Monday, June 10, 2019

How to show HBA/NIC driver version

How to find the version of HBA or NIC driver on VMware ESXi?

Let's start with HBA drivers. 

STEP 1/ Find driver name for the particular HBA. In this example, we are interested in vmhba3.

We can use following esxcli command to see driver names ...
esxcli storage core adapter list

So now we have driver name for vmhba3, which is qlnativefc

STEP 2/ Find the driver version.
The following command will show you the version.
vmkload_mod -s qlnativefc | grep -i version

NIC drivers

The process to get NIC driver version is very similar.

STEP 1/ Find driver name for the particular NIC. In this example, we are interested in vmhba3.
esxcli network nic list

STEP 2/ Find the driver version.
The following command will show you the version.
vmkload_mod -s ntg3 | grep -i version

You should always verify your driver versions are at VMware Compatibility Guide. The process of how to do it is documented here How to check I/O device on VMware HCL.

For further information see VMware KB - Determining Network/Storage firmware and driver version in ESXi 4.x and later (1027206)

Thursday, June 06, 2019

vMotion multi-threading and other tuning settings

When you need to boost overall vMotion throughput, you can leverage Multi-NIC vMotion. This is good when you have multiple NICs so it is kind of scale-out solution. But what if you have 40 Gb NICs and you would like to do scale-up and leverage the huge NIC bandwidth (40 Gb) for vMotion?

vMotion is by default using a single thread (aka stream), therefore it does not have enough CPU performance to transfer more than 10 Gb of network traffic. If you really want to use higher NIC bandwidth, the only way is to increase the number of threads pushing the data through the NIC. This is where advanced setting Migrate.VMotionStreamHelpers comes in to play.

I have been informed about these advanced settings by one VMware customer who saw it on some VMworld presentation. I did not find anything in VMware documentation, therefore these settings are undocumented and you should use it with special care.

Advanced System Settings
Number of helpers to allocate for VMotion streams
Max TX queue load (in thousand packet per second) to allow packing on the corresponding RX queue
Threshold (in thousand packet per second) for TX queue load to trigger unpacking of the corresponding RX queue
Maximum length of the Tx queue for the physical NICs 

Wednesday, June 05, 2019

How to get more IOPS from a single VM?

Yesterday, I have got a typical storage performance question. Here is the question ...
I am running a test with my customer how many IOPS we can get from a single VM working with HDS all flash array. The best that I could get with IOmeter was 32K IOPS with 3ms latency at 8KB blocks. No matter what other block size I choose or outstanding IOs, I am unable to have more then 32k. On the other hand I can't find any bottlenecks across the paths or storage. I use PVSCSI storage controller. Latency and queues looks to be ok
IOmeter is good storage test tool. However, you have to understand basic storage principles to plan and interpret your storage performance test properly. The storage is the most crucial component for any vSphere infrastructure, therefore I have some experience with IOmeter and storage performance tests in general and here are my thoughts about this question.

First thing first, every shared storage system requires specific I/O scheduling to NOT give the whole performance to a single worker. The storage worker is the compute process or thread sending storage I/Os down the storage subsystem. If you think about it, it makes a perfect sense as it mitigates the problem of a noisy neighbor. When you invest a lot of money to a shared storage system, you most probably want to use it for multiple servers, right? Does not matter if these servers are physical (ESXi hosts) or virtual (VMs). To get the most performance from shared storage you must use multiple workers and optimally spread them across multiple servers and multiple storage devices (aka LUNs, volumes,  datastores).

IOmeter allows you to use

  • Multiple workers on a single server (aka Manager)
  • Outstanding I/Os within a single worker (asynchronous I/O to a disk queue without waiting for acknowledge)
  • Multiple Managers – the manager is the server generating storage workload (multiple workers) and reporting results to a central IOmeter GUI. This is where IOmeter dynamos come in to play.
To test the performance limits of a shared storage subsystem, it is an always good idea to use multiple servers (IOmeter managers) with multiple workers on each server (nowadays usually VMs) spread across multiple storage devices (datastores / LUNs). This will give you multiple storage queues, which means more parallel I/Os. Parallelism is the way which will give you more performance when such performance exists on shared storage. If such performance does not exist on the shared storage, queueing will not help you to boost performance. If you want, you can also leverage Oustanding I/Os to fill disk queue(s) more quickly and make an additional pressure to a storage subsystem, but it is not necessary if you use the number of workers equal to available queue depth. Outstanding I/Os can help you potentially generating more I/Os with fewer workers but it does not help you to get more performance when your queues are full. You will just increase response times without any positive performance gain.

Just as an example of IOmeter performance test, on the image below, you can see the results from IOmeter distributed performance tests on 2-node vSAN I planned, designed, implemented and tested recently for one of my customers. There is just one disk group (1xSSD cache, 4xSSD capacity).

Above storage performance test was using 8xVMs and each VMs was running 8 storage workers.
I have performed different storage patterns (I/O size, R/W ratio, 100% random access). The performance is pretty good, right? However, I would not be able to get such performance from the single VM having a single vDisk. 
Note: vSAN has a significant advantage in comparison to traditional storage because you do not need to deal with LUNs queueing (HBA Device Queue Depth) as there are no LUNs. On the other hand, in vSAN storage, you have to think about the total performance available for a single vDisk and it boils down to vSAN DiskGroup(s) layout and vDisk object components distribution across physical disks. But that's another topic as the asker is using traditional storage with LUNs.

Unfortunately, using multiple VMs is not the solution for the asker as he is trying to get all I/Os from a single VM.

In the question is declared that a single VM cannot get more than 32K IOPS and observed I/O response time is 3ms. The asker is curious why he cannot get more IOPS from the single VM?

Well, there can be multiple reasons but let’s assume the physical storage is capable provide more than 32K IOPS. I think, that more IOPS cannot be achieved because only one VM is used and IOmeter is using a single vDisk having a single queue. The situation is depicted in drawing below.

So, let’s do the simple math calculation for this particular situation …
  • We have a single vDisk queue having default queue depth 64 (we use Paravirtual SCSI adapter. Non-paravirtualized SCSI adapters have queue depth 32)
  • We have an HBA QLogic having queue default depth 64 (other HBA vendors like Emulex, have default queue depth 32, so it would be another bottleneck on the storage path)
  • The storage has average service time (response time) around 3ms
We have to understand the following basic principles
  • IOPS is the number of I/O operations per second
  • 64 queue depth = 64 I/O operations in parallel = 64 slices for I/O operations
  • Each I/O from these 64 I/Os are in the vDisk queue until SCSI response from the LUN will come back
  • All other I/Os have to wait until there is the free I/O slice in the queue.
And here is the math calculation ...

Q1: How many I/Os can be delivered in this situation per 1 millisecond?
A1: 64 (queue depth) / 3 (service time in ms)  = 64 / 3 = 21.33333 I/Os per 1 millisecond
Q2: How many I/Os can be delivered per 1 second?
A2: It is easy. 1,000 times more than in millisecond. So, 21.33333 x 1,000 = 21333.33 IOs per second ~= 21.3K IOPS
The asker is claiming he can get 32K IOPS with 3 ms response time, therefore it seems that the response time from storage is better than 3 ms. The math above would tell me that storage response time in this particular exercise is somewhere around 2 ms. There can be other mechanisms to boost performance. For example, I/O coalescing but let's keep it simple.

If the storage would be able to service I/O in 1 ms we would be able to get ~64K IOPS.
If the storage would be able to service I/O in 2 ms we would be able to get ~32K IOPS. 
If the storage would be able to service I/O in 3 ms we would be able to get ~21K IOPS. 

The math above would work if END-2-END queue depth is 64. This would be the case when QLogic HBA is used as it has HBA LUN Queue Depth 64. In the case of Emulex HBA, there is HBA LUN Queue Depth 32, therefore higher vDisk Queue Depth (64), would not help.
Hope the principle is clear now.

So how can I boost storage performance for a single VM? If you really need to get more IOPS from the single VM you have only three following options:
  1. increase queue depth, but not only on vDISK itself but END-2-END. IT IS GENERALLY NOT RECOMMENDED as you really must know what you are doing and it can have a negative impact on overall shared storage. However, if you need it and have the justification for it, you can try to tune the system.
  2. use the storage system with low service time (response time). For example, the sub-millisecond storage system (for example 0.5 ms) will give you more IOPS for the same queue depth as a storage system having higher service time (for example 3 ms).
  3. leverage multiple vDisks spread across multiple vSCSI controllers and datastores (LUNs). This would give you more (total) queue depth in a distributed fashion. However, this would have additional requirements for your real application as it would need a filesystem or other mechanism supporting multiple storage devices (vDisks).
I hope options 1 and 2 are clear. Option 3 is depicted in the figure below.

On a typical VMware vSphere environment, you use the shared storage system from multiple ESXi hosts, multiple VMs having vDisks on multiple datastores (LUNs). That's the reason why the default queue depth usually makes perfect sense as it provides fairness among all storage consumers. If you have storage system with, let's say 2 ms response time, and queue depth 32, you can still get around 16K IOPS. This should be good enough for any typical enterprise application, and usually, I recommend to use IOPS limiting to limit some VMs (vDisks) even more. This is how storage performance tiering can be very simply achieved on VMware SDDC with unified infrastructure.  If you need higher storage performance, your application is specific and you should do a specific design and leverage specific technologies or tunings.

By the way, I like Howard's Marks (@DeepStorageNet) statement I have heard on his storage technologies related podcast "GrayBeards".  It is something like ...
"There are only two storage performance types - good enough and not good enough." 
This is very true.
Hope this writeup helps to broader VMware community.

Relevant articles:

Friday, May 24, 2019 is invalid or exceeds the maximum number of characters permitted

I have a customer who has a pretty decent vSphere environment and uses VMware vRealize LogInsight as a central syslog server for advanced troubleshooting and actionable loging. VMware vRealize LogInsight is tightly integrated with vSphere so it configures syslog configuration on ESXi hosts automatically through vCenter API. Everything worked fine but one day customer realized there is the issue with one and only one ESXi host.

He saw the following failed vCenter task in his vSphere Client.

The error message:
setting[""] is invalid or exceeds the maximum number of characters permitted
seemed very strange to me.

From ESXi logs collected by LogInsight was evident that ESXi advanced setting cannot be configured through API. However, the same or similar issue can be reproduced by esxcli command for setting the advanced parameter.

esxcli system settings advanced set -o /Syslog/global/logHost -s “udp://”

Unable to find branch Syslog

The resolution of this problem was to configure syslog configuration (as described in my older blog post here) instead of setting advanced parameter /Syslog/global/logHost.

The command to configure remote syslog is
esxcli system syslog config set --loghost='tcp://'  help customer to resolve the issue.

I have never seen this issue in other vSphere environments so hope this helps to at least one other person from VMware community.

Thursday, May 16, 2019

The SPECTRE story continues ... now it is MDS

Last year (2018) started with shocked Intel CPU vulnerabilities Spectre and Meltdown and two days ago was published another SPECTRE variant know as Microarchitectural Data Sampling or MDS. It was obvious from the beginning, that this is just a start and other vulnerabilities will be found over time by security experts and researchers. All these vulnerabilities are collectively known as Speculative Executions aka SPECTRE variants.

Here is the timeline of particular SPECTRE variant vulnerabilities along with VMware Security Advisories.

2018-01-03 - Spectre (speculative execution by performing a bounds-check bypass) / Meltdown (speculative execution by utilizing branch target injection) - VMSA-2018-0002.3 
2018-05-21 - Speculative Store Bypass (SSB) - VMSA-2018-0012.1
2018-08-14 - L1 Terminal Fault - VMSA-2018-0020
2019-05-14 - Microarchitectural Data Sampling (MDS) - VMSA-2019-0008

I published several blog posts about SPECTRE topics in the past

The last two vulnerabilities "L1 Terminal Fault (aka L1TF)" and "Microarchitectural Data Sampling (aka MDS)" are related to Intel CPU Hyper-threading. As per statement here AMD is not vulnerable.

When we are talking about L1TF and MDS, a typical question of my customers having Intel CPUs is if they are safe when Hyper-Threading is disabled in the BIOS. The answer is yes but you would have to power cycle the physical system to reconfigure BIOS settings which can be pretty annoying and time-consuming in larger environments. That's' why VMware recommends leveraging SDDC concept and set it by software change - ESXi hypervisor advanced setting. It is obviously much easier to change two ESXi advanced settings VMkernel.Boot.hyperthreadingMitigation and VMkernel.Boot.hyperthreadingMitigationIntraVM to the value true and disable hyperthreading in ESXi CPU scheduler without a need of physical server power cycle. You can do it by PowerCLI one-liner in a few minutes which is much more flexible than BIOS changes.

So that's it from the security point of view but what about performance?

It is simple and obvious. When hyper-threading is disabled you will obviously lose the CPU performance benefit of Hyper-Threading technology which can be somewhere between 5 - 20% and heavily depends on the type of particular workload. Let's be absolutely clear here. Until the issue is addressed inside the CPU hardware architecture it will be always the tradeoff between security and performance. If I understand Intel messaging correctly, the first hardware solution for their Hyper-Threading is implemented in Cascade Lake family. You can double check it by yourself here ...
Side Channel Mitigation by Product CPU Model

You can get hyperthreading performance back but only in VMware vSphere 6.7 U2. VMware vSphere 6.7 U2 includes new scheduler options that secure it from the L1TF vulnerability, while also retaining as much performance as possible. This new scheduler has introduced ESXi advanced setting
VMkernel.Boot.hyperthreadingMitigationIntraVM which allows you to set it to FALSE (this is the default) and leverage HyperThreading benefits within Virtual Machine but still do isolation between VMs when VMkernel.Boot.hyperthreadingMitigation is set to TRUE. This possibility is not available in older ESXi hypervisors and there are no plans to backport it. For further info read paper "Performance of vSphere 6.7 Scheduling Options".

By the way, last year I have spent a significant time to test the performance impact of SPECTRE and MELTDOWN vulnerabilities remediations. If you want to check the results of the performance tests of Spectre/Meltdown 2018 variants along with the conclusion, you can read my document published on SlideShare. It would be cool to perform the same tests for L1TF and MDS but it would require additional time effort. I'm not going to do so until sponsored by some of my customers. But anybody can do it by himself as a test plan is described in the document below.

Friday, May 03, 2019

Storage and Fabric latencies - difference in order of magnitude

It is well known, that the storage industry is in a big transformation. SSD's based on Flash is changing the old storage paradigma and supporting fast computing required nowadays in modern applications supporting digital transformation projects.

So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).

It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.

Latencies order of magnitude:
  • ms - miliseconds - 0.001 of second = 10−3
  • μs - microseconds - 0.000001 of second = 10−6
  • ns - nanoseconds - 0.000000001 of second = 10−9
Storage Latencies

SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms
SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.
NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms =  100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms =  80 μs
DIMM - 3D XPoint memory (Intel Optane) ~=   the latency less than 500 ns (0.5 μs)

Ethernet Fabric Latencies

Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~=  μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= μs (sockets application) / 1.3 μs (RDMA application)

InfiniBand and Omni-Path Fabrics Latencies

10Gb/s SDR - 1 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s  ~=  1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s  ~=  1.04 μs (Intel 100G Omni-Path)

RAM Latency

DIMM - DDR4 SDRAM ~=  75 ns (local NUMA access) - 120 ns (remote NUMA access)


It is good to realize what latencies we should expect on different infrastructure subsystems 
  • RAM ~= 100 ns
  • 3D XPoint memory ~= 500 ns
  • Modern Fabrics ~= 1-4 μs
  • NVMe ~= 80 μs
  • NVMe over RoCE ~= 100 μs
  • SAS SSD ~= 250 μs
  • SAS magnetic disks ~= 5-12 ms
The latency order of magnitude is important for several reasons. Let's focus on one of them - latency monitoring. It was always a challenge to monitor traditional storage systems as 5 minutes or even 1-minute is simply too large interval for ms (milisecond) latency and the average does not tell you anything about microbursts. However, in lower latency (μs or even ns) systems is 5-minute interval like an eternity. Average, Min and Max of 5-minute interval might not help you to understand what is really happening there. Much deeper mathematical statistics would be needed to have real and valuable visibility into telemetry data. Percentiles are good but Histograms can help even more ...
Wavefront links above are talking mainly about application monitoring but do we have such telemetry granularity in hardware? Mellanox Spectrum claims Real-time Network Visibility but it seems to me as an exception. Intel had an open source project "The Snap Telemetry Framework", however, it seems that it was discontinued by Intel. And what about other components? To be honest, I do not know and it seems to me that real-time visibility is not a big priority for the Infrastructure Industry, however, Operating Systems, Hypervisors and Software Defined Storages could help here. VMware vSphere Performance Manager available via vCenter SOAP API can provide "real-time" monitoring. I'm quoting "real-time" into brackets because it can provide 20-second samples (min, max, average) for metrics in leaf objects. Is it good enough? Well, not really. It is better than a 5-minute or 1-minute sample but still very long for sub-millisecond latencies. Minimum, maximum and average do not have enough information value for some decisions. The histograms could help here. ESXi has an old good tool vscsiStats supporting histograms latency of IOs in Microseconds (us) for virtual machine. Unfortunately, there is no officially supported vCenter API for this tool so it is usually used for short-term manual performance troubleshooting and not for continuous latency monitoring.  William Lam has published a blog post and scripts on how to leverage ESXi API to get vscsiStats histograms. It would be great to be able to get histograms for some objects through vCenter in a supported way and expose such information to external monitoring tools. #FEATURE-REQUEST

Hope this is informative and educational.

Other sources:
Performance Characteristics of Common Network Fabrics:
Real-time Network Visibility:
Johan van Amersfoort and Frank Denneman present a NUMA deep dive:
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API

Tuesday, April 09, 2019

What NSX-T Manager appliance size is good for your environment?

NSX-T 2.4 has NSX Manager and NSX Controller still logically separated but physically integrated within a single virtual appliance which can be clustered as a 3-node management/controller cluster. So the first typical question during NSX-T design workshop or before NSX-T implementation is what NSX-T Manager appliance size is good for my environment.

In NSX-T 2.4 documentation (NSX Manager VM System Requirements) are documented following NSX Manager Appliance sizes.

Appliance Size
Disk Space
VM Hardware Version
NSX Manager Extra Small
8 GB
200 GB
10 or later
NSX Manager Small VM
16 GB
200 GB
10 or later
NSX Manager Medium VM
24 GB
200 GB
10 or later
NSX Manager Large VM
48 GB
200 GB
10 or later
In the above documentation section is written that
  • The NSX Manager Extra Small VM resource requirements apply only to the Cloud Service Manager.
  • The NSX Manager Small VM appliance size is suitable for lab and proof-of-concept deployments.
So for NSX-T on-prem production usage, you can use Medium and Large size. But which one? The NSX-T documentation section (NSX Manager VM System Requirements) has no more info to support your design or implementation decision. However, in another part of the documentation (Overview of NSX-T Data Center) is written that
  • The NSX Manager Medium appliance is targeted for deployments up to 64 hosts
  • The NSX Manager Large appliance for larger-scale environments.


Long story short, only Medium and Large sizes are targeted to On-Prem NSX-T production usage. The Medium size should be used in an environment up to 64 ESXi hosts. For larger environments, the Large size is the way to go.

Hope this helps to your NSX-T Plan, Design, and Implement exercise.