Thursday, September 12, 2019

Datacenter Fabric for HCI in scale

I'm currently designing a brand new data center based on VMware HCI for one of my customers. Conceptually, we are planning to have two sites in the metro distance (~10 km) for disaster avoidance and cross-site high availability. For me, a cross-site high availability (stretched metro clusters) is not a disaster recovery solution, so we will have the third location (200km+ far from the primary location) for DR purposes. We are also considering remote clusters in branch offices to provide services outside the datacenters. The overall concept is depicted in the drawing below.
Datacenter conceptual design
The capacity requirements lead to a decent number of servers, therefore the datacenter size will be somewhere around 10+ racks per site. The rack design is pretty simple. Top of rack switches and standard rack servers connected to TOR switches optimally by 4x 25Gb link.
Server rack design
However, we will have 10+ racks per site so there is the question of how we will connect racks together? Well, we have two options. Traditionally (Access / Aggregation / Core) or Leaf/Spine. In our case, both options are valid and pretty similar because if we would use Traditional topology, we would go with collapsed Access / Core topology anyway, however, the are two key differences between Access/Core network and Leaf/Spine fabric

  1. Leaf/Spine is always L3 internally keeping L2/L3 boundary at TORs. This is good because it splits the L2 fault domain and mitigates risks of L2 networking with STP issues, broadcast storms, unknown unicast flooding, etc.
  2. Leaf/Spine supports additional fabric services like automation, L2 over L3 to have L2 across the racks (PODs), Life Cycle Management (fabric level firmware upgrades, rollbacks, etc.), Single Point of Management/Visibility, etc.

We are designing the brand new data center which has to be in production for the next 15+ years and we will need very good connectivity within the data center supporting huge east-west traffic, therefore leaf-spine fabric topology based on 100Gb/25Gb ethernet makes a perfect sense. The concept of data center fabric in leaf-spine topology is depicted in the figure below.
Datacenter fabric
Ok. So, conceptually and logically we know what we want but how to design it physically and what products to choose?

I've just starting to work for VMware as HCI Specialist supporting Dell within our Dell Synergy Acceleration Team, so there is no doubt VxRAIL makes a perfect sense here and it perfectly fits into the concept. However, we need a data center fabric to connect all racks within each site and also interconnect two sites together.

I have found that Dell EMC has SmartFabric Services. You can watch the high-level introduction at
https://www.dellemc.com/en-us/video-collateral/dellemc-smartfabric-services-for-vxrail.htm

SmartFabric Services seems very tempting.  To be honest, I do not have any experience with Dell EMC SmartFabric so far, however, my regular readers know that I was testing, designing and implementing Dell Networking a few years ago. At that time I was blogging about Force10 Networking (FTOS 9) technology and published a series of blog posts available at https://www.vcdx200.com/p/series.html 

However, DellEMC SmartFabric Services are based on a newer switch operating system (OS 10) which I do not have any experience yet, therefore, I did some research and I found a very interesting blog posts about Dell EMC SmartFabric published by Mike Orth aka @HCIdiver and Hasan Mansur. Here are links to blog posts:


So the next step is to get more familiar with DellEMC SmartFabric Services because it can significantly simplify data center operations and split the duties between the datacenter full-stack engineers (compute/storage/network) and traditional network team.

I personally believe, that datacenter full-stack engineer should be responsible not only for compute and storage (HCI) but also for data center fabric. And the traditional networking team takes the responsibility at network rack where the fabric is connected to the external network. You can treat datacenter fabric as a new generation of SAN which is nowadays operated by storage guys anyway, right?

Hope this makes sense and if there is anybody with Dell EMC SmartFabric experience or with a similar design, please feel free to leave the comment below or contact me on the twitter @david_pasek.

Wednesday, September 11, 2019

New job role - vSAN rulez

Hey, my readers.

Long-time readers of my blog know that I'm working with VMware datacenter technologies since 2006 when I moved from software development to data center infrastructure consulting. In June 2006, VMware released VMware Virtual Infrastructure 3 and it was for me the first production-ready version for hosting business applications. Back in the days, it was a very simple platform (at least for end-users / administrators) and very functional piece of software with huge business benefits.

Let me share with you one secret I have discovered. As I have the software background, over the years I personally developed a lot of software on my own and also in the development teams so there is the law I personally discovered during my developer days in 90' and 2000' ... "The 3rd version of the developed software code is production-ready and to develop three software versions usually takes around 5 years." There are multiple reasons for this statement but this is beyond the scope of this post.

I will just give you two examples related to the topic of this article.

  1. ESX 1.0 was released on March 2001 and production-ready (at least for me) was the version ESX 3.0 released on June 2006 ... 5 years of software development and continuous improvements based on customers feedback 
  2. vSAN 1.0 was released in 2013 and production-ready (at least for me) vSAN 6.7 U1 in October 2018 ... .. 5 years of software development and continuous improvements based on customers feedback 
But let's go back to the topic of this blog post. Since 2006, I'm dealing with various adventures and challenges with server virtualization and software-defined infrastructure as the whole. It was and still is a very interesting journey and great industry and community to be as the infrastructure principles do not change very often and it was mainly about the better manageability, scalability, and simpler infrastructure consumption for application admins. Well, when I'm saying that key principles in IT infrastructure do not change very often there are few innovations over decades which change the industry significantly.
  • Think about "The Flash" (SLC, MLC, QLC, Intel Optane, NVMe) which already changed the storage industry.
  • Think about the concept of Composable Disaggregated Infrastructure
  • Think about software-defined infrastructures in compute, storage and network
Few innovations mentioned above are changing the technologies but it always takes some time because the most resistant for change are humans. On the other hand, resistance or conservatism if you will, is not always bad because, at the end of the day, the infrastructure is just another tool in the toolbox of IT guys supporting the business.

Well, this is a pretty important thing to realize.
Infrastructure is not here for the sake of the infrastructure but to support software applications. And software applications are not here for the sake of applications but to support some business, whatever the business is. And each business supports some other business. So generally speaking, the quality of the support and simplicity in consumption is the key to have satisfied customers and be successful in any business.
During my datacenter career, I worked for two infrastructure giants - Dell and Cisco - delivering server, storage, and network virtualization consulting to simplify sometimes pretty complex infrastructures. If you ask me for one reason why this is the interesting industry to be, my answer would be - THE SIMPLIFICATION BY ABSTRACTION.

When I look back retrospectively, this is the reason why I have decided to focus on VMware platform and invest my time (13+ years) into this technology. Back then, I have chosen VMware because they were emulating physical infrastructure components as a logical software constructs all backed by API with automation in mind from day zero. Another reason to chose the virtualization abstraction is that I'm a big believer of standardization and commoditization. It allowed me to leverage my existing knowledge about infrastructure principles. The disk is the disk, logically does not matter if it is connected locally through SATA, SCSI or remotely over Fibre Channel, iSCSI. At the end of the day, the disk is there to have a block device on a server and store some data on it. Period. And similarly, you can think about other infrastructure components like NIC, Switch, Router, File Storage, Object Storage, you name it.

My decisions and good intuition always paid back. Nowadays, nobody doubts about server virtualization and business-critical apps and even mission-critical apps are virtualized or planned to be virtualized soon. It's all about x86 commoditization and "good enough" technology. Compare it with another example in ICT. Do you remember the Telco battle between traditional telephone and IP telephony? I do. And do you see such discussions nowadays? No, no. It was accepted as a "good enough" technology even you do not have 100% guarantee you own the line during the telephone conversation. However, it has a lot of other benefits.

And we are in the same situation with PRIMARY STORAGE in datacenters. During the last 10+ years, I was dealing with or witnessing multiple storage issues in various data centers. I'm always saying NO STORAGE = NO DATACENTER. The storage system is a critical component of any datacenter. The major incidents and problems were caused because of
  • environment complexity (SAN fan-in / fan-out ratios, micro-bursting, and "slow-drain" issues)
  • overloaded storage systems from a performance point of view (any overloaded system behaves unpredictable, even a human body).
  • human mistakes
I was witnessing all these issues not only as a data center consultant but also as VMware Technical Account Manager during last 4 years, helping few assigned VIP VMware accounts to keep lights up and think strategically how to simplify data center infrastructures and deliver the value to the business.

Since 2013, VMware is developing the next big thing, software-defined storage. As I've mentioned above, from a technical point of view, I believe vSAN 6.7 U1 is the production-ready piece of software and from now it will be better and better. I'm not saying there are not or will not be any problems. The problems are everywhere around us, the failure is inevitable, but it is about the availability and recoverability options and how fast any solution can be fail-over or recovered. This is again beyond the scope of this article but I hope you understand the point.

Nevertheless, the software eats the world and I believe, there is the momentum for Hyper-Converged Infrastructures which in VMware means vSAN. Of course, it will require some time to convince traditional storage architects, engineers, and operation guys but the trend is absolutely clear. For the transition period, VMware has other storage technologies to leverage not only vSAN but consume traditional storage systems more intelligently with vVols and evolve into software-defined datacenter (SDDC) eventually.

So let me announce publically, that since September 1st, I work for VMware as a Senior Systems Engineer for Hyper Converged Infrastructure (HCI). As a VMware Systems Engineer for HCI, I'm responsible for driving the technical solutions to ensure customer success and revenue goals derived from VMware software and solutions. I'm focused on HCI solutions for assigned Alliance Partner (Dell Technologies) working closely with both the Alliance Partner and VMware technical teams to build the differentiation with that Alliance. I will be also able to bring feedback from the field to VMware and Alliance Product Management teams.

So what you, my blog readers, can expect? More storage-related blog posts about vSAN, vVols, vSphere Replications, SRM, I/O filters leveraged for 3rd party solutions, etc.


I'm excited and looking forward to this new journey. Hope I will have enough spare time to blog about the journey and share it with great VMware community. 

Wednesday, August 28, 2019

VMware Ports and Protocols

VMware recently released a very interesting tool. The tool documents all network ports and protocols required for communication from/to some VMware products. At the moment, there are the following products

  • vSphere
  • vSAN
  • NSX for vSphere
  • vRealize Network Insight
  • vRealize Operations Managers
  • vRealize Automation
I believe other products will follow. See the screenshot of the tool below.



The tool is available at https://ports.vmware.com/

I'm pretty sure technical designers and implementation engineers will love this tool.

Wednesday, August 21, 2019

VMware vSphere 6.7 Update 3 is GA

VMware vSphere 6.7 Update 3 is GA as of August 20, 2019.

The most interesting new feature is the possibility to change the Primary Network Identifier (PNID) of vCenter Server Appliance

With vCenter Server 6.7 Update 3, you can change the Primary Network Identifier (PNID) of your vCenter Server Appliance. You can change the vCenter Server Appliance FQDN or hostname, and also modify the IP address configuration of the virtual machine Management Network (NIC 0).

I wrote about PNIC last year here on this blog post What is vCenter PNID?
It is a great improvement that you can change vCenter FQDN, however, the process has some

For further operational details on how to change vCenter Server's FQDN/PNID read the blog post - Changing your vCenter Server’s FQDN.

There are a few more new features and enhancements

  • vCenter Server 6.7 Update 3 supports a dynamic relationship between the IP address settings of a vCenter Server Appliance and a DNS server by using the Dynamic Domain Name Service (DDNS). The DDNS client in the appliance automatically sends secure updates to DDNS servers on scheduled intervals.
  • With vCenter Server 6.7 Update 3, you can publish your .vmtx templates directly from a published library to multiple subscribers in a single action instead of performing a synchronization from each subscribed library individually. The published and subscribed libraries must be in the same linked vCenter Server system, regardless if on-premises, on cloud, or hybrid. Work with other templates in content libraries does not change.
  • With vCenter Server 6.7 Update 3, if the overall health status of a vSAN cluster is Red, APIs to configure or extend HCI clusters throw InvalidState exception to prevent further configuration or extension. This fix aims to resolve situations when mixed versions of ESXi host in a HCI cluster might cause vSAN network partition.
  • ixgben driver enhancements
  • VMXNET3 enhancements
  • NVIDIA virtual GPU (vGPU) enhancements
  • bnxtnet driver enhancements
  • QuickBoot support enhancements
  • Configurable shutdown time for the sfcbd service

For further information read ESXi Release Notes here, vCenter Release Notes here and VMware Update Manager here.

I also recommend reading the blog post Announcing vSphere 6.7 Update 3 available at https://blogs.vmware.com/vsphere/2019/08/announcing-vsphere-6-7-update-3.html

Tuesday, August 20, 2019

VMware vSAN 6.7 U3 has been released

VMware vSAN 6.7 U3 is GA as of August 20, 2019!
This is a great release. I was waiting mainly for Native support for Windows Server Failover Clusters which is now officially supported so no more vSAN iSCSI targets and in-Guest iSCSI for shared disks across the WSFC as vSAN VMDKs now support SCSI-3 persistent reservations. This is a great improvement and significant simplification for vSphere / vSAN architects, technical designers, implementers, and operational guys. For further details see blog post "Native SQL Server Cluster support on vSAN

The second feature I like a lot is the performance improvement. Performance improvements and better performance predictability are always welcome. More predictable application performance is achieved in vSAN 6.7 U3 by destaging data from the write buffer to the capacity tier. This makes vSAN more efficient and consistent with increased throughput for sequential writes. The increased consistency in destaging operations results in a smaller deviation between high and low latency. This increased throughput also reduces the amount of time necessary for resyncs from repairs and rebuild tasks. The destaging improvement is depicted in the figure below.


However, there are other operations and availability improvements in 6.7 U3. Let's highlight the vSAN 6.7 U3 improvements one by one as published in Release Notes.
What's New?
Cloud Native Applications
Cloud Native Storage: Introducing vSphere integrated provisioning and management of Kubernetes persistent volumes (PVs) and a CSI (Container Storage Interface) based plugin. This integration enables unified management of modern and traditional applications in vSphere.
Intelligent Operations
Enhanced day 2 monitoring: vSAN’s capacity dashboard has been overhauled for greater simplicity, and the resync dashboard now includes improved completion estimates and granular filtering of active tasks
Data migration pre-check report: A new detailed report for greater insight and predictive analysis of host maintenance mode operations
Greater resiliency with capacity usage conditions: Key enhancements include reduced transient capacity usage for policy changes, and improved remediation during capacity-strained scenarios
Automated rebalancing: Fully automate all rebalancing operations with a new cluster-level configuration, including modifiable thresholds
Native support for Windows Server Failover Clusters: Deploy WSFC natively on vSAN VMDKs with SCSI-3 persistent reservations
Enhanced Performance & Availability
Performance enhancements: Deduplication & compression enabled workloads will have improved performance in terms of predictable I/O latencies and increased sequential I/O throughput
Availability enhancements: Optimizations to sequential I/O throughput and resync parallelization will result in faster rebuilds
For further details about vSAN 6.7 U3 read blog post vSAN 6.7 Update 3 – What’s New available at https://blogs.vmware.com/virtualblocks/2019/08/13/vsan67u3-whats-new/

Thursday, August 08, 2019

vSAN Capacity planning - Understanding vSAN memory consumption in ESXi

It is very clear that VMware vSAN (VMware's software-defined storage) has the momentum in the field, as almost all my customers are planning and designing vSAN in their environments. Capacity planning is an important part of any logical design, so we have to do the same for vSAN. Capacity planning is nothing else than simple math, however, we need to know how the designed system works and what overheads we have to include in our capacity planning exercise. Over the years, a lot of VMware vSphere technical designers did get knowledge and practice how to do capacity planning for core vSphere because server virtualization is here for ages (13+ years). But we all are just starting (3+ years) with vSAN designs, therefore it will take some time to gain the practice and know-how what is important to calculate in terms of VMware hyper-converged infrastructure (VMware HCI = vSphere + vSAN).

One of the many important factors for HCI capacity planning is vSAN memory consumption from the ESXi host memory. There is very good VMware KB 2113954 explaining the math calculation and formulas behind the scene. However, we are tech geeks, so we do not want to do the math on the paper so here is the link to Google Sheets calculator I have prepared for vSAN (All-Flash) memory overhead calculation.

Here is the calculator embedded into this blog post, however, it is just in read-only mode. If you want to change parameters (yellow cells) you have to open Google Sheet available on this link.



Note: I did not finish the calculator for Hybrid configuration because I personally believe that 99% of vSAN deployments should be All-Flash. The reason for this assumption is the fact, that Flash capacity is only 2x or 3x more expensive than magnetic disks and the lower price of magnetic disks is not worth to low speed you can achieve by magnetic disks. In terms of capacity, advanced technics like erasure coding (RAID-5, RAID-6) and deduplication + compression can give you back capacity in All-Flash vSAN as these technics are available or make sense only on All-Flash vSAN. If you would like the same calculator for Hybrid vSAN, leave the comment below this blog post and I will try to find some spare time to prepare another sheet for Hybrid vSAN.

Let's document here some design scenarios with vSAN memory consumptions.

Scenario 1
ESXi host system memory: 192 GB
Number of disk groups: 1
Cache disk size in each disk group: 400 GB
Number of capacity disks in each disk group: 4
vSAN memory consumption per ESXi host is 17.78 GB.

Scenario 2
ESXi host system memory: 192 GB
Number of disk groups: 2
Cache disk size in each disk group: 400 GB
Number of capacity disks in each disk group: 2
vSAN memory consumption per ESXi host is 28 GB.

Scenario 3
ESXi host system memory: 256 GB
Number of disk groups: 2
Cache disk size in each disk group: 400 GB
Number of capacity disks in each disk group: 2
vSAN memory consumption per ESXi host is 28.64 GB.

Hope this is informative and it helps broader VMware community. 

Friday, August 02, 2019

Updating Firmware in vSAN Clusters from VUM

If you operate vSAN you know that correct firmware and drivers are super important for system stability as vSAN software heavily depends on IO controller and physical disks within the server.

Different server vendors have different system management. Some are more complex than other but typical vSphere admin is using vSphere Update Manager (VUM) so would not it be cool to do firmware management directly from VUM? Yes, of course, so how it could be done?

Well, there is a chapter in vSphere documentation covering "Updating Firmware in vSAN Clusters", however, there is no information what hardware is supported. I did some internal research and colleagues have pointed me to VMware KB 60382 "IO Controllers that vSAN supports firmware updating" where supported IO Controllers are listed.

At the time of writing this post, there are plenty of Dell IO Controllers, two Lenovo, two SuperMicro and one Fujitsu.

At the moment, you can update only IO Controllers but it may or may not change in the future.


Wednesday, July 03, 2019

VMware Skyline

VMware Skyline is a relatively new Phone Call or Home Call functionality developed by VMware Global Services. It is a proactive support technology available to customers with an active Production Support or Premier Services contract. Skyline automatically and securely collects, aggregates and analyzes customer specific product usage data to proactively identify potential issues and improve time-to-resolution.

You are probably interested in Skyline Collector System Requirements which are documented here.

Skyline is packaged as a VMware Virtual Appliance (OVA) which is easy to install and operate. From a networking standpoint, there are only two external network connections you have to allow from your environment:
  • HTTPS (443) to vcsa.vmware.com
  • HTTPS (443) to vapp-updates.vmware.com
Do you have more questions about Skyline? Your questions can be addressed in Skyline FAQ.

Tuesday, July 02, 2019

vSAN logical design and SSD versus NVMe considerations

I'm just preparing vSAN capacity planning for PoC of one of my customers. Capacity planning for traditional and hyper-converged infrastructure is principally the same. You have to understand TOTAL REQUIRED CAPACITY of your workloads and  USABLE CAPACITY of vSphere Cluster you are designing. Of course, you need to understand how vSAN hyper-converged system conceptually and logically works but it is not rocket science. vSAN is conceptually very straight forward and you can design very different storage systems from performance and capacity point of view. It is just a matter of components you will use. You probably understand that performance characteristics differ if you use rotational SATA disks, SSD or NVMe. For NVMe, 10Gb network can be the bottleneck so you should consider 25Gb network or even more. So, in the figure below is an example of my particular vSAN capacity planning and proposed logical specifications.


Capacity planning is the part of the logical design phase, therefore any physical specifications and details should be avoided. However, within the logical design, you should compare multiple options having an impact on infrastructure design qualities such as

  • availability, 
  • manageability, 
  • scalability, 
  • performance, 
  • security, 
  • recoverability 
  • and last but not least the cost.  

For such considerations, you have to understand the characteristics of different "materials" your system will be eventually built from. When we are talking about magnetic disks, SSD, NVMe, NICs, etc. we are thinking about logical components. So I was just considering the difference between SAS SSD and NVMe Flash for the intended storage system. Of course, different physical models will behave differently but hey, we are in the logical design phase so we need at least some theoretical estimations. We will see the real behavior and performance characteristics after the system is built and tested before production usage or we can invest some time into PoC and validate our expectations.

Nevertheless, cost and performance is always a hot topic when talking with technical architects. Of course, higher performance costs more. However, I was curious about the current situation on the market so I quickly checked the price of SSD and NVMe on DELL.com e-shop.

Note that this is just the indicative, kind of street price, but it has some informational value.

This is what I have found there today

  • Dell 6.4TB, NVMe, Mixed Use Express Flash, HHHL AIC, PM1725b, DIB - 213,150 CZK
  • Dell 3.84TB SSD vSAS Mixed Use 12Gbps 512e 2.5in Hot-Plug AG drive,3 DWPD 21024 TBW - 105,878 CZK

1 TB of NVMe storage costs 33,281 CZK
1 TB of SAS SSD storage costs 27,572 CZK
This is approximately 20% difference advantage for SSD.

So here are SSD advantages

  • ~ 20% less expensive material
  • scalability because you can put 24 and more SSD disks to 2U rack server but the same server supports usually less than 8 PCIe slots
  • manageability as you can more easily replace disks than PCI cards

The NVMe advantage is the performance with a positive impact on storage latency as SAS SSD has ~250 μs latency and NVMe ~= 80 μs so you should improve performance and storage service quality by a factor of 3.

So as always, you have to consider what infrastructure design quality is good for your particular use case and non-functional requirements and do the right design decision(s) with justification(s).

Any comment? Real experience? Please, leave the comment below the article. 

Monday, June 10, 2019

How to show HBA/NIC driver version

How to find the version of HBA or NIC driver on VMware ESXi?

Let's start with HBA drivers. 

STEP 1/ Find driver name for the particular HBA. In this example, we are interested in vmhba3.

We can use following esxcli command to see driver names ...
esxcli storage core adapter list


So now we have driver name for vmhba3, which is qlnativefc

STEP 2/ Find the driver version.
The following command will show you the version.
vmkload_mod -s qlnativefc | grep -i version

NIC drivers

The process to get NIC driver version is very similar.

STEP 1/ Find driver name for the particular NIC. In this example, we are interested in vmhba3.
esxcli network nic list


STEP 2/ Find the driver version.
The following command will show you the version.
vmkload_mod -s ntg3 | grep -i version


You should always verify your driver versions are at VMware Compatibility Guide. The process of how to do it is documented here How to check I/O device on VMware HCL.

For further information see VMware KB - Determining Network/Storage firmware and driver version in ESXi 4.x and later (1027206)

Thursday, June 06, 2019

vMotion multi-threading and other tuning settings

When you need to boost overall vMotion throughput, you can leverage Multi-NIC vMotion. This is good when you have multiple NICs so it is kind of scale-out solution. But what if you have 40 Gb NICs and you would like to do scale-up and leverage the huge NIC bandwidth (40 Gb) for vMotion?

vMotion is by default using a single thread (aka stream), therefore it does not have enough CPU performance to transfer more than 10 Gb of network traffic. If you really want to use higher NIC bandwidth, the only way is to increase the number of threads pushing the data through the NIC. This is where advanced setting Migrate.VMotionStreamHelpers comes in to play.

I have been informed about these advanced settings by one VMware customer who saw it on some VMworld presentation. I did not find anything in VMware documentation, therefore these settings are undocumented and you should use it with special care.

Advanced System Settings
Default
Tunning
Desc
Migrate.VMotionStreamHelpers
0
8
Number of helpers to allocate for VMotion streams
Net.NetNetqTxPackKpps
300
600
Max TX queue load (in thousand packet per second) to allow packing on the corresponding RX queue
Net.NetNetqTxUnpackKpps
600
1200
Threshold (in thousand packet per second) for TX queue load to trigger unpacking of the corresponding RX queue
Net.MaxNetifTxQueueLen
2000
10000
Maximum length of the Tx queue for the physical NICs 





Wednesday, June 05, 2019

How to get more IOPS from a single VM?

Yesterday, I have got a typical storage performance question. Here is the question ...
I am running a test with my customer how many IOPS we can get from a single VM working with HDS all flash array. The best that I could get with IOmeter was 32K IOPS with 3ms latency at 8KB blocks. No matter what other block size I choose or outstanding IOs, I am unable to have more then 32k. On the other hand I can't find any bottlenecks across the paths or storage. I use PVSCSI storage controller. Latency and queues looks to be ok
IOmeter is good storage test tool. However, you have to understand basic storage principles to plan and interpret your storage performance test properly. The storage is the most crucial component for any vSphere infrastructure, therefore I have some experience with IOmeter and storage performance tests in general and here are my thoughts about this question.

First thing first, every shared storage system requires specific I/O scheduling to NOT give the whole performance to a single worker. The storage worker is the compute process or thread sending storage I/Os down the storage subsystem. If you think about it, it makes a perfect sense as it mitigates the problem of a noisy neighbor. When you invest a lot of money to a shared storage system, you most probably want to use it for multiple servers, right? Does not matter if these servers are physical (ESXi hosts) or virtual (VMs). To get the most performance from shared storage you must use multiple workers and optimally spread them across multiple servers and multiple storage devices (aka LUNs, volumes,  datastores).

IOmeter allows you to use

  • Multiple workers on a single server (aka Manager)
  • Outstanding I/Os within a single worker (asynchronous I/O to a disk queue without waiting for acknowledge)
  • Multiple Managers – the manager is the server generating storage workload (multiple workers) and reporting results to a central IOmeter GUI. This is where IOmeter dynamos come in to play.
To test the performance limits of a shared storage subsystem, it is an always good idea to use multiple servers (IOmeter managers) with multiple workers on each server (nowadays usually VMs) spread across multiple storage devices (datastores / LUNs). This will give you multiple storage queues, which means more parallel I/Os. Parallelism is the way which will give you more performance when such performance exists on shared storage. If such performance does not exist on the shared storage, queueing will not help you to boost performance. If you want, you can also leverage Oustanding I/Os to fill disk queue(s) more quickly and make an additional pressure to a storage subsystem, but it is not necessary if you use the number of workers equal to available queue depth. Outstanding I/Os can help you potentially generating more I/Os with fewer workers but it does not help you to get more performance when your queues are full. You will just increase response times without any positive performance gain.

Just as an example of IOmeter performance test, on the image below, you can see the results from IOmeter distributed performance tests on 2-node vSAN I planned, designed, implemented and tested recently for one of my customers. There is just one disk group (1xSSD cache, 4xSSD capacity).


Above storage performance test was using 8xVMs and each VMs was running 8 storage workers.
I have performed different storage patterns (I/O size, R/W ratio, 100% random access). The performance is pretty good, right? However, I would not be able to get such performance from the single VM having a single vDisk. 
Note: vSAN has a significant advantage in comparison to traditional storage because you do not need to deal with LUNs queueing (HBA Device Queue Depth) as there are no LUNs. On the other hand, in vSAN storage, you have to think about the total performance available for a single vDisk and it boils down to vSAN DiskGroup(s) layout and vDisk object components distribution across physical disks. But that's another topic as the asker is using traditional storage with LUNs.

Unfortunately, using multiple VMs is not the solution for the asker as he is trying to get all I/Os from a single VM.

In the question is declared that a single VM cannot get more than 32K IOPS and observed I/O response time is 3ms. The asker is curious why he cannot get more IOPS from the single VM?

Well, there can be multiple reasons but let’s assume the physical storage is capable provide more than 32K IOPS. I think, that more IOPS cannot be achieved because only one VM is used and IOmeter is using a single vDisk having a single queue. The situation is depicted in drawing below.


So, let’s do the simple math calculation for this particular situation …
  • We have a single vDisk queue having default queue depth 64 (we use Paravirtual SCSI adapter. Non-paravirtualized SCSI adapters have queue depth 32)
  • We have an HBA QLogic having queue default depth 64 (other HBA vendors like Emulex, have default queue depth 32, so it would be another bottleneck on the storage path)
  • The storage has average service time (response time) around 3ms
We have to understand the following basic principles
  • IOPS is the number of I/O operations per second
  • 64 queue depth = 64 I/O operations in parallel = 64 slices for I/O operations
  • Each I/O from these 64 I/Os are in the vDisk queue until SCSI response from the LUN will come back
  • All other I/Os have to wait until there is the free I/O slice in the queue.
And here is the math calculation ...

Q1: How many I/Os can be delivered in this situation per 1 millisecond?
A1: 64 (queue depth) / 3 (service time in ms)  = 64 / 3 = 21.33333 I/Os per 1 millisecond
 
Q2: How many I/Os can be delivered per 1 second?
A2: It is easy. 1,000 times more than in millisecond. So, 21.33333 x 1,000 = 21333.33 IOs per second ~= 21.3K IOPS
 
The asker is claiming he can get 32K IOPS with 3 ms response time, therefore it seems that the response time from storage is better than 3 ms. The math above would tell me that storage response time in this particular exercise is somewhere around 2 ms. There can be other mechanisms to boost performance. For example, I/O coalescing but let's keep it simple.

If the storage would be able to service I/O in 1 ms we would be able to get ~64K IOPS.
If the storage would be able to service I/O in 2 ms we would be able to get ~32K IOPS. 
If the storage would be able to service I/O in 3 ms we would be able to get ~21K IOPS. 

The math above would work if END-2-END queue depth is 64. This would be the case when QLogic HBA is used as it has HBA LUN Queue Depth 64. In the case of Emulex HBA, there is HBA LUN Queue Depth 32, therefore higher vDisk Queue Depth (64), would not help.
 
Hope the principle is clear now.

So how can I boost storage performance for a single VM? If you really need to get more IOPS from the single VM you have only three following options:
  1. increase queue depth, but not only on vDISK itself but END-2-END. IT IS GENERALLY NOT RECOMMENDED as you really must know what you are doing and it can have a negative impact on overall shared storage. However, if you need it and have the justification for it, you can try to tune the system.
  2. use the storage system with low service time (response time). For example, the sub-millisecond storage system (for example 0.5 ms) will give you more IOPS for the same queue depth as a storage system having higher service time (for example 3 ms).
  3. leverage multiple vDisks spread across multiple vSCSI controllers and datastores (LUNs). This would give you more (total) queue depth in a distributed fashion. However, this would have additional requirements for your real application as it would need a filesystem or other mechanism supporting multiple storage devices (vDisks).
I hope options 1 and 2 are clear. Option 3 is depicted in the figure below.


CONCLUSION
On a typical VMware vSphere environment, you use the shared storage system from multiple ESXi hosts, multiple VMs having vDisks on multiple datastores (LUNs). That's the reason why the default queue depth usually makes perfect sense as it provides fairness among all storage consumers. If you have storage system with, let's say 2 ms response time, and queue depth 32, you can still get around 16K IOPS. This should be good enough for any typical enterprise application, and usually, I recommend to use IOPS limiting to limit some VMs (vDisks) even more. This is how storage performance tiering can be very simply achieved on VMware SDDC with unified infrastructure.  If you need higher storage performance, your application is specific and you should do a specific design and leverage specific technologies or tunings.

By the way, I like Howard's Marks (@DeepStorageNet) statement I have heard on his storage technologies related podcast "GrayBeards".  It is something like ...
"There are only two storage performance types - good enough and not good enough." 
This is very true.
 
Hope this writeup helps to broader VMware community.

Relevant articles: