Thursday, October 11, 2018

VMware virtual disk (VMDK) in Multi Write Mode

VMFS is a clustered file system that disables (by default) multiple virtual machines from opening and writing to the same virtual disk (vmdk file). This prevents more than one virtual machine from inadvertently accessing the same vmdk file. This is the safety mechanism to avoid data corruption in cases where the applications in the virtual machine do not maintain consistency in the writes performed to the shared disk. However, you might have some third-party cluster-aware application, where the multi-writer option allows VMFS-backed disks to be shared by multiple virtual machines and leverage third-party OS/App cluster solutions to share a single VMDK disk on VMFS filesystem. These third-party cluster-aware applications, in which the applications ensure that writes originate from multiple different virtual machines, does not cause data loss. Examples of such third-party cluster-aware applications are Oracle RAC, Veritas Cluster Filesystem, etc.

There is VMware KB “Enabling or disabling simultaneous write protection provided by VMFS using the multi-writer flag (1034165)” available at KB describes how to enable or disable simultaneous write protection provided by VMFS using the multi-writer flag. It is the official resource how to use multi-write flag but the operational procedure is a little bit obsolete as vSphere 6.x supports configuration from WebClient (Flash) or vSphere Client (HTML5) GUI as highlighted in the screenshot below.

However, KB 1034165 contains several important limitations which should be considered and addressed in solution design. Limitations of multi-writer mode are:
  • The virtual disk must be eager zeroed thick; it cannot be zeroed thick or thin provisioned.
  • Sharing is limited to 8 ESXi/ESX hosts with VMFS-3 (vSphere 4.x) and VMFS-5 (vSphere 5.x) and VMFS-6 in multi-writer mode.
  • Hot adding a virtual disk removes Multi-Writer Flag. 

Let’s focus on 8 ESXi host limit. The above statement about scalability is a little bit unclear. That’s the reason why one of my customers has asked me what does it really mean. I did some research on internal VMware resources and fortunately enough I’ve found internal VMware discussion about this topic, so I think sharing the info about this topic will help to broader VMware community.

Here is 8 host limit explanation in other words …

“8 host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. This limitation applies to a group of VMs sharing the same VMDK file for a particular instance of the cluster-aware application. In case, you need to consolidate multiple application clusters into a single vSphere cluster, you can safely do it and app nodes from one app cluster instance can run on other ESXi nodes than app nodes from another app cluster instance. It means that if you have more than one app cluster instance, all app cluster instances can leverage resources from more than 8 ESXi hosts in vSphere Cluster.”
The best way to fully understand specific behavior is to test it. That’s why I have a pretty decent home lab. However, I do not have 10 physical ESXi host, therefore I have created a nested vSphere environment with vSphere Cluster having 9 ESXi hosts. You can see vSphere cluster with two App Cluster Instances (App1, App2) on the screenshot below.

Application Cluster instance App1 is composed of 9 nodes (9 VMs) and App2 instance just from 2 nodes. Each instance is sharing their own VMDK disk. The whole test infrastructure is conceptually depicted on the figures below.

Test Step 1: I have started 8 of 9 VMs of App1 cluster instance on 8 ESXi hosts (ESXi01-ESXi08). Such setup works perfectly fine as there is 1 to 1 mapping between VMs and ESX hosts within the limit of 8 ESXi hosts having shared VMDK1 opened.

Test Step 2: Next step is to test the Power-On operation of App1-VM9 on ESXi09. Such operation fails. This is expected result because 9th ESXi host cannot open the VMDK1 file on VMFS datastore.

The error message is visible on the screenshot below.

Test Step 3: Next step is to Power On App1-VM9 on ESXi01. This operation is successful as two app cluster nodes (virtual machines App1-VM1 and App1-VM9) are running on single ESXi host (ESX01) therefore only 8 ESXi hosts have the VMDK1 file open and we are in the supported limits.

Test Step 4: Let’s test vMotion of App1-VM9 from ESXi01 to ESX09. Such operation fails. This is expected result because of the same reason as on Power-On operation. App1 Cluster instance would be stretched across 9 ESXi hosts but 9th ESXi host cannot open VMDK1 file on VMFS datastore.

The error message is a little bit different but the root cause is the same.

Test Step 5: Let’s test vMotion of App2-VM2 from ESXi08 to ESX09. Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit.

Test step 6: The last test is the vMotion of App2-VM2 from vSphere Cluster (ESXi08) to standalone ESXi host outside of the vSphere cluster (ESX01). Such operation works because App2 Cluster instance is still stretched across two ESXi hosts only so it is within supported 8 ESXi hosts limit. vSphere cluster is not the boundary for multi-write VMDK mode.


Q: What exactly does it mean the limitation of 8 ESXi hosts?
A: 8 ESXi host limit implies how many ESXi hosts can simultaneously open the same virtual disk (aka VMDK file). If the cluster-aware application is not going to have more than 8 nodes, it works and it is supported. Details and various scenarios are described in this article.

Q: Where are stored the information about the locks from ESXi hosts?
A: The normal VMFS file locking mechanism is in use, therefore there are VMFS file locks which can be displayed by ESXi command: vmkfstools -D
The only difference is that multi-write VMDKs can have multiple locks as is shown in the screenshot below.

Q: Is it supported to use DRS rules for vmdk multi-write in case that is more than 8 ESXi hosts in the cluster where VMs with configured multi-write vmdks are running?
A: Yes. It is supported. DRS rules can be beneficial to keep all nodes of the particular App Cluster Instance on specified ESXi hosts. This is not necessary nor required from the technical point of view, but it can be beneficial from a licensing point of view.  

Q: How ESXi life cycle can be handled with the limit 8 ESXi hosts?
A: Let’s discuss specific VM operations and supportability of multi-write vmdk configuration. The source for the answers is VMware KB
·      Power on, off, restart virtual machine – supported
·      Suspend VM – unsupported
·      Hot add virtual disks - only to existing adapters
·      Hot remove devices – supported
·      Hot extend virtual disk – unsupported
·      Connect and disconnect devices – supported
·      Snapshots – unsupported
·      Snapshots of VMs with independent-persistent disks – supported
·      Cloning – unsupported
·      Storage vMotion – unsupported
·      Changed Block Tracking (CBT) – unsupported
·      vSphere Flash Read Cache (vFRC) – unsupported
·      vMotion – supported by VMware for Oracle RAC only and limited to 8 ESX/ESXi hosts. Note: other cluster-aware applications are not supported by VMware but can be supported by partners. For example, Veritas products have supportability documented here Please, verify current supportability directly with specific partners.

Q: Is it possible to migrate VMs with multi-write vmdks to different cluster when it will be offline?
A: Yes. VM can be Shut Down or Power Off and Power On on any ESXi host outside of the vSphere cluster. The only requirement is to have the same VMFS datastore available on source and target ESXi host. Please, keep in mind that the maximum supported number of ESXi hosts connected to a single VMFS datastore is 64.

Saturday, September 01, 2018

New with vSphere 6.7 U1 - Enhanced Load Balancing Path Selection Policy

With the release of vSphere 6.7 U1, there are now sub-policy options for VMW_PSP_RR to enable active monitoring of the paths. The policy considers path latency and pending IOs on each active path. This is accomplished with an algorithm that monitors active paths and calculates average latency per path based on either time and/or the number of IOs. When the module is loaded, the latency logic will get triggered and the first 16 IOs per path are used to calculate the latency. The remaining IOs will then be directed based on the results of the algorithm’s calculations to use the path with the least latency. When using the latency mechanism, the Round Robin policy can dynamically select the optimal path and achieve better load balancing results.

The user must enable the configuration option to use latency based sub-policy for VMW_PSP_RR:
esxcfg-advcfg -s 1 /Misc/EnablePSPLatencyPolicy
To switch to latency based sub-policy, use the following command:
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency
If you want to change the default evaluation time or the number of sampling IOs to evaluate latency,
use the following commands.

For Latency evaluation time (default is 15000 = 15 sec):
esxcli storage nmp psp roundrobin deviceconfig set -d --type=latency --latency-eval-time=18000
For the number of sampling IOs:
esxcli storage nmp psp roundrobin deviceconfig set -d -- type=latency --num-sampling-cycles=32
To check the device configuration and sub-policy:
esxcli storage nmp device list -d
The diagram below shows how sampling IOs are monitored on paths P1, P2, and P3 and eventually selected. The time “t” sampling window starts. In the sampling window, IOs are issued on each path in Round Robin fashion and their round-trip time is monitored. Path P1 took 10ms to complete in total for 16 sampling IOs. Similarly, path P2 took 20ms for the same number of sampling IOs and path P3 took 30ms. As path P1 has the lowest latency, path P1 will be selected more often for IOs. Then the sampling window again starts at ‘T’. Both “m” and “T” are tunable parameters but we would suggest to not change these parameters as they are set to a default value based on the experiments ran internally while implementing it.

The diagram, how sampling IOs are monitored and selected.
T = Interval after sampling should start again 
m = Sampling IOs per path

t1 < t2 < t3 ---------------> 10ms < 20ms < 30ms 
t1/m < t2/m < t3/m -----> 10/16 < 20/16 < 30/16

With the testing, VMware found that with the new latency monitoring policy, even with latency introduced up to 100ms on half the paths, the PSP sub-policy maintained almost full throughput.
Setting the values for the round robin sub-policy can be accomplished via CLI or using host-profiles.

VMworld US 2018 - VIN2416BU - Core Storage Best Practices

As in previous years, William Lam ( has published URLs to VMworld US 2018 Breakout Sessions. William wrote the blog post about it and created GitHub repo vmworld2018-session-urls available at Direct link to US sessions is here

I'm going to watch sessions from areas of my interest and write my thoughts and interesting findings in my blog. So stay tuned and come back to read future posts, if interested. Let's start with VMworld 2018 session VIN2416BU - Core Storage Best Practices Speakers: Jason Massae, Cody Hosterman.  This technical session is about vSphere core storage topics. In the beginning, Jason shared with the audience GSS top storage issues and customers challenges. These are PSA, iSCSI, VMFS, NFS, VVols, Trim/Unmap, Queueing, Troubleshooting. General recommendation by Jason and Cody is to validate any vSphere storage change and adjustment with particular storage vendor and VMware. Customers should change advanced settings only when recommended by storage vendor or VMware GSS.

Cody follows with the basic explanation of how SATP and PSP works. Then he explains why some storage vendors recommend adjusting the default Round Robin I/Os quantity per single path until switching to the next path. The reason is not the performance but faster failover in case of some storage paths issues. If you want to know how to change such setting, read VMware KB 2069356.

Jason continues with the iSCSI topic which is, based on VMware GSS, the #1 problem. The first recommendation is to not expose LUNs used for the virtual environment to some other external systems or functions. The only exception might be RDM LUNs used for OS level clustering, but this is the special use case. Another topic is the teaming and port binding. Some kind of the teaming is highly recommended. The iSCSI port binding is preferred. The port binding will give you load balancing and fail-over not only on link failure but also on SCSI sense code. This will help you in situations when the network path is OK but the target LUN is not available for whatever reasons. I wrote the blog post about this topic here. It is about the advanced option enable_action_OnRetryErrors and as far as I know, it was not enabled by default in vSphere 6.5 but probably is in 6.7. I did not test it so it should be validated before implemented into production. Jason explained, that in vSphere 6.0 and below, port binding did not allow you to do network L3 routing, therefore NIC teaming was the only way to go in environments where initiators and targets (iSCSI portals) were in different IP subnets. However, since vSphere 6.5, iSCSI can leverage dedicated TCP/IP stack, therefore default gateway can be specified for VMkernel port used for iSCSI. So from vSphere 6.5 port-binding is highly recommended over NIC teaming. After, Jason explains the difference among software iSCSI adapter, dependent iSCSI adapters (iSCSI offloaded to the hardware), iSER (iSCSI over RDMA).

The mic is handed over to Cody, and Cody is sharing with the audience the information about the new Round Robin enhanced policy, available in vSphere 6.7 U1 which considers I/O latency for path selection. For more info read my blog post here.

Cody moves to VMFS topic. VMFS 6 has a lot of enhancements. You cannot upgrade VMFS 5 to VMFS 6, therefore you have to create new datastore, format it to VMFS 6 and use storage vMotion to migrate VMs from VMFS 5. Cody is trying to answer typical customers questions. How big datastores I should do? How many VMs I should accommodate in a single datastore? Do not expect the exact answer. The answer is, it depends on your storage architecture, performance, and capabilities. Everything begins with the question if your array supports VAAI (VMware API for Array Integration) but there are other questions you have to answer your self, like what granularity of recoverability do you expect, etc. Cody joked a little bit and at the beginning of VMFS section of the presentation, he shared his opinion, the best VMFS datastore is VVols datastore :-)

Jason continued with NFS best practices. The interesting one was about the usage of Jumbo Frames. Jason's Jumbo Frames recommendation is to use it only when it is already configured in your network. In other words, it is not worth to enable it and believe you will get significantly better performance.

Another topic is VVols. Cody starts with general explanation what VVols are and are NOT. Cody highlight that with VVols you can finally use VVols snapshots without any performance overhead because it is offloaded to storage hardware. Cody explains the basic of VVols terminology and architecture. VVols use one or more Protocol Endpoints. Protocol Endpoint is nothing else then LUN with ID 254 and used as Administrative Logical Unit (ALU). Protocol Endpoint is used to handle the data but within the storage, there are Virtual Volumes also known as Subsidiary Logical Units (SLU) having VVol Sub-LUN IDs (for example 254:7) where vSphere storage objects (VM home directory, VMDKs, SWAPs, Snapshots) are stored. This session is not about VVols, therefore, it is really just a brief overview.

Trim/UNMAP and Space Reclamation on Thin Provisioned storage
The mic is handed over to Jason who starts another topic - Trim/UNMAP and Space Reclamation. Jason informs auditorium that Trim/UNMAP functionality depends on vSphere version and other circumstances.

In vSphere 6.0, Trim/UNMAP is a manual operation (esxcli storage vmfs unmap). In vSphere 6.5, it is automated with VMFS-6, it is enabled by default, can be configured to Low Priority or Off and it takes 12-24 hours to clean up space. In vSphere 6.7 adds configurable throughput limits where the default is Low Priority.

Space reclamation also works differently on each vSphere edition.

  • In vSphere 6.0 it works when virtual disks are thin, VM hardware is 11+, and only for MS Windows OS. It does NOT work with VMware snapshots, CBT enabled, Linux OS, UNMAPs are misaligned. 
  • In vSphere 6.5, it works when virtual disks are thin, it works for Windows and Linux OS, VM hardware is 11+ for MS Windows OS, VM hardware 13 for Linux OS, CBT can be enabled, UNMAPs can be misaligned. It does NOT work when VM has VMware snapshots, with thick virtual disks, when Virtual NVMe adapter is used.
  • In vSphere 6.7it works when virtual disks are thin, it works for Windows and Linux OS, VM hardware is 11+ for MS Windows OS, VM hardware 13 for Linux OS, CBT can be enabled, UNMAPs can be misaligned, VM snapshots supported, Virtual NVMe adapter is supported. It does NOT work for thick provisioned virtual disks.
The mic is handed over back to Cody to speak about queueing. At the beginning of this section, Cody explains how storage queuing works in vSphere. He shares the default values of HBA Device Queues (aka HBA LUN queues) for different HBA types:

  • QLogic - 64
  • Brocade - 32
  • Emulex - 32
  • Cisco UCS (VIC) - 32
  • Software iSCSI - 128
HBA Device Queue is an HBA setting which controls how many I/Os may be queued on a device (aka LUN). Default values are configurable via esxcli. Changing requires reboot. Details are documented in VMware KB 1267. After the explanation of "HBA Device Queue", Cody explains DQLEN which is Hypervisor level device queue limit. The actual device queue depth is a minimum from "HBA Device Queue" and DQLEN. In a mathematical formula, it is a MIN("HBA Device Queue", DQLEN). Therefore, if you increase DQLEN you have to also adjust "HBA Device Queue" for some real effect. VMFS DQLEN defaults are:

  • VMFS - 32
  • RDM - 32
  • VVols Protocol Endpoints - 128 (Scsi.ScsiVVolPESNRO)
After basic problem explanation, Cody does some quick math to stress that default settings are ok for most environments and it usually does not make sense to change it. However, if you have specific storage performance requirements, you have to understand the whole end-to-end storage stack and then you can adjust it appropriately and do a performance tunning. If you do any change, you should do it on all ESXi hosts in the cluster to keep performance consistent after migration from one ESXi host to another.

Storage DRS (SDRS) and Storage I/O Control (SIOC)
Cody continues with SDRS and SIOC. SDRS and SIOC are two different technologies.

SDRS moves VMS around based on hitting a latency threshold. This is the VM observed latency which includes any latency induced by queueing.

SIOC controls throttles VMs based on hitting a datastore latency threshold. SIOC is using a device (LUN) latency to throttle device queue depth automatically when the device is stressed. If SIOC kicks in, it takes into consideration VMDK shares, therefore VMDKs with higher shares have more frequent access to stressed (overloaded) storage device.

Jason notes that what Cody explained is how SIOC version 1 works but there is also SIOC version 2 introduced in vSphere 6.5. SIOC v1 and v2 are different. SIOC v1 looks at datastore level, SIOC v2 is policy based setting. It is good to know that SIOC v1 and SIOC v2 can co-exist on vSphere 6.5+. SIOC V2 is considerably different from a user experience perspective when compared to V1. SIOCv2 is implemented using IO Filter framework Storage IO Control category. SIOC V2 can be managed using SPBM Policies. What this means is that you create a policy which contains your SIOC specifications, and these policies are then attached to virtual machines. One thing to note is that IO Filter based IOPS does not look at the size of the IO. For example, there is no normalization like in SIOC v1 so that a 64K IOP is not equal to 2 x 32K IOPS. It is a fixed value of IOPS irrespective of the size of the IO. For more information about SIOC look here.

The last section of the presentation is about troubleshooting. It is presented by Jason. When you have to do a storage performance troubleshooting, start with reviewing performance graph and try to narrow down the issue. Looks at VM and ESXi host. Select only problem component. You have to build your troubleshooting toolbox. It should include

  • Performance Graph
  • vRealize LogInsight
  • vSphere On-disk Metadata Analyzer (VOMA)
  • Enable CEIP (Customer Experience Improvement Program) which shares troubleshooting data with VMware Support (GSS). CEIP is actually call-home functionality which is opt-out in vSphere 6.5+

Session evaluation 
This is a very good technical session for infrastructure folks responsible for VMware vSphere and storage interoperability. I highly recommend to watch it.

Monday, August 27, 2018

VMworld 2018 announcements

In this post, I would like to summarize the coolest VMworld 2018 announcements.

Project Dimension
On-premise managed vSphere infrastructure in a cloudy fashion. Project Dimension will extend VMware Cloud to deliver SDDC infrastructure and hardware as-a-service to on-premises locations.  Because this will be a service, it means that VMware can take care of managing the infrastructure, troubleshooting issues, and performing patching and maintenance. For more info read this blog post.

Project Magna
Project Magna will make possible a self-driving data center based on machine learning. It is focused on applying reinforcement learning to a data center environment to drive greater performance and efficiencies. The demonstration illustrated how Project Magna can learn and understand application behavior to the point that it can model, test, and then reconfigure the network to a make it more optimal to improve performance. Project Magna relies on artificial intelligence algorithms to help connect the dots across huge data sets and gain deep insights across applications and the stack from application code, to software to hardware infrastructure, to the public cloud and the edge.

vSphere Platinum
VMware vSphere Platinum is a new edition of vSphere that delivers advanced security capabilities fully integrated into the hypervisor. This new release combines the industry-leading capabilities of vSphere with VMware AppDefense, delivering purpose-built VMs to secure applications. For more info read this blog post.

VMware ESXi 64-bit Arm Support. 
ESXi will probably run on Cavium ThunderX2 servers. Cavium ThunderX2 has very interesting specifications.

vSphere 6.7 Update 1
VMware announced vSphere 6.7 Update 1, which includes some key new and enhanced capabilities. Here are some highlights:
  • Fully Featured HTML5-based vSphere Client
  • Enhanced support for NVIDIA Quadro vDWS powered VMs (vSphere vMotion with NVIDIA Quadro vDWS vGPU powered VMs)
  • Support for Intel FPGA 
  • New vCenter Server Convergence Tool (allows migration from an external PSC architecture into embedded PSC architecture and also combine, merge, or separate vSphere SSO Domains)
  • Enhancements for HCI and vSAN
  • Enhanced vSphere Content Library (import of OVA templates from a HTTPS endpoint and local storage, native support of VM templates)
For more info read this blog post.

VMware vSAN 6.7 Update 1
vSAN 6.7 U1 will be available together with vSphere 6.7 U1. Here are some highlights:
  • Firmware Updates through VUM
  • Cluster Quickstart wizard
  • UNMAP support (capability of unmapping blocks when the Guest OS sends an unmap/trim command)
  • Mixed MTU support for vSAN Stretched Clusters (different MTU for Witness traffic then vSAN traffic)
  • Historical capacity reporting
Amazon Relational Database Service (RDS) on VMware
AWS and VMware Announce Amazon Relational Database Service on VMware. It is a database as a service managed by Amazon. Amazon RDS on VMware will be generally available soon and will support Microsoft SQL Server, Oracle, PostgreSQL, MySQL, and MariaDB databases. Read announcement or register for preview here.

VVols support for SRM is now officially on the roadmap
It is not coming in the latest SRM version but it is officially in the roadmap so VMware announced the commitment to develop it soon. [Source]

VMware vCloud Director 9.5
The new vCloud Director 9.5 enhances easy and intuitive cloud provisioning and consumption by adding highly-requested capabilities including self-service data protection, disaster recovery, and container-orchestration for cloud consumers, along with multi-site management, multi-tenancy and cross-platform networking for Cloud Providers.  6 Key New Innovations in vCloud Director 9.5:
  • Cross-site networking improvements powered by deeper integration with NSX
  • Initial integration with NSX-T
  • Additional integration with NSX including e.g. the possibility to stretch networks across virtual datacenters on different vCenter Servers or vCD instances residing at different sites right from the UI.
  • Cross-platform networking for Cloud Providers. Makes it possible for NSX-T and NSX-V managers in same vCD instance to create isolated logical L2 networks with a directly connected network.
  • Full transition to an HTML5 UI for the cloud consumer
  • Improvements to role-based access control
  • Natively integrated data protection capabilities, powered by Dell-EMC Avamar
  • vCD virtual appliance deployment model
  • Container-orchestration for cloud consumers. Deploy both VMs and containers, consumed via Kubernetes.
  • Data protection capabilities. EMC Avamar is added to the vCD UI to make it easier for end consumers to manage these tasks. This is made possible based on the extensible tools available via vCD which means you (software vendor, cloud provider) can publish services to the vCD UI as needed. Hopefully, other vendors will follow.
For more information read VMware blog.

Introducing VMware Cloud Provider Pod: Custom Designs
VMware announced a new product that will revolutionize the deployment of Cloud Provider environments through the first flexible, validated VMware cloud stack with 1-click deployment: VMware Cloud Provider Pod. Cloud Provider environments are complex to deploy thanks to interoperability, scalability, reliability and performance issues that constantly plague cloud admins and architects. It is a time-consuming process that takes weeks to months. Yes, there are “one-click” deployment products out there, but these are rigid, have stringent hardware compatibility requirements, and end up creating yet another datacenter silo to manage. The Cloud Provider Pod has been designed to deliver three key capabilities:

  • Allows Cloud Providers to design a custom cloud environment of their choice
  • Automates the deployment of the designed cloud environment in adherence with VMware Validated Designs for Cloud Providers
  • Generates customized documentation and guidelines for their environment that radically simplifies operations
For more information read VMware blog.

VMworld US 2018 General Sessions & Breakout Sessions Playback
You can watch general sessions and also technical (breakout) session online.  General sessions on VMworld US 2018 is available at

A nice summary list of all VMworld US 2018 technical (breakout) session with the respective video playback & download URLs is available at