Sunday, December 17, 2017

No Storage, No vSphere, No Datacenter

In the past, I have had a lot of discussions with different customers and partners about various storage issues with VMware vSphere. It was always identified as a physical storage or SAN issue and VMware support recommendation was to contact the particular storage vendor. It was always true and correct recommendation, however such storage issues always have the catastrophic or at least huge impact not only on virtualized workloads running on the impacted datastore but also on manageability of VMware vSphere because of intensive ESXi logging which affects hostd and vpxa services and it ends up with ESXi host disconnection from vCenter and very slow direct manageability of ESXi host. Such issues should be resolved by fixing the storage issue but in the meantime, vSphere admins do not have visibility into the part or even the whole vSphere environment, therefore, they usually restart impacted ESXi hosts which have a negative impact on the availability of VMs running even on not impacted datastores. Such situations are usually classified by users as the whole datacenter outage. You can imagine how hot are discussions between VMware and Storage teams in such situations and I often say the generic expression ...
"NO STORAGE,  NO DATACENTER"
Well, there is no doubt, the storage is the most important piece of the datacenter. VMware ESXi hypervisor is usually just an I/O storage passthrough component with some additional intelligence like
  • native storage multipathing (NMP), 
  • fair storage I/O scheduling (SIOC), 
  • I/O filtering (VAIO),
  • etc. 

And probably due to such additional intelligence, VMware customers usually expect that VMware vSphere will do some magic to mitigate physical storage or SAN related issues. First of all, it is logical and obvious that VMware vSphere cannot solve the issue of physical infrastructure. However, there can be some specific scenarios when the storage device is not available through one path but available via another path.

In this blog post, I would like to share my recent findings of storage issues and VMware native multipathing. Let's start with the visualization of storage multipathing over Fibre Channel SAN. Usually, there are two independent SANs (A and B). Each ESXi HBA is connected to different SAN. From the storage point of view, each storage controller (two storage controllers depicted in the figure below) is connected to different SAN through different storage front-end ports. HBA port is storage initiator and storage front-end ports are usually storage targets.

The I/O sent from ESXi hosts to their assigned logical unit numbers (LUNs) travels through a specific route that starts with an HBA and ends at a LUN. This route is referred to as a path. Each host, in a properly designed infrastructure, should have more than one path to each LUN. VMware generally recommends four storage paths but the optimal number of paths depending on particular storage architecture. In the figure above we have following four paths to LUN 1
  • vmhba0:C0:T0:L1
  • vmhba0:C0:T2:L1
  • vmhba1:C0:T1:L1
  • vmhba1:C0:T3:L1
Note: The storage system usually exports multiple LUNs with additional paths but we use single LUN (LUN 1) here for simplicity. 

ESXi host sees LUN 1 as four independent devices (volumes) but because ESXi has native multipathing driver these four LUNs are identified as the same LUN (LUN1 in our case) therefore ESXi automatically collapse these four devices into a single device having four independent paths. Storage I/Os to such device are distributed across multiple paths based on multipathing policy. ESXi has three native multipathing policies
  • fixed (FIXED), 
  • most recently used (aka MRU) and 
  • round robin (RR). 
Multipathing policy type dictates how multiple I/Os are distributed across available paths but if the one I/O is sent through particular path it will stick on it until the path is claimed as dead. Single I/O flow is depicted below.

Most commonly used SCSI commands are
  • Inquiry (Requests general information of the target device)
  • Test/Unit/Ready aka TUR (Checks whether the target device is ready for the transfer operation)
  • Read (Transfers data from the SCSI target device)
  • Write (Transfers data to the SCSI target device)
  • Request Sense (Requests the sense data of the last command)
  • Read Capacity (Requests the storage capacity information)
If the LUN accepts the SCSI command everything is great and shiny, however when a LUN at the end of storage path experiences some problems, then an ESXi host sends the Test Unit Ready (TUR) command to the storage target (particular storage front-end port) to confirm that the path to the LUN is down before initiating a path failover. However, when the ESXi receive some TUR response from the storage system the path is for ESXi host up a running and repeatedly returns a retry operation request without triggering the failover even the TUR returns error responses and effectively the LUN is not ready. Typical TUR SCSI command response should be "TEST_UNIT_READY" but in case of any problem it returns from the storage systems following responses:
  • SCSI_HOST_BUS_BUSY 0x02
  • SCSI_HOST_SOFT_ERROR 0x0b
  • SCSI_HOST_RETRY 0x0c

The particular I/O flow is happening over a single selected path and VMware native multipathing will not try another path even there is some probability that LUN could be ready via another path. Let me say it again. The default behavior is that ...
... storage path does not fail over when the path to the target is up and sending reponse back into the initiator even the LUN is not available for whatever reasons.
The reason for such conservative vSphere behavior is that Enterprise Storage System and SAN should work and storage vendors claiming storage availability higher than 99.999%. Multipathing is usually solving the issue with the path to the storage system (to storage target ports) but not the problem on the storage system itself (LUN unknown unavailability). I personally believe, the physical storage system has another possibilities how to respond to ESXi host that particular path is not available at the moment and instruct ESXi multipathing driver to not issue I/Os via particular path if it is necessary and the storage system does not have other possibilities how to transfer I/O to the LUN in the storage. However, the reality is that some storage systems do not have LUN available (TUR return errors) through one path but it works via another path. This is typical interoperability issue. However, I have just been informed that there is solution how to resolve this interoperability issue. You can use the enable_action_OnRetryErrors option. This option allows the ESXi host to mark a problematic path as dead. After marking the path as dead, the host can trigger a failover and use an alternative working path. I assume that in case the LUN is not available via any path, all paths will be claimed as dead until LUN works again. 

Now you can ask when the storage path claimed as dead will become active again in case the LUN is back and available. All paths claimed as dead are periodically evaluated. The Fibre Channel path state is evaluated at a fixed interval or when there is an I/O error and TUR is returning nothing, which is not our case here. The path evaluation interval is defined via the advanced configuration option Disk.PathEvalTime in seconds. The default value is 300 seconds. This means that the path state is evaluated every 5 minutes unless an error is reported sooner on that path, in which case the path state might change depending on the interpretation of the reported error. However, I have been told that this standard Disk Path Evaluation DOES NOT return path to an active state when it was claimed as down by OnRetryErrors action. My understanding of the reason for such behavior is that the storage path had some errors, therefore, it is not good to put the path back into production to avoid flip-flop situation.

Let me stress again, such intelligent and proactive failover behavior based on TUR responses is not the default one. At least not in vSphere 6.5 and below. There are some rumors that it can change in the next vSphere release but there is not any official messaging so far. I personally think that more intelligent behavior is better for VMware customers which are usually expecting such cleverness from the vSphere and they are negatively surprised how vSphere behaves in case of storage issues over some paths. Som the intelligent and proactive failover behavior based on TUR responses can be additional cleverness of VMware vSphere native multipathing, however, it is important to say that it would help with few specific behaviors/misbehaviors of some storage systems but the basic rule is still valid ... "NO STORAGE,  NO DATACENTER".

Disclaimer: This is my current understanding how vSphere ESXi handles storage I/O based on my long experience in the field, tests in the lab, design and implementation projects and knowledge I have read from the documentation, VMware KB's and books.  If you want to know more, please, check some relevant references below and do your own research. I do not if my understanding of this topic is complete and if I do not understand something wrong. Therefore, express any feedback in the comments and we can discuss it further because only deep constructive discussions lead to further knowledge. 

References:

No comments: