When I did research I found Frank Denneman blog post from March 2009 where almost everything is very well and accurately explained but technology little bit changed over the time. Another great document is here on VMware Community Docs.
Disclaimer: I work for DELL Services therefore I work on lot of projects with DELL Compellent Storage Systems, DELL Servers and QLogic HBA's. Therefore this blog post is based on "DELL Compellent Best Practices with VMware vSphere 5.x" and "Qlogic Best Practices for VMware vSphere 5.x". I have all this equipment in the lab so I was able to test it and verify what's going on. However although this blog has information about specific hardware general principles are the same for any other HBAs and storage systems.
Some blog sections are just copy and paste from public documents mentioned above so all credits go there. Some other sections are based on my lab tests, experiments and thoughts.
What is queue depth?Queue depth is defined as the number of disk transactions that are allowed to be “in flight” between an initiator and a target, where the initiator is typically an ESXi host HBA port/iSCSI initiator and the target is typically the Storage Center front-end port. Since any given target can have multiple initiators sending it data, the initiator queue depth is generally used to throttle the number of transactions being sent to a target to keep it from becoming “flooded”. When this happens, the transactions start to pile up causing higher latencies and degraded performance. That being said, while increasing the queue depth can sometimes increase performance, if it is set too high, there is an increased risk of over-driving the storage array. As data travels between the application and the storage array, there are several places that the queue depth can be set to throttle the number of concurrent disk transactions. The most common places queue depth can be modified are:
- The application itself (Default=dependent on application)
- The virtual SCSI card driver in the guest (Default=32)
- The VMFS layer (DSNRO) (Default=32)
- The HBA VMkernel Module driver (Default=64)
- The HBA BIOS (Default=Varies)
HBA Queue Depth = AQLEN(Default Varies but in my QLogic card it is 2176)
- The “connection options” field should be set to 1 for point to point only
- The “login retry count” field should be set to 60 attempts
- The “port down retry” count field should be set to 60 attempts
- The “link down timeout” field should be set to 30 seconds
- The “queue depth” (or “Execution Throttle”) field should be set to 255. This queue depth can be set to 255 because the ESXi VMkernel driver module and DSNRO can more conveniently control the queue depth
I believe QLogic Host Bus Adapter has Queue Depth 2176 because this number is visible in esxtop as AQLEN (Adapter Queue Length) for each particular HBA (vmhba).
HBA LUN Queue Depth(Default=64)
HBA LUN Queue Depth is by default 64 but can be changed via VMkernel Module driver. Compellent recommends to set HBA max queue depth to 255 but this queue depth can be used just for single VM running on single LUN (disk). If more VMs are running on LUN than DSNRO has precedence. However, when SIOC (Storage I/O Control) is used then DSNRO is not used at all and HBA VMkernel Module Driver Queue Depth value is used for each device queue depth (DQLEN). I think SIOC is beneficial in this situation because you can have deeper queues with dynamic queue management across all ESXi hosts connected to particular LUN. Please note, that special attention has to be taken for correct SIOC latency threshold especially on LUNs with sub-lun tiering.
If you want to check your Disk (aka LUN) Queue Depth value it is nicely visible in esxtop as DQLEN.
So here is the procedure how to change HBA VMkernel Module Driver Queue Depth.
Find the appropriate driver name for the module that is loaded for QLogic HBA:
~ # esxcli system module list | grep ql
qlnativefc true true
esxcli system module parameters set -m qlnativefc -p "ql2xmaxqdepth=255 ql2xloginretrycount=60 qlport_down_retry=60"You have to reboot ESXi host to apply module changes.
Below is description of QLogic Parameters we have just change
- ql2xmaxqdepth (int) - Maximum queue depth to report for target devices.
- ql2xloginretrycount (int) - Specify an alternate value for the NVRAM login retry count.
- qlport_down_retry (int) - Maximum number of command retries to a port that returns a PORT-DOWN status.
esxcfg-module --get-options qlnativefcAnd affected change (after ESXi reboot) is visible on esxtop on disk devices as DQLEN.
VMFS Layer DSNRO(Default=32)
When two or more virtual machines share a LUN (logical unit number), this parameter controls the total number of outstanding commands permitted from all virtual machines collectively on the host to that LUN (this setting is not per virtual machine). For more information see KB
esxcli storage core device set -d
~ # esxcli storage core device list -d naa.6000d3100025e7000000000000000098
Display Name: COMPELNT Fibre Channel Disk (naa.6000d3100025e7000000000000000098)
Has Settable Display Name: true
Device Type: Direct-Access
Multipath Plugin: NMP
Devfs Path: /vmfs/devices/disks/naa.6000d3100025e7000000000000000098
Model: Compellent Vol
SCSI Level: 5
Is Pseudo: false
Is RDM Capable: true
Is Local: false
Is Removable: false
Is SSD: false
Is Offline: false
Is Perennially Reserved: false
Queue Full Sample Size: 0
Queue Full Threshold: 0
Thin Provisioning Status: yes
VAAI Status: supported
Other UIDs: vml.02000200006000d3100025e7000000000000000098436f6d70656c
Is Local SAS Device: false
Is Boot USB Device: false
No of outstanding IOs with competing worlds: 32
And once again. DSNRO is not used when SIOC is enabled because in that case "HBA LUN queue depth" is the leading parameter for disk queue depth (DQLEN).
Storage Target Port Queue DepthQueues are on lot of places and you have to understand the path from end to end. A queue exist on the storage array controller port as well, this is called the “Target Port Queue Depth“. Modern midrange storage arrays, like most EMC- and HP arrays can handle around 2048 outstanding IO’s.
2048 IO’s sounds a lot, but most of the time multiple servers communicate with the storage controller at the same time. Because a port can only service one request at a time, additional requests are placed in queue and when the storage controller port receives more than 2048 IO requests, the queue gets flooded. When the queue depth is reached, this status is called (QFULL), the storage controller issues an IO throttling command to the host to suspend further requests until space in the queue becomes available. The ESXi host accepts the IO throttling command and decreases the LUN queue depth to the minimum value, which is 1! The QFULL condition is handled by the qlogic driver itself. The QFULL status is not returned to the OS. But some storage devices return BUSY rather than QFULL. BUSY errors are logged in the /var/log/vmkernel. Not every busy error is a Qfull error! Check the scsi sense codes indicated in the vmkernel message to determine what type of error it is. The VMkernel will check every 2 seconds to check if the QFULL condition is resolved. If it is resolved, theVMkernel will slowly increase the LUN queue depth to its normal value, usually this can take up to 60 seconds.
All together will give as the holistic view
Here are end to end disk queue parameters:
- DSNRO is 32 - this is default ESXi value.
- HBA LUN Queue Depth is 64 by default in QLogic HBAs. Compellent recommends to change it to 256 by parameter of HBA VMkernel Module Driver.
- HBA Queue Depth is 2176.
- Compellent Target Ports Queue Depths are 2048.
|Disk Queue Depth - Holistic View|
What is the optimal HBA LUN Queue Depth?We already know 64 is the default value for QLogic and you can change it via VMkernel Module driver. So what is the optimal HBA LUN Queue Depth?
To prevent flooding the target port queue depth, the result of the combination of number of host paths + HBA LUN Queue Depth+ number of presented LUNs through the host port must be less than the target port queue depth. In short T >= P * Q * L
T = Target Port Queue Depth (2048)
P = Paths connected to the target port (1)
Q = HBA LUN Queue Depth (?)
L = number of LUN presented to the host through this port (20)
So when I have 20 LUNs exposed from Compellent storage system HBA LUN Queue Depth should be maximally Q = T / P /L = 2048 / 1 / 20 = 102.4 for single ESXi host. When I have 16 hosts HBA LUN Queue Depth must be divided by 16 so the value for this particular scenario is 6.4.
This number is without any overbooking which is not the real case. Current QLogic HBA LUN Queue Depth default value (64) introduces fan in fan out ratio 10:1 which is probably based on long term experience with virtualized workloads. In the past Qlogic has lower default values 16 in ESX 3.5 and 32 in ESX 4.x which had lower fan in fan out ratios (2.5:1 and 5:1).
With Compellent recommended QLogic HBA LUN Queue Depth 255 we will have, in this particular case, fan in fan out ratio 40 : 1. But only when SIOC is used. And when SIOC is used and normalized datastore latency is higher then SIOC threshold dynamic queue management kick in and disk queues are automatically throttled. So all is good.
What is my real Disk Queue Depth?VMkernel Disk Queue Depth (DQLEN) is the number which matters. However, this number is dynamic and it depends on several factors. Here are scenarios you can have:
Scenario 1 - SIOC enabledWhen you have datastore with SIOC enabled then your DQLEN will be set to HBA LUN Queue Depth. It is by default 64 but Compellent recommends to set it to 255. But Compellent doesn't recommend to use SIOC.
Scenario 2 - SIOC disabled and you have only one VM on datastoreWhen you have just one VM on datastore your DQLEN will be set o HBA LUN Queue Depth. But how often do you have just one VM on datastore?
Scenario 3 - SIOC disabled and you have two or more VMs on datastoreIn this scenario your DQLEN will be set to DSNRO which is by default 32 and it has precedence over HBA LUN Queue Depth.
ESXi Disk Queue Management - None, Adaptive Queuing or Storage I/O Control?By default ESXi doesn't use any Disk Queue Length (DQLEN) throttling mechanism and your DQLEN is set to DSNRO (32) when two or more VM (vDisks) are on the datastore. When only one VM disk is on the datastore your DQLEN will be set to HBA LUN Queue Depth (default=64, Compellent recommends 255).
When you have enabled Adaptive Queuing or Storage I/O Control your Disk Queue Length (DQLEN) will be throttled automatically when I/O congestion occurs. The difference between Adaptive Queuing and SIOC is how I/O congestion is detected.
Adaptive queuing is waiting for storage SCSI Sense Code of BUSY or QUEUE FULL status on the I/O path. For adaptive queueing, there are some advanced parameters which must be set on a per host basis. From the Configuration tab, you have to select Software Advanced Settings. Navigate to Disk and the two parameters you need are Disk.QFullSampleSize and Disk.QFullThreshold. By default, QFullSampleSize is set to 0, meaning that it is disabled. When this value is set, adaptive queueing will kick in and half the queue depth when this number of queue full conditions is reported by the array. The QFullThreshold is the number of good status to receive before incrementing the queue once again.
Storage I/O Control uses the concept of a congestion threshold, which is based on latency. But not normal latency of datastore on only one particular ESXi host but there is sophisticated algorithm preparing normalized datastore latency of all datastore latencies across ESXi hosts using one particular datastore. On top of datastore wise normalization there is another type of normalization which takes in to account just "normal" I/O sizes. What is normal IO size?
Information in above section are based on Cormag Hogan blog post.
ConclusionBased on information above I think that default ESXi and QLogic values are the best for general workloads and tuning queues is not something I would recommend to do. It is important to know that queues tuning does not help with performance in most cases. Queues are used for latency improvements during transient I/O burst. If you have storage performance issues usually it is because of storage system performance and not about queues in the network path. Bigger Queue Depth can help with more asynchronous I/Os to storage subsystem which can be beneficial for storage performance in environment with just few VMs but when you have hundreds of VMs lot of I/Os are flying to the storage system so deeper queues will not help you anyway.
Question I still need answer to my self is why Compellent doesn't recommend VMware Storage I/O Control and Adaptive Queuing for dynamic queue management which is in my opinion very good thing.
UPDATE 2014-12-16: I have downloaded the latest Compellent Best Practices and there is new statement about SIOC. SIOC can be enabled but you must know what's happening and if it is beneficial for you. Here is the snip from the document
SIOC is a feature that was introduced in ESX/ESXi 4.1 to help VMware administrators regulate storage performance and provide fairness across hosts sharing a LUN. Due to factors such as Data Progression and the fact that Storage Center uses a shared pool of disk spindles, it is recommended that caution is exercised when using this feature. Due to how Data Progression migrates portions of volumes into different storage tiers and RAID levels at the block level, this could ultimately affect the latency of the volume, and trigger the resource scheduler at inappropriate times. Practically speaking, it may not make sense to use SIOC unless pinning particular volumes into specific tiers of disk.I still believe SIOC is the way to go and special attention has to be paid to SIOC latency threshold. Compellent recommends to keep it on default value 30 milliseconds which makes perfect sense. Storage System will do all hard work for you but when there is huge congestion and normalized latency is too high dynamic ESXi disk queue management can kick in. It make sense for me.
I also believe that Adaptive Queuing is really good and practical safety mechanism when your storage array has full queues in storage front-end ports or LUNs. If Adaptive Queueing is not used, even SIOC cannot help you with LUNs/Datastores issues (Datastore Disconnections) because SIOC algorithm is based on device response time but not on queue full storage response. Therefore, I strongly recommend to enable SIOC together with Adaptive Queuing unless your storage vendor has really good justification to not do so. SIOC will help you with storage traffic throttling during high device response times and Adaptive Queuing when storage array queues are full and device cannot accept new I/O's. However, Adaptive Queuing should be configured in concert with your storage vendor. For more information how to enable Adaptive Queuing read VMware KB 1008113. Please note, that SIOC and Adaptive Queuing are just safety mechanisms how to mitigate impacts of storage issues but the root cause is the Capacity Planning on storage array.
If any of you with a deep understanding of vSphere and storage architectures see an error in my analysis, please let me know so that I can correct appropriately.