I found IBM RedBook [SOURCE: IBM RedBook redp-4609-01] explanation the most descriptive, so here it is.
The term segment size refers to the amount of data that is written to one disk drive in anarray before writing to the next disk drive in the array, for example, in a RAID5, 4+1 array with a segment size of 128 KB, the first 128 KB of the LUN storage capacity is written to the first disk drive and the next 128 KB to the second disk drive. For a RAID1, 2+2 array, 128 KB of an I/O is written to each of the two data disk drives and to the mirrors. If the I/O size is larger than the number of disk drives times 128 KB, this pattern repeats until the entire I/O is completed. For very large I/O requests, the optimal segment size for a RAID array is one that distributes a single host I/O across all data disk drives.
The formula for optimal segment size is:
LUN segment size = LUN stripe width ÷ number of data disk drives
For RAID 5, the number of data disk drives is equal to the number of disk drives in the array minus 1, for example:
RAID5, 4+1 with a 64 KB segment size = (5-1) * 64KB = 256 KB stripe width
For RAID 1, the number of data disk drives is equal to the number of disk drives divided by 2, for example:
RAID 10, 2+2 with a 64 KB segment size = (2) * 64 KB = 128 KB stripe width
For small I/O requests, the segment size must be large enough to minimize the number ofsegments (disk drives in the LUN) that must be accessed to satisfy the I/O request, that is, to minimize segment boundary crossings.
For IOPS environments, set the segment size to 256KB or larger, so that the stripe width is at least as large as the median I/O size.
IBM Best practice: For most implementations set the segment size of VMware data partitions to 256KB.
Note: If I decrypting IBM terminology correctly then IBM mentioned term "stripe width" is actually "data stripe size". We need to clear terminology because normally is the term "stripe width" used as number of disks in RAID group. "Data stripe size" is payload without the parity. The parity is stored on another segment(s) dependent on selected RAID level.
For clear understanding terminology I've created RAID 5 (4+1) segment/stripe visualization depicted bellow.
|RAID 5 (4+1) striping example|
Even I found this IBM description very informative I'm not sure why they recommend to use segment size 256KB for VMware. It is true that the biggest IO size issued from ESX can be by default 32MB because bigger IOs issued from guest OS ESX splits into more IOs (for more information about big IO split see this blog post). However the most important is IO size issued from guest OSes. If you want to monitor max/average/median IO size from ESX you can use tool vscsiStats already included in ESXi for such purpose. It allows you to show histogram which is really cool (for more information about vscsiStats read this excellent blog post). So based on all these assumptions and also my own IO size monitoring in the field it seems to me that average IO size issued from ESX is usually somewhere between 32 and 64KB. So let's use 64KB as average data stripe (IO size issued from OS). Then for RAID 5 (4+1) data stripe will be composed from 4 segments and optimal segment size in this particular case should be 16KB (64/4).
Am I right or I missed something? Any comments are welcome and highly appreciated.
We are discussing this topic very frequently with my colleague who work as DELL storage specialist. The theory is nice but only the real test can prove any theory. Recently he performed set of IOmeter tests against DELL PV MD3600f which is actually the same array as IBM DS3500. He found that optimal performance (# of IOPS versus response times) is when segment size is as close as possible to IO size issued from operating system. So key takeaway from this exercise is that optimal segment size for example above is not 16KB but 64KB. Now I understand IBM general recommendation (best practice) to use 256KB segment size for VMware workloads as this is the biggest segment size which can be chosen.
After more thinking about this topic I've realized that idea to use the segment size bigger than your biggest IO size can make sense from several reasons
- each IO will get single spindle (disk) to handle this IO which will use queues down the route and will be served in spindle latency time which is the minimal one for this single IO, right?
- typical virtual infrastructure environment is running several VMs generating several IOs based on queues available in the guest OS, ESX layer disk scheduler settings (see more here on Duncan Epping blog) so at the end of the day you are able to generate lot of IOPSes by different threads and load is evenly distributed across RAID group