Thursday, May 16, 2019

The SPECTRE story continues ... now it is MDS

Last year (2018) started with shocked Intel CPU vulnerabilities Spectre and Meltdown and two days ago was published another SPECTRE variant know as Microarchitectural Data Sampling or MDS. It was obvious from the beginning, that this is just a start and other vulnerabilities will be found over time by security experts and researchers. All these vulnerabilities are collectively known as Speculative Executions aka SPECTRE variants.

Here is the timeline of particular SPECTRE variant vulnerabilities along with VMware Security Advisories.

2018-01-03 - Spectre (speculative execution by performing a bounds-check bypass) / Meltdown (speculative execution by utilizing branch target injection) - VMSA-2018-0002.3 
2018-05-21 - Speculative Store Bypass (SSB) - VMSA-2018-0012.1
2018-08-14 - L1 Terminal Fault - VMSA-2018-0020
2019-05-14 - Microarchitectural Data Sampling (MDS) - VMSA-2019-0008

I published several blog posts about SPECTRE topics in the past

The last two vulnerabilities "L1 Terminal Fault (aka L1TF)" and "Microarchitectural Data Sampling (aka MDS)" are related to Intel CPU Hyper-threading. As per statement here AMD is not vulnerable.

When we are talking about L1TF and MDS, a typical question of my customers having Intel CPUs is if they are safe when Hyper-Threading is disabled in the BIOS. The answer is yes but you would have to power cycle the physical system to reconfigure BIOS settings which can be pretty annoying and time-consuming in larger environments. That's' why VMware recommends leveraging SDDC concept and set it by software change - ESXi hypervisor advanced setting. It is obviously much easier to change two ESXi advanced settings VMkernel.Boot.hyperthreadingMitigation and VMkernel.Boot.hyperthreadingMitigationIntraVM to the value true and disable hyperthreading in ESXi CPU scheduler without a need of physical server power cycle. You can do it by PowerCLI one-liner in a few minutes which is much more flexible than BIOS changes.

So that's it from the security point of view but what about performance?

It is simple and obvious. When hyper-threading is disabled you will obviously lose the CPU performance benefit of Hyper-Threading technology which can be somewhere between 5 - 20% and heavily depends on the type of particular workload. Let's be absolutely clear here. Until the issue is addressed inside the CPU hardware architecture it will be always the tradeoff between security and performance. If I understand Intel messaging correctly, the first hardware solution for their Hyper-Threading is implemented in Cascade Lake family. You can double check it by yourself here ...
Side Channel Mitigation by Product CPU Model

You can get hyperthreading performance back but only in VMware vSphere 6.7 U2. VMware vSphere 6.7 U2 includes new scheduler options that secure it from the L1TF vulnerability, while also retaining as much performance as possible. This new scheduler has introduced ESXi advanced setting
VMkernel.Boot.hyperthreadingMitigationIntraVM which allows you to set it to FALSE (this is the default) and leverage HyperThreading benefits within Virtual Machine but still do isolation between VMs when VMkernel.Boot.hyperthreadingMitigation is set to TRUE. This possibility is not available in older ESXi hypervisors and there are no plans to backport it. For further info read paper "Performance of vSphere 6.7 Scheduling Options".

By the way, last year I have spent a significant time to test the performance impact of SPECTRE and MELTDOWN vulnerabilities remediations. If you want to check the results of the performance tests of Spectre/Meltdown 2018 variants along with the conclusion, you can read my document published on SlideShare. It would be cool to perform the same tests for L1TF and MDS but it would require additional time effort. I'm not going to do so until sponsored by some of my customers. But anybody can do it by himself as a test plan is described in the document below.

Friday, May 03, 2019

Storage and Fabric latencies - difference in order of magnitude

It is well known, that the storage industry is in a big transformation. SSD's based on Flash is changing the old storage paradigma and supporting fast computing required nowadays in modern applications supporting digital transformation projects.

So the Flash is great but it is also about the bus and the protocol over which the Flash is connected.
We have traditional storage protocols SCSI, SATA, and SAS but these interface protocols were invented for magnetic disks, that's the reason why Flash over these legacy interface protocols cannot leverage the full potential of Flash technology. That's why NVMe (new storage interface protocol over PCI) or even 3D XPoint memory (Intel Optane).

It is all about latency and available bandwidth. Total throughput depends on I/O size and achievable transaction (IOPS). IOPS on storage systems below can be achieved on particular storage media by a single worker with random access, 100% read, 4 KB I/O size workload. Multiple workers can achieve higher performance but with higher latency.

Latencies order of magnitude:
  • ms - miliseconds - 0.001 of second = 10−3
  • μs - microseconds - 0.000001 of second = 10−6
  • ns - nanoseconds - 0.000000001 of second = 10−9
Storage Latencies

SATA - magnetic disk 7.2k RPM ~= 80 I/O per second (IOPS) = 1,000ms / 80 = 12ms
SAS - magnetic disk 15k RPM ~= 200 I/O per second (IOPS) = 1,000ms / 200 = 5 ms
SAS - Solid State Disk (SSD) Mixed use SFF ~= 4,000 I/O per second (IOPS) = 1,000ms / 4,000 = 0.25 ms = 250 μs.
NVMe over RoCE - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.100 ms =  100 μs
NVMe - Solid State Disk (SSD) ~= TBT I/O per second (IOPS) = 1,000ms / ??? = 0.080 ms =  80 μs
DIMM - 3D XPoint memory (Intel Optane) ~=   the latency less than 500 ns (0.5 μs)

Ethernet Fabric Latencies

Gigabit Ethernet - 125 MB/s ~= 25 ~ 65 μs
10G Ethernet - 1.25 GB/s ~=  μs (sockets application) / 1.3 μs (RDMA application)
40G Ethernet - 5 GB/s ~= μs (sockets application) / 1.3 μs (RDMA application)

InfiniBand and Omni-Path Fabrics Latencies

10Gb/s SDR - 1 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
20Gb/s DDR - 2 GB/s  ~=  2.6 μs (Mellanox InfiniHost III)
40Gb/s QDR - 4 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
40Gb/s FDR-10 - 5.16 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
56Gb/s FDR-10 - 6.82 GB/s  ~=  1.07 μs (Mellanox ConnectX-3)
100Gb/s EDR-10 - 12.08 GB/s  ~=  1.01 μs (Mellanox ConnectX-4)
100Gb/s Omni-Path - 12.36 GB/s  ~=  1.04 μs (Intel 100G Omni-Path)

RAM Latency

DIMM - DDR4 SDRAM ~=  75 ns (local NUMA access) - 120 ns (remote NUMA access)


It is good to realize what latencies we should expect on different infrastructure subsystems 
  • RAM ~= 100 ns
  • 3D XPoint memory ~= 500 ns
  • Modern Fabrics ~= 1-4 μs
  • NVMe ~= 80 μs
  • NVMe over RoCE ~= 100 μs
  • SAS SSD ~= 250 μs
  • SAS magnetic disks ~= 5-12 ms
The latency order of magnitude is important for several reasons. Let's focus on one of them - latency monitoring. It was always a challenge to monitor traditional storage systems as 5 minutes or even 1-minute is simply too large interval for ms (milisecond) latency and the average does not tell you anything about microbursts. However, in lower latency (μs or even ns) systems is 5-minute interval like an eternity. Average, Min and Max of 5-minute interval might not help you to understand what is really happening there. Much deeper mathematical statistics would be needed to have real and valuable visibility into telemetry data. Percentiles are good but Histograms can help even more ...
Wavefront links above are talking mainly about application monitoring but do we have such telemetry granularity in hardware? Mellanox Spectrum claims Real-time Network Visibility but it seems to me as an exception. Intel had an open source project "The Snap Telemetry Framework", however, it seems that it was discontinued by Intel. And what about other components? To be honest, I do not know and it seems to me that real-time visibility is not a big priority for the Infrastructure Industry, however, Operating Systems, Hypervisors and Software Defined Storages could help here. VMware vSphere Performance Manager available via vCenter SOAP API can provide "real-time" monitoring. I'm quoting "real-time" into brackets because it can provide 20-second samples (min, max, average) for metrics in leaf objects. Is it good enough? Well, not really. It is better than a 5-minute or 1-minute sample but still very long for sub-millisecond latencies. Minimum, maximum and average do not have enough information value for some decisions. The histograms could help here. ESXi has an old good tool vscsiStats supporting histograms latency of IOs in Microseconds (us) for virtual machine. Unfortunately, there is no officially supported vCenter API for this tool so it is usually used for short-term manual performance troubleshooting and not for continuous latency monitoring.  William Lam has published a blog post and scripts on how to leverage ESXi API to get vscsiStats histograms. It would be great to be able to get histograms for some objects through vCenter in a supported way and expose such information to external monitoring tools. #FEATURE-REQUEST

Hope this is informative and educational.

Other sources:
Performance Characteristics of Common Network Fabrics:
Real-time Network Visibility:
Johan van Amersfoort and Frank Denneman present a NUMA deep dive:
Wiliam Lam : Retrieving vscsiStats Using the vSphere 5.1 API