Tuesday, September 22, 2020

vSAN - vLCM Capable ReadyNode

VMware vSphere Lifecycle Manager (aka vLCM) is one of the very interesting features in vSphere 7.  vLCM is a powerful new approach to simplified consistent lifecycle management for the hypervisor and the full stack of drivers and firmware for the servers powering your data center.

There are only a few server vendors who have implemented firmware management with vLCM.

At the moment of writing this article, these vendors are:

  • Dell and HPE for vSphere 7.0
  • Dell, HPE, Lenovo for vSphere 7.0 Update 1

Recently I have got the following question from one of my customers.

"Where I can find official information about certified vLCM server vendors?"

It is a very good question. I would expect such information in VMware Compatibility Guides (VCG), however, there is no such information on "Systems / Servers" VCG but you can find it in "vSAN" VCG.



vSAN VCG contains "vSAN ReadyNodes Additional Features" where one feature is "vLCM Capable ReadyNode". So, there you can find Server Vendors successfully implemented firmware management integration with vLCM, but it is available only for vSAN Ready Nodes. I can imagine, that in the future, vLCM capability may or may not be available even for standard servers and not only for vSAN Ready Nodes.

Tuesday, September 15, 2020

vSAN 7 Update 1 - What's new

vSAN 7 Update 1 has been announced, so let's look at what it brings into the table. 

NEW FEATURES

In the figure below, you will see what new features are available in this release.

New features in vSAN 7 Update 1


VMware HCI Mesh

It is a possibility to mount remote vSAN datastore from external (aka Server) vSAN Cluster to multiple Client vSAN Clusters. An example topology is depicted in the figure below.

vSAN HCI Mesh
vSAN HCI Mesh

HCI Mesh allows multiple client vSAN clusters can mount and share a remote datastore from vSAN Server Cluster. A single datastore can be mounted up to a maximum of 64 hosts, including the server cluster's hosts. With such topology, you can do vMotion (Compute only) across multiple vSphere/vSAN Clusters.

With HCI Mesh you can also do a Full Mesh where vSAN Cluster acts as Client and Server of HCI Mesh. Such topology is depicted below, where all three clusters are both clients and servers.

Full Mesh

Such Full Mesh topology is ideal for homogeneous clusters and equalizes storage consumption across clusters. Scalability of such topology is limited to 5 remote datastores and 5 client datastores. In other words, Client clusters can mount a maximum of 5 remote datastores, and Server clusters can export up to a maximum of 5 client clusters.

Few more notes about HCI Mesh 

  • HCI Mesh Client server is vSAN Cluster. So, the minimum node count is 2-node vSAN Cluster
  • Compute-only vSAN cluster technically works but not recommended and supported at the moment.
  • Meshing Hybrid and All-Flash datastores is supported

vSAN Native File Services

VMware is extending vSAN File Services to SMB protocol. SMB is integrated with Microsoft Active Directory and supports Kerberos authentication. This means that vSAN now supports NFS (version 3 and 4.1) and SMB.

SECURITY

vSAN Data-in-Transit Encryption

vSAN 7 Update 1 increases overall security with native Inter-node encryption of vSAN data traffic over TCP, which ensures data privacy, authentication, and integrity, leveraging existing FIPS-2 validated crypto module. The interesting fact is, that external Key Management Server (KMS) is not required for this feature. However, please be aware that vSAN Mesh and Data-in-Transit Encryption together are not supported in this release.

vSAN Data-in-Transit Encryption


SSD Secure Erase (Secure wipe method)

vSAN 7 Update 1 has the option to securely erase SSDs for Dell and HPE supported devices at this release, so HPE & Dell vSAN Ready Nodes and DellEMC VxRail should be able to use this feature. Other hardware vendors will obviously come in the future.

PERFORMANCE

Overall performance optimization

Based on VMware internal performance tests, vSAN 7 Update 1 should be approximately 30% faster in comparison to vSAN 6.7 U3, which was the fastest vSAN release so far. I know this is a kind of vague statement without further details but I personally believe, the vSAN performance, especially in the All-Flash model, was already good enough for the majority of traditional workloads. Of course, additional performance improvements are always nice to have but I think there are other factors which are more important at least for customers I work with. 

Compression-Only

vSAN prior 7 Update 1 supported compression together with deduplication. vSAN 7 Update 1 decouples the compression feature from the deduplication feature to allow space efficiency with a lower performance overhead caused by the deduplication algorithm. 

When both features (Dedup & Compress) are turned on, it works in the following way ... 

Deduplication

  • Per disk group
  • Occurs when destaging to the capacity tier
  • 4KB fixed blocks

Compression

  • Occurs after dedup, prior to data being destaged
  • If block is compressed <= 2KB
  • Otherwise full 4KB block is stored

When Deduplication is Turned On, the failure domain is the whole disk group.
In Compression-only mode, the failure domain is reduced to a disk, which is another benefit improving availability, reducing data to resync on failures, therefore improves availability and recoverability SLA.

Compression-only has a significantly higher performance than dedup, that's why OLTP workloads primarily benefit from this feature.


AVAILABILITY AND RECOVERABILITY

Enhanced Durability During Maintenance Operations

First of all, it is important to understand the difference between Durability and Availability. We are talking here about Durability during vSphere/vSAN Cluster node maintenance mode. The feature is nicely depicted in the figure below.

Enhanced Durability During Maintenance Mode

When ESXi host enters Maintenance Mode you can do 
  • Full data Evacuation <-- time-consuming operation but vSAN objects stay protected per Storage Policy intent
  • Ensure Availability <-- it checks that none vSAN object becomes unavailable, but can become unprotected
  • Nothing <-- this is very dangerous and can cause a data loss
Now with vSAN 7 Update 1, you can Ensure Availability, acknowledge that some vSAN objects can become unavailable in case of some other failure, but doing delta writes to other Host or Failure Domain for the vSAN object placed on a host with the only active replica. This results in additional durability protection (prevents loss of data) because, in the case of Replica 1 failure, the vSAN object becomes unavailable, but you can recover the data very quickly from Replica 2 and Delta Writes after the ESXi host returns back from the planned maintenance mode.

Implementation details:
  • Only available when another Host or Failure Domain is available
  • Can be the same host as the witness component
  • Applies to both RAID Mirroring and Erasure Coding
Faster Host Reboots and Cluster Upgrades
  • Significant improvement in cluster upgrades due to faster host reboots
  • Host metadata is written to disk before a reboot and read back to memory after reboot. This is faster than rebuilding metadata.
  • Average 5x improvement in host reboot times
MANAGEABILITY

Shared Witness for 2 Node vSAN Deployments

This feature enables 2-Node vSAN deployments to share a common witness instance. vSAN 7 Update 1 supports up to 64 ROBO 2-Node clusters. With vSAN witness consolidation, customers can reduce deployment cost and operational complexity.

Slack Space optimized, operationalized, and renamed into Reserved Capacity

In the past, VMware recommended keeping 25% - 30% capacity as slack space. Now, the new Reserved Capacity is optimized to requires less space and is dependent on deployment variables and decreases with the number of hosts in vSAN Cluster. Example deployment:
  • 12 node cluster = ~16%
  • 24 node cluster = ~12%
  • 48 node cluster =~ 10%
Reserved capacity is required for:
  • Resync operations such as policy changes, rebalancing, and data movement
  • Rebuild activities due to failures
Slack space is Reserved Capacity

On top of disk space optimization for reserve capacity, vSAN now optionally prevents the consumption of vSAN reserved capacity with optional capacity reserves including
  • Operations reserve
  • Host rebuild reserve
Capacity reserves are soft-thresholds that prevent provisioning activities, thus existing VM I/O is not prevented. And once again, this prevention is completely optional, so it is OPT-IN setting and it is not enabled by default. You can see UI in the figure below.


Please note that the vSAN Reserved Capacity feature is not supported on stretched clusters and 2-node vSAN.

vLCM Enhancements

Dell and HPE are supporting vLCM since vSAN 7 release. Now, in vSAN 7 Update 1, Lenovo ReadyNode models are supported as well.

From the technical features point of view, vLCM has been enhanced in the following areas 
  • vSAN Fault Domains, 2-node, and Stretched Clusters awareness
  • hardware compatibility pre-checks
  • parallel cluster remediation of up to 64 clusters
  • support for environments running NSX-T 3.1
The technical features above are depicted in the following figure.

vSAN & vLCM

Simplified Routing for vSAN Network Topologies

Prior to this release, vSAN topologies with external witness had to use static routing in each ESXi host which was a significant management pain. In vSAN 7 Update 1, alternate default gateway can be specified and static routes do not need to be used. This is a very useful feature from an operational point of view if you ask me.

vSAN alternate default gateway


vSAN I/O Insight

This is another very useful feature, to analyze storage workload I/O patterns. It is
  • Quick and easy tool in vSphere Client to capture workload IO characteristics on VSAN
  • Rich IO Pattern metrics and histograms to analyze R/W ratio, Seq/Random ratio, 4K aligned / unaligned ratio, IO size distribution
  • Finer granular IO performance metrics
The tool provides solid data points for the infrastructure team to triage issues with app users without the need for complex external tools. See. the screenshot below.

vSAN IO Insight


CONCLUSION

vSAN 7.0 U1 release is definitely a significant step forward for VMware software-defined storage, which is an important component of full VMware SDDC stack. And this is a very nice proof of a key SDDC benefit ... Quick response to customers and industry requirements ... This is the reason, why I really like software-defined and hyper-converged infrastructure, leveraging commoditization, integration, and continuous product improvement. vSphere administrator is the king of the house ... at least from the infrastructure point of view :-)

Friday, September 11, 2020

Datacenter Network Topology - Dell OS10 MultiDomain VLT

Yesterday, I have got the following e-mail from one of my blog readers ...

Hello David,

Let me introduce myself, I work in medium size company and we began to sell Dell Networking stuff to go along with VxRail. We do small deployments, not the big stuff with spine/leaf L3 BGP, you name it. For a Customer, I had to implement this solution. Sadly, we are having a bad time with STP as you can see on the design.

 

Customer design with STP challenge

Is there a way to be loop-free ? I thought about Multi Domain VLT LAG but it looks like it is not supported in OS10. 

I wonder how you would do this. Is SmartFabric the answer ?
Thank you

Well, first of all, thanks for the question. If you ask me, it all boils down to specific design factors - use cases, requirements, constraints, assumptions.

So let's write down design factors

Requirements:

  • Multi-site deployment
  • A small deployment with a single VLT domain per site.
  • Robust L2 networking for VxRail clusters

Constraints:

  • Dell Networking hardware with OS10
  • Networking for VMware vSphere/vSAN (VxRail)

Assumptions:
  • No more than a single VLT domain per site is required
  • No vSphere/vSAN (VxRail) Clusters are Stretched across sites
Any unfulfilled assumption is a potential risk. In the case of unfulfilled assumption, the design should be reviewed and potentially redesigned to fulfill the design factors. 

Now, let's think about network topology options we have. 

The reader has asked if DellEMC SmartFabric can help him. Well, SmartFabric can be the option as it is Leaf-Spine Fabric fully managed by External SmartFabric Orchestrator. Something like Cisco ACI / APIC. SmartFabric uses EVPN, BGP, VXLAN, etc. for multi-rack deployment. I do not know the latest details, but AFAIK, it was not multi-site ready a few months ago. The latest SmartFabric features should be validated with DellEMC. Anyway, SmartFabric can do L2 over L3 if you need stretching L2 over L3 across racks. Eventually, it should be possible to stretch L2 even across sites.

However, because our design is targeted to a small deployment, I think the Leaf-Spine is the overkill for small deployment and I always prefer the KISS (Keep It Simple, Stupid) approach. 

So, here are two final options of network topology I would consider and compare.

OPTION 1: Stretched L2 Loop-Free across sites 
OPTION 2: L3 across sites with L2/L3 boundary in TOR access switches 


 Option 1 Stretched L2 Loop-Free across sites


Option 2 - L3 across sites with L2/L3 boundary in TOR access switches 

So let's compare these two options. 

Option 1 - Stretched L2 Loop-Free across sites 

Benefits

  • Simplicity
  • Stretched L2 across sites allows workload (device, VM, container, etc.) migrations across sites without L2 over L3 network overlay (NSX, SmartFabric, etc.) and re-IP.

Drawbacks

  • Topology is not scalable for more TOR access switches (VLT domains), but this ok with the design factors
  • Topology optimally requires 8 links across sites. Optionally, can be reduced to 4 links.
  • Only two routers. One per site.
  • Stretched L2 topology across sites also extends L2 network fault-domain across sites, therefore broadcast storms, unknown unicast flooding, and potential STP challenges are the potential risks.
  • This topology has L3 trombone by design - https://blog.ipspace.net/2011/02/traffic-trombone-what-it-is-and-how-you.html. This drawback can be accepted or mitigated by NSX distributed routing.

OPTION 2 - L3 across sites with L2/L3 boundary in TOR access switches 

Benefits

  • Better scalability, because other VLT domains (TOR access switches) can be connected to core routers. However, this benefit is not required by the design factors above. 
  • Topology optimally requires 4 links across sites. Optionally, can be reduced to 2 links. This is less than Option 1 requires.
  • Each site is local fault-domain from L2 networking point of view, as L2 fault-domain is not stretched across sites. L2 faults (STP, broadcast storms, unknown unicast flooding, etc.) are isolated within the site. 

Drawbacks

  • More complex routing configuration with ECMP and dynamic routing protocol like iBGP or OSPF
  • Four routers. Two per site.
  • L3 topology across sites restricts workload (device, VM, container, etc.) migrations across sites without L2 over L3 network overlay (NSX, SmartFabric, etc.) or changing the IP address of migrated workload.

Conclusion and Design Decision

Both considered design options are L2 loop-free topologies and I hope it should fit all design factors defined above. If you do not agree, please write a comment because anybody can make an error in any design or not foresee all situations, until the architecture design is implemented and validated. 

If I should make a final design decision, it would depend on two other factors
  • Do I have VMware NSX in my toolbox or not?
  • What is the skillset level of network operators (Dynamic Routing, ECMP, VRRP) responsible for the operation?
If I would not have NSX and network operators would prefer Routing High Availability (VRRP) over Dynamic Routing with ECMP (high availability + scalability + performance), I would decide to implement Option 1.

In the case of NSX and willingness to use dynamic routing with ECMP, I would decide to implement Option 2.

The reader in his question mentioned, that his company do not use spine/leaf L3 BGP, therefore Option 1 is probably a better fit for him. 

Disclaimer: I had no chance to test and validate any of the design option considered above, therefore, if you have any real experience, please speak out loudly in the comments.

Tuesday, September 01, 2020

Why NUMA matters?

This is a very short blog post because more and more VMware customers and partners are asking me the same question ... 

"Why NUMA matters?"

If you want to know more I would highly recommend reading Frank Denneman's detailed blog posts or books about NUMA, however, the table below is worth 1000 words.

Local memory access latency is ~ 75 ns.

Remote memory access latency is ~ 132 ns.

A ~ 40% positive impact on performance is worth to incorporate NUMA considerations in your data center infrastructure design.

If you prefer a comprehensive presentation, Frank Denneman should speak on VMworld 2020 about NUMA in session "60 Minutes of NUMA [HCP2453]".