Wednesday, March 16, 2016

General recommendations for stretched vSphere HA Cluster aka Metro Cluster Storage (vMSC)

This is just a brief blog post with general recommendations for VMware vSphere Metro Cluster Storage (aka vMSC). For more holistic view, please read white paper "VMware vSphere Metro Storage Cluster Recommended Practices"

vSphere HA Cluster Recommended Configuration Settings:
  • Set Admission Control - Failover capacity by defining percentage of the cluster (50% for CPU and Memory)
  • Set Host Isolation Response - Power Off and Restart VMs
  • Specify multiple host isolation addresses - Advanced configuration option das.isolationaddressX
  • Disable default gateway as host isolation address - Advanced configuration option das.useDefaultIsolationAddress=false
  • Change the default settings of vSphere HA and configure it to Respect VM to Host affinity rules during failover - Advanced configuration option das.respectVmHostSoftAffinityRules=true
  • The minimum number of heartbeat datastores is two and the maximum is five. VMware recommends increasing the number of heartbeat datastores from two to four in a stretched cluster environment Advanced configuration option das.heartbeatDsPerHost=4
  • VMware recommends using "Select any of the cluster datastores taking into account my preferences" for heartbeat datastores and choose two datastores (active distributed volumes/LUNs) on each site
  • PDL and APD considerations depends on stretched cluster mode (uniform/non-uniform). However, VMware recommends to configure PDL/APD responses therefore VM Component Protection (VMCP) must be enabled and response should be set to "Power Off and Restart VMs - Conservative". Detail configuration should be discussed with particular storage vendor. 
vSphere DRS Recommended Configuration Settings:
  • DRS mode - Fully automated
  • Use DRS VM/Host rules to set VM per site locality
  • Use DRS "Should Rules" and avoid the use of "Must Rules"

  • Based on KB 2042596 SIOC is not supported
  • Based on KB 2042596 SDRS is only supported when the IO Metric function is disabled.

Distributed (stretched) Storage Recommendations:
  • Always consult your configuration with your storage vendor
  • VMware highly recommends to use storage witness (aka arbitrator, tie-braker, etc.) in third site.
Custom automation for compliance check and / or operational procedures Recommendations:
  • VMware recommends manually defining “sites” by creating a group of hosts that belong to a site and then adding VMs to these sites based on the affinity of the datastore on which they are provisioned. 
  • VMware recommends automating the process of defining site affinity by using tools such as VMware vCenter OrchestratorTM or VMware vSphere PowerCLITM. 
  • If automating the process is not an option, use of a generic naming convention is recommended to simplify the creation of these groups. 
  • VMware recommends that these groups be validated on a regular basis to ensure that all VMs belong to the group with the correct site affinity.
Other relevant references:


TheFluffyAdmin said...

The challenge we face with Metro Cluster is how the VM's access the outside world.
You have may have a metro cluster design where in both datacenters, the VMs have their own way out of the datacenter. 2 gateways basically.

However, in many designs, there is only 1 way out of the datacenter, often an active-passive type way out. Depending on the type of network split you encounter, you may run into scenarios where the internet connectivity the outside world only survives on 1 of the 2 datacenters. Meaning that all VMs in 1 of the 2 will loose their networking to the outside (as well as too the other datacenter).

But in most cases, HA cannot help you here. As long as you have isolation addresses configured that represent resources on both datacenters, HA will always find at least 1 isolation address pingable. It will, therefore, never come to the conclusion that there is host isolation. It will merely conclude a HA cluster split. In those cases, HA doesn't restart anything.

Part of the problem is that isolation addresses are set on the cluster level, not on a per-host level. So you are never able to set a -different- set of isolation response addresses for the hosts in datacenter1, and a -different- set of isolation addresses for the hosts in datacenter2.

If you where able to do that, you could make HA far more useful, because it could be aware of more dynamic kind of cluster split scenarios.

For example, in the above case, if my internet relies on a way out of the 2 datacenters that can only ever exist in 1 datacenter, then I would want the hosts in the datacenter without internet, to come to the conclusion that they where all individually isolated. That would trigger a useful HA response in such a scenario. - all VMs would be restarted on the side that still had internet.

A workaround or this would be to set isolation addresses on the -other side- of your internet routing. Something in the perimeter maybe. But this introduces other risks.

David Pasek said...

First of all thanks for comment.

BTW I had a similar discussion with my customer currently planning stretched cluster. He asked me how to use vSphere HA Isolation response (IsolationAddresses) for identification of network failure in one site and restart VMs on the other site.

My answer was that vSphere HA Host Isolation Response is not the right tool for this task. The main goal of Host Isolation mode is to identify the situation that single host is totally isolated from the management network. It means it cannot see heartbeats from other hosts in the cluster and on top of that isolated ESXi host cannot ping isolation addresses (default gateway by default but this can be changed per whole cluster).

Now, host isolation response will not come in to play if at least two hosts from the cluster will stay up in DC and have network visibility between them, right?

So "Host Isolation algorithm" is used for identification of ESXi hosts local network isolation but not for VMs L3 global network isolation.

As you can see you have to use another mechanisms to failover your L3 networking from site A to SIte B or vice versa. The answer is dynamic routing protocol and keeping site locality of VMs in the same L2 segment (vlan or vxlan).

Your stretched storage volume is also in active/passive mode so you should keep VMs on particular datastore in the same site. The same is true for L2 segment and outside routing from the site.

Just my $0.02 based on my current knowledge ;-)

BTW: did I mentioned that stretched cluster is not DR for me? It is metro cluster = single failure zone. DR architecture should be architected as two independent failure zones.

TheFluffyAdmin said...

I agree, vsphere HA was never made for this kind of detection. You need other ways to deal with this kind of failure.

So in the case of internet routing only surviving in 1 of the 2 datacenters, what we would probably do in this scenario is to have some kind of a 'big red button' for the datacenter where internet is no longer available. Some kind of scripted mechanism to shutdown all the VMs there.
And then bring them up on the other side. We can do this neatly (neat shutdown, nice, ordered restart), but it would be slow.
You could also find ways of leveraging HA response anyway, by 'cheating' as in, pulling the networking for all the hosts in that datacenter. Or forcing an ESX shutdown for all those hosts. Depending on how fast you had to act and how many VMs where involved, these are scenarios we have seriously considered.

But I have to say, if a metro-cluster split is not enough reason for a business to allow reasonable downtime to get VMs up again on the side with the surviving internet-routing, then perhaps they should not have gone with full data-mobility concept. Other design choices become more important like:
- Multiple internet-routes out, so that both datacenters still have it, regardless of inter-cluster split. (hard to do with active/passive firewall, router solutions, though, more costly)
- Run all production in 1 datacenter, and keep the other completely empty (just another view of the 50% rule, but a lot harder to justify politically)

David Pasek said...

Yep. That's where VM site locality preference together with NSX (ESG, DLR, DFW) become handy ;-)

But it is for another blog post.

Mino said...

Other requirement:

PDL AutoRemove feature in vSphere 5_5 and vSphere 6_x (2059622) VMware KB.htm

Disk.AutoremoveOnPDL to 0 instead of the default of 1

David Pasek said...

Thanks for the comment.

Generally, I agree with this recommendation for vMSC ...

... BUT it really depends on particular storage and if the storage even uses PDL for stretched volume control (active/passive volume fail-over) and leverage VM PDL response type to indirectly fail-over VMs from one site to another site. This VM fail-over control via PDL response type is IMHO very important consideration for non-uniform vMSC topology and not so important for uniform vMSC topology.

So it depends but I agree with general recommendation Disk.AutoremoveOnPDL=0 (false) for any vMSC cluster to avoid any unwanted LUN (volume) remove.