Thursday, March 17, 2022

vSAN Health Service - Network Health - vSAN: MTU check

I have a customer having an issue with vSAN Health Service - Network Health - vSAN: MTU check which was, from time to time, alerting the problem. Normally, the check is green as depicted in the screenshot below.

The same can be checked from CLI via esxcli.

However, my customer was experienced intermittent yellow and red alerts and the only way was to retest the skyline test suite. After retesting, sometimes it switched back to green, sometimes not.

During the problem isolation was identified that the only problem is on vSAN clusters having witness nodes (2-node clusters, stretched clusters). Another indication was that the problem was identified only between vSAN data nodes and vSAN witness. The network communication between data nodes was always ok.

How is this particular vSAN health check work?

It is important to understand, that “vSAN: MTU check (ping with large packet size)”

  • is not using “don’t fragment bit” to test end-to-end MTU configuration
  • is not using manually reconfigured (decreased) MTU from vSAN witness vmkernel interfaces leveraged in my customer's environment. The check is using static large packet size to understand how the network can handle it.
  • The check is sending the large packet between ESXi (vSAN Nodes) and evaluates packet loss based on the following thresholds:
    • 0% <-> 32% packet loss => green
    • 33%  <-> 66% packet loss => yellow
    • 67%  <-> 100% packet loss => red
The vSAN health check is great to understand if there is a network problem (packet loss) between vSAN data nodes. The potential problem can be on ESXi hosts or somewhere in the network path.

So what's the problem?

Let's visualize the environment architecture which is depicted in the drawing below.

The customer has vSAN witness in a remote location and experiencing the problem only between vSAN data nodes and vSAN witness node. Large packet size ping (ping -s 8000) to vSAN witness was tested from ESXi console to test if packet loss is observed there as well.  As we have observed the packet loss, it was the indication, that the problem is somewhere in the middle of the network. Some network routers could be overloaded and do not provide fast enough packet fragmentation causing packet loss.

Feature Request

My customer understands that this is the correct behavior, and everything works as is designed. However, as they have a large number of vSAN clusters, they would highly appreciate, if the check "vSAN: MTU check (ping with large packet size)" would be separated into two independent tests.
  • Test #1: “vSAN: MTU check (ping with large packet size) between data nodes”
  • Test #2: “vSAN: MTU check (ping with large packet size) between data nodes and witness”
We believe that such functionality would significantly improve the operational experience for large and complex environments.

Hope this explanation helps someone else within the VMware community.

No comments: