Friday, December 01, 2017

vSphere Switch Independent Teaming or LACP?

I have answered this question lot of times during the last couple of years, thus I have finally decided to write a blog post on this topic. Unfortunately, the answer always depends on specific factors (requirements and constraints) for the particular environment so do not expect the short answer. Instead of the simple answer, I will do the comparison of LBT and LACP.

I assume you (my reader) is familiar with LACP but do you know what LBT is? If not, here is the short explanation.
VMware LBT (load based teaming) is advanced switch independent teaming available on VMware DVS which pin each VM vNIC to particular physical uplink in roud robin fasion but if the network traffic of particular physical NIC is higher then 75% of total bandwidth over 30 seconds it will initiate rebalancing across available physical uplinks (physical NICs of ESXi host) to avoid network congestion on particular uplink. 
If you are not familiar with basic VMware vSphere networking read my previous blog post "Back to the basics - VMware vSphere networking" before continuing.

What we are doing is the comparison of switch independent teaming and LACP. LACP is the capability of VMware Distributed Virtual Switch (VDS), therefore, I would assume you are on vSphere Enterprise Plus license and having VDS. When you have VDS then I would have another assumption, that you are already considering LBT as it is the best choice for switch independent teaming algorithms available on VDS.

LBT versus LACP comparison

Option 1: Switch Independent Teaming (LBT - Load Based Teaming)
Option 2: LACP

LBT advantages
  • Fully independent on upstream physical switches
  • Simple configuration
  • Beacon probing can be used.  Note: Beacon probing requires at least 3 physical NICs.
LBT disadvantages
  • Single VM cannot handle traffic higher than the bandwidth of single physical NIC.
  • Traffic is load-balanced across links in the channel from ESXi perspective (egress traffic) but only at VM NIC granularity and returning traffic (ingress traffic) is forwarded by the same link as egress traffic.
LACP advantages
  • One of the main LACP advantages is continuous heartbeat between two sides of the link (ESXi physical NIC port and switch port). VMware's LACP is sending LACPDUs every 30 seconds but it can be reconfigured to fast mode when LACPDUs are exchanged every 1 second. This improves failover in case of link failure and also helps when link status (up/down) do not work well.
  • Single VM can, in theory, handle higher traffic then single physical NIC because of load-balancing algorithm. 
  • Trafic can be load-balanced from both sides of the link (virtual link channel, port-channel, etc.). From ESXi perspective by ESXi and from the switch perspective by load-balancing set on the switch side. The proper configuration on both sides is required.
LACP disadvantages
  • ESXi Network Dump Collector does not work if the Management vmkernel port has been configured to use EtherChannel/LACP
  • VMware vSphere beacon probing cannot be used
  • The LACP is not supported with software iSCSI port binding.
  • The LACP support settings are not available in host profiles.

CONCLUSION AND ANSWER

So which option is better? Well, it depends.

When you do not have direct or indirect control of physical network infrastructure then switch independent teaming is generally much simpler and safer solution, therefore LBT is a better choice. 

In case, you trust your network vendor LACP implementation and you have some control or trust your physical switch configuration LACP is the better choice because of LACPDU heart beating and multiple load-balancing hash algorithms which can, in theory, improve network bandwidth for single VM network traffic and can be configured on both sides of the link channel. Another advantage is that LACP works better with multi-chassis LAG (MLAG) technologies like Cisco vPC, Dell Force10 VLT, Arista MLAG, etc. Generally, Multi-Chassis LAG "orphan ports" (ports without LACP) are not recommended by MLAG switch vendors because they do not have the control of the end-point.

So the final decision is, as always, up to you but this blog post should help you with the right decision on your specific environment.

Any other opinions, advantages, disadvantages, and ideas are welcome, so do not hesitate to write a comment.

****************************************************************

References to other resources:

[1] Check "Limitations of the LACP Support on a vSphere Distributed Switch" in the documentation here.


FAQ related to LBT and LACP comparison

Q: VMware's LACP is sending LACPDUs every 30 seconds. Is there any way how to configure LACPDU frequency to 1 second?

A: Yes.

You can use command "esxcli network vswitch dvs vmware lacp timeout set". It allows set advanced timeout settings for LACP

Description:
set ... Set long/short timeout for vmnics in one LACP LAG
Cmd options:
-l|--lag-id= The ID of LAG to be configured. (required)
-n|--nic-name= The nic name. If it is set, then only this vmnic in the lag will be configured.
-t|--timeout Set long or short timeout: 1 for short timeout and 0 for long timeout. (required)
-s|--vds= The name of VDS. (required)

Relevant blog post on this topic "VMware vSphere DVS LACP timers".

Q: Does ESXi has a possibility to display LACP settings of established LACP session in particular ESXi host? Something like "show lacp" on Cisco switch?

A: Yes. You can use command "esxcli network vswitch dvs vmware lacp status get". It should be equivalent to "show lacp" on Cisco physical switch

Q: How VMware vSwitch Beacon Probing works?
A: Read following blog posts
Q: What is Beacon Probing interval?
A: 1 second

Q: Is ESXi beacon probing send beacons to every VLAN?
A: Yes, but only to VLANs (portgroups) where at least one VM is connected. It does not make sense to test failure on VLANs where nothing is connected.


No comments: