Thursday, January 20, 2022

Energetics and Distributed Cloud Computing

The Energy

The cost of energy is increasing. A significant part of electrical energy cost is the cost of distribution. That's the reason why the popularity of small home solar systems increases. That's the way how to generate and consume electricity locally and be independent of the distribution network. However, we have a problem. "Green Energy" from solar, wind, and hydroelectric power stations is difficult to distribute via the electrical grid. Energy accumulation (batteries, pumped storage power plant, etc.) is costly and for the traditional electrical grid is very difficult to automatically manage the distribution of so many energy sources. 

The Cloud Computing

The demand for cloud (computing and storage) capacity is increasing year by year. Internet bandwidth increases and cost decreases every year. 5G Networks and SD-WANs are on the radar. Cloud Computing is operated on data centers. A significant part of data center costs is the cost of energy. 

The potential synergy between Energetics and Cloud Computing 

The solution is to consume electricity in the proximity of green power generators. Excess electricity is accumulated into batteries but batteries capacity is limited. We should treat batteries like a cache or buffer to overcome times when green energy does not generate energy but we have local demand. However, when we have excess electricity and the battery (cache/buffer) is full, instead of providing the energy into the electrical grid, the excess electricity can be consumed by a computer system providing compute resources to cloud computing consumers over the internet. This is the form of Distributed Cloud Computing. 

Cloud-Native Applications

So, let's assume we will have Distributed Cloud Computing with so-called Spot Compute Resource Pools". Spot Compute Resource Pools are computing resources that can appear or disappear within hours or minutes. This is not optimal IT infrastructure for traditional software applications which are not infrastructure aware. For such distributed cloud computing the software applications must be designed and developed with infrastructure resources ephemerality in mind. In other words, Cloud-Native Applications must be able to leverage ephemeral compute resource pools and know how to use "Spot Compute Resource Pools".

Conclusion

With today's technology, it is not very difficult to roll out such a network of data centers providing distributed cloud computing and consuming locally the excess electricity from "green" electric sources. I'm planning the Proof of Concept in my house in the middle of this year and let you know about some real experiences because the devil is in detail.

The conceptual Design of such a solution is available at https://www.slideshare.net/davidpasek/flex-cloud-conceptual-design-ver-02

If you would like to discuss this topic, do not hesitate to use the comments below the blog post or open discussion on Twitter @vcdx200.

Wednesday, January 19, 2022

How to avoid or at least mitigate the risk of software and hardware component failures?

Last Thursday, my Firefox web browser stopped working at a regular zoom meeting with my team. Today, thanks to The Register, I realized that it was due to a Foxstuck software bug. For further details about the bug read https://www.theregister.com/2022/01/18/foxstuck_firefox_browser_bug_boots/ 

My troubleshooting was pretty quick. Both Chrome and Safari worked fine, so it was evident that this was definitely the Firefox issue.

I tried various classic tricks to solve the Firefox problem (clearing the cache, cookies, reinstalling the software to the latest version, etc.), but because nothing helped in the 10 minutes I was willing to invest, I decided I didn't have time for further experiments and after about a year of using Firefox, I switched back to Chrome.

The switch over was all about transferring important data from Firefox to Chrome. I use an external password manager (thank god), so the only important data in Firefox were my bookmarks. Exporting bookmarks from Firefox and importing them into Chrome was a matter of seconds.

Problem solved. Hurrah!

But, it's clear that a similar software bug may hit Chrome or Safari in the future, so it's only a matter of time before I will be forced to switch to another web browser. Actually, Chrome has made me angry in the past and that was the reason to switch to Firefox.

So what is the moral of this story?

The only way not to be affected by such software bugs is dual, triple, or even multi-vendor strategy (in this case Firefox, Chrome, Safari) and the art of quickly identifying a problematic component and replacing it with another.

This blog is about data centers, data center infrastructure, and software-defined infrastructure. Does it apply here? I think so.

In the hardware area, we can solve the MULTI-VENDOR strategy using a computer, storage, and network virtualization, where VMware is the industry leader. Server virtualization (ESXi) gives us hardware abstraction so we use HPE, Dell, or Lenovo servers in the same way. Storage virtualization (vSAN, vVols) gives us storage abstraction and independence on storage vendors. Network virtualization does the same for network components like a switch, router, firewall, and load balancer. 

When we virtualize all hardware components we have a software-defined infrastructure. If we do not want to plan, design, implement and operate software-defined infrastructure by ourselves, we can outsource it to cloud providers and consume it as a service. This is IaaS cloud infrastructure.

If we consume IaaS cloud infrastructure, we can solve the MULTI-VENDOR strategy using MULTI-CLOUD. The MULTI-CLOUD strategy is based on the assumption that if one IaaS cloud provider fails, the other cloud providers will not fail at the same time, therefore such strategy has a positive impact on the availability and/or recoverability.

And if we already have an adopted MULTI-CLOUD strategy, then we only lack modernly designed applications that can automatically detect an infrastructure failure of one cloud provider and recover from it by a fast application fail-over to another cloud. Kubernetes can help with multi-cloud from an infrastructure point of view but in the end, it is all about the application architecture having self-healing natively within application DNA. The application architected for MULTI-CLOUD architecture is, at least for me, the CLOUD NATIVE APPLICATION. The application, which is able to live in the cloud and survive inevitable failures. This is exactly how the human body works and how the human civilizations are migrating between the regions. That's why we have multi-site and multi-region architectures and cloud-native applications are able to recognize where is the best place to live, do some cost analysis and migrate if it makes sense. Isn't it similar to humans? 

And that's it. Easy to write, isn't it? ... The real implementation of MULTI-CLOUD architecture is a bit trickier, but with today's technology, it's feasible.

Wednesday, October 20, 2021

Kubernetes vSphere CSI Driver

The main reason why I do blogging is to document some technical details and design patterns I discuss with my customers. Usually, I decide to write the blog post about some topic, when there are more then two customers wanting to know some technical details or experiencing some technical challenge.

Today I will write a first blog about Kubernetes. It seems to me that Kubernetes has finally reached the momentum and everybody is trying to jump into the wagon. It is obvious, that Kubernetes is the infrastructure platform for modern distributed applications. VMware has recognized this trend very early and integrated Kubernetes into VMware vSphere platform, also known as Tanzu. I do not want to describe Tanzu platform from product perspective because there are plenty of such blog posts across the blogosphere. Cormac Hogan is my favorite Tanzu/Kubernetes blogger, probably because in the past he was blogging about vSphere and storage related topics. Therefore, if you want to get some info about VMware Tanzu, I highly recommend Cormac's blog which is available at https://cormachogan.com/.

In this article, I would like to describe the architecture overview of vSphere CSI Driver and some process flow behind the scene.

Disclaimer: Please note that this is just my personal understanding how it works and some things can be inaccurate or at very high detail. Nevertheless, if you believe there is something totally wrong, speak up in comments below the article.

First thing first, I'm the visual guy therefore let's start with overall solution architecture.


 The DevOps process to create persistent volume is following

  • DevOps Admin will ask Kubernetes cluster to create persistent volume via kubectl and YAML manifest (aka persistent volume claim)
  • CSI driver has control plane in K8s supervisor and CSI Driver agents on all K8s worker nodes
  • DevOps Admin request (claim) of persistent volume is sent to CSI driver control plane
  • CSI driver control plane is integrated with vCenter server via vSphere API
  • CSI driver control plane via vCenter API asks vSphere to create storage volume.
  • Storage volume can be VMDK file on VMFS filesystem, vSAN object, vVol (lun on physical storage) or NFS shared storage (mountpoint).
  • vCenter will create such storage volume via some ESXi host
  • CSI driver control plane can leave such storage volume unattached (aka FCD - First Class Disk) or it can attach the storage volume into particular ESXi host because eventually it knows into which K8s pod (container) such volume should be attached. And it also knows in which K8S Worker Node (linux guest os on top of virtual machine) the K8s pod is running, therefore, it dynamically attach the volume (it leverages hot-plug/hot-add capability) to particular virtual machine.
    • Note 1: block persistent volumes are attached to virtual machines via PVSCSI driver as it supports higher number (64) of disks and as virtual machine supports up to four (4) SCSI adapters, single VM (K8s worker node) can have up to 256 volumes.
    • Note 2: CSI driver can add additional PVSCSI adapters to VM dynamically
    • Note 3: It only works when VM addvaced setting "devices.hotplug" is enabled, which is default setting.
  • Finally, CSI driver agent detects new storage volume within K8s worker node (linux guest os) and because it knows into which K8s pod (linux container / chroot) the particular volume should be attached, it will attach it to the desired container (pod).

Hope I did not forget something in the automated workflow vSphere CSI driver is doing :-)

I guess now you would ask me, how DevOps admin issues persistent volumes claims into K8s cluster, right?

Well, it is two step process. The first of all, K8s cluster must know K8s Storage Class which is later used for persistent volume claims. Storage Class is just a mapping between vSphere Storage Policy and K8s Storage Class object (aka kind). If you are not yet familiar with VMware vSphere SPBM (Storage Policy Based Management), please read this.

The second step is to create Persistent Volume Claim, describing the particular storage request.

Examples of both Kubernetes (YAML) requests are below. 

 

I believe examples above are self-explanatory. 

Hope this article helps broader VMware user community to understand what is under the cover.

References:

 

Monday, October 04, 2021

2-Node vSAN Direct Connect and LACP

One of my customers is using 2-node vSANs on multiple branch offices. One of many reasons of using 2-node vSAN is the possibility to leverage existing 1 Gb network and use 25 Gb Direct Connect between ESXi hosts (vSAN nodes) without the need of 25 Gb Ethernet switches. Generally they have very good experience with vSAN, but recently they have experienced vSAN Direct Connect outages when testing the network resiliency. The resiliency test was done by administrative shutdown of one vmnic (physical NIC port) on one vSAN node. After further troubleshooting, they realized their particular NICs (Network Adapters) do not propagate link down state to the physical link, when vmnic is administratively disabled by command "esxcli network nic down -n vmnic2". 

It is worth to mention, that such network outage does not mean 2-node vSAN outage because that's the reason why we have vSAN witness, however, vSAN is in degraded state and cannot provide mirror (RAID1) protection of vSAN objects.

Such network behavior is definitely strange and we have opened discussion and root cause analysis with hardware vendor, however, we have also started the internal discussion about design alternatives we have to mitigate such weird situations and increase resiliency and the overall availability of vSAN system.

Here are three design options how to implement direct connect networking between two ESXi hosts.

Design Option 1 - Switch independent teaming with explicit fail-over

Option 1 is using single VMkernel interface (vmk2) connected to single vSwitch portgroup which is using two uplinks with explicit fail-over teaming where vmnic2 is the explicit active uplink and vmnic3 will be used only in case vmnic2 is not available.
 

This design option is generally recommended by VMware.

Benefits: simple configuration, highly available solution

Drawbacks: in case of link state hardware problem, you can be in situation when one vSAN node is using VMkernel interface via vmnic2 uplink and 

Design Option 2 - Link Aggregation (LACP)

Option 2 is using single VMkernel interface (vmk2) connected to single vSwitch portgroup having single logical uplink (LAG) which is backed by two uplinks (vmnic2, vminc3) bonded into the port-channel. In such network configuration, both uplinks are active. It is worth to mention, that in 2-node configuration, LACP load balancing algorithm can help with load balancing of vSAN traffic across both uplinks, but the benefit of LACP is periodical heart beating (sending LACPDU) which is by default done every 30 seconds (slow LACP). For more information LACP timers read this blog post.

Benefits: LAG virtual interface with LACPDU heart beating can mitigate the risk of black hole scenario in case of problems with link state.

Drawbacks: 

  • LACP configuration is more complicated than switch independent teaming, therefore it has a negative impact on manageability. 
  • Network availability is not guaranteed with multiple vmknics in some asymmetric failures, such as one NIC failure on one host and another NIC failure on another host. However, more bundled links can increase vSAN traffic availability, because vSAN L3 connectivity would stay up and running until single L1 link is up.

Useful LACP commands

  • esxcli network vswitch dvs vmware lacp status get
  • esxcli network vswitch dvs vmware lacp stats get
  • esxcli network nic down -n vmnic2 
  • esxcli network nic up -n vmnic2

Design Option 3 - Two vSAN Air Gap Network

Two vSAN Air Gap Networks actually means two vSAN vmkernel interfaces connected to two totally independent (air gap) networks.

Benefits: Little bit easier configuration than LACP.

Drawbacks: 

  • Setup is complex and error prone, so troubleshooting is more complex. 
    • Requires multiple L3 VMkernel interfaces for vSAN traffic. 
  • Network availability is not guaranteed with multiple vmknics in some asymmetric failures, such as one NIC failure on one host and another NIC failure on another host. 
  • Source: Pros and Cons of Air Gap Network Configurations with vSAN

Conclusion and design decision

In this blog post, I have described three different option of network configuration for vSAN direct connect. I personally believe, the design option 2 (LACP for vSAN Direct Connect) is the optimal design decision, especially if NIC link state propagation is not reliable as is the case for my customer. However, the design option 2 is solving the issue as well. The final design decision is on the customer.

Friday, October 01, 2021

Enhanced Load Balancing Path Selection Policy

This blog post will be very short.

Few years ago I wrote the blog post about this topic. It is available here so read it for further details.

What we have today realized with my colleagues, this VMW_PSP_RR sub-policy options is enabled by default, therefore VMware Round Robin multi-pathing policy is considering I/O latency for optimal storage path selection.

The ESXi setting can be validated in ESXi shell by command

esxcfg-advcfg -g /Misc/EnablePSPLatencyPolicy 

where the output in ESXi 6.7 U3 and above is

Value of EnablePSPLatencyPolicy is 1

Note: 1 is TRUE.

This is the reason, why you can observe different traffic via different storage paths.

Thursday, September 30, 2021

VMware Distributed Switch - vSphere 6.7 versus 7.0

This will be a really quick heads-up for those upgrading vSphere 6 to vSphere 7.

I've been informed by one colleague, that his customer had an network outage when he upgraded VMware Distributed Switch (aka VDS) from version 6.6.0 (vSphere 6.7 U3) to 7.0.2 (vSphere 7.0 U2).

That was a surprise, as we were not aware about any VDS upgrade issues in the past.

The network outage was observed on Microsoft Network Load Balancers (aka NLB) which was a pretty good hint for Root Cause Analysis.

After the further analysis, the root cause was the change of VMware DVS default advanced setting "Multicast filtering mode".

In vSphere 6.7, the default "Multicast filtering mode" is basic.


In vSphere 7.0, the default "Multicast filtering mode" is IGMP/MLD Snooping.

 

For those who know how IGMP Snooping works, it is not a big surprise why it might be problem for Microsoft Network Load Balancer.

Hope this will help broader VMware community.
 


Thursday, September 09, 2021

vSphere design : ESXi protection against network port flapping

I've just finished a root cause analysis of VM restart in customer production environment, so let me share with you the symptoms of the problem, current customer's vSphere design and recommended improvement to avoid similar problems in the future. 

After the further discussion with customer we have identified following symptoms:

  • VM was restarted in different ESXi host
  • original ESXi host, where VM was running before the restart, was isolated (network isolation)
  • vSAN was partitioned

What does it mean?

Well, for those understanding how vSphere HA Cluster works it is pretty simple diagnosis.

  • ESXi was isolated from the network
  • HA Cluster "Response for Host Isolation Response" was set to "Power Off and restart VMs"
    • this is recommended setting for IP storage, because when network is not available, there is a huge probability, the storage is not available and VM is in trouble
    • customer has vSAN, which is a IP storage, therefore such setting makes perfect sense

That having said, this was the reason VM was restarted and it is expected behavior to achieve higher VM availability in cost of some small unavailability because of VM restart.   

However, there is a logical question.

Why was ESXi isolated from network when there is network teaming (vmnic1 + vmnic3) configured?

The customer environment is depicted on design drawing below.

When vSAN is used, vSphere HA heart beating is happening across vSAN network, therefore vmk3 L3 interface (vSAN) is in use, leveraging vmnic1 and vmnic3 uplinks. Customer has both uplinks active with "Route based on originating virtual port", therefore the traffic goes either through vminc1 or vmnic3. This is called uplink pinning and only one uplink is used for vSphere HA heart beat traffic.

Customer is using VMware LogInsight (syslog + data analytics) for central log management, therefore troubleshooting was a piece of cake. We have found vmnic3 flapping (link up, down, up, down, ...) and Fault Domain Manager (aka FDM) log message about the host isolation and VM restart.

Cool, we know the Root Cause, but what options do we have to avoid such situation?

Well, the issue described above is called Network Port Flapping and in such single port issue, in our case with vmnic3, the vmk3 (vSAN, HA heart beat) interface was originally pinned to vmnic3 and when vmnic3 went down, vmk3 was failed over from vmnic3 to vmnic1. However, because vmnic3 went up, the fail-over process was stooped and kept on vmnic3. Nevertheless, vmnic3 went down, up, down, up, etc. again and as network was very unstable, vSphere HA heart beating failed. As we do not have traditional datastores, there is no vSphere HA storage heart beating and we only rely on network heart beating which failed, thus ESXi host was claimed as isolated, and VM was Powered Off and restarted on another ESXi within vSphere cluster, where VM can provide application services running within VM again. This is actually the goal of vSphere HA, to increase VM services availability and network availability is part of the availability.

So, what is port flapping?

Source: https://lantern.splunk.com/IT_Use_Case_Guidance/Infrastructure_Performance_Monitoring/Network_Monitoring/Managing_Cisco_IOS_devices/Port_flapping_on_Cisco_IOS_devices

Port flapping is a situation in which a physical interface on the switch continually goes up and down, three or more times a second for at least 10 seconds.

Common causes for port flapping are bad, unsupported, or non-standard cable or other link synchronization issues. The cause for port flapping can be intermittent or permanent. You need a search to identify when it happens on your network so you can investigate and resolve the problem.

How to avoid port flapping consequences in vSphere Cluster?

(1) Link Dampening. There are some possibilities in Ethernet switch side. I was blogging about "Dell Force10 Link Dampening" few years ago, which should help in these situations.

(2) There is VMware vSwitch "Teaming and failover" option Failback=No available through GUI.


(3) And there is ESXi advanced setting "Net.teampolicyupdelay" which is something like "Link Dampening" described above. Source: https://kb.vmware.com/s/article/2014075

Each option above has their own benefits and drawbacks

+ means benefit

- means drawback

Let's go option by option and discuss pluses and minuses.

Option 1: Physical Ethernet switch Link Dampening

+ per physical switch port setting, therefore not too much places to set, but still some effort. Some switches supports profile configuration which can have positive impact on manageability.

- such feature might or might not be available for particular network vendor and if available, configuration varies vendor by vendor

- must be done by network admin, therefore vSphere admin does not have rights and clue about such setting and you must explain and justify it to network admin, network manager, etc.

Option 2: VMware vSwitch "Teaming and failover / Failback=No"

+ per vSwitch port group setting, therefore, single and straight forward setting in case of Distributed Virtual Switch (aka VDS)

- In case of Standard Virtual Switch (aka VSS), the setting must be done for vSwitch on each ESXi host, which has negative impact on manageability

- it will failover all trafic from flapping vmnic to fully operated vmnic, but it will never failback until ESXi restart. It has a positive impact on availability but potentially negative impact on performance and throughput

Option 3: ESXi advanced setting "Net.teampolicyupdelay"

- per ESXi advanced setting, which is not perfect from manageability point of view

+ it has a positive impact on availability and also performance, because in case of temporary flapping issue, it can failover traffic back after some longer time, lets say 5 or 10 second.

- Unfortunately, there is no such granularity like Force10 Link Dampening, which can penalize the interface based on flap frequency and decays exponentially depending on the configured half-life.

Conclusion

What option should customer implement? To be honest, it is up to cross team discussion, because each option has some advantages and disadvantages. Nevertheless, there are some options to consider to increase system availability and resiliency.

Hope you have found this write up useful. This is my give back to VMware community. I believe that sharing the knowledge is the only way how to improve not only technology but human civilization. Do you have another opinion, options or experience, please do not hesitate to write a comment below this article.