Sunday, April 20, 2014

Potential Network Black Hole Issue

When I do vSphere and hardware infrastructure health checks very often I meet misconfigured networks usually but not only in blade server environments. That's the reason I've decided to write blog post about this issue. The issue is general and should be considered and checked for any vendor solution but because I'm very familiar with DELL products I'll use DELL blade system and I/O modules to show you deeper specification and configurations.  

Blade server chassis typically have switch modules as depicted on figure below.

When blade chassis switch modules are connected to another network layer (aggregation or core) than there is possibility of network black hole which I would like to discuss deeply on this post.

Let's assume you will lost single uplink from I/O module A. This situation is depicted below.

In this situation there is not availability problem because network traffic can flow via second I/O module uplink port. Indeed, there is only half of uplink bandwidth so there is potential throughput degradation and therefore congestion can occur but everything works and it is not availability issue.

But what happen when second I/O switch module uplink port fails? Look at figure below.

If I/O switch module is in normal switch mode then uplink ports are in link-down state but downlink server ports are in link-up state and therefore ESX host NIC ports are also up and ESX teaming don't know that something is wrong down the path and traffic is sending to both NIC uplinks. We call this situation "black hole" because traffic routed via NIC1 will never reach the destination and your infrastructure is in trouble.

To overcome this issue some I/O modules in blade systems can be configured as I/O Aggregator. Some other modules are designed as I/O Aggregators by default and it cannot be changed.

Here are examples of DELL blade switch modules which are switches by default but can be configure to work as I/O Aggregators (aka Simple Switch Mode):
  • DELL PowerConnect M6220
  • DELL PowerConnect M6348
  • DELL PowerConnect M8024-k
Example of implicit I/O Aggregator is DELL Force10 IOA.

Another I/O Aggregator like option is to use Fabric Extender architecture implemented in DELL Blade System as CISCO Nexus B22. CISCO FEX is little bit different topic but is also help you to effectively avoid our black hole issue.

When you use "simple switch mode"you have limited configuration possibilities. For example you can use the module just for L2 and you cannot use advanced features like access control lists (ACLs). That can be reason you would like to leave I/O module in normal switch mode. But even you have I/O modules in normal switch mode you can configure your switch  to overcome potential "black hole" issue. Here are examples of DELL blade switches and technologies to overcome this issue:
  • DELL PowerConnect M6220 (Link Dependency)
  • DELL PowerConnect M6348 (Link Dependency)
  • DELL PowerConnect M8024-k (Link Dependency)
  • DELL Force10 MXL (Uplink Failure Detection)
  • CISCO 3130X (Link State Tracking)
  • CISCO 3130G (Link State Tracking)
  • CISCO 3032 (Link State Tracking)
  • CISCO Nexus B22 (Fabric Extender)
If you leverage any of technology listed above then link states of I/O module switch uplink ports are synchronized to the configured downlink ports and ESX teaming driver can effectively do ESX uplink high availability. Such situation is depicted in figure below.

Below are examples of detail CLI configurations of some port tracking technologies described above.

DELL PowerConnect Link Dependency

Link dependency configuration on both blade access switch modules can solve "Network Black Hole" issue.

 ! Server port configuration  
 interface Gi1/0/1  
 switchport mode general  
 switchport general pvid 201  
 switchport general allowed vlan add 201  
 switchport general allowed vlan add 500-999 tagged  
 ! Physical Uplink port configuration   
 interface Gi1/0/47  
 channel-group 1 mode auto  
 ! Physical Uplink port configuration   
 interface Gi1/0/48  
 channel-group 1 mode auto  
 ! Logical Uplink port configuration (LACP Port Channel)  
 interface port-channel 1  
 switchport mode trunk  
 ! Link dependency configuration  
 link-dependency group 1  
 add Gi1/0/1-16  
 depends-on port-channel 1  

Force10 Uplink Failure Detection (UFD)

Force 10 call link dependency feature UFD and here is configuration example

 FTOS#show running-config uplink-state-group  
 uplink-state-group 1  
 downstream TenGigabitEthernet 0/0  
 upstream TenGigabitEthernet 0/1  

The status of UFD can be displayed by "show configuration" command

 FTOS(conf-uplink-state-group-16)# show configuration  
 uplink-state-group 16  
 description test  
 downstream disable links all  
 downstream TengigabitEthernet 0/40  
 upstream TengigabitEthernet 0/41  
 upstream Port-channel 8  

CISCO Link State Tracking

Link state tracking is a feature available on Cisco switches to manage the link state of downstream ports (ports connected to Servers) based on the status of upstream ports (ports connected to Aggregation/Core switches).

Saturday, April 19, 2014

How to fix broken VMFS partition?

I had a need of more storage space in my lab. The redundancy was not important so I changed RAID configuration of local disk from RAID 1 to RAID 0. After this change the old VMFS partition left on the disk volume. That was the reason I have seen just half of all disk space when I was trying to create new datastore. Another half was still used by old VMFS partition. You can ssh to ESXi host end check disk partition by partedUtil command. 

 ~ # partedUtil get /dev/disks/naa.690b11c0034974001ae597300bc5b3aa  
 Error: The backup GPT table is not at the end of the disk, as it should be. This might mean that another operating system believes the disk is smaller. Fix, by moving the backup to the end (and removing the old backup)? This will also fix the last usable sector as per the new size. diskSize (1169686528) AlternateLBA (584843263) LastUsableLBA (584843230)  
 Warning: Not all of the space available to /dev/disks/naa.690b11c0034974001ae597300bc5b3aa appears to be used, you can fix the GPT to use all of the space (an extra 584843264 blocks) or continue with the current setting? This will also move the backup table at the end if is is not at the end already. diskSize (1169686528) AlternateLBA (584843263) LastUsableLBA (584843230) NewLastUsableLBA (1169686494)  
 72809 255 63 1169686528  
 1 2048 584843230 0 0  

Old datastore didn't contain any data so I had no problem to try fix command option by following command.

 ~ # partedUtil fix /dev/disks/naa.690b11c0034974001ae597300bc5b3aa  

When I checked disk partition on disk device I get  following output

 ~ # partedUtil get /dev/disks/naa.690b11c0034974001ae597300bc5b3aa  
 72809 255 63 1169686528  

... and then I was able to create datastore on whole disk.

DELL recommended BIOS Settings for VMware vSphere Hypervisor

Here are a list of BIOS settings  specifically regarding Dell PowerEdge servers:
  • Hardware-Assisted Virtualization: As the VMware best practices state, this technology provides hardware-assisted CPU and MMU virtualization.
    In the Dell PowerEdge BIOS, this is known as “Virtualization Technology” under the “Processor Settings” screen. Depending upon server model, this may be Disabled by default. In order to utilize these technologies, Dell recommends setting this to Enabled.
  • Intel® Turbo Boost Technology and Hyper-Threading Technology: These technologies, known as "Turbo Mode" and "Logical Processor" respectively in the Dell BIOS under the "Processor Settings" screen, are recommended by VMware to be Enabled for applicable processors; this is the Dell factory default setting.
  • Non-Uniform Memory Access (NUMA): VMware states that in most cases, disabling "Node Interleaving" (which enables NUMA) provides the best performance, as the VMware kernel scheduler is NUMA-aware and optimizes memory accesses to the processor it belongs to. This is the Dell factory default.
  • Power Management: VMware states “For the highest performance, potentially at the expense of higher power consumption, set any BIOS power-saving options to high-performance mode.” In the Dell BIOS, this is accomplished by setting "Power Management" to Maximum Performance.
  • Integrated Devices: VMware states “Disable from within the BIOS any unneeded devices, such as serial and USB ports.” These devices can be turned off under the “Integrated Devices” screen within the Dell BIOS.
  • C1E: VMware recommends disabling the C1E halt state for multi-threaded, I/O latency sensitive workloads. This option is Enabled by default, and may be set to Disabled under the “Processor Settings” screen of the Dell BIOS.
  • Processor Prefetchers: Certain processor architectures may have additional options under the “Processor Settings” screen, such as Hardware Prefetcher, Adjacent Cache Line Prefetch, DCU Streamer Prefetcher, Data Reuse, DRAM Prefetcher, etc. The default settings for these options is Enabled, and in general, Dell does not recommend disabling them, as they typically improve performance. However, for very random, memory-intensive workloads, you can try disabling these settings to evaluate whether that may increase performance of your virtualized workloads.

Tuesday, April 15, 2014

Long network lost during vMotion

We have observed strange behavior of vMotion during vSphere Design Verification tests after successful vSphere Implementation. By the way that's the reason why Design Verification tests are very important before putting infrastructure into production.But back to the problem. When VM was migrated between ESXi hosts leveraging VMware vMotion we have seen long network lost of VM networking somewhere between 5 and 20 seconds. That's really strange because usually you didn't observe any network lost or maximally lost of single ping from VM perspective.

My first question to implementation team was if  virtual switch option "Notify Switch" is enabled as is documented in vSphere design. The importance of this option is relatively deeply explained for example here and here. The answer from implementation team was positive so problem has to be somewhere else and most probably on physical switches. The main reason of option "Notify Switch" is to quickly update physical switch mac-address-table (aka CAM) and help physical switch to understand on which physical port VM vNIC is connected. So as another potential culprit was intended physical switch. In our case it was  virtual stack of two HP blade modules 6125XLG.

At the end it has been proven by implementation team that HP switch was the culprit. Here is very important HP switch global configuration setting to eliminate this issue.

 mac-address mac-move fast-update  

I didn't find any similar blog post or KB article about this issue so I hope it will be helpful for VMware community. 

DELL official response to OpenSSL Heartbleed

The official Dell response to OpenSSL Heartbleed for our entire portfolio of products is listed here.

Monday, April 14, 2014

Code Formatter

From time to time i'm publishing programming code source or configurations on my blog running on google blog platform  I'm always struggling with formatting the code.

I've just found and I'll try it next time when needed.

Tuesday, April 08, 2014

PRTG alerts phone call notifications

I have been asked by someone how to do phone call notification of critical alerts in PRTG monitoring system. Advantage of phone call notification against Email or SMS is that it can wake up sleeping administrator in night when he has support service and critical alert appears in central monitoring system.

My conceptual answer was ... use PRTG API to monitor alerts and make a phone call when critical alerts exist.

New generation of admins doesn't have problem with APIs but don't know how to dial voice call. That's because they are IP based generation and don't have experience with modems we played extensively back in 80's and 90's ;-)

At the end I promised them to prepare embedded system integrated with PRTG over API and dialing phone configured number in case of critical alerts.

Here is the picture of hardware prototype leveraging soekris computing platform running FreeBSD OS in RAM disk and making phone calls via RS-232 GSM modem.
Soekris computing platform and RS-232 GSM modem.
Here are relevant blog posts describing some technical details little bit deeper