Wednesday, March 24, 2021

What's new in vSphere 7 Update 2

vSphere 7 is not only about server virtualization (Virtual Machines) but also about Containers orchestrated by Kubernetes orchestration engine. VMware Kubernetes distribution and the broader platform for modern applications, also known as CNA - Cloud Native Applications or Developer Ready Infrastructure) is called VMware Tanzu. Let's start with enhancements in this area and continue with more traditional areas like Operational, Scalability, and Security improvements.

Developer Ready Infrastructure

vSphere with Tanzu - Integrated LoadBalancer

vSphere Update 2 includes fully supported, integrated, highly available, enterprise-ready Load Balancer for Tanzu Kubernetes Grid Control Plane and Kubernetes Services of type Load Balancer - NSX Advanced Load Balancer Essentials (Formerly Avi Load Balancer). NSX Advanced Load Balancer Essentials is scale out load balancer. The data path for users accessing the VIPs is through a set of Service Engines that automatically scale out as workloads increase.

Sphere with Tanzu - Private Registry Support

If you are using a container registry with self-signed, or private CA signed certs – this allows them to be used with TKG clusters.

Sphere with Tanzu - Advanced security for container-based workloads in vSphere with Tanzu on AMD

For customers interested in running containers with as much security in place as possible, Confidential Containers provides full and complete register and memory isolation and encryption from Pod to Pod and Hypervisor to Pod.

  • Builds on vSphere’s industry-leading, easy-to-enable support for AMD SEV-ES data protections on 2nd & 3rd generation AMD EPYC CPUs
  • Each Pod is uniquely encrypted to protect applications and data in use within CPU and memory
  • Enabled with standard Kubernetes YAML annotation

Artificial Intelligence & Machine Learning

vSphere and NVIDIA. The new Ampere family of NVIDIA GPUs is supported on vSphere 7U2. This is part of a bigger effort between the two companies to build a full stack AI/ML offering for customers.

  • Support for new NVIDIA Ampere family of GPUs
    • In the new Ampere family of GPUs, the A100 GPU is the new high-end offering. Previously the high-end GPU was the V100 – the A100 is about double the performance of the V100. 
  • Multi-Instance GPU (MIG) improves physical isolation between VMs & workloads
    • You can think of MIG as spatial  separation as opposed to the older form of vGPU which did time-slicing to separate one VM from another on the GPU. MIG is used through a familiar vGPU profile assigned to the VM. You enable MIG at the vSphere host level firstly using one simple command "nvidia-smi mig enable -I 0". This requires SR-IOV to be switched on in the BIOS (via the iDRAC on a Dell server, for example).  
  • Performance enhancements with GPUdirect & Address Translation Service in the hypervisor

Operational Enhancements

VMware vSphere Lifecycle Manager - support for Tanzu & NSX-T

  • vSphere Lifecycle Manager now handles vSphere with Tanzu “supervisor” cluster lifecycle operations
  • Uses declarative model for host management

VMware vSphere Lifecycle Manager Desired Image Seeding

Extract an image from an existing host

ESXi Suspend-to-Memory

Suspend to Memory introduces a new option to help reduce the overall ESXi host upgrade time.

  • Depends on Quick Boot
  • New option to suspend the VM state to memory during upgrades
  • Options defined in the Host Remediation Settings
  • Adds flexibility and reduces upgrade time

Availability & Efficiency

vSphere HA support for Persistent Memory Workloads

  • Use vSphere HA to automatically restart workloads with PMEM
  • Admission Control ensures NVDIMM failover capacity
  • Can be enabled with VM Hardware 19

Note: By default, vSphere HA will not attempt to restart a virtual machine using NVDIMM on another host. Allowing HA on host failure to failover the virtual machine, will restart the virtual machine on another host with a new, empty NVDIMM

VMware vMotion Auto Scale

vSphere 7 U2 automatically tunes vMotion to the available network bandwidth for faster live-migrations for faster outage avoidance and less time spent on maintenance.

  • Faster live migration on 25, 40, and 100 GbE networks means faster outage avoidance and less time spent on maintenance
  • One vMotion stream capable of processing 15 Gbps+
  • vMotion automatically scales the number of streams to the available bandwidth
  • No more manual tuning to get the most from your network

VMware vMotion Auto Scale

AMD optimizations

As customers trust in AMD increases, so is the performance of ESXi on modern AMD processors.

  • Optimized scheduler ​for AMD EPYC architecture
  • Better load balancing and cache locality
  • Enormous performance gains

Reduced I/O Jitter for Latency-sensitive Workloads

Under the hood vSphere kernel improvements in vSphere 7U2 allow for significantly improved I/O latency for virtual Telco 5G Radio Access Networks (vRAN) deployments.

  • Eliminate Jitter for Telco 5G Deployments
  • Significantly Improve I/O Latency
  • Reduce NIC Passthrough Interrupts

Security & Compliance

ESXi Key Persistence

ESXi Key Persistence helps eliminate dependency loops and creates options for encryption without the traditional infrastructure. It’s the ability to use a Trusted Platform Module, or TPM, on a host to store secrets. A TPM is a secure enclave for a server, and we strongly recommend customers install them in all of their servers because they’re an inexpensive way to get a lot of advanced security.

  • Helps Eliminate Dependencies
  • Enabled via Hardware TPM
  • Encryption Without vCenter Server

VMware vSphere Native Key Provider 

vSphere Native Key Provider puts data-at-rest protections in reach for all customers.

  • Easily enable vSAN Encryption, VM Encryption, and vTPM
  • Key provider integrated in vCenter Server & clustered ESXi hosts
  • Works with ESXi Key Persistence to eliminate dependencies
  • Adds flexible and easy-to-use options for advanced data-at-rest security
 vSphere has some pretty heavy-duty data-at-rest protections, like vSAN Encryption, VM encryption, and virtual TPMs for workloads. One of the gotchas there is that customers need a third-party key provider to enable those features, traditionally known as a key management service or KMS. There are inexpensive KMS options out there but they add significant complexity to operations. In fact, complexity has been a real deterrent to using these features… until now!

Storage

iSCSI path limits
 
ESXi has had a disparity in path limits between iSCSI and Fibre Channel. 32 paths for FC and 8 (8!) paths for iSCSI. As of ESXi 7.0 U2 this limit is now 32 paths. For further details read this.

File Repository on a vVol Datastore

VMware added a new feature that supports creating a custom size config vVol–while this was technically possible in earlier releases, it was not supported. For further details read this.

VMware Tools and Guest OS

Virtual Trusted Platform Module (vTPM) support on Linux & Windows

  • Easily enable in-guest security requiring TPM support
  • vTPM available for modern versions of Microsoft Windows and select Linux distributions
  • Does not require physical TPM
  • Requires VM Encryption, easy with Native Key Provider!

VMware Tools Guest Content Distribution

Guest store enables the customers to distribute various types of content to the VMs, like an internal CDN system.

  • Distribute content “like an internal CDN”
  • Granular control over participation
  • Flexibility to choose content

VMware Time Provider Plugin for Precision Time on Windows

With the introduction of new plugin: vmwTimeProvider shipped with VMware Tools, guests can synchronize directly with hosts over a low-jitter channel.

  • VMware Tools plugin to synchronize guest clocks with Windows Time Service
  • Added via custom install option in VMware Tools
  • Precision Clock device available in VM Hardware 18+
  • Supported on Windows 10 and Windows Server 2016+
  • High quality alternative to traditional time sources like NTP or Active Directory

Conclusion

vSphere 7 Update 2 is nice evolution of vSphere platform. If you ask me what is the most interesting feature in this release, I would probably answer VMware vSphere Native Key Provider, because it has a positive impact on manageability and simplification of overall architecture. The second one is VMware vMotion Auto Scale, which reduces operational time during ESXi maintenace operations in environments with 25+ Gb NICs already adopted.


 




What's new in vSAN 7 Update 2

vSAN 7 Update 2 has been released. Let's look what's new in vSAN. It is very interesting release if you ask me.

HCI Mesh 

HCI Mesh was available in the previous release as well but Update 2 brings a significant technical and also licensing enhancements. Let's go through these enhancements one by one.

  • Mount remote vSAN datastore from non-vSAN based vSphere cluster (aka HCI Mesh Compute Cluster). We call it Disaggregated HCI. This is huge because it does not require vSAN licensing in vSphere cluster which mounts remote vSAN datastore.
  • Improved scalability. Up to 128 hosts connected to remote vSAN datastore.
  • Improved Storage policies and their integrations with vSAN datastores. Storage policies now supports
    • Dedup & Compression or Compression-only
    • Data-at-rest encryption
    • Hybrid or all-flash
Disaggregated HCI

Cloud Native Applications

Based on my experience, Kubernetes and DevOps madness is the biggest use case for vSAN7 nowadays. Do not get me wrong, I like agile and DevOps in general as I use these principles it for last 25 years, however, the Enterprise DevOps adoption is little bit "specific" :-) Nevertheless, time pressure drives the DevOps and Agile methodologies even to inflexible enterprises, so VMware hyper-converged stack supports it from technical point of view very nicely.

  • vSAN 7 U2 supports expansion of persistent volumes without the need to take it offline, eliminating any interruption. This has a positive impact on flexibility of container powered workloads.

Native File Services

  • Support for vSAN stretched clusters and 2-node topologies
  • Support of Data-in-Transit Encryption and UNMAP
  • Snapshots for file services volumes (via API only) and it can extract differences between two snapshots (delta between two snapshosts). This is a new snapshotting mechanism for point-in-time recovery of files. This mechanism allows our backup partners to build applications to protect file shares in new and interesting ways.
  • Improved scale, performance and efficiency
    • optimization some metadata handling and data path for more efficient transactions, especially with small files

vSAN File Services in stretched and 2-node clusters
 

Performance : vSAN over RDMA

The distributed storage systems like vSAN heavily relies on resilient network connectivity, performance, and efficiency.  vSAN 7 U2 now supports clusters configured for RDMA-based networking - RoCE v2 specifically.  Transmitting native vSAN protocols directly over RDMA can offer a level of efficiency that is difficult to achieve with traditional TCP based connectivity over ethernet.  The support for RDMA also means that the vSAN hosts have the intelligence to fall back to TCP connectivity in the event that RDMA is not supported on one of the hosts in a cluster.  vSAN over RDMA is a great step toward supporting topologies which are using the latest fast, efficient delivery of east-west traffic using a commodity network. Below is the list of new RDMA related features

  • Efficient connectivity for vSAN clusters using RDMA
  • Improved CPU utilization and app performance for certain workloads
    • Sequential reads
    • Random mixed reads/writes
  • Supports RDMA over Converged Ethernet v2 (RoCE v2)
  • Automatic detection and handling of RDMA adapters

 

vSAN over RDMA
 

vSAN over RDMA improves performance and efficiency. The performance is always tricky but in the slide below you can see some numbers.

vSAN over RDMA performance and efficiency improvements

Performance : Optimizations in vSAN 7 U2 

VMware continues to drive better performance with each new version of vSAN.  Many of these are achieved by optimizing the hypervisor and storage stack for the very latest and greatest hardware, which is precisely how VMware was able to deliver improved performance with vSAN 7 U2. First, changes were made in the hypervisor to better accommodate the architecture of AMD-based chipsets.  Improvements were also made in vSAN (regardless of chipset used) to help reduce CPU resources during I/O activity for objects using the RAID-5 or RAID-6 data placement scheme, this will be especially beneficial for workloads issuing large sequential writes.  vSAN7 U2 also includes enhancements to help I/O commit to the buffer tier with higher degrees of parallelization.  All of these add up to fewer impediments as I/O traverses the storage stack, and a reduction of CPU utilization and storage latency. Let's describe it one by one

  • Improved NUMA awareness for AMD EPYC  Rome processors
  • Improved performance when using RAID-5/6 erasure codes
    • Improved large sequential writes
    • Reduced CPU usage
  • Improved CPU efficiency writing data to cache/buffer tier
    • Improved small random I/O
Performance optimization for AMD platform


Security: vSphere Native Key Provider in vSAN

I like this one, as it is another operational and architectural simplification leading into better infrastructure security. Security through data encryption is top of mind for many VMware customers.  Encrypting data at rest is often a part of this security effort.  With vSphere and vSAN 7 U2, VMware introduces the support the “Native Key Provider” feature which can simplify key management for environments using encryption. For vSAN, the embedded KMS is ideal for Edge or 2-Node topologies, and is a great example of VMware’s approach to intrinsic security. So what is inside?

  • vSphere Native Key Provider enables “out of the box” data-at-rest protections 
  • Key provider integrated in vCenter Server & clustered ESXi hosts
  • Works with ESXi Key Persistence to eliminate dependencies
  • Adds flexible and easy-to-use options for advanced data-at-rest security

vSphere Native Key Provider in vSAN

Manageability: Skyline Health Diagnostic (SHD) tool for VMware customers

This is another great evolution of VMware Skyline. The Skyline Health Diagnostics tool is a self-service tool that brings some of the benefits of Skyline health directly to an isolated environment.  The tool is run by an administrator at a frequency they desire.  It will scan critical log bundles to detect issues, and give notifications and recommendations to important issues and their related KB articles. In a nutshell, it is

  • Ideal for isolated environments unable to use online Skyline Health
  • Self service tool for customer
    • Uses latest signature library from VMware
    • Scans log bundles
    • Detects issues and gives KB recommendations
  • Proactive support of issues without contacting GSS
  • Improved support experience for opened cases with GSS

Skyline Health Diagnostic (SHD) tool

 

Manageability: Extending vLCM compatibility

The vSphere Lifecycle Manager (vLCM) is VMware’s new lifecycle management platform built into the hypervisor that was first introduced in vSphere 7.  vSphere 7 U2 also includes for environments running vSphere with Tanzu that use NSX-T for their network overlay.  These enhancements are a great example of how vLCM is maturing into a robust management platform for hypervisor management that benefits our customers, and our OEM partners. So what's new in vSAN 7 Update 2?

  • New vendor Plugin* for select Hitachi UCP ReadyNode models
  • Update recommendations automatically refreshed after common change events
    • VMware image depot
    • Change in desired image
  • Support for vSphere with Tanzu using NSX-T

Extending vLCM compatibility
 

vCenter Server and cluster deployment wizards with vLCM integration

Deploying a new vSAN cluster has never been easier with the improvements introduced in this release.  vSphere 7 U2 has now integrated vLCM into the “Easy Install” and “Cluster Quick Start” deployment wizard for deployments of new hosts and clusters from OEMs that support vLCM.  OEM vendors now have all of the capabilities in place to help their users deploy their new systems in a fast, and fully compliant manner.  The workflows used by administrators for both new cluster creation, as well as a greenfield environment where a vCenter Server appliance is bootstrapped onto a single host have been updated to accommodate the ability to easily reference a host for easy compliance through vLCM.

  • New vLCM image options found in multiple wizards:
    • Easy Install
    • Cluster Quick Start
  • Seed vLCM desired image through a reference host for new deployments
  • OEM Servers include:
    • OEM hypervisor ISO
    • vLCM desired state specs
    • vLCM image depot contents
    • VCSA installer
  • Supports VCSA CLI installer
Cluster deployment wizards with vLCM

Manageability: Fast Restarts of vSAN Hosts during Upgrades

vSAN 7 U2 provides better integration and coordination for hosts using Quick Boot to speed up the host update process.  By introducing a new “suspending the VMs to memory” option, and better integration with the Quick Boot workflow, the amount of data moved during a rolling upgrade is drastically reduced due to reduced VM migrations, and a smaller amount of resynchronization data.

  • Minimizes VM migrations during rolling upgrades using  ‘Suspend to memory’ in Quick Boot
  • Restart hypervisor and vSAN while avoiding time-consuming hardware initialization
  • Faster restarts reduce resynchronization efforts
  • Compliments host restart optimizations made in vSAN 7 U1

Fast Restarts of vSAN Hosts during Upgrades

 

Recoverability : Enhanced Data Durability During Unplanned Events

vSAN 7 U2 makes a significant improvement to ensuring the latest written data is saved redundantly in the event of an unplanned transient error or outage. When an unplanned outage occurs on a host, vSAN 7 U2 will immediately write all incremental updates to another host in addition to the other host holding the active object replica.  This helps ensure the durability of the changed data in the event that an additional outage occurs on the other host holding the active object replica.  This builds off of the capability first introduced in vSAN 7 U1 that used this technique for planned maintenance events.  These data durability improvements also have an additional benefit:  Improving the time in which data is resynchronized to a stale object.

  • Maintains latest data redundantly in the event of an unplanned transient error or outage
  • Latest writes quickly committed to additional host ensuring durability of new data
  • Efficient and fast resyncs to stale components on recovered or new host
  • More frequent checks for silent disk errors
Enhanced Data Durability During Unplanned Events

Availability : vSAN support of vSphere Proactive High Availability (HA)

vSAN 7 U2 now supports vSphere Proactive HA, where the application state and any potential data stored can be proactively migrated to another host. 

  • Proactive response when vSAN host detects impending failure
    • Evacuates VMs
    • Migrates object data
  • Uses plug-in provided by participating OEM server vendors
  • Supports quarantine mode and maintenance mode
  • Increased application up-time
vSAN support of vSphere Proactive High Availability (HA)

Availability : Integrated DRS awareness of Stretched Cluster configurations

Stretched clusters provides higher availability across two availability zones in metro distance (response time less than 5 ms). Stretched cluster configuration must account for not only a variety of failure scenarios, but recovery conditions. vSAN 7 U2 introduces integration with data placement and DRS so that after a recovered failure condition, DRS will keep the VM state at the same site until data is fully resynchronized, which will ensure that all read operations do not traverse the inter-site link (ISL).  Once data is fully resynchronized, DRS will move the VM state to the desired site in accordance to DRS rules.  This improvement can dramatically reduce unnecessary read operations occurring across the ISL, and free up ISL resources to continue with its efforts to complete any resynchronizations post site recovery.  And finally, vSAN U2 also increases the maximum host count for a stretched cluster configuration.  Stretched cluster maximums are increased from 30 data hosts spread across two sites to 40 data hosts spread across two sites  (not including witness host appliance).

  • Prioritizes I/O read locality over any VM site affinity rules
  • Instructs DRS not to migrate VMs to desired site until resyncs complete
  • Reduces I/O across ISL in recovery conditions
    • Improve read performance
    • Free up ISL for resyncs to regain compliance
  • Support for larger stretched clusters:  20+20+1
Integrated DRS awareness of Stretched Cluster configurations

Manageability : Proactive capacity monitoring

Capacity Monitoring and alerting sees some great improvements with vSAN 7 U2 that make it easier for administrators to understand capacity limits, and oversubscription ratios.  In this release, vSAN 7 U2 introduces the ability for the administrator to see how oversubscribed capacity is for the cluster.  vSAN is inherently thin provisioned, meaning that only the used space of an object is counted against the capacity usage.  Oversubscription visibility helps the administrator understand how much storage has been allocated, so they can easily see the capacity required in a worst-case scenario and adhere to their own sizing practices. vSAN 7 U2 also provides customizable warning and error alert thresholds directly in the Capacity Management UI in vCenter Server.  Redundant alerting for capacity thresholds have also been eliminated to help clarify and simplify the condition reported to the administrator.

  • Capacity estimations for fully allocated capacity of thin provisioned objects
    • Easily see over subscription
    • Factors in storage policy used
  • Customizable alarm thresholds for vSAN cluster capacity
  • Eliminated redundant alerting
Proactive capacity monitoring - Thinprovisioning awareness

Proactive capacity monitoring - Reserved Capacity Alerting

Manageability : Enhanced network monitoring

vSAN 7 U2 introduces several new metrics and health checks to provide better visibility into the switch fabric that connects the vSAN hosts.

  • Advanced networking metrics integrated into vSAN performance monitoring and alerts
  • New network statistics with customizable alerts
    • TCP/IP layer
    • Physical network layer
  • Augments existing networking metrics
  • Integrated into vSAN Support Insight

Enhanced network monitoring
  

Manageability : Health check history and correlation

SAN 7 U2 introduces new enhancements to help provide context and insight into the sophisticated collection of alerts found in vSAN.
  • View a timeline of discrete error conditions
  • Gain insight into transient conditions that are difficult to track
  • Easily enable/disable based on need
Health check history and correlation


Manageability : vSAN performance ‘top contributors’

vSAN 7 U2 introduces an easy way to determine the heavy utilized VMs, sometimes referred to as noisy neighbors.
  • Easily determine top contributors when experiencing performance issues
    • VMs
    • Disk groups
  • Quickly find potential noisy neighbors and their impact on resources
  • Set time period, and view by latency, throughput, or IOPS
vSAN performance ‘top contributors’
 

Conclusion

As you can see, vSAN 7 Update 2 brings significant improvements which are very useful for modern datacenters. I personally like it and looking forward to see these enhancements at least in our labs and pre-production environments to get hands on experience.

Other blog posts about vSAN 7 Update 2 News

 

Wednesday, February 17, 2021

VMware Short URLs

 VMware has a lot of products and technologies, here are few interesting URL shortcuts to quickly get resources for a particular product, technology, or other information.

VMware HCL and Interop

https://vmware.com/go/hcl - VMware Compatibility Guide

https://vmwa.re/vsanhclc or https://vmware.com/go/vsanvcg - VMware Compatibility Guide vSAN 

https://vmware.com/go/interop - VMware Product Interoperability Matrices

VMware Partners

https://www.vmware.com/go/partnerconnect - VMware Partner Connect

VMware Customers

https://www.vmware.com/go/myvmware - My VMware Overview
 
https://www.vmware.com/go/customerconnect - Customer Connect Overview

https://www.vmware.com/go/patch - Customer Connect, where you can download VMware bits

http://vmware.com/go/skyline - VMware Skyline

http://vmware.com/go/skyline/download - Download VMware Skyline

VMware vSphere

http://vmware.com/go/vsphere - VMware vSphere

VMware CLIs

http://vmware.com/go/dcli - VMware Data Center CLI

VMware Software-Defined Networking and Security

https://vmware.com/go/vcn - Virtual Cloud Network

https://vmware.com/go/nsx - VMware NSX Data Center

https://vmware.com/go/vmware_hcx - Download VMware HCX

VVD

https://vmware.com/go/vvd-diagrams - Diagrams for VMware Validated Design

https://vmware.com/go/vvd-stencils - VMware Stencils for Visio and OmniGraffle

http://vmware.com/go/vvd-community - VVD Community

http://www.vmware.com/go/vvd-sddc - Download VMware Validated Design for Software-Defined Data Center

VCF

https://vmware.com/go/vcfrc - VMware Cloud Foundation Resource Center

http://vmware.com/go/cloudfoundation - VMware Cloud Foundation

http://vmware.com/go/cloudfoundation-community - VMware Cloud Foundation Discussions

http://vmware.com/go/cloudfoundation-docs - VMware Cloud Foundation Documentation

Tanzu Kubernetes Grid (TKG)

http://vmware.com/go/get-tkg - Download VMware Tanzu Kubernetes Grid

Hope this helps at least one person in the VMware community.

Sunday, February 14, 2021

Top Ten Things VMware TAM should have on his mind and use on a daily basis

The readers may or may not know, that I work for VMware as a TAM. For those who do not know, TAM stands for Technical Account Manager. VMware TAM is the billable consulting role available for VMware customers who want to have an on-site dedicated technical advisor/consultant for long term cooperation. VMware TAM organization historically belonged under VMware PSO (Professional Services Organization), however, recently has been moved under Customer Success Organization, which makes perfect sense if you ask me, because customer success is the key goal of a TAM role.

How TAM engagement works? It is pretty easy. VMware Technical Account Managers have 5 slots (days) per week which can be consumed by one or many VMware customers. There are Tier1, Tier2, and Tier3 offerings, where Tier 1 TAM service includes one day per week for the customer, Tier 2 has 2.5 days per week and Tier 3 TAM is fully dedicated.

The TAM job role is very flexible and customizable based on specific customer demand. I like the figure below, describing TAM Service standard Deliverables and On-Demand Activities.


VMware TAM is delivering standard deliverables like
  • Kickoff Meeting and TAM Business Reviews to continuously align with customer expectations
  • Standard Analytics and Reporting including the report of customer estate in terms of VMware products and technologies (we call it CI.Next), Best Practices Review report highlighting a few best practices violations against VMware Health Check’s recommended practices.
  • Technical Advisory Service about VMware Product Releases, VMware Security Advisories, Specific TAM Customer Technical Webinars, Events, etc.
However, what is the most interesting part of VMware TAM job role, at least for me, are On Demand Activities including
  • Technical Enablements, DeepDives, Roadmaps, etc.
  • Planning and Conceptual Designing of Technical Solutions and Transformation Project
  • Problem Management and Design Troubleshootings
  • Product Feature Request management
  • Etc.

And this is the reason why I love my job, because I like technical planning, designing, coordinating technical implementations, validating and testing implementations before it is handed over to production. And I also like to communicate with operation teams and after a while, reevaluate the implemented design and take the operational feedback back to the architecture and engineering for continuous solution improvement. 
That’s the reason why the TAM role is my dream job for one of the best and impactful IT companies in the world.

During the last One on One meeting with my manager, I have been asked to write down the top ten things VMware TAM should have on his mind and use on a daily basis in 2021. To be honest, the rules I will ist are not specific to the year 2021 but very general applying to any other year, and also easily reusable for any other human activity.

After 25 years in the IT industry, 15 years in Professional Consulting, and 5 years as a VMware TAM, I immodestly believe, the 10 things below are the most important things to be the valuable VMware TAM for my customers. These are just my best practices and it is good to know, there are no best practices written into stone, therefore your opinion may vary. Anyway, take it or leave it. Here we go.

#1 Top Bottom approach

I use the Top Bottom approach, to be able to split any project or solution into Conceptual, Logical, and Physical layers. I use Abstraction and Generalization. While abstraction reduces complexity by hiding irrelevant detail, generalization reduces complexity by replacing multiple entities that perform similar functions with a single construct. Do not forget, the modern IT system complexity can be insane. Check the video “Power of Ten” to understand details about other systems' complexity and how it can be visible at various levels.

#2 Correct Expectations

I always set correct expectations. Discover customer’s requirements, constraints, and specific use cases before going into any details or specific solutions is the key to customer success.

#3 Communication

Open and honest communication is the key to any long term successful relationship. As a TAM, I have to be the communicator who can break barriers between various customer silos and teams, like VMware, compute, storage, network, security application, developers, DevOps, you name it. They have to trust you, otherwise, you cannot make success.

#4 Assumptions

I do not assume. Sometimes we need some assumptions to not be stuck and move forward, however, we should validate those assumptions as soon as possible, because false assumptions lead to risks. And one of our primary goals as TAMs is to mitigate risks for our customers. 

#5 Digital Transformation

I leverage online and digital platforms. Nothing compares to personal meetings and whiteboarding, however, tools like Zoom, Miro.com, and Monday.com increase efficiency and help with communication especially in COVID-19 times. This is probably the only related point to the year 2021, as COVID-19 challenges are staying with us for some time.

#6 Agile Methodologies

I use an agile consulting approach leveraging tools like Miro.com, Monday.com, etc. gives me a toolbox to apply agile software methodologies into technical infrastructure architecture design. In the past, when I worked as a software developer, software engineer, and software architect I was a follower of Extreme Programming. I apply the same or similar concepts and methods to Infrastructure Architecture Design and Consulting. This approach helps me to keep up with the speed of IT and high business expectations.

#7 Documentation

I document everything. The documentation is essential. If it’s not written down, it doesn’t exist! I like "Eleven Rules of Design Documentation" by Greg Ferro.

#8 Resource Mobilization

I leverage resources. Internal and External. As TAMs, we have access to a lot of internal resources (GSS, Engineering, Product Management, Technical Marketing, etc.) which we can leverage for our TAM customers. We can also leverage external resources like partners, other vendors from the broader VMware ecosystem, etc. However, we should use resources efficiently. Do not forget, all human resources are shared, thus limited. And time is the most valuable resource, at least for humans, therefore Time Management is important. Anyway, resource mobilization is the true value of the VMware TAM program, therefore we must know how to leverage these resources. 

#9 Customer Advocacy

As a TAM, I work for VMware but also for TAM customers. Therefore, I do customer advocacy within VMware and VMware advocacy within the Customer organization. This is again about the art of communication.

#10 Technical Expertise

Last but not least, I must have technical expertise and competency. I’m a Technical Account Manager, therefore I try to have deep technical expertise in at least one VMware area and broader technical proficiency in few other areas. This approach is often called Full Stack Engineering. I’m very aware of the fact that expertise and competency are very tricky and subjective. It is worth understanding the Dunning Kruger-Effect which is the law about the correlation between competence and confidence. In other words, I’m willing to have real competence and not only false confidence about the competence. If I do not feel confident in some area, I honestly admit it and try to find another resource (see rule #8). The best approach to get and validate my competency and expertise is to continuously learn and validate it by VMware advanced certifications.

Hope this write-up will be useful for at least one person on the VMware TAM Organization.

Thursday, February 04, 2021

Back to basics - MTU & IP defragmentation

This is just a short blog post as it can be useful for other full-stack (compute/storage/network) infrastructure engineers.

I have just had a call from my customer with the following problem symptom. 

Symptom:

When ESXi (in ROBO)  is connected to vCenter (in Datacenter), TCP/IP communication overloads 60 Mbps network link. In such a scenario, huge packet retransmit is observed. IP packets are defragmented and packet retransmission is observed.

Design drawing:

Hypothesis:

MTU Defragmentation is happening in the physical network and MTU is lower than 1280 Bytes.

Planned test:

Find the smallest MTU in the end-2-end network path between ESXi and vCenter

vmkping -s 1472 -d VCENTER-IP

Decrease -s parameter value until the ping is successful. This is the way how to find the smallest MTU in the IP network path. 

Back to basics

IP fragmentation is an Internet Protocol (IP) process that breaks packets into smaller pieces (fragments), so that the resulting pieces can pass through a link with a smaller maximum transmission unit (MTU) than the original packet size. The fragments are reassembled by the receiving host. [source]

The vmkping command has some parameters you should know and use in this case:

-s to set the payload size

Syntax:vmkping -s size IP-address

With the parameter -s you can define the size of the ICMP payload. If you have defined an MTU size from eg. 1500 bytes and use this size in your vmkping command, you may get a “Message too long” error. This happens because ICMP needs 8 bytes for its ICMP header and 20 bytes for IP header:

The size you need to use in your command will be:

1500 (MTU size) – 8 (ICMP header) – 20 (IP header) = 1472 bytes for ICMP payload

-d to disable IP fragmentation

Syntax:vmkping -d IP-address

Use the command “vmkping -s 1472 IP-address” to test your end-2-end network path.

Decrease -s parameter until the ping is successful.

Monday, January 11, 2021

Server rack design and capacity planning

Our VMware local SE team has got a great Christmas present from regional Intel BU. Four rack servers with very nice technical specifications and the latest Intel Optane technology. 

Here is the server technical spec: 

Node Configuration

Description

Quantity

CPU

Intel Platinum 8280L (28 cores, max memory 4.5TB)                          

2

DDR4 Memory

768GB DDR4 DRAM RDIMM

12 x 64GB 

Intel Persistent Memory

3TB Intel Persistent Memory

12 x 256GB

Caching Tier

Intel Optane SSD DC P4800X Series

(750GB, 2.5in PCIe* x4, 3D XPoint™)

2

Capacity Tier

Intel SSD DC P4510 Series

(4.0TB, 2.5in PCIe* 3.1 x4, 3D2, TLC)

4

Networking

       +

transceivers, cables

Intel® Ethernet Network Adapter XXV710-DA2

(25G, 2 ports)

1


These servers are vSAN Ready and the local VMware team is planning to use them for demonstration purposes of VMware SDDC (vSphere, vSAN, NSX, vRealize), therefore VMware Cloud Foundation (VCF) is a very logical choice.

Anyway, even Software-Defined Data Center requires power and cooling, so I've been asked to help with server rack design with proper power capacity planning. To be honest, the server rack plan and design is not rocket science. It is just simple math & elementary physics, however, you have to know the power consumption of each computer component. I did some research and here is the math exercise with a power consumption of each component:

  • CPU - 2x CPU Intel Platinum 8280L (110 W Idle, 150 W Computational,  360 W Peak load)
    • Estimation: 2x150 W = 300 W
  • RAM - 12x 64 GB DDR4 DRAM RDIMM (768 GB)
    • Estimation: 12x 24 Watt = 288 W
  • Persistent RAM - 12x 256GB (3TB) Intel Persistent Memory
    • Estimation: 12x 15 Watt = 180 W
  • vSAN Caching Tier - 2x Intel Optane SSD DC P4800X 750GB
    • Estimation: 2x18W =>  36W
  • vSAN Capacity Tier - 4x Intel SSD DC P4510 Series 4TB
    • Estimation: 4x 16W => 64 W
  • NIC - 1x Intel® Ethernet Network Adapter XXV710-DA2 (25G, 2 ports)
    • Estimation: 15 W

If we sum the power consumption above, we will get 883 Watt per single server.  

To validate the estimation above, I used the DellEMC Enterprise Infrastructure Planning Tool available at http://dell-ui-eipt.azurewebsites.net/#/, where you can place infrastructure devices and get the Power and Heating calculations. You can see the IDLE and COMPUTATIONAL consumptions below.

Idle Power Consumption


Computational Power Consumption

POWER CONSUMPTION
Based on the above calculations, the server power consumption range between 300 and 900 Watts, so it is good to plan a 1 kW power budget per server which in our case would be 4 kW / 17.4 Amp per a single power brach, which would mean 1x32 Amp PDUs just for 4 servers. 

For a full 45U Rack with 21 servers, it would be 21 kW / 91.3 Amp, which would mean 3x32 Amp per a single branch in the rack.

HEATING AND COOLING
Heating and cooling are other considerations. Based on Dell Infrastructure Planning Tool, the temperature in the environment will rise by 9°C (idle load) or even 15 °C (computational load). This also requires appropriate cooling and electricity planning.

Conclusion

1 kW per server is a pretty decent consumption. When you design your cool SDDC, do not forget for basics - Power and Cooling.