Back

VMWare Vsphere: Performance Optimization

Image Slider

May 27, 2019

By Amir BENYEKKOU - Systems Engineer, Virtual Infrastructure at Squad

This first article discusses performance optimization for VMware's vSphere hypervisor, specifically the Network, Storage, Processor, and Memory components.
A second article will be dedicated to troubleshooting these components and their associated performance by monitoring various indicators.
Prior knowledge of vSphere environments is required before making any changes to these performance settings.
All best practices for optimizing performance can be found in the official VMware documentation: https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/performance/Perf_Best_Practices_vSphere65.pdf


Virtualization techniques

In the early days, VMWare used binary translation as a software virtualization technique within a bare-metal hypervisor. This technique generated performance overhead at the hypervisor processor level.
With the advent of hardware virtualization within the processor, the Virtual Machine Monitor no longer runs in Ring 0 and the guest OS in Ring 1(1): the VMM has a privileged mode that allows it to capture kernel requests from the guest OS, which runs in Ring 0.

Similarly, for memory management (Memory Management Unit), the VMM manages a correspondence between the guest OS memory and the physical memory. Initially, a software technique called Shadow Page Tables was used, which also generated processor overhead. The arrival of techniques managed directly by the processor has significantly improved performance.
There are as many VMMs as there are virtual machines on a hypervisor: the VMM provides access to VMKernel technologies (scheduler, memory management, network stacks, and storage) and manages the virtualization techniques of each virtual machine.

A virtual processor therefore consists of its instruction set and the MMU (Memory Management Unit) to form its Monitor Mode execution mode (2).

Monitor Mode will be reflected in the execution of each virtual machine in the vmware.log file of the virtual machine on an ESXi-type hypervisor by searching for the keyword "MONITOR MODE":

  • BT: Binary Translation and Shadow Page Tables (CPU/Memory software virtualization)
  • HV: Hardware Virtualization and SPT (CPU hardware virtualization, software for memory)
  • HWMMU: Hardware Virtualization for CPU and Memory

Depending on the mode used, this will have a direct impact on performance (https://kb.vmware.com/s/article/1036775).


Performance Optimization on the Network Component

A vSwitch is a virtual switch that acts as a gateway between the vNics (virtual network interfaces) of virtual machines and the NICs (physical network interfaces) of an ESXi hypervisor.

In a vSphere environment and depending on the type of license used on ESXi hypervisors, two types of virtual switches can be used: the standard vSwitch or the Distributed vSwitch (DVS). DVSs are only available when using an Enterprise license on a vCenter (VMware's centralized management system). The configuration of a DVS is distributed across all ESXi hypervisors and performed directly from the vCenter, whereas the configuration of a standard vSwitch must be performed on each ESXi hypervisor.

A DVS provides additional features compared to a standard vSwitch:

  • No loss of connectivity for a virtual machine during vMotion (migration of a virtual machine from one hypervisor to another)
  • The control plane is managed by vCenter and the IO plane by ESXi hypervisors.
  • Load balancing of physical network cards in hypervisors is based on network IO load, LACP.
  • Inbound and outbound traffic shaping (outbound traffic is supported with the standard vSwitch)
  • Private VLANs to segment a Layer 2 network (ideal on a DMZ or hosting network composed of public IPs)
  • Network troubleshooting using port mirroring, port monitoring, and NetFlow
  • Troubleshooting DVS with HealthCheck for configuration errors
  • Backup and restoration of DVS
  • Network I/O Control to allocate reservations, limits, bandwidth, and sharing to each type of flow on ESXi and virtual machines

When designing or upgrading a complex and large vSphere environment, it is therefore preferable to usehttps://kb.vmware.com/s/article/1010555(DVS).

On a virtual machine, several types of vNICs can be used:

  • Vlance, E1000 & E1000e: emulated drivers
  • VMXNET, VMXNET2, VMXNET3: paravirtualized drivers

It is therefore important to ensure that the VMXNET3 type is used with vmtools installed in the virtual machine to guarantee maximum performance and the latest available features (3).

Noteworthy features include TCP Segmentation Offload (TSO), which reduces processor overhead (this must be enabled manually on virtual machines), Jumbo Frames, which allow frames with an MTU of 9000 to be used (mounting NFS filesystems in a virtual machine on a dedicated storage network), and the use of 10G cards in virtual machines.


Storage Performance Optimization

In a vSphere environment, storage arrays based on SAN (FC, FCoE, and iSCSI) or NAS (NFS/CIFS) protocols can be used.

The choice of protocol type, storage configuration, load balancing, queue management and depth, VMFS configuration, and virtual HDD type will impact workload performance.

Here are some measures that can improve storage performance. In a SAN/NAS environment, the RAID level must be appropriate for the performance expectations of the workloads, and best practices for multipathing(4) must be applied:

  • For an active/passive array: Most Recently Used (MRU)
  • For an active/active array: Fixed

Where possible, use the rack manufacturer's plugins if available (PSP: Path Selection Plugins).

The depth of the HBA, VMKernel, and LUNS (array side) queues will be adjusted if the number of commands at a given moment causes latency in SCSI operations.

The use of iSCSI or NFS protocols must ensure that the underlying Ethernet network is not saturated and that a dedicated network is available. Separating storage traffic from virtual machine traffic is also recommended.

For virtual machines, the choice of disk type will impact performance:

  • Thick provisioned: full allocation and formatting at initialization with better performance
  • Thin provisioned: allocation and formatting on demand during initialization with lower performance.

In order to better control performance, manufacturers provide various APIs, such as vSphere Storage API Array Integration (VAAI) to accelerate IO operations, and vSphere API for Storage Awareness (VASA), which offer the following features:

  • Reduction of ESXi CPU overhead on storage operations by dedicating them to the array in a SAN/NAS environment
  • Integration of topology, capacity, and status of storage arrays

In addition to these components, vSphere provides the ability to implement storage policies to place virtual machines on storage suited to the workload, use datastore clusters to distribute the IO load, or use vSphere IO Control to provide QoS on virtual machine IOs (only in iSCSI and Fibre Channel).


Processor Performance Optimization

The CPU scheduler on ESXi hypervisors is responsible for distributing the load of vCPUs (virtual processors) across pCPUs (physical processors) and scheduling the execution of execution contexts. A VM is a collection of worlds: one for each vCPU, one for the VMM, one for the keyboard, etc.

When vCPUs are allocated in excess of pCPUs, proportional time is allocated to each vCPU based on the shares, limits, and reservations configured on the virtual machines.

Several factors can affect CPU performance:

  • Idle virtual machines: this generates overhead due to interrupt requests from virtual machines.
  • CPU affinity: this involves directly assigning vCPUs to pCPUs and can lead to CPU load imbalance.
  • SMP VMs: it is recommended to use SMP kernels rather than single-processor kernels.
  • vCPU allocation: allocate as few vCPUs as possible to a virtual machine to reduce the planning load on the CPU scheduler.
  • NUMA on multi-socket systems: allocating memory or vCPUs to a virtual machine beyond the capabilities of a NUMA node can cause a significant drop in performance on a workload.

An alert indicator, "Ready Time," shows the wait time for a vCPU to obtain processor cycles on the ESXi hypervisor. If this time increases, then the vCPUs of the virtual machines on the hypervisor are in contention.


Memory Performance Optimization

Physical memory allocation to a virtual machine is always done on the fly by the ESXi hypervisor. The hypervisor does not have access to the contents of the virtual machine's physical memory and is not aware of memory releases made by the machine or their contents.

There are several techniques for optimizing and reclaiming memory(5) [performed by the hypervisor when memory becomes saturated or when the memory allocation to virtual machines exceeds the hypervisor's memory capacity.

 The following techniques are used in order to free up memory:

  • Transparent Page Sharing: sharing of memory pages common to multiple VMs (disabled by default)
  • Ballooning: a specialized driver used by vmtools will ask the guest OS to free up inactive or free memory so that the hypervisor can reclaim memory.
  • Memory compression: this only occurs if there is RAM over-allocation. Memory pages that can be swapped will be compressed directly in memory, and only if the compression rate is greater than 50%.
  • Host Swapping: the hypervisor creates a swap file outside the virtual machine during initialization and allows random swapping of memory (active/inactive).

Sources

(1): Understanding Full Virtualization, Paravirtualization, and Hardware Assist- White Paper
(2): Software and Hardware Techniques for x86 Virtualization - Information Guide VMware
(3): Performance Evaluation of VMXNET3 Virtual Network Device - Performance study VMware
(4): Understanding Multipathing and Failover - FAQ VMware Docs
(5): Understanding Memory Resource Management in VMware® ESX™ Server - White Paper