How to Minimize CPU Latency in VMware with NUMA

January 7, 2015 | Heroix Support

On the most basic level CPUs do one thing: process data based on instructions. The faster the CPU the faster it processes data. But before a CPU can process data it has to read both the data and the instructions from slower system RAM and that latency can slow the CPU processing. In order to minimize the time the CPU is waiting on reading data, CPU architectures include on-chip memory caches that are much faster than RAM. However even though the on-chip caches have hit rates that are better than 95% there are still times when the CPU has to wait for data from RAM.

When the CPU reads from RAM the data is transferred along a bus shared by all the CPUs on a system. As the number of CPUs in a system increase the traffic along that bus increases as well, and CPUs can end up contending with each other to access RAM. This is where NUMA comes in – NUMA is designed to minimize the problem of system bus contention by increasing the number of paths between CPU and RAM.

NUMA (Non Uniform Memory Architecture) breaks up a system into nodes of associated CPUs and local RAM. NUMA nodes are optimized so that the CPUs in a node preferentially use the local RAM within that node. The result is that CPUs typically contend only with other CPUs within their NUMA node for access to RAM rather than with all the CPUs on a system.

As an example consider a system with 4 processor sockets, each with 4 cores and 128 GB RAM. Without NUMA that comes to 16 physical processors that could potentially be queued up on the same system bus to access 128 GB RAM. If this same system were broken up into 4 NUMA nodes each node would have 4 CPUs with local access to 32 GB RAM.

The ESXi hypervisor can manage virtual machines (VMs) so that they take advantage of the NUMA system architecture. The VMware 6.5 Best Practices Guide divides VMs running on NUMA into 2 groups:

The number of virtual CPUs for a VM is less than or equal to physical CPUs in the NUMA node.

The ESXi hypervisor assigns the VM to a home NUMA node where memory and physical CPU are preferentially used. Best practices in this case are that the allocated VM memory be less than the NUMA node memory. As far as the VM is concerned it is effectively on a non-NUMA system where all CPU and memory resources are local

The number of virtual CPUs for a VM is greater than the number of physical CPUs in the NUMA node (“Wide VMs”).

Wide VMs are split into multiple NUMA clients with each client assigned a different home NUMA node. For example, if a system had multiple NUMA nodes of 1 socket with 4 cores each (4 physical CPUs/node) and a wide VM had 8 virtual CPUs then ESXi can divide the VM into two NUMA clients with 4 physical CPUs each assigned to 2 different home NUMA nodes. The problem with dividing a wide VM into multiple NUMA clients is that it introduces the possibility that one of the client nodes may need to access memory from a different NUMA client node.

Note: In this post we discuss hyperthreading – as per The CPU Scheduler in VMware vSphere, hyperthreading isn’t taken into account when you’re calculating the number of available virtual processors on a NUMA node.

In our next post we’ll take a look at using virtual NUMA (vNUMA) to minimize wide VMs accessing remote NUMA memory and what happens when VMs configured to use NUMA are migrated to a host with a different NUMA system configuration.

Want to learn more?

Download our Overcommitting VMware Resources Whitepaper for the guidelines you need to ensure that you are getting the most out of your host resources without sacrificing performance.

Download the whitepaper

Categories

How to Minimize CPU Latency in VMware with NUMA

Sign In