In this post, I will cover common performance metrics seen in VMware vSphere hosts and VM’s as well as some issues that I’ve personally often encountered while managing VMware vSphere clusters and how to identify and solve them. I will focus on the computing part (CPU/RAM) only to keep the post from getting too long.
CPU Metrics
1. CPU utilization
CPU utilization is the most obvious CPU metric in VMware. This metric indicates the percentage of your vCPU(s) that is being utilized and is calculated as follows:
virtual CPU usage = usagemhz ÷ (number of virtual CPUs × core frequency)
It will tell you if your VM has too few or too many vCPUs assigned. For example, if you see values averaging above 80% or even 90% during normal operations, then it would be advised to increase the number of vCPUs assigned. Should the usage mostly be below 40%, this could indicate that you have too many vCPUs assigned which in turn could cause CPU ready time.
2. CPU Ready time (%RDY)
CPU Ready time is a more complex metric that often has different causes and could heavily impact the performance of your VMs. The metric is a percentage of time during which the virtual machine is ready but cannot get scheduled to run on physical CPU. In ESXTOP (ESXTOP is a command-line tool that provides real-time resource usage on a vSphere ESXi host) this metric is called “%RDY”, while in the vSphere Client it’s called “Readiness” – not to be confused with “CPU Ready”, which shows the summation in ms, instead of the percentage. You can calculate the percentage from the summation in ms using the following formula:
(CPU summation value / (<chart default update interval in seconds> * 1000)) * 100 = CPU ready %
A ready time percentage of <5% is typically not an issue and should not cause significant performance impact. However, if the metric reaches above 10% the impact will become quite significant. I’ll be going over the two most common causes of the ready time and how you can identify and solve these issues.
3. Co-Stop (%CSTP)
Co-Stop indicates a percentage of time a vSMP (Virtual Symmetric Multi-Processing) virtual machine was ready to run but incurred delay due to co-vCPU scheduling contention (100% = %RUN + %RDY + %CSTP + %WAIT). This metric applies to all VMs that have >1 vCPU allocated, and it basically means that the physical processors do not have sufficient CPU scheduling opportunities. Depending on the situation, there could be multiple ways to resolve this.
First you need to identify if there are multiple VMs that have this issue or if only a few big ones have this problem. For example, there could be one or more big VMs that have so many vCPUs assigned compared to the number of physical cores of the host that it would be difficult to schedule all its vCPUs at the exact same time. This could be solved by lowering the number of vCPUs allocated to these VMs or by moving them to a host with more physical cores per processor.
On the other hand, it could be that (almost) all your VMs are seeing Co-Stop time. This could indicate that your CPU overprovisioning factor is too high, which would lead to the same issue. You can calculate your overprovisioning factor by dividing the number of allocated vCPUs by the number of physical cores you have in the platform/host. In general, an overprovisioning factor of 3:1 should not cause any issues. However, going over that can start to cause performance degradation and going above 5:1 is highly likely to cause significant impact.
4. Max Limited (%MLTP)
Max Limited is another common metric seen in VMware environments, which basically means that the vCPUs are deliberately not scheduled on the physical processor because it would violate a CPU limit set in VMware, thus throttling the VM(s). This could be a VM limit or resource pool limit and can typically be resolved by just increasing the VM or resource pool limit. However, these limits are likely configured by an administrator to prevent one VM from impacting another or one resource pool from impacting another. In that case, you’ll need to make sure that your platform can support the CPU resources you allocate to the VMs/resource pools, otherwise you will likely trigger a wider CPU problem.
Memory Metrics
1. Memory utilization
Memory utilization in vSphere is split up into a few different metrics. Most are used to find out if the memory utilization in a cluster is Active, Consumed, or Granted memory. There are different ways of handling high memory utilization based on where it’s observed. When memory utilization is getting close to the maximum threshold on a host, you can look at balancing out your hosts amongst each other, adjusting DRS settings to allow for more dynamic balancing, or extending your cluster with additional hosts/memory. Typically, when memory utilization is reaching its maximum threshold on a VM, you would assign additional memory to that specific VM. However, there could be issues with the guest that is causing it to utilize more memory than it should, so ideally, you should troubleshoot the guest OS before simply allocating more memory.
2. Consumed memory
Consumed memory is also referred to as “used memory”. It contains the total amount of memory used by the ESXi host. Consumed memory is calculated by summing up the total consumed memory by all running virtual machines on the host but also includes memory used by other components of the ESXi host itself, such as the VMKernel, management agents, and other vSphere services.
3. Granted memory
Granted memory is the amount of physical memory that has been mapped to the virtual machines on the host. This also includes the amount of memory that has been allocated to vSphere services on the host.
4. Active memory
Memory consumed by virtual machines is not always actively touched but rather used to cache data that may be accessed often or required at a later moment. Active memory is that part of the consumed memory that is actually being used. The active memory shown in vCenter is memory that has been touched within the last 20 seconds.
5. Memory ballooning
Most of the memory issues in VMware are seen when memory ballooning occurs. However, ballooning is not necessarily a bad thing. Memory ballooning is a feature that allows the host to be more efficient with its physical memory by reclaiming unused memory that had been allocated to running virtual machines. This is done by the Memory Balloon Driver, which is included in the VMware Tools. The balloon driver determines how much memory can be reclaimed and then inflates it to the point that this amount of memory is blocked in the guest system. Issues typically appear when the balloon driver inflates too much and the guest is no longer able to run its operations. This will cause performance issues and, in some cases, even an outage. For example, an SQL server might run out of memory and crash if this happens. Therefore, it’s recommended to plan on a per virtual machine basis if you want to have the possibility of ballooning and reserving the full guest memory for machines that you don’t want to balloon at all.
6. Worst case allocation
Worst case allocation is the amount of memory that a virtual machine can allocate when all the virtual machines in the ESXi host or cluster consume the full amount of their configured or allocated resources. In the case of overprovisioning, this is the amount of memory the VM will get. Typically, you don’t want the Worst-case allocation to be much lower than the amount of memory you’ve allocated to VM’s in a production environment, as this will start to cause a lot of ballooning or even out-of-memory issues for your VM’s once they start to utilize more of their allocated memory at the same time. Keeping your Worst-case allocation and actual assigned memory as close to each other as possible ensures that your VM’s will be able to use their allocated memory during high load events/periods in your cluster.
In this post, we discussed the most common compute performance metrics and issues that I’ve encountered managing VMware vSphere hosts and VM’s and how you can identify and solve these issues. I hope these insights will help you identify and solve performance issues in your environments.