Tuesday, October 8, 2013

Things to look for when troubleshooting VMware CPU performance problems

I get asked this questions often so I figured it would make sense to provide the metrics I review when troubleshooting CPU performance related issues in a VMware environment.

High Ready Time: Ready Time above 10% could indicate CPU contention and might impact the Performance of CPU intensive application. However, some less CPU sensitive application and virtual machines can have much higher values of ready time and still perform satisfactorily. A CPU is in the Ready state when the virtual machine is ready to run but unable to run because the vSphere scheduler is unable to find physical host CPU resources to run the virtual machine on. One potential reason for elevated Ready time is that the virtual machine is constrained by a user-set CPU limit or resource pool limit, reported as max limited (MLMTD).

High Costop time: Costop time indicates that CPU contention is occurring among vCPUs of a multi-way virtual machine. Costop time above 10% could be an indicator that vSphere is having contention issues when trying to schedule all the vCPUS of a multi-way virtual machine. CoStop (CSTP): Time the vCPUs of a multi-way virtual machine spent waiting to be co-started. This gives an indication of the co-scheduling overhead incurred by the virtual machine.

CPU Limits: CPU Limits directly prevent a virtual machine from using more than a set amount of CPU resources. Any CPU limit might cause a CPU performance problem if the virtual machine needs resources beyond the limit.

Host CPU Saturation: When the Physical CPUs of a vSphere host are being consistently utilized at 85% or more then the vSphere host may be saturated. When a vSphere host is saturated, it is more difficult for the scheduler to find free physical CPU resources in order to run virtual machines.

Guest CPU Saturation: Guest CPU (vCPU) Saturation is when the application inside the virtual machine is using 90% or more of the CPU resources assigned to the virtual machine. This may be an indicator that the application is being bottlenecked on vCPU resource. In these situations, adding additional vCPU resources to the virtual machine might improve performance.

Incorrect SMP Usage: Using large SMP virtual machines can cause extra overhead. Virtual machines should be correctly sized for the application that is intended to run in the virtual machine. Some applications may only support multithreading up to a certain number of threads. Assignment of additional vCPU to the virtual machine may cause additional overhead. If vCPU usage shows that a machine that is configured with multiple vCPUs is only using one of them, then it might be an indicator that the application inside the virtual machine is unable to take advantage of the additional vCPU capacity, or that the guest OS is not configured correctly.

Low Guest Usage: Low in-guest CPU utilization might be an indicator that the application is not configured correctly or that the application is starved on some other resource such as I/O or Memory and therefore cannot fully utilize the assigned vCPU resources

Wait: This can occur when the virtual machine's guest OS is idle (Waiting for Work), or the virtual machine could be waiting on vSphere tasks. Some examples of vSphere tasks that a vCPU may be waiting on are either waiting for I/O to complete or waiting for ESX level swapping to complete. These non-idle vSphere system waits are called VMWAIT.



Notice the amount of CPU this virtual machine is demanding and compare that to the amount of CPU usage the virtual machine is actually allocated (Usage in MHz). The virtual machine is demanding more than it is currently being allowed to use.

Notice that the virtual machine is also seeing a large amount of ready time.



vCenter reports some metrics such as "Ready Time" in milliseconds (ms). Use the formula above to convert the milliseconds (ms) value to a percentage.

For multi vCPU virtual machines you need to multiply the Sample Period by the number of vCPUs in the virtual machine to determine the total time of the sample period. It is also beneficial to monitor Co-Stop time on multi vCPU virtual machines. Like Ready time, Co-Stop time greater than 10% could indicate a performance problem.

No comments:

Post a Comment