2009-07-24

How Heavy is your ESX Load?

Well ok.. This could be taken the wrong way (and all of your with the dirty minds should be ashamed of yourselves - ha ha). On one of my previous posts - How Much Ram per Host - a.k.a Lego - I gave a hypothetical scenario of 40 1 vCPU VM's on a single host as opposed to 80 VM's on one host. There was one thing I neglected to mention, and because of a issue with a client this week, I feel it is important to point out.

CPU Contention. For those of  you who do not know what the issue is about, a brief explanation. If you have too many VM's competing for CPU resources to work, then your VM's will stop behaving and start to crawl.

So here was the story - a client called me with an issue, all his VM's had started to crawl - EVERYTHING was running slowly!

Troubleshooting walkthrough:

  1. Log into the VI Client - and check the resource utilization of the Host - CPU, RAM, Network, Disk.
    Ok I did that - absolutely nothing!
    CPU - 40%
    RAM - 50%
    NIC - 5-10mb/s utilization
    Disk - This was NFS no disk statistics - so I looked at the VMNIC of the VMKernel - and also nothing!

    On to the next step..
  2. top on the ESX host
    ssh'd into the ESX host and looked at the resources with top. I do this first before even going into the ESX statistics. I looked to see if any the iowait was high or if there was any processes stealing up too many resources and the state of the RAM on the host.

    14:21:34  up 2 days, 20:57,  1 user,  load average: 1.06, 0.92, 0.75
    286 processes: 284 sleeping, 2 running, 0 zombie, 0 stopped
    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total    0.9%    0.0%    0.0%   0.0%     0.0%    45.8%   94.1%
    Mem:   268548k av,  256560k used,   11988k free,       0k shrd,   21432k buff
                        189028k actv,   29240k in_d,    3232k in_c
    Swap: 1638620k av,   251022k used, 1541988k free                   74068k cached

    If you notice on the last line

    Swap: 1638620k av,   251022k used, 1541988k free                   74068k cached

    Why was it swapping - that is not normal. Quick check on the Vi Client how much RAM was allocated,

    image

    So there was only 272 (default) allocated, someone had done the proper work of creating the SWAP of 1600MB (double the max. of 800) - well done! - but had not restarted the host! So effectively the host was still set for 272. Now of course the load on the machine high enough causing the host to run out of RAM. anything that was done on the host was working slowly

    Vmotioned the VM's off and restarted the host which cam back with the full amount this time

    image

    Swap: 1638588k av,       0k used, 1638588k free       155924k cached

    Ahh much better - no more swapping. Vmotioned the machines back, and at a certain stage all VM's started to crawl again.

    Looked into top again

    CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
               total    0.3%    0.0%    0.0%   0.0%     0.9%     38.9%   59.6%

    Whoa! that is also extremely high!

  3. esxtop
    shift + V to show only VM's, shift + R to sort by ready time

    ID    GID NAME      NWLD   %USED    %RUN    %SYS   %WAIT    %RDY   %IDLE  %OVRLP   %CSTP  %MLMTD
    21     21 MEM_ABC_STB_     5   49.84   50.08    0.04  393.77   54.77   0.00    0.34    0.00   51.17

    There were something like 10 VM's with $RDY times of over 10% constantly.

Here you have a perfect case of CPU contention. The host is a Dual Quad x5320. Machine was running 44 VM's

image

Ratio of vm's per core is high - but achievable. I then looked to to see what the amount of vCPU's there were on the host, approximately 10 vm's had 2 or more vCPU's.

image
This brought the ratio of vCPU's per core to 6.75 vCPU's per core. And this is what was killing the host.

Even though the ratio of vm:core was 5.5:1 the vCPU:core ratio was much higher and therefore causing the contention throughout the server.

Of course the client did not understand why all of these VM's should not be configured with anything less than 2 vCPU's - "because that is what you get with any desktop computer.."

It took an incident like this for the client to understand that there is no reason to configure the machine with more than 1 vCPU unless it really needs (and knows how) to use it.

We bought all the machines back down to 1 vCPU and

ID    GID NAME             NWLD   %USED    %RUN    %SYS   %WAIT    %RDY
128    128 RPM_Tester_Arie     5    4.70    4.70    0.03  487.45    7.35
118    118 RHEL4_6             5    5.33    5.37    0.00  488.44    5.67
108    108 STBinteg3           5    1.68    1.70    0.00  495.00    2.78
112    112 STBinteg2           5   11.20   11.25    0.00  485.50    2.74
  21     21   MEM_ABC_STB_  5    6.90    6.92    0.02  490.19    2.35

And all was back to normal!

Lessons learned from this episode:

  1. First thing you do when installing the host - SWAP should be 1600MB and the Service Console RAM to the Maximum of 800MB
  2. Reboot the Host after that!!!
  3. Remember to always check your resources, CPU/RAM/NIC/DISK usage are not the only bottlenecks which can cause performance issues.
  4. 80 vCPU's might not be actually possible - it will depend on the workload that is running on the host - but hey this was a hypothetical scenario anyway.

Invaluable resources for troubleshooting performance:

Checking for resource starvation of the ESX Server service console

Ready Time

CPU Performance Analysis and Monitoring

Hope you enjoyed the ride..

2 comments:

Fred L. said...

Thanks for this article. Something important is missing for me: i'm trying to find the limit or the metrics for I/O disk (write, read, for all technologies such as SAS, SATA, FC,etc)...How can I know what are the good metrics for disks ? Is there any document, or article talking about this ? Thank you !

Fred L. said...

Thanks for this article. Something important is missing for me: i'm trying to find the limit or the metrics for I/O disk (write, read, for all technologies such as SAS, SATA, FC,etc)...How can I know what are the good metrics for disks ? Is there any document, or article talking about this ? Thank you !