Well ok.. This could be taken the wrong way (and all of your with the dirty minds should be ashamed of yourselves - ha ha). On one of my previous posts - How Much Ram per Host - a.k.a Lego - I gave a hypothetical scenario of 40 1 vCPU VM's on a single host as opposed to 80 VM's on one host. There was one thing I neglected to mention, and because of a issue with a client this week, I feel it is important to point out.
CPU Contention. For those of you who do not know what the issue is about, a brief explanation. If you have too many VM's competing for CPU resources to work, then your VM's will stop behaving and start to crawl.
So here was the story - a client called me with an issue, all his VM's had started to crawl - EVERYTHING was running slowly!
Troubleshooting walkthrough:
- Log into the VI Client - and check the resource utilization of the Host - CPU, RAM, Network, Disk.
Ok I did that - absolutely nothing!
CPU - 40%
RAM - 50%
NIC - 5-10mb/s utilization
Disk - This was NFS no disk statistics - so I looked at the VMNIC of the VMKernel - and also nothing!
On to the next step.. - top on the ESX host
ssh'd into the ESX host and looked at the resources with top. I do this first before even going into the ESX statistics. I looked to see if any the iowait was high or if there was any processes stealing up too many resources and the state of the RAM on the host.
14:21:34 up 2 days, 20:57, 1 user, load average: 1.06, 0.92, 0.75
286 processes: 284 sleeping, 2 running, 0 zombie, 0 stopped
CPU states: cpu user nice system irq softirq iowait idle
total 0.9% 0.0% 0.0% 0.0% 0.0% 45.8% 94.1%
Mem: 268548k av, 256560k used, 11988k free, 0k shrd, 21432k buff
189028k actv, 29240k in_d, 3232k in_c
Swap: 1638620k av, 251022k used, 1541988k free 74068k cachedIf you notice on the last line
Swap: 1638620k av, 251022k used, 1541988k free 74068k cached
Why was it swapping - that is not normal. Quick check on the Vi Client how much RAM was allocated,
So there was only 272 (default) allocated, someone had done the proper work of creating the SWAP of 1600MB (double the max. of 800) - well done! - but had not restarted the host! So effectively the host was still set for 272. Now of course the load on the machine high enough causing the host to run out of RAM. anything that was done on the host was working slowly
Vmotioned the VM's off and restarted the host which cam back with the full amount this time
Swap: 1638588k av, 0k used, 1638588k free 155924k cached
Ahh much better - no more swapping. Vmotioned the machines back, and at a certain stage all VM's started to crawl again.
Looked into top again
CPU states: cpu user nice system irq softirq iowait idle
total 0.3% 0.0% 0.0% 0.0% 0.9% 38.9% 59.6%Whoa! that is also extremely high!
- esxtop
shift + V to show only VM's, shift + R to sort by ready time
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY %IDLE %OVRLP %CSTP %MLMTD
21 21 MEM_ABC_STB_ 5 49.84 50.08 0.04 393.77 54.77 0.00 0.34 0.00 51.17
There were something like 10 VM's with $RDY times of over 10% constantly.
Here you have a perfect case of CPU contention. The host is a Dual Quad x5320. Machine was running 44 VM's
Ratio of vm's per core is high - but achievable. I then looked to to see what the amount of vCPU's there were on the host, approximately 10 vm's had 2 or more vCPU's.
This brought the ratio of vCPU's per core to 6.75 vCPU's per core. And this is what was killing the host.
Even though the ratio of vm:core was 5.5:1 the vCPU:core ratio was much higher and therefore causing the contention throughout the server.
Of course the client did not understand why all of these VM's should not be configured with anything less than 2 vCPU's - "because that is what you get with any desktop computer.."
It took an incident like this for the client to understand that there is no reason to configure the machine with more than 1 vCPU unless it really needs (and knows how) to use it.
We bought all the machines back down to 1 vCPU and
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY
128 128 RPM_Tester_Arie 5 4.70 4.70 0.03 487.45 7.35
118 118 RHEL4_6 5 5.33 5.37 0.00 488.44 5.67
108 108 STBinteg3 5 1.68 1.70 0.00 495.00 2.78
112 112 STBinteg2 5 11.20 11.25 0.00 485.50 2.74
21 21 MEM_ABC_STB_ 5 6.90 6.92 0.02 490.19 2.35
And all was back to normal!
Lessons learned from this episode:
- First thing you do when installing the host - SWAP should be 1600MB and the Service Console RAM to the Maximum of 800MB
- Reboot the Host after that!!!
- Remember to always check your resources, CPU/RAM/NIC/DISK usage are not the only bottlenecks which can cause performance issues.
- 80 vCPU's might not be actually possible - it will depend on the workload that is running on the host - but hey this was a hypothetical scenario anyway.
Invaluable resources for troubleshooting performance:
Checking for resource starvation of the ESX Server service console
CPU Performance Analysis and Monitoring
Hope you enjoyed the ride..