if you were following me on Twitter - you would have noticed this week that I was extremely busy with troubleshooting and solving a serious performance issue that I encountered.
First things first - the environment.
Multiple ESX 3.5 Clusters residing on NFS across multiple Datastores coming from same Storage.
On average these hosts are utilizing 50-70% RAM and 20-30% CPU. The machines run without any noticeable issues.
Next - the incident.
Along comes 05.00 - alerts start coming in from the monitoring system - timeouts from the monitoring agents were occurring - i,e, it looked like the virtual machines were not responding. Within 20 minutes things were back to normal.
8.00 Same thing happens again. While this was happening I tested the guest machines for connectivity - all 100%. Tried to log into a virtual machine with RDP - Slow as a snail. It took approximately 3 minutes from CTRL+ALT+DELETE till I got the desktop, and again all this time - network connectivity to the VM was100%.
I started to receive more complaints of the same issue across a number of Virtual Machines. First thing I did was to try to find what (if anything) was in common. The machines were spread around over different hosts, in different clusters on different VLAN's, so that was not it.
During these outages the Hosts themselves were completely fine.
CPU usage was normal
RAM usage was normal
Network Usage was normal
ESXTOP statistics were normal - no contention on CPU / memory.
Now what the heck was going on here?
The only thing that was common amongst all the Virtual machines - they were all using NFS datastores - divided over 3 different Datamovers on the same EMC Storage array.
The outages were intermittent and not permanent.
Logs were collected.
Tests were performed to test network issues with the ESX hosts.
In the meantime - we tried to see if anything was wrong with the network infrastructure - no issues at all. Throughput on the ports using the NFS datastores was well be low normal, Virtual machines Network was also not suffering under any kind of load.
Again all fingers were pointing to the Storage Array.
There was a slight amount of stress on the storage array - this we found with the help of EMC (who also got a priority 1 call the same time as VMware) but nothing to be highly worried about.
OK - so how do you measure NFS throughput on the ESX side? Unfortunately this is not so simple. On the contrary to measuring disk throughput with iSCSI / SAN which can be done relatively easily with the performance charts / ESXTOP - there are no metrics for disk performance when it comes to NFS datastores. The only thing you can check is vmkernel throughput.
Using ESXTOP -> n:ESX nic -> T to Sort by megabits tx ( I truncated the data a bit to make it presentable)
PORT ID USED BY DNAME PKTTX/s MbTX/s PKTRX/s MbRX/s %DRPTX %DRPRX
33554433 vmnic2 vSwitch1 195.50 2.03 118.83 0.11 0.00 0.00
33554436 vmk-tcpip-1.1.x.xx vSwitch1 195.50 2.03 118.26 0.11 0.00 0.00
The bold entry is the VMkernel interface and what its network traffic is. Now the utilization of this port was never getting over 2-4 Mb/s - which is nothing.
In the meantime we started to receive more complaints about regular NFS mounts (not connected to our Virtual Infrastructure) that were performing slowly - in addition other servers that were connected directly to the SAN as well were suffering.
Again all pointed to the storage.
One more thing.
NFS (like iSCSI) uses the vmkernel - so where would you look for issues if that were the case?
If you said /var/log/vmkernel - you were right!
From the log - during these outages entries similar to this were present
xxxxx vmkernel: 133:06:49:16.958 cpu6:1724)VSCSIFs: 441: fd 258193 status No connection
No connection? No connection? Datastore not responding - Storage anyone?
After putting 2+2 together - and getting a big headache - we all knew it was a storage issue.
Sat on EMC's head to solve it.
They did. What it turned out to be was an application that was connected to a LUN on the storage array (not my LUN) that had malfunctioned - and was using its LUN with 100% utilization over 90% of the time.
Why this affected the rest of the storage - we will hear back from EMC after completing the root cause analysis on the issue. But as soon at the rogue application was stopped - like magic all returned to normal. Measures have been taken to alert us of such issues on the storage array in future
So what did I learn from this experience?
- Why were the machines still responding - even though the storage was not working properly? My theory on this is as follows. Network was working fine. The machines responded slowly - when you tried to login. What happens when you login? You load up a user profile - which is on the vmdk - which in turn was on the NFS share - which was as slow as a snail. Therefore it was logical that this was the issue, because of a badly performing disk.
- NFS throughput is not something that VMware can present easily to the administrator for troubleshooting. There are no disk counters for VM's on an NFS datastore. Disk Performance on the ESX does not include NFS traffic. This I find is something that VMware has to improve on - since more and more shops are starting to use NFS by default. If they provide the statistics for iSCSI / Fiber - then there is no reason they should not do it for NFS.
- An assumption was made that the Storage Array was most probably the least likely to fail out of all the chain of components in the virtual Infrastructure.
In the ESX Server - we have 2 Disks / 2 CPU's / 2 Power supplies / at least 2 NIC's - all to protect from a single point of failure
The Network cards are connected redundantly to the Network Infrastructure - to protect from a single point of failure.
The ESX Servers were connected to the storage array to 3 different Datamovers - to protect from a single point of failure.
But all in all the storage was the point of failure here.
The storage is shared with other applications and not dedicated to Virtualization - this has its ups and downs.
So now is all calm and well - and now I can start up solitaire on my Windows servers within a few seconds from the the time I press CTRL+ALT+DEL - so I am happy :)
What I do like about instances like these - is things that should not / cannot happen (in theory) actually do (in reality)- and when they do, it is a great learning experience, which only makes me want to improve and provide even a higher level of performance / availability.
Hope you all enjoyed the ride!