What troubleshooting tools are you familiar with for monitoring your ESX host? I am sure that one of the first things that pop into your mind is of course ESXTOP.
ESXTOP will provide a great deal of information about the performance of your Host, be it network, your storage, VM's you name it.
The next thing that pops into my head is logs.
There are numerous amount of logs on an ESX host be it the hostd, vmkernel, vmkwarning, aam and I am sure that I am leaving out a number of others as well. In order to find out what is going wrong you will have go through the logs - and troubleshoot the issue.
It is a bit like going through a needle in the haystack. Troubleshooting is an art. Sometimes it is luck - sometimes it just knowing where to look for the right things, and they will pop up at you.
Let me give an example. I was asked the other to check why some computers running some continuous integration tests were freezing periodically, so badly that they would freeze after a few days. So looking at the Task Manager there was no problem of RAM/CPU/Disk/Network on the OS. Now I have seen more than once, a Windows OS lock up because a huge amount of open handles from a process clogging up the system. Lo and behold the process they were using had approximately 300 times more handles than it should have been using. but how did I find this? Using Task manager which gave me enough information as to what was happening.
Lets get back on track. Back to vCenter. What troubleshooting tools are there for vCenter?
- Tasks and Events - which can give you great deal of information of what is happening on your vCenter server.
- Logs again. Under \ProgramData\VMware\VMware VirtualCenter\Logs
But then again. Do we really know what is happening under the covers?
"I'm looking for a script to kill a hung vCenter task."
"stuck Task in Virtual Center"
"Cancelled task stuck in "Recent Tasks" area"
There are many many more like this - and what usually is the answer to all of these problems?
Yep you guessed it - Restart the vCenter Service.
Ok I do have to admit - we are talking about a Windows Application here, and this usually solves the problem.
But there are so many other things that I would like to see.
- Why did an alarm not get triggered?
- Why did it change state when it should not have?
- Why was my VM migrated to this host and not that host?
- What configuration Changes were made to which VM?
- Why is Linked mode not working?
- Is my vCenter so swamped with requests that it cannot keep up?
- Where are these request coming from?
These are all things that not really available today.
Of the questions that should be asked is why do you need additional tools to give this info - I mean most of these issues are sorted by a restart of the service. The answer to that is actually very simple.
More and more components are hooking into vCenter.
- Your storage Plug-ins
- Backup (SMVI for example)
- Linked Mode
- Application Discovery Manager
- Change Insight
- Future VMsafe products
- Cloud services that are coming.
If you have not realized by now - vCenter is a critical component of your Virtual environment - almost everything hooks into it and the more components that interact with it the more you will need the insight into vCenter to see how it is performing.
I think it is time to start thinking in this direction.
I would be very interested to hear your thoughts on the subject.