This was a weird one that hit me today.
I had a performance issue on a server.
esxtop is the first I thing I looked at and got this:
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY
69 69 VSE 5 83.98 85.74 0.00 380.16 22.14
So I looked to which machine it was:
[root@dmz1 root]# vmware-cmd -l | grep VSE
[root@dmz1 root]#
And the result I got was nuddah!!
So next
[root@dmz1 root]# vm-support -x | grep VSE
vmid=1428 VSE
root 4426 1 0 Feb23 ? 00:00:06 /usr/lib/vmware/bin/vmkload_app …… …… a/VSE/VSE.vmx
root 4476 3756 0 12:27 pts/2 00:00:00 grep VSE
So there was a running VM – or so it seemed.
I ran the same steps the other host in cluster
ID GID NAME NWLD %USED %RUN %SYS %WAIT %RDY59 59 VSE 5 11.74 11.96 0.00 487.23 6.94
[root@dmz2 root]# vm-support -x | grep VSE
vmid=1410 VSE
[root@dmz2 root]# vmware-cmd -l | grep VSE
[root@dmz2 root]#
OK so what was going on here? Looking at the details of the machine – I saw that the name of the VM had no correlation to the actual folder it was in
Looking for the machine again
[root@dmz2 root]# vmware-cmd -l | grep CSG1
/vmfs/volumes/…………a/CSG1/CSG1.vmx
[root@dmz1 /]# vmware-cmd -l | grep CSG1
[root@dmz1 /]#
OK. So I now have found the machine Named VSE running on dmz2 but I still had a process running on dmz1 that was taking up CPU
[root@dmz1 /]# ps -efww | grep VSEroot 4426 1 0 Feb23 ? 00:00:06 /usr/lib/vmware/bin/vmkload_app …… …… a/VSE/VSE.vmx
root 4476 3756 0 12:27 pts/2 00:00:00 grep VSE
I looked into the folder itself
[root@dmz1 /]# ls -la /vmfs/volumes/……a/VSE/
total 23413952
drwxr-xr-x 1 root root 980 Mar 17 11:39 .
drwxr-xr-t 1 root root 2380 Mar 15 11:21 ..
-rw------- 1 root root 2510 Mar 17 11:35 vmdumper.png
-rw------- 1 root root 23573652480 Feb 23 23:53 VSE_1-flat.vmdk
-rw------- 1 root root 268435456 Feb 23 23:53 VSE-6785c36f.vswp
-rw------- 1 root root 131604480 Feb 23 23:53 VSE-flat.vmdk
-rwxr-xr-- 1 root root 1960 Feb 24 02:04 VSE.vmx
[root@ilesxdmz1 /]#
As you can see all the files were old and this looked like a Phantom machine
Time to kill the process on dmz1
I have the wid (WorldID) from before – 1428
[root@dmz1 /]# less /proc/vmware/vm/1428/cpu/status
You will find the master world ID for this process will be in the output after the vm.XXXX
(the 4 digits - in my case it was 1427)
Then kill the process
[root@dmz1 /]# /usr/lib/vmware/bin/vmkload_app -k 9 1427
Warning: Mar 17 12:37:04.706: Sending signal '9' to world 1427.
Process was gone and not using a full proc on nothing
[root@dmz1 /]# ps -efww | grep VSE
root 4785 3756 0 12:37 pts/2 00:00:00 grep VSE
Just to be on the safe side I took a vm-support snapshot of the VMID before the whole process – maybe I can find something out about the problem later on.
How the phantom happened I am still not sure. What worries me more – is how this can be detected in the future and I do not have to wait for a problem to arise to find these things out.
I would be interested in hearing your comments or suggestions as to how to address the above question.