2009-11-23

Disconnected ESX Host



Got a call today.

Panic!!!!

All VM on an ESX host just went grey – all disconnected.

Trouble shooting steps:

  1. Ping ESX host Service Console – All ok
  2. Look in the VI client what is with the server – NOT OK – all machines are greyed out – (hey that is what they said wasn’t it).
  3. SSH into the Service console - All ok
  4. Direct GUI management to the server NOT OK. could not load the inventory
  5. All VM's on the host were running and responding to ping.
  6. No failover was initiated in the cluster.
  7. On the console – I saw that there were 7 processes of vmware-hostd each using a lot of RAM.
  8. service mgmt-vmware stop – to stop the service. GOT STUCK
  9. Off to this KB  which helped me stop the service and get the host responsive again.

    # cd /var/run/vmware
    # ls -l vmware-hostd.PID watchdog-hostd.PID (to get the current PID of the process)
    # cat vmware-hostd.PID (i.e. 1234 is the PID)
    # kill -9 <PID> (kill the process)
    # rm vmware-hostd.PID watchdog-hostd.PID remove the files
    # service mgmt-vmware start (restart the agent)
  10. The host came back online – all VM’s were no longer grey.

Here starts my questions.

  1. Why did this happen?

    I went to start digging into the logs and found that there was a gap in the system logs for about 20 minutes – which is really strange.
  2. It seems this happened after a snapshot removal

    image 
  3. I have opened a SR with VMware to get to the bottom of this issue.

6 comments:

meob said...

scsi reservation conflicts possibly...

Maish said...

Thanks for the input, but this is not likely since this a pure NFS environment

Andrew VanSpronsen said...

Known issue. You will likely be told to increase timeout periods for tasks in VC and on the host.

I didn't believe them at first but as we rolled out the config changes the situation improved dramatically.

Maish said...

Thank you Andrew for your comment - do you have a reference link for this problem?

Ronny said...

I always see this behavior when someone commits/deletes a snapshot via VI Client or directly on the host. But I think this is not normal because the SC is still available.
Could you please post the answer from VMware?

bpace said...

I have had this issue for OVER 3 YEARS!!!!!
VMWare has been working on it and keeps giving us "new" builds of VC or hostd, but still it exists, better but it is still there.
Demand escalation and you might get some attention.