2013-01-28

Error Removing Nexus 1000V VEM

I encountered this last week and was not find any reference to my specific problem – so I am documenting it here.
I was trying to remove the Cisco Nexus 1000V VEM from the ESXi hosts in my lab.
This was the error I was getting.
Fail1
This is what I had from the esxupdate.log file

2013-01-24T08:35:32Z esxupdate: LiveImageInstaller: DEBUG: Starting to live remove VIBs: Cisco_bootbank_cisco-vem-v147-esx_4.2.1.1.5.2b.0-3.1.1
2013-01-24T08:35:32Z esxupdate: LiveImageInstaller: INFO: Live removing cisco-vem-v147-esx-4.2.1.1.5.2b.0-3.1.1
2013-01-24T08:35:32Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/chkconfig', '-B', '/etc/chkconfig.db', '-D', '/etc/init.d', '-i', '-o']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2013-01-24T08:35:32Z esxupdate: LiveImageInstaller: DEBUG: Running [['/etc/init.d/n1k-vem', 'stop', 'remove']]...
2013-01-24T08:35:32Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/etc/init.d/n1k-vem', 'stop', 'remove']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2013-01-24T08:35:33Z esxupdate: LiveImageInstaller: DEBUG: output: svsStop, remove
svsStopRemove
stopDpa
Stopping Cisco Nexus 1000V VEM
stopDpa
Unload N1k switch modules
Warning: /dev/char/vmkdriver/stun not found
Unload of N1k modules done.
2013-01-24T08:35:33Z esxupdate: LiveImageInstaller: DEBUG: Starting to run etc/vmware/shutdown/shutdown.d/*
2013-01-24T08:35:33Z esxupdate: LiveImageInstaller: DEBUG: Trying to unmount payload [cisco-vem-v147-] of VIB Cisco_bootbank_cisco-vem-v147-esx_4.2.1.1.5.2b.0-3.1.1
2013-01-24T08:35:33Z esxupdate: LiveImageInstaller: DEBUG: Unmounting cisco_ve.v00...
2013-01-24T08:35:33Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = 'rm /tardisks/cisco_ve.v00', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2013-01-24T08:35:33Z esxupdate: LiveImageInstaller: DEBUG: output: rm: can't remove '/tardisks/cisco_ve.v00': Device or resource busy
<…truncated..>
2013-01-24T08:35:33Z esxupdate: root: ERROR: InstallationError: ([], "Error in running rm /tardisks/cisco_ve.v00:\nReturn code: 1\nOutput: rm: can't remove '/tardisks/cisco_ve.v00': Device or resource busy\n\nIt is not safe to continue. Please reboot the host immediately to discard the unfinished update.")

I looked for some help on the web and came across this - Problems with uninstalling Nexus 1000v VEM VIB – and here it said perhaps the vem was still running.

So I tried that as well – here we see the vem is still running
vem status
Fail2
Even after stopping the vem – it would not remove the VIB. Maybe the modules were still loaded?
Fail3
Still no go…

I then came across these two KB’s

The vem-swiscsi process fails to exit even when no Software iSCSI device is found and High CPU and memory utilization by the vem-swiscsi process

They were not relevant to my versions – neither of ESXi nor the Cisco modules but still this led to the right solution.

I checked to see if I had any vem* processes still running.
lsof
After killing the processes.
kill process
The removal was successful.
Success!

The vem-swiscsi process was not killed properly when I stopped the vem (or removed the modules) – which I assume is a bug which was re-introduced since 4.2(1)SV1(5.1).

The Release Notes for Release 4.2(1)SV1(5.1) say that these bugs were resolved
17. CSCtl21012 The vem-swiscsi process fails to exit when no "Software iSCSI" device is found.
44. CSCtr83664 The vem-swiscsi process fails to exit when no "Software iSCSI" device is found.

In short – if you cannot remove the Cisco VEM from a ESXi host – check that there are no vem processes still running – that will prevent you removing the module.

I would like to also thank Frank Denneman for his very useful post on Removing orphaned Nexus DVS.