2010-09-21

Murphy's Law

Let me introduce you to a good old soul by the name of Edward Murphy. Who was Edward Murphy?
In short he was responsible for the famous saying, "Anything that can go wrong, will go wrong".

And so the story goes.

After preparing with the rest of my team members about what should be done - how it should be done, and going over all the current open issues that needed to be taken care of while I would be away on vacation, I thought, "Great! Finally some quiet and relaxing times while I go on vacation".
But our dear old soul Murphy? Nuhuh, he had other plans …

I received an email 6 hours after I had flown out of the country - saying that a the following VM's were deleted.

Deleted? Deleted? How? What? When? Why???????????????????????

So what happened? User Error. Plain and simple. Someone on the storage team wanted to delete the snapshots on an NFS Qtree but by mistake, instead of deleting the snapshots they deleted the volume. And in the flick of a button, Boom! 40 VM's were gone!

image
Source (Deepspar)

A great deal of SMS's alerts, email notifications and several phone calls later, the error was identified.

Now we were left with 40 orphaned VM's in the infrastructure.

By the way - there is no undo button here.

So what did I get from this experience?

  1. My boss has already told me that the next time I can take vacation - is somewhere in 2020 :)
  2. No matter how much redundancy you plan for (Raid Groups, Storage Processors, Disks, Network), there is always the unknown things that catch you with your pants down.
  3. Human Error - is a good percentage of the reason for outages. Mistakes can be made, mistakes will be made. You can cover 99% of the cases - but again it will always be that 1% that will get you.
  4. Backup - Backup - BACKUP.
  5. Restoring from Backup can take a while. Quite a while

Back to my vacation….

2 comments:

Hayes Whitt said...

I had a similar experience, once. And once they are gone, forget it, get your back-up. This issue is one of the reasons a lot of IT folks poo-poo virtulization in general. I mean, its pretty hard to delete 40 bare metal machines. But restoring from back-up is the last resort, of 8-10 hours.

What did you recover with Maish, a tape? You have the pro VM stuff for sure, but i have been using the Trilead VM explorer with good success. Backs-up live VMs and can connect to a Unix server.

If you dont mind me asking... what kind of machines were deleted? Servers or clients? I have had very good success deploying ultra low profile dedicated virtual servers, which restore from back-up in minutes with Veam fast SCP or even rsync.

I recomend that all "Delete" operations require a written Method of Procedure (MoP) that describes in detail the operation, step by step with initialed check boxes. The first one being, "Ensure Maish is not on vacation before proceeding."

Maish said...

The machines were recovered from tape backup.
These were a mixture of clients and servers.
I am flattered but I do not want to put my name in any of the company procedures :)
The machine were deleted by a member of the storage team - from the backend storage - this team did not even have the appropriate rights to delete these VM's from the vCenter interface.