Let me introduce you to a good old soul by the name of Edward Murphy. Who was Edward Murphy?
In short he was responsible for the famous saying, "Anything that can go wrong, will go wrong".
And so the story goes.
After preparing with the rest of my team members about what should be done - how it should be done, and going over all the current open issues that needed to be taken care of while I would be away on vacation, I thought, "Great! Finally some quiet and relaxing times while I go on vacation".
But our dear old soul Murphy? Nuhuh, he had other plans …
I received an email 6 hours after I had flown out of the country - saying that a the following VM's were deleted.
Deleted? Deleted? How? What? When? Why???????????????????????
So what happened? User Error. Plain and simple. Someone on the storage team wanted to delete the snapshots on an NFS Qtree but by mistake, instead of deleting the snapshots they deleted the volume. And in the flick of a button, Boom! 40 VM's were gone!
A great deal of SMS's alerts, email notifications and several phone calls later, the error was identified.
Now we were left with 40 orphaned VM's in the infrastructure.
By the way - there is no undo button here.
So what did I get from this experience?
- My boss has already told me that the next time I can take vacation - is somewhere in 2020 :)
- No matter how much redundancy you plan for (Raid Groups, Storage Processors, Disks, Network), there is always the unknown things that catch you with your pants down.
- Human Error - is a good percentage of the reason for outages. Mistakes can be made, mistakes will be made. You can cover 99% of the cases - but again it will always be that 1% that will get you.
- Backup - Backup - BACKUP.
- Restoring from Backup can take a while. Quite a while
Back to my vacation….