Completing the Missing Piece in the VMware HA Puzzle

I cannot remember the number of times I have been saved by VMware High Availability. To protect an application in my datacenter from hardware failure has never been easier. Just put your VM in a VMware HA cluster and Bam! you are done. It is that easy.

And for your more critical VM’s here you have Fault Tolerance. Granted today that will only help you for your low hanging fruit VM’s that have 1 vCPU – but that will most probably change in an upcoming version.

But what is still missing? I feel that piece is clustering at the application level.

Let me explain what I mean.

VM HA monitoring – will only restart your VM in the case of a loss of the Heartbeat with VMware Tools on your VM – but it can do nothing else but restart your VM. This covers the case where your VM has Blue-Screened – or has panicked. HA will pick up on this and restart the VM for you.

Next is Symantec ApplicationHA. They go a step further – utilizing the vCenter API they will monitor a specific process inside the VM – and if that process is not running (and there is even built in for certain applications already) – it can issue a command to the Guest OS to restart the service X number of times and if that does not help – it will restart the VM. This is a step forward – but this is still not where I want to get to. (Just to clarify – this is not a free product and is priced on a per VM basis)

Take the following scenario where we will see that neither of the solutions above help me.

I have a SQL database – which I need to make highly available. None of the solutions above will help me out in the case that SQL decided to go “Belly Up”. Say for some reason – my sqlserver.exe process decided to go on a trip and stopped working, because someone did something really, really bad to the Master database or there was a bug in the software.

So my OS is up and running – so VM HA monitoring will not help here and will not even kick in at – from its perspective – the VM is fine. VMware Tools are running and responding. The OS is responding. There is activity both on the Network and Disk. So no reason to perform any action to restart the VM.

Let’s see what would happen with Symantec ApplicationHA. If my sqlserver.exe process stopped – then it would restart it – but my master database is now mush – so it will not start. It will retry to restart the process X number of times – but will fail each time. The next plan of action (by default) is to restart the VM. Now needless to say – that if my VM had something else running on it – besides SQL (and not dependent on SQL) then whoa!!! I just lost that service as well – because the VM was restarted. I am pretty sure though – that this can be configured as well on the ApplicationHA side – not to restart if need be. But of course the restart does not help – remember the mush????

So how does one protect themselves from such a failure? By using Application Clustering Software. If I put SQL into a Microsoft Cluster – then if my one node does go down – then it will failover to node 2.

Same with Oracle, Redhat, Veritas etc.. etc..

Ok Maish – What is your point???

What I would like to see in the future is the following. I do not actually know if this is (technologically) possible or not.

My wish is for VMware HA to handle the application layer as well.

Take the following scenario. I have an application that need to run in an active-standby model. Let us take a simple webserver for example. (I am explicitly not taking a Database application – but perhaps this model could be adapted for this as well).

HostA is running GuestA that in turn is running an http service. This provides service for a web site. The configuration/data files for this service are on a shared location. I cannot afford to have this VM go down – if so – I lose $$$$ per minute. Therefore on HostB I put GuestB that will have the same configuration – with the same access to the same shared location – but the http service is not started (active-stanby). The only way today that I can failover the service from GuestA to GuestB os by putting that service under an Application Cluster and configure it that GuestB will take over from GuestA in the case of a failure.

What if vCenter HA could take care of this for me?

Let’s look at how deep VMware can actually see into the OS today.

Guest up/down – Available today
Process up/down – Available today – with the use of API.
Group of processes (a service) – surprisingly enough – yes it is available. VMware Infrastructure Navigator – can identify services and relationships between applications – without the need to install any agent on the VM – just using the VMware Tools.
(and I gather some closed API as well)

I would find it highly beneficial if I could define a rule in my HA settings – that would say:

GuestA and GuestB should never be on the same host (available today)
GuestA and GuestB are part of and application cluster
The cluster model is Active/Passive
The processes / service that is part of the cluster is A,B,C
When Process A,B,C stop responding

Trigger an alarm
Try X number of times to restart the process
If not successful trigger an alarm
Start Process A,B,C on the second node in the cluster (failover to another node)
If successful ensure that the process does not come up on GuestA

I do admit that this is a primitive example and granted it will not be so simple as the steps above – but for the end user – this would be an amazing benefit if this functionality would be added to HA.

No more having to worry about an application cluster solution – and if packaged well enough could even be encapsulated in such a way that the end user will not even care what the guest OS is. They could save on expensive licenses needed for their clusters.

What do you say? Is this something that you could use? Would it be beneficial to you? Am I way off?

Please feel free to add your comment below.

2012-07-02

Completing the Missing Piece in the VMware HA Puzzle