2012-07-02

Completing the Missing Piece in the VMware HA Puzzle

I cannot remember the number of times I have been saved by VMware High Availability. To protect an application in my datacenter from hardware failure has never been easier. Just put your VM in a VMware HA cluster and Bam! you are done. It is that easy.

And for your more critical VM’s here you have Fault Tolerance. Granted today that will only help you for your low hanging fruit VM’s that have 1 vCPU – but that will most probably change in an upcoming version.

But what is still missing? I feel that piece is clustering at the application level.

Let me explain what I mean.

VM HA monitoring – will only restart your VM in the case of a loss of the Heartbeat with VMware Tools on your VM – but it can do nothing else but restart your VM. This covers the case where your VM has Blue-Screened – or has panicked. HA will pick up on this and restart the VM for you.

Next is Symantec ApplicationHA. They go a step further – utilizing the vCenter API they will monitor a specific process inside the VM – and if that process is not running (and there is even built in for certain applications already) – it can issue a command to the Guest OS to restart the service X number of times and if that does not help – it will restart the VM. This is a step forward – but this is still not where I want to get to. (Just to clarify – this is not a free product and is priced on a per VM basis)

Take the following scenario where we will see that neither of the solutions above help me.

I have a SQL database – which I need to make highly available. None of the solutions above will help me out in the case that SQL decided to go “Belly Up”. Say for some reason – my sqlserver.exe process decided to go on a trip and stopped working, because someone did something really, really bad to the Master database or there was a bug in the software.

So my OS is up and running – so VM HA monitoring will not help here and will not even kick in at – from its perspective – the VM is fine. VMware Tools are running and responding. The OS is responding. There is activity both on the Network and Disk. So no reason to perform any action to restart the VM.

Let’s see what would happen with Symantec ApplicationHA. If my sqlserver.exe process stopped – then it would restart it – but my master database is now mush – so it will not start. It will retry to restart the process X number of times – but will fail each time. The next plan of action (by default) is to restart the VM. Now needless to say – that if my VM had something else running on it – besides SQL (and not dependent on SQL) then whoa!!! I just lost that service as well – because the VM was restarted. I am pretty sure though – that this can be configured as well on the ApplicationHA side – not to restart if need be. But of course the restart does not help – remember the mush????

So how does one protect themselves from such a failure? By using Application Clustering Software. If I put SQL into a Microsoft Cluster – then if my one node does go down – then it will failover to node 2.

Same with Oracle, Redhat, Veritas etc.. etc..

Ok Maish – What is your point???

What I would like to see in the future is the following. I do not actually know if this is (technologically) possible or not.

My wish is for VMware HA to handle the application layer as well.

Take the following scenario. I have an application that need to run in an active-standby model. Let us take a simple webserver for example. (I am explicitly not taking a Database application – but perhaps this model could be adapted for this as well).

HostA is running GuestA that in turn is running an http service. This provides service for a web site. The configuration/data files for this service are on a shared location. I cannot afford to have this VM go down – if so – I lose $$$$ per minute. Therefore on HostB I put GuestB that will have the same configuration – with the same access to the same shared location – but the http service is not started (active-stanby). The only way today that I can failover the service from GuestA to GuestB os by putting that service under an Application Cluster and configure it that GuestB will take over from GuestA in the case of a failure.

What if vCenter HA could take care of this for me?

Let’s look at how deep VMware can actually see into the OS today.

  • Guest up/down – Available today
  • Process up/down – Available today – with the use of API.
  • Group of processes (a service) – surprisingly enough – yes it is available. VMware Infrastructure Navigator – can identify services and relationships between applications – without the need to install any agent on the VM – just using the VMware Tools.
    (and I gather some closed API as well)

I would find it highly beneficial if I could define a rule in my HA settings – that would say:

  • GuestA and GuestB should never be on the same host (available today)
  • GuestA and GuestB are part of and application cluster
  • The cluster model is Active/Passive
  • The processes / service that is part of the cluster is A,B,C
  • When Process A,B,C stop responding
    • Trigger an alarm
    • Try X number of times to restart the process
    • If not successful trigger an alarm
    • Start Process A,B,C on the second node in the cluster (failover to another node)
    • If successful ensure that the process does not come up on GuestA

I do admit that this is a primitive example and granted it will not be so simple as the steps above – but for the end user – this would be an amazing benefit if this functionality would be added to HA.

No more having to worry about an application cluster solution – and if packaged well enough could even be encapsulated in such a way that the end user will not even care what the guest OS is. They could save on expensive licenses needed for their clusters.

What do you say? Is this something that you could use? Would it be beneficial to you? Am I way off?

Please feel free to add your comment below.

8 comments:

Mandiv said...

 Alarm can be done only at vCenter level. vApps can still manage the power on sequence and can also trigger some other events with schedule tasks for the VMs. Now for the starting particular process within Guest OS its the same which is to monitor the apps within Guest OS and it is out of the scope for HA as its at OS level. FT and HeartBeat may be applies but need to be researched out in this condition. Feature Request would be definitely a good start further in this direction to keep necessary stake holders in the loop :-)

Kushmaro said...

just to be a bit accurate, today, if service X fails, windows SCM will try to restart it automatically for up to 3 times if i'm not mistaken, so i'm not quite sure what benefit does the symantec cluster give you.
as for your idea, I think it is great except for one thing.
what you described sounds (obviously) a lot like MSCS. the thing is, that in order for VMware to implement this,in according to your explenation, it will also have to take care of application DNS name and IP... which requires Guest A and B to be clones.
and of course the fact that the application has to be stateless...
a bit complicated..

Maish said...

 That is next on my list. It is good to have the idea documented here first though.

Maish said...

I agree that is is complicated - and that is very similar to what we already have at the OS level - no-one said it would be easy.

I would still like this functionality built-in if possible - it would make things a lot simpler.

Thanks for joining in the discussion.

@Virtually_LG said...

Hi Maish, great article, interestingly ApplicationHA is our (Symantec) first step towards this utopia of application management, ofcourse clustering solutions today can offer the majority of the ask but with limitations on vMotion etc. All I can say at the moment is watch this space (VMworld), we are working towards this goal and although it will be a paid solution from us it will offer enterprise functionality across all OSs.

Maish said...

 Thanks for the comment - I am looking forward to your announcements at VMworld.

Duncan said...

All I can say is: working on it :-)

Maish said...

Thanks for the update Duncan - looking forward to being able to ditch the Operating system cluster solutions.