Is SMP Fault Tolerance Even Useful?

Last week I got into a very interesting Twitter conversation regarding whether FT is a solution worth using and why.

The short version of this post – in my honest opinion the answer is a definite NO! but I cannot leave you with that kind of a statement without explaining in detail why. So here goes.

A quick quote from VMware’s site as to what Fault Tolerance is useful for

Fault Tolerance

When Fault tolerance first came out – it was hot item – albeit the use case was very limited. and this was for a number of reasons but mainly these 2.

  1. A FT protected VM was only limited to a single vCPU
  2. A 10Gb connection was recommended (perhaps required – I cannot remember) to assure the replication traffic.

VMware has always sold FT as an additional level of protection for virtual machines that must have the highest availability, something that cannot afford downtime due to hardware failure. It never has and perhaps never will protect you in the case of an application failure, that is where VMware are trying to go with AppHA, but are still not there yet (as I said in the this post a while back).

What kind of applications is FT suited for? If you would ask that question – most people would answer that these are older applications – Line of Business (LOB) applications that were never built with any kind of clustering solution. To adapt or create a clustering solution for these applications would be probably impossible – and if possible would come with a very price ticket.

Let me draw out a typical scenario of how such a dilemma might come into place.

Company XYZ has an application that was written by a software company approximately 15-20 years ago. At the time that was the best of breed, your company paid a substantial amount of money for customizations and enhancements – because there was nothing better at the time. But since then the software company has either gone out of business – or the product has been retired and is no longer supported, even if you wanted to pay for that support.

And yes usually it is probably installed on a Windows NT4/2000/2003 machine – or perhaps a RHEL 2.x machine. Your IT department was so glad, rejoiced and perhaps even threw a party when VMware came along – oh so glad, now you could finally keep that old ‘critical’ application – convert it to a virtual machine, and finally breathe a little bit easier – now that you did not have worry that much about that the hardware that was also over 7 years old. The number of times I have heard ‘don’t touch it – or it might break.. Don’t reboot the server it might not come back up gain’ just indicates that these applications are a disaster waiting to happen, and yet the business relies upon them.

I dare to say that if you have such a ‘critical’ application in such a scenario above – then you are doing yourself and your company a disservice – a huge one. If you have an application that cannot suffer any downtime (and yes these applications are always so important that they have to be up the whole time – all the time) then you will never be protected unless you have some kind of application clustering solution.

Let us take the following scenarios

  1. Hardware failure

    • VM resides in an HA cluster
      VM will be restarted according to the HA policy – which will usually mean that the VM will not be available for about 2-3 minutes.
    • VM is protected by Fault Tolerance
      VM will failover to a secondary host – and you will probably not notice more than a ping or two that drop. No VM restart, quick and clean.
  2. Operating System Failure

    • VM resides in an HA cluster
      If you enabled VMware Tools heartbeat properly, then the VM will be restarted according to the HA policy – which will usually mean that the VM will not be available for about 2-3 minutes.
    • VM is protected by Fault Tolerance

      The kernel panic will replicate to the secondary host – and HA would kick in as before and restart the VM – again it will be down for 2-3 minutes until the VM is available again.

      In this case there is absolutely no benefit to use FT.

  3. Application Failure

    • VM resides in an HA cluster

      If you are lucky enough to an application that is supported by AppHA (which probably is not the case – we are talking about custom applications and they are old – remember?) then AppHA will try and restart the service a number of times and if that does not work – it will restart the VM. But most probably – you are not protected by AppHA, so in that case – you will need to go in and fix the problem manually.
      Downtime – unknown.
    • VM is protected by Fault Tolerance

      Same as above – if your application / service fails – it will not even fail over to the secondary host – because FT does not even recognize this is as a problem. In this case you will need to go in and fix the problem manually.
      Downtime – unknown.

      In this case there is absolutely no benefit to use FT.

So how does FT actually help you in protecting your application? I mean actually really provide benefit? In one scenario and one scenario only – when the hardware fails. Probably no re-connects needed on the clients connecting to the application.

And here is where I say you are doing a disservice to you users and organization. If the application is so absolutely critical – so essential that suffering downtime will cause a big outage or will cause great financial loss – then you should be protecting it – and protecting it at the application level – because HA and FT will not help you here – not at all.

I have heard comments on Twitter that the cost is too high to protect the application with a software clustering solution – or a re-write of code will be painful and even more cost.

But I would be so bold to go out on a limb and say that if you have a LOB critical application that is not protected at the application level – then it is either not a critical app or you are neglecting your job. You should raise a HUGE red flag about this issue and find a way to provide the correct level of resilience for your application.

One of the points that I have been hearing lately is that you could use FT to protect vCenter as well. Personally I think vCenter is a Critical application and should be protected at the application level. I have said this multiple times – vCenter is a Single point of failure and FT will not help you except for one specific use case – hardware failure. If any of the processes in your vCenter go down – it will be of absolutely no use.

One more thing I would like to point out – you are limited to a maximum of 8 vCPUs of protected VM’s or 4 protected VM’s whichever comes first.

The VCSA comes in a number of profiles:
vCenter Profiles

That means if you protect vCenter with FT – then if you are lucky then you will be able to protect perhaps one additional VM on that host.
There is a vCenter profile that is even a no-starter– as it does not even fit into the limitations of FT.

Do you use FT and is it worth the additional infrastructure overhead needed to implement this addiitonal layer of protection? Will you use it to protect vSMP VM’s? Why not invest in proper application resilience instead?

Please feel free to leave your thoughts and comments below.