Is SMP Fault Tolerance Even Useful?

Last week I got into a very interesting Twitter conversation regarding whether FT is a solution worth using and why.

The short version of this post – in my honest opinion the answer is a definite NO! but I cannot leave you with that kind of a statement without explaining in detail why. So here goes.

A quick quote from VMware’s site as to what Fault Tolerance is useful for

Fault Tolerance

When Fault tolerance first came out – it was hot item – albeit the use case was very limited. and this was for a number of reasons but mainly these 2.

  1. A FT protected VM was only limited to a single vCPU
  2. A 10Gb connection was recommended (perhaps required – I cannot remember) to assure the replication traffic.

VMware has always sold FT as an additional level of protection for virtual machines that must have the highest availability, something that cannot afford downtime due to hardware failure. It never has and perhaps never will protect you in the case of an application failure, that is where VMware are trying to go with AppHA, but are still not there yet (as I said in the this post a while back).

What kind of applications is FT suited for? If you would ask that question – most people would answer that these are older applications – Line of Business (LOB) applications that were never built with any kind of clustering solution. To adapt or create a clustering solution for these applications would be probably impossible – and if possible would come with a very price ticket.

Let me draw out a typical scenario of how such a dilemma might come into place.

Company XYZ has an application that was written by a software company approximately 15-20 years ago. At the time that was the best of breed, your company paid a substantial amount of money for customizations and enhancements – because there was nothing better at the time. But since then the software company has either gone out of business – or the product has been retired and is no longer supported, even if you wanted to pay for that support.

And yes usually it is probably installed on a Windows NT4/2000/2003 machine – or perhaps a RHEL 2.x machine. Your IT department was so glad, rejoiced and perhaps even threw a party when VMware came along – oh so glad, now you could finally keep that old ‘critical’ application – convert it to a virtual machine, and finally breathe a little bit easier – now that you did not have worry that much about that the hardware that was also over 7 years old. The number of times I have heard ‘don’t touch it – or it might break.. Don’t reboot the server it might not come back up gain’ just indicates that these applications are a disaster waiting to happen, and yet the business relies upon them.

I dare to say that if you have such a ‘critical’ application in such a scenario above – then you are doing yourself and your company a disservice – a huge one. If you have an application that cannot suffer any downtime (and yes these applications are always so important that they have to be up the whole time – all the time) then you will never be protected unless you have some kind of application clustering solution.

Let us take the following scenarios

  1. Hardware failure

    • VM resides in an HA cluster
      VM will be restarted according to the HA policy – which will usually mean that the VM will not be available for about 2-3 minutes.
    • VM is protected by Fault Tolerance
      VM will failover to a secondary host – and you will probably not notice more than a ping or two that drop. No VM restart, quick and clean.
  2. Operating System Failure

    • VM resides in an HA cluster
      If you enabled VMware Tools heartbeat properly, then the VM will be restarted according to the HA policy – which will usually mean that the VM will not be available for about 2-3 minutes.
    • VM is protected by Fault Tolerance

      The kernel panic will replicate to the secondary host – and HA would kick in as before and restart the VM – again it will be down for 2-3 minutes until the VM is available again.

      In this case there is absolutely no benefit to use FT.

  3. Application Failure

    • VM resides in an HA cluster

      If you are lucky enough to an application that is supported by AppHA (which probably is not the case – we are talking about custom applications and they are old – remember?) then AppHA will try and restart the service a number of times and if that does not work – it will restart the VM. But most probably – you are not protected by AppHA, so in that case – you will need to go in and fix the problem manually.
      Downtime – unknown.
    • VM is protected by Fault Tolerance

      Same as above – if your application / service fails – it will not even fail over to the secondary host – because FT does not even recognize this is as a problem. In this case you will need to go in and fix the problem manually.
      Downtime – unknown.

      In this case there is absolutely no benefit to use FT.

So how does FT actually help you in protecting your application? I mean actually really provide benefit? In one scenario and one scenario only – when the hardware fails. Probably no re-connects needed on the clients connecting to the application.

And here is where I say you are doing a disservice to you users and organization. If the application is so absolutely critical – so essential that suffering downtime will cause a big outage or will cause great financial loss – then you should be protecting it – and protecting it at the application level – because HA and FT will not help you here – not at all.

I have heard comments on Twitter that the cost is too high to protect the application with a software clustering solution – or a re-write of code will be painful and even more cost.

But I would be so bold to go out on a limb and say that if you have a LOB critical application that is not protected at the application level – then it is either not a critical app or you are neglecting your job. You should raise a HUGE red flag about this issue and find a way to provide the correct level of resilience for your application.

One of the points that I have been hearing lately is that you could use FT to protect vCenter as well. Personally I think vCenter is a Critical application and should be protected at the application level. I have said this multiple times – vCenter is a Single point of failure and FT will not help you except for one specific use case – hardware failure. If any of the processes in your vCenter go down – it will be of absolutely no use.

One more thing I would like to point out – you are limited to a maximum of 8 vCPUs of protected VM’s or 4 protected VM’s whichever comes first.

The VCSA comes in a number of profiles:
vCenter Profiles

That means if you protect vCenter with FT – then if you are lucky then you will be able to protect perhaps one additional VM on that host.
There is a vCenter profile that is even a no-starter– as it does not even fit into the limitations of FT.

Do you use FT and is it worth the additional infrastructure overhead needed to implement this addiitonal layer of protection? Will you use it to protect vSMP VM’s? Why not invest in proper application resilience instead?

Please feel free to leave your thoughts and comments below.


VMware Integrated OpenStack - Cost Analysis

VMware announced last week the launch of VIO and there are a number of things that I think people are missing and should be pointed out.

The information I have taken is from the Datasheet and publicly available information.


A great part of the the functionality and flexibility that people use is the option for flexible networking, i.e. creating private networks, routers for example.

NSX for Neutron

That is great – all the functionality is there – with NSX. But how many people are actually using NSX today? How many people have deployed NSX? In a previous article I went through the reasons why this will not be an easy path. So how does that work with what I currently have in my datacenter?

I think it is safe to assume that this will be deployed in an environment with a Distributed virtual switch – we are talking about environments that are using Enterprise Plus after all (and VMware is giving this away for free to everyone with the Ent+ licenses).

So how does OpenStack work with a DvSwitch today?

Well I am sorry to surprise you – but it does not. It only works with nova-network (which is supposed to be deprecated – so caveat emptor). VMware themselves have said that most customers that are using OpenStack and VMware are using NSX (and that they don’t really have much experience with nova-network).


** Edited February 11th, 2015 **

According to a twitter conversations with @hui_kenneth and @danwendlandt last night – VIO GA will support the dvSwitch, only that information is currently not public and is only available to the Beta users. The functionality still will not bet the same as that of NSX.

So it boils down to this. The only way to really use OpenStack with vSphere today – in any kind of semi-normal way, is to do it with NSX. Any demo you have seen – Hands-on Labs, presentations all use NSX. And it always something that already exists in the environment you are working with, that is the assumption.

So VMware is giving this away for free (unless you are interested in support – which will cost you another $200 per socket) – but this essentially is giving you a hobbled product – which does not have functionality that you get out of the OpenStack box – because you are using vSphere networking.

So what features will you not be able to use – without NSX?

  • No GRE – standard VLANs only
  • No LBaaS
  • No VPNaaS
  • No FWaaS
  • No security groups

I will say that the number of people that are actually using FWaaS and VPNaaS are not the majority of OpenStack users – but on the other LBaaS – is more or less an essential part of any automated cloud. And even more so – security groups are definitely an essential part of any cloud.

But of course we would like to use all the bells and whistles – actually I would really only like to use Neutron with vSphere – so my options are only going to be to use NSX (until they manage to get this working as will with a dvSwitch).

So what is this going to cost me?

This is going to hurt (and by no means am I licensing expert – and yes I know that no-one really pays list price – but here goes).

You have two options to buy NSX – per vm or per socket. Now we all know that the per-vm model – usually does not run in the customers favor – and the last time a per-vm model was proposed – the was a huge disturbance in the force. So I am going to assume that you will want a per CPU based license.

1 CPU license of NSX for vSphere (NS-VS-C) – $5,996
1 CPU SNS Basic support (NX-VS-G-SSS-C) for 1 CPU – $1,259
1 Year SNS Production support for VMware Integrated OpenStack for 1 CPU – $200
1 CPU license of VMware Integrated OpenStack (assuming you have Enterprise Plus) – $0

Total cost for 1 CPU of VMware Integrated OpenStack – $5,996
Annual support costs for 1 CPU of VMware Integrated OpenStack (and NSX) – $1,459

So let me lay this out in simple terms with an example.

** Post Updated February 11th **

It was brought to my attention that there is a minimum purchase of 50 CPU licenses for OpenStack support as part of the FAQ notes.

Minimum of 50

I did not change my assumption that you would only be using 4 hosts but the purchase of additional OpenStack CPU support s required.

I have therefore amended the numbers below.

If you are interested in really using (i.e. with neutron and NSX) OpenStack on a 4 host (8 socket) cluster this will cost you:

  • Initial cost
    • NSX licensing – 8x$5,996 = $47,968
    • SnS first year – 8x$(1,259 + 200) = $11,672
    • SnS first year – 8x1,259 + 50x200 = $20,072
    • Total – $59,640
    • Total – $68,040
  • Annual costs
    • SnS per year – $11,672
    • SnS per year – $20,072

That is above and beyond the regular licensing fees that you pay for Enterprise Plus licenses (which I have not factored in here – because I am assuming that you already have them. But if you do not, then that is even more of a hit to your CAPEX.

Again I would like to stress – that this is MSRP – and not including any bundles.

My Take

VMware would like to see the whole world run on their platform (obviously), and they have started to make a move to minimize the impact that I think they are starting to feel – due to people moving over OpenStack. This offering is a foot in the door to minimize the business they could lose from people moving off of their platform (seriously speaking though vCloud is a competing product – and I do not know how much longer they can continue to sell competing products). There are a number of benefits of running on top of vSphere of course, the underlying platform – and the hooks and insight into vRO is another one of course.

That is one side of the story. The other side is NSX adoption – I do not think that VMware is seeing the market share that they were hoping to gain with NSX – network virtualization is still not a mainstream concept. Companies are starting to dabble and try – but no – we are not there yet.


The ironic thing is that even with VIO integrated with NSX when it is released – it still will not support native LBaaS, VPNaaS and FWaaS out of the box (you probably will be able to integrate with 3rd party vendors) – that will probably come in a future release.

So even with their flagship product – it still will not have all the functionality that OpenStack Operators/Users are accustomed to have in their environments today.

There are benefits of having “one neck to throttle” so to speak – but that comes with a price tag – and hefty one. It certainly is not a free product as it is being made out to be – or at a minimal cost (VMware support).

The devil is always in the details.

What do you say? Is it financially viable? Would you use VIO? Why? Or would you rather rough it and go with another vendor?

I would be happy to hear your thoughts, and comments. Please feel free to leave them in the comments below.