2014-12-11

vCenter is Still a Single Point of Failure

A few days ago VMware (Mike Brown, Anil Kapur and Justin King are the authors) announced the updated document for the vCenter Server 5.5 Availability guide.

I would like to make clear a few things from the start.

  • This is not a VMware bashing post. (Even it might be perceived as such)
  • I hold all three of the authors in very high regard.

Here goes.

When reading this document I was hoping to hear something new, something refreshing, something that VMware customers have been asking and verbally complaining about for a very long time.

Alas – this is not the case.

vCenter is a single point of failure. There I have said it. I have said it before, I will continue to say it in the future until this if fixed.

In the following article I will be taking statements directly from the text, providing my thoughts as I go along.

Overview

Great start – This document will discuss the requirements… After re-reading that statement – I understood what VMware did. VMware has not provided us with a method of providing HA for vCenter – but rather – have explained what they think should be defined as High Availability be for your vCenter server.

SLA

The authors then go into explaining all about MTBF and MTTR – they did a great job. I will not go into the details here – you should read the document

SLA’s are extremely important – and for and every environment – an SLA is something different – yours may differ from your neighbor, so it is important to understand what you need to achieve.

tests

They then go into describing the tests that were run in order to measure the amount of time it would take for a vCenter server to recover. Fair enough.

results

Here is where it starts to get interesting. Let us look at this in a picture.

Timeline

Bottom line is – that once a vCenter server has gone down – it will take a little over 5 minutes until it is fully functional.

recommendations

This part of the document states that having vSphere HA – and having vCenter running as a virtual machine actually provides some level of protection.

A dedicated management cluster is of course advised – that way you have a dedicated environment to run your management components without having to worry that the client workloads will interfere.

ESXi Hosts

Also putting the database in the same management cluster is recommended – seems logical.

I then noticed that the only SQL version that is supported for vCenter 5.5 is Enterprise and up – which was news to me. I gather this is a documentation bug – because the VMware Product Interoperability Matrix says that Standard is supported.

Matrix

So how do you protect vCenter?

Replication

It would really be great if they would explain exactly how that would be possible and how that should be done. It still might be possible? How exactly? In order to protect vCenter – I will need another vCenter? Licensing? Implications?

VDP

Emergency Restore was a new one to me – but it is only available in vSphere Data Protection Advanced Edition – that is something that was left out – which is approximately $1,500 (list price) / per socket. As a result of the feedback received in the comments – I have amended this. It seems that Emergency restore is also available in all editions of VDP – not only Advanced (more information here).

Definition

OK, enough copy and paste. This piece above is what set me off.

Essentially what VMware are saying the following:

  1. Use a separate management cluster
  2. Run vCenter in a VM
  3. Run the Database in a VM
  4. No matter what happens – if your vCenter crashes then it will be down for 5 minutes.
  5. Your  workloads are safe because they are running on your ESXi hosts are protected by HA and can continue running without a vCenter server.

Points 1-4 - I totally agree. With point #5 I also agree.

But there are environments that cannot afford to have a 5 minute outage. VMware might say that having vCenter go done and out for five minutes, is not really an outage per se, but I would very much like to disagree here.

If I cannot provision a new VM because my vCenter is not available – that is an outage.

Where would this be an issue?

  • VDI environments – If a user logs in and his desktop is not provisioned because vCenter is down? How about the whole 100 or 1000 employees?
  • Highly automated environments – ones that use products like vRealize Automation or vRealize Code Stream. Imagine having your code builds fail for 5 minutes because vCenter is not available? The whole continuous delivery process breaks down.

I might be exaggerating a bit – but I have voiced this more than once – I started more than 4 years ago - Troubleshooting Tools for vCenter.

vCenter is probably the most crucial part of your virtual infrastructure. And all that you can expect from from an availability perspective is that you have to accept as a given that vCenter might go down for 5 minutes at a time.

There are environments that will accept this - I would actually say that the large majority are fine with this – but what about those who are not? Those who cannot afford having this kind of outage? What do they do?

There used to be a product called vCenter Server Heartbeat – which was retired.

Heartbeat

Where are those promised options? When will they be available? What do companies do in the interim? Pray that there vCenter does not crash?

Embedded below is the Twitter conversation that sparked this post.

 

The scenario on which VMware based their whole presumption was on the fact that the host on which vCenter was running would crash, HA would kick in and the VM would be restarted on another host within 5 minutes.

The whole scenario of having a problem with your database, or a vCenter service problem (and believe me it happens), that was not covered.

Take the following scenario. You have a vCenter appliance. For some reason the vCenter service stops responding on the VM. There is no automatic restart. Eventually you get a call, something is not right. You try and restart the service, nothing happens. You restart the VM, nothing happens.

Now what? Open a call with VMware? Deploy another vCenter appliance and hope that nothing goes wrong? I can guarantee you that will take a hell of a lot longer than 5 minutes.

Why does the document even go into providing a clustered solution for the MSSQL database? Because that might fail? Yes it could happen. But guess what – the whole system is only as strong as its single weakest link. So providing a clustered database solution might give some piece of mind – but it will not protect you from an outage – because there is no way to cluster a vCenter server.

conclusion

In conclusion – yes there are considerations. I would definitely not say that VMware have a High Availability solution for vCenter, they have done their best to minimize the impact it will have when it vCenter crashes – but that is not HA!

What do you think? Am I making a mountain out of  molehill? Or this a real and valid concern? Please feel free to leave your comments and thoughts below.