2010-02-24

So How Up-to-Date is Your BCP Plan?

Here is a tale of a small little airplane and what kind of damage it can do. No I am not going back to Sep. 11 and the WTC. more of the likes of Feb 17th, 2010 - 07:50 in Palo Alto California.

A play-by-play of the events showed that because of a small plane that crashed (unfortunately killing 3 people) Palo Alto went dark - well not really dark, because it was early morning - but power went out, COMPLETELY!!

Now I guess that most of us can 20100217__planecrashnew~1_GALLERY[1]imagine what that means for us in our personal life, I mean you have No lights, no TV - no VoIP, no computers, no air conditioners, no hot espresso from the espresso machine…
You see where I am going.

But taking this to a different angle. Your Datacenter.

Power goes out. If you are well organized - you will have backup power for a certain amount of time - depending on the generator.

SMS alerts start flying into your mobile phone like crazy.

You call the office … no-one answers.

You manage to find someone on his mobile that is at work - and try to find out what is going on - but they are as confused as you are.

You try to turn on your TV to see what is going on - whoops forgot no electricity - which means not internet connection either.

Chaos… Confusion…. this is what it can be like.

I mean I could go on and on but I think you get my drift.

So if you are lucky enough you manage to power everything down properly before your backup power runs out of juice.

OK now comes the analysis phase - WHAT IS DOWN?

Let us go back to VMware's case. No Email. No Phones.

Post notification to the world of service availability - Twitter

Open up a channel of communications as soon as possible - Twitter

Restore services as soon as possible
(in this case until full functionality was back was almost 10 hours)

I am not going to go into the BCP plan of VMware - because I have absolutely no knowledge of what was down and what was still up and working.

For your BCP plan you will need (amongst others) these things:

Recovery time objective - How long until I am back up?

Recovery point objective - How much is the acceptable amount of data loss measured in time?

What critical services do I need back online and in what order and how soon?

In this particular case I can only assume that the BCP plan that existed for VMware in Palo Alto before February 17th and the BCP plan that will be amended thereafter will not be the same.

I am sure that the small Flagship product called Site Recovery Manager can be used for this purpose and hopefully VMware will come out with a better BCP plan.

So have you reviewed your BCP plan for your Datacenter lately? If not….

DO IT!!!!!! NOW!!!!!!!!!!

p.s. I hope all of you in Palo Alto did not have to through out too much food from your refrigerators -  I mean 10 hours is a long….. time

2010-02-21

Israel VMUG - February 2010 - Cloud vision

Last Thursday I participated in the Israeli VMUG meeting. The following are my thought and observations from what was presented at this meeting. (The product release dates mentioned below are not mine, but were presented at the event).

First up was someone from the Systems Engineering group from VMware who talked about:
Cloud Vision

It was mentioned that ESX 4.1 will be released in not to distant future.

VMware are working on the options of raising the Maximum number of vCpu's for a VM from 8 per VM. This will assure that there is no CPU workload - whatsoever - that cannot be run as a VM.

From a VMware Customer survey from 2008:

The application is that is most customers are virtualizing is SQL - 56%, followed by SharePoint - 53%, IBM Websphere 50%, Oracle Middleware - 41% and Exchange 36%

The average Oracle DB Server uses 2-4vCPUS, utilization is approximately 6%
Average disk I/O 2000 IOPS

VMware are going to push cloud this upcoming year, and in order to embrace this you should treat your VM as a server - it should be a service - that way it will be easier to define the financial cost for this service as opposed to the single server.

VMware is looking for Service providers for Cloud Services in Israel.

Project Redwood - Middleware - Common service model for  Infrastructure clouds, which will allow for use of services for a user centric perspective - will include API that can communicate between clouds, the focus will be on this for the upcoming year

Next up was a presentation about capacity IQ, I was surprised to see that not many of the 60-80 of the participants knew of any of the other vendors that provide these similar services. I did not receive a satisfactory response regarding what the added benefit that CapacityIQ has over these 3rd party products.

Thereafter was a presentation about AppSpeed. At present, the product is limited to a small number of protocols that it can analyze. Plans  for the future are to enlarge that number for broader application support.

After that was a presentation regarding View 4.0 and the new enhancements - Teradici and PCoIP, a very good explanation was given as to how the graphics rendering works which allows for the better overall user experience using this protocol.

Last  on the Agenda was a session about PowerCLI. The only earth-shattering moment I had in this session was when the presenter asked, "How many of you used PowerCLI for scripting tasks?" , three people put their hand (out of an audience of 80) the presenter, one other person and myself.

I have to really ask, how is it that these people are performing their VI duties efficiently if they are doing everything manually? I mean there were people that manage VMware environments that contain hundreds of Hosts and thousands of VM's. I am wondering how they manage to get their work done - without using PowerCLI.

I do see that the potential for educating VI admins in Israel to use Powershell for management is extremely high - and there is still a lot of work to be done.

2010-02-17

ESX 4.0 Active Directory Authentication

There are numerous posts about how to use Active Directory to authenticate your ssh logins to your ESX servers.

Jason Boche,Travis Laird, Geert Baeke

The idea is pretty simple

  1. Configure the ESX server with esxcfg-auth as in the above posts
  2. Add the desired users locally on your ESX Server
  3. Login away

But….

Once this is enabled  all authentication will be done against Active directory – INCLUDING THE root USER

Jason mentioned this on his post

Warning:  One thing to watch out for would the existance of a root account in AD in which you are not the owner of.  By implementing AD authentication, a root account in AD is going to be granted root level Service Console access on the ESX host!  Take the necessary precautions here.

Travis did as well and provided a solution as well

If you are not using root login through SSH and you want to exclude the root user login from attempting Active Directory authentication, modify the /etc/pam.d/system-auth file and add the parameter minimum_uid=1 to the following line so it reads:

auth sufficient /lib/security/$ISA/pam_krb5.so use_first_pass minimum_uid=1

Geert as well..

A couple of other things to think of:

  • If you create a user in AD with account name root, you can logon as root with its AD password.
  • If you don't want AD authentication for root, you can edit /etc/pam.d/system-auth. On the line that starts with auth and also includes pam_krb5.so, add this to the end: minimum_uid=1. Authentication for root (uid=0) will now be done locally only.

Now of course ssh login is disabled for root. But when trying to login I saw in the logs /var/log/secure that root was trying to authenticate against the domain.

Feb 17 10:32:08 esx2 sshd[5838]: pam_krb5[5838]: authentication fails for 'root' (root@MAISHK.LOCAL): User not known to the underlying authentication module (Clients credentials have been revoked)

(the root account exists in the domain, but is disabled)

So I wanted to add the solution as published above.

These are the contents of the /etc/pam.d/system-auth file

#%PAM-1.0

account      required pam_per_user.so   /etc/pam.d/login.map
auth         required pam_per_user.so   /etc/pam.d/login.map
password     required pam_per_user.so   /etc/pam.d/login.map
session      required pam_per_user.so   /etc/pam.d/login.map

As you can see, no auth sufficient or /lib/security/$ISA/pam_krb5.so in the file…

Hmmmmm….

So I gather that this has changed for ESX 4.0, session pointed to /etc/pam.d/login.map

These are the contents of the /etc/pam.d/login.map file

vpxuser  : system-auth-local
*        : system-auth-generic

OK. so all users except vpxuser are pointed to /etc/pam.d/system-auth-generic

These are the contents of the /etc/pam.d/system-auth-generic file

#%PAM-1.0
# Autogenerated by esxcfg-auth

account         sufficient      /lib/security/$ISA/pam_krb5.so
account         required        pam_unix.so

auth            required        pam_env.so
auth            sufficient      pam_unix.so         try_first_pass nullok
auth            sufficient      /lib/security/$ISA/pam_krb5.so              use_first_pass
auth            required        pam_deny.so

password        required        /lib/security/$ISA/pam_passwdqc.so          min=8,8,8,7,6 similar=deny match=0
password        sufficient      pam_unix.so         try_first_pass use_authtok nullok shadow md5
password        sufficient      /lib/security/$ISA/pam_krb5.so              use_authtok
password        required        pam_deny.so

session         optional        pam_keyinit.so              revoke
session         required        pam_limits.so
session         sufficient      /lib/security/$ISA/pam_krb5.so
session         [success=1 default=ignore]      pam_succeed_if.so           service in crond quiet use_uid
session         required        pam_unix.so

Yep! There it is!

So added the minimum_uid=1 to the file

auth            sufficient      /lib/security/$ISA/pam_krb5.so       use_first_pass minimum_uid=1

Feb 17 10:58:51 ilesx2 sshd[11906]: pam_unix(system-auth-generic:auth): authentication failure; logname= uid=0 euid=0 tty=ssh ruser= rhost=msaidelk-server.xxx.xxxx.com  user=root

And now I see in the log when logging in as root the authentication is
pam_unix(system-auth-generic:auth)
and not pam_krb5

My Active Directory Authentication process is complete!!

Update:
Thanks to Armin van Lieshout for pointing this out to me. you can do this all with the command line as well    esxcfg-auth --enforce-local-auth=root

This will force local authentication for the defined user

2010-02-16

Is 100mb Enough for the Service Console?

It is accepted practice amongst most of the virtualization world that you can use a 100mb/s link for your Service Console port, because there is not much traffic that is flowing over that link.

Well in the majority of the cases that is true.

For example. vmnic0 is running the Service Console. This link is connected at 1000mb/s

image

As you can see in the screen shot there is nothing really running through the Service Console (vmnic0)

But how about this

image

Now you would (and should) most definitely ask what is running through vmnic0 that is causing that amount of traffic.

It is not a Backup agent, no VMkernel on vmnic0 either. It was a regular virtual machine import.

image

A bit more of an explanation. How often does it happen that you are asked to import a virtual machine from somewhere?  A virtual appliance? A new image that has to go onto the ESX?

When importing a virtual machine into an ESX host, the traffic goes directly through the service console and it's vmnic.

As you can see that amount of traffic can easily surpass 100mb/s.

But if you have the option of only running a SC on 100mb/s then you will have to take into account the virtual machine imports will take longer - they can take much.. much longer if you run multiple simultaneous imports. I do not know what impact it will have on the other traffic that has to run on the Service console. I do not know of any QOS on ESX that will ensure that some kind of traffic is more important than others.

Finding a 100mb/s NIC is something that you will have to go on a treasure hunt for - we are at the time and age that a 1Gb is default and soon 10Gb will become the norm. the issue here is not the NIC - it is the switchport. Not everything can run on 1Gb end-to-end, so here is your limitation.

Solutions?

  1. Run your SC on 1Gb
  2. Share your SC with other ESX Network components (Vmkernel / VM Traffic) and lower the level of security for your ESX
  3. Create an aggregate to widen the "pipe"
  4. Run your SC on 100mb/s and take into account that virtual machine imports will take longer

Thanks to Tom Howarth, Scott Lowe, Roger Lund, Dave Graham, Mike LaSpina and Tommy Hall for joining in on the discussion.

2010-02-10

Install VMware Tools on Server Core

Just to remind you all - Server Core - NO GUI!

I needed to install VMware tools today on a VM with Windows 2008 R2 Core.

Start the VMware tools install.

msiexec /i <path to>\VMware Tools64.msi /qn

Machine will reboot automatically - unless you provide the correct parameters to the msi installer

Thanks to Mike and Geert for the assistance

New Hyper-V Security Vulnerability

Many eons ago there was talk about patch footprints - comparing ESXi to Hyper-V, footprints and security patches.

So today I came across this one.

Microsoft Security Bulletin MS10-010 - Important

General Information

Executive Summary

This security update resolves a privately reported vulnerability in Windows Server 2008 Hyper-V and Windows Server 2008 R2 Hyper-V. The vulnerability could allow denial of service if a malformed sequence of machine instructions is run by an authenticated user in one of the guest virtual machines hosted by the Hyper-V server. An attacker must have valid logon credentials and be able to log on locally into a guest virtual machine to exploit this vulnerability. The vulnerability could not be exploited remotely or by anonymous users.

This security update is rated Important for all supported x64-based editions of Windows Server 2008 and Windows Server 2008 R2. For more information, see the subsection, Affected and Non-Affected Software, in this section.

The security update addresses the vulnerability by correcting the way Hyper-V server validates encoding on machine instructions executed inside its guest virtual machines. For more information about the vulnerability, see the Frequently Asked Questions (FAQ) subsection for the specific vulnerability entry under the next section, Vulnerability Information.

So what is so different about this one - I mean Microsoft release patches once a month (and in certain extreme cases - more often).

Read the fine print…

An attacker must have valid logon credentials and be able to log on locally into
a guest virtual machine to exploit this vulnerability.

From the details:

An attacker must have valid logon credentials and be able to log on locally to a Hyper-V virtual machine to exploit this vulnerability. The vulnerability could not be exploited remotely or by anonymous users.

What is the scope of the vulnerability?
This is a denial of service vulnerability. An attacker who exploited this vulnerability could cause the affected Hyper-V server to stop responding and require it to be restarted. Note that the denial of service vulnerability would not allow an attacker to execute code or to elevate their user rights, but it could cause the affected system to stop accepting requests.

What causes the vulnerability?
This vulnerability is caused by Hyper-V server incorrectly validating the encoding of specific machine instructions executed inside the guest virtual machines. Due to this lack of validation, processing of these instructions may cause the Hyper-V server application to become non-responsive.

What might an attacker use the vulnerability to do?
An attacker who successfully exploited this vulnerability could cause a user’s system to become non-responsive until the system is restarted. Note that exploitation of this vulnerability could cause the actual Hyper-V server to stop responding, including all guest virtual machines hosted by that server.

How could an attacker exploit the vulnerability?
An attacker would have to be an authenticated user in one of the guest virtual machines hosted by the Hyper-V server and would need to have the ability to execute arbitrary code on the system. An attacker could then run an untrusted executable on the system that invokes a malformed sequence of machine instructions and thereby cause the Hyper-V server to become non-responsive.

And again the importance is the details..

Note that exploitation of this vulnerability could cause the actual Hyper-V server to stop responding, including all guest virtual machines hosted by that server.

And yep..

Restart Requirement

Restart required?

Yes, you must restart your system after you apply this security update.

HotPatching - Not applicable.

Pot calling the kettle black ??

image

2010-02-09

Veeam FastSCP - Supports Win7 and Win2008R2

Last week I posted the link to the Beta release. Today Veeam announces the release of version 3.0.2 that now supports Windows 7 and Windows 2008 R2.

Release Notes

New Features

The following is a list of new features introduced in the Veeam FastSCP 3.0.2:

  • Added support for Microsoft Windows 7 and Microsoft Windows Server 2008 R2
  • VMware vSphere 4 is now “officially” supported.

The following is a list of new features introduced in the Veeam FastSCP 3.0.1:

  • Added support for ESX 3.5 Update 4.

The following is a list of new features introduced in the Veeam FastSCP 3.0:

  • Full ESXi support.
  • Improved file copy scheduling options
  • Improved user interface

Resolved Issues

The following is a list of issues resolved in the Veeam FastSCP 3.0.2:

FastSCP doesn't process flat.vmdk files on ESX(i) 4 when using agentless mode

The following is a list of issues resolved in the Veeam FastSCP 3.0.1:

  • Inactive datastore error when trying to copy files to/from online NFS datastore connected to an ESXi host.
  • If the daily scheduled copy job’s schedule is set to skip some days, the job always start at 12am on the next day it is allowed to run.
  • Unknown api version error displayed when processing vCenter containing older versions of ESX servers (prior to ESX 3.0).
  • File deletion errors are not displayed correctly

Known Issues

The following is a list of issues known to exist at the time of the Veeam FastSCP 3.0.2 release:

General 

  • Creating new folder, and copying files within the same datastore fails with COM error () is not
    when FastSCP is installed on the 64bit OS and agentless data transfer mode (ESXi) is used. Service console agent based data transfer mode ESX and Linux is not affected.
  • Scheduled file copy job statuses do not refresh automatically. Click the Refresh button to update the status for all jobs when required.
  • Under certain circumstances, making changes in FastSCP user interface while a file copy job is running may result session information and statistics loss.
  • Local administrator rights are required to setup and run FastSCP.

2010-02-08

VMware workstation - Small Rant

Every now and again VMware releases an update for their products, among those are also VMware Workstation. I have a small issue with the whole thing.

Firstly, I must say this is only regarding those who are running Workstation on Windows - from what I hear this does not happen on on a Linux Host OS.

When you install an update for Workstation - you are required to reboot the machine. That would be fine and more than acceptable - but this has to be done twice. Workstation cannot update the software it has to uninstall the previous version, reboot and then install the new version and then reboot again.

All in all the the process took 35 minutes from start to finish (Uninstall - Reboot - Install - Reboot).

Is there no way that VMware can make this process less time consuming??

Thanks @jasonboche @wilva for the empathy :)

2010-02-02

Optimize NFS settings for Celerra

In continuation to both Jason Boche's and Scott Lowe's excellent posts about the recommended Advanced Settings for ESX while using a Celerra NFS mount, and after Jase McCarty's post earlier regarding how to set the recommended settings for ESX and NetApp, I wanted to share with you my script for doing the same for those who are using EMC

function optimize-CelerraNFS {
Set-VMHostAdvancedConfiguration -Name NFS.SendBufferSize -Value 64
Set-VMHostAdvancedConfiguration -Name NFS.ReceiveBufferSize -Value 64
Set-VMHostAdvancedConfiguration -Name NFS.MaxVolumes -Value 32
Set-VMHostAdvancedConfiguration -Name Net.TcpipHeapMax -Value 120
Set-VMHostAdvancedConfiguration -Name Net.TcpipHeapSize -Value 30
Set-VMHostAdvancedConfiguration -Name NFS.HeartbeatFrequency -Value 12
Set-VMHostAdvancedConfiguration -Name NFS.HeartbeatDelta -Value 12
Set-VMHostAdvancedConfiguration -Name NFS.HeartbeatMaxFailures -Value 10
}

Connect-VIServer VISERVER
$hostcreds = Get-Credential

Get-VMHost | ForEach-Object {
Write-Host "Optimizing $($_.Name) ..."
Connect-VIServer -server $_.Name -Credential $hostcreds
optimize-CelerraNFS
Disconnect-VIServer -Confirm:$false
}


The settings are as per the recommendations on the above blog posts and this EMC document.

Regarding the lines 5-6. The settings are taken from this knowledgebase article, seeing that in the EMC document and on Scott's blog the recommendation is multiply the values by the same number that you multiplied NFS.MaxVolumes  - but taking into account that the maximum value for Net.TcpipHeapMax is 120, I could not multiply it by 4.

Lines 12-13: The action must be performed against the ESX host directly, so I gather the credentials for each the ESX Host.

The rest is pretty self-explanatory.

It goes without saying that you have to reboot your Host in order for these settings to take effect.

FastSCP 3.0.2.270 Beta

For all of you who have been waiting for a version that works on Windows 7 and on Windows Server 2008 R2 it is out. For those that do not know the current tool does not work correctly on Windows 7 or on Windows 2008 R2

As was posted on the Veeam Forums you download a Beta release here

There are issues with ESXi and 64-bit Windows OS’s so read the post on the forum to see what can be done and what cannot