2009-12-30

My Visio stencils again

*****Update June 11, 2012******

The new version is available here

*****************************


Luc Dekens brought to my attention last night that with the Visio Stencils that I posted on VIOPS a while back has somehow been moved and even worse, the content has been modified and the stencils are no longer attached.
I have already contacted VMware to try and get the content restored, and am awaiting their reply.
In the mean time if you need the stencils - ping me via Twitter and I will provide a download link.
Update: Since it seems that it is taking a while for VMware to fix up the link I am providing a temporary download link for all those who need the stencils in the interim.
Part 1 & Part 2 (~30MB in total)

2009 - in 349 words



I started this blog - like most of us bloggers for none other than a personal reference for keeping track of things I find during my day.

Officially my first blog post is dated somewhere in November 2007, but I can say that I started seriously from November 2008.

Now I posted some new years resolutions a while back when the new Jewish year started.

I can say that from that list of things I wanted to get done I have done the following:

The Business

  • Continue assistance to R&D / PD with enabling them to perform their job better / faster / easier
    (This project is well under way)
  • Lab manager - start to work with these products - for better productivity
    (I am starting a POC for Lab Manager)
  • Help the business with developing a Virtual Appliance Version of our product
    (Not yet - but we are getting more and more interest from the customer’s to utilize Virtualization)

BCP/DR

  • More efficient backup methods and faster restore time
    (we are currently evaluating EMC Avamar for this purpose)

Storage

  • Utilize the latest storage technologies for
    • Thin provisioning
    • Storage de-duplication
    • Storage offloaded snapshots
      (This is coming  - much sooner than I expected since we are outgrowing our current storage)

Automation

  • Utilize more scripting for more standardization of procedures (PowerCLI)
    (Have you noticed my blog lately???)

There is still a lot of work for me to do, but I think this is a good start.

I would like to thank each and every one of my 50,533 Unique visitors since January 1, 2009, that have honored me by visiting this blog. I certainly did not think that this was going to happen when I started my first blog Just over a year ago.

The most popular articles over the past year were:

ESX3i to ESX4i Update
Small Present for you all - VMware Visio Stencils
How Heavy is your ESX Load?
Converting a Linux Virtual Machine With an LVM

Thank you all for a wonderful year - and I am really looking forward to an
even more interesting one in 2010.

2009-12-28

VMwareVCMSDS - A time for change



As of vCenter 4.0 when installing a new instance of vCenter, in order to allow for Linked Mode, an instance of Microsoft ADAM is installed on Windows 2003 or AD LDS on Windows 2008. The reason for this is because in order to link your vCenter Server together there has to be some kind of hierarchy in order to allow for the communication between the servers, similar to Active Directory.

Below is a sample screenshot of what it looks like:

 

Ok so here comes the reason for the post and why all of this important. A customer contacted me today saying, “I cannot log into my vCenter Server - something is wrong!”. I asked what was the error message he said, “Something about an error! Please Fix it!!”. He sent me a screenshot:

image

Well he was right about the error message, totally no information here.

First things first I looked at the Services, to see if everything was started and lo and behold there was one that was not:

image

Started the Service and connectivity returned to normal.

The reason this service stopped is not exactly clear, but then I was thinking to myself, if this is such a critical service then why did it not restart, I mean if you do experience an error (for whatever reason) and it shuts the service down - I would expect it to restart.

This is the service setting for the vCenter Service

image

For the Webservices:

image

And even for Update Manager:

image

But the VMwareVCMSDS?

image

Now that seems a bit strange for a Critical service - don’t you think? It would be interesting find the logic behind not setting this service to be the same as all the other?

I am changing mine - What about you?

image

Update: 

I would actually not be doing myself justice if I would not provide a quick Powershell way to change this value – without going into the GUI.

I would like to thank Shay Levy for the assistance and point you to his:
Stand alone registry functions library 

Saving this file as a .ps1 file

And then using this small bit of Powershell code.

   1: ##Load the modules in the script
   2: ################################
   3:  
   4: . .\Registry.ps1 
   5:  
   6: ##Get the correct values from the vCenter Service (vpxd)
   7: ########################################################
   8:  
   9: $myval = Get-regbinary -server ilvcenter -hive localmachine ` 
  10:     -keyName "SYSTEM\CurrentControlSet\Services\vpxd" ` 
  11:     -valueName Failureactions
  12:  
  13: #Set the values to the ADAM_VMwareVCMSDS Service
  14: ################################################
  15:  
  16: Set-regbinary -server ilvcenter -hive localmachine -keyName ` 
  17:     "SYSTEM\CurrentControlSet\Services\ADAM_VMwareVCMSDS" ` 
  18:     -valueName Failureactions -value $myval

Line 9 – Gets the current recovery options settings from the vCenter service and save to the $myval variable

Line 16 – Sets the variable for the VMwareVCMSDS service with the same options.

2009-12-26

The case of the my SSL cert – RTFM!!!!!

It took me a while to understand why this was not working - it could be because I hate - actually loath -having to dig through logs because of Java and Tomcat issues, but I only have my self to blame for this one.

I am currently installing a new vCenter for my Production Environment (this is part of my MJTV series that I currently going through the process). The last time we installed - we were just starting out with VMware – and there have been a decent amount of problems that we have encountered because of lack of experience and knowledge. Therefore a new vCenter (not from scratch but that is another post entirely).

Fast forwarding a couple of years - the technology has evolved - and I have gained more knowledge. So one of the things that were never implemented correctly was an SSL Certificate for vCenter. I wanted to do this right so I started out on what and how this should be done.

Firstly – this is the official VMware reference document. Since we are a Microsoft shop with a established PKI Infrastructure I went to page 2 - Replacing Default Server Certificates with Certificates Signed by a Commercial CA.

Ok so first things first. In order to create the Certificate Signing Request (CSR) you will have to download the OpenSSL binaries from here. Since the vCenter is a 64-bit box – I got the 64-bit version. Before installing the software you will need to download and install I installed the Visual C++ 2008 Redistributables (x64) as well otherwise you will not be able to run the binaries.

I installed to it all to C:\Program Files\OpenSSL. In the bin Directory of the installation folder are the files you will work with.

First you generate an RSA key for your host.

C:\Program Files\OpenSSL\bin>openssl.exe genrsa 1024 > rui.key

A small pause here. The openssl.cfg is the configuration file for the application. I wanted to install a certificate with two different hostnames. Why you may ask? well actually it is very simple. Not all of the users will always remember to put in an Fully Qualified Domain Name when accessing the vCenter server. True they should – but it doesn’t always work that way. So i wanted the SSL certificate to be valid both for the FQDN and the short hostname - i.e. vcenter.maishsk.local and just plain vcenter. So how is this done - with a field in your certificate called subject alternative name (altName). How do you get this into your CSR - well following the great advice from this link, I added to the openssl.cfg file in the [req] section

[req]
req_extensions = v3_req

And in the v3_req section:

[ v3_req ]
subjectAltName          = @alt_names

[alt_names]
DNS.1   = vcenter.maishsk.local
DNS.2   = vcenter

Next I created the CSR

C:\Program Files\OpenSSL\bin>openssl req -new rui.key > rui.csr -config openssl.cfg

To check if the CSR was created correctly with the multiple hostnames, I ran

C:\Program Files\OpenSSL\bin>openssl req -text -noout -in $CSR_FILENAME

and got output similar to this

Requested Extensions: X509v3 Basic Constraints: CA:FALSE X509v3 Key Usage: Digital Signature, Non Repudiation, Key Encipherment X509v3 Subject Alternative Name: DNS:vcenter.maishsk.local, DNS:vcenter

From there to get the actual Certificate for your CA, browse to:

http://<CA_HOSTNAME>/certsrv

Click on Request a Certificate

Click on Submit a certificate request by using a base-64-encoded CMC or PKCS #10 file, or submit a renewal request by using a base-64-encoded PKCS #7 file.

Open the rui.csr file that you saved a few steps above with notepad and copy all of the the contents including the

"-----BEGIN CERTIFICATE REQUEST-----"

and

"-----END CERTIFICATE REQUEST-----" lines

Paste the contents into the Saved Request field

Choose Web Certificate from the Certificate Template and click on the Submit button.

Select Base 64 Encoded and click Download Certificate and save the certificate to C:\Program Files\OpenSSL\bin> and make sure you save the file as rui.crt!

Next I create the .pfx (personal individual exchange) file for rui.crt.

C:\Program Files\OpenSSL\bin> openssl pkcs12 -export -in rui.crt -inkey rui.key -name vcenter.maishsk.local -out rui.pfx

I was prompted with a password request

Loading 'screen' into random state - done Enter Export Password: Verifying - Enter Export Password:

-------------- Make note of this part - because this is where it went wrong.------------------------ I pressed Enter twice - because I did not enter a password before when creating the CSR

I stopped both the VMware VirtualCenter Management Webservices and the VMware VirtualCenter Server. I then copied the three files rui.crt, rui.key, and rui.pfx to the default SSL location which is, according to the official Whitepaper from VMware:

image

Unfotunately - that is not 100% accurate. The correct path should be:

C:\ProgramData\VMware\VMware VirtualCenter\SSL

If you try to access the path in the Whitepaper you will get a nice

image

So I backed up the old SSL Certificate in the folder and pasted my new files.

And started the Service again.

I tried to access the server with https://vcenter.maishsk.local and https://maishsk and everything worked fine. Certificate was good no errors - all was hunky dorey!

This is what the Altname part of the Certificate looks like by the way

image

Opened up the vSphere client - accessed the vCenter server and was not presented with a Certificate warning so all was good - or at least I thought so.

Later on in the day I went to look to see if all was ok - and noticed that there was no vCenter Health Status Icon

image

So I looked at the plug-ins and got this

image

Now naturally - I went looking for the problem on the web and this post popped up.

http://kb.vmware.com/kb/1010641 did not resolve my problem.

I tried to access the the link which was https://vcenter.maishsk.local:8443/sdk and it was not working, not as if I was getting a blank page, there was no response from the other side

Next was to look at the vCenter server to see if it was listening on port 8433

That is easy enough, back to the command-line.

netstat -a | findstr 8443

(for those of you who do not know findstr is DOS tool like grep)

Findstr

Searches for patterns of text in files using regular expressions.

Syntax

findstr [/b] [/e] [/l] [/r] [/s] [/i] [/x] [/v] [/n] [/m] [/o] [/p] [/offline] [/g:file] [/f:file] [/c:string] [/d:dirlist] [/a:ColorAttribute] [strings] [[Drive:][Path] FileName [...]]

And that came up empty

C:\Program Files\OpenSSL\bin>netstat -a | findstr 8443

C:\Program Files\OpenSSL\bin>

Just to make sure that I was not using the wrong syntax I looked for port 443

C:\Program Files\OpenSSL\bin>netstat -a | findstr 443 TCP 0.0.0.0:443 VCENTER:0 LISTENING TCP 1xx.xx.3.xx:443 msaidelk-server:59260 ESTABLISHED TCP 1xx.xx.3.xx:443 msaidelk-server:59261 ESTABLISHED TCP [::]:443 VCENTER:0 LISTENING

C:\Program Files\OpenSSL\bin>

So SSL was working - but not on port 8443. Now this was wierd.

I went to look at the Tomcat Logs (located c:\Program Files (x86)\VMware\Infrastructure\tomcat\logs\catalina.2009-12-24.log)

And in the log I saw the following.

Dec 24, 2009 10:41:55 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-127.0.0.1-8080 Dec 24, 2009 10:41:56 PM org.apache.tomcat.util.net.jsse.JSSESocketFactory getStore SEVERE: Failed to load keystore type PKCS12 with path C:\ProgramData\VMware\VMware VirtualCenter\SSL\rui.pfx due to failed to decrypt safe contents entry: javax.crypto.BadPaddingException: Given final block not properly padded …..

Dec 24, 2009 10:41:56 PM org.apache.coyote.http11.Http11Protocol init SEVERE: Error initializing endpoint java.io.IOException: failed to decrypt safe contents entry: javax.crypto.BadPaddingException: Given final block not properly padded at com.sun.net.ssl.internal.ssl.PKCS12KeyStore.engineLoad(PKCS12KeyStore.java:1275) ….

And so on and so on…

Now after searching for the topic on VMware forums and anything connected with VMware and coming up with a complete blank I widened the search to a pure Tomcat and Java issue and came up with Unable to import openssl key to java keystore and this post which were suggesting that the rui.pfx that I created earlier should have been password protected.

Now looking at C:\Program Files (x86)\VMware\Infrastructure\tomcat\conf\server.xml brought me to find this configuration

<Connector port="8443" protocol="HTTP/1.1" SSLEnabled="true"

maxThreads="150" scheme="https" secure="false"

clientAuth="false" sslProtocol="TLS"

keystoreFile="C:\ProgramData\VMware\VMware VirtualCenter\SSL\rui.pfx"

keystorePass="testpassword" keystoreType="PKCS12"

Now where had I seen that string before?? hmmm…

(And this is where things went wrong). I had started to follow a walkthrough that was posted on the VMTN, and on this page there was no mention of what the password was supposed to be. So I naturally pressed Enter - Twice - and continued.

Remember I said above

-------------- Make note of this part - because this is where it went wrong.------------------------ If I had read the White paper carefully - it explicitly states

image

So pushing Enter twice - was not a good idea after all, I should have entered the password as above when prompted or entered the command like above.

A quick change of the rui.pfx. Stop Services. Copy new file. Start Services.

And….

Logs were clean

Dec 24, 2009 11:28:22 PM org.apache.catalina.core.AprLifecycleListener init INFO: The APR based Apache Tomcat Native library which allows optimal performance in production environments was not found on the java.library.path: C:\Program Files (x86)\VMware\Infrastructure\tomcat\bin;.;C:\Windows\system32;C:\Windows;C:\Windows\system32;C:\Windows;C:\Windows\System32\Wbem Dec 24, 2009 11:28:22 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-127.0.0.1-8080 Dec 24, 2009 11:28:22 PM org.apache.coyote.http11.Http11Protocol init INFO: Initializing Coyote HTTP/1.1 on http-8443 Dec 24, 2009 11:28:22 PM org.apache.catalina.startup.Catalina load INFO: Initialization processed in 1080 ms Dec 24, 2009 11:28:22 PM org.apache.catalina.core.StandardService start INFO: Starting service Catalina Dec 24, 2009 11:28:22 PM org.apache.catalina.core.StandardEngine start INFO: Starting Servlet Engine: Apache Tomcat/6.0.20 Dec 24, 2009 11:28:26 PM org.apache.catalina.loader.WebappClassLoader validateJarFile INFO: validateJarFile(C:\Program Files (x86)\VMware\Infrastructure\tomcat\webapps\sms\WEB-INF\lib\servlet-api.jar) - jar not loaded. See Servlet Spec 2.3, section 9.7.2. Offending class: javax/servlet/Servlet.class Dec 24, 2009 11:28:30 PM org.apache.coyote.http11.Http11Protocol start INFO: Starting Coyote HTTP/1.1 on http-127.0.0.1-8080 Dec 24, 2009 11:28:30 PM org.apache.coyote.http11.Http11Protocol start INFO: Starting Coyote HTTP/1.1 on http-8443 Dec 24, 2009 11:28:30 PM org.apache.jk.common.ChannelSocket init INFO: JK: ajp13 listening on /0.0.0.0:8009 Dec 24, 2009 11:28:30 PM org.apache.jk.server.JkMain start INFO: Jk running ID=0 time=0/31 config=null Dec 24, 2009 11:28:30 PM org.apache.catalina.startup.Catalina start INFO: Server startup in 8060 ms

My vCenter Service was back

image

And all my Plug-ins were working

image

So what have I learned from this experience.

  1. RTFM!!!!!!!!!!!
  2. Read it again!!!
  3. Debugging an issue like this takes time. It does give a large amount of satisfaction actually finding the problem and even more so finding the solution.
  4. Writing a blog post like this can take over two hours :)
  5. Have a happy holidays everyone!

2009-12-23

ESX Disconnecting from vCenter



I got a call today from a colleague that had an issue with an ESX Server that was behind a firewall that kept on disconnecting every 30 seconds, and he could not understand why.

I remember that I had encountered this before, and the solution this was happening because not all ports in the firewall were open to allow the traffic through, and there the Host was losing connection.

So as a reference for myself (and anyone else that can use it) here is what needs to be opened.

firewallrules

Happy holidays to you all!

2009-12-21

Assigning Permissions – PowerCLI

Have you ever been asked to assign permissions to a VM/Folder/Resource?

Come on, own up! Of course you have.

Ever done it with the GUI? I guess the answer is the same.

So GUI is pretty easy:

  1. Find Resource (for example VM)
  2. Right-Click
  3. Add Permission
  4. Choose Role
  5. Check Propagate (if needed)
  6. Add User/Group
  7. OK
  8. OK

In total 8 different actions that need to be performed for one action.

Enter PowerCLI. In the latest release there is a new cmdlet – New-VIPermission

NAME
  New-VIPermission

SYNOPSIS
  Creates new permissions on the specified inventory objects for the provided users and groups in the role.

SYNTAX
  New-VIPermission [-Entity] <InventoryItem[]> [-Principal] <VIAccount[]> [-Role] <Role>
[-Propagate [<Boolean>]] [-Server <VIServer>] [-WhatIf] [-Confirm] [<CommonParameters>]

 

So if you would like to add a Domain (MAISHSK) User (User1) as an Administrator on a Folder (Folder1) you would

[vSphere PowerCLI] C:\> get-folder folder1 | New-VIPermission -Role 'Admin' -Principal 'MAISHSK\User1' 

New-VIPermission : 12/21/2009 5:48:29 AM    New-VIPermission        Could not find VIAccount with name 'MAISHSK\User1'.
At line:1 char:17
+ New-VIPermission <<<<  -Role 'Admin' -Principal 'MAISHSK\User1' -Entity (Get-folder folder1)
    + CategoryInfo          : ObjectNotFound: (MAISHSK\User1:String) [New-VIPermission], VimException
    + FullyQualifiedErrorId :  Core_ObnSelector_SelectObjectByNameCore_ObjectNotFound,
       VMware.VimAutomation.Commands.PermissionManagement.NewVIPermission

New-VIPermission : Value cannot be found for the mandatory parameter Principal
At line:1 char:17
+ New-VIPermission <<<<  -Role 'Admin' -Principal 'MAISHSK\User1' -Entity (Get-folder folder1)
    + CategoryInfo             : NotSpecified: (:) [New-VIPermission], ParameterBindingException
    + FullyQualifiedErrorId : RuntimeException,VMware.VimAutomation.Commands.
       PermissionManagement.NewVIPermission

But Hey that did not work! Huh???!!

This led me to a post on the VMTN forums regarding this issue by Carter Shanklin.

In short:

The source of the bug is that PowerCLI cannot correctly convert this principal into the type of object it needs, which is a VIAccount object. The workaround is to create the VIAccount object yourself.

And how do you do that you may ask? With this Function

function New-VIAccount($principal) {
	$flags = `
		[System.Reflection.BindingFlags]::NonPublic    -bor
		[System.Reflection.BindingFlags]::Public       -bor
		[System.Reflection.BindingFlags]::DeclaredOnly -bor
		[System.Reflection.BindingFlags]::Instance

	$method = $defaultviserver.GetType().GetMethods($flags) |
	where { $_.Name -eq "VMware.VimAutomation.Types.VIObjectCore.get_Client" }

	$client = $method.Invoke($global:DefaultVIServer, $null)
	Write-Output (New-Object  VMware.VimAutomation.Client20.PermissionManagement.VCUserAccountImpl  -ArgumentList $principal, "", $client)
}

[

vSphere PowerCLI] C:\> $account = New-VIAccount "MAISHSK\user1"
[vSphere PowerCLI] C:\> get-folder folder1 | New-VIPermission -Role 'Admin' -Principal $account -Propagate:$true

EntityId                        Role         Principal              IsGroup Propagate
--------                           ----            ---------                  -------      ---------
Folder-group-v241    Admin       MAISHSK\user1  False     True

How many clicks was that?

2009-12-20

Updating a User attribute in the Enterprise



I was asked to update an attribute of the EmployeeNumber for each and every user in the Enterprise for a new Application that will be using the newly populated attribute for a Global Database application.

I had several examples that I could use for the job utilizing VbScript – but I wanted to use Powershell for the task.

It turned out to be a relatively easy task – using the Quest Active Directory Commandlets.

   1: add-PSSnapin quest.activeroles.admanagement 
   2:  
   3: Connect-QADService -Service domain.com -Credential (Get-Credential)
   4:  
   5: $infile = Import-Csv "c:\temp\file.csv"
   6:  
   7: $logfile = "c:\temp\logfile.log"
   8: foreach ($line in $infile) {
   9:         set-QADObject ($line.domain +"\" + $line.login) -ObjectAttributes `
  10:             @{employeeNumber=$line.guid} 
  11:         if ($? -eq $true){
  12:         Write-output "Updated: $($line.domain)\$($line.login) with employeeNumber: `
  13:             $($line.guid)" >>  $logfile
  14:         } else {
  15:         Write-output "Error in updating: $($line.domain)\$($line.login)" >> $logfile
  16:         } 
  17:     }    
  18:  
  19: ##Get Results
  20: $results = foreach ($line in $infile) {
  21:     get-QADObject ($line.domain +"\" + $line.login) -IncludedProperties ` 
  22:         Name, employeeNumber | select Name, employeeNumber 
  23:     } 
  24: $results >> $logfile
  25:  
  26: Disconnect-QADService -Service domain.com


A Quick explanation:

Line 1: Add the Quest Snapin

Line 3: Connect to the domain with acquired credentials

Lines 5-7: import the CSV file that was formatted - domain,login,guid, and create a log file for results

Lines 8-17: Go through each line in the CSV – if successful log to the file and if not then report the error to the log file.

Lines 20-24: Go through the list of users again – retrieving only the Name and EmployeeNumber properties and pipe the results in the same log file.

The script to a longer to write than it did to run.

Hope you enjoyed the ride.

2009-12-06

Benefits and Justification - MJTV1



Firstly, you might ask what is MJTV1? I was thinking that I would like to tag all my posts for this series with something to be easy to recognize it by. So no it is not Michael Jackson TV 1, but rather
My Journey To VSphere. I started two weeks ago with this post.

So let us start with Part 1. Today I will discuss the topic: Benefits and Justification

As you might all know the actual percentage of the technical part of a successful project is is less than 20%. I believe that a project will succeed/fail mainly based how well the project was planned documented, thought through, risks identified, risks mitigated and only in the end - technical implementation.

So what goes into planning your project?

image First and foremost - you will have to identify (and of course "sell" to Management) why do you need to upgrade? We all know the saying "if it ain't broke then don't fix it!" I personally do not really believe in that because technology evolves, all the time, things get better, faster and cheaper which makes this kind of logic not always the best option. I mean driving around in a Mustang 69 is a really cool thing - I mean you get the babes, you get to go from A->B and in general this car fills it purpose.Comparing it to the new Hybrid cars today - you still get the babes - get from A->B and it also fills it purpose. But looking at the bigger picture…

 

Your Mustang needs:

  • frequent repairs and tune-ups
  • more gas
  • a new coat of paint
  • doesn't drive as fast
  • emits more fumes
  • parts are harder to come by
  • but it still stays a cool car!
  • no air-conditioning

Your Hybrid needs:

  • less gas - because it use alternative energy
  • less repairs (it is a new car)
  • zooms like the wind
  • easier on the environment
  • can be a cool car - depends on who is driving it
  • it has air-conditioning

I think you see where I am going here. Let us take it to the upgrade to vSphere. The question you will and must ask yourself is - "What are the benefits I will receive with the upgrade?" Now of course I can list a number of benefits here - you can get them from VMware's site or from your own personal environment. Also you can add to the list - what are the problems / issues I am currently experiencing in my environment and does this upgrade solve/ease them?

From the list you must identify - what are you pain-points that you are currently suffering from and how this upgrade can ease / solve them.

For me - these are the some of the Benefits I will emphasize for upgrading to vSphere 4 U1.

  • VDR and vStorage API's - Better options for backup compared to 3.5
  • Large performance improvement on the same hardware.
  • Larger Resources available for each Virtual Machine
  • Host Profiles and dvSwitch - will save a huge amount of time in the enterprise with configuration of ESX hosts.
  • Higher density on each ESX host.

If you cannot come to a good list of why you should upgrade and what are the problems that this Upgrade will solve FOR THE BUSINESS then you should not be doing it. I do not agree with those who upgrade to a new version just because it came out. That is why so many of Microsoft's Enterprise customers told Microsoft to take Vista, and shove it. We did not deploy Vista - there is no reason and never was any reason to do so. Now that Windows 7 is now out and the benefits in the new OS include a large number of Enterprise features that we can use - there is more of a reason to roll out the new OS. I want to emphasize one more thing here - and it was written here in Capital letters a few lines above - the upgrade could make things easier for you as an Admin - they could save you time - you could do your job better - but if there is no real benefit for your Business - then you will have a hard time selling the justification to Management.

Of course this will all have to go down on paper / presentation for the right people, in order to get the go-ahead for your project.

You of course have to keep in mind What the process of the Upgrade will be but that is for another post.

2009-12-04

Once Upon a Performance Issue



if you were following me on Twitter - you would have noticed this week that I was extremely busy with troubleshooting and solving a serious performance issue that I encountered.

First things first - the environment.

Multiple ESX 3.5 Clusters residing on NFS across multiple Datastores coming from same Storage.

On average these hosts are utilizing 50-70% RAM and 20-30% CPU. The machines run without any noticeable issues.

Next - the incident.

Along comes 05.00 - alerts start coming in from the monitoring system - timeouts from the monitoring agents were occurring - i,e, it looked like the virtual machines were not responding. Within 20 minutes things were back to normal.

8.00 Same thing happens again. While this was happening I tested the guest machines for connectivity - all 100%. Tried to log into a virtual machine with RDP - Slow as a snail. It took approximately 3 minutes from CTRL+ALT+DELETE till I got the desktop, and again all this time - network connectivity to the VM was100%.

I started to receive more complaints of the same issue across a number of Virtual Machines. First thing I did was to try to find what (if anything) was in common. The machines were spread around over different hosts, in different clusters on different VLAN's, so that was not it.

During these outages the Hosts themselves were completely fine.

CPU usage was normal
RAM usage was normal
Network Usage was normal
ESXTOP statistics were normal - no contention on CPU / memory.

Now what the heck was going on here?

The only thing that was common amongst all the Virtual machines - they were all using NFS datastores - divided over 3 different Datamovers on the same EMC Storage array.

The outages were intermittent and not permanent.

In the meantime I opened a priority 1 SR with VMware support. Support was back to me within 30 minutes (according to the Severity Definitions) so they we right on time.

Logs were collected.

Tests were performed to test network issues with the ESX hosts.

In the meantime - we tried to see if anything was wrong with the network infrastructure - no issues at all. Throughput on the ports using the NFS datastores was well be low normal, Virtual machines Network was also not suffering under any kind of load.

Again all fingers were pointing to the Storage Array.

There was a slight amount of stress on the storage array - this we found with the help of EMC (who also got a priority 1 call the same time as VMware) but nothing to be highly worried about.

OK - so how do you measure NFS throughput on the ESX side? Unfortunately this is not so simple. On the contrary to measuring disk throughput with iSCSI / SAN which can be done relatively easily with the performance charts / ESXTOP - there are no metrics for disk performance when it comes to NFS datastores. The only thing you can check is vmkernel throughput.

Using ESXTOP -> n:ESX nic -> T to Sort by megabits tx ( I truncated the data a bit to make it presentable)

PORT ID   USED BY                DNAME       PKTTX/s  MbTX/s    PKTRX/s  MbRX/s %DRPTX %DRPRX
33554433 vmnic2                    vSwitch1     195.50     2.03        118.83      0.11      0.00         0.00
33554436 vmk-tcpip-1.1.x.xx   vSwitch1     195.50     2.03        118.26      0.11      0.00         0.00

The bold entry is the VMkernel interface and what its network traffic is. Now the utilization of this port was never getting over 2-4 Mb/s - which is nothing.

In the meantime we started to receive more complaints about regular NFS mounts (not connected to our Virtual Infrastructure) that were performing slowly - in addition other servers that were connected directly to the SAN as well were suffering.

Again all pointed to the storage.

One more thing.

NFS (like iSCSI) uses the vmkernel - so where would you look for issues if that were the case?
If you said /var/log/vmkernel - you were right!

From the log - during these outages entries similar to this were present

xxxxx vmkernel: 133:06:49:16.958 cpu6:1724)VSCSIFs: 441: fd 258193 status No connection

No connection? No connection? Datastore not responding - Storage anyone?

After putting 2+2 together - and getting a big headache - we all knew it was a storage issue.

Sat on EMC's head to solve it.

They did. What it turned out to be was an application that was connected to a LUN on the storage array (not my LUN) that had malfunctioned - and was using its LUN with 100% utilization over 90% of the time.

Why this affected the rest of the storage - we will hear back from EMC after completing the root cause analysis on the issue. But as soon at the rogue application was stopped - like magic all returned to normal. Measures have been taken to alert us of such issues on the storage array in future

So what did I learn from this experience?

  1. Why were the machines still responding - even though the storage was not working properly? My theory on this is as follows. Network was working fine. The machines responded slowly - when you tried to login. What happens when you login?  You load up a user profile - which is on the vmdk - which in turn was on the NFS share - which was as slow as a snail. Therefore it was logical that this was the issue, because of a badly performing disk.
  2. NFS throughput is not something that VMware can present easily to the administrator for troubleshooting. There are no disk counters for VM's on an NFS datastore. Disk Performance on the ESX does not include NFS traffic. This I find is something that VMware has to improve on - since more and more shops are starting to use NFS by default. If they provide the statistics for iSCSI / Fiber - then there is no reason they should not do it for NFS.
  3. An assumption was made that the Storage Array was most probably the least likely to fail out of all the chain of components in the virtual Infrastructure.
    In the ESX Server - we have 2 Disks / 2 CPU's / 2 Power supplies / at least 2 NIC's - all to protect from a single point of failure
    The Network cards are connected redundantly to the Network Infrastructure - to protect from a single point of failure.
    The ESX Servers were connected to the storage array to 3 different Datamovers - to protect from a single point of failure.
    But all in all the storage was the point of failure here.

The storage is shared with other applications and not dedicated to Virtualization - this has its ups and downs.

So now is all calm and well - and now I can start up solitaire on my Windows servers within a few seconds from the the time I press CTRL+ALT+DEL - so I am happy :)

What I do like about instances like these - is things that should not / cannot happen (in theory) actually do (in reality)- and when they do, it is a great learning experience, which only makes me want to improve and provide even a higher level of performance / availability.

Hope you all enjoyed the ride!