Managing virtual environments with VMware, Cisco UCS, and OpenStack: 2014

Sunday, February 23, 2014

UCS Firmware 2.2(1c) Upgrade Woes - WILL_BOOT_FAULT

This weekend while testing out the new features and compatibility of Cisco UCS firmware 2.2(1c) I ran into an issue where when a blade was upgrading using the auto upgrade feature was showing a very strange status.

I performed the normal steps:

Upgrade the Cisco UCS Manager software to version 2.2(1ac) or later. The required server CIMC and BIOS are also included with the download bundle.

Use Cisco UCS Manager to upgrade and activate the server CIMC, BIOS and boardcontroller

Monitor the firmware upgrade process via FSM at either the blade level (recommended) or from the service profile

Normally, this process is straight forward and very easy! One of the reasons why I love Cisco UCS, but this time after everything was said and done the blade status was showing "WILL_BOOT_FAULT."

Uh, what the heck is WILL_BOOT_FAULT and where did this come from? Ive never seen this before?! So i elected to dig into the problem and see what could be wrong.

I opened an SSH connection to UCSM and proceeded to do the following:

# scope server 2/5 (chassis 2 blade 5)
# scope boardcontroller
# show image (this will list all of the firmware versions)
# activate firmware version.0 force (select the appropriate value from the show image command)
# commit-buffer

This forced the firmware version to the blade in question and everything seemed to progress without any issues. So I thought.... Upon completion the WILL_BOOT_FAULT error came back. DARN! I then decided to move forward with an old trick that I have used before when moving from Pre-2.0.x -> 2.0.x and also when upgrading from 2.0.x -> 2.1.x which is to reset CIMC.

In the Navigation pane, click the Equipment tab.

On the Equipment tab, expand Equipment > Chassis > Chassis number > Servers, then choose your server.

In the Work pane, click the General tab.

In the Actions area, click Recover Server.

In the Recover Server dialog, click Reset CIMC (Server Controller), then click OK.

Wait for CIMC to reboot and for Cisco UCS Manager to do a shallow discovery of the server. This takes two to three minutes.

When all was said and done, the blade came happily back online and the WILL_BOOT_FAULT error was cleared.

Wednesday, January 15, 2014

New Training Course - VMware Virtual SAN: Deploy and Manage

This training course focuses on deploying and managing a software-defined storage solution with VMware Virtual SAN 5.5. This course looks at how Virtual SAN is used as an important component in the VMware software-defined data center. The course is based on VMware ESXi 5.5 and VMware vCenter Server 5.5.

By the end of the course, you should be able to meet the following objectives:

Define the key components of a software-defined data center

Identify benefits of software-defined storage solutions

Compare and contrast disk types and storage technologies

Explain file, block, and object-oriented storage

Identify Virtual SAN requirements, use cases, and architecture components.

Plan and design a Virtual SAN deployment

Configure Virtual SAN clusters

Identify benefits of storage policy-based management

Scale a Virtual SAN deployment based on storage needs

Monitor Virtual SAN

Troubleshoot Virtual SAN

Identify the integration of Virtual SAN with the VMware product portfolio

VMware Virtual SAN: Deploy and Manage Training Course

Friday, January 10, 2014

Cisco UCS Boot from SAN Troubleshooting

So, first let me define some terms….the Cisco VIC is also called Palo card/adapter. The current generation of Palo is the VIC-1240 and VIC-1280. The VIC-1240 is a built-in option on the M3 blades.

The VIC itself will show you if zoning and LUN masking are correct however the command LUNLIST is a very powerful debugging tool and one that should always be used when troubleshooting a boot from SAN problem. There are prerequisites that must be met for the VIC to show success during POST, and when POST does not show you what you expect, you don’t always know where to start!? Is the problem in the profile, the Fabric Interconnect, the upstream FC switch, or the array itself? It’s this kind of thing that makes san engineer frustrated with boot from SAN and why many people tend to stay away from using non-persistent storage.

Have no fear though, debugging boot from SAN isn't that hard or complex if you know the general basics of how HBAs work, switch zoning, and switch masking.

Cisco UCS certainly makes boot from SAN tremendously easier but I believe there is always room for improvement, and this area is no different. So with UCS 2.0 Cisco introduced LUNLIST.

LUNLIST only works prior to the OS HBA driver loading. Once the driver loads, the VIC BIOS is no longer in control and will not return valid data.

To get to the command, you need to gain access to the UCS CLI and run the following:

connect: connects to the VICs management processor

adapter: the chassis/blade/interface card (example: connect adapter 1/1/1 will connect you to chassis 1, blade 1, adapter 1)

attach-fls: attaches to the fabric login service of the adapter

vnic: displays the vnic adapters for the Palo

Once you run lunlist, you see output similar to the below. This one is from a server where the end-to-end configuration is correct and the server can boot from SAN:

Now let’s break it apart and describe what you are seeing:

Incorrect LUN masking:

Here is the LUNLIST output from a server that is having an issue with incorrect LUN masking. The host has not been allowed access to the assigned LUN. The same problem would likely result if the host is not setup in the array at all, or if it was created on the array but someone mis-typed the host’s WWPN. Zoning is correct because the Nameserver Query Response succeeds (line 11) and returns a WWPN target that matches the WWPN target in the boot policy (line 5). The HBA successfully logged into the fabric and was able to see that a LUN of ID 0×00 is visible (line 9). But when the LUN is queried for additional information, it fails with “access failure” (line 7).

Incorrect Zoning:

The host is not zoned correctly. It is either in a zone by itself or not zoned at all. This is an easier one to troubleshoot because the host cannot see a LUN nor can it see any available WWPN targets. Look at lines 8 and 9 and notice that there is no response returned for either of these queries. Note that the PLOGI is unsuccessful (fc_id in line 5 is 0×000000) because the host was unable to successfully establish a session with the target.

Incorrect SAN Boot Target in the boot policy:

You can clearly see that the WWPN configured in the boot policy (line 5) does not match the available target found on the fabric (line 10). In this situation, the PLOGI (line 5) is once again unsuccessful because a session cannot be established between the host and the target.

Incorrect LUN ID in the boot policy

You can see the incorrect LUN ID into the boot policy for the server (line 7) and it does not match the LUN ID found on the fabric (line 9).

This example displays a properly configured host that has access to multiple LUNs presented.

6. When you run LUNLIST and the OS is up and running with the driver for the VIC loaded [which means LUNLIST won’t work] you will see the following:

Monday, January 6, 2014

Performance issues with VMware Fusion and OSX Mavericks App Nap

I recently upgraded to OSX 10.9.1 and while most of the upgrades to OSX are welcomed I have been noticing some strange and inconsistent performance issues with VMware Fusion 6.

I knew the issues probably had something to do with Maverick’s as I had no issues with my Mac when using Mountain Lion. I dug a little deeper and looked at the newly refreshed Activity Monitor in OS X. Nothing exceptional in the CPU and Memory tabs. CPU was at a reasonable level before and after running a virtual machine in VMWare Fusion and I have plenty of room with 16GB of memory.

Nothing seemed to be pegging either metric. I

looked at Disk and Network… again nothing out of the ordinary.

Knowing that Energy was a new tab I haven’t seen before in Activity Monitor I selected it and noticed the “App Nap” column was listed and indicated “Yes” in the row defining VMWare Fusion.

App Nap is the new feature in Mavericks that allows users, thankfully, to allow OSX to put to sleep applications consuming excessive amounts of your limited battery power. A great feature for laptop users but I’m using a desktop and plugged into AC. Not as big of an issue at my desk. Activity_Monitor__Applications_in_last_8_hours_

Activity_Monitor__Applications_in_last_8_hours_

A quick scan of Apple’s support site led me to how to disable this feature on an app-by-app basis… Here’s what I did:

1. Open a Finder window and navigate to your Applications Folder

2. Locate VMware Fusion, right click and select “Get Info”

3. In the “General:” section of the dialog box you will see a checkbox for “Prevent App Nap”. Make sure this is unchecked.

I terminated and restarted VMWare Fusion and launched my Windows virtual machines and everything returned back to normal. No more performance loss or stuttering.

Hope this helps!