iSCSI Boot with ESXi 5.0 & UCS Blades

UPDATE:: The issue was the NIC/HBA Placement Policy.  The customer had set a policy to have the HBA’s first, then the iSCSI Overlay NIC, then the remaining NICs.  When we moved the iSCSI NIC to the bottom of the list, the ESXi 5.0 installer worked just fine.  I’m not 100% sure why this fix is actually working, but either way it works.

So at a recent customers site i was trying to configure iSCSI Booting of ESXi 5.0 on a UCS Blade, B230 M2.  To make a long story short it doesn’t fully work and isn’t offically supported by Cisco.  In fact, NO blade models are supported for ESXi 5.0 & iSCSI boot by Cisco.  They claim a fix is on the way, and i will post an update when there is a fix.

Here is the exact issue, and my orgianal thoughts, in case it helps anybody;

We got an error installing ESXi 5 to a Netapp LUN.  Got an error “Expecting 2 bootbanks, found 0” at 90% of the install of ESXi. The blade is a B230 M2.

The LUN is seen in BIOS as well as by the ESXi 5 installer.  I even verified the “Details” option, and all the information is correct.

Doing an Alt-F12 during the install and watching the logs more closely today, at ~90% it appears to be unloading a module, that appears by its’ name, to be some sort of vmware tools type package.  As SOON as it does that the installer claims that there is no IP address on the iSCSI NIC and begins to look for DHCP.  The issue is during the configuration of the Service Profile and the iSCSI NIC, at no time did we choose DHCP, we choose static. (We even have tried Pooled)  Since there is no DHCP Server in that subnet it doesn’t pickup an address and thus loses connectivity to the LUN.

So we rebooted the blade after the error, and ESXi5 actually loads with no errors.  The odd thing is that the root password that’s specified isn’t set, it’s blank like ESXi 4.x was.

So an interesting question is what’s happening during that last 10% of the installation of ESXi 5??  Since it boots cleanly, it almost seems like it does a sort of “sysprep” of the OS, ie all the configuration details.  If that’s the only issue then it might technically be ok.  However I don’t get the “warm and fuzzies”.  My concern would be that, maybe not today but down the road some module that wasn’t loaded correctly will come back to bite the client.

Also, what is happening in that last 10% that’s different then ESXi 4.x??  We were able to load 4.1 just fine with no errors.

Again we called Cisco TAC and we were told that ESXi 5 iSCSI booting wasn’t supported on any blade.  They do support 4.1 as well as Windows, and a variety of Linux Distos.

Configuring iSCSI boot on a FlexPod

Here is a nice document to follow to configure iSCSI booting for a FlexPod, ie. UCS Blades, NetApp array & ESXi.

UPDATE: This document has the fix i found for ESXi 5.0.  This was tested on B230 M2’s and seems to work every time.

This document will be updated as i get new information.

FlexPod iSCSI Boot-Fixed

Major Lessons in UCS Failure & Recovery

Major lessons learned in UCS Failure and Recovery

While at a customers site we were doing a V&V test on a fully completed UCS Chassis.  The client indicated that he wanted to do a “hard-fail” of the primary interconnect, so we pulled the power cord.  We then left the interconnect off overnight.  The next more we powered on the, now subordinate, interconnect.  After its long boot, the servers went into a “Discovery” mode.

Issue #1

After 30 minutes the servers’ secondary NIC wouldn’t come back online.  The IOMs for the A side were showing Critial Errors and all the links were administratively down.  In the Faults area the error was that the descovery policy didn’t match what the IOMs were seeing, which wasn’t the case.

After a lot of painful troubleshooting i came to find out after the Fabric Interconnect came back online, i acknowleged the chassis too quickly.  I had the following pointed out to me by a collegue.

Perfect understanding of the chassis discovery policy and number of links between IOM & FI. I would add just 2c here — care must be taken about “when” you hit the “acknowledge chassis” button. When you “acknowledge chassis”, what you are telling UCSM is that “I acknowledge current connectivity of the chassis”. Every “Fabric Port” of the IOM (the one connected to FI) has two states: ‘Discovery’ and ‘Acknowledgement’. You can see that under “Status Details” of a given fabric port (under ‘Equpment’ –> ‘Chassis’ –> ‘IO Module’ in the GUI). Discovery is an operational state – it can be ‘absent’ or ‘present’. Ack tells whether the link is used by the system or not.

When admin hits “acknowledge chassis”, UCSM takes the snapshot of Discovery state – and if link is ‘Present’, then it is marked as ‘Acknowledge’ (and if not present, then un-ack) — and all the ack’ed ports are used to pass data.

So, before hitting ‘acknowledge chassis’, it is advisable to make sure that the links are all in ‘present’ state.”

It turns out you need to wait a few minutes after everything comes back online before doing the acknowlegement.

 

Fix #1

Unfortunatly i tried several methods of trying to fix this including trying to acknowlege the chassis after all 4 “Fabric Ports” came up.  The only way i could fix it was to decommission and recommission the chassis.

 

Issue #2

So now after recommissioning the chassis, the servers need to go through a rediscover.  Again after 30 minutes or so, the servers were getting a critical error that the discovery process was failing.  Watching the FSM all the servers were stuck at the same point “configure primary fabric interconnect in <svr#>  for pre-boot environment(FSM-STAGE:sam:dme:ComputeBladeDiscover:SwConfigPnuOSLocal)

 

This was at 30% of the discovery process, and it kept doing retries until it failed out.  Looking at the KVM of the server it was sitting in BIOS looking for something to boot off of. Knowing that for the discovery process it has to boot into the UCS PXE image, i knew there was an issue.

 

I attempted to “Recover corrupt BIOS”, “Reset CMOS”, “Reset CIMC” & “Re-acknowledge”, i even attempted pulling the blades and re-seating them, nothing work.  The servers were unusable.

 

Fix #2

I decommissioned the servers.  Upon clicking on the servers again a popup appear stating that the inserted servers were not the same as the configured, which was none, do i want to accept the new servers.  After accepting this a discovery was launced.  Luckly this time the discovery proceded correctly and then the service profiles began to load on the blades.

 

So as much as it seems really scary and drastic to do a decomission and recomission it is sometimes necessary and does seem to do a true reset on the configuration of the components.  However, i say this with caution.  This process should be the LAST resort as it does involve a major outage, although if your wondering if you need to do this, you most likely already have downtime.

 

 

Direct Connected Fiber Storage to UCS

So i’ve come across this recently.  I have a client that is direct connecting the Fiber from their NetApp array to the 6120’s of the UCS.

The issue that has been raised is that this is not technically supported.  As is seems Cisco releases with the 1.4.1 firmware release that you can absolutely do this.  However there is a caveat, it’s supported by Cisco as long as the storage vendor will support it.

The biggest problem is that NetApp did support it, but they don’t any longer.  So it seems Cisco was left holding the ball when NetApp walked away.

So if your running a NetApp array that is direct connected to their UCS w/o an MDS or even a 5548 with the FC module, its no longer technically supported and you very well may run into issues if you need Vendor support.

For those not familiar with direct connecting the storage i’ll give a little but of information on it, as well as some of my experiences with it and some tips on making it “work” with UCS.

So inside the 6120 there is effectivly a very very dumb MDS switch.  There is no Zoning, it is all 1 big zone, you do vSANs, but obviously no inter-vSAN routing, no security, no real way of even getting any initiator/target information for troubleshooting purposes.

In order to even use the functionality, you must change the Fiber portion of the switch from “End-Host Mode” to “Switch Mode”.  This is EXTREMELY similar in method and functionality to switching the Network side to “Switch Mode”.

You MUST also make sure to select the default vSAN that is created upon inital set-up, and enable “Default Zoning”

Intersting note you MUST absolutely make sure the HBA name in the Boot Policy is the EXACT same as the HBA name in the HBA Template, or it won’t boot.
So again, in my opinion if you can avoid direct connecting your SAN storage to the 6120, please avoid it, at least until UCS 2.0 comes out  🙂

Enabling Jumbo Frames in a Flexpod Environment

Update: I have fixed the 5548 section i was missing the last two lines.

This post will help the user enable Jumbo frames on their Flexpod environment. This document will also work for just about any UCS-based environment, however you will have to check on how to enable Jumbo Frames for their storage array.

This post assumes a few things;

Environment is running 5548 Nexus switches
User needs to setup Jumbo-Frames on the NetApp for NFS/CIFS Shares
Netapp has VIF or MMVIF connections for said NFS/CIFS connections.

Cisco UCS Configuration 

-Login to the UCSM, Click on the LAN Tab.
-Expand LANs, & LAN Cloud.
-Click on the QoS System Class, Change the “Best-Effort” MTU to 9216. 

NOTE: You need to just type in the number, it’s not one of the ones that can be selected in the drop-down.

Expand the Policies section on the LAN Tab.  Right-Click on the QoS Polices and click “Create new QoS Policy”.  Call it “Jumbo-Frames” or something similar.
-On the vNIC Template or actual vNIC on the Service Profile, set the “QoS Policy” to the new Policy.

 ESX/ESXi Configuration

-Either SSH or Console into the ESX host.  If your using ESXi you’ll need to ensure local or remote tech support mode is enabled.
-We need to set the vSwitch that the Jumbo-Framed NICs will be on to allow Jumbo-Frames.
          Type esxcfg-vswitch –l   find the vSwitch we need to modify.
          Type esxcfg-vswitch –m 9000 vSwitch# (Replace # with the actual number)
          Type esxcfg-vswitch –l   you should now see the MTU to 9000

-We now need to set the actual VMKernel NICs.

          Type esxcfg-vmknic –l  find the vmk’s that we need to modify
          Type esxcfg-vmknic –m 9000 <portgroup name> (this is the portgroup that the vmk is part of)
          Type esxcfg-vmknic –l   verify that the MTU is now 9000 

Note: If your using dvSwitches, you can set the MTU size through the VI-Client.

5548 Configuration 

Login to the 5548 switch on the “A” side.
-Type the following;

system jumbomtu 9216
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
multi-cast-optimize
exit
system qos
service-policy type network-qos jumbo
exit
copy run start

-Repeat on the “B” Side 

NetApp Configuration 

-Login to the Filer.
-Type ifconfig –a  verify which ports we need to make run jumbo frames.
 -Type ifconfig <VIF_NAME> mtusize 9000 

NOTE: You need to make sure you enable jumbo-frames not only on the VLAN’d VIF but also the “root” VIF.

Good questions asked during UCS Design Workshop

So i’ve recently started working for a large technology company on their Datacenter Services team in their Professional Services org.  Its been quite an experience so far, and i’m doing my first solo Cisco UCS Design Workshop coupled with an installation as well some basic teachings.

I was asked some good questions and figured that others may be asked the same things as well as may just have the questions themselves. I figured i can share and maybe help somebody else.  I will try and keep this page updated with some of the more interesting questions that aren’t easily found in Ciscos documentation.

Q1. According to Cisco’s documents when you’re using the VM-FEX or Pass Through Switching there is a limit of 54 VMs per server when those hosts have 2 HBA’s.  What is the real reason for the low limitation?  As with todays high-powered servers 54 VMs isn’t an unreachable goal.

A1. The 54 limit is based on VN-tag address space limitations on the UCS 6100 ASICs.  Future hardware for UCS will support more.  PTS may not be the right fit for high density virtual deployments, especially VDI.   Here is a link to a great blog on it.  http://vblog.wwtlab.com/2011/03/01/cisco-ucs-pts-vs-1000v/

Q2. What is the minimum number of power supplies needed for a UCS Chassis?

A2. The answer is 2, especially a fully populated one.  In this case you are running in a non-redundant mode.  If one of the power supplies fail, the UCS System will continue to power the Fans and the IO-Modules.  It will however begin to power off the blades in reverse-numerical order until it reaches a supported power load.

Q3.  Can you change the number of uplinks from the IO-Modules to the Fabic Interconnects once the system is online?

A3. Changing the number of cables from FI to the chassis requires a re-pinning of the server links to the new number of uplinks.  The pinning is based on a hard-coded static mapping based on number of links used.  This re-pinning is temporarily disruptive to the A fabric then the B fabric path on the chassis.  NIC-teaming / SAN multi-pathing will handle failover/failback if in place.

Q4. If the uplinks from the Fabic Interconnect are connected to Nexus switchs, if we dont use vPC on them do we lose the full bandwidth because the switches are in an active/passive mode?? Can you get the full bandwidth using VMware and no vPC?

A4. Even without vPC the UCS Fabric Interconnects will utilize the bandwidth of all of the uplinks, no active/passive.  However, I would still recommend configuring VMware for active/active use but ensure you are using MAC or Virtual port based pinning rather than Src/Dest IP hash.

Q5. So is there any advantages to doing the vPC other then the simplified management??

A5. Two of them, Faster failover, and potential for a single server to utilize more than 10Gbps based on port-channel load-balancing.

SSD drives don’t secure erase

So if you’re in the industry that requires its’ drives secure erased, or even if your a security minded person.  I came across a very interesting study.

In essence it says that because there is some brains on the actual SSD itself, there is no way to be sure you’ve erased the disk.   In a normal HDD, the erase program just writes a ton of 1’s and 0’s to the disk.  The problem is when the erasing program writes to what it think is block X, the SSD might actually write to block Y.  This is because of the way SSDs try to spread out the data so that one particular area of the memory chip isn’t over utilized.

This is quite an interesting article.

http://www.usenix.org/events/fast11/tech/full_papers/Wei.pdf

Inital impressions of Ubuntu 10.10

Alright so even though i work in IT, i actually like Windows OS’s for a few reasons.  Granted i’m refering more towards Windows 7, Server 2008, and maybe XP.

However i’m pretty good with Linux, and i use a 2010 MacBook Pro for work, so suffice to say i’ve worked with the major players.  I recently decided to try my hands at doing a ChromeOS build, but to do so they recommend Ubuntu.  I’ve used Ubuntu in the past, but not anytime recently.  I’ve been using Fedora or CentOS, since i’m comfortable with RHEL.

I have to say i’m really really impressed with the new Ubuntu.  I think it actually could make for a very nice Full-Time OS, even for the average user.  Now granted i wouldn’t give it to the extreme novice, but for your average user, especially a power user it works quite well.  They have made it intuitive, and easy to use.  It is “pretty” which is still necessary as most people don’t want something that looks like it’s from the Windows 3.1 era.

I think i’m going to keep it as a VM for a while, yet i think i’m going to try and do as much work in it as possible.  Here’s to nothing…..

Boot from USB drive in VMware Workstation

Its quite annoying that you can’t boot from a USB Drive in VMware Workstation.  So here’s a simple workaround.

1. Download PloP boot manager http://www.plop.at/en/bootmanager.html#download

2.  Extract the .ZIP

3. Attach the .ISO from the extracted .ZIP to your VM that you want to USB boot.

4. Making sure the USB stick is inserted into your PC, and attached to the VM.

5.  When the PloP boot manager comes up, select “USB”.

Enjoy booting whatever you have on your USB drive.

VMware Workstation USB Issues

So i’m using VMware Workstation 7.1 to run Ubuntu under my standard Windows 7 desktop.  I’m trying to build ChromeOS to play around with, and i can’t get my USB stick to be seen on my VM.

As it turns out the VMware USB Arbitration Service on my host isn’t started, and in fact wont start.  Turns out there is some USB filter driver causing the issue.  I’m willing to bet it’s part of the driver for my new USB 3.0 Motherboard.

Anyway here is the simple fix.

Shut down Workstation.

Open the registry (Start > Run > regedit).

Browse to HKEY_LOCAL_MACHINESYSTEMCurrentControlSetServiceshcmon.

Create a new key called Parameters.

In Parameters, create a new DWORD value entry named DisableDriverCheck, and then set the value to 1.

This works great and i can now pass USB to my VM.

Enjoy