AutoDeploy and VXLAN

UPDATE!!: So another awesome engineer that i work with has found a solution to this.

“as long as you let VSM create the vmknic the first time, and then preserve that MAC address (in the answer file of the Host-Profile), you’re good.  (Assuming you’ve added the vxlan vib to your
image)”  He also mentions “I think one of the keys is to make sure VSM is showing the correct IP in Datacenters->Network Virtualization->Preparation->Connectivity BEFORE attempting host reboots.”

Thank you Jason

Original Post:

So working with the same technician that found the UCS and PXE bug, he found another one this time relating to VXLAN and Auto-Deploy. (Thanks Zach & Eric)

First, a little background.

The customer is doing a full-scale vCloud Enterprise Suite deployment.  They wanted to utilize VXLAN and fully stateless hosts.

So normally when the cluster gets prepared, vShield Manager (now called vCloud Networking and Security. I’m sure the name will change tomorrow) creates a vmknic on the hosts that is used for the VXLAN transportation.

Now we have Auto-Deploy which muddies everything up…. The process  that should occur is this;

First boot a new host and configure it as needed.  Then prepare the cluster through vSM/VCNS.  This adds a vmknic to the host for the VXLAN transport.  Then you create a host profile from it.  Then as you add more hosts you update their answer file for those hosts.  Then reboot and all is happy…

Well here is the rub, upon reboot the vSM/VCNS prep happens before the host-profile is applied.  So when the host-profile gets applied the, just created, vmknic is removed and re-added. Sadly, there is no way to just add the IP address to this vmknic through the host-profile, it’s an all or nothing affair.  What sucks is when doing the Host-Profile remediation the vmknic isn’t just modified, it’s actually deleted and re-created with the appropriate settings.  I’m sure this is to simplify code.  Anyway, now this new vmknic is created with the correct settings, but vSM/VCNS doesn’t know about it because some identifier has changed….doh!!

There is currently no fix for this Order of Operations issue…  This will be fixed when VCNS gets updated to 5.5 though.  So for now it’s VXLAN or Auto-Deploy.

 

UCS Firmware bug found, affects PXE Boot

UPDATE:

On the system that discovered the initial bug updating to 2.1(1f) did

 

 

I am working with one of my fellow engineers who is doing an whole vCloud Suite/SRM/kitchen sink deployment.  Anyway he found a bug in the UCS Firmware version 2.1(1e).

When he was trying to do auto deploy his ESXi hosts the would not get DHCP at all.

We looked at the DHCP server, it was on the same VLAN as the hosts, and it was configured correctly and the scope was actually on.  We looked at the UCS settings, the correct mgmt VLAN was set to “Native” so that the NICs could actually see the DHCP reply.

What was noticed was for some reason the VIFs were not coming up on the UCS system.  Now anybody who has used UCS for any period of time is used to seeing VIF errors, especially when setting up blades and installing OS’s.  The typical time you get these errors is when the FI is trying to setup the network before the blade is actually online, these are typically transient and go away quickly.

Anyway, these weren’t the typical errors and the VIFs were truly down and would not come up until the ESXi installer was run off of a mounted ISO.  They would not come up while the NICs were looking for DHCP…odd

They were running the exact same hardware as i had in 4 different FlexPods, 2248’s, B200 M3’s, and an very similarly setup VMware environment.  The ONLY difference was that the firmware they loaded was the latest, at the time, 2.1(1e).  I was still running 2.1(1a).

The engineer downgraded to 2.1(1d) and everything immediately came online and worked perfectly.  Problem solved…well sort of.   I have confirmed on a new install of mine that there is an issue.

I am in the process of building a new system at 2.1(1f) to see if the problem has been fixed.  I will update ASAP.