Major lessons learned in UCS Failure and Recovery
While at a customers site we were doing a V&V test on a fully completed UCS Chassis. The client indicated that he wanted to do a “hard-fail” of the primary interconnect, so we pulled the power cord. We then left the interconnect off overnight. The next more we powered on the, now subordinate, interconnect. After its long boot, the servers went into a “Discovery” mode.
After 30 minutes the servers’ secondary NIC wouldn’t come back online. The IOMs for the A side were showing Critial Errors and all the links were administratively down. In the Faults area the error was that the descovery policy didn’t match what the IOMs were seeing, which wasn’t the case.
After a lot of painful troubleshooting i came to find out after the Fabric Interconnect came back online, i acknowleged the chassis too quickly. I had the following pointed out to me by a collegue.
“Perfect understanding of the chassis discovery policy and number of links between IOM & FI. I would add just 2c here — care must be taken about “when” you hit the “acknowledge chassis” button. When you “acknowledge chassis”, what you are telling UCSM is that “I acknowledge current connectivity of the chassis”. Every “Fabric Port” of the IOM (the one connected to FI) has two states: ‘Discovery’ and ‘Acknowledgement’. You can see that under “Status Details” of a given fabric port (under ‘Equpment’ –> ‘Chassis’ –> ‘IO Module’ in the GUI). Discovery is an operational state – it can be ‘absent’ or ‘present’. Ack tells whether the link is used by the system or not.
When admin hits “acknowledge chassis”, UCSM takes the snapshot of Discovery state – and if link is ‘Present’, then it is marked as ‘Acknowledge’ (and if not present, then un-ack) — and all the ack’ed ports are used to pass data.
So, before hitting ‘acknowledge chassis’, it is advisable to make sure that the links are all in ‘present’ state.”
It turns out you need to wait a few minutes after everything comes back online before doing the acknowlegement.
Unfortunatly i tried several methods of trying to fix this including trying to acknowlege the chassis after all 4 “Fabric Ports” came up. The only way i could fix it was to decommission and recommission the chassis.
So now after recommissioning the chassis, the servers need to go through a rediscover. Again after 30 minutes or so, the servers were getting a critical error that the discovery process was failing. Watching the FSM all the servers were stuck at the same point “configure primary fabric interconnect in <svr#> for pre-boot environment(FSM-STAGE:sam:dme:ComputeBladeDiscover:SwConfigPnuOSLocal)
This was at 30% of the discovery process, and it kept doing retries until it failed out. Looking at the KVM of the server it was sitting in BIOS looking for something to boot off of. Knowing that for the discovery process it has to boot into the UCS PXE image, i knew there was an issue.
I attempted to “Recover corrupt BIOS”, “Reset CMOS”, “Reset CIMC” & “Re-acknowledge”, i even attempted pulling the blades and re-seating them, nothing work. The servers were unusable.
I decommissioned the servers. Upon clicking on the servers again a popup appear stating that the inserted servers were not the same as the configured, which was none, do i want to accept the new servers. After accepting this a discovery was launced. Luckly this time the discovery proceded correctly and then the service profiles began to load on the blades.
So as much as it seems really scary and drastic to do a decomission and recomission it is sometimes necessary and does seem to do a true reset on the configuration of the components. However, i say this with caution. This process should be the LAST resort as it does involve a major outage, although if your wondering if you need to do this, you most likely already have downtime.