Major Releases from Cisco today!!

Cisco has announced some major products and updates this morning around their UCS product line.  These announcements are not only just hardware and software but they also show how Cisco is seeing the changing data center landscape and realizing where they need to compete.  As a note i will be updating this as i get more information, as well as deep diving on each topic shortly.  I suggest you follow me on twitter @ck_nic or subscribe to the blog to see when i’ve updated this or added deep dives.

Cisco has realized that the single monolithic data center is not always the norm today.  Companies are moving towards more Remote Site or Multi-Data center environments.  In my opinion this is something that has been happening for years, but only recently has accelerated.  With new technologies such as VXLAN, OTV and huge advances in the virtualization world, there is no need to build a single giant data center.  Personally i’m seeing companies building multiple smaller and more efficient data centers, or utilizing space in remote offices or shared space.  In addition the push for the “Internet of Things” is only going to accelerate that, as computing power will need to be closer to the edge.  In addition, Cisco also has recognized the “app-centric” data center and cloud model that most of the vendors are moving towards, especially around the SDN and automation areas.  Cisco has announced several new items that speak to this.

UCS Mini

First, Cisco has announced the UCS Mini.  This is a UCS blade chassis with the Fabric Interconnects in the back of the chassis instead of top of rack.  Cisco is positioning this as “Edge-Scale Computing”.  They see the UCS Mini being deployed in Remote offices or smaller datacenters where the expected growth is small and the power and cooling requirements need to be smaller then the current UCS line.  For WAY more information I suggest you read my earlier post relating to the UCS Mini here.  I have updated it with some new information gained in the last week or so.

UCS Director Updates

Secondly, Cisco has updated its UCS Director Software to be more useful to more people.  The UCS Director software will now allow an administrator to automate and monitor not only UCS equipment but be able to work with Nexus products as well.  UCS Director will also be able to push out ACI configurations to the new Nexus 9k product line, here.  UCS Director has also introduced what it calls “Application Containers”.  These will allow configuration to be done from the “Application Level”.  What this means is you will be able to create networking and compute resources for a given application.  Cisco is stating that this is a very good way to simplify private-cloud deployment.  Finally, UCS Director has provided Hadoop integration into the product.  There is now a very easy way to deploy and monitor Hadoop clusters on UCS hardware.  This is something i’d like to see more of, personally.

UCS M-Series Servers

Cisco is announcing a new line of servers today that are very different then just about any other server in the market today.  Cisco’s M-Series servers are modular servers that can pack 16 individual servers into a single 2U chassis.  This is accomplished by creating “Compute Cartridges” that consist of CPU & Memory only.  Each cartridge contains two seperate servers with a single Intel Xeon E3 processor and four DIMM slots.  All of the cartridges share 4 SSDs that serve iSCSI boot LUNs to each compute node, as well as all Power supplies, Fans & Outbound Network & SAN connections.  These servers support the new VIC 1300 mentioned below, that means these can be uplinked to a UCS Fabric Interconnect as well. Now, these servers are NOT designed to run your typical virtualization or bare-metal OSs.  These are designed more for a lightweight OS, such as linux.  Cisco sees these being deployed in large numbers for uses like, BigData or other “Scale-Out” applications, online gaming, and ecommerce.  Now there has been a lot of talk about compasions to both HPs Moonshot servers as well as to the offerings of Nutanix.  These are a bit different then both.  Nutanix is a “Hyper-Converged” platform where it uses its own filesystem, and does a lot of neat tricks to distribute things across the nodes, the compute nodes become part of the virtual environment more then normal servers.  The M-Series is “Disaggregated” it uses what Cisco calls its “System Link” technology to separate the components making them more modular.  HP’s Moonshot is somewhat similar to the M-Series in that it used “server cartridges” however they are mostly Atom based processors, and still have some other hardware in the cartridges.  Cisco’s is all full Intel Xeon x86 processors.

UCS C3000 Series Servers

Cisco is not only releasing a compute heavy server but also a storage heavy one.  Cisco has announced the C3160 Rack Storage Server.  It is a 4u server that is capable of holding up to 360TB of storage space.  It is a single server just like any other, it has two processor sockets and a LSI 12GB SAS  Contoller that is connected to the disks.  Cisco is targeting this server at BigData or Web Applications that need a very large, fast central storage repository.  Cisco has provided some examples where it uses both the new M-Series and the new 3160 together in various designs.  It has mentioned both BigData and gaming services where the compute is distributed across an array of M-Series with all of the backend storage being hosted on the C3160’s.

New M4 Servers & VIC released

Cisco has announced the newest line of its blade and rackmount servers, the B200 M4, C220 M4 & C240 M4.  These servers take the advantage of the latest Intel processors as well a DDR 4 RAM, with up to 1.5TB of RAM per server.  Cisco is not introducing any configuration constraints that some other vendors have been doing.  Cisco has said will support the new 18-core Intel processor, when released.  This means you could get 36 full cores, 72 if you count hyper-threading in each blade!!!  Cisco has also announced a new VIC 1300 to go along with the newer servers.  This VIC is native 40gb capable.  However until the new FIs and IOMs are released the card will run at 4x 10gb.  For the PCIe based version the VIC have QSFP ports on it which will support both breakout cables as well as a special adapter that will convert 40gb to 10gb.  Its nice that these VIC are released to “future proof” hardware for when we see more 40gb switches, however i am a bit bummed we didn’t see a 40gb FI.

Overall there has been a lot of things announced and lot of information to be digested.  I have seen some pictures of the new hardware and hope to get to play with it soon.  Expect some deep dives to be written about the hardware and my experience with it.

UCS Build Good Practices

So i have done more UCS Build then i can count, and between some of my colleagues and i, we’ve done hundreds of them by now. So we put together a “UCS Best Practice Guide”, i am sharing it here because i feel that this information should be make public. I have added “Why?” to the line items since many customers want to know why things are done. Again this isn’t the end all be all, just what we found have worked for many customers and makes for a good starting point.

Overview

This guide is intended to be used on UCS deployments.  This guide will detail all of the best practices that have been defined by both Cisco as well as by lessons learned in the field from Subject Matter Experts in the field.  This guide will ensure not only proper deployment but also will help ensure consistency.

This document is written following the basic installation order and processes.

This document is written as a Guideline when existing Client practices aren’t in place, or when asked about recommendations.  Client requirements, wishes and needs should always supersede anything written here.

 

Physical Equipment

This section will list out the best practices for the physical equipment and rack and stacking.

Rack & Stack

    The Fabric Interconnects should ideally be located at the top of the racks

  • It is preferred to be in two separate racks to provide additional failure domains.
  • Cisco does approve of a “Center of Rack” design as well.
  • Airflow for both 61xx & 62xx are Front to Back.
    • Air enters through the fan trays and power supplies and exits by the ports.
  • Airflow for the Chassis is also Front to Back.
  • The FIs weigh between 35 & 50 pounds each so be sure to account for it.
  • The UCS Chassis can be up to 300 pounds.
  • Ensure all equipment is using the proper rails and/or brackets
  • Ensure equipment is secured using ALL of the screw holes!
  • Ensure what will become Fabric Interconnect A is on the top or left, and B is the bottom or right.
    • Typically Left vs Right is looking at the front of the system.
      • Front contains the blades.
    • Exception: Please be sure to refer to customer requirements on numbering.
  • Chassis numbering should be bottom -> top & left -> right.

Cabling

  • All Server Ports should be cabled using the leftmost ports on the FI.
    • Why??
      • Since typically its server ports that are added, the next chassis can remain cabled in the correct numerical order.
  • All Uplink Ports should be cabled using the rightmost available ports on the FI.
    • Why??
      • Again this allows spare ports to be for the typically added Server ports.
  • FEX/IOMs should be cabled from top to bottom.
  • Use the shortest possible cables, while ensuring no cables are too tight.
  • Cabling should look closer to the picture on the right, otherwise airflow becomes an issue.

UCS

 

UCSM Configuration

This section will provide the configuration best practices for all UCSM items.

Overall

  • Do not put any pools, policies or templates under the root organization, create a sub-org and place everything under that.
    • Why?
      • This allows additional sub-orgs to be created later that won’t pull from these root org items
  • Delete all default pools, which can be deleted.
    •  Why?
      • This keeps the error count down and ensure that correct pool/policies are being created
  • Ensure all names are meaningful and descriptive
  • Use the “Description” field when possible for more information
  • All pools should be set to Sequential
    •  Why?
      • The default setting uses an algorithm that is seemingly random when bigger then 32 objects.
  • Engineer must have patience while working with UCS, you WILL break it otherwise!!
    • Why?
      • The UCSM can have a substantial delay from clicking to reporting back tasks, if you click too quickly tasks can get interrupted or queued and cause major issues.

 

Equipment Tab

Policies

  • Chassis Discovery Policy should be set to the minimum number of IOM -> FI links.
    • Why?
      • This will prevent any issues down the road relating to chassis being discovered.
    • You MUST do an acknowledge on each chassis after all links are up (if using above method)
  • Link Grouping Preference should be set to Port-Channel.
  • Power Policy should be set to Grid.

Fabric Interconnects

  • Fabric Interconnects should be placed into End-Host mode whenever possible. (default)
  • When setting Unified ports, only the minimum necessary number of ports should be turned into fibre channel ports.
    • Why?
      • This prevents wasting ports
      • Should be done right away as a reboot is necessary.
      • Keep in mind any future expansion the customer may discuss.
  • Primary Fabric Interconnect should be “A” and subordinate should be “B”.
  • Server Ports should be enabled one chassis at a time.
    • Why?
      • This will order the chassis in the correct physical order.
    • If mis-numbered they can be decommissioned and re-numbered.
  • Uplink ports can be enabled in bulk.
  • Set any unused ports to “Unconfigured”.
    • Why?
      • This is for both security and licensing purposes.

Chassis

  • Resolve any “Fabric Conn” problems before continuing configuration.
    • Why?
      • The remediation steps are disruptive to the environment.
    • Ensure at least 10 minutes have passed from the time the chassis appeared within UCSM.
    • If problem persists, perform a chassis “Acknowledge” and wait for it to clear.
    • Chassis should not be “acknowledged” until all cables have been connected.
      • Why?
        • If this is done earlier, there is a potential to have issues with the IOMs
  • Ensure that all blades and IOMs show no errors.

Firmware

  • Update to the latest recommended firmware.
    • Why?
      • This prevents issues during configuration
    • Firmware should be no less than 60 days old, unless dictated by the customer or TAC.
  • Engineer MUST read the Cisco Release Notes and understand upgrade process described.
    • Why?
      • These procedures can vary between versions.
      • In addition there are many caveats related to “do not update if” listed.

 

Admin Tab

Communication Management

  • Ensure that your Management Interfaces are specified correctly, including domain.
  • If the Management subnet’s gateway are not pingable, specify the MII Status setting.
    • Why?
      • If this is not set the FIs will assume they are down and attempt failover constantly.

Time Zone Management

  • Be sure to set the time-zone if not UTC.
  • Be sure to specify NTP Servers.
    • Note: This does not set the time on blades themselves.

License Management

  • Be sure check that no “Grace Period” licenses are in use.
    • Why?
      • This will prevent licensing errors on day 121.
  • Download and apply any licenses that were purchased.
    • Most times they come pre-loaded depending on the SKU.

User Management

  • Engineer must fully understand RBAC integration processes.
    • Why?
      • The process can be a bit complicated and involves integration with production customer systems.

LAN Tab

LAN Cloud

  • When creating Port-Channels match the ID to the Port-Channel ID of the uplinks on the upstream switch.
    • Why?
      • This is to help simplify troubleshooting and keep items consistent.
  • Set the QoS System Class to match the upstream Nexus switches QoS.
    • It is highly recommended to enable jumbo frames on the best effort and other system classes.  Many traffic types take advantage of jumbo frames: vMotion, NFS, iSCSI, Oracle RAC, etc.
    • If they are not set here, setting Jumbo on the vNICs will not do anything.
      • Future troubleshooting is often very difficult.  This is due to MTU mismatch at L2 resulting in dropped frames, vs. a mismatch at L3 resulting in fragmentation.

Policies

  • Create a Network Control Policy that enables CDP and sets MAC Register Mode to All Host Vlans.
    • Why?
      • This allows ESXi to see the CDP information.
      • This also allows MACs to be registered on other VLANs
        • This is especially useful when the customer is not utilizing the native VLAN on the trunks, or when the Native is locked down.
    •  The exception to this rule is;
      • Set this to Native VLAN Only if the customer has a large number of VLANs specified. (200+, typically only seen in Service Provider space)
  • Create QoS Policies that match up with all the enabled policies in the QoS System Class.
    •  Leave Host Control None
      • The exception to this rule is;
        • The Host Control should be set to “Full” if tagging CoS at the host level (typically done with a 1000v, or 5.5 DVS)

IP Pools

  • Create an ext-mgmt pool that is at least the size of the maximum number of blades possible in the Domain, if possible. (Or at least account for future growth)
    • Why?
      • This prevents a split pool possibility later on.
    • Ensure the Subnet is in same subnet as Fabric Interconnects
      • Why?
        • The Out-Of-Band management for the blades MUST reside on the same subnet as the Fabric Interconnects MGMT port since it uses it for external connectivity.
    • Accounting for number of ports on FI, and number of uplinks to FI, determine max number of chassis possible.
      • Why?
        • This will ensure the customer is create a pool ahead of time for a max number of servers supported.  If they don’t plan on growing that big, trim back as necessary.
      • Fabric Interconnects have a maximum possible chassis count of 20.
    • Multiple that number by 8 (max number of blades per chassis).
      • So using a 6248 with 4 uplinks per chassis and 4 uplinks total as an example the math is this;
        • (48 total ports – 4 uplink ports)/(4 uplinks per chassis) = 11 chassis * 8 blades per chassis = 88 total possible blades per UCS Domain.
        • This means you’ll need at least 91 IPs (88 blades + 3 FI IPs) in the management subnet.
    • It is sometimes preferred to double that number in case Service Profile IPs want to be used, however this is not necessary or always possible.

MAC Pools

  • Create MAC Pools with the following convention in mind:
    •  This is only a suggestion, follow customer needs. Larger environments will be limited by the 255 host limit.  Please plan accordingly.
    •  This convention requires MAC pools for each vNIC type, this does increase the initial setup and some management, however a lot of customers like it for better ability to segment the traffic for security, QoS, & Monitoring.

The MAC pool value convention is used to provide a contiguous range of MAC address values.  The MAC address consists of 12 hex values.  The address is built utilizing the following table;

Section 1 Section 2 Section 3 Section 4 Section 5 Section 6
00:25:B5 X Y Z A/B XX

 

  • Section 1 identifies the Cisco OUI
  • Section 2 is the site code
    • For example, Las Vegas is a 1 & Chicago is a 2, etc
  • Section 3 is the domain code
    • This is per UCS domain and increments. Since this is the 2nd UCS at a given DC this will be 2.
  • Section 4 identifies the purpose of the NIC
    • 1 is for Mgmt, 2 is for vMotion, 3 is for Storage Traffic, 4 is for VM Data Traffic
  • Section 5 identifies the Fabric A or B (A is Shown)
  • Section 6 is the assigned GIDs (Allows for 255 unique GIDs)
    • We are starting at 01 so that the server to ID numbering matches.
  • This would give a Storage vNIC on Fabric A of the 3rd Server in the above Chicago UCS a MAC Address of;
    • 00:25:B5:22:3A:03

 

  • Recommend starting the pools with 01 instead of 00.
    • Why?
      • This makes it more “human readable” and less prone to errors
  • Recommend creating a MAC Pool for each NIC type, for each Fabric.
  • Maximum size with this convention is 255 if starting at 01.
    • Should be enough since 255 is max number of blades in a Domain is 160

vNIC Templates

  • Name should be descriptive to its function
  • You should NOT enable failover in the following situations;
    • When any Multi-pathing aware OS is installed.
      • ESXi
      • Certain newer Linux OSs
      • Windows 2012 or later
      • Windows 2008 if Cisco Teaming drivers will be installed
    • You should enable failover in the following situations;
      • Windows 2008 and Earlier with no teaming drivers installed
      • Older Linux OSs
      • iSCSI Boot NICs
  • Create Template as an Updating Template
    • Why?
      • This will make any changes done to the Template propagate to any attached items.
      • Understand the implications though!!
  • Set the appropriate MTU, MAC Pool, QoS & Network Control Policy.

 

SAN Tab

SAN Cloud

  • If using VSANs be sure to create them.
  • Ensure each VSANs FCoE VLAN is in a high-enough range.
    • Why?
      • This will ensure that it is in a range that will not later on be used by Networking equipment, while making it human-readable to know what the VLAN is used for.
    • It is recommended to add 2 or 3000 to the VSAN ID number.
    • So VSAN 2 would have an FCoE VLAN of 2002 or 3002.
      • Ensure the VLAN # is unique and reserved in the environment!!!!!
  • Do NOT create Common/Global VSANs, they should be specific to each Fabric Interconnect.
    • Why?
      • Each FI should be treated as a separate SAN switch.  This complies with typical SAN switch design.
      • This also prevents issues with FCoE where you cannot have the same VSAN on both Fabric Interconnects.
  • Enable FC Zoning for that VSAN ONLY if there is no upstream SAN switch.
    • Why?
      • This prevents the FI from creating zonesets and trying to do an additional layer of zoning that will cause issues.
      • In addition typically the upstream switches do a much better job at handling the zoning then the FIs can.
  • Ensure that your FC Uplink Interfaces are in the correct VSAN.
    •  By Default they will be in VSAN 1.
  • If utilizing FC Port-Channels ensure the ID matches an uplink port-channel ID.
    • Why?
      • This helps with consistency as well as future troubleshooting.

Policies

  • Create a Storage Connectivity Policy if doing Fabric Interconnect FC Zoning.
    • Use ONLY “Single Initiator Single Target”.
      • Why?
        • This is a best practice that is recommended by almost every storage vendor on the market today.

Pools

  • Create a WWNN Pool with the following convention;
    • This is only a suggestion, follow customer needs. Larger environments will be limited by the 255 host limit.  Please plan accordingly.
    • You MUST have a zero in the place that the WWPN’s A or B identifier will go.

The WWNN pool value convention is used to provide a contiguous range of WWNN address values.  The WWNN address consists of 16 hex values.  The address is built utilizing the following table;

Section 1 Section 2 Section 3 Section 4 Section 5 Section 6
20:00 00:25:B5 X Y 00 XX

 

  • Section 1 identifies the ID as an initiator
  • Section 2 is the Cisco OUI
  • Section 3 is the site code
    • For example, Las Vegas is a 1 & Chicago is a 2, etc
  • Section 4 is the domain code
    • This is per UCS domain and increments. Since this is the 2nd UCS at a given DC this will be 2.
  • Section 5 is 00 as a WWNN
  • Section 6 is the assigned GIDs (Allows for 255 unique GIDs)
    • We are starting at 01 so that the server to ID numbering matches.
  • This would give a WWNN of the 3rd Server in the above Chicago UCS a MAC Address of;
    • 20:00:00:25:B5:22:00:03
  • Create WWPN Pools with the following convention;
    • This is only a suggestion, follow customer needs. Larger environments will be limited by the 255 host limit.  Please plan accordingly.
    • You MUST place the A or B identifier in the place that the WWNN has a 0 digit in it.

The WWPN pool value convention is used to provide a contiguous range of WWPN address values.  The WWPN address consists of 16 hex values.  The address is built utilizing the following table;

Section 1 Section 2 Section 3 Section 4 Section 5 Section 6
20:00 00:25:B5 X Y 0A/B XX

 

  • Section 1 identifies the ID as an initiator
  • Section 2 is the Cisco OUI
  • Section 3 is the site code
    • For example, Las Vegas is a 1 & Chicago is a 2, etc
  • Section 4 is the domain code
    • This is per UCS domain and increments. Since this is the 2nd UCS at a given DC this will be 2.
  • Section 5 is either 0A or 0B depending on the fabric
  • Section 6 is the assigned GIDs (Allows for 255 unique GIDs)
    • We are starting at 01 so that the server to ID numbering matches.
  • This would give a WWPN on the A Fabric of the 3rd Server in the above Chicago UCS a MAC Address of:
    • 20:00:00:25:B5:22:0A:03
  • Create IQN Pools with the following prefix;
    • Iqn.2014-05.cisco.com:25B5
      • Why?
        • Certain Storage vendors will not work correctly without this type of formatted IQN.
        • This exact prefix states it is a cisco device, as well as the 25B5 which is part of the cisco registered OUI
    • o   Suffix does not matter
      • However it’s recommended to identify it with the future hostname or other descriptive name and number identifier.

HBA Templates

  • Name should be simple if using only 2 HBAs.
    • Why?
      • This will keep it easy to understand and use.
      • Either way the name should make sense.
    • HBA-A & HBA-B is a typically used option.
  • Ensure the VSAN selected is correct.
  • Select Updating Template.
    • Why?
      • This ensures any changes to the template are propagated to all attached items.
      • Ensure the implications of this are understood.
  • Ensure the QoS Policy is set to the previously create FC Policy.

Server Tab

Policies

  • Create a BIOS policy that sets the following;
    • Why?
      • These are typically requested items by customers and recommended by the industry.
    • Resume AC on Power Loss to “Last-State”.
    • Set DDR Mode to “Performance-Mode”.
    • Boot Option Retry “Enabled”.
    • Quiet Boot “Disabled”.
  • When creating Boot Policy ensure “Enforce vNIC/vHBA” is checked.
    • Why?
      • This ensures that the vNIC you want to boot off of is the actual one that it will boot off of.
    • This means you MUST enter the names exactly the same as they will be created during the profile Templates!!
      • Why?
        • If you don’t you will get an error relating to the boot policy.
    • Ensure that you always have CD/DVD as the first boot device.
      • Why?
        • This allows the use of ISO media even if there is a OS on the blade.
        • Otherwise the admin would have to try and catch the boot selection keypress.
    • Unless specified by the Operating System use Legacy boot mode.
      • Why?
        • If EFI mode is selected many Operating Systems will not boot.
    • If doing SAN Boot, Odd # servers should boot of SAN Head A and Even # servers off Head B.
      • Why?
        • This balances the blades and reduces the impact of a Boot Storm.
      • Ensure to set both A & B HBA’s, WWPNs should from both Storage Heads, on their respective fabrics.
      • The drawback to this is multiple Service Profile Templates, so discuss it with Customer.
  • Create a Host Firmware Package and set it to the same Firmware revision as the FIs.
  • Create a Maintenance Policy with a setting of “User Ack”.
    • Why?
      • This prevents any updates from rebooting all of the UCS blades at once.
  • Create a Power Control Policy with a setting of “No Cap”, unless required by customer.
    • Why?
      • This is used as a majority of customers do not setup or use the Power Capping features.
  • Create a Scrub Policy with all Settings to No, unless required by Customer.
    • Policy does not actually do a data wipe.

Pools

  • Create a UUID Suffix Pool with the following;
    • If you are following the MAC/WWNN/WWPN guides as specified in this guide, Prefix should be set to “Other”
      • Using the Site and Domain codes, enter then into the first two places in the Prefix
      • Using the Site and Domain codes, enter then into the first two places in the Suffix

Service Profile Templates

  • Ensure the Name is descriptive to its purpose as well as any unique features.
  • Select “Updating Template”.
    • Why?
      • This ensures any changes to the template are propagated to all attached items.
      • Ensure the implications of this are understood.
  • Use “Expert” mode for adding vNICs.
    • Why?
      • Choosing anything other the “Expert” will be missing important network info.
  • Name the vNICs the same thing as the vNIC Template if possible.
    • Why?
      • This keeps things consistent and easy to troubleshoot.
  • Ensure you select the appropriate Adapter Policy.
  • Use “Expert” mode for adding vHBAs.
    • Why?
      • Choosing anything other the “Expert” will be missing important info.
  • Name the vHBAs the same thing as the vHBA Template if possible.
    • Why?
      • This keeps things consistent and easy to troubleshoot.
  • Ensure you select the appropriate Adapter Policy.
  • Select “Let system perform placement” on vNIC/vHBA Placement”.
  • Set the Firmware Management policy on the “Server Assignment” page.
  • Set the BIOS, Power Control & Scrub Policy on the last page.
  • Ensure the names of the profiles are descriptive.

Service Profiles

  • Ensure the names of the profiles are descriptive.
    • Best case is use the Hostname of the server as the name, if possible.
  • Put the hostname in the “User Label” section.

Custom UCS/FlexPod Build Script

 UPDATE: Working with some of our internal guys, its come to my attention that some of the script has broken with the newer UCSM versions.  I will be updating this to be more “adaptable”, however use the script for ideas and feel free to kang any code from it for now.


 

So i started working on developing a Powershell script that will grab variables from an Excel sheet and create a UCS Build off of that.

I am at a point that the build actually works quite well now. I’m pretty proud of myself since i’m NOT a deep Powershell guy. This came about from looking at other UCS Powershell scripts and a lot of tweaking and testing.

Anyway this script will continue to grow and its functionality expand. My end goal is to be able to do a base FlexPod build by scripting, including UCS, Nexus Switches, Netapp and VMware.

It will take a lot of time, and i may never really use the script but its more of a pet project to not only see if i can do it, but also grow my Powershell skillset.

Here is the github if you’d like to follow/assist or download and play with it a bit.

https://github.com/cknic/UCS_Build

UCS Firmware bug found, affects PXE Boot

UPDATE:

On the system that discovered the initial bug updating to 2.1(1f) did

 

 

I am working with one of my fellow engineers who is doing an whole vCloud Suite/SRM/kitchen sink deployment.  Anyway he found a bug in the UCS Firmware version 2.1(1e).

When he was trying to do auto deploy his ESXi hosts the would not get DHCP at all.

We looked at the DHCP server, it was on the same VLAN as the hosts, and it was configured correctly and the scope was actually on.  We looked at the UCS settings, the correct mgmt VLAN was set to “Native” so that the NICs could actually see the DHCP reply.

What was noticed was for some reason the VIFs were not coming up on the UCS system.  Now anybody who has used UCS for any period of time is used to seeing VIF errors, especially when setting up blades and installing OS’s.  The typical time you get these errors is when the FI is trying to setup the network before the blade is actually online, these are typically transient and go away quickly.

Anyway, these weren’t the typical errors and the VIFs were truly down and would not come up until the ESXi installer was run off of a mounted ISO.  They would not come up while the NICs were looking for DHCP…odd

They were running the exact same hardware as i had in 4 different FlexPods, 2248’s, B200 M3’s, and an very similarly setup VMware environment.  The ONLY difference was that the firmware they loaded was the latest, at the time, 2.1(1e).  I was still running 2.1(1a).

The engineer downgraded to 2.1(1d) and everything immediately came online and worked perfectly.  Problem solved…well sort of.   I have confirmed on a new install of mine that there is an issue.

I am in the process of building a new system at 2.1(1f) to see if the problem has been fixed.  I will update ASAP.

Updates

So I’ve been neglecting this site really badly.

I’ve been insanely busy with all kinds of things.  So whats new;

Got my Citrix CCIA certification, woohoo!!  Now just waiting on some Citrix projects 🙂

 

In the meantime i’m doing more FlexPod’s.  I am working on an update document for my iSCSI boot.  I found that the new UCS 2.0(2r) i belive added the IQN Pools, so some of the screenshots changed, however the process is basicly the same.

I’ve also been writing a lot of internal documentation right now, so the thought of writing more when i’m “off the clock” hasn’t been fun.  Thats coming to a bit of an end, so i’m going to start writing more here again.  I’ve come across some interesting things lately.

iSCSI Boot with ESXi 5.0 & UCS Blades

UPDATE:: The issue was the NIC/HBA Placement Policy.  The customer had set a policy to have the HBA’s first, then the iSCSI Overlay NIC, then the remaining NICs.  When we moved the iSCSI NIC to the bottom of the list, the ESXi 5.0 installer worked just fine.  I’m not 100% sure why this fix is actually working, but either way it works.

So at a recent customers site i was trying to configure iSCSI Booting of ESXi 5.0 on a UCS Blade, B230 M2.  To make a long story short it doesn’t fully work and isn’t offically supported by Cisco.  In fact, NO blade models are supported for ESXi 5.0 & iSCSI boot by Cisco.  They claim a fix is on the way, and i will post an update when there is a fix.

Here is the exact issue, and my orgianal thoughts, in case it helps anybody;

We got an error installing ESXi 5 to a Netapp LUN.  Got an error “Expecting 2 bootbanks, found 0” at 90% of the install of ESXi. The blade is a B230 M2.

The LUN is seen in BIOS as well as by the ESXi 5 installer.  I even verified the “Details” option, and all the information is correct.

Doing an Alt-F12 during the install and watching the logs more closely today, at ~90% it appears to be unloading a module, that appears by its’ name, to be some sort of vmware tools type package.  As SOON as it does that the installer claims that there is no IP address on the iSCSI NIC and begins to look for DHCP.  The issue is during the configuration of the Service Profile and the iSCSI NIC, at no time did we choose DHCP, we choose static. (We even have tried Pooled)  Since there is no DHCP Server in that subnet it doesn’t pickup an address and thus loses connectivity to the LUN.

So we rebooted the blade after the error, and ESXi5 actually loads with no errors.  The odd thing is that the root password that’s specified isn’t set, it’s blank like ESXi 4.x was.

So an interesting question is what’s happening during that last 10% of the installation of ESXi 5??  Since it boots cleanly, it almost seems like it does a sort of “sysprep” of the OS, ie all the configuration details.  If that’s the only issue then it might technically be ok.  However I don’t get the “warm and fuzzies”.  My concern would be that, maybe not today but down the road some module that wasn’t loaded correctly will come back to bite the client.

Also, what is happening in that last 10% that’s different then ESXi 4.x??  We were able to load 4.1 just fine with no errors.

Again we called Cisco TAC and we were told that ESXi 5 iSCSI booting wasn’t supported on any blade.  They do support 4.1 as well as Windows, and a variety of Linux Distos.

Configuring iSCSI boot on a FlexPod

Here is a nice document to follow to configure iSCSI booting for a FlexPod, ie. UCS Blades, NetApp array & ESXi.

UPDATE: This document has the fix i found for ESXi 5.0.  This was tested on B230 M2’s and seems to work every time.

This document will be updated as i get new information.

FlexPod iSCSI Boot-Fixed

Major Lessons in UCS Failure & Recovery

Major lessons learned in UCS Failure and Recovery

While at a customers site we were doing a V&V test on a fully completed UCS Chassis.  The client indicated that he wanted to do a “hard-fail” of the primary interconnect, so we pulled the power cord.  We then left the interconnect off overnight.  The next more we powered on the, now subordinate, interconnect.  After its long boot, the servers went into a “Discovery” mode.

Issue #1

After 30 minutes the servers’ secondary NIC wouldn’t come back online.  The IOMs for the A side were showing Critial Errors and all the links were administratively down.  In the Faults area the error was that the descovery policy didn’t match what the IOMs were seeing, which wasn’t the case.

After a lot of painful troubleshooting i came to find out after the Fabric Interconnect came back online, i acknowleged the chassis too quickly.  I had the following pointed out to me by a collegue.

Perfect understanding of the chassis discovery policy and number of links between IOM & FI. I would add just 2c here — care must be taken about “when” you hit the “acknowledge chassis” button. When you “acknowledge chassis”, what you are telling UCSM is that “I acknowledge current connectivity of the chassis”. Every “Fabric Port” of the IOM (the one connected to FI) has two states: ‘Discovery’ and ‘Acknowledgement’. You can see that under “Status Details” of a given fabric port (under ‘Equpment’ –> ‘Chassis’ –> ‘IO Module’ in the GUI). Discovery is an operational state – it can be ‘absent’ or ‘present’. Ack tells whether the link is used by the system or not.

When admin hits “acknowledge chassis”, UCSM takes the snapshot of Discovery state – and if link is ‘Present’, then it is marked as ‘Acknowledge’ (and if not present, then un-ack) — and all the ack’ed ports are used to pass data.

So, before hitting ‘acknowledge chassis’, it is advisable to make sure that the links are all in ‘present’ state.”

It turns out you need to wait a few minutes after everything comes back online before doing the acknowlegement.

 

Fix #1

Unfortunatly i tried several methods of trying to fix this including trying to acknowlege the chassis after all 4 “Fabric Ports” came up.  The only way i could fix it was to decommission and recommission the chassis.

 

Issue #2

So now after recommissioning the chassis, the servers need to go through a rediscover.  Again after 30 minutes or so, the servers were getting a critical error that the discovery process was failing.  Watching the FSM all the servers were stuck at the same point “configure primary fabric interconnect in <svr#>  for pre-boot environment(FSM-STAGE:sam:dme:ComputeBladeDiscover:SwConfigPnuOSLocal)

 

This was at 30% of the discovery process, and it kept doing retries until it failed out.  Looking at the KVM of the server it was sitting in BIOS looking for something to boot off of. Knowing that for the discovery process it has to boot into the UCS PXE image, i knew there was an issue.

 

I attempted to “Recover corrupt BIOS”, “Reset CMOS”, “Reset CIMC” & “Re-acknowledge”, i even attempted pulling the blades and re-seating them, nothing work.  The servers were unusable.

 

Fix #2

I decommissioned the servers.  Upon clicking on the servers again a popup appear stating that the inserted servers were not the same as the configured, which was none, do i want to accept the new servers.  After accepting this a discovery was launced.  Luckly this time the discovery proceded correctly and then the service profiles began to load on the blades.

 

So as much as it seems really scary and drastic to do a decomission and recomission it is sometimes necessary and does seem to do a true reset on the configuration of the components.  However, i say this with caution.  This process should be the LAST resort as it does involve a major outage, although if your wondering if you need to do this, you most likely already have downtime.

 

 

Direct Connected Fiber Storage to UCS

So i’ve come across this recently.  I have a client that is direct connecting the Fiber from their NetApp array to the 6120’s of the UCS.

The issue that has been raised is that this is not technically supported.  As is seems Cisco releases with the 1.4.1 firmware release that you can absolutely do this.  However there is a caveat, it’s supported by Cisco as long as the storage vendor will support it.

The biggest problem is that NetApp did support it, but they don’t any longer.  So it seems Cisco was left holding the ball when NetApp walked away.

So if your running a NetApp array that is direct connected to their UCS w/o an MDS or even a 5548 with the FC module, its no longer technically supported and you very well may run into issues if you need Vendor support.

For those not familiar with direct connecting the storage i’ll give a little but of information on it, as well as some of my experiences with it and some tips on making it “work” with UCS.

So inside the 6120 there is effectivly a very very dumb MDS switch.  There is no Zoning, it is all 1 big zone, you do vSANs, but obviously no inter-vSAN routing, no security, no real way of even getting any initiator/target information for troubleshooting purposes.

In order to even use the functionality, you must change the Fiber portion of the switch from “End-Host Mode” to “Switch Mode”.  This is EXTREMELY similar in method and functionality to switching the Network side to “Switch Mode”.

You MUST also make sure to select the default vSAN that is created upon inital set-up, and enable “Default Zoning”

Intersting note you MUST absolutely make sure the HBA name in the Boot Policy is the EXACT same as the HBA name in the HBA Template, or it won’t boot.
So again, in my opinion if you can avoid direct connecting your SAN storage to the 6120, please avoid it, at least until UCS 2.0 comes out  🙂

Enabling Jumbo Frames in a Flexpod Environment

Update: I have fixed the 5548 section i was missing the last two lines.

This post will help the user enable Jumbo frames on their Flexpod environment. This document will also work for just about any UCS-based environment, however you will have to check on how to enable Jumbo Frames for their storage array.

This post assumes a few things;

Environment is running 5548 Nexus switches
User needs to setup Jumbo-Frames on the NetApp for NFS/CIFS Shares
Netapp has VIF or MMVIF connections for said NFS/CIFS connections.

Cisco UCS Configuration 

-Login to the UCSM, Click on the LAN Tab.
-Expand LANs, & LAN Cloud.
-Click on the QoS System Class, Change the “Best-Effort” MTU to 9216. 

NOTE: You need to just type in the number, it’s not one of the ones that can be selected in the drop-down.

Expand the Policies section on the LAN Tab.  Right-Click on the QoS Polices and click “Create new QoS Policy”.  Call it “Jumbo-Frames” or something similar.
-On the vNIC Template or actual vNIC on the Service Profile, set the “QoS Policy” to the new Policy.

 ESX/ESXi Configuration

-Either SSH or Console into the ESX host.  If your using ESXi you’ll need to ensure local or remote tech support mode is enabled.
-We need to set the vSwitch that the Jumbo-Framed NICs will be on to allow Jumbo-Frames.
          Type esxcfg-vswitch –l   find the vSwitch we need to modify.
          Type esxcfg-vswitch –m 9000 vSwitch# (Replace # with the actual number)
          Type esxcfg-vswitch –l   you should now see the MTU to 9000

-We now need to set the actual VMKernel NICs.

          Type esxcfg-vmknic –l  find the vmk’s that we need to modify
          Type esxcfg-vmknic –m 9000 <portgroup name> (this is the portgroup that the vmk is part of)
          Type esxcfg-vmknic –l   verify that the MTU is now 9000 

Note: If your using dvSwitches, you can set the MTU size through the VI-Client.

5548 Configuration 

Login to the 5548 switch on the “A” side.
-Type the following;

system jumbomtu 9216
policy-map type network-qos jumbo
class type network-qos class-default
mtu 9216
multi-cast-optimize
exit
system qos
service-policy type network-qos jumbo
exit
copy run start

-Repeat on the “B” Side 

NetApp Configuration 

-Login to the Filer.
-Type ifconfig –a  verify which ports we need to make run jumbo frames.
 -Type ifconfig <VIF_NAME> mtusize 9000 

NOTE: You need to make sure you enable jumbo-frames not only on the VLAN’d VIF but also the “root” VIF.