Nutanix Tech Field Day “Demo”

Tech Field Day

I had the great pleasure to take part in Stephen Foskett’s Tech Field Day sessions again this year and provide a “demo” for the delegates and viewers (those attending know the reason for the quotes – personally I’m not a fan of canned demos :) ).  In this blog post I’ll cover the topics I covered during the session as well as provide some additional insight and videos on what was taking place.

I tried to go deeper than the discussion but keep from going down the rabbit hole, if there are areas where you’d like to get more information let me know and I’d be more than happy to add more detail!

Here’s the video:

 

Phase 1 – Deploy!

To kick things off we started with a deployment of a fresh from the factory Nutanix 4 node cluster.

Nutanix uses IPv6 link-local zeroconf in coordination with Linux Avahi (Linux version of Apple’s Bonjour) and zero-conf to dynamically discover the nodes on the network prior to any IP addresses being configured.  This is also leveraged for cluster communication using our native genesis framework. By utilizing these technologies it allows the cluster to be configured independently of IP schemes and also allows the cluster to run when there is an IP configuration issue which needs to be addressed.

Upon navigating to the “cluster_init” page (http://<IPv6 link local address>:2100/cluster_init.html) you can see a html page showing the discovered nodes and configuration parameters (this year at TFD we had already configured the IPv4 address so IP re-configuration wasn’t necessary).  From here we selected the discovered nodes and created our Nutanix cluster.  This process configures certain infrastructure components, initializes the cluster as well as sets all of the IP address for ESXi, Nutanix and IPMI.  I can normally complete this process in < 3 minutes which beats days or weeks for traditional solutions.

At the end of the creation I actually destroyed the cluster to put it back in an “available” state so it could be discovered by the other nodes and added to expand our cluster later.

Phase 2 – Nutanix UI and Cluster managementLogin UI Dashboard UI

Next up we continue to log into our Nutanix UI which is our new HTML5 based user interface and will be part of the upcoming 3.5 release (I’ll have a separate post covering all of the enhancements/features).  Now normally I’m not a huge fan of glitzy interfaces, as for the majority of my time I prefer to use APIs or the CLI (eg. PowerCLI, NCLI, etc.), however this one I can use.  Not only is the interface very pleasing, it also has a ton of valuable information.

The main screen of the dashboard give me a high-level overview of all of the details I need including compute resources, virtual machines, storage metrics as well as events and alerts in the system.

REST API & Explorer

Rest API

Extensible interfaces and APIs are a must have for any developer or anyone trying to interface into a platform.  These APIs allow tools and workflows to be developed as part of any orchestration and automation effort.  Maybe we’ll see a PowerNCLI soon? ;).

This is a big thing for myself as well as a lot of other engineers/developers/admins/etc. who are looking to do things in the most efficient manner possible.  For example, I primarily leverage the vSphere UI just to monitor alerts and tasks, but more of my work with vSphere is done using PowercLI.

Another nice piece is the REST API Explorer which allows you to view the REST API, calls, schemas as well as execute actual calls and see responses and headers.  This is very nice when looking to implement or troubleshoot a workflow.

Events and Alerting and more importantly analysis UI_Analysis

In reality an event or alert is good at telling you something that happened (normally after the fact), but more importantly I want to know what is impacted.  The alerting framework will display alerts from the hardware up the stack to the VMs.  For example, if there is a bad disk or bad blocks found during our scrubbing process we can fire an alert saying that a disk may be going bad.  Also, if latency or CPU utilization is higher than usual we can assert that as well.

Alerts are great for visibility but their impacts are the most critical.  When we showed the analysis framework we took a look at events and alerts as well as metrics to see how alerts impacted performance, utilization, etc. that we care about.

Phase 3 – Cluster ExpansionUI_Expansion

This is personally one of my favorite pieces and is one of the key things in the “scale out” capabilities of the cluster.  Before I talk about what I showed I first want to take a step back and talk a little bit about how the cluster expansion feature works.

As with any distributed system peer coordination is absolutely critical, if peers were to get out of sync there can be multiple issues ranging from system downtime to corruption.

Discovery

As mentioned in the deployment section Avahi and zeroconf is used to allow nodes to “announce” themselves on the network.  This plays a key role in the cluster expansion process as it allows the existing cluster to discover the new nodes, as well as configure the IP settings of the to-be added nodes.

During the demo we could see this through the system reporting 4 discovered nodes available on the network.

Metadata Metadata Expansion

Expansion of metadata is another critical piece for any distributed system.  As shown during Binny’s slides metadata is stored in a “ring like” manner.  During the cluster expansion process block awareness will insert the new nodes throughout the ring to minimize the impact of a block or chassis failure.  Any new keys will be distributed to the all nodes (including new) while metadata information is re-balanced in the background.

Data

Balancing of data is also another critical part in the expansion process.  First off, IO locality is a very important aspect of Nutanix ILM and key for linear performance at scale.  As new nodes are added the VMs running on the prior hosts will continue to read/write locally, however, the replication of data from those write IOs will be distributed to all nodes in the cluster (in our case was 8).  This is a very important thing as write IO performance is immediately increased.  A parallel is to think about spindle count in a RAID group, as the number of spindles go up the IO performance does as well, the same is true with scaling the Nutanix nodes when looking at replication traffic.  Over time the MapReduce framework (Curator) will automatically balance the previous replicas in order to assure disk balancing (capacity utilization) as well spread the load to get the highest performance from an update standpoint.  For example, the previous replicas existed on 4 nodes and any updates would occur on those specifically, however as the replicas are balanced between all 8 nodes all of the nodes can be leveraged for updates.  That covers the data side, now on to compute..

As we add these new ESXi hosts to our vSphere cluster tools like DRS can automatically come in and optimize VM placement and balance the load from a compute perspective amongst all hosts.  Upon being vMotioned to a new node (for example, one of the 4 nodes we added) the VM’s write IO will all occur on that node and Curator will dynamically move all of the previous data (composing a vDisk(s)) from the prior Controller VM (CVM) to the local CVM as it can see the IOs are originating from a different ESXi host.  Meaning in time all of that VMs data will now sit locally for both read and write IO.

Misc. Scripts, Prep, etc.

Here are some scripts and commands I used during the session:

Running Diagnostics Tests

 Powershell Commands

Enjoy!

Legal Mumbo Jumbo

Copyright © Steven Poitras, The Nutanix Bible and StevenPoitras.com, 2014. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Steven Poitras and StevenPoitras.com with appropriate and specific direction to the original content.