Advanced Nutanix: “Basics”

advNutanix_640

I have a few posts ready on some hacking of ILM and advanced configuration of the Nutanix platform, however in this post I wanted to focus on the “basics.”

Cluster Sizing

Sizing is a key peice of any reference architecture or RFP and a key piece of any reliable design.  Here are the guidelines I normally follow:

Nutanix
  • Max cluster size: Unlimited, however 48 nodes is ideal
  • Number of clusters:
    • If # of nodes is < 16 then start with a single cluster
    • If # of nodes is > 16 then start with two clusters and scale those up to 48+ as necessary  (NOTE: The reasoning for this is to have isolated fault domains)
  • Tip: Sizing should always be done with reliability and growth built in.
vSphere
  • Max cluster size: 32, however 8 or 16 nodes is ideal
  • Number of clusters:
    • If # of nodes is < 8 then start with a single cluster
    • If # of nodes is > 16 then start with two clusters and scale those up to 16 or 32 as necessary  (NOTE: The reasoning for this is to have isolated fault domains and will be driven by the workload and use-case)
  • Tip: Multiple vSphere clusters can span a single Nutanix cluster (Example, a single 32 node Nutanix cluster could be accessed by four 8 host vSphere clusters)

Gflags

Gflags are an advanced configuration parameter which allows us to modify configuration values on the Nutanix platform.  DISCLAIMER: Any modification of Gflags should only be performed by a Nutanix SE or a Nutanix Certified Professional and should not be modified on any production systems nor necessary to do so!

Here are the usual Gflags I work with:

curator_tier_usage_ilm_threshold_percent
  • Explanation: This Gflag specifies the threshold percentage for SSD tier utilization at which will force an ILM down-migration of cold data when is breached.
  • Default: 75
curator_tier_free_up_percent_by_ilm
  • Explanation: This Gflag specifies the amount of cold data that will be moved during an ILM down-migration
  • Default: 15
stargate_extent_cache_max_MB
  • Explanation: This Gflag specifies the size of the in-memory extent cache
  • Default: -1 (dyanmic)

Important Pages

These are advanced Nutanix pages besides the standard user interface that allow you to monitor detailed stats and metrics.  The URLs are formatted in the following way: http://<Nutanix CVM IP/DNS>:<Port/path (mentioned below)>  Example: http://MyCVM-A:2009

2009 Page
  • This is a Stargate page used to monitor the back end storage system and should only be used by advanced users.  I’ll have a post that explains the 2009 pages and things to look for.
2009/latency Page
  • This is a Stargate page used to monitor the back end latency
2009/traces Page
  • This is the Stargate page used to monitor activity traces for operations
2010 Page
  • This is the Curator page which is used for monitoring curator runs
2011 Page
  • This is the Chronos page which monitors jobs and tasks scheduled by curator
2020 Page
  •  This is the Cerebro page which monitors the protection domains, replication status and DR
7777 Page
  • This is the Aegis Portal page which can be used to get good logs and statistics, useful commands, and modify Gflags

 Useful Commands

ncli cluster version
  • Description: Displays the current version of the Nutanix software
  • Example: ncli cluster version
ncli sp ls
  • Description: Displays the existing storage pools
  • Example: ncli sp ls
ncli ctr ls
  • Description: Displays the existing containers
  • Example: ncli ctr ls
ncli ctr create name=<NAME> sp-name=<SP NAME>
  • Description: Creates a new container
  • Example: ncli ctr create name=NFS-VM sp-name=sp1
ncli vm ls
  • Description: Displays the existing VMs
  • Example: ncli vm ls
ncli pd create name=<NAME>
  • Description: Creates a protection domain
  • Example: ncli pd create name=”pd_prod1″
ncli remote-site create name=<NAME> address-list=<Remote CVM IP(s)>
  • Description: Create a remote site for replication
  • Example: ncli remote-site create name=remote_site_1 address-list=”10.2.100.55″
ncli pd protect name=<PD NAME> ctr-id=<Container ID> cg-name=<NAME>
  • Description: Protect all VMs in the specified container
  • Example: ncli pd protect name=”pd_prod1″ ctr-id=194 cg-name=”cg_1″
ncli pd protect name=<PD NAME> vm-names=<VM Name(s)> cg-name=<NAME>
  • Description: Protect the VMs specified
  • Example: ncli pd protect name=”pd_prod1″ vm-names=VM1, VM2 cg-name=”cg_1″
ncli pd protect name=<PD NAME> files=<File Name(s)> cg-name=<NAME>
  • Description: Protect the NFS Files specified
  • Example: ncli pd protect name=”pd_prod1″ files=”myfile.bin” cg-name=”cg_1″
ncli pd add-one-time-snapshot name=<PD NAME> retention-time=<seconds>
  • Description: Create a one-time snapshot of the protection domain
  • Example: ncli pd add-one-time-snapshot name=”pd_prod1″ retention-time=3600
ncli pd set-schedule name=<PD NAME> interval=<seconds> retention-policy=<POLICY> remote-sites=<REMOTE SITE NAME>
  • Description: Create a recurring snapshot schedule and replication to n remote sites
  • Example: ncli pd set-schedule name=”pd_prod1″ interval=”3600″ retention-policy=”1:5″ remote-sites=”remote_site_1″
ncli pd list-replication-status
  • Description: Monitor replication status
  • Example: ncli pd list-replication-status
ncli pd migrate name=<PD NAME> remote-site=<REMOTE SITE NAME>
  • Description: Fail-over a protection domain to a remote site
  • Example: ncli pd migrate name=”pd_prod1″ remote-site=”remote_site_2″
ncli pd activate name=<PD NAME>
  • Description: Activate a protection domain at a remote site
  • Example: ncli pd activate name=”pd_prod1″

Enjoy!

  • Sam

    Hi Steven

    Love your blog with lots of technical information around Nutanix! Could you give me some more details regarding why 48 node clusters are recommended? What’s the drawback if you want to build larger clusters?

    • stevenpoitras

      Great to hear! It all comes down to isloation of “fault domains” where 48 nodes seems to be a good fit because it fits within a single rack. However, there’s no limitation there and we have people using larger cluster sizes (hundreds). We have customers with larger cluster sizes than that, so it all comes down to preference. You could definitely have a 256 node cluster :)

      • Sam

        Thanks for the fast reply! I was also wondering what the minimum RPO is you can set for the replication? Do you have customers looking for synchronous replication as well? Is there something planned?

        • stevenpoitras

          Currently through the UI the min RPO for remote replication is 1hour, however this will be decreasing over time towards pure synch. Can’t say much, but we’ll change the way “metro-clusters” are handled :)

Legal Mumbo Jumbo

Copyright © Steven Poitras, The Nutanix Bible and StevenPoitras.com, 2014. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Steven Poitras and StevenPoitras.com with appropriate and specific direction to the original content.