Advanced Nutanix: SQL Server on Nutanix Best Practices Released!

advNutanix_640

Recently I’m been in the trenches with Microsoft SQL Server on the Nutanix platform and am pleased to finally announce the public release of the Microsoft SQL Server on Nutanix Best Practices!

In this post I’ll talk about some of the key best practices to optimize MSSQL performance on Nutanix and cover some of the performance results.  The following post below is a small excerpt from the document so be sure to check it out!

Here’s the link to the SQL Server on Nutanix Best Practices document: LINK

Solution Overview

Below we show a high-level overview of the Microsoft SQL Server on Nutanix solution.  A key thing to highlight here is the ability to handle both MSSQL workloads as well as other workloads including VDI, App/Data servers and other Microsoft services like Sharepoint and Exchange.

SQL-overview

Best Practices Checklist

In reality its simple, to get optimal performance of MSSQL on Nutanix you don’t need to change anything on Nutanix.  Just keep a minimum of 4-6 disks on each SQL server and all of the ILM, tiering, caching, etc. is handled by NDFS.  Like one of my earlier posts on DFS, just keep it simple :)

Now, there are a lot of configurable parameters for MSSQL and items that need to be tuned within the application.

Here’s a high-level checklist of the SQL Server tunings:

General

  • Perform a current state analysis to identify workloads and sizing
  • Spend time up front to architect a solution that meets both current and future needs
  • Design to deliver consistent performance, reliability, and scale
  • Don’t undersize, don’t oversize, right size
  • Start with a PoC, test, optimize, iterate, scale

Core Components

MSSQL

Performance and Scalability

  • Utilize multiple drives for TempDB Log/Data and Database Log/Data
    • Start with a minimum of 2 for small environments or 4 for larger environments
    • Look for PAGEIOLATCH_XX contention and scale number of drives as necessary
  • Utilize a 64KB NTFS allocation unit size for MSSQL drives
  • Enabled locked pages in memory for MSSQL Server service account (NOTE: if this setting is used the VM’s memory must be locked, only applies with memory > 8GB)
  • TempDB Data Files
    • Set TempDB size between 1 and 10% of instance database sizes
    • If number of cores < 8
      • # of cores = # of data files
    • If number of cores > 8Use 8 data files to being with
      • Look for contention for in-memory allocation (PAGELATCH_XX) and scale 4 files at a time until contention is eliminated
  • Database Data files
    • Size appropriately and enable AUTOGROW respective to database growth
    • Do not AUTOSHRINK data and log files
    • At a maximum keep below 80% of disk capacity utilization
    • Use multiple data file and drives
      • Look for contention  for in-memory allocation (PAGELATCH_XX), if contention increase number of files
      • Look for I/O subsystem contention (PAGEIOLATCH_XX), if contention, spread the data files across multiple drives
  • Trace flags
    • Implement trace flag 1118 at startup to remove single page allocations
    • Implement trace flag 834 to enable large pages (for tier-1 performance)
  • Utilize the MSSQL Server Best Practices Analyzer (BPA) to identify potential issues
  • Utilize fast file initialization
  • Scale number of MSSQL VMs vs. number of MSSQL instances per VM
  • More memory = higher performance, if seeing memory pressures, increase VM memory
  • Utilize a dedicated disk for Microsoft Page File

Availability

  • In most cases vSphere HA will provide an adequate level of availability and uptime for non-mission critical/tier-1 applications
  • For mission critical/tier-1 applications:
    • MSSQL 2012: utilize AlwaysOn availability groups (preferred)
    • MSSQL 2008 and prior: utilize log shipping or clustered MSSQL using MSCS clusters
  • Take consistent database snapshots/backups, frequency should be derived from required RPOs
  • Leverage native or 3rd party tools to manage backups (eg. Microsoft System Center Data Protection Manager (DPM), etc.)

Manageability

  • Standardize, monitor and maintain
  • Leverage a MSSQL application monitoring solution (eg. System Center, etc.)
  • Create standardized MSSQL VM Templates
  • Utilize consistent disk quantities and layout schemes for MSSQL VMs
  • Join the MSSQL Server to the domain and use Active Directory for authentication
  • Leverage Contained Database Authentication (MSSQL 2012)
  • Use named instances for MSSQL database instances, even when only planning a single instance per VM
  • For named instances,  ensure application compatibility with dynamic ports, otherwise set instance to use a fixed port

VMware vSphere

  • Follow VMware performance best practices
  • Avoid CPU core oversubscription (for tier-1 workloads)
  • For small MSSQL VMs keep vCPUs <= to the number of cores per each physical NUMA node
  • For wide MSSQL VMs size vCPUs to align with physical NUMA boundaries and leverage vNUMA
  • Keep vCPU numbers easily divisible by NUMA node sizes for easy scheduling
  • Leave Hyperthreading sharing at the default policy (Any)
  • Enable ‘High Performance’ host power policy
  • Lock MSSQL VM memory (for tier-1 workloads)
  • Size MSSQL VM memory using the following calculation:
    • VM Memory = SQL Server Max Memory + ThreadStack + OS Memory + VM Overhead
    • Threadstack = SQL Max Worker Threads * 2MB (for x64)
  • Use paravirtual SCSI Controllers and VMXNET3 Nics
  • Use resource pools with correct share allocation
  • Use DRS anti-affinity rules to keep MSSQL VMs apart

Nutanix

  • Use a single container
  • Utilize appropriate model based upon compute and storage requirements
    • Ideally keep working set in SSD and capacity within node
    • Choose a model which can ideally fit the full database on a single node.  NOTE: for larger databases which cannot fit on a node, ensure there is ample bandwidth between nodes
    • Utilize higher memory node models for I/O heavy MSSQL workloads
  • Create a dedicated consistency group with the MSSQL VMs and applications
  • Leverage ‘Application Consistent Snapshots’ on the consistency group to invoke VSS when snapshotting

Supporting Components

Network

  • Utilize and optimize QoS for NDFS and database traffic
  • Use low-latency 10GbE switches
  • Utilize redundant 10GbE uplinks from each Nutanix node
  • Ensure adequate throughput between Nutanix nodes and MSSQL VMs
  • Check for any pause frames which could impact replication and VM communication

Active Directory

  • Utilize AD based authentication for MSSQL servers

OS and Application Updates

  • Schedule updates to be applied outside business hours to avoid performance impacts
  • Stagger updates in phases

Performance Testing

For performance testing we utilized a mix of SQLIO, SQLIOSim and HammerDB to simulate workloads and test NDFS I/O performance for MSSQL.  To learn more on how the testing was performed or to automate, check out my earlier post on Automating SQLIO Benchmarking with Powershell or check out the best practices doc.  The script is pretty awesome, I used it to automate over 20,000 SQLIO test runs, made my life easy :)

For the SQLIO testing a single VM was leveraged to find a node’s performance.  These numbers give a nodes performance which can then be extrapolated by the number of nodes (NOTE: minimum cluster size is 3).

The figure below shows the SQLIO IOPS and throughput performance based upon the block size.  Results showed the single SQLIO VM was able to facilitate ~35,000 random IOPS with a 8 KB block size and ~16,000 IOPS with a 64 KB block size.  Sequential I/O peaked with a 512 KB block size at ~1,200 MBps (1.2 GBps) and was ~950 MBps (.95 GBps) with a 64 KB block size.

sqlio_agg_iops2

The figure below shows the SQLIO operation latency for read and write workloads based upon the block size.  Results showed read latency kept consistently under 1ms and in microseconds for 8 KB and 64 KB block sizes and ~1 ms for 512 KB.  Write latency was consistent at 1 ms for 8 and 64 KB block sizes and ~5 ms for 512 KB.

sqlio_lat_blocksize

 

The Controller VM (CVM) hosting the VM running the SQLIO workload (CVM-B – highlighted with * below), peaked at ~91% CPU utilization during the testing.  All other CVM CPU utilizations stayed consistently ~15-20% which is expected during idle operation.  Aggregate cluster CPU utilization peaked at 22%.

This highlights the ability to incrementally scale out the number of SQL servers and maintain performance as the number of databases/instances scale, as well as the ability to eliminate any noisy-neighbor scenarios.

sqlio_cvm_cpu

 

To learn more check out the best practices document which has a lot more details and test information! :)

 

Legal Mumbo Jumbo

Copyright © Steven Poitras, The Nutanix Bible and StevenPoitras.com, 2014. Unauthorized use and/or duplication of this material without express and written permission from this blog’s author and/or owner is strictly prohibited. Excerpts and links may be used, provided that full and clear credit is given to Steven Poitras and StevenPoitras.com with appropriate and specific direction to the original content.