github.com/hspak/nomad@v0.7.2-0.20180309000617-bc4ae22a39a5/website/source/guides/cluster/autopilot.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Autopilot"
     4  sidebar_current: "guides-cluster-autopilot"
     5  description: |-
     6    This guide covers how to configure and use Autopilot features.
     7  ---
     8  
     9  # Autopilot
    10  
    11  Autopilot is a set of new features added in Nomad 0.8 to allow for automatic
    12  operator-friendly management of Nomad servers. It includes cleanup of dead
    13  servers, monitoring the state of the Raft cluster, and stable server introduction.
    14  
    15  To enable Autopilot features (with the exception of dead server cleanup),
    16  the [`raft_protocol`](/docs/agent/configuration/server.html#raft_protocol) setting in
    17  the Agent configuration must be set to 3 or higher on all servers. In Nomad
    18  0.8 this setting defaults to 2; in Nomad 0.9 it will default to 3. For more
    19  information, see the [Version Upgrade section]
    20  (/docs/upgrade/upgrade-specific.html#raft-protocol-version-compatibility)
    21  on Raft Protocol versions.
    22  
    23  ## Configuration
    24  
    25  The configuration of Autopilot is loaded by the leader from the agent's
    26  [Autopilot settings](/docs/agent/configuration/autopilot.html) when initially
    27  bootstrapping the cluster:
    28  
    29  ```
    30  autopilot {
    31      cleanup_dead_servers = true
    32      last_contact_threshold = 200ms
    33      max_trailing_logs = 250
    34      server_stabilization_time = "10s"
    35      enable_redundancy_zones = false
    36      disable_upgrade_migration = false
    37      enable_custom_upgrades = false
    38  }
    39  ```
    40  
    41  After bootstrapping, the configuration can be viewed or modified either via the
    42  [`operator autopilot`](/docs/commands/operator.html) subcommand or the
    43  [`/v1/operator/autopilot/configuration`](/api/operator.html#read-autopilot-configuration)
    44  HTTP endpoint:
    45  
    46  ```
    47  $ nomad operator autopilot get-config
    48  CleanupDeadServers = true
    49  LastContactThreshold = 200ms
    50  MaxTrailingLogs = 250
    51  ServerStabilizationTime = 10s
    52  EnableRedundancyZones = false
    53  DisableUpgradeMigration = false
    54  EnableCustomUpgrades = false
    55  
    56  $ nomad operator autopilot set-config -cleanup-dead-servers=false
    57  Configuration updated!
    58  
    59  $ nomad operator autopilot get-config
    60  CleanupDeadServers = false
    61  LastContactThreshold = 200ms
    62  MaxTrailingLogs = 250
    63  ServerStabilizationTime = 10s
    64  EnableRedundancyZones = false
    65  DisableUpgradeMigration = false
    66  EnableCustomUpgrades = false
    67  ```
    68  
    69  ## Dead Server Cleanup
    70  
    71  Dead servers will periodically be cleaned up and removed from the Raft peer
    72  set, to prevent them from interfering with the quorum size and leader elections.
    73  This cleanup will also happen whenever a new server is successfully added to the
    74  cluster.
    75  
    76  Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped,
    77  or operators had to script a `nomad force-leave`. If another server failure occurred,
    78  it could jeopardize the quorum, even if the failed Nomad server had been automatically
    79  replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
    80  servers as soon as a replacement Nomad server comes online. When servers are removed
    81  by the cleanup process they will enter the "left" state.
    82  
    83  This option can be disabled by running `nomad operator autopilot set-config`
    84  with the `-cleanup-dead-servers=false` option.
    85  
    86  ## Server Health Checking
    87  
    88  An internal health check runs on the leader to track the stability of servers.
    89  A server is considered healthy if all of the following conditions are true:
    90  
    91  - Its status according to Serf is 'Alive'
    92  - The time since its last contact with the current leader is below
    93  `LastContactThreshold`
    94  - Its latest Raft term matches the leader's term
    95  - The number of Raft log entries it trails the leader by does not exceed
    96  `MaxTrailingLogs`
    97  
    98  The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`]
    99  (/api/operator.html#read-health) HTTP endpoint, with a top level
   100  `Healthy` field indicating the overall status of the cluster:
   101  
   102  ```
   103  $ curl localhost:8500/v1/operator/autopilot/health
   104  {
   105      "Healthy": true,
   106      "FailureTolerance": 0,
   107      "Servers": [
   108          {
   109              "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e",
   110              "Name": "node1",
   111              "Address": "127.0.0.1:4647",
   112              "SerfStatus": "alive",
   113              "Version": "0.8.0",
   114              "Leader": true,
   115              "LastContact": "0s",
   116              "LastTerm": 2,
   117              "LastIndex": 10,
   118              "Healthy": true,
   119              "Voter": true,
   120              "StableSince": "2017-03-28T18:28:52Z"
   121          },
   122          {
   123              "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea",
   124              "Name": "node2",
   125              "Address": "127.0.0.1:4747",
   126              "SerfStatus": "alive",
   127              "Version": "0.8.0",
   128              "Leader": false,
   129              "LastContact": "35.371007ms",
   130              "LastTerm": 2,
   131              "LastIndex": 10,
   132              "Healthy": true,
   133              "Voter": false,
   134              "StableSince": "2017-03-28T18:29:10Z"
   135          }
   136      ]
   137  }
   138  ```
   139  
   140  ## Stable Server Introduction
   141  
   142  When a new server is added to the cluster, there is a waiting period where it
   143  must be healthy and stable for a certain amount of time before being promoted
   144  to a full, voting member. This can be configured via the `ServerStabilizationTime`
   145  setting.
   146  
   147  ---
   148  
   149  ~> The following Autopilot features are available only in
   150     [Nomad Enterprise](https://www.hashicorp.com/products/nomad/) version 0.8.0 and later.
   151  
   152  ## Server Read Scaling
   153  
   154  With the [`non_voting_server`](/docs/agent/configuration/server.html#non_voting_server) option, a
   155  server can be explicitly marked as a non-voter and will never be promoted to a voting
   156  member. This can be useful when more read scaling is needed; being a non-voter means
   157  that the server will still have data replicated to it, but it will not be part of the
   158  quorum that the leader must wait for before committing log entries.
   159  
   160  ## Redundancy Zones
   161  
   162  Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
   163  isolated failure domains such as AWS Availability Zones; users would be forced to either
   164  have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
   165  deploying just one server in each.
   166  
   167  If the `EnableRedundancyZones` setting is set, Nomad will use its value to look for a
   168  zone in each server's specified [`redundancy_zone`]
   169  (/docs/agent/configuration/server.html#redundancy_zone) field.
   170  
   171  Here's an example showing how to configure this:
   172  
   173  ```hcl
   174  /* config.hcl */
   175  server {
   176      redundancy_zone = "west-1"
   177  }
   178  ```
   179  
   180  ```
   181  $ nomad operator autopilot set-config -enable-redundancy-zones=true
   182  Configuration updated!
   183  ```
   184  
   185  Nomad will then use these values to partition the servers by redundancy zone, and will
   186  aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
   187  on standby to be promoted if the active voter leaves or dies.
   188  
   189  ## Upgrade Migrations
   190  
   191  Autopilot in Nomad Enterprise supports upgrade migrations by default. To disable this
   192  functionality, set `DisableUpgradeMigration` to true.
   193  
   194  When a new server is added and Autopilot detects that its Nomad version is newer than
   195  that of the existing servers, Autopilot will avoid promoting the new server until enough
   196  newer-versioned servers have been added to the cluster. When the count of new servers
   197  equals or exceeds that of the old servers, Autopilot will begin promoting the new servers
   198  to voters and demoting the old servers. After this is finished, the old servers can be
   199  safely removed from the cluster.
   200  
   201  To check the Nomad version of the servers, either the [autopilot health]
   202  (/api/operator.html#read-health) endpoint or the `nomad members`
   203  command can be used:
   204  
   205  ```
   206  $ nomad server-members
   207  Name   Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
   208  node1  127.0.0.1  4648  alive   true    3         0.7.1  dc1         global
   209  node2  127.0.0.1  4748  alive   false   3         0.7.1  dc1         global
   210  node3  127.0.0.1  4848  alive   false   3         0.7.1  dc1         global
   211  node4  127.0.0.1  4948  alive   false   3         0.8.0  dc1         global
   212  ```
   213  
   214  ### Migrations Without a Nomad Version Change
   215  
   216  The `EnableCustomUpgrades` field can be used to override the version information used during
   217  a migration, so that the migration logic can be used for updating the cluster when
   218  changing configuration.
   219  
   220  If the `EnableCustomUpgrades` setting is set to `true`, Nomad will use its value to look for a
   221  version in each server's specified [`upgrade_version`](/docs/agent/configuration/server.html#upgrade_version)
   222  tag. The upgrade logic will follow semantic versioning and the `upgrade_version`
   223  must be in the form of either `X`, `X.Y`, or `X.Y.Z`.