github.com/quite/nomad@v0.8.6/website/source/guides/cluster/autopilot.html.md

github.com/quite/nomad@v0.8.6/website/source/guides/cluster/autopilot.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Autopilot"
     4  sidebar_current: "guides-cluster-autopilot"
     5  description: |-
     6    This guide covers how to configure and use Autopilot features.
     7  ---
     8  
     9  # Autopilot
    10  
    11  Autopilot is a set of new features added in Nomad 0.8 to allow for automatic
    12  operator-friendly management of Nomad servers. It includes cleanup of dead
    13  servers, monitoring the state of the Raft cluster, and stable server introduction.
    14  
    15  To enable Autopilot features (with the exception of dead server cleanup),
    16  the `raft_protocol` setting in the [server stanza](/docs/agent/configuration/server.html)
    17  must be set to 3 on all servers. In Nomad 0.8 this setting defaults to 2; in Nomad 0.9 it will default to 3.
    18  For more information, see the [Version Upgrade section](/docs/upgrade/upgrade-specific.html#raft-protocol-version-compatibility)
    19  on Raft Protocol versions.
    20  
    21  ## Configuration
    22  
    23  The configuration of Autopilot is loaded by the leader from the agent's
    24  [Autopilot settings](/docs/agent/configuration/autopilot.html) when initially
    25  bootstrapping the cluster:
    26  
    27  ```
    28  autopilot {
    29      cleanup_dead_servers = true
    30      last_contact_threshold = 200ms
    31      max_trailing_logs = 250
    32      server_stabilization_time = "10s"
    33      enable_redundancy_zones = false
    34      disable_upgrade_migration = false
    35      enable_custom_upgrades = false
    36  }
    37  ```
    38  
    39  After bootstrapping, the configuration can be viewed or modified either via the
    40  [`operator autopilot`](/docs/commands/operator.html) subcommand or the
    41  [`/v1/operator/autopilot/configuration`](/api/operator.html#read-autopilot-configuration)
    42  HTTP endpoint:
    43  
    44  ```
    45  $ nomad operator autopilot get-config
    46  CleanupDeadServers = true
    47  LastContactThreshold = 200ms
    48  MaxTrailingLogs = 250
    49  ServerStabilizationTime = 10s
    50  EnableRedundancyZones = false
    51  DisableUpgradeMigration = false
    52  EnableCustomUpgrades = false
    53  
    54  $ nomad operator autopilot set-config -cleanup-dead-servers=false
    55  Configuration updated!
    56  
    57  $ nomad operator autopilot get-config
    58  CleanupDeadServers = false
    59  LastContactThreshold = 200ms
    60  MaxTrailingLogs = 250
    61  ServerStabilizationTime = 10s
    62  EnableRedundancyZones = false
    63  DisableUpgradeMigration = false
    64  EnableCustomUpgrades = false
    65  ```
    66  
    67  ## Dead Server Cleanup
    68  
    69  Dead servers will periodically be cleaned up and removed from the Raft peer
    70  set, to prevent them from interfering with the quorum size and leader elections.
    71  This cleanup will also happen whenever a new server is successfully added to the
    72  cluster.
    73  
    74  Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped,
    75  or operators had to script a `nomad force-leave`. If another server failure occurred,
    76  it could jeopardize the quorum, even if the failed Nomad server had been automatically
    77  replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
    78  servers as soon as a replacement Nomad server comes online. When servers are removed
    79  by the cleanup process they will enter the "left" state.
    80  
    81  This option can be disabled by running `nomad operator autopilot set-config`
    82  with the `-cleanup-dead-servers=false` option.
    83  
    84  ## Server Health Checking
    85  
    86  An internal health check runs on the leader to track the stability of servers.
    87  A server is considered healthy if all of the following conditions are true:
    88  
    89  - Its status according to Serf is 'Alive'
    90  - The time since its last contact with the current leader is below
    91  `LastContactThreshold`
    92  - Its latest Raft term matches the leader's term
    93  - The number of Raft log entries it trails the leader by does not exceed
    94  `MaxTrailingLogs`
    95  
    96  The status of these health checks can be viewed through the 
    97  [`/v1/operator/autopilot/health`](/api/operator.html#read-health) HTTP endpoint, with
    98  a top level `Healthy` field indicating the overall status of the cluster:
    99  
   100  ```
   101  $ curl localhost:8500/v1/operator/autopilot/health
   102  {
   103      "Healthy": true,
   104      "FailureTolerance": 0,
   105      "Servers": [
   106          {
   107              "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e",
   108              "Name": "node1",
   109              "Address": "127.0.0.1:4647",
   110              "SerfStatus": "alive",
   111              "Version": "0.8.0",
   112              "Leader": true,
   113              "LastContact": "0s",
   114              "LastTerm": 2,
   115              "LastIndex": 10,
   116              "Healthy": true,
   117              "Voter": true,
   118              "StableSince": "2017-03-28T18:28:52Z"
   119          },
   120          {
   121              "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea",
   122              "Name": "node2",
   123              "Address": "127.0.0.1:4747",
   124              "SerfStatus": "alive",
   125              "Version": "0.8.0",
   126              "Leader": false,
   127              "LastContact": "35.371007ms",
   128              "LastTerm": 2,
   129              "LastIndex": 10,
   130              "Healthy": true,
   131              "Voter": false,
   132              "StableSince": "2017-03-28T18:29:10Z"
   133          }
   134      ]
   135  }
   136  ```
   137  
   138  ## Stable Server Introduction
   139  
   140  When a new server is added to the cluster, there is a waiting period where it
   141  must be healthy and stable for a certain amount of time before being promoted
   142  to a full, voting member. This can be configured via the `ServerStabilizationTime`
   143  setting.
   144  
   145  ---
   146  
   147  ~> The following Autopilot features are available only in
   148     [Nomad Enterprise](https://www.hashicorp.com/products/nomad/) version 0.8.0 and later.
   149  
   150  ## Server Read and Scheduling Scaling
   151  
   152  With the [`non_voting_server`](/docs/agent/configuration/server.html#non_voting_server) option, a
   153  server can be explicitly marked as a non-voter and will never be promoted to a voting
   154  member. This can be useful when more read scaling is needed; being a non-voter means
   155  that the server will still have data replicated to it, but it will not be part of the
   156  quorum that the leader must wait for before committing log entries. Non voting servers can also
   157  act as scheduling workers to increase scheduling throughput in large clusters.
   158  
   159  ## Redundancy Zones
   160  
   161  Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
   162  isolated failure domains such as AWS Availability Zones; users would be forced to either
   163  have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
   164  deploying just one server in each.
   165  
   166  If the `EnableRedundancyZones` setting is set, Nomad will use its value to look for a
   167  zone in each server's specified [`redundancy_zone`](/docs/agent/configuration/server.html#redundancy_zone)
   168  field.
   169  
   170  Here's an example showing how to configure this:
   171  
   172  ```hcl
   173  /* config.hcl */
   174  server {
   175      redundancy_zone = "west-1"
   176  }
   177  ```
   178  
   179  ```
   180  $ nomad operator autopilot set-config -enable-redundancy-zones=true
   181  Configuration updated!
   182  ```
   183  
   184  Nomad will then use these values to partition the servers by redundancy zone, and will
   185  aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
   186  on standby to be promoted if the active voter leaves or dies.
   187  
   188  ## Upgrade Migrations
   189  
   190  Autopilot in Nomad Enterprise supports upgrade migrations by default. To disable this
   191  functionality, set `DisableUpgradeMigration` to true.
   192  
   193  When a new server is added and Autopilot detects that its Nomad version is newer than
   194  that of the existing servers, Autopilot will avoid promoting the new server until enough
   195  newer-versioned servers have been added to the cluster. When the count of new servers
   196  equals or exceeds that of the old servers, Autopilot will begin promoting the new servers
   197  to voters and demoting the old servers. After this is finished, the old servers can be
   198  safely removed from the cluster.
   199  
   200  To check the Nomad version of the servers, either the [autopilot health](/api/operator.html#read-health)
   201  endpoint or the `nomad members`command can be used:
   202  
   203  ```
   204  $ nomad server members
   205  Name   Address    Port  Status  Leader  Protocol  Build  Datacenter  Region
   206  node1  127.0.0.1  4648  alive   true    3         0.7.1  dc1         global
   207  node2  127.0.0.1  4748  alive   false   3         0.7.1  dc1         global
   208  node3  127.0.0.1  4848  alive   false   3         0.7.1  dc1         global
   209  node4  127.0.0.1  4948  alive   false   3         0.8.0  dc1         global
   210  ```
   211  
   212  ### Migrations Without a Nomad Version Change
   213  
   214  The `EnableCustomUpgrades` field can be used to override the version information used during
   215  a migration, so that the migration logic can be used for updating the cluster when
   216  changing configuration.
   217  
   218  If the `EnableCustomUpgrades` setting is set to `true`, Nomad will use its value to look for a
   219  version in each server's specified [`upgrade_version`](/docs/agent/configuration/server.html#upgrade_version)
   220  tag. The upgrade logic will follow semantic versioning and the `upgrade_version`
   221  must be in the form of either `X`, `X.Y`, or `X.Y.Z`.