github.com/outbrain/consul@v1.4.5/website/source/docs/guides/autopilot.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/guides/autopilot.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Autopilot"
     4  sidebar_current: "docs-guides-autopilot"
     5  description: |-
     6    This guide covers how to configure and use Autopilot features.
     7  ---
     8  
     9  # Autopilot
    10  
    11  Autopilot features allow for automatic,
    12  operator-friendly management of Consul servers. It includes cleanup of dead
    13  servers, monitoring the state of the Raft cluster, and stable server introduction.
    14  
    15  To enable Autopilot features (with the exception of dead server cleanup),
    16  the [`raft_protocol`](/docs/agent/options.html#_raft_protocol) setting in
    17  the Agent configuration must be set to 3 or higher on all servers. In Consul
    18  0.8 this setting defaults to 2; in Consul 1.0 it will default to 3. For more
    19  information, see the [Version Upgrade section](/docs/upgrade-specific.html#raft_protocol)
    20  on Raft Protocol versions.
    21  
    22  In this guide we will learn more about Autopilot's features.
    23  
    24  * Dead server cleanup
    25  * Server Stabilization
    26  * Redundancy zone tags
    27  * Upgrade migration
    28  
    29  Finally, we will review how to ensure Autopilot is healthy.
    30  
    31  Note, in this guide we are using  examples from a Consul 1.4 cluster, we
    32  are starting with Autopilot enabled by default.
    33  
    34  ## Default Configuration
    35  
    36  The configuration of Autopilot is loaded by the leader from the agent's
    37  [Autopilot settings](/docs/agent/options.html#autopilot) when initially
    38  bootstrapping the cluster. Since Autopilot and it's features are already
    39  enabled, we only need to update the configuration to disable them. The
    40  following are the defaults.
    41  
    42  ```
    43  {
    44      "cleanup_dead_servers": true,
    45      "last_contact_threshold": "200ms",
    46      "max_trailing_logs": 250,
    47      "server_stabilization_time": "10s",
    48      "redundancy_zone_tag": "",
    49      "disable_upgrade_migration": false,
    50      "upgrade_version_tag": ""
    51  }
    52  ```
    53  
    54  All Consul servers should have Autopilot and its features either enabled
    55  or disabled to ensure consistency accross servers in case of a failure. Additionally,
    56  Autopilot must be enabled to use any of the features, but the features themselves
    57  can be configured independently. Meaning you can enable or disable any of the features
    58  separately, at any time.
    59  
    60  After bootstrapping, the configuration can be viewed or modified either via the
    61  [`operator autopilot`](/docs/commands/operator/autopilot.html) subcommand or the
    62  [`/v1/operator/autopilot/configuration`](/api/operator.html#autopilot-configuration)
    63  HTTP endpoint.
    64  
    65  ```
    66  $ consul operator autopilot get-config
    67  CleanupDeadServers = true
    68  LastContactThreshold = 200ms
    69  MaxTrailingLogs = 250
    70  ServerStabilizationTime = 10s
    71  RedundancyZoneTag = ""
    72  DisableUpgradeMigration = false
    73  UpgradeVersionTag = ""
    74  ```
    75  
    76  In the example above, we used the `operator autopilot get-config` subcommand to check
    77  the autopilot configuration. You can see we still have all the defaults.
    78  
    79  ## Dead Server Cleanup
    80  
    81  If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped
    82  or an operator had to script a `consul force-leave`. If another server failure occurred
    83  it could jeopardize the quorum, even if the failed Consul server had been automatically
    84  replaced. Autopilot helps prevent these kinds of outages by quickly removing failed
    85  servers as soon as a replacement Consul server comes online. When servers are removed
    86  by the cleanup process they will enter the "left" state.
    87  
    88  With Autopilot's dead server cleanup enabled, dead servers will periodically be
    89  cleaned up and removed from the Raft peer set to prevent them from interfering with
    90  the quorum size and leader elections. The cleanup process will also be automatically
    91  triggered whenever a new server is successfully added to the cluster.
    92  
    93  To update the dead server cleanup feature use `consul operator autopilot set-config`
    94  with the `-cleanup-dead-servers` flag.
    95  
    96  ```sh
    97  $ consul operator autopilot set-config -cleanup-dead-servers=false
    98  Configuration updated!
    99  
   100  $ consul operator autopilot get-config
   101  CleanupDeadServers = false
   102  LastContactThreshold = 200ms
   103  MaxTrailingLogs = 250
   104  ServerStabilizationTime = 10s
   105  RedundancyZoneTag = ""
   106  DisableUpgradeMigration = false
   107  UpgradeVersionTag = ""
   108  ```
   109  
   110  We have disabled dead server cleanup, but sill have all the other Autopilot defaults.
   111  
   112  ## Server Stabilization
   113  
   114  When a new server is added to the cluster, there is a waiting period where it
   115  must be healthy and stable for a certain amount of time before being promoted
   116  to a full, voting member. This can be configured via the `ServerStabilizationTime`
   117  setting.
   118  
   119  ```sh
   120  consul operator autopilot set-config -server-stabilization-time=5s
   121  Configuration updated!
   122  
   123  $ consul operator autopilot get-config
   124  CleanupDeadServers = false
   125  LastContactThreshold = 200ms
   126  MaxTrailingLogs = 250
   127  ServerStabilizationTime = 5s
   128  RedundancyZoneTag = ""
   129  DisableUpgradeMigration = false
   130  UpgradeVersionTag = ""
   131  ```
   132  
   133  Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds.
   134  When a new server is added to our cluster, it will only need to be healthy and stable for
   135  5 seconds.
   136  
   137  ## Redundancy Zones
   138  
   139  Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of
   140  isolated failure domains such as AWS Availability Zones; users would be forced to either
   141  have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by
   142  deploying just one server in each.
   143  
   144  If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a
   145  zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
   146  tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a`
   147  is used when starting a server, that server's redundancy zone will be `east1a`.
   148  
   149  ```
   150  $ consul operator autopilot set-config -redundancy-zone-tag=uswest1
   151  Configuration updated!
   152  
   153  $ consul operator autopilot get-config
   154  CleanupDeadServers = false
   155  LastContactThreshold = 200ms
   156  MaxTrailingLogs = 250
   157  ServerStabilizationTime = 5s
   158  RedundancyZoneTag = "uswest1"
   159  DisableUpgradeMigration = false
   160  UpgradeVersionTag = ""
   161  ```
   162  
   163  For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and
   164  the redundancy zone tag is uswest1.
   165  
   166  Consul will then use these values to partition the servers by redundancy zone, and will
   167  aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters
   168  on standby to be promoted if the active voter leaves or dies.
   169  
   170  ## Upgrade Migrations
   171  
   172  Autopilot in Consul *Enterprise* supports upgrade migrations by default. To disable this
   173  functionality, set `DisableUpgradeMigration` to true.
   174  
   175  ```sh
   176  $ consul operator autopilot set-config -disable-upgrade-migration=true
   177  Configuration updated!
   178  
   179  $ consul operator autopilot get-config
   180  CleanupDeadServers = false
   181  LastContactThreshold = 200ms
   182  MaxTrailingLogs = 250
   183  ServerStabilizationTime = 5s
   184  RedundancyZoneTag = "uswest1"
   185  DisableUpgradeMigration = true
   186  UpgradeVersionTag = ""
   187  ```
   188  
   189  With upgrade migration enabled, when a new server is added and Autopilot detects that
   190  its Consul version is newer than that of the existing servers, Autopilot will avoid
   191  promoting the new server until enough newer-versioned servers have been added to the
   192  cluster. When the count of new servers equals or exceeds that of the old servers,
   193  Autopilot will begin promoting the new servers to voters and demoting the old servers.
   194  After this is finished, the old servers can be safely removed from the cluster.
   195  
   196  To check the consul version of the servers, you can either use the [autopilot health]
   197  (/api/operator.html#autopilot-health) endpoint or the `consul members`
   198  command.
   199  
   200  ```
   201  $ consul members
   202  Node   Address         Status  Type    Build  Protocol  DC   Segment
   203  node1  127.0.0.1:8301  alive   server  1.4.0  2         dc1   <all>
   204  node2  127.0.0.1:8703  alive   server  1.4.0  2         dc1   <all>
   205  node3  127.0.0.1:8803  alive   server  1.4.0  2         dc1   <all>
   206  node4  127.0.0.1:8203  alive   server  1.3.0  2         dc1   <all>
   207  ```
   208  
   209  ### Migrations Without a Consul Version Change
   210  
   211  The `UpgradeVersionTag` can be used to override the version information used during
   212  a migration, so that the migration logic can be used for updating the cluster when
   213  changing configuration.
   214  
   215  If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a
   216  version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta)
   217  tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2`
   218  is used when starting a server, that server's version will be `0.0.2` when considered in
   219  a migration. The upgrade logic will follow semantic versioning and the version string
   220  must be in the form of either `X`, `X.Y`, or `X.Y.Z`.
   221  
   222  ```sh
   223  $ consul operator autopilot set-config -upgrade-version-tag=1.4.0
   224  Configuration updated!
   225  
   226  $ consul operator autopilot get-config
   227  CleanupDeadServers = false
   228  LastContactThreshold = 200ms
   229  MaxTrailingLogs = 250
   230  ServerStabilizationTime = 5s
   231  RedundancyZoneTag = "uswest1"
   232  DisableUpgradeMigration = true
   233  UpgradeVersionTag = "1.4.0"
   234  ```
   235  
   236  ## Server Health Checking
   237  
   238  An internal health check runs on the leader to track the stability of servers.
   239  <br>A server is considered healthy if all of the following conditions are true.
   240  
   241  - It has a SerfHealth status of 'Alive'.
   242  - The time since its last contact with the current leader is below
   243  `LastContactThreshold`.
   244  - Its latest Raft term matches the leader's term.
   245  - The number of Raft log entries it trails the leader by does not exceed
   246  `MaxTrailingLogs`.
   247  
   248  The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`]
   249  (/api/operator.html#autopilot-health) HTTP endpoint, with a top level
   250  `Healthy` field indicating the overall status of the cluster:
   251  
   252  ```
   253  $ curl localhost:8500/v1/operator/autopilot/health
   254  {
   255      "Healthy": true,
   256      "FailureTolerance": 0,
   257      "Servers": [
   258          {
   259              "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e",
   260              "Name": "node1",
   261              "Address": "127.0.0.1:8300",
   262              "SerfStatus": "alive",
   263              "Version": "0.8.0",
   264              "Leader": true,
   265              "LastContact": "0s",
   266              "LastTerm": 2,
   267              "LastIndex": 10,
   268              "Healthy": true,
   269              "Voter": true,
   270              "StableSince": "2017-03-28T18:28:52Z"
   271          },
   272          {
   273              "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea",
   274              "Name": "node2",
   275              "Address": "127.0.0.1:8705",
   276              "SerfStatus": "alive",
   277              "Version": "0.8.0",
   278              "Leader": false,
   279              "LastContact": "35.371007ms",
   280              "LastTerm": 2,
   281              "LastIndex": 10,
   282              "Healthy": true,
   283              "Voter": false,
   284              "StableSince": "2017-03-28T18:29:10Z"
   285          }
   286      ]
   287  }
   288  ```
   289  
   290  ## Summary
   291  
   292  In this guide we configured most of the Autopilot features; dead server cleanup, server
   293  stabilization, redundancy zone tags, upgrade migration, and upgrade version tag.
   294  
   295  To learn more about the Autopilot settings we did not configure,
   296  [last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold)
   297  and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs),
   298  either read the agent configuration documentation or use the help flag with the
   299  operator autopilot `consul operator autopilot set-config -h`.