github.com/outbrain/consul@v1.4.5/website/source/docs/guides/autopilot.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Autopilot" 4 sidebar_current: "docs-guides-autopilot" 5 description: |- 6 This guide covers how to configure and use Autopilot features. 7 --- 8 9 # Autopilot 10 11 Autopilot features allow for automatic, 12 operator-friendly management of Consul servers. It includes cleanup of dead 13 servers, monitoring the state of the Raft cluster, and stable server introduction. 14 15 To enable Autopilot features (with the exception of dead server cleanup), 16 the [`raft_protocol`](/docs/agent/options.html#_raft_protocol) setting in 17 the Agent configuration must be set to 3 or higher on all servers. In Consul 18 0.8 this setting defaults to 2; in Consul 1.0 it will default to 3. For more 19 information, see the [Version Upgrade section](/docs/upgrade-specific.html#raft_protocol) 20 on Raft Protocol versions. 21 22 In this guide we will learn more about Autopilot's features. 23 24 * Dead server cleanup 25 * Server Stabilization 26 * Redundancy zone tags 27 * Upgrade migration 28 29 Finally, we will review how to ensure Autopilot is healthy. 30 31 Note, in this guide we are using examples from a Consul 1.4 cluster, we 32 are starting with Autopilot enabled by default. 33 34 ## Default Configuration 35 36 The configuration of Autopilot is loaded by the leader from the agent's 37 [Autopilot settings](/docs/agent/options.html#autopilot) when initially 38 bootstrapping the cluster. Since Autopilot and it's features are already 39 enabled, we only need to update the configuration to disable them. The 40 following are the defaults. 41 42 ``` 43 { 44 "cleanup_dead_servers": true, 45 "last_contact_threshold": "200ms", 46 "max_trailing_logs": 250, 47 "server_stabilization_time": "10s", 48 "redundancy_zone_tag": "", 49 "disable_upgrade_migration": false, 50 "upgrade_version_tag": "" 51 } 52 ``` 53 54 All Consul servers should have Autopilot and its features either enabled 55 or disabled to ensure consistency accross servers in case of a failure. Additionally, 56 Autopilot must be enabled to use any of the features, but the features themselves 57 can be configured independently. Meaning you can enable or disable any of the features 58 separately, at any time. 59 60 After bootstrapping, the configuration can be viewed or modified either via the 61 [`operator autopilot`](/docs/commands/operator/autopilot.html) subcommand or the 62 [`/v1/operator/autopilot/configuration`](/api/operator.html#autopilot-configuration) 63 HTTP endpoint. 64 65 ``` 66 $ consul operator autopilot get-config 67 CleanupDeadServers = true 68 LastContactThreshold = 200ms 69 MaxTrailingLogs = 250 70 ServerStabilizationTime = 10s 71 RedundancyZoneTag = "" 72 DisableUpgradeMigration = false 73 UpgradeVersionTag = "" 74 ``` 75 76 In the example above, we used the `operator autopilot get-config` subcommand to check 77 the autopilot configuration. You can see we still have all the defaults. 78 79 ## Dead Server Cleanup 80 81 If Autopilot is disabled, it will take 72 hours for dead servers to be automatically reaped 82 or an operator had to script a `consul force-leave`. If another server failure occurred 83 it could jeopardize the quorum, even if the failed Consul server had been automatically 84 replaced. Autopilot helps prevent these kinds of outages by quickly removing failed 85 servers as soon as a replacement Consul server comes online. When servers are removed 86 by the cleanup process they will enter the "left" state. 87 88 With Autopilot's dead server cleanup enabled, dead servers will periodically be 89 cleaned up and removed from the Raft peer set to prevent them from interfering with 90 the quorum size and leader elections. The cleanup process will also be automatically 91 triggered whenever a new server is successfully added to the cluster. 92 93 To update the dead server cleanup feature use `consul operator autopilot set-config` 94 with the `-cleanup-dead-servers` flag. 95 96 ```sh 97 $ consul operator autopilot set-config -cleanup-dead-servers=false 98 Configuration updated! 99 100 $ consul operator autopilot get-config 101 CleanupDeadServers = false 102 LastContactThreshold = 200ms 103 MaxTrailingLogs = 250 104 ServerStabilizationTime = 10s 105 RedundancyZoneTag = "" 106 DisableUpgradeMigration = false 107 UpgradeVersionTag = "" 108 ``` 109 110 We have disabled dead server cleanup, but sill have all the other Autopilot defaults. 111 112 ## Server Stabilization 113 114 When a new server is added to the cluster, there is a waiting period where it 115 must be healthy and stable for a certain amount of time before being promoted 116 to a full, voting member. This can be configured via the `ServerStabilizationTime` 117 setting. 118 119 ```sh 120 consul operator autopilot set-config -server-stabilization-time=5s 121 Configuration updated! 122 123 $ consul operator autopilot get-config 124 CleanupDeadServers = false 125 LastContactThreshold = 200ms 126 MaxTrailingLogs = 250 127 ServerStabilizationTime = 5s 128 RedundancyZoneTag = "" 129 DisableUpgradeMigration = false 130 UpgradeVersionTag = "" 131 ``` 132 133 Now we have disabled dead server cleanup and set the server stabilization time to 5 seconds. 134 When a new server is added to our cluster, it will only need to be healthy and stable for 135 5 seconds. 136 137 ## Redundancy Zones 138 139 Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of 140 isolated failure domains such as AWS Availability Zones; users would be forced to either 141 have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by 142 deploying just one server in each. 143 144 If the `RedundancyZoneTag` setting is set, Consul will use its value to look for a 145 zone in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) 146 tag. For example, if `RedundancyZoneTag` is set to `zone`, and `-node-meta zone:east1a` 147 is used when starting a server, that server's redundancy zone will be `east1a`. 148 149 ``` 150 $ consul operator autopilot set-config -redundancy-zone-tag=uswest1 151 Configuration updated! 152 153 $ consul operator autopilot get-config 154 CleanupDeadServers = false 155 LastContactThreshold = 200ms 156 MaxTrailingLogs = 250 157 ServerStabilizationTime = 5s 158 RedundancyZoneTag = "uswest1" 159 DisableUpgradeMigration = false 160 UpgradeVersionTag = "" 161 ``` 162 163 For our Autopilot features, we now have disabled dead server cleanup, server stabilization time to 5 seconds, and 164 the redundancy zone tag is uswest1. 165 166 Consul will then use these values to partition the servers by redundancy zone, and will 167 aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters 168 on standby to be promoted if the active voter leaves or dies. 169 170 ## Upgrade Migrations 171 172 Autopilot in Consul *Enterprise* supports upgrade migrations by default. To disable this 173 functionality, set `DisableUpgradeMigration` to true. 174 175 ```sh 176 $ consul operator autopilot set-config -disable-upgrade-migration=true 177 Configuration updated! 178 179 $ consul operator autopilot get-config 180 CleanupDeadServers = false 181 LastContactThreshold = 200ms 182 MaxTrailingLogs = 250 183 ServerStabilizationTime = 5s 184 RedundancyZoneTag = "uswest1" 185 DisableUpgradeMigration = true 186 UpgradeVersionTag = "" 187 ``` 188 189 With upgrade migration enabled, when a new server is added and Autopilot detects that 190 its Consul version is newer than that of the existing servers, Autopilot will avoid 191 promoting the new server until enough newer-versioned servers have been added to the 192 cluster. When the count of new servers equals or exceeds that of the old servers, 193 Autopilot will begin promoting the new servers to voters and demoting the old servers. 194 After this is finished, the old servers can be safely removed from the cluster. 195 196 To check the consul version of the servers, you can either use the [autopilot health] 197 (/api/operator.html#autopilot-health) endpoint or the `consul members` 198 command. 199 200 ``` 201 $ consul members 202 Node Address Status Type Build Protocol DC Segment 203 node1 127.0.0.1:8301 alive server 1.4.0 2 dc1 <all> 204 node2 127.0.0.1:8703 alive server 1.4.0 2 dc1 <all> 205 node3 127.0.0.1:8803 alive server 1.4.0 2 dc1 <all> 206 node4 127.0.0.1:8203 alive server 1.3.0 2 dc1 <all> 207 ``` 208 209 ### Migrations Without a Consul Version Change 210 211 The `UpgradeVersionTag` can be used to override the version information used during 212 a migration, so that the migration logic can be used for updating the cluster when 213 changing configuration. 214 215 If the `UpgradeVersionTag` setting is set, Consul will use its value to look for a 216 version in each server's specified [`-node-meta`](/docs/agent/options.html#_node_meta) 217 tag. For example, if `UpgradeVersionTag` is set to `build`, and `-node-meta build:0.0.2` 218 is used when starting a server, that server's version will be `0.0.2` when considered in 219 a migration. The upgrade logic will follow semantic versioning and the version string 220 must be in the form of either `X`, `X.Y`, or `X.Y.Z`. 221 222 ```sh 223 $ consul operator autopilot set-config -upgrade-version-tag=1.4.0 224 Configuration updated! 225 226 $ consul operator autopilot get-config 227 CleanupDeadServers = false 228 LastContactThreshold = 200ms 229 MaxTrailingLogs = 250 230 ServerStabilizationTime = 5s 231 RedundancyZoneTag = "uswest1" 232 DisableUpgradeMigration = true 233 UpgradeVersionTag = "1.4.0" 234 ``` 235 236 ## Server Health Checking 237 238 An internal health check runs on the leader to track the stability of servers. 239 <br>A server is considered healthy if all of the following conditions are true. 240 241 - It has a SerfHealth status of 'Alive'. 242 - The time since its last contact with the current leader is below 243 `LastContactThreshold`. 244 - Its latest Raft term matches the leader's term. 245 - The number of Raft log entries it trails the leader by does not exceed 246 `MaxTrailingLogs`. 247 248 The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`] 249 (/api/operator.html#autopilot-health) HTTP endpoint, with a top level 250 `Healthy` field indicating the overall status of the cluster: 251 252 ``` 253 $ curl localhost:8500/v1/operator/autopilot/health 254 { 255 "Healthy": true, 256 "FailureTolerance": 0, 257 "Servers": [ 258 { 259 "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e", 260 "Name": "node1", 261 "Address": "127.0.0.1:8300", 262 "SerfStatus": "alive", 263 "Version": "0.8.0", 264 "Leader": true, 265 "LastContact": "0s", 266 "LastTerm": 2, 267 "LastIndex": 10, 268 "Healthy": true, 269 "Voter": true, 270 "StableSince": "2017-03-28T18:28:52Z" 271 }, 272 { 273 "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea", 274 "Name": "node2", 275 "Address": "127.0.0.1:8705", 276 "SerfStatus": "alive", 277 "Version": "0.8.0", 278 "Leader": false, 279 "LastContact": "35.371007ms", 280 "LastTerm": 2, 281 "LastIndex": 10, 282 "Healthy": true, 283 "Voter": false, 284 "StableSince": "2017-03-28T18:29:10Z" 285 } 286 ] 287 } 288 ``` 289 290 ## Summary 291 292 In this guide we configured most of the Autopilot features; dead server cleanup, server 293 stabilization, redundancy zone tags, upgrade migration, and upgrade version tag. 294 295 To learn more about the Autopilot settings we did not configure, 296 [last_contact_threshold](https://www.consul.io/docs/agent/options.html#last_contact_threshold) 297 and [max_trailing_logs](https://www.consul.io/docs/agent/options.html#max_trailing_logs), 298 either read the agent configuration documentation or use the help flag with the 299 operator autopilot `consul operator autopilot set-config -h`.