github.com/quite/nomad@v0.8.6/website/source/guides/cluster/autopilot.html.md (about) 1 --- 2 layout: "guides" 3 page_title: "Autopilot" 4 sidebar_current: "guides-cluster-autopilot" 5 description: |- 6 This guide covers how to configure and use Autopilot features. 7 --- 8 9 # Autopilot 10 11 Autopilot is a set of new features added in Nomad 0.8 to allow for automatic 12 operator-friendly management of Nomad servers. It includes cleanup of dead 13 servers, monitoring the state of the Raft cluster, and stable server introduction. 14 15 To enable Autopilot features (with the exception of dead server cleanup), 16 the `raft_protocol` setting in the [server stanza](/docs/agent/configuration/server.html) 17 must be set to 3 on all servers. In Nomad 0.8 this setting defaults to 2; in Nomad 0.9 it will default to 3. 18 For more information, see the [Version Upgrade section](/docs/upgrade/upgrade-specific.html#raft-protocol-version-compatibility) 19 on Raft Protocol versions. 20 21 ## Configuration 22 23 The configuration of Autopilot is loaded by the leader from the agent's 24 [Autopilot settings](/docs/agent/configuration/autopilot.html) when initially 25 bootstrapping the cluster: 26 27 ``` 28 autopilot { 29 cleanup_dead_servers = true 30 last_contact_threshold = 200ms 31 max_trailing_logs = 250 32 server_stabilization_time = "10s" 33 enable_redundancy_zones = false 34 disable_upgrade_migration = false 35 enable_custom_upgrades = false 36 } 37 ``` 38 39 After bootstrapping, the configuration can be viewed or modified either via the 40 [`operator autopilot`](/docs/commands/operator.html) subcommand or the 41 [`/v1/operator/autopilot/configuration`](/api/operator.html#read-autopilot-configuration) 42 HTTP endpoint: 43 44 ``` 45 $ nomad operator autopilot get-config 46 CleanupDeadServers = true 47 LastContactThreshold = 200ms 48 MaxTrailingLogs = 250 49 ServerStabilizationTime = 10s 50 EnableRedundancyZones = false 51 DisableUpgradeMigration = false 52 EnableCustomUpgrades = false 53 54 $ nomad operator autopilot set-config -cleanup-dead-servers=false 55 Configuration updated! 56 57 $ nomad operator autopilot get-config 58 CleanupDeadServers = false 59 LastContactThreshold = 200ms 60 MaxTrailingLogs = 250 61 ServerStabilizationTime = 10s 62 EnableRedundancyZones = false 63 DisableUpgradeMigration = false 64 EnableCustomUpgrades = false 65 ``` 66 67 ## Dead Server Cleanup 68 69 Dead servers will periodically be cleaned up and removed from the Raft peer 70 set, to prevent them from interfering with the quorum size and leader elections. 71 This cleanup will also happen whenever a new server is successfully added to the 72 cluster. 73 74 Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped, 75 or operators had to script a `nomad force-leave`. If another server failure occurred, 76 it could jeopardize the quorum, even if the failed Nomad server had been automatically 77 replaced. Autopilot helps prevent these kinds of outages by quickly removing failed 78 servers as soon as a replacement Nomad server comes online. When servers are removed 79 by the cleanup process they will enter the "left" state. 80 81 This option can be disabled by running `nomad operator autopilot set-config` 82 with the `-cleanup-dead-servers=false` option. 83 84 ## Server Health Checking 85 86 An internal health check runs on the leader to track the stability of servers. 87 A server is considered healthy if all of the following conditions are true: 88 89 - Its status according to Serf is 'Alive' 90 - The time since its last contact with the current leader is below 91 `LastContactThreshold` 92 - Its latest Raft term matches the leader's term 93 - The number of Raft log entries it trails the leader by does not exceed 94 `MaxTrailingLogs` 95 96 The status of these health checks can be viewed through the 97 [`/v1/operator/autopilot/health`](/api/operator.html#read-health) HTTP endpoint, with 98 a top level `Healthy` field indicating the overall status of the cluster: 99 100 ``` 101 $ curl localhost:8500/v1/operator/autopilot/health 102 { 103 "Healthy": true, 104 "FailureTolerance": 0, 105 "Servers": [ 106 { 107 "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e", 108 "Name": "node1", 109 "Address": "127.0.0.1:4647", 110 "SerfStatus": "alive", 111 "Version": "0.8.0", 112 "Leader": true, 113 "LastContact": "0s", 114 "LastTerm": 2, 115 "LastIndex": 10, 116 "Healthy": true, 117 "Voter": true, 118 "StableSince": "2017-03-28T18:28:52Z" 119 }, 120 { 121 "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea", 122 "Name": "node2", 123 "Address": "127.0.0.1:4747", 124 "SerfStatus": "alive", 125 "Version": "0.8.0", 126 "Leader": false, 127 "LastContact": "35.371007ms", 128 "LastTerm": 2, 129 "LastIndex": 10, 130 "Healthy": true, 131 "Voter": false, 132 "StableSince": "2017-03-28T18:29:10Z" 133 } 134 ] 135 } 136 ``` 137 138 ## Stable Server Introduction 139 140 When a new server is added to the cluster, there is a waiting period where it 141 must be healthy and stable for a certain amount of time before being promoted 142 to a full, voting member. This can be configured via the `ServerStabilizationTime` 143 setting. 144 145 --- 146 147 ~> The following Autopilot features are available only in 148 [Nomad Enterprise](https://www.hashicorp.com/products/nomad/) version 0.8.0 and later. 149 150 ## Server Read and Scheduling Scaling 151 152 With the [`non_voting_server`](/docs/agent/configuration/server.html#non_voting_server) option, a 153 server can be explicitly marked as a non-voter and will never be promoted to a voting 154 member. This can be useful when more read scaling is needed; being a non-voter means 155 that the server will still have data replicated to it, but it will not be part of the 156 quorum that the leader must wait for before committing log entries. Non voting servers can also 157 act as scheduling workers to increase scheduling throughput in large clusters. 158 159 ## Redundancy Zones 160 161 Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of 162 isolated failure domains such as AWS Availability Zones; users would be forced to either 163 have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by 164 deploying just one server in each. 165 166 If the `EnableRedundancyZones` setting is set, Nomad will use its value to look for a 167 zone in each server's specified [`redundancy_zone`](/docs/agent/configuration/server.html#redundancy_zone) 168 field. 169 170 Here's an example showing how to configure this: 171 172 ```hcl 173 /* config.hcl */ 174 server { 175 redundancy_zone = "west-1" 176 } 177 ``` 178 179 ``` 180 $ nomad operator autopilot set-config -enable-redundancy-zones=true 181 Configuration updated! 182 ``` 183 184 Nomad will then use these values to partition the servers by redundancy zone, and will 185 aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters 186 on standby to be promoted if the active voter leaves or dies. 187 188 ## Upgrade Migrations 189 190 Autopilot in Nomad Enterprise supports upgrade migrations by default. To disable this 191 functionality, set `DisableUpgradeMigration` to true. 192 193 When a new server is added and Autopilot detects that its Nomad version is newer than 194 that of the existing servers, Autopilot will avoid promoting the new server until enough 195 newer-versioned servers have been added to the cluster. When the count of new servers 196 equals or exceeds that of the old servers, Autopilot will begin promoting the new servers 197 to voters and demoting the old servers. After this is finished, the old servers can be 198 safely removed from the cluster. 199 200 To check the Nomad version of the servers, either the [autopilot health](/api/operator.html#read-health) 201 endpoint or the `nomad members`command can be used: 202 203 ``` 204 $ nomad server members 205 Name Address Port Status Leader Protocol Build Datacenter Region 206 node1 127.0.0.1 4648 alive true 3 0.7.1 dc1 global 207 node2 127.0.0.1 4748 alive false 3 0.7.1 dc1 global 208 node3 127.0.0.1 4848 alive false 3 0.7.1 dc1 global 209 node4 127.0.0.1 4948 alive false 3 0.8.0 dc1 global 210 ``` 211 212 ### Migrations Without a Nomad Version Change 213 214 The `EnableCustomUpgrades` field can be used to override the version information used during 215 a migration, so that the migration logic can be used for updating the cluster when 216 changing configuration. 217 218 If the `EnableCustomUpgrades` setting is set to `true`, Nomad will use its value to look for a 219 version in each server's specified [`upgrade_version`](/docs/agent/configuration/server.html#upgrade_version) 220 tag. The upgrade logic will follow semantic versioning and the `upgrade_version` 221 must be in the form of either `X`, `X.Y`, or `X.Y.Z`.