github.com/hspak/nomad@v0.7.2-0.20180309000617-bc4ae22a39a5/website/source/guides/cluster/autopilot.html.md (about) 1 --- 2 layout: "guides" 3 page_title: "Autopilot" 4 sidebar_current: "guides-cluster-autopilot" 5 description: |- 6 This guide covers how to configure and use Autopilot features. 7 --- 8 9 # Autopilot 10 11 Autopilot is a set of new features added in Nomad 0.8 to allow for automatic 12 operator-friendly management of Nomad servers. It includes cleanup of dead 13 servers, monitoring the state of the Raft cluster, and stable server introduction. 14 15 To enable Autopilot features (with the exception of dead server cleanup), 16 the [`raft_protocol`](/docs/agent/configuration/server.html#raft_protocol) setting in 17 the Agent configuration must be set to 3 or higher on all servers. In Nomad 18 0.8 this setting defaults to 2; in Nomad 0.9 it will default to 3. For more 19 information, see the [Version Upgrade section] 20 (/docs/upgrade/upgrade-specific.html#raft-protocol-version-compatibility) 21 on Raft Protocol versions. 22 23 ## Configuration 24 25 The configuration of Autopilot is loaded by the leader from the agent's 26 [Autopilot settings](/docs/agent/configuration/autopilot.html) when initially 27 bootstrapping the cluster: 28 29 ``` 30 autopilot { 31 cleanup_dead_servers = true 32 last_contact_threshold = 200ms 33 max_trailing_logs = 250 34 server_stabilization_time = "10s" 35 enable_redundancy_zones = false 36 disable_upgrade_migration = false 37 enable_custom_upgrades = false 38 } 39 ``` 40 41 After bootstrapping, the configuration can be viewed or modified either via the 42 [`operator autopilot`](/docs/commands/operator.html) subcommand or the 43 [`/v1/operator/autopilot/configuration`](/api/operator.html#read-autopilot-configuration) 44 HTTP endpoint: 45 46 ``` 47 $ nomad operator autopilot get-config 48 CleanupDeadServers = true 49 LastContactThreshold = 200ms 50 MaxTrailingLogs = 250 51 ServerStabilizationTime = 10s 52 EnableRedundancyZones = false 53 DisableUpgradeMigration = false 54 EnableCustomUpgrades = false 55 56 $ nomad operator autopilot set-config -cleanup-dead-servers=false 57 Configuration updated! 58 59 $ nomad operator autopilot get-config 60 CleanupDeadServers = false 61 LastContactThreshold = 200ms 62 MaxTrailingLogs = 250 63 ServerStabilizationTime = 10s 64 EnableRedundancyZones = false 65 DisableUpgradeMigration = false 66 EnableCustomUpgrades = false 67 ``` 68 69 ## Dead Server Cleanup 70 71 Dead servers will periodically be cleaned up and removed from the Raft peer 72 set, to prevent them from interfering with the quorum size and leader elections. 73 This cleanup will also happen whenever a new server is successfully added to the 74 cluster. 75 76 Prior to Autopilot, it would take 72 hours for dead servers to be automatically reaped, 77 or operators had to script a `nomad force-leave`. If another server failure occurred, 78 it could jeopardize the quorum, even if the failed Nomad server had been automatically 79 replaced. Autopilot helps prevent these kinds of outages by quickly removing failed 80 servers as soon as a replacement Nomad server comes online. When servers are removed 81 by the cleanup process they will enter the "left" state. 82 83 This option can be disabled by running `nomad operator autopilot set-config` 84 with the `-cleanup-dead-servers=false` option. 85 86 ## Server Health Checking 87 88 An internal health check runs on the leader to track the stability of servers. 89 A server is considered healthy if all of the following conditions are true: 90 91 - Its status according to Serf is 'Alive' 92 - The time since its last contact with the current leader is below 93 `LastContactThreshold` 94 - Its latest Raft term matches the leader's term 95 - The number of Raft log entries it trails the leader by does not exceed 96 `MaxTrailingLogs` 97 98 The status of these health checks can be viewed through the [`/v1/operator/autopilot/health`] 99 (/api/operator.html#read-health) HTTP endpoint, with a top level 100 `Healthy` field indicating the overall status of the cluster: 101 102 ``` 103 $ curl localhost:8500/v1/operator/autopilot/health 104 { 105 "Healthy": true, 106 "FailureTolerance": 0, 107 "Servers": [ 108 { 109 "ID": "e349749b-3303-3ddf-959c-b5885a0e1f6e", 110 "Name": "node1", 111 "Address": "127.0.0.1:4647", 112 "SerfStatus": "alive", 113 "Version": "0.8.0", 114 "Leader": true, 115 "LastContact": "0s", 116 "LastTerm": 2, 117 "LastIndex": 10, 118 "Healthy": true, 119 "Voter": true, 120 "StableSince": "2017-03-28T18:28:52Z" 121 }, 122 { 123 "ID": "e35bde83-4e9c-434f-a6ef-453f44ee21ea", 124 "Name": "node2", 125 "Address": "127.0.0.1:4747", 126 "SerfStatus": "alive", 127 "Version": "0.8.0", 128 "Leader": false, 129 "LastContact": "35.371007ms", 130 "LastTerm": 2, 131 "LastIndex": 10, 132 "Healthy": true, 133 "Voter": false, 134 "StableSince": "2017-03-28T18:29:10Z" 135 } 136 ] 137 } 138 ``` 139 140 ## Stable Server Introduction 141 142 When a new server is added to the cluster, there is a waiting period where it 143 must be healthy and stable for a certain amount of time before being promoted 144 to a full, voting member. This can be configured via the `ServerStabilizationTime` 145 setting. 146 147 --- 148 149 ~> The following Autopilot features are available only in 150 [Nomad Enterprise](https://www.hashicorp.com/products/nomad/) version 0.8.0 and later. 151 152 ## Server Read Scaling 153 154 With the [`non_voting_server`](/docs/agent/configuration/server.html#non_voting_server) option, a 155 server can be explicitly marked as a non-voter and will never be promoted to a voting 156 member. This can be useful when more read scaling is needed; being a non-voter means 157 that the server will still have data replicated to it, but it will not be part of the 158 quorum that the leader must wait for before committing log entries. 159 160 ## Redundancy Zones 161 162 Prior to Autopilot, it was difficult to deploy servers in a way that took advantage of 163 isolated failure domains such as AWS Availability Zones; users would be forced to either 164 have an overly-large quorum (2-3 nodes per AZ) or give up redundancy within an AZ by 165 deploying just one server in each. 166 167 If the `EnableRedundancyZones` setting is set, Nomad will use its value to look for a 168 zone in each server's specified [`redundancy_zone`] 169 (/docs/agent/configuration/server.html#redundancy_zone) field. 170 171 Here's an example showing how to configure this: 172 173 ```hcl 174 /* config.hcl */ 175 server { 176 redundancy_zone = "west-1" 177 } 178 ``` 179 180 ``` 181 $ nomad operator autopilot set-config -enable-redundancy-zones=true 182 Configuration updated! 183 ``` 184 185 Nomad will then use these values to partition the servers by redundancy zone, and will 186 aim to keep one voting server per zone. Extra servers in each zone will stay as non-voters 187 on standby to be promoted if the active voter leaves or dies. 188 189 ## Upgrade Migrations 190 191 Autopilot in Nomad Enterprise supports upgrade migrations by default. To disable this 192 functionality, set `DisableUpgradeMigration` to true. 193 194 When a new server is added and Autopilot detects that its Nomad version is newer than 195 that of the existing servers, Autopilot will avoid promoting the new server until enough 196 newer-versioned servers have been added to the cluster. When the count of new servers 197 equals or exceeds that of the old servers, Autopilot will begin promoting the new servers 198 to voters and demoting the old servers. After this is finished, the old servers can be 199 safely removed from the cluster. 200 201 To check the Nomad version of the servers, either the [autopilot health] 202 (/api/operator.html#read-health) endpoint or the `nomad members` 203 command can be used: 204 205 ``` 206 $ nomad server-members 207 Name Address Port Status Leader Protocol Build Datacenter Region 208 node1 127.0.0.1 4648 alive true 3 0.7.1 dc1 global 209 node2 127.0.0.1 4748 alive false 3 0.7.1 dc1 global 210 node3 127.0.0.1 4848 alive false 3 0.7.1 dc1 global 211 node4 127.0.0.1 4948 alive false 3 0.8.0 dc1 global 212 ``` 213 214 ### Migrations Without a Nomad Version Change 215 216 The `EnableCustomUpgrades` field can be used to override the version information used during 217 a migration, so that the migration logic can be used for updating the cluster when 218 changing configuration. 219 220 If the `EnableCustomUpgrades` setting is set to `true`, Nomad will use its value to look for a 221 version in each server's specified [`upgrade_version`](/docs/agent/configuration/server.html#upgrade_version) 222 tag. The upgrade logic will follow semantic versioning and the `upgrade_version` 223 must be in the form of either `X`, `X.Y`, or `X.Y.Z`.