github.com/outbrain/consul@v1.4.5/website/source/docs/agent/checks.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Check Definition" 4 sidebar_current: "docs-agent-checks" 5 description: |- 6 One of the primary roles of the agent is management of system- and application-level health checks. A health check is considered to be application-level if it is associated with a service. A check is defined in a configuration file or added at runtime over the HTTP interface. 7 --- 8 9 # Checks 10 11 One of the primary roles of the agent is management of system-level and application-level health 12 checks. A health check is considered to be application-level if it is associated with a 13 service. If not associated with a service, the check monitors the health of the entire node. 14 15 A check is defined in a configuration file or added at runtime over the HTTP interface. Checks 16 created via the HTTP interface persist with that node. 17 18 There are several different kinds of checks: 19 20 * Script + Interval - These checks depend on invoking an external application 21 that performs the health check, exits with an appropriate exit code, and potentially 22 generates some output. A script is paired with an invocation interval (e.g. 23 every 30 seconds). This is similar to the Nagios plugin system. The output of 24 a script check is limited to 4KB. Output larger than this will be truncated. 25 By default, Script checks will be configured with a timeout equal to 30 seconds. 26 It is possible to configure a custom Script check timeout value by specifying the 27 `timeout` field in the check definition. When the timeout is reached on Windows, 28 Consul will wait for any child processes spawned by the script to finish. For any 29 other system, Consul will attempt to force-kill the script and any child processes 30 it has spawned once the timeout has passed. 31 In Consul 0.9.0 and later, script checks are not enabled by default. To use them you 32 can either use : 33 * [`enable_local_script_checks`](/docs/agent/options.html#_enable_local_script_checks): 34 enable script checks defined in local config files. Script checks defined via the HTTP 35 API will not be allowed. 36 * [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks): enable 37 script checks regardless of how they are defined. 38 39 ~> **Security Warning:** Enabling script checks in some configurations may 40 introduce a remote execution vulnerability which is known to be targeted by 41 malware. We strongly recommend `enable_local_script_checks` instead. See [this 42 blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations) 43 for more details. 44 45 * HTTP + Interval - These checks make an HTTP `GET` request every Interval (e.g. 46 every 30 seconds) to the specified URL. The status of the service depends on 47 the HTTP response code: any `2xx` code is considered passing, a `429 Too Many 48 Requests` is a warning, and anything else is a failure. This type of check 49 should be preferred over a script that uses `curl` or another external process 50 to check a simple HTTP operation. By default, HTTP checks are `GET` requests 51 unless the `method` field specifies a different method. Additional header 52 fields can be set through the `header` field which is a map of lists of 53 strings, e.g. `{"x-foo": ["bar", "baz"]}`. By default, HTTP checks will be 54 configured with a request timeout equal to the check interval, with a max of 55 10 seconds. It is possible to configure a custom HTTP check timeout value by 56 specifying the `timeout` field in the check definition. The output of the 57 check is limited to roughly 4KB. Responses larger than this will be truncated. 58 HTTP checks also support TLS. By default, a valid TLS certificate is expected. 59 Certificate verification can be turned off by setting the `tls_skip_verify` 60 field to `true` in the check definition. 61 62 * TCP + Interval - These checks make a TCP connection attempt every Interval 63 (e.g. every 30 seconds) to the specified IP/hostname and port. If no hostname 64 is specified, it defaults to "localhost". The status of the service depends on 65 whether the connection attempt is successful (ie - the port is currently 66 accepting connections). If the connection is accepted, the status is 67 `success`, otherwise the status is `critical`. In the case of a hostname that 68 resolves to both IPv4 and IPv6 addresses, an attempt will be made to both 69 addresses, and the first successful connection attempt will result in a 70 successful check. This type of check should be preferred over a script that 71 uses `netcat` or another external process to check a simple socket operation. 72 By default, TCP checks will be configured with a request timeout equal to the 73 check interval, with a max of 10 seconds. It is possible to configure a custom 74 TCP check timeout value by specifying the `timeout` field in the check 75 definition. 76 77 * <a name="TTL"></a>Time to Live (TTL) - These checks retain their last known 78 state for a given TTL. The state of the check must be updated periodically 79 over the HTTP interface. If an external system fails to update the status 80 within a given TTL, the check is set to the failed state. This mechanism, 81 conceptually similar to a dead man's switch, relies on the application to 82 directly report its health. For example, a healthy app can periodically `PUT` a 83 status update to the HTTP endpoint; if the app fails, the TTL will expire and 84 the health check enters a critical state. The endpoints used to update health 85 information for a given check are: 86 [pass](/api/agent/check.html#ttl-check-pass), 87 [warn](/api/agent/check.html#ttl-check-warn), 88 [fail](/api/agent/check.html#ttl-check-fail), and 89 [update](/api/agent/check.html#ttl-check-update). TTL 90 checks also persist their last known status to disk. This allows the Consul 91 agent to restore the last known status of the check across restarts. Persisted 92 check status is valid through the end of the TTL from the time of the last 93 check. 94 95 * Docker + Interval - These checks depend on invoking an external application which 96 is packaged within a Docker Container. The application is triggered within the running 97 container via the Docker Exec API. We expect that the Consul agent user has access 98 to either the Docker HTTP API or the unix socket. Consul uses ```$DOCKER_HOST``` to 99 determine the Docker API endpoint. The application is expected to run, perform a health 100 check of the service running inside the container, and exit with an appropriate exit code. 101 The check should be paired with an invocation interval. The shell on which the check 102 has to be performed is configurable which makes it possible to run containers which 103 have different shells on the same host. Check output for Docker is limited to 104 4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent 105 must be configured with [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks) 106 set to `true` in order to enable Docker health checks. 107 108 * gRPC + Interval - These checks are intended for applications that support the standard 109 [gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md). 110 The state of the check will be updated at the given interval by probing the configured 111 endpoint. By default, gRPC checks will be configured with a default timeout of 10 seconds. 112 It is possible to configure a custom timeout value by specifying the `timeout` field in 113 the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by 114 setting `grpc_use_tls` in the check definition. If TLS is enabled, then by default, a valid 115 TLS certificate is expected. Certificate verification can be turned off by setting the 116 `tls_skip_verify` field to `true` in the check definition. 117 118 * <a name="alias"></a>Alias - These checks alias the health state of another registered 119 node or service. The state of the check will be updated asynchronously, 120 but is nearly instant. For aliased services on the same agent, the local 121 state is monitored and no additional network resources are consumed. For 122 other services and nodes, the check maintains a blocking query over the 123 agent's connection with a current server and allows stale requests. If there 124 are any errors in watching the aliased node or service, the check state will be 125 critical. For the blocking query, the check will use the ACL token set 126 on the service or check definition or otherwise will fall back to the default ACL 127 token set with the agent (`acl_token`). 128 129 ## Check Definition 130 131 A script check: 132 133 ```javascript 134 { 135 "check": { 136 "id": "mem-util", 137 "name": "Memory utilization", 138 "args": ["/usr/local/bin/check_mem.py", "-limit", "256MB"], 139 "interval": "10s", 140 "timeout": "1s" 141 } 142 } 143 ``` 144 145 A HTTP check: 146 147 ```javascript 148 { 149 "check": { 150 "id": "api", 151 "name": "HTTP API on port 5000", 152 "http": "https://localhost:5000/health", 153 "tls_skip_verify": false, 154 "method": "POST", 155 "header": {"x-foo":["bar", "baz"]}, 156 "interval": "10s", 157 "timeout": "1s" 158 } 159 } 160 ``` 161 162 A TCP check: 163 164 ```javascript 165 { 166 "check": { 167 "id": "ssh", 168 "name": "SSH TCP on port 22", 169 "tcp": "localhost:22", 170 "interval": "10s", 171 "timeout": "1s" 172 } 173 } 174 ``` 175 176 A TTL check: 177 178 ```javascript 179 { 180 "check": { 181 "id": "web-app", 182 "name": "Web App Status", 183 "notes": "Web app does a curl internally every 10 seconds", 184 "ttl": "30s" 185 } 186 } 187 ``` 188 189 A Docker check: 190 191 ```javascript 192 { 193 "check": { 194 "id": "mem-util", 195 "name": "Memory utilization", 196 "docker_container_id": "f972c95ebf0e", 197 "shell": "/bin/bash", 198 "args": ["/usr/local/bin/check_mem.py"], 199 "interval": "10s" 200 } 201 } 202 ``` 203 204 A gRPC check: 205 206 ```javascript 207 { 208 "check": { 209 "id": "mem-util", 210 "name": "Service health status", 211 "grpc": "127.0.0.1:12345", 212 "grpc_use_tls": true, 213 "interval": "10s" 214 } 215 } 216 ``` 217 218 An alias check for a local service: 219 220 ```javascript 221 { 222 "check": { 223 "id": "web-alias", 224 "alias_service": "web" 225 } 226 } 227 ``` 228 229 Each type of definition must include a `name` and may optionally provide an 230 `id` and `notes` field. The `id` must be unique per _agent_ otherwise only the 231 last defined check with that `id` will be registered. If the `id` is not set 232 and the check is embedded within a service definition a unique check id is 233 generated. Otherwise, `id` will be set to `name`. If names might conflict, 234 unique IDs should be provided. 235 236 The `notes` field is opaque to Consul but can be used to provide a human-readable 237 description of the current state of the check. Similarly, an external process 238 updating a TTL check via the HTTP interface can set the `notes` value. 239 240 Checks may also contain a `token` field to provide an ACL token. This token is 241 used for any interaction with the catalog for the check, including 242 [anti-entropy syncs](/docs/internals/anti-entropy.html) and deregistration. 243 For Alias checks, this token is used if a remote blocking query is necessary 244 to watch the state of the aliased node or service. 245 246 Script, TCP, HTTP, Docker, and gRPC checks must include an `interval` field. This 247 field is parsed by Go's `time` package, and has the following 248 [formatting specification](https://golang.org/pkg/time/#ParseDuration): 249 > A duration string is a possibly signed sequence of decimal numbers, each with 250 > optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m". 251 > Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h". 252 253 In Consul 0.7 and later, checks that are associated with a service may also contain 254 an optional `deregister_critical_service_after` field, which is a timeout in the 255 same Go time format as `interval` and `ttl`. If a check is in the critical state 256 for more than this configured value, then its associated service (and all of its 257 associated checks) will automatically be deregistered. The minimum timeout is 1 258 minute, and the process that reaps critical services runs every 30 seconds, so it 259 may take slightly longer than the configured timeout to trigger the deregistration. 260 This should generally be configured with a timeout that's much, much longer than 261 any expected recoverable outage for the given service. 262 263 To configure a check, either provide it as a `-config-file` option to the 264 agent or place it inside the `-config-dir` of the agent. The file must 265 end in a ".json" or ".hcl" extension to be loaded by Consul. Check definitions 266 can also be updated by sending a `SIGHUP` to the agent. Alternatively, the 267 check can be registered dynamically using the [HTTP API](/api/index.html). 268 269 ## Check Scripts 270 271 A check script is generally free to do anything to determine the status 272 of the check. The only limitations placed are that the exit codes must obey 273 this convention: 274 275 * Exit code 0 - Check is passing 276 * Exit code 1 - Check is warning 277 * Any other code - Check is failing 278 279 This is the only convention that Consul depends on. Any output of the script 280 will be captured and stored in the `output` field. 281 282 In Consul 0.9.0 and later, the agent must be configured with 283 [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks) set to `true` 284 in order to enable script checks. 285 286 ## Initial Health Check Status 287 288 By default, when checks are registered against a Consul agent, the state is set 289 immediately to "critical". This is useful to prevent services from being 290 registered as "passing" and entering the service pool before they are confirmed 291 to be healthy. In certain cases, it may be desirable to specify the initial 292 state of a health check. This can be done by specifying the `status` field in a 293 health check definition, like so: 294 295 ```javascript 296 { 297 "check": { 298 "id": "mem", 299 "args": ["/bin/check_mem", "-limit", "256MB"], 300 "interval": "10s", 301 "status": "passing" 302 } 303 } 304 ``` 305 306 The above service definition would cause the new "mem" check to be 307 registered with its initial state set to "passing". 308 309 ## Service-bound checks 310 311 Health checks may optionally be bound to a specific service. This ensures 312 that the status of the health check will only affect the health status of the 313 given service instead of the entire node. Service-bound health checks may be 314 provided by adding a `service_id` field to a check configuration: 315 316 ```javascript 317 { 318 "check": { 319 "id": "web-app", 320 "name": "Web App Status", 321 "service_id": "web-app", 322 "ttl": "30s" 323 } 324 } 325 ``` 326 327 In the above configuration, if the web-app health check begins failing, it will 328 only affect the availability of the web-app service. All other services 329 provided by the node will remain unchanged. 330 331 ## Agent Certificates for TLS Checks 332 333 The [enable_agent_tls_for_checks](/docs/agent/options.html#enable_agent_tls_for_checks) 334 agent configuration option can be utilized to have HTTP or gRPC health checks 335 to use the agent's credentials when configured for TLS. 336 337 ## Multiple Check Definitions 338 339 Multiple check definitions can be defined using the `checks` (plural) 340 key in your configuration file. 341 342 ```javascript 343 { 344 "checks": [ 345 { 346 "id": "chk1", 347 "name": "mem", 348 "args": ["/bin/check_mem", "-limit", "256MB"], 349 "interval": "5s" 350 }, 351 { 352 "id": "chk2", 353 "name": "/health", 354 "http": "http://localhost:5000/health", 355 "interval": "15s" 356 }, 357 { 358 "id": "chk3", 359 "name": "cpu", 360 "args": ["/bin/check_cpu"], 361 "interval": "10s" 362 }, 363 ... 364 ] 365 } 366 ```