github.com/outbrain/consul@v1.4.5/website/source/docs/agent/checks.html.md

github.com/outbrain/consul@v1.4.5/website/source/docs/agent/checks.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Check Definition"
     4  sidebar_current: "docs-agent-checks"
     5  description: |-
     6    One of the primary roles of the agent is management of system- and application-level health checks. A health check is considered to be application-level if it is associated with a service. A check is defined in a configuration file or added at runtime over the HTTP interface.
     7  ---
     8  
     9  # Checks
    10  
    11  One of the primary roles of the agent is management of system-level and application-level health
    12  checks. A health check is considered to be application-level if it is associated with a
    13  service. If not associated with a service, the check monitors the health of the entire node.
    14  
    15  A check is defined in a configuration file or added at runtime over the HTTP interface. Checks
    16  created via the HTTP interface persist with that node.
    17  
    18  There are several different kinds of checks:
    19  
    20  * Script + Interval - These checks depend on invoking an external application
    21    that performs the health check, exits with an appropriate exit code, and potentially
    22    generates some output. A script is paired with an invocation interval (e.g.
    23    every 30 seconds). This is similar to the Nagios plugin system. The output of
    24    a script check is limited to 4KB. Output larger than this will be truncated.
    25    By default, Script checks will be configured with a timeout equal to 30 seconds.
    26    It is possible to configure a custom Script check timeout value by specifying the
    27    `timeout` field in the check definition. When the timeout is reached on Windows,
    28    Consul will wait for any child processes spawned by the script to finish. For any
    29    other system, Consul will attempt to force-kill the script and any child processes
    30    it has spawned once the timeout has passed.
    31    In Consul 0.9.0 and later, script checks are not enabled by default. To use them you
    32    can either use :
    33    * [`enable_local_script_checks`](/docs/agent/options.html#_enable_local_script_checks):
    34      enable script checks defined in local config files. Script checks defined via the HTTP
    35      API will not be allowed.
    36    * [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks): enable
    37      script checks regardless of how they are defined.
    38  
    39    ~> **Security Warning:** Enabling script checks in some configurations may
    40    introduce a remote execution vulnerability which is known to be targeted by
    41    malware. We strongly recommend `enable_local_script_checks` instead. See [this
    42    blog post](https://www.hashicorp.com/blog/protecting-consul-from-rce-risk-in-specific-configurations)
    43    for more details.
    44  
    45  * HTTP + Interval - These checks make an HTTP `GET` request every Interval (e.g.
    46    every 30 seconds) to the specified URL. The status of the service depends on
    47    the HTTP response code: any `2xx` code is considered passing, a `429 Too Many
    48    Requests` is a warning, and anything else is a failure. This type of check
    49    should be preferred over a script that uses `curl` or another external process
    50    to check a simple HTTP operation. By default, HTTP checks are `GET` requests
    51    unless the `method` field specifies a different method. Additional header
    52    fields can be set through the `header` field which is a map of lists of
    53    strings, e.g. `{"x-foo": ["bar", "baz"]}`. By default, HTTP checks will be
    54    configured with a request timeout equal to the check interval, with a max of
    55    10 seconds. It is possible to configure a custom HTTP check timeout value by
    56    specifying the `timeout` field in the check definition. The output of the
    57    check is limited to roughly 4KB. Responses larger than this will be truncated.
    58    HTTP checks also support TLS. By default, a valid TLS certificate is expected.
    59    Certificate verification can be turned off by setting the `tls_skip_verify`
    60    field to `true` in the check definition.
    61  
    62  * TCP + Interval - These checks make a TCP connection attempt every Interval
    63    (e.g. every 30 seconds) to the specified IP/hostname and port. If no hostname
    64    is specified, it defaults to "localhost". The status of the service depends on
    65    whether the connection attempt is successful (ie - the port is currently
    66    accepting connections). If the connection is accepted, the status is
    67    `success`, otherwise the status is `critical`. In the case of a hostname that
    68    resolves to both IPv4 and IPv6 addresses, an attempt will be made to both
    69    addresses, and the first successful connection attempt will result in a
    70    successful check. This type of check should be preferred over a script that
    71    uses `netcat` or another external process to check a simple socket operation.
    72    By default, TCP checks will be configured with a request timeout equal to the
    73    check interval, with a max of 10 seconds. It is possible to configure a custom
    74    TCP check timeout value by specifying the `timeout` field in the check
    75    definition.
    76  
    77  * <a name="TTL"></a>Time to Live (TTL) - These checks retain their last known
    78    state for a given TTL.  The state of the check must be updated periodically
    79    over the HTTP interface. If an external system fails to update the status
    80    within a given TTL, the check is set to the failed state. This mechanism,
    81    conceptually similar to a dead man's switch, relies on the application to
    82    directly report its health. For example, a healthy app can periodically `PUT` a
    83    status update to the HTTP endpoint; if the app fails, the TTL will expire and
    84    the health check enters a critical state. The endpoints used to update health
    85    information for a given check are:
    86    [pass](/api/agent/check.html#ttl-check-pass),
    87    [warn](/api/agent/check.html#ttl-check-warn),
    88    [fail](/api/agent/check.html#ttl-check-fail), and
    89    [update](/api/agent/check.html#ttl-check-update).  TTL
    90    checks also persist their last known status to disk. This allows the Consul
    91    agent to restore the last known status of the check across restarts.  Persisted
    92    check status is valid through the end of the TTL from the time of the last
    93    check.
    94  
    95  * Docker + Interval - These checks depend on invoking an external application which
    96    is packaged within a Docker Container. The application is triggered within the running
    97    container via the Docker Exec API. We expect that the Consul agent user has access
    98    to either the Docker HTTP API or the unix socket. Consul uses ```$DOCKER_HOST``` to
    99    determine the Docker API endpoint. The application is expected to run, perform a health
   100    check of the service running inside the container, and exit with an appropriate exit code.
   101    The check should be paired with an invocation interval. The shell on which the check
   102    has to be performed is configurable which makes it possible to run containers which
   103    have different shells on the same host. Check output for Docker is limited to
   104    4KB. Any output larger than this will be truncated. In Consul 0.9.0 and later, the agent
   105    must be configured with [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks)
   106    set to `true` in order to enable Docker health checks.
   107  
   108  * gRPC + Interval - These checks are intended for applications that support the standard
   109    [gRPC health checking protocol](https://github.com/grpc/grpc/blob/master/doc/health-checking.md).
   110    The state of the check will be updated at the given interval by probing the configured
   111    endpoint. By default, gRPC checks will be configured with a default timeout of 10 seconds.
   112    It is possible to configure a custom timeout value by specifying the `timeout` field in
   113    the check definition. gRPC checks will default to not using TLS, but TLS can be enabled by
   114    setting `grpc_use_tls` in the check definition. If TLS is enabled, then by default, a valid
   115    TLS certificate is expected. Certificate verification can be turned off by setting the
   116    `tls_skip_verify` field to `true` in the check definition.
   117  
   118  * <a name="alias"></a>Alias - These checks alias the health state of another registered
   119    node or service. The state of the check will be updated asynchronously,
   120    but is nearly instant. For aliased services on the same agent, the local
   121    state is monitored and no additional network resources are consumed. For
   122    other services and nodes, the check maintains a blocking query over the
   123    agent's connection with a current server and allows stale requests. If there
   124    are any errors in watching the aliased node or service, the check state will be
   125    critical. For the blocking query, the check will use the ACL token set
   126    on the service or check definition or otherwise will fall back to the default ACL
   127    token set with the agent (`acl_token`).
   128  
   129  ## Check Definition
   130  
   131  A script check:
   132  
   133  ```javascript
   134  {
   135    "check": {
   136      "id": "mem-util",
   137      "name": "Memory utilization",
   138      "args": ["/usr/local/bin/check_mem.py", "-limit", "256MB"],
   139      "interval": "10s",
   140      "timeout": "1s"
   141    }
   142  }
   143  ```
   144  
   145  A HTTP check:
   146  
   147  ```javascript
   148  {
   149    "check": {
   150      "id": "api",
   151      "name": "HTTP API on port 5000",
   152      "http": "https://localhost:5000/health",
   153      "tls_skip_verify": false,
   154      "method": "POST",
   155      "header": {"x-foo":["bar", "baz"]},
   156      "interval": "10s",
   157      "timeout": "1s"
   158    }
   159  }
   160  ```
   161  
   162  A TCP check:
   163  
   164  ```javascript
   165  {
   166    "check": {
   167      "id": "ssh",
   168      "name": "SSH TCP on port 22",
   169      "tcp": "localhost:22",
   170      "interval": "10s",
   171      "timeout": "1s"
   172    }
   173  }
   174  ```
   175  
   176  A TTL check:
   177  
   178  ```javascript
   179  {
   180    "check": {
   181      "id": "web-app",
   182      "name": "Web App Status",
   183      "notes": "Web app does a curl internally every 10 seconds",
   184      "ttl": "30s"
   185    }
   186  }
   187  ```
   188  
   189  A Docker check:
   190  
   191  ```javascript
   192  {
   193    "check": {
   194      "id": "mem-util",
   195      "name": "Memory utilization",
   196      "docker_container_id": "f972c95ebf0e",
   197      "shell": "/bin/bash",
   198      "args": ["/usr/local/bin/check_mem.py"],
   199      "interval": "10s"
   200    }
   201  }
   202  ```
   203  
   204  A gRPC check:
   205  
   206  ```javascript
   207  {
   208    "check": {
   209      "id": "mem-util",
   210      "name": "Service health status",
   211      "grpc": "127.0.0.1:12345",
   212      "grpc_use_tls": true,
   213      "interval": "10s"
   214    }
   215  }
   216  ```
   217  
   218  An alias check for a local service:
   219  
   220  ```javascript
   221  {
   222    "check": {
   223      "id": "web-alias",
   224      "alias_service": "web"
   225    }
   226  }
   227  ```
   228  
   229  Each type of definition must include a `name` and may optionally provide an
   230  `id` and `notes` field. The `id` must be unique per _agent_ otherwise only the
   231  last defined check with that `id` will be registered. If the `id` is not set
   232  and the check is embedded within a service definition a unique check id is
   233  generated. Otherwise, `id` will be set to `name`. If names might conflict,
   234  unique IDs should be provided.
   235  
   236  The `notes` field is opaque to Consul but can be used to provide a human-readable
   237  description of the current state of the check. Similarly, an external process
   238  updating a TTL check via the HTTP interface can set the `notes` value.
   239  
   240  Checks may also contain a `token` field to provide an ACL token. This token is
   241  used for any interaction with the catalog for the check, including
   242  [anti-entropy syncs](/docs/internals/anti-entropy.html) and deregistration.
   243  For Alias checks, this token is used if a remote blocking query is necessary
   244  to watch the state of the aliased node or service.
   245  
   246  Script, TCP, HTTP, Docker, and gRPC checks must include an `interval` field. This
   247  field is parsed by Go's `time` package, and has the following
   248  [formatting specification](https://golang.org/pkg/time/#ParseDuration):
   249  > A duration string is a possibly signed sequence of decimal numbers, each with
   250  > optional fraction and a unit suffix, such as "300ms", "-1.5h" or "2h45m".
   251  > Valid time units are "ns", "us" (or "µs"), "ms", "s", "m", "h".
   252  
   253  In Consul 0.7 and later, checks that are associated with a service may also contain
   254  an optional `deregister_critical_service_after` field, which is a timeout in the
   255  same Go time format as `interval` and `ttl`. If a check is in the critical state
   256  for more than this configured value, then its associated service (and all of its
   257  associated checks) will automatically be deregistered. The minimum timeout is 1
   258  minute, and the process that reaps critical services runs every 30 seconds, so it
   259  may take slightly longer than the configured timeout to trigger the deregistration.
   260  This should generally be configured with a timeout that's much, much longer than
   261  any expected recoverable outage for the given service.
   262  
   263  To configure a check, either provide it as a `-config-file` option to the
   264  agent or place it inside the `-config-dir` of the agent. The file must
   265  end in a ".json" or ".hcl" extension to be loaded by Consul. Check definitions
   266  can also be updated by sending a `SIGHUP` to the agent. Alternatively, the
   267  check can be registered dynamically using the [HTTP API](/api/index.html).
   268  
   269  ## Check Scripts
   270  
   271  A check script is generally free to do anything to determine the status
   272  of the check. The only limitations placed are that the exit codes must obey
   273  this convention:
   274  
   275   * Exit code 0 - Check is passing
   276   * Exit code 1 - Check is warning
   277   * Any other code - Check is failing
   278  
   279  This is the only convention that Consul depends on. Any output of the script
   280  will be captured and stored in the `output` field.
   281  
   282  In Consul 0.9.0 and later, the agent must be configured with
   283  [`enable_script_checks`](/docs/agent/options.html#_enable_script_checks) set to `true`
   284  in order to enable script checks.
   285  
   286  ## Initial Health Check Status
   287  
   288  By default, when checks are registered against a Consul agent, the state is set
   289  immediately to "critical". This is useful to prevent services from being
   290  registered as "passing" and entering the service pool before they are confirmed
   291  to be healthy. In certain cases, it may be desirable to specify the initial
   292  state of a health check. This can be done by specifying the `status` field in a
   293  health check definition, like so:
   294  
   295  ```javascript
   296  {
   297    "check": {
   298      "id": "mem",
   299      "args": ["/bin/check_mem", "-limit", "256MB"],
   300      "interval": "10s",
   301      "status": "passing"
   302    }
   303  }
   304  ```
   305  
   306  The above service definition would cause the new "mem" check to be
   307  registered with its initial state set to "passing".
   308  
   309  ## Service-bound checks
   310  
   311  Health checks may optionally be bound to a specific service. This ensures
   312  that the status of the health check will only affect the health status of the
   313  given service instead of the entire node. Service-bound health checks may be
   314  provided by adding a `service_id` field to a check configuration:
   315  
   316  ```javascript
   317  {
   318    "check": {
   319      "id": "web-app",
   320      "name": "Web App Status",
   321      "service_id": "web-app",
   322      "ttl": "30s"
   323    }
   324  }
   325  ```
   326  
   327  In the above configuration, if the web-app health check begins failing, it will
   328  only affect the availability of the web-app service. All other services
   329  provided by the node will remain unchanged.
   330  
   331  ## Agent Certificates for TLS Checks
   332  
   333  The [enable_agent_tls_for_checks](/docs/agent/options.html#enable_agent_tls_for_checks)
   334  agent configuration option can be utilized to have HTTP or gRPC health checks
   335  to use the agent's credentials when configured for TLS.
   336  
   337  ## Multiple Check Definitions
   338  
   339  Multiple check definitions can be defined using the `checks` (plural)
   340  key in your configuration file.
   341  
   342  ```javascript
   343  {
   344    "checks": [
   345      {
   346        "id": "chk1",
   347        "name": "mem",
   348        "args": ["/bin/check_mem", "-limit", "256MB"],
   349        "interval": "5s"
   350      },
   351      {
   352        "id": "chk2",
   353        "name": "/health",
   354        "http": "http://localhost:5000/health",
   355        "interval": "15s"
   356      },
   357      {
   358        "id": "chk3",
   359        "name": "cpu",
   360        "args": ["/bin/check_cpu"],
   361        "interval": "10s"
   362      },
   363      ...
   364    ]
   365  }
   366  ```