github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/docs/operations/monitoring-nomad.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: Monitoring Nomad
     4  description: |-
     5    Overview of runtime metrics available in Nomad along with monitoring and
     6    alerting.
     7  ---
     8  
     9  # Monitoring Nomad
    10  
    11  The Nomad client and server agents collect a wide range of runtime metrics.
    12  These metrics are useful for monitoring the health and performance of Nomad
    13  clusters. Careful monitoring can spot trends before they cause issues and help
    14  debug issues if they arise.
    15  
    16  All Nomad agents, both servers and clients, report basic system and Go runtime
    17  metrics.
    18  
    19  Nomad servers all report many metrics, but some metrics are specific to the
    20  leader server. Since leadership may change at any time, these metrics should be
    21  monitored on all servers. Missing (or 0) metrics from non-leaders may be safely
    22  ignored.
    23  
    24  Nomad clients have separate metrics for the host they are running on as well as
    25  for each allocation being run. Both of these metrics [must be explicitly
    26  enabled][telemetry-stanza].
    27  
    28  By default, the Nomad agent collects telemetry data at a [1 second
    29  interval][collection-interval]. Note that Nomad supports [gauges, counters, and
    30  timers][metric-types].
    31  
    32  There are three ways to obtain metrics from Nomad:
    33  
    34  - Query the [/v1/metrics API endpoint][metrics-api-endpoint] to return metrics
    35    for the current Nomad process. This endpoint supports Prometheus formatted
    36    metrics.
    37  
    38  - Send the USR1 signal to the Nomad process. This will dump the current
    39    telemetry information to STDERR (on Linux).
    40  
    41  - Configure Nomad to automatically forward metrics to a third-party provider
    42    such as [DataDog][datadog-telem], [Prometheus][prometheus-telem],
    43    [statsd][statsd-telem], and [Circonus][circonus-telem].
    44  
    45  ## Alerting
    46  
    47  The recommended practice for alerting is to leverage the alerting capabilities
    48  of your monitoring provider. Nomad’s intention is to surface metrics that enable
    49  users to configure the necessary alerts using their existing monitoring systems
    50  as a scaffold, rather than to natively support alerting. Here are a few common
    51  patterns.
    52  
    53  - Export metrics from Nomad to Prometheus using the [StatsD
    54    exporter][statsd-exporter], define [alerting rules][alerting-rules] in
    55    Prometheus, and use [Alertmanager][alertmanager] for summarization and
    56    routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
    57    supported for [Datadog][datadog-alerting].
    58  
    59  - Periodically submit test jobs into Nomad to determine if your application
    60    deployment pipeline is working end-to-end. This pattern is well-suited to
    61    batch processing workloads.
    62  
    63  - Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
    64    monitor when a new Nomad job is added. When a job is removed, remove the
    65    Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
    66    job-specific alerting system.
    67  
    68  - Write a script that looks at the history of each batch job to determine
    69    whether or not the job is in an unhealthy state, updating your monitoring
    70    system as appropriate. In many cases, it may be ok if a given batch job fails
    71    occasionally, as long as it goes back to passing.
    72  
    73  # Key Performance Indicators
    74  
    75  Nomad servers' memory, CPU, disk, and network usage all scales linearly with
    76  cluster size and scheduling throughput. The most important aspect of ensuring
    77  Nomad operates normally is monitoring these system resources to ensure the
    78  servers are not encountering resource constraints.
    79  
    80  The sections below cover a number of other important metrics.
    81  
    82  ## Consensus Protocol (Raft)
    83  
    84  Nomad uses the Raft consensus protocol for leader election and state
    85  replication. Spurious leader elections can be caused by networking
    86  issues between the servers, insufficient CPU resources, or
    87  insufficient disk IOPS. Users in cloud environments often bump their
    88  servers up to the next instance class with improved networking and CPU
    89  to stabilize leader elections, or switch to higher-performance disks.
    90  
    91  The `nomad.raft.leader.lastContact` metric is a general indicator of
    92  Raft latency which can be used to observe how Raft timing is
    93  performing and guide infrastructure provisioning. If this number
    94  trends upwards, look at CPU, disk IOPs, and network
    95  latency. `nomad.raft.leader.lastContact` should not get too close to
    96  the leader lease timeout of 500ms.
    97  
    98  The `nomad.raft.replication.appendEntries` metric is an indicator of
    99  the time it takes for a Raft transaction to be replicated to a quorum
   100  of followers. If this number trends upwards, check the disk I/O on the
   101  followers and network latency between the leader and the followers.
   102  
   103  The details for how to examine CPU, IO operations, and networking are
   104  specific to your platform and environment. On Linux, the `sysstat`
   105  package contains a number of useful tools. Here are examples to
   106  consider.
   107  
   108  - **CPU** - `vmstat 1`, cloud provider metrics for "CPU %"
   109  
   110  - **IO** - `iostat`, `sar -d`, cloud provider metrics for "volume
   111    write/read ops" and "burst balance"
   112  
   113  - **Network** - `sar -n`, `netstat -s`, cloud provider metrics for
   114    interface "allowance"
   115  
   116  The `nomad.raft.fsm.apply` metric is an indicator of the time it takes
   117  for a server to apply Raft entries to the internal state machine. If
   118  this number trends upwards, look at the `nomad.nomad.fsm.*` metrics to
   119  see if a specific Raft entry is increasing in latency. You can compare
   120  this to warn-level logs on the Nomad servers for `attempting to apply
   121  large raft entry`. If a specific type of message appears here, there
   122  may be a job with a large job specification or dispatch payload that
   123  is increasing the time it takes to apply Raft messages. Try shrinking the size
   124  of the job either by putting distinct task groups into separate jobs,
   125  downloading templates instead of embedding them, or reducing the `count` on
   126  task groups.
   127  
   128  ## Scheduling
   129  
   130  The [Scheduling] documentation describes the workflow of how evaluations become
   131  scheduled plans and placed allocations.
   132  
   133  ### Progress
   134  
   135  There is a class of bug possible in Nomad where the two parts of the scheduling
   136  pipeline, the workers and the leader's plan applier, *disagree* about the
   137  validity of a plan. In the pathological case this can cause a job to never
   138  finish scheduling, as workers produce the same plan and the plan applier
   139  repeatedly rejects it.
   140  
   141  While this class of bug is very rare, it can be detected by repeated log lines
   142  on the Nomad servers containing `plan for node rejected`:
   143  
   144  ```
   145  nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5
   146  ```
   147  
   148  While it is possible for these log lines to occur infrequently due to normal
   149  cluster conditions, they should not appear repeatedly and prevent the job from
   150  eventually running (look up the evaluation ID logged to find the job).
   151  
   152  #### Plan rejection tracker
   153  
   154  Nomad provides a mechanism to track the history of plan rejections per client
   155  and mark them as ineligible if the number goes above a given threshold within a
   156  time window. This functionality can be enabled using the
   157  [`plan_rejection_tracker`] server configuration.
   158  
   159  When a node is marked as ineligible due to excessive plan rejections, the
   160  following node event is registered:
   161  
   162  ```
   163  Node marked as ineligible for scheduling due to multiple plan rejections, refer to https://www.nomadproject.io/s/port-plan-failure for more information
   164  ```
   165  
   166  Along with the log line:
   167  
   168  ```
   169  [WARN]  nomad.state_store: marking node as ineligible due to multiple plan rejections: node_id=67af2541-5e96-6f54-9095-11089d627626
   170  ```
   171  
   172  If a client is marked as ineligible due to repeated plan rejections, try
   173  [draining] the node and shutting it down. Misconfigurations not caught by
   174  validation can cause nodes to enter this state: [#11830][gh-11830].
   175  
   176  If the `plan for node rejected` log *does* appear repeatedly with the same
   177  `node_id` referenced but the client is not being set as ineligible you can try
   178  adjusting the [`plan_rejection_tracker`] configuration of servers.
   179  
   180  ### Performance
   181  
   182  The following metrics allow observing changes in throughput at the various
   183  points in the scheduling process.
   184  
   185  - **nomad.worker.invoke_scheduler.<type\>** - The time to run the
   186    scheduler of the given type. Each scheduler worker handles one
   187    evaluation at a time, entirely in-memory. If this metric increases,
   188    examine the CPU and memory resources of the scheduler.
   189  
   190  - **nomad.broker.total_blocked** - The number of blocked
   191    evaluations. Blocked evaluations are created when the scheduler
   192    cannot place all allocations as part of a plan. Blocked evaluations
   193    will be re-evaluated so that changes in cluster resources can be
   194    used for the blocked evaluation's allocations. An increase in
   195    blocked evaluations may mean that the cluster's clients are low in
   196    resources or that job have been submitted that can never have all
   197    their allocations placed. Nomad also emits a similar metric for each
   198    individual scheduler. For example `nomad.broker.batch_blocked` shows
   199    the number of blocked evaluations for the batch scheduler.
   200  
   201  - **nomad.broker.total_unacked** - The number of unacknowledged
   202    evaluations. When an evaluation has been processed, the worker sends
   203    an acknowledgment RPC to the leader to signal to the eval broker
   204    that processing is complete. The unacked evals are those that are
   205    in-flight in the scheduler and have not yet been acknowledged. An
   206    increase in unacknowledged evaluations may mean that the schedulers
   207    have a large queue of evaluations to process. See the
   208    `invoke_scheduler` metric (above) and examine the CPU and memory
   209    resources of the scheduler. Nomad also emits a similar metric for
   210    each individual scheduler. For example `nomad.broker.batch_unacked`
   211    shows the number of unacknowledged evaluations for the batch
   212    scheduler.
   213  
   214  - **nomad.plan.evaluate** - The time to evaluate a scheduler plan
   215    submitted by a worker. This operation happens on the leader to
   216    serialize the plans of all the scheduler workers. This happens
   217    entirely in memory on the leader. If this metric increases, examine
   218    the CPU and memory resources of the leader.
   219  
   220  - **nomad.plan.wait_for_index** - The time required for the planner to wait for
   221    the Raft index of the plan to be processed. If this metric increases, refer
   222    to the [Consensus Protocol (Raft)] section above. If this metric approaches 5
   223    seconds, scheduling operations may fail and be retried. If possible reduce
   224    scheduling load until metrics improve.
   225  
   226  - **nomad.plan.submit** - The time to submit a scheduler plan from the
   227    worker to the leader. This operation requires writing to Raft and
   228    includes the time from `nomad.plan.evaluate` and
   229    `nomad.plan.wait_for_index` (above). If this metric increases, refer
   230    to the [Consensus Protocol (Raft)] section above.
   231  
   232  - **nomad.plan.queue_depth** - The number of scheduler plans waiting
   233    to be evaluated after being submitted. If this metric increases,
   234    examine the `nomad.plan.evaluate` and `nomad.plan.submit` metrics to
   235    determine if the problem is in general leader resources or Raft
   236    performance.
   237  
   238  Upticks in any of the above metrics indicate a decrease in scheduler
   239  throughput.
   240  
   241  ## Capacity
   242  
   243  The importance of monitoring resource availability is workload specific. Batch
   244  processing workloads often operate under the assumption that the cluster should
   245  be at or near capacity, with queued jobs running as soon as adequate resources
   246  become available. Clusters that are primarily responsible for long running
   247  services with an uptime requirement may want to maintain headroom at 20% or
   248  more. The following metrics can be used to assess capacity across the cluster on
   249  a per client basis.
   250  
   251  - **nomad.client.allocated.cpu**
   252  - **nomad.client.unallocated.cpu**
   253  - **nomad.client.allocated.disk**
   254  - **nomad.client.unallocated.disk**
   255  - **nomad.client.allocated.iops**
   256  - **nomad.client.unallocated.iops**
   257  - **nomad.client.allocated.memory**
   258  - **nomad.client.unallocated.memory**
   259  
   260  ## Task Resource Consumption
   261  
   262  The metrics listed [here][allocation-metrics] can be used to track resource
   263  consumption on a per task basis. For user facing services, it is common to alert
   264  when the CPU is at or above the reserved resources for the task.
   265  
   266  ## Job and Task Status
   267  
   268  See [Job Summary Metrics] for monitoring the health and status of workloads
   269  running on Nomad.
   270  
   271  ## Runtime Metrics
   272  
   273  Runtime metrics apply to all clients and servers. The following metrics are
   274  general indicators of load and memory pressure.
   275  
   276  - **nomad.runtime.num_goroutines**
   277  - **nomad.runtime.heap_objects**
   278  - **nomad.runtime.alloc_bytes**
   279  
   280  It is recommended to alert on upticks in any of the above, server memory usage
   281  in particular.
   282  
   283  ## Federated Deployments (Serf)
   284  
   285  Nomad uses the membership and failure detection capabilities of the Serf library
   286  to maintain a single, global gossip pool for all servers in a federated
   287  deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
   288  that membership is unstable.
   289  
   290  If these metrics increase, look at CPU load on the servers and network
   291  latency and packet loss for the [Serf] address.
   292  
   293  [alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
   294  [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
   295  [allocation-metrics]: /docs/operations/metrics-reference#allocation-metrics
   296  [circonus-telem]: /docs/configuration/telemetry#circonus
   297  [collection-interval]: /docs/configuration/telemetry#collection_interval
   298  [datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
   299  [datadog-telem]: /docs/configuration/telemetry#datadog
   300  [draining]: https://learn.hashicorp.com/tutorials/nomad/node-drain
   301  [gh-11830]: https://github.com/hashicorp/nomad/pull/11830
   302  [metric-types]: /docs/operations/metrics-reference#metric-types
   303  [metrics-api-endpoint]: /api-docs/metrics
   304  [prometheus-telem]: /docs/configuration/telemetry#prometheus
   305  [`plan_rejection_tracker`]: /docs/configuration/server#plan_rejection_tracker
   306  [serf]: /docs/configuration#serf-1
   307  [statsd-exporter]: https://github.com/prometheus/statsd_exporter
   308  [statsd-telem]: /docs/configuration/telemetry#statsd
   309  [statsite-telem]: /docs/configuration/telemetry#statsite
   310  [tagged-metrics]: /docs/operations/metrics-reference#tagged-metrics
   311  [telemetry-stanza]: /docs/configuration/telemetry
   312  [Consensus Protocol (Raft)]: /docs/operations/monitoring-nomad#consensus-protocol-raft
   313  [Job Summary Metrics]: /docs/operations/metrics-reference#job-summary-metrics
   314  [Scheduling]: /docs/concepts/scheduling/scheduling