github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/docs/operations/monitoring-nomad.mdx

github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/docs/operations/monitoring-nomad.mdx (about)

1 ---
2 layout: docs
3 page_title: Monitoring Nomad
4 description: |-
5 Overview of runtime metrics available in Nomad along with monitoring and
6 alerting.
7 ---
8
9 # Monitoring Nomad
10
11 The Nomad client and server agents collect a wide range of runtime metrics.
12 These metrics are useful for monitoring the health and performance of Nomad
13 clusters. Careful monitoring can spot trends before they cause issues and help
14 debug issues if they arise.
15
16 All Nomad agents, both servers and clients, report basic system and Go runtime
17 metrics.
18
19 Nomad servers all report many metrics, but some metrics are specific to the
20 leader server. Since leadership may change at any time, these metrics should be
21 monitored on all servers. Missing (or 0) metrics from non-leaders may be safely
22 ignored.
23
24 Nomad clients have separate metrics for the host they are running on as well as
25 for each allocation being run. Both of these metrics [must be explicitly
26 enabled][telemetry-stanza].
27
28 By default, the Nomad agent collects telemetry data at a [1 second
29 interval][collection-interval]. Note that Nomad supports [gauges, counters, and
30 timers][metric-types].
31
32 There are three ways to obtain metrics from Nomad:
33
34 - Query the [/v1/metrics API endpoint][metrics-api-endpoint] to return metrics
35 for the current Nomad process. This endpoint supports Prometheus formatted
36 metrics.
37
38 - Send the USR1 signal to the Nomad process. This will dump the current
39 telemetry information to STDERR (on Linux).
40
41 - Configure Nomad to automatically forward metrics to a third-party provider
42 such as [DataDog][datadog-telem], [Prometheus][prometheus-telem],
43 [statsd][statsd-telem], and [Circonus][circonus-telem].
44
45 ## Alerting
46
47 The recommended practice for alerting is to leverage the alerting capabilities
48 of your monitoring provider. Nomad’s intention is to surface metrics that enable
49 users to configure the necessary alerts using their existing monitoring systems
50 as a scaffold, rather than to natively support alerting. Here are a few common
51 patterns.
52
53 - Export metrics from Nomad to Prometheus using the [StatsD
54 exporter][statsd-exporter], define [alerting rules][alerting-rules] in
55 Prometheus, and use [Alertmanager][alertmanager] for summarization and
56 routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
57 supported for [Datadog][datadog-alerting].
58
59 - Periodically submit test jobs into Nomad to determine if your application
60 deployment pipeline is working end-to-end. This pattern is well-suited to
61 batch processing workloads.
62
63 - Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
64 monitor when a new Nomad job is added. When a job is removed, remove the
65 Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
66 job-specific alerting system.
67
68 - Write a script that looks at the history of each batch job to determine
69 whether or not the job is in an unhealthy state, updating your monitoring
70 system as appropriate. In many cases, it may be ok if a given batch job fails
71 occasionally, as long as it goes back to passing.
72
73 # Key Performance Indicators
74
75 Nomad servers' memory, CPU, disk, and network usage all scales linearly with
76 cluster size and scheduling throughput. The most important aspect of ensuring
77 Nomad operates normally is monitoring these system resources to ensure the
78 servers are not encountering resource constraints.
79
80 The sections below cover a number of other important metrics.
81
82 ## Consensus Protocol (Raft)
83
84 Nomad uses the Raft consensus protocol for leader election and state
85 replication. Spurious leader elections can be caused by networking
86 issues between the servers, insufficient CPU resources, or
87 insufficient disk IOPS. Users in cloud environments often bump their
88 servers up to the next instance class with improved networking and CPU
89 to stabilize leader elections, or switch to higher-performance disks.
90
91 The `nomad.raft.leader.lastContact` metric is a general indicator of
92 Raft latency which can be used to observe how Raft timing is
93 performing and guide infrastructure provisioning. If this number
94 trends upwards, look at CPU, disk IOPs, and network
95 latency. `nomad.raft.leader.lastContact` should not get too close to
96 the leader lease timeout of 500ms.
97
98 The `nomad.raft.replication.appendEntries` metric is an indicator of
99 the time it takes for a Raft transaction to be replicated to a quorum
100 of followers. If this number trends upwards, check the disk I/O on the
101 followers and network latency between the leader and the followers.
102
103 The details for how to examine CPU, IO operations, and networking are
104 specific to your platform and environment. On Linux, the `sysstat`
105 package contains a number of useful tools. Here are examples to
106 consider.
107
108 - **CPU** - `vmstat 1`, cloud provider metrics for "CPU %"
109
110 - **IO** - `iostat`, `sar -d`, cloud provider metrics for "volume
111 write/read ops" and "burst balance"
112
113 - **Network** - `sar -n`, `netstat -s`, cloud provider metrics for
114 interface "allowance"
115
116 The `nomad.raft.fsm.apply` metric is an indicator of the time it takes
117 for a server to apply Raft entries to the internal state machine. If
118 this number trends upwards, look at the `nomad.nomad.fsm.*` metrics to
119 see if a specific Raft entry is increasing in latency. You can compare
120 this to warn-level logs on the Nomad servers for `attempting to apply
121 large raft entry`. If a specific type of message appears here, there
122 may be a job with a large job specification or dispatch payload that
123 is increasing the time it takes to apply Raft messages. Try shrinking the size
124 of the job either by putting distinct task groups into separate jobs,
125 downloading templates instead of embedding them, or reducing the `count` on
126 task groups.
127
128 ## Scheduling
129
130 The [Scheduling] documentation describes the workflow of how evaluations become
131 scheduled plans and placed allocations.
132
133 ### Progress
134
135 There is a class of bug possible in Nomad where the two parts of the scheduling
136 pipeline, the workers and the leader's plan applier, *disagree* about the
137 validity of a plan. In the pathological case this can cause a job to never
138 finish scheduling, as workers produce the same plan and the plan applier
139 repeatedly rejects it.
140
141 While this class of bug is very rare, it can be detected by repeated log lines
142 on the Nomad servers containing `plan for node rejected`:
143
144 ```
145 nomad: plan for node rejected: node_id=0fa84370-c713-b914-d329-f6485951cddc reason="reserved port collision" eval_id=098a5
146 ```
147
148 While it is possible for these log lines to occur infrequently due to normal
149 cluster conditions, they should not appear repeatedly and prevent the job from
150 eventually running (look up the evaluation ID logged to find the job).
151
152 #### Plan rejection tracker
153
154 Nomad provides a mechanism to track the history of plan rejections per client
155 and mark them as ineligible if the number goes above a given threshold within a
156 time window. This functionality can be enabled using the
157 [`plan_rejection_tracker`] server configuration.
158
159 When a node is marked as ineligible due to excessive plan rejections, the
160 following node event is registered:
161
162 ```
163 Node marked as ineligible for scheduling due to multiple plan rejections, refer to https://www.nomadproject.io/s/port-plan-failure for more information
164 ```
165
166 Along with the log line:
167
168 ```
169 [WARN] nomad.state_store: marking node as ineligible due to multiple plan rejections: node_id=67af2541-5e96-6f54-9095-11089d627626
170 ```
171
172 If a client is marked as ineligible due to repeated plan rejections, try
173 [draining] the node and shutting it down. Misconfigurations not caught by
174 validation can cause nodes to enter this state: [#11830][gh-11830].
175
176 If the `plan for node rejected` log *does* appear repeatedly with the same
177 `node_id` referenced but the client is not being set as ineligible you can try
178 adjusting the [`plan_rejection_tracker`] configuration of servers.
179
180 ### Performance
181
182 The following metrics allow observing changes in throughput at the various
183 points in the scheduling process.
184
185 - **nomad.worker.invoke_scheduler.<type\>** - The time to run the
186 scheduler of the given type. Each scheduler worker handles one
187 evaluation at a time, entirely in-memory. If this metric increases,
188 examine the CPU and memory resources of the scheduler.
189
190 - **nomad.broker.total_blocked** - The number of blocked
191 evaluations. Blocked evaluations are created when the scheduler
192 cannot place all allocations as part of a plan. Blocked evaluations
193 will be re-evaluated so that changes in cluster resources can be
194 used for the blocked evaluation's allocations. An increase in
195 blocked evaluations may mean that the cluster's clients are low in
196 resources or that job have been submitted that can never have all
197 their allocations placed. Nomad also emits a similar metric for each
198 individual scheduler. For example `nomad.broker.batch_blocked` shows
199 the number of blocked evaluations for the batch scheduler.
200
201 - **nomad.broker.total_unacked** - The number of unacknowledged
202 evaluations. When an evaluation has been processed, the worker sends
203 an acknowledgment RPC to the leader to signal to the eval broker
204 that processing is complete. The unacked evals are those that are
205 in-flight in the scheduler and have not yet been acknowledged. An
206 increase in unacknowledged evaluations may mean that the schedulers
207 have a large queue of evaluations to process. See the
208 `invoke_scheduler` metric (above) and examine the CPU and memory
209 resources of the scheduler. Nomad also emits a similar metric for
210 each individual scheduler. For example `nomad.broker.batch_unacked`
211 shows the number of unacknowledged evaluations for the batch
212 scheduler.
213
214 - **nomad.plan.evaluate** - The time to evaluate a scheduler plan
215 submitted by a worker. This operation happens on the leader to
216 serialize the plans of all the scheduler workers. This happens
217 entirely in memory on the leader. If this metric increases, examine
218 the CPU and memory resources of the leader.
219
220 - **nomad.plan.wait_for_index** - The time required for the planner to wait for
221 the Raft index of the plan to be processed. If this metric increases, refer
222 to the [Consensus Protocol (Raft)] section above. If this metric approaches 5
223 seconds, scheduling operations may fail and be retried. If possible reduce
224 scheduling load until metrics improve.
225
226 - **nomad.plan.submit** - The time to submit a scheduler plan from the
227 worker to the leader. This operation requires writing to Raft and
228 includes the time from `nomad.plan.evaluate` and
229 `nomad.plan.wait_for_index` (above). If this metric increases, refer
230 to the [Consensus Protocol (Raft)] section above.
231
232 - **nomad.plan.queue_depth** - The number of scheduler plans waiting
233 to be evaluated after being submitted. If this metric increases,
234 examine the `nomad.plan.evaluate` and `nomad.plan.submit` metrics to
235 determine if the problem is in general leader resources or Raft
236 performance.
237
238 Upticks in any of the above metrics indicate a decrease in scheduler
239 throughput.
240
241 ## Capacity
242
243 The importance of monitoring resource availability is workload specific. Batch
244 processing workloads often operate under the assumption that the cluster should
245 be at or near capacity, with queued jobs running as soon as adequate resources
246 become available. Clusters that are primarily responsible for long running
247 services with an uptime requirement may want to maintain headroom at 20% or
248 more. The following metrics can be used to assess capacity across the cluster on
249 a per client basis.
250
251 - **nomad.client.allocated.cpu**
252 - **nomad.client.unallocated.cpu**
253 - **nomad.client.allocated.disk**
254 - **nomad.client.unallocated.disk**
255 - **nomad.client.allocated.iops**
256 - **nomad.client.unallocated.iops**
257 - **nomad.client.allocated.memory**
258 - **nomad.client.unallocated.memory**
259
260 ## Task Resource Consumption
261
262 The metrics listed [here][allocation-metrics] can be used to track resource
263 consumption on a per task basis. For user facing services, it is common to alert
264 when the CPU is at or above the reserved resources for the task.
265
266 ## Job and Task Status
267
268 See [Job Summary Metrics] for monitoring the health and status of workloads
269 running on Nomad.
270
271 ## Runtime Metrics
272
273 Runtime metrics apply to all clients and servers. The following metrics are
274 general indicators of load and memory pressure.
275
276 - **nomad.runtime.num_goroutines**
277 - **nomad.runtime.heap_objects**
278 - **nomad.runtime.alloc_bytes**
279
280 It is recommended to alert on upticks in any of the above, server memory usage
281 in particular.
282
283 ## Federated Deployments (Serf)
284
285 Nomad uses the membership and failure detection capabilities of the Serf library
286 to maintain a single, global gossip pool for all servers in a federated
287 deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
288 that membership is unstable.
289
290 If these metrics increase, look at CPU load on the servers and network
291 latency and packet loss for the [Serf] address.
292
293 [alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
294 [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
295 [allocation-metrics]: /docs/operations/metrics-reference#allocation-metrics
296 [circonus-telem]: /docs/configuration/telemetry#circonus
297 [collection-interval]: /docs/configuration/telemetry#collection_interval
298 [datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
299 [datadog-telem]: /docs/configuration/telemetry#datadog
300 [draining]: https://learn.hashicorp.com/tutorials/nomad/node-drain
301 [gh-11830]: https://github.com/hashicorp/nomad/pull/11830
302 [metric-types]: /docs/operations/metrics-reference#metric-types
303 [metrics-api-endpoint]: /api-docs/metrics
304 [prometheus-telem]: /docs/configuration/telemetry#prometheus
305 [`plan_rejection_tracker`]: /docs/configuration/server#plan_rejection_tracker
306 [serf]: /docs/configuration#serf-1
307 [statsd-exporter]: https://github.com/prometheus/statsd_exporter
308 [statsd-telem]: /docs/configuration/telemetry#statsd
309 [statsite-telem]: /docs/configuration/telemetry#statsite
310 [tagged-metrics]: /docs/operations/metrics-reference#tagged-metrics
311 [telemetry-stanza]: /docs/configuration/telemetry
312 [Consensus Protocol (Raft)]: /docs/operations/monitoring-nomad#consensus-protocol-raft
313 [Job Summary Metrics]: /docs/operations/metrics-reference#job-summary-metrics
314 [Scheduling]: /docs/concepts/scheduling/scheduling