github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/telemetry/index.mdx (about) 1 --- 2 layout: docs 3 page_title: Telemetry Overview 4 sidebar_title: Telemetry 5 description: |- 6 Overview of runtime metrics available in Nomad along with monitoring and 7 alerting. 8 --- 9 10 # Telemetry Overview 11 12 The Nomad client and server agents collect a wide range of runtime metrics 13 related to the performance of the system. Operators can use this data to gain 14 real-time visibility into their cluster and improve performance. Additionally, 15 Nomad operators can set up monitoring and alerting based on these metrics in 16 order to respond to any changes in the cluster state. 17 18 On the server side, leaders and 19 followers have metrics in common as well as metrics that are specific to their 20 roles. Clients have separate metrics for the host metrics and for 21 allocations/tasks, both of which have to be [explicitly 22 enabled][telemetry-stanza]. There are also runtime metrics that are common to 23 all servers and clients. 24 25 By default, the Nomad agent collects telemetry data at a [1 second 26 interval][collection-interval]. Note that Nomad supports [Gauges, counters and 27 timers][metric-types]. 28 29 There are three ways to obtain metrics from Nomad: 30 31 - Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for 32 the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus 33 formatted metrics. 34 - Send the USR1 signal to the Nomad process. This will dump the current 35 telemetry information to STDERR (on Linux). 36 - Configure Nomad to automatically forward metrics to a third-party provider. 37 38 Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the 39 integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem]. 40 Metrics can also be forwarded to [Statsite][statsite-telem], 41 [StatsD][statsd-telem], and [Circonus][circonus-telem]. 42 43 ## Alerting 44 45 The recommended practice for alerting is to leverage the alerting capabilities 46 of your monitoring provider. Nomad’s intention is to surface metrics that enable 47 users to configure the necessary alerts using their existing monitoring systems 48 as a scaffold, rather than to natively support alerting. Here are a few common 49 patterns: 50 51 - Export metrics from Nomad to Prometheus using the [StatsD 52 exporter][statsd-exporter], define [alerting rules][alerting-rules] in 53 Prometheus, and use [Alertmanager][alertmanager] for summarization and 54 routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is 55 supported for [Datadog][datadog-alerting]. 56 57 - Periodically submit test jobs into Nomad to determine if your application 58 deployment pipeline is working end-to-end. This pattern is well-suited to 59 batch processing workloads. 60 61 - Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios 62 monitor when a new Nomad job is added. When a job is removed, remove the 63 Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a 64 job-specific alerting system. 65 66 - Write a script that looks at the history of each batch job to determine 67 whether or not the job is in an unhealthy state, updating your monitoring 68 system as appropriate. In many cases, it may be ok if a given batch job fails 69 occasionally, as long as it goes back to passing. 70 71 # Key Performance Indicators 72 73 The sections below cover a number of important metrics 74 75 ## Consensus Protocol (Raft) 76 77 Nomad uses the Raft consensus protocol for leader election and state 78 replication. Spurious leader elections can be caused by networking issues 79 between the servers or insufficient CPU resources. Users in cloud environments 80 often bump their servers up to the next instance class with improved networking 81 and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric 82 is a general indicator of Raft latency which can be used to observe how Raft 83 timing is performing and guide the decision to upgrade to more powerful servers. 84 `nomad.raft.leader.lastContact` should not get too close to the leader lease 85 timeout of 500ms. 86 87 ## Federated Deployments (Serf) 88 89 Nomad uses the membership and failure detection capabilities of the Serf library 90 to maintain a single, global gossip pool for all servers in a federated 91 deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator 92 that membership is unstable. 93 94 ## Scheduling 95 96 The following metrics allow an operator to observe changes in throughput at the 97 various points in the scheduling process (evaluation, scheduling/planning, and 98 placement): 99 100 - **nomad.broker.total_blocked** - The number of blocked evaluations. 101 - **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of 102 the given type. 103 - **nomad.plan.evaluate** - The time to evaluate a scheduler Plan. 104 - **nomad.plan.submit** - The time to submit a scheduler Plan. 105 - **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be 106 evaluated. 107 108 Upticks in any of the above metrics indicate a decrease in scheduler throughput. 109 110 ## Capacity 111 112 The importance of monitoring resource availability is workload specific. Batch 113 processing workloads often operate under the assumption that the cluster should 114 be at or near capacity, with queued jobs running as soon as adequate resources 115 become available. Clusters that are primarily responsible for long running 116 services with an uptime requirement may want to maintain headroom at 20% or 117 more. The following metrics can be used to assess capacity across the cluster on 118 a per client basis: 119 120 - **nomad.client.allocated.cpu** 121 - **nomad.client.unallocated.cpu** 122 - **nomad.client.allocated.disk** 123 - **nomad.client.unallocated.disk** 124 - **nomad.client.allocated.iops** 125 - **nomad.client.unallocated.iops** 126 - **nomad.client.allocated.memory** 127 - **nomad.client.unallocated.memory** 128 129 ## Task Resource Consumption 130 131 The metrics listed [here][allocation-metrics] can be used to track resource 132 consumption on a per task basis. For user facing services, it is common to alert 133 when the CPU is at or above the reserved resources for the task. 134 135 ## Job and Task Status 136 137 We do not currently surface metrics for job and task/allocation status, although 138 we will consider adding metrics where it makes sense. 139 140 ## Runtime Metrics 141 142 Runtime metrics apply to all clients and servers. The following metrics are 143 general indicators of load and memory pressure: 144 145 - **nomad.runtime.num_goroutines** 146 - **nomad.runtime.heap_objects** 147 - **nomad.runtime.alloc_bytes** 148 149 It is recommended to alert on upticks in any of the above, server memory usage 150 in particular. 151 152 [alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 153 [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/ 154 [allocation-metrics]: /docs/telemetry/metrics#allocation-metrics 155 [circonus-telem]: /docs/configuration/telemetry#circonus 156 [collection-interval]: /docs/configuration/telemetry#collection_interval 157 [datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/ 158 [datadog-telem]: /docs/configuration/telemetry#datadog 159 [prometheus-telem]: /docs/configuration/telemetry#prometheus 160 [metrics-api-endpoint]: /api-docs/metrics 161 [metric-types]: /docs/telemetry/metrics#metric-types 162 [statsd-exporter]: https://github.com/prometheus/statsd_exporter 163 [statsd-telem]: /docs/configuration/telemetry#statsd 164 [statsite-telem]: /docs/configuration/telemetry#statsite 165 [tagged-metrics]: /docs/telemetry/metrics#tagged-metrics 166 [telemetry-stanza]: /docs/configuration/telemetry