github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/telemetry/index.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: Telemetry Overview
     4  sidebar_title: Telemetry
     5  description: |-
     6    Overview of runtime metrics available in Nomad along with monitoring and
     7    alerting.
     8  ---
     9  
    10  # Telemetry Overview
    11  
    12  The Nomad client and server agents collect a wide range of runtime metrics
    13  related to the performance of the system. Operators can use this data to gain
    14  real-time visibility into their cluster and improve performance. Additionally,
    15  Nomad operators can set up monitoring and alerting based on these metrics in
    16  order to respond to any changes in the cluster state.
    17  
    18  On the server side, leaders and
    19  followers have metrics in common as well as metrics that are specific to their
    20  roles. Clients have separate metrics for the host metrics and for
    21  allocations/tasks, both of which have to be [explicitly
    22  enabled][telemetry-stanza]. There are also runtime metrics that are common to
    23  all servers and clients.
    24  
    25  By default, the Nomad agent collects telemetry data at a [1 second
    26  interval][collection-interval]. Note that Nomad supports [Gauges, counters and
    27  timers][metric-types].
    28  
    29  There are three ways to obtain metrics from Nomad:
    30  
    31  - Query the [/metrics API endpoint][metrics-api-endpoint] to return metrics for
    32    the current Nomad process (as of Nomad 0.7). This endpoint supports Prometheus
    33    formatted metrics.
    34  - Send the USR1 signal to the Nomad process. This will dump the current
    35    telemetry information to STDERR (on Linux).
    36  - Configure Nomad to automatically forward metrics to a third-party provider.
    37  
    38  Nomad 0.7 added support for [tagged metrics][tagged-metrics], improving the
    39  integrations with [DataDog][datadog-telem] and [Prometheus][prometheus-telem].
    40  Metrics can also be forwarded to [Statsite][statsite-telem],
    41  [StatsD][statsd-telem], and [Circonus][circonus-telem].
    42  
    43  ## Alerting
    44  
    45  The recommended practice for alerting is to leverage the alerting capabilities
    46  of your monitoring provider. Nomad’s intention is to surface metrics that enable
    47  users to configure the necessary alerts using their existing monitoring systems
    48  as a scaffold, rather than to natively support alerting. Here are a few common
    49  patterns:
    50  
    51  - Export metrics from Nomad to Prometheus using the [StatsD
    52    exporter][statsd-exporter], define [alerting rules][alerting-rules] in
    53    Prometheus, and use [Alertmanager][alertmanager] for summarization and
    54    routing/notifications (to PagerDuty, Slack, etc.). A similar workflow is
    55    supported for [Datadog][datadog-alerting].
    56  
    57  - Periodically submit test jobs into Nomad to determine if your application
    58    deployment pipeline is working end-to-end. This pattern is well-suited to
    59    batch processing workloads.
    60  
    61  - Deploy Nagios on Nomad. Centrally manage Nomad job files and add the Nagios
    62    monitor when a new Nomad job is added. When a job is removed, remove the
    63    Nagios monitor. Map Consul alerts to the Nagios monitor. This provides a
    64    job-specific alerting system.
    65  
    66  - Write a script that looks at the history of each batch job to determine
    67    whether or not the job is in an unhealthy state, updating your monitoring
    68    system as appropriate. In many cases, it may be ok if a given batch job fails
    69    occasionally, as long as it goes back to passing.
    70  
    71  # Key Performance Indicators
    72  
    73  The sections below cover a number of important metrics
    74  
    75  ## Consensus Protocol (Raft)
    76  
    77  Nomad uses the Raft consensus protocol for leader election and state
    78  replication. Spurious leader elections can be caused by networking issues
    79  between the servers or insufficient CPU resources. Users in cloud environments
    80  often bump their servers up to the next instance class with improved networking
    81  and CPU to stabilize leader elections. The `nomad.raft.leader.lastContact` metric
    82  is a general indicator of Raft latency which can be used to observe how Raft
    83  timing is performing and guide the decision to upgrade to more powerful servers.
    84  `nomad.raft.leader.lastContact` should not get too close to the leader lease
    85  timeout of 500ms.
    86  
    87  ## Federated Deployments (Serf)
    88  
    89  Nomad uses the membership and failure detection capabilities of the Serf library
    90  to maintain a single, global gossip pool for all servers in a federated
    91  deployment. An uptick in `member.flap` and/or `msg.suspect` is a reliable indicator
    92  that membership is unstable.
    93  
    94  ## Scheduling
    95  
    96  The following metrics allow an operator to observe changes in throughput at the
    97  various points in the scheduling process (evaluation, scheduling/planning, and
    98  placement):
    99  
   100  - **nomad.broker.total_blocked** - The number of blocked evaluations.
   101  - **nomad.worker.invoke_scheduler.\<type\>** - The time to run the scheduler of
   102    the given type.
   103  - **nomad.plan.evaluate** - The time to evaluate a scheduler Plan.
   104  - **nomad.plan.submit** - The time to submit a scheduler Plan.
   105  - **nomad.plan.queue_depth** - The number of scheduler Plans waiting to be
   106    evaluated.
   107  
   108  Upticks in any of the above metrics indicate a decrease in scheduler throughput.
   109  
   110  ## Capacity
   111  
   112  The importance of monitoring resource availability is workload specific. Batch
   113  processing workloads often operate under the assumption that the cluster should
   114  be at or near capacity, with queued jobs running as soon as adequate resources
   115  become available. Clusters that are primarily responsible for long running
   116  services with an uptime requirement may want to maintain headroom at 20% or
   117  more. The following metrics can be used to assess capacity across the cluster on
   118  a per client basis:
   119  
   120  - **nomad.client.allocated.cpu**
   121  - **nomad.client.unallocated.cpu**
   122  - **nomad.client.allocated.disk**
   123  - **nomad.client.unallocated.disk**
   124  - **nomad.client.allocated.iops**
   125  - **nomad.client.unallocated.iops**
   126  - **nomad.client.allocated.memory**
   127  - **nomad.client.unallocated.memory**
   128  
   129  ## Task Resource Consumption
   130  
   131  The metrics listed [here][allocation-metrics] can be used to track resource
   132  consumption on a per task basis. For user facing services, it is common to alert
   133  when the CPU is at or above the reserved resources for the task.
   134  
   135  ## Job and Task Status
   136  
   137  We do not currently surface metrics for job and task/allocation status, although
   138  we will consider adding metrics where it makes sense.
   139  
   140  ## Runtime Metrics
   141  
   142  Runtime metrics apply to all clients and servers. The following metrics are
   143  general indicators of load and memory pressure:
   144  
   145  - **nomad.runtime.num_goroutines**
   146  - **nomad.runtime.heap_objects**
   147  - **nomad.runtime.alloc_bytes**
   148  
   149  It is recommended to alert on upticks in any of the above, server memory usage
   150  in particular.
   151  
   152  [alerting-rules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
   153  [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
   154  [allocation-metrics]: /docs/telemetry/metrics#allocation-metrics
   155  [circonus-telem]: /docs/configuration/telemetry#circonus
   156  [collection-interval]: /docs/configuration/telemetry#collection_interval
   157  [datadog-alerting]: https://www.datadoghq.com/blog/monitoring-101-alerting/
   158  [datadog-telem]: /docs/configuration/telemetry#datadog
   159  [prometheus-telem]: /docs/configuration/telemetry#prometheus
   160  [metrics-api-endpoint]: /api-docs/metrics
   161  [metric-types]: /docs/telemetry/metrics#metric-types
   162  [statsd-exporter]: https://github.com/prometheus/statsd_exporter
   163  [statsd-telem]: /docs/configuration/telemetry#statsd
   164  [statsite-telem]: /docs/configuration/telemetry#statsite
   165  [tagged-metrics]: /docs/telemetry/metrics#tagged-metrics
   166  [telemetry-stanza]: /docs/configuration/telemetry