github.com/dkerwin/nomad@v0.3.3-0.20160525181927-74554135514b/website/source/docs/agent/telemetry.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Telemetry"
     4  sidebar_current: "docs-agent-telemetry"
     5  description: |-
     6    Learn about the telemetry data available in Nomad.
     7  ---
     8  
     9  # Telemetry
    10  
    11  The Nomad agent collects various runtime metrics about the performance of
    12  different libraries and subsystems. These metrics are aggregated on a ten
    13  second interval and are retained for one minute.
    14  
    15  To view this data, you must send a signal to the Nomad process: on Unix,
    16  this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal,
    17  it will dump the current telemetry information to the agent's `stderr`.
    18  
    19  This telemetry information can be used for debugging or otherwise
    20  getting a better view of what Nomad is doing.
    21  
    22  Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite)
    23  as well as statsd based on providing the appropriate configuration options.
    24  
    25  Below is sample output of a telemetry dump:
    26  
    27  ```text
    28  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000
    29  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000
    30  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000
    31  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000
    32  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000
    33  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000
    34  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000
    35  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000
    36  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000
    37  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000
    38  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000
    39  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000
    40  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000
    41  [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000
    42  [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000
    43  [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000
    44  [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000
    45  [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000
    46  [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000
    47  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000
    48  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296
    49  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000
    50  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000
    51  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054
    52  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007
    53  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025
    54  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306
    55  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110
    56  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354
    57  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000
    58  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110
    59  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071
    60  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626
    61  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813
    62  [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204
    63  ```
    64  
    65  # Key Metrics
    66  
    67  When telemetry is being streamed to statsite or statsd, `interval` is defined to
    68  be their flush interval. Otherwise, the interval can be assumed to be 10 seconds
    69  when retrieving metrics using the above described signals.
    70  
    71  <table class="table table-bordered table-striped">
    72    <tr>
    73      <th>Metric</th>
    74      <th>Description</th>
    75      <th>Unit</th>
    76      <th>Type</th>
    77    </tr>
    78    <tr>
    79      <td>`nomad.runtime.num_goroutines`</td>
    80      <td>Number of goroutines and general load pressure indicator</td>
    81      <td># of goroutines</td>
    82      <td>Gauge</td>
    83    </tr>
    84    <tr>
    85      <td>`nomad.runtime.alloc_bytes`</td>
    86      <td>Memory utilization</td>
    87      <td># of bytes</td>
    88      <td>Gauge</td>
    89    </tr>
    90    <tr>
    91      <td>`nomad.runtime.heap_objects`</td>
    92      <td>Number of objects on the heap. General memory pressure indicator</td>
    93      <td># of heap objects</td>
    94      <td>Gauge</td>
    95    </tr>
    96    <tr>
    97      <td>`nomad.raft.apply`</td>
    98      <td>Number of Raft transactions</td>
    99      <td>Raft transactions / `interval`</td>
   100      <td>Counter</td>
   101    </tr>
   102    <tr>
   103      <td>`nomad.raft.replication.appendEntries`</td>
   104      <td>Raft transaction commit time</td>
   105      <td>ms / Raft Log Append</td>
   106      <td>Timer</td>
   107    </tr>
   108    <tr>
   109      <td>`nomad.raft.leader.lastContact`</td>
   110      <td>Time since last contact to leader. General indicator of Raft latency</td>
   111      <td>ms / Leader Contact</td>
   112      <td>Timer</td>
   113    </tr>
   114    <tr>
   115      <td>`nomad.broker.total_ready`</td>
   116      <td>Number of evaluations ready to be processed</td>
   117      <td># of evaluations</td>
   118      <td>Gauge</td>
   119    </tr>
   120    <tr>
   121      <td>`nomad.broker.total_unacked`</td>
   122      <td>Evaluations dispatched for processing but incomplete</td>
   123      <td># of evaluations</td>
   124      <td>Gauge</td>
   125    </tr>
   126    <tr>
   127      <td>`nomad.broker.total_blocked`</td>
   128      <td>
   129          Evaluations that are blocked til an existing evaluation for the same job
   130          completes
   131      </td>
   132      <td># of evaluations</td>
   133      <td>Gauge</td>
   134    </tr>
   135    <tr>
   136      <td>`nomad.plan.queue_depth`</td>
   137      <td>Number of scheduler Plans waiting to be evaluated</td>
   138      <td># of plans</td>
   139      <td>Gauge</td>
   140    </tr>
   141    <tr>
   142      <td>`nomad.plan.submit`</td>
   143      <td>
   144          Time to submit a scheduler Plan. Higher values cause lower scheduling
   145          throughput
   146      </td>
   147      <td>ms / Plan Submit</td>
   148      <td>Timer</td>
   149    </tr>
   150    <tr>
   151      <td>`nomad.plan.evaluate`</td>
   152      <td>
   153          Time to validate a scheduler Plan. Higher values cause lower scheduling
   154          throughput. Similar to `nomad.plan.submit` but does not include RPC time
   155          or time in the Plan Queue
   156      </td>
   157      <td>ms / Plan Evaluation</td>
   158      <td>Timer</td>
   159    </tr>
   160    <tr>
   161      <td>`nomad.worker.invoke_scheduler.<type>`</td>
   162      <td>Time to run the scheduler of the given type</td>
   163      <td>ms / Scheduler Run</td>
   164      <td>Timer</td>
   165    </tr>
   166    <tr>
   167      <td>`nomad.worker.wait_for_index`</td>
   168      <td>
   169          Time waiting for Raft log replication from leader. High delays result in
   170          lower scheduling throughput
   171      </td>
   172      <td>ms / Raft Index Wait</td>
   173      <td>Timer</td>
   174    </tr>
   175    <tr>
   176      <td>`nomad.heartbeat.active`</td>
   177      <td>
   178          Number of active heartbeat timers. Each timer represents a Nomad Client
   179          connection
   180      </td>
   181      <td># of heartbeat timers</td>
   182      <td>Gauge</td>
   183    </tr>
   184    <tr>
   185      <td>`nomad.heartbeat.invalidate`</td>
   186      <td>
   187          The length of time it takes to invalidate a Nomad Client due to failed
   188          heartbeats
   189      </td>
   190      <td>ms / Heartbeat Invalidation</td>
   191      <td>Timer</td>
   192    </tr>
   193    <tr>
   194      <td>`nomad.rpc.query`</td>
   195      <td>Number of RPC queries</td>
   196      <td>RPC Queries / `interval`</td>
   197      <td>Counter</td>
   198    </tr>
   199    <tr>
   200      <td>`nomad.rpc.request`</td>
   201      <td>Number of RPC requests being handled</td>
   202      <td>RPC Requests / `interval`</td>
   203      <td>Counter</td>
   204    </tr>
   205    <tr>
   206      <td>`nomad.rpc.request_error`</td>
   207      <td>Number of RPC requests being handled that result in an error</td>
   208      <td>RPC Errors / `interval`</td>
   209      <td>Counter</td>
   210    </tr>
   211  </table>
   212  
   213  # Metric Types
   214  
   215  <table class="table table-bordered table-striped">
   216    <tr>
   217      <th>Type</th>
   218      <th>Description</th>
   219      <th>Quantiles</th>
   220    </tr>
   221    <tr>
   222      <td>Gauge</td>
   223      <td>
   224          Gauge types report an absolute number at the end of the aggregation
   225          interval
   226      </td>
   227      <td>false</td>
   228    </tr>
   229    <tr>
   230      <td>Counter</td>
   231      <td>
   232          Counts are incremented and flushed at the end of the aggregation
   233          interval and then are reset to zero
   234      </td>
   235      <td>true</td>
   236    </tr>
   237    <tr>
   238      <td>Timer</td>
   239      <td>
   240          Timers measure the time to complete a task and will include quantiles,
   241          means, standard deviation, etc per interval.
   242      </td>
   243      <td>true</td>
   244    </tr>
   245  </table>