github.com/dkerwin/nomad@v0.3.3-0.20160525181927-74554135514b/website/source/docs/agent/telemetry.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Telemetry" 4 sidebar_current: "docs-agent-telemetry" 5 description: |- 6 Learn about the telemetry data available in Nomad. 7 --- 8 9 # Telemetry 10 11 The Nomad agent collects various runtime metrics about the performance of 12 different libraries and subsystems. These metrics are aggregated on a ten 13 second interval and are retained for one minute. 14 15 To view this data, you must send a signal to the Nomad process: on Unix, 16 this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal, 17 it will dump the current telemetry information to the agent's `stderr`. 18 19 This telemetry information can be used for debugging or otherwise 20 getting a better view of what Nomad is doing. 21 22 Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite) 23 as well as statsd based on providing the appropriate configuration options. 24 25 Below is sample output of a telemetry dump: 26 27 ```text 28 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000 29 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000 30 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000 31 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000 32 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000 33 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000 34 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000 35 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000 36 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000 37 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000 38 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000 39 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000 40 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000 41 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000 42 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000 43 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000 44 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000 45 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000 46 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000 47 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000 48 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296 49 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000 50 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000 51 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054 52 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007 53 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025 54 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306 55 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110 56 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354 57 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000 58 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110 59 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071 60 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626 61 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813 62 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204 63 ``` 64 65 # Key Metrics 66 67 When telemetry is being streamed to statsite or statsd, `interval` is defined to 68 be their flush interval. Otherwise, the interval can be assumed to be 10 seconds 69 when retrieving metrics using the above described signals. 70 71 <table class="table table-bordered table-striped"> 72 <tr> 73 <th>Metric</th> 74 <th>Description</th> 75 <th>Unit</th> 76 <th>Type</th> 77 </tr> 78 <tr> 79 <td>`nomad.runtime.num_goroutines`</td> 80 <td>Number of goroutines and general load pressure indicator</td> 81 <td># of goroutines</td> 82 <td>Gauge</td> 83 </tr> 84 <tr> 85 <td>`nomad.runtime.alloc_bytes`</td> 86 <td>Memory utilization</td> 87 <td># of bytes</td> 88 <td>Gauge</td> 89 </tr> 90 <tr> 91 <td>`nomad.runtime.heap_objects`</td> 92 <td>Number of objects on the heap. General memory pressure indicator</td> 93 <td># of heap objects</td> 94 <td>Gauge</td> 95 </tr> 96 <tr> 97 <td>`nomad.raft.apply`</td> 98 <td>Number of Raft transactions</td> 99 <td>Raft transactions / `interval`</td> 100 <td>Counter</td> 101 </tr> 102 <tr> 103 <td>`nomad.raft.replication.appendEntries`</td> 104 <td>Raft transaction commit time</td> 105 <td>ms / Raft Log Append</td> 106 <td>Timer</td> 107 </tr> 108 <tr> 109 <td>`nomad.raft.leader.lastContact`</td> 110 <td>Time since last contact to leader. General indicator of Raft latency</td> 111 <td>ms / Leader Contact</td> 112 <td>Timer</td> 113 </tr> 114 <tr> 115 <td>`nomad.broker.total_ready`</td> 116 <td>Number of evaluations ready to be processed</td> 117 <td># of evaluations</td> 118 <td>Gauge</td> 119 </tr> 120 <tr> 121 <td>`nomad.broker.total_unacked`</td> 122 <td>Evaluations dispatched for processing but incomplete</td> 123 <td># of evaluations</td> 124 <td>Gauge</td> 125 </tr> 126 <tr> 127 <td>`nomad.broker.total_blocked`</td> 128 <td> 129 Evaluations that are blocked til an existing evaluation for the same job 130 completes 131 </td> 132 <td># of evaluations</td> 133 <td>Gauge</td> 134 </tr> 135 <tr> 136 <td>`nomad.plan.queue_depth`</td> 137 <td>Number of scheduler Plans waiting to be evaluated</td> 138 <td># of plans</td> 139 <td>Gauge</td> 140 </tr> 141 <tr> 142 <td>`nomad.plan.submit`</td> 143 <td> 144 Time to submit a scheduler Plan. Higher values cause lower scheduling 145 throughput 146 </td> 147 <td>ms / Plan Submit</td> 148 <td>Timer</td> 149 </tr> 150 <tr> 151 <td>`nomad.plan.evaluate`</td> 152 <td> 153 Time to validate a scheduler Plan. Higher values cause lower scheduling 154 throughput. Similar to `nomad.plan.submit` but does not include RPC time 155 or time in the Plan Queue 156 </td> 157 <td>ms / Plan Evaluation</td> 158 <td>Timer</td> 159 </tr> 160 <tr> 161 <td>`nomad.worker.invoke_scheduler.<type>`</td> 162 <td>Time to run the scheduler of the given type</td> 163 <td>ms / Scheduler Run</td> 164 <td>Timer</td> 165 </tr> 166 <tr> 167 <td>`nomad.worker.wait_for_index`</td> 168 <td> 169 Time waiting for Raft log replication from leader. High delays result in 170 lower scheduling throughput 171 </td> 172 <td>ms / Raft Index Wait</td> 173 <td>Timer</td> 174 </tr> 175 <tr> 176 <td>`nomad.heartbeat.active`</td> 177 <td> 178 Number of active heartbeat timers. Each timer represents a Nomad Client 179 connection 180 </td> 181 <td># of heartbeat timers</td> 182 <td>Gauge</td> 183 </tr> 184 <tr> 185 <td>`nomad.heartbeat.invalidate`</td> 186 <td> 187 The length of time it takes to invalidate a Nomad Client due to failed 188 heartbeats 189 </td> 190 <td>ms / Heartbeat Invalidation</td> 191 <td>Timer</td> 192 </tr> 193 <tr> 194 <td>`nomad.rpc.query`</td> 195 <td>Number of RPC queries</td> 196 <td>RPC Queries / `interval`</td> 197 <td>Counter</td> 198 </tr> 199 <tr> 200 <td>`nomad.rpc.request`</td> 201 <td>Number of RPC requests being handled</td> 202 <td>RPC Requests / `interval`</td> 203 <td>Counter</td> 204 </tr> 205 <tr> 206 <td>`nomad.rpc.request_error`</td> 207 <td>Number of RPC requests being handled that result in an error</td> 208 <td>RPC Errors / `interval`</td> 209 <td>Counter</td> 210 </tr> 211 </table> 212 213 # Metric Types 214 215 <table class="table table-bordered table-striped"> 216 <tr> 217 <th>Type</th> 218 <th>Description</th> 219 <th>Quantiles</th> 220 </tr> 221 <tr> 222 <td>Gauge</td> 223 <td> 224 Gauge types report an absolute number at the end of the aggregation 225 interval 226 </td> 227 <td>false</td> 228 </tr> 229 <tr> 230 <td>Counter</td> 231 <td> 232 Counts are incremented and flushed at the end of the aggregation 233 interval and then are reset to zero 234 </td> 235 <td>true</td> 236 </tr> 237 <tr> 238 <td>Timer</td> 239 <td> 240 Timers measure the time to complete a task and will include quantiles, 241 means, standard deviation, etc per interval. 242 </td> 243 <td>true</td> 244 </tr> 245 </table>