github.com/mattyr/nomad@v0.3.3-0.20160919021406-3485a065154a/website/source/docs/agent/telemetry.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Telemetry" 4 sidebar_current: "docs-agent-telemetry" 5 description: |- 6 Learn about the telemetry data available in Nomad. 7 --- 8 9 # Telemetry 10 11 The Nomad agent collects various runtime metrics about the performance of 12 different libraries and subsystems. These metrics are aggregated on a ten 13 second interval and are retained for one minute. 14 15 To view this data, you must send a signal to the Nomad process: on Unix, 16 this is `USR1` while on Windows it is `BREAK`. Once Nomad receives the signal, 17 it will dump the current telemetry information to the agent's `stderr`. 18 19 This telemetry information can be used for debugging or otherwise 20 getting a better view of what Nomad is doing. 21 22 Telemetry information can be streamed to both [statsite](https://github.com/armon/statsite) 23 as well as statsd based on providing the appropriate configuration options. 24 25 To configure the telemetry output please see the [agent 26 configuration](/docs/agent/config.html#telemetry_config). 27 28 Below is sample output of a telemetry dump: 29 30 ```text 31 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_blocked': 0.000 32 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.plan.queue_depth': 0.000 33 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.malloc_count': 7568.000 34 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_runs': 8.000 35 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_ready': 0.000 36 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.num_goroutines': 56.000 37 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.sys_bytes': 3999992.000 38 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.heap_objects': 4135.000 39 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.heartbeat.active': 1.000 40 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_unacked': 0.000 41 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.nomad.broker.total_waiting': 0.000 42 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.alloc_bytes': 634056.000 43 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.free_count': 3433.000 44 [2015-09-17 16:59:40 -0700 PDT][G] 'nomad.runtime.total_gc_pause_ns': 6572135.000 45 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.memberlist.msg.alive': Count: 1 Sum: 1.000 46 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.serf.member.join': Count: 1 Sum: 1.000 47 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.barrier': Count: 1 Sum: 1.000 48 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.raft.apply': Count: 1 Sum: 1.000 49 [2015-09-17 16:59:40 -0700 PDT][C] 'nomad.nomad.rpc.query': Count: 2 Sum: 2.000 50 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Query': Count: 6 Sum: 0.000 51 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.fsm.register_node': Count: 1 Sum: 1.296 52 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Intent': Count: 6 Sum: 0.000 53 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.runtime.gc_pause_ns': Count: 8 Min: 126492.000 Mean: 821516.875 Max: 3126670.000 Stddev: 1139250.294 Sum: 6572135.000 54 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.leader.dispatchLog': Count: 3 Min: 0.007 Mean: 0.018 Max: 0.039 Stddev: 0.018 Sum: 0.054 55 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcileMember': Count: 1 Sum: 0.007 56 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.reconcile': Count: 1 Sum: 0.025 57 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.fsm.apply': Count: 1 Sum: 1.306 58 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.get_allocs': Count: 1 Sum: 0.110 59 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.worker.dequeue_eval': Count: 29 Min: 0.003 Mean: 363.426 Max: 503.377 Stddev: 228.126 Sum: 10539.354 60 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.serf.queue.Event': Count: 6 Sum: 0.000 61 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.raft.commitTime': Count: 3 Min: 0.013 Mean: 0.037 Max: 0.079 Stddev: 0.037 Sum: 0.110 62 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.leader.barrier': Count: 1 Sum: 0.071 63 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.client.register': Count: 1 Sum: 1.626 64 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.nomad.eval.dequeue': Count: 21 Min: 500.610 Mean: 501.753 Max: 503.361 Stddev: 1.030 Sum: 10536.813 65 [2015-09-17 16:59:40 -0700 PDT][S] 'nomad.memberlist.gossip': Count: 12 Min: 0.009 Mean: 0.017 Max: 0.025 Stddev: 0.005 Sum: 0.204 66 ``` 67 68 # Key Metrics 69 70 When telemetry is being streamed to statsite or statsd, `interval` is defined to 71 be their flush interval. Otherwise, the interval can be assumed to be 10 seconds 72 when retrieving metrics using the above described signals. 73 74 <table class="table table-bordered table-striped"> 75 <tr> 76 <th>Metric</th> 77 <th>Description</th> 78 <th>Unit</th> 79 <th>Type</th> 80 </tr> 81 <tr> 82 <td>`nomad.runtime.num_goroutines`</td> 83 <td>Number of goroutines and general load pressure indicator</td> 84 <td># of goroutines</td> 85 <td>Gauge</td> 86 </tr> 87 <tr> 88 <td>`nomad.runtime.alloc_bytes`</td> 89 <td>Memory utilization</td> 90 <td># of bytes</td> 91 <td>Gauge</td> 92 </tr> 93 <tr> 94 <td>`nomad.runtime.heap_objects`</td> 95 <td>Number of objects on the heap. General memory pressure indicator</td> 96 <td># of heap objects</td> 97 <td>Gauge</td> 98 </tr> 99 <tr> 100 <td>`nomad.raft.apply`</td> 101 <td>Number of Raft transactions</td> 102 <td>Raft transactions / `interval`</td> 103 <td>Counter</td> 104 </tr> 105 <tr> 106 <td>`nomad.raft.replication.appendEntries`</td> 107 <td>Raft transaction commit time</td> 108 <td>ms / Raft Log Append</td> 109 <td>Timer</td> 110 </tr> 111 <tr> 112 <td>`nomad.raft.leader.lastContact`</td> 113 <td>Time since last contact to leader. General indicator of Raft latency</td> 114 <td>ms / Leader Contact</td> 115 <td>Timer</td> 116 </tr> 117 <tr> 118 <td>`nomad.broker.total_ready`</td> 119 <td>Number of evaluations ready to be processed</td> 120 <td># of evaluations</td> 121 <td>Gauge</td> 122 </tr> 123 <tr> 124 <td>`nomad.broker.total_unacked`</td> 125 <td>Evaluations dispatched for processing but incomplete</td> 126 <td># of evaluations</td> 127 <td>Gauge</td> 128 </tr> 129 <tr> 130 <td>`nomad.broker.total_blocked`</td> 131 <td> 132 Evaluations that are blocked til an existing evaluation for the same job 133 completes 134 </td> 135 <td># of evaluations</td> 136 <td>Gauge</td> 137 </tr> 138 <tr> 139 <td>`nomad.plan.queue_depth`</td> 140 <td>Number of scheduler Plans waiting to be evaluated</td> 141 <td># of plans</td> 142 <td>Gauge</td> 143 </tr> 144 <tr> 145 <td>`nomad.plan.submit`</td> 146 <td> 147 Time to submit a scheduler Plan. Higher values cause lower scheduling 148 throughput 149 </td> 150 <td>ms / Plan Submit</td> 151 <td>Timer</td> 152 </tr> 153 <tr> 154 <td>`nomad.plan.evaluate`</td> 155 <td> 156 Time to validate a scheduler Plan. Higher values cause lower scheduling 157 throughput. Similar to `nomad.plan.submit` but does not include RPC time 158 or time in the Plan Queue 159 </td> 160 <td>ms / Plan Evaluation</td> 161 <td>Timer</td> 162 </tr> 163 <tr> 164 <td>`nomad.worker.invoke_scheduler.<type>`</td> 165 <td>Time to run the scheduler of the given type</td> 166 <td>ms / Scheduler Run</td> 167 <td>Timer</td> 168 </tr> 169 <tr> 170 <td>`nomad.worker.wait_for_index`</td> 171 <td> 172 Time waiting for Raft log replication from leader. High delays result in 173 lower scheduling throughput 174 </td> 175 <td>ms / Raft Index Wait</td> 176 <td>Timer</td> 177 </tr> 178 <tr> 179 <td>`nomad.heartbeat.active`</td> 180 <td> 181 Number of active heartbeat timers. Each timer represents a Nomad Client 182 connection 183 </td> 184 <td># of heartbeat timers</td> 185 <td>Gauge</td> 186 </tr> 187 <tr> 188 <td>`nomad.heartbeat.invalidate`</td> 189 <td> 190 The length of time it takes to invalidate a Nomad Client due to failed 191 heartbeats 192 </td> 193 <td>ms / Heartbeat Invalidation</td> 194 <td>Timer</td> 195 </tr> 196 <tr> 197 <td>`nomad.rpc.query`</td> 198 <td>Number of RPC queries</td> 199 <td>RPC Queries / `interval`</td> 200 <td>Counter</td> 201 </tr> 202 <tr> 203 <td>`nomad.rpc.request`</td> 204 <td>Number of RPC requests being handled</td> 205 <td>RPC Requests / `interval`</td> 206 <td>Counter</td> 207 </tr> 208 <tr> 209 <td>`nomad.rpc.request_error`</td> 210 <td>Number of RPC requests being handled that result in an error</td> 211 <td>RPC Errors / `interval`</td> 212 <td>Counter</td> 213 </tr> 214 </table> 215 216 # Client Metrics 217 218 The Nomad client emits metrics related to the resource usage of the allocations 219 and tasks running on it and the node itself. Operators have to explicity turn 220 on publishing host and allocation metrics. Publishing allocation and host 221 metrics can be turned on by setting the value of `publish_allocation_metrics` 222 `publish_node_metrics` to `true`. 223 224 225 By default the collection interval is 1 second but it can be changed by the 226 changing the value of the `collection_interval` key in the `telemetry` 227 configuration block. 228 229 Please see the [agent configuration](/docs/agent/config.html#telemetry_config) 230 page for more details. 231 232 ## Host Metrics 233 234 <table class="table table-bordered table-striped"> 235 <tr> 236 <th>Metric</th> 237 <th>Description</th> 238 <th>Unit</th> 239 <th>Type</th> 240 </tr> 241 <tr> 242 <td>`nomad.client.host.memmory.<HostID>.total`</td> 243 <td>Total amount of physical memory on the node</td> 244 <td>Bytes</td> 245 <td>Gauge</td> 246 </tr> 247 <tr> 248 <td>`nomad.client.host.memmory.<HostID>.available`</td> 249 <td>Total amount of memory available to processes which includes free and 250 cached memory</td> 251 <td>Bytes</td> 252 <td>Gauge</td> 253 </tr> 254 <tr> 255 <td>`nomad.client.host.memory.<HostID>.used`</td> 256 <td>Amount of memory used by processes</td> 257 <td>Bytes</td> 258 <td>Gauge</td> 259 </tr> 260 <tr> 261 <td>`nomad.client.host.memory.<HostID>.free`</td> 262 <td>Amount of memory which is free</td> 263 <td>Bytes</td> 264 <td>Gauge</td> 265 </tr> 266 <tr> 267 <td>`nomad.client.uptime.<HostID>`</td> 268 <td>Uptime of the host running the Nomad client</td> 269 <td>Seconds</td> 270 <td>Gauge</td> 271 </tr> 272 <tr> 273 <td>`nomad.client.host.cpu.<HostID>.<CPU-Core>.total`</td> 274 <td>Total CPU utilization</td> 275 <td>Percentage</td> 276 <td>Gauge</td> 277 </tr> 278 <tr> 279 <td>`nomad.client.host.cpu.<HostID>.<CPU-Core>.user`</td> 280 <td>CPU utilization in the user space</td> 281 <td>Percentage</td> 282 <td>Gauge</td> 283 </tr> 284 <tr> 285 <td>`nomad.client.host.cpu.<HostID>.<CPU-Core>.system`</td> 286 <td>CPU utilization in the system space</td> 287 <td>Percentage</td> 288 <td>Gauge</td> 289 </tr> 290 <tr> 291 <td>`nomad.client.host.cpu.<HostID>.<CPU-Core>.idle`</td> 292 <td>Idle time spent by the CPU</td> 293 <td>Percentage</td> 294 <td>Gauge</td> 295 </tr> 296 <tr> 297 <td>`nomad.client.host.disk.<HostID>.<Device-Name>.size`</td> 298 <td>Total size of the device</td> 299 <td>Bytes</td> 300 <td>Gauge</td> 301 </tr> 302 <tr> 303 <td>`nomad.client.host.disk.<HostID>.<Device-Name>.used`</td> 304 <td>Amount of space which has been used</td> 305 <td>Bytes</td> 306 <td>Gauge</td> 307 </tr> 308 <tr> 309 <td>`nomad.client.host.disk.<HostID>.<Device-Name>.available`</td> 310 <td>Amount of space which is available</td> 311 <td>Bytes</td> 312 <td>Gauge</td> 313 </tr> 314 <tr> 315 <td>`nomad.client.host.disk.<HostID>.<Device-Name>.used_percent`</td> 316 <td>Percentage of disk space used</td> 317 <td>Percentage</td> 318 <td>Gauge</td> 319 </tr> 320 <tr> 321 <td>`nomad.client.host.disk.<HostID>.<Device-Name>.inodes_percent`</td> 322 <td>Disk space consumed by the inodes</td> 323 <td>Percent</td> 324 <td>Gauge</td> 325 </tr> 326 </table> 327 328 ## Allocation Metrics 329 330 <table class="table table-bordered table-striped"> 331 <tr> 332 <th>Metric</th> 333 <th>Description</th> 334 <th>Unit</th> 335 <th>Type</th> 336 </tr> 337 <tr> 338 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.rss`</td> 339 <td>Amount of RSS memory consumed by the task</td> 340 <td>Bytes</td> 341 <td>Gauge</td> 342 </tr> 343 <tr> 344 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.cache`</td> 345 <td>Amount of memory cached by the task</td> 346 <td>Bytes</td> 347 <td>Gauge</td> 348 </tr> 349 <tr> 350 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.swap`</td> 351 <td>Amount of memory swapped by the task</td> 352 <td>Bytes</td> 353 <td>Gauge</td> 354 </tr> 355 <tr> 356 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.max_usage`</td> 357 <td>Maximum amount of memory ever used by the task</td> 358 <td>Bytes</td> 359 <td>Gauge</td> 360 </tr> 361 <tr> 362 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_usage`</td> 363 <td>Amount of memory used by the kernel for this task</td> 364 <td>Bytes</td> 365 <td>Gauge</td> 366 </tr> 367 <tr> 368 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.memory.kernel_max_usage`</td> 369 <td>Maximum amount of memory ever used by the kernel for this task</td> 370 <td>Bytes</td> 371 <td>Gauge</td> 372 </tr> 373 <tr> 374 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_percent`</td> 375 <td>Total CPU resources consumed by the task across all cores</td> 376 <td>Percentage</td> 377 <td>Gauge</td> 378 </tr> 379 <tr> 380 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.system`</td> 381 <td>Total CPU resources consumed by the task in the system space</td> 382 <td>Percentage</td> 383 <td>Gauge</td> 384 </tr> 385 <tr> 386 <td>`nomad.client.allocs.<Job>.TaskGroup>.<AllocID>.<Task>.cpu.user`</td> 387 <td>Total CPU resources consumed by the task in the user space</td> 388 <td>Percentage</td> 389 <td>Gauge</td> 390 </tr> 391 <tr> 392 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.throttled_time`</td> 393 <td>Total time that the task was throttled</td> 394 <td>Nanoseconds</td> 395 <td>Gauge</td> 396 </tr> 397 <tr> 398 <td>`nomad.client.allocs.<Job>.<TaskGroup>.<AllocID>.<Task>.cpu.total_ticks`</td> 399 <td>CPU ticks consumed by the process in the last collection interval</td> 400 <td>Integer</td> 401 <td>Gauge</td> 402 </tr> 403 </table> 404 405 # Metric Types 406 407 <table class="table table-bordered table-striped"> 408 <tr> 409 <th>Type</th> 410 <th>Description</th> 411 <th>Quantiles</th> 412 </tr> 413 <tr> 414 <td>Gauge</td> 415 <td> 416 Gauge types report an absolute number at the end of the aggregation 417 interval 418 </td> 419 <td>false</td> 420 </tr> 421 <tr> 422 <td>Counter</td> 423 <td> 424 Counts are incremented and flushed at the end of the aggregation 425 interval and then are reset to zero 426 </td> 427 <td>true</td> 428 </tr> 429 <tr> 430 <td>Timer</td> 431 <td> 432 Timers measure the time to complete a task and will include quantiles, 433 means, standard deviation, etc per interval. 434 </td> 435 <td>true</td> 436 </tr> 437 </table>