github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/guides/operations/monitoring-and-alerting/prometheus-metrics.html.md (about) 1 --- 2 layout: "guides" 3 page_title: "Using Prometheus to Monitor Nomad Metrics" 4 sidebar_current: "guides-operations-monitoring-prometheus" 5 description: |- 6 It is possible to collect metrics on Nomad with Prometheus after enabling 7 telemetry on Nomad servers and clients. 8 --- 9 10 # Using Prometheus to Monitor Nomad Metrics 11 12 This guide explains how to configure [Prometheus][prometheus] to integrate with 13 a Nomad cluster and Prometheus [Alertmanager][alertmanager]. While this guide introduces the basics of enabling [telemetry][telemetry] and alerting, a Nomad operator can go much further by customizing dashboards and integrating different 14 [receivers][receivers] for alerts. 15 16 ## Reference Material 17 18 - [Configuring Prometheus][configuring prometheus] 19 - [Telemetry Stanza in Nomad Agent Configuration][telemetry stanza] 20 - [Alerting Overview][alerting] 21 22 ## Estimated Time to Complete 23 24 25 minutes 25 26 ## Challenge 27 28 Think of a scenario where a Nomad operator needs to deploy Prometheus to 29 collect metrics from a Nomad cluster. The operator must enable telemetry on 30 the Nomad servers and clients as well as configure Prometheus to use Consul 31 for service discovery. The operator must also configure Prometheus Alertmanager 32 so notifications can be sent out to a specified [receiver][receivers]. 33 34 35 ## Solution 36 37 Deploy Prometheus with a configuration that accounts for a highly dynamic 38 environment. Integrate service discovery into the configuration file to avoid 39 using hard-coded IP addresses. Place the Prometheus deployment behind 40 [fabio][fabio] (this will allow easy access to the Prometheus web interface 41 by allowing the Nomad operator to hit any of the client nodes at the `/` path. 42 43 ## Prerequisites 44 45 To perform the tasks described in this guide, you need to have a Nomad 46 environment with Consul installed. You can use this 47 [repo](https://github.com/hashicorp/nomad/tree/master/terraform#provision-a-nomad-cluster-in-the-cloud) 48 to easily provision a sandbox environment. This guide will assume a cluster with 49 one server node and three client nodes. 50 51 -> **Please Note:** This guide is for demo purposes and is only using a single 52 server node. In a production cluster, 3 or 5 server nodes are recommended. The 53 alerting rules defined in this guide are for instructional purposes. Please 54 refer to [Alerting Rules][alertingrules] for more information. 55 56 ## Steps 57 58 ### Step 1: Enable Telemetry on Nomad Servers and Clients 59 60 Add the stanza below in your Nomad client and server configuration 61 files. If you have used the provided repo in this guide to set up a Nomad 62 cluster, the configuration file will be `/etc/nomad.d/nomad.hcl`. 63 After making this change, restart the Nomad service on each server and 64 client node. 65 66 ```hcl 67 telemetry { 68 collection_interval = "1s" 69 disable_hostname = true 70 prometheus_metrics = true 71 publish_allocation_metrics = true 72 publish_node_metrics = true 73 } 74 ``` 75 76 ### Step 2: Create a Job for Fabio 77 78 Create a job for Fabio and name it `fabio.nomad` 79 80 ```hcl 81 job "fabio" { 82 datacenters = ["dc1"] 83 type = "system" 84 85 group "fabio" { 86 task "fabio" { 87 driver = "docker" 88 config { 89 image = "fabiolb/fabio" 90 network_mode = "host" 91 } 92 93 resources { 94 cpu = 100 95 memory = 64 96 network { 97 mbits = 20 98 port "lb" { 99 static = 9999 100 } 101 port "ui" { 102 static = 9998 103 } 104 } 105 } 106 } 107 } 108 } 109 ``` 110 To learn more about fabio and the options used in this job file, see 111 [Load Balancing with Fabio][fabio-lb]. For the purpose of this guide, it is 112 important to note that the `type` option is set to [system][system] so that 113 fabio will be deployed on all client nodes. We have also set `network_mode` to 114 `host` so that fabio will be able to use Consul for service discovery. 115 116 ### Step 3: Run the Fabio Job 117 118 We can now register our fabio job: 119 120 ```shell 121 $ nomad job run fabio.nomad 122 ==> Monitoring evaluation "7b96701e" 123 Evaluation triggered by job "fabio" 124 Allocation "d0e34682" created: node "28d7f859", group "fabio" 125 Allocation "238ec0f7" created: node "510898b6", group "fabio" 126 Allocation "9a2e8359" created: node "f3739267", group "fabio" 127 Evaluation status changed: "pending" -> "complete" 128 ==> Evaluation "7b96701e" finished with status "complete" 129 ``` 130 At this point, you should be able to visit any one of your client nodes at port 131 `9998` and see the web interface for fabio. The routing table will be empty 132 since we have not yet deployed anything that fabio can route to. 133 Accordingly, if you visit any of the client nodes at port `9999` at this 134 point, you will get a `404` HTTP response. That will change soon. 135 136 ### Step 4: Create a Job for Prometheus 137 138 Create a job for Prometheus and name it `prometheus.nomad` 139 140 ```hcl 141 job "prometheus" { 142 datacenters = ["dc1"] 143 type = "service" 144 145 group "monitoring" { 146 count = 1 147 restart { 148 attempts = 2 149 interval = "30m" 150 delay = "15s" 151 mode = "fail" 152 } 153 ephemeral_disk { 154 size = 300 155 } 156 157 task "prometheus" { 158 template { 159 change_mode = "noop" 160 destination = "local/prometheus.yml" 161 data = <<EOH 162 --- 163 global: 164 scrape_interval: 5s 165 evaluation_interval: 5s 166 167 scrape_configs: 168 169 - job_name: 'nomad_metrics' 170 171 consul_sd_configs: 172 - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500' 173 services: ['nomad-client', 'nomad'] 174 175 relabel_configs: 176 - source_labels: ['__meta_consul_tags'] 177 regex: '(.*)http(.*)' 178 action: keep 179 180 scrape_interval: 5s 181 metrics_path: /v1/metrics 182 params: 183 format: ['prometheus'] 184 EOH 185 } 186 driver = "docker" 187 config { 188 image = "prom/prometheus:latest" 189 volumes = [ 190 "local/prometheus.yml:/etc/prometheus/prometheus.yml" 191 ] 192 port_map { 193 prometheus_ui = 9090 194 } 195 } 196 resources { 197 network { 198 mbits = 10 199 port "prometheus_ui" {} 200 } 201 } 202 service { 203 name = "prometheus" 204 tags = ["urlprefix-/"] 205 port = "prometheus_ui" 206 check { 207 name = "prometheus_ui port alive" 208 type = "http" 209 path = "/-/healthy" 210 interval = "10s" 211 timeout = "2s" 212 } 213 } 214 } 215 } 216 } 217 ``` 218 Notice we are using the [template][template] stanza to create a Prometheus 219 configuration using [environment][env] variables. In this case, we are using the 220 environment variable `NOMAD_IP_prometheus_ui` in the 221 [consul_sd_configs][consul_sd_config] 222 section to ensure Prometheus can use Consul to detect and scrape targets. 223 This works in our example because Consul is installed alongside Nomad. 224 Additionally, we benefit from this configuration by avoiding the need to 225 hard-code IP addresses. If you did not use the repo provided in this guide to 226 create a Nomad cluster, be sure to point your Prometheus configuration 227 to a Consul server you have set up. 228 229 The [volumes][volumes] option allows us to take the configuration file we 230 dynamically created and place it in our Prometheus container. 231 232 233 ### Step 5: Run the Prometheus Job 234 235 We can now register our job for Prometheus: 236 237 ```shell 238 $ nomad job run prometheus.nomad 239 ==> Monitoring evaluation "4e6b7127" 240 Evaluation triggered by job "prometheus" 241 Evaluation within deployment: "d3a651a7" 242 Allocation "9725af3d" created: node "28d7f859", group "monitoring" 243 Evaluation status changed: "pending" -> "complete" 244 ==> Evaluation "4e6b7127" finished with status "complete" 245 ``` 246 Prometheus is now deployed. You can visit any of your client nodes at port 247 `9999` to visit the web interface. There is only one instance of Prometheus 248 running in the Nomad cluster, but you are automatically routed to it 249 regardless of which node you visit because fabio is deployed and running on the 250 cluster as well. 251 252 At the top menu bar, click on `Status` and then `Targets`. You should see all 253 of your Nomad nodes (servers and clients) show up as targets. Please note that 254 the IP addresses will be different in your cluster. 255 256 [![Prometheus Targets][prometheus-targets]][prometheus-targets] 257 258 Let's use Prometheus to query how many jobs are running in our Nomad cluster. 259 On the main page, type `nomad_nomad_job_summary_running` into the query 260 section. You can also select the query from the drop-down list. 261 262 [![Running Jobs][running-jobs]][running-jobs] 263 264 You can see that the value of our fabio job is `3` since it is using the 265 [system][system] scheduler type. This makes sense because we are running 266 three Nomad clients in our demo cluster. The value of our Prometheus job, on 267 the other hand, is `1` since we have only deployed one instance of it. 268 To see the description of other metrics, visit the [telemetry][telemetry] 269 section. 270 271 ### Step 6: Deploy Alertmanager 272 273 Now that we have enabled Prometheus to collect metrics from our cluster and see 274 the state of our jobs, let's deploy [Alertmanager][alertmanager]. Keep in mind 275 that Prometheus sends alerts to Alertmanager. It is then Alertmanager's job to 276 send out the notifications on those alerts to any designated [receiver][receivers]. 277 278 Create a job for Alertmanager and named it `alertmanager.nomad` 279 280 ```hcl 281 job "alertmanager" { 282 datacenters = ["dc1"] 283 type = "service" 284 285 group "alerting" { 286 count = 1 287 restart { 288 attempts = 2 289 interval = "30m" 290 delay = "15s" 291 mode = "fail" 292 } 293 ephemeral_disk { 294 size = 300 295 } 296 297 task "alertmanager" { 298 driver = "docker" 299 config { 300 image = "prom/alertmanager:latest" 301 port_map { 302 alertmanager_ui = 9093 303 } 304 } 305 resources { 306 network { 307 mbits = 10 308 port "alertmanager_ui" {} 309 } 310 } 311 service { 312 name = "alertmanager" 313 tags = ["urlprefix-/alertmanager strip=/alertmanager"] 314 port = "alertmanager_ui" 315 check { 316 name = "alertmanager_ui port alive" 317 type = "http" 318 path = "/-/healthy" 319 interval = "10s" 320 timeout = "2s" 321 } 322 } 323 } 324 } 325 } 326 ``` 327 328 ### Step 7: Configure Prometheus to Integrate with Alertmanager 329 330 Now that we have deployed Alertmanager, let's slightly modify our Prometheus job 331 configuration to allow it to recognize and send alerts to it. Note that there are 332 some rules in the configuration that refer a to a web server we will deploy soon. 333 334 Below is the same Prometheus configuration we detailed above, but we have added 335 some sections that hook Prometheus into the Alertmanager and set up some Alerting 336 rules. 337 338 ```hcl 339 job "prometheus" { 340 datacenters = ["dc1"] 341 type = "service" 342 343 group "monitoring" { 344 count = 1 345 restart { 346 attempts = 2 347 interval = "30m" 348 delay = "15s" 349 mode = "fail" 350 } 351 ephemeral_disk { 352 size = 300 353 } 354 355 task "prometheus" { 356 template { 357 change_mode = "noop" 358 destination = "local/webserver_alert.yml" 359 data = <<EOH 360 --- 361 groups: 362 - name: prometheus_alerts 363 rules: 364 - alert: Webserver down 365 expr: absent(up{job="webserver"}) 366 for: 10s 367 labels: 368 severity: critical 369 annotations: 370 description: "Our webserver is down." 371 EOH 372 } 373 374 template { 375 change_mode = "noop" 376 destination = "local/prometheus.yml" 377 data = <<EOH 378 --- 379 global: 380 scrape_interval: 5s 381 evaluation_interval: 5s 382 383 alerting: 384 alertmanagers: 385 - consul_sd_configs: 386 - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500' 387 services: ['alertmanager'] 388 389 rule_files: 390 - "webserver_alert.yml" 391 392 scrape_configs: 393 394 - job_name: 'alertmanager' 395 396 consul_sd_configs: 397 - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500' 398 services: ['alertmanager'] 399 400 - job_name: 'nomad_metrics' 401 402 consul_sd_configs: 403 - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500' 404 services: ['nomad-client', 'nomad'] 405 406 relabel_configs: 407 - source_labels: ['__meta_consul_tags'] 408 regex: '(.*)http(.*)' 409 action: keep 410 411 scrape_interval: 5s 412 metrics_path: /v1/metrics 413 params: 414 format: ['prometheus'] 415 416 - job_name: 'webserver' 417 418 consul_sd_configs: 419 - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500' 420 services: ['webserver'] 421 422 metrics_path: /metrics 423 EOH 424 } 425 driver = "docker" 426 config { 427 image = "prom/prometheus:latest" 428 volumes = [ 429 "local/webserver_alert.yml:/etc/prometheus/webserver_alert.yml", 430 "local/prometheus.yml:/etc/prometheus/prometheus.yml" 431 ] 432 port_map { 433 prometheus_ui = 9090 434 } 435 } 436 resources { 437 network { 438 mbits = 10 439 port "prometheus_ui" {} 440 } 441 } 442 service { 443 name = "prometheus" 444 tags = ["urlprefix-/"] 445 port = "prometheus_ui" 446 check { 447 name = "prometheus_ui port alive" 448 type = "http" 449 path = "/-/healthy" 450 interval = "10s" 451 timeout = "2s" 452 } 453 } 454 } 455 } 456 } 457 ``` 458 Notice we have added a few important sections to this job file: 459 460 - We added another template stanza that defines an [alerting rule][alertingrules] 461 for our web server. Namely, Prometheus will send out an alert if it detects 462 the `webserver` service has disappeared. 463 464 - We added an `alerting` block to our Prometheus configuration as well as a 465 `rule_files` block to make Prometheus aware of Alertmanager as well as the 466 rule we have defined. 467 468 - We are now also scraping Alertmanager along with our 469 web server. 470 471 ### Step 8: Deploy Web Server 472 473 Create a job for our web server and name it `webserver.nomad` 474 475 ```hcl 476 job "webserver" { 477 datacenters = ["dc1"] 478 479 group "webserver" { 480 task "server" { 481 driver = "docker" 482 config { 483 image = "hashicorp/demo-prometheus-instrumentation:latest" 484 } 485 486 resources { 487 cpu = 500 488 memory = 256 489 network { 490 mbits = 10 491 port "http"{} 492 } 493 } 494 495 service { 496 name = "webserver" 497 port = "http" 498 499 tags = [ 500 "testweb", 501 "urlprefix-/webserver strip=/webserver", 502 ] 503 504 check { 505 type = "http" 506 path = "/" 507 interval = "2s" 508 timeout = "2s" 509 } 510 } 511 } 512 } 513 } 514 ``` 515 At this point, re-run your Prometheus job. After a few seconds, you will see the 516 web server and Alertmanager appear in your list of targets. 517 518 [![New Targets][new-targets]][new-targets] 519 520 You should also be able to go to the `Alerts` section of the Prometheus web interface 521 and see the alert that we have configured. No alerts are active because our web server 522 is up and running. 523 524 [![Alerts][alerts]][alerts] 525 526 ### Step 9: Stop the Web Server 527 528 Run `nomad stop webserver` to stop our webserver. After a few seconds, you will see 529 that we have an active alert in the `Alerts` section of the web interface. 530 531 [![Active Alerts][active-alerts]][active-alerts] 532 533 We can now go to the Alertmanager web interface to see that Alertmanager has received 534 this alert as well. Since Alertmanager has been configured behind fabio, go to the IP address of any of your client nodes at port `9999` and use `/alertmanager` as the route. An example is shown below: 535 536 -> < client node IP >:9999/alertmanager 537 538 You should see that Alertmanager has received the alert. 539 540 [![Alertmanager Web UI][alertmanager-webui]][alertmanager-webui] 541 542 ## Next Steps 543 544 Read more about Prometheus [Alertmanager][alertmanager] and how to configure it 545 to send out notifications to a [receiver][receivers] of your choice. 546 547 [active-alerts]: /assets/images/active-alert.png 548 [alerts]: /assets/images/alerts.png 549 [alerting]: https://prometheus.io/docs/alerting/overview/ 550 [alertingrules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/ 551 [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/ 552 [alertmanager-webui]: /assets/images/alertmanager-webui.png 553 [configuring prometheus]: https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus 554 [consul_sd_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cconsul_sd_config%3E 555 [env]: /docs/runtime/environment.html 556 [fabio]: https://fabiolb.net/ 557 [fabio-lb]: /guides/load-balancing/fabio.html 558 [new-targets]: /assets/images/new-targets.png 559 [prometheus-targets]: /assets/images/prometheus-targets.png 560 [running-jobs]: /assets/images/running-jobs.png 561 [telemetry]: /docs/configuration/telemetry.html 562 [telemetry stanza]: /docs/configuration/telemetry.html 563 [template]: /docs/job-specification/template.html 564 [volumes]: /docs/drivers/docker.html#volumes 565 [prometheus]: https://prometheus.io/docs/introduction/overview/ 566 [receivers]: https://prometheus.io/docs/alerting/configuration/#%3Creceiver%3E 567 [system]: /docs/schedulers.html#system