github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/guides/operations/monitoring-and-alerting/prometheus-metrics.html.md

github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/guides/operations/monitoring-and-alerting/prometheus-metrics.html.md (about)

     1  ---
     2  layout: "guides"
     3  page_title: "Using Prometheus to Monitor Nomad Metrics"
     4  sidebar_current: "guides-operations-monitoring-prometheus"
     5  description: |-
     6    It is possible to collect metrics on Nomad with Prometheus after enabling
     7    telemetry on Nomad servers and clients.
     8  ---
     9  
    10  # Using Prometheus to Monitor Nomad Metrics
    11  
    12  This guide explains how to configure [Prometheus][prometheus] to integrate with
    13  a Nomad cluster and Prometheus [Alertmanager][alertmanager]. While this guide introduces the basics of enabling [telemetry][telemetry] and alerting, a Nomad operator can go much further by customizing dashboards and integrating different
    14  [receivers][receivers] for alerts.
    15  
    16  ## Reference Material
    17  
    18  - [Configuring Prometheus][configuring prometheus]
    19  - [Telemetry Stanza in Nomad Agent Configuration][telemetry stanza]
    20  - [Alerting Overview][alerting]
    21  
    22  ## Estimated Time to Complete
    23  
    24  25 minutes
    25  
    26  ## Challenge
    27  
    28  Think of a scenario where a Nomad operator needs to deploy Prometheus to
    29  collect metrics from a Nomad cluster. The operator must enable telemetry on
    30  the Nomad servers and clients as well as configure Prometheus to use Consul
    31  for service discovery. The operator must also configure Prometheus Alertmanager
    32  so notifications can be sent out to a specified [receiver][receivers].
    33  
    34  
    35  ## Solution
    36  
    37  Deploy Prometheus with a configuration that accounts for a highly dynamic
    38  environment. Integrate service discovery into the configuration file to avoid
    39  using hard-coded IP addresses. Place the Prometheus deployment behind
    40  [fabio][fabio] (this will allow easy access to the Prometheus web interface
    41  by allowing the Nomad operator to hit any of the client nodes at the `/` path.
    42  
    43  ## Prerequisites
    44  
    45  To perform the tasks described in this guide, you need to have a Nomad
    46  environment with Consul installed. You can use this
    47  [repo](https://github.com/hashicorp/nomad/tree/master/terraform#provision-a-nomad-cluster-in-the-cloud)
    48  to easily provision a sandbox environment. This guide will assume a cluster with
    49  one server node and three client nodes.
    50  
    51  -> **Please Note:** This guide is for demo purposes and is only using a single
    52  server node. In a production cluster, 3 or 5 server nodes are recommended. The
    53  alerting rules defined in this guide are for instructional purposes. Please
    54  refer to [Alerting Rules][alertingrules] for more information.
    55  
    56  ## Steps
    57  
    58  ### Step 1: Enable Telemetry on Nomad Servers and Clients
    59  
    60  Add the stanza below in your Nomad client and server configuration
    61  files. If you have used the provided repo in this guide to set up a Nomad
    62  cluster, the configuration file will be `/etc/nomad.d/nomad.hcl`.
    63  After making this change, restart the Nomad service on each server and
    64  client node.
    65  
    66  ```hcl
    67  telemetry {
    68    collection_interval = "1s"
    69    disable_hostname = true
    70    prometheus_metrics = true
    71    publish_allocation_metrics = true
    72    publish_node_metrics = true
    73  }
    74  ```
    75  
    76  ### Step 2: Create a Job for Fabio
    77  
    78  Create a job for Fabio and name it `fabio.nomad`
    79  
    80  ```hcl
    81  job "fabio" {
    82    datacenters = ["dc1"]
    83    type = "system"
    84  
    85    group "fabio" {
    86      task "fabio" {
    87        driver = "docker"
    88        config {
    89          image = "fabiolb/fabio"
    90          network_mode = "host"
    91        }
    92  
    93        resources {
    94          cpu    = 100
    95          memory = 64
    96          network {
    97            mbits = 20
    98            port "lb" {
    99              static = 9999
   100            }
   101            port "ui" {
   102              static = 9998
   103            }
   104          }
   105        }
   106      }
   107    }
   108  }
   109  ```
   110  To learn more about fabio and the options used in this job file, see
   111  [Load Balancing with Fabio][fabio-lb]. For the purpose of this guide, it is
   112  important to note that the `type` option is set to [system][system] so that
   113  fabio will be deployed on all client nodes. We have also set `network_mode` to
   114  `host` so that fabio will be able to use Consul for service discovery.
   115  
   116  ### Step 3: Run the Fabio Job
   117  
   118  We can now register our fabio job:
   119  
   120  ```shell
   121  $ nomad job run fabio.nomad
   122  ==> Monitoring evaluation "7b96701e"
   123      Evaluation triggered by job "fabio"
   124      Allocation "d0e34682" created: node "28d7f859", group "fabio"
   125      Allocation "238ec0f7" created: node "510898b6", group "fabio"
   126      Allocation "9a2e8359" created: node "f3739267", group "fabio"
   127      Evaluation status changed: "pending" -> "complete"
   128  ==> Evaluation "7b96701e" finished with status "complete"
   129  ```
   130  At this point, you should be able to visit any one of your client nodes at port
   131  `9998` and see the web interface for fabio. The routing table will be empty
   132  since we have not yet deployed anything that fabio can route to.
   133  Accordingly, if you visit any of the client nodes at port `9999` at this
   134  point, you will get a `404` HTTP response. That will change soon.
   135  
   136  ### Step 4: Create a Job for Prometheus
   137  
   138  Create a job for Prometheus and name it `prometheus.nomad`
   139  
   140  ```hcl
   141  job "prometheus" {
   142    datacenters = ["dc1"]
   143    type = "service"
   144  
   145    group "monitoring" {
   146      count = 1
   147      restart {
   148        attempts = 2
   149        interval = "30m"
   150        delay = "15s"
   151        mode = "fail"
   152      }
   153      ephemeral_disk {
   154        size = 300
   155      }
   156  
   157      task "prometheus" {
   158        template {
   159          change_mode = "noop"
   160          destination = "local/prometheus.yml"
   161          data = <<EOH
   162  ---
   163  global:
   164    scrape_interval:     5s
   165    evaluation_interval: 5s
   166  
   167  scrape_configs:
   168  
   169    - job_name: 'nomad_metrics'
   170  
   171      consul_sd_configs:
   172      - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
   173        services: ['nomad-client', 'nomad']
   174  
   175      relabel_configs:
   176      - source_labels: ['__meta_consul_tags']
   177        regex: '(.*)http(.*)'
   178        action: keep
   179  
   180      scrape_interval: 5s
   181      metrics_path: /v1/metrics
   182      params:
   183        format: ['prometheus']
   184  EOH
   185        }
   186        driver = "docker"
   187        config {
   188          image = "prom/prometheus:latest"
   189          volumes = [
   190            "local/prometheus.yml:/etc/prometheus/prometheus.yml"
   191          ]
   192          port_map {
   193            prometheus_ui = 9090
   194          }
   195        }
   196        resources {
   197          network {
   198            mbits = 10
   199            port "prometheus_ui" {}
   200          }
   201        }
   202        service {
   203          name = "prometheus"
   204          tags = ["urlprefix-/"]
   205          port = "prometheus_ui"
   206          check {
   207            name     = "prometheus_ui port alive"
   208            type     = "http"
   209            path     = "/-/healthy"
   210            interval = "10s"
   211            timeout  = "2s"
   212          }
   213        }
   214      }
   215    }
   216  }
   217  ```
   218  Notice we are using the [template][template] stanza to create a Prometheus
   219  configuration using [environment][env] variables. In this case, we are using the
   220  environment variable `NOMAD_IP_prometheus_ui` in the
   221  [consul_sd_configs][consul_sd_config]
   222  section to ensure Prometheus can use Consul to detect and scrape targets.
   223  This works in our example because Consul is installed alongside Nomad.
   224  Additionally, we benefit from this configuration by avoiding the need to
   225  hard-code IP addresses. If you did not use the repo provided in this guide to
   226  create a Nomad cluster, be sure to point your Prometheus configuration
   227  to a Consul server you have set up.
   228  
   229  The [volumes][volumes] option allows us to take the configuration file we
   230  dynamically created and place it in our Prometheus container.
   231  
   232  
   233  ### Step 5: Run the Prometheus Job
   234  
   235  We can now register our job for Prometheus:
   236  
   237  ```shell
   238  $ nomad job run prometheus.nomad
   239  ==> Monitoring evaluation "4e6b7127"
   240      Evaluation triggered by job "prometheus"
   241      Evaluation within deployment: "d3a651a7"
   242      Allocation "9725af3d" created: node "28d7f859", group "monitoring"
   243      Evaluation status changed: "pending" -> "complete"
   244  ==> Evaluation "4e6b7127" finished with status "complete"
   245  ```
   246  Prometheus is now deployed. You can visit any of your client nodes at port
   247  `9999` to visit the web interface. There is only one instance of Prometheus
   248  running in the Nomad cluster, but you are automatically routed to it
   249  regardless of which node you visit because fabio is deployed and running on the
   250  cluster as well.
   251  
   252  At the top menu bar, click on `Status` and then `Targets`. You should see all
   253  of your Nomad nodes (servers and clients) show up as targets. Please note that
   254  the IP addresses will be different in your cluster.
   255  
   256  [![Prometheus Targets][prometheus-targets]][prometheus-targets]
   257  
   258  Let's use Prometheus to query how many jobs are running in our Nomad cluster.
   259  On the main page, type `nomad_nomad_job_summary_running` into the query
   260  section. You can also select the query from the drop-down list.
   261  
   262  [![Running Jobs][running-jobs]][running-jobs]
   263  
   264  You can see that the value of our fabio job is `3` since it is using the
   265  [system][system] scheduler type. This makes sense because we are running
   266  three Nomad clients in our demo cluster. The value of our Prometheus job, on
   267  the other hand, is `1` since we have only deployed one instance of it.
   268  To see the description of other metrics, visit the [telemetry][telemetry]
   269  section.
   270  
   271  ### Step 6: Deploy Alertmanager
   272  
   273  Now that we have enabled Prometheus to collect metrics from our cluster and see
   274  the state of our jobs, let's deploy [Alertmanager][alertmanager]. Keep in mind
   275  that Prometheus sends alerts to Alertmanager. It is then Alertmanager's job to
   276  send out the notifications on those alerts to any designated [receiver][receivers].
   277  
   278  Create a job for Alertmanager and named it `alertmanager.nomad`
   279  
   280  ```hcl
   281  job "alertmanager" {
   282    datacenters = ["dc1"]
   283    type = "service"
   284  
   285    group "alerting" {
   286      count = 1
   287      restart {
   288        attempts = 2
   289        interval = "30m"
   290        delay = "15s"
   291        mode = "fail"
   292      }
   293      ephemeral_disk {
   294        size = 300
   295      }
   296  
   297      task "alertmanager" {
   298        driver = "docker"
   299        config {
   300          image = "prom/alertmanager:latest"
   301          port_map {
   302            alertmanager_ui = 9093
   303          }
   304        }
   305        resources {
   306          network {
   307            mbits = 10
   308            port "alertmanager_ui" {}
   309          }
   310        }
   311        service {
   312          name = "alertmanager"
   313          tags = ["urlprefix-/alertmanager strip=/alertmanager"]
   314          port = "alertmanager_ui"
   315          check {
   316            name     = "alertmanager_ui port alive"
   317            type     = "http"
   318            path     = "/-/healthy"
   319            interval = "10s"
   320            timeout  = "2s"
   321          }
   322        }
   323      }
   324    }
   325  }
   326  ```
   327  
   328  ### Step 7: Configure Prometheus to Integrate with Alertmanager
   329  
   330  Now that we have deployed Alertmanager, let's slightly modify our Prometheus job
   331  configuration to allow it to recognize and send alerts to it. Note that there are
   332  some rules in the configuration that refer a to a web server we will deploy soon.
   333  
   334  Below is the same Prometheus configuration we detailed above, but we have added
   335  some sections that hook Prometheus into the Alertmanager and set up some Alerting
   336  rules.
   337  
   338  ```hcl
   339  job "prometheus" {
   340    datacenters = ["dc1"]
   341    type = "service"
   342  
   343    group "monitoring" {
   344      count = 1
   345      restart {
   346        attempts = 2
   347        interval = "30m"
   348        delay = "15s"
   349        mode = "fail"
   350      }
   351      ephemeral_disk {
   352        size = 300
   353      }
   354  
   355      task "prometheus" {
   356        template {
   357          change_mode = "noop"
   358          destination = "local/webserver_alert.yml"
   359          data = <<EOH
   360  ---
   361  groups:
   362  - name: prometheus_alerts
   363    rules:
   364    - alert: Webserver down
   365      expr: absent(up{job="webserver"})
   366      for: 10s
   367      labels:
   368        severity: critical
   369      annotations:
   370        description: "Our webserver is down."
   371  EOH
   372        }
   373  
   374        template {
   375          change_mode = "noop"
   376          destination = "local/prometheus.yml"
   377          data = <<EOH
   378  ---
   379  global:
   380    scrape_interval:     5s
   381    evaluation_interval: 5s
   382  
   383  alerting:
   384    alertmanagers:
   385    - consul_sd_configs:
   386      - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
   387        services: ['alertmanager']
   388  
   389  rule_files:
   390    - "webserver_alert.yml"
   391  
   392  scrape_configs:
   393  
   394    - job_name: 'alertmanager'
   395  
   396      consul_sd_configs:
   397      - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
   398        services: ['alertmanager']
   399  
   400    - job_name: 'nomad_metrics'
   401  
   402      consul_sd_configs:
   403      - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
   404        services: ['nomad-client', 'nomad']
   405  
   406      relabel_configs:
   407      - source_labels: ['__meta_consul_tags']
   408        regex: '(.*)http(.*)'
   409        action: keep
   410  
   411      scrape_interval: 5s
   412      metrics_path: /v1/metrics
   413      params:
   414        format: ['prometheus']
   415  
   416    - job_name: 'webserver'
   417  
   418      consul_sd_configs:
   419      - server: '{{ env "NOMAD_IP_prometheus_ui" }}:8500'
   420        services: ['webserver']
   421  
   422      metrics_path: /metrics
   423  EOH
   424        }
   425        driver = "docker"
   426        config {
   427          image = "prom/prometheus:latest"
   428          volumes = [
   429            "local/webserver_alert.yml:/etc/prometheus/webserver_alert.yml",
   430            "local/prometheus.yml:/etc/prometheus/prometheus.yml"
   431          ]
   432          port_map {
   433            prometheus_ui = 9090
   434          }
   435        }
   436        resources {
   437          network {
   438            mbits = 10
   439            port "prometheus_ui" {}
   440          }
   441        }
   442        service {
   443          name = "prometheus"
   444          tags = ["urlprefix-/"]
   445          port = "prometheus_ui"
   446          check {
   447            name     = "prometheus_ui port alive"
   448            type     = "http"
   449            path     = "/-/healthy"
   450            interval = "10s"
   451            timeout  = "2s"
   452          }
   453        }
   454      }
   455    }
   456  }
   457  ```
   458  Notice we have added a few important sections to this job file:
   459  
   460    - We added another template stanza that defines an [alerting rule][alertingrules]
   461      for our web server. Namely, Prometheus will send out an alert if it detects
   462      the `webserver` service has disappeared.
   463  
   464    - We added an `alerting` block to our Prometheus configuration as well as a
   465      `rule_files` block to make Prometheus aware of Alertmanager as well as the
   466      rule we have defined.
   467  
   468    - We are now also scraping Alertmanager along with our
   469      web server.
   470  
   471  ### Step 8: Deploy Web Server
   472  
   473  Create a job for our web server and name it `webserver.nomad`
   474  
   475  ```hcl
   476  job "webserver" {
   477    datacenters = ["dc1"]
   478  
   479    group "webserver" {
   480      task "server" {
   481        driver = "docker"
   482        config {
   483          image = "hashicorp/demo-prometheus-instrumentation:latest"
   484        }
   485  
   486        resources {
   487          cpu = 500
   488          memory = 256
   489          network {
   490            mbits = 10
   491            port  "http"{}
   492          }
   493        }
   494  
   495        service {
   496          name = "webserver"
   497          port = "http"
   498  
   499          tags = [
   500            "testweb",
   501            "urlprefix-/webserver strip=/webserver",
   502          ]
   503  
   504          check {
   505            type     = "http"
   506            path     = "/"
   507            interval = "2s"
   508            timeout  = "2s"
   509          }
   510        }
   511      }
   512    }
   513  }
   514  ```
   515  At this point, re-run your Prometheus job. After a few seconds, you will see the
   516  web server and Alertmanager appear in your list of targets.
   517  
   518  [![New Targets][new-targets]][new-targets]
   519  
   520  You should also be able to go to the `Alerts` section of the Prometheus web interface
   521  and see the alert that we have configured. No alerts are active because our web server
   522  is up and running.
   523  
   524  [![Alerts][alerts]][alerts]
   525  
   526  ### Step 9: Stop the Web Server
   527  
   528  Run `nomad stop webserver` to stop our webserver. After a few seconds, you will see
   529  that we have an active alert in the `Alerts` section of the web interface.
   530  
   531  [![Active Alerts][active-alerts]][active-alerts]
   532  
   533  We can now go to the Alertmanager web interface to see that Alertmanager has received
   534  this alert as well. Since Alertmanager has been configured behind fabio, go to the IP address of any of your client nodes at port `9999` and use `/alertmanager` as the route. An example is shown below:
   535  
   536  -> < client node IP >:9999/alertmanager
   537  
   538  You should see that Alertmanager has received the alert.
   539  
   540  [![Alertmanager Web UI][alertmanager-webui]][alertmanager-webui]
   541  
   542  ## Next Steps
   543  
   544  Read more about Prometheus [Alertmanager][alertmanager] and how to configure it
   545  to send out notifications to a [receiver][receivers] of your choice.
   546  
   547  [active-alerts]: /assets/images/active-alert.png
   548  [alerts]: /assets/images/alerts.png
   549  [alerting]: https://prometheus.io/docs/alerting/overview/
   550  [alertingrules]: https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/
   551  [alertmanager]: https://prometheus.io/docs/alerting/alertmanager/
   552  [alertmanager-webui]: /assets/images/alertmanager-webui.png
   553  [configuring prometheus]: https://prometheus.io/docs/introduction/first_steps/#configuring-prometheus
   554  [consul_sd_config]: https://prometheus.io/docs/prometheus/latest/configuration/configuration/#%3Cconsul_sd_config%3E
   555  [env]: /docs/runtime/environment.html
   556  [fabio]: https://fabiolb.net/
   557  [fabio-lb]: /guides/load-balancing/fabio.html
   558  [new-targets]: /assets/images/new-targets.png
   559  [prometheus-targets]: /assets/images/prometheus-targets.png
   560  [running-jobs]: /assets/images/running-jobs.png
   561  [telemetry]: /docs/configuration/telemetry.html
   562  [telemetry stanza]: /docs/configuration/telemetry.html
   563  [template]: /docs/job-specification/template.html
   564  [volumes]: /docs/drivers/docker.html#volumes
   565  [prometheus]: https://prometheus.io/docs/introduction/overview/
   566  [receivers]: https://prometheus.io/docs/alerting/configuration/#%3Creceiver%3E
   567  [system]: /docs/schedulers.html#system