github.com/letsencrypt/boulder@v0.20251208.0/cmd/boulder-observer/README.md (about)

     1  # boulder-observer
     2  
     3  A modular configuration driven approach to black box monitoring with
     4  Prometheus.
     5  
     6  * [boulder-observer](#boulder-observer)
     7    * [Usage](#usage)
     8      * [Options](#options)
     9      * [Starting the boulder-observer
    10        daemon](#starting-the-boulder-observer-daemon)
    11    * [Configuration](#configuration)
    12      * [Root](#root)
    13        * [Schema](#schema)
    14        * [Example](#example)
    15      * [Monitors](#monitors)
    16        * [Schema](#schema-1)
    17        * [Example](#example-1)
    18      * [Probers](#probers)
    19        * [DNS](#dns)
    20          * [Schema](#schema-2)
    21          * [Example](#example-2)
    22        * [HTTP](#http)
    23          * [Schema](#schema-3)
    24          * [Example](#example-3)
    25        * [CRL](#crl)
    26          * [Schema](#schema-4)
    27          * [Example](#example-4)
    28        * [TLS](#tls)
    29          * [Schema](#schema-5)
    30          * [Example](#example-5)
    31    * [Metrics](#metrics)
    32      * [Global Metrics](#global-metrics)
    33        * [obs_monitors](#obs_monitors)
    34        * [obs_observations](#obs_observations)
    35      * [CRL Metrics](#crl-metrics)
    36        * [obs_crl_this_update](#obs_crl_this_update)
    37        * [obs_crl_next_update](#obs_crl_next_update)
    38        * [obs_crl_revoked_cert_count](#obs_crl_revoked_cert_count)
    39      * [TLS Metrics](#tls-metrics)
    40        * [obs_crl_this_update](#obs_tls_not_after)
    41        * [obs_crl_next_update](#obs_tls_reason)
    42    * [Development](#development)
    43      * [Starting Prometheus locally](#starting-prometheus-locally)
    44      * [Viewing metrics locally](#viewing-metrics-locally)
    45  
    46  ## Usage
    47  
    48  ### Options
    49  
    50  ```shell
    51  $ ./boulder-observer -help
    52    -config string
    53          Path to boulder-observer configuration file (default "config.yml")
    54  ```
    55  
    56  ### Starting the boulder-observer daemon
    57  
    58  ```shell
    59  $ ./boulder-observer -config test/config-next/observer.yml
    60  I152525 boulder-observer _KzylQI Versions: main=(Unspecified Unspecified) Golang=(go1.16.2) BuildHost=(Unspecified)
    61  I152525 boulder-observer q_D84gk Initializing boulder-observer daemon from config: test/config-next/observer.yml
    62  I152525 boulder-observer 7aq68AQ all monitors passed validation
    63  I152527 boulder-observer yaefiAw kind=[HTTP] success=[true] duration=[0.130097] name=[https://letsencrypt.org-[200]]
    64  I152527 boulder-observer 65CuDAA kind=[HTTP] success=[true] duration=[0.148633] name=[http://letsencrypt.org/foo-[200 404]]
    65  I152530 boulder-observer idi4rwE kind=[DNS] success=[false] duration=[0.000093] name=[[2606:4700:4700::1111]:53-udp-A-google.com-recurse]
    66  I152530 boulder-observer prOnrw8 kind=[DNS] success=[false] duration=[0.000242] name=[[2606:4700:4700::1111]:53-tcp-A-google.com-recurse]
    67  I152530 boulder-observer 6uXugQw kind=[DNS] success=[true] duration=[0.022962] name=[1.1.1.1:53-udp-A-google.com-recurse]
    68  I152530 boulder-observer to7h-wo kind=[DNS] success=[true] duration=[0.029860] name=[owen.ns.cloudflare.com:53-udp-A-letsencrypt.org-no-recurse]
    69  I152530 boulder-observer ovDorAY kind=[DNS] success=[true] duration=[0.033820] name=[owen.ns.cloudflare.com:53-tcp-A-letsencrypt.org-no-recurse]
    70  ...
    71  ```
    72  
    73  ## Configuration
    74  
    75  Configuration is provided via a YAML file.
    76  
    77  ### Root
    78  
    79  #### Schema
    80  
    81  `debugaddr`: The Prometheus scrape port prefixed with a single colon
    82  (e.g. `:8040`).
    83  
    84  `buckets`: List of floats representing Prometheus histogram buckets (e.g
    85  `[.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]`)
    86  
    87  `syslog`: Map of log levels, see schema below.
    88  
    89  - `stdoutlevel`: Log level for stdout, see legend below.
    90  - `sysloglevel`:Log level for stdout, see legend below.
    91  
    92  `0`: *EMERG* `1`: *ALERT* `2`: *CRIT* `3`: *ERR* `4`: *WARN* `5`:
    93  *NOTICE* `6`: *INFO* `7`: *DEBUG*
    94  
    95  `monitors`: List of monitors, see [monitors](#monitors) for schema.
    96  
    97  #### Example
    98  
    99  ```yaml
   100  debugaddr: :8040
   101  buckets: [.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]
   102  syslog:
   103    stdoutlevel: 6
   104    sysloglevel: 6
   105    -
   106      ...
   107  ```
   108  
   109  ### Monitors
   110  
   111  #### Schema
   112  
   113  `period`: Interval between probing attempts (e.g. `1s` `1m` `1h`).
   114  
   115  `kind`: Kind of prober to use, see [probers](#probers) for schema.
   116  
   117  `settings`: Map of prober settings, see [probers](#probers) for schema.
   118  
   119  #### Example
   120  
   121  ```yaml
   122  monitors:
   123    - 
   124      period: 5s
   125      kind: DNS
   126      settings:
   127          ...
   128  ```
   129  
   130  ### Probers
   131  
   132  #### DNS
   133  
   134  ##### Schema
   135  
   136  `protocol`: Protocol to use, options are: `udp` or `tcp`.
   137  
   138  `server`: Hostname, IPv4 address, or IPv6 address surrounded with
   139  brackets + port of the DNS server to send the query to (e.g.
   140  `example.com:53`, `1.1.1.1:53`, or `[2606:4700:4700::1111]:53`).
   141  
   142  `recurse`: Bool indicating if recursive resolution is desired.
   143  
   144  `query_name`: Name to query (e.g. `example.com`).
   145  
   146  `query_type`: Record type to query, options are: `A`, `AAAA`, `TXT`, or
   147  `CAA`.
   148  
   149  ##### Example
   150  
   151  ```yaml
   152  monitors:
   153    - 
   154      period: 5s
   155      kind: DNS
   156      settings:
   157        protocol: tcp
   158        server: [2606:4700:4700::1111]:53
   159        recurse: false
   160        query_name: letsencrypt.org
   161        query_type: A
   162  ```
   163  
   164  #### HTTP
   165  
   166  ##### Schema
   167  
   168  `url`: Scheme + Hostname to send a request to (e.g.
   169  `https://example.com`).
   170  
   171  `rcodes`: List of expected HTTP response codes.
   172  
   173  `useragent`: String to set HTTP header User-Agent. If no useragent string
   174  is provided it will default to `letsencrypt/boulder-observer-http-client`.
   175  
   176  ##### Example
   177  
   178  ```yaml
   179  monitors:
   180    - 
   181      period: 2s
   182      kind: HTTP
   183      settings:
   184        url: http://letsencrypt.org/FOO
   185        rcodes: [200, 404]
   186        useragent: letsencrypt/boulder-observer-http-client
   187  ```
   188  
   189  #### CRL
   190  
   191  ##### Schema
   192  
   193  `url`: Scheme + Hostname to grab the CRL from (e.g. `http://x1.c.lencr.org/`).
   194  
   195  ##### Example
   196  
   197  ```yaml
   198  monitors:
   199    - 
   200      period: 1h
   201      kind: CRL
   202      settings:
   203        url: http://x1.c.lencr.org/
   204  ```
   205  
   206  #### TLS
   207  
   208  ##### Schema
   209  
   210  `hostname`: Hostname to run TLS check on (e.g. `valid-isrgrootx1.letsencrypt.org`).
   211  
   212  `rootOrg`: Organization to check against the root certificate Organization (e.g. `Internet Security Research Group`).
   213  
   214  `rootCN`: Name to check against the root certificate Common Name (e.g. `ISRG Root X1`). If not provided, root comparison will be skipped.
   215  
   216  `response`: Expected site response; must be one of: `valid`, `revoked` or `expired`.
   217  
   218  ##### Example
   219  
   220  ```yaml
   221  monitors:
   222    - 
   223      period: 1h
   224      kind: TLS
   225      settings:
   226        hostname: valid-isrgrootx1.letsencrypt.org
   227        rootOrg: "Internet Security Research Group"
   228        rootCN: "ISRG Root X1"
   229        response: valid
   230  ```
   231  
   232  ## Metrics
   233  
   234  Observer provides the following metrics.
   235  
   236  ### Global Metrics
   237  
   238  These metrics will always be available.
   239  
   240  #### obs_monitors
   241  
   242  Count of configured monitors.
   243  
   244  **Labels:**
   245  
   246  `kind`: Kind of Prober the monitor is configured to use.
   247  
   248  `valid`: Bool indicating whether settings provided could be validated
   249  for the `kind` of Prober specified.
   250  
   251  #### obs_observations
   252  
   253  **Labels:**
   254  
   255  `name`: Name of the monitor.
   256  
   257  `kind`: Kind of prober the monitor is configured to use.
   258  
   259  `duration`: Duration of the probing in seconds.
   260  
   261  `success`: Bool indicating whether the result of the probe attempt was
   262  successful.
   263  
   264  **Bucketed response times:**
   265  
   266  This is configurable, see `buckets` under [root/schema](#schema).
   267  
   268  ### CRL Metrics
   269  
   270  These metrics will be available whenever a valid CRL prober is configured.
   271  
   272  #### obs_crl_this_update
   273  
   274  Unix timestamp value (in seconds) of the thisUpdate field for a CRL.
   275  
   276  **Labels:**
   277  
   278  `url`: Url of the CRL
   279  
   280  **Example Usage:**
   281  
   282  This is a sample rule that alerts when a CRL has a thisUpdate timestamp in the future, signalling that something may have gone wrong during its creation:
   283  
   284  ```yaml
   285  - alert: CRLThisUpdateInFuture
   286    expr: obs_crl_this_update{url="http://x1.c.lencr.org/"} > time()
   287    labels:
   288      severity: critical
   289    annotations:
   290      description: 'CRL thisUpdate is in the future'
   291  ```
   292  
   293  #### obs_crl_next_update
   294  
   295  Unix timestamp value (in seconds) of the nextUpdate field for a CRL.
   296  
   297  **Labels:**
   298  
   299  `url`: Url of the CRL
   300  
   301  **Example Usage:**
   302  
   303  This is a sample rule that alerts when a CRL has a nextUpdate timestamp in the past, signalling that the CRL was not updated on time:
   304  
   305  ```yaml
   306  - alert: CRLNextUpdateInPast
   307    expr: obs_crl_next_update{url="http://x1.c.lencr.org/"} < time()
   308    labels:
   309      severity: critical
   310    annotations:
   311      description: 'CRL nextUpdate is in the past'
   312  ```
   313  
   314  Another potentially useful rule would be to notify when nextUpdate is within X days from the current time, as a reminder that the update is coming up soon.
   315  
   316  #### obs_crl_revoked_cert_count
   317  
   318  Count of revoked certificates in a CRL.
   319  
   320  **Labels:**
   321  
   322  `url`: Url of the CRL
   323  
   324  ### TLS Metrics
   325  
   326  These metrics will be available whenever a valid TLS prober is configured.
   327  
   328  #### obs_tls_not_after
   329  
   330  Unix timestamp value (in seconds) of the notAfter field for a subscriber certificate.
   331  
   332  **Labels:**
   333  
   334  `hostname`: Hostname of the site of the subscriber certificate
   335  
   336  **Example Usage:**
   337  
   338  This is a sample rule that alerts when a site has a notAfter timestamp indicating that the certificate will expire within the next 20 days:
   339  
   340  ```yaml
   341    - alert: CertExpiresSoonWarning
   342      annotations:
   343        description: "The certificate at {{ $labels.hostname }} expires within 20 days, on: {{ $value | humanizeTimestamp }}"
   344      expr: (obs_tls_not_after{hostname=~"^[^e][a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}) <= time() + 1728000
   345      for: 60m
   346      labels:
   347        severity: warning
   348  ```
   349  
   350  #### obs_tls_reason
   351  
   352  This is a count that increments by one for each resulting reason of a TSL check. The reason is `nil` if the TLS Prober returns `true` and one of the following otherwise: `internalError`, `ocspError`, `rootDidNotMatch`, `responseDidNotMatch`.
   353  
   354  **Labels:**
   355  
   356  `hostname`: Hostname of the site of the subscriber certificate
   357  `reason`: The reason for TLS Probe returning false, and `nil` if it returns true
   358  
   359  **Example Usage:**
   360  
   361  This is a sample rule that alerts when TLS Prober returns false, providing insight on the reason for failure.
   362  
   363  ```yaml
   364    - alert: TLSCertCheckFailed
   365      annotations:
   366        description: "The TLS probe for {{ $labels.hostname }} failed for reason: {{ $labels.reason }}. This potentially violents CP 2.2."
   367      expr: (rate(obs_observations_count{success="false",name=~"[a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}[5m])) > 0
   368      for: 5m
   369      labels:
   370        severity: critical
   371  ```
   372  
   373  ## Development
   374  
   375  ### Starting Prometheus locally
   376  
   377  Please note, this assumes you've installed a local Prometheus binary.
   378  
   379  ```shell
   380  prometheus --config.file=boulder/test/prometheus/prometheus.yml
   381  ```
   382  
   383  ### Viewing metrics locally
   384  
   385  When developing with a local Prometheus instance you can use this link
   386  to view metrics: [link](http://0.0.0.0:9090)