github.com/letsencrypt/boulder@v0.20251208.0/cmd/boulder-observer/README.md (about) 1 # boulder-observer 2 3 A modular configuration driven approach to black box monitoring with 4 Prometheus. 5 6 * [boulder-observer](#boulder-observer) 7 * [Usage](#usage) 8 * [Options](#options) 9 * [Starting the boulder-observer 10 daemon](#starting-the-boulder-observer-daemon) 11 * [Configuration](#configuration) 12 * [Root](#root) 13 * [Schema](#schema) 14 * [Example](#example) 15 * [Monitors](#monitors) 16 * [Schema](#schema-1) 17 * [Example](#example-1) 18 * [Probers](#probers) 19 * [DNS](#dns) 20 * [Schema](#schema-2) 21 * [Example](#example-2) 22 * [HTTP](#http) 23 * [Schema](#schema-3) 24 * [Example](#example-3) 25 * [CRL](#crl) 26 * [Schema](#schema-4) 27 * [Example](#example-4) 28 * [TLS](#tls) 29 * [Schema](#schema-5) 30 * [Example](#example-5) 31 * [Metrics](#metrics) 32 * [Global Metrics](#global-metrics) 33 * [obs_monitors](#obs_monitors) 34 * [obs_observations](#obs_observations) 35 * [CRL Metrics](#crl-metrics) 36 * [obs_crl_this_update](#obs_crl_this_update) 37 * [obs_crl_next_update](#obs_crl_next_update) 38 * [obs_crl_revoked_cert_count](#obs_crl_revoked_cert_count) 39 * [TLS Metrics](#tls-metrics) 40 * [obs_crl_this_update](#obs_tls_not_after) 41 * [obs_crl_next_update](#obs_tls_reason) 42 * [Development](#development) 43 * [Starting Prometheus locally](#starting-prometheus-locally) 44 * [Viewing metrics locally](#viewing-metrics-locally) 45 46 ## Usage 47 48 ### Options 49 50 ```shell 51 $ ./boulder-observer -help 52 -config string 53 Path to boulder-observer configuration file (default "config.yml") 54 ``` 55 56 ### Starting the boulder-observer daemon 57 58 ```shell 59 $ ./boulder-observer -config test/config-next/observer.yml 60 I152525 boulder-observer _KzylQI Versions: main=(Unspecified Unspecified) Golang=(go1.16.2) BuildHost=(Unspecified) 61 I152525 boulder-observer q_D84gk Initializing boulder-observer daemon from config: test/config-next/observer.yml 62 I152525 boulder-observer 7aq68AQ all monitors passed validation 63 I152527 boulder-observer yaefiAw kind=[HTTP] success=[true] duration=[0.130097] name=[https://letsencrypt.org-[200]] 64 I152527 boulder-observer 65CuDAA kind=[HTTP] success=[true] duration=[0.148633] name=[http://letsencrypt.org/foo-[200 404]] 65 I152530 boulder-observer idi4rwE kind=[DNS] success=[false] duration=[0.000093] name=[[2606:4700:4700::1111]:53-udp-A-google.com-recurse] 66 I152530 boulder-observer prOnrw8 kind=[DNS] success=[false] duration=[0.000242] name=[[2606:4700:4700::1111]:53-tcp-A-google.com-recurse] 67 I152530 boulder-observer 6uXugQw kind=[DNS] success=[true] duration=[0.022962] name=[1.1.1.1:53-udp-A-google.com-recurse] 68 I152530 boulder-observer to7h-wo kind=[DNS] success=[true] duration=[0.029860] name=[owen.ns.cloudflare.com:53-udp-A-letsencrypt.org-no-recurse] 69 I152530 boulder-observer ovDorAY kind=[DNS] success=[true] duration=[0.033820] name=[owen.ns.cloudflare.com:53-tcp-A-letsencrypt.org-no-recurse] 70 ... 71 ``` 72 73 ## Configuration 74 75 Configuration is provided via a YAML file. 76 77 ### Root 78 79 #### Schema 80 81 `debugaddr`: The Prometheus scrape port prefixed with a single colon 82 (e.g. `:8040`). 83 84 `buckets`: List of floats representing Prometheus histogram buckets (e.g 85 `[.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10]`) 86 87 `syslog`: Map of log levels, see schema below. 88 89 - `stdoutlevel`: Log level for stdout, see legend below. 90 - `sysloglevel`:Log level for stdout, see legend below. 91 92 `0`: *EMERG* `1`: *ALERT* `2`: *CRIT* `3`: *ERR* `4`: *WARN* `5`: 93 *NOTICE* `6`: *INFO* `7`: *DEBUG* 94 95 `monitors`: List of monitors, see [monitors](#monitors) for schema. 96 97 #### Example 98 99 ```yaml 100 debugaddr: :8040 101 buckets: [.001, .002, .005, .01, .02, .05, .1, .2, .5, 1, 2, 5, 10] 102 syslog: 103 stdoutlevel: 6 104 sysloglevel: 6 105 - 106 ... 107 ``` 108 109 ### Monitors 110 111 #### Schema 112 113 `period`: Interval between probing attempts (e.g. `1s` `1m` `1h`). 114 115 `kind`: Kind of prober to use, see [probers](#probers) for schema. 116 117 `settings`: Map of prober settings, see [probers](#probers) for schema. 118 119 #### Example 120 121 ```yaml 122 monitors: 123 - 124 period: 5s 125 kind: DNS 126 settings: 127 ... 128 ``` 129 130 ### Probers 131 132 #### DNS 133 134 ##### Schema 135 136 `protocol`: Protocol to use, options are: `udp` or `tcp`. 137 138 `server`: Hostname, IPv4 address, or IPv6 address surrounded with 139 brackets + port of the DNS server to send the query to (e.g. 140 `example.com:53`, `1.1.1.1:53`, or `[2606:4700:4700::1111]:53`). 141 142 `recurse`: Bool indicating if recursive resolution is desired. 143 144 `query_name`: Name to query (e.g. `example.com`). 145 146 `query_type`: Record type to query, options are: `A`, `AAAA`, `TXT`, or 147 `CAA`. 148 149 ##### Example 150 151 ```yaml 152 monitors: 153 - 154 period: 5s 155 kind: DNS 156 settings: 157 protocol: tcp 158 server: [2606:4700:4700::1111]:53 159 recurse: false 160 query_name: letsencrypt.org 161 query_type: A 162 ``` 163 164 #### HTTP 165 166 ##### Schema 167 168 `url`: Scheme + Hostname to send a request to (e.g. 169 `https://example.com`). 170 171 `rcodes`: List of expected HTTP response codes. 172 173 `useragent`: String to set HTTP header User-Agent. If no useragent string 174 is provided it will default to `letsencrypt/boulder-observer-http-client`. 175 176 ##### Example 177 178 ```yaml 179 monitors: 180 - 181 period: 2s 182 kind: HTTP 183 settings: 184 url: http://letsencrypt.org/FOO 185 rcodes: [200, 404] 186 useragent: letsencrypt/boulder-observer-http-client 187 ``` 188 189 #### CRL 190 191 ##### Schema 192 193 `url`: Scheme + Hostname to grab the CRL from (e.g. `http://x1.c.lencr.org/`). 194 195 ##### Example 196 197 ```yaml 198 monitors: 199 - 200 period: 1h 201 kind: CRL 202 settings: 203 url: http://x1.c.lencr.org/ 204 ``` 205 206 #### TLS 207 208 ##### Schema 209 210 `hostname`: Hostname to run TLS check on (e.g. `valid-isrgrootx1.letsencrypt.org`). 211 212 `rootOrg`: Organization to check against the root certificate Organization (e.g. `Internet Security Research Group`). 213 214 `rootCN`: Name to check against the root certificate Common Name (e.g. `ISRG Root X1`). If not provided, root comparison will be skipped. 215 216 `response`: Expected site response; must be one of: `valid`, `revoked` or `expired`. 217 218 ##### Example 219 220 ```yaml 221 monitors: 222 - 223 period: 1h 224 kind: TLS 225 settings: 226 hostname: valid-isrgrootx1.letsencrypt.org 227 rootOrg: "Internet Security Research Group" 228 rootCN: "ISRG Root X1" 229 response: valid 230 ``` 231 232 ## Metrics 233 234 Observer provides the following metrics. 235 236 ### Global Metrics 237 238 These metrics will always be available. 239 240 #### obs_monitors 241 242 Count of configured monitors. 243 244 **Labels:** 245 246 `kind`: Kind of Prober the monitor is configured to use. 247 248 `valid`: Bool indicating whether settings provided could be validated 249 for the `kind` of Prober specified. 250 251 #### obs_observations 252 253 **Labels:** 254 255 `name`: Name of the monitor. 256 257 `kind`: Kind of prober the monitor is configured to use. 258 259 `duration`: Duration of the probing in seconds. 260 261 `success`: Bool indicating whether the result of the probe attempt was 262 successful. 263 264 **Bucketed response times:** 265 266 This is configurable, see `buckets` under [root/schema](#schema). 267 268 ### CRL Metrics 269 270 These metrics will be available whenever a valid CRL prober is configured. 271 272 #### obs_crl_this_update 273 274 Unix timestamp value (in seconds) of the thisUpdate field for a CRL. 275 276 **Labels:** 277 278 `url`: Url of the CRL 279 280 **Example Usage:** 281 282 This is a sample rule that alerts when a CRL has a thisUpdate timestamp in the future, signalling that something may have gone wrong during its creation: 283 284 ```yaml 285 - alert: CRLThisUpdateInFuture 286 expr: obs_crl_this_update{url="http://x1.c.lencr.org/"} > time() 287 labels: 288 severity: critical 289 annotations: 290 description: 'CRL thisUpdate is in the future' 291 ``` 292 293 #### obs_crl_next_update 294 295 Unix timestamp value (in seconds) of the nextUpdate field for a CRL. 296 297 **Labels:** 298 299 `url`: Url of the CRL 300 301 **Example Usage:** 302 303 This is a sample rule that alerts when a CRL has a nextUpdate timestamp in the past, signalling that the CRL was not updated on time: 304 305 ```yaml 306 - alert: CRLNextUpdateInPast 307 expr: obs_crl_next_update{url="http://x1.c.lencr.org/"} < time() 308 labels: 309 severity: critical 310 annotations: 311 description: 'CRL nextUpdate is in the past' 312 ``` 313 314 Another potentially useful rule would be to notify when nextUpdate is within X days from the current time, as a reminder that the update is coming up soon. 315 316 #### obs_crl_revoked_cert_count 317 318 Count of revoked certificates in a CRL. 319 320 **Labels:** 321 322 `url`: Url of the CRL 323 324 ### TLS Metrics 325 326 These metrics will be available whenever a valid TLS prober is configured. 327 328 #### obs_tls_not_after 329 330 Unix timestamp value (in seconds) of the notAfter field for a subscriber certificate. 331 332 **Labels:** 333 334 `hostname`: Hostname of the site of the subscriber certificate 335 336 **Example Usage:** 337 338 This is a sample rule that alerts when a site has a notAfter timestamp indicating that the certificate will expire within the next 20 days: 339 340 ```yaml 341 - alert: CertExpiresSoonWarning 342 annotations: 343 description: "The certificate at {{ $labels.hostname }} expires within 20 days, on: {{ $value | humanizeTimestamp }}" 344 expr: (obs_tls_not_after{hostname=~"^[^e][a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}) <= time() + 1728000 345 for: 60m 346 labels: 347 severity: warning 348 ``` 349 350 #### obs_tls_reason 351 352 This is a count that increments by one for each resulting reason of a TSL check. The reason is `nil` if the TLS Prober returns `true` and one of the following otherwise: `internalError`, `ocspError`, `rootDidNotMatch`, `responseDidNotMatch`. 353 354 **Labels:** 355 356 `hostname`: Hostname of the site of the subscriber certificate 357 `reason`: The reason for TLS Probe returning false, and `nil` if it returns true 358 359 **Example Usage:** 360 361 This is a sample rule that alerts when TLS Prober returns false, providing insight on the reason for failure. 362 363 ```yaml 364 - alert: TLSCertCheckFailed 365 annotations: 366 description: "The TLS probe for {{ $labels.hostname }} failed for reason: {{ $labels.reason }}. This potentially violents CP 2.2." 367 expr: (rate(obs_observations_count{success="false",name=~"[a-zA-Z]*-isrgrootx[12][.]letsencrypt[.]org"}[5m])) > 0 368 for: 5m 369 labels: 370 severity: critical 371 ``` 372 373 ## Development 374 375 ### Starting Prometheus locally 376 377 Please note, this assumes you've installed a local Prometheus binary. 378 379 ```shell 380 prometheus --config.file=boulder/test/prometheus/prometheus.yml 381 ``` 382 383 ### Viewing metrics locally 384 385 When developing with a local Prometheus instance you can use this link 386 to view metrics: [link](http://0.0.0.0:9090)