github.com/alloyci/alloy-runner@v1.0.1-0.20180222164613-925503ccafd6/docs/monitoring/README.md

github.com/alloyci/alloy-runner@v1.0.1-0.20180222164613-925503ccafd6/docs/monitoring/README.md (about)

     1  # AlloyCI Runner monitoring
     2  
     3  AlloyCI Runner can be monitored using [Prometheus].
     4  
     5  ## Embedded Prometheus metrics
     6  
     7  The AlloyCI Runner is instrumented with native Prometheus
     8  metrics, which can be exposed via an embedded HTTP server on the `/metrics`
     9  path. The server - if enabled - can be scraped by the Prometheus monitoring
    10  system or accessed with any other HTTP client.
    11  
    12  The exposed information includes:
    13  
    14  - Runner business logic metrics (e.g., the number of currently running builds)
    15  - Go-specific process metrics (garbage collection stats, goroutines, memstats, etc.)
    16  - general process metrics (memory usage, CPU usage, file descriptor usage, etc.)
    17  - build version information
    18  
    19  The following is an example of the metrics output in Prometheus'
    20  text-based metrics exposition format:
    21  
    22  ```
    23  # HELP ci_docker_machines The total number of machines created.
    24  # TYPE ci_docker_machines counter
    25  ci_docker_machines{type="created"} 0
    26  ci_docker_machines{type="removed"} 0
    27  ci_docker_machines{type="used"} 0
    28  # HELP ci_docker_machines_provider The current number of machines in given state.
    29  # TYPE ci_docker_machines_provider gauge
    30  ci_docker_machines_provider{state="acquired"} 0
    31  ci_docker_machines_provider{state="creating"} 0
    32  ci_docker_machines_provider{state="idle"} 0
    33  ci_docker_machines_provider{state="removing"} 0
    34  ci_docker_machines_provider{state="used"} 0
    35  # HELP ci_runner_builds The current number of running builds.
    36  # TYPE ci_runner_builds gauge
    37  ci_runner_builds{stage="prepare_script",state="running"} 1
    38  # HELP ci_runner_version_info A metric with a constant '1' value labeled by different build stats fields.
    39  # TYPE ci_runner_version_info gauge
    40  ci_runner_version_info{architecture="amd64",branch="rename-to-alloy-runner",built_at="2017-09-11 15:30:31 +0000 +0000",go_version="go1.8.3",name="alloy-runner",os="linux",revision="35e724fa",version="10.0.0~beta.28.g35e724fa"} 1
    41  # HELP ci_ssh_docker_machines The total number of machines created.
    42  # TYPE ci_ssh_docker_machines counter
    43  ci_ssh_docker_machines{type="created"} 0
    44  ci_ssh_docker_machines{type="removed"} 0
    45  ci_ssh_docker_machines{type="used"} 0
    46  # HELP ci_ssh_docker_machines_provider The current number of machines in given state.
    47  # TYPE ci_ssh_docker_machines_provider gauge
    48  ci_ssh_docker_machines_provider{state="acquired"} 0
    49  ci_ssh_docker_machines_provider{state="creating"} 0
    50  ci_ssh_docker_machines_provider{state="idle"} 0
    51  ci_ssh_docker_machines_provider{state="removing"} 0
    52  ci_ssh_docker_machines_provider{state="used"} 0
    53  # HELP go_gc_duration_seconds A summary of the GC invocation durations.
    54  # TYPE go_gc_duration_seconds summary
    55  go_gc_duration_seconds{quantile="0"} 0.00030304800000000004
    56  go_gc_duration_seconds{quantile="0.25"} 0.00038177500000000005
    57  go_gc_duration_seconds{quantile="0.5"} 0.0009022510000000001
    58  go_gc_duration_seconds{quantile="0.75"} 0.006189937
    59  go_gc_duration_seconds{quantile="1"} 0.00880617
    60  go_gc_duration_seconds_sum 0.016583181000000002
    61  go_gc_duration_seconds_count 5
    62  # HELP go_goroutines Number of goroutines that currently exist.
    63  # TYPE go_goroutines gauge
    64  go_goroutines 16
    65  # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use.
    66  # TYPE go_memstats_alloc_bytes gauge
    67  go_memstats_alloc_bytes 2.8288e+06
    68  # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed.
    69  # TYPE go_memstats_alloc_bytes_total counter
    70  go_memstats_alloc_bytes_total 7.973392e+06
    71  # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table.
    72  # TYPE go_memstats_buck_hash_sys_bytes gauge
    73  go_memstats_buck_hash_sys_bytes 1.444932e+06
    74  # HELP go_memstats_frees_total Total number of frees.
    75  # TYPE go_memstats_frees_total counter
    76  go_memstats_frees_total 73317
    77  # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata.
    78  # TYPE go_memstats_gc_sys_bytes gauge
    79  go_memstats_gc_sys_bytes 423936
    80  # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use.
    81  # TYPE go_memstats_heap_alloc_bytes gauge
    82  go_memstats_heap_alloc_bytes 2.8288e+06
    83  # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used.
    84  # TYPE go_memstats_heap_idle_bytes gauge
    85  go_memstats_heap_idle_bytes 1.39264e+06
    86  # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use.
    87  # TYPE go_memstats_heap_inuse_bytes gauge
    88  go_memstats_heap_inuse_bytes 4.407296e+06
    89  # HELP go_memstats_heap_objects Number of allocated objects.
    90  # TYPE go_memstats_heap_objects gauge
    91  go_memstats_heap_objects 23532
    92  # HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS.
    93  # TYPE go_memstats_heap_released_bytes_total counter
    94  go_memstats_heap_released_bytes_total 0
    95  # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system.
    96  # TYPE go_memstats_heap_sys_bytes gauge
    97  go_memstats_heap_sys_bytes 5.799936e+06
    98  # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection.
    99  # TYPE go_memstats_last_gc_time_seconds gauge
   100  go_memstats_last_gc_time_seconds 1.4768981425195277e+09
   101  # HELP go_memstats_lookups_total Total number of pointer lookups.
   102  # TYPE go_memstats_lookups_total counter
   103  go_memstats_lookups_total 42
   104  # HELP go_memstats_mallocs_total Total number of mallocs.
   105  # TYPE go_memstats_mallocs_total counter
   106  go_memstats_mallocs_total 96849
   107  # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures.
   108  # TYPE go_memstats_mcache_inuse_bytes gauge
   109  go_memstats_mcache_inuse_bytes 4800
   110  # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system.
   111  # TYPE go_memstats_mcache_sys_bytes gauge
   112  go_memstats_mcache_sys_bytes 16384
   113  # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures.
   114  # TYPE go_memstats_mspan_inuse_bytes gauge
   115  go_memstats_mspan_inuse_bytes 72320
   116  # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system.
   117  # TYPE go_memstats_mspan_sys_bytes gauge
   118  go_memstats_mspan_sys_bytes 98304
   119  # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place.
   120  # TYPE go_memstats_next_gc_bytes gauge
   121  go_memstats_next_gc_bytes 5.274438e+06
   122  # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations.
   123  # TYPE go_memstats_other_sys_bytes gauge
   124  go_memstats_other_sys_bytes 1.2341e+06
   125  # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator.
   126  # TYPE go_memstats_stack_inuse_bytes gauge
   127  go_memstats_stack_inuse_bytes 491520
   128  # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator.
   129  # TYPE go_memstats_stack_sys_bytes gauge
   130  go_memstats_stack_sys_bytes 491520
   131  # HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations.
   132  # TYPE go_memstats_sys_bytes gauge
   133  go_memstats_sys_bytes 9.509112e+06
   134  # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds.
   135  # TYPE process_cpu_seconds_total counter
   136  process_cpu_seconds_total 0.18
   137  # HELP process_max_fds Maximum number of open file descriptors.
   138  # TYPE process_max_fds gauge
   139  process_max_fds 1024
   140  # HELP process_open_fds Number of open file descriptors.
   141  # TYPE process_open_fds gauge
   142  process_open_fds 8
   143  # HELP process_resident_memory_bytes Resident memory size in bytes.
   144  # TYPE process_resident_memory_bytes gauge
   145  process_resident_memory_bytes 2.3191552e+07
   146  # HELP process_start_time_seconds Start time of the process since unix epoch in seconds.
   147  # TYPE process_start_time_seconds gauge
   148  process_start_time_seconds 1.47689813837e+09
   149  # HELP process_virtual_memory_bytes Virtual memory size in bytes.
   150  # TYPE process_virtual_memory_bytes gauge
   151  process_virtual_memory_bytes 3.39746816e+08
   152  ```
   153  
   154  Note that the lines starting with `# HELP` document the meaning of each exposed
   155  metric. This metrics format is documented in Prometheus'
   156  [Exposition formats](https://prometheus.io/docs/instrumenting/exposition_formats/)
   157  specification.
   158  
   159  These metrics are meant as a way for operators to monitor and gain insight into
   160  AlloyCI Runners. For example, you may be interested if the load average increase
   161  on your runner's host is related to an increase of processed builds or not. Or
   162  you are running a cluster of machines to be used for the builds and you want to
   163  track build trends to plan changes in your infrastructure.
   164  
   165  ### Learning more about Prometheus
   166  
   167  To learn how to set up a Prometheus server to scrape this HTTP endpoint and
   168  make use of the collected metrics, see Prometheus's [Getting
   169  started](https://prometheus.io/docs/introduction/getting_started/) guide. Also
   170  see the [Configuration](https://prometheus.io/docs/operating/configuration/)
   171  section for more details on how to configure Prometheus, as well as the section
   172  on [Alerting rules](https://prometheus.io/docs/alerting/rules/) and setting up
   173  an [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) to
   174  dispatch alert notifications.
   175  
   176  ## `pprof` HTTP endpoints
   177  
   178  While having metrics about internal state of Runner process is useful
   179  we've found that in some cases it would be good to check what is happening
   180  inside of the Running process in real time. That's why we've introduced
   181  the `pprof` HTTP endpoints.
   182  
   183  `pprof` endpoints will be available via an embedded HTTP server on `/debug/pprof/`
   184  path.
   185  
   186  You can read more about using `pprof` in its [documentation][go-pprof].
   187  
   188  ## Configuration of the metrics HTTP server
   189  
   190  > **Note:**
   191  The metrics server exports data about the internal state of the
   192  AlloyCI Runner process and should not be publicly available!
   193  
   194  The metrics HTTP server can be configured in two ways:
   195  
   196  - with a `metrics_server` global configuration option in `config.toml` file,
   197  - with a `--metrics-server` command line option for the `run` command.
   198  
   199  In both cases the option accepts a string with the format `[host]:<port>`,
   200  where:
   201  
   202  - `host` can be an IP address or a host name,
   203  - `port` is a valid TCP port or symbolic service name (like `http`). We recommend to use port `9252` which is already [allocated in Prometheus](https://github.com/prometheus/prometheus/wiki/Default-port-allocations).
   204  
   205  If the metrics server address does not contain a port, it will default to `9252`.
   206  
   207  Examples of addresses:
   208  
   209  - `:9252` - will listen on all IPs of all interfaces on port `9252`
   210  - `localhost:9252` - will only listen on the loopback interface on port `9252`
   211  - `[2001:db8::1]:http` - will listen on IPv6 address `[2001:db8::1]` on the HTTP port `80`
   212  
   213  Remember that for listening on ports below `1024` - at least on Linux/Unix
   214  systems - you need to have root/administrator rights.
   215  
   216  Also please notice, that HTTP server is opened on selected `host:port`
   217  **without any authorization**. If you plan to bind the metrics server
   218  to a public interface then you should consider to use your firewall to
   219  limit access to this server or add a HTTP proxy which will add the
   220  authorization and access control layer.
   221  
   222  [go-pprof]: https://golang.org/pkg/net/http/pprof/
   223  [prometheus]: https://prometheus.io