github.com/alloyci/alloy-runner@v1.0.1-0.20180222164613-925503ccafd6/docs/monitoring/README.md (about) 1 # AlloyCI Runner monitoring 2 3 AlloyCI Runner can be monitored using [Prometheus]. 4 5 ## Embedded Prometheus metrics 6 7 The AlloyCI Runner is instrumented with native Prometheus 8 metrics, which can be exposed via an embedded HTTP server on the `/metrics` 9 path. The server - if enabled - can be scraped by the Prometheus monitoring 10 system or accessed with any other HTTP client. 11 12 The exposed information includes: 13 14 - Runner business logic metrics (e.g., the number of currently running builds) 15 - Go-specific process metrics (garbage collection stats, goroutines, memstats, etc.) 16 - general process metrics (memory usage, CPU usage, file descriptor usage, etc.) 17 - build version information 18 19 The following is an example of the metrics output in Prometheus' 20 text-based metrics exposition format: 21 22 ``` 23 # HELP ci_docker_machines The total number of machines created. 24 # TYPE ci_docker_machines counter 25 ci_docker_machines{type="created"} 0 26 ci_docker_machines{type="removed"} 0 27 ci_docker_machines{type="used"} 0 28 # HELP ci_docker_machines_provider The current number of machines in given state. 29 # TYPE ci_docker_machines_provider gauge 30 ci_docker_machines_provider{state="acquired"} 0 31 ci_docker_machines_provider{state="creating"} 0 32 ci_docker_machines_provider{state="idle"} 0 33 ci_docker_machines_provider{state="removing"} 0 34 ci_docker_machines_provider{state="used"} 0 35 # HELP ci_runner_builds The current number of running builds. 36 # TYPE ci_runner_builds gauge 37 ci_runner_builds{stage="prepare_script",state="running"} 1 38 # HELP ci_runner_version_info A metric with a constant '1' value labeled by different build stats fields. 39 # TYPE ci_runner_version_info gauge 40 ci_runner_version_info{architecture="amd64",branch="rename-to-alloy-runner",built_at="2017-09-11 15:30:31 +0000 +0000",go_version="go1.8.3",name="alloy-runner",os="linux",revision="35e724fa",version="10.0.0~beta.28.g35e724fa"} 1 41 # HELP ci_ssh_docker_machines The total number of machines created. 42 # TYPE ci_ssh_docker_machines counter 43 ci_ssh_docker_machines{type="created"} 0 44 ci_ssh_docker_machines{type="removed"} 0 45 ci_ssh_docker_machines{type="used"} 0 46 # HELP ci_ssh_docker_machines_provider The current number of machines in given state. 47 # TYPE ci_ssh_docker_machines_provider gauge 48 ci_ssh_docker_machines_provider{state="acquired"} 0 49 ci_ssh_docker_machines_provider{state="creating"} 0 50 ci_ssh_docker_machines_provider{state="idle"} 0 51 ci_ssh_docker_machines_provider{state="removing"} 0 52 ci_ssh_docker_machines_provider{state="used"} 0 53 # HELP go_gc_duration_seconds A summary of the GC invocation durations. 54 # TYPE go_gc_duration_seconds summary 55 go_gc_duration_seconds{quantile="0"} 0.00030304800000000004 56 go_gc_duration_seconds{quantile="0.25"} 0.00038177500000000005 57 go_gc_duration_seconds{quantile="0.5"} 0.0009022510000000001 58 go_gc_duration_seconds{quantile="0.75"} 0.006189937 59 go_gc_duration_seconds{quantile="1"} 0.00880617 60 go_gc_duration_seconds_sum 0.016583181000000002 61 go_gc_duration_seconds_count 5 62 # HELP go_goroutines Number of goroutines that currently exist. 63 # TYPE go_goroutines gauge 64 go_goroutines 16 65 # HELP go_memstats_alloc_bytes Number of bytes allocated and still in use. 66 # TYPE go_memstats_alloc_bytes gauge 67 go_memstats_alloc_bytes 2.8288e+06 68 # HELP go_memstats_alloc_bytes_total Total number of bytes allocated, even if freed. 69 # TYPE go_memstats_alloc_bytes_total counter 70 go_memstats_alloc_bytes_total 7.973392e+06 71 # HELP go_memstats_buck_hash_sys_bytes Number of bytes used by the profiling bucket hash table. 72 # TYPE go_memstats_buck_hash_sys_bytes gauge 73 go_memstats_buck_hash_sys_bytes 1.444932e+06 74 # HELP go_memstats_frees_total Total number of frees. 75 # TYPE go_memstats_frees_total counter 76 go_memstats_frees_total 73317 77 # HELP go_memstats_gc_sys_bytes Number of bytes used for garbage collection system metadata. 78 # TYPE go_memstats_gc_sys_bytes gauge 79 go_memstats_gc_sys_bytes 423936 80 # HELP go_memstats_heap_alloc_bytes Number of heap bytes allocated and still in use. 81 # TYPE go_memstats_heap_alloc_bytes gauge 82 go_memstats_heap_alloc_bytes 2.8288e+06 83 # HELP go_memstats_heap_idle_bytes Number of heap bytes waiting to be used. 84 # TYPE go_memstats_heap_idle_bytes gauge 85 go_memstats_heap_idle_bytes 1.39264e+06 86 # HELP go_memstats_heap_inuse_bytes Number of heap bytes that are in use. 87 # TYPE go_memstats_heap_inuse_bytes gauge 88 go_memstats_heap_inuse_bytes 4.407296e+06 89 # HELP go_memstats_heap_objects Number of allocated objects. 90 # TYPE go_memstats_heap_objects gauge 91 go_memstats_heap_objects 23532 92 # HELP go_memstats_heap_released_bytes_total Total number of heap bytes released to OS. 93 # TYPE go_memstats_heap_released_bytes_total counter 94 go_memstats_heap_released_bytes_total 0 95 # HELP go_memstats_heap_sys_bytes Number of heap bytes obtained from system. 96 # TYPE go_memstats_heap_sys_bytes gauge 97 go_memstats_heap_sys_bytes 5.799936e+06 98 # HELP go_memstats_last_gc_time_seconds Number of seconds since 1970 of last garbage collection. 99 # TYPE go_memstats_last_gc_time_seconds gauge 100 go_memstats_last_gc_time_seconds 1.4768981425195277e+09 101 # HELP go_memstats_lookups_total Total number of pointer lookups. 102 # TYPE go_memstats_lookups_total counter 103 go_memstats_lookups_total 42 104 # HELP go_memstats_mallocs_total Total number of mallocs. 105 # TYPE go_memstats_mallocs_total counter 106 go_memstats_mallocs_total 96849 107 # HELP go_memstats_mcache_inuse_bytes Number of bytes in use by mcache structures. 108 # TYPE go_memstats_mcache_inuse_bytes gauge 109 go_memstats_mcache_inuse_bytes 4800 110 # HELP go_memstats_mcache_sys_bytes Number of bytes used for mcache structures obtained from system. 111 # TYPE go_memstats_mcache_sys_bytes gauge 112 go_memstats_mcache_sys_bytes 16384 113 # HELP go_memstats_mspan_inuse_bytes Number of bytes in use by mspan structures. 114 # TYPE go_memstats_mspan_inuse_bytes gauge 115 go_memstats_mspan_inuse_bytes 72320 116 # HELP go_memstats_mspan_sys_bytes Number of bytes used for mspan structures obtained from system. 117 # TYPE go_memstats_mspan_sys_bytes gauge 118 go_memstats_mspan_sys_bytes 98304 119 # HELP go_memstats_next_gc_bytes Number of heap bytes when next garbage collection will take place. 120 # TYPE go_memstats_next_gc_bytes gauge 121 go_memstats_next_gc_bytes 5.274438e+06 122 # HELP go_memstats_other_sys_bytes Number of bytes used for other system allocations. 123 # TYPE go_memstats_other_sys_bytes gauge 124 go_memstats_other_sys_bytes 1.2341e+06 125 # HELP go_memstats_stack_inuse_bytes Number of bytes in use by the stack allocator. 126 # TYPE go_memstats_stack_inuse_bytes gauge 127 go_memstats_stack_inuse_bytes 491520 128 # HELP go_memstats_stack_sys_bytes Number of bytes obtained from system for stack allocator. 129 # TYPE go_memstats_stack_sys_bytes gauge 130 go_memstats_stack_sys_bytes 491520 131 # HELP go_memstats_sys_bytes Number of bytes obtained by system. Sum of all system allocations. 132 # TYPE go_memstats_sys_bytes gauge 133 go_memstats_sys_bytes 9.509112e+06 134 # HELP process_cpu_seconds_total Total user and system CPU time spent in seconds. 135 # TYPE process_cpu_seconds_total counter 136 process_cpu_seconds_total 0.18 137 # HELP process_max_fds Maximum number of open file descriptors. 138 # TYPE process_max_fds gauge 139 process_max_fds 1024 140 # HELP process_open_fds Number of open file descriptors. 141 # TYPE process_open_fds gauge 142 process_open_fds 8 143 # HELP process_resident_memory_bytes Resident memory size in bytes. 144 # TYPE process_resident_memory_bytes gauge 145 process_resident_memory_bytes 2.3191552e+07 146 # HELP process_start_time_seconds Start time of the process since unix epoch in seconds. 147 # TYPE process_start_time_seconds gauge 148 process_start_time_seconds 1.47689813837e+09 149 # HELP process_virtual_memory_bytes Virtual memory size in bytes. 150 # TYPE process_virtual_memory_bytes gauge 151 process_virtual_memory_bytes 3.39746816e+08 152 ``` 153 154 Note that the lines starting with `# HELP` document the meaning of each exposed 155 metric. This metrics format is documented in Prometheus' 156 [Exposition formats](https://prometheus.io/docs/instrumenting/exposition_formats/) 157 specification. 158 159 These metrics are meant as a way for operators to monitor and gain insight into 160 AlloyCI Runners. For example, you may be interested if the load average increase 161 on your runner's host is related to an increase of processed builds or not. Or 162 you are running a cluster of machines to be used for the builds and you want to 163 track build trends to plan changes in your infrastructure. 164 165 ### Learning more about Prometheus 166 167 To learn how to set up a Prometheus server to scrape this HTTP endpoint and 168 make use of the collected metrics, see Prometheus's [Getting 169 started](https://prometheus.io/docs/introduction/getting_started/) guide. Also 170 see the [Configuration](https://prometheus.io/docs/operating/configuration/) 171 section for more details on how to configure Prometheus, as well as the section 172 on [Alerting rules](https://prometheus.io/docs/alerting/rules/) and setting up 173 an [Alertmanager](https://prometheus.io/docs/alerting/alertmanager/) to 174 dispatch alert notifications. 175 176 ## `pprof` HTTP endpoints 177 178 While having metrics about internal state of Runner process is useful 179 we've found that in some cases it would be good to check what is happening 180 inside of the Running process in real time. That's why we've introduced 181 the `pprof` HTTP endpoints. 182 183 `pprof` endpoints will be available via an embedded HTTP server on `/debug/pprof/` 184 path. 185 186 You can read more about using `pprof` in its [documentation][go-pprof]. 187 188 ## Configuration of the metrics HTTP server 189 190 > **Note:** 191 The metrics server exports data about the internal state of the 192 AlloyCI Runner process and should not be publicly available! 193 194 The metrics HTTP server can be configured in two ways: 195 196 - with a `metrics_server` global configuration option in `config.toml` file, 197 - with a `--metrics-server` command line option for the `run` command. 198 199 In both cases the option accepts a string with the format `[host]:<port>`, 200 where: 201 202 - `host` can be an IP address or a host name, 203 - `port` is a valid TCP port or symbolic service name (like `http`). We recommend to use port `9252` which is already [allocated in Prometheus](https://github.com/prometheus/prometheus/wiki/Default-port-allocations). 204 205 If the metrics server address does not contain a port, it will default to `9252`. 206 207 Examples of addresses: 208 209 - `:9252` - will listen on all IPs of all interfaces on port `9252` 210 - `localhost:9252` - will only listen on the loopback interface on port `9252` 211 - `[2001:db8::1]:http` - will listen on IPv6 address `[2001:db8::1]` on the HTTP port `80` 212 213 Remember that for listening on ports below `1024` - at least on Linux/Unix 214 systems - you need to have root/administrator rights. 215 216 Also please notice, that HTTP server is opened on selected `host:port` 217 **without any authorization**. If you plan to bind the metrics server 218 to a public interface then you should consider to use your firewall to 219 limit access to this server or add a HTTP proxy which will add the 220 authorization and access control layer. 221 222 [go-pprof]: https://golang.org/pkg/net/http/pprof/ 223 [prometheus]: https://prometheus.io