github.com/inspektor-gadget/inspektor-gadget@v0.28.1/docs/design/001-prometheus.md (about) 1 # Prometheus support in Inspektor Gadget 2 3 Inspektor Gadget has a lot of tools that hook into the kernel to capture different events like file 4 opened, process created, DNS requests, etc. Currently it's mostly designed as a troubleshooting 5 tool: it prints those events as they happen to the terminal. However, it's an easy win to provide 6 metrics through Prometheus. The whole logic to capture the data is already in place, we only need to 7 aggregate and expose this in a Prometheus format. 8 9 This document contains a design proposal for supporting Prometheus metrics in Inspektor Gadget. 10 Upstream issue: 11 [https://github.com/inspektor-gadget/inspektor-gadget/issues/1513](https://github.com/inspektor-gadget/inspektor-gadget/issues/1513) 12 13 # Goals 14 15 This document is written with the following goals in mind in descending order of priority. 16 17 - Bring this support soon to market 18 - Metrics to expose should be configurable 19 - The solution should be performant 20 21 # Design Decisions 22 23 ## Metrics to expose 24 25 In order to be as flexible as possible, the user should be able to configure the metrics they want to expose 26 for each gadget. Most gadgets emit events from eBPF including several fields of data and send them to the 27 userspace part of IG for processing. In a generic solution, most of these fields should be selectable for metric collection, aggregation and filtering. However, due to handling all events in userspace, this could negatively 28 impact performance. In order to improve that, we also propose a way to handle collection of the most commonly 29 used metrics directly in eBPF. 30 31 ## Labels Granularity 32 33 High cardinality (a lot of distinct label combinations) can be problematic as it increases the memory 34 usage of both the collector (IG) and the consumer (Prometheus). As stated above, users should still be able 35 to configure the granularity they want to have and so should consider the cardinality themselves. 36 37 ## Filtering 38 39 Inspektor Gadget already provides a mechanism to filter out events we're not interested in. This 40 mechanism should be reused by the Prometheus integration to avoid handling metrics for objects the 41 user is not interested in. 42 43 # User Experience 44 45 The metric collection and export to Prometheus should be supported in both cases, a) when running in Kubernetes (ig-k8s), and b) when running on Linux hosts (ig). 46 This is possible by implementing this using a new Prometheus gadget/operator as it makes the code automatically shareable between ig, ig-k8s and external applications. This gadget/operator provides 47 start / stop operations to enable / disable collection of metrics. 48 49 ```bash 50 $ kubectl gadget prometheus start --config <path> 51 $ kubectl gadget prometheus stop 52 ``` 53 54 TODO: need to think about not having a start operation on this one 55 56 ```bash 57 $ ig prometheus --config <path> 58 ``` 59 60 It should also be possible to configure the metrics using a CR - supporting a [static 61 configuration](https://github.com/inspektor-gadget/inspektor-gadget/issues/1401) could be 62 implemented in the future as well. 63 64 ## Configuration File 65 66 Given that we want this to be flexible, allowing the user to control which metrics to capture 67 and how to aggregate them, we will use configuration files to define those aspects. This takes inspiration from 68 [https://github.com/cloudflare/ebpf_exporter](https://github.com/cloudflare/ebpf_exporter). 69 70 ### Filtering (aka Selectors) 71 72 The user should be able to provide a set of filters indicating the events that should be taken into 73 consideration when collecting the metrics. The mechanism should provide the following features: 74 75 - Equal operator 76 - "columnName:value" 77 - Different operator (!) 78 - "columnName:!value" 79 - Greater than, less than operators (<, >, <=, >=) 80 - "columnName:>value" 81 - Match regex (~) 82 - "columnName:~regex" 83 84 In the future, we could consider introducing more advanced operators like: 85 86 - Set based 87 - In 88 - NotIn 89 90 It's similar to the existing [Labels and 91 Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) mechanism, but 92 we still need to understand if we can reuse that or if we need a completely new implementation. 93 94 Some examples of possible filters are: 95 96 # Only metrics for default namespace 97 98 ```yaml 99 # Only metrics for default namespace 100 selector: 101 - k8s.namespace: default 102 103 # Count only events with retval != 0 104 selector: 105 - "retval:!0" 106 ``` 107 108 The configuration file defines the different metrics to collect. 109 110 ### Counters 111 112 This is probably the most intuitive metric: "A _counter_ is a cumulative metric that represents a 113 single [monotonically increasing counter](https://en.wikipedia.org/wiki/Monotonic_function) whose 114 value can only increase or be reset to zero on restart. For example, you can use a counter to 115 represent the number of requests served, tasks completed, or errors." from 116 [https://prometheus.io/docs/concepts/metric_types/#counter](https://prometheus.io/docs/concepts/metric_types/#counter). 117 118 The following are examples of counters we can support with the existing gadgets. The first one 119 counts the number of executed processes. 120 121 ```yaml 122 metrics: 123 # executed processes by namespace, pod and container. 124 - name: executed_processes 125 type: counter 126 category: trace 127 gadget: exec 128 labels: 129 - k8s.namespace 130 - k8s.pod 131 - k8s.container 132 ``` 133 134 The category and gadget fields define which gadget to use. The labels indicate how metrics are 135 aggregated, i.e., the cardinality of the exposed metric. In this case, we'll have a counter for each 136 namespace, pod and container combination. 137 138 Another example that will report the number of executed processes, aggregated by comm and namespace: 139 140 # executed processes by comm and namespace 141 142 ```yaml 143 - name: executed_processes_by_comm 144 type: counter 145 category: trace 146 gadget: exec 147 labels: 148 - k8s.namespace 149 - comm 150 ``` 151 152 It is possible to count events based on matching criteria. For instance, the following counter 153 will only consider events in the default namespace. 154 155 ```yaml 156 # executed processes by pod and container in the default namespace 157 - name: executed_processes 158 type: counter 159 category: trace 160 gadget: exec 161 labels: 162 - k8s.pod 163 - k8s.container 164 selector: 165 - "k8s.namespace:default" 166 ``` 167 168 Or only count events for a given command: 169 170 ```yaml 171 # cat executions by namespace, pod and container 172 - name: executed_cats # ohno! 173 type: counter 174 category: trace 175 gadget: exec 176 labels: 177 - k8s.namespace 178 - k8s.pod 179 - k8s.container 180 selector: 181 - "comm:cat" 182 ``` 183 184 And finally, we can provide counters for failed operations: 185 186 ```yaml 187 # failed execs by namespace, pod and container 188 - name: failed_execs 189 type: counter 190 category: trace 191 gadget: exec 192 labels: 193 - k8s.namespace 194 - k8s.pod 195 - k8s.container 196 selector: 197 - "retval:!0" 198 ``` 199 200 Filtering can also be used for gadgets that provide events describing two different situations, for 201 instance the trace dns gadget emits events for requests and answers. Then, we can expose a counter 202 only for requests based on the value of the "qr" field. 203 204 ```yaml 205 # DNS requests aggregated by namespace and pod 206 - name: dns_requests 207 type: counter 208 category: trace 209 gadget: dns 210 labels: 211 - namespace 212 - pod 213 selector: 214 # Only count query events 215 - "qr:Q" 216 ``` 217 218 Another example is: 219 220 ```yaml 221 # bpf seccomp violations 222 - name: seccomp_violations 223 type: counter 224 category: audit 225 gadget: seccomp 226 labels: 227 - k8s.namespace 228 - k8s.pod 229 - k8s.container 230 - syscall 231 selector: 232 - "syscall:bpf" 233 ``` 234 235 By default, a counter is increased by one each time there is an event, however it's possible to 236 increase a counter using a field on the event: 237 238 ```yaml 239 # Read bytes on ext4 filesystem 240 - name: read_bytes_ext4 241 type: counter 242 category: trace 243 gadget: fsslower 244 labels: 245 - k8s.namespace 246 - k8s.pod 247 - k8s.container 248 field: bytes 249 selector: 250 - "filesystem:ext4" 251 - "op:R" 252 ``` 253 254 ## Gauges 255 256 "A _gauge_ is a metric that represents a single numerical value that can arbitrarily go up and down" 257 from 258 [https://prometheus.io/docs/concepts/metric_types/#gauge](https://prometheus.io/docs/concepts/metric_types/#gauge). 259 260 It seems that the only category of gadgets that can provide data to be interpreted as a gauge is the 261 snapshotters. 262 263 ```yaml 264 # Number of processes by namespace / pod / container 265 - name: number_of_processes 266 type: gauge 267 category: snapshot 268 gadget: process 269 labels: 270 - k8s.namespace 271 - k8s.pod 272 - k8s.container 273 274 # Number of sockets in CLOSE_WAIT state 275 - name: number_of_sockets_close_wait 276 type: gauge 277 category: snapshot 278 gadget: socket 279 labels: 280 - k8s.namespace 281 - k8s.pod 282 - k8s.container 283 selector: 284 - "status:CLOSE_WAIT" 285 ``` 286 287 TODO: This is not totally clear how this should work since there gadget doesn't provide a stream of 288 events. In this case we should execute the gadget each time prometheus scrapes the endpoint. 289 290 ### Histograms 291 292 The histogram definition is a bit more complex than the previous ones, hence please check the Prometheus 293 documentation: 294 [https://prometheus.io/docs/concepts/metric_types/#histogram](https://prometheus.io/docs/concepts/metric_types/#histogram) 295 296 We'll support the same bucket configuration as described in 297 [https://github.com/cloudflare/ebpf_exporter#histograms.](https://github.com/cloudflare/ebpf_exporter#histograms.) 298 299 ```yaml 300 # DNS replies latency 301 - name: dns_latency 302 type: histogram 303 category: trace 304 gadget: dns 305 field: latency 306 bucket: 307 min: 0s 308 max: 1m 309 type: exp2 310 labels: 311 - k8s.namespace 312 - k8s.pod 313 selector: 314 - "qr:R" 315 ``` 316 317 # Implementation 318 319 ## Gadgets supported 320 321 We want to make Prometheus supported by as many gadgets as possible, however it's currently not 322 possible to support all of them. The initial implementation covers these gadgets: 323 324 - Tracers: counters and histograms 325 - Snapshot: gauges 326 327 There are some categories that we don't know if they can be supported altogether, so we probably should 328 also define per gadget support: 329 330 - Audit seccomp: counters 331 - Profile block-io: will be nice but can require some extra work 332 333 ## Metrics collection 334 335 The main implementation detail of this support is where to count the metrics. This proposal includes 336 two approaches that are independent and that can be implemented in parallel or one after the other: 337 338 - Collect metrics in user space: A very flexible and less performant solution 339 - Collect metrics in eBPF: A more performant solution that should handle most common metrics 340 341 ### Collection in user space 342 343 In this case the counting / aggregation happens on user space. This is the simplest way to collect 344 metrics, but also the most expensive one. It leverages the whole functionality of our gadgets as it 345 is right now. Events are still collected in eBPF and sent to user-space as they occur, where they 346 are evaluated, i.e., aggregated and filtered according to the user's configuration. 347 348 This option is defined to be flexible rather than performant. For metrics with high throughput, users should use 349 the metrics collection backed by eBPF (see below). 350 351 The implementation is based on the existing parser that uses reflection underneath. It was 352 implemented in https://github.com/inspektor-gadget/inspektor-gadget/pull/1620. 353 354 ### Collection in eBPF 355 356 We should extend the gadgets to collect some common metrics in eBPF to make this solution more 357 performant. This should be the preferred way of collecting metrics, even if it isn't as flexible as 358 the implementation using the events (and it's harder to implement). In this case, we'll have to define a 359 list of common metrics that are exposed by each gadget (TODO). 360 361 Gadgets' eBPF code would require two new constants: 362 363 ```c 364 bool enable_events 365 bool enable_metrics 366 ``` 367 368 When metric collection is enabled, it expects maps that it can fill. The gadget/operator will provide such 369 maps with a layout depending on the users' configuration. 370 371 The biggest issue here is that we basically need to have counters for each of the possible tuples of 372 the requested labels (which are dynamic). This would be solved by a BPF_HASH_MAP (potentially 373 PER_CPU) with the key looking like this for example: 374 375 ```c 376 struct metric_key_t { 377 __u64 mount_ns_id; 378 __u32 reason; 379 } 380 ``` 381 382 With each added label, the key length increases. Adding for example the SADDR, the key would look 383 like this: 384 385 ```c 386 struct metric_key_t { 387 __u64 mount_ns_id; 388 __u8 saddr[16]; 389 __u32 reason; 390 } 391 ``` 392 393 The actual value consists of a simpler struct, containing for example just a counter variable, 394 buckets for histograms, etc. (and maybe a last-access timestamp). 395 396 The operator would then periodically iterate over the map and update the exported metrics. 397 398 We'd have to think about pruning the maps periodically as well, otherwise the maps would only grow - 399 the mentioned timestamp could help with that. We also have to notify the user about possible 400 overflows. 401 402 #### Use of Macros 403 404 To be able to use maps like described above, the keys of the maps would have to be sized 405 dynamically. This can be achieved by using macros and consts that switch on/off certain fields on 406 demand (I've started work on a PoC). 407 408 A definition like this 409 410 ```c 411 #define METRIC_LIST_KEYS(X) \ 412 X(MntNS, __u64, 8) \ 413 X(Reason, __u32, 4) 414 ``` 415 416 could then create required functions (like offset helpers) and volatile consts that will be set upon 417 starting the gadget. 418 419 # Compatibility with script and BYOB (bring your own bpf) gadgets 420 421 These two gadgets allow the user to inject custom eBPF programs. It should be possible to 422 also support metric collection using this solution. The structure of the eBPF maps that collect the 423 metrics should be well defined to create a contract between the eBPF programs and the 424 Operator/Gadget in user space. 425 426 ## BYOB 427 428 We have to document what is the structure of the maps and the user will be responsible for creating 429 and filling them in their eBPF programs 430 431 ## Script 432 433 This gadget creates the eBPF maps on behalf of users when they specify they want a counter. We will 434 need to be sure those maps stick to the contract defined above. The syntax to defining counters, 435 histograms, etc by using the DSL can be very similar to what we already have in bpftrace 436 [https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#2-count-count](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#2-count-count), 437 [https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#8-hist-log2-histogram](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#8-hist-log2-histogram). 438 439 # Out of Scope 440 441 - Providing a Golang package for 3rd party applications: This solution will be implemented as a 442 Gadget/Operator, hence it'll also be available for 3rd party applications. It's not on our roadmap 443 to provide a golang package supporting this, however it should be easy to refactor the code 444 and create such a package if needed in the future. At that point we'll need to determine the 445 format used by this package to expose the data. One possibility is to expose Otel metrics and the 446 user then will convert them to the needed format, Prometheus for instance.