github.com/inspektor-gadget/inspektor-gadget@v0.28.1/docs/design/001-prometheus.md

github.com/inspektor-gadget/inspektor-gadget@v0.28.1/docs/design/001-prometheus.md (about)

     1  # Prometheus support in Inspektor Gadget
     2  
     3  Inspektor Gadget has a lot of tools that hook into the kernel to capture different events like file
     4  opened, process created, DNS requests, etc. Currently it's mostly designed as a troubleshooting
     5  tool: it prints those events as they happen to the terminal. However, it's an easy win to provide
     6  metrics through Prometheus. The whole logic to capture the data is already in place, we only need to
     7  aggregate and expose this in a Prometheus format.
     8  
     9  This document contains a design proposal for supporting Prometheus metrics in Inspektor Gadget.
    10  Upstream issue:
    11  [https://github.com/inspektor-gadget/inspektor-gadget/issues/1513](https://github.com/inspektor-gadget/inspektor-gadget/issues/1513)
    12  
    13  # Goals
    14  
    15  This document is written with the following goals in mind in descending order of priority.
    16  
    17  - Bring this support soon to market
    18  - Metrics to expose should be configurable
    19  - The solution should be performant
    20  
    21  # Design Decisions
    22  
    23  ## Metrics to expose
    24  
    25  In order to be as flexible as possible, the user should be able to configure the metrics they want to expose
    26  for each gadget. Most gadgets emit events from eBPF including several fields of data and send them to the
    27  userspace part of IG for processing. In a generic solution, most of these fields should be selectable for metric collection, aggregation and filtering. However, due to handling all events in userspace, this could negatively
    28  impact performance. In order to improve that, we also propose a way to handle collection of the most commonly
    29  used metrics directly in eBPF.
    30  
    31  ## Labels Granularity
    32  
    33  High cardinality (a lot of distinct label combinations) can be problematic as it increases the memory
    34  usage of both the collector (IG) and the consumer (Prometheus). As stated above, users should still be able
    35  to configure the granularity they want to have and so should consider the cardinality themselves.
    36  
    37  ## Filtering
    38  
    39  Inspektor Gadget already provides a mechanism to filter out events we're not interested in. This
    40  mechanism should be reused by the Prometheus integration to avoid handling metrics for objects the
    41  user is not interested in.
    42  
    43  # User Experience
    44  
    45  The metric collection and export to Prometheus should be supported in both cases, a) when running in Kubernetes (ig-k8s), and b) when running on Linux hosts (ig).
    46  This is possible by implementing this using a new Prometheus gadget/operator as it makes the code automatically shareable between ig, ig-k8s and external applications. This gadget/operator provides
    47  start / stop operations to enable / disable collection of metrics.
    48  
    49  ```bash
    50  $ kubectl gadget prometheus start --config <path>
    51  $ kubectl gadget prometheus stop
    52  ```
    53  
    54  TODO: need to think about not having a start operation on this one
    55  
    56  ```bash
    57  $ ig prometheus --config <path>
    58  ```
    59  
    60  It should also be possible to configure the metrics using a CR - supporting a [static
    61  configuration](https://github.com/inspektor-gadget/inspektor-gadget/issues/1401) could be
    62  implemented in the future as well.
    63  
    64  ## Configuration File
    65  
    66  Given that we want this to be flexible, allowing the user to control which metrics to capture
    67  and how to aggregate them, we will use configuration files to define those aspects. This takes inspiration from
    68  [https://github.com/cloudflare/ebpf_exporter](https://github.com/cloudflare/ebpf_exporter).
    69  
    70  ### Filtering (aka Selectors)
    71  
    72  The user should be able to provide a set of filters indicating the events that should be taken into
    73  consideration when collecting the metrics. The mechanism should provide the following features:
    74  
    75  - Equal operator
    76    - "columnName:value"
    77  - Different operator (!)
    78    - "columnName:!value"
    79  - Greater than, less than operators (<, >, <=, >=)
    80    - "columnName:>value"
    81  - Match regex (~)
    82    - "columnName:~regex"
    83  
    84  In the future, we could consider introducing more advanced operators like:
    85  
    86  - Set based
    87    - In
    88    - NotIn
    89  
    90  It's similar to the existing [Labels and
    91  Selectors](https://kubernetes.io/docs/concepts/overview/working-with-objects/labels/) mechanism, but
    92  we still need to understand if we can reuse that or if we need a completely new implementation.
    93  
    94  Some examples of possible filters are:
    95  
    96  # Only metrics for default namespace
    97  
    98  ```yaml
    99  # Only metrics for default namespace
   100  selector:
   101    - k8s.namespace: default
   102  
   103  # Count only events with retval != 0
   104  selector:
   105    - "retval:!0"
   106  ```
   107  
   108  The configuration file defines the different metrics to collect.
   109  
   110  ### Counters
   111  
   112  This is probably the most intuitive metric: "A _counter_ is a cumulative metric that represents a
   113  single [monotonically increasing counter](https://en.wikipedia.org/wiki/Monotonic_function) whose
   114  value can only increase or be reset to zero on restart. For example, you can use a counter to
   115  represent the number of requests served, tasks completed, or errors." from
   116  [https://prometheus.io/docs/concepts/metric_types/#counter](https://prometheus.io/docs/concepts/metric_types/#counter).
   117  
   118  The following are examples of counters we can support with the existing gadgets. The first one
   119  counts the number of executed processes.
   120  
   121  ```yaml
   122  metrics:
   123    # executed processes by namespace, pod and container.
   124    - name: executed_processes
   125      type: counter
   126      category: trace
   127      gadget: exec
   128      labels:
   129        - k8s.namespace
   130        - k8s.pod
   131        - k8s.container
   132  ```
   133  
   134  The category and gadget fields define which gadget to use. The labels indicate how metrics are
   135  aggregated, i.e., the cardinality of the exposed metric. In this case, we'll have a counter for each
   136  namespace, pod and container combination.
   137  
   138  Another example that will report the number of executed processes, aggregated by comm and namespace:
   139  
   140  # executed processes by comm and namespace
   141  
   142  ```yaml
   143  - name: executed_processes_by_comm
   144    type: counter
   145    category: trace
   146    gadget: exec
   147    labels:
   148      - k8s.namespace
   149      - comm
   150  ```
   151  
   152  It is possible to count events based on matching criteria. For instance, the following counter
   153  will only consider events in the default namespace.
   154  
   155  ```yaml
   156  # executed processes by pod and container in the default namespace
   157  - name: executed_processes
   158    type: counter
   159    category: trace
   160    gadget: exec
   161    labels:
   162      - k8s.pod
   163      - k8s.container
   164    selector:
   165      - "k8s.namespace:default"
   166  ```
   167  
   168  Or only count events for a given command:
   169  
   170  ```yaml
   171  # cat executions by namespace, pod and container
   172  - name: executed_cats # ohno!
   173    type: counter
   174    category: trace
   175    gadget: exec
   176    labels:
   177      - k8s.namespace
   178      - k8s.pod
   179      - k8s.container
   180    selector:
   181      - "comm:cat"
   182  ```
   183  
   184  And finally, we can provide counters for failed operations:
   185  
   186  ```yaml
   187  # failed execs by namespace, pod and container
   188  - name: failed_execs
   189    type: counter
   190    category: trace
   191    gadget: exec
   192    labels:
   193      - k8s.namespace
   194      - k8s.pod
   195      - k8s.container
   196    selector:
   197      - "retval:!0"
   198  ```
   199  
   200  Filtering can also be used for gadgets that provide events describing two different situations, for
   201  instance the trace dns gadget emits events for requests and answers. Then, we can expose a counter
   202  only for requests based on the value of the "qr" field.
   203  
   204  ```yaml
   205  # DNS requests aggregated by namespace and pod
   206  - name: dns_requests
   207    type: counter
   208    category: trace
   209    gadget: dns
   210    labels:
   211      - namespace
   212      - pod
   213    selector:
   214      # Only count query events
   215      - "qr:Q"
   216  ```
   217  
   218  Another example is:
   219  
   220  ```yaml
   221    # bpf seccomp violations
   222  - name: seccomp_violations
   223    type: counter
   224    category: audit
   225    gadget: seccomp
   226    labels:
   227      - k8s.namespace
   228      - k8s.pod
   229      - k8s.container
   230      - syscall
   231    selector:
   232      - "syscall:bpf"
   233  ```
   234  
   235  By default, a counter is increased by one each time there is an event, however it's possible to
   236  increase a counter using a field on the event:
   237  
   238  ```yaml
   239  # Read bytes on ext4 filesystem
   240  - name: read_bytes_ext4
   241    type: counter
   242    category: trace
   243    gadget: fsslower
   244    labels:
   245      - k8s.namespace
   246      - k8s.pod
   247      - k8s.container
   248    field: bytes
   249    selector:
   250      - "filesystem:ext4"
   251      - "op:R"
   252  ```
   253  
   254  ## Gauges
   255  
   256  "A _gauge_ is a metric that represents a single numerical value that can arbitrarily go up and down"
   257  from
   258  [https://prometheus.io/docs/concepts/metric_types/#gauge](https://prometheus.io/docs/concepts/metric_types/#gauge).
   259  
   260  It seems that the only category of gadgets that can provide data to be interpreted as a gauge is the
   261  snapshotters.
   262  
   263  ```yaml
   264  # Number of processes by namespace / pod / container
   265  - name: number_of_processes
   266    type: gauge
   267    category: snapshot
   268    gadget: process
   269    labels:
   270      - k8s.namespace
   271      - k8s.pod
   272      - k8s.container
   273  
   274  # Number of sockets in CLOSE_WAIT state
   275  - name: number_of_sockets_close_wait
   276    type: gauge
   277    category: snapshot
   278    gadget: socket
   279    labels:
   280      - k8s.namespace
   281      - k8s.pod
   282      - k8s.container
   283    selector:
   284      - "status:CLOSE_WAIT"
   285  ```
   286  
   287  TODO: This is not totally clear how this should work since there gadget doesn't provide a stream of
   288  events. In this case we should execute the gadget each time prometheus scrapes the endpoint.
   289  
   290  ### Histograms
   291  
   292  The histogram definition is a bit more complex than the previous ones, hence please check the Prometheus
   293  documentation:
   294  [https://prometheus.io/docs/concepts/metric_types/#histogram](https://prometheus.io/docs/concepts/metric_types/#histogram)
   295  
   296  We'll support the same bucket configuration as described in
   297  [https://github.com/cloudflare/ebpf_exporter#histograms.](https://github.com/cloudflare/ebpf_exporter#histograms.)
   298  
   299  ```yaml
   300  # DNS replies latency
   301  - name: dns_latency
   302    type: histogram
   303    category: trace
   304    gadget: dns
   305    field: latency
   306    bucket:
   307      min: 0s
   308      max: 1m
   309      type: exp2
   310    labels:
   311      - k8s.namespace
   312      - k8s.pod
   313    selector:
   314      - "qr:R"
   315  ```
   316  
   317  # Implementation
   318  
   319  ## Gadgets supported
   320  
   321  We want to make Prometheus supported by as many gadgets as possible, however it's currently not
   322  possible to support all of them. The initial implementation covers these gadgets:
   323  
   324  - Tracers: counters and histograms
   325  - Snapshot: gauges
   326  
   327  There are some categories that we don't know if they can be supported altogether, so we probably should
   328  also define per gadget support:
   329  
   330  - Audit seccomp: counters
   331  - Profile block-io: will be nice but can require some extra work
   332  
   333  ## Metrics collection
   334  
   335  The main implementation detail of this support is where to count the metrics. This proposal includes
   336  two approaches that are independent and that can be implemented in parallel or one after the other:
   337  
   338  - Collect metrics in user space: A very flexible and less performant solution
   339  - Collect metrics in eBPF: A more performant solution that should handle most common metrics
   340  
   341  ### Collection in user space
   342  
   343  In this case the counting / aggregation happens on user space. This is the simplest way to collect
   344  metrics, but also the most expensive one. It leverages the whole functionality of our gadgets as it
   345  is right now. Events are still collected in eBPF and sent to user-space as they occur, where they
   346  are evaluated, i.e., aggregated and filtered according to the user's configuration.
   347  
   348  This option is defined to be flexible rather than performant. For metrics with high throughput, users should use
   349  the metrics collection backed by eBPF (see below).
   350  
   351  The implementation is based on the existing parser that uses reflection underneath. It was
   352  implemented in https://github.com/inspektor-gadget/inspektor-gadget/pull/1620.
   353  
   354  ### Collection in eBPF
   355  
   356  We should extend the gadgets to collect some common metrics in eBPF to make this solution more
   357  performant. This should be the preferred way of collecting metrics, even if it isn't as flexible as
   358  the implementation using the events (and it's harder to implement). In this case, we'll have to define a
   359  list of common metrics that are exposed by each gadget (TODO).
   360  
   361  Gadgets' eBPF code would require two new constants:
   362  
   363  ```c
   364  bool enable_events
   365  bool enable_metrics
   366  ```
   367  
   368  When metric collection is enabled, it expects maps that it can fill. The gadget/operator will provide such
   369  maps with a layout depending on the users' configuration.
   370  
   371  The biggest issue here is that we basically need to have counters for each of the possible tuples of
   372  the requested labels (which are dynamic). This would be solved by a BPF_HASH_MAP (potentially
   373  PER_CPU) with the key looking like this for example:
   374  
   375  ```c
   376  struct metric_key_t {
   377  __u64 mount_ns_id;
   378  __u32 reason;
   379  }
   380  ```
   381  
   382  With each added label, the key length increases. Adding for example the SADDR, the key would look
   383  like this:
   384  
   385  ```c
   386  struct metric_key_t {
   387  __u64 mount_ns_id;
   388  __u8 saddr[16];
   389  __u32 reason;
   390  }
   391  ```
   392  
   393  The actual value consists of a simpler struct, containing for example just a counter variable,
   394  buckets for histograms, etc. (and maybe a last-access timestamp).
   395  
   396  The operator would then periodically iterate over the map and update the exported metrics.
   397  
   398  We'd have to think about pruning the maps periodically as well, otherwise the maps would only grow -
   399  the mentioned timestamp could help with that. We also have to notify the user about possible
   400  overflows.
   401  
   402  #### Use of Macros
   403  
   404  To be able to use maps like described above, the keys of the maps would have to be sized
   405  dynamically. This can be achieved by using macros and consts that switch on/off certain fields on
   406  demand (I've started work on a PoC).
   407  
   408  A definition like this
   409  
   410  ```c
   411  #define METRIC_LIST_KEYS(X) \
   412   X(MntNS, __u64, 8) \
   413   X(Reason, __u32, 4)
   414  ```
   415  
   416  could then create required functions (like offset helpers) and volatile consts that will be set upon
   417  starting the gadget.
   418  
   419  # Compatibility with script and BYOB (bring your own bpf) gadgets
   420  
   421  These two gadgets allow the user to inject custom eBPF programs. It should be possible to
   422  also support metric collection using this solution. The structure of the eBPF maps that collect the
   423  metrics should be well defined to create a contract between the eBPF programs and the
   424  Operator/Gadget in user space.
   425  
   426  ## BYOB
   427  
   428  We have to document what is the structure of the maps and the user will be responsible for creating
   429  and filling them in their eBPF programs
   430  
   431  ## Script
   432  
   433  This gadget creates the eBPF maps on behalf of users when they specify they want a counter. We will
   434  need to be sure those maps stick to the contract defined above. The syntax to defining counters,
   435  histograms, etc by using the DSL can be very similar to what we already have in bpftrace
   436  [https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#2-count-count](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#2-count-count),
   437  [https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#8-hist-log2-histogram](https://github.com/iovisor/bpftrace/blob/master/docs/reference_guide.md#8-hist-log2-histogram).
   438  
   439  # Out of Scope
   440  
   441  - Providing a Golang package for 3rd party applications: This solution will be implemented as a
   442    Gadget/Operator, hence it'll also be available for 3rd party applications. It's not on our roadmap
   443    to provide a golang package supporting this, however it should be easy to refactor the code
   444    and create such a package if needed in the future. At that point we'll need to determine the
   445    format used by this package to expose the data. One possibility is to expose Otel metrics and the
   446    user then will convert them to the needed format, Prometheus for instance.