gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/seccheck/README.md (about)

     1  # Introduction
     2  
     3  This package provides a remote interface to observe behavior of the application
     4  running inside the sandbox. It was built with runtime monitoring in mind, e.g.
     5  threat detection, but it can be used for other purposes as well. It allows a
     6  process running outside the sandbox to receive a stream of trace data
     7  asynchronously. This process can watch actions performed by the application,
     8  generate alerts when something unexpected occurs, log these actions, etc.
     9  
    10  First, let's go over a few concepts before we get into the details.
    11  
    12  ## Concepts
    13  
    14  -   **Points:** these are discrete places (or points) in the code where
    15      instrumentation was added. Each point has a unique name and schema. They can
    16      be individually enabled/disabled. For example, `container/start` is a point
    17      that is fired when a new container starts.
    18  -   **Point fields:** each point may contain fields that carry point data. For
    19      example, `container/start` has a `id` field with the container ID that is
    20      getting started.
    21  -   **Optional fields:** each point may also have optional fields. By default
    22      these fields are not collected and they can be manually set to be collected
    23      when the point is configured. These fields are normally more expensive to
    24      collect and/or large, e.g. resolve path to FD, or data for read/write.
    25  -   **Context fields:** these are fields generally available to most events, but
    26      are disabled by default. Like optional fields, they can be set to be
    27      collected when the point is configured. Context field data comes from
    28      context where the point is being fired, for example PID, UID/GID, container
    29      ID are fields available to most trace points.
    30  -   **Sink:** sinks are trace point consumers. Each sink is identified by a name
    31      and may handle trace points differently. Later we'll describe in more
    32      detailed what sinks are available in the system and how to use them.
    33  -   **Session:** trace session is a set of points that are enabled with their
    34      corresponding configuration. A trace session also has a list of sinks that
    35      will receive the trace points. A session is identified by a unique name.
    36      Once a session is deleted, all points belonging to the session are disabled
    37      and the sinks destroyed.
    38  
    39  If you're interested in exploring further, there are more details in the
    40  [design doc](https://docs.google.com/document/d/1RQQKzeFpO-zOoBHZLA-tr5Ed_bvAOLDqgGgKhqUff2A/edit).
    41  
    42  # Points
    43  
    44  Every trance point in the system is identified by a unique name. The naming
    45  convention is to scope the point with a main component followed by its name to
    46  avoid conflicts. Here are a few examples:
    47  
    48  -   `sentry/signal_delivered`
    49  -   `container/start`
    50  -   `syscall/openat/enter`
    51  
    52  > Note: the syscall trace point contains an extra level to separate the
    53  > enter/exit points.
    54  
    55  Most of the trace points are in the `syscall` component. They come in 2 flavors:
    56  raw, schematized. Raw syscalls include all syscalls in the system and contain
    57  the 6 arguments for the given syscall. Schematized trace points exist for many
    58  syscalls, but not all. They provide fields that are specific to the syscalls and
    59  fetch more information than is available from the raw syscall arguments. For
    60  example, here is the schema for the open syscall:
    61  
    62  ```proto
    63  message Open {
    64    gvisor.common.ContextData context_data = 1;
    65    Exit exit = 2;
    66    uint64 sysno = 3;
    67    int64 fd = 4;
    68    string fd_path = 5;
    69    string pathname = 6;
    70    uint32 flags = 7;
    71    uint32 mode = 8;
    72  }
    73  ```
    74  
    75  As you can see, some fields are in both raw and schematized points, like `fd`
    76  which is also `arg1` in the raw syscall, but here it has a name and correct
    77  type. It also has fields like `pathname` that are not available in the raw
    78  syscall event. In addition, `fd_path` is an optional field that can take the
    79  `fd` and translate it into a full path for convenience. In some cases, the same
    80  schema can be shared by many syscalls. In this example, `message Open` is used
    81  for `open(2)`, `openat(2)` and `creat(2)` syscalls. The `sysno` field can be
    82  used to distinguish between them. The schema for all syscall trace points can be
    83  found
    84  [here](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/syscall.proto).
    85  
    86  Other components that exist today are:
    87  
    88  *   **sentry:** trace points fired from within gVisor's kernel
    89      ([schema](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/sentry.proto)).
    90  *   **container:** container related events
    91      ([schema](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/container.proto)).
    92  
    93  The following command lists all trace points available in the system:
    94  
    95  ```shell
    96  $ runsc trace metadata
    97  POINTS (973)
    98  Name: container/start, optional fields: [env], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
    99  Name: sentry/clone, optional fields: [], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
   100  Name: syscall/accept/enter, optional fields: [fd_path], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name]
   101  ...
   102  ```
   103  
   104  > Note: the output format for `trace metadata` may change without notice.
   105  
   106  The list above also includes what optional and context fields are available for
   107  each trace point. Optional fields schema is part of the trace point proto, like
   108  `fd_path` we saw above. Context fields are set in `context_data` field of all
   109  points and is defined in
   110  [gvisor.common.ContextData](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/common.proto;bpv=1;bpt=1;l=77?gsn=ContextData&gs=kythe%3A%2F%2Fgithub.com%2Fgoogle%2Fgvisor%3Flang%3Dprotobuf%3Fpath%3Dpkg%2Fsentry%2Fseccheck%2Fpoints%2Fcommon.proto%234.2).
   111  
   112  # Sinks
   113  
   114  Sinks receive enabled trace points and do something useful with them. They are
   115  identified by a unique name. The same `runsc trace metadata` command used above
   116  also lists all sinks:
   117  
   118  ```shell
   119  $ runsc trace metadata
   120  ...
   121  SINKS (2)
   122  Name: remote
   123  Name: null
   124  
   125  ```
   126  
   127  > Note: the output format for `trace metadata` may change without notice.
   128  
   129  ## Remote
   130  
   131  The remote sink serializes the trace point into protobuf and sends it to a
   132  separate process. For threat detection, external monitoring processes can
   133  receive connections from remote sinks and be sent a stream of trace points that
   134  are occurring in the system. This sink connects to a remote process via Unix
   135  domain socket and expects the remote process to be listening for new
   136  connections. If you're interested in creating a monitoring process that
   137  communicates with the remote sink, [this document](sinks/remote/README.md) has
   138  more details.
   139  
   140  The remote sink has many properties that can be configured when it's created
   141  (more on how to configure sinks below):
   142  
   143  *   `endpoint` (mandatory): Unix domain socket address to connect.
   144  *   `retries`: number of attempts to write the trace point before dropping it in
   145      case the remote process is not responding. Note that a high number of
   146      retries can significantly delay application execution.
   147  *   `backoff`: initial backoff time after the first failed attempt. This value
   148      doubles with every failed attempt, up to the max.
   149  *   `backoff_max`: max duration to wait between retries.
   150  
   151  ## Null
   152  
   153  The null sink does nothing with the trace points and it's used for testing.
   154  Syscall tests enable all trace points, with all optional and context fields to
   155  ensure there is no crash with them enabled.
   156  
   157  ## Strace (not implemented)
   158  
   159  The strace sink has not been implemented yet. It's meant to replace the strace
   160  mechanism that exists in the Sentry to simplify the code and add more trace
   161  points to it.
   162  
   163  > Note: It requires more than one trace session to be supported.
   164  
   165  # Sessions
   166  
   167  Trace sessions scope a set of trace points with their corresponding
   168  configuration and a set of sinks that receive the points. Sessions can be
   169  created at sandbox initialization time or during runtime. Creating sessions at
   170  init time guarantees that no trace points are missed, which is important for
   171  threat detection. It is configured using the `--pod-init-config` flag (more on
   172  it below). To manage sessions during runtime, `runsc trace create|delete|list`
   173  is used to manipulate trace sessions. Here are few examples assuming there is a
   174  running container with ID=cont123 using Docker:
   175  
   176  ```shell
   177  $ sudo runsc --root /run/docker/runtime-runc/moby trace create --config session.json cont123
   178  $ sudo runsc --root /run/docker/runtime-runc/moby trace list cont123
   179  SESSIONS (1)
   180  "Default"
   181          Sink: "remote", dropped: 0
   182  
   183  $ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name Default cont123
   184  $ sudo runsc --root /var/run/docker/runtime-runc/moby trace list cont123
   185  SESSIONS (0)
   186  ```
   187  
   188  > Note: There is a current limitation that only a single session can exist in
   189  > the system and it must be called `Default`. This restriction can be lifted in
   190  > the future when more than one session is needed.
   191  
   192  ## Config
   193  
   194  The event session can be defined using JSON for the `runsc trace create`
   195  command. The session definition has 3 main parts:
   196  
   197  1.  `name`: name of the session being created. Only `Default` for now.
   198  1.  `points`: array of points being enabled in the session. Each point has:
   199      1.  `name`: name of trace point being enabled.
   200      1.  `optional_fields`: array of optional fields to include with the trace
   201          point.
   202      1.  `context_fields`: array of context fields to include with the trace
   203          point.
   204  1.  `sinks`: array of sinks that will process the trace points.
   205      1.  `name`: name of the sink.
   206      1.  `config`: sink specific configuration.
   207      1.  `ignore_setup_error`: ignores failure to configure the sink. In the
   208          remote sink case, for example, it doesn't fail container startup if the
   209          remote process cannot be reached.
   210  
   211  The session configuration above can also be used with the `--pod-init-config`
   212  flag under the `"trace_session"` JSON object. There is a full example
   213  [here](https://cs.opensource.google/gvisor/gvisor/+/master:examples/seccheck/pod_init.json)
   214  
   215  > Note: For convenience, the `--pod-init-config` file can also be used with
   216  > `runsc trace create` command. The portions of the Pod init config file that
   217  > are not related to the session configuration are ignored.
   218  
   219  # Full Example
   220  
   221  Here, we're going to explore a how to use runtime monitoring end to end. Under
   222  the `examples` directory there is an implementation of the monitoring process
   223  that accepts connections from remote sinks and prints out all the trace points
   224  it receives.
   225  
   226  First, let's start the monitoring process and leave it running:
   227  
   228  ```shell
   229  $ bazel run examples/seccheck:server_cc
   230  Socket address /tmp/gvisor_events.sock
   231  ```
   232  
   233  The server is now listening on the socket at `/tmp/gvisor_events.sock` for new
   234  gVisor sandboxes to connect. Now let's create a session configuration file with
   235  some trace points enabled and the remote sink using the socket address from
   236  above:
   237  
   238  ```shell
   239  $ cat <<EOF >session.json
   240  {
   241    "trace_session": {
   242      "name": "Default",
   243      "points": [
   244        {
   245          "name": "sentry/clone"
   246        },
   247        {
   248          "name": "syscall/fork/enter",
   249          "context_fields": [
   250            "group_id",
   251            "process_name"
   252          ]
   253        },
   254        {
   255          "name": "syscall/fork/exit",
   256          "context_fields": [
   257            "group_id",
   258            "process_name"
   259          ]
   260        },
   261        {
   262          "name": "syscall/execve/enter",
   263          "context_fields": [
   264            "group_id",
   265            "process_name"
   266          ]
   267        },
   268        {
   269          "name": "syscall/sysno/35/enter",
   270          "context_fields": [
   271            "group_id",
   272            "process_name"
   273          ]
   274        },
   275        {
   276          "name": "syscall/sysno/35/exit"
   277        }
   278      ],
   279      "sinks": [
   280        {
   281          "name": "remote",
   282          "config": {
   283            "endpoint": "/tmp/gvisor_events.sock"
   284          }
   285        }
   286      ]
   287    }
   288  }
   289  EOF
   290  ```
   291  
   292  Now, we're ready to start a container and watch it send traces to the monitoring
   293  process. The container we're going to create simply loops every 5 seconds and
   294  writes something to stdout. While the container is running, we're going to call
   295  `runsc trace` command to create a trace session.
   296  
   297  ```shell
   298  # Start the container and copy the container ID for future reference.
   299  $ docker run --rm --runtime=runsc -d bash -c "while true; do echo looping; sleep 5; done"
   300  dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61
   301  
   302  $ CID=dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61
   303  
   304  # Create new trace session in the container above.
   305  $ sudo runsc --root /var/run/docker/runtime-runc/moby trace create --config session.json ${CID?}
   306  Trace session "Default" created.
   307  ```
   308  
   309  In the terminal where you are running the monitoring process, you'll start
   310  seeing messages like this:
   311  
   312  ```
   313  Connection accepted
   314  E Fork context_data      { thread_group_id: 1 process_name: "bash" } sysno: 57
   315  CloneInfo => created_thread_id:      110 created_thread_group_id: 110 created_thread_start_time_ns: 1660249219204031676
   316  X Fork context_data      { thread_group_id: 1 process_name: "bash" } exit { result: 110 } sysno: 57
   317  E Execve context_data    { thread_group_id: 110 process_name: "bash" } sysno: 59 pathname: "/bin/sleep" argv: "sleep" argv: "5"
   318  E Syscall context_data   { thread_group_id: 110 process_name: "sleep" } sysno: 35 arg1: 139785970818200 arg2: 139785970818200
   319  X Syscall context_data   { thread_group_id: 110 process_name: "sleep" } exit { } sysno: 35 arg1: 139785970818200 arg2: 139785970818200
   320  ```
   321  
   322  The first message in the log is a notification that a new sandbox connected to
   323  the monitoring process. The `E` and `X` in front of the syscall traces denotes
   324  whether the trace belongs to an `E`nter or e`X`it syscall trace. The first
   325  syscall trace shows a call to `fork(2)` from a process with `group_thread_id`
   326  (or PID) equal to 1 and the process name is `bash`. In other words, this is the
   327  init process of the container, running `bash`, and calling fork to execute
   328  `sleep 5`. The next trace is from `sentry/clone` and informs that the forked
   329  process has PID=110. Then, `X Fork` indicates that `fork(2)` syscall returned to
   330  the parent. The child continues and executes `execve(2)` to call `sleep` as can
   331  be seen from the `pathname` and `argv` fields. Note that at this moment, the PID
   332  is 110 (child) but the process name is still `bash` because it hasn't executed
   333  `sleep` yet. After `execve(2)` is called the process name changes to `sleep` as
   334  expected. Next, it shows the `nanosleep(2)` raw syscalls, which have `sysno`=35
   335  (referred to as `syscall/sysno/35` in the configuration file), one for enter
   336  with the exit trace happening 5 seconds later.
   337  
   338  Let's list all trace sessions that are active in the sandbox:
   339  
   340  ```shell
   341  $ sudo runsc --root /var/run/docker/runtime-runc/moby trace list ${CID?}
   342  SESSIONS (1)
   343  "Default"
   344          Sink: "remote", dropped: 0
   345  ```
   346  
   347  It shows the `Default` session created above, using the `remote` sink and no
   348  trace points have been dropped. Once we're done, the trace session can be
   349  deleted with the command below:
   350  
   351  ```shell
   352  $ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name
   353  Default ${CID?} Trace session "Default" deleted.
   354  ```
   355  
   356  In the monitoring process you should see a message `Connection closed` to inform
   357  that the sandbox has disconnected.
   358  
   359  If you want to set up `runsc` to connect to the monitoring process automatically
   360  before the application starts running, you can set the `--pod-init-config` flag
   361  to the configuration file created above. Here's an example:
   362  
   363  ```shell
   364  $ sudo runsc --install --runtime=runsc-trace -- --pod-init-config=$PWD/session.json
   365  ```