gvisor.dev/gvisor@v0.0.0-20240520182842-f9d4d51c7e0f/pkg/sentry/seccheck/README.md (about) 1 # Introduction 2 3 This package provides a remote interface to observe behavior of the application 4 running inside the sandbox. It was built with runtime monitoring in mind, e.g. 5 threat detection, but it can be used for other purposes as well. It allows a 6 process running outside the sandbox to receive a stream of trace data 7 asynchronously. This process can watch actions performed by the application, 8 generate alerts when something unexpected occurs, log these actions, etc. 9 10 First, let's go over a few concepts before we get into the details. 11 12 ## Concepts 13 14 - **Points:** these are discrete places (or points) in the code where 15 instrumentation was added. Each point has a unique name and schema. They can 16 be individually enabled/disabled. For example, `container/start` is a point 17 that is fired when a new container starts. 18 - **Point fields:** each point may contain fields that carry point data. For 19 example, `container/start` has a `id` field with the container ID that is 20 getting started. 21 - **Optional fields:** each point may also have optional fields. By default 22 these fields are not collected and they can be manually set to be collected 23 when the point is configured. These fields are normally more expensive to 24 collect and/or large, e.g. resolve path to FD, or data for read/write. 25 - **Context fields:** these are fields generally available to most events, but 26 are disabled by default. Like optional fields, they can be set to be 27 collected when the point is configured. Context field data comes from 28 context where the point is being fired, for example PID, UID/GID, container 29 ID are fields available to most trace points. 30 - **Sink:** sinks are trace point consumers. Each sink is identified by a name 31 and may handle trace points differently. Later we'll describe in more 32 detailed what sinks are available in the system and how to use them. 33 - **Session:** trace session is a set of points that are enabled with their 34 corresponding configuration. A trace session also has a list of sinks that 35 will receive the trace points. A session is identified by a unique name. 36 Once a session is deleted, all points belonging to the session are disabled 37 and the sinks destroyed. 38 39 If you're interested in exploring further, there are more details in the 40 [design doc](https://docs.google.com/document/d/1RQQKzeFpO-zOoBHZLA-tr5Ed_bvAOLDqgGgKhqUff2A/edit). 41 42 # Points 43 44 Every trance point in the system is identified by a unique name. The naming 45 convention is to scope the point with a main component followed by its name to 46 avoid conflicts. Here are a few examples: 47 48 - `sentry/signal_delivered` 49 - `container/start` 50 - `syscall/openat/enter` 51 52 > Note: the syscall trace point contains an extra level to separate the 53 > enter/exit points. 54 55 Most of the trace points are in the `syscall` component. They come in 2 flavors: 56 raw, schematized. Raw syscalls include all syscalls in the system and contain 57 the 6 arguments for the given syscall. Schematized trace points exist for many 58 syscalls, but not all. They provide fields that are specific to the syscalls and 59 fetch more information than is available from the raw syscall arguments. For 60 example, here is the schema for the open syscall: 61 62 ```proto 63 message Open { 64 gvisor.common.ContextData context_data = 1; 65 Exit exit = 2; 66 uint64 sysno = 3; 67 int64 fd = 4; 68 string fd_path = 5; 69 string pathname = 6; 70 uint32 flags = 7; 71 uint32 mode = 8; 72 } 73 ``` 74 75 As you can see, some fields are in both raw and schematized points, like `fd` 76 which is also `arg1` in the raw syscall, but here it has a name and correct 77 type. It also has fields like `pathname` that are not available in the raw 78 syscall event. In addition, `fd_path` is an optional field that can take the 79 `fd` and translate it into a full path for convenience. In some cases, the same 80 schema can be shared by many syscalls. In this example, `message Open` is used 81 for `open(2)`, `openat(2)` and `creat(2)` syscalls. The `sysno` field can be 82 used to distinguish between them. The schema for all syscall trace points can be 83 found 84 [here](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/syscall.proto). 85 86 Other components that exist today are: 87 88 * **sentry:** trace points fired from within gVisor's kernel 89 ([schema](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/sentry.proto)). 90 * **container:** container related events 91 ([schema](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/container.proto)). 92 93 The following command lists all trace points available in the system: 94 95 ```shell 96 $ runsc trace metadata 97 POINTS (973) 98 Name: container/start, optional fields: [env], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name] 99 Name: sentry/clone, optional fields: [], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name] 100 Name: syscall/accept/enter, optional fields: [fd_path], context fields: [time|thread_id|task_start_time|group_id|thread_group_start_time|container_id|credentials|cwd|process_name] 101 ... 102 ``` 103 104 > Note: the output format for `trace metadata` may change without notice. 105 106 The list above also includes what optional and context fields are available for 107 each trace point. Optional fields schema is part of the trace point proto, like 108 `fd_path` we saw above. Context fields are set in `context_data` field of all 109 points and is defined in 110 [gvisor.common.ContextData](https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sentry/seccheck/points/common.proto;bpv=1;bpt=1;l=77?gsn=ContextData&gs=kythe%3A%2F%2Fgithub.com%2Fgoogle%2Fgvisor%3Flang%3Dprotobuf%3Fpath%3Dpkg%2Fsentry%2Fseccheck%2Fpoints%2Fcommon.proto%234.2). 111 112 # Sinks 113 114 Sinks receive enabled trace points and do something useful with them. They are 115 identified by a unique name. The same `runsc trace metadata` command used above 116 also lists all sinks: 117 118 ```shell 119 $ runsc trace metadata 120 ... 121 SINKS (2) 122 Name: remote 123 Name: null 124 125 ``` 126 127 > Note: the output format for `trace metadata` may change without notice. 128 129 ## Remote 130 131 The remote sink serializes the trace point into protobuf and sends it to a 132 separate process. For threat detection, external monitoring processes can 133 receive connections from remote sinks and be sent a stream of trace points that 134 are occurring in the system. This sink connects to a remote process via Unix 135 domain socket and expects the remote process to be listening for new 136 connections. If you're interested in creating a monitoring process that 137 communicates with the remote sink, [this document](sinks/remote/README.md) has 138 more details. 139 140 The remote sink has many properties that can be configured when it's created 141 (more on how to configure sinks below): 142 143 * `endpoint` (mandatory): Unix domain socket address to connect. 144 * `retries`: number of attempts to write the trace point before dropping it in 145 case the remote process is not responding. Note that a high number of 146 retries can significantly delay application execution. 147 * `backoff`: initial backoff time after the first failed attempt. This value 148 doubles with every failed attempt, up to the max. 149 * `backoff_max`: max duration to wait between retries. 150 151 ## Null 152 153 The null sink does nothing with the trace points and it's used for testing. 154 Syscall tests enable all trace points, with all optional and context fields to 155 ensure there is no crash with them enabled. 156 157 ## Strace (not implemented) 158 159 The strace sink has not been implemented yet. It's meant to replace the strace 160 mechanism that exists in the Sentry to simplify the code and add more trace 161 points to it. 162 163 > Note: It requires more than one trace session to be supported. 164 165 # Sessions 166 167 Trace sessions scope a set of trace points with their corresponding 168 configuration and a set of sinks that receive the points. Sessions can be 169 created at sandbox initialization time or during runtime. Creating sessions at 170 init time guarantees that no trace points are missed, which is important for 171 threat detection. It is configured using the `--pod-init-config` flag (more on 172 it below). To manage sessions during runtime, `runsc trace create|delete|list` 173 is used to manipulate trace sessions. Here are few examples assuming there is a 174 running container with ID=cont123 using Docker: 175 176 ```shell 177 $ sudo runsc --root /run/docker/runtime-runc/moby trace create --config session.json cont123 178 $ sudo runsc --root /run/docker/runtime-runc/moby trace list cont123 179 SESSIONS (1) 180 "Default" 181 Sink: "remote", dropped: 0 182 183 $ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name Default cont123 184 $ sudo runsc --root /var/run/docker/runtime-runc/moby trace list cont123 185 SESSIONS (0) 186 ``` 187 188 > Note: There is a current limitation that only a single session can exist in 189 > the system and it must be called `Default`. This restriction can be lifted in 190 > the future when more than one session is needed. 191 192 ## Config 193 194 The event session can be defined using JSON for the `runsc trace create` 195 command. The session definition has 3 main parts: 196 197 1. `name`: name of the session being created. Only `Default` for now. 198 1. `points`: array of points being enabled in the session. Each point has: 199 1. `name`: name of trace point being enabled. 200 1. `optional_fields`: array of optional fields to include with the trace 201 point. 202 1. `context_fields`: array of context fields to include with the trace 203 point. 204 1. `sinks`: array of sinks that will process the trace points. 205 1. `name`: name of the sink. 206 1. `config`: sink specific configuration. 207 1. `ignore_setup_error`: ignores failure to configure the sink. In the 208 remote sink case, for example, it doesn't fail container startup if the 209 remote process cannot be reached. 210 211 The session configuration above can also be used with the `--pod-init-config` 212 flag under the `"trace_session"` JSON object. There is a full example 213 [here](https://cs.opensource.google/gvisor/gvisor/+/master:examples/seccheck/pod_init.json) 214 215 > Note: For convenience, the `--pod-init-config` file can also be used with 216 > `runsc trace create` command. The portions of the Pod init config file that 217 > are not related to the session configuration are ignored. 218 219 # Full Example 220 221 Here, we're going to explore a how to use runtime monitoring end to end. Under 222 the `examples` directory there is an implementation of the monitoring process 223 that accepts connections from remote sinks and prints out all the trace points 224 it receives. 225 226 First, let's start the monitoring process and leave it running: 227 228 ```shell 229 $ bazel run examples/seccheck:server_cc 230 Socket address /tmp/gvisor_events.sock 231 ``` 232 233 The server is now listening on the socket at `/tmp/gvisor_events.sock` for new 234 gVisor sandboxes to connect. Now let's create a session configuration file with 235 some trace points enabled and the remote sink using the socket address from 236 above: 237 238 ```shell 239 $ cat <<EOF >session.json 240 { 241 "trace_session": { 242 "name": "Default", 243 "points": [ 244 { 245 "name": "sentry/clone" 246 }, 247 { 248 "name": "syscall/fork/enter", 249 "context_fields": [ 250 "group_id", 251 "process_name" 252 ] 253 }, 254 { 255 "name": "syscall/fork/exit", 256 "context_fields": [ 257 "group_id", 258 "process_name" 259 ] 260 }, 261 { 262 "name": "syscall/execve/enter", 263 "context_fields": [ 264 "group_id", 265 "process_name" 266 ] 267 }, 268 { 269 "name": "syscall/sysno/35/enter", 270 "context_fields": [ 271 "group_id", 272 "process_name" 273 ] 274 }, 275 { 276 "name": "syscall/sysno/35/exit" 277 } 278 ], 279 "sinks": [ 280 { 281 "name": "remote", 282 "config": { 283 "endpoint": "/tmp/gvisor_events.sock" 284 } 285 } 286 ] 287 } 288 } 289 EOF 290 ``` 291 292 Now, we're ready to start a container and watch it send traces to the monitoring 293 process. The container we're going to create simply loops every 5 seconds and 294 writes something to stdout. While the container is running, we're going to call 295 `runsc trace` command to create a trace session. 296 297 ```shell 298 # Start the container and copy the container ID for future reference. 299 $ docker run --rm --runtime=runsc -d bash -c "while true; do echo looping; sleep 5; done" 300 dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61 301 302 $ CID=dee0da1eafc6b15abffeed1abc6ca968c6d816252ae334435de6f3871fb05e61 303 304 # Create new trace session in the container above. 305 $ sudo runsc --root /var/run/docker/runtime-runc/moby trace create --config session.json ${CID?} 306 Trace session "Default" created. 307 ``` 308 309 In the terminal where you are running the monitoring process, you'll start 310 seeing messages like this: 311 312 ``` 313 Connection accepted 314 E Fork context_data { thread_group_id: 1 process_name: "bash" } sysno: 57 315 CloneInfo => created_thread_id: 110 created_thread_group_id: 110 created_thread_start_time_ns: 1660249219204031676 316 X Fork context_data { thread_group_id: 1 process_name: "bash" } exit { result: 110 } sysno: 57 317 E Execve context_data { thread_group_id: 110 process_name: "bash" } sysno: 59 pathname: "/bin/sleep" argv: "sleep" argv: "5" 318 E Syscall context_data { thread_group_id: 110 process_name: "sleep" } sysno: 35 arg1: 139785970818200 arg2: 139785970818200 319 X Syscall context_data { thread_group_id: 110 process_name: "sleep" } exit { } sysno: 35 arg1: 139785970818200 arg2: 139785970818200 320 ``` 321 322 The first message in the log is a notification that a new sandbox connected to 323 the monitoring process. The `E` and `X` in front of the syscall traces denotes 324 whether the trace belongs to an `E`nter or e`X`it syscall trace. The first 325 syscall trace shows a call to `fork(2)` from a process with `group_thread_id` 326 (or PID) equal to 1 and the process name is `bash`. In other words, this is the 327 init process of the container, running `bash`, and calling fork to execute 328 `sleep 5`. The next trace is from `sentry/clone` and informs that the forked 329 process has PID=110. Then, `X Fork` indicates that `fork(2)` syscall returned to 330 the parent. The child continues and executes `execve(2)` to call `sleep` as can 331 be seen from the `pathname` and `argv` fields. Note that at this moment, the PID 332 is 110 (child) but the process name is still `bash` because it hasn't executed 333 `sleep` yet. After `execve(2)` is called the process name changes to `sleep` as 334 expected. Next, it shows the `nanosleep(2)` raw syscalls, which have `sysno`=35 335 (referred to as `syscall/sysno/35` in the configuration file), one for enter 336 with the exit trace happening 5 seconds later. 337 338 Let's list all trace sessions that are active in the sandbox: 339 340 ```shell 341 $ sudo runsc --root /var/run/docker/runtime-runc/moby trace list ${CID?} 342 SESSIONS (1) 343 "Default" 344 Sink: "remote", dropped: 0 345 ``` 346 347 It shows the `Default` session created above, using the `remote` sink and no 348 trace points have been dropped. Once we're done, the trace session can be 349 deleted with the command below: 350 351 ```shell 352 $ sudo runsc --root /var/run/docker/runtime-runc/moby trace delete --name 353 Default ${CID?} Trace session "Default" deleted. 354 ``` 355 356 In the monitoring process you should see a message `Connection closed` to inform 357 that the sandbox has disconnected. 358 359 If you want to set up `runsc` to connect to the monitoring process automatically 360 before the application starts running, you can set the `--pod-init-config` flag 361 to the configuration file created above. Here's an example: 362 363 ```shell 364 $ sudo runsc --install --runtime=runsc-trace -- --pod-init-config=$PWD/session.json 365 ```