github.com/rkt/rkt@v1.30.1-0.20200224141603-171c416fac02/Documentation/seccomp-guide.md (about) 1 # Seccomp Isolators Guide 2 3 This document is a walk-through guide describing how to use rkt isolators for 4 [Linux seccomp filtering][lwn-seccomp]. 5 6 * [About Seccomp](#about-seccomp) 7 * [Predefined Seccomp Filters](#predefined-seccomp-filter) 8 * [Seccomp Isolators](#seccomp-isolators) 9 * [Usage Example](#usage-example) 10 * [Overriding Seccomp Filters](#overriding-seccomp-filters) 11 * [Recommendations](#recommendations) 12 13 ## About seccomp 14 15 Linux seccomp (short for SECure COMputing) filtering allows one to specify which 16 system calls a process should be allowed to invoke, reducing the kernel surface 17 exposed to applications. 18 This provides a clearly defined mechanism to build sandboxed environments, where 19 processes can run having access only to a specific reduced set of system calls. 20 21 In the context of containers, seccomp filtering is useful for: 22 23 * Restricting applications from invoking syscalls that can affect the host 24 * Reducing kernel attack surface in case of security bugs 25 26 For more details on how Linux seccomp filtering works, see 27 [seccomp(2)][man-seccomp]. 28 29 ## Predefined seccomp filters 30 31 By default, rkt comes with a set of predefined filtering groups that can be 32 used to quickly build sandboxed environments for containerized applications. 33 Each set is simply a reference to a group of syscalls, covering a single 34 functional area or kernel subsystem. They can be further combined to 35 build more complex filters, either by blacklisting or by whitelisting specific 36 system calls. To distinguish these predefined groups from real syscall names, 37 wildcard labels are prefixed with a `@` symbols and are namespaced. 38 39 The App Container Spec (appc) defines 40 [two groups][appc-isolators]: 41 42 * `@appc.io/all` represents the set of all available syscalls. 43 * `@appc.io/empty` represents the empty set. 44 45 rkt provides two default groups for generic usage: 46 47 * `@rkt/default-blacklist` represents a broad-scope filter than can be used for generic blacklisting 48 * `@rkt/default-whitelist` represents a broad-scope filter than can be used for generic whitelisting 49 50 For compatibility reasons, two groups are provided mirroring [default Docker profiles][docker-seccomp]: 51 52 * `@docker/default-blacklist` 53 * `@docker/default-whitelist` 54 55 When using stage1 images with systemd >= v231, some 56 [predefined groups][systemd-seccomp] 57 are also available: 58 59 * `@systemd/clock` for syscalls manipulating the system clock 60 * `@systemd/default-whitelist` for a generic set of typically whitelisted syscalls 61 * `@systemd/mount` for filesystem mounting and unmounting 62 * `@systemd/network-io` for socket I/O operations 63 * `@systemd/obsolete` for unusual, obsolete or unimplemented syscalls 64 * `@systemd/privileged` for syscalls which need super-user syscalls 65 * `@systemd/process` for syscalls acting on process control, execution and namespacing 66 * `@systemd/raw-io` for raw I/O port access 67 68 When no seccomp filtering is specified, by default rkt whitelists all the generic 69 syscalls typically needed by applications for common operations. This is 70 the same set defined by `@rkt/default-whitelist`. 71 72 The default set is tailored to stop applications from performing a large 73 variety of privileged actions, while not impacting their normal behavior. 74 Operations which are typically not needed in containers and which may 75 impact host state, eg. invoking [`umount(2)`][man-umount], are denied in this way. 76 77 However, this default set is mostly meant as a safety precaution against erratic 78 and misbehaving applications, and will not suffice against tailored attacks. 79 As such, it is recommended to fine-tune seccomp filtering using one of the 80 customizable isolators available in rkt. 81 82 ## Seccomp Isolators 83 84 When running Linux containers, rkt provides two mutually exclusive isolators 85 to define a seccomp filter for an application: 86 87 * `os/linux/seccomp-retain-set` 88 * `os/linux/seccomp-remove-set` 89 90 Those isolators cover different use-cases and employ different techniques to 91 achieve the same goal of limiting available syscalls. As such, they cannot 92 be used together at the same time, and recommended usage varies on a 93 case-by-case basis. 94 95 ### Operation mode 96 97 Seccomp isolators work by defining a set of syscalls than can be either blocked 98 ("remove-set") or allowed ("retain-set"). Once an application tries to invoke 99 a blocked syscall, the kernel will deny this operation and the application will 100 be notified about the failure. 101 102 By default, invoking blocked syscalls will result in the application being 103 immediately terminated with a `SIGSYS` signal. This behavior can be tweaked by 104 returning a specific error code ("errno") to the application instead of 105 terminating it. 106 107 For both isolators, this can be customized by specifying an additional `errno` 108 parameter with the desired symbolic errno name. For a list of errno labels, check 109 the [reference][man-errno] at `man 3 errno`. 110 111 ### Retain-set 112 113 `os/linux/seccomp-retain-set` allows for an additive approach to build a seccomp 114 filter: applications will not able to use any syscalls, except the ones 115 listed in this isolator. 116 117 This whitelisting approach is useful for completely locking down environments 118 and whenever application requirements (in terms of syscalls) are 119 well-defined in advance. It allows one to ensure that exactly and only the 120 specified syscalls could ever be used. 121 122 For example, the "retain-set" for a typical network application will include 123 entries for generic POSIX operations (available in `@systemd/default-whitelist`), 124 socket operations (`@systemd/network-io`) and reacting to I/O 125 events (`@systemd/io-event`). 126 127 ### Remove-set 128 129 `os/linux/seccomp-remove-set` tackles syscalls in a subtractive way: 130 starting from all available syscalls, single entries can be forbidden in order 131 to prevent specific actions. 132 133 This blacklisting approach is useful to somehow limit applications which have 134 broad requirements in terms of syscalls, in order to deny access to some clearly 135 unused but potentially exploitable syscalls. 136 137 For example, an application that will need to perform multiple operations but is 138 known to never touch mountpoints could have `@systemd/mount` specified in its 139 "remove-set". 140 141 ## Usage Example 142 143 The goal of these examples is to show how to build ACI images with [`acbuild`][acbuild], 144 where some syscalls are either explicitly blocked or allowed. 145 For simplicity, the starting point will be a bare Alpine Linux image which 146 ships with `ping` and `umount` commands (from busybox). Those 147 commands respectively requires [`socket(2)`][man-socket] and [`umount(2)`][man-umount] syscalls in order to 148 perform privileged operations. 149 To block their usage, a syscalls filter can be installed via 150 `os/linux/seccomp-remove-set` or `os/linux/seccomp-retain-set`; both approaches 151 are shown here. 152 153 ### Blacklisting specific syscalls 154 155 This example shows how to block socket operation (e.g. with `ping`), by removing 156 `socket()` from the set of allowed syscalls. 157 158 First, a local image is built with an explicit "remove-set" isolator. 159 This set contains the syscalls that need to be forbidden in order to block 160 socket setup: 161 162 ``` 163 $ acbuild begin 164 $ acbuild set-name localhost/seccomp-remove-set-example 165 $ acbuild dependency add quay.io/coreos/alpine-sh 166 $ acbuild set-exec -- /bin/sh 167 $ echo '{ "set": ["@rkt/default-blacklist", "socket"] }' | acbuild isolator add "os/linux/seccomp-remove-set" - 168 $ acbuild write seccomp-remove-set-example.aci 169 $ acbuild end 170 ``` 171 172 Once properly built, this image can be run in order to check that `ping` usage is 173 now blocked by the seccomp filter. At the same time, the default blacklist will 174 also block other dangerous syscalls like `umount(2)`: 175 176 ``` 177 $ sudo rkt run --interactive --insecure-options=image seccomp-remove-set-example.aci 178 image: using image from file stage1-coreos.aci 179 image: using image from file seccomp-remove-set-example.aci 180 image: using image from local store for image name quay.io/coreos/alpine-sh 181 182 / # whoami 183 root 184 185 / # ping -c1 8.8.8.8 186 PING 8.8.8.8 (8.8.8.8): 56 data bytes 187 Bad system call 188 189 / # umount /proc/bus/ 190 Bad system call 191 ``` 192 193 This means that `socket(2)` and `umount(2)` have been both effectively disabled 194 inside the container. 195 196 ### Allowing specific syscalls 197 198 In contrast to the example above, this one shows how to allow some operations 199 only (e.g. network communication via `ping`), by whitelisting all required 200 syscalls. This means that syscalls outside of this set will be blocked. 201 202 First, a local image is built with an explicit "retain-set" isolator. 203 This set contains the rkt wildcard "default-whitelist" (which already provides 204 all socket-related entries), plus some custom syscalls (e.g. `umount(2)`) which 205 are typically not allowed: 206 207 ``` 208 $ acbuild begin 209 $ acbuild set-name localhost/seccomp-retain-set-example 210 $ acbuild dependency add quay.io/coreos/alpine-sh 211 $ acbuild set-exec -- /bin/sh 212 $ echo '{ "set": ["@rkt/default-whitelist", "umount", "umount2"] }' | acbuild isolator add "os/linux/seccomp-retain-set" - 213 $ acbuild write seccomp-retain-set-example.aci 214 $ acbuild end 215 ``` 216 217 Once run, it can be easily verified that both `ping` and `umount` are now 218 functional inside the container. These operations also require [additional 219 capabilities][capabilities-guide] to be retained in order to work: 220 221 ``` 222 $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-example.aci --caps-retain=CAP_SYS_ADMIN,CAP_NET_RAW 223 image: using image from file stage1-coreos.aci 224 image: using image from file seccomp-retain-set-example.aci 225 image: using image from local store for image name quay.io/coreos/alpine-sh 226 227 / # whoami 228 root 229 230 / # ping -c 1 8.8.8.8 231 PING 8.8.8.8 (8.8.8.8): 56 data bytes 232 64 bytes from 8.8.8.8: seq=0 ttl=41 time=24.910 ms 233 234 --- 8.8.8.8 ping statistics --- 235 1 packets transmitted, 1 packets received, 0% packet loss 236 round-trip min/avg/max = 24.910/24.910/24.910 ms 237 238 / # mount | grep /proc/bus 239 proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) 240 / # umount /proc/bus 241 / # mount | grep /proc/bus 242 ``` 243 244 However, others syscalls are still not available to the application. 245 For example, trying to set the time will result in a failure due to invoking 246 non-whitelisted syscalls: 247 248 ``` 249 $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-example.aci 250 image: using image from file stage1-coreos.aci 251 image: using image from file seccomp-retain-set-example.aci 252 image: using image from local store for image name quay.io/coreos/alpine-sh 253 254 / # whoami 255 root 256 257 / # adjtimex -f 0 258 Bad system call 259 ``` 260 261 ## Overriding Seccomp Filters 262 263 Seccomp filters are typically defined when creating images, as they are tightly 264 linked to specific app requirements. However, image consumers may need to further 265 tweak/restrict the set of available syscalls in specific local scenarios. 266 This can be done either by permanently patching the manifest of specific images, 267 or by overriding seccomp isolators with command line options. 268 269 ### Patching images 270 271 Image manifests can be manipulated manually, by unpacking the image and editing 272 the manifest file, or with helper tools like [`actool`][actool]. 273 To override an image's pre-defined syscalls set, just replace the existing seccomp 274 isolators in the image with new isolators defining the desired syscalls. 275 276 The `patch-manifest` subcommand to `actool` manipulates the syscalls sets 277 defined in an image. 278 `actool patch-manifest -seccomp-mode=... -seccomp-set=...` options 279 can be used together to override any seccomp filters by specifying a new mode 280 (retain or reset), an optional custom errno, and a set of syscalls to filter. 281 These commands take an input image, modify any existing seccomp isolators, and 282 write the changes to an output image, as shown in the example: 283 284 ``` 285 $ actool cat-manifest seccomp-retain-set-example.aci 286 ... 287 "isolators": [ 288 { 289 "name": "os/linux/seccomp-retain-set", 290 "value": { 291 "set": [ 292 "@rkt/default-whitelist", 293 "umount", 294 "umount2" 295 ] 296 } 297 } 298 ] 299 ... 300 301 $ actool patch-manifest -seccomp-mode=retain,errno=ENOSYS -seccomp-set=@rkt/default-whitelist seccomp-retain-set-example.aci seccomp-retain-set-patched.aci 302 303 $ actool cat-manifest seccomp-retain-set-patched.aci 304 ... 305 "isolators": [ 306 { 307 "name": "os/linux/seccomp-retain-set", 308 "value": { 309 "set": [ 310 "@rkt/default-whitelist", 311 ], 312 "errno": "ENOSYS" 313 } 314 } 315 ] 316 ... 317 318 ``` 319 320 Now run the image to verify that the `umount(2)` syscall is no longer allowed, 321 and a custom error is returned: 322 323 ``` 324 $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-patched.aci 325 image: using image from file stage1-coreos.aci 326 image: using image from file seccomp-retain-set-patched.aci 327 image: using image from local store for image name quay.io/coreos/alpine-sh 328 329 / # mount | grep /proc/bus 330 proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime) 331 / # umount /proc/bus/ 332 umount: can't umount /proc/bus: Function not implemented 333 ``` 334 335 ### Overriding seccomp filters at run-time 336 337 Seccomp filters can be directly overridden at run time from the command-line, 338 without changing the executed images. 339 The `--seccomp` option to `rkt run` can manipulate both the "retain" and the 340 "remove" isolators. 341 342 Isolator overridden from the command-line will replace all seccomp settings in 343 the image manifest, and can be specified as shown in this example: 344 345 ``` 346 $ sudo rkt run --interactive quay.io/coreos/alpine-sh --seccomp mode=remove,errno=ENOTSUP,socket 347 image: using image from file /usr/local/bin/stage1-coreos.aci 348 image: using image from local store for image name quay.io/coreos/alpine-sh 349 350 / # whoami 351 root 352 353 / # ping -c 1 8.8.8.8 354 PING 8.8.8.8 (8.8.8.8): 56 data bytes 355 ping: can't create raw socket: Not supported 356 ``` 357 358 Seccomp isolators are application-specific configuration entries, and in a 359 `rkt run` command line they **must follow the application container image to 360 which they apply**. 361 Each application within a pod can have different seccomp filters. 362 363 ## Recommendations 364 365 As with most security features, seccomp isolators may require some 366 application-specific tuning in order to be maximally effective. For this reason, 367 for security-sensitive environments it is recommended to have a well-specified 368 set of syscalls requirements and follow best practices: 369 370 1. Only allow syscalls needed by an application, according to its typical usage. 371 2. While it is possible to completely disable seccomp, it is rarely needed and 372 should be generally avoided. Tweaking the syscalls set is a better approach 373 instead. 374 3. Avoid granting access to dangerous syscalls. For example, [`mount(2)`][man-mount] and 375 [`ptrace(2)`][man-ptrace] are typically abused to escape containers. 376 4. Prefer a whitelisting approach, trying to keep the "retain-set" as small as 377 possible. 378 379 380 [acbuild]: https://github.com/containers/build 381 [actool]: https://github.com/appc/spec#building-acis 382 [appc-isolators]: https://github.com/appc/spec/blob/master/spec/ace.md#linux-isolators 383 [capabilities-guide]: capabilities-guide.md 384 [docker-seccomp]: https://docs.docker.com/engine/security/seccomp/ 385 [lwn-seccomp]: https://lwn.net/Articles/656307/ 386 [man-errno]: http://man7.org/linux/man-pages/man3/errno.3.html 387 [man-mount]: http://man7.org/linux/man-pages/man2/mount.2.html 388 [man-ptrace]: http://man7.org/linux/man-pages/man2/ptrace.2.html 389 [man-seccomp]: http://man7.org/linux/man-pages/man2/seccomp.2.html 390 [man-socket]: http://man7.org/linux/man-pages/man2/socket.2.html 391 [man-umount]: http://man7.org/linux/man-pages/man2/umount.2.html 392 [systemd-seccomp]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=