github.com/coreos/rocket@v1.30.1-0.20200224141603-171c416fac02/Documentation/capabilities-guide.md (about) 1 # Capabilities Isolators Guide 2 3 This document is a walk-through guide describing how to use rkt isolators for 4 [Linux Capabilities][capabilities]. 5 6 * [About Linux Capabilities](#about-linux-capabilities) 7 * [Default Capabilities](#default-capabilities) 8 * [Capability Isolators](#capability-isolators) 9 * [Configure capabilities via the command line](#configure-capabilities-via-the-command-line) 10 * [Configure capabilities in ACI images](#configure-capabilities-in-aci-images) 11 * [Capabilities when running as non-root](#capabilities-when-running-as-non-root) 12 * [Recommendations](#recommendations) 13 14 ## About Linux Capabilities 15 16 Linux capabilities are meant to be a modern evolution of traditional UNIX 17 permissions checks. 18 The goal is to split the permissions granted to privileged processes into a set 19 of capabilities (eg. `CAP_NET_RAW` to open a raw socket), which can be 20 separately handled and assigned to single threads. 21 22 Processes can gain specific capabilities by either being run by superuser, or by 23 having the setuid/setgid bits or specific file-capabilities set on their 24 executable file. 25 Once running, each process has a bounding set of capabilities which it can 26 enable and use; such process cannot get further capabilities outside of this set. 27 28 In the context of containers, capabilities are useful for: 29 30 * Restricting the effective privileges of applications running as root 31 * Allowing applications to perform specific privileged operations, without 32 having to run them as root 33 34 For the complete list of existing Linux capabilities and a detailed description 35 of this security mechanism, see the [capabilities(7) man page][man-capabilities]. 36 37 ## Default capabilities 38 39 By default, rkt enforces [a default set of capabilities][default-caps] onto applications. 40 This default set is tailored to stop applications from performing a large 41 variety of privileged actions, while not impacting their normal behavior. 42 Operations which are typically not needed in containers and which may 43 impact host state, eg. invoking `reboot(2)`, are denied in this way. 44 45 However, this default set is mostly meant as a safety precaution against erratic 46 and misbehaving applications, and will not suffice against tailored attacks. 47 As such, it is recommended to fine-tune the capabilities bounding set using one 48 of the customizable isolators available in rkt. 49 50 ## Capability Isolators 51 52 When running Linux containers, rkt provides two mutually exclusive isolators 53 to define the bounding set under which an application will be run: 54 55 * `os/linux/capabilities-retain-set` 56 * `os/linux/capabilities-remove-set` 57 58 Those isolators cover different use-cases and employ different techniques to 59 achieve the same goal of limiting available capabilities. As such, they cannot 60 be used together at the same time, and recommended usage varies on a 61 case-by-case basis. 62 63 As the granularity of capabilities varies for specific permission cases, a word 64 of warning is needed in order to avoid a false sense of security. 65 In many cases it is possible to abuse granted capabilities in order to 66 completely subvert the sandbox: for example, `CAP_SYS_PTRACE` allows to access 67 stage1 environment and `CAP_SYS_ADMIN` grants a broad range of privileges, 68 effectively equivalent to root. 69 Many other ways to maliciously transition across capabilities have already been 70 [reported][grsec-forums]. 71 72 ### Retain-set 73 74 `os/linux/capabilities-retain-set` allows for an additive approach to 75 capabilities: applications will be stripped of all capabilities, except the ones 76 listed in this isolator. 77 78 This whitelisting approach is useful for completely locking down environments 79 and whenever application requirements (in terms of capabilities) are 80 well-defined in advance. It allows one to ensure that exactly and only the 81 specified capabilities could ever be used. 82 83 For example, an application that will only need to bind to port 80 as 84 a privileged operation, will have `CAP_NET_BIND_SERVICE` as the only entry in 85 its "retain-set". 86 87 ### Remove-set 88 89 `os/linux/capabilities-remove-set` tackles capabilities in a subtractive way: 90 starting from the default set of capabilities, single entries can be further 91 forbidden in order to prevent specific actions. 92 93 This blacklisting approach is useful to somehow limit applications which have 94 broad requirements in terms of privileged operations, in order to deny some 95 potentially malicious operations. 96 97 For example, an application that will need to perform multiple privileged 98 operations but is known to never open a raw socket, will have 99 `CAP_NET_RAW` specified in its "remove-set". 100 101 ## Configure capabilities via the command line 102 103 Capabilities can be directly overridden at run time from the command-line, 104 without changing the executed images. 105 The `--caps-retain` option to `rkt run` manipulates the `retain` capabilities set. 106 The `--caps-remove` option manipulates the `remove` set. 107 108 Capabilities specified from the command-line will replace all capability settings in the image manifest. 109 Also as stated above the options `--caps-retain`, and `--caps-remove` are mutually exclusive. 110 Only one can be specified at a time. 111 112 Capabilities isolators can be added on the command line at run time by 113 specifying the desired overriding set, as shown in this example: 114 115 ``` 116 $ sudo rkt run --interactive quay.io/coreos/alpine-sh --caps-retain CAP_NET_BIND_SERVICE 117 image: using image from file /usr/local/bin/stage1-coreos.aci 118 image: using image from local store for image name quay.io/coreos/alpine-sh 119 120 / # whoami 121 root 122 123 / # ping -c 1 8.8.8.8 124 PING 8.8.8.8 (8.8.8.8): 56 data bytes 125 ping: permission denied (are you root?) 126 127 ``` 128 129 Capability sets are application-specific configuration entries, and in a 130 `rkt run` command line, they must follow the application container image to 131 which they apply. 132 Each application within a pod can have different capability sets. 133 134 ## Configure capabilities in ACI images 135 136 Capability sets are typically defined when creating images, as they are tightly 137 linked to specific app requirements. 138 139 The goal of these examples is to show how to build ACIs with [`acbuild`][acbuild], 140 where some capabilities are either explicitly blocked or allowed. 141 For simplicity, the starting point will be the official Alpine Linux image from 142 CoreOS which ships with `ping` and `nc` commands (from busybox). Those 143 commands respectively requires `CAP_NET_RAW` and `CAP_NET_BIND_SERVICE` 144 capabilities in order to perform privileged operations. 145 To block their usage, capabilities bounding set 146 can be manipulated via `os/linux/capabilities-remove-set` or 147 `os/linux/capabilities-retain-set`; both approaches are shown here. 148 149 ### Removing specific capabilities 150 151 This example shows how to block `ping` only, by removing `CAP_NET_RAW` from 152 capabilities bounding set. 153 154 First, a local image is built with an explicit "remove-set" isolator. 155 This set contains the capabilities that need to be forbidden in order to block 156 `ping` usage (and only that): 157 158 ``` 159 $ acbuild begin 160 $ acbuild set-name localhost/caps-remove-set-example 161 $ acbuild dependency add quay.io/coreos/alpine-sh 162 $ acbuild set-exec -- /bin/sh 163 $ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-remove-set" - 164 $ acbuild write caps-remove-set-example.aci 165 $ acbuild end 166 ``` 167 168 Once properly built, this image can be run in order to check that `ping` usage has 169 been effectively disabled: 170 171 ``` 172 $ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci 173 image: using image from file stage1-coreos.aci 174 image: using image from file caps-remove-set-example.aci 175 image: using image from local store for image name quay.io/coreos/alpine-sh 176 177 / # whoami 178 root 179 180 / # ping -c 1 8.8.8.8 181 PING 8.8.8.8 (8.8.8.8): 56 data bytes 182 ping: permission denied (are you root?) 183 ``` 184 185 This means that `CAP_NET_RAW` had been effectively disabled inside the container. 186 At the same time, `CAP_NET_BIND_SERVICE` is still available in the default bounding 187 set, so the `nc` command will be able to bind to port 80: 188 189 ``` 190 $ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci 191 image: using image from file stage1-coreos.aci 192 image: using image from file caps-remove-set-example.aci 193 image: using image from local store for image name quay.io/coreos/alpine-sh 194 195 / # whoami 196 root 197 198 / # nc -v -l -p 80 199 listening on [::]:80 ... 200 ``` 201 202 ### Allowing specific capabilities 203 204 In contrast to the example above, this one shows how to allow `ping` only, by 205 removing all capabilities except `CAP_NET_RAW` from the bounding set. 206 This means that all other privileged operations, including binding to port 80 207 will be blocked. 208 209 First, a local image is built with an explicit "retain-set" isolator. 210 This set contains the capabilities that need to be enabled in order to allowed 211 `ping` usage (and only that): 212 213 ``` 214 $ acbuild begin 215 $ acbuild set-name localhost/caps-retain-set-example 216 $ acbuild dependency add quay.io/coreos/alpine-sh 217 $ acbuild set-exec -- /bin/sh 218 $ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-retain-set" - 219 $ acbuild write caps-retain-set-example.aci 220 $ acbuild end 221 ``` 222 223 Once run, it can be easily verified that `ping` from inside the container is now 224 functional: 225 226 ``` 227 $ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci 228 image: using image from file stage1-coreos.aci 229 image: using image from file caps-retain-set-example.aci 230 image: using image from local store for image name quay.io/coreos/alpine-sh 231 232 / # whoami 233 root 234 235 / # ping -c 1 8.8.8.8 236 PING 8.8.8.8 (8.8.8.8): 56 data bytes 237 64 bytes from 8.8.8.8: seq=0 ttl=41 time=24.910 ms 238 239 --- 8.8.8.8 ping statistics --- 240 1 packets transmitted, 1 packets received, 0% packet loss 241 round-trip min/avg/max = 24.910/24.910/24.910 ms 242 ``` 243 244 However, all others capabilities are now not anymore available to the application. 245 For example, using `nc` to bind to port 80 will now result in a failure due to 246 the missing `CAP_NET_BIND_SERVICE` capability: 247 248 ``` 249 $ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci 250 image: using image from file stage1-coreos.aci 251 image: using image from file caps-retain-set-example.aci 252 image: using image from local store for image name quay.io/coreos/alpine-sh 253 254 / # whoami 255 root 256 257 / # nc -v -l -p 80 258 nc: bind: Permission denied 259 ``` 260 261 ### Patching images 262 263 Image manifests can be manipulated manually, by unpacking the image and editing 264 the manifest file, or with helper tools like [`actool`][actool]. 265 To override an image's pre-defined capabilities set, replace the existing capabilities 266 isolators in the image with new isolators defining the desired capabilities. 267 268 The `patch-manifest` subcommand to `actool` manipulates the capabilities sets 269 defined in an image. 270 `actool patch-manifest --capability` changes the `retain` capabilities set. 271 `actool patch-manifest --revoke-capability` changes the `remove` set. 272 These commands take an input image, modify its existing capabilities sets, and 273 write the changes to an output image, as shown in the example: 274 275 ``` 276 $ actool cat-manifest caps-retain-set-example.aci 277 ... 278 "isolators": [ 279 { 280 "name": "os/linux/capabilities-retain-set", 281 "value": { 282 "set": [ 283 "CAP_NET_RAW" 284 ] 285 } 286 } 287 ] 288 ... 289 290 $ actool patch-manifest -capability CAP_NET_RAW,CAP_NET_BIND_SERVICE caps-retain-set-example.aci caps-retain-set-patched.aci 291 292 $ actool cat-manifest caps-retain-set-patched.aci 293 ... 294 "isolators": [ 295 { 296 "name": "os/linux/capabilities-retain-set", 297 "value": { 298 "set": [ 299 "CAP_NET_RAW", 300 "CAP_NET_BIND_SERVICE" 301 ] 302 } 303 } 304 ] 305 ... 306 307 ``` 308 309 Now run the image to check that the `CAP_NET_BIND_SERVICE` capability added to 310 the patched image is retained as expected by using `nc` to listen on a 311 "privileged" port: 312 313 ``` 314 $ sudo rkt run --interactive --insecure-options=image caps-retain-set-patched.aci 315 image: using image from file stage1-coreos.aci 316 image: using image from file caps-retain-set-patched.aci 317 image: using image from local store for image name quay.io/coreos/alpine-sh 318 319 / # nc -v -l -p 80 320 listening on [::]:80 ... 321 ``` 322 323 ## Capabilities when running as non-root 324 325 The capability isolators (and default capabilities) mentioned in this document operate on the capability bounding set. 326 When running containers as non-root, capabilities are not added to the effective set of the process, which is the one the kernel will check when the app is attempting to perform a privileged operation. 327 This means the process won't be able to run the privileged operations enabled by the capabilities granted to the container directly. 328 329 For example, including `CAP_NET_RAW` in the retain set when running a container as non-root doesn't allow the container to run `ping` (which uses raw sockets): 330 331 ``` 332 $ sudo rkt run --interactive kinvolk.io/aci/busybox --user=1000 --group=1000 --caps-retain=CAP_NET_RAW --exec ping -- 8.8.8.8 333 PING 8.8.8.8 (8.8.8.8): 56 data bytes 334 ping: permission denied (are you root?) 335 ``` 336 337 To be able to execute `ping` as a non-root user, the binary needs to have the corresponding file capability. 338 339 **Note**: running an image with file capabilities currently requires disabling seccomp in rkt. 340 This is due to a systemd bug where using seccomp results in enabling [no_new_privs][NNP] (you can track progress in [#3896](https://github.com/rkt/rkt/issues/3896)). 341 342 ### Building images with file capabilities 343 344 Building images that include files with file capabilities is challenging since: 345 346 * [build][acbuild] doesn't preserve file capabilities. See [containers/build#197](https://github.com/containers/build/issues/197). 347 * docker build doesn't preserve file capabilities. See [moby/moby#35699](https://github.com/moby/moby/issues/35699). 348 349 However, provided we have an ACI (created for example with [build][acbuild] or with [docker2aci][docker2aci]), we can extract it and add the file capabilities manually. 350 351 #### Example 352 353 We'll use build to create an Ubuntu ACI with ping installed: 354 355 ```bash 356 #!/usr/bin/env bash 357 358 acbuild --debug begin docker://ubuntu 359 acbuild --debug set-name example.com/filecap 360 361 # Install ping 362 acbuild --debug run -- apt-get update 363 acbuild --debug run -- apt-get install -y inetutils-ping 364 365 # ping comes as a setuid file, we don't want that 366 acbuild --debug run -- chmod -s /bin/ping 367 368 acbuild --debug set-exec /bin/bash 369 370 acbuild --debug write --overwrite filecaps.aci 371 ``` 372 373 After running the script, we'll extract the ACI, add file capabilities, and rebuild it. 374 375 ``` 376 $ sudo tar -xf filecaps.aci 377 $ sudo setcap cap_net_raw+ep rootfs/bin/ping 378 $ sudo tar --xattrs -cf filecaps-mod.aci manifest rootfs 379 ``` 380 381 Now we can run rkt with that image as a non-root user and with the right capability and ping should work fine: 382 383 ``` 384 $ sudo rkt --insecure-options=image,seccomp run --interactive filecaps-mod.aci --user=1000 --group=1000 --caps-retain=cap_net_raw 385 groups: cannot find name for group ID 1000 386 bash: /root/.bashrc: Permission denied 387 I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ ls -l /bin/ping 388 -rwxr-xr-x 1 root root 70680 Feb 6 2016 /bin/ping 389 I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ getcap /bin/ping 390 /bin/ping = cap_net_raw+ep 391 I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ ping 8.8.8.8 392 PING 8.8.8.8 (8.8.8.8): 56 data bytes 393 64 bytes from 8.8.8.8: icmp_seq=0 ttl=52 time=21.859 ms 394 64 bytes from 8.8.8.8: icmp_seq=1 ttl=52 time=20.298 ms 395 64 bytes from 8.8.8.8: icmp_seq=2 ttl=52 time=25.207 ms 396 ^C--- 8.8.8.8 ping statistics --- 397 3 packets transmitted, 3 packets received, 0% packet loss 398 round-trip min/avg/max/stddev = 20.298/22.455/25.207/2.048 ms 399 ``` 400 401 ### Ambient capabilities 402 403 There's a way in Linux to give capabilities to non-root processes without needing file capabilities to perform the privileged task: ambient capabilities. 404 405 When a capability is added to the ambient set, it will be preserved in the effective set of the process executed in the container. 406 However, this is currently not implemented in rkt. 407 408 For more information about ambient capabilities, check [capabilities(7)][man-capabilities]. 409 410 ## Recommendations 411 412 As with most security features, capability isolators may require some 413 application-specific tuning in order to be maximally effective. For this reason, 414 for security-sensitive environments it is recommended to have a well-specified 415 set of capabilities requirements and follow best practices: 416 417 1. Always follow the principle of least privilege and, whenever possible, 418 avoid running applications as root 419 2. Only grant the minimum set of capabilities needed by an application, 420 according to its typical usage 421 3. Avoid granting overly generic capabilities. For example, `CAP_SYS_ADMIN` and 422 `CAP_SYS_PTRACE` are typically bad choices, as they open large attack 423 surfaces. 424 4. Prefer a whitelisting approach, trying to keep the "retain-set" as small as 425 possible. 426 427 [acbuild]: https://github.com/containers/build 428 [docker2aci]: https://github.com/appc/docker2aci 429 [actool]: https://github.com/appc/spec#building-acis 430 [capabilities]: https://lwn.net/Kernel/Index/#Capabilities 431 [default-caps]: https://github.com/appc/spec/blob/master/spec/ace.md#oslinuxcapabilities-remove-set 432 [grsec-forums]: https://forums.grsecurity.net/viewtopic.php?f=7&t=2522 433 [man-capabilities]: http://man7.org/linux/man-pages/man7/capabilities.7.html 434 [NNP]: https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt