github.com/coreos/rocket@v1.30.1-0.20200224141603-171c416fac02/Documentation/capabilities-guide.md (about)

     1  # Capabilities Isolators Guide
     2  
     3  This document is a walk-through guide describing how to use rkt isolators for
     4  [Linux Capabilities][capabilities].
     5  
     6  * [About Linux Capabilities](#about-linux-capabilities)
     7  * [Default Capabilities](#default-capabilities)
     8  * [Capability Isolators](#capability-isolators)
     9  * [Configure capabilities via the command line](#configure-capabilities-via-the-command-line)
    10  * [Configure capabilities in ACI images](#configure-capabilities-in-aci-images)
    11  * [Capabilities when running as non-root](#capabilities-when-running-as-non-root)
    12  * [Recommendations](#recommendations)
    13  
    14  ## About Linux Capabilities
    15  
    16  Linux capabilities are meant to be a modern evolution of traditional UNIX
    17  permissions checks.
    18  The goal is to split the permissions granted to privileged processes into a set
    19  of capabilities (eg. `CAP_NET_RAW` to open a raw socket), which can be
    20  separately handled and assigned to single threads.
    21  
    22  Processes can gain specific capabilities by either being run by superuser, or by
    23  having the setuid/setgid bits or specific file-capabilities set on their
    24  executable file.
    25  Once running, each process has a bounding set of capabilities which it can
    26  enable and use; such process cannot get further capabilities outside of this set.
    27  
    28  In the context of containers, capabilities are useful for:
    29  
    30  * Restricting the effective privileges of applications running as root
    31  * Allowing applications to perform specific privileged operations, without
    32     having to run them as root
    33  
    34  For the complete list of existing Linux capabilities and a detailed description
    35  of this security mechanism, see the [capabilities(7) man page][man-capabilities].
    36  
    37  ## Default capabilities
    38  
    39  By default, rkt enforces [a default set of capabilities][default-caps] onto applications.
    40  This default set is tailored to stop applications from performing a large
    41  variety of privileged actions, while not impacting their normal behavior.
    42  Operations which are typically not needed in containers and which may
    43  impact host state, eg. invoking `reboot(2)`, are denied in this way.
    44  
    45  However, this default set is mostly meant as a safety precaution against erratic
    46  and misbehaving applications, and will not suffice against tailored attacks.
    47  As such, it is recommended to fine-tune the capabilities bounding set using one
    48  of the customizable isolators available in rkt.
    49  
    50  ## Capability Isolators
    51  
    52  When running Linux containers, rkt provides two mutually exclusive isolators
    53  to define the bounding set under which an application will be run:
    54  
    55  * `os/linux/capabilities-retain-set`
    56  * `os/linux/capabilities-remove-set`
    57  
    58  Those isolators cover different use-cases and employ different techniques to
    59  achieve the same goal of limiting available capabilities. As such, they cannot
    60  be used together at the same time, and recommended usage varies on a
    61  case-by-case basis.
    62  
    63  As the granularity of capabilities varies for specific permission cases, a word
    64  of warning is needed in order to avoid a false sense of security.
    65  In many cases it is possible to abuse granted capabilities in order to
    66  completely subvert the sandbox: for example, `CAP_SYS_PTRACE` allows to access
    67  stage1 environment and `CAP_SYS_ADMIN` grants a broad range of privileges,
    68  effectively equivalent to root.
    69  Many other ways to maliciously transition across capabilities have already been
    70  [reported][grsec-forums].
    71  
    72  ### Retain-set
    73  
    74  `os/linux/capabilities-retain-set` allows for an additive approach to
    75  capabilities: applications will be stripped of all capabilities, except the ones
    76  listed in this isolator.
    77  
    78  This whitelisting approach is useful for completely locking down environments
    79  and whenever application requirements (in terms of capabilities) are
    80  well-defined in advance. It allows one to ensure that exactly and only the
    81  specified capabilities could ever be used.
    82  
    83  For example, an application that will only need to bind to port 80 as
    84  a privileged operation, will have `CAP_NET_BIND_SERVICE` as the only entry in
    85  its "retain-set".
    86  
    87  ### Remove-set
    88  
    89  `os/linux/capabilities-remove-set` tackles capabilities in a subtractive way:
    90  starting from the default set of capabilities, single entries can be further
    91  forbidden in order to prevent specific actions.
    92  
    93  This blacklisting approach is useful to somehow limit applications which have
    94  broad requirements in terms of privileged operations, in order to deny some
    95  potentially malicious operations.
    96  
    97  For example, an application that will need to perform multiple privileged
    98  operations but is known to never open a raw socket, will have
    99  `CAP_NET_RAW` specified in its "remove-set".
   100  
   101  ## Configure capabilities via the command line
   102  
   103  Capabilities can be directly overridden at run time from the command-line,
   104  without changing the executed images.
   105  The `--caps-retain` option to `rkt run` manipulates the `retain` capabilities set.
   106  The `--caps-remove` option manipulates the `remove` set.
   107  
   108  Capabilities specified from the command-line will replace all capability settings in the image manifest.
   109  Also as stated above the options `--caps-retain`, and `--caps-remove` are mutually exclusive.
   110  Only one can be specified at a time.
   111  
   112  Capabilities isolators can be added on the command line at run time by
   113  specifying the desired overriding set, as shown in this example:
   114  
   115  ```
   116  $ sudo rkt run --interactive quay.io/coreos/alpine-sh --caps-retain CAP_NET_BIND_SERVICE
   117  image: using image from file /usr/local/bin/stage1-coreos.aci
   118  image: using image from local store for image name quay.io/coreos/alpine-sh
   119  
   120  / # whoami
   121  root
   122  
   123  / # ping -c 1 8.8.8.8
   124  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   125  ping: permission denied (are you root?)
   126  
   127  ```
   128  
   129  Capability sets are application-specific configuration entries, and in a
   130  `rkt run` command line, they must follow the application container image to
   131  which they apply.
   132  Each application within a pod can have different capability sets.
   133  
   134  ## Configure capabilities in ACI images
   135  
   136  Capability sets are typically defined when creating images, as they are tightly
   137  linked to specific app requirements.
   138  
   139  The goal of these examples is to show how to build ACIs with [`acbuild`][acbuild],
   140  where some capabilities are either explicitly blocked or allowed.
   141  For simplicity, the starting point will be the official Alpine Linux image from
   142  CoreOS which ships with `ping` and `nc` commands (from busybox). Those
   143  commands respectively requires `CAP_NET_RAW` and `CAP_NET_BIND_SERVICE`
   144  capabilities in order to perform privileged operations.
   145  To block their usage, capabilities bounding set
   146  can be manipulated via `os/linux/capabilities-remove-set` or
   147  `os/linux/capabilities-retain-set`; both approaches are shown here.
   148  
   149  ### Removing specific capabilities
   150  
   151  This example shows how to block `ping` only, by removing `CAP_NET_RAW` from
   152  capabilities bounding set.
   153  
   154  First, a local image is built with an explicit "remove-set" isolator.
   155  This set contains the capabilities that need to be forbidden in order to block
   156  `ping` usage (and only that):
   157  
   158  ```
   159  $ acbuild begin
   160  $ acbuild set-name localhost/caps-remove-set-example
   161  $ acbuild dependency add quay.io/coreos/alpine-sh
   162  $ acbuild set-exec -- /bin/sh
   163  $ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-remove-set" -
   164  $ acbuild write caps-remove-set-example.aci
   165  $ acbuild end
   166  ```
   167  
   168  Once properly built, this image can be run in order to check that `ping` usage has
   169  been effectively disabled:
   170  
   171  ```
   172  $ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci
   173  image: using image from file stage1-coreos.aci
   174  image: using image from file caps-remove-set-example.aci
   175  image: using image from local store for image name quay.io/coreos/alpine-sh
   176  
   177  / # whoami
   178  root
   179  
   180  / # ping -c 1 8.8.8.8
   181  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   182  ping: permission denied (are you root?)
   183  ```
   184  
   185  This means that `CAP_NET_RAW` had been effectively disabled inside the container.
   186  At the same time, `CAP_NET_BIND_SERVICE` is still available in the default bounding
   187  set, so the `nc` command will be able to bind to port 80:
   188  
   189  ```
   190  $ sudo rkt run --interactive --insecure-options=image caps-remove-set-example.aci
   191  image: using image from file stage1-coreos.aci
   192  image: using image from file caps-remove-set-example.aci
   193  image: using image from local store for image name quay.io/coreos/alpine-sh
   194  
   195  / # whoami
   196  root
   197  
   198  / # nc -v -l -p 80
   199  listening on [::]:80 ...
   200  ```
   201  
   202  ### Allowing specific capabilities
   203  
   204  In contrast to the example above, this one shows how to allow `ping` only, by
   205  removing all capabilities except `CAP_NET_RAW` from the bounding set.
   206  This means that all other privileged operations, including binding to port 80
   207  will be blocked.
   208  
   209  First, a local image is built with an explicit "retain-set" isolator.
   210  This set contains the capabilities that need to be enabled in order to allowed
   211  `ping` usage (and only that):
   212  
   213  ```
   214  $ acbuild begin
   215  $ acbuild set-name localhost/caps-retain-set-example
   216  $ acbuild dependency add quay.io/coreos/alpine-sh
   217  $ acbuild set-exec -- /bin/sh
   218  $ echo '{ "set": ["CAP_NET_RAW"] }' | acbuild isolator add "os/linux/capabilities-retain-set" -
   219  $ acbuild write caps-retain-set-example.aci
   220  $ acbuild end
   221  ```
   222  
   223  Once run, it can be easily verified that `ping` from inside the container is now
   224  functional:
   225  
   226  ```
   227  $ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci
   228  image: using image from file stage1-coreos.aci
   229  image: using image from file caps-retain-set-example.aci
   230  image: using image from local store for image name quay.io/coreos/alpine-sh
   231  
   232  / # whoami
   233  root
   234  
   235  / # ping -c 1 8.8.8.8
   236  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   237  64 bytes from 8.8.8.8: seq=0 ttl=41 time=24.910 ms
   238  
   239  --- 8.8.8.8 ping statistics ---
   240  1 packets transmitted, 1 packets received, 0% packet loss
   241  round-trip min/avg/max = 24.910/24.910/24.910 ms
   242  ```
   243  
   244  However, all others capabilities are now not anymore available to the application.
   245  For example, using `nc` to bind to port 80 will now result in a failure due to
   246  the missing `CAP_NET_BIND_SERVICE` capability:
   247  
   248  ```
   249  $ sudo rkt run --interactive --insecure-options=image caps-retain-set-example.aci
   250  image: using image from file stage1-coreos.aci
   251  image: using image from file caps-retain-set-example.aci
   252  image: using image from local store for image name quay.io/coreos/alpine-sh
   253  
   254  / # whoami
   255  root
   256  
   257  / # nc -v -l -p 80
   258  nc: bind: Permission denied
   259  ```
   260  
   261  ### Patching images
   262  
   263  Image manifests can be manipulated manually, by unpacking the image and editing
   264  the manifest file, or with helper tools like [`actool`][actool].
   265  To override an image's pre-defined capabilities set, replace the existing capabilities
   266  isolators in the image with new isolators defining the desired capabilities.
   267  
   268  The `patch-manifest` subcommand to `actool` manipulates the capabilities sets
   269  defined in an image.
   270  `actool patch-manifest --capability` changes the `retain` capabilities set.
   271  `actool patch-manifest --revoke-capability` changes the `remove` set.
   272  These commands take an input image, modify its existing capabilities sets, and
   273  write the changes to an output image, as shown in the example:
   274  
   275  ```
   276  $ actool cat-manifest caps-retain-set-example.aci
   277  ...
   278      "isolators": [
   279        {
   280          "name": "os/linux/capabilities-retain-set",
   281          "value": {
   282            "set": [
   283              "CAP_NET_RAW"
   284            ]
   285          }
   286        }
   287      ]
   288  ...
   289  
   290  $ actool patch-manifest -capability CAP_NET_RAW,CAP_NET_BIND_SERVICE caps-retain-set-example.aci caps-retain-set-patched.aci
   291  
   292  $ actool cat-manifest caps-retain-set-patched.aci
   293  ...
   294      "isolators": [
   295        {
   296          "name": "os/linux/capabilities-retain-set",
   297          "value": {
   298            "set": [
   299              "CAP_NET_RAW",
   300              "CAP_NET_BIND_SERVICE"
   301            ]
   302          }
   303        }
   304      ]
   305  ...
   306  
   307  ```
   308  
   309  Now run the image to check that the `CAP_NET_BIND_SERVICE` capability added to
   310  the patched image is retained as expected by using `nc` to listen on a
   311  "privileged" port:
   312  
   313  ```
   314  $ sudo rkt run --interactive --insecure-options=image caps-retain-set-patched.aci
   315  image: using image from file stage1-coreos.aci
   316  image: using image from file caps-retain-set-patched.aci
   317  image: using image from local store for image name quay.io/coreos/alpine-sh
   318  
   319  / # nc -v -l -p 80
   320  listening on [::]:80 ...
   321  ```
   322  
   323  ## Capabilities when running as non-root
   324  
   325  The capability isolators (and default capabilities) mentioned in this document operate on the capability bounding set.
   326  When running containers as non-root, capabilities are not added to the effective set of the process, which is the one the kernel will check when the app is attempting to perform a privileged operation.
   327  This means the process won't be able to run the privileged operations enabled by the capabilities granted to the container directly.
   328  
   329  For example, including `CAP_NET_RAW` in the retain set when running a container as non-root doesn't allow the container to run `ping` (which uses raw sockets):
   330  
   331  ```
   332  $ sudo rkt run --interactive kinvolk.io/aci/busybox --user=1000 --group=1000 --caps-retain=CAP_NET_RAW --exec ping -- 8.8.8.8
   333  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   334  ping: permission denied (are you root?)
   335  ```
   336  
   337  To be able to execute `ping` as a non-root user, the binary needs to have the corresponding file capability.
   338  
   339  **Note**: running an image with file capabilities currently requires disabling seccomp in rkt.
   340  This is due to a systemd bug where using seccomp results in enabling [no_new_privs][NNP] (you can track progress in [#3896](https://github.com/rkt/rkt/issues/3896)).
   341  
   342  ### Building images with file capabilities
   343  
   344  Building images that include files with file capabilities is challenging since:
   345  
   346  * [build][acbuild] doesn't preserve file capabilities. See [containers/build#197](https://github.com/containers/build/issues/197).
   347  * docker build doesn't preserve file capabilities. See [moby/moby#35699](https://github.com/moby/moby/issues/35699).
   348  
   349  However, provided we have an ACI (created for example with [build][acbuild] or with [docker2aci][docker2aci]), we can extract it and add the file capabilities manually.
   350  
   351  #### Example
   352  
   353  We'll use build to create an Ubuntu ACI with ping installed:
   354  
   355  ```bash
   356  #!/usr/bin/env bash
   357  
   358  acbuild --debug begin docker://ubuntu
   359  acbuild --debug set-name example.com/filecap
   360  
   361  # Install ping
   362  acbuild --debug run -- apt-get update
   363  acbuild --debug run -- apt-get install -y inetutils-ping
   364  
   365  # ping comes as a setuid file, we don't want that
   366  acbuild --debug run -- chmod -s /bin/ping
   367  
   368  acbuild --debug set-exec /bin/bash
   369  
   370  acbuild --debug write --overwrite filecaps.aci
   371  ```
   372  
   373  After running the script, we'll extract the ACI, add file capabilities, and rebuild it.
   374  
   375  ```
   376  $ sudo tar -xf filecaps.aci
   377  $ sudo setcap cap_net_raw+ep rootfs/bin/ping
   378  $ sudo tar --xattrs -cf filecaps-mod.aci manifest rootfs
   379  ```
   380  
   381  Now we can run rkt with that image as a non-root user and with the right capability and ping should work fine:
   382  
   383  ```
   384  $ sudo rkt --insecure-options=image,seccomp run --interactive filecaps-mod.aci --user=1000 --group=1000 --caps-retain=cap_net_raw
   385  groups: cannot find name for group ID 1000
   386  bash: /root/.bashrc: Permission denied
   387  I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ ls -l /bin/ping
   388  -rwxr-xr-x 1 root root 70680 Feb  6  2016 /bin/ping
   389  I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ getcap /bin/ping
   390  /bin/ping = cap_net_raw+ep
   391  I have no name!@rkt-4a9e66a0-6ee0-496f-8b7a-ab259362cba7:/$ ping 8.8.8.8
   392  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   393  64 bytes from 8.8.8.8: icmp_seq=0 ttl=52 time=21.859 ms
   394  64 bytes from 8.8.8.8: icmp_seq=1 ttl=52 time=20.298 ms
   395  64 bytes from 8.8.8.8: icmp_seq=2 ttl=52 time=25.207 ms
   396  ^C--- 8.8.8.8 ping statistics ---
   397  3 packets transmitted, 3 packets received, 0% packet loss
   398  round-trip min/avg/max/stddev = 20.298/22.455/25.207/2.048 ms
   399  ```
   400  
   401  ### Ambient capabilities
   402  
   403  There's a way in Linux to give capabilities to non-root processes without needing file capabilities to perform the privileged task: ambient capabilities.
   404  
   405  When a capability is added to the ambient set, it will be preserved in the effective set of the process executed in the container.
   406  However, this is currently not implemented in rkt.
   407  
   408  For more information about ambient capabilities, check [capabilities(7)][man-capabilities].
   409  
   410  ## Recommendations
   411  
   412  As with most security features, capability isolators may require some
   413  application-specific tuning in order to be maximally effective. For this reason,
   414  for security-sensitive environments it is recommended to have a well-specified
   415  set of capabilities requirements and follow best practices:
   416  
   417   1. Always follow the principle of least privilege and, whenever possible,
   418      avoid running applications as root
   419   2. Only grant the minimum set of capabilities needed by an application,
   420      according to its typical usage
   421   3. Avoid granting overly generic capabilities. For example, `CAP_SYS_ADMIN` and
   422      `CAP_SYS_PTRACE` are typically bad choices, as they open large attack
   423      surfaces.
   424   4. Prefer a whitelisting approach, trying to keep the "retain-set" as small as
   425      possible.
   426  
   427  [acbuild]: https://github.com/containers/build
   428  [docker2aci]: https://github.com/appc/docker2aci
   429  [actool]: https://github.com/appc/spec#building-acis
   430  [capabilities]: https://lwn.net/Kernel/Index/#Capabilities
   431  [default-caps]: https://github.com/appc/spec/blob/master/spec/ace.md#oslinuxcapabilities-remove-set
   432  [grsec-forums]: https://forums.grsecurity.net/viewtopic.php?f=7&t=2522
   433  [man-capabilities]: http://man7.org/linux/man-pages/man7/capabilities.7.html
   434  [NNP]: https://www.kernel.org/doc/Documentation/prctl/no_new_privs.txt