github.com/rkt/rkt@v1.30.1-0.20200224141603-171c416fac02/Documentation/seccomp-guide.md (about)

     1  # Seccomp Isolators Guide
     2  
     3  This document is a walk-through guide describing how to use rkt isolators for
     4  [Linux seccomp filtering][lwn-seccomp].
     5  
     6  * [About Seccomp](#about-seccomp)
     7  * [Predefined Seccomp Filters](#predefined-seccomp-filter)
     8  * [Seccomp Isolators](#seccomp-isolators)
     9  * [Usage Example](#usage-example)
    10  * [Overriding Seccomp Filters](#overriding-seccomp-filters)
    11  * [Recommendations](#recommendations)
    12  
    13  ## About seccomp
    14  
    15  Linux seccomp (short for SECure COMputing) filtering allows one to specify which
    16  system calls a process should be allowed to invoke, reducing the kernel surface
    17  exposed to applications.
    18  This provides a clearly defined mechanism to build sandboxed environments, where
    19  processes can run having access only to a specific reduced set of system calls.
    20  
    21  In the context of containers, seccomp filtering is useful for:
    22  
    23  * Restricting applications from invoking syscalls that can affect the host
    24  * Reducing kernel attack surface in case of security bugs
    25  
    26  For more details on how Linux seccomp filtering works, see
    27  [seccomp(2)][man-seccomp].
    28  
    29  ## Predefined seccomp filters
    30  
    31  By default, rkt comes with a set of predefined filtering groups that can be
    32  used to quickly build sandboxed environments for containerized applications.
    33  Each set is simply a reference to a group of syscalls, covering a single
    34  functional area or kernel subsystem. They can be further combined to
    35  build more complex filters, either by blacklisting or by whitelisting specific
    36  system calls. To distinguish these predefined groups from real syscall names,
    37  wildcard labels are prefixed with a `@` symbols and are namespaced.
    38  
    39  The App Container Spec (appc) defines
    40  [two groups][appc-isolators]:
    41  
    42   * `@appc.io/all` represents the set of all available syscalls.
    43   * `@appc.io/empty` represents the empty set.
    44  
    45  rkt provides two default groups for generic usage:
    46  
    47   * `@rkt/default-blacklist` represents a broad-scope filter than can be used for generic blacklisting
    48   * `@rkt/default-whitelist` represents a broad-scope filter than can be used for generic whitelisting
    49  
    50  For compatibility reasons, two groups are provided mirroring [default Docker profiles][docker-seccomp]:
    51  
    52   * `@docker/default-blacklist`
    53   * `@docker/default-whitelist`
    54  
    55  When using stage1 images with systemd >= v231, some
    56  [predefined groups][systemd-seccomp]
    57  are also available:
    58  
    59   * `@systemd/clock` for syscalls manipulating the system clock
    60   * `@systemd/default-whitelist` for a generic set of typically whitelisted syscalls
    61   * `@systemd/mount` for filesystem mounting and unmounting
    62   * `@systemd/network-io` for socket I/O operations
    63   * `@systemd/obsolete` for unusual, obsolete or unimplemented syscalls
    64   * `@systemd/privileged` for syscalls which need super-user syscalls
    65   * `@systemd/process` for syscalls acting on process control, execution and namespacing
    66   * `@systemd/raw-io` for raw I/O port access
    67  
    68  When no seccomp filtering is specified, by default rkt whitelists all the generic
    69  syscalls typically needed by applications for common operations. This is
    70  the same set defined by `@rkt/default-whitelist`.
    71  
    72  The default set is tailored to stop applications from performing a large
    73  variety of privileged actions, while not impacting their normal behavior.
    74  Operations which are typically not needed in containers and which may
    75  impact host state, eg. invoking [`umount(2)`][man-umount], are denied in this way.
    76  
    77  However, this default set is mostly meant as a safety precaution against erratic
    78  and misbehaving applications, and will not suffice against tailored attacks.
    79  As such, it is recommended to fine-tune seccomp filtering using one of the
    80  customizable isolators available in rkt.
    81  
    82  ## Seccomp Isolators
    83  
    84  When running Linux containers, rkt provides two mutually exclusive isolators
    85  to define a seccomp filter for an application:
    86  
    87  * `os/linux/seccomp-retain-set`
    88  * `os/linux/seccomp-remove-set`
    89  
    90  Those isolators cover different use-cases and employ different techniques to
    91  achieve the same goal of limiting available syscalls. As such, they cannot
    92  be used together at the same time, and recommended usage varies on a
    93  case-by-case basis.
    94  
    95  ### Operation mode
    96  
    97  Seccomp isolators work by defining a set of syscalls than can be either blocked
    98  ("remove-set") or allowed ("retain-set"). Once an application tries to invoke
    99  a blocked syscall, the kernel will deny this operation and the application will
   100  be notified about the failure.
   101  
   102  By default, invoking blocked syscalls will result in the application being
   103  immediately terminated with a `SIGSYS` signal. This behavior can be tweaked by
   104  returning a specific error code ("errno") to the application instead of
   105  terminating it.
   106  
   107  For both isolators, this can be customized by specifying an additional `errno`
   108  parameter with the desired symbolic errno name. For a list of errno labels, check
   109  the [reference][man-errno] at `man 3 errno`.
   110  
   111  ### Retain-set
   112  
   113  `os/linux/seccomp-retain-set` allows for an additive approach to build a seccomp
   114  filter: applications will not able to use any syscalls, except the ones
   115  listed in this isolator.
   116  
   117  This whitelisting approach is useful for completely locking down environments
   118  and whenever application requirements (in terms of syscalls) are
   119  well-defined in advance. It allows one to ensure that exactly and only the
   120  specified syscalls could ever be used.
   121  
   122  For example, the "retain-set" for a typical network application will include
   123  entries for generic POSIX operations (available in `@systemd/default-whitelist`),
   124  socket operations (`@systemd/network-io`) and reacting to I/O
   125  events (`@systemd/io-event`).
   126  
   127  ### Remove-set
   128  
   129  `os/linux/seccomp-remove-set` tackles syscalls in a subtractive way:
   130  starting from all available syscalls, single entries can be forbidden in order
   131  to prevent specific actions.
   132  
   133  This blacklisting approach is useful to somehow limit applications which have
   134  broad requirements in terms of syscalls, in order to deny access to some clearly
   135  unused but potentially exploitable syscalls.
   136  
   137  For example, an application that will need to perform multiple operations but is
   138  known to never touch mountpoints could have `@systemd/mount` specified in its
   139  "remove-set".
   140  
   141  ## Usage Example
   142  
   143  The goal of these examples is to show how to build ACI images with [`acbuild`][acbuild],
   144  where some syscalls are either explicitly blocked or allowed.
   145  For simplicity, the starting point will be a bare Alpine Linux image which
   146  ships with `ping` and `umount` commands (from busybox). Those
   147  commands respectively requires [`socket(2)`][man-socket] and [`umount(2)`][man-umount] syscalls in order to
   148  perform privileged operations.
   149  To block their usage, a syscalls filter can be installed via
   150  `os/linux/seccomp-remove-set` or `os/linux/seccomp-retain-set`; both approaches
   151  are shown here.
   152  
   153  ### Blacklisting specific syscalls
   154  
   155  This example shows how to block socket operation (e.g. with `ping`), by removing
   156  `socket()` from the set of allowed syscalls.
   157  
   158  First, a local image is built with an explicit "remove-set" isolator.
   159  This set contains the syscalls that need to be forbidden in order to block
   160  socket setup:
   161  
   162  ```
   163  $ acbuild begin
   164  $ acbuild set-name localhost/seccomp-remove-set-example
   165  $ acbuild dependency add quay.io/coreos/alpine-sh
   166  $ acbuild set-exec -- /bin/sh
   167  $ echo '{ "set": ["@rkt/default-blacklist", "socket"] }' | acbuild isolator add "os/linux/seccomp-remove-set" -
   168  $ acbuild write seccomp-remove-set-example.aci
   169  $ acbuild end
   170  ```
   171  
   172  Once properly built, this image can be run in order to check that `ping` usage is
   173  now blocked by the seccomp filter. At the same time, the default blacklist will
   174  also block other dangerous syscalls like `umount(2)`:
   175  
   176  ```
   177  $ sudo rkt run --interactive --insecure-options=image seccomp-remove-set-example.aci
   178  image: using image from file stage1-coreos.aci
   179  image: using image from file seccomp-remove-set-example.aci
   180  image: using image from local store for image name quay.io/coreos/alpine-sh
   181  
   182  / # whoami
   183  root
   184  
   185  / # ping -c1 8.8.8.8
   186  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   187  Bad system call
   188  
   189  / # umount /proc/bus/
   190  Bad system call
   191  ```
   192  
   193  This means that `socket(2)` and `umount(2)` have been both effectively disabled
   194  inside the container.
   195  
   196  ### Allowing specific syscalls
   197  
   198  In contrast to the example above, this one shows how to allow some operations
   199  only (e.g. network communication via `ping`), by whitelisting all required
   200  syscalls. This means that syscalls outside of this set will be blocked.
   201  
   202  First, a local image is built with an explicit "retain-set" isolator.
   203  This set contains the rkt wildcard "default-whitelist" (which already provides
   204  all socket-related entries), plus some custom syscalls (e.g. `umount(2)`) which
   205  are typically not allowed:
   206  
   207  ```
   208  $ acbuild begin
   209  $ acbuild set-name localhost/seccomp-retain-set-example
   210  $ acbuild dependency add quay.io/coreos/alpine-sh
   211  $ acbuild set-exec -- /bin/sh
   212  $ echo '{ "set": ["@rkt/default-whitelist", "umount", "umount2"] }' | acbuild isolator add "os/linux/seccomp-retain-set" -
   213  $ acbuild write seccomp-retain-set-example.aci
   214  $ acbuild end
   215  ```
   216  
   217  Once run, it can be easily verified that both `ping` and `umount` are now
   218  functional inside the container. These operations also require [additional
   219  capabilities][capabilities-guide] to be retained in order to work:
   220  
   221  ```
   222  $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-example.aci --caps-retain=CAP_SYS_ADMIN,CAP_NET_RAW
   223  image: using image from file stage1-coreos.aci
   224  image: using image from file seccomp-retain-set-example.aci
   225  image: using image from local store for image name quay.io/coreos/alpine-sh
   226  
   227  / # whoami
   228  root
   229  
   230  / # ping -c 1 8.8.8.8
   231  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   232  64 bytes from 8.8.8.8: seq=0 ttl=41 time=24.910 ms
   233  
   234  --- 8.8.8.8 ping statistics ---
   235  1 packets transmitted, 1 packets received, 0% packet loss
   236  round-trip min/avg/max = 24.910/24.910/24.910 ms
   237  
   238  / # mount | grep /proc/bus
   239  proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
   240  / # umount /proc/bus
   241  / # mount | grep /proc/bus
   242  ```
   243  
   244  However, others syscalls are still not available to the application.
   245  For example, trying to set the time will result in a failure due to invoking
   246  non-whitelisted syscalls:
   247  
   248  ```
   249  $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-example.aci
   250  image: using image from file stage1-coreos.aci
   251  image: using image from file seccomp-retain-set-example.aci
   252  image: using image from local store for image name quay.io/coreos/alpine-sh
   253  
   254  / # whoami
   255  root
   256  
   257  / # adjtimex -f 0
   258  Bad system call
   259  ```
   260  
   261  ## Overriding Seccomp Filters
   262  
   263  Seccomp filters are typically defined when creating images, as they are tightly
   264  linked to specific app requirements. However, image consumers may need to further
   265  tweak/restrict the set of available syscalls in specific local scenarios.
   266  This can be done either by permanently patching the manifest of specific images,
   267  or by overriding seccomp isolators with command line options.
   268  
   269  ### Patching images
   270  
   271  Image manifests can be manipulated manually, by unpacking the image and editing
   272  the manifest file, or with helper tools like [`actool`][actool].
   273  To override an image's pre-defined syscalls set, just replace the existing seccomp
   274  isolators in the image with new isolators defining the desired syscalls.
   275  
   276  The `patch-manifest` subcommand to `actool` manipulates the syscalls sets
   277  defined in an image.
   278  `actool patch-manifest -seccomp-mode=... -seccomp-set=...` options
   279  can be used together to override any seccomp filters by specifying a new mode
   280  (retain or reset), an optional custom errno, and a set of syscalls to filter.
   281  These commands take an input image, modify any existing seccomp isolators, and
   282  write the changes to an output image, as shown in the example:
   283  
   284  ```
   285  $ actool cat-manifest seccomp-retain-set-example.aci
   286  ...
   287      "isolators": [
   288        {
   289        "name": "os/linux/seccomp-retain-set",
   290          "value": {
   291            "set": [
   292              "@rkt/default-whitelist",
   293              "umount",
   294              "umount2"
   295            ]
   296          }
   297        }
   298      ]
   299  ...
   300  
   301  $ actool patch-manifest -seccomp-mode=retain,errno=ENOSYS -seccomp-set=@rkt/default-whitelist seccomp-retain-set-example.aci seccomp-retain-set-patched.aci
   302  
   303  $ actool cat-manifest seccomp-retain-set-patched.aci
   304  ...
   305      "isolators": [
   306        {
   307          "name": "os/linux/seccomp-retain-set",
   308          "value": {
   309            "set": [
   310              "@rkt/default-whitelist",
   311            ],
   312            "errno": "ENOSYS"
   313          }
   314        }
   315      ]
   316  ...
   317  
   318  ```
   319  
   320  Now run the image to verify that the `umount(2)` syscall is no longer allowed,
   321  and a custom error is returned:
   322  
   323  ```
   324  $ sudo rkt run --interactive --insecure-options=image seccomp-retain-set-patched.aci
   325  image: using image from file stage1-coreos.aci
   326  image: using image from file seccomp-retain-set-patched.aci
   327  image: using image from local store for image name quay.io/coreos/alpine-sh
   328  
   329  / # mount | grep /proc/bus
   330  proc on /proc/bus type proc (ro,nosuid,nodev,noexec,relatime)
   331  / # umount /proc/bus/
   332  umount: can't umount /proc/bus: Function not implemented
   333  ```
   334  
   335  ### Overriding seccomp filters at run-time
   336  
   337  Seccomp filters can be directly overridden at run time from the command-line,
   338  without changing the executed images.
   339  The `--seccomp` option to `rkt run` can manipulate both the "retain" and the
   340  "remove" isolators.
   341  
   342  Isolator overridden from the command-line will replace all seccomp settings in
   343  the image manifest, and can be specified as shown in this example:
   344  
   345  ```
   346  $ sudo rkt run --interactive quay.io/coreos/alpine-sh --seccomp mode=remove,errno=ENOTSUP,socket
   347  image: using image from file /usr/local/bin/stage1-coreos.aci
   348  image: using image from local store for image name quay.io/coreos/alpine-sh
   349  
   350  / # whoami
   351  root
   352  
   353  / # ping -c 1 8.8.8.8
   354  PING 8.8.8.8 (8.8.8.8): 56 data bytes
   355  ping: can't create raw socket: Not supported
   356  ```
   357  
   358  Seccomp isolators are application-specific configuration entries, and in a
   359  `rkt run` command line they **must follow the application container image to
   360  which they apply**.
   361  Each application within a pod can have different seccomp filters.
   362  
   363  ## Recommendations
   364  
   365  As with most security features, seccomp isolators may require some
   366  application-specific tuning in order to be maximally effective. For this reason,
   367  for security-sensitive environments it is recommended to have a well-specified
   368  set of syscalls requirements and follow best practices:
   369  
   370   1. Only allow syscalls needed by an application, according to its typical usage.
   371   2. While it is possible to completely disable seccomp, it is rarely needed and
   372      should be generally avoided. Tweaking the syscalls set is a better approach
   373      instead.
   374   3. Avoid granting access to dangerous syscalls. For example, [`mount(2)`][man-mount] and
   375      [`ptrace(2)`][man-ptrace] are typically abused to escape containers.
   376   4. Prefer a whitelisting approach, trying to keep the "retain-set" as small as
   377      possible.
   378  
   379  
   380  [acbuild]: https://github.com/containers/build
   381  [actool]: https://github.com/appc/spec#building-acis
   382  [appc-isolators]: https://github.com/appc/spec/blob/master/spec/ace.md#linux-isolators
   383  [capabilities-guide]: capabilities-guide.md
   384  [docker-seccomp]: https://docs.docker.com/engine/security/seccomp/
   385  [lwn-seccomp]: https://lwn.net/Articles/656307/
   386  [man-errno]: http://man7.org/linux/man-pages/man3/errno.3.html
   387  [man-mount]: http://man7.org/linux/man-pages/man2/mount.2.html
   388  [man-ptrace]: http://man7.org/linux/man-pages/man2/ptrace.2.html
   389  [man-seccomp]: http://man7.org/linux/man-pages/man2/seccomp.2.html
   390  [man-socket]: http://man7.org/linux/man-pages/man2/socket.2.html
   391  [man-umount]: http://man7.org/linux/man-pages/man2/umount.2.html
   392  [systemd-seccomp]: https://www.freedesktop.org/software/systemd/man/systemd.exec.html#SystemCallFilter=