github.com/boynux/docker@v1.11.0-rc4/docs/security/seccomp.md (about)

     1  <!-- [metadata]>
     2  +++
     3  title = "Seccomp security profiles for Docker"
     4  description = "Enabling seccomp in Docker"
     5  keywords = ["seccomp, security, docker, documentation"]
     6  [menu.main]
     7  parent= "smn_secure_docker"
     8  weight=90
     9  +++
    10  <![end-metadata]-->
    11  
    12  # Seccomp security profiles for Docker
    13  
    14  Secure computing mode (Seccomp) is a Linux kernel feature. You can use it to
    15  restrict the actions available within the container. The `seccomp()` system
    16  call operates on the seccomp state of the calling process. You can use this
    17  feature to restrict your application's access.
    18  
    19  This feature is available only if Docker has been built with seccomp and the
    20  kernel is configured with `CONFIG_SECCOMP` enabled. To check if your kernel
    21  supports seccomp:
    22  
    23  ```bash
    24  $ cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
    25  CONFIG_SECCOMP=y
    26  ```
    27  
    28  > **Note**: seccomp profiles require seccomp 2.2.1 and are only
    29  > available starting with Debian 9 "Stretch", Ubuntu 15.10 "Wily", and
    30  > Fedora 22. To use this feature on Ubuntu 14.04, Debian Wheezy, or
    31  > Debian Jessie, you must download the [latest static Docker Linux binary](../installation/binaries.md).
    32  > This feature is currently *not* available on other distributions.
    33  
    34  ## Passing a profile for a container
    35  
    36  The default seccomp profile provides a sane default for running containers with
    37  seccomp and disables around 44 system calls out of 300+. It is moderately protective while providing wide application
    38  compatibility. The default Docker profile (found [here](https://github.com/docker/docker/blob/master/profiles/seccomp/default.json) has a JSON layout in the following form:
    39  
    40  ```json
    41  {
    42  	"defaultAction": "SCMP_ACT_ERRNO",
    43  	"architectures": [
    44  		"SCMP_ARCH_X86_64",
    45  		"SCMP_ARCH_X86",
    46  		"SCMP_ARCH_X32"
    47  	],
    48  	"syscalls": [
    49  		{
    50  			"name": "accept",
    51  			"action": "SCMP_ACT_ALLOW",
    52  			"args": []
    53  		},
    54  		{
    55  			"name": "accept4",
    56  			"action": "SCMP_ACT_ALLOW",
    57  			"args": []
    58  		},
    59  		...
    60  	]
    61  }
    62  ```
    63  
    64  When you run a container, it uses the default profile unless you override
    65  it with the `security-opt` option. For example, the following explicitly
    66  specifies the default policy:
    67  
    68  ```
    69  $ docker run --rm -it --security-opt seccomp=/path/to/seccomp/profile.json hello-world
    70  ```
    71  
    72  ### Significant syscalls blocked by the default profile
    73  
    74  Docker's default seccomp profile is a whitelist which specifies the calls that
    75  are allowed. The table below lists the significant (but not all) syscalls that
    76  are effectively blocked because they are not on the whitelist. The table includes
    77  the reason each syscall is blocked rather than white-listed.
    78  
    79  | Syscall             | Description                                                                                                                           |
    80  |---------------------|---------------------------------------------------------------------------------------------------------------------------------------|
    81  | `acct`              | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_PACCT`. |
    82  | `add_key`           | Prevent containers from using the kernel keyring, which is not namespaced.                                   |
    83  | `adjtimex`          | Similar to `clock_settime` and `settimeofday`, time/date is not namespaced.                                  |
    84  | `bpf`               | Deny loading potentially persistent bpf programs into kernel, already gated by `CAP_SYS_ADMIN`.              |
    85  | `clock_adjtime`     | Time/date is not namespaced.                                                                                 |
    86  | `clock_settime`     | Time/date is not namespaced.                                                                                 |
    87  | `clone`             | Deny cloning new namespaces. Also gated by `CAP_SYS_ADMIN` for CLONE_* flags, except `CLONE_USERNS`.         |
    88  | `create_module`     | Deny manipulation and functions on kernel modules.                                                           |
    89  | `delete_module`     | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`.                           |
    90  | `finit_module`      | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`.                           |
    91  | `get_kernel_syms`   | Deny retrieval of exported kernel and module symbols.                                                        |
    92  | `get_mempolicy`     | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`.                      |
    93  | `init_module`       | Deny manipulation and functions on kernel modules. Also gated by `CAP_SYS_MODULE`.                           |
    94  | `ioperm`            | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`.             |
    95  | `iopl`              | Prevent containers from modifying kernel I/O privilege levels. Already gated by `CAP_SYS_RAWIO`.             |
    96  | `kcmp`              | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`.                          |
    97  | `kexec_file_load`   | Sister syscall of `kexec_load` that does the same thing, slightly different arguments.                       |
    98  | `kexec_load`        | Deny loading a new kernel for later execution.                                                               |
    99  | `keyctl`            | Prevent containers from using the kernel keyring, which is not namespaced.                                   |
   100  | `lookup_dcookie`    | Tracing/profiling syscall, which could leak a lot of information on the host.                                |
   101  | `mbind`             | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`.                      |
   102  | `modify_ldt`        | Old syscall only used in 16-bit code and a potential information leak.                                       |
   103  | `mount`             | Deny mounting, already gated by `CAP_SYS_ADMIN`.                                                             |
   104  | `move_pages`        | Syscall that modifies kernel memory and NUMA settings.                                                       |
   105  | `name_to_handle_at` | Sister syscall to `open_by_handle_at`. Already gated by `CAP_SYS_NICE`.                                      |
   106  | `nfsservctl`        | Deny interaction with the kernel nfs daemon.                                                                 |
   107  | `open_by_handle_at` | Cause of an old container breakout. Also gated by `CAP_DAC_READ_SEARCH`.                                     |
   108  | `perf_event_open`   | Tracing/profiling syscall, which could leak a lot of information on the host.                                |
   109  | `personality`       | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
   110  | `pivot_root`        | Deny `pivot_root`, should be privileged operation.                                                           |
   111  | `process_vm_readv`  | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`.                          |
   112  | `process_vm_writev` | Restrict process inspection capabilities, already blocked by dropping `CAP_PTRACE`.                          |
   113  | `ptrace`            | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping `CAP_PTRACE`. |
   114  | `query_module`      | Deny manipulation and functions on kernel modules.                                                            |
   115  | `quotactl`          | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by `CAP_SYS_ADMIN`. |
   116  | `reboot`            | Don't let containers reboot the host. Also gated by `CAP_SYS_BOOT`.                                           |
   117  | `request_key`       | Prevent containers from using the kernel keyring, which is not namespaced.                                    |
   118  | `set_mempolicy`     | Syscall that modifies kernel memory and NUMA settings. Already gated by `CAP_SYS_NICE`.                       |
   119  | `setns`             | Deny associating a thread with a namespace. Also gated by `CAP_SYS_ADMIN`.                                    |
   120  | `settimeofday`      | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`.                                                    |
   121  | `stime`             | Time/date is not namespaced. Also gated by `CAP_SYS_TIME`.                                                    |
   122  | `swapon`            | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`.                                       |
   123  | `swapoff`           | Deny start/stop swapping to file/device. Also gated by `CAP_SYS_ADMIN`.                                       |
   124  | `sysfs`             | Obsolete syscall.                                                                                             |
   125  | `_sysctl`           | Obsolete, replaced by /proc/sys.                                                                              |
   126  | `umount`            | Should be a privileged operation. Also gated by `CAP_SYS_ADMIN`.                                              |
   127  | `umount2`           | Should be a privileged operation.                                                                             |
   128  | `unshare`           | Deny cloning new namespaces for processes. Also gated by `CAP_SYS_ADMIN`, with the exception of `unshare --user`. |
   129  | `uselib`            | Older syscall related to shared libraries, unused for a long time.                                            |
   130  | `userfaultfd`       | Userspace page fault handling, largely needed for process migration.                                          |
   131  | `ustat`             | Obsolete syscall.                                                                                             |
   132  | `vm86`              | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`.                                       |
   133  | `vm86old`           | In kernel x86 real mode virtual machine. Also gated by `CAP_SYS_ADMIN`.                                       |
   134  
   135  ## Run without the default seccomp profile
   136  
   137  You can pass `unconfined` to run a container without the default seccomp
   138  profile.
   139  
   140  ```
   141  $ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
   142      unshare --map-root-user --user sh -c whoami
   143  ```