github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/libcontainer/SPEC.md

github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/libcontainer/SPEC.md (about)

     1  ## Container Specification - v1
     2  
     3  This is the standard configuration for version 1 containers.  It includes
     4  namespaces, standard filesystem setup, a default Linux capability set, and
     5  information about resource reservations.  It also has information about any
     6  populated environment settings for the processes running inside a container.
     7  
     8  Along with the configuration of how a container is created the standard also
     9  discusses actions that can be performed on a container to manage and inspect
    10  information about the processes running inside.
    11  
    12  The v1 profile is meant to be able to accommodate the majority of applications
    13  with a strong security configuration.
    14  
    15  ### System Requirements and Compatibility
    16  
    17  Minimum requirements:
    18  * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
    19  * Mounted cgroups with each subsystem in its own hierarchy
    20  
    21  
    22  ### Namespaces
    23  
    24  |     Flag        | Enabled |
    25  | --------------- | ------- |
    26  | CLONE_NEWPID    |    1    |
    27  | CLONE_NEWUTS    |    1    |
    28  | CLONE_NEWIPC    |    1    |
    29  | CLONE_NEWNET    |    1    |
    30  | CLONE_NEWNS     |    1    |
    31  | CLONE_NEWUSER   |    1    |
    32  | CLONE_NEWCGROUP |    1    |
    33  
    34  Namespaces are created for the container via the `unshare` syscall.
    35  
    36  
    37  ### Filesystem
    38  
    39  A root filesystem must be provided to a container for execution.  The container
    40  will use this root filesystem (rootfs) to jail and spawn processes inside where
    41  the binaries and system libraries are local to that directory.  Any binaries
    42  to be executed must be contained within this rootfs.
    43  
    44  Mounts that happen inside the container are automatically cleaned up when the
    45  container exits as the mount namespace is destroyed and the kernel will
    46  unmount all the mounts that were setup within that namespace.
    47  
    48  For a container to execute properly there are certain filesystems that
    49  are required to be mounted within the rootfs that the runtime will setup.
    50  
    51  |     Path    |  Type  |                  Flags                 |                 Data                     |
    52  | ----------- | ------ | -------------------------------------- | ---------------------------------------- |
    53  | /proc       | proc   | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
    54  | /dev        | tmpfs  | MS_NOEXEC,MS_STRICTATIME               | mode=755                                 |
    55  | /dev/shm    | tmpfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV           | mode=1777,size=65536k                    |
    56  | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
    57  | /dev/pts    | devpts | MS_NOEXEC,MS_NOSUID                    | newinstance,ptmxmode=0666,mode=620,gid=5 |
    58  | /sys        | sysfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY |                                          |
    59  
    60  
    61  After a container's filesystems are mounted within the newly created
    62  mount namespace `/dev` will need to be populated with a set of device nodes.
    63  It is expected that a rootfs does not need to have any device nodes specified
    64  for `/dev` within the rootfs as the container will setup the correct devices
    65  that are required for executing a container's process.
    66  
    67  |      Path    | Mode |   Access   |
    68  | ------------ | ---- | ---------- |
    69  | /dev/null    | 0666 |  rwm       |
    70  | /dev/zero    | 0666 |  rwm       |
    71  | /dev/full    | 0666 |  rwm       |
    72  | /dev/tty     | 0666 |  rwm       |
    73  | /dev/random  | 0666 |  rwm       |
    74  | /dev/urandom | 0666 |  rwm       |
    75  
    76  
    77  **ptmx**
    78  `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
    79  the container.
    80  
    81  The use of a pseudo TTY is optional within a container and it should support both.
    82  If a pseudo is provided to the container `/dev/console` will need to be
    83  setup by binding the console in `/dev/` after it has been populated and mounted
    84  in tmpfs.
    85  
    86  |      Source     | Destination  | UID GID | Mode | Type |
    87  | --------------- | ------------ | ------- | ---- | ---- |
    88  | *pty host path* | /dev/console | 0 0     | 0600 | bind |
    89  
    90  
    91  After `/dev/null` has been setup we check for any external links between
    92  the container's io, STDIN, STDOUT, STDERR.  If the container's io is pointing
    93  to `/dev/null` outside the container we close and `dup2` the `/dev/null`
    94  that is local to the container's rootfs.
    95  
    96  
    97  After the container has `/proc` mounted a few standard symlinks are setup
    98  within `/dev/` for the io.
    99  
   100  |    Source       | Destination |
   101  | --------------- | ----------- |
   102  | /proc/self/fd   | /dev/fd     |
   103  | /proc/self/fd/0 | /dev/stdin  |
   104  | /proc/self/fd/1 | /dev/stdout |
   105  | /proc/self/fd/2 | /dev/stderr |
   106  
   107  A `pivot_root` is used to change the root for the process, effectively
   108  jailing the process inside the rootfs.
   109  
   110  ```c
   111  put_old = mkdir(...);
   112  pivot_root(rootfs, put_old);
   113  chdir("/");
   114  unmount(put_old, MS_DETACH);
   115  rmdir(put_old);
   116  ```
   117  
   118  For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
   119  with a `chroot` is required as `pivot_root` is not supported in `ramfs`.
   120  
   121  ```c
   122  mount(rootfs, "/", NULL, MS_MOVE, NULL);
   123  chroot(".");
   124  chdir("/");
   125  ```
   126  
   127  The `umask` is set back to `0022` after the filesystem setup has been completed.
   128  
   129  ### Resources
   130  
   131  Cgroups are used to handle resource allocation for containers.  This includes
   132  system resources like cpu, memory, and device access.
   133  
   134  | Subsystem  | Enabled |
   135  | ---------- | ------- |
   136  | devices    | 1       |
   137  | memory     | 1       |
   138  | cpu        | 1       |
   139  | cpuacct    | 1       |
   140  | cpuset     | 1       |
   141  | blkio      | 1       |
   142  | perf_event | 1       |
   143  | freezer    | 1       |
   144  | hugetlb    | 1       |
   145  | pids       | 1       |
   146  
   147  
   148  All cgroup subsystem are joined so that statistics can be collected from
   149  each of the subsystems.  Freezer does not expose any stats but is joined
   150  so that containers can be paused and resumed.
   151  
   152  The parent process of the container's init must place the init pid inside
   153  the correct cgroups before the initialization begins.  This is done so
   154  that no processes or threads escape the cgroups.  This sync is
   155  done via a pipe ( specified in the runtime section below ) that the container's
   156  init process will block waiting for the parent to finish setup.
   157  
   158  ### IntelRdt
   159  
   160  Intel platforms with new Xeon CPU support Resource Director Technology (RDT).
   161  Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are
   162  two sub-features of RDT.
   163  
   164  Cache Allocation Technology (CAT) provides a way for the software to restrict
   165  cache allocation to a defined 'subset' of L3 cache which may be overlapping
   166  with other 'subsets'. The different subsets are identified by class of
   167  service (CLOS) and each CLOS has a capacity bitmask (CBM).
   168  
   169  Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle
   170  over memory bandwidth for the software. A user controls the resource by
   171  indicating the percentage of maximum memory bandwidth or memory bandwidth limit
   172  in MBps unit if MBA Software Controller is enabled.
   173  
   174  It can be used to handle L3 cache and memory bandwidth resources allocation
   175  for containers if hardware and kernel support Intel RDT CAT and MBA features.
   176  
   177  In Linux 4.10 kernel or newer, the interface is defined and exposed via
   178  "resource control" filesystem, which is a "cgroup-like" interface.
   179  
   180  Comparing with cgroups, it has similar process management lifecycle and
   181  interfaces in a container. But unlike cgroups' hierarchy, it has single level
   182  filesystem layout.
   183  
   184  CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via
   185  "resource control" filesystem.
   186  
   187  Intel RDT "resource control" filesystem hierarchy:
   188  ```
   189  mount -t resctrl resctrl /sys/fs/resctrl
   190  tree /sys/fs/resctrl
   191  /sys/fs/resctrl/
   192  |-- info
   193  |   |-- L3
   194  |   |   |-- cbm_mask
   195  |   |   |-- min_cbm_bits
   196  |   |   |-- num_closids
   197  |   |-- MB
   198  |       |-- bandwidth_gran
   199  |       |-- delay_linear
   200  |       |-- min_bandwidth
   201  |       |-- num_closids
   202  |-- ...
   203  |-- schemata
   204  |-- tasks
   205  |-- <container_id>
   206      |-- ...
   207      |-- schemata
   208      |-- tasks
   209  ```
   210  
   211  For runc, we can make use of `tasks` and `schemata` configuration for L3
   212  cache and memory bandwidth resources constraints.
   213  
   214  The file `tasks` has a list of tasks that belongs to this group (e.g.,
   215  <container_id>" group). Tasks can be added to a group by writing the task ID
   216  to the "tasks" file (which will automatically remove them from the previous
   217  group to which they belonged). New tasks created by fork(2) and clone(2) are
   218  added to the same group as their parent.
   219  
   220  The file `schemata` has a list of all the resources available to this group.
   221  Each resource (L3 cache, memory bandwidth) has its own line and format.
   222  
   223  L3 cache schema:
   224  It has allocation bitmasks/values for L3 cache on each socket, which
   225  contains L3 cache id and capacity bitmask (CBM).
   226  ```
   227  	Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..."
   228  ```
   229  For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0"
   230  which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0.
   231  
   232  The valid L3 cache CBM is a *contiguous bits set* and number of bits that can
   233  be set is less than the max bit. The max bits in the CBM is varied among
   234  supported Intel CPU models. Kernel will check if it is valid when writing.
   235  e.g., default value 0xfffff in root indicates the max bits of CBM is 20
   236  bits, which mapping to entire L3 cache capacity. Some valid CBM values to
   237  set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc.
   238  
   239  Memory bandwidth schema:
   240  It has allocation values for memory bandwidth on each socket, which contains
   241  L3 cache id and memory bandwidth.
   242  ```
   243  	Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..."
   244  ```
   245  For example, on a two-socket machine, the schema line could be "MB:0=20;1=70"
   246  
   247  The minimum bandwidth percentage value for each CPU model is predefined and
   248  can be looked up through "info/MB/min_bandwidth". The bandwidth granularity
   249  that is allocated is also dependent on the CPU model and can be looked up at
   250  "info/MB/bandwidth_gran". The available bandwidth control steps are:
   251  min_bw + N * bw_gran. Intermediate values are rounded to the next control
   252  step available on the hardware.
   253  
   254  If MBA Software Controller is enabled through mount option "-o mba_MBps"
   255  mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl
   256  We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit
   257  instead of "percentages". The kernel underneath would use a software feedback
   258  mechanism or a "Software Controller" which reads the actual bandwidth using
   259  MBM counters and adjust the memory bandwidth percentages to ensure:
   260  "actual memory bandwidth < user specified memory bandwidth".
   261  
   262  For example, on a two-socket machine, the schema line could be
   263  "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0
   264  and 7000 MBps memory bandwidth limit on socket 1.
   265  
   266  For more information about Intel RDT kernel interface:
   267  https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt
   268  
   269  ```
   270  An example for runc:
   271  Consider a two-socket machine with two L3 caches where the default CBM is
   272  0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10%
   273  with a memory bandwidth granularity of 10%.
   274  
   275  Tasks inside the container only have access to the "upper" 7/11 of L3 cache
   276  on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a
   277  maximum memory bandwidth of 20% on socket 0 and 70% on socket 1.
   278  
   279  "linux": {
   280      "intelRdt": {
   281          "closID": "guaranteed_group",
   282          "l3CacheSchema": "L3:0=7f0;1=1f",
   283          "memBwSchema": "MB:0=20;1=70"
   284      }
   285  }
   286  ```
   287  
   288  ### Security
   289  
   290  The standard set of Linux capabilities that are set in a container
   291  provide a good default for security and flexibility for the applications.
   292  
   293  
   294  |     Capability       | Enabled |
   295  | -------------------- | ------- |
   296  | CAP_NET_RAW          | 1       |
   297  | CAP_NET_BIND_SERVICE | 1       |
   298  | CAP_AUDIT_READ       | 1       |
   299  | CAP_AUDIT_WRITE      | 1       |
   300  | CAP_DAC_OVERRIDE     | 1       |
   301  | CAP_SETFCAP          | 1       |
   302  | CAP_SETPCAP          | 1       |
   303  | CAP_SETGID           | 1       |
   304  | CAP_SETUID           | 1       |
   305  | CAP_MKNOD            | 1       |
   306  | CAP_CHOWN            | 1       |
   307  | CAP_FOWNER           | 1       |
   308  | CAP_FSETID           | 1       |
   309  | CAP_KILL             | 1       |
   310  | CAP_SYS_CHROOT       | 1       |
   311  | CAP_NET_BROADCAST    | 0       |
   312  | CAP_SYS_MODULE       | 0       |
   313  | CAP_SYS_RAWIO        | 0       |
   314  | CAP_SYS_PACCT        | 0       |
   315  | CAP_SYS_ADMIN        | 0       |
   316  | CAP_SYS_NICE         | 0       |
   317  | CAP_SYS_RESOURCE     | 0       |
   318  | CAP_SYS_TIME         | 0       |
   319  | CAP_SYS_TTY_CONFIG   | 0       |
   320  | CAP_AUDIT_CONTROL    | 0       |
   321  | CAP_MAC_OVERRIDE     | 0       |
   322  | CAP_MAC_ADMIN        | 0       |
   323  | CAP_NET_ADMIN        | 0       |
   324  | CAP_SYSLOG           | 0       |
   325  | CAP_DAC_READ_SEARCH  | 0       |
   326  | CAP_LINUX_IMMUTABLE  | 0       |
   327  | CAP_IPC_LOCK         | 0       |
   328  | CAP_IPC_OWNER        | 0       |
   329  | CAP_SYS_PTRACE       | 0       |
   330  | CAP_SYS_BOOT         | 0       |
   331  | CAP_LEASE            | 0       |
   332  | CAP_WAKE_ALARM       | 0       |
   333  | CAP_BLOCK_SUSPEND    | 0       |
   334  
   335  
   336  Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
   337  and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
   338  the containers.  A container should support setting an apparmor profile or
   339  selinux process and mount labels if provided in the configuration.
   340  
   341  Standard apparmor profile:
   342  ```c
   343  #include <tunables/global>
   344  profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
   345    #include <abstractions/base>
   346    network,
   347    capability,
   348    file,
   349    umount,
   350  
   351    deny @{PROC}/sys/fs/** wklx,
   352    deny @{PROC}/sysrq-trigger rwklx,
   353    deny @{PROC}/mem rwklx,
   354    deny @{PROC}/kmem rwklx,
   355    deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
   356    deny @{PROC}/sys/kernel/*/** wklx,
   357  
   358    deny mount,
   359  
   360    deny /sys/[^f]*/** wklx,
   361    deny /sys/f[^s]*/** wklx,
   362    deny /sys/fs/[^c]*/** wklx,
   363    deny /sys/fs/c[^g]*/** wklx,
   364    deny /sys/fs/cg[^r]*/** wklx,
   365    deny /sys/firmware/efi/efivars/** rwklx,
   366    deny /sys/kernel/security/** rwklx,
   367  }
   368  ```
   369  
   370  *TODO: seccomp work is being done to find a good default config*
   371  
   372  ### Runtime and Init Process
   373  
   374  During container creation the parent process needs to talk to the container's init
   375  process and have a form of synchronization.  This is accomplished by creating
   376  a pipe that is passed to the container's init.  When the init process first spawns
   377  it will block on its side of the pipe until the parent closes its side.  This
   378  allows the parent to have time to set the new process inside a cgroup hierarchy
   379  and/or write any uid/gid mappings required for user namespaces.
   380  The pipe is passed to the init process via FD 3.
   381  
   382  The application consuming libcontainer should be compiled statically.  libcontainer
   383  does not define any init process and the arguments provided are used to `exec` the
   384  process inside the application.  There should be no long running init within the
   385  container spec.
   386  
   387  If a pseudo tty is provided to a container it will open and `dup2` the console
   388  as the container's STDIN, STDOUT, STDERR as well as mounting the console
   389  as `/dev/console`.
   390  
   391  An extra set of mounts are provided to a container and setup for use.  A container's
   392  rootfs can contain some non portable files inside that can cause side effects during
   393  execution of a process.  These files are usually created and populated with the container
   394  specific information via the runtime.
   395  
   396  **Extra runtime files:**
   397  * /etc/hosts
   398  * /etc/resolv.conf
   399  * /etc/hostname
   400  * /etc/localtime
   401  
   402  
   403  #### Defaults
   404  
   405  There are a few defaults that can be overridden by users, but in their omission
   406  these apply to processes within a container.
   407  
   408  |       Type          |             Value              |
   409  | ------------------- | ------------------------------ |
   410  | Parent Death Signal | SIGKILL                        |
   411  | UID                 | 0                              |
   412  | GID                 | 0                              |
   413  | GROUPS              | 0, NULL                        |
   414  | CWD                 | "/"                            |
   415  | $HOME               | Current user's home dir or "/" |
   416  | Readonly rootfs     | false                          |
   417  | Pseudo TTY          | false                          |
   418  
   419  
   420  ## Actions
   421  
   422  After a container is created there is a standard set of actions that can
   423  be done to the container.  These actions are part of the public API for
   424  a container.
   425  
   426  |     Action     |                         Description                                |
   427  | -------------- | ------------------------------------------------------------------ |
   428  | Get processes  | Return all the pids for processes running inside a container       |
   429  | Get Stats      | Return resource statistics for the container as a whole            |
   430  | Wait           | Waits on the container's init process ( pid 1 )                    |
   431  | Wait Process   | Wait on any of the container's processes returning the exit status |
   432  | Destroy        | Kill the container's init process and remove any filesystem state  |
   433  | Signal         | Send a signal to the container's init process                      |
   434  | Signal Process | Send a signal to any of the container's processes                  |
   435  | Pause          | Pause all processes inside the container                           |
   436  | Resume         | Resume all processes inside the container if paused                |
   437  | Exec           | Execute a new process inside of the container  ( requires setns )  |
   438  | Set            | Setup configs of the container after it's created                  |
   439  
   440  ### Execute a new process inside of a running container
   441  
   442  User can execute a new process inside of a running container. Any binaries to be
   443  executed must be accessible within the container's rootfs.
   444  
   445  The started process will run inside the container's rootfs. Any changes
   446  made by the process to the container's filesystem will persist after the
   447  process finished executing.
   448  
   449  The started process will join all the container's existing namespaces. When the
   450  container is paused, the process will also be paused and will resume when
   451  the container is unpaused.  The started process will only run when the container's
   452  primary process (PID 1) is running, and will not be restarted when the container
   453  is restarted.
   454  
   455  #### Planned additions
   456  
   457  The started process will have its own cgroups nested inside the container's
   458  cgroups. This is used for process tracking and optionally resource allocation
   459  handling for the new process. Freezer cgroup is required, the rest of the cgroups
   460  are optional. The process executor must place its pid inside the correct
   461  cgroups before starting the process. This is done so that no child processes or
   462  threads can escape the cgroups.
   463  
   464  When the process is stopped, the process executor will try (in a best-effort way)
   465  to stop all its children and remove the sub-cgroups.