github.com/zhuohuang-hust/src-cbuild@v0.0.0-20230105071821-c7aab3e7c840/mergeCode/runc/libcontainer/SPEC.md (about)

     1  ## Container Specification - v1
     2  
     3  This is the standard configuration for version 1 containers.  It includes
     4  namespaces, standard filesystem setup, a default Linux capability set, and
     5  information about resource reservations.  It also has information about any 
     6  populated environment settings for the processes running inside a container.
     7  
     8  Along with the configuration of how a container is created the standard also
     9  discusses actions that can be performed on a container to manage and inspect
    10  information about the processes running inside.
    11  
    12  The v1 profile is meant to be able to accommodate the majority of applications
    13  with a strong security configuration.
    14  
    15  ### System Requirements and Compatibility
    16  
    17  Minimum requirements:
    18  * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches)
    19  * Mounted cgroups with each subsystem in its own hierarchy
    20  
    21  
    22  ### Namespaces
    23  
    24  |     Flag      | Enabled | 
    25  | ------------  | ------- |
    26  | CLONE_NEWPID  |    1    |
    27  | CLONE_NEWUTS  |    1    |
    28  | CLONE_NEWIPC  |    1    |
    29  | CLONE_NEWNET  |    1    |
    30  | CLONE_NEWNS   |    1    |
    31  | CLONE_NEWUSER |    1    |
    32  
    33  Namespaces are created for the container via the `clone` syscall.  
    34  
    35  
    36  ### Filesystem
    37  
    38  A root filesystem must be provided to a container for execution.  The container
    39  will use this root filesystem (rootfs) to jail and spawn processes inside where
    40  the binaries and system libraries are local to that directory.  Any binaries
    41  to be executed must be contained within this rootfs.
    42  
    43  Mounts that happen inside the container are automatically cleaned up when the
    44  container exits as the mount namespace is destroyed and the kernel will 
    45  unmount all the mounts that were setup within that namespace.
    46  
    47  For a container to execute properly there are certain filesystems that 
    48  are required to be mounted within the rootfs that the runtime will setup.
    49  
    50  |     Path    |  Type  |                  Flags                 |                 Data                     |
    51  | ----------- | ------ | -------------------------------------- | ---------------------------------------- |
    52  | /proc       | proc   | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
    53  | /dev        | tmpfs  | MS_NOEXEC,MS_STRICTATIME               | mode=755                                 |
    54  | /dev/shm    | tmpfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV           | mode=1777,size=65536k                    |
    55  | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV           |                                          |
    56  | /dev/pts    | devpts | MS_NOEXEC,MS_NOSUID                    | newinstance,ptmxmode=0666,mode=620,gid=5 |
    57  | /sys        | sysfs  | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY |                                          |
    58  
    59  
    60  After a container's filesystems are mounted within the newly created 
    61  mount namespace `/dev` will need to be populated with a set of device nodes.
    62  It is expected that a rootfs does not need to have any device nodes specified
    63  for `/dev` within the rootfs as the container will setup the correct devices
    64  that are required for executing a container's process.
    65  
    66  |      Path    | Mode |   Access   |
    67  | ------------ | ---- | ---------- |
    68  | /dev/null    | 0666 |  rwm       |
    69  | /dev/zero    | 0666 |  rwm       |
    70  | /dev/full    | 0666 |  rwm       |
    71  | /dev/tty     | 0666 |  rwm       |
    72  | /dev/random  | 0666 |  rwm       |
    73  | /dev/urandom | 0666 |  rwm       |
    74  
    75  
    76  **ptmx**
    77  `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within
    78  the container.  
    79  
    80  The use of a pseudo TTY is optional within a container and it should support both.
    81  If a pseudo is provided to the container `/dev/console` will need to be 
    82  setup by binding the console in `/dev/` after it has been populated and mounted
    83  in tmpfs.
    84  
    85  |      Source     | Destination  | UID GID | Mode | Type |
    86  | --------------- | ------------ | ------- | ---- | ---- |
    87  | *pty host path* | /dev/console | 0 0     | 0600 | bind | 
    88  
    89  
    90  After `/dev/null` has been setup we check for any external links between
    91  the container's io, STDIN, STDOUT, STDERR.  If the container's io is pointing
    92  to `/dev/null` outside the container we close and `dup2` the `/dev/null` 
    93  that is local to the container's rootfs.
    94  
    95  
    96  After the container has `/proc` mounted a few standard symlinks are setup 
    97  within `/dev/` for the io.
    98  
    99  |    Source       | Destination |
   100  | --------------- | ----------- |
   101  | /proc/self/fd   | /dev/fd     |
   102  | /proc/self/fd/0 | /dev/stdin  |
   103  | /proc/self/fd/1 | /dev/stdout |
   104  | /proc/self/fd/2 | /dev/stderr |
   105  
   106  A `pivot_root` is used to change the root for the process, effectively 
   107  jailing the process inside the rootfs.
   108  
   109  ```c
   110  put_old = mkdir(...);
   111  pivot_root(rootfs, put_old);
   112  chdir("/");
   113  unmount(put_old, MS_DETACH);
   114  rmdir(put_old);
   115  ```
   116  
   117  For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined
   118  with a `chroot` is required as `pivot_root` is not supported in `ramfs`.
   119  
   120  ```c
   121  mount(rootfs, "/", NULL, MS_MOVE, NULL);
   122  chroot(".");
   123  chdir("/");
   124  ```
   125  
   126  The `umask` is set back to `0022` after the filesystem setup has been completed.
   127  
   128  ### Resources
   129  
   130  Cgroups are used to handle resource allocation for containers.  This includes
   131  system resources like cpu, memory, and device access.
   132  
   133  | Subsystem  | Enabled |
   134  | ---------- | ------- |
   135  | devices    | 1       |
   136  | memory     | 1       |
   137  | cpu        | 1       |
   138  | cpuacct    | 1       |
   139  | cpuset     | 1       |
   140  | blkio      | 1       |
   141  | perf_event | 1       |
   142  | freezer    | 1       |
   143  | hugetlb    | 1       |
   144  | pids       | 1       |
   145  
   146  
   147  All cgroup subsystem are joined so that statistics can be collected from
   148  each of the subsystems.  Freezer does not expose any stats but is joined
   149  so that containers can be paused and resumed.
   150  
   151  The parent process of the container's init must place the init pid inside
   152  the correct cgroups before the initialization begins.  This is done so
   153  that no processes or threads escape the cgroups.  This sync is 
   154  done via a pipe ( specified in the runtime section below ) that the container's
   155  init process will block waiting for the parent to finish setup.
   156  
   157  ### Security 
   158  
   159  The standard set of Linux capabilities that are set in a container
   160  provide a good default for security and flexibility for the applications.
   161  
   162  
   163  |     Capability       | Enabled |
   164  | -------------------- | ------- |
   165  | CAP_NET_RAW          | 1       |
   166  | CAP_NET_BIND_SERVICE | 1       |
   167  | CAP_AUDIT_READ       | 1       |
   168  | CAP_AUDIT_WRITE      | 1       |
   169  | CAP_DAC_OVERRIDE     | 1       |
   170  | CAP_SETFCAP          | 1       |
   171  | CAP_SETPCAP          | 1       |
   172  | CAP_SETGID           | 1       |
   173  | CAP_SETUID           | 1       |
   174  | CAP_MKNOD            | 1       |
   175  | CAP_CHOWN            | 1       |
   176  | CAP_FOWNER           | 1       |
   177  | CAP_FSETID           | 1       |
   178  | CAP_KILL             | 1       |
   179  | CAP_SYS_CHROOT       | 1       |
   180  | CAP_NET_BROADCAST    | 0       |
   181  | CAP_SYS_MODULE       | 0       |
   182  | CAP_SYS_RAWIO        | 0       |
   183  | CAP_SYS_PACCT        | 0       |
   184  | CAP_SYS_ADMIN        | 0       |
   185  | CAP_SYS_NICE         | 0       |
   186  | CAP_SYS_RESOURCE     | 0       |
   187  | CAP_SYS_TIME         | 0       |
   188  | CAP_SYS_TTY_CONFIG   | 0       |
   189  | CAP_AUDIT_CONTROL    | 0       |
   190  | CAP_MAC_OVERRIDE     | 0       |
   191  | CAP_MAC_ADMIN        | 0       |
   192  | CAP_NET_ADMIN        | 0       |
   193  | CAP_SYSLOG           | 0       |
   194  | CAP_DAC_READ_SEARCH  | 0       |
   195  | CAP_LINUX_IMMUTABLE  | 0       |
   196  | CAP_IPC_LOCK         | 0       |
   197  | CAP_IPC_OWNER        | 0       |
   198  | CAP_SYS_PTRACE       | 0       |
   199  | CAP_SYS_BOOT         | 0       |
   200  | CAP_LEASE            | 0       |
   201  | CAP_WAKE_ALARM       | 0       |
   202  | CAP_BLOCK_SUSPEND    | 0       |
   203  
   204  
   205  Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor)
   206  and [selinux](http://selinuxproject.org/page/Main_Page) can be used with
   207  the containers.  A container should support setting an apparmor profile or 
   208  selinux process and mount labels if provided in the configuration.  
   209  
   210  Standard apparmor profile:
   211  ```c
   212  #include <tunables/global>
   213  profile <profile_name> flags=(attach_disconnected,mediate_deleted) {
   214    #include <abstractions/base>
   215    network,
   216    capability,
   217    file,
   218    umount,
   219  
   220    deny @{PROC}/sys/fs/** wklx,
   221    deny @{PROC}/sysrq-trigger rwklx,
   222    deny @{PROC}/mem rwklx,
   223    deny @{PROC}/kmem rwklx,
   224    deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx,
   225    deny @{PROC}/sys/kernel/*/** wklx,
   226  
   227    deny mount,
   228  
   229    deny /sys/[^f]*/** wklx,
   230    deny /sys/f[^s]*/** wklx,
   231    deny /sys/fs/[^c]*/** wklx,
   232    deny /sys/fs/c[^g]*/** wklx,
   233    deny /sys/fs/cg[^r]*/** wklx,
   234    deny /sys/firmware/efi/efivars/** rwklx,
   235    deny /sys/kernel/security/** rwklx,
   236  }
   237  ```
   238  
   239  *TODO: seccomp work is being done to find a good default config*
   240  
   241  ### Runtime and Init Process
   242  
   243  During container creation the parent process needs to talk to the container's init 
   244  process and have a form of synchronization.  This is accomplished by creating
   245  a pipe that is passed to the container's init.  When the init process first spawns 
   246  it will block on its side of the pipe until the parent closes its side.  This
   247  allows the parent to have time to set the new process inside a cgroup hierarchy 
   248  and/or write any uid/gid mappings required for user namespaces.  
   249  The pipe is passed to the init process via FD 3.
   250  
   251  The application consuming libcontainer should be compiled statically.  libcontainer
   252  does not define any init process and the arguments provided are used to `exec` the
   253  process inside the application.  There should be no long running init within the 
   254  container spec.
   255  
   256  If a pseudo tty is provided to a container it will open and `dup2` the console
   257  as the container's STDIN, STDOUT, STDERR as well as mounting the console
   258  as `/dev/console`.
   259  
   260  An extra set of mounts are provided to a container and setup for use.  A container's
   261  rootfs can contain some non portable files inside that can cause side effects during
   262  execution of a process.  These files are usually created and populated with the container
   263  specific information via the runtime.  
   264  
   265  **Extra runtime files:**
   266  * /etc/hosts 
   267  * /etc/resolv.conf
   268  * /etc/hostname
   269  * /etc/localtime
   270  
   271  
   272  #### Defaults
   273  
   274  There are a few defaults that can be overridden by users, but in their omission
   275  these apply to processes within a container.
   276  
   277  |       Type          |             Value              |
   278  | ------------------- | ------------------------------ |
   279  | Parent Death Signal | SIGKILL                        | 
   280  | UID                 | 0                              |
   281  | GID                 | 0                              |
   282  | GROUPS              | 0, NULL                        |
   283  | CWD                 | "/"                            |
   284  | $HOME               | Current user's home dir or "/" |
   285  | Readonly rootfs     | false                          |
   286  | Pseudo TTY          | false                          |
   287  
   288  
   289  ## Actions
   290  
   291  After a container is created there is a standard set of actions that can
   292  be done to the container.  These actions are part of the public API for 
   293  a container.
   294  
   295  |     Action     |                         Description                                |
   296  | -------------- | ------------------------------------------------------------------ |
   297  | Get processes  | Return all the pids for processes running inside a container       | 
   298  | Get Stats      | Return resource statistics for the container as a whole            |
   299  | Wait           | Waits on the container's init process ( pid 1 )                    |
   300  | Wait Process   | Wait on any of the container's processes returning the exit status | 
   301  | Destroy        | Kill the container's init process and remove any filesystem state  |
   302  | Signal         | Send a signal to the container's init process                      |
   303  | Signal Process | Send a signal to any of the container's processes                  |
   304  | Pause          | Pause all processes inside the container                           |
   305  | Resume         | Resume all processes inside the container if paused                |
   306  | Exec           | Execute a new process inside of the container  ( requires setns )  |
   307  | Set            | Setup configs of the container after it's created                  |
   308  
   309  ### Execute a new process inside of a running container.
   310  
   311  User can execute a new process inside of a running container. Any binaries to be
   312  executed must be accessible within the container's rootfs.
   313  
   314  The started process will run inside the container's rootfs. Any changes
   315  made by the process to the container's filesystem will persist after the
   316  process finished executing.
   317  
   318  The started process will join all the container's existing namespaces. When the
   319  container is paused, the process will also be paused and will resume when
   320  the container is unpaused.  The started process will only run when the container's
   321  primary process (PID 1) is running, and will not be restarted when the container
   322  is restarted.
   323  
   324  #### Planned additions
   325  
   326  The started process will have its own cgroups nested inside the container's
   327  cgroups. This is used for process tracking and optionally resource allocation
   328  handling for the new process. Freezer cgroup is required, the rest of the cgroups
   329  are optional. The process executor must place its pid inside the correct
   330  cgroups before starting the process. This is done so that no child processes or
   331  threads can escape the cgroups.
   332  
   333  When the process is stopped, the process executor will try (in a best-effort way)
   334  to stop all its children and remove the sub-cgroups.