github.com/zhuohuang-hust/src-cbuild@v0.0.0-20230105071821-c7aab3e7c840/mergeCode/runc/libcontainer/SPEC.md (about) 1 ## Container Specification - v1 2 3 This is the standard configuration for version 1 containers. It includes 4 namespaces, standard filesystem setup, a default Linux capability set, and 5 information about resource reservations. It also has information about any 6 populated environment settings for the processes running inside a container. 7 8 Along with the configuration of how a container is created the standard also 9 discusses actions that can be performed on a container to manage and inspect 10 information about the processes running inside. 11 12 The v1 profile is meant to be able to accommodate the majority of applications 13 with a strong security configuration. 14 15 ### System Requirements and Compatibility 16 17 Minimum requirements: 18 * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches) 19 * Mounted cgroups with each subsystem in its own hierarchy 20 21 22 ### Namespaces 23 24 | Flag | Enabled | 25 | ------------ | ------- | 26 | CLONE_NEWPID | 1 | 27 | CLONE_NEWUTS | 1 | 28 | CLONE_NEWIPC | 1 | 29 | CLONE_NEWNET | 1 | 30 | CLONE_NEWNS | 1 | 31 | CLONE_NEWUSER | 1 | 32 33 Namespaces are created for the container via the `clone` syscall. 34 35 36 ### Filesystem 37 38 A root filesystem must be provided to a container for execution. The container 39 will use this root filesystem (rootfs) to jail and spawn processes inside where 40 the binaries and system libraries are local to that directory. Any binaries 41 to be executed must be contained within this rootfs. 42 43 Mounts that happen inside the container are automatically cleaned up when the 44 container exits as the mount namespace is destroyed and the kernel will 45 unmount all the mounts that were setup within that namespace. 46 47 For a container to execute properly there are certain filesystems that 48 are required to be mounted within the rootfs that the runtime will setup. 49 50 | Path | Type | Flags | Data | 51 | ----------- | ------ | -------------------------------------- | ---------------------------------------- | 52 | /proc | proc | MS_NOEXEC,MS_NOSUID,MS_NODEV | | 53 | /dev | tmpfs | MS_NOEXEC,MS_STRICTATIME | mode=755 | 54 | /dev/shm | tmpfs | MS_NOEXEC,MS_NOSUID,MS_NODEV | mode=1777,size=65536k | 55 | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV | | 56 | /dev/pts | devpts | MS_NOEXEC,MS_NOSUID | newinstance,ptmxmode=0666,mode=620,gid=5 | 57 | /sys | sysfs | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY | | 58 59 60 After a container's filesystems are mounted within the newly created 61 mount namespace `/dev` will need to be populated with a set of device nodes. 62 It is expected that a rootfs does not need to have any device nodes specified 63 for `/dev` within the rootfs as the container will setup the correct devices 64 that are required for executing a container's process. 65 66 | Path | Mode | Access | 67 | ------------ | ---- | ---------- | 68 | /dev/null | 0666 | rwm | 69 | /dev/zero | 0666 | rwm | 70 | /dev/full | 0666 | rwm | 71 | /dev/tty | 0666 | rwm | 72 | /dev/random | 0666 | rwm | 73 | /dev/urandom | 0666 | rwm | 74 75 76 **ptmx** 77 `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within 78 the container. 79 80 The use of a pseudo TTY is optional within a container and it should support both. 81 If a pseudo is provided to the container `/dev/console` will need to be 82 setup by binding the console in `/dev/` after it has been populated and mounted 83 in tmpfs. 84 85 | Source | Destination | UID GID | Mode | Type | 86 | --------------- | ------------ | ------- | ---- | ---- | 87 | *pty host path* | /dev/console | 0 0 | 0600 | bind | 88 89 90 After `/dev/null` has been setup we check for any external links between 91 the container's io, STDIN, STDOUT, STDERR. If the container's io is pointing 92 to `/dev/null` outside the container we close and `dup2` the `/dev/null` 93 that is local to the container's rootfs. 94 95 96 After the container has `/proc` mounted a few standard symlinks are setup 97 within `/dev/` for the io. 98 99 | Source | Destination | 100 | --------------- | ----------- | 101 | /proc/self/fd | /dev/fd | 102 | /proc/self/fd/0 | /dev/stdin | 103 | /proc/self/fd/1 | /dev/stdout | 104 | /proc/self/fd/2 | /dev/stderr | 105 106 A `pivot_root` is used to change the root for the process, effectively 107 jailing the process inside the rootfs. 108 109 ```c 110 put_old = mkdir(...); 111 pivot_root(rootfs, put_old); 112 chdir("/"); 113 unmount(put_old, MS_DETACH); 114 rmdir(put_old); 115 ``` 116 117 For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined 118 with a `chroot` is required as `pivot_root` is not supported in `ramfs`. 119 120 ```c 121 mount(rootfs, "/", NULL, MS_MOVE, NULL); 122 chroot("."); 123 chdir("/"); 124 ``` 125 126 The `umask` is set back to `0022` after the filesystem setup has been completed. 127 128 ### Resources 129 130 Cgroups are used to handle resource allocation for containers. This includes 131 system resources like cpu, memory, and device access. 132 133 | Subsystem | Enabled | 134 | ---------- | ------- | 135 | devices | 1 | 136 | memory | 1 | 137 | cpu | 1 | 138 | cpuacct | 1 | 139 | cpuset | 1 | 140 | blkio | 1 | 141 | perf_event | 1 | 142 | freezer | 1 | 143 | hugetlb | 1 | 144 | pids | 1 | 145 146 147 All cgroup subsystem are joined so that statistics can be collected from 148 each of the subsystems. Freezer does not expose any stats but is joined 149 so that containers can be paused and resumed. 150 151 The parent process of the container's init must place the init pid inside 152 the correct cgroups before the initialization begins. This is done so 153 that no processes or threads escape the cgroups. This sync is 154 done via a pipe ( specified in the runtime section below ) that the container's 155 init process will block waiting for the parent to finish setup. 156 157 ### Security 158 159 The standard set of Linux capabilities that are set in a container 160 provide a good default for security and flexibility for the applications. 161 162 163 | Capability | Enabled | 164 | -------------------- | ------- | 165 | CAP_NET_RAW | 1 | 166 | CAP_NET_BIND_SERVICE | 1 | 167 | CAP_AUDIT_READ | 1 | 168 | CAP_AUDIT_WRITE | 1 | 169 | CAP_DAC_OVERRIDE | 1 | 170 | CAP_SETFCAP | 1 | 171 | CAP_SETPCAP | 1 | 172 | CAP_SETGID | 1 | 173 | CAP_SETUID | 1 | 174 | CAP_MKNOD | 1 | 175 | CAP_CHOWN | 1 | 176 | CAP_FOWNER | 1 | 177 | CAP_FSETID | 1 | 178 | CAP_KILL | 1 | 179 | CAP_SYS_CHROOT | 1 | 180 | CAP_NET_BROADCAST | 0 | 181 | CAP_SYS_MODULE | 0 | 182 | CAP_SYS_RAWIO | 0 | 183 | CAP_SYS_PACCT | 0 | 184 | CAP_SYS_ADMIN | 0 | 185 | CAP_SYS_NICE | 0 | 186 | CAP_SYS_RESOURCE | 0 | 187 | CAP_SYS_TIME | 0 | 188 | CAP_SYS_TTY_CONFIG | 0 | 189 | CAP_AUDIT_CONTROL | 0 | 190 | CAP_MAC_OVERRIDE | 0 | 191 | CAP_MAC_ADMIN | 0 | 192 | CAP_NET_ADMIN | 0 | 193 | CAP_SYSLOG | 0 | 194 | CAP_DAC_READ_SEARCH | 0 | 195 | CAP_LINUX_IMMUTABLE | 0 | 196 | CAP_IPC_LOCK | 0 | 197 | CAP_IPC_OWNER | 0 | 198 | CAP_SYS_PTRACE | 0 | 199 | CAP_SYS_BOOT | 0 | 200 | CAP_LEASE | 0 | 201 | CAP_WAKE_ALARM | 0 | 202 | CAP_BLOCK_SUSPEND | 0 | 203 204 205 Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor) 206 and [selinux](http://selinuxproject.org/page/Main_Page) can be used with 207 the containers. A container should support setting an apparmor profile or 208 selinux process and mount labels if provided in the configuration. 209 210 Standard apparmor profile: 211 ```c 212 #include <tunables/global> 213 profile <profile_name> flags=(attach_disconnected,mediate_deleted) { 214 #include <abstractions/base> 215 network, 216 capability, 217 file, 218 umount, 219 220 deny @{PROC}/sys/fs/** wklx, 221 deny @{PROC}/sysrq-trigger rwklx, 222 deny @{PROC}/mem rwklx, 223 deny @{PROC}/kmem rwklx, 224 deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx, 225 deny @{PROC}/sys/kernel/*/** wklx, 226 227 deny mount, 228 229 deny /sys/[^f]*/** wklx, 230 deny /sys/f[^s]*/** wklx, 231 deny /sys/fs/[^c]*/** wklx, 232 deny /sys/fs/c[^g]*/** wklx, 233 deny /sys/fs/cg[^r]*/** wklx, 234 deny /sys/firmware/efi/efivars/** rwklx, 235 deny /sys/kernel/security/** rwklx, 236 } 237 ``` 238 239 *TODO: seccomp work is being done to find a good default config* 240 241 ### Runtime and Init Process 242 243 During container creation the parent process needs to talk to the container's init 244 process and have a form of synchronization. This is accomplished by creating 245 a pipe that is passed to the container's init. When the init process first spawns 246 it will block on its side of the pipe until the parent closes its side. This 247 allows the parent to have time to set the new process inside a cgroup hierarchy 248 and/or write any uid/gid mappings required for user namespaces. 249 The pipe is passed to the init process via FD 3. 250 251 The application consuming libcontainer should be compiled statically. libcontainer 252 does not define any init process and the arguments provided are used to `exec` the 253 process inside the application. There should be no long running init within the 254 container spec. 255 256 If a pseudo tty is provided to a container it will open and `dup2` the console 257 as the container's STDIN, STDOUT, STDERR as well as mounting the console 258 as `/dev/console`. 259 260 An extra set of mounts are provided to a container and setup for use. A container's 261 rootfs can contain some non portable files inside that can cause side effects during 262 execution of a process. These files are usually created and populated with the container 263 specific information via the runtime. 264 265 **Extra runtime files:** 266 * /etc/hosts 267 * /etc/resolv.conf 268 * /etc/hostname 269 * /etc/localtime 270 271 272 #### Defaults 273 274 There are a few defaults that can be overridden by users, but in their omission 275 these apply to processes within a container. 276 277 | Type | Value | 278 | ------------------- | ------------------------------ | 279 | Parent Death Signal | SIGKILL | 280 | UID | 0 | 281 | GID | 0 | 282 | GROUPS | 0, NULL | 283 | CWD | "/" | 284 | $HOME | Current user's home dir or "/" | 285 | Readonly rootfs | false | 286 | Pseudo TTY | false | 287 288 289 ## Actions 290 291 After a container is created there is a standard set of actions that can 292 be done to the container. These actions are part of the public API for 293 a container. 294 295 | Action | Description | 296 | -------------- | ------------------------------------------------------------------ | 297 | Get processes | Return all the pids for processes running inside a container | 298 | Get Stats | Return resource statistics for the container as a whole | 299 | Wait | Waits on the container's init process ( pid 1 ) | 300 | Wait Process | Wait on any of the container's processes returning the exit status | 301 | Destroy | Kill the container's init process and remove any filesystem state | 302 | Signal | Send a signal to the container's init process | 303 | Signal Process | Send a signal to any of the container's processes | 304 | Pause | Pause all processes inside the container | 305 | Resume | Resume all processes inside the container if paused | 306 | Exec | Execute a new process inside of the container ( requires setns ) | 307 | Set | Setup configs of the container after it's created | 308 309 ### Execute a new process inside of a running container. 310 311 User can execute a new process inside of a running container. Any binaries to be 312 executed must be accessible within the container's rootfs. 313 314 The started process will run inside the container's rootfs. Any changes 315 made by the process to the container's filesystem will persist after the 316 process finished executing. 317 318 The started process will join all the container's existing namespaces. When the 319 container is paused, the process will also be paused and will resume when 320 the container is unpaused. The started process will only run when the container's 321 primary process (PID 1) is running, and will not be restarted when the container 322 is restarted. 323 324 #### Planned additions 325 326 The started process will have its own cgroups nested inside the container's 327 cgroups. This is used for process tracking and optionally resource allocation 328 handling for the new process. Freezer cgroup is required, the rest of the cgroups 329 are optional. The process executor must place its pid inside the correct 330 cgroups before starting the process. This is done so that no child processes or 331 threads can escape the cgroups. 332 333 When the process is stopped, the process executor will try (in a best-effort way) 334 to stop all its children and remove the sub-cgroups.