github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/libcontainer/SPEC.md (about) 1 ## Container Specification - v1 2 3 This is the standard configuration for version 1 containers. It includes 4 namespaces, standard filesystem setup, a default Linux capability set, and 5 information about resource reservations. It also has information about any 6 populated environment settings for the processes running inside a container. 7 8 Along with the configuration of how a container is created the standard also 9 discusses actions that can be performed on a container to manage and inspect 10 information about the processes running inside. 11 12 The v1 profile is meant to be able to accommodate the majority of applications 13 with a strong security configuration. 14 15 ### System Requirements and Compatibility 16 17 Minimum requirements: 18 * Kernel version - 3.10 recommended 2.6.2x minimum(with backported patches) 19 * Mounted cgroups with each subsystem in its own hierarchy 20 21 22 ### Namespaces 23 24 | Flag | Enabled | 25 | --------------- | ------- | 26 | CLONE_NEWPID | 1 | 27 | CLONE_NEWUTS | 1 | 28 | CLONE_NEWIPC | 1 | 29 | CLONE_NEWNET | 1 | 30 | CLONE_NEWNS | 1 | 31 | CLONE_NEWUSER | 1 | 32 | CLONE_NEWCGROUP | 1 | 33 34 Namespaces are created for the container via the `unshare` syscall. 35 36 37 ### Filesystem 38 39 A root filesystem must be provided to a container for execution. The container 40 will use this root filesystem (rootfs) to jail and spawn processes inside where 41 the binaries and system libraries are local to that directory. Any binaries 42 to be executed must be contained within this rootfs. 43 44 Mounts that happen inside the container are automatically cleaned up when the 45 container exits as the mount namespace is destroyed and the kernel will 46 unmount all the mounts that were setup within that namespace. 47 48 For a container to execute properly there are certain filesystems that 49 are required to be mounted within the rootfs that the runtime will setup. 50 51 | Path | Type | Flags | Data | 52 | ----------- | ------ | -------------------------------------- | ---------------------------------------- | 53 | /proc | proc | MS_NOEXEC,MS_NOSUID,MS_NODEV | | 54 | /dev | tmpfs | MS_NOEXEC,MS_STRICTATIME | mode=755 | 55 | /dev/shm | tmpfs | MS_NOEXEC,MS_NOSUID,MS_NODEV | mode=1777,size=65536k | 56 | /dev/mqueue | mqueue | MS_NOEXEC,MS_NOSUID,MS_NODEV | | 57 | /dev/pts | devpts | MS_NOEXEC,MS_NOSUID | newinstance,ptmxmode=0666,mode=620,gid=5 | 58 | /sys | sysfs | MS_NOEXEC,MS_NOSUID,MS_NODEV,MS_RDONLY | | 59 60 61 After a container's filesystems are mounted within the newly created 62 mount namespace `/dev` will need to be populated with a set of device nodes. 63 It is expected that a rootfs does not need to have any device nodes specified 64 for `/dev` within the rootfs as the container will setup the correct devices 65 that are required for executing a container's process. 66 67 | Path | Mode | Access | 68 | ------------ | ---- | ---------- | 69 | /dev/null | 0666 | rwm | 70 | /dev/zero | 0666 | rwm | 71 | /dev/full | 0666 | rwm | 72 | /dev/tty | 0666 | rwm | 73 | /dev/random | 0666 | rwm | 74 | /dev/urandom | 0666 | rwm | 75 76 77 **ptmx** 78 `/dev/ptmx` will need to be a symlink to the host's `/dev/ptmx` within 79 the container. 80 81 The use of a pseudo TTY is optional within a container and it should support both. 82 If a pseudo is provided to the container `/dev/console` will need to be 83 setup by binding the console in `/dev/` after it has been populated and mounted 84 in tmpfs. 85 86 | Source | Destination | UID GID | Mode | Type | 87 | --------------- | ------------ | ------- | ---- | ---- | 88 | *pty host path* | /dev/console | 0 0 | 0600 | bind | 89 90 91 After `/dev/null` has been setup we check for any external links between 92 the container's io, STDIN, STDOUT, STDERR. If the container's io is pointing 93 to `/dev/null` outside the container we close and `dup2` the `/dev/null` 94 that is local to the container's rootfs. 95 96 97 After the container has `/proc` mounted a few standard symlinks are setup 98 within `/dev/` for the io. 99 100 | Source | Destination | 101 | --------------- | ----------- | 102 | /proc/self/fd | /dev/fd | 103 | /proc/self/fd/0 | /dev/stdin | 104 | /proc/self/fd/1 | /dev/stdout | 105 | /proc/self/fd/2 | /dev/stderr | 106 107 A `pivot_root` is used to change the root for the process, effectively 108 jailing the process inside the rootfs. 109 110 ```c 111 put_old = mkdir(...); 112 pivot_root(rootfs, put_old); 113 chdir("/"); 114 unmount(put_old, MS_DETACH); 115 rmdir(put_old); 116 ``` 117 118 For container's running with a rootfs inside `ramfs` a `MS_MOVE` combined 119 with a `chroot` is required as `pivot_root` is not supported in `ramfs`. 120 121 ```c 122 mount(rootfs, "/", NULL, MS_MOVE, NULL); 123 chroot("."); 124 chdir("/"); 125 ``` 126 127 The `umask` is set back to `0022` after the filesystem setup has been completed. 128 129 ### Resources 130 131 Cgroups are used to handle resource allocation for containers. This includes 132 system resources like cpu, memory, and device access. 133 134 | Subsystem | Enabled | 135 | ---------- | ------- | 136 | devices | 1 | 137 | memory | 1 | 138 | cpu | 1 | 139 | cpuacct | 1 | 140 | cpuset | 1 | 141 | blkio | 1 | 142 | perf_event | 1 | 143 | freezer | 1 | 144 | hugetlb | 1 | 145 | pids | 1 | 146 147 148 All cgroup subsystem are joined so that statistics can be collected from 149 each of the subsystems. Freezer does not expose any stats but is joined 150 so that containers can be paused and resumed. 151 152 The parent process of the container's init must place the init pid inside 153 the correct cgroups before the initialization begins. This is done so 154 that no processes or threads escape the cgroups. This sync is 155 done via a pipe ( specified in the runtime section below ) that the container's 156 init process will block waiting for the parent to finish setup. 157 158 ### IntelRdt 159 160 Intel platforms with new Xeon CPU support Resource Director Technology (RDT). 161 Cache Allocation Technology (CAT) and Memory Bandwidth Allocation (MBA) are 162 two sub-features of RDT. 163 164 Cache Allocation Technology (CAT) provides a way for the software to restrict 165 cache allocation to a defined 'subset' of L3 cache which may be overlapping 166 with other 'subsets'. The different subsets are identified by class of 167 service (CLOS) and each CLOS has a capacity bitmask (CBM). 168 169 Memory Bandwidth Allocation (MBA) provides indirect and approximate throttle 170 over memory bandwidth for the software. A user controls the resource by 171 indicating the percentage of maximum memory bandwidth or memory bandwidth limit 172 in MBps unit if MBA Software Controller is enabled. 173 174 It can be used to handle L3 cache and memory bandwidth resources allocation 175 for containers if hardware and kernel support Intel RDT CAT and MBA features. 176 177 In Linux 4.10 kernel or newer, the interface is defined and exposed via 178 "resource control" filesystem, which is a "cgroup-like" interface. 179 180 Comparing with cgroups, it has similar process management lifecycle and 181 interfaces in a container. But unlike cgroups' hierarchy, it has single level 182 filesystem layout. 183 184 CAT and MBA features are introduced in Linux 4.10 and 4.12 kernel via 185 "resource control" filesystem. 186 187 Intel RDT "resource control" filesystem hierarchy: 188 ``` 189 mount -t resctrl resctrl /sys/fs/resctrl 190 tree /sys/fs/resctrl 191 /sys/fs/resctrl/ 192 |-- info 193 | |-- L3 194 | | |-- cbm_mask 195 | | |-- min_cbm_bits 196 | | |-- num_closids 197 | |-- MB 198 | |-- bandwidth_gran 199 | |-- delay_linear 200 | |-- min_bandwidth 201 | |-- num_closids 202 |-- ... 203 |-- schemata 204 |-- tasks 205 |-- <container_id> 206 |-- ... 207 |-- schemata 208 |-- tasks 209 ``` 210 211 For runc, we can make use of `tasks` and `schemata` configuration for L3 212 cache and memory bandwidth resources constraints. 213 214 The file `tasks` has a list of tasks that belongs to this group (e.g., 215 <container_id>" group). Tasks can be added to a group by writing the task ID 216 to the "tasks" file (which will automatically remove them from the previous 217 group to which they belonged). New tasks created by fork(2) and clone(2) are 218 added to the same group as their parent. 219 220 The file `schemata` has a list of all the resources available to this group. 221 Each resource (L3 cache, memory bandwidth) has its own line and format. 222 223 L3 cache schema: 224 It has allocation bitmasks/values for L3 cache on each socket, which 225 contains L3 cache id and capacity bitmask (CBM). 226 ``` 227 Format: "L3:<cache_id0>=<cbm0>;<cache_id1>=<cbm1>;..." 228 ``` 229 For example, on a two-socket machine, the schema line could be "L3:0=ff;1=c0" 230 which means L3 cache id 0's CBM is 0xff, and L3 cache id 1's CBM is 0xc0. 231 232 The valid L3 cache CBM is a *contiguous bits set* and number of bits that can 233 be set is less than the max bit. The max bits in the CBM is varied among 234 supported Intel CPU models. Kernel will check if it is valid when writing. 235 e.g., default value 0xfffff in root indicates the max bits of CBM is 20 236 bits, which mapping to entire L3 cache capacity. Some valid CBM values to 237 set in a group: 0xf, 0xf0, 0x3ff, 0x1f00 and etc. 238 239 Memory bandwidth schema: 240 It has allocation values for memory bandwidth on each socket, which contains 241 L3 cache id and memory bandwidth. 242 ``` 243 Format: "MB:<cache_id0>=bandwidth0;<cache_id1>=bandwidth1;..." 244 ``` 245 For example, on a two-socket machine, the schema line could be "MB:0=20;1=70" 246 247 The minimum bandwidth percentage value for each CPU model is predefined and 248 can be looked up through "info/MB/min_bandwidth". The bandwidth granularity 249 that is allocated is also dependent on the CPU model and can be looked up at 250 "info/MB/bandwidth_gran". The available bandwidth control steps are: 251 min_bw + N * bw_gran. Intermediate values are rounded to the next control 252 step available on the hardware. 253 254 If MBA Software Controller is enabled through mount option "-o mba_MBps" 255 mount -t resctrl resctrl -o mba_MBps /sys/fs/resctrl 256 We could specify memory bandwidth in "MBps" (Mega Bytes per second) unit 257 instead of "percentages". The kernel underneath would use a software feedback 258 mechanism or a "Software Controller" which reads the actual bandwidth using 259 MBM counters and adjust the memory bandwidth percentages to ensure: 260 "actual memory bandwidth < user specified memory bandwidth". 261 262 For example, on a two-socket machine, the schema line could be 263 "MB:0=5000;1=7000" which means 5000 MBps memory bandwidth limit on socket 0 264 and 7000 MBps memory bandwidth limit on socket 1. 265 266 For more information about Intel RDT kernel interface: 267 https://www.kernel.org/doc/Documentation/x86/intel_rdt_ui.txt 268 269 ``` 270 An example for runc: 271 Consider a two-socket machine with two L3 caches where the default CBM is 272 0x7ff and the max CBM length is 11 bits, and minimum memory bandwidth of 10% 273 with a memory bandwidth granularity of 10%. 274 275 Tasks inside the container only have access to the "upper" 7/11 of L3 cache 276 on socket 0 and the "lower" 5/11 L3 cache on socket 1, and may use a 277 maximum memory bandwidth of 20% on socket 0 and 70% on socket 1. 278 279 "linux": { 280 "intelRdt": { 281 "closID": "guaranteed_group", 282 "l3CacheSchema": "L3:0=7f0;1=1f", 283 "memBwSchema": "MB:0=20;1=70" 284 } 285 } 286 ``` 287 288 ### Security 289 290 The standard set of Linux capabilities that are set in a container 291 provide a good default for security and flexibility for the applications. 292 293 294 | Capability | Enabled | 295 | -------------------- | ------- | 296 | CAP_NET_RAW | 1 | 297 | CAP_NET_BIND_SERVICE | 1 | 298 | CAP_AUDIT_READ | 1 | 299 | CAP_AUDIT_WRITE | 1 | 300 | CAP_DAC_OVERRIDE | 1 | 301 | CAP_SETFCAP | 1 | 302 | CAP_SETPCAP | 1 | 303 | CAP_SETGID | 1 | 304 | CAP_SETUID | 1 | 305 | CAP_MKNOD | 1 | 306 | CAP_CHOWN | 1 | 307 | CAP_FOWNER | 1 | 308 | CAP_FSETID | 1 | 309 | CAP_KILL | 1 | 310 | CAP_SYS_CHROOT | 1 | 311 | CAP_NET_BROADCAST | 0 | 312 | CAP_SYS_MODULE | 0 | 313 | CAP_SYS_RAWIO | 0 | 314 | CAP_SYS_PACCT | 0 | 315 | CAP_SYS_ADMIN | 0 | 316 | CAP_SYS_NICE | 0 | 317 | CAP_SYS_RESOURCE | 0 | 318 | CAP_SYS_TIME | 0 | 319 | CAP_SYS_TTY_CONFIG | 0 | 320 | CAP_AUDIT_CONTROL | 0 | 321 | CAP_MAC_OVERRIDE | 0 | 322 | CAP_MAC_ADMIN | 0 | 323 | CAP_NET_ADMIN | 0 | 324 | CAP_SYSLOG | 0 | 325 | CAP_DAC_READ_SEARCH | 0 | 326 | CAP_LINUX_IMMUTABLE | 0 | 327 | CAP_IPC_LOCK | 0 | 328 | CAP_IPC_OWNER | 0 | 329 | CAP_SYS_PTRACE | 0 | 330 | CAP_SYS_BOOT | 0 | 331 | CAP_LEASE | 0 | 332 | CAP_WAKE_ALARM | 0 | 333 | CAP_BLOCK_SUSPEND | 0 | 334 335 336 Additional security layers like [apparmor](https://wiki.ubuntu.com/AppArmor) 337 and [selinux](http://selinuxproject.org/page/Main_Page) can be used with 338 the containers. A container should support setting an apparmor profile or 339 selinux process and mount labels if provided in the configuration. 340 341 Standard apparmor profile: 342 ```c 343 #include <tunables/global> 344 profile <profile_name> flags=(attach_disconnected,mediate_deleted) { 345 #include <abstractions/base> 346 network, 347 capability, 348 file, 349 umount, 350 351 deny @{PROC}/sys/fs/** wklx, 352 deny @{PROC}/sysrq-trigger rwklx, 353 deny @{PROC}/mem rwklx, 354 deny @{PROC}/kmem rwklx, 355 deny @{PROC}/sys/kernel/[^s][^h][^m]* wklx, 356 deny @{PROC}/sys/kernel/*/** wklx, 357 358 deny mount, 359 360 deny /sys/[^f]*/** wklx, 361 deny /sys/f[^s]*/** wklx, 362 deny /sys/fs/[^c]*/** wklx, 363 deny /sys/fs/c[^g]*/** wklx, 364 deny /sys/fs/cg[^r]*/** wklx, 365 deny /sys/firmware/efi/efivars/** rwklx, 366 deny /sys/kernel/security/** rwklx, 367 } 368 ``` 369 370 *TODO: seccomp work is being done to find a good default config* 371 372 ### Runtime and Init Process 373 374 During container creation the parent process needs to talk to the container's init 375 process and have a form of synchronization. This is accomplished by creating 376 a pipe that is passed to the container's init. When the init process first spawns 377 it will block on its side of the pipe until the parent closes its side. This 378 allows the parent to have time to set the new process inside a cgroup hierarchy 379 and/or write any uid/gid mappings required for user namespaces. 380 The pipe is passed to the init process via FD 3. 381 382 The application consuming libcontainer should be compiled statically. libcontainer 383 does not define any init process and the arguments provided are used to `exec` the 384 process inside the application. There should be no long running init within the 385 container spec. 386 387 If a pseudo tty is provided to a container it will open and `dup2` the console 388 as the container's STDIN, STDOUT, STDERR as well as mounting the console 389 as `/dev/console`. 390 391 An extra set of mounts are provided to a container and setup for use. A container's 392 rootfs can contain some non portable files inside that can cause side effects during 393 execution of a process. These files are usually created and populated with the container 394 specific information via the runtime. 395 396 **Extra runtime files:** 397 * /etc/hosts 398 * /etc/resolv.conf 399 * /etc/hostname 400 * /etc/localtime 401 402 403 #### Defaults 404 405 There are a few defaults that can be overridden by users, but in their omission 406 these apply to processes within a container. 407 408 | Type | Value | 409 | ------------------- | ------------------------------ | 410 | Parent Death Signal | SIGKILL | 411 | UID | 0 | 412 | GID | 0 | 413 | GROUPS | 0, NULL | 414 | CWD | "/" | 415 | $HOME | Current user's home dir or "/" | 416 | Readonly rootfs | false | 417 | Pseudo TTY | false | 418 419 420 ## Actions 421 422 After a container is created there is a standard set of actions that can 423 be done to the container. These actions are part of the public API for 424 a container. 425 426 | Action | Description | 427 | -------------- | ------------------------------------------------------------------ | 428 | Get processes | Return all the pids for processes running inside a container | 429 | Get Stats | Return resource statistics for the container as a whole | 430 | Wait | Waits on the container's init process ( pid 1 ) | 431 | Wait Process | Wait on any of the container's processes returning the exit status | 432 | Destroy | Kill the container's init process and remove any filesystem state | 433 | Signal | Send a signal to the container's init process | 434 | Signal Process | Send a signal to any of the container's processes | 435 | Pause | Pause all processes inside the container | 436 | Resume | Resume all processes inside the container if paused | 437 | Exec | Execute a new process inside of the container ( requires setns ) | 438 | Set | Setup configs of the container after it's created | 439 440 ### Execute a new process inside of a running container 441 442 User can execute a new process inside of a running container. Any binaries to be 443 executed must be accessible within the container's rootfs. 444 445 The started process will run inside the container's rootfs. Any changes 446 made by the process to the container's filesystem will persist after the 447 process finished executing. 448 449 The started process will join all the container's existing namespaces. When the 450 container is paused, the process will also be paused and will resume when 451 the container is unpaused. The started process will only run when the container's 452 primary process (PID 1) is running, and will not be restarted when the container 453 is restarted. 454 455 #### Planned additions 456 457 The started process will have its own cgroups nested inside the container's 458 cgroups. This is used for process tracking and optionally resource allocation 459 handling for the new process. Freezer cgroup is required, the rest of the cgroups 460 are optional. The process executor must place its pid inside the correct 461 cgroups before starting the process. This is done so that no child processes or 462 threads can escape the cgroups. 463 464 When the process is stopped, the process executor will try (in a best-effort way) 465 to stop all its children and remove the sub-cgroups.