github.com/coreos/rocket@v1.30.1-0.20200224141603-171c416fac02/Documentation/devel/cgroups.md (about)

     1  # Control Groups (cgroups)
     2  
     3  ## Background
     4  
     5  [Control Groups][cgroups] are a Linux feature for organizing processes in hierarchical groups and applying resources limits to them. Each rkt pod is placed in a different cgroup to separate the processes of the pod from the processes of the host. Memory and CPU isolators are also implemented with cgroups.
     6  
     7  ## What cgroup does rkt use?
     8  
     9  Every pod and application within that pod is run within its own cgroup.
    10  
    11  ### `rkt` started from the command line
    12  
    13  When a recent version of systemd is running on the host and `rkt` is not started as a systemd service (typically, from the command line), `rkt` will call `systemd-nspawn` with `--register=true`. This will cause `systemd-nspawn` to call the D-Bus method `CreateMachineWithNetwork` on `systemd-machined` and the cgroup `/machine.slice/machine-rkt...` will be created. This requires systemd v216+ as detected by the [machinedRegister][machinedRegister] function in stage1's `init`.
    14  
    15  When systemd is not running on the host, or the systemd version is too old (< v216), `rkt` uses `systemd-nspawn` with the `--register=false` parameter. In this case, `systemd-nspawn` or other systemd components will not create new cgroups for rkt. Instead, `rkt` creates a new cgroup for each pod under the current cgroup, like `$CALLER_CGROUP/machine-some-id.slice`.
    16  
    17  ### `rkt` started as a systemd service
    18  
    19  `rkt` is able to detect if it is started as a systemd service (from a `.service` file or from `systemd-run`).
    20  In that case, `systemd-nspawn` is started with the `--keep-unit` parameter.
    21  This will cause `systemd-nspawn` to use the D-Bus method call `RegisterMachineWithNetwork` instead of `CreateMachineWithNetwork` and the pod will remain in the cgroup of the service.
    22  By default, the slice is `systemd.slice` but [users are advised][rkt-systemd] to select `machine.slice` with `systemd-run --slice=machine` or `Slice=machine.slice` in the `.service` file.
    23  It will result in `/machine.slice/servicename.service` when the user select that slice.
    24  
    25  ### Summary
    26  
    27  1. `/machine.slice/machine-rkt...` when started on the command line with systemd v216+.
    28  2. `/$SLICE.slice/servicename.service` when started from a systemd service.
    29  3. `$CALLER_CGROUP/machine-some-id.slice` without systemd, or with systemd pre-v216
    30  
    31  For example, a simple pod run interactively on a system with systemd would look like:
    32  
    33  ```
    34  ├─machine.slice
    35  │ └─machine-rkt\x2df28d074b\x2da8bb\x2d4246\x2d96a5\x2db961e1fe7035.scope
    36  │   ├─init.scope
    37  │   │ └─/usr/lib/systemd/systemd
    38  │   └─system.slice
    39  │     ├─alpine-sh.service
    40  │     │ ├─/bin/sh 
    41  │     └─systemd-journald.service
    42  │       └─/usr/lib/systemd/systemd-journald
    43  ```
    44  
    45  
    46  ## What subsystems does rkt use?
    47  
    48  Right now, rkt uses the `cpu`, `cpuset`, and `memory` subsystems.
    49  
    50  ### How are they mounted?
    51  
    52  When the stage1 starts, it mounts `/sys` . Then, for every subsystem, it:
    53  
    54  1. Mounts the subsystem (on `<rootfs>/sys/fs/cgroup/<subsystem>`)
    55  2. Bind-mounts the subcgroup on top of itself (e.g `<rootfs>/sys/fs/cgroup/memory/machine.slice/machine-rkt-UUID.scope/`)
    56  3. Remounts the subsystem readonly
    57  
    58  This is so that the pod itself cannot escape the cgroup. Currently the cgroup filesystems are not accessible to applications within the pod, but that may change.
    59  
    60  (N.B. `rkt` prior to v1.23 mounted each individual *knob* read-write. E.g. `.../memory/machine.slice/machine-rkt-UUID.scope/system.slice/etcd.service/{memory.limit_in_bytes, cgroup.procs}`)
    61  
    62  ## Future work
    63  
    64  ### Unified hierarchy and cgroup2
    65  
    66  Unified hierarchy and cgroup2 is a new feature in Linux that will be available in Linux 4.4.
    67  
    68  This is tracked by [#1757][rkt-1757].
    69  
    70  ### CGroup Namespaces
    71  
    72  CGroup Namespaces is a new feature being developed in Linux.
    73  
    74  This is tracked by [#1757][rkt-1757].
    75  
    76  ### Network isolator
    77  
    78  Appc/spec defines the [network isolator][network-isolator] `resource/network-bandwidth` to limit the network bandwidth used by each app in the pod.
    79  This is not implemented yet.
    80  This could be implemented with cgroups.
    81  
    82  [cgroups]: https://www.kernel.org/doc/Documentation/cgroup-v1/cgroups.txt
    83  [machinedRegister]: https://github.com/rkt/rkt/blob/master/stage1/init/init.go#L153
    84  [network-isolator]: https://github.com/appc/spec/blob/master/spec/ace.md#resourcenetwork-bandwidth
    85  [rkt-1757]: https://github.com/rkt/rkt/issues/1757
    86  [rkt-1844]: https://github.com/rkt/rkt/pull/1844
    87  [rkt-systemd]: ../using-rkt-with-systemd.md