github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/docs/isolated-cpu-affinity-transition.md (about)

     1  ## Isolated CPU affinity transition
     2  
     3  The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76
     4  in 5.7 has affected a deterministic scheduling behavior by distributing tasks
     5  across CPU cores within a cgroups cpuset. It means that `runc exec` might be
     6  impacted under some circumstances, by example when a container has been
     7  created within a cgroup cpuset entirely composed of isolated CPU cores
     8  usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters.
     9  
    10  Some containerized real-time applications are relying on this deterministic
    11  behavior and uses the first CPU core to run a slow thread while other CPU
    12  cores are fully used by the real-time threads with SCHED_FIFO policy.
    13  Such applications can prevent runc process from joining a container when the
    14  runc process is randomly scheduled on a CPU core owned by a real-time thread.
    15  
    16  Runc introduces a way to restore this behavior by adding the following
    17  annotation to the container runtime spec (`config.json`):
    18  
    19  `org.opencontainers.runc.exec.isolated-cpu-affinity-transition`
    20  
    21  This annotation can take one of those values:
    22  
    23  * `temporary` to temporarily set the runc process CPU affinity to the first
    24  isolated CPU core of the container cgroup cpuset.
    25  * `definitive`: to definitively set the runc process CPU affinity to the first
    26  isolated CPU core of the container cgroup cpuset.
    27  
    28  For example:
    29  
    30  ```json
    31    "annotations": {
    32      "org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary"
    33    }
    34  ```
    35  
    36  __WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and
    37  above.
    38  
    39  ### How it works?
    40  
    41  When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel
    42  boot parameter value and considers the CPUs in the list as isolated, it doesn't
    43  look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is
    44  identical to `nohz_full` when specified. If `nohz_full` parameter is not found,
    45  runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`.
    46  
    47  Once it gets the isolated CPU list, it returns an eligible CPU core within the
    48  container cgroup cpuset based on those heuristics:
    49  
    50  * when there is not cpuset cores: no eligible CPU
    51  * when there is not isolated cores: no eligible CPU
    52  * when cpuset cores are not in isolated core list: no eligible CPU
    53  * when cpuset cores are all isolated cores: return the first CPU of the cpuset
    54  * when cpuset cores are mixed between housekeeping/isolated cores: return the
    55    first housekeeping CPU not in isolated CPUs.
    56  
    57  The returned CPU core is then used to set the `runc init` CPU affinity before
    58  the container cgroup cpuset transition.
    59  
    60  #### Transition example
    61  
    62  `nohz_full` has the isolated cores `4-7`. A container has been created with
    63  the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7.
    64  `runc exec` is called by a process with CPU affinity set to `0-3`
    65  
    66  * with `temporary` transition:
    67  
    68    runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7)
    69  
    70  * with `definitive` transition:
    71  
    72    runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4)
    73  
    74  The difference between `temporary` and `definitive` is the container process
    75  affinity, `definitive` will constraint the container process to run on the
    76  first isolated CPU core of the cgroup cpuset, while `temporary` restore the
    77  CPU affinity to match the container cgroup cpuset.
    78  
    79  `definitive` transition might be helpful when `nohz_full` is used without
    80  `isolcpus` to avoid runc and container process to be a noisy neighbour for
    81  real-time applications.
    82  
    83  ### How to use it with Kubernetes?
    84  
    85  Kubernetes doesn't manage container directly, instead it uses the Container Runtime
    86  Interface (CRI) to communicate with a software implementing this interface and responsible
    87  to manage the lifecycle of containers. There are popular CRI implementations like Containerd
    88  and CRI-O. Those implementations allows to pass pod annotations to the container runtime
    89  via the container runtime spec. Currently runc is the runtime used by default for both.
    90  
    91  #### Containerd configuration
    92  
    93  Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc.
    94  You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod
    95  annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`:
    96  
    97  ```toml
    98  [plugins."io.containerd.grpc.v1.cri".containerd]
    99    default_runtime_name = "runc"
   100    [plugins."io.containerd.grpc.v1.cri".containerd.runtimes]
   101      [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
   102        runtime_type = "io.containerd.runc.v2"
   103        base_runtime_spec = "/etc/containerd/cri-base.json"
   104        pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"]
   105  ```
   106  
   107  #### CRI-O configuration
   108  
   109  CRI-O doesn't require any extra step, however some annotations could be excluded by
   110  configuration.
   111  
   112  #### Pod deployment example
   113  
   114  ```yaml
   115  apiVersion: v1
   116  kind: Pod
   117  metadata:
   118    name: demo-pod
   119    annotations:
   120      org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary"
   121  spec:
   122    containers:
   123    - name: demo
   124      image: registry.com/demo:latest
   125  ```