github.com/opencontainers/runc@v1.2.0-rc.1.0.20240520010911-492dc558cdd6/docs/isolated-cpu-affinity-transition.md (about) 1 ## Isolated CPU affinity transition 2 3 The introduction of the kernel commit 46a87b3851f0d6eb05e6d83d5c5a30df0eca8f76 4 in 5.7 has affected a deterministic scheduling behavior by distributing tasks 5 across CPU cores within a cgroups cpuset. It means that `runc exec` might be 6 impacted under some circumstances, by example when a container has been 7 created within a cgroup cpuset entirely composed of isolated CPU cores 8 usually sets either with `nohz_full` and/or `isolcpus` kernel boot parameters. 9 10 Some containerized real-time applications are relying on this deterministic 11 behavior and uses the first CPU core to run a slow thread while other CPU 12 cores are fully used by the real-time threads with SCHED_FIFO policy. 13 Such applications can prevent runc process from joining a container when the 14 runc process is randomly scheduled on a CPU core owned by a real-time thread. 15 16 Runc introduces a way to restore this behavior by adding the following 17 annotation to the container runtime spec (`config.json`): 18 19 `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` 20 21 This annotation can take one of those values: 22 23 * `temporary` to temporarily set the runc process CPU affinity to the first 24 isolated CPU core of the container cgroup cpuset. 25 * `definitive`: to definitively set the runc process CPU affinity to the first 26 isolated CPU core of the container cgroup cpuset. 27 28 For example: 29 30 ```json 31 "annotations": { 32 "org.opencontainers.runc.exec.isolated-cpu-affinity-transition": "temporary" 33 } 34 ``` 35 36 __WARNING:__ `definitive` requires a kernel >= 6.2, also works with RHEL 9 and 37 above. 38 39 ### How it works? 40 41 When enabled and during `runc exec`, runc is looking for the `nohz_full` kernel 42 boot parameter value and considers the CPUs in the list as isolated, it doesn't 43 look for `isolcpus` boot parameter, it just assumes that `isolcpus` value is 44 identical to `nohz_full` when specified. If `nohz_full` parameter is not found, 45 runc also attempts to read the list from `/sys/devices/system/cpu/nohz_full`. 46 47 Once it gets the isolated CPU list, it returns an eligible CPU core within the 48 container cgroup cpuset based on those heuristics: 49 50 * when there is not cpuset cores: no eligible CPU 51 * when there is not isolated cores: no eligible CPU 52 * when cpuset cores are not in isolated core list: no eligible CPU 53 * when cpuset cores are all isolated cores: return the first CPU of the cpuset 54 * when cpuset cores are mixed between housekeeping/isolated cores: return the 55 first housekeeping CPU not in isolated CPUs. 56 57 The returned CPU core is then used to set the `runc init` CPU affinity before 58 the container cgroup cpuset transition. 59 60 #### Transition example 61 62 `nohz_full` has the isolated cores `4-7`. A container has been created with 63 the cgroup cpuset `4-7` to only run on the isolated CPU cores 4 to 7. 64 `runc exec` is called by a process with CPU affinity set to `0-3` 65 66 * with `temporary` transition: 67 68 runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4-7) 69 70 * with `definitive` transition: 71 72 runc exec (affinity 0-3) -> runc init (affinity 4) -> container process (affinity 4) 73 74 The difference between `temporary` and `definitive` is the container process 75 affinity, `definitive` will constraint the container process to run on the 76 first isolated CPU core of the cgroup cpuset, while `temporary` restore the 77 CPU affinity to match the container cgroup cpuset. 78 79 `definitive` transition might be helpful when `nohz_full` is used without 80 `isolcpus` to avoid runc and container process to be a noisy neighbour for 81 real-time applications. 82 83 ### How to use it with Kubernetes? 84 85 Kubernetes doesn't manage container directly, instead it uses the Container Runtime 86 Interface (CRI) to communicate with a software implementing this interface and responsible 87 to manage the lifecycle of containers. There are popular CRI implementations like Containerd 88 and CRI-O. Those implementations allows to pass pod annotations to the container runtime 89 via the container runtime spec. Currently runc is the runtime used by default for both. 90 91 #### Containerd configuration 92 93 Containerd CRI uses runc by default but requires an extra step to pass the annotation to runc. 94 You have to whitelist `org.opencontainers.runc.exec.isolated-cpu-affinity-transition` as a pod 95 annotation allowed to be passed to the container runtime in `/etc/containerd/config.toml`: 96 97 ```toml 98 [plugins."io.containerd.grpc.v1.cri".containerd] 99 default_runtime_name = "runc" 100 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes] 101 [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc] 102 runtime_type = "io.containerd.runc.v2" 103 base_runtime_spec = "/etc/containerd/cri-base.json" 104 pod_annotations = ["org.opencontainers.runc.exec.isolated-cpu-affinity-transition"] 105 ``` 106 107 #### CRI-O configuration 108 109 CRI-O doesn't require any extra step, however some annotations could be excluded by 110 configuration. 111 112 #### Pod deployment example 113 114 ```yaml 115 apiVersion: v1 116 kind: Pod 117 metadata: 118 name: demo-pod 119 annotations: 120 org.opencontainers.runc.exec.isolated-cpu-affinity-transition: "temporary" 121 spec: 122 containers: 123 - name: demo 124 image: registry.com/demo:latest 125 ```