github.com/kubewharf/katalyst-core@v0.5.3/docs/tutorial/colocation.md (about) 1 # Tutorial - katalyst colocation best-practices 2 This guide introduces best-practices for colocation in katalyst with an end-to-end example. Follow the steps below to take a glance at the integrated functionality, and then you can replace the sample yamls with your workload when applying the system in your own production environment. 3 4 ## Prerequisite 5 Please make sure you have deployed all pre-dependent components before moving on to the next step. 6 7 - Install enhanced kubernetes based on [install-enhanced-k8s.md](../install-enhanced-k8s.md) 8 - Install components according to the instructions in [Charts](https://github.com/kubewharf/charts.git). To enable full functionality of colocation, the following components are required while others are optional. 9 - Agent 10 - Controller 11 - Scheduler 12 13 ## Functionalities 14 Before going to the next step, let's assume that we will use those settings and configurations as baseline: 15 16 - Total resources are set as 48 cores and 195924424Ki per node; 17 - Reserved resources for pods with shared_cores are set as 4 cores and 5Gi, and it means that we'll always keep at least this amount of resources for those pods for bursting requirements. 18 19 Based on the assumption above, you can follow the steps to deep dive into the colocation workflow. 20 21 ### Resource Reporting 22 After installing, resource reporting module will report reclaimed resources. Since there are no pods running, the reclaimed resource will be calculated as: 23 `reclaimed_resources = total_resources - reserve_resources` 24 25 When you refer to CNR, reclaimed resources will be as follows, and it means that pods with reclaimed_cores can be scheduled onto this node. 26 27 ``` 28 status: 29 resourceAllocatable: 30 katalyst.kubewharf.io/reclaimed_memory: "195257901056" 31 katalyst.kubewharf.io/reclaimed_millicpu: 44k 32 resourceCapacity: 33 katalyst.kubewharf.io/reclaimed_memory: "195257901056" 34 katalyst.kubewharf.io/reclaimed_millicpu: 44k 35 ``` 36 37 Submit several pods with shared_cores, and put pressure on those workloads to make reclaimed resources fluctuate along with the running state of workload. 38 39 ``` 40 $ kubectl create -f ./examples/shared-normal-pod.yaml 41 ``` 42 43 After successfully scheduled, the pod starts running with cpu-usage ~= 1cores and cpu-load ~=1, and the reclaimed resources will be changed according to the formula below. We skip memory here since it's more difficult to reproduce with accurate value than cpu, but the principle is familiar. 44 `reclaim cpu = allocatable - round(ceil(reserve + max(usage,load.1min,load.5min))` 45 46 ``` 47 status: 48 resourceAllocatable: 49 katalyst.kubewharf.io/reclaimed_millicpu: 42k 50 resourceCapacity: 51 katalyst.kubewharf.io/reclaimed_millicpu: 42k 52 ``` 53 54 You can then put pressure on those pods to simulate requested peaks with `stress`, and the cpu-load will rise to approximately 3 to make the reclaimed cpu shrink to 40k. 55 56 ``` 57 $ kubectl exec shared-normal-pod -it -- stress -c 2 58 ``` 59 ``` 60 status: 61 resourceAllocatable: 62 katalyst.kubewharf.io/reclaimed_millicpu: 40k 63 resourceCapacity: 64 katalyst.kubewharf.io/reclaimed_millicpu: 40k 65 ``` 66 67 ### Scheduling Strategy 68 Katalyst provides several scheduling strategies to schedule pods with reclaimed_cores. You can alter the default scheduling config, and then create deployment with reclaimed_cores. 69 70 ``` 71 $ kubectl create -f ./examples/reclaimed-deployment.yaml 72 ``` 73 74 #### Spread 75 Spread is the default scheduling strategy, and it will try to spread pods among all suitable nodes, and it is usually used to balance workload contention in the cluster. Apply Spread policy with the command below, and pod will be scheduled onto each node evenly. 76 77 ``` 78 $ kubectl apply -f ./examples/scheduler-policy-spread.yaml 79 $ kubectl get po -owide 80 ``` 81 ``` 82 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 83 reclaimed-pod-5f7f69d7b8-4lknl 1/1 Running 0 3m31s 192.168.1.169 node-1 <none> <none> 84 reclaimed-pod-5f7f69d7b8-656bz 1/1 Running 0 3m31s 192.168.2.103 node-2 <none> <none> 85 reclaimed-pod-5f7f69d7b8-89n46 1/1 Running 0 3m31s 192.168.0.129 node-3 <none> <none> 86 reclaimed-pod-5f7f69d7b8-bcpbs 1/1 Running 0 3m31s 192.168.1.171 node-1 <none> <none> 87 reclaimed-pod-5f7f69d7b8-bq22q 1/1 Running 0 3m31s 192.168.0.126 node-3 <none> <none> 88 reclaimed-pod-5f7f69d7b8-jblgk 1/1 Running 0 3m31s 192.168.0.128 node-3 <none> <none> 89 reclaimed-pod-5f7f69d7b8-kxqdl 1/1 Running 0 3m31s 192.168.0.127 node-3 <none> <none> 90 reclaimed-pod-5f7f69d7b8-mdh2d 1/1 Running 0 3m31s 192.168.1.170 node-1 <none> <none> 91 reclaimed-pod-5f7f69d7b8-p2q7s 1/1 Running 0 3m31s 192.168.2.104 node-2 <none> <none> 92 reclaimed-pod-5f7f69d7b8-x7lqh 1/1 Running 0 3m31s 192.168.2.102 node-2 <none> <none> 93 ``` 94 95 #### Binpack 96 Binpack tries to schedule pods into a single node, until the node is unsuitable to schedule more, and it is usually used to squeeze resource utilization to a limited bound to rise utilization. Apply Binpack policy with the command below, and pods will be scheduled onto one intensive node. 97 98 ``` 99 $ kubectl apply -f ./examples/scheduler-policy-binpack.yaml 100 $ kubectl get po -owide 101 ``` 102 ``` 103 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 104 reclaimed-pod-5f7f69d7b8-7mjbz 1/1 Running 0 36s 192.168.1.176 node-1 <none> <none> 105 reclaimed-pod-5f7f69d7b8-h8nmk 1/1 Running 0 36s 192.168.1.177 node-1 <none> <none> 106 reclaimed-pod-5f7f69d7b8-hfhqt 1/1 Running 0 36s 192.168.1.181 node-1 <none> <none> 107 reclaimed-pod-5f7f69d7b8-nhx4h 1/1 Running 0 36s 192.168.1.182 node-1 <none> <none> 108 reclaimed-pod-5f7f69d7b8-s8sx7 1/1 Running 0 36s 192.168.1.178 node-1 <none> <none> 109 reclaimed-pod-5f7f69d7b8-szn8z 1/1 Running 0 36s 192.168.1.180 node-1 <none> <none> 110 reclaimed-pod-5f7f69d7b8-vdm7c 1/1 Running 0 36s 192.168.0.133 node-3 <none> <none> 111 reclaimed-pod-5f7f69d7b8-vrr8w 1/1 Running 0 36s 192.168.1.179 node-1 <none> <none> 112 reclaimed-pod-5f7f69d7b8-w9hv4 1/1 Running 0 36s 192.168.2.109 node-2 <none> <none> 113 reclaimed-pod-5f7f69d7b8-z2wqv 1/1 Running 0 36s 192.168.4.200 node-4 <none> <none> 114 ``` 115 116 #### Custom 117 Besides those in-tree policies, you can also use self-defined scoring functions to customize scheduling strategy. In the example below, we use self-defined RequestedToCapacityRatio scorer as scheduling policy, and it will work the same as Binpack policy. 118 ``` 119 $ kubectl apply -f ./examples/scheduler-policy-custom.yaml 120 $ kubectl get po -owide 121 ``` 122 ``` 123 NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 124 reclaimed-pod-5f7f69d7b8-547zk 1/1 Running 0 7s 192.168.1.191 node-1 <none> <none> 125 reclaimed-pod-5f7f69d7b8-6jzbs 1/1 Running 0 6s 192.168.1.193 node-1 <none> <none> 126 reclaimed-pod-5f7f69d7b8-6v7kr 1/1 Running 0 7s 192.168.2.111 node-2 <none> <none> 127 reclaimed-pod-5f7f69d7b8-9vrb9 1/1 Running 0 6s 192.168.1.192 node-1 <none> <none> 128 reclaimed-pod-5f7f69d7b8-dnn7n 1/1 Running 0 7s 192.168.4.204 node-4 <none> <none> 129 reclaimed-pod-5f7f69d7b8-jtgx9 1/1 Running 0 7s 192.168.1.189 node-1 <none> <none> 130 reclaimed-pod-5f7f69d7b8-kjrlv 1/1 Running 0 7s 192.168.0.139 node-3 <none> <none> 131 reclaimed-pod-5f7f69d7b8-mr85t 1/1 Running 0 6s 192.168.1.194 node-1 <none> <none> 132 reclaimed-pod-5f7f69d7b8-q4dz5 1/1 Running 0 7s 192.168.1.188 node-1 <none> <none> 133 reclaimed-pod-5f7f69d7b8-v28nv 1/1 Running 0 7s 192.168.1.190 node-1 <none> <none> 134 ``` 135 136 ### QoS Controlling 137 After successfully scheduled, katalyst falls into the main QoS-controlling loop to dynamically adjust resource allocations in each node. In the current version, we use cpuset to isolate scheduling domain for pods in each pool, and memory limit to restrict upper bound of memory usage. 138 139 Before going to the next step, remember to clear previous pods to construct a pure environment. 140 141 Apply a pod with shared_cores as the command below. After ramp-up period, the cpuset for shared-pool will be 6 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 2 cores for regular requirements). And the left cores are considered suitable for reclaimed pods. 142 143 ``` 144 $ kubectl create -f ./examples/shared-normal-pod.yaml 145 ``` 146 ``` 147 root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod 148 Tue Jan 3 16:18:31 CST 2023 149 11,22-23,35,46-47 150 ``` 151 152 Apply a pod with reclaimed_cores as the command below, and the cpuset for reclaimed-pool will be 40 cores. 153 154 ``` 155 kubectl create -f ./examples/reclaimed-normal-pod.yaml 156 ``` 157 ``` 158 root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod 159 Tue Jan 3 16:23:20 CST 2023 160 0-10,12-21,24-34,36-45 161 ``` 162 163 Put pressure on the previous pod with shared_cores to make its load rise to 3, and the cpuset for shared-pool will be 8 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 4 cores for regular requirements). And cores for reclaimed pool will shrink to 48 relatively. 164 165 ``` 166 $ kubectl exec reclaimed-normal-pod -it -- stress -c 2 167 ``` 168 ``` 169 root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod 170 Tue Jan 3 16:25:23 CST 2023 171 10-11,22-23,34-35,46-47 172 ``` 173 ``` 174 root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod 175 Tue Jan 3 16:28:32 CST 2023 176 0-9,12-21,24-33,36-45 177 ``` 178 179 ### Pod Eviction 180 Eviction is usually used as a common back-and-force method in case that the QoS fails to be satisfied, and we should always make sure pods with higher priority (i.e. shared_cores) to meet its QoS by evicting pods with lower priority (i.e. reclaimed_cores). Katalyst contains both agent and central evictions to meet different requirements. 181 182 Before going to the next step, remember to clear previous pods to construct a pure environment. 183 184 #### Agent Eviction 185 Currently, katalyst provides several in-tree agent eviction implementations. 186 187 ##### Resource OverCommit 188 Since reclaimed resources are always fluctuating according to the running state of pods with shared_cores, it may reduce shrink to a critical point that even pods will reclaimed pods can not run properly. In this case, katalyst will evict them to rebalance them to other nodes, and the comparison formula is as follows: 189 `sum(requested_reclaimed_resource) > alloctable_reclaimed_resource * threshold` 190 191 Apply several pods (including shared_cores and reclaimed_cores), and put some pressure to reduce allocatable reclaimed resources until it is below the tolerance threshold. It will finally trigger to evict pod reclaimed-large-pod-2. 192 193 ``` 194 $ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml 195 ``` 196 ``` 197 $ kubectl exec shared-large-pod-2 -it -- stress -c 40 198 ``` 199 ``` 200 status: 201 resourceAllocatable: 202 katalyst.kubewharf.io/reclaimed_millicpu: 4k 203 resourceCapacity: 204 katalyst.kubewharf.io/reclaimed_millicpu: 4k 205 ``` 206 ``` 207 $ kubectl get event -A | grep evict 208 default 43s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin 209 default 8s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin 210 ``` 211 212 The default threshold for reclaimed resources 5, you can change it dynamically with KCC. 213 214 ```$ kubectl create -f ./examples/kcc-eviction-reclaimed-resource-config.yaml``` 215 216 ##### Memory 217 Memory eviction is implemented in two parts: numa-level eviction and system-level eviction. The former is used along with numa-binding enhancement, while the latter is used for more general cases. In this tutorial, we will mainly demonstrate the latter. For each level, katalyst will trigger memory eviciton based on memory usage and Kswapd active rate to avoid slow path for memory allocation in kernel. 218 219 Apply several pods (including shared_cores and reclaimed_cores). 220 221 ```$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml``` 222 223 Apply KCC to alert the default free memory and Kswapd rate threshold. 224 225 ```$ kubectl create -f ./examples/kcc-eviction-memory-system-config.yaml``` 226 227 ###### For Memory Usage 228 Exec into reclaimed-large-pod-2 and request enough memory. When memory free falls below the target, it will trigger eviction for pods with reclaimed cores, and it will choose the pod that uses the most memory. 229 230 ``` 231 $ kubectl exec -it reclaimed-large-pod-2 bash 232 $ stress --vm 1 --vm-bytes 175G --vm-hang 1000 --verbose 233 ``` 234 ``` 235 $ kubectl get event -A | grep evict 236 default 2m40s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin 237 default 2m5s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin 238 ``` 239 ``` 240 taints: 241 - effect: NoSchedule 242 key: node.katalyst.kubewharf.io/MemoryPressure 243 timeAdded: "2023-01-09T06:32:08Z" 244 ``` 245 246 ###### For Kswapd 247 Login into the working node and put some pressure on system memory. When Kswapd active rates exceed the target threshold (default = 1), it will trigger eviction for pods both for reclaimed cores and shared cores, but reclaimed cores will be prior to shared cores. 248 249 ``` 250 $ stress --vm 1 --vm-bytes 180G --vm-hang 1000 --verbose 251 ``` 252 ``` 253 $ kubectl get event -A | grep evict 254 default 2m2s Normal EvictCreated pod/reclaimed-large-pod-2 Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin 255 default 92s Normal EvictSucceeded pod/reclaimed-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin 256 ``` 257 ``` 258 taints: 259 - effect: NoSchedule 260 key: node.katalyst.kubewharf.io/MemoryPressure 261 timeAdded: "2023-01-09T06:32:08Z" 262 ``` 263 264 ##### Load 265 For pods with shared_cores, if any pod creates too many threads, the scheduling-period in cfs may be split into small pieces, and makes throttle more frequent, thus impacts workload performance. To solve this, katalyst implements load eviction to detect load counts and trigger taint and eviction actions based on threshold, and the comparison formula is as follows. 266 ``` soft: load > resource_pool_cpu_amount ``` 267 ``` hard: load > resource_pool_cpu_amount * threshold ``` 268 269 Apply several pods (including shared_cores and reclaimed_cores). 270 271 ``` 272 $ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml 273 ``` 274 275 put some pressure to reduce allocatable reclaimed resources until the load exceeds the soft bound. In this case, taint will be added in CNR to avoid scheduling new pods, but the existing pods will keep running. 276 277 ``` 278 $ kubectl exec shared-large-pod-2 -it -- stress -c 50 279 ``` 280 ``` 281 taints: 282 - effect: NoSchedule 283 key: node.katalyst.kubewharf.io/CPUPressure 284 timeAdded: "2023-01-05T05:26:51Z" 285 ``` 286 287 put more pressure to reduce allocatable reclaimed resources until the load exceeds the hard bound. In this case, katalyst will evict the pods that create the most amount of threads. 288 289 ``` 290 $ kubectl exec shared-large-pod-2 -it -- stress -c 100 291 ``` 292 ``` 293 $ kubectl get event -A | grep evict 294 67s Normal EvictCreated pod/shared-large-pod-2 Successfully create eviction; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin 295 68s Normal Killing pod/shared-large-pod-2 Stopping container stress 296 32s Normal EvictSucceeded pod/shared-large-pod-2 Evicted pod has been deleted physically; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin 297 ``` 298 299 #### Centralized Eviction 300 In some cases, the agents may suffer from the single point problem, i.e. in a large cluster, the daemon may fail to work because of a lot of abnormal cases, and it may cause the pods running in this node out of control. So, centralized eviction by katalyst will try to evict all reclaimed pods to relieve this problem. 301 By default, if the readiness state keeps failing for 10 minutes, katalyst will taint the CNR as unSchedubable to make sure no more pods with reclaimed_cores can be scheduled into this node. And if the readiness state keeps failing for 20 minutes, it will try to evict all pods with reclaimed_cores. 302 303 ``` 304 taints: 305 - effect: NoScheduleForReclaimedTasks 306 key: node.kubernetes.io/unschedulable 307 ``` 308 309 ## Further More 310 We will try to provide more tutorials in the future along with feature releases in the future.