github.com/kubewharf/katalyst-core@v0.5.3/docs/tutorial/colocation.md (about)

     1  # Tutorial - katalyst colocation best-practices
     2  This guide introduces best-practices for colocation in katalyst with an end-to-end example. Follow the steps below to take a glance at the integrated functionality, and then you can replace the sample yamls with your workload when applying the system in your own production environment.
     3  
     4  ## Prerequisite
     5  Please make sure you have deployed all pre-dependent components before moving on to the next step.
     6  
     7  - Install enhanced kubernetes based on [install-enhanced-k8s.md](../install-enhanced-k8s.md)
     8  - Install components according to the instructions in [Charts](https://github.com/kubewharf/charts.git). To enable full functionality of colocation, the following components are required while others are optional.
     9      - Agent
    10      - Controller
    11      - Scheduler
    12  
    13  ## Functionalities
    14  Before going to the next step, let's assume that we will use those settings and configurations as baseline:
    15  
    16  - Total resources are set as 48 cores and 195924424Ki per node;
    17  - Reserved resources for pods with shared_cores are set as 4 cores and 5Gi, and it means that we'll always keep at least this amount of resources for those pods for bursting requirements.
    18    
    19  Based on the assumption above, you can follow the steps to deep dive into the colocation workflow.
    20  
    21  ### Resource Reporting
    22  After installing, resource reporting module will report reclaimed resources. Since there are no pods running, the reclaimed resource will be calculated as:
    23  `reclaimed_resources = total_resources - reserve_resources`
    24  
    25  When you refer to CNR, reclaimed resources will be as follows, and it means that pods with reclaimed_cores can be scheduled onto this node.
    26  
    27  ```
    28  status:
    29      resourceAllocatable:
    30          katalyst.kubewharf.io/reclaimed_memory: "195257901056"
    31          katalyst.kubewharf.io/reclaimed_millicpu: 44k
    32      resourceCapacity:
    33          katalyst.kubewharf.io/reclaimed_memory: "195257901056"
    34          katalyst.kubewharf.io/reclaimed_millicpu: 44k
    35  ```
    36  
    37  Submit several pods with shared_cores, and put pressure on those workloads to make reclaimed resources fluctuate along with the running state of workload.
    38  
    39  ```
    40  $ kubectl create -f ./examples/shared-normal-pod.yaml
    41  ```
    42  
    43  After successfully scheduled, the pod starts running with cpu-usage ~= 1cores and cpu-load ~=1, and the reclaimed resources will be changed according to the formula below. We skip memory here since it's more difficult to reproduce with accurate value than cpu, but the principle is familiar.
    44  `reclaim cpu = allocatable - round(ceil(reserve + max(usage,load.1min,load.5min))`
    45  
    46  ```
    47  status:
    48      resourceAllocatable:
    49          katalyst.kubewharf.io/reclaimed_millicpu: 42k
    50      resourceCapacity:
    51          katalyst.kubewharf.io/reclaimed_millicpu: 42k
    52  ```
    53     
    54  You can then put pressure on those pods to simulate requested peaks with `stress`, and the cpu-load will rise to approximately 3 to make the reclaimed cpu shrink to 40k.
    55  
    56  ```
    57  $ kubectl exec shared-normal-pod -it -- stress -c 2
    58  ```
    59  ```
    60  status:
    61      resourceAllocatable:
    62          katalyst.kubewharf.io/reclaimed_millicpu: 40k
    63      resourceCapacity:
    64          katalyst.kubewharf.io/reclaimed_millicpu: 40k
    65  ```
    66  
    67  ### Scheduling Strategy
    68  Katalyst provides several scheduling strategies to schedule pods with reclaimed_cores. You can alter the default scheduling config, and then create deployment with reclaimed_cores.
    69  
    70  ```
    71  $ kubectl create -f ./examples/reclaimed-deployment.yaml
    72  ```
    73  
    74  #### Spread
    75  Spread is the default scheduling strategy, and it will try to spread pods among all suitable nodes, and it is usually used to balance workload contention in the cluster. Apply Spread policy with the command below, and pod will be scheduled onto each node evenly.
    76  
    77  ```
    78  $ kubectl apply -f ./examples/scheduler-policy-spread.yaml
    79  $ kubectl get po -owide
    80  ```
    81  ```
    82  NAME                             READY   STATUS    RESTARTS   AGE     IP              NODE          NOMINATED NODE   READINESS GATES
    83  reclaimed-pod-5f7f69d7b8-4lknl   1/1     Running   0          3m31s   192.168.1.169   node-1   <none>           <none>
    84  reclaimed-pod-5f7f69d7b8-656bz   1/1     Running   0          3m31s   192.168.2.103   node-2   <none>           <none>
    85  reclaimed-pod-5f7f69d7b8-89n46   1/1     Running   0          3m31s   192.168.0.129   node-3   <none>           <none>
    86  reclaimed-pod-5f7f69d7b8-bcpbs   1/1     Running   0          3m31s   192.168.1.171   node-1   <none>           <none>
    87  reclaimed-pod-5f7f69d7b8-bq22q   1/1     Running   0          3m31s   192.168.0.126   node-3   <none>           <none>
    88  reclaimed-pod-5f7f69d7b8-jblgk   1/1     Running   0          3m31s   192.168.0.128   node-3   <none>           <none>
    89  reclaimed-pod-5f7f69d7b8-kxqdl   1/1     Running   0          3m31s   192.168.0.127   node-3   <none>           <none>
    90  reclaimed-pod-5f7f69d7b8-mdh2d   1/1     Running   0          3m31s   192.168.1.170   node-1   <none>           <none>
    91  reclaimed-pod-5f7f69d7b8-p2q7s   1/1     Running   0          3m31s   192.168.2.104   node-2   <none>           <none>
    92  reclaimed-pod-5f7f69d7b8-x7lqh   1/1     Running   0          3m31s   192.168.2.102   node-2   <none>           <none>
    93  ```
    94  
    95  #### Binpack
    96  Binpack tries to schedule pods into a single node, until the node is unsuitable to schedule more, and it is usually used to squeeze resource utilization to a limited bound to rise utilization. Apply Binpack policy with the command below, and pods will be scheduled onto one intensive node.
    97  
    98  ```
    99  $ kubectl apply -f ./examples/scheduler-policy-binpack.yaml
   100  $ kubectl get po -owide      
   101  ```
   102  ```
   103  NAME                             READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
   104  reclaimed-pod-5f7f69d7b8-7mjbz   1/1     Running   0          36s   192.168.1.176   node-1    <none>           <none>
   105  reclaimed-pod-5f7f69d7b8-h8nmk   1/1     Running   0          36s   192.168.1.177   node-1    <none>           <none>
   106  reclaimed-pod-5f7f69d7b8-hfhqt   1/1     Running   0          36s   192.168.1.181   node-1    <none>           <none>
   107  reclaimed-pod-5f7f69d7b8-nhx4h   1/1     Running   0          36s   192.168.1.182   node-1    <none>           <none>
   108  reclaimed-pod-5f7f69d7b8-s8sx7   1/1     Running   0          36s   192.168.1.178   node-1    <none>           <none>
   109  reclaimed-pod-5f7f69d7b8-szn8z   1/1     Running   0          36s   192.168.1.180   node-1    <none>           <none>
   110  reclaimed-pod-5f7f69d7b8-vdm7c   1/1     Running   0          36s   192.168.0.133   node-3    <none>           <none>
   111  reclaimed-pod-5f7f69d7b8-vrr8w   1/1     Running   0          36s   192.168.1.179   node-1    <none>           <none>
   112  reclaimed-pod-5f7f69d7b8-w9hv4   1/1     Running   0          36s   192.168.2.109   node-2    <none>           <none>
   113  reclaimed-pod-5f7f69d7b8-z2wqv   1/1     Running   0          36s   192.168.4.200   node-4    <none>           <none>
   114  ```
   115  
   116  #### Custom
   117  Besides those in-tree policies, you can also use self-defined scoring functions to customize scheduling strategy. In the example below, we use self-defined RequestedToCapacityRatio scorer as scheduling policy, and it will work the same as Binpack policy.
   118  ```
   119  $ kubectl apply -f ./examples/scheduler-policy-custom.yaml
   120  $ kubectl get po -owide
   121  ```
   122  ```
   123  NAME                             READY   STATUS    RESTARTS   AGE   IP              NODE      NOMINATED NODE   READINESS GATES
   124  reclaimed-pod-5f7f69d7b8-547zk   1/1     Running   0          7s    192.168.1.191   node-1    <none>           <none>
   125  reclaimed-pod-5f7f69d7b8-6jzbs   1/1     Running   0          6s    192.168.1.193   node-1    <none>           <none>
   126  reclaimed-pod-5f7f69d7b8-6v7kr   1/1     Running   0          7s    192.168.2.111   node-2    <none>           <none>
   127  reclaimed-pod-5f7f69d7b8-9vrb9   1/1     Running   0          6s    192.168.1.192   node-1    <none>           <none>
   128  reclaimed-pod-5f7f69d7b8-dnn7n   1/1     Running   0          7s    192.168.4.204   node-4   <none>           <none>
   129  reclaimed-pod-5f7f69d7b8-jtgx9   1/1     Running   0          7s    192.168.1.189   node-1    <none>           <none>
   130  reclaimed-pod-5f7f69d7b8-kjrlv   1/1     Running   0          7s    192.168.0.139   node-3    <none>           <none>
   131  reclaimed-pod-5f7f69d7b8-mr85t   1/1     Running   0          6s    192.168.1.194   node-1    <none>           <none>
   132  reclaimed-pod-5f7f69d7b8-q4dz5   1/1     Running   0          7s    192.168.1.188   node-1    <none>           <none>
   133  reclaimed-pod-5f7f69d7b8-v28nv   1/1     Running   0          7s    192.168.1.190   node-1    <none>           <none>
   134  ```
   135  
   136  ### QoS Controlling
   137  After successfully scheduled, katalyst falls into the main QoS-controlling loop to dynamically adjust resource allocations in each node. In the current version, we use cpuset to isolate scheduling domain for pods in each pool, and memory limit to restrict upper bound of memory usage.
   138  
   139  Before going to the next step, remember to clear previous pods to construct a pure environment.
   140  
   141  Apply a pod with shared_cores as the command below. After ramp-up period, the cpuset for shared-pool will be 6 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 2 cores for regular requirements). And the left cores are considered suitable for reclaimed pods.
   142  
   143  ```
   144  $ kubectl create -f ./examples/shared-normal-pod.yaml
   145  ```
   146  ```
   147  root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
   148  Tue Jan  3 16:18:31 CST 2023
   149  11,22-23,35,46-47
   150  ```
   151  
   152  Apply a pod with reclaimed_cores as the command below, and the cpuset for reclaimed-pool will be 40 cores.
   153  
   154  ```
   155  kubectl create -f ./examples/reclaimed-normal-pod.yaml
   156  ```
   157  ```
   158  root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
   159  Tue Jan  3 16:23:20 CST 2023
   160  0-10,12-21,24-34,36-45
   161  ```
   162  
   163  Put pressure on the previous pod with shared_cores to make its load rise to 3, and the cpuset for shared-pool will be 8 cores in total (i.e. reserved for 4 cores to reply for bursting, plus 4 cores for regular requirements). And cores for reclaimed pool will shrink to 48 relatively.
   164  
   165  ```
   166  $ kubectl exec reclaimed-normal-pod -it -- stress -c 2
   167  ```
   168  ```
   169  root@node-1:~# ./examples/get_cpuset.sh shared-normal-pod
   170  Tue Jan  3 16:25:23 CST 2023
   171  10-11,22-23,34-35,46-47
   172  ```
   173  ```
   174  root@node-1:~# ./examples/get_cpuset.sh reclaimed-normal-pod
   175  Tue Jan  3 16:28:32 CST 2023
   176  0-9,12-21,24-33,36-45
   177  ```
   178  
   179  ### Pod Eviction
   180  Eviction is usually used as a common back-and-force method in case that the QoS fails to be satisfied, and we should always make sure pods with higher priority (i.e. shared_cores) to meet its QoS by evicting pods with lower  priority (i.e. reclaimed_cores). Katalyst contains both agent and central evictions to meet different requirements.
   181  
   182  Before going to the next step, remember to clear previous pods to construct a pure environment.
   183  
   184  #### Agent Eviction
   185  Currently, katalyst provides several in-tree agent eviction implementations.
   186  
   187  ##### Resource OverCommit
   188  Since reclaimed resources are always fluctuating according to the running state of pods with shared_cores, it may reduce shrink to a critical point that even pods will reclaimed pods can not run properly. In this case, katalyst will evict them to rebalance them to other nodes, and the comparison formula is as follows:
   189  `sum(requested_reclaimed_resource) > alloctable_reclaimed_resource * threshold`
   190  
   191  Apply several pods (including shared_cores and reclaimed_cores), and put some pressure to reduce allocatable reclaimed resources until it is below the tolerance threshold. It will finally trigger to evict pod reclaimed-large-pod-2.
   192  
   193  ```
   194  $ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml
   195  ```
   196  ```
   197  $ kubectl exec shared-large-pod-2 -it -- stress -c 40
   198  ```
   199  ```
   200  status:
   201      resourceAllocatable:
   202          katalyst.kubewharf.io/reclaimed_millicpu: 4k
   203      resourceCapacity:
   204          katalyst.kubewharf.io/reclaimed_millicpu: 4k
   205  ```
   206  ```
   207  $ kubectl get event -A | grep evict
   208  default     43s         Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
   209  default     8s          Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: katalyst.kubewharf.io/reclaimed_millicpu from plugin: reclaimed-resource-pressure-eviction-plugin
   210  ```
   211  
   212  The default threshold for reclaimed resources 5, you can change it dynamically with KCC.
   213  
   214  ```$ kubectl create -f ./examples/kcc-eviction-reclaimed-resource-config.yaml```
   215  
   216  ##### Memory
   217  Memory eviction is implemented in two parts: numa-level eviction and system-level eviction. The former is used along with numa-binding enhancement, while the latter is used for more general cases. In this tutorial, we will mainly demonstrate the latter. For each level, katalyst will trigger memory eviciton based on memory usage and Kswapd active rate to avoid slow path for memory allocation in kernel.
   218  
   219  Apply several pods (including shared_cores and reclaimed_cores).
   220  
   221  ```$ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml```
   222  
   223  Apply KCC to alert the default free memory and Kswapd rate threshold.
   224  
   225  ```$ kubectl create -f ./examples/kcc-eviction-memory-system-config.yaml```
   226  
   227  ###### For Memory Usage
   228  Exec into reclaimed-large-pod-2 and request enough memory. When memory free falls below the target, it will trigger eviction for pods with reclaimed cores, and it will choose the pod that uses the most memory.
   229  
   230  ```
   231  $ kubectl exec -it reclaimed-large-pod-2 bash
   232  $ stress --vm 1 --vm-bytes 175G --vm-hang 1000 --verbose
   233  ```
   234  ```
   235  $ kubectl get event -A | grep evict
   236  default     2m40s       Normal   EvictCreated     pod/reclaimed-large-pod-2   Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
   237  default     2m5s        Normal   EvictSucceeded   pod/reclaimed-large-pod-2   Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
   238  ```
   239  ```
   240  taints:
   241  - effect: NoSchedule
   242    key: node.katalyst.kubewharf.io/MemoryPressure
   243    timeAdded: "2023-01-09T06:32:08Z"
   244  ```
   245  
   246  ###### For Kswapd
   247  Login into the working node and put some pressure on system memory. When Kswapd active rates exceed the target threshold (default = 1),  it will trigger eviction for pods both for reclaimed cores and shared cores, but reclaimed cores will be prior to shared cores.
   248  
   249  ```
   250  $ stress --vm 1 --vm-bytes 180G --vm-hang 1000 --verbose
   251  ```
   252  ```
   253  $ kubectl get event -A | grep evict
   254  default           2m2s        Normal    EvictCreated              pod/reclaimed-large-pod-2          Successfully create eviction; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
   255  default           92s         Normal    EvictSucceeded            pod/reclaimed-large-pod-2          Evicted pod has been deleted physically; reason: met threshold in scope: memory from plugin: memory-pressure-eviction-plugin
   256  ```
   257  ```
   258  taints:
   259  - effect: NoSchedule
   260    key: node.katalyst.kubewharf.io/MemoryPressure
   261    timeAdded: "2023-01-09T06:32:08Z"
   262  ```
   263  
   264  ##### Load
   265  For pods with shared_cores, if any pod creates too many threads, the scheduling-period in cfs may be split into small pieces, and makes throttle more frequent, thus impacts workload performance. To solve this, katalyst implements load eviction to detect load counts and trigger taint and eviction actions based on threshold, and the comparison formula is as follows. 
   266  ``` soft: load > resource_pool_cpu_amount ```
   267  ``` hard: load > resource_pool_cpu_amount * threshold ```
   268  
   269  Apply several pods (including shared_cores and reclaimed_cores).
   270  
   271  ```
   272  $ kubectl create -f ./examples/shared-large-pod.yaml ./examples/reclaimed-large-pod.yaml
   273  ```
   274  
   275  put some pressure to reduce allocatable reclaimed resources until the load exceeds the soft bound. In this case, taint will be added in CNR to avoid scheduling new pods, but the existing pods will keep running.
   276  
   277  ```
   278  $ kubectl exec shared-large-pod-2 -it -- stress -c 50
   279  ```
   280  ```
   281  taints:
   282  - effect: NoSchedule
   283    key: node.katalyst.kubewharf.io/CPUPressure
   284    timeAdded: "2023-01-05T05:26:51Z"
   285  ```
   286  
   287  put more pressure to reduce allocatable reclaimed resources until the load exceeds the hard bound. In this case, katalyst will evict the pods that create the most amount of threads.
   288  
   289  ```
   290  $ kubectl exec shared-large-pod-2 -it -- stress -c 100
   291  ```
   292  ```
   293  $ kubectl get event -A | grep evict
   294  67s         Normal   EvictCreated     pod/shared-large-pod-2      Successfully create eviction; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
   295  68s         Normal   Killing          pod/shared-large-pod-2      Stopping container stress
   296  32s         Normal   EvictSucceeded   pod/shared-large-pod-2      Evicted pod has been deleted physically; reason: met threshold in scope: cpu.load.1min.container from plugin: cpu-pressure-eviction-plugin
   297  ```
   298  
   299  #### Centralized Eviction
   300  In some cases, the agents may suffer from the single point problem, i.e. in a large cluster, the daemon may fail to work because of a lot of abnormal cases, and it may cause the pods running in this node out of control. So,  centralized eviction by katalyst will try to evict all reclaimed pods to relieve this problem.
   301  By default, if the readiness state keeps failing for 10 minutes, katalyst will taint the CNR as unSchedubable to make sure no more pods with reclaimed_cores can be scheduled into this node. And if the readiness state keeps failing for 20 minutes, it will try to evict all pods with reclaimed_cores.
   302  
   303  ```
   304  taints:
   305  - effect: NoScheduleForReclaimedTasks
   306    key: node.kubernetes.io/unschedulable
   307  ```
   308  
   309  ## Further More
   310  We will try to provide more tutorials in the future along with feature releases in the future.