volcano.sh/volcano@v1.9.0/docs/design/numa-aware.md (about) 1 # NUMA Aware Plugin 2 3 ## Backgrounds 4 5 When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time. Many workloads are not sensitive to this migration and thus work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU management policies to determine some placement preferences on the node. 6 7 The CPU Manager and the Topology Manager are all Kubelet components, However There is the following limitation: 8 9 - The scheduler is not topology-aware. so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. this is unacceptable for Tensorflow job. If any worker or ps failed on node, the job will fail. 10 - The managers are node-level that results in an inability to match the best node for NUMA topology in the whole cluster. 11 12 ## Motivation 13 14 We target to resolve the limitation to make scheduler NUMA topology aware so as to achieve the following: 15 16 - Don't schedule pods to the nodes which NUMA topology don't match. 17 - Schedule pods to the best node for NUMA topology. 18 ## Goals 19 - Support cpu resource topology scheduling 20 - Support pod-level topology policies 21 22 ## Non-Goals 23 - Support other resources topology schedule, such as GPU. 24 25 26 ## Design Action 27 28 ### Node numa information 29 30 The kubelet has no interface about the cpu topology information externally, so we need to report it to volcano scheduler by ourselves. <br> So a new CRD is created to do it. 31 It is consistent of the following parts: 32 ```` 33 1. the topology policy on kubelet config 34 2. the cpu topology information 35 3. the cpu allocatable sets 36 4. the reserved cpu resource 37 ```` 38 For details, refer to [numatopo_types](https://github.com/volcano-sh/apis/blob/master/pkg/apis/nodeinfo/v1alpha1/numatopo_types.go) 39 40 41 ### Pod scheduling process 42 43  44 45 46 ### Pod-level topology 47 In the volcano job, it sets the different policies config for the specific task. 48 ``` 49 task: 50 - replicas: 1 51 name: "test-1" 52 topologyPolicy: single-numa-node 53 ... 54 - replicas: 1 55 name: "test-2" 56 topologyPolicy: best-effort 57 ... 58 ``` 59 There are the topology policies as same as the [Topology Manager](https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/): 60 ``` 61 1. single-numa-node 62 2. best-effort 63 3. restricted 64 4. none 65 ``` 66 67 ### Predicate function 68 69 for the pods with the topology policy, we need to predicate the matched node list. 70 71 | policy | action | 72 | :---- | :---- | 73 | none | 1. no filter action | 74 | best-effort | 1. filter out the node with the topology policy “best-effort”| 75 | restricted | 1. filter out the node with the topology policy “restricted” <br> 2. filter out the node that the cpu topology meets the cpu requirements for "restricted" | 76 | single-numa-node | 1. filter out the node with the topology policy “single-numa-node”; <br> 2. filter out the node that the cpu topology meets the cpu requirements for "single-numa-node" | 77 78 ### Priority function 79 80 Regardless of the topology policy, pod is hoped to be scheduled to the optimal node. <br> 81 So we select the best node by scoring all filtered nodes. 82 ``` 83 calculation formula: 84 score = weight * (100 - 100 * numaNodeNum / maxNumaNodeNum) 85 Arguments: 86 weight: the weight of the NUMA Aware Plugin, default is 1 87 numaNodeNum: the member of required NUMA node in the calculated node for meeting the request resources 88 maxNumaNodeNum: the maximum NUMA node number in all filtered nodes 89 ``` 90 91 For example: 92 93 ``` 94 There are three nodes to meet the cpu topology of the pod: 95 96 the numa node layout: 97 1. Node-A : need one numa node (0) 98 2. Node-B : need two numa node (0, 1) 99 3. Node-C : need four numa node (0, 1, 2, 3) 100 101 calculate the score 102 maxNumaNodeNum = 4 103 1. Node-A : 10 * (100 - 100 * 1 / 4) = 750 104 2. Node-B : 10 * (100 - 100 * 2 / 4) = 500 105 3. Node-C : 10 * (100 - 100 * 4 / 4) = 0 106 so the best node is Node-A. 107 108 ``` 109 110 For the usage details, please refer to the [NUMA Aware guide](../user-guide/how_to_use_numa_aware.md) 111 ## Drawbacks 112 113 Kubelet processes pods based on the creation time sequence of pods, but volcano uses a series of plugins to determine the scheduling order of pods, not just the creation time. This results in different resource NUMA allocations between kubelet and scheduler, even if the same algorithm is used with kubelet. 114 115 116 117 118 119