volcano.sh/volcano@v1.9.0/docs/design/numa-aware.md

volcano.sh/volcano@v1.9.0/docs/design/numa-aware.md (about)

     1  # NUMA Aware Plugin
     2  
     3  ## Backgrounds
     4  
     5  When the node runs many CPU-bound pods, the workload can move to different CPU cores depending on whether the pod is throttled and which CPU cores are available at scheduling time.  Many workloads are not sensitive to this migration and thus work fine without any intervention. However, in workloads where CPU cache affinity and scheduling latency significantly affect workload performance, the kubelet allows alternative CPU management policies to determine some placement preferences on the node.
     6  
     7  The CPU Manager and the Topology Manager are all Kubelet components, However There is the following limitation:
     8    
     9    - The scheduler is not topology-aware. so it is possible to be scheduled on a node and then fail on the node due to the Topology Manager. this is unacceptable for Tensorflow job. If any worker or ps failed on node, the job will fail.
    10    - The managers are node-level that results in an inability to match the best node for NUMA topology in the whole cluster.
    11  
    12  ## Motivation
    13  
    14  We target to resolve the limitation to make scheduler NUMA topology aware so as to achieve the following:
    15      
    16     - Don't schedule pods to the nodes which NUMA topology don't match. 
    17     - Schedule pods to the best node for NUMA topology.
    18  ## Goals
    19   - Support cpu resource topology scheduling
    20   - Support pod-level topology policies
    21  
    22  ## Non-Goals
    23   - Support other resources topology schedule, such as GPU.
    24   
    25  
    26  ## Design Action
    27  
    28  ### Node numa information
    29  
    30  The kubelet has no interface about the cpu topology information externally, so we need to report it to volcano scheduler by ourselves. <br> So a new CRD is created to do it.
    31  It is consistent of the following parts:
    32  ````
    33  1. the topology policy on kubelet config
    34  2. the cpu topology information
    35  3. the cpu allocatable sets
    36  4. the reserved cpu resource
    37  ````
    38  For details, refer to [numatopo_types](https://github.com/volcano-sh/apis/blob/master/pkg/apis/nodeinfo/v1alpha1/numatopo_types.go)
    39  
    40  
    41  ### Pod scheduling process
    42  
    43  ![](./images/numa-aware-process.png) 
    44  
    45  
    46  ### Pod-level topology
    47  In the volcano job, it sets the different policies config for the specific task.
    48  ```
    49  task:
    50    - replicas: 1
    51      name: "test-1"
    52      topologyPolicy: single-numa-node
    53      ...
    54    - replicas: 1
    55      name: "test-2"
    56      topologyPolicy: best-effort
    57      ...
    58  ```
    59  There are the topology policies as same as the [Topology Manager](https://kubernetes.io/docs/tasks/administer-cluster/topology-manager/):
    60  ```
    61    1. single-numa-node
    62    2. best-effort
    63    3. restricted
    64    4. none
    65  ```
    66  
    67  ### Predicate function
    68  
    69  for the pods with the topology policy, we need to predicate the matched node list.
    70  
    71  | policy | action |
    72  | :----   | :---- |
    73  | none   | 1. no filter action |
    74  | best-effort | 1. filter out the node with the topology policy “best-effort”|
    75  | restricted | 1.	filter out the node with the topology policy “restricted” <br> 2.	filter out the node that the cpu topology meets the cpu requirements for "restricted" | 
    76  | single-numa-node | 1.	filter out the node with the topology policy “single-numa-node”;  <br> 2. filter out the node that the cpu topology meets the cpu requirements for "single-numa-node" |
    77  
    78  ### Priority function
    79  
    80  Regardless of the topology policy, pod is hoped to be scheduled to the optimal node. <br>
    81  So we select the best node by scoring all filtered nodes.
    82  ```
    83  calculation formula:
    84       score = weight * (100 - 100 * numaNodeNum / maxNumaNodeNum)
    85       Arguments: 
    86           weight: the weight of the NUMA Aware Plugin, default is 1
    87           numaNodeNum: the member of required NUMA node in the calculated node for meeting the request resources
    88           maxNumaNodeNum: the maximum NUMA node number in all filtered nodes
    89  ```
    90  
    91  For example:
    92  
    93  ```
    94  There are three nodes to meet the cpu topology of the pod:
    95  
    96  the numa node layout:
    97     1. Node-A : need one numa node (0)
    98     2. Node-B : need two numa node (0, 1)
    99     3. Node-C : need four numa node (0, 1, 2, 3)
   100  
   101  calculate the score 
   102     maxNumaNodeNum = 4
   103     1. Node-A : 10 * (100 - 100 * 1 / 4) = 750
   104     2. Node-B : 10 * (100 - 100 * 2 / 4) = 500
   105     3. Node-C : 10 * (100 - 100 * 4 / 4) = 0
   106  so the best node is Node-A.
   107  
   108  ``` 
   109  
   110  For the usage details, please refer to the [NUMA Aware guide](../user-guide/how_to_use_numa_aware.md)
   111  ## Drawbacks
   112  
   113  Kubelet processes pods based on the creation time sequence of pods, but volcano uses a series of plugins to determine the scheduling order of pods, not just the creation time. This results in different resource NUMA allocations between kubelet and scheduler, even if the same algorithm is used with kubelet.
   114  
   115  
   116  
   117  
   118   
   119