volcano.sh/volcano@v1.9.0/docs/design/node-selector.md

volcano.sh/volcano@v1.9.0/docs/design/node-selector.md (about)

     1  ## Introduction
     2  
     3  This feature allows volcano to schedule workloads based on the Nodes with specific label(not all nodes of k8s cluster). 
     4  
     5  In my case, k8s cluster has 10 nodes (10 GPUs per node), 5 for training, 5 for serving, I want to use volcano schedule training job(tfjob/pytorchjob/vcjob) on training nodes. use default-scheduler schedule online pod on serving nodes. 
     6  
     7  ![](./images/node-selector-1.png)
     8  
     9  if you just want to schedule workloads on training nodes, you can use nodeSelector or nodeAffinity on `Pod.Spec`. but it is not properly when considering volcano queue mechanism, because volcano think it can work on 10 node (and use all resources of 10 nodes), it has 100 GPUs. but in fact, it only can work on 5 nodes for training, it has 50 GPUs. 
    10  
    11  if there are two queue: queue1 and queue2
    12  
    13  ||weight|reclaimable|deserved GPUs|
    14  |---|---|---|---|
    15  |queue1|1|true|50|
    16  |queue2|1|true|50|
    17  
    18  if queue1 already used 45 GPUs, then I submit a training job of queue2 using 10 GPUs, the job will be pending because there are not enough GPUs on training nodes, and queue1 is not overused in volcano's view, so it will not reclaim job of queue1 to release resource.   
    19  
    20  so it is necessary to tell volcano scheduler that it can only work on training nodes(not all nodes in cluster), queue1 can only use 25 GPUs normally, it is overused for queue1 to use 45 GPUs.
    21  
    22  ![](./images/node-selector-2.png)
    23  
    24  so I add nodeSelector as command args for volcano scheduler.
    25  
    26  ## Usage
    27  
    28  in following example, volcano can only work on the node which has `nodeRole:training` or `nodeRole:dev` or `gpuModel: tesla` label. you can use any label name you like.
    29  
    30  ```yaml
    31  ...
    32  spec:
    33    serviceAccount: volcano-scheduler
    34    containers:
    35      - name: volcano-scheduler
    36        image: xx
    37        args:
    38          - --node-selector=nodeRole:training
    39          - --node-selector=nodeRole:dev
    40          - --node-selector=gpuModel:tesla
    41          - --logtostderr
    42          - --scheduler-conf=/volcano.scheduler/volcano-scheduler.conf
    43          - -v=5
    44          - 2>&1
    45  ...
    46  ```
    47  
    48  ## Design
    49  
    50  parse nodeSelector from command args and transfer the labels into `SchedulerCache.nodeSelectorLabel`
    51  ```go
    52  // pkg/scheduler/cache/cache.go
    53  type SchedulerCache struct {
    54  	// added field
    55      nodeSelectorLabels    map[string]string
    56  }
    57  
    58  ```
    59  add 'filter logic' into node event handler.
    60  ```go
    61  // pkg/scheduler/cache/cache.go
    62  func newSchedulerCache(config *rest.Config, schedulerName string, defaultQueue string, nodeSelector []string) *SchedulerCache {
    63  	...
    64      sc.nodeInformer.Informer().AddEventHandlerWithResyncPeriod(
    65          cache.FilteringResourceEventHandler{
    66              FilterFunc: func(obj interface{}) bool {
    67                  node, ok := obj.(*v1.Node)
    68                  if !ok {
    69                      klog.Errorf("Cannot convert to *v1.Node: %v", obj)
    70                      return false
    71                  }
    72                  if !responsibleForNode(node.Name, mySchedulerPodName, c) {
    73                      return false
    74                  }
    75                  // add code ==========================>
    76                  // filter node by nodeSelector labels
    77                  if len(sc.nodeSelectorLabels) == 0 {
    78                      return true
    79                  }
    80                  for labelName, labelValue := range node.Labels {
    81                      key := labelName + ":" + labelValue
    82                      if _, ok := sc.nodeSelectorLabels[key]; ok {
    83                          return true
    84                      }
    85                  }
    86                  klog.Infof("node %s ignore add/update/delete into schedulerCache", node.Name)
    87                  return false
    88                  // add code ==========================>
    89              },
    90              Handler: cache.ResourceEventHandlerFuncs{
    91                  AddFunc:    sc.AddNode,
    92                  UpdateFunc: sc.UpdateNode,
    93                  DeleteFunc: sc.DeleteNode,
    94                  },
    95              },
    96              0,
    97          )
    98      }
    99  ...
   100  ```