volcano.sh/volcano@v1.9.0/docs/design/node-selector.md (about) 1 ## Introduction 2 3 This feature allows volcano to schedule workloads based on the Nodes with specific label(not all nodes of k8s cluster). 4 5 In my case, k8s cluster has 10 nodes (10 GPUs per node), 5 for training, 5 for serving, I want to use volcano schedule training job(tfjob/pytorchjob/vcjob) on training nodes. use default-scheduler schedule online pod on serving nodes. 6 7  8 9 if you just want to schedule workloads on training nodes, you can use nodeSelector or nodeAffinity on `Pod.Spec`. but it is not properly when considering volcano queue mechanism, because volcano think it can work on 10 node (and use all resources of 10 nodes), it has 100 GPUs. but in fact, it only can work on 5 nodes for training, it has 50 GPUs. 10 11 if there are two queue: queue1 and queue2 12 13 ||weight|reclaimable|deserved GPUs| 14 |---|---|---|---| 15 |queue1|1|true|50| 16 |queue2|1|true|50| 17 18 if queue1 already used 45 GPUs, then I submit a training job of queue2 using 10 GPUs, the job will be pending because there are not enough GPUs on training nodes, and queue1 is not overused in volcano's view, so it will not reclaim job of queue1 to release resource. 19 20 so it is necessary to tell volcano scheduler that it can only work on training nodes(not all nodes in cluster), queue1 can only use 25 GPUs normally, it is overused for queue1 to use 45 GPUs. 21 22  23 24 so I add nodeSelector as command args for volcano scheduler. 25 26 ## Usage 27 28 in following example, volcano can only work on the node which has `nodeRole:training` or `nodeRole:dev` or `gpuModel: tesla` label. you can use any label name you like. 29 30 ```yaml 31 ... 32 spec: 33 serviceAccount: volcano-scheduler 34 containers: 35 - name: volcano-scheduler 36 image: xx 37 args: 38 - --node-selector=nodeRole:training 39 - --node-selector=nodeRole:dev 40 - --node-selector=gpuModel:tesla 41 - --logtostderr 42 - --scheduler-conf=/volcano.scheduler/volcano-scheduler.conf 43 - -v=5 44 - 2>&1 45 ... 46 ``` 47 48 ## Design 49 50 parse nodeSelector from command args and transfer the labels into `SchedulerCache.nodeSelectorLabel` 51 ```go 52 // pkg/scheduler/cache/cache.go 53 type SchedulerCache struct { 54 // added field 55 nodeSelectorLabels map[string]string 56 } 57 58 ``` 59 add 'filter logic' into node event handler. 60 ```go 61 // pkg/scheduler/cache/cache.go 62 func newSchedulerCache(config *rest.Config, schedulerName string, defaultQueue string, nodeSelector []string) *SchedulerCache { 63 ... 64 sc.nodeInformer.Informer().AddEventHandlerWithResyncPeriod( 65 cache.FilteringResourceEventHandler{ 66 FilterFunc: func(obj interface{}) bool { 67 node, ok := obj.(*v1.Node) 68 if !ok { 69 klog.Errorf("Cannot convert to *v1.Node: %v", obj) 70 return false 71 } 72 if !responsibleForNode(node.Name, mySchedulerPodName, c) { 73 return false 74 } 75 // add code ==========================> 76 // filter node by nodeSelector labels 77 if len(sc.nodeSelectorLabels) == 0 { 78 return true 79 } 80 for labelName, labelValue := range node.Labels { 81 key := labelName + ":" + labelValue 82 if _, ok := sc.nodeSelectorLabels[key]; ok { 83 return true 84 } 85 } 86 klog.Infof("node %s ignore add/update/delete into schedulerCache", node.Name) 87 return false 88 // add code ==========================> 89 }, 90 Handler: cache.ResourceEventHandlerFuncs{ 91 AddFunc: sc.AddNode, 92 UpdateFunc: sc.UpdateNode, 93 DeleteFunc: sc.DeleteNode, 94 }, 95 }, 96 0, 97 ) 98 } 99 ... 100 ```