volcano.sh/volcano@v1.9.0/docs/design/queue-guarantee-resource-reservation-design.md (about) 1 # Volcano Resource Reservation For Queue 2 3 @[qiankunli](https://github.com/qiankunli); Oct 11rd, 2021 4 5 ## Motivation 6 7 In my case, we use volcano to schedule training job(tfjob/pytorchjob/vcjob) in k8s cluster, there are many groups such as ad/recommend/tts, a queue represents a group. 8 In order to ensure the full utilization of resources, we generally do not configure `queue.capability`. but this will cause a queue to running out all resources, and make the new job of other queue unable to execute. 9 so we want to reserve some resources for a queue, so that any new job in the queue can be submitted immediately. 10 11 As [issue 1101](https://github.com/volcano-sh/volcano/issues/1101) mentioned, Volcano should support resource reservation 12 for specified queue. Requirement detail as follows: 13 * Support reserving specified resources for specified queue 14 * We only Consider non-preemption reservation. 15 * Support enable and disable resource reservation for specified queue dynamically without restarting Volcano. 16 * Support hard reservation resource specified and percentage reservation resource specified. 17 18 @[Thor-wl](https://github.com/Thor-wl) already provide a design doc [Volcano Resource Reservation For Queue](https://github.com/volcano-sh/volcano/blob/master/docs/design/queue-resource-reservation-design.md) 19 I do not implement all features above, supported feature are as follows: 20 21 * Support reserving specified resources for specified queue 22 * We only Consider non-preemption reservation. 23 * Support enable and disable resource reservation for specified queue dynamically without restarting Volcano. 24 * Support hard reservation resource specified 25 26 ## Consideration 27 ### Resource Request 28 * The reserved resource cannot be more than the total resource amount of cluster at all dimensions. 29 * If `capability` is set in a queue, the reserved resource must be no more than it at all dimensions. 30 31 ### Safety 32 * Malicious application for large amount of resource reservation will cause jobs in other queue to block. 33 34 ## Design 35 ### API 36 ``` 37 apiVersion: scheduling.volcano.sh/v1beta1 38 kind: Queue 39 metadata: 40 name: q1 41 spec: 42 reclaimable: true 43 weight: 1 44 guarantee: // reservation key word 45 resource: // specified reserving resource 46 cpu: 2c 47 memory: 4G 48 ``` 49 50 `guarantee.resource` list of reserving resource categories and amount. 51 52 ## Implementation 53 54 In order to support guarantee mechanism, there are two scenarios to consider 55 1. support `spec.guarantee` during scheduling 56 2. create a new queue whose `spec.guarantee` is not nil or an existed queue's `spec.guarantee` becomes bigger 57 58 ### support `spec.guarantee` during scheduling 59 60 if there are three queues and 30 GPUs in cluster. 61 62 |queue/attr|guarantee GPUs|capability GPUs|realCapability GPUs| 63 |---|---|---|---| 64 |queue1|5|nil|30| 65 |queue2|nil|nil|25| 66 |queue3|nil|10|10| 67 68 69 ```go 70 // /volcano/pkg/scheduler/plugins/proportion/proportion.go 71 type queueAttr struct { 72 queueID api.QueueID 73 name string 74 75 deserved *api.Resource 76 allocated *api.Resource 77 request *api.Resource 78 // inqueue represents the resource request of the inqueue job 79 inqueue *api.Resource 80 capability *api.Resource 81 realCapability *api.Resource 82 guarantee *api.Resource 83 } 84 ``` 85 86 on each schedule cycle, proportion plugin will calculate `queueAttr.deserved` for a queue which means how many resources the queue can use. when consider a new task, 87 if `queueAttr.deserved` is bigger than `queueAttr.allocated`, the new task can be scheduled. 88 1. `queueAttr.deserved` must be bigger than `queueAttr.guarantee` 89 2. if `queueAttr.guarantee` is not nil(like queue1), it means the 5 GPUs only can be used by queue1 even there is no job running in queue1. we use `queueAttr.realCapability` to represent the upper limit resources that a queue can use. 90 1. if `queueAttr.capability` is nil(like queue2), `realCapability = total resources - sum(other-queue.guarantee)` 91 2. if `queueAttr.capability` is not nil(like queue3), `realCapability = min(capability,total resources - sum(other-queue.guarantee))` 92 3. replace `queueAttr.capability` with `queueAttr.realCapability` everywhere 93 94 After doing this, a queue owns the resources which is bigger than `queueAttr.guarantee` and less than `queueAttr.realCapability` 95 96 ### create a new queue whose `spec.guarantee` is not nil 97 98 if there are three queues and 30 GPUs in cluster, there are many task in queue1/queue2/queue3 and running out the 30 GPUs, 99 100 |queue/attr|weight|deserved GPUs|guarantee GPUs|capability GPUs|realCapability GPUs| 101 |---|---|---|---|---|---| 102 |queue1|1|10|5|nil|30| 103 |queue2|1|10|nil|nil|25| 104 |queue3|1|10|nil|10|10| 105 106 then we create queue4 and submit a new job(request 2GPUs) in queue4 107 108 |queue/attr|weight|deserved GPUs|guarantee GPUs|capability GPUs|realCapability GPUs| 109 |---|---|---|---|---|---| 110 |queue1|1|6|5|nil|20| 111 |queue2|1|6|nil|nil|15| 112 |queue3|1|6|nil|10|10| 113 |queue4|2|12|10|nil|25| 114 115 1. the overcommit plugin will deny the new job in queue4 because there is no free GPUs in cluster. so,we should change the logic, if `job.request < queue4.guarantee`, the job can be `Inqueue` whether there are free GPUs or not. 116 2. we should enable the reclaim action, so that volcano can reclaim the task in overused queue 117 118 ## Usage 119 Configure guarantee for queue 120 ```yaml 121 apiVersion: scheduling.volcano.sh/v1beta1 122 kind: Queue 123 metadata: 124 name: q1 125 spec: 126 reclaimable: true 127 weight: 1 128 guarantee: // reservation key word 129 resource: // specified reserving resource 130 cpu: 2c 131 memory: 4G 132 ``` 133 Enable reclaim action for scheduler. 134 ```yaml 135 apiVersion: v1 136 kind: ConfigMap 137 metadata: 138 name: volcano-scheduler-configmap 139 namespace: volcano-system 140 data: 141 volcano-scheduler.conf: | 142 actions: "enqueue,allocate,reclaim,backfill" 143 tiers: 144 - plugins: 145 - name: priority 146 - name: gang 147 - name: conformance 148 - plugins: 149 - name: overcommit 150 - name: drf 151 - name: predicates 152 - name: proportion 153 - name: nodeorder 154 - name: binpack 155 ``` 156 157 158 159