volcano.sh/volcano@v1.9.0/docs/design/fairshare.md (about) 1 # Namespace fair share 2 3 [@lminzhw](http://github.com/lminzhw); May 8, 2019 4 5 ## Motivation 6 7 `Queue` was introduced in [kube-batch](http://github.com/kubernetes-sigs/kube-batch) to share resources among users. 8 9 But, the user in the same `Queue` are equivalent during scheduling. For example, we have a `Queue` contains a small amount of resources, and there are 10 pods belong to UserA and 1000 pods belong to UserB. In this case, pods of UserA would have less probability to bind with node. 10 11 So, we need a more fine-grained strategy to balance resource usage among users in the same `Queue`. 12 13 In consideration of multi-user model in kubernetes, we use namespace to distinguish different user. Each namespace would have its weight to control resources usage. 14 15 ## Function Specification 16 17 Weight have these features: 18 > 1. `Queue` level 19 > 2. an `integer` with default value 1 20 > 3. record in namespace `quota` 21 > 4. higher value means more resources after balancing 22 23 ### where is the weight 24 25 ```yaml 26 apiVersion: v1 27 kind: ResourceQuota 28 metadata: 29 namespace: default 30 spec: 31 hard: 32 limits.memory: 2Gi 33 volcano.sh/namespace.weight: 1 <- this field represent the weight of this namespace 34 ``` 35 36 If many `ResourceQuota` in the same namespace have weight, the weight for this namespace is the highest one of them. 37 38 This weight should be positive, any invalid value is treated as default value 1. 39 40 ### Scheduler Framework 41 42 Introduce two new fields in SchedulerCache 43 44 ```go 45 type NamespaceInfo struct { 46 Weight int 47 } 48 49 type SchedulerCache struct { 50 ... 51 quotaInformer infov1.ResourceQuotaInformer 52 ... 53 NamespaceInfo map[string]*kbapi.NamespaceInfo 54 ... 55 } 56 ``` 57 58 The Scheduler will watch the lifecycle of `ResourceQuota` by `quotaInformer`, and refresh the info in `NamespaceInfo`. 59 60 In `openSession` function, we should pass the `NamespaceInfo` through function `cache.Snapshot` into `Session` by using a new filed in `Session`/`ClusterInfo` struct. 61 62 ```go 63 type Session struct { 64 ... 65 NamespaceInfo map[string]*kbapi.NamespaceInfo 66 ... 67 } 68 type ClusterInfo struct { 69 ... 70 NamespaceInfo map[string]*kbapi.NamespaceInfo 71 ... 72 } 73 ``` 74 75 ### Allocate Action 76 77 #### Scheduling loop 78 79 The behavior of `allocate` action is scheduling job in `Queue` one by one. 80 81 At the beginning of scheduling loop, it will take a job with highest priority from `Queue`. And try to schedule tasks that belong to it until job is ready (matches the minMember) then go to next round. 82 83 The priority of job mentioned above is defined by `JobOrder` functions registered by plugins. Such as job ready order of Gang plugin, priority order of Priority plugin, and also the share order of DRF plugin. 84 85 #### JobOrder 86 87 Namespace weight `should not` implement with JobOrder func. Because the scheduling of job would affect priority of the others. 88 89 > e.g. 90 > 91 > ns1 has job1, job2, ns2 has job3, job4. The original order is job1-job2-job3-job4. 92 > 93 > After the scheduling of job1, right order should be job3-job4-job2. But in priority queue, we have no chance to fix the priority for job2 94 95 #### Namespace Order 96 97 To add namespace weight, we introduce a new order function named `NamespaceOrder` in `Session`. 98 99 ```go 100 type Session struct { 101 ... 102 NamespaceOrderFns map[string]api.CompareFn 103 ... 104 } 105 ``` 106 107 The scheduling loop in allocate should change as follows. 108 109 In scheduling loop, firstly, choose a namespace having highest priority by calling `NamespaceOrderFn`, and then choose a job having highest priority using `JobOrderFn` in this namespace. 110 111 After scheduling of job, push the namespace and job back to priority queue in order to refresh its priority. Because once a job is scheduled, assigned resource may decrease the priority of this namespace, the other jobs in the same namespace may be scheduled later. 112 113 Always assigning resources to namespace with highest priority (lower resource usage) in every turn will make the resource balanced. 114 115 ### DRF plugin 116 117 DRF plugin use preemption and order of job to balance resource among jobs. The `share` in this plugin is defined as resource usage, the higher `share` means this job occupies the more resource now. 118 119 #### Namespace Compare 120 121 To introduce namespace weight into this plugin, we should define how to compare namespace having weight firstly. 122 123 For namespace n1 having weight w1 and namespace n2 having weight w2, we can compute the `share` of resource and recorded as u1 and u2. Now, the resource usage of n1 less than n2 can be defined as (u1 / w1 < u2 / w2) 124 125 `e.g.` ns1 having weight w1=2 use 6cpu, ns2 having weight w2=1 use 2cpu. In the scope of cpu, the ns1 use less resource than ns2. (6 / 3 < 3 / 1) 126 127 #### Namespace Order 128 129 Register `NamespaceOrder` function using the strategy mentioned above. 130 131 #### preemption 132 133 > The `preempt` action is disabled now. Do this later. 134 135 In the `preemption` function now, strategy is just simply comparing the resource share among jobs . 136 137 After adding namespace weight, we should check namespace of preemptor and preemptee firstly. The job in namespace with less resources can preempt others, or when namespace resource usage are the same, compare share of job instead. 138 139 ### Feature Interaction 140 141 #### preempt action 142 143 Preempt is a strategy set to choose victims and finally evict it. 144 145 The way to choose victims is a function set named `Preemptable` registered by plugins. Such as job ready protection of Gang plugin, special pod protection of Conformance plugin, job share balance strategy of DRF plugin. 146 147 All these plugin would choose some victims respective, and the intersection of them would be the final victim set. So, the choice made by DRF plugin would never break the requirement of others. 148 149 ### short hand 150 151 1. Preempt may cause killing of some running pod. 152 153 ### Cases: 154 155 - cluster have __16 cpu__, queue and namespace have default weight. 156 157 | queue | namespace | requested | queue assigned | namespace assigned | 158 | ----- | --------- | --------- | -------------- | ------------------ | 159 | q1 | ns1 | 5 cpu | 8 cpu | 4 cpu | 160 | | ns2 | 10 cpu | | 4 cpu | 161 | q2 | ns3 | 10 cpu | 8 cpu | 6 cpu | 162 | | ns4 | 2 cpu | | 2 cpu | 163 164 - cluster have __16 cpu__, q1 with weight 1, q2 with weight 3. ns1 with weight 3, ns2 have weight 1, ns3 have weight 2, ns4 have weight 6. 165 166 | queue | namespace | requested | queue assigned | namespace assigned | 167 | ----- | --------- | --------- | -------------- | ------------------ | 168 | q1 w1 | ns1 w3 | 5 cpu | 4 cpu | 3 cpu | 169 | | ns2 w1 | 10 cpu | | 1 cpu | 170 | q2 w3 | ns3 w2 | 10 cpu | 12 cpu | 10 cpu | 171 | | ns4 w6 | 2 cpu | | 2 cpu | 172 173 - cluster have __16 cpu__, q1 with weight 1, q2 with weight 3. ns1 have weight 2, ns2 have weight 6. 174 175 | queue | namespace | requested | queue assigned | namespace assigned | 176 | ----- | --------- | --------- | -------------- | ------------------ | 177 | q1 w1 | ns1 w2 | | 4 cpu | | 178 | q2 w3 | ns1 w2 | 5 cpu | 12 cpu | 3 cpu | 179 | | ns2 w6 | 20 cpu | | 9 cpu |