volcano.sh/volcano@v1.9.0/docs/design/fairshare.md

volcano.sh/volcano@v1.9.0/docs/design/fairshare.md (about)

     1  # Namespace fair share
     2  
     3  [@lminzhw](http://github.com/lminzhw); May 8, 2019
     4  
     5  ## Motivation
     6  
     7  `Queue` was introduced in [kube-batch](http://github.com/kubernetes-sigs/kube-batch) to share resources among users.
     8  
     9  But, the user in the same `Queue` are equivalent during scheduling. For example, we have a `Queue` contains a small amount of resources, and there are 10 pods belong to UserA and 1000 pods belong to UserB. In this case, pods of UserA would have less probability to bind with node.
    10  
    11  So, we need a more fine-grained strategy to balance resource usage among users in the same `Queue`.
    12  
    13  In consideration of multi-user model in kubernetes, we use namespace to distinguish different user. Each namespace would have its weight to control resources usage.
    14  
    15  ## Function Specification
    16  
    17  Weight have these features:
    18  > 1. `Queue` level
    19  > 2. an `integer` with default value 1
    20  > 3. record in namespace `quota`
    21  > 4. higher value means more resources after balancing
    22  
    23  ### where is the weight
    24  
    25  ```yaml
    26  apiVersion: v1
    27  kind: ResourceQuota
    28  metadata:
    29    namespace: default
    30  spec:
    31    hard:
    32      limits.memory: 2Gi
    33      volcano.sh/namespace.weight: 1  <- this field represent the weight of this namespace
    34  ```
    35  
    36  If many `ResourceQuota` in the same namespace have weight, the weight for this namespace is the highest one of them.
    37  
    38  This weight should be positive, any invalid value is treated as default value 1.
    39  
    40  ### Scheduler Framework
    41  
    42  Introduce two new fields in SchedulerCache
    43  
    44  ```go
    45  type NamespaceInfo struct {
    46      Weight int
    47  }
    48  
    49  type SchedulerCache struct {
    50      ...
    51      quotaInformer    infov1.ResourceQuotaInformer
    52      ...
    53      NamespaceInfo  map[string]*kbapi.NamespaceInfo
    54      ...
    55  }
    56  ```
    57  
    58  The Scheduler will watch the lifecycle of `ResourceQuota` by `quotaInformer`, and refresh the info in `NamespaceInfo`.
    59  
    60  In `openSession` function, we should pass the `NamespaceInfo` through function `cache.Snapshot` into `Session` by using a new filed in `Session`/`ClusterInfo` struct.
    61  
    62  ```go
    63  type Session struct {
    64      ...
    65      NamespaceInfo  map[string]*kbapi.NamespaceInfo
    66      ...
    67  }
    68  type ClusterInfo struct {
    69      ...
    70      NamespaceInfo  map[string]*kbapi.NamespaceInfo
    71      ...
    72  }
    73  ```
    74  
    75  ### Allocate Action
    76  
    77  #### Scheduling loop
    78  
    79  The behavior of `allocate` action is scheduling job in `Queue` one by one.
    80  
    81  At the beginning of scheduling loop, it will take a job with highest priority from `Queue`. And try to schedule tasks that belong to it until job is ready (matches the minMember) then go to next round.
    82  
    83  The priority of job mentioned above is defined by `JobOrder` functions registered by plugins. Such as job ready order of Gang plugin, priority order of Priority plugin, and also the share order of DRF plugin.
    84  
    85  #### JobOrder
    86  
    87  Namespace weight `should not` implement with JobOrder func. Because the scheduling of job would affect priority of the others.
    88  
    89  > e.g.
    90  >
    91  > ns1 has job1, job2, ns2 has job3, job4. The original order is job1-job2-job3-job4.
    92  >
    93  > After the scheduling of job1, right order should be job3-job4-job2. But in priority queue, we have no chance to fix the priority for job2
    94  
    95  #### Namespace Order
    96  
    97  To add namespace weight, we introduce a new order function named `NamespaceOrder` in `Session`.
    98  
    99  ```go
   100  type Session struct {
   101      ...
   102      NamespaceOrderFns map[string]api.CompareFn
   103      ...
   104  }
   105  ```
   106  
   107  The scheduling loop in allocate should change as follows.
   108  
   109  In scheduling loop, firstly, choose a namespace having highest priority by calling `NamespaceOrderFn`, and then choose a job having highest priority using `JobOrderFn` in this namespace.
   110  
   111  After scheduling of job, push the namespace and job back to priority queue in order to refresh its priority. Because once a job is scheduled, assigned resource may decrease the priority of this namespace, the other jobs in the same namespace may be scheduled later.
   112  
   113  Always assigning resources to namespace with highest priority (lower resource usage) in every turn will make the resource balanced.
   114  
   115  ### DRF plugin
   116  
   117  DRF plugin use preemption and order of job to balance resource among jobs. The `share` in this plugin is defined as resource usage, the higher `share` means this job occupies the more resource now.
   118  
   119  #### Namespace Compare
   120  
   121  To introduce namespace weight into this plugin, we should define how to compare namespace having weight firstly.
   122  
   123  For namespace n1 having weight w1 and namespace n2 having weight w2, we can compute the `share` of resource and recorded as u1 and u2. Now, the resource usage of n1 less than n2 can be defined as (u1 / w1 < u2 / w2)
   124  
   125  `e.g.` ns1 having weight w1=2 use 6cpu, ns2 having weight w2=1 use 2cpu. In the scope of cpu, the ns1 use less resource than ns2. (6 / 3 < 3 / 1)
   126  
   127  #### Namespace Order
   128  
   129  Register `NamespaceOrder` function using the strategy mentioned above.
   130  
   131  #### preemption
   132  
   133  > The `preempt` action is disabled now. Do this later.
   134  
   135  In the `preemption` function now, strategy is just simply comparing the resource share among jobs .
   136  
   137  After adding namespace weight, we should check namespace of preemptor and preemptee firstly. The job in namespace with less resources can preempt others, or when namespace resource usage are the same, compare share of job instead.
   138  
   139  ### Feature Interaction
   140  
   141  #### preempt action
   142  
   143  Preempt is a strategy set to choose victims and finally evict it.
   144  
   145  The way to choose victims is a function set named `Preemptable` registered by plugins. Such as job ready protection of Gang plugin, special pod protection of Conformance plugin, job share balance strategy of DRF plugin.
   146  
   147  All these plugin would choose some victims respective, and the intersection of them would be the final victim set. So, the choice made by DRF plugin would never break the requirement of others.
   148  
   149  ### short hand
   150  
   151  1. Preempt may cause killing of some running pod.
   152  
   153  ### Cases:
   154  
   155  - cluster have __16 cpu__, queue and namespace have default weight.
   156  
   157      | queue | namespace | requested | queue assigned | namespace assigned |
   158      | ----- | --------- | --------- | -------------- | ------------------ |
   159      | q1    | ns1       | 5 cpu     | 8 cpu          | 4 cpu              |
   160      |       | ns2       | 10 cpu    |                | 4 cpu              |
   161      | q2    | ns3       | 10 cpu    | 8 cpu          | 6 cpu              |
   162      |       | ns4       | 2 cpu     |                | 2 cpu              |
   163  
   164  - cluster have __16 cpu__, q1 with weight 1, q2 with weight 3. ns1 with weight 3, ns2 have weight 1, ns3 have weight 2, ns4 have weight 6.
   165  
   166      | queue | namespace | requested | queue assigned | namespace assigned |
   167      | ----- | --------- | --------- | -------------- | ------------------ |
   168      | q1 w1 | ns1 w3    | 5 cpu     | 4 cpu          | 3 cpu              |
   169      |       | ns2 w1    | 10 cpu    |                | 1 cpu              |
   170      | q2 w3 | ns3 w2    | 10 cpu    | 12 cpu         | 10 cpu             |
   171      |       | ns4 w6    | 2 cpu     |                | 2 cpu              |
   172  
   173  - cluster have __16 cpu__, q1 with weight 1, q2 with weight 3. ns1 have weight 2, ns2 have weight 6.
   174  
   175      | queue | namespace | requested | queue assigned | namespace assigned |
   176      | ----- | --------- | --------- | -------------- | ------------------ |
   177      | q1 w1 | ns1 w2    |           | 4 cpu          |                    |
   178      | q2 w3 | ns1 w2    | 5 cpu     | 12 cpu         | 3 cpu              |
   179      |       | ns2 w6    | 20 cpu    |                | 9 cpu              |