volcano.sh/volcano@v1.9.0/docs/design/job-scale-up-down.md (about)

     1  # Volcano Job scale up and down
     2  
     3  @hzxuzhonghu; April 24, 2020
     4  
     5  ## Motivation
     6  
     7  Currently, Volcano does not support Job update. It is not allowed to update the `Job.Spec` on the fly.
     8  However, many users show appeal to run ML training jobs in a elastic manner. For example ModelArts want to dynamically adjust Job's replicas according to the cluster idle capacity
     9  in order to achieve most high efficiency on GPU card.
    10  
    11  I propose to support volcano job dynamical scale up/down before more intelligent elasticity in the first step.
    12  
    13  ## Design
    14  
    15  Before this design, let's recall the current Job's initialization
    16  
    17  ### Job Initialization
    18  
    19  When a Volcano job is created, the job controller does the following to run/manage all of its tasks.
    20  
    21  1. all the plugins execute OnJobAdd callbacks to create service and hosts configmap, etc
    22  
    23  2. create pvc for the job
    24  
    25  3. create PodGroup for the job
    26  
    27  4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc
    28  
    29  5. call the kube-apiserver to create pods equals the replicas of the job
    30  
    31  All above steps are run in `syncJob`, which is called when external events happen, for this it happens when Job is newly created.
    32  
    33  ### Volcano Job Scale Up/Down
    34  
    35  The Job's scale up and down correlates to reconciling of the resources the job owns, like PVC/PodGroup/Service/HostFile ConfigMap
    36  so the procedure is kind of similar to the [Job Initialization](#Job Initialization).
    37  
    38  The differences are:
    39  
    40  1. job plugins' callbacks:only the `svc` plugin should update the configmap including the job tasks
    41  
    42  2. create pods when scale up
    43  
    44  3. delete pods when scale down
    45  
    46  However, only when the job is not started, the initialization is run.
    47  So we need a way to know whether it is a scale up/down event that triggered this round of sync.
    48  
    49  The way I propose is to add a new event `JobUpdatedEvent` to indicate that the job is updated(here only cares about the scale up/down).
    50  And accordingly add a new action `UpdateJobAction` to run `UpdateJob` function. And the overall workflow is:
    51  ![workflow](images/Job-scale-up-down.PNG)
    52  
    53  To scale up/down on the fly, Volcano should be responsible to notify the original pods the current status, including the hosts of all the pods.
    54  This is done by plugins, so to distinguish from the initialization phase, a new `OnJobUpdate` is introduced.
    55  It is to reconcile all the associated configs of the job. Currently, the `svc` plugin should update the configmap of all the hosts.
    56  
    57  **NOTE**:
    58  
    59  1. Users should watch the `/etc/volcano` to get the up-to-date hosts files if they want to be aware of the training workers.
    60  
    61  2. The env `VC_{task name}_HOSTS` `VC_{task name}_NUM` of the existing pods can not be mutated on the fly, so be careful not to use it.
    62  
    63  ```
    64  type PluginInterface interface {
    65  	// The unique name of Plugin.
    66  	Name() string
    67  
    68  	// for all pod when createJobPod
    69  	OnPodCreate(pod *v1.Pod, job *vcbatch.Job) error
    70  
    71  	// do once when syncJob
    72  	OnJobAdd(job *vcbatch.Job) error
    73  
    74  	// do once when killJob
    75  	OnJobDelete(job *vcbatch.Job) error
    76  
    77  	OnJobUpdate(job *vcbatch.Job) error
    78  }
    79  ```
    80  
    81  `UpdateJob` is much like the current `SyncJob`, and it's workflow is:
    82  
    83  1. all the plugins execute OnJobUpdate callbacks, which is to update all the envs, service and hosts configmap.
    84  
    85  2. create pvc for the job if necessary
    86  
    87  3. update PodGroup for the job if necessary
    88  
    89  4. execute plugins' OnPodAdd callbacks to set pod related env, mount hostfile, etc
    90  
    91  5. call the kube-apiserver to create/delete pods equals the replicas of the job
    92  
    93  
    94  **Note**: when scale down, the pod delete order is from the larger indexed to the lower indexed. But this is not guaranteed as Kubernetes is a eventual consistent system.
    95  
    96  
    97  
    98  ### Admission webhook
    99  
   100  Should prevent invalid mutating Job Spec on the fly. In this proposal, we only allow `replicas` and `minAvailable` update. Any other spec changes will be prohibited.
   101  It is also not allowed if the number of total replicas is less than the `minAvailable`.
   102  
   103  `minAvailable` must be greater than zero, we depend on it to maintain the job status.