volcano.sh/volcano@v1.9.0/docs/design/sla-plugin.md (about) 1 # Service Level Agreement (SLA) Plugin 2 3 ## Introduction 4 5 When users apply jobs to volcano, they may need adding some particular constraints to job, for example, longest Pending time aiming to prevent job from starving. And these constraints can be regarded as Service Level Agreement (SLA) which are agreed between volcano and user. So sla plugin is provided to receive and realize SLA settings for both individual job and whole cluster. 6 7 ## Solution 8 9 1. In sla plugin, arguments `sla-waiting-time` is provided to realize job resource reservation: `sla-waiting-time` is maximum time one job should stay `Pending` or `inqueue` status and not be allocated. When `sla-waiting-time` is over, `sla` plugin sets the job to be `inqueue` in `enqueue` action immediately. Then `sla` plugin locks idle resources pre-allocated to pods of this job in `allocate` action, even if the job has not been `Ready` yet. In this way, `sla` plugin realizes large job election and resource reservation, thus replaces `elect` & `reserve` action in v1.1.0. 10 11 2. Arguments `sla-waiting-time` can be set for one job, and for all jobs in cluster. 12 1. For one job, user can set them in job annotations in following format: 13 14 ```yaml 15 apiVersion: batch.volcano.sh/v1alpha1 16 kind: Job 17 metadata: 18 annotations: 19 sla-waiting-time: 1h2m3s 20 ``` 21 22 2. For all jobs, user can set `sla-waiting-time` field in `sla` plugin arguments via `volcano-scheduler-configmap` in following format: 23 24 ```yaml 25 actions: "enqueue, allocate, backfill" 26 tiers: 27 - plugins: 28 - name: priority 29 - name: gang 30 - name: sla 31 arguments: 32 sla-waiting-time: 1h2m3s 33 ``` 34 35 3. `sla` plugin return 3 callback functions: `JobEnqueueableFn`, `JobPipelinedFn`, and `JobOrderFn`: 36 37 1. `JobEnqueueableFn` returns `Permit` when job waiting time in `Pending` status is longer than `sla-waiting-time`, and job will go through `enqueue` action and be `inqueue` instantly, regardless of other plugins returning `Reject` or `Abstain` to reject this job from being `inqueue`. 38 39 2. `JobPipelinedFn` returns `Permit` when job waiting time in `inqueue` status is longer than `sla-waiting-time`, and job will be `Pipelined` status instantly, regardless of other plugins returning `Reject` or `Abstain` to reject this job from being `Pipelined`. In this way `allocate` action reserves resources for pods of the job even if the job is not Ready yet. 40 41 3. `JobOrderFn` adjusts the order of this job in waiting queues of `enqueue` & `allocate` action. The more close to `sla-waiting-time` that job waiting time is, the higher scored of this job in `JobOrderFn` of `sla` plugin, so that job would have larger probability to be front int priority queue, which means that it can touch more idle resources and have higher priority to be `inqueue` and allocated. 42 43 4. the execution flow chart of `sla` plugin is shown as below: 44  45 46 ## Feature Interaction 47 48 1. By now we only need 1 argument `sla-waiting-time`, so I add it into annotations for simplicity and invocation, but when `sla` plugin is extended with more arguments, a better way to invoke this plugin may be job plugin like `svc` and `ssh`.