volcano.sh/volcano@v1.9.0/docs/design/task-topology-plugin.md

volcano.sh/volcano@v1.9.0/docs/design/task-topology-plugin.md (about)

     1  # Task Topology Plugin
     2  
     3  ## Introduction
     4  
     5  In big data processing jobs like Tensorflow & Spark, tasks transmitted a large amount of data between each other, causing transmission delay took a large proportion in job execution time. So task topology plugin was proposed to modify scheduling strategy according to transmission topology inside a job, so as to cut the data amount to be transmitted between nodes, decrease transmission delay proportion in job execution time, and improve resource utilization.
     6  
     7  ## Theory
     8  
     9  - For simplicity, task-topology plugin create task topology of a job according to task affinities set in job annotation, then create buckets to store tasks. Tasks with affinity tends to be put in same bucket, and tasks with anti-affinity tends to be put in different bucket. Finally reflect bucket to nodes, so as to minimize the data transmission between nodes.
    10  
    11  - Here is an example to describe what task-topology plugin do.
    12  
    13  - Suppose a tensorflow job with 6 task: `ps0, ps1, worker0, worker1, worker2, worker3`. For simplicity, each task just need 1 cpu. Set the task affinity as `"affinity": "ps,worker"`, `"anti-affinity": "ps"`
    14  
    15    - In `OnSessionOpen`, task-topology plugin generates the bucket by affinity:
    16      - sort the task by `taskAffinityOrder`, in this order, the anti-affinity is prior to affinity, because anti-affinity would generate more bucket.
    17      - Suppose tasks with orders: `ps0, ps1, worker0, worker1, worker2, worker3`
    18  
    19    - generate bucket
    20      1. ps0, there is no bucket, generate bucket 1
    21      2. ps1, has 1 bucket, but has anti-affinity config, generate bucket 2
    22      3. worker0, affinity to all two bucket, choose bucket 1,
    23      4. worker1, affinity to all two bucket, but by resource balancing, choose bucket 2
    24      5. worker2, choose 1
    25      6. worker3, choose 2
    26      7. now, we have buckets:
    27          | bucket | tasks |
    28          | - | - |
    29          | bucket1 | ps0, worker0, worker2 |
    30          | bucket2 | ps1, worker1, worker3 |
    31  
    32  - After bucket generation, task-topology plugin provides `taskOrderFn` for `allocate` action  to create a `priorityQueue` for allocate. In sample above, the task order will be like: `ps0, worker0, worker2, ps1, worker1, worker3`
    33  
    34  - Suppose there are 3 nodes available in cluster:
    35      | node | resources |
    36      | - | - |
    37      | node1 | cpu: 2 |
    38      | node2 | cpu: 1 |
    39      | node3 | cpu: 4 |
    40  
    41  - Task-topology plugin also provide `nodeOrderFn` to priority score for each node, which would mapping to [0, 10], but now just using bucket score for simplicity:
    42    - for ps0:
    43      | node | bucket in node | score |
    44      | - | - | - |
    45      | node1 | ps0 worker0 | 2 |
    46      | node2 | ps0 | 1 |
    47      | node3 | ps0 worker0 worker2 | 3 |
    48  
    49      obviously, ps0 will bind to node3.
    50      | node1 | node2 | node3 |
    51      | - | - | - |
    52      | | | ps0 |
    53  
    54    - for worker0:
    55      | node | own tasks | bucket in node | score |
    56      | - | - | - | - |
    57      | node1 | | worker0 worker2 | 2 |
    58      | node2 | | worker0 | 1 |
    59      | node3 | ps0 | worker0 worker2 | 3 |
    60  
    61      and then, worker0 will follow the ps0, and bind to node3.
    62  
    63    - and the same to worker2.
    64      obviously, ps0 will bind to node3.
    65      | node1 | node2 | node3 |
    66      | - | - | - |
    67      | | | ps0, worker0, worker2 |
    68  
    69    - next task, for ps1:
    70      | node | own tasks | bucket in node | score |
    71      | - | - | - | - |
    72      | node1 | | ps1 worker1 | 2 |
    73      | node2 | | ps1 | 1 |
    74      | node3 | ps0, worker0, worker2 | ps1 | 0(anti-affinity) |
    75  
    76      so, ps1 will bind to node1.
    77  
    78      | node1 | node2 | node3 |
    79      | - | - | - |
    80      | ps1 | | ps0, worker0, worker2 |
    81  
    82    - and worker1 will bind to node1.
    83      | node1 | node2 | node3 |
    84      | - | - | - |
    85      | ps1, worker1 | | ps0, worker0, worker2 |
    86  
    87    - for worker3, the node2 and node3 has the same score, the choice will affect by other plugins like `binpack` or `leastRequestPriority`.
    88  
    89  ## Future Improvement
    90  
    91  1. By now task-topology plugin uses annotations as input arguments, it is easy to cooperate with upper applications through various operators, but not official. So next step task-topology plugin could be added into job plugin like `svc` & `ssh`, which could still set inside individual job.
    92  2. By now task-topology plugin only create task topology according to task species & affinities, but a more detailed topology may need a whole matrix with data scale. So one more interface is needed once task-topology plugin needs to be extended.
    93  3. By now task-topology plugin do not interact with other arguments of volcano, `minAvailable`, etc, may need supports about this if necessary.