volcano.sh/volcano@v1.9.0/docs/design/distributed-framework-plugins.md (about) 1 # Distributed Framework Plugins 2 3 - [Motivation](#motivation) 4 - [Goals](#goals) 5 - [Design](#design) 6 - [Introduction of ML Distributed Pattern](#introduction-of-ml-distributed-pattern) 7 - [Implementation](#implementation) 8 - [Tensorflow Plugin](#tensorflow-plugin) 9 - [Other Framework](#other-framework) 10 - [Task Launch Sequence](#task-launch-sequence) 11 - [Webhook](#webhook) 12 ## Motivation 13 14 Volcano is widely used in machine learning, but sometimes it is quite complicated for users to set configs. 15 16 - User has to be familiar with the volcano job plugins (i.e the `svc`, `env` and `ssh` job plugins). 17 - For example, if you want to execute a MPI job using volcano, you should know exactly the behavior which is providing all worker node's names in a file like `/etc/volcano/mpiworker.host`. 18 - User has to be familiar with the shell syntax, which is used for generating a cluster spec parameter from the file produced by plugins. 19 - e.g. generate `TF_CONFIG` from files `worker.host` and `ps.host` for Tensorflow job 20 - It is not straightforward enough to run a distributed ML job via Volcano. User has to carefully set the `lifeCyclePolicy` and `restartPolicy` in every task for distributed training. 21 - For example, in MPI job, the master task will be failed and restarted until all worker tasks are ready. Therefore, user should add `OnFailure` restart policy and `TaskCompleted-CompleteJob` lifecycle policy to master task. 22 23 If we can add more in-tree plugins for distributed ML job, which lets Volcano know more about the job type, the complexity of using volcano for ML workloads will be reduced. 24 25 ## Goals 26 27 - Add several plugins for distributed framework, including but not limited to Tensorflow-Plugin, MPI-Plugin, Pytorch-Plugin, MxNet-Plugin 28 29 - These plugins will patch pod spec to fit the distributed pattern of a specified framework. 30 31 - Make it easier to set ML distributed topology. 32 33 - By using init containers, plugins will make sure tasks is launched in the topology order. 34 35 - Make it easier to use these plugin. Users only need to add a few lines in job spec. e.g. 36 ```yaml 37 spec: 38 plugins: 39 tensorflow: [] 40 ``` 41 42 ## Design 43 44 ### Introduction of ML Distributed Pattern 45 46 Here is a summary of distributed training pattern in various ML frameworks, including the node topology, environment variables, file and entrypoint for distributed training. 47 48 | Framework | Topology | Environment Variables | File | Entrypoint | 49 | --------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------------------ | 50 | Tensorflow | **PS mode**: Chief + Evaluator + Worker + PS<br />**All-Reduce mode**: multiple workers | `TF_CONFIG`: tensorflow cluster spec | none | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` | 51 | Pytorch (Official recommendation) | Master + Worker | `PET_MASTER_ADDR`: master address<br/>`PET_MASTER_PORT`: master port<br />`PET_NNODES`: node number<br/>`PET_NPROC_PER_NODE`: process number in every node<br/>`PET_NODE_RANK`: current node index | none | `python -m torch.distributed.run TRAINING_SCRIPT.py (--arg1 ... train script args...)` | 52 | Pytorch (custom) | Master + Worker | `MASTER_ADDR`: master address<br/>`MASTER_PORT`: master port<br />`WORLD_SIZE`: node number<br/>`RANK`: current node index | none | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` | 53 | MXNet | Scheduler + Worker + PS | `DMLC_PS_ROOT_URI`: scheduler address<br />`DMLC_PS_ROOT_PORT`: scheduler port<br/>`DMLC_NUM_SERVER`: parameter server number <br/>`DMLC_NUM_WORKER`: worker number<br/>`DMLC_ROLE`: current node role | none | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` | 54 | MPI | Master + Worker | `OMPI_MCA_orte_default_hostfile`: default host file path<br /> | hostfile: with every node name and slot number | **master node**: `mpirun --hostfile HOSTFILE (-H HOSTS) python TRAINING_SCRIPT.py (--arg1 ... train script args...)`<br />**worker node**: `/usr/sbin/sshd -D` | 55 56 ### Implementation 57 58 With the introduction of distributed pattern in various frameworks, we can implement various plugins. 59 60 #### Tensorflow Plugin 61 62 The key implementation of tensorflow plugin is that how to set correct `TF_CONFIG` environment variable for every pod. 63 64 Firstly, we must know the cluster role of task in volcano job, and the port to be exposed. And this information can be passed by plugin arguments, which is defined in job spec. 65 66 ```yaml 67 spec: 68 plugins: 69 # set tensorflow plugin 70 tensorflow: ["--port=5000", "--worker=worker", "--ps=ps"] 71 ``` 72 73 In the implementation of `tensorflowPlugin`, these arguments will be parsed. 74 75 ```go 76 // tensorflowPlugin is plugin for tensorflow framework 77 type tensorflowPlugin struct { 78 tfArguments []string 79 Clientset pluginsinterface.PluginClientset 80 psName string 81 workerName string 82 chiefName string 83 evaluatorName string 84 port int 85 } 86 // parse all arguments 87 func (tp *tensorflowPlugin) addFlags() { 88 flagSet := flag.NewFlagSet(tp.Name(), flag.ContinueOnError) 89 flagSet.StringVar(&tp.psName, "ps", "ps", "name of ps role task") 90 flagSet.StringVar(&tp.workerName, "worker", "worker", "name of ps role task") 91 flagSet.StringVar(&tp.chiefName, "chief", "chief", "name of chief role task") 92 flagSet.StringVar(&tp.evaluatorName, "evaluator", "evaluator", "name of evaluator role task") 93 flagSet.IntVar(&tp.port, "port", 2222, "serviec port") 94 if err := flagSet.Parse(sp.tfArguments); err != nil { 95 klog.Errorf("plugin %s flagset parse failed, err: %v", tp.Name(), err) 96 } 97 } 98 ``` 99 100 And then patch the pod spec in method `OnPodCreate`. 101 102 ```go 103 func (tp *tensorflowPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error { 104 // do not patch if job is not distributed 105 if len(job.Spec.Tasks) == 1 && job.Spec.Tasks[0].Replicas == 1 { 106 return nil 107 } 108 // generate tfconfig spec 109 c, err := tp.generateTFConfig(pod, job) 110 if err != nil { 111 return err 112 } 113 raw, err := json.Marshal(c) 114 if err != nil { 115 return err 116 } 117 // add TF_CONFIG envrionment 118 for i := range pod.Spec.Containers { 119 pod.Spec.Containers[i].Env = append(pod.Spec.Containers[i].Env, v1.EnvVar{ 120 Name: "TF_CONFIG", 121 Value: string(raw), 122 }) 123 } 124 return nil 125 } 126 ``` 127 128 Here is the structure of `TF_CONFIG`: 129 130 ```go 131 type tfClusterSpec struct { 132 Cluster clusterInfo `json:"cluster"` 133 Task taskInfo `json:"task"` 134 } 135 136 type clusterInfo struct { 137 PS []string `json:"ps,omitempty"` 138 Worker []string `json:"worker,omitempty"` 139 Chief []string `json:"chief,omitempty"` 140 Evaluator []string `json:"evaluator,omitempty"` 141 } 142 143 type taskInfo struct { 144 Type string `json:"type"` 145 Index int `json:"index"` 146 } 147 ``` 148 149 And we can generate a `tfClusterSpec` for each pod in the job, here is an example: 150 ```go 151 // generateTFConfig generate tfClusterSpec by a given pod and job 152 func (tp *tensorflowPlugin) generateTFConfig(pod *v1.Pod, job *batch.Job) (tfClusterSpec, error) { 153 // get task index by pod 154 index, err := strconv.Atoi(helpers.GetPodIndexUnderTask(pod)) 155 if err != nil { 156 return tfClusterSpec{}, err 157 } 158 // get task type by pod and job 159 taskType := tp.getTaskType(pod, job) 160 // get cluster info by job 161 spec := tfClusterSpec{ 162 Cluster: tp.getClusterInfo(job), 163 Task: taskInfo{ 164 Type: taskType, 165 Index: index, 166 }, 167 } 168 return spec, nil 169 } 170 171 // getClusterInfo return a clusterInfo by a given job 172 func (tp *tensorflowPlugin) getClusterInfo(job *batch.Job) clusterInfo { 173 cluster := clusterInfo{} 174 for _, ts := range job.Spec.Tasks { 175 hosts := []string{} 176 for i := 0; i < int(ts.Replicas); i++ { 177 // generate domain name for each task replicas 178 hosts = append(hosts, helpers.MakeDomainName(job.Name, ts, i)) 179 } 180 // assign all hostnames to clusterInfo 181 switch ts.Name { 182 case tp.psName: 183 cluster.PS = hosts 184 case tp.workerName: 185 cluster.Worker = hosts 186 case tp.chiefName: 187 cluster.Chief = hosts 188 case tp.evaluatorName: 189 cluster.Evaluator = hosts 190 } 191 } 192 return cluster 193 } 194 ``` 195 196 #### Pytorch Plugin 197 198 Similar to the tensorflow plugin, firstly we must know the cluster role of task in volcano job, and the port to be exposed. And this information can be passed by plugin arguments, which is defined in job spec. 199 200 ```yaml 201 spec: 202 plugins: 203 # set pytorch plugin 204 pytorch: ["--master=master","--worker=worker","--port=23456"] 205 ``` 206 207 In the implementation of `pytorchPlugin`, these arguments will be parsed. 208 209 ```go 210 // pytorchPlugin is plugin for pytorch framework 211 type pytorchPlugin struct { 212 pytorchArguments []string 213 clientset pluginsinterface.PluginClientset 214 masterName string 215 workerName string 216 port int 217 } 218 ``` 219 220 Then we patch pytorch-distributed-training related environment variables to container envs in method `OnPodCreate`. 221 The main environment variables are: 222 * `MASTER_ADDR`: master address 223 * `MASTER_PORT`: master port 224 * `WORLD_SIZE`: total node number 225 * `RANK`: current node index 226 227 #### Other Framework 228 229 Most of other frameworks is similar to Tensorflow. But the MPI framework is special. In most case, It needs a `hostfile`, e.g. : 230 231 ``` 232 jobname-worker-0.jobname slots=4 233 jobname-worker-1.jobname slots=4 234 jobname-worker-2.jobname slots=4 235 ``` 236 237 To generate the `hostfile`, we need to create a `configMap` in `OnJobAdd` phase. 238 239 ```go 240 func (mp *mpiPlugin) OnJobAdd(job *batch.Job) error { 241 // generate hostfile, and create a configmap 242 data := map[string]string{"hostfile": mp.hostfile(job)} 243 if err := helpers.CreateOrUpdateConfigMap(job, mp.Clientset.KubeClients, data, mp.cmName(job)); err != nil { 244 return err 245 } 246 247 if job.Status.ControlledResources["plugin-"+mp.Name()] == mp.Name() { 248 return nil 249 } 250 job.Status.ControlledResources["plugin-"+mp.Name()] = mp.Name() 251 return nil 252 } 253 ``` 254 255 The data in `configMap` is as follows: 256 257 ```yaml 258 data: 259 hostfile: |- 260 jobname-worker-0.jobname slots=4 261 jobname-worker-1.jobname slots=4 262 jobname-worker-2.jobname slots=4 263 ``` 264 265 > The utility function `CreateOrUpdateConfigMap` will add an owner reference in the Configmap's metadata, thus it will be deleted when the job is deleted. 266 267 268 269 In `OnPodCreate` phase, the `hostfile` will be added into pod volumes, and mouted to specified path (e.g. `/etc/mpi/hostfile`). The `OMPI_MCA_orte_default_hostfile` environment variable should also be set. 270 271 ```go 272 func (mp *mpiPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error { 273 // generate hostfile volume and volumeMount 274 volume := mp.hostfileVolume(job) 275 mount := mp.hostfileVolumeMount(job) 276 // add to pod and containers 277 pod.Spec.Volumes = append(pod.Spec.Volumes, vm) 278 for i := range pod.Spec.Containers { 279 pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, mount) 280 pod.Spec.Containers[i].Env = append(pod.Spec.Containers[i].Env, v1.EnvVar{ 281 Name: "OMPI_MCA_orte_default_hostfile", 282 Value: "/etc/mpi/hostfile", 283 }) 284 } 285 return nil 286 } 287 ``` 288 289 #### Task Launch Sequence 290 291 As mentioned in section *Motivation*, task-level topology and launch sequence is common in ML distributed training. But there is no task-level scheduling policy in Volcano at present. 292 We could set task dependency in plugins, 293 294 - e.g. we could use `InitContainer` to control the dependency of tasks. 295 - any other approaches are welcomed. 296 297 ##### Using InitContainer 298 299 In `OnPodCreate`, we could patch `InitContainers` to pod. Here is an example for MPI-Plugin: 300 301 ```go 302 func (mp *mpiPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error { 303 // ...... 304 // Other code 305 // ...... 306 307 // Add an init container to wait for dependency tasks 308 // Get dependency tasks by pod and job 309 depTasks := mp.getDepTasks(pod, job) 310 if len(depTasks) != 0 { 311 // Generate an init container and insert it into pod spec 312 pod.Spec.InitContainers = append(pod.Spec.InitContainers, mlhelpers.CreateInitContainer(depTasks, job)) 313 } 314 return nil 315 } 316 ``` 317 318 For MPI-Plugin, master task should wait for worker task. So we only generate dependency tasks for master pod: 319 320 ```go 321 func (mp *mpiPlugin) getDepTasks(pod *v1.Pod, job *batch.Job) (tasks []batch.TaskSpec) { 322 // get task name from pod 323 taskName := mlhelpers.GetTaskName(pod, job) 324 if taskName == mp.masterName { 325 // get task spec from job by a given task name 326 if t, ok := mlhelpers.GetTaskSpec(mp.workerName, job); ok { 327 tasks = append(tasks, t) 328 } 329 } 330 return 331 } 332 ``` 333 334 We offer one implementation for `InitContainer`, it has limitations but works for common scenarios. we welcome better approaches. 335 336 The logic in the init container is quite simple. It will send an ICMP message to the domain name of every task pod to check if the pod is alived. Here is an shell script example: 337 338 ```shell 339 SECONDS=0 340 while true 341 do 342 ok=true 343 for ip in ${IP_LIST} 344 do 345 ping -c 1 $ip >/dev/null 346 if [ $? -ne 0 ] 347 then 348 ok=false 349 break 350 fi 351 done 352 if $ok 353 then 354 exit 0 355 else 356 if [ $SECONDS -gt ${WAIT_TIMEOUT} ] 357 then 358 exit 1 359 fi 360 sleep 5 361 fi 362 done 363 ``` 364 365 366 367 ##### Alternatives Considered 368 369 With the introduction of [Task Launch Order Design](https://github.com/volcano-sh/volcano/blob/master/docs/design/task-launch-order-within-job.md), we can use the existing solution to manage task launch sequence. 370 371 #### Webhook 372 373 The Distributed Framework Plugins mentioned above work depending on svc plugin or others, thus we need to add new logic to ensure that svc plugin or others exist. 374 375 The new logic to be added in webhook is shown as in below: 376 377 1. Check if Distributed-Framework plugins exist 378 2. Patch job spec with plugin denpendency