volcano.sh/volcano@v1.9.0/docs/design/distributed-framework-plugins.md (about)

     1  # Distributed Framework Plugins
     2  
     3    - [Motivation](#motivation)
     4    - [Goals](#goals)
     5    - [Design](#design)
     6      - [Introduction of ML Distributed Pattern](#introduction-of-ml-distributed-pattern)
     7      - [Implementation](#implementation)
     8        - [Tensorflow Plugin](#tensorflow-plugin)
     9        - [Other Framework](#other-framework)
    10        - [Task Launch Sequence](#task-launch-sequence)
    11        - [Webhook](#webhook)
    12  ## Motivation
    13  
    14  Volcano is widely used in machine learning, but sometimes it is quite complicated for users to set configs.
    15  
    16  - User has to be familiar with the volcano job plugins (i.e the `svc`, `env` and `ssh` job plugins). 
    17    - For example, if you want to execute a MPI job using volcano, you should know exactly the behavior which is providing all worker node's names in a file like `/etc/volcano/mpiworker.host`.
    18  - User has to be familiar with the shell syntax, which is used for generating a cluster spec parameter from the file produced by plugins. 
    19    - e.g.  generate `TF_CONFIG` from files `worker.host`  and `ps.host` for Tensorflow job
    20  - It is not straightforward enough to run a distributed ML job via Volcano. User has to carefully set the `lifeCyclePolicy` and `restartPolicy` in every task for distributed training. 
    21    - For example, in MPI job, the master task will be failed and restarted until all worker tasks are ready. Therefore, user should add  `OnFailure` restart policy and `TaskCompleted-CompleteJob` lifecycle policy to master task.
    22  
    23  If we can add more in-tree plugins for distributed ML job, which lets Volcano know more about the job type, the complexity of using volcano for ML workloads will be reduced.
    24  
    25  ## Goals
    26  
    27  - Add several plugins for distributed framework, including but not limited to Tensorflow-Plugin, MPI-Plugin, Pytorch-Plugin, MxNet-Plugin
    28  
    29    - These plugins will patch pod spec to fit the distributed pattern of a specified framework.
    30  
    31  - Make it easier to set ML distributed topology.
    32  
    33    - By using init containers, plugins will make sure tasks is launched in the topology order.
    34  
    35  - Make it easier to use these plugin. Users only need to add a few lines in job spec.  e.g.
    36    ```yaml
    37    spec:
    38      plugins:
    39        tensorflow: []
    40    ```
    41  
    42  ## Design
    43  
    44  ### Introduction of ML Distributed Pattern 
    45  
    46  Here is a summary of distributed training pattern in various ML frameworks, including the node topology, environment variables, file and entrypoint for distributed training.
    47  
    48  | Framework                         | Topology                                                     | Environment Variables                                        | File                                           | Entrypoint                                                   |
    49  | --------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ---------------------------------------------- | ------------------------------------------------------------ |
    50  | Tensorflow                        | **PS mode**: Chief + Evaluator + Worker + PS<br />**All-Reduce mode**: multiple workers | `TF_CONFIG`: tensorflow cluster spec                         | none                                           | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` |
    51  | Pytorch (Official recommendation) | Master + Worker                                              | `PET_MASTER_ADDR`: master address<br/>`PET_MASTER_PORT`: master port<br />`PET_NNODES`: node number<br/>`PET_NPROC_PER_NODE`: process number in every node<br/>`PET_NODE_RANK`: current node index | none                                           | `python -m torch.distributed.run TRAINING_SCRIPT.py (--arg1 ... train script args...)` |
    52  | Pytorch (custom)                  | Master + Worker                                              | `MASTER_ADDR`: master address<br/>`MASTER_PORT`: master port<br />`WORLD_SIZE`: node number<br/>`RANK`: current node index | none                                           | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` |
    53  | MXNet                             | Scheduler + Worker + PS                                      | `DMLC_PS_ROOT_URI`: scheduler address<br />`DMLC_PS_ROOT_PORT`: scheduler port<br/>`DMLC_NUM_SERVER`: parameter server number <br/>`DMLC_NUM_WORKER`: worker number<br/>`DMLC_ROLE`: current node role | none                                           | `python TRAINING_SCRIPT.py (--arg1 ... train script args...)` |
    54  | MPI                               | Master + Worker                                              | `OMPI_MCA_orte_default_hostfile`: default host file path<br /> | hostfile: with every node name and slot number | **master node**: `mpirun --hostfile HOSTFILE (-H HOSTS)  python TRAINING_SCRIPT.py (--arg1 ... train script args...)`<br />**worker node**: `/usr/sbin/sshd -D` |
    55  
    56  ### Implementation
    57  
    58  With the introduction of distributed pattern in various frameworks, we can implement various plugins.
    59  
    60  #### Tensorflow Plugin
    61  
    62  The key implementation of tensorflow plugin is that how to set correct `TF_CONFIG` environment variable for every pod.
    63  
    64  Firstly, we must know the cluster role of task in volcano job, and the port to be exposed. And this information can be passed by plugin arguments, which is defined in job spec.
    65  
    66  ```yaml
    67  spec:
    68    plugins:
    69      # set tensorflow plugin
    70      tensorflow: ["--port=5000", "--worker=worker", "--ps=ps"]
    71  ```
    72  
    73  In the implementation of `tensorflowPlugin`, these arguments will be parsed.
    74  
    75  ```go
    76  // tensorflowPlugin is plugin for tensorflow framework
    77  type tensorflowPlugin struct {
    78  	tfArguments   []string
    79  	Clientset     pluginsinterface.PluginClientset
    80  	psName        string
    81  	workerName    string
    82  	chiefName     string
    83  	evaluatorName string
    84  	port          int
    85  }
    86  // parse all arguments
    87  func (tp *tensorflowPlugin) addFlags() {
    88  	flagSet := flag.NewFlagSet(tp.Name(), flag.ContinueOnError)
    89  	flagSet.StringVar(&tp.psName, "ps", "ps", "name of ps role task")
    90  	flagSet.StringVar(&tp.workerName, "worker", "worker", "name of ps role task")
    91  	flagSet.StringVar(&tp.chiefName, "chief", "chief", "name of chief role task")
    92  	flagSet.StringVar(&tp.evaluatorName, "evaluator", "evaluator", "name of evaluator role task")
    93  	flagSet.IntVar(&tp.port, "port", 2222, "serviec port")
    94  	if err := flagSet.Parse(sp.tfArguments); err != nil {
    95  		klog.Errorf("plugin %s flagset parse failed, err: %v", tp.Name(), err)
    96  	}
    97  }
    98  ```
    99  
   100  And then patch the pod spec in method `OnPodCreate`.
   101  
   102  ```go
   103  func (tp *tensorflowPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error {
   104  	// do not patch if job is not distributed
   105  	if len(job.Spec.Tasks) == 1 && job.Spec.Tasks[0].Replicas == 1 {
   106  		return nil
   107  	}
   108  	// generate tfconfig spec
   109  	c, err := tp.generateTFConfig(pod, job)
   110  	if err != nil {
   111  		return err
   112  	}
   113  	raw, err := json.Marshal(c)
   114  	if err != nil {
   115  		return err
   116  	}
   117  	// add TF_CONFIG envrionment
   118  	for i := range pod.Spec.Containers {
   119  		pod.Spec.Containers[i].Env = append(pod.Spec.Containers[i].Env, v1.EnvVar{
   120  			Name:  "TF_CONFIG",
   121  			Value: string(raw),
   122  		})
   123  	}
   124  	return nil
   125  }
   126  ```
   127  
   128  Here is the structure of  `TF_CONFIG`:
   129  
   130  ```go
   131  type tfClusterSpec struct {
   132  	Cluster clusterInfo `json:"cluster"`
   133  	Task    taskInfo    `json:"task"`
   134  }
   135  
   136  type clusterInfo struct {
   137  	PS        []string `json:"ps,omitempty"`
   138  	Worker    []string `json:"worker,omitempty"`
   139  	Chief     []string `json:"chief,omitempty"`
   140  	Evaluator []string `json:"evaluator,omitempty"`
   141  }
   142  
   143  type taskInfo struct {
   144  	Type  string `json:"type"`
   145  	Index int    `json:"index"`
   146  }
   147  ```
   148  
   149  And we can generate a `tfClusterSpec` for each pod in the job, here is an example:
   150  ```go
   151  // generateTFConfig generate tfClusterSpec by a given pod and job
   152  func (tp *tensorflowPlugin) generateTFConfig(pod *v1.Pod, job *batch.Job) (tfClusterSpec, error) {
   153  	// get task index by pod
   154  	index, err := strconv.Atoi(helpers.GetPodIndexUnderTask(pod))
   155  	if err != nil {
   156  		return tfClusterSpec{}, err
   157  	}
   158  	// get task type by pod and job
   159  	taskType := tp.getTaskType(pod, job)
   160  	// get cluster info by job
   161  	spec := tfClusterSpec{
   162  		Cluster: tp.getClusterInfo(job),
   163  		Task: taskInfo{
   164  			Type:  taskType,
   165  			Index: index,
   166  		},
   167  	}
   168  	return spec, nil
   169  }
   170  
   171  // getClusterInfo return a clusterInfo by a given job
   172  func (tp *tensorflowPlugin) getClusterInfo(job *batch.Job) clusterInfo {
   173  	cluster := clusterInfo{}
   174  	for _, ts := range job.Spec.Tasks {
   175  		hosts := []string{}
   176  		for i := 0; i < int(ts.Replicas); i++ {
   177  			// generate domain name for each task replicas
   178  			hosts = append(hosts, helpers.MakeDomainName(job.Name, ts, i))
   179  		}
   180  		// assign all hostnames to clusterInfo
   181  		switch ts.Name {
   182  		case tp.psName:
   183  			cluster.PS = hosts
   184  		case tp.workerName:
   185  			cluster.Worker = hosts
   186  		case tp.chiefName:
   187  			cluster.Chief = hosts
   188  		case tp.evaluatorName:
   189  			cluster.Evaluator = hosts
   190  		}
   191  	}
   192  	return cluster
   193  }
   194  ```
   195  
   196  #### Pytorch Plugin
   197  
   198  Similar to the tensorflow plugin, firstly we must know the cluster role of task in volcano job, and the port to be exposed. And this information can be passed by plugin arguments, which is defined in job spec.
   199  
   200  ```yaml
   201  spec:
   202    plugins:
   203      # set pytorch plugin
   204      pytorch: ["--master=master","--worker=worker","--port=23456"]
   205  ```
   206  
   207  In the implementation of `pytorchPlugin`, these arguments will be parsed.
   208  
   209  ```go
   210  // pytorchPlugin is plugin for pytorch framework
   211  type pytorchPlugin struct {
   212  	pytorchArguments []string
   213  	clientset        pluginsinterface.PluginClientset
   214  	masterName       string
   215  	workerName       string
   216  	port             int
   217  }
   218  ```
   219  
   220  Then we patch pytorch-distributed-training related environment variables to container envs in method `OnPodCreate`.
   221  The main environment variables are:
   222  * `MASTER_ADDR`: master address
   223  * `MASTER_PORT`: master port
   224  * `WORLD_SIZE`: total node number
   225  * `RANK`: current node index
   226  
   227  #### Other Framework
   228  
   229  Most of other frameworks is similar to Tensorflow. But the MPI framework is special. In most case, It needs a `hostfile`, e.g. :
   230  
   231  ```
   232  jobname-worker-0.jobname slots=4
   233  jobname-worker-1.jobname slots=4
   234  jobname-worker-2.jobname slots=4
   235  ```
   236  
   237  To generate the `hostfile`, we need to create a `configMap` in `OnJobAdd` phase.
   238  
   239  ```go
   240  func (mp *mpiPlugin) OnJobAdd(job *batch.Job) error {
   241  	// generate hostfile, and create a configmap
   242  	data := map[string]string{"hostfile": mp.hostfile(job)}
   243  	if err := helpers.CreateOrUpdateConfigMap(job, mp.Clientset.KubeClients, data, mp.cmName(job)); err != nil {
   244  		return err
   245  	}
   246  
   247  	if job.Status.ControlledResources["plugin-"+mp.Name()] == mp.Name() {
   248  		return nil
   249  	}
   250  	job.Status.ControlledResources["plugin-"+mp.Name()] = mp.Name()
   251  	return nil
   252  }
   253  ```
   254  
   255  The data in `configMap` is as follows:
   256  
   257  ```yaml
   258  data:
   259    hostfile: |-
   260      jobname-worker-0.jobname slots=4
   261      jobname-worker-1.jobname slots=4
   262      jobname-worker-2.jobname slots=4
   263  ```
   264  
   265  > The utility function `CreateOrUpdateConfigMap` will add an owner reference in the Configmap's metadata, thus it will be deleted when the job is deleted.
   266  
   267  
   268  
   269  In `OnPodCreate` phase, the `hostfile` will be added into pod volumes, and mouted to specified path (e.g. `/etc/mpi/hostfile`). The `OMPI_MCA_orte_default_hostfile` environment variable should also be set.
   270  
   271  ```go
   272  func (mp *mpiPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error {
   273  	// generate hostfile volume and volumeMount
   274  	volume := mp.hostfileVolume(job)
   275  	mount := mp.hostfileVolumeMount(job)
   276  	// add to pod and containers
   277  	pod.Spec.Volumes = append(pod.Spec.Volumes, vm)
   278  	for i := range pod.Spec.Containers {
   279  		pod.Spec.Containers[i].VolumeMounts = append(pod.Spec.Containers[i].VolumeMounts, mount)
   280  		pod.Spec.Containers[i].Env = append(pod.Spec.Containers[i].Env, v1.EnvVar{
   281  		    Name: "OMPI_MCA_orte_default_hostfile",
   282  		    Value: "/etc/mpi/hostfile",
   283  		})
   284  	}
   285  	return nil
   286  }
   287  ```
   288  
   289  #### Task Launch Sequence
   290  
   291  As mentioned in section *Motivation*, task-level topology and launch sequence is common in ML distributed training. But there is no task-level scheduling policy in Volcano at present.
   292  We could set task dependency in plugins, 
   293  
   294   - e.g. we could use `InitContainer` to control the dependency of tasks.
   295   - any other approaches are welcomed.
   296  
   297  ##### Using InitContainer
   298  
   299  In `OnPodCreate`, we could patch `InitContainers` to pod. Here is an example for MPI-Plugin:
   300  
   301  ```go
   302  func (mp *mpiPlugin) OnPodCreate(pod *v1.Pod, job *batch.Job) error {
   303  	// ......
   304  	// Other code
   305  	// ......
   306  
   307  	// Add an init container to wait for dependency tasks
   308  	// Get dependency tasks by pod and job
   309  	depTasks := mp.getDepTasks(pod, job)
   310  	if len(depTasks) != 0 {
   311  	    // Generate an init container and insert it into pod spec
   312  		pod.Spec.InitContainers = append(pod.Spec.InitContainers, mlhelpers.CreateInitContainer(depTasks, job))
   313  	}
   314  	return nil
   315  }
   316  ```
   317  
   318  For MPI-Plugin, master task should wait for worker task. So we only generate dependency tasks for master pod:
   319  
   320  ```go
   321  func (mp *mpiPlugin) getDepTasks(pod *v1.Pod, job *batch.Job) (tasks []batch.TaskSpec) {
   322  	// get task name from pod
   323  	taskName := mlhelpers.GetTaskName(pod, job)
   324  	if taskName == mp.masterName {
   325  	    // get task spec from job by a given task name
   326  		if t, ok := mlhelpers.GetTaskSpec(mp.workerName, job); ok {
   327  			tasks = append(tasks, t)
   328  		}
   329  	}
   330  	return
   331  }
   332  ```
   333  
   334  We offer one implementation for `InitContainer`, it has limitations but works for common scenarios. we welcome better approaches.
   335  
   336  The logic in the init container is quite simple. It will send an ICMP message to the domain name of every task pod to check if the pod is alived. Here is an shell script example:
   337  
   338  ```shell
   339  SECONDS=0
   340  while true
   341  do
   342  	ok=true
   343  	for ip in ${IP_LIST}
   344  	do
   345  		ping -c 1 $ip >/dev/null
   346  		if [ $? -ne 0 ]
   347  		then
   348  			ok=false
   349  			break
   350  		fi
   351  	done
   352  	if $ok
   353  	then
   354  		exit 0
   355  	else
   356  		if [ $SECONDS -gt ${WAIT_TIMEOUT} ]
   357  		then
   358  			exit 1
   359  		fi
   360  		sleep 5
   361  	fi
   362  done
   363  ```
   364  
   365  
   366  
   367  ##### Alternatives Considered
   368  
   369  With the introduction of [Task Launch Order Design](https://github.com/volcano-sh/volcano/blob/master/docs/design/task-launch-order-within-job.md),  we can use the existing solution to manage task launch sequence.
   370  
   371  #### Webhook
   372  
   373  The Distributed Framework Plugins mentioned above work depending on svc plugin or others, thus we need to add new logic to ensure that svc plugin or others exist.
   374  
   375  The new logic to be added in webhook is shown as in below:
   376  
   377  1. Check if Distributed-Framework plugins exist
   378  2. Patch job spec with plugin denpendency