volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_pytorch_plugin.md (about) 1 # Pytorch Plugin User Guide 2 3 ## Introduction 4 5 **Pytorch plugin** is designed to optimize the user experience when running pytorch jobs, it not only allows users to write less yaml, but also ensures the normal operation of Pytorch jobs. 6 7 ## How the Pytorch Plugin Works 8 9 The Pytorch Plugin will do three things: 10 11 * Open ports used by Pytorch for all containers of the job 12 * Force open `svc` plugins 13 * Add some envs such like `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK` which pytorch distributed training needed to containers automatically 14 15 ## Parameters of the Pytorch Plugin 16 17 ### Arguments 18 19 | ID | Name | Type | Default Value | Required | Description | Example | 20 | ---- | ------ | ------ | ------------- | -------- | ---------------------------------- | ------------------ | 21 | 1 | master | string | master | No | Name of Pytorch master | --master=master | 22 | 2 | worker | string | worker | No | Name of Pytorch worker | --worker=worker | 23 | 3 | port | string | 23456 | No | The port to open for the container | --port=23456 | 24 25 ## Examples 26 27 ```yaml 28 apiVersion: batch.volcano.sh/v1alpha1 29 kind: Job 30 metadata: 31 name: pytorch-job 32 spec: 33 minAvailable: 1 34 schedulerName: volcano 35 plugins: 36 pytorch: ["--master=master","--worker=worker","--port=23456"] # Pytorch plugin register 37 tasks: 38 - replicas: 1 39 name: master 40 policies: 41 - event: TaskCompleted 42 action: CompleteJob 43 template: 44 spec: 45 containers: 46 - image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0 47 imagePullPolicy: IfNotPresent 48 name: master 49 restartPolicy: OnFailure 50 - replicas: 2 51 name: worker 52 template: 53 spec: 54 containers: 55 - image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0 56 imagePullPolicy: IfNotPresent 57 name: worker 58 workingDir: /home 59 restartPolicy: OnFailure 60 ```