volcano.sh/volcano@v1.9.0/docs/user-guide/how_to_use_pytorch_plugin.md (about)

     1  # Pytorch Plugin User Guide
     2  
     3  ## Introduction
     4  
     5  **Pytorch plugin** is designed to optimize the user experience when running pytorch jobs, it not only allows users to write less yaml, but also ensures the normal operation of Pytorch jobs.
     6  
     7  ## How the Pytorch Plugin Works
     8  
     9  The Pytorch Plugin will do three things:
    10  
    11  * Open ports used by Pytorch for all containers of the job
    12  * Force open `svc` plugins
    13  * Add some envs such like `MASTER_ADDR`, `MASTER_PORT`, `WORLD_SIZE`, `RANK` which pytorch distributed training needed to containers automatically
    14  
    15  ## Parameters of the Pytorch Plugin
    16  
    17  ### Arguments
    18  
    19  | ID   | Name   | Type   | Default Value | Required | Description                        | Example            |
    20  | ---- | ------ | ------ | ------------- | -------- | ---------------------------------- | ------------------ |
    21  | 1    | master | string | master        | No       | Name of Pytorch master             | --master=master    |
    22  | 2    | worker | string | worker        | No       | Name of Pytorch worker             | --worker=worker    |
    23  | 3    | port   | string | 23456         | No       | The port to open for the container | --port=23456       |
    24  
    25  ## Examples
    26  
    27  ```yaml
    28  apiVersion: batch.volcano.sh/v1alpha1
    29  kind: Job
    30  metadata:
    31    name: pytorch-job
    32  spec:
    33    minAvailable: 1
    34    schedulerName: volcano
    35    plugins:
    36      pytorch: ["--master=master","--worker=worker","--port=23456"] # Pytorch plugin register
    37    tasks:
    38      - replicas: 1
    39        name: master
    40        policies:
    41          - event: TaskCompleted
    42            action: CompleteJob
    43        template:
    44          spec:
    45            containers:
    46              - image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0
    47                imagePullPolicy: IfNotPresent
    48                name: master
    49            restartPolicy: OnFailure
    50      - replicas: 2
    51        name: worker
    52        template:
    53          spec:
    54            containers:
    55              - image: gcr.io/kubeflow-ci/pytorch-dist-sendrecv-test:1.0
    56                imagePullPolicy: IfNotPresent
    57                name: worker
    58                workingDir: /home
    59            restartPolicy: OnFailure
    60  ```