github.com/kubeflow/training-operator@v1.7.0/examples/pytorch/README.md

github.com/kubeflow/training-operator@v1.7.0/examples/pytorch/README.md (about)

     1  ## Installation & deployment tips 
     2  1. You need to configure your node to utilize GPU. In order this can be done the following way: 
     3      * Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker)
     4      * Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daemon.json`:
     5          ```
     6          {
     7              "default-runtime": "nvidia",
     8              "runtimes": {
     9                  "nvidia": {
    10                      "path": "/usr/bin/nvidia-container-runtime",
    11                      "runtimeArgs": []
    12                  }
    13              }
    14          }
    15          ```
    16      * After that deploy nvidia-daemon to kubernetes: 
    17          ```bash
    18          kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml
    19          ```
    20          
    21  2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu:
    22        ```
    23        resources:
    24          limits:
    25              nvidia.com/gpu: 2 # requesting 2 GPUs
    26        ```
    27  
    28  3. Building image. Each example has prebuilt images that are stored on google cloud resources (GCR). If you want to create your own image we recommend using dockerhub. Each example has its own Dockerfile that we strongly advise to use. To build your custom image follow instruction on [TechRepublic](https://www.techrepublic.com/article/how-to-create-a-docker-image-and-push-it-to-docker-hub/).
    29  
    30  4. To deploy your job we recommend using official [kubeflow documentation](https://www.kubeflow.org/docs/guides/components/pytorch/). Each example has example yaml files for two versions of apis. Feel free to modify them, e.g. image or number of GPUs.