github.com/kubeflow/training-operator@v1.7.0/examples/pytorch/README.md (about) 1 ## Installation & deployment tips 2 1. You need to configure your node to utilize GPU. In order this can be done the following way: 3 * Install [nvidia-docker2](https://github.com/NVIDIA/nvidia-docker) 4 * Connect to your MasterNode and set nvidia as the default run in `/etc/docker/daemon.json`: 5 ``` 6 { 7 "default-runtime": "nvidia", 8 "runtimes": { 9 "nvidia": { 10 "path": "/usr/bin/nvidia-container-runtime", 11 "runtimeArgs": [] 12 } 13 } 14 } 15 ``` 16 * After that deploy nvidia-daemon to kubernetes: 17 ```bash 18 kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v1.11/nvidia-device-plugin.yml 19 ``` 20 21 2. NVIDIA GPUs can now be consumed via container level resource requirements using the resource name nvidia.com/gpu: 22 ``` 23 resources: 24 limits: 25 nvidia.com/gpu: 2 # requesting 2 GPUs 26 ``` 27 28 3. Building image. Each example has prebuilt images that are stored on google cloud resources (GCR). If you want to create your own image we recommend using dockerhub. Each example has its own Dockerfile that we strongly advise to use. To build your custom image follow instruction on [TechRepublic](https://www.techrepublic.com/article/how-to-create-a-docker-image-and-push-it-to-docker-hub/). 29 30 4. To deploy your job we recommend using official [kubeflow documentation](https://www.kubeflow.org/docs/guides/components/pytorch/). Each example has example yaml files for two versions of apis. Feel free to modify them, e.g. image or number of GPUs.