github.com/kubeflow/training-operator@v1.7.0/examples/pytorch/mnist/README.md

github.com/kubeflow/training-operator@v1.7.0/examples/pytorch/mnist/README.md (about)

1 ### Distributed MNIST Examples
2
3 This folder contains an example where mnist is trained. This example is also used for e2e testing.
4
5 The python script used to train mnist with pytorch takes in several arguments that can be used
6 to switch the distributed backends. The manifests to launch the distributed training of this mnist
7 file using the pytorch operator are under the respective version folders: [v1](./v1).
8 Each folder contains manifests with example usage of the different backends.
9
10 **Note**: PyTorch job doesn’t work in a user namespace by default because of Istio [automatic sidecar injection](https://istio.io/v1.3/docs/setup/additional-setup/sidecar-injection/#automatic-sidecar-injection). In order to get it running, it needs annotation sidecar.istio.io/inject: "false" to disable it for either PyTorch pods or namespace.
11
12 **Build Image**
13
14 The default image name and tag is `kubeflow/pytorch-dist-mnist-test:1.0`.
15
16 ```shell
17 docker build -f Dockerfile -t kubeflow/pytorch-dist-mnist-test:1.0 ./
18 ```
19 NOTE: If you you are working on Power System, Dockerfile.ppc64le could be used.
20
21 **Create the mnist PyTorch job**
22
23 The below example uses the gloo backend.
24
25 ```shell
26 kubectl create -f ./v1/pytorch_job_mnist_gloo.yaml
27 ```