github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/keras-API/README.md (about) 1 # Multi-worker training with Keras 2 3 This directory contains a example for running multi-worker distributed training 4 using Tensorflow 2.1 keras API on Kubeflow. For more information about the 5 source code, please see TensorFlow tutorials [here](https://www.tensorflow.org/tutorials/distribute/keras) and [here](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras) 6 7 ## Prerequisite 8 9 Your cluster must be configured to use Multiple GPUs, 10 please follow the [instructions](https://www.kubeflow.org/docs/components/training/tftraining/#using-gpus) 11 12 ## Steps 13 14 1. Build a image 15 ``` 16 docker build -f Dockerfile -t kubeflow/multi_worker_strategy:v1.0 . 17 ``` 18 19 2. Specify your storageClassName and create a persistent volume claim to save 20 models and checkpoints 21 ``` 22 kubectl -n ${NAMESPACE} create -f pvc.yaml 23 ``` 24 25 3. Create a TFJob, if you use some GPUs other than NVIDIA, please replace 26 `nvidia.com/gpu` with your GPU vendor in the `limits` section. 27 ``` 28 kubectl -n ${NAMESPACE} create -f multi_worker_tfjob.yaml 29 ```