github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/estimator-API/README.md (about) 1 # Distributed Training on Kubeflow 2 3 This is an example of running distributed training on Kubeflow. The source code is taken from 4 TensorFlow team's example [here](https://github.com/tensorflow/ecosystem/tree/master/distribution_strategy). 5 6 The directory contains the following files: 7 * Dockerfile: Builds the independent worker image. 8 * Makefile: For building the above image. 9 * keras_model_to_estimator.py: This is the model code to run multi-worker training. Identical to the TensorFlow example. 10 * distributed_tfjob.yaml: The TFJob spec. 11 12 To run the example, edit `distributed_tfjob.yaml` for your cluster's namespace. Then run 13 ``` 14 kubectl apply -f distributed_tfjob.yaml 15 ``` 16 to create the job. 17 18 Then use 19 ``` 20 kubectl -n ${NAMESPACE} describe tfjob distributed-training 21 ``` 22 to see the status.