github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/estimator-API/README.md

github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/estimator-API/README.md (about)

     1  # Distributed Training on Kubeflow
     2  
     3  This is an example of running distributed training on Kubeflow. The source code is taken from
     4  TensorFlow team's example [here](https://github.com/tensorflow/ecosystem/tree/master/distribution_strategy).
     5  
     6  The directory contains the following files:
     7  * Dockerfile: Builds the independent worker image.
     8  * Makefile: For building the above image.
     9  * keras_model_to_estimator.py: This is the model code to run multi-worker training. Identical to the TensorFlow example.
    10  * distributed_tfjob.yaml: The TFJob spec.
    11  
    12  To run the example, edit `distributed_tfjob.yaml` for your cluster's namespace. Then run
    13  ```
    14  kubectl apply -f distributed_tfjob.yaml
    15  ```
    16  to create the job.
    17  
    18  Then use
    19  ```
    20  kubectl -n ${NAMESPACE} describe tfjob distributed-training
    21  ```
    22  to see the status.