github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/keras-API/README.md

github.com/kubeflow/training-operator@v1.7.0/examples/tensorflow/distribution_strategy/keras-API/README.md (about)

     1  # Multi-worker training with Keras
     2  
     3  This directory contains a example for running multi-worker distributed training 
     4  using Tensorflow 2.1 keras API on Kubeflow. For more information about the 
     5  source code, please see TensorFlow tutorials [here](https://www.tensorflow.org/tutorials/distribute/keras) and [here](https://www.tensorflow.org/tutorials/distribute/multi_worker_with_keras)
     6  
     7  ## Prerequisite
     8  
     9  Your cluster must be configured to use Multiple GPUs, 
    10  please follow the [instructions](https://www.kubeflow.org/docs/components/training/tftraining/#using-gpus)
    11  
    12  ## Steps
    13  
    14  1.  Build a image
    15      ```
    16      docker build -f Dockerfile -t kubeflow/multi_worker_strategy:v1.0 .
    17      ```
    18  
    19  2.  Specify your storageClassName and create a persistent volume claim to save 
    20      models and checkpoints
    21      ```
    22      kubectl -n ${NAMESPACE} create -f pvc.yaml
    23      ```
    24  
    25  3.  Create a TFJob, if you use some GPUs other than NVIDIA, please replace 
    26      `nvidia.com/gpu` with your GPU vendor in the `limits` section.
    27      ```
    28      kubectl -n ${NAMESPACE} create -f multi_worker_tfjob.yaml
    29      ```