github.com/kubeflow/training-operator@v1.7.0/sdk/python/docs/TFJobClient.md (about)

     1  # TFJobClient
     2  
     3  > TFJobClient(config_file=None, context=None, client_configuration=None, persist_config=True)
     4  
     5  User can loads authentication and cluster information from kube-config file and stores them in kubernetes.client.configuration. Parameters are as following:
     6  
     7  parameter |  Description
     8  ------------ | -------------
     9  config_file | Name of the kube-config file. Defaults to `~/.kube/config`. Note that for the case that the SDK is running in cluster and you want to operate tfjob in another remote cluster, user must set `config_file` to load kube-config file explicitly, e.g. `TFJobClient(config_file="~/.kube/config")`. |
    10  context |Set the active context. If is set to None, current_context from config file will be used.|
    11  client_configuration | The kubernetes.client.Configuration to set configs to.|
    12  persist_config | If True, config file will be updated when changed (e.g GCP token refresh).|
    13  
    14  
    15  The APIs for TFJobClient are as following:
    16  
    17  Class | Method |  Description
    18  ------------ | ------------- | -------------
    19  TFJobClient| [create](#create) | Create TFJob|
    20  TFJobClient | [get](#get)    | Get the specified TFJob or all TFJob in the namespace |
    21  TFJobClient | [patch](#patch)  | Patch the specified TFJob|
    22  TFJobClient | [delete](#delete) | Delete the specified TFJob |
    23  TFJobClient | [wait_for_job](#wait_for_job) | Wait for the specified job to finish |
    24  TFJobClient | [wait_for_condition](#wait_for_condition) | Waits until any of the specified conditions occur |
    25  TFJobClient | [get_job_status](#get_job_status) | Get the TFJob status|
    26  TFJobClient | [is_job_running](#is_job_running) | Check if the TFJob status is running |
    27  TFJobClient | [is_job_succeeded](#is_job_succeeded) | Check if the TFJob status is Succeeded |
    28  TFJobClient | [get_pod_names](#get_pod_names) | Get pod names of TFJob |
    29  TFJobClient | [get_logs](#get_logs) | Get training logs of the TFJob |
    30  
    31  
    32  ## create
    33  > create(tfjob, namespace=None)
    34  
    35  Create the provided tfjob in the specified namespace
    36  
    37  ### Example
    38  
    39  ```python
    40  from kubernetes.client import V1PodTemplateSpec
    41  from kubernetes.client import V1ObjectMeta
    42  from kubernetes.client import V1PodSpec
    43  from kubernetes.client import V1Container
    44  
    45  from kubeflow.training import constants
    46  from kubeflow.training import utils
    47  from kubeflow.training import V1ReplicaSpec
    48  from kubeflow.training import KubeflowOrgV1TFJob
    49  from kubeflow.training import KubeflowOrgV1TFJobList
    50  from kubeflow.training import KubeflowOrgV1TFJobSpec
    51  from kubeflow.training import TFJobClient
    52  
    53  
    54  container = V1Container(
    55      name="tensorflow",
    56      image="gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0",
    57      command=[
    58          "python",
    59          "/var/tf_mnist/mnist_with_summaries.py",
    60          "--log_dir=/train/logs", "--learning_rate=0.01",
    61          "--batch_size=150"
    62          ]
    63  )
    64  
    65  worker = V1ReplicaSpec(
    66      replicas=1,
    67      restart_policy="Never",
    68      template=V1PodTemplateSpec(
    69          spec=V1PodSpec(
    70              containers=[container]
    71          )
    72      )
    73  )
    74  
    75  tfjob = KubeflowOrgV1TFJob(
    76      api_version="kubeflow.org/v1",
    77      kind="TFJob",
    78      metadata=V1ObjectMeta(name="mnist",namespace=namespace),
    79      spec=KubeflowOrgV1TFJobSpec(
    80          clean_pod_policy="None",
    81          tf_replica_specs={"Worker": worker}
    82      )
    83  )
    84  
    85  
    86  tfjob_client = TFJobClient()
    87  tfjob_client.create(tfjob)
    88  
    89  ```
    90  
    91  
    92  ### Parameters
    93  Name | Type |  Description | Notes
    94  ------------ | ------------- | ------------- | -------------
    95  tfjob  | [KubeflowOrgV1TFJob](KubeflowOrgV1TFJob.md) | tfjob defination| Required |
    96  namespace | str | Namespace for tfjob deploying to. If the `namespace` is not defined, will align with tfjob definition, or use current or default namespace if namespace is not specified in tfjob definition.  | Optional |
    97  
    98  ### Return type
    99  object
   100  
   101  ## get
   102  > get(name=None, namespace=None, watch=False, timeout_seconds=600)
   103  
   104  Get the created tfjob in the specified namespace
   105  
   106  ### Example
   107  
   108  ```python
   109  from kubeflow.training import TFJobClient
   110  
   111  tfjob_client = TFJobClient()
   112  tfjob_client.get('mnist', namespace='kubeflow')
   113  ```
   114  
   115  ### Parameters
   116  Name | Type |  Description | Notes
   117  ------------ | ------------- | ------------- | -------------
   118  name  | str | The TFJob name. If the `name` is not specified, it will get all tfjobs in the namespace.| Optional. |
   119  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   120  watch | bool | Watch the created TFJob if `True`, otherwise will return the created TFJob object. Stop watching if TFJob reaches the optional specified `timeout_seconds` or once the TFJob status `Succeeded` or `Failed`. | Optional |
   121  timeout_seconds | int | Timeout seconds for watching. Defaults to 600. | Optional |
   122  
   123  ### Return type
   124  object
   125  
   126  
   127  ## patch
   128  > patch(name, tfjob, namespace=None)
   129  
   130  Patch the created tfjob in the specified namespace.
   131  
   132  Note that if you want to set the field from existing value to `None`, `patch` API may not work, you need to use [replace](#replace) API to remove the field value.
   133  
   134  ### Example
   135  
   136  ```python
   137  
   138  tfjob = KubeflowOrgV1TFJob(
   139      api_version="kubeflow.org/v1",
   140      ... #update something in TFJob spec
   141  )
   142  
   143  tfjob_client = TFJobClient()
   144  tfjob_client.patch('mnist', isvc)
   145  
   146  ```
   147  
   148  ### Parameters
   149  Name | Type |  Description | Notes
   150  ------------ | ------------- | ------------- | -------------
   151  tfjob  | [KubeflowOrgV1TFJob](KubeflowOrgV1TFJob.md) | tfjob defination| Required |
   152  namespace | str | The tfjob's namespace for patching. If the `namespace` is not defined, will align with tfjob definition, or use current or default namespace if namespace is not specified in tfjob definition. | Optional|
   153  
   154  ### Return type
   155  object
   156  
   157  
   158  ## delete
   159  > delete(name, namespace=None)
   160  
   161  Delete the created tfjob in the specified namespace
   162  
   163  ### Example
   164  
   165  ```python
   166  from kubeflow.training import TFJobClient
   167  
   168  tfjob_client = TFJobClient()
   169  tfjob_client.delete('mnist', namespace='kubeflow')
   170  ```
   171  
   172  ### Parameters
   173  Name | Type |  Description | Notes
   174  ------------ | ------------- | ------------- | -------------
   175  name  | str | The TFJob name.| |
   176  namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional|
   177  
   178  ### Return type
   179  object
   180  
   181  
   182  ## wait_for_job
   183  > wait_for_job(name,
   184  >              namespace=None,
   185  >              timeout_seconds=600,
   186  >              polling_interval=30,
   187  >              watch=False,
   188  >              status_callback=None):
   189  
   190  Wait for the specified job to finish.
   191  
   192  ### Example
   193  
   194  ```python
   195  from kubeflow.training import TFJobClient
   196  
   197  tfjob_client = TFJobClient()
   198  tfjob_client.wait_for_job('mnist', namespace='kubeflow')
   199  
   200  # The API also supports watching the TFJob status till it's Succeeded or Failed.
   201  tfjob_client.wait_for_job('mnist', namespace=namespace, watch=True)
   202  NAME                           STATE                TIME
   203  mnist                          Created              2019-12-31T09:20:07Z
   204  mnist                          Running              2019-12-31T09:20:19Z
   205  mnist                          Running              2019-12-31T09:20:19Z
   206  mnist                          Succeeded            2019-12-31T09:22:04Z
   207  ```
   208  
   209  ### Parameters
   210  Name | Type |  Description | Notes
   211  ------------ | ------------- | ------------- | -------------
   212  name  | str | The TFJob name.| |
   213  namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional|
   214  timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional|
   215  polling_interval | int | How often to poll for the status of the job.| Optional|
   216  status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the tfjob.| Optional|
   217  watch | bool | Watch the TFJob if `True`. Stop watching if TFJob reaches the optional specified `timeout_seconds` or once the TFJob status `Succeeded` or `Failed`. | Optional |
   218  
   219  ### Return type
   220  object
   221  
   222  
   223  ## wait_for_condition
   224  > wait_for_condition(name,
   225  >                    expected_condition,
   226  >                    namespace=None,
   227  >                    timeout_seconds=600,
   228  >                    polling_interval=30,
   229  >                    status_callback=None):
   230  
   231  
   232  Waits until any of the specified conditions occur.
   233  
   234  ### Example
   235  
   236  ```python
   237  from kubeflow.training import TFJobClient
   238  
   239  tfjob_client = TFJobClient()
   240  tfjob_client.wait_for_condition('mnist', expected_condition=["Succeeded", "Failed"], namespace='kubeflow')
   241  ```
   242  
   243  ### Parameters
   244  Name | Type |  Description | Notes
   245  ------------ | ------------- | ------------- | -------------
   246  name  | str | The TFJob name.| |
   247  expected_condition  |List |A list of conditions. Function waits until any of the supplied conditions is reached.| |
   248  namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional|
   249  timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional|
   250  polling_interval | int | How often to poll for the status of the job.| Optional|
   251  status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the tfjob.| Optional|
   252  
   253  ### Return type
   254  object
   255  
   256  ## get_job_status
   257  > get_job_status(name, namespace=None)
   258  
   259  Returns TFJob status, such as Running, Failed or Succeeded.
   260  
   261  ### Example
   262  
   263  ```python
   264  from kubeflow.training import TFJobClient
   265  
   266  tfjob_client = TFJobClient()
   267  tfjob_client.get_job_status('mnist', namespace='kubeflow')
   268  ```
   269  
   270  ### Parameters
   271  Name | Type |  Description | Notes
   272  ------------ | ------------- | ------------- | -------------
   273  name  | str | The TFJob name. | |
   274  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   275  
   276  ### Return type
   277  Str
   278  
   279  ## is_job_running
   280  > is_job_running(name, namespace=None)
   281  
   282  Returns True if the TFJob running; false otherwise.
   283  
   284  ### Example
   285  
   286  ```python
   287  from kubeflow.training import TFJobClient
   288  
   289  tfjob_client = TFJobClient()
   290  tfjob_client.is_job_running('mnist', namespace='kubeflow')
   291  ```
   292  
   293  ### Parameters
   294  Name | Type |  Description | Notes
   295  ------------ | ------------- | ------------- | -------------
   296  name  | str | The TFJob name.| |
   297  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   298  
   299  ### Return type
   300  Bool
   301  
   302  ## is_job_succeeded
   303  > is_job_succeeded(name, namespace=None)
   304  
   305  Returns True if the TFJob succeeded; false otherwise.
   306  
   307  ### Example
   308  
   309  ```python
   310  from kubeflow.training import TFJobClient
   311  
   312  tfjob_client = TFJobClient()
   313  tfjob_client.is_job_succeeded('mnist', namespace='kubeflow')
   314  ```
   315  
   316  ### Parameters
   317  Name | Type |  Description | Notes
   318  ------------ | ------------- | ------------- | -------------
   319  name  | str | The TFJob name.| |
   320  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   321  
   322  ### Return type
   323  Bool
   324  
   325  
   326  ## get_pod_names
   327  > get_pod_names(name, namespace=None, master=False, replica_type=None, replica_index=None)
   328  
   329  Get pod names of the TFJob.
   330  
   331  ### Example
   332  
   333  ```python
   334  from kubeflow.training import TFJobClient
   335  
   336  tfjob_client = TFJobClient()
   337  tfjob_client.get_pod_names('mnist', namespace='kubeflow')
   338  ```
   339  
   340  ### Parameters
   341  Name | Type |  Description | Notes
   342  ------------ | ------------- | ------------- | -------------
   343  name  | str | The TFJob name.| |
   344  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   345  master  | bool | Only get pod with label 'job-role: master' pod if True. | |
   346  replica_type | str | User can specify one of 'worker, ps, chief' to only get one type pods. By default get all type pods.| |
   347  replica_index | str | User can specfy replica index to get one pod of the TFJob. | |
   348  
   349  ### Return type
   350  Set
   351  
   352  
   353  ## get_logs
   354  > get_logs(name, namespace=None, master=True, replica_type=None, replica_index=None, follow=False)
   355  
   356  Get training logs of the TFJob. By default only get the logs of Pod that has labels 'job-role: master', to get all pods logs, specfy the `master=False`.
   357  
   358  ### Example
   359  
   360  ```python
   361  from kubeflow.training import TFJobClient
   362  
   363  tfjob_client = TFJobClient()
   364  tfjob_client.get_logs('mnist', namespace='kubeflow')
   365  ```
   366  
   367  ### Parameters
   368  Name | Type |  Description | Notes
   369  ------------ | ------------- | ------------- | -------------
   370  name  | str | The TFJob name.| |
   371  namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional |
   372  master  | bool | Only get pod with label 'job-role: master' pod if True. | |
   373  replica_type  | str | User can specify one of 'worker, ps, chief' to only get one type pods. By default get all type pods.| |
   374  replica_index | str | User can specfy replica index to get one pod of the TFJob. | |
   375  follow | bool | Follow the log stream of the pod. Defaults to false. | |
   376  
   377  ### Return type
   378  Str