github.com/kubeflow/training-operator@v1.7.0/sdk/python/docs/PyTorchJobClient.md

github.com/kubeflow/training-operator@v1.7.0/sdk/python/docs/PyTorchJobClient.md (about)

     1  # PyTorchJobClient
     2  
     3  > PyTorchJobClient(config_file=None, context=None, client_configuration=None, persist_config=True)
     4  
     5  User can loads authentication and cluster information from kube-config file and stores them in kubernetes.client.configuration. Parameters are as following:
     6  
     7  parameter |  Description
     8  ------------ | -------------
     9  config_file | Name of the kube-config file. Defaults to `~/.kube/config`. Note that for the case that the SDK is running in cluster and you want to operate PyTorchJob in another remote cluster, user must set `config_file` to load kube-config file explicitly, e.g. `PyTorchJobClient(config_file="~/.kube/config")`. |
    10  context |Set the active context. If is set to None, current_context from config file will be used.|
    11  client_configuration | The kubernetes.client.Configuration to set configs to.|
    12  persist_config | If True, config file will be updated when changed (e.g GCP token refresh).|
    13  
    14  
    15  The APIs for PyTorchJobClient are as following:
    16  
    17  Class | Method |  Description
    18  ------------ | ------------- | -------------
    19  PyTorchJobClient| [create](#create) | Create PyTorchJob|
    20  PyTorchJobClient | [get](#get)    | Get the specified PyTorchJob or all PyTorchJob in the namespace |
    21  PyTorchJobClient | [patch](#patch)  | Patch the specified PyTorchJob|
    22  PyTorchJobClient | [delete](#delete) | Delete the specified PyTorchJob |
    23  PyTorchJobClient | [wait_for_job](#wait_for_job) | Wait for the specified job to finish |
    24  PyTorchJobClient | [wait_for_condition](#wait_for_condition) | Waits until any of the specified conditions occur |
    25  PyTorchJobClient | [get_job_status](#get_job_status) | Get the PyTorchJob status|
    26  PyTorchJobClient | [is_job_running](#is_job_running) | Check if the PyTorchJob running |
    27  PyTorchJobClient | [is_job_succeeded](#is_job_succeeded) | Check if the PyTorchJob Succeeded |
    28  PyTorchJobClient | [get_pod_names](#get_pod_names) | Get pod names of PyTorchJob |
    29  PyTorchJobClient | [get_logs](#get_logs) | Get training logs of the PyTorchJob |
    30  
    31  ## create
    32  > create(pytorchjob, namespace=None)
    33  
    34  Create the provided pytorchjob in the specified namespace
    35  
    36  ### Example
    37  
    38  ```python
    39  from kubernetes.client import V1PodTemplateSpec
    40  from kubernetes.client import V1ObjectMeta
    41  from kubernetes.client import V1PodSpec
    42  from kubernetes.client import V1Container
    43  from kubernetes.client import V1ResourceRequirements
    44  
    45  from kubeflow.training import constants
    46  from kubeflow.training import utils
    47  from kubeflow.training import V1ReplicaSpec
    48  from kubeflow.training import KubeflowOrgV1PyTorchJob
    49  from kubeflow.training import KubeflowOrgV1PyTorchJobSpec
    50  from kubeflow.training import PyTorchJobClient
    51  
    52    container = V1Container(
    53      name="pytorch",
    54      image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0",
    55      args=["--backend","gloo"],
    56    )
    57  
    58    master = V1ReplicaSpec(
    59      replicas=1,
    60      restart_policy="OnFailure",
    61      template=V1PodTemplateSpec(
    62        spec=V1PodSpec(
    63          containers=[container]
    64        )
    65      )
    66    )
    67  
    68    worker = V1ReplicaSpec(
    69      replicas=1,
    70      restart_policy="OnFailure",
    71      template=V1PodTemplateSpec(
    72        spec=V1PodSpec(
    73          containers=[container]
    74          )
    75      )
    76    )
    77  
    78    pytorchjob = KubeflowOrgV1PyTorchJob(
    79      api_version="kubeflow.org/v1",
    80      kind="PyTorchJob",
    81      metadata=V1ObjectMeta(name="mnist", namespace='default'),
    82      spec=KubeflowOrgV1PyTorchJobSpec(
    83        clean_pod_policy="None",
    84        pytorch_replica_specs={"Master": master,
    85                               "Worker": worker}
    86      )
    87    )
    88  
    89  pytorchjob_client = PyTorchJobClient()
    90  pytorchjob_client.create(pytorchjob)
    91  
    92  ```
    93  
    94  
    95  ### Parameters
    96  Name | Type |  Description | Notes
    97  ------------ | ------------- | ------------- | -------------
    98  pytorchjob  | [KubeflowOrgV1PyTorchJob](KubeflowOrgV1PyTorchJob.md) | pytorchjob defination| Required |
    99  namespace | str | Namespace for pytorchjob deploying to. If the `namespace` is not defined, will align with pytorchjob definition, or use current or default namespace if namespace is not specified in pytorchjob definition.  | Optional |
   100  
   101  ### Return type
   102  object
   103  
   104  ## get
   105  > get(name=None, namespace=None, watch=False, timeout_seconds=600)
   106  
   107  Get the created pytorchjob in the specified namespace
   108  
   109  ### Example
   110  
   111  ```python
   112  from kubeflow.training import pytorchjobClient
   113  
   114  pytorchjob_client = PyTorchJobClient()
   115  pytorchjob_client.get('mnist', namespace='kubeflow')
   116  ```
   117  
   118  ### Parameters
   119  Name | Type |  Description | Notes
   120  ------------ | ------------- | ------------- | -------------
   121  name  | str | pytorchjob name. If the `name` is not specified, it will get all pytorchjobs in the namespace.| Optional. |
   122  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   123  watch | bool | Watch the created pytorchjob if `True`, otherwise will return the created pytorchjob object. Stop watching if pytorchjob reaches the optional specified `timeout_seconds` or once the PyTorchJob status `Succeeded` or `Failed`. | Optional |
   124  timeout_seconds | int | Timeout seconds for watching. Defaults to 600. | Optional |
   125  
   126  
   127  ### Return type
   128  object
   129  
   130  
   131  ## patch
   132  > patch(name, pytorchjob, namespace=None)
   133  
   134  Patch the created pytorchjob in the specified namespace.
   135  
   136  Note that if you want to set the field from existing value to `None`, `patch` API may not work, you need to use [replace](#replace) API to remove the field value.
   137  
   138  ### Example
   139  
   140  ```python
   141  
   142  pytorchjob = KubeflowOrgV1PyTorchJob(
   143      api_version="kubeflow.org/v1",
   144      ... #update something in PyTorchJob spec
   145  )
   146  
   147  pytorchjob_client = PyTorchJobClient()
   148  pytorchjob_client.patch('mnist', isvc)
   149  
   150  ```
   151  
   152  ### Parameters
   153  Name | Type |  Description | Notes
   154  ------------ | ------------- | ------------- | -------------
   155  pytorchjob  | [KubeflowOrgV1PyTorchJob](KubeflowOrgV1PyTorchJob.md) | pytorchjob defination| Required |
   156  namespace | str | The pytorchjob's namespace for patching. If the `namespace` is not defined, will align with pytorchjob definition, or use current or default namespace if namespace is not specified in pytorchjob definition. | Optional|
   157  
   158  ### Return type
   159  object
   160  
   161  
   162  ## delete
   163  > delete(name, namespace=None)
   164  
   165  Delete the created pytorchjob in the specified namespace
   166  
   167  ### Example
   168  
   169  ```python
   170  from kubeflow.training import pytorchjobClient
   171  
   172  pytorchjob_client = PyTorchJobClient()
   173  pytorchjob_client.delete('mnist', namespace='kubeflow')
   174  ```
   175  
   176  ### Parameters
   177  Name | Type |  Description | Notes
   178  ------------ | ------------- | ------------- | -------------
   179  name  | str | pytorchjob name| |
   180  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional|
   181  
   182  ### Return type
   183  object
   184  
   185  ## wait_for_job
   186  > wait_for_job(name,
   187  >              namespace=None,
   188  >              watch=False,
   189  >              timeout_seconds=600,
   190  >              polling_interval=30,
   191  >              status_callback=None):
   192  
   193  Wait for the specified job to finish.
   194  
   195  ### Example
   196  
   197  ```python
   198  from kubeflow.training import PyTorchJobClient
   199  
   200  pytorchjob_client = PyTorchJobClient()
   201  pytorchjob_client.wait_for_job('mnist', namespace='kubeflow')
   202  
   203  # The API also supports watching the PyTorchJob status till it's Succeeded or Failed.
   204  pytorchjob_client.wait_for_job('mnist', namespace='kubeflow', watch=True)
   205  NAME                           STATE                TIME
   206  pytorch-dist-mnist-gloo        Created              2020-01-02T09:21:22Z
   207  pytorch-dist-mnist-gloo        Running              2020-01-02T09:21:36Z
   208  pytorch-dist-mnist-gloo        Running              2020-01-02T09:21:36Z
   209  pytorch-dist-mnist-gloo        Running              2020-01-02T09:21:36Z
   210  pytorch-dist-mnist-gloo        Running              2020-01-02T09:21:36Z
   211  pytorch-dist-mnist-gloo        Succeeded            2020-01-02T09:26:38Z
   212  ```
   213  
   214  ### Parameters
   215  Name | Type |  Description | Notes
   216  ------------ | ------------- | ------------- | -------------
   217  name  | str | The PyTorchJob name.| |
   218  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional|
   219  watch | bool | Watch the PyTorchJob if `True`. Stop watching if PyTorchJob reaches the optional specified `timeout_seconds` or once the PyTorchJob status `Succeeded` or `Failed`. | Optional |
   220  timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional|
   221  polling_interval | int | How often to poll for the status of the job.| Optional|
   222  status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the pytorchjob.| Optional|
   223  
   224  ### Return type
   225  object
   226  
   227  
   228  ## wait_for_condition
   229  > wait_for_condition(name,
   230  >                    expected_condition,
   231  >                    namespace=None,
   232  >                    timeout_seconds=600,
   233  >                    polling_interval=30,
   234  >                    status_callback=None):
   235  
   236  
   237  Waits until any of the specified conditions occur.
   238  
   239  ### Example
   240  
   241  ```python
   242  from kubeflow.training import PyTorchJobClient
   243  
   244  pytorchjob_client = PyTorchJobClient()
   245  pytorchjob_client.wait_for_condition('mnist', expected_condition=["Succeeded", "Failed"], namespace='kubeflow')
   246  ```
   247  
   248  ### Parameters
   249  Name | Type |  Description | Notes
   250  ------------ | ------------- | ------------- | -------------
   251  name  | str | The PyTorchJob name.| |
   252  expected_condition  |List |A list of conditions. Function waits until any of the supplied conditions is reached.| |
   253  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional|
   254  timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional|
   255  polling_interval | int | How often to poll for the status of the job.| Optional|
   256  status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the pytorchjob.| Optional|
   257  
   258  ### Return type
   259  object
   260  
   261  ## get_job_status
   262  > get_job_status(name, namespace=None)
   263  
   264  Returns PyTorchJob status, such as Running, Failed or Succeeded.
   265  
   266  ### Example
   267  
   268  ```python
   269  from kubeflow.training import PyTorchJobClient
   270  
   271  pytorchjob_client = PyTorchJobClient()
   272  pytorchjob_client.get_job_status('mnist', namespace='kubeflow')
   273  ```
   274  
   275  ### Parameters
   276  Name | Type |  Description | Notes
   277  ------------ | ------------- | ------------- | -------------
   278  name  | str | The PyTorchJob name. | |
   279  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   280  
   281  ### Return type
   282  Str
   283  
   284  ## is_job_running
   285  > is_job_running(name, namespace=None)
   286  
   287  Returns True if the PyTorchJob running; false otherwise.
   288  
   289  ### Example
   290  
   291  ```python
   292  from kubeflow.training import PyTorchJobClient
   293  
   294  pytorchjob_client = PyTorchJobClient()
   295  pytorchjob_client.is_job_running('mnist', namespace='kubeflow')
   296  ```
   297  
   298  ### Parameters
   299  Name | Type |  Description | Notes
   300  ------------ | ------------- | ------------- | -------------
   301  name  | str | The PyTorchJob name.| |
   302  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   303  
   304  ### Return type
   305  Bool
   306  
   307  ## is_job_succeeded
   308  > is_job_succeeded(name, namespace=None)
   309  
   310  Returns True if the PyTorchJob succeeded; false otherwise.
   311  
   312  ### Example
   313  
   314  ```python
   315  from kubeflow.training import PyTorchJobClient
   316  
   317  pytorchjob_client = PyTorchJobClient()
   318  pytorchjob_client.is_job_succeeded('mnist', namespace='kubeflow')
   319  ```
   320  
   321  ### Parameters
   322  Name | Type |  Description | Notes
   323  ------------ | ------------- | ------------- | -------------
   324  name  | str | The PyTorchJob name.| |
   325  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   326  
   327  ### Return type
   328  Bool
   329  
   330  ## get_pod_names
   331  > get_pod_names(name, namespace=None, master=False, replica_type=None, replica_index=None)
   332  
   333  Get pod names of the PyTorchJob.
   334  
   335  ### Example
   336  
   337  ```python
   338  from kubeflow.training import PyTorchJobClient
   339  
   340  pytorchjob_client = PyTorchJobClient()
   341  pytorchjob_client.get_pod_names('mnist', namespace='kubeflow')
   342  ```
   343  
   344  ### Parameters
   345  Name | Type |  Description | Notes
   346  ------------ | ------------- | ------------- | -------------
   347  name  | str | The PyTorchJob name.| |
   348  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   349  master  | bool | Only get pod with label 'job-role: master' pod if True. | |
   350  replica_type | str | User can specify one of 'master, worker' to only get one type pods. By default get all type pods.| |
   351  replica_index | str | User can specfy replica index to get one pod of the PyTorchJob. | |
   352  
   353  ### Return type
   354  Set
   355  
   356  
   357  ## get_logs
   358  > get_logs(name, namespace=None, master=True, replica_type=None, replica_index=None, follow=False)
   359  
   360  Get training logs of the PyTorchJob. By default only get the logs of Pod that has labels 'job-role: master', to get all pods logs, specfy the `master=False`.
   361  
   362  ### Example
   363  
   364  ```python
   365  from kubeflow.training import PyTorchJobClient
   366  
   367  pytorchjob_client = PyTorchJobClient()
   368  pytorchjob_client.get_logs('mnist', namespace='kubeflow')
   369  ```
   370  
   371  ### Parameters
   372  Name | Type |  Description | Notes
   373  ------------ | ------------- | ------------- | -------------
   374  name  | str | The PyTorchJob name.| |
   375  namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional |
   376  master  | bool | Only get pod with label 'job-role: master' pod if True. | |
   377  replica_type  | str | User can specify one of 'master, worker' to only get one type pods. By default get all type pods.| |
   378  replica_index | str | User can specfy replica index to get one pod of the PyTorchJob. | |
   379  follow | bool | Follow the log stream of the pod. Defaults to false. | |
   380  
   381  ### Return type
   382  Str