github.com/kubeflow/training-operator@v1.7.0/sdk/python/docs/PyTorchJobClient.md (about) 1 # PyTorchJobClient 2 3 > PyTorchJobClient(config_file=None, context=None, client_configuration=None, persist_config=True) 4 5 User can loads authentication and cluster information from kube-config file and stores them in kubernetes.client.configuration. Parameters are as following: 6 7 parameter | Description 8 ------------ | ------------- 9 config_file | Name of the kube-config file. Defaults to `~/.kube/config`. Note that for the case that the SDK is running in cluster and you want to operate PyTorchJob in another remote cluster, user must set `config_file` to load kube-config file explicitly, e.g. `PyTorchJobClient(config_file="~/.kube/config")`. | 10 context |Set the active context. If is set to None, current_context from config file will be used.| 11 client_configuration | The kubernetes.client.Configuration to set configs to.| 12 persist_config | If True, config file will be updated when changed (e.g GCP token refresh).| 13 14 15 The APIs for PyTorchJobClient are as following: 16 17 Class | Method | Description 18 ------------ | ------------- | ------------- 19 PyTorchJobClient| [create](#create) | Create PyTorchJob| 20 PyTorchJobClient | [get](#get) | Get the specified PyTorchJob or all PyTorchJob in the namespace | 21 PyTorchJobClient | [patch](#patch) | Patch the specified PyTorchJob| 22 PyTorchJobClient | [delete](#delete) | Delete the specified PyTorchJob | 23 PyTorchJobClient | [wait_for_job](#wait_for_job) | Wait for the specified job to finish | 24 PyTorchJobClient | [wait_for_condition](#wait_for_condition) | Waits until any of the specified conditions occur | 25 PyTorchJobClient | [get_job_status](#get_job_status) | Get the PyTorchJob status| 26 PyTorchJobClient | [is_job_running](#is_job_running) | Check if the PyTorchJob running | 27 PyTorchJobClient | [is_job_succeeded](#is_job_succeeded) | Check if the PyTorchJob Succeeded | 28 PyTorchJobClient | [get_pod_names](#get_pod_names) | Get pod names of PyTorchJob | 29 PyTorchJobClient | [get_logs](#get_logs) | Get training logs of the PyTorchJob | 30 31 ## create 32 > create(pytorchjob, namespace=None) 33 34 Create the provided pytorchjob in the specified namespace 35 36 ### Example 37 38 ```python 39 from kubernetes.client import V1PodTemplateSpec 40 from kubernetes.client import V1ObjectMeta 41 from kubernetes.client import V1PodSpec 42 from kubernetes.client import V1Container 43 from kubernetes.client import V1ResourceRequirements 44 45 from kubeflow.training import constants 46 from kubeflow.training import utils 47 from kubeflow.training import V1ReplicaSpec 48 from kubeflow.training import KubeflowOrgV1PyTorchJob 49 from kubeflow.training import KubeflowOrgV1PyTorchJobSpec 50 from kubeflow.training import PyTorchJobClient 51 52 container = V1Container( 53 name="pytorch", 54 image="gcr.io/kubeflow-ci/pytorch-dist-mnist-test:v1.0", 55 args=["--backend","gloo"], 56 ) 57 58 master = V1ReplicaSpec( 59 replicas=1, 60 restart_policy="OnFailure", 61 template=V1PodTemplateSpec( 62 spec=V1PodSpec( 63 containers=[container] 64 ) 65 ) 66 ) 67 68 worker = V1ReplicaSpec( 69 replicas=1, 70 restart_policy="OnFailure", 71 template=V1PodTemplateSpec( 72 spec=V1PodSpec( 73 containers=[container] 74 ) 75 ) 76 ) 77 78 pytorchjob = KubeflowOrgV1PyTorchJob( 79 api_version="kubeflow.org/v1", 80 kind="PyTorchJob", 81 metadata=V1ObjectMeta(name="mnist", namespace='default'), 82 spec=KubeflowOrgV1PyTorchJobSpec( 83 clean_pod_policy="None", 84 pytorch_replica_specs={"Master": master, 85 "Worker": worker} 86 ) 87 ) 88 89 pytorchjob_client = PyTorchJobClient() 90 pytorchjob_client.create(pytorchjob) 91 92 ``` 93 94 95 ### Parameters 96 Name | Type | Description | Notes 97 ------------ | ------------- | ------------- | ------------- 98 pytorchjob | [KubeflowOrgV1PyTorchJob](KubeflowOrgV1PyTorchJob.md) | pytorchjob defination| Required | 99 namespace | str | Namespace for pytorchjob deploying to. If the `namespace` is not defined, will align with pytorchjob definition, or use current or default namespace if namespace is not specified in pytorchjob definition. | Optional | 100 101 ### Return type 102 object 103 104 ## get 105 > get(name=None, namespace=None, watch=False, timeout_seconds=600) 106 107 Get the created pytorchjob in the specified namespace 108 109 ### Example 110 111 ```python 112 from kubeflow.training import pytorchjobClient 113 114 pytorchjob_client = PyTorchJobClient() 115 pytorchjob_client.get('mnist', namespace='kubeflow') 116 ``` 117 118 ### Parameters 119 Name | Type | Description | Notes 120 ------------ | ------------- | ------------- | ------------- 121 name | str | pytorchjob name. If the `name` is not specified, it will get all pytorchjobs in the namespace.| Optional. | 122 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 123 watch | bool | Watch the created pytorchjob if `True`, otherwise will return the created pytorchjob object. Stop watching if pytorchjob reaches the optional specified `timeout_seconds` or once the PyTorchJob status `Succeeded` or `Failed`. | Optional | 124 timeout_seconds | int | Timeout seconds for watching. Defaults to 600. | Optional | 125 126 127 ### Return type 128 object 129 130 131 ## patch 132 > patch(name, pytorchjob, namespace=None) 133 134 Patch the created pytorchjob in the specified namespace. 135 136 Note that if you want to set the field from existing value to `None`, `patch` API may not work, you need to use [replace](#replace) API to remove the field value. 137 138 ### Example 139 140 ```python 141 142 pytorchjob = KubeflowOrgV1PyTorchJob( 143 api_version="kubeflow.org/v1", 144 ... #update something in PyTorchJob spec 145 ) 146 147 pytorchjob_client = PyTorchJobClient() 148 pytorchjob_client.patch('mnist', isvc) 149 150 ``` 151 152 ### Parameters 153 Name | Type | Description | Notes 154 ------------ | ------------- | ------------- | ------------- 155 pytorchjob | [KubeflowOrgV1PyTorchJob](KubeflowOrgV1PyTorchJob.md) | pytorchjob defination| Required | 156 namespace | str | The pytorchjob's namespace for patching. If the `namespace` is not defined, will align with pytorchjob definition, or use current or default namespace if namespace is not specified in pytorchjob definition. | Optional| 157 158 ### Return type 159 object 160 161 162 ## delete 163 > delete(name, namespace=None) 164 165 Delete the created pytorchjob in the specified namespace 166 167 ### Example 168 169 ```python 170 from kubeflow.training import pytorchjobClient 171 172 pytorchjob_client = PyTorchJobClient() 173 pytorchjob_client.delete('mnist', namespace='kubeflow') 174 ``` 175 176 ### Parameters 177 Name | Type | Description | Notes 178 ------------ | ------------- | ------------- | ------------- 179 name | str | pytorchjob name| | 180 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional| 181 182 ### Return type 183 object 184 185 ## wait_for_job 186 > wait_for_job(name, 187 > namespace=None, 188 > watch=False, 189 > timeout_seconds=600, 190 > polling_interval=30, 191 > status_callback=None): 192 193 Wait for the specified job to finish. 194 195 ### Example 196 197 ```python 198 from kubeflow.training import PyTorchJobClient 199 200 pytorchjob_client = PyTorchJobClient() 201 pytorchjob_client.wait_for_job('mnist', namespace='kubeflow') 202 203 # The API also supports watching the PyTorchJob status till it's Succeeded or Failed. 204 pytorchjob_client.wait_for_job('mnist', namespace='kubeflow', watch=True) 205 NAME STATE TIME 206 pytorch-dist-mnist-gloo Created 2020-01-02T09:21:22Z 207 pytorch-dist-mnist-gloo Running 2020-01-02T09:21:36Z 208 pytorch-dist-mnist-gloo Running 2020-01-02T09:21:36Z 209 pytorch-dist-mnist-gloo Running 2020-01-02T09:21:36Z 210 pytorch-dist-mnist-gloo Running 2020-01-02T09:21:36Z 211 pytorch-dist-mnist-gloo Succeeded 2020-01-02T09:26:38Z 212 ``` 213 214 ### Parameters 215 Name | Type | Description | Notes 216 ------------ | ------------- | ------------- | ------------- 217 name | str | The PyTorchJob name.| | 218 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional| 219 watch | bool | Watch the PyTorchJob if `True`. Stop watching if PyTorchJob reaches the optional specified `timeout_seconds` or once the PyTorchJob status `Succeeded` or `Failed`. | Optional | 220 timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional| 221 polling_interval | int | How often to poll for the status of the job.| Optional| 222 status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the pytorchjob.| Optional| 223 224 ### Return type 225 object 226 227 228 ## wait_for_condition 229 > wait_for_condition(name, 230 > expected_condition, 231 > namespace=None, 232 > timeout_seconds=600, 233 > polling_interval=30, 234 > status_callback=None): 235 236 237 Waits until any of the specified conditions occur. 238 239 ### Example 240 241 ```python 242 from kubeflow.training import PyTorchJobClient 243 244 pytorchjob_client = PyTorchJobClient() 245 pytorchjob_client.wait_for_condition('mnist', expected_condition=["Succeeded", "Failed"], namespace='kubeflow') 246 ``` 247 248 ### Parameters 249 Name | Type | Description | Notes 250 ------------ | ------------- | ------------- | ------------- 251 name | str | The PyTorchJob name.| | 252 expected_condition |List |A list of conditions. Function waits until any of the supplied conditions is reached.| | 253 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace. | Optional| 254 timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional| 255 polling_interval | int | How often to poll for the status of the job.| Optional| 256 status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the pytorchjob.| Optional| 257 258 ### Return type 259 object 260 261 ## get_job_status 262 > get_job_status(name, namespace=None) 263 264 Returns PyTorchJob status, such as Running, Failed or Succeeded. 265 266 ### Example 267 268 ```python 269 from kubeflow.training import PyTorchJobClient 270 271 pytorchjob_client = PyTorchJobClient() 272 pytorchjob_client.get_job_status('mnist', namespace='kubeflow') 273 ``` 274 275 ### Parameters 276 Name | Type | Description | Notes 277 ------------ | ------------- | ------------- | ------------- 278 name | str | The PyTorchJob name. | | 279 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 280 281 ### Return type 282 Str 283 284 ## is_job_running 285 > is_job_running(name, namespace=None) 286 287 Returns True if the PyTorchJob running; false otherwise. 288 289 ### Example 290 291 ```python 292 from kubeflow.training import PyTorchJobClient 293 294 pytorchjob_client = PyTorchJobClient() 295 pytorchjob_client.is_job_running('mnist', namespace='kubeflow') 296 ``` 297 298 ### Parameters 299 Name | Type | Description | Notes 300 ------------ | ------------- | ------------- | ------------- 301 name | str | The PyTorchJob name.| | 302 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 303 304 ### Return type 305 Bool 306 307 ## is_job_succeeded 308 > is_job_succeeded(name, namespace=None) 309 310 Returns True if the PyTorchJob succeeded; false otherwise. 311 312 ### Example 313 314 ```python 315 from kubeflow.training import PyTorchJobClient 316 317 pytorchjob_client = PyTorchJobClient() 318 pytorchjob_client.is_job_succeeded('mnist', namespace='kubeflow') 319 ``` 320 321 ### Parameters 322 Name | Type | Description | Notes 323 ------------ | ------------- | ------------- | ------------- 324 name | str | The PyTorchJob name.| | 325 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 326 327 ### Return type 328 Bool 329 330 ## get_pod_names 331 > get_pod_names(name, namespace=None, master=False, replica_type=None, replica_index=None) 332 333 Get pod names of the PyTorchJob. 334 335 ### Example 336 337 ```python 338 from kubeflow.training import PyTorchJobClient 339 340 pytorchjob_client = PyTorchJobClient() 341 pytorchjob_client.get_pod_names('mnist', namespace='kubeflow') 342 ``` 343 344 ### Parameters 345 Name | Type | Description | Notes 346 ------------ | ------------- | ------------- | ------------- 347 name | str | The PyTorchJob name.| | 348 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 349 master | bool | Only get pod with label 'job-role: master' pod if True. | | 350 replica_type | str | User can specify one of 'master, worker' to only get one type pods. By default get all type pods.| | 351 replica_index | str | User can specfy replica index to get one pod of the PyTorchJob. | | 352 353 ### Return type 354 Set 355 356 357 ## get_logs 358 > get_logs(name, namespace=None, master=True, replica_type=None, replica_index=None, follow=False) 359 360 Get training logs of the PyTorchJob. By default only get the logs of Pod that has labels 'job-role: master', to get all pods logs, specfy the `master=False`. 361 362 ### Example 363 364 ```python 365 from kubeflow.training import PyTorchJobClient 366 367 pytorchjob_client = PyTorchJobClient() 368 pytorchjob_client.get_logs('mnist', namespace='kubeflow') 369 ``` 370 371 ### Parameters 372 Name | Type | Description | Notes 373 ------------ | ------------- | ------------- | ------------- 374 name | str | The PyTorchJob name.| | 375 namespace | str | The pytorchjob's namespace. Defaults to current or default namespace.| Optional | 376 master | bool | Only get pod with label 'job-role: master' pod if True. | | 377 replica_type | str | User can specify one of 'master, worker' to only get one type pods. By default get all type pods.| | 378 replica_index | str | User can specfy replica index to get one pod of the PyTorchJob. | | 379 follow | bool | Follow the log stream of the pod. Defaults to false. | | 380 381 ### Return type 382 Str