github.com/kubeflow/training-operator@v1.7.0/sdk/python/docs/TFJobClient.md (about) 1 # TFJobClient 2 3 > TFJobClient(config_file=None, context=None, client_configuration=None, persist_config=True) 4 5 User can loads authentication and cluster information from kube-config file and stores them in kubernetes.client.configuration. Parameters are as following: 6 7 parameter | Description 8 ------------ | ------------- 9 config_file | Name of the kube-config file. Defaults to `~/.kube/config`. Note that for the case that the SDK is running in cluster and you want to operate tfjob in another remote cluster, user must set `config_file` to load kube-config file explicitly, e.g. `TFJobClient(config_file="~/.kube/config")`. | 10 context |Set the active context. If is set to None, current_context from config file will be used.| 11 client_configuration | The kubernetes.client.Configuration to set configs to.| 12 persist_config | If True, config file will be updated when changed (e.g GCP token refresh).| 13 14 15 The APIs for TFJobClient are as following: 16 17 Class | Method | Description 18 ------------ | ------------- | ------------- 19 TFJobClient| [create](#create) | Create TFJob| 20 TFJobClient | [get](#get) | Get the specified TFJob or all TFJob in the namespace | 21 TFJobClient | [patch](#patch) | Patch the specified TFJob| 22 TFJobClient | [delete](#delete) | Delete the specified TFJob | 23 TFJobClient | [wait_for_job](#wait_for_job) | Wait for the specified job to finish | 24 TFJobClient | [wait_for_condition](#wait_for_condition) | Waits until any of the specified conditions occur | 25 TFJobClient | [get_job_status](#get_job_status) | Get the TFJob status| 26 TFJobClient | [is_job_running](#is_job_running) | Check if the TFJob status is running | 27 TFJobClient | [is_job_succeeded](#is_job_succeeded) | Check if the TFJob status is Succeeded | 28 TFJobClient | [get_pod_names](#get_pod_names) | Get pod names of TFJob | 29 TFJobClient | [get_logs](#get_logs) | Get training logs of the TFJob | 30 31 32 ## create 33 > create(tfjob, namespace=None) 34 35 Create the provided tfjob in the specified namespace 36 37 ### Example 38 39 ```python 40 from kubernetes.client import V1PodTemplateSpec 41 from kubernetes.client import V1ObjectMeta 42 from kubernetes.client import V1PodSpec 43 from kubernetes.client import V1Container 44 45 from kubeflow.training import constants 46 from kubeflow.training import utils 47 from kubeflow.training import V1ReplicaSpec 48 from kubeflow.training import KubeflowOrgV1TFJob 49 from kubeflow.training import KubeflowOrgV1TFJobList 50 from kubeflow.training import KubeflowOrgV1TFJobSpec 51 from kubeflow.training import TFJobClient 52 53 54 container = V1Container( 55 name="tensorflow", 56 image="gcr.io/kubeflow-ci/tf-mnist-with-summaries:1.0", 57 command=[ 58 "python", 59 "/var/tf_mnist/mnist_with_summaries.py", 60 "--log_dir=/train/logs", "--learning_rate=0.01", 61 "--batch_size=150" 62 ] 63 ) 64 65 worker = V1ReplicaSpec( 66 replicas=1, 67 restart_policy="Never", 68 template=V1PodTemplateSpec( 69 spec=V1PodSpec( 70 containers=[container] 71 ) 72 ) 73 ) 74 75 tfjob = KubeflowOrgV1TFJob( 76 api_version="kubeflow.org/v1", 77 kind="TFJob", 78 metadata=V1ObjectMeta(name="mnist",namespace=namespace), 79 spec=KubeflowOrgV1TFJobSpec( 80 clean_pod_policy="None", 81 tf_replica_specs={"Worker": worker} 82 ) 83 ) 84 85 86 tfjob_client = TFJobClient() 87 tfjob_client.create(tfjob) 88 89 ``` 90 91 92 ### Parameters 93 Name | Type | Description | Notes 94 ------------ | ------------- | ------------- | ------------- 95 tfjob | [KubeflowOrgV1TFJob](KubeflowOrgV1TFJob.md) | tfjob defination| Required | 96 namespace | str | Namespace for tfjob deploying to. If the `namespace` is not defined, will align with tfjob definition, or use current or default namespace if namespace is not specified in tfjob definition. | Optional | 97 98 ### Return type 99 object 100 101 ## get 102 > get(name=None, namespace=None, watch=False, timeout_seconds=600) 103 104 Get the created tfjob in the specified namespace 105 106 ### Example 107 108 ```python 109 from kubeflow.training import TFJobClient 110 111 tfjob_client = TFJobClient() 112 tfjob_client.get('mnist', namespace='kubeflow') 113 ``` 114 115 ### Parameters 116 Name | Type | Description | Notes 117 ------------ | ------------- | ------------- | ------------- 118 name | str | The TFJob name. If the `name` is not specified, it will get all tfjobs in the namespace.| Optional. | 119 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 120 watch | bool | Watch the created TFJob if `True`, otherwise will return the created TFJob object. Stop watching if TFJob reaches the optional specified `timeout_seconds` or once the TFJob status `Succeeded` or `Failed`. | Optional | 121 timeout_seconds | int | Timeout seconds for watching. Defaults to 600. | Optional | 122 123 ### Return type 124 object 125 126 127 ## patch 128 > patch(name, tfjob, namespace=None) 129 130 Patch the created tfjob in the specified namespace. 131 132 Note that if you want to set the field from existing value to `None`, `patch` API may not work, you need to use [replace](#replace) API to remove the field value. 133 134 ### Example 135 136 ```python 137 138 tfjob = KubeflowOrgV1TFJob( 139 api_version="kubeflow.org/v1", 140 ... #update something in TFJob spec 141 ) 142 143 tfjob_client = TFJobClient() 144 tfjob_client.patch('mnist', isvc) 145 146 ``` 147 148 ### Parameters 149 Name | Type | Description | Notes 150 ------------ | ------------- | ------------- | ------------- 151 tfjob | [KubeflowOrgV1TFJob](KubeflowOrgV1TFJob.md) | tfjob defination| Required | 152 namespace | str | The tfjob's namespace for patching. If the `namespace` is not defined, will align with tfjob definition, or use current or default namespace if namespace is not specified in tfjob definition. | Optional| 153 154 ### Return type 155 object 156 157 158 ## delete 159 > delete(name, namespace=None) 160 161 Delete the created tfjob in the specified namespace 162 163 ### Example 164 165 ```python 166 from kubeflow.training import TFJobClient 167 168 tfjob_client = TFJobClient() 169 tfjob_client.delete('mnist', namespace='kubeflow') 170 ``` 171 172 ### Parameters 173 Name | Type | Description | Notes 174 ------------ | ------------- | ------------- | ------------- 175 name | str | The TFJob name.| | 176 namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional| 177 178 ### Return type 179 object 180 181 182 ## wait_for_job 183 > wait_for_job(name, 184 > namespace=None, 185 > timeout_seconds=600, 186 > polling_interval=30, 187 > watch=False, 188 > status_callback=None): 189 190 Wait for the specified job to finish. 191 192 ### Example 193 194 ```python 195 from kubeflow.training import TFJobClient 196 197 tfjob_client = TFJobClient() 198 tfjob_client.wait_for_job('mnist', namespace='kubeflow') 199 200 # The API also supports watching the TFJob status till it's Succeeded or Failed. 201 tfjob_client.wait_for_job('mnist', namespace=namespace, watch=True) 202 NAME STATE TIME 203 mnist Created 2019-12-31T09:20:07Z 204 mnist Running 2019-12-31T09:20:19Z 205 mnist Running 2019-12-31T09:20:19Z 206 mnist Succeeded 2019-12-31T09:22:04Z 207 ``` 208 209 ### Parameters 210 Name | Type | Description | Notes 211 ------------ | ------------- | ------------- | ------------- 212 name | str | The TFJob name.| | 213 namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional| 214 timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional| 215 polling_interval | int | How often to poll for the status of the job.| Optional| 216 status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the tfjob.| Optional| 217 watch | bool | Watch the TFJob if `True`. Stop watching if TFJob reaches the optional specified `timeout_seconds` or once the TFJob status `Succeeded` or `Failed`. | Optional | 218 219 ### Return type 220 object 221 222 223 ## wait_for_condition 224 > wait_for_condition(name, 225 > expected_condition, 226 > namespace=None, 227 > timeout_seconds=600, 228 > polling_interval=30, 229 > status_callback=None): 230 231 232 Waits until any of the specified conditions occur. 233 234 ### Example 235 236 ```python 237 from kubeflow.training import TFJobClient 238 239 tfjob_client = TFJobClient() 240 tfjob_client.wait_for_condition('mnist', expected_condition=["Succeeded", "Failed"], namespace='kubeflow') 241 ``` 242 243 ### Parameters 244 Name | Type | Description | Notes 245 ------------ | ------------- | ------------- | ------------- 246 name | str | The TFJob name.| | 247 expected_condition |List |A list of conditions. Function waits until any of the supplied conditions is reached.| | 248 namespace | str | The tfjob's namespace. Defaults to current or default namespace. | Optional| 249 timeout_seconds | int | How long to wait for the job, default wait for 600 seconds. | Optional| 250 polling_interval | int | How often to poll for the status of the job.| Optional| 251 status_callback | str | Callable. If supplied this callable is invoked after we poll the job. Callable takes a single argument which is the tfjob.| Optional| 252 253 ### Return type 254 object 255 256 ## get_job_status 257 > get_job_status(name, namespace=None) 258 259 Returns TFJob status, such as Running, Failed or Succeeded. 260 261 ### Example 262 263 ```python 264 from kubeflow.training import TFJobClient 265 266 tfjob_client = TFJobClient() 267 tfjob_client.get_job_status('mnist', namespace='kubeflow') 268 ``` 269 270 ### Parameters 271 Name | Type | Description | Notes 272 ------------ | ------------- | ------------- | ------------- 273 name | str | The TFJob name. | | 274 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 275 276 ### Return type 277 Str 278 279 ## is_job_running 280 > is_job_running(name, namespace=None) 281 282 Returns True if the TFJob running; false otherwise. 283 284 ### Example 285 286 ```python 287 from kubeflow.training import TFJobClient 288 289 tfjob_client = TFJobClient() 290 tfjob_client.is_job_running('mnist', namespace='kubeflow') 291 ``` 292 293 ### Parameters 294 Name | Type | Description | Notes 295 ------------ | ------------- | ------------- | ------------- 296 name | str | The TFJob name.| | 297 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 298 299 ### Return type 300 Bool 301 302 ## is_job_succeeded 303 > is_job_succeeded(name, namespace=None) 304 305 Returns True if the TFJob succeeded; false otherwise. 306 307 ### Example 308 309 ```python 310 from kubeflow.training import TFJobClient 311 312 tfjob_client = TFJobClient() 313 tfjob_client.is_job_succeeded('mnist', namespace='kubeflow') 314 ``` 315 316 ### Parameters 317 Name | Type | Description | Notes 318 ------------ | ------------- | ------------- | ------------- 319 name | str | The TFJob name.| | 320 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 321 322 ### Return type 323 Bool 324 325 326 ## get_pod_names 327 > get_pod_names(name, namespace=None, master=False, replica_type=None, replica_index=None) 328 329 Get pod names of the TFJob. 330 331 ### Example 332 333 ```python 334 from kubeflow.training import TFJobClient 335 336 tfjob_client = TFJobClient() 337 tfjob_client.get_pod_names('mnist', namespace='kubeflow') 338 ``` 339 340 ### Parameters 341 Name | Type | Description | Notes 342 ------------ | ------------- | ------------- | ------------- 343 name | str | The TFJob name.| | 344 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 345 master | bool | Only get pod with label 'job-role: master' pod if True. | | 346 replica_type | str | User can specify one of 'worker, ps, chief' to only get one type pods. By default get all type pods.| | 347 replica_index | str | User can specfy replica index to get one pod of the TFJob. | | 348 349 ### Return type 350 Set 351 352 353 ## get_logs 354 > get_logs(name, namespace=None, master=True, replica_type=None, replica_index=None, follow=False) 355 356 Get training logs of the TFJob. By default only get the logs of Pod that has labels 'job-role: master', to get all pods logs, specfy the `master=False`. 357 358 ### Example 359 360 ```python 361 from kubeflow.training import TFJobClient 362 363 tfjob_client = TFJobClient() 364 tfjob_client.get_logs('mnist', namespace='kubeflow') 365 ``` 366 367 ### Parameters 368 Name | Type | Description | Notes 369 ------------ | ------------- | ------------- | ------------- 370 name | str | The TFJob name.| | 371 namespace | str | The tfjob's namespace. Defaults to current or default namespace.| Optional | 372 master | bool | Only get pod with label 'job-role: master' pod if True. | | 373 replica_type | str | User can specify one of 'worker, ps, chief' to only get one type pods. By default get all type pods.| | 374 replica_index | str | User can specfy replica index to get one pod of the TFJob. | | 375 follow | bool | Follow the log stream of the pod. Defaults to false. | | 376 377 ### Return type 378 Str