github.com/kubeflow/training-operator@v1.7.0/examples/xgboost/xgboost-dist/README.md (about) 1 ### Distributed XGBoost Job train and prediction 2 3 This folder containers related files for distributed XGBoost training and prediction. In this demo, 4 [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) is a well known multi-class classification dataset. 5 Thus, in this demo, distributed XGBoost job is able to do multi-class classification problem. Meanwhile, 6 User can extend provided data reader to read data from distributed data storage like HDFS, HBase or Hive etc. 7 8 9 **Build image** 10 11 The default image name and tag is `kubeflow/xgboost-dist-iris-test:1.1` respectiveily. 12 13 ```shell 14 docker build -f Dockerfile -t kubeflow/xgboost-dist-iris-test:1.0 ./ 15 ``` 16 17 Then you can push the docker image into repository 18 ```shell 19 docker push kubeflow/xgboost-dist-iris-test:1.0 ./ 20 ``` 21 22 **Configure the job runtime via Yaml file** 23 24 The following files are available to setup distributed XGBoost computation runtime 25 26 To store the model in OSS: 27 28 * xgboostjob_v1alpha1_iris_train.yaml 29 * xgboostjob_v1alpha1_iris_predict.yaml 30 31 To store the model in local path: 32 33 * xgboostjob_v1alpha1_iris_train_local.yaml 34 * xgboostjob_v1alpha1_iris_predict_local.yaml 35 36 For training jobs in OSS , you could configure xgboostjob_v1alpha1_iris_train.yaml and xgboostjob_v1alpha1_iris_predict.yaml 37 Note, we use [OSS](https://www.alibabacloud.com/product/oss) to store the trained model, 38 thus, you need to specify the OSS parameter in the yaml file. Therefore, remember to fill the OSS parameter in xgboostjob_v1alpha1_iris_train.yaml and xgboostjob_v1alpha1_iris_predict.yaml file. 39 The oss parameter includes the account information such as access_id, access_key, access_bucket and endpoint. 40 For Eg: 41 --oss_param=endpoint:http://oss-ap-south-1.aliyuncs.com,access_id:XXXXXXXXXXX,access_key:XXXXXXXXXXXXXXXXXXX,access_bucket:XXXXXX 42 Similarly, xgboostjob_v1alpha1_iris_predict.yaml is used to configure XGBoost job batch prediction. 43 44 45 **Start the distributed XGBoost train to store the model in OSS** 46 ``` 47 kubectl create -f xgboostjob_v1alpha1_iris_train.yaml 48 ``` 49 50 **Look at the train job status** 51 ``` 52 kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-train 53 ``` 54 Here is a sample output when the job is finished. The output log like this 55 ``` 56 Name: xgboost-dist-iris-test 57 Namespace: default 58 Labels: <none> 59 Annotations: <none> 60 API Version: xgboostjob.kubeflow.org/v1alpha1 61 Kind: XGBoostJob 62 Metadata: 63 Creation Timestamp: 2019-06-27T01:16:09Z 64 Generation: 9 65 Resource Version: 385834 66 Self Link: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test 67 UID: 2565e99a-9879-11e9-bbab-080027dfbfe2 68 Spec: 69 Run Policy: 70 Clean Pod Policy: None 71 Xgb Replica Specs: 72 Master: 73 Replicas: 1 74 Restart Policy: Never 75 Template: 76 Metadata: 77 Creation Timestamp: <nil> 78 Spec: 79 Containers: 80 Args: 81 --job_type=Train 82 --xgboost_parameter=objective:multi:softprob,num_class:3 83 --n_estimators=10 84 --learning_rate=0.1 85 --model_path=autoAI/xgb-opt/2 86 --model_storage_type=oss 87 --oss_param=unknown 88 Image: docker.io/merlintang/xgboost-dist-iris:1.1 89 Image Pull Policy: Always 90 Name: xgboostjob 91 Ports: 92 Container Port: 9991 93 Name: xgboostjob-port 94 Resources: 95 Worker: 96 Replicas: 2 97 Restart Policy: ExitCode 98 Template: 99 Metadata: 100 Creation Timestamp: <nil> 101 Spec: 102 Containers: 103 Args: 104 --job_type=Train 105 --xgboost_parameter="objective:multi:softprob,num_class:3" 106 --n_estimators=10 107 --learning_rate=0.1 108 --model_path="/tmp/xgboost_model" 109 --model_storage_type=oss 110 Image: docker.io/merlintang/xgboost-dist-iris:1.1 111 Image Pull Policy: Always 112 Name: xgboostjob 113 Ports: 114 Container Port: 9991 115 Name: xgboostjob-port 116 Resources: 117 Status: 118 Completion Time: 2019-06-27T01:17:04Z 119 Conditions: 120 Last Transition Time: 2019-06-27T01:16:09Z 121 Last Update Time: 2019-06-27T01:16:09Z 122 Message: xgboostJob xgboost-dist-iris-test is created. 123 Reason: XGBoostJobCreated 124 Status: True 125 Type: Created 126 Last Transition Time: 2019-06-27T01:16:09Z 127 Last Update Time: 2019-06-27T01:16:09Z 128 Message: XGBoostJob xgboost-dist-iris-test is running. 129 Reason: XGBoostJobRunning 130 Status: False 131 Type: Running 132 Last Transition Time: 2019-06-27T01:17:04Z 133 Last Update Time: 2019-06-27T01:17:04Z 134 Message: XGBoostJob xgboost-dist-iris-test is successfully completed. 135 Reason: XGBoostJobSucceeded 136 Status: True 137 Type: Succeeded 138 Replica Statuses: 139 Master: 140 Succeeded: 1 141 Worker: 142 Succeeded: 2 143 Events: 144 Type Reason Age From Message 145 ---- ------ ---- ---- ------- 146 Normal SuccessfulCreatePod 102s xgboostjob-operator Created pod: xgboost-dist-iris-test-master-0 147 Normal SuccessfulCreateService 102s xgboostjob-operator Created service: xgboost-dist-iris-test-master-0 148 Normal SuccessfulCreatePod 102s xgboostjob-operator Created pod: xgboost-dist-iris-test-worker-1 149 Normal SuccessfulCreateService 102s xgboostjob-operator Created service: xgboost-dist-iris-test-worker-0 150 Normal SuccessfulCreateService 102s xgboostjob-operator Created service: xgboost-dist-iris-test-worker-1 151 Normal SuccessfulCreatePod 64s xgboostjob-operator Created pod: xgboost-dist-iris-test-worker-0 152 Normal ExitedWithCode 47s (x3 over 49s) xgboostjob-operator Pod: default.xgboost-dist-iris-test-worker-1 exited with code 0 153 Normal ExitedWithCode 47s xgboostjob-operator Pod: default.xgboost-dist-iris-test-master-0 exited with code 0 154 Normal XGBoostJobSucceeded 47s xgboostjob-operator XGBoostJob xgboost-dist-iris-test is successfully completed. 155 ``` 156 157 **Start the distributed XGBoost job predict** 158 ```shell 159 kubectl create -f xgboostjob_v1alpha1_iris_predict.yaml 160 ``` 161 162 **Look at the batch predict job status** 163 ``` 164 kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-predict 165 ``` 166 Here is a sample output when the job is finished. The output log like this 167 ``` 168 Name: xgboost-dist-iris-test-predict 169 Namespace: default 170 Labels: <none> 171 Annotations: <none> 172 API Version: xgboostjob.kubeflow.org/v1alpha1 173 Kind: XGBoostJob 174 Metadata: 175 Creation Timestamp: 2019-06-27T06:06:53Z 176 Generation: 8 177 Resource Version: 394523 178 Self Link: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-predict 179 UID: c2a04cbc-98a1-11e9-bbab-080027dfbfe2 180 Spec: 181 Run Policy: 182 Clean Pod Policy: None 183 Xgb Replica Specs: 184 Master: 185 Replicas: 1 186 Restart Policy: Never 187 Template: 188 Metadata: 189 Creation Timestamp: <nil> 190 Spec: 191 Containers: 192 Args: 193 --job_type=Predict 194 --model_path=autoAI/xgb-opt/3 195 --model_storage_type=oss 196 --oss_param=unkown 197 Image: docker.io/merlintang/xgboost-dist-iris:1.1 198 Image Pull Policy: Always 199 Name: xgboostjob 200 Ports: 201 Container Port: 9991 202 Name: xgboostjob-port 203 Resources: 204 Worker: 205 Replicas: 2 206 Restart Policy: ExitCode 207 Template: 208 Metadata: 209 Creation Timestamp: <nil> 210 Spec: 211 Containers: 212 Args: 213 --job_type=Predict 214 --model_path=autoAI/xgb-opt/3 215 --model_storage_type=oss 216 --oss_param=unkown 217 Image: docker.io/merlintang/xgboost-dist-iris:1.1 218 Image Pull Policy: Always 219 Name: xgboostjob 220 Ports: 221 Container Port: 9991 222 Name: xgboostjob-port 223 Resources: 224 Status: 225 Completion Time: 2019-06-27T06:07:02Z 226 Conditions: 227 Last Transition Time: 2019-06-27T06:06:53Z 228 Last Update Time: 2019-06-27T06:06:53Z 229 Message: xgboostJob xgboost-dist-iris-test-predict is created. 230 Reason: XGBoostJobCreated 231 Status: True 232 Type: Created 233 Last Transition Time: 2019-06-27T06:06:53Z 234 Last Update Time: 2019-06-27T06:06:53Z 235 Message: XGBoostJob xgboost-dist-iris-test-predict is running. 236 Reason: XGBoostJobRunning 237 Status: False 238 Type: Running 239 Last Transition Time: 2019-06-27T06:07:02Z 240 Last Update Time: 2019-06-27T06:07:02Z 241 Message: XGBoostJob xgboost-dist-iris-test-predict is successfully completed. 242 Reason: XGBoostJobSucceeded 243 Status: True 244 Type: Succeeded 245 Replica Statuses: 246 Master: 247 Succeeded: 1 248 Worker: 249 Succeeded: 2 250 Events: 251 Type Reason Age From Message 252 ---- ------ ---- ---- ------- 253 Normal SuccessfulCreatePod 47s xgboostjob-operator Created pod: xgboost-dist-iris-test-predict-worker-0 254 Normal SuccessfulCreatePod 47s xgboostjob-operator Created pod: xgboost-dist-iris-test-predict-worker-1 255 Normal SuccessfulCreateService 47s xgboostjob-operator Created service: xgboost-dist-iris-test-predict-worker-0 256 Normal SuccessfulCreateService 47s xgboostjob-operator Created service: xgboost-dist-iris-test-predict-worker-1 257 Normal SuccessfulCreatePod 47s xgboostjob-operator Created pod: xgboost-dist-iris-test-predict-master-0 258 Normal SuccessfulCreateService 47s xgboostjob-operator Created service: xgboost-dist-iris-test-predict-master-0 259 Normal ExitedWithCode 38s (x3 over 40s) xgboostjob-operator Pod: default.xgboost-dist-iris-test-predict-worker-0 exited with code 0 260 Normal ExitedWithCode 38s xgboostjob-operator Pod: default.xgboost-dist-iris-test-predict-master-0 exited with code 0 261 Normal XGBoostJobSucceeded 38s xgboostjob-operator XGBoostJob xgboost-dist-iris-test-predict is successfully completed. 262 ``` 263 264 **Start the distributed XGBoost train to store the model locally** 265 266 Before proceeding with training we will create a PVC to store the model trained. 267 Creating pvc : 268 create a yaml file with the below content 269 pvc.yaml 270 ``` 271 apiVersion: v1 272 kind: PersistentVolumeClaim 273 metadata: 274 name: xgboostlocal 275 spec: 276 storageClassName: glusterfs 277 accessModes: 278 - ReadWriteMany 279 resources: 280 requests: 281 storage: 10Gi 282 ``` 283 ``` 284 kubectl create -f pvc.yaml 285 ``` 286 Note: 287 288 * Please use the storage class which supports ReadWriteMany. The example yaml above uses glusterfs 289 290 * Mention model_storage_type=local and model_path accordingly( In the example /tmp/xgboost_model/2 is used ) in xgboostjob_v1alpha1_iris_train_local.yaml and xgboostjob_v1alpha1_iris_predict_local.yaml" 291 292 Now start the distributed XGBoost train. 293 ``` 294 kubectl create -f xgboostjob_v1alpha1_iris_train_local.yaml 295 ``` 296 297 **Look at the train job status** 298 ``` 299 kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-train-local 300 ``` 301 Here is a sample output when the job is finished. The output log like this 302 ``` 303 304 apiVersion: xgboostjob.kubeflow.org/v1alpha1 305 kind: XGBoostJob 306 metadata: 307 creationTimestamp: "2019-09-17T05:36:01Z" 308 generation: 7 309 name: xgboost-dist-iris-test-train_local 310 namespace: default 311 resourceVersion: "8919366" 312 selfLink: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-train_local 313 uid: 08f85fad-d90d-11e9-aca1-fa163ea13108 314 spec: 315 RunPolicy: 316 cleanPodPolicy: None 317 xgbReplicaSpecs: 318 Master: 319 replicas: 1 320 restartPolicy: Never 321 template: 322 metadata: 323 creationTimestamp: null 324 spec: 325 containers: 326 - args: 327 - --job_type=Train 328 - --xgboost_parameter=objective:multi:softprob,num_class:3 329 - --n_estimators=10 330 - --learning_rate=0.1 331 - --model_path=/tmp/xgboost_model/2 332 - --model_storage_type=local 333 image: docker.io/merlintang/xgboost-dist-iris:1.1 334 imagePullPolicy: Always 335 name: xgboostjob 336 ports: 337 - containerPort: 9991 338 name: xgboostjob-port 339 resources: {} 340 volumeMounts: 341 - mountPath: /tmp/xgboost_model 342 name: task-pv-storage 343 volumes: 344 - name: task-pv-storage 345 persistentVolumeClaim: 346 claimName: xgboostlocal 347 Worker: 348 replicas: 2 349 restartPolicy: ExitCode 350 template: 351 metadata: 352 creationTimestamp: null 353 spec: 354 containers: 355 - args: 356 - --job_type=Train 357 - --xgboost_parameter="objective:multi:softprob,num_class:3" 358 - --n_estimators=10 359 - --learning_rate=0.1 360 - --model_path=/tmp/xgboost_model/2 361 - --model_storage_type=local 362 image: bcmt-registry:5000/kubeflow/xgboost-dist-iris-test:1.0 363 imagePullPolicy: Always 364 name: xgboostjob 365 ports: 366 - containerPort: 9991 367 name: xgboostjob-port 368 resources: {} 369 volumeMounts: 370 - mountPath: /tmp/xgboost_model 371 name: task-pv-storage 372 volumes: 373 - name: task-pv-storage 374 persistentVolumeClaim: 375 claimName: xgboostlocal 376 status: 377 completionTime: "2019-09-17T05:37:02Z" 378 conditions: 379 - lastTransitionTime: "2019-09-17T05:36:02Z" 380 lastUpdateTime: "2019-09-17T05:36:02Z" 381 message: xgboostJob xgboost-dist-iris-test-train_local is created. 382 reason: XGBoostJobCreated 383 status: "True" 384 type: Created 385 - lastTransitionTime: "2019-09-17T05:36:02Z" 386 lastUpdateTime: "2019-09-17T05:36:02Z" 387 message: XGBoostJob xgboost-dist-iris-test-train_local is running. 388 reason: XGBoostJobRunning 389 status: "False" 390 type: Running 391 - lastTransitionTime: "2019-09-17T05:37:02Z" 392 lastUpdateTime: "2019-09-17T05:37:02Z" 393 message: XGBoostJob xgboost-dist-iris-test-train_local is successfully completed. 394 reason: XGBoostJobSucceeded 395 status: "True" 396 type: Succeeded 397 replicaStatuses: 398 Master: 399 succeeded: 1 400 Worker: 401 succeeded: 2 402 ``` 403 **Start the distributed XGBoost job predict** 404 ``` 405 kubectl create -f xgboostjob_v1alpha1_iris_predict_local.yaml 406 ``` 407 408 **Look at the batch predict job status** 409 ``` 410 kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-predict-local 411 ``` 412 Here is a sample output when the job is finished. The output log like this 413 ``` 414 apiVersion: xgboostjob.kubeflow.org/v1alpha1 415 kind: XGBoostJob 416 metadata: 417 creationTimestamp: "2019-09-17T06:33:38Z" 418 generation: 6 419 name: xgboost-dist-iris-test-predict_local 420 namespace: default 421 resourceVersion: "8976054" 422 selfLink: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-predict_local 423 uid: 151655b0-d915-11e9-aca1-fa163ea13108 424 spec: 425 RunPolicy: 426 cleanPodPolicy: None 427 xgbReplicaSpecs: 428 Master: 429 replicas: 1 430 restartPolicy: Never 431 template: 432 metadata: 433 creationTimestamp: null 434 spec: 435 containers: 436 - args: 437 - --job_type=Predict 438 - --model_path=/tmp/xgboost_model/2 439 - --model_storage_type=local 440 image: docker.io/merlintang/xgboost-dist-iris:1.1 441 imagePullPolicy: Always 442 name: xgboostjob 443 ports: 444 - containerPort: 9991 445 name: xgboostjob-port 446 resources: {} 447 volumeMounts: 448 - mountPath: /tmp/xgboost_model 449 name: task-pv-storage 450 volumes: 451 - name: task-pv-storage 452 persistentVolumeClaim: 453 claimName: xgboostlocal 454 Worker: 455 replicas: 2 456 restartPolicy: ExitCode 457 template: 458 metadata: 459 creationTimestamp: null 460 spec: 461 containers: 462 - args: 463 - --job_type=Predict 464 - --model_path=/tmp/xgboost_model/2 465 - --model_storage_type=local 466 image: docker.io/merlintang/xgboost-dist-iris:1.1 467 imagePullPolicy: Always 468 name: xgboostjob 469 ports: 470 - containerPort: 9991 471 name: xgboostjob-port 472 resources: {} 473 volumeMounts: 474 - mountPath: /tmp/xgboost_model 475 name: task-pv-storage 476 volumes: 477 - name: task-pv-storage 478 persistentVolumeClaim: 479 claimName: xgboostlocal 480 status: 481 completionTime: "2019-09-17T06:33:51Z" 482 conditions: 483 - lastTransitionTime: "2019-09-17T06:33:38Z" 484 lastUpdateTime: "2019-09-17T06:33:38Z" 485 message: xgboostJob xgboost-dist-iris-test-predict_local is created. 486 reason: XGBoostJobCreated 487 status: "True" 488 type: Created 489 - lastTransitionTime: "2019-09-17T06:33:38Z" 490 lastUpdateTime: "2019-09-17T06:33:38Z" 491 message: XGBoostJob xgboost-dist-iris-test-predict_local is running. 492 reason: XGBoostJobRunning 493 status: "False" 494 type: Running 495 - lastTransitionTime: "2019-09-17T06:33:51Z" 496 lastUpdateTime: "2019-09-17T06:33:51Z" 497 message: XGBoostJob xgboost-dist-iris-test-predict_local is successfully completed. 498 reason: XGBoostJobSucceeded 499 status: "True" 500 type: Succeeded 501 replicaStatuses: 502 Master: 503 succeeded: 1 504 Worker: 505 succeeded: 1 506 ```