github.com/kubeflow/training-operator@v1.7.0/examples/xgboost/xgboost-dist/README.md (about)

     1  ### Distributed XGBoost Job train and prediction
     2  
     3  This folder containers related files for distributed XGBoost training and prediction. In this demo,
     4  [Iris Data Set](https://archive.ics.uci.edu/ml/datasets/iris) is a well known multi-class classification dataset.
     5  Thus, in this demo, distributed XGBoost job is able to do multi-class classification problem. Meanwhile,
     6  User can extend provided data reader to read data from distributed data storage like HDFS, HBase or Hive etc.
     7  
     8  
     9  **Build image**
    10  
    11  The default image name and tag is `kubeflow/xgboost-dist-iris-test:1.1` respectiveily.
    12  
    13  ```shell
    14  docker build -f Dockerfile -t kubeflow/xgboost-dist-iris-test:1.0 ./
    15  ```
    16  
    17  Then you can push the docker image into repository
    18  ```shell
    19  docker push kubeflow/xgboost-dist-iris-test:1.0 ./
    20  ```
    21  
    22  **Configure the job runtime via Yaml file**
    23  
    24  The following files are available to setup distributed XGBoost computation runtime
    25   
    26  To store the model in OSS:
    27  
    28  * xgboostjob_v1alpha1_iris_train.yaml 
    29  * xgboostjob_v1alpha1_iris_predict.yaml
    30  
    31  To store the model in local path:
    32  
    33  * xgboostjob_v1alpha1_iris_train_local.yaml
    34  * xgboostjob_v1alpha1_iris_predict_local.yaml
    35  
    36  For training jobs in OSS , you could configure xgboostjob_v1alpha1_iris_train.yaml and xgboostjob_v1alpha1_iris_predict.yaml
    37  Note, we use [OSS](https://www.alibabacloud.com/product/oss) to store the trained model,
    38  thus, you need to specify the OSS parameter in the yaml file. Therefore, remember to fill the OSS parameter in xgboostjob_v1alpha1_iris_train.yaml and xgboostjob_v1alpha1_iris_predict.yaml file.
    39  The oss parameter includes the account information such as access_id, access_key, access_bucket and endpoint.
    40  For Eg:
    41  --oss_param=endpoint:http://oss-ap-south-1.aliyuncs.com,access_id:XXXXXXXXXXX,access_key:XXXXXXXXXXXXXXXXXXX,access_bucket:XXXXXX
    42  Similarly, xgboostjob_v1alpha1_iris_predict.yaml is used to configure XGBoost job batch prediction.
    43  
    44  
    45  **Start the distributed XGBoost train to store the model in OSS**
    46  ```
    47  kubectl create -f xgboostjob_v1alpha1_iris_train.yaml
    48  ```
    49  
    50  **Look at the train job status**
    51  ```
    52   kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-train
    53   ```
    54   Here is a sample output when the job is finished. The output log like this
    55  ```
    56  Name:         xgboost-dist-iris-test
    57  Namespace:    default
    58  Labels:       <none>
    59  Annotations:  <none>
    60  API Version:  xgboostjob.kubeflow.org/v1alpha1
    61  Kind:         XGBoostJob
    62  Metadata:
    63    Creation Timestamp:  2019-06-27T01:16:09Z
    64    Generation:          9
    65    Resource Version:    385834
    66    Self Link:           /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test
    67    UID:                 2565e99a-9879-11e9-bbab-080027dfbfe2
    68  Spec:
    69    Run Policy:
    70      Clean Pod Policy:  None
    71    Xgb Replica Specs:
    72      Master:
    73        Replicas:        1
    74        Restart Policy:  Never
    75        Template:
    76          Metadata:
    77            Creation Timestamp:  <nil>
    78          Spec:
    79            Containers:
    80              Args:
    81                --job_type=Train
    82                --xgboost_parameter=objective:multi:softprob,num_class:3
    83                --n_estimators=10
    84                --learning_rate=0.1
    85                --model_path=autoAI/xgb-opt/2
    86                --model_storage_type=oss
    87                --oss_param=unknown
    88              Image:              docker.io/merlintang/xgboost-dist-iris:1.1
    89              Image Pull Policy:  Always
    90              Name:               xgboostjob
    91              Ports:
    92                Container Port:  9991
    93                Name:            xgboostjob-port
    94              Resources:
    95      Worker:
    96        Replicas:        2
    97        Restart Policy:  ExitCode
    98        Template:
    99          Metadata:
   100            Creation Timestamp:  <nil>
   101          Spec:
   102            Containers:
   103              Args:
   104                --job_type=Train
   105                --xgboost_parameter="objective:multi:softprob,num_class:3"
   106                --n_estimators=10
   107                --learning_rate=0.1
   108                --model_path="/tmp/xgboost_model"
   109                --model_storage_type=oss
   110              Image:              docker.io/merlintang/xgboost-dist-iris:1.1
   111              Image Pull Policy:  Always
   112              Name:               xgboostjob
   113              Ports:
   114                Container Port:  9991
   115                Name:            xgboostjob-port
   116              Resources:
   117  Status:
   118    Completion Time:  2019-06-27T01:17:04Z
   119    Conditions:
   120      Last Transition Time:  2019-06-27T01:16:09Z
   121      Last Update Time:      2019-06-27T01:16:09Z
   122      Message:               xgboostJob xgboost-dist-iris-test is created.
   123      Reason:                XGBoostJobCreated
   124      Status:                True
   125      Type:                  Created
   126      Last Transition Time:  2019-06-27T01:16:09Z
   127      Last Update Time:      2019-06-27T01:16:09Z
   128      Message:               XGBoostJob xgboost-dist-iris-test is running.
   129      Reason:                XGBoostJobRunning
   130      Status:                False
   131      Type:                  Running
   132      Last Transition Time:  2019-06-27T01:17:04Z
   133      Last Update Time:      2019-06-27T01:17:04Z
   134      Message:               XGBoostJob xgboost-dist-iris-test is successfully completed.
   135      Reason:                XGBoostJobSucceeded
   136      Status:                True
   137      Type:                  Succeeded
   138    Replica Statuses:
   139      Master:
   140        Succeeded:  1
   141      Worker:
   142        Succeeded:  2
   143  Events:
   144    Type    Reason                   Age                From                 Message
   145    ----    ------                   ----               ----                 -------
   146    Normal  SuccessfulCreatePod      102s               xgboostjob-operator  Created pod: xgboost-dist-iris-test-master-0
   147    Normal  SuccessfulCreateService  102s               xgboostjob-operator  Created service: xgboost-dist-iris-test-master-0
   148    Normal  SuccessfulCreatePod      102s               xgboostjob-operator  Created pod: xgboost-dist-iris-test-worker-1
   149    Normal  SuccessfulCreateService  102s               xgboostjob-operator  Created service: xgboost-dist-iris-test-worker-0
   150    Normal  SuccessfulCreateService  102s               xgboostjob-operator  Created service: xgboost-dist-iris-test-worker-1
   151    Normal  SuccessfulCreatePod      64s                xgboostjob-operator  Created pod: xgboost-dist-iris-test-worker-0
   152    Normal  ExitedWithCode           47s (x3 over 49s)  xgboostjob-operator  Pod: default.xgboost-dist-iris-test-worker-1 exited with code 0
   153    Normal  ExitedWithCode           47s                xgboostjob-operator  Pod: default.xgboost-dist-iris-test-master-0 exited with code 0
   154    Normal  XGBoostJobSucceeded      47s                xgboostjob-operator  XGBoostJob xgboost-dist-iris-test is successfully completed.
   155   ```
   156  
   157  **Start the distributed XGBoost job predict**
   158  ```shell
   159  kubectl create -f xgboostjob_v1alpha1_iris_predict.yaml
   160  ```
   161  
   162  **Look at the batch predict job status**
   163  ```
   164   kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-predict
   165   ```
   166   Here is a sample output when the job is finished. The output log like this
   167  ```
   168  Name:         xgboost-dist-iris-test-predict
   169  Namespace:    default
   170  Labels:       <none>
   171  Annotations:  <none>
   172  API Version:  xgboostjob.kubeflow.org/v1alpha1
   173  Kind:         XGBoostJob
   174  Metadata:
   175    Creation Timestamp:  2019-06-27T06:06:53Z
   176    Generation:          8
   177    Resource Version:    394523
   178    Self Link:           /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-predict
   179    UID:                 c2a04cbc-98a1-11e9-bbab-080027dfbfe2
   180  Spec:
   181    Run Policy:
   182      Clean Pod Policy:  None
   183    Xgb Replica Specs:
   184      Master:
   185        Replicas:        1
   186        Restart Policy:  Never
   187        Template:
   188          Metadata:
   189            Creation Timestamp:  <nil>
   190          Spec:
   191            Containers:
   192              Args:
   193                --job_type=Predict
   194                --model_path=autoAI/xgb-opt/3
   195                --model_storage_type=oss
   196                --oss_param=unkown
   197              Image:              docker.io/merlintang/xgboost-dist-iris:1.1
   198              Image Pull Policy:  Always
   199              Name:               xgboostjob
   200              Ports:
   201                Container Port:  9991
   202                Name:            xgboostjob-port
   203              Resources:
   204      Worker:
   205        Replicas:        2
   206        Restart Policy:  ExitCode
   207        Template:
   208          Metadata:
   209            Creation Timestamp:  <nil>
   210          Spec:
   211            Containers:
   212              Args:
   213                --job_type=Predict
   214                --model_path=autoAI/xgb-opt/3
   215                --model_storage_type=oss
   216                --oss_param=unkown
   217              Image:              docker.io/merlintang/xgboost-dist-iris:1.1
   218              Image Pull Policy:  Always
   219              Name:               xgboostjob
   220              Ports:
   221                Container Port:  9991
   222                Name:            xgboostjob-port
   223              Resources:
   224  Status:
   225    Completion Time:  2019-06-27T06:07:02Z
   226    Conditions:
   227      Last Transition Time:  2019-06-27T06:06:53Z
   228      Last Update Time:      2019-06-27T06:06:53Z
   229      Message:               xgboostJob xgboost-dist-iris-test-predict is created.
   230      Reason:                XGBoostJobCreated
   231      Status:                True
   232      Type:                  Created
   233      Last Transition Time:  2019-06-27T06:06:53Z
   234      Last Update Time:      2019-06-27T06:06:53Z
   235      Message:               XGBoostJob xgboost-dist-iris-test-predict is running.
   236      Reason:                XGBoostJobRunning
   237      Status:                False
   238      Type:                  Running
   239      Last Transition Time:  2019-06-27T06:07:02Z
   240      Last Update Time:      2019-06-27T06:07:02Z
   241      Message:               XGBoostJob xgboost-dist-iris-test-predict is successfully completed.
   242      Reason:                XGBoostJobSucceeded
   243      Status:                True
   244      Type:                  Succeeded
   245    Replica Statuses:
   246      Master:
   247        Succeeded:  1
   248      Worker:
   249        Succeeded:  2
   250  Events:
   251    Type    Reason                   Age                From                 Message
   252    ----    ------                   ----               ----                 -------
   253    Normal  SuccessfulCreatePod      47s                xgboostjob-operator  Created pod: xgboost-dist-iris-test-predict-worker-0
   254    Normal  SuccessfulCreatePod      47s                xgboostjob-operator  Created pod: xgboost-dist-iris-test-predict-worker-1
   255    Normal  SuccessfulCreateService  47s                xgboostjob-operator  Created service: xgboost-dist-iris-test-predict-worker-0
   256    Normal  SuccessfulCreateService  47s                xgboostjob-operator  Created service: xgboost-dist-iris-test-predict-worker-1
   257    Normal  SuccessfulCreatePod      47s                xgboostjob-operator  Created pod: xgboost-dist-iris-test-predict-master-0
   258    Normal  SuccessfulCreateService  47s                xgboostjob-operator  Created service: xgboost-dist-iris-test-predict-master-0
   259    Normal  ExitedWithCode           38s (x3 over 40s)  xgboostjob-operator  Pod: default.xgboost-dist-iris-test-predict-worker-0 exited with code 0
   260    Normal  ExitedWithCode           38s                xgboostjob-operator  Pod: default.xgboost-dist-iris-test-predict-master-0 exited with code 0
   261    Normal  XGBoostJobSucceeded      38s                xgboostjob-operator  XGBoostJob xgboost-dist-iris-test-predict is successfully completed.
   262  ```
   263  
   264  **Start the distributed XGBoost train to store the model locally**
   265  
   266  Before proceeding with training we will create a PVC to store the model trained.
   267  Creating pvc : 
   268  create a yaml file with the below content 
   269  pvc.yaml
   270  ```
   271  apiVersion: v1
   272  kind: PersistentVolumeClaim
   273  metadata:
   274    name: xgboostlocal
   275  spec:
   276    storageClassName: glusterfs
   277    accessModes:
   278      - ReadWriteMany
   279    resources:
   280      requests:
   281        storage: 10Gi
   282  ```
   283  ```
   284  kubectl create -f pvc.yaml
   285  ```
   286  Note: 
   287  
   288  * Please use the storage class which supports ReadWriteMany. The example yaml above uses glusterfs
   289  
   290  * Mention model_storage_type=local and model_path accordingly( In the example /tmp/xgboost_model/2 is used ) in xgboostjob_v1alpha1_iris_train_local.yaml and xgboostjob_v1alpha1_iris_predict_local.yaml"
   291  
   292  Now start the distributed XGBoost train. 
   293  ```
   294  kubectl create -f xgboostjob_v1alpha1_iris_train_local.yaml
   295  ```
   296  
   297  **Look at the train job status**
   298  ```
   299   kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-train-local
   300   ```
   301   Here is a sample output when the job is finished. The output log like this
   302  ```
   303  
   304  apiVersion: xgboostjob.kubeflow.org/v1alpha1
   305  kind: XGBoostJob
   306  metadata:
   307    creationTimestamp: "2019-09-17T05:36:01Z"
   308    generation: 7
   309    name: xgboost-dist-iris-test-train_local
   310    namespace: default
   311    resourceVersion: "8919366"
   312    selfLink: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-train_local
   313    uid: 08f85fad-d90d-11e9-aca1-fa163ea13108
   314  spec:
   315    RunPolicy:
   316      cleanPodPolicy: None
   317    xgbReplicaSpecs:
   318      Master:
   319        replicas: 1
   320        restartPolicy: Never
   321        template:
   322          metadata:
   323            creationTimestamp: null
   324          spec:
   325            containers:
   326            - args:
   327              - --job_type=Train
   328              - --xgboost_parameter=objective:multi:softprob,num_class:3
   329              - --n_estimators=10
   330              - --learning_rate=0.1
   331              - --model_path=/tmp/xgboost_model/2
   332              - --model_storage_type=local
   333              image: docker.io/merlintang/xgboost-dist-iris:1.1
   334              imagePullPolicy: Always
   335              name: xgboostjob
   336              ports:
   337              - containerPort: 9991
   338                name: xgboostjob-port
   339              resources: {}
   340              volumeMounts:
   341              - mountPath: /tmp/xgboost_model
   342                name: task-pv-storage
   343            volumes:
   344            - name: task-pv-storage
   345              persistentVolumeClaim:
   346                claimName: xgboostlocal
   347      Worker:
   348        replicas: 2
   349        restartPolicy: ExitCode
   350        template:
   351          metadata:
   352            creationTimestamp: null
   353          spec:
   354            containers:
   355            - args:
   356              - --job_type=Train
   357              - --xgboost_parameter="objective:multi:softprob,num_class:3"
   358              - --n_estimators=10
   359              - --learning_rate=0.1
   360              - --model_path=/tmp/xgboost_model/2
   361              - --model_storage_type=local
   362              image: bcmt-registry:5000/kubeflow/xgboost-dist-iris-test:1.0
   363              imagePullPolicy: Always
   364              name: xgboostjob
   365              ports:
   366              - containerPort: 9991
   367                name: xgboostjob-port
   368              resources: {}
   369              volumeMounts:
   370              - mountPath: /tmp/xgboost_model
   371                name: task-pv-storage
   372            volumes:
   373            - name: task-pv-storage
   374              persistentVolumeClaim:
   375                claimName: xgboostlocal
   376  status:
   377    completionTime: "2019-09-17T05:37:02Z"
   378    conditions:
   379    - lastTransitionTime: "2019-09-17T05:36:02Z"
   380      lastUpdateTime: "2019-09-17T05:36:02Z"
   381      message: xgboostJob xgboost-dist-iris-test-train_local is created.
   382      reason: XGBoostJobCreated
   383      status: "True"
   384      type: Created
   385    - lastTransitionTime: "2019-09-17T05:36:02Z"
   386      lastUpdateTime: "2019-09-17T05:36:02Z"
   387      message: XGBoostJob xgboost-dist-iris-test-train_local is running.
   388      reason: XGBoostJobRunning
   389      status: "False"
   390      type: Running
   391    - lastTransitionTime: "2019-09-17T05:37:02Z"
   392      lastUpdateTime: "2019-09-17T05:37:02Z"
   393      message: XGBoostJob xgboost-dist-iris-test-train_local is successfully completed.
   394      reason: XGBoostJobSucceeded
   395      status: "True"
   396      type: Succeeded
   397    replicaStatuses:
   398      Master:
   399        succeeded: 1
   400      Worker:
   401        succeeded: 2 
   402   ```
   403  **Start the distributed XGBoost job predict**
   404  ```
   405  kubectl create -f xgboostjob_v1alpha1_iris_predict_local.yaml
   406  ```
   407  
   408  **Look at the batch predict job status**
   409  ```
   410   kubectl get -o yaml XGBoostJob/xgboost-dist-iris-test-predict-local
   411   ```
   412   Here is a sample output when the job is finished. The output log like this
   413  ```
   414  apiVersion: xgboostjob.kubeflow.org/v1alpha1
   415  kind: XGBoostJob
   416  metadata:
   417    creationTimestamp: "2019-09-17T06:33:38Z"
   418    generation: 6
   419    name: xgboost-dist-iris-test-predict_local
   420    namespace: default
   421    resourceVersion: "8976054"
   422    selfLink: /apis/xgboostjob.kubeflow.org/v1alpha1/namespaces/default/xgboostjobs/xgboost-dist-iris-test-predict_local
   423    uid: 151655b0-d915-11e9-aca1-fa163ea13108
   424  spec:
   425    RunPolicy:
   426      cleanPodPolicy: None
   427    xgbReplicaSpecs:
   428      Master:
   429        replicas: 1
   430        restartPolicy: Never
   431        template:
   432          metadata:
   433            creationTimestamp: null
   434          spec:
   435            containers:
   436            - args:
   437              - --job_type=Predict
   438              - --model_path=/tmp/xgboost_model/2
   439              - --model_storage_type=local
   440              image: docker.io/merlintang/xgboost-dist-iris:1.1
   441              imagePullPolicy: Always
   442              name: xgboostjob
   443              ports:
   444              - containerPort: 9991
   445                name: xgboostjob-port
   446              resources: {}
   447              volumeMounts:
   448              - mountPath: /tmp/xgboost_model
   449                name: task-pv-storage
   450            volumes:
   451            - name: task-pv-storage
   452              persistentVolumeClaim:
   453                claimName: xgboostlocal
   454      Worker:
   455        replicas: 2
   456        restartPolicy: ExitCode
   457        template:
   458          metadata:
   459            creationTimestamp: null
   460          spec:
   461            containers:
   462            - args:
   463              - --job_type=Predict
   464              - --model_path=/tmp/xgboost_model/2
   465              - --model_storage_type=local
   466              image: docker.io/merlintang/xgboost-dist-iris:1.1
   467              imagePullPolicy: Always
   468              name: xgboostjob
   469              ports:
   470              - containerPort: 9991
   471                name: xgboostjob-port
   472              resources: {}
   473              volumeMounts:
   474              - mountPath: /tmp/xgboost_model
   475                name: task-pv-storage
   476            volumes:
   477            - name: task-pv-storage
   478              persistentVolumeClaim:
   479                claimName: xgboostlocal
   480  status:
   481    completionTime: "2019-09-17T06:33:51Z"
   482    conditions:
   483    - lastTransitionTime: "2019-09-17T06:33:38Z"
   484      lastUpdateTime: "2019-09-17T06:33:38Z"
   485      message: xgboostJob xgboost-dist-iris-test-predict_local is created.
   486      reason: XGBoostJobCreated
   487      status: "True"
   488      type: Created
   489    - lastTransitionTime: "2019-09-17T06:33:38Z"
   490      lastUpdateTime: "2019-09-17T06:33:38Z"
   491      message: XGBoostJob xgboost-dist-iris-test-predict_local is running.
   492      reason: XGBoostJobRunning
   493      status: "False"
   494      type: Running
   495    - lastTransitionTime: "2019-09-17T06:33:51Z"
   496      lastUpdateTime: "2019-09-17T06:33:51Z"
   497      message: XGBoostJob xgboost-dist-iris-test-predict_local is successfully completed.
   498      reason: XGBoostJobSucceeded
   499      status: "True"
   500      type: Succeeded
   501    replicaStatuses:
   502      Master:
   503        succeeded: 1
   504      Worker:
   505        succeeded: 1
   506  ```