github.com/oam-dev/kubevela@v1.9.11/design/platform/ml-platform-design.md (about) 1 # Building Machine Learning Platforms Using KubeVela and ACK 2 3 4 ## Background 5 6 Data scientists are embracing Kubernetes as the infrastructure to run ML apps. 7 Nonetheless, when it comes to converting machine learning code to application delivery pipelines, data scientists struggle a lot -- 8 It is a very challenging, time-consuming task, and needs the cooperation of different domain experts: Application Developer, Data Scientist, Platform Engineer. 9 10 As a result, platform teams are building self-service ML platforms for data scientists to test, deploy and upgrade models. 11 Such platforms provide the following benefits: 12 13 - Improve the speed-to-market for ML models. 14 - Lower the barrier to entry for ML developers to get their models into production. 15 - Implement operational efficiencies and economies of scale. 16 17 With KubeVela and ACK (Alibaba Kubernetes Service), we can build ML platforms easily: 18 19 - ACK + Alibaba Cloud can provide infra services to support deployment of ML code and models. 20 - KubeVela can provide standard workflow and APIs to glue all the deployment steps. 21 22 In this doc, we will discuss one generic solution to building a ML platform using KubeVela and ACK. 23 We will see that by using KubeVela it is easy to build high-level abstractions and developer-facing APIs to improve user experience on top of cloud infrastructure. 24 25 26 ## ACK Features Used 27 28 Buidling ML Platforms with KubeVela on ACK gives you the following feature benefits: 29 30 - You can provision and manage Kubernetes clusters via ACK console and easily configure multiple compute and GPU node configurations. 31 - You can scale up cluster resources or setup staging environments in pay-as-you-go mode by using ASK (Serverless Kubernetes). 32 - You can deploy your apps to the edge and manage them in edge-autonomous mode by using ACK@Edge. 33 - Machine learning jobs can share GPUs to save cost and improve utilization by enabling GPU sharing mode on ACK. 34 - Centralized and unified application logs/metrics in ARMS, which helps with monitoring, troubleshooting, debugging. 35 36 37 ## Initialize Infrastructure Environment 38 39 Users need to setup the following infrastructure resources before deploying ML code. 40 41 - Kubernetes cluster 42 - Kubeflow operator 43 - OSS bucket 44 45 We propose to add the following Initializer to achieve it: 46 47 ```yaml 48 apiVersion: core.oam.dev/v1beta1 49 kind: Initializer 50 spec: 51 appTemplate: 52 spec: 53 components: 54 - name: prd-cluster 55 type: k8s-cluster 56 properties: 57 provider: alibaba 58 resource: ACK 59 version: v1.20 60 61 - name: dev-cluster 62 type: k8s-cluster 63 properties: 64 provider: alibaba 65 resource: ASK 66 version: v1.20 67 68 - name: kubeflow 69 type: helm-chart 70 properties: 71 repo: repo-url 72 chart: kubeflow 73 namespace: kubeflow-system 74 create-nmespace: true 75 76 - name: s3-bucket 77 type: s3-bucket 78 properties: 79 provider: alibaba 80 bucket: ml-example 81 82 83 workflows: 84 - name: create-prod-cluster 85 type: terraform-apply 86 properties: 87 component: prod-cluster 88 89 - name: create-dev-cluster 90 type: terraform-apply 91 properties: 92 component: prod-cluster 93 94 - name: deploy-kubeflow 95 type: helm-apply 96 properties: 97 component: kubeflow 98 99 - name: create-s3-bucket 100 type: terraform-apply 101 prooperties: 102 component: s3-bucket 103 ``` 104 105 ## Model Training and Serving 106 107 In this section, we will define the high-level, user-facing APIs exposed to users. 108 109 Here is an overview: 110 111 ```yaml 112 kind: Application 113 spec: 114 components: 115 # This is the component to train the models. 116 - name: my-tfjob 117 type: tfjob 118 119 properties: 120 # modelVersion defines the location where the model is stored. 121 modelVersion: 122 modelName: mymodel 123 # The dockerhub repo to push the generated image 124 imageRepo: myhub/mymodel 125 # tfReplicaSpecs defines the config to run the training job 126 tfReplicaSpecs: 127 Worker: 128 replicas: 3 129 template: 130 spec: 131 containers: 132 - name: tensorflow 133 image: tf-mnist-estimator-api:v0.1 134 135 # This is the component to serve the models. 136 - name: my-tfserving-prod 137 type: tfserving 138 139 properties: 140 # Below we show two predictors that splits the serving traffic 141 predictors: 142 # 90% traffic will be roted to this predictor. 143 - name: model-a-predictor 144 modelVersion: mymodel-v1 145 replicas: 3 146 trafficPercentage: 90 147 autoScale: 148 minReplicas: 1 149 maxReplicas: 10 150 batching: 151 batchSize: 32 152 template: 153 spec: 154 containers: 155 - name: tensorflow 156 image: tensorflow/serving:1.11.0 157 # 10% traffic will be roted to this predictor. 158 - name: model-b-predictor 159 modelVersion: mymodel-v2 160 replicas: 3 161 trafficPercentage: 10 162 autoScale: 163 minReplicas: 1 164 maxReplicas: 10 165 batching: 166 batchSize: 64 167 template: 168 spec: 169 containers: 170 - name: tensorflow 171 image: tensorflow/serving:1.11.1 172 173 traits: 174 - name: metrics 175 type: arms-metrics 176 177 - name: logging 178 type: arms-logging 179 180 181 # This is the component to serve the models. 182 - name: my-tfserving-dev 183 type: tfserving 184 properties: 185 predictors: 186 - name: model-predictor 187 modelVersion: mymodel-v2 188 template: 189 spec: 190 containers: 191 - name: tensorflow 192 image: tensorflow/serving:1.11.1 193 194 195 workflow: 196 steps: 197 - name: train-model 198 type: ml-model-training 199 properties: 200 component: my-tfjob 201 # The workflow task will load the dataset into the volumes of the training job container 202 dataset: 203 s3: 204 bucket: bucket-url 205 206 # wait for user to evaluate and decide to pass/fail 207 - name: evaluate-model 208 type: suspend 209 210 - name: save-model 211 type: ml-model-checkpoint 212 properties: 213 # modelVersion defines the location where the model is stored. 214 modelVersion: 215 modelName: mymodel-v2 216 # The docker repo to push the generated image 217 imageRepo: myrepo/mymodel 218 219 - name: serve-model-in-dev 220 type: ml-model-serving 221 properties: 222 component: my-tfserving-dev 223 env: dev 224 225 # wait for user to evaluate and decide to pass/fail 226 - name: evaluate-serving 227 type: suspend 228 229 - name: serve-model-in-prod 230 type: ml-model-serving 231 properties: 232 component: my-tfserving-prod 233 env: prod 234 ``` 235 236 ## Integration with ACK Services 237 238 In above we have defined the user APIs. 239 Under the hood, we can leverage ACK and cloud services to support the deployment of the ML models. 240 Here are how they are implemented: 241 242 - We can create and manage ACK clusters in the `create-cluster` workflow task. 243 We can define the ACK cluster templates in `k8s-cluster` component. 244 - We can use ASK as cluster resources for dev environment, which is defined in `dev-cluster` component. 245 Once users have evaluated the service and promoted to production, the ASK cluster will automatically scale down. 246 - We can use ASK for scaling up cluster resources in prod environment. 247 When traffic spike comes, users would have more resources automatically to create more serving instances, 248 which keeps the services responsive. 249 - We can deploy ML models to ACK@Edge to keep services running on edge-autonomous mode. 250 - We can provide GPU sharing options to users by using ACK GPU sharing feature. 251 - We can export the logs and metrics to ARMS and display them in dashboard automatically. 252 253 254 ## Considerations 255 256 ### 1. Comparison to using Kubeflow 257 258 How is it different from traditional methods like using Kubeflow directly: 259 260 - Users using Kubeflow still needs to write a lot of scripts. 261 It is a challenging problem to manage those scripts. 262 For example, how to store them, and how to document them? 263 - With KubeVela, we provide a standard way to manage the these glue code. 264 They are managed in modules, stored as CRDs, and exposed in CUE APIs. 265 - Kubeflow and Kubeflow works in different levels. 266 Kubeflow provides low-level, atomic capabilities. 267 KubeVela works on higher-level APIs to simplify deployment and operations for users.