github.com/oam-dev/kubevela@v1.9.11/design/platform/ml-platform-design.md

github.com/oam-dev/kubevela@v1.9.11/design/platform/ml-platform-design.md (about)

     1  # Building Machine Learning Platforms Using KubeVela and ACK
     2  
     3  
     4  ## Background
     5  
     6  Data scientists are embracing Kubernetes as the infrastructure to run ML apps.
     7  Nonetheless, when it comes to converting machine learning code to application delivery pipelines, data scientists struggle a lot --
     8  It is a very challenging, time-consuming task, and needs the cooperation of different domain experts: Application Developer, Data Scientist, Platform Engineer.
     9  
    10  As a result, platform teams are building self-service ML platforms for data scientists to test, deploy and upgrade models.
    11  Such platforms provide the following benefits:
    12  
    13  - Improve the speed-to-market for ML models.
    14  - Lower the barrier to entry for ML developers to get their models into production.
    15  - Implement operational efficiencies and economies of scale.
    16  
    17  With KubeVela and ACK (Alibaba Kubernetes Service), we can build ML platforms easily:
    18  
    19  - ACK + Alibaba Cloud can provide infra services to support deployment of ML code and models.
    20  - KubeVela can provide standard workflow and APIs to glue all the deployment steps.
    21  
    22  In this doc, we will discuss one generic solution to building a ML platform using KubeVela and ACK.
    23  We will see that by using KubeVela it is easy to build high-level abstractions and developer-facing APIs to improve user experience on top of cloud infrastructure.
    24  
    25  
    26  ## ACK Features Used
    27  
    28  Buidling ML Platforms with KubeVela on ACK gives you the following feature benefits:
    29  
    30  - You can provision and manage Kubernetes clusters via ACK console and easily configure multiple compute and GPU node configurations.
    31  - You can scale up cluster resources or setup staging environments in pay-as-you-go mode by using ASK (Serverless Kubernetes).
    32  - You can deploy your apps to the edge and manage them in edge-autonomous mode by using ACK@Edge.
    33  - Machine learning jobs can share GPUs to save cost and improve utilization by enabling GPU sharing mode on ACK.
    34  - Centralized and unified application logs/metrics in ARMS, which helps with monitoring, troubleshooting, debugging.
    35  
    36  
    37  ## Initialize Infrastructure Environment
    38  
    39  Users need to setup the following infrastructure resources before deploying ML code.
    40  
    41  - Kubernetes cluster
    42  - Kubeflow operator
    43  - OSS bucket
    44  
    45  We propose to add the following Initializer to achieve it:
    46  
    47  ```yaml
    48  apiVersion: core.oam.dev/v1beta1
    49  kind: Initializer
    50  spec:
    51    appTemplate:
    52      spec:
    53        components:
    54          - name: prd-cluster
    55            type: k8s-cluster
    56            properties:
    57              provider: alibaba
    58              resource: ACK
    59              version: v1.20
    60  
    61          - name: dev-cluster
    62            type: k8s-cluster
    63            properties:
    64              provider: alibaba
    65              resource: ASK
    66              version: v1.20
    67  
    68          - name: kubeflow
    69            type: helm-chart
    70            properties:
    71              repo: repo-url
    72              chart: kubeflow
    73              namespace: kubeflow-system
    74              create-nmespace: true
    75  
    76          - name: s3-bucket
    77            type: s3-bucket
    78            properties:
    79              provider: alibaba
    80              bucket: ml-example
    81  
    82  
    83        workflows:
    84          - name: create-prod-cluster
    85            type: terraform-apply
    86            properties:
    87              component: prod-cluster
    88  
    89          - name: create-dev-cluster
    90            type: terraform-apply
    91            properties:
    92              component: prod-cluster
    93  
    94          - name: deploy-kubeflow
    95            type: helm-apply
    96            properties:
    97              component: kubeflow
    98  
    99          - name: create-s3-bucket
   100            type: terraform-apply
   101            prooperties:
   102              component: s3-bucket
   103  ```
   104  
   105  ## Model Training and Serving
   106  
   107  In this section, we will define the high-level, user-facing APIs exposed to users.
   108  
   109  Here is an overview:
   110  
   111  ```yaml
   112  kind: Application
   113  spec:
   114    components:
   115      # This is the component to train the models.
   116      - name: my-tfjob
   117        type: tfjob
   118  
   119        properties:
   120          # modelVersion defines the location where the model is stored.
   121          modelVersion:
   122            modelName: mymodel
   123            # The dockerhub repo to push the generated image
   124            imageRepo: myhub/mymodel
   125          # tfReplicaSpecs defines the config to run the training job
   126          tfReplicaSpecs:
   127            Worker:
   128              replicas: 3
   129              template:
   130                spec:
   131                  containers:
   132                    - name: tensorflow
   133                      image: tf-mnist-estimator-api:v0.1
   134  
   135      # This is the component to serve the models.
   136      - name: my-tfserving-prod
   137        type: tfserving
   138  
   139        properties:
   140          # Below we show two predictors that splits the serving traffic
   141          predictors:
   142             # 90% traffic will be roted to this predictor.
   143            - name: model-a-predictor
   144              modelVersion: mymodel-v1
   145              replicas: 3
   146              trafficPercentage: 90
   147              autoScale:
   148                minReplicas: 1
   149                maxReplicas: 10
   150              batching:
   151                batchSize: 32
   152              template:
   153                spec:
   154                  containers:
   155                  - name: tensorflow
   156                    image: tensorflow/serving:1.11.0
   157            # 10% traffic will be roted to this predictor.
   158            - name: model-b-predictor
   159              modelVersion: mymodel-v2
   160              replicas: 3
   161              trafficPercentage: 10
   162              autoScale:
   163                minReplicas: 1
   164                maxReplicas: 10
   165              batching:
   166                batchSize: 64
   167              template:
   168                spec:
   169                  containers:
   170                  - name: tensorflow
   171                    image: tensorflow/serving:1.11.1
   172  
   173        traits:
   174          - name: metrics
   175            type: arms-metrics
   176            
   177          - name: logging
   178            type: arms-logging
   179  
   180  
   181      # This is the component to serve the models.
   182      - name: my-tfserving-dev
   183        type: tfserving
   184        properties:
   185          predictors:
   186            - name: model-predictor
   187              modelVersion: mymodel-v2
   188              template:
   189                spec:
   190                  containers:
   191                  - name: tensorflow
   192                    image: tensorflow/serving:1.11.1
   193  
   194  
   195    workflow:
   196      steps:
   197        - name: train-model
   198          type: ml-model-training
   199          properties:
   200            component: my-tfjob
   201            # The workflow task will load the dataset into the volumes of the training job container
   202            dataset:
   203              s3:
   204                bucket: bucket-url
   205  
   206        # wait for user to evaluate and decide to pass/fail
   207        - name: evaluate-model
   208          type: suspend
   209  
   210        - name: save-model
   211          type: ml-model-checkpoint
   212          properties:
   213            # modelVersion defines the location where the model is stored.
   214            modelVersion:
   215              modelName: mymodel-v2
   216              # The docker repo to push the generated image
   217              imageRepo: myrepo/mymodel
   218        
   219        - name: serve-model-in-dev
   220          type: ml-model-serving
   221          properties:
   222            component: my-tfserving-dev
   223            env: dev
   224  
   225        # wait for user to evaluate and decide to pass/fail
   226        - name: evaluate-serving
   227          type: suspend
   228  
   229        - name: serve-model-in-prod
   230          type: ml-model-serving
   231          properties:
   232            component: my-tfserving-prod
   233            env: prod
   234  ```
   235  
   236  ## Integration with ACK Services
   237  
   238  In above we have defined the user APIs.
   239  Under the hood, we can leverage ACK and cloud services to support the deployment of the ML models.
   240  Here are how they are implemented:
   241  
   242  - We can create and manage ACK clusters in the `create-cluster` workflow task.
   243    We can define the ACK cluster templates in `k8s-cluster` component.
   244  - We can use ASK as cluster resources for dev environment, which is defined in `dev-cluster` component.
   245    Once users have evaluated the service and promoted to production, the ASK cluster will automatically scale down.
   246  - We can use ASK for scaling up cluster resources in prod environment.
   247    When traffic spike comes, users would have more resources automatically to create more serving instances,
   248    which keeps the services responsive.
   249  - We can deploy ML models to ACK@Edge to keep services running on edge-autonomous mode.
   250  - We can provide GPU sharing options to users by using ACK GPU sharing feature.
   251  - We can export the logs and metrics to ARMS and display them in dashboard automatically.
   252  
   253  
   254  ## Considerations
   255  
   256  ### 1. Comparison to using Kubeflow
   257  
   258  How is it different from traditional methods like using Kubeflow directly:
   259  
   260  - Users using Kubeflow still needs to write a lot of scripts.
   261    It is a challenging problem to manage those scripts.
   262    For example, how to store them, and how to document them?
   263  - With KubeVela, we provide a standard way to manage the these glue code.
   264    They are managed in modules, stored as CRDs, and exposed in CUE APIs.
   265  - Kubeflow and Kubeflow works in different levels.
   266    Kubeflow provides low-level, atomic capabilities.
   267    KubeVela works on higher-level APIs to simplify deployment and operations for users.