github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/k8s.md (about)

     1  # studio-go-runner Kubernetes features
     2  
     3  This document describes features support by the studioml go runner (runner) that are supported for generic Kubernetes installations, and builds.
     4  
     5  This document is useful for technical staff who wish to dig into the mechanics behind the runner Kubernetes deployment technology.
     6  
     7  ## Prerequisties
     8  
     9  This document assumes that the reader is familar with Kubernetes (k8s), docker, and Linux.
    10  
    11  In order to perform builds and prepare docker images for remote builds inside a k8s cluster you should have the following:
    12  
    13  An Ubuntu workstation with Docker-CE 17 or later installed, https://docs.docker.com/install/linux/docker-ce/ubuntu/
    14  A shell account with docker accessible and appropriate rights enabled
    15  The go compiler installed, https://github.com/golang/go/wiki/Ubuntu, snap is the preferred method, `snap install --classic go`
    16  The python runtime installed, default on Ubuntu distributions
    17  
    18  The next two steps will first prepare a directory from which docker images and be produced for builds, and the second will produce images that can then be tagged later on with public hosting image repositories for pulling into your build cluster.
    19  
    20  ### Build boostrapping
    21  
    22  In order to perform a build you will need to checkout a copy of the runner using git and define several environment variables.
    23  
    24  First a decision needs to be made as to whether you will use a fork of the open source repository and which branch is needed.  The following instructions assume that the master branch of the original open source repository is being used:
    25  
    26  ```
    27  export GOPATH=~/project
    28  export PATH=$GOPATH/bin:$PATH
    29  
    30  mkdir -p ~/project/src/github.com/leaf-ai
    31  cd ~/project/src/github.com/leaf-ai
    32  git clone https://github.com/leaf-ai/studio-go-runner.git
    33  cd studio-go-runner
    34  
    35  # Get build tooling
    36  go get github.com/karlmutch/duat
    37  go install github.com/karlmutch/duat/cmd/semver
    38  go install github.com/karlmutch/duat/cmd/github-release
    39  go install github.com/karlmutch/duat/cmd/image-release
    40  go install github.com/karlmutch/duat/cmd/stencil
    41  
    42  # Get build dependency and package manager
    43  go get -u github.com/golang/dep/cmd/dep
    44  
    45  
    46  # (Optional) Get the Azure CLI tools, more information at https://github.com/Azure/azure-cli
    47  
    48  AZ_REPO=$(lsb_release -cs)
    49  echo "deb [arch=amd64] https://packages.microsoft.com/repos/azure-cli/ $AZ_REPO main" | \
    50      sudo tee /etc/apt/sources.list.d/azure-cli.list
    51  curl -L https://packages.microsoft.com/keys/microsoft.asc | sudo apt-key add -
    52  
    53  sudo apt-get update
    54  sudo apt-get install apt-transport-https azure-cli
    55  
    56  
    57  # (Optional) Get the AWS CLI Tools, more information at https://github.com/aws/aws-cli
    58  
    59  pip install --user --upgrade awscli
    60  ```
    61  
    62  ### Building reference docker images
    63  
    64  The next step is to produce a set of build images that can be run either locally or remotely via k8s by using Docker to create the images.
    65  
    66  The runner supports a standalone build mode which can be used to perform local or remote builds without needing a local developer environment configured.  The Dockerfile\_standalone image specification file contains the container definition to do this.
    67  
    68  The runner also supports a developer build mode which mounts code from a developers workstation environment into a running container defined by the default Dockerfile, and Dockerfile\_workstation.
    69  
    70  ```
    71  export SEMVER=`semver`
    72  export GIT_BRANCH=`echo '{{.duat.gitBranch}}'|stencil - | tr '_' '-' | tr '\/' '-'`
    73  
    74  stencil -input Dockerfile_developer | docker build -t leafai/studio-go-runner-build:$GIT_BRANCH --build-arg USER=$USER --build-arg USER_ID=`id -u $USER` --build-arg USER_GROUP_ID=`id -g $USER` -
    75  stencil -input Dockerfile_standalone | docker build -t leafai/studio-go-runner-standalone-build:$GIT_BRANCH -
    76  ````
    77  
    78  You will now discover that you have two docker images locally registered ready to perform full builds for you.  The first of these containers can be used for localized building during iterative development and testing.  The second image tagged with standalone-build can be used to run the build remotely without access to your local source code copy.
    79  
    80  When build.sh is used to perform local developer builds, a container is also produced tagged as $azure_registry_name.azurecr.io/leafai/studio-go-runner/standalone-build.  This container when built will be pushed to azure and AWS docker image registries if the appropriate cloud environment tooling is available and environment variables are set, $azure\_registry\_name, and for AWS a default account configured and ECR login activated.  The image produced by the build when run will access the github source repo and will build and test the code for the branch that the developer initiating the build used.
    81  
    82  ### Image management
    83  
    84  A script is provided within the git repo, build.sh, that does image builds and then can tag and push the images to Azure or AWS.
    85  
    86  The script is written to make use of environment variables to push images to cloud provider image registries.  The script will check for the presense of the aws and az command line client tools before using either of these cloud providers.
    87  
    88  The script has been used within a number of CI/CD systems and so has many commands that allow travis log folding etc.  The actual number of commands resulting in direct effects to image registries is fairly limited.
    89  
    90  #### Azure Images
    91  
    92  Prior to using this feature you should authenticate to the Azure infrastructure from your workstation using the 'az login' command described here, https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli?view=azure-cli-latest.  Your credentials will then be saved in your bootstrapping environment and used when pushing images.
    93  
    94  The Azure image support checks that the $azure_registry_name environment variable and the az command line tool are present before being used by build.sh.
    95  
    96  The azure_registry_name will be appended to the standard host name being used by Azure, producing a prefix for images for example $azure_registry_name.azurecr.io/leafai/studio-go-runner.
    97  
    98  The component name will then be added to the prefix and the semantic version added to the tag as the image is pushed to Azure.
    99  
   100  #### AWS Images
   101  
   102  The AWS images will be pushed automatically if a default AWS account is configured.  Images will be pushed to the default region for that account, and to the registry leafai/studio-go-runner.  The semantic version will also be used within the tag for the image.
   103  
   104  ## Using k8s build and test
   105  
   106  This section describes the k8s based builds.
   107  
   108  In order to create a k8s cluster you will need to select a cloud provider or identify a k8s cluster running within your own infrastructure.  This document does not describe creation of a cluster, however information can be found on your cloud providers documentation web site, or on the k8s http://kubernetes.io/ documentation web site.
   109  
   110  If you wish to test the Kubernetes features create a cluster with at least one agent node that has the nvidia plugin installed, if you are using a cloud provider use the cloud providers GPU host type when creating nodes.  Set your KUBECONFIG environment variable to point at your cluster, then create a secret to enable access from your cluster to the your AWS or Azure registry.
   111  
   112  Your registry secrets are typically obtained from the administration portal of your cloud account.  In Azure the username and password can be found by navigating to the registry and selecting the Settings -> Access Keys section.  When using AWS the docker registry will typically be authenticated at the account level so your k8s cluster should have access automatically to the registry.
   113  
   114  ```
   115  kubectl create secret docker-registry studioml-docker-key --docker-server=studio-repo.azurecr.io --docker-username=studio-repo --docker-password=long-hash-value --docker-email=karlmutch@gmail.com
   116  ```
   117  
   118  The secret will be used by the build job to retrieve the build image and create the running container.
   119  
   120  ### k8s testing builds using the k8s job resource type
   121  
   122  The main reasons for using a k8s cluster to build the runner is to off load longer running tests into a cluster, and secondly to obtain access to a GPU for more complete testing use cases.  When using k8s you will not be able to perform a release from within the cluster because the docker daemon is not directly accessible to you.  In these cases you would wait for the test results and do a locally controlled release using the standalone build script, build.sh.
   123  
   124  The k8s build job can safely be run on a production cluster with GPU resources.
   125  
   126  To bootstrap an image that can be dispatched to a k8s job the local build.sh can be used.  If the appropriate cloud environment variables are set and the build environment is successfully authenticate to the cloud the build image will be pushed to your cloud provider.
   127  
   128  environment variables that should be set of this to work on Azure is the azure_registry_name variable.
   129  
   130  When the local build has completed any code that needs building within the k8s cluster should be committed to the current branch.
   131  
   132  The full build has the ability to load releases if the shell from which the build is launch has the GITHUB_TOKEN environment variables set.  You should take careful note that the k8s build will store the token as a k8s secret within the namespace of your build.  You should pay careful attention to securing your kubernetes RBAC system to prevent the token from leaking.  One option is to rotate the token on a daily basis, or use another tool to cycle your tokens automatically within the shell of the account launching these builds.
   133  
   134  A full build can then be kicked off by using the build.yaml file to create a k8s job resource.
   135  
   136  ```
   137   $ stencil -input build.yaml | kubectl create -f -
   138  ```
   139  
   140  This will then initiate the build that can be tracked using a k8s pod, for example:
   141  
   142  ```
   143  $ kubectl describe jobs/build-studio-go-runner
   144  Name:           build-studio-go-runner
   145  Namespace:      default
   146  Selector:       controller-uid=c1593355-b554-11e8-afa6-000d3a4d8ade
   147  Labels:         controller-uid=c1593355-b554-11e8-afa6-000d3a4d8ade
   148                  job-name=build-studio-go-runner
   149  Annotations:    <none>
   150  Parallelism:    1
   151  Completions:    1
   152  Start Time:     Mon, 10 Sep 2018 16:53:46 -0700
   153  Pods Statuses:  0 Running / 1 Succeeded / 0 Failed
   154  Pod Template:
   155    Labels:  controller-uid=c1593355-b554-11e8-afa6-000d3a4d8ade
   156             job-name=build-studio-go-runner
   157    Containers:
   158     build:
   159      Image:      quotaworkaround001.azurecr.io/leafai/studio-go-runner/standalone-build:feature-137-service-management
   160      Port:       <none>
   161      Host Port:  <none>
   162      Limits:
   163        cpu:             2
   164        memory:          10Gi
   165        nvidia.com/gpu:  1
   166      Environment Variables from:
   167        build-studio-go-runner-env  ConfigMap  Optional: false
   168      Environment:                  <none>
   169      Mounts:                       <none>
   170    Volumes:                        <none>
   171  Events:
   172    Type    Reason            Age   From            Message
   173    ----    ------            ----  ----            -------
   174    Normal  SuccessfulCreate  25m   job-controller  Created pod: build-studio-go-runner-mpfpt
   175  
   176  $ kubectl logs build-studio-go-runner-mpfpt -f
   177  ...
   178  018-09-10T23:57:22+0000 INF cache_xhaust_test removed "0331071c2b0ecb52b71beafc254e0055-1" from cache \_: [host build-studio-go-runner-mpfpt]
   179  2018-09-10T23:57:25+0000 DBG cache_xhaust_test cache gc signalled \_: [[cache_test.go:461] host build-studio-go-runner-mpfpt]
   180  2018-09-10T23:57:25+0000 INF cache_xhaust_test bebg9jme75mc1e60rig0-11 \_: [0331071c2b0ecb52b71beafc254e0055-1 [cache_test.go:480] host build-studio-go-runner-mpfpt]
   181  2018-09-10T23:57:26+0000 INF cache_xhaust_test TestCacheXhaust completed \_: [host build-studio-go-runner-mpfpt]
   182  --- PASS: TestCacheXhaust (24.94s)
   183  PASS
   184  2018-09-10T23:57:26+0000 INF cache_xhaust_test waiting for server down to complete \_: [host build-studio-go-runner-mpfpt]
   185  2018-09-10T23:57:26+0000 WRN cache_xhaust_test stopping k8sStateLogger \_: [host build-studio-go-runner-mpfpt] in: 
   186  2018-09-10T23:57:26+0000 WRN cache_xhaust_test cache service stopped \_: [host build-studio-go-runner-mpfpt] in: 
   187  2018-09-10T23:57:26+0000 WRN cache_xhaust_test http: Server closed [monitor.go:66] \_: [host build-studio-go-runner-mpfpt] in: 
   188  2018-09-10T23:57:26+0000 INF cache_xhaust_test forcing test mode server down \_: [host build-studio-go-runner-mpfpt]
   189  ok      github.com/leaf-ai/studio-go-runner/cmd/runner     30.064s
   190  2018-09-10T23:57:29+0000 DBG build.go built  [build.go:138]
   191  
   192  ```
   193  
   194  once you have seen the logs etc for the job you can delete it using the following command:
   195  
   196  ```
   197  $ stencil -input build.yaml | kubectl delete -f -
   198  configmap "build-studio-go-runner-env" deleted
   199  job.batch "build-studio-go-runner" deleted
   200  ```
   201  
   202  ### k8s builds done the hard way
   203  After creating the k8s secret to enable access to the image registry you can then run the build in an ad-hoc fashion using a command such as the following:
   204  
   205  ```
   206  kubectl run --image=studio-repo.azurecr.io/leafai/studio-go-runner/standalone-build --attach --requests="nvidia.com/gpu=1" --limits="nvidia.com/gpu=1" build
   207  ```
   208  
   209  Performing the build within a k8s cluster can take time due to the container creation and large images involved.  It will probably take serveral minutes, however you can check the progress by using another terminal and you will likely see something like the following:
   210  
   211  ```
   212  $kubectl get pods
   213  NAME                                             READY     STATUS              RESTARTS   AGE
   214  build-67b64d446f-tfwbg                           0/1       ContainerCreating   0          2m
   215  studioml-go-runner-deployment-847d7d5874-5lrs7   1/1       Running             0          15h
   216  ```
   217  
   218  Once the build starts you will be able to see output like the following:
   219  
   220  ```
   221  kubectl run --image=quotaworkaround001.azurecr.io/leafai/studio-go-runner/standalone-build --attach --requests="nvidia.com/gpu=1" --limits="nvidia.com/gpu=1" build
   222  
   223  If you don't see a command prompt, try pressing enter.
   224  Branch feature/137\_service_management set up to track remote branch feature/137_service_management from origin.
   225  Switched to a new branch 'feature/137_service_management'
   226  Warning: CUDA not supported on this platform stack="[cuda_nosupport.go:30 cuda.go:70]"
   227  === RUN   TestK8sConfig
   228  --- PASS: TestK8sConfig (0.00s)
   229  === RUN   TestStrawMan
   230  --- PASS: TestStrawMan (0.00s)
   231  PASS
   232  ok      github.com/leaf-ai/studio-go-runner/internal/runner        0.011s
   233  ```
   234  
   235  Seeing the K8s tests complete without warning messages will let you know that they have run successfully.
   236  
   237  The 'kubectl run' command makes use of deployment resources and so if something goes wrong you can manually manipulate the deployment using for example the 'kubectl delete deployment build' command.
   238  
   239  ## Configuration Map support
   240  
   241  The runner uses both a global configuration map and a node specific configuration map within k8s to store state. The node specific map will superceed the global map.
   242  
   243  The global configuration map can be found by looking for the map named 'studioml-go-runner'.  This map differs from the env maps also used by the runner in that the map once found will be watched for changes.  Currently the configuration map supports a single key, 'STATE', which is used by the runners to determine what state they should be in, or if they should terminate.
   244  
   245  The node specific configuration can be found using the host name, ${HOSTNAME}, as a convention for naming the maps.  Care should be taken concerning this naming if the k8s deployment is modified as these names can easily be changed.
   246  
   247  The following is an example of what can be found within the configuration map state.  In this case one of the runner pods is being specifically configured.
   248  
   249  ```
   250  $ cat global_config.yaml
   251  ---
   252  apiVersion: v1
   253  kind: ConfigMap
   254  metadata:
   255    name: studioml-go-runner
   256  data:
   257    STATE: Running
   258  $ kubectl apply -f global_config.yaml
   259  $ kubectl get -o=yaml --export cm studioml-go-runner
   260  apiVersion: v1
   261  data:
   262    STATE: Running
   263  kind: ConfigMap
   264  metadata:
   265    annotations:
   266      kubectl.kubernetes.io/last-applied-configuration: |
   267        {"apiVersion":"v1","data":{"STATE":"Running"},"kind":"ConfigMap","metadata":{"annotations":{},"name":"studioml-go-runner","namespace":"default"}}
   268    creationTimestamp: null
   269    name: studioml-go-runner
   270    selfLink: /api/v1/namespaces/default/configmaps/studioml-go-runner
   271  ```
   272  
   273  Supported states include:
   274  ```
   275  Running, DrainAndSuspend
   276  ```
   277  
   278  Other states such as a hard abort, or a hard restart can be done using Kubernetes and are not an application state
   279  
   280  ### Security requirements
   281  
   282  ```
   283  kubectl create clusterrolebinding default-cluster-admin --clusterrole=cluster-admin --serviceaccount=default:default
   284  ```
   285  
   286  ## Kubernetes labelling
   287  
   288  Kubernetes supports the ability for deployments to select nodes based upon the labels of those nodes.  For example you might wish to steer work for 2 GPUs to specific machines using specific queues. To do this you can either change the deployment specification to reflect the need for multiple GPUs, or another approach is to use a label.  Labels are very useful when you wish to partition a clusters nodes temporaily to allow builds, or other specialized work to be hosted in specific places.
   289  
   290  Using labels is a best practice as it allows your general workpool to avoid special purpose nodes by default if you use explicit labels throughout the population of nodes you have within your clusters.
   291  
   292  An example of labelling a single GPU host and reserving for specific work can be seen below:
   293  
   294  ```
   295  $ kubectl get nodes
   296  NAME                                 STATUS    ROLES     AGE       VERSION
   297  k8s-agentpool1-11296868-vmss000000   Ready     agent     3d        v1.10.8
   298  k8s-agentpool2-11296868-vmss000000   Ready     agent     3d        v1.10.8
   299  k8s-master-11296868-0                Ready     master    3d        v1.10.8
   300  $ kubectl describe node k8s-agentpool2-11296868-vmss000000 |grep gpu         
   301   nvidia.com/gpu:     2
   302   nvidia.com/gpu:     2
   303  $ kubectl describe node k8s-agentpool1-11296868-vmss000000 |grep gpu
   304   nvidia.com/gpu:     1
   305   nvidia.com/gpu:     1
   306  $ kubectl label node k8s-agentpool1-11296868-vmss000000 leafai.affinity=production
   307  node "k8s-agentpool1-11296868-vmss000000" labeled
   308  ```
   309  
   310  The studioml go runner deployment can then have a label added to narrow the selection of the node on which it is deployed:
   311  
   312  ```
   313   template:
   314     metadata:
   315       labels:
   316         app: studioml-go-runner
   317     spec:
   318        imagePullSecrets:
   319          - name: studioml-docker-key
   320        nodeSelector:
   321          beta.kubernetes.io/os: linux
   322          leafai.affinity: production
   323        containers:
   324        - name: studioml-go-runner
   325  ...
   326          resources:
   327            requests:
   328              nvidia.com/gpu: 1
   329              memory: 8G
   330              cpu: 2
   331            limits:
   332              nvidia.com/gpu: 1
   333              memory: 16G
   334              cpu: 2
   335  
   336  ```
   337  
   338  Because the deployment is being used to select the nodes on the basis of either resources or labelling the opportunity then exists for the runners assigned to them to make use of different queue names.  Again this allows with forethought for workloads to arrive on nodes that have been selected avoid your unlabelled nodes and, avoid nodes that are possibly costly or dedicated for a specific purpose.
   339  
   340  Copyright &copy 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.