github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/aws_k8s.md

github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/aws_k8s.md (about)

     1  # studio-go-runner AWS support
     2  
     3  This document details the installation of the studio go runner within an Azure hosted Kubernetes cluster.  After completing the Kubernetes installation using these instructions please return to the main README.md file to continue.
     4  
     5  If you are interested in using CPU deployments with attached EBS volumes the [README at examples/aws/cpu/README.md](examples/aws/cpu/README.md) will be of interest.
     6  
     7  # Prerequisites
     8  
     9  * Install and configure the AWS Command Line Interface (AWS CLI):
    10      * Install the [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html).
    11      * Configure the AWS CLI using the command: `aws configure`.
    12      * Enter credentials ([Access Key ID and Secret Access Key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)).
    13      * Enter the Region and other options.
    14  * Install [eksctl](https://github.com/weaveworks/eksctl).
    15  * Load the AWS SQS Credentials
    16  * Deploy the runner
    17  
    18  ## Install eksctl (AWS only)
    19  
    20  If you are using azure or GCP then options such as acs-engine, and skaffold are natively supported by the cloud vendors.  These tools are also readily customizable, and maintained and so these are recommended.
    21  
    22  For AWS the eksctl tool is now considered the official tool for the EKS CLI.  iA full set of instructions for the installation of eksctl can be found at,https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html. In brief form eksctl can be installed using the following steps:
    23  
    24  ```shell
    25  pip install awscli --upgrade --user
    26  curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
    27  sudo mv /tmp/eksctl /usr/local/bin
    28  ```
    29  
    30  One requirement of using eksctl is that you must first subscribe to the AMI that will be used with your GPU EC2 instances.  The subscription can be found at, https://aws.amazon.com/marketplace/pp/B07GRHFXGM.
    31  
    32  
    33  ## AWS Cloud support for Kubernetes 1.14.x and GPU
    34  
    35  This section discusses the use of eksctl to provision a working k8s cluster onto which the gpu runner can be deployed.
    36  
    37  The use of AWS EC2 machines requires that the AWS account has had an EC2 key Pair imported from your administration machine, or created in order that machines created using eksctl can be accessed.  More information can be found at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html.
    38  
    39  In order to make use of StudioML environment variable based templates you should export the AWS environment variables.  While doing this you should also synchronize your system clock as this is a common source of authentication issues with AWS.  
    40  
    41  <pre><code><b>export AWS_ACCESS_KEY=xxx
    42  export AWS_SECRET_ACCESS_KEY=xxx
    43  export AWS_DEFAULT_REGION=xxx
    44  sudo ntpdate ntp.ubuntu.com
    45  </b></code></pre>
    46  
    47  <pre><code><b>
    48  export AWS_CLUSTER_NAME=test-eks
    49  eksctl create cluster --name $AWS_CLUSTER_NAME --region us-west-2 --nodegroup-name $AWS_CLUSTER_NAME --node-type p2.xlarge  --nodes 1 --nodes-min 1 --nodes-max 3 --ssh-access --ssh-public-key ~/.ssh/id_rsa.pub --managed</b>
    50  [ℹ]  eksctl version 0.16.0
    51  [ℹ]  using region us-west-2
    52  [ℹ]  setting availability zones to [us-west-2a us-west-2c us-west-2b]
    53  [ℹ]  subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19
    54  [ℹ]  subnets for us-west-2c - public:192.168.32.0/19 private:192.168.128.0/19
    55  [ℹ]  subnets for us-west-2b - public:192.168.64.0/19 private:192.168.160.0/19
    56  [ℹ]  using SSH public key "/home/kmutch/.ssh/id_rsa.pub" as "eksctl-test-eks-nodegroup-kmutch-workers-be:07:a0:27:44:d8:27:04:c2:ba:28:fa:8c:47:7f:09" 
    57  [ℹ]  using Kubernetes version 1.14
    58  [ℹ]  creating EKS cluster "test-eks" in "us-west-2" region with managed nodes
    59  [ℹ]  will create 2 separate CloudFormation stacks for cluster itself and the initial managed nodegroup
    60  [ℹ]  if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=test-eks'
    61  [ℹ]  CloudWatch logging will not be enabled for cluster "test-eks" in "us-west-2"
    62  [ℹ]  you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=test-eks'
    63  [ℹ]  Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "test-eks" in "us-west-2"
    64  [ℹ]  2 sequential tasks: { create cluster control plane "test-eks", create managed nodegroup "kmutch-workers" }
    65  [ℹ]  building cluster stack "eksctl-test-eks-cluster"
    66  [ℹ]  deploying stack "eksctl-test-eks-cluster"
    67  [ℹ]  building managed nodegroup stack "eksctl-test-eks-nodegroup-kmutch-workers"
    68  [ℹ]  deploying stack "eksctl-test-eks-nodegroup-kmutch-workers"
    69  [✔]  all EKS cluster resources for "test-eks" have been created
    70  [✔]  saved kubeconfig as "/home/kmutch/.kube/microk8s.config"
    71  [ℹ]  nodegroup "kmutch-workers" has 1 node(s)
    72  [ℹ]  node "ip-192-168-5-16.us-west-2.compute.internal" is ready
    73  [ℹ]  waiting for at least 1 node(s) to become ready in "kmutch-workers"
    74  [ℹ]  nodegroup "kmutch-workers" has 1 node(s)
    75  [ℹ]  node "ip-192-168-5-16.us-west-2.compute.internal" is ready
    76  [ℹ]  kubectl command should work with "/home/kmutch/.kube/microk8s.config", try 'kubectl --kubeconfig=/home/kmutch/.kube/microk8s.config get nodes'
    77  [✔]  EKS cluster "test-eks" in "us-west-2" region is ready
    78  
    79  </code></pre>
    80  
    81  eksctl is written in Go uses CloudFormation internally and supports the use of YAML resources to define deployments, more information can be found at https://eksctl.io/.
    82  
    83  When creating a cluster the credentials will be loaded into your ~/.kube/config file automatically.  When using the AWS service oriented method of deployment the normally visible master will not be displayed as a node.
    84  
    85  ## GPU Setup
    86  
    87  In order to activate GPU support within the workers a daemon set instance needs to be created that will mediate between the kubernetes plugin and the GPU resources available to pods, as shown in the following command.
    88  
    89  <pre><code><b>
    90  kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml</b>
    91  daemonset.apps/nvidia-device-plugin-daemonset created
    92  </code></pre>
    93  
    94  Machines when first started will have an allocatable resource named nvidia.com/gpu.  When this resource flips from 0 to 1 the machine has become available for GPU work.  The plugin yaml added will cause a container to be bootstrapped into new nodes to perform the installation of the drivers etc.
    95  
    96  <pre><code><b>
    97  kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"</b>
    98  NAME                                         GPU
    99  ip-192-168-5-16.us-west-2.compute.internal   1
   100  </code></pre>
   101  
   102  ## GPU Testing
   103  
   104  A test pod for validating the GPU functionality can be created using the following commands:
   105  
   106  ```
   107  $ cat <<EOF | kubectl apply -f -
   108  apiVersion: v1
   109  kind: Pod
   110  metadata:
   111    name: tf-gpu
   112  spec:
   113    containers:
   114    - name: gpu
   115      image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04
   116      imagePullPolicy: IfNotPresent
   117      command: ["/bin/sh", "-c"]
   118      args: ["sleep 10000"]
   119      resources:
   120        limits:
   121          memory: 1024Mi
   122          # ^ Set memory in case default limits are set low
   123          nvidia.com/gpu: 1 # requesting 1 GPUs
   124          # ^ For Legacy Accelerators mode this key must be renamed
   125          #   'alpha.kubernetes.io/nvidia-gpu'
   126    tolerations:
   127    # This toleration will allow the gpu hook to run anywhere
   128    #   By default this is permissive in case you have tainted your GPU nodes.
   129    - operator: "Exists"
   130  EOF
   131  ```
   132  
   133  Once the pod is in a running state you should be able to test the access to the GPU cards using the following commands:
   134  
   135  <pre><code><b>
   136  kubectl get pods</b>
   137  NAME     READY   STATUS    RESTARTS   AGE
   138  tf-gpu   1/1     Running   0          2m31s
   139   <b>kubectl exec -it tf-gpu -- \
   140    python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'</b>
   141  WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/__init__.py:1467: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead.
   142  
   143  2020-04-02 19:53:04.846974: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300070000 Hz
   144  2020-04-02 19:53:04.847631: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47a9050 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
   145  2020-04-02 19:53:04.847672: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
   146  2020-04-02 19:53:04.851171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
   147  2020-04-02 19:53:05.074667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   148  2020-04-02 19:53:05.075725: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4870840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
   149  2020-04-02 19:53:05.075757: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla K80, Compute Capability 3.7
   150  2020-04-02 19:53:05.077045: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   151  2020-04-02 19:53:05.077866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties:
   152  name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235
   153  pciBusID: 0000:00:1e.0
   154  2020-04-02 19:53:05.078377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
   155  2020-04-02 19:53:05.080249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
   156  2020-04-02 19:53:05.081941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0
   157  2020-04-02 19:53:05.082422: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0
   158  2020-04-02 19:53:05.084606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0
   159  2020-04-02 19:53:05.086207: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0
   160  2020-04-02 19:53:05.090706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
   161  2020-04-02 19:53:05.090908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   162  2020-04-02 19:53:05.091833: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   163  2020-04-02 19:53:05.092591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0
   164  2020-04-02 19:53:05.092655: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0
   165  2020-04-02 19:53:05.094180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
   166  2020-04-02 19:53:05.094214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      0
   167  2020-04-02 19:53:05.094237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0:   N
   168  2020-04-02 19:53:05.094439: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   169  2020-04-02 19:53:05.095349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
   170  2020-04-02 19:53:05.096185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7)
   171  [name: "/device:CPU:0"
   172  device_type: "CPU"
   173  memory_limit: 268435456
   174  locality {
   175  }
   176  incarnation: 15851552145019400091
   177  , name: "/device:XLA_CPU:0"
   178  device_type: "XLA_CPU"
   179  memory_limit: 17179869184
   180  locality {
   181  }
   182  incarnation: 589949818737926036
   183  physical_device_desc: "device: XLA_CPU device"
   184  , name: "/device:XLA_GPU:0"
   185  device_type: "XLA_GPU"
   186  memory_limit: 17179869184
   187  locality {
   188  }
   189  incarnation: 1337920997684791636
   190  physical_device_desc: "device: XLA_GPU device"
   191  , name: "/device:GPU:0"
   192  device_type: "GPU"
   193  memory_limit: 11330115994
   194  locality {
   195    bus_id: 1
   196    links {
   197    }
   198  }
   199  incarnation: 6377093002559650203
   200  physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7"
   201  ]
   202  <b>kubectl exec -it tf-gpu nvidia-smi</b>
   203  Thu Apr  2 19:58:15 2020       
   204  +-----------------------------------------------------------------------------+
   205  | NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
   206  |-------------------------------+----------------------+----------------------+
   207  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   208  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   209  |===============================+======================+======================|
   210  |   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
   211  | N/A   44C    P8    27W / 149W |      0MiB / 11441MiB |      0%      Default |
   212  +-------------------------------+----------------------+----------------------+
   213                                                                                 
   214  +-----------------------------------------------------------------------------+
   215  | Processes:                                                       GPU Memory |
   216  |  GPU       PID   Type   Process name                             Usage      |
   217  |=============================================================================|
   218  |  No running processes found                                                 |
   219  +-----------------------------------------------------------------------------+
   220  
   221  <b>kubectl delete pod tf-gpu</b>
   222  pod "tf-gpu" deleted
   223  </code></pre>
   224  
   225  It is also possible to use the stock nvidia docker images to perform tests as well, for example:
   226  
   227  ```
   228  $ cat << EOF | kubectl create -f -
   229  apiVersion: v1
   230  kind: Pod
   231  metadata:
   232    name: nvidia-smi
   233  spec:
   234    restartPolicy: OnFailure
   235    containers:
   236    - name: nvidia-smi
   237      image: nvidia/cuda:latest
   238      args:
   239      - "nvidia-smi"
   240      resources:
   241        limits:
   242          nvidia.com/gpu: 1
   243  EOF
   244  pod/nvidia-smi created
   245  $ kubectl logs nvidia-smi
   246  Thu Apr  2 20:03:44 2020
   247  +-----------------------------------------------------------------------------+
   248  | NVIDIA-SMI 418.87.00    Driver Version: 418.87.00    CUDA Version: 10.1     |
   249  |-------------------------------+----------------------+----------------------+
   250  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   251  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   252  |===============================+======================+======================|
   253  |   0  Tesla K80           On   | 00000000:00:1E.0 Off |                    0 |
   254  | N/A   44C    P8    27W / 149W |      0MiB / 11441MiB |      2%      Default |
   255  +-------------------------------+----------------------+----------------------+
   256  
   257  +-----------------------------------------------------------------------------+
   258  | Processes:                                                       GPU Memory |
   259  |  GPU       PID   Type   Process name                             Usage      |
   260  |=============================================================================|
   261  |  No running processes found                                                 |
   262  +-----------------------------------------------------------------------------+
   263  $ kubectl delete pod nvidia-smi
   264  pod "nvidia-smi" deleted
   265  ```
   266  
   267  ## Load the AWS SQS Credentials
   268  
   269  In order to deploy the runner SQS credentials will need to be injected into the EKS cluster.  A default section must existing within the AWS credentials files, this will be the one selected by the runner. Using the following we can inject all of our known AWS credentials etc into the SQS secrets, this will not always be the best practice and you will need to determine how you will manage these credentials.
   270  
   271  ```
   272  aws_sqs_cred=`cat ~/.aws/credentials | base64 -w 0`
   273  aws_sqs_config=`cat ~/.aws/config | base64 -w 0`
   274  kubectl apply -f <(cat <<EOF
   275  apiVersion: v1
   276  kind: Secret
   277  metadata:
   278    name: studioml-runner-aws-sqs
   279  type: Opaque
   280  data:
   281    credentials: $aws_sqs_cred
   282    config: $aws_sqs_config
   283  EOF
   284  )
   285  ```
   286  
   287  When the deployment yaml is kubectl applied a set of mount points are included that will map these secrets from the etcd based secrets store for your cluster into the runner containers automatically.
   288  
   289  ## Deployment of the runner
   290  
   291  Having deployed the needed secrets for the SQS queue the runner can now be deployed.  A template for deployment can be found at examples/aws/deployment.yaml.
   292  
   293  Copy the example and examine the file for the studioml-go-runner-ecr-cred CronJob resource.  In this resource you will need to change the [AWS Account ID], [AWS_ACCESS_KEY_ID], and [AWS_SECRET_ACCESS_KEY] strings to the appropriate values and then 'kubectl apply -f [the file]'. You will also want to modify the Replica parameter in the studioml-go-runner-deployment Deployment resource as well.
   294  
   295  Be aware that any person, or entity having access to the kubernetes vault can extract these secrets unless extra measures are taken to first encrypt the secrets before injecting them into the cluster.
   296  For more information as to how to used secrets hosted through the file system on a running k8s container please refer to, https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod.
   297  
   298  
   299  
   300  ## Manually accessing cluster master APIs
   301  
   302  In order to retrieve the Kubernetes API Bearer token you can use the following command: 
   303  
   304  ```
   305  kops get secrets --type secret admin -oplaintext
   306  ```
   307  
   308  Access for the administrative API can be exposed using one of the two following commands:
   309  
   310  ```
   311  kops get secrets kube -oplaintext
   312  kubectl config view --minify
   313  ```
   314  
   315  More information concerning the kubelet security can be found at, https://github.com/kubernetes/kops/blob/master/docs/security.md#kubelet-api.
   316  
   317  If you wish to pass the ability to manage your cluster to another person, or wish to migrate running the dashboard using a browser on another machine you can using the kops export command to pass a kubectl configuration file around, take care however as this will greatly increase the risk of a security incident if not done correctly.  The configuration for accessing your cluster will be stored in your $KUBECONFIG file, defaulting to $HOME/.kube/config if not defined in your environment table.
   318  
   319  
   320  If you wish to delete the cluster you can use the following command:
   321  
   322  ```
   323  $ kops delete cluster $AWS_CLUSTER_NAME --yes
   324  ```
   325  
   326  Copyright © 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.