github.com/sentienttechnologies/studio-go-runner@v0.0.0-20201118202441-6d21f2ced8ee/docs/aws_k8s.md (about) 1 # studio-go-runner AWS support 2 3 This document details the installation of the studio go runner within an Azure hosted Kubernetes cluster. After completing the Kubernetes installation using these instructions please return to the main README.md file to continue. 4 5 If you are interested in using CPU deployments with attached EBS volumes the [README at examples/aws/cpu/README.md](examples/aws/cpu/README.md) will be of interest. 6 7 # Prerequisites 8 9 * Install and configure the AWS Command Line Interface (AWS CLI): 10 * Install the [AWS Command Line Interface](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-install.html). 11 * Configure the AWS CLI using the command: `aws configure`. 12 * Enter credentials ([Access Key ID and Secret Access Key](https://docs.aws.amazon.com/general/latest/gr/aws-sec-cred-types.html#access-keys-and-secret-access-keys)). 13 * Enter the Region and other options. 14 * Install [eksctl](https://github.com/weaveworks/eksctl). 15 * Load the AWS SQS Credentials 16 * Deploy the runner 17 18 ## Install eksctl (AWS only) 19 20 If you are using azure or GCP then options such as acs-engine, and skaffold are natively supported by the cloud vendors. These tools are also readily customizable, and maintained and so these are recommended. 21 22 For AWS the eksctl tool is now considered the official tool for the EKS CLI. iA full set of instructions for the installation of eksctl can be found at,https://docs.aws.amazon.com/eks/latest/userguide/getting-started-eksctl.html. In brief form eksctl can be installed using the following steps: 23 24 ```shell 25 pip install awscli --upgrade --user 26 curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp 27 sudo mv /tmp/eksctl /usr/local/bin 28 ``` 29 30 One requirement of using eksctl is that you must first subscribe to the AMI that will be used with your GPU EC2 instances. The subscription can be found at, https://aws.amazon.com/marketplace/pp/B07GRHFXGM. 31 32 33 ## AWS Cloud support for Kubernetes 1.14.x and GPU 34 35 This section discusses the use of eksctl to provision a working k8s cluster onto which the gpu runner can be deployed. 36 37 The use of AWS EC2 machines requires that the AWS account has had an EC2 key Pair imported from your administration machine, or created in order that machines created using eksctl can be accessed. More information can be found at https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/ec2-key-pairs.html. 38 39 In order to make use of StudioML environment variable based templates you should export the AWS environment variables. While doing this you should also synchronize your system clock as this is a common source of authentication issues with AWS. 40 41 <pre><code><b>export AWS_ACCESS_KEY=xxx 42 export AWS_SECRET_ACCESS_KEY=xxx 43 export AWS_DEFAULT_REGION=xxx 44 sudo ntpdate ntp.ubuntu.com 45 </b></code></pre> 46 47 <pre><code><b> 48 export AWS_CLUSTER_NAME=test-eks 49 eksctl create cluster --name $AWS_CLUSTER_NAME --region us-west-2 --nodegroup-name $AWS_CLUSTER_NAME --node-type p2.xlarge --nodes 1 --nodes-min 1 --nodes-max 3 --ssh-access --ssh-public-key ~/.ssh/id_rsa.pub --managed</b> 50 [ℹ] eksctl version 0.16.0 51 [ℹ] using region us-west-2 52 [ℹ] setting availability zones to [us-west-2a us-west-2c us-west-2b] 53 [ℹ] subnets for us-west-2a - public:192.168.0.0/19 private:192.168.96.0/19 54 [ℹ] subnets for us-west-2c - public:192.168.32.0/19 private:192.168.128.0/19 55 [ℹ] subnets for us-west-2b - public:192.168.64.0/19 private:192.168.160.0/19 56 [ℹ] using SSH public key "/home/kmutch/.ssh/id_rsa.pub" as "eksctl-test-eks-nodegroup-kmutch-workers-be:07:a0:27:44:d8:27:04:c2:ba:28:fa:8c:47:7f:09" 57 [ℹ] using Kubernetes version 1.14 58 [ℹ] creating EKS cluster "test-eks" in "us-west-2" region with managed nodes 59 [ℹ] will create 2 separate CloudFormation stacks for cluster itself and the initial managed nodegroup 60 [ℹ] if you encounter any issues, check CloudFormation console or try 'eksctl utils describe-stacks --region=us-west-2 --cluster=test-eks' 61 [ℹ] CloudWatch logging will not be enabled for cluster "test-eks" in "us-west-2" 62 [ℹ] you can enable it with 'eksctl utils update-cluster-logging --region=us-west-2 --cluster=test-eks' 63 [ℹ] Kubernetes API endpoint access will use default of {publicAccess=true, privateAccess=false} for cluster "test-eks" in "us-west-2" 64 [ℹ] 2 sequential tasks: { create cluster control plane "test-eks", create managed nodegroup "kmutch-workers" } 65 [ℹ] building cluster stack "eksctl-test-eks-cluster" 66 [ℹ] deploying stack "eksctl-test-eks-cluster" 67 [ℹ] building managed nodegroup stack "eksctl-test-eks-nodegroup-kmutch-workers" 68 [ℹ] deploying stack "eksctl-test-eks-nodegroup-kmutch-workers" 69 [✔] all EKS cluster resources for "test-eks" have been created 70 [✔] saved kubeconfig as "/home/kmutch/.kube/microk8s.config" 71 [ℹ] nodegroup "kmutch-workers" has 1 node(s) 72 [ℹ] node "ip-192-168-5-16.us-west-2.compute.internal" is ready 73 [ℹ] waiting for at least 1 node(s) to become ready in "kmutch-workers" 74 [ℹ] nodegroup "kmutch-workers" has 1 node(s) 75 [ℹ] node "ip-192-168-5-16.us-west-2.compute.internal" is ready 76 [ℹ] kubectl command should work with "/home/kmutch/.kube/microk8s.config", try 'kubectl --kubeconfig=/home/kmutch/.kube/microk8s.config get nodes' 77 [✔] EKS cluster "test-eks" in "us-west-2" region is ready 78 79 </code></pre> 80 81 eksctl is written in Go uses CloudFormation internally and supports the use of YAML resources to define deployments, more information can be found at https://eksctl.io/. 82 83 When creating a cluster the credentials will be loaded into your ~/.kube/config file automatically. When using the AWS service oriented method of deployment the normally visible master will not be displayed as a node. 84 85 ## GPU Setup 86 87 In order to activate GPU support within the workers a daemon set instance needs to be created that will mediate between the kubernetes plugin and the GPU resources available to pods, as shown in the following command. 88 89 <pre><code><b> 90 kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml</b> 91 daemonset.apps/nvidia-device-plugin-daemonset created 92 </code></pre> 93 94 Machines when first started will have an allocatable resource named nvidia.com/gpu. When this resource flips from 0 to 1 the machine has become available for GPU work. The plugin yaml added will cause a container to be bootstrapped into new nodes to perform the installation of the drivers etc. 95 96 <pre><code><b> 97 kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"</b> 98 NAME GPU 99 ip-192-168-5-16.us-west-2.compute.internal 1 100 </code></pre> 101 102 ## GPU Testing 103 104 A test pod for validating the GPU functionality can be created using the following commands: 105 106 ``` 107 $ cat <<EOF | kubectl apply -f - 108 apiVersion: v1 109 kind: Pod 110 metadata: 111 name: tf-gpu 112 spec: 113 containers: 114 - name: gpu 115 image: 763104351884.dkr.ecr.us-west-2.amazonaws.com/tensorflow-training:1.15.2-gpu-py36-cu100-ubuntu18.04 116 imagePullPolicy: IfNotPresent 117 command: ["/bin/sh", "-c"] 118 args: ["sleep 10000"] 119 resources: 120 limits: 121 memory: 1024Mi 122 # ^ Set memory in case default limits are set low 123 nvidia.com/gpu: 1 # requesting 1 GPUs 124 # ^ For Legacy Accelerators mode this key must be renamed 125 # 'alpha.kubernetes.io/nvidia-gpu' 126 tolerations: 127 # This toleration will allow the gpu hook to run anywhere 128 # By default this is permissive in case you have tainted your GPU nodes. 129 - operator: "Exists" 130 EOF 131 ``` 132 133 Once the pod is in a running state you should be able to test the access to the GPU cards using the following commands: 134 135 <pre><code><b> 136 kubectl get pods</b> 137 NAME READY STATUS RESTARTS AGE 138 tf-gpu 1/1 Running 0 2m31s 139 <b>kubectl exec -it tf-gpu -- \ 140 python -c 'from tensorflow.python.client import device_lib; print(device_lib.list_local_devices())'</b> 141 WARNING:tensorflow:From /usr/local/lib/python3.6/dist-packages/tensorflow_core/__init__.py:1467: The name tf.estimator.inputs is deprecated. Please use tf.compat.v1.estimator.inputs instead. 142 143 2020-04-02 19:53:04.846974: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2300070000 Hz 144 2020-04-02 19:53:04.847631: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x47a9050 initialized for platform Host (this does not guarantee that XLA will be used). Devices: 145 2020-04-02 19:53:04.847672: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version 146 2020-04-02 19:53:04.851171: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1 147 2020-04-02 19:53:05.074667: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 148 2020-04-02 19:53:05.075725: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x4870840 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices: 149 2020-04-02 19:53:05.075757: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Tesla K80, Compute Capability 3.7 150 2020-04-02 19:53:05.077045: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 151 2020-04-02 19:53:05.077866: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 152 name: Tesla K80 major: 3 minor: 7 memoryClockRate(GHz): 0.8235 153 pciBusID: 0000:00:1e.0 154 2020-04-02 19:53:05.078377: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 155 2020-04-02 19:53:05.080249: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0 156 2020-04-02 19:53:05.081941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10.0 157 2020-04-02 19:53:05.082422: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10.0 158 2020-04-02 19:53:05.084606: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10.0 159 2020-04-02 19:53:05.086207: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10.0 160 2020-04-02 19:53:05.090706: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7 161 2020-04-02 19:53:05.090908: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 162 2020-04-02 19:53:05.091833: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 163 2020-04-02 19:53:05.092591: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1767] Adding visible gpu devices: 0 164 2020-04-02 19:53:05.092655: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.0 165 2020-04-02 19:53:05.094180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix: 166 2020-04-02 19:53:05.094214: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186] 0 167 2020-04-02 19:53:05.094237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1199] 0: N 168 2020-04-02 19:53:05.094439: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 169 2020-04-02 19:53:05.095349: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:983] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero 170 2020-04-02 19:53:05.096185: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1325] Created TensorFlow device (/device:GPU:0 with 10805 MB memory) -> physical GPU (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7) 171 [name: "/device:CPU:0" 172 device_type: "CPU" 173 memory_limit: 268435456 174 locality { 175 } 176 incarnation: 15851552145019400091 177 , name: "/device:XLA_CPU:0" 178 device_type: "XLA_CPU" 179 memory_limit: 17179869184 180 locality { 181 } 182 incarnation: 589949818737926036 183 physical_device_desc: "device: XLA_CPU device" 184 , name: "/device:XLA_GPU:0" 185 device_type: "XLA_GPU" 186 memory_limit: 17179869184 187 locality { 188 } 189 incarnation: 1337920997684791636 190 physical_device_desc: "device: XLA_GPU device" 191 , name: "/device:GPU:0" 192 device_type: "GPU" 193 memory_limit: 11330115994 194 locality { 195 bus_id: 1 196 links { 197 } 198 } 199 incarnation: 6377093002559650203 200 physical_device_desc: "device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0, compute capability: 3.7" 201 ] 202 <b>kubectl exec -it tf-gpu nvidia-smi</b> 203 Thu Apr 2 19:58:15 2020 204 +-----------------------------------------------------------------------------+ 205 | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | 206 |-------------------------------+----------------------+----------------------+ 207 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 208 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 209 |===============================+======================+======================| 210 | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | 211 | N/A 44C P8 27W / 149W | 0MiB / 11441MiB | 0% Default | 212 +-------------------------------+----------------------+----------------------+ 213 214 +-----------------------------------------------------------------------------+ 215 | Processes: GPU Memory | 216 | GPU PID Type Process name Usage | 217 |=============================================================================| 218 | No running processes found | 219 +-----------------------------------------------------------------------------+ 220 221 <b>kubectl delete pod tf-gpu</b> 222 pod "tf-gpu" deleted 223 </code></pre> 224 225 It is also possible to use the stock nvidia docker images to perform tests as well, for example: 226 227 ``` 228 $ cat << EOF | kubectl create -f - 229 apiVersion: v1 230 kind: Pod 231 metadata: 232 name: nvidia-smi 233 spec: 234 restartPolicy: OnFailure 235 containers: 236 - name: nvidia-smi 237 image: nvidia/cuda:latest 238 args: 239 - "nvidia-smi" 240 resources: 241 limits: 242 nvidia.com/gpu: 1 243 EOF 244 pod/nvidia-smi created 245 $ kubectl logs nvidia-smi 246 Thu Apr 2 20:03:44 2020 247 +-----------------------------------------------------------------------------+ 248 | NVIDIA-SMI 418.87.00 Driver Version: 418.87.00 CUDA Version: 10.1 | 249 |-------------------------------+----------------------+----------------------+ 250 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 251 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 252 |===============================+======================+======================| 253 | 0 Tesla K80 On | 00000000:00:1E.0 Off | 0 | 254 | N/A 44C P8 27W / 149W | 0MiB / 11441MiB | 2% Default | 255 +-------------------------------+----------------------+----------------------+ 256 257 +-----------------------------------------------------------------------------+ 258 | Processes: GPU Memory | 259 | GPU PID Type Process name Usage | 260 |=============================================================================| 261 | No running processes found | 262 +-----------------------------------------------------------------------------+ 263 $ kubectl delete pod nvidia-smi 264 pod "nvidia-smi" deleted 265 ``` 266 267 ## Load the AWS SQS Credentials 268 269 In order to deploy the runner SQS credentials will need to be injected into the EKS cluster. A default section must existing within the AWS credentials files, this will be the one selected by the runner. Using the following we can inject all of our known AWS credentials etc into the SQS secrets, this will not always be the best practice and you will need to determine how you will manage these credentials. 270 271 ``` 272 aws_sqs_cred=`cat ~/.aws/credentials | base64 -w 0` 273 aws_sqs_config=`cat ~/.aws/config | base64 -w 0` 274 kubectl apply -f <(cat <<EOF 275 apiVersion: v1 276 kind: Secret 277 metadata: 278 name: studioml-runner-aws-sqs 279 type: Opaque 280 data: 281 credentials: $aws_sqs_cred 282 config: $aws_sqs_config 283 EOF 284 ) 285 ``` 286 287 When the deployment yaml is kubectl applied a set of mount points are included that will map these secrets from the etcd based secrets store for your cluster into the runner containers automatically. 288 289 ## Deployment of the runner 290 291 Having deployed the needed secrets for the SQS queue the runner can now be deployed. A template for deployment can be found at examples/aws/deployment.yaml. 292 293 Copy the example and examine the file for the studioml-go-runner-ecr-cred CronJob resource. In this resource you will need to change the [AWS Account ID], [AWS_ACCESS_KEY_ID], and [AWS_SECRET_ACCESS_KEY] strings to the appropriate values and then 'kubectl apply -f [the file]'. You will also want to modify the Replica parameter in the studioml-go-runner-deployment Deployment resource as well. 294 295 Be aware that any person, or entity having access to the kubernetes vault can extract these secrets unless extra measures are taken to first encrypt the secrets before injecting them into the cluster. 296 For more information as to how to used secrets hosted through the file system on a running k8s container please refer to, https://kubernetes.io/docs/concepts/configuration/secret/#using-secrets-as-files-from-a-pod. 297 298 299 300 ## Manually accessing cluster master APIs 301 302 In order to retrieve the Kubernetes API Bearer token you can use the following command: 303 304 ``` 305 kops get secrets --type secret admin -oplaintext 306 ``` 307 308 Access for the administrative API can be exposed using one of the two following commands: 309 310 ``` 311 kops get secrets kube -oplaintext 312 kubectl config view --minify 313 ``` 314 315 More information concerning the kubelet security can be found at, https://github.com/kubernetes/kops/blob/master/docs/security.md#kubelet-api. 316 317 If you wish to pass the ability to manage your cluster to another person, or wish to migrate running the dashboard using a browser on another machine you can using the kops export command to pass a kubectl configuration file around, take care however as this will greatly increase the risk of a security incident if not done correctly. The configuration for accessing your cluster will be stored in your $KUBECONFIG file, defaulting to $HOME/.kube/config if not defined in your environment table. 318 319 320 If you wish to delete the cluster you can use the following command: 321 322 ``` 323 $ kops delete cluster $AWS_CLUSTER_NAME --yes 324 ``` 325 326 Copyright © 2019-2020 Cognizant Digital Business, Evolutionary AI. All rights reserved. Issued under the Apache 2.0 license.