sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/book/src/topics/gpu.md (about) 1 # GPU-enabled clusters 2 3 ## Overview 4 5 With CAPZ you can create GPU-enabled Kubernetes clusters on Microsoft Azure. 6 7 Before you begin, be aware that: 8 9 - [Scheduling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is a Kubernetes beta feature 10 - [NVIDIA GPUs](https://learn.microsoft.com/azure/virtual-machines/sizes-gpu) are supported on Azure NC-series, NV-series, and NVv3-series VMs 11 - [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster. 12 13 To deploy a cluster with support for GPU nodes, use the [nvidia-gpu flavor](https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-azure/main/templates/cluster-template-nvidia-gpu.yaml). 14 15 ## An example GPU cluster 16 17 Let's create a CAPZ cluster with an N-series node and run a GPU-powered vector calculation. 18 19 ### Generate an nvidia-gpu cluster template 20 21 Use the `clusterctl generate cluster` command to generate a manifest that defines your GPU-enabled 22 workload cluster. 23 24 Remember to use the `nvidia-gpu` flavor with N-series nodes. 25 26 ```bash 27 AZURE_CONTROL_PLANE_MACHINE_TYPE=Standard_B2s \ 28 AZURE_NODE_MACHINE_TYPE=Standard_NC6s_v3 \ 29 AZURE_LOCATION=southcentralus \ 30 clusterctl generate cluster azure-gpu \ 31 --kubernetes-version=v1.22.1 \ 32 --worker-machine-count=1 \ 33 --flavor=nvidia-gpu > azure-gpu-cluster.yaml 34 ``` 35 36 ### Create the cluster 37 38 Apply the manifest from the previous step to your management cluster to have CAPZ create a 39 workload cluster: 40 41 ```bash 42 $ kubectl apply -f azure-gpu-cluster.yaml 43 cluster.cluster.x-k8s.io/azure-gpu serverside-applied 44 azurecluster.infrastructure.cluster.x-k8s.io/azure-gpu serverside-applied 45 kubeadmcontrolplane.controlplane.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied 46 azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied 47 machinedeployment.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied 48 azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied 49 kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied 50 ``` 51 52 Wait until the cluster and nodes are finished provisioning... 53 54 ```bash 55 $ kubectl get cluster azure-gpu 56 NAME PHASE 57 azure-gpu Provisioned 58 $ kubectl get machines 59 NAME PROVIDERID PHASE VERSION 60 azure-gpu-control-plane-t94nm azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-control-plane-nnb57 Running v1.22.1 61 azure-gpu-md-0-f6b88dd78-vmkph azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v Running v1.22.1 62 ``` 63 64 ... and then you can install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice. 65 66 Once all nodes are `Ready`, install the official NVIDIA gpu-operator via Helm. 67 68 ### Install nvidia gpu-operator Helm chart 69 70 If you don't have `helm`, installation instructions for your environment can be found [here](https://helm.sh). 71 72 First, grab the kubeconfig from your newly created cluster and save it to a file: 73 74 ```bash 75 $ clusterctl get kubeconfig azure-gpu > ./azure-gpu-cluster.conf 76 ``` 77 78 Now we can use Helm to install the official chart: 79 80 ```bash 81 $ helm install --kubeconfig ./azure-gpu-cluster.conf --repo https://helm.ngc.nvidia.com/nvidia gpu-operator --generate-name 82 ``` 83 84 The installation of GPU drivers via gpu-operator will take several minutes. Coffee or tea may be appropriate at this time. 85 86 After a time, you may run the following command against the workload cluster to check if all the `gpu-operator` resources are installed: 87 88 ```bash 89 $ kubectl --kubeconfig ./azure-gpu-cluster.conf get pods -o wide | grep 'gpu\|nvidia' 90 NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES 91 default gpu-feature-discovery-r6zgh 1/1 Running 0 7m21s 192.168.132.75 azure-gpu-md-0-gcc8v <none> <none> 92 default gpu-operator-1674686292-node-feature-discovery-master-79d8pbcg6 1/1 Running 0 8m15s 192.168.96.7 azure-gpu-control-plane-nnb57 <none> <none> 93 default gpu-operator-1674686292-node-feature-discovery-worker-g9dj2 1/1 Running 0 8m15s 192.168.132.66 gpu-md-0-gcc8v <none> <none> 94 default gpu-operator-95b545d6f-rmlf2 1/1 Running 0 8m15s 192.168.132.67 gpu-md-0-gcc8v <none> <none> 95 default nvidia-container-toolkit-daemonset-hstgw 1/1 Running 0 7m21s 192.168.132.70 gpu-md-0-gcc8v <none> <none> 96 default nvidia-cuda-validator-pdmkl 0/1 Completed 0 3m47s 192.168.132.74 azure-gpu-md-0-gcc8v <none> <none> 97 default nvidia-dcgm-exporter-wjm7p 1/1 Running 0 7m21s 192.168.132.71 azure-gpu-md-0-gcc8v <none> <none> 98 default nvidia-device-plugin-daemonset-csv6k 1/1 Running 0 7m21s 192.168.132.73 azure-gpu-md-0-gcc8v <none> <none> 99 default nvidia-device-plugin-validator-gxzt2 0/1 Completed 0 2m49s 192.168.132.76 azure-gpu-md-0-gcc8v <none> <none> 100 default nvidia-driver-daemonset-zww52 1/1 Running 0 7m46s 192.168.132.68 azure-gpu-md-0-gcc8v <none> <none> 101 default nvidia-operator-validator-kjr6m 1/1 Running 0 7m21s 192.168.132.72 azure-gpu-md-0-gcc8v <none> <none> 102 ``` 103 104 You should see all pods in either a state of `Running` or `Completed`. If that is the case, then you know the driver installation and GPU node configuration is successful. 105 106 Then run the following commands against the workload cluster to verify that the 107 [NVIDIA device plugin](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin.yml) 108 has initialized and the `nvidia.com/gpu` resource is available: 109 110 ```bash 111 $ kubectl --kubeconfig ./azure-gpu-cluster.conf get nodes 112 NAME STATUS ROLES AGE VERSION 113 azure-gpu-control-plane-nnb57 Ready master 42m v1.22.1 114 azure-gpu-md-0-gcc8v Ready <none> 38m v1.22.1 115 $ kubectl --kubeconfig ./azure-gpu-cluster.conf get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq 116 { 117 "attachable-volumes-azure-disk": "12", 118 "cpu": "6", 119 "ephemeral-storage": "119716326407", 120 "hugepages-1Gi": "0", 121 "hugepages-2Mi": "0", 122 "memory": "115312060Ki", 123 "nvidia.com/gpu": "1", 124 "pods": "110" 125 } 126 ``` 127 128 ### Run a test app 129 130 Let's create a pod manifest for the `cuda-vector-add` example from the Kubernetes documentation and 131 deploy it: 132 133 ```shell 134 $ cat > cuda-vector-add.yaml << EOF 135 apiVersion: v1 136 kind: Pod 137 metadata: 138 name: cuda-vector-add 139 spec: 140 restartPolicy: OnFailure 141 containers: 142 - name: cuda-vector-add 143 # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile 144 image: "registry.k8s.io/cuda-vector-add:v0.1" 145 resources: 146 limits: 147 nvidia.com/gpu: 1 # requesting 1 GPU 148 EOF 149 $ kubectl --kubeconfig ./azure-gpu-cluster.conf apply -f cuda-vector-add.yaml 150 ``` 151 152 The container will download, run, and perform a [CUDA](https://developer.nvidia.com/cuda-zone) 153 calculation with the GPU. 154 155 ```bash 156 $ kubectl get po cuda-vector-add 157 cuda-vector-add 0/1 Completed 0 91s 158 $ kubectl logs cuda-vector-add 159 [Vector addition of 50000 elements] 160 Copy input data from the host memory to the CUDA device 161 CUDA kernel launch with 196 blocks of 256 threads 162 Copy output data from the CUDA device to the host memory 163 Test PASSED 164 Done 165 ``` 166 167 If you see output like the above, your GPU cluster is working!