sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/book/src/topics/gpu.md

sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/book/src/topics/gpu.md (about)

     1  # GPU-enabled clusters
     2  
     3  ## Overview
     4  
     5  With CAPZ you can create GPU-enabled Kubernetes clusters on Microsoft Azure.
     6  
     7  Before you begin, be aware that:
     8  
     9  - [Scheduling GPUs](https://kubernetes.io/docs/tasks/manage-gpus/scheduling-gpus/) is a Kubernetes beta feature
    10  - [NVIDIA GPUs](https://learn.microsoft.com/azure/virtual-machines/sizes-gpu) are supported on Azure NC-series, NV-series, and NVv3-series VMs
    11  - [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) allows administrators of Kubernetes clusters to manage GPU nodes just like CPU nodes in the cluster.
    12  
    13  To deploy a cluster with support for GPU nodes, use the [nvidia-gpu flavor](https://raw.githubusercontent.com/kubernetes-sigs/cluster-api-provider-azure/main/templates/cluster-template-nvidia-gpu.yaml).
    14  
    15  ## An example GPU cluster
    16  
    17  Let's create a CAPZ cluster with an N-series node and run a GPU-powered vector calculation.
    18  
    19  ### Generate an nvidia-gpu cluster template
    20  
    21  Use the `clusterctl generate cluster` command to generate a manifest that defines your GPU-enabled
    22  workload cluster.
    23  
    24  Remember to use the `nvidia-gpu` flavor with N-series nodes.
    25  
    26  ```bash
    27  AZURE_CONTROL_PLANE_MACHINE_TYPE=Standard_B2s \
    28  AZURE_NODE_MACHINE_TYPE=Standard_NC6s_v3 \
    29  AZURE_LOCATION=southcentralus \
    30  clusterctl generate cluster azure-gpu \
    31    --kubernetes-version=v1.22.1 \
    32    --worker-machine-count=1 \
    33    --flavor=nvidia-gpu > azure-gpu-cluster.yaml
    34  ```
    35  
    36  ### Create the cluster
    37  
    38  Apply the manifest from the previous step to your management cluster to have CAPZ create a
    39  workload cluster:
    40  
    41  ```bash
    42  $ kubectl apply -f azure-gpu-cluster.yaml
    43  cluster.cluster.x-k8s.io/azure-gpu serverside-applied
    44  azurecluster.infrastructure.cluster.x-k8s.io/azure-gpu serverside-applied
    45  kubeadmcontrolplane.controlplane.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
    46  azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-control-plane serverside-applied
    47  machinedeployment.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
    48  azuremachinetemplate.infrastructure.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
    49  kubeadmconfigtemplate.bootstrap.cluster.x-k8s.io/azure-gpu-md-0 serverside-applied
    50  ```
    51  
    52  Wait until the cluster and nodes are finished provisioning...
    53  
    54  ```bash
    55  $ kubectl get cluster azure-gpu
    56  NAME        PHASE
    57  azure-gpu   Provisioned
    58  $ kubectl get machines
    59  NAME                             PROVIDERID                                                                                                                                     PHASE     VERSION
    60  azure-gpu-control-plane-t94nm    azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-control-plane-nnb57   Running   v1.22.1
    61  azure-gpu-md-0-f6b88dd78-vmkph   azure:////subscriptions/<subscription_id>/resourceGroups/azure-gpu/providers/Microsoft.Compute/virtualMachines/azure-gpu-md-0-gcc8v            Running   v1.22.1
    62  ```
    63  
    64  ... and then you can install a [CNI](https://cluster-api.sigs.k8s.io/user/quick-start.html#deploy-a-cni-solution) of your choice.
    65  
    66  Once all nodes are `Ready`, install the official NVIDIA gpu-operator via Helm.
    67  
    68  ### Install nvidia gpu-operator Helm chart
    69  
    70  If you don't have `helm`, installation instructions for your environment can be found [here](https://helm.sh).
    71  
    72  First, grab the kubeconfig from your newly created cluster and save it to a file:
    73  
    74  ```bash
    75  $ clusterctl get kubeconfig azure-gpu > ./azure-gpu-cluster.conf
    76  ```
    77  
    78  Now we can use Helm to install the official chart:
    79  
    80  ```bash
    81  $ helm install --kubeconfig ./azure-gpu-cluster.conf --repo https://helm.ngc.nvidia.com/nvidia gpu-operator --generate-name
    82  ```
    83  
    84  The installation of GPU drivers via gpu-operator will take several minutes. Coffee or tea may be appropriate at this time.
    85  
    86  After a time, you may run the following command against the workload cluster to check if all the `gpu-operator` resources are installed:
    87  
    88  ```bash
    89  $ kubectl --kubeconfig ./azure-gpu-cluster.conf get pods -o wide | grep 'gpu\|nvidia'
    90  NAMESPACE          NAME                                                              READY   STATUS      RESTARTS   AGE     IP               NODE                                      NOMINATED NODE   READINESS GATES
    91  default            gpu-feature-discovery-r6zgh                                       1/1     Running     0          7m21s   192.168.132.75   azure-gpu-md-0-gcc8v            <none>           <none>
    92  default            gpu-operator-1674686292-node-feature-discovery-master-79d8pbcg6   1/1     Running     0          8m15s   192.168.96.7     azure-gpu-control-plane-nnb57   <none>           <none>
    93  default            gpu-operator-1674686292-node-feature-discovery-worker-g9dj2       1/1     Running     0          8m15s   192.168.132.66   gpu-md-0-gcc8v            <none>           <none>
    94  default            gpu-operator-95b545d6f-rmlf2                                      1/1     Running     0          8m15s   192.168.132.67   gpu-md-0-gcc8v            <none>           <none>
    95  default            nvidia-container-toolkit-daemonset-hstgw                          1/1     Running     0          7m21s   192.168.132.70   gpu-md-0-gcc8v            <none>           <none>
    96  default            nvidia-cuda-validator-pdmkl                                       0/1     Completed   0          3m47s   192.168.132.74   azure-gpu-md-0-gcc8v            <none>           <none>
    97  default            nvidia-dcgm-exporter-wjm7p                                        1/1     Running     0          7m21s   192.168.132.71   azure-gpu-md-0-gcc8v            <none>           <none>
    98  default            nvidia-device-plugin-daemonset-csv6k                              1/1     Running     0          7m21s   192.168.132.73   azure-gpu-md-0-gcc8v            <none>           <none>
    99  default            nvidia-device-plugin-validator-gxzt2                              0/1     Completed   0          2m49s   192.168.132.76   azure-gpu-md-0-gcc8v            <none>           <none>
   100  default            nvidia-driver-daemonset-zww52                                     1/1     Running     0          7m46s   192.168.132.68   azure-gpu-md-0-gcc8v            <none>           <none>
   101  default            nvidia-operator-validator-kjr6m                                   1/1     Running     0          7m21s   192.168.132.72   azure-gpu-md-0-gcc8v            <none>           <none>
   102  ```
   103  
   104  You should see all pods in either a state of `Running` or `Completed`. If that is the case, then you know the driver installation and GPU node configuration is successful.
   105  
   106  Then run the following commands against the workload cluster to verify that the
   107  [NVIDIA device plugin](https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/main/deployments/static/nvidia-device-plugin.yml)
   108  has initialized and the `nvidia.com/gpu` resource is available:
   109  
   110  ```bash
   111  $ kubectl --kubeconfig ./azure-gpu-cluster.conf get nodes
   112  NAME                            STATUS   ROLES    AGE   VERSION
   113  azure-gpu-control-plane-nnb57   Ready    master   42m   v1.22.1
   114  azure-gpu-md-0-gcc8v            Ready    <none>   38m   v1.22.1
   115  $ kubectl --kubeconfig ./azure-gpu-cluster.conf get node azure-gpu-md-0-gcc8v -o jsonpath={.status.allocatable} | jq
   116  {
   117    "attachable-volumes-azure-disk": "12",
   118    "cpu": "6",
   119    "ephemeral-storage": "119716326407",
   120    "hugepages-1Gi": "0",
   121    "hugepages-2Mi": "0",
   122    "memory": "115312060Ki",
   123    "nvidia.com/gpu": "1",
   124    "pods": "110"
   125  }
   126  ```
   127  
   128  ### Run a test app
   129  
   130  Let's create a pod manifest for the `cuda-vector-add` example from the Kubernetes documentation and
   131  deploy it:
   132  
   133  ```shell
   134  $ cat > cuda-vector-add.yaml << EOF
   135  apiVersion: v1
   136  kind: Pod
   137  metadata:
   138    name: cuda-vector-add
   139  spec:
   140    restartPolicy: OnFailure
   141    containers:
   142      - name: cuda-vector-add
   143        # https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
   144        image: "registry.k8s.io/cuda-vector-add:v0.1"
   145        resources:
   146          limits:
   147            nvidia.com/gpu: 1 # requesting 1 GPU
   148  EOF
   149  $ kubectl --kubeconfig ./azure-gpu-cluster.conf apply -f cuda-vector-add.yaml
   150  ```
   151  
   152  The container will download, run, and perform a [CUDA](https://developer.nvidia.com/cuda-zone)
   153  calculation with the GPU.
   154  
   155  ```bash
   156  $ kubectl get po cuda-vector-add
   157  cuda-vector-add   0/1     Completed   0          91s
   158  $ kubectl logs cuda-vector-add
   159  [Vector addition of 50000 elements]
   160  Copy input data from the host memory to the CUDA device
   161  CUDA kernel launch with 196 blocks of 256 threads
   162  Copy output data from the CUDA device to the host memory
   163  Test PASSED
   164  Done
   165  ```
   166  
   167  If you see output like the above, your GPU cluster is working!