github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/plugins/devices/nvidia.mdx (about) 1 --- 2 layout: docs 3 page_title: 'Device Plugins: Nvidia' 4 description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. 5 --- 6 7 # Nvidia GPU Device Plugin 8 9 Name: `nomad-device-nvidia` 10 11 The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. 12 13 ~> **Note**: The Nvidia device plugin setup has changed in Nomad 1.2. You must 14 add a [`plugin`] block to your clients configuration and install the 15 [external Nvidia device plugin][nvidia_plugin_download] into their 16 [`plugin_dir`] prior to upgrading. See plugin options below for an example. 17 Note the job specification remains the same. 18 19 ## Fingerprinted Attributes 20 21 <table> 22 <thead> 23 <tr> 24 <th>Attribute</th> 25 <th>Unit</th> 26 </tr> 27 </thead> 28 <tbody> 29 <tr> 30 <td> 31 <tt>memory</tt> 32 </td> 33 <td>MiB</td> 34 </tr> 35 <tr> 36 <td> 37 <tt>power</tt> 38 </td> 39 <td>W (Watt)</td> 40 </tr> 41 <tr> 42 <td> 43 <tt>bar1</tt> 44 </td> 45 <td>MiB</td> 46 </tr> 47 <tr> 48 <td> 49 <tt>driver_version</tt> 50 </td> 51 <td>string</td> 52 </tr> 53 <tr> 54 <td> 55 <tt>cores_clock</tt> 56 </td> 57 <td>MHz</td> 58 </tr> 59 <tr> 60 <td> 61 <tt>memory_clock</tt> 62 </td> 63 <td>MHz</td> 64 </tr> 65 <tr> 66 <td> 67 <tt>pci_bandwidth</tt> 68 </td> 69 <td>MB/s</td> 70 </tr> 71 <tr> 72 <td> 73 <tt>display_state</tt> 74 </td> 75 <td>string</td> 76 </tr> 77 <tr> 78 <td> 79 <tt>persistence_mode</tt> 80 </td> 81 <td>string</td> 82 </tr> 83 </tbody> 84 </table> 85 86 ## Runtime Environment 87 88 The `nvidia-gpu` device plugin exposes the following environment variables: 89 90 - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. 91 92 ### Additional Task Configurations 93 94 Additional environment variables can be set by the task to influence the runtime 95 environment. See [Nvidia's 96 documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). 97 98 ## Installation Requirements 99 100 In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met: 101 102 1. GNU/Linux x86_64 with kernel version > 3.10 103 2. NVIDIA GPU with Architecture > Fermi (2.1) 104 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 105 4. Docker v19.03+ 106 107 ### Container Toolkit Installation 108 109 Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit] 110 from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should 111 be able to run this simple command to test your environment and produce meaningful 112 output. 113 114 ```shell 115 docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi 116 ``` 117 118 119 ## Plugin Configuration 120 121 ```hcl 122 plugin "nomad-device-nvidia" { 123 config { 124 enabled = true 125 ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] 126 fingerprint_period = "1m" 127 } 128 } 129 ``` 130 131 The `nomad-device-nvidia` device plugin supports the following configuration in the agent 132 config: 133 134 - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. 135 136 - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that 137 should be ignored when fingerprinting. 138 139 - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for 140 device changes. 141 142 ## Limitations 143 144 The Nvidia integration only works with drivers who natively integrate with 145 Nvidia's [container runtime 146 library](https://github.com/NVIDIA/libnvidia-container). 147 148 Nomad has tested support with the [`docker` driver][docker-driver]. Support for 149 [`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook] 150 but is not tested or documented by Nomad. 151 152 ## Source Code & Compiled Binaries 153 154 The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You 155 can also find pre-built binaries on the [releases page][nvidia_plugin_download]. 156 157 ## Examples 158 159 Inspect a node with a GPU: 160 161 ```shell-session 162 $ nomad node status 4d46e59f 163 164 // ...TRUNCATED... 165 166 Device Resource Utilization 167 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 168 ``` 169 170 Display detailed statistics on a node with a GPU: 171 172 ```shell-session 173 $ nomad node status -stats 4d46e59f 174 175 // ...TRUNCATED... 176 177 Device Resource Utilization 178 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 179 180 // ...TRUNCATED... 181 182 Device Stats 183 Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 184 BAR1 buffer state = 2 / 16384 MiB 185 Decoder utilization = 0 % 186 ECC L1 errors = 0 187 ECC L2 errors = 0 188 ECC memory errors = 0 189 Encoder utilization = 0 % 190 GPU utilization = 0 % 191 Memory state = 0 / 11441 MiB 192 Memory utilization = 0 % 193 Power usage = 37 / 149 W 194 Temperature = 34 C 195 ``` 196 197 Run the following example job to see that the GPU was mounted in the 198 container: 199 200 ```hcl 201 job "gpu-test" { 202 datacenters = ["dc1"] 203 type = "batch" 204 205 group "smi" { 206 task "smi" { 207 driver = "docker" 208 209 config { 210 image = "nvidia/cuda:11.0-base" 211 command = "nvidia-smi" 212 } 213 214 resources { 215 device "nvidia/gpu" { 216 count = 1 217 218 # Add an affinity for a particular model 219 affinity { 220 attribute = "${device.model}" 221 value = "Tesla K80" 222 weight = 50 223 } 224 } 225 } 226 } 227 } 228 } 229 ``` 230 231 ```shell-session 232 $ nomad run example.nomad 233 ==> Monitoring evaluation "21bd7584" 234 Evaluation triggered by job "gpu-test" 235 Allocation "d250baed" created: node "4d46e59f", group "smi" 236 Evaluation status changed: "pending" -> "complete" 237 ==> Evaluation "21bd7584" finished with status "complete" 238 239 $ nomad alloc status d250baed 240 241 // ...TRUNCATED... 242 243 Task "smi" is "dead" 244 Task Resources 245 CPU Memory Disk Addresses 246 0/100 MHz 0 B/300 MiB 300 MiB 247 248 Device Stats 249 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 250 251 Task Events: 252 Started At = 2019-01-23T18:25:32Z 253 Finished At = 2019-01-23T18:25:34Z 254 Total Restarts = 0 255 Last Restart = N/A 256 257 Recent Events: 258 Time Type Description 259 2019-01-23T18:25:34Z Terminated Exit Code: 0 260 2019-01-23T18:25:32Z Started Task started by client 261 2019-01-23T18:25:29Z Task Setup Building Task Directory 262 2019-01-23T18:25:29Z Received Task received by client 263 264 $ nomad alloc logs d250baed 265 Wed Jan 23 18:25:32 2019 266 +-----------------------------------------------------------------------------+ 267 | NVIDIA-SMI 410.48 Driver Version: 410.48 | 268 |-------------------------------+----------------------+----------------------+ 269 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 270 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 271 |===============================+======================+======================| 272 | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | 273 | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | 274 +-------------------------------+----------------------+----------------------+ 275 276 +-----------------------------------------------------------------------------+ 277 | Processes: GPU Memory | 278 | GPU PID Type Process name Usage | 279 |=============================================================================| 280 | No running processes found | 281 +-----------------------------------------------------------------------------+ 282 ``` 283 284 285 [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' 286 [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' 287 [java-driver]: /docs/drivers/java 'Nomad java Driver' 288 [lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver' 289 [`plugin`]: /docs/configuration/plugin 290 [`plugin_dir`]: /docs/configuration#plugin_dir 291 [nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia 292 [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/ 293 [nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html 294 [source]: https://github.com/hashicorp/nomad-device-nvidia