github.com/Ilhicas/nomad@v1.0.4-0.20210304152020-e86851182bc3/website/content/docs/devices/nvidia.mdx (about) 1 --- 2 layout: docs 3 page_title: 'Device Plugins: Nvidia' 4 sidebar_title: Nvidia 5 description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. 6 --- 7 8 # Nvidia GPU Device Plugin 9 10 Name: `nvidia-gpu` 11 12 The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia 13 plugin is built into Nomad and does not need to be downloaded separately. 14 15 ## Fingerprinted Attributes 16 17 <table> 18 <thead> 19 <tr> 20 <th>Attribute</th> 21 <th>Unit</th> 22 </tr> 23 </thead> 24 <tbody> 25 <tr> 26 <td> 27 <tt>memory</tt> 28 </td> 29 <td>MiB</td> 30 </tr> 31 <tr> 32 <td> 33 <tt>power</tt> 34 </td> 35 <td>W (Watt)</td> 36 </tr> 37 <tr> 38 <td> 39 <tt>bar1</tt> 40 </td> 41 <td>MiB</td> 42 </tr> 43 <tr> 44 <td> 45 <tt>driver_version</tt> 46 </td> 47 <td>string</td> 48 </tr> 49 <tr> 50 <td> 51 <tt>cores_clock</tt> 52 </td> 53 <td>MHz</td> 54 </tr> 55 <tr> 56 <td> 57 <tt>memory_clock</tt> 58 </td> 59 <td>MHz</td> 60 </tr> 61 <tr> 62 <td> 63 <tt>pci_bandwidth</tt> 64 </td> 65 <td>MB/s</td> 66 </tr> 67 <tr> 68 <td> 69 <tt>display_state</tt> 70 </td> 71 <td>string</td> 72 </tr> 73 <tr> 74 <td> 75 <tt>persistence_mode</tt> 76 </td> 77 <td>string</td> 78 </tr> 79 </tbody> 80 </table> 81 82 ## Runtime Environment 83 84 The `nvidia-gpu` device plugin exposes the following environment variables: 85 86 - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. 87 88 ### Additional Task Configurations 89 90 Additional environment variables can be set by the task to influence the runtime 91 environment. See [Nvidia's 92 documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). 93 94 ## Installation Requirements 95 96 In order to use the `nvidia-gpu` the following prerequisites must be met: 97 98 1. GNU/Linux x86_64 with kernel version > 3.10 99 2. NVIDIA GPU with Architecture > Fermi (2.1) 100 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 101 102 ### Docker Driver Requirements 103 104 In order to use the Nvidia driver plugin with the Docker driver, please follow 105 the installation instructions for 106 [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>). 107 108 ## Plugin Configuration 109 110 ```hcl 111 plugin "nvidia-gpu" { 112 config { 113 enabled = true 114 ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] 115 fingerprint_period = "1m" 116 } 117 } 118 ``` 119 120 The `nvidia-gpu` device plugin supports the following configuration in the agent 121 config: 122 123 - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running. 124 125 - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that 126 should be ignored when fingerprinting. 127 128 - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for 129 device changes. 130 131 ## Restrictions 132 133 The Nvidia integration only works with drivers who natively integrate with 134 Nvidia's [container runtime 135 library](https://github.com/NVIDIA/libnvidia-container). 136 137 Nomad has tested support with the [`docker` driver][docker-driver] and plans to 138 bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] 139 drivers. Support for [`lxc`][lxc-driver] should be possible by installing the 140 [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not 141 tested or documented by Nomad. 142 143 ## Examples 144 145 Inspect a node with a GPU: 146 147 ```shell-session 148 $ nomad node status 4d46e59f 149 ID = 4d46e59f 150 Name = nomad 151 Class = <none> 152 DC = dc1 153 Drain = false 154 Eligibility = eligible 155 Status = ready 156 Uptime = 19m43s 157 Driver Status = docker,mock_driver,raw_exec 158 159 Node Events 160 Time Subsystem Message 161 2019-01-23T18:25:18Z Cluster Node registered 162 163 Allocated Resources 164 CPU Memory Disk 165 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 166 167 Allocation Resource Utilization 168 CPU Memory 169 0/15576 MHz 0 B/55 GiB 170 171 Host Resource Utilization 172 CPU Memory Disk 173 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 174 175 Device Resource Utilization 176 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 177 178 Allocations 179 No allocations placed 180 ``` 181 182 Display detailed statistics on a node with a GPU: 183 184 ```shell-session 185 $ nomad node status -stats 4d46e59f 186 ID = 4d46e59f 187 Name = nomad 188 Class = <none> 189 DC = dc1 190 Drain = false 191 Eligibility = eligible 192 Status = ready 193 Uptime = 19m59s 194 Driver Status = docker,mock_driver,raw_exec 195 196 Node Events 197 Time Subsystem Message 198 2019-01-23T18:25:18Z Cluster Node registered 199 200 Allocated Resources 201 CPU Memory Disk 202 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 203 204 Allocation Resource Utilization 205 CPU Memory 206 0/15576 MHz 0 B/55 GiB 207 208 Host Resource Utilization 209 CPU Memory Disk 210 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 211 212 Device Resource Utilization 213 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 214 215 // ...TRUNCATED... 216 217 Device Stats 218 Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 219 BAR1 buffer state = 2 / 16384 MiB 220 Decoder utilization = 0 % 221 ECC L1 errors = 0 222 ECC L2 errors = 0 223 ECC memory errors = 0 224 Encoder utilization = 0 % 225 GPU utilization = 0 % 226 Memory state = 0 / 11441 MiB 227 Memory utilization = 0 % 228 Power usage = 37 / 149 W 229 Temperature = 34 C 230 231 Allocations 232 No allocations placed 233 ``` 234 235 Run the following example job to see that that the GPU was mounted in the 236 container: 237 238 ```hcl 239 job "gpu-test" { 240 datacenters = ["dc1"] 241 type = "batch" 242 243 group "smi" { 244 task "smi" { 245 driver = "docker" 246 247 config { 248 image = "nvidia/cuda:9.0-base" 249 command = "nvidia-smi" 250 } 251 252 resources { 253 device "nvidia/gpu" { 254 count = 1 255 256 # Add an affinity for a particular model 257 affinity { 258 attribute = "${device.model}" 259 value = "Tesla K80" 260 weight = 50 261 } 262 } 263 } 264 } 265 } 266 } 267 ``` 268 269 ```shell-session 270 $ nomad run example.nomad 271 ==> Monitoring evaluation "21bd7584" 272 Evaluation triggered by job "gpu-test" 273 Allocation "d250baed" created: node "4d46e59f", group "smi" 274 Evaluation status changed: "pending" -> "complete" 275 ==> Evaluation "21bd7584" finished with status "complete" 276 277 $ nomad alloc status d250baed 278 ID = d250baed 279 Eval ID = 21bd7584 280 Name = gpu-test.smi[0] 281 Node ID = 4d46e59f 282 Job ID = example 283 Job Version = 0 284 Client Status = complete 285 Client Description = All tasks have completed 286 Desired Status = run 287 Desired Description = <none> 288 Created = 7s ago 289 Modified = 2s ago 290 291 Task "smi" is "dead" 292 Task Resources 293 CPU Memory Disk Addresses 294 0/100 MHz 0 B/300 MiB 300 MiB 295 296 Device Stats 297 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 298 299 Task Events: 300 Started At = 2019-01-23T18:25:32Z 301 Finished At = 2019-01-23T18:25:34Z 302 Total Restarts = 0 303 Last Restart = N/A 304 305 Recent Events: 306 Time Type Description 307 2019-01-23T18:25:34Z Terminated Exit Code: 0 308 2019-01-23T18:25:32Z Started Task started by client 309 2019-01-23T18:25:29Z Task Setup Building Task Directory 310 2019-01-23T18:25:29Z Received Task received by client 311 312 $ nomad alloc logs d250baed 313 Wed Jan 23 18:25:32 2019 314 +-----------------------------------------------------------------------------+ 315 | NVIDIA-SMI 410.48 Driver Version: 410.48 | 316 |-------------------------------+----------------------+----------------------+ 317 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 318 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 319 |===============================+======================+======================| 320 | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | 321 | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | 322 +-------------------------------+----------------------+----------------------+ 323 324 +-----------------------------------------------------------------------------+ 325 | Processes: GPU Memory | 326 | GPU PID Type Process name Usage | 327 |=============================================================================| 328 | No running processes found | 329 +-----------------------------------------------------------------------------+ 330 ``` 331 332 [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' 333 [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' 334 [java-driver]: /docs/drivers/java 'Nomad java Driver' 335 [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'