github.com/adityamillind98/nomad@v0.11.8/website/pages/docs/devices/nvidia.mdx (about) 1 --- 2 layout: docs 3 page_title: 'Device Plugins: Nvidia' 4 sidebar_title: Nvidia 5 description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. 6 --- 7 8 # Nvidia GPU Device Plugin 9 10 Name: `nvidia-gpu` 11 12 The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia 13 plugin is built into Nomad and does not need to be downloaded separately. 14 15 ## Fingerprinted Attributes 16 17 <table> 18 <thead> 19 <tr> 20 <th>Attribute</th> 21 <th>Unit</th> 22 </tr> 23 </thead> 24 <tbody> 25 <tr> 26 <td> 27 <tt>memory</tt> 28 </td> 29 <td>MiB</td> 30 </tr> 31 <tr> 32 <td> 33 <tt>power</tt> 34 </td> 35 <td>W (Watt)</td> 36 </tr> 37 <tr> 38 <td> 39 <tt>bar1</tt> 40 </td> 41 <td>MiB</td> 42 </tr> 43 <tr> 44 <td> 45 <tt>driver_version</tt> 46 </td> 47 <td>string</td> 48 </tr> 49 <tr> 50 <td> 51 <tt>cores_clock</tt> 52 </td> 53 <td>MHz</td> 54 </tr> 55 <tr> 56 <td> 57 <tt>memory_clock</tt> 58 </td> 59 <td>MHz</td> 60 </tr> 61 <tr> 62 <td> 63 <tt>pci_bandwidth</tt> 64 </td> 65 <td>MB/s</td> 66 </tr> 67 <tr> 68 <td> 69 <tt>display_state</tt> 70 </td> 71 <td>string</td> 72 </tr> 73 <tr> 74 <td> 75 <tt>persistence_mode</tt> 76 </td> 77 <td>string</td> 78 </tr> 79 </tbody> 80 </table> 81 82 ## Runtime Environment 83 84 The `nvidia-gpu` device plugin exposes the following environment variables: 85 86 - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. 87 88 ### Additional Task Configurations 89 90 Additional environment variables can be set by the task to influence the runtime 91 environment. See [Nvidia's 92 documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). 93 94 ## Installation Requirements 95 96 In order to use the `nvidia-gpu` the following prerequisites must be met: 97 98 1. GNU/Linux x86_64 with kernel version > 3.10 99 2. NVIDIA GPU with Architecture > Fermi (2.1) 100 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 101 102 ### Docker Driver Requirements 103 104 In order to use the Nvidia driver plugin with the Docker driver, please follow 105 the installation instructions for 106 [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>). 107 108 ## Plugin Configuration 109 110 ```hcl 111 plugin "nvidia-gpu" { 112 ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] 113 fingerprint_period = "1m" 114 } 115 ``` 116 117 The `nvidia-gpu` device plugin supports the following configuration in the agent 118 config: 119 120 - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that 121 should be ignored when fingerprinting. 122 123 - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for 124 device changes. 125 126 ## Restrictions 127 128 The Nvidia integration only works with drivers who natively integrate with 129 Nvidia's [container runtime 130 library](https://github.com/NVIDIA/libnvidia-container). 131 132 Nomad has tested support with the [`docker` driver][docker-driver] and plans to 133 bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] 134 drivers. Support for [`lxc`][lxc-driver] should be possible by installing the 135 [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not 136 tested or documented by Nomad. 137 138 ## Examples 139 140 Inspect a node with a GPU: 141 142 ```shell-session 143 $ nomad node status 4d46e59f 144 ID = 4d46e59f 145 Name = nomad 146 Class = <none> 147 DC = dc1 148 Drain = false 149 Eligibility = eligible 150 Status = ready 151 Uptime = 19m43s 152 Driver Status = docker,mock_driver,raw_exec 153 154 Node Events 155 Time Subsystem Message 156 2019-01-23T18:25:18Z Cluster Node registered 157 158 Allocated Resources 159 CPU Memory Disk 160 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 161 162 Allocation Resource Utilization 163 CPU Memory 164 0/15576 MHz 0 B/55 GiB 165 166 Host Resource Utilization 167 CPU Memory Disk 168 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 169 170 Device Resource Utilization 171 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 172 173 Allocations 174 No allocations placed 175 ``` 176 177 Display detailed statistics on a node with a GPU: 178 179 ```shell-session 180 $ nomad node status -stats 4d46e59f 181 ID = 4d46e59f 182 Name = nomad 183 Class = <none> 184 DC = dc1 185 Drain = false 186 Eligibility = eligible 187 Status = ready 188 Uptime = 19m59s 189 Driver Status = docker,mock_driver,raw_exec 190 191 Node Events 192 Time Subsystem Message 193 2019-01-23T18:25:18Z Cluster Node registered 194 195 Allocated Resources 196 CPU Memory Disk 197 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 198 199 Allocation Resource Utilization 200 CPU Memory 201 0/15576 MHz 0 B/55 GiB 202 203 Host Resource Utilization 204 CPU Memory Disk 205 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 206 207 Device Resource Utilization 208 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 209 210 // ...TRUNCATED... 211 212 Device Stats 213 Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 214 BAR1 buffer state = 2 / 16384 MiB 215 Decoder utilization = 0 % 216 ECC L1 errors = 0 217 ECC L2 errors = 0 218 ECC memory errors = 0 219 Encoder utilization = 0 % 220 GPU utilization = 0 % 221 Memory state = 0 / 11441 MiB 222 Memory utilization = 0 % 223 Power usage = 37 / 149 W 224 Temperature = 34 C 225 226 Allocations 227 No allocations placed 228 ``` 229 230 Run the following example job to see that that the GPU was mounted in the 231 container: 232 233 ```hcl 234 job "gpu-test" { 235 datacenters = ["dc1"] 236 type = "batch" 237 238 group "smi" { 239 task "smi" { 240 driver = "docker" 241 242 config { 243 image = "nvidia/cuda:9.0-base" 244 command = "nvidia-smi" 245 } 246 247 resources { 248 device "nvidia/gpu" { 249 count = 1 250 251 # Add an affinity for a particular model 252 affinity { 253 attribute = "${device.model}" 254 value = "Tesla K80" 255 weight = 50 256 } 257 } 258 } 259 } 260 } 261 } 262 ``` 263 264 ```shell-session 265 $ nomad run example.nomad 266 ==> Monitoring evaluation "21bd7584" 267 Evaluation triggered by job "gpu-test" 268 Allocation "d250baed" created: node "4d46e59f", group "smi" 269 Evaluation status changed: "pending" -> "complete" 270 ==> Evaluation "21bd7584" finished with status "complete" 271 272 $ nomad alloc status d250baed 273 ID = d250baed 274 Eval ID = 21bd7584 275 Name = gpu-test.smi[0] 276 Node ID = 4d46e59f 277 Job ID = example 278 Job Version = 0 279 Client Status = complete 280 Client Description = All tasks have completed 281 Desired Status = run 282 Desired Description = <none> 283 Created = 7s ago 284 Modified = 2s ago 285 286 Task "smi" is "dead" 287 Task Resources 288 CPU Memory Disk Addresses 289 0/100 MHz 0 B/300 MiB 300 MiB 290 291 Device Stats 292 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 293 294 Task Events: 295 Started At = 2019-01-23T18:25:32Z 296 Finished At = 2019-01-23T18:25:34Z 297 Total Restarts = 0 298 Last Restart = N/A 299 300 Recent Events: 301 Time Type Description 302 2019-01-23T18:25:34Z Terminated Exit Code: 0 303 2019-01-23T18:25:32Z Started Task started by client 304 2019-01-23T18:25:29Z Task Setup Building Task Directory 305 2019-01-23T18:25:29Z Received Task received by client 306 307 $ nomad alloc logs d250baed 308 Wed Jan 23 18:25:32 2019 309 +-----------------------------------------------------------------------------+ 310 | NVIDIA-SMI 410.48 Driver Version: 410.48 | 311 |-------------------------------+----------------------+----------------------+ 312 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 313 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 314 |===============================+======================+======================| 315 | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | 316 | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | 317 +-------------------------------+----------------------+----------------------+ 318 319 +-----------------------------------------------------------------------------+ 320 | Processes: GPU Memory | 321 | GPU PID Type Process name Usage | 322 |=============================================================================| 323 | No running processes found | 324 +-----------------------------------------------------------------------------+ 325 ``` 326 327 [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' 328 [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' 329 [java-driver]: /docs/drivers/java 'Nomad java Driver' 330 [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'