github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/devices/nvidia.mdx (about) 1 --- 2 layout: docs 3 page_title: 'Device Plugins: Nvidia' 4 sidebar_title: Nvidia 5 description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. 6 --- 7 8 # Nvidia GPU Device Plugin 9 10 Name: `nvidia-gpu` 11 12 The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia 13 plugin is built into Nomad and does not need to be downloaded separately. 14 15 ## Fingerprinted Attributes 16 17 <table> 18 <thead> 19 <tr> 20 <th>Attribute</th> 21 <th>Unit</th> 22 </tr> 23 </thead> 24 <tbody> 25 <tr> 26 <td> 27 <tt>memory</tt> 28 </td> 29 <td>MiB</td> 30 </tr> 31 <tr> 32 <td> 33 <tt>power</tt> 34 </td> 35 <td>W (Watt)</td> 36 </tr> 37 <tr> 38 <td> 39 <tt>bar1</tt> 40 </td> 41 <td>MiB</td> 42 </tr> 43 <tr> 44 <td> 45 <tt>driver_version</tt> 46 </td> 47 <td>string</td> 48 </tr> 49 <tr> 50 <td> 51 <tt>cores_clock</tt> 52 </td> 53 <td>MHz</td> 54 </tr> 55 <tr> 56 <td> 57 <tt>memory_clock</tt> 58 </td> 59 <td>MHz</td> 60 </tr> 61 <tr> 62 <td> 63 <tt>pci_bandwidth</tt> 64 </td> 65 <td>MB/s</td> 66 </tr> 67 <tr> 68 <td> 69 <tt>display_state</tt> 70 </td> 71 <td>string</td> 72 </tr> 73 <tr> 74 <td> 75 <tt>persistence_mode</tt> 76 </td> 77 <td>string</td> 78 </tr> 79 </tbody> 80 </table> 81 82 ## Runtime Environment 83 84 The `nvidia-gpu` device plugin exposes the following environment variables: 85 86 - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. 87 88 ### Additional Task Configurations 89 90 Additional environment variables can be set by the task to influence the runtime 91 environment. See [Nvidia's 92 documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). 93 94 ## Installation Requirements 95 96 In order to use the `nvidia-gpu` the following prerequisites must be met: 97 98 1. GNU/Linux x86_64 with kernel version > 3.10 99 2. NVIDIA GPU with Architecture > Fermi (2.1) 100 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 101 102 ### Docker Driver Requirements 103 104 In order to use the Nvidia driver plugin with the Docker driver, please follow 105 the installation instructions for 106 [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>). 107 108 ## Plugin Configuration 109 110 ```hcl 111 plugin "nvidia-gpu" { 112 ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] 113 fingerprint_period = "1m" 114 } 115 ``` 116 117 The `nvidia-gpu` device plugin supports the following configuration in the agent 118 config: 119 120 - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that 121 should be ignored when fingerprinting. 122 123 - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for 124 device changes. 125 126 ## Restrictions 127 128 The Nvidia integration only works with drivers who natively integrate with 129 Nvidia's [container runtime 130 library](https://github.com/NVIDIA/libnvidia-container). 131 132 Nomad has tested support with the [`docker` driver][docker-driver] and plans to 133 bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] 134 drivers. Support for [`lxc`][lxc-driver] should be possible by installing the 135 [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not 136 tested or documented by Nomad. 137 138 ## Examples 139 140 Inspect a node with a GPU: 141 142 ```shell-sessionnomad node status 4d46e59f 143 ID = 4d46e59f 144 Name = nomad 145 Class = <none> 146 DC = dc1 147 Drain = false 148 Eligibility = eligible 149 Status = ready 150 Uptime = 19m43s 151 Driver Status = docker,mock_driver,raw_exec 152 153 Node Events 154 Time Subsystem Message 155 2019-01-23T18:25:18Z Cluster Node registered 156 157 Allocated Resources 158 CPU Memory Disk 159 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 160 161 Allocation Resource Utilization 162 CPU Memory 163 0/15576 MHz 0 B/55 GiB 164 165 Host Resource Utilization 166 CPU Memory Disk 167 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 168 169 Device Resource Utilization 170 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 171 172 Allocations 173 No allocations placed 174 ``` 175 176 Display detailed statistics on a node with a GPU: 177 178 ```shell-sessionnomad node status -stats 4d46e59f 179 ID = 4d46e59f 180 Name = nomad 181 Class = <none> 182 DC = dc1 183 Drain = false 184 Eligibility = eligible 185 Status = ready 186 Uptime = 19m59s 187 Driver Status = docker,mock_driver,raw_exec 188 189 Node Events 190 Time Subsystem Message 191 2019-01-23T18:25:18Z Cluster Node registered 192 193 Allocated Resources 194 CPU Memory Disk 195 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 196 197 Allocation Resource Utilization 198 CPU Memory 199 0/15576 MHz 0 B/55 GiB 200 201 Host Resource Utilization 202 CPU Memory Disk 203 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 204 205 Device Resource Utilization 206 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 207 208 // ...TRUNCATED... 209 210 Device Stats 211 Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 212 BAR1 buffer state = 2 / 16384 MiB 213 Decoder utilization = 0 % 214 ECC L1 errors = 0 215 ECC L2 errors = 0 216 ECC memory errors = 0 217 Encoder utilization = 0 % 218 GPU utilization = 0 % 219 Memory state = 0 / 11441 MiB 220 Memory utilization = 0 % 221 Power usage = 37 / 149 W 222 Temperature = 34 C 223 224 Allocations 225 No allocations placed 226 ``` 227 228 Run the following example job to see that that the GPU was mounted in the 229 container: 230 231 ```hcl 232 job "gpu-test" { 233 datacenters = ["dc1"] 234 type = "batch" 235 236 group "smi" { 237 task "smi" { 238 driver = "docker" 239 240 config { 241 image = "nvidia/cuda:9.0-base" 242 command = "nvidia-smi" 243 } 244 245 resources { 246 device "nvidia/gpu" { 247 count = 1 248 249 # Add an affinity for a particular model 250 affinity { 251 attribute = "${device.model}" 252 value = "Tesla K80" 253 weight = 50 254 } 255 } 256 } 257 } 258 } 259 } 260 ``` 261 262 ```shell-sessionnomad run example.nomad 263 ==> Monitoring evaluation "21bd7584" 264 Evaluation triggered by job "gpu-test" 265 Allocation "d250baed" created: node "4d46e59f", group "smi" 266 Evaluation status changed: "pending" -> "complete" 267 ==> Evaluation "21bd7584" finished with status "complete" 268 269 $ nomad alloc status d250baed 270 ID = d250baed 271 Eval ID = 21bd7584 272 Name = gpu-test.smi[0] 273 Node ID = 4d46e59f 274 Job ID = example 275 Job Version = 0 276 Client Status = complete 277 Client Description = All tasks have completed 278 Desired Status = run 279 Desired Description = <none> 280 Created = 7s ago 281 Modified = 2s ago 282 283 Task "smi" is "dead" 284 Task Resources 285 CPU Memory Disk Addresses 286 0/100 MHz 0 B/300 MiB 300 MiB 287 288 Device Stats 289 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 290 291 Task Events: 292 Started At = 2019-01-23T18:25:32Z 293 Finished At = 2019-01-23T18:25:34Z 294 Total Restarts = 0 295 Last Restart = N/A 296 297 Recent Events: 298 Time Type Description 299 2019-01-23T18:25:34Z Terminated Exit Code: 0 300 2019-01-23T18:25:32Z Started Task started by client 301 2019-01-23T18:25:29Z Task Setup Building Task Directory 302 2019-01-23T18:25:29Z Received Task received by client 303 304 $ nomad alloc logs d250baed 305 Wed Jan 23 18:25:32 2019 306 +-----------------------------------------------------------------------------+ 307 | NVIDIA-SMI 410.48 Driver Version: 410.48 | 308 |-------------------------------+----------------------+----------------------+ 309 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 310 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 311 |===============================+======================+======================| 312 | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | 313 | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | 314 +-------------------------------+----------------------+----------------------+ 315 316 +-----------------------------------------------------------------------------+ 317 | Processes: GPU Memory | 318 | GPU PID Type Process name Usage | 319 |=============================================================================| 320 | No running processes found | 321 +-----------------------------------------------------------------------------+ 322 ``` 323 324 [docker-driver]: /docs/drivers/docker 'Nomad docker Driver' 325 [exec-driver]: /docs/drivers/exec 'Nomad exec Driver' 326 [java-driver]: /docs/drivers/java 'Nomad java Driver' 327 [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'