github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/docs/devices/nvidia.html.md (about) 1 --- 2 layout: "docs" 3 page_title: "Device Plugins: Nvidia" 4 sidebar_current: "docs-devices-nvidia" 5 description: |- 6 The Nvidia Device Plugin detects and makes Nvidia devices available to tasks. 7 --- 8 9 # Nvidia GPU Device Plugin 10 11 Name: `nvidia-gpu` 12 13 The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia 14 plugin is built into Nomad and does not need to be downloaded separately. 15 16 ## Fingerprinted Attributes 17 18 <table class="table table-bordered table-striped"> 19 <tr> 20 <th>Attribute</th> 21 <th>Unit</th> 22 </tr> 23 <tr> 24 <td><tt>memory</tt></td> 25 <td>MiB</td> 26 </tr> 27 <tr> 28 <td><tt>power</tt></td> 29 <td>W (Watt)</td> 30 </tr> 31 <tr> 32 <td><tt>bar1</tt></td> 33 <td>MiB</td> 34 </tr> 35 <tr> 36 <td><tt>driver_version</tt></td> 37 <td>string</td> 38 </tr> 39 <tr> 40 <td><tt>cores_clock</tt></td> 41 <td>MHz</td> 42 </tr> 43 <tr> 44 <td><tt>memory_clock</tt></td> 45 <td>MHz</td> 46 </tr> 47 <tr> 48 <td><tt>pci_bandwidth</tt></td> 49 <td>MB/s</td> 50 </tr> 51 <tr> 52 <td><tt>display_state</tt></td> 53 <td>string</td> 54 </tr> 55 <tr> 56 <td><tt>persistence_mode</tt></td> 57 <td>string</td> 58 </tr> 59 </table> 60 61 ## Runtime Environment 62 63 The `nvidia-gpu` device plugin exposes the following environment variables: 64 65 * `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task. 66 67 ### Additional Task Configurations 68 69 Additional environment variables can be set by the task to influence the runtime 70 environment. See [Nvidia's 71 documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec). 72 73 ## Installation Requirements 74 75 In order to use the `nvidia-gpu` the following prerequisites must be met: 76 77 1. GNU/Linux x86_64 with kernel version > 3.10 78 2. NVIDIA GPU with Architecture > Fermi (2.1) 79 3. NVIDIA drivers >= 340.29 with binary `nvidia-smi` 80 81 ### Docker Driver Requirements 82 83 In order to use the Nvidia driver plugin with the Docker driver, please follow 84 the installation instructions for 85 [`nvidia-docker`](https://github.com/NVIDIA/nvidia-docker/wiki/Installation-\(version-1.0\)). 86 87 ## Plugin Configuration 88 89 ```hcl 90 plugin "nvidia-gpu" { 91 ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"] 92 fingerprint_period = "1m" 93 } 94 ``` 95 96 The `nvidia-gpu` device plugin supports the following configuration in the agent 97 config: 98 99 * `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that 100 should be ignored when fingerprinting. 101 102 * `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for 103 device changes. 104 105 ## Restrictions 106 107 The Nvidia integration only works with drivers who natively integrate with 108 Nvidia's [container runtime 109 library](https://github.com/NVIDIA/libnvidia-container). 110 111 Nomad has tested support with the [`docker` driver][docker-driver] and plans to 112 bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver] 113 drivers. Support for [`lxc`][lxc-driver] should be possible by installing the 114 [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not 115 tested or documented by Nomad. 116 117 ## Examples 118 119 Inspect a node with a GPU: 120 121 ```sh 122 $ nomad node status 4d46e59f 123 ID = 4d46e59f 124 Name = nomad 125 Class = <none> 126 DC = dc1 127 Drain = false 128 Eligibility = eligible 129 Status = ready 130 Uptime = 19m43s 131 Driver Status = docker,mock_driver,raw_exec 132 133 Node Events 134 Time Subsystem Message 135 2019-01-23T18:25:18Z Cluster Node registered 136 137 Allocated Resources 138 CPU Memory Disk 139 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 140 141 Allocation Resource Utilization 142 CPU Memory 143 0/15576 MHz 0 B/55 GiB 144 145 Host Resource Utilization 146 CPU Memory Disk 147 2674/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 148 149 Device Resource Utilization 150 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 151 152 Allocations 153 No allocations placed 154 ``` 155 156 Display detailed statistics on a node with a GPU: 157 158 ```sh 159 $ nomad node status -stats 4d46e59f 160 ID = 4d46e59f 161 Name = nomad 162 Class = <none> 163 DC = dc1 164 Drain = false 165 Eligibility = eligible 166 Status = ready 167 Uptime = 19m59s 168 Driver Status = docker,mock_driver,raw_exec 169 170 Node Events 171 Time Subsystem Message 172 2019-01-23T18:25:18Z Cluster Node registered 173 174 Allocated Resources 175 CPU Memory Disk 176 0/15576 MHz 0 B/55 GiB 0 B/28 GiB 177 178 Allocation Resource Utilization 179 CPU Memory 180 0/15576 MHz 0 B/55 GiB 181 182 Host Resource Utilization 183 CPU Memory Disk 184 2673/15576 MHz 1.5 GiB/55 GiB 3.0 GiB/31 GiB 185 186 Device Resource Utilization 187 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 188 189 // ...TRUNCATED... 190 191 Device Stats 192 Device = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 193 BAR1 buffer state = 2 / 16384 MiB 194 Decoder utilization = 0 % 195 ECC L1 errors = 0 196 ECC L2 errors = 0 197 ECC memory errors = 0 198 Encoder utilization = 0 % 199 GPU utilization = 0 % 200 Memory state = 0 / 11441 MiB 201 Memory utilization = 0 % 202 Power usage = 37 / 149 W 203 Temperature = 34 C 204 205 Allocations 206 No allocations placed 207 ``` 208 209 Run the following example job to see that that the GPU was mounted in the 210 container: 211 212 ```hcl 213 job "gpu-test" { 214 datacenters = ["dc1"] 215 type = "batch" 216 217 group "smi" { 218 task "smi" { 219 driver = "docker" 220 221 config { 222 image = "nvidia/cuda:9.0-base" 223 command = "nvidia-smi" 224 } 225 226 resources { 227 device "nvidia/gpu" { 228 count = 1 229 230 # Add an affinity for a particular model 231 affinity { 232 attribute = "${device.model}" 233 value = "Tesla K80" 234 weight = 50 235 } 236 } 237 } 238 } 239 } 240 } 241 ``` 242 243 ```sh 244 $ nomad run example.nomad 245 ==> Monitoring evaluation "21bd7584" 246 Evaluation triggered by job "gpu-test" 247 Allocation "d250baed" created: node "4d46e59f", group "smi" 248 Evaluation status changed: "pending" -> "complete" 249 ==> Evaluation "21bd7584" finished with status "complete" 250 251 $ nomad alloc status d250baed 252 ID = d250baed 253 Eval ID = 21bd7584 254 Name = gpu-test.smi[0] 255 Node ID = 4d46e59f 256 Job ID = example 257 Job Version = 0 258 Client Status = complete 259 Client Description = All tasks have completed 260 Desired Status = run 261 Desired Description = <none> 262 Created = 7s ago 263 Modified = 2s ago 264 265 Task "smi" is "dead" 266 Task Resources 267 CPU Memory Disk Addresses 268 0/100 MHz 0 B/300 MiB 300 MiB 269 270 Device Stats 271 nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416] 0 / 11441 MiB 272 273 Task Events: 274 Started At = 2019-01-23T18:25:32Z 275 Finished At = 2019-01-23T18:25:34Z 276 Total Restarts = 0 277 Last Restart = N/A 278 279 Recent Events: 280 Time Type Description 281 2019-01-23T18:25:34Z Terminated Exit Code: 0 282 2019-01-23T18:25:32Z Started Task started by client 283 2019-01-23T18:25:29Z Task Setup Building Task Directory 284 2019-01-23T18:25:29Z Received Task received by client 285 286 $ nomad alloc logs d250baed 287 Wed Jan 23 18:25:32 2019 288 +-----------------------------------------------------------------------------+ 289 | NVIDIA-SMI 410.48 Driver Version: 410.48 | 290 |-------------------------------+----------------------+----------------------+ 291 | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | 292 | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | 293 |===============================+======================+======================| 294 | 0 Tesla K80 On | 00004477:00:00.0 Off | 0 | 295 | N/A 33C P8 37W / 149W | 0MiB / 11441MiB | 0% Default | 296 +-------------------------------+----------------------+----------------------+ 297 298 +-----------------------------------------------------------------------------+ 299 | Processes: GPU Memory | 300 | GPU PID Type Process name Usage | 301 |=============================================================================| 302 | No running processes found | 303 +-----------------------------------------------------------------------------+ 304 ``` 305 306 [docker-driver]: /docs/drivers/docker.html "Nomad docker Driver" 307 [exec-driver]: /docs/drivers/exec.html "Nomad exec Driver" 308 [java-driver]: /docs/drivers/java.html "Nomad java Driver" 309 [lxc-driver]: /docs/drivers/external/lxc.html "Nomad lxc Driver"