github.com/ferranbt/nomad@v0.9.3-0.20190607002617-85c449b7667c/website/source/docs/devices/nvidia.html.md (about)

     1  ---
     2  layout: "docs"
     3  page_title: "Device Plugins: Nvidia"
     4  sidebar_current: "docs-devices-nvidia"
     5  description: |-
     6    The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
     7  ---
     8  
     9  # Nvidia GPU Device Plugin
    10  
    11  Name: `nvidia-gpu`
    12  
    13  The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
    14  plugin is built into Nomad and does not need to be downloaded separately.
    15  
    16  ## Fingerprinted Attributes
    17  
    18  <table class="table table-bordered table-striped">
    19    <tr>
    20      <th>Attribute</th>
    21      <th>Unit</th>
    22    </tr>
    23    <tr>
    24      <td><tt>memory</tt></td>
    25      <td>MiB</td>
    26    </tr>
    27    <tr>
    28      <td><tt>power</tt></td>
    29      <td>W (Watt)</td>
    30    </tr>
    31    <tr>
    32      <td><tt>bar1</tt></td>
    33      <td>MiB</td>
    34    </tr>
    35    <tr>
    36      <td><tt>driver_version</tt></td>
    37      <td>string</td>
    38    </tr>
    39    <tr>
    40      <td><tt>cores_clock</tt></td>
    41      <td>MHz</td>
    42    </tr>
    43    <tr>
    44      <td><tt>memory_clock</tt></td>
    45      <td>MHz</td>
    46    </tr>
    47    <tr>
    48      <td><tt>pci_bandwidth</tt></td>
    49      <td>MB/s</td>
    50    </tr>
    51    <tr>
    52      <td><tt>display_state</tt></td>
    53      <td>string</td>
    54    </tr>
    55    <tr>
    56      <td><tt>persistence_mode</tt></td>
    57      <td>string</td>
    58    </tr>
    59  </table>
    60  
    61  ## Runtime Environment
    62  
    63  The `nvidia-gpu` device plugin exposes the following environment variables:
    64  
    65  * `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
    66  
    67  ### Additional Task Configurations
    68  
    69  Additional environment variables can be set by the task to influence the runtime
    70  environment. See [Nvidia's
    71  documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
    72  
    73  ## Installation Requirements
    74  
    75  In order to use the `nvidia-gpu` the following prerequisites must be met:
    76  
    77  1. GNU/Linux x86_64 with kernel version > 3.10
    78  2. NVIDIA GPU with Architecture > Fermi (2.1)
    79  3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
    80  
    81  ### Docker Driver Requirements
    82  
    83  In order to use the Nvidia driver plugin with the Docker driver, please follow
    84  the installation instructions for
    85  [`nvidia-docker`](https://github.com/NVIDIA/nvidia-docker/wiki/Installation-\(version-1.0\)).
    86  
    87  ## Plugin Configuration
    88  
    89  ```hcl
    90  plugin "nvidia-gpu" {
    91    ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
    92    fingerprint_period = "1m"
    93  }  
    94  ```
    95  
    96  The `nvidia-gpu` device plugin supports the following configuration in the agent
    97  config:
    98  
    99  * `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
   100    should be ignored when fingerprinting.
   101  
   102  * `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
   103    device changes.
   104  
   105  ## Restrictions
   106  
   107  The Nvidia integration only works with drivers who natively integrate with
   108  Nvidia's [container runtime
   109  library](https://github.com/NVIDIA/libnvidia-container). 
   110  
   111  Nomad has tested support with the [`docker` driver][docker-driver] and plans to
   112  bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
   113  drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
   114  [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
   115  tested or documented by Nomad.
   116  
   117  ## Examples
   118  
   119  Inspect a node with a GPU:
   120  
   121  ```sh
   122  $ nomad node status 4d46e59f
   123  ID            = 4d46e59f
   124  Name          = nomad
   125  Class         = <none>
   126  DC            = dc1
   127  Drain         = false
   128  Eligibility   = eligible
   129  Status        = ready
   130  Uptime        = 19m43s
   131  Driver Status = docker,mock_driver,raw_exec
   132  
   133  Node Events
   134  Time                  Subsystem  Message
   135  2019-01-23T18:25:18Z  Cluster    Node registered
   136  
   137  Allocated Resources
   138  CPU          Memory      Disk
   139  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   140  
   141  Allocation Resource Utilization
   142  CPU          Memory
   143  0/15576 MHz  0 B/55 GiB
   144  
   145  Host Resource Utilization
   146  CPU             Memory          Disk
   147  2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   148  
   149  Device Resource Utilization
   150  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   151  
   152  Allocations
   153  No allocations placed
   154  ```
   155  
   156  Display detailed statistics on a node with a GPU:
   157  
   158  ```sh
   159  $ nomad node status -stats 4d46e59f
   160  ID            = 4d46e59f
   161  Name          = nomad
   162  Class         = <none>
   163  DC            = dc1
   164  Drain         = false
   165  Eligibility   = eligible
   166  Status        = ready
   167  Uptime        = 19m59s
   168  Driver Status = docker,mock_driver,raw_exec
   169  
   170  Node Events
   171  Time                  Subsystem  Message
   172  2019-01-23T18:25:18Z  Cluster    Node registered
   173  
   174  Allocated Resources
   175  CPU          Memory      Disk
   176  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   177  
   178  Allocation Resource Utilization
   179  CPU          Memory
   180  0/15576 MHz  0 B/55 GiB
   181  
   182  Host Resource Utilization
   183  CPU             Memory          Disk
   184  2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   185  
   186  Device Resource Utilization
   187  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   188  
   189  // ...TRUNCATED...
   190  
   191  Device Stats
   192  Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
   193  BAR1 buffer state   = 2 / 16384 MiB
   194  Decoder utilization = 0 %
   195  ECC L1 errors       = 0
   196  ECC L2 errors       = 0
   197  ECC memory errors   = 0
   198  Encoder utilization = 0 %
   199  GPU utilization     = 0 %
   200  Memory state        = 0 / 11441 MiB
   201  Memory utilization  = 0 %
   202  Power usage         = 37 / 149 W
   203  Temperature         = 34 C
   204  
   205  Allocations
   206  No allocations placed
   207  ```
   208  
   209  Run the following example job to see that that the GPU was mounted in the
   210  container:
   211  
   212  ```hcl
   213  job "gpu-test" {
   214    datacenters = ["dc1"]
   215    type = "batch"
   216  
   217    group "smi" {
   218      task "smi" {
   219        driver = "docker"
   220  
   221        config {
   222          image = "nvidia/cuda:9.0-base"
   223          command = "nvidia-smi"
   224        }
   225  
   226        resources {
   227          device "nvidia/gpu" {
   228            count = 1
   229  
   230            # Add an affinity for a particular model
   231            affinity {
   232              attribute = "${device.model}"
   233              value     = "Tesla K80"
   234              weight    = 50
   235            }
   236          }
   237        }
   238      }
   239    }
   240  }
   241  ```
   242  
   243  ```sh
   244  $ nomad run example.nomad
   245  ==> Monitoring evaluation "21bd7584"
   246      Evaluation triggered by job "gpu-test"
   247      Allocation "d250baed" created: node "4d46e59f", group "smi"
   248      Evaluation status changed: "pending" -> "complete"
   249  ==> Evaluation "21bd7584" finished with status "complete"
   250  
   251  $ nomad alloc status d250baed
   252  ID                  = d250baed
   253  Eval ID             = 21bd7584
   254  Name                = gpu-test.smi[0]
   255  Node ID             = 4d46e59f
   256  Job ID              = example
   257  Job Version         = 0
   258  Client Status       = complete
   259  Client Description  = All tasks have completed
   260  Desired Status      = run
   261  Desired Description = <none>
   262  Created             = 7s ago
   263  Modified            = 2s ago
   264  
   265  Task "smi" is "dead"
   266  Task Resources
   267  CPU        Memory       Disk     Addresses
   268  0/100 MHz  0 B/300 MiB  300 MiB
   269  
   270  Device Stats
   271  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   272  
   273  Task Events:
   274  Started At     = 2019-01-23T18:25:32Z
   275  Finished At    = 2019-01-23T18:25:34Z
   276  Total Restarts = 0
   277  Last Restart   = N/A
   278  
   279  Recent Events:
   280  Time                  Type        Description
   281  2019-01-23T18:25:34Z  Terminated  Exit Code: 0
   282  2019-01-23T18:25:32Z  Started     Task started by client
   283  2019-01-23T18:25:29Z  Task Setup  Building Task Directory
   284  2019-01-23T18:25:29Z  Received    Task received by client
   285  
   286  $ nomad alloc logs d250baed
   287  Wed Jan 23 18:25:32 2019
   288  +-----------------------------------------------------------------------------+
   289  | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
   290  |-------------------------------+----------------------+----------------------+
   291  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   292  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   293  |===============================+======================+======================|
   294  |   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
   295  | N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
   296  +-------------------------------+----------------------+----------------------+
   297  
   298  +-----------------------------------------------------------------------------+
   299  | Processes:                                                       GPU Memory |
   300  |  GPU       PID   Type   Process name                             Usage      |
   301  |=============================================================================|
   302  |  No running processes found                                                 |
   303  +-----------------------------------------------------------------------------+
   304  ```
   305  
   306  [docker-driver]: /docs/drivers/docker.html "Nomad docker Driver"
   307  [exec-driver]: /docs/drivers/exec.html "Nomad exec Driver"
   308  [java-driver]: /docs/drivers/java.html "Nomad java Driver"
   309  [lxc-driver]: /docs/drivers/external/lxc.html "Nomad lxc Driver"