github.com/adityamillind98/nomad@v0.11.8/website/pages/docs/devices/nvidia.mdx

github.com/adityamillind98/nomad@v0.11.8/website/pages/docs/devices/nvidia.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: 'Device Plugins: Nvidia'
     4  sidebar_title: Nvidia
     5  description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
     6  ---
     7  
     8  # Nvidia GPU Device Plugin
     9  
    10  Name: `nvidia-gpu`
    11  
    12  The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
    13  plugin is built into Nomad and does not need to be downloaded separately.
    14  
    15  ## Fingerprinted Attributes
    16  
    17  <table>
    18    <thead>
    19      <tr>
    20        <th>Attribute</th>
    21        <th>Unit</th>
    22      </tr>
    23    </thead>
    24    <tbody>
    25      <tr>
    26        <td>
    27          <tt>memory</tt>
    28        </td>
    29        <td>MiB</td>
    30      </tr>
    31      <tr>
    32        <td>
    33          <tt>power</tt>
    34        </td>
    35        <td>W (Watt)</td>
    36      </tr>
    37      <tr>
    38        <td>
    39          <tt>bar1</tt>
    40        </td>
    41        <td>MiB</td>
    42      </tr>
    43      <tr>
    44        <td>
    45          <tt>driver_version</tt>
    46        </td>
    47        <td>string</td>
    48      </tr>
    49      <tr>
    50        <td>
    51          <tt>cores_clock</tt>
    52        </td>
    53        <td>MHz</td>
    54      </tr>
    55      <tr>
    56        <td>
    57          <tt>memory_clock</tt>
    58        </td>
    59        <td>MHz</td>
    60      </tr>
    61      <tr>
    62        <td>
    63          <tt>pci_bandwidth</tt>
    64        </td>
    65        <td>MB/s</td>
    66      </tr>
    67      <tr>
    68        <td>
    69          <tt>display_state</tt>
    70        </td>
    71        <td>string</td>
    72      </tr>
    73      <tr>
    74        <td>
    75          <tt>persistence_mode</tt>
    76        </td>
    77        <td>string</td>
    78      </tr>
    79    </tbody>
    80  </table>
    81  
    82  ## Runtime Environment
    83  
    84  The `nvidia-gpu` device plugin exposes the following environment variables:
    85  
    86  - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
    87  
    88  ### Additional Task Configurations
    89  
    90  Additional environment variables can be set by the task to influence the runtime
    91  environment. See [Nvidia's
    92  documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
    93  
    94  ## Installation Requirements
    95  
    96  In order to use the `nvidia-gpu` the following prerequisites must be met:
    97  
    98  1. GNU/Linux x86_64 with kernel version > 3.10
    99  2. NVIDIA GPU with Architecture > Fermi (2.1)
   100  3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
   101  
   102  ### Docker Driver Requirements
   103  
   104  In order to use the Nvidia driver plugin with the Docker driver, please follow
   105  the installation instructions for
   106  [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>).
   107  
   108  ## Plugin Configuration
   109  
   110  ```hcl
   111  plugin "nvidia-gpu" {
   112    ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
   113    fingerprint_period = "1m"
   114  }
   115  ```
   116  
   117  The `nvidia-gpu` device plugin supports the following configuration in the agent
   118  config:
   119  
   120  - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
   121    should be ignored when fingerprinting.
   122  
   123  - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
   124    device changes.
   125  
   126  ## Restrictions
   127  
   128  The Nvidia integration only works with drivers who natively integrate with
   129  Nvidia's [container runtime
   130  library](https://github.com/NVIDIA/libnvidia-container).
   131  
   132  Nomad has tested support with the [`docker` driver][docker-driver] and plans to
   133  bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
   134  drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
   135  [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
   136  tested or documented by Nomad.
   137  
   138  ## Examples
   139  
   140  Inspect a node with a GPU:
   141  
   142  ```shell-session
   143  $ nomad node status 4d46e59f
   144  ID            = 4d46e59f
   145  Name          = nomad
   146  Class         = <none>
   147  DC            = dc1
   148  Drain         = false
   149  Eligibility   = eligible
   150  Status        = ready
   151  Uptime        = 19m43s
   152  Driver Status = docker,mock_driver,raw_exec
   153  
   154  Node Events
   155  Time                  Subsystem  Message
   156  2019-01-23T18:25:18Z  Cluster    Node registered
   157  
   158  Allocated Resources
   159  CPU          Memory      Disk
   160  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   161  
   162  Allocation Resource Utilization
   163  CPU          Memory
   164  0/15576 MHz  0 B/55 GiB
   165  
   166  Host Resource Utilization
   167  CPU             Memory          Disk
   168  2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   169  
   170  Device Resource Utilization
   171  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   172  
   173  Allocations
   174  No allocations placed
   175  ```
   176  
   177  Display detailed statistics on a node with a GPU:
   178  
   179  ```shell-session
   180  $ nomad node status -stats 4d46e59f
   181  ID            = 4d46e59f
   182  Name          = nomad
   183  Class         = <none>
   184  DC            = dc1
   185  Drain         = false
   186  Eligibility   = eligible
   187  Status        = ready
   188  Uptime        = 19m59s
   189  Driver Status = docker,mock_driver,raw_exec
   190  
   191  Node Events
   192  Time                  Subsystem  Message
   193  2019-01-23T18:25:18Z  Cluster    Node registered
   194  
   195  Allocated Resources
   196  CPU          Memory      Disk
   197  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   198  
   199  Allocation Resource Utilization
   200  CPU          Memory
   201  0/15576 MHz  0 B/55 GiB
   202  
   203  Host Resource Utilization
   204  CPU             Memory          Disk
   205  2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   206  
   207  Device Resource Utilization
   208  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   209  
   210  // ...TRUNCATED...
   211  
   212  Device Stats
   213  Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
   214  BAR1 buffer state   = 2 / 16384 MiB
   215  Decoder utilization = 0 %
   216  ECC L1 errors       = 0
   217  ECC L2 errors       = 0
   218  ECC memory errors   = 0
   219  Encoder utilization = 0 %
   220  GPU utilization     = 0 %
   221  Memory state        = 0 / 11441 MiB
   222  Memory utilization  = 0 %
   223  Power usage         = 37 / 149 W
   224  Temperature         = 34 C
   225  
   226  Allocations
   227  No allocations placed
   228  ```
   229  
   230  Run the following example job to see that that the GPU was mounted in the
   231  container:
   232  
   233  ```hcl
   234  job "gpu-test" {
   235    datacenters = ["dc1"]
   236    type = "batch"
   237  
   238    group "smi" {
   239      task "smi" {
   240        driver = "docker"
   241  
   242        config {
   243          image = "nvidia/cuda:9.0-base"
   244          command = "nvidia-smi"
   245        }
   246  
   247        resources {
   248          device "nvidia/gpu" {
   249            count = 1
   250  
   251            # Add an affinity for a particular model
   252            affinity {
   253              attribute = "${device.model}"
   254              value     = "Tesla K80"
   255              weight    = 50
   256            }
   257          }
   258        }
   259      }
   260    }
   261  }
   262  ```
   263  
   264  ```shell-session
   265  $ nomad run example.nomad
   266  ==> Monitoring evaluation "21bd7584"
   267      Evaluation triggered by job "gpu-test"
   268      Allocation "d250baed" created: node "4d46e59f", group "smi"
   269      Evaluation status changed: "pending" -> "complete"
   270  ==> Evaluation "21bd7584" finished with status "complete"
   271  
   272  $ nomad alloc status d250baed
   273  ID                  = d250baed
   274  Eval ID             = 21bd7584
   275  Name                = gpu-test.smi[0]
   276  Node ID             = 4d46e59f
   277  Job ID              = example
   278  Job Version         = 0
   279  Client Status       = complete
   280  Client Description  = All tasks have completed
   281  Desired Status      = run
   282  Desired Description = <none>
   283  Created             = 7s ago
   284  Modified            = 2s ago
   285  
   286  Task "smi" is "dead"
   287  Task Resources
   288  CPU        Memory       Disk     Addresses
   289  0/100 MHz  0 B/300 MiB  300 MiB
   290  
   291  Device Stats
   292  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   293  
   294  Task Events:
   295  Started At     = 2019-01-23T18:25:32Z
   296  Finished At    = 2019-01-23T18:25:34Z
   297  Total Restarts = 0
   298  Last Restart   = N/A
   299  
   300  Recent Events:
   301  Time                  Type        Description
   302  2019-01-23T18:25:34Z  Terminated  Exit Code: 0
   303  2019-01-23T18:25:32Z  Started     Task started by client
   304  2019-01-23T18:25:29Z  Task Setup  Building Task Directory
   305  2019-01-23T18:25:29Z  Received    Task received by client
   306  
   307  $ nomad alloc logs d250baed
   308  Wed Jan 23 18:25:32 2019
   309  +-----------------------------------------------------------------------------+
   310  | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
   311  |-------------------------------+----------------------+----------------------+
   312  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   313  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   314  |===============================+======================+======================|
   315  |   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
   316  | N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
   317  +-------------------------------+----------------------+----------------------+
   318  
   319  +-----------------------------------------------------------------------------+
   320  | Processes:                                                       GPU Memory |
   321  |  GPU       PID   Type   Process name                             Usage      |
   322  |=============================================================================|
   323  |  No running processes found                                                 |
   324  +-----------------------------------------------------------------------------+
   325  ```
   326  
   327  [docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
   328  [exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
   329  [java-driver]: /docs/drivers/java 'Nomad java Driver'
   330  [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'