github.com/anth0d/nomad@v0.0.0-20221214183521-ae3a0a2cad06/website/content/plugins/devices/nvidia.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: 'Device Plugins: Nvidia'
     4  description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
     5  ---
     6  
     7  # Nvidia GPU Device Plugin
     8  
     9  Name: `nomad-device-nvidia`
    10  
    11  The Nvidia device plugin is used to expose Nvidia GPUs to Nomad.
    12  
    13  ~> **Note**: The Nvidia device plugin setup has changed in Nomad 1.2. You must
    14  add a [`plugin`] block to your clients configuration and install the
    15  [external Nvidia device plugin][nvidia_plugin_download] into their
    16  [`plugin_dir`] prior to upgrading. See plugin options below for an example.
    17  Note the job specification remains the same.
    18  
    19  ## Fingerprinted Attributes
    20  
    21  <table>
    22    <thead>
    23      <tr>
    24        <th>Attribute</th>
    25        <th>Unit</th>
    26      </tr>
    27    </thead>
    28    <tbody>
    29      <tr>
    30        <td>
    31          <tt>memory</tt>
    32        </td>
    33        <td>MiB</td>
    34      </tr>
    35      <tr>
    36        <td>
    37          <tt>power</tt>
    38        </td>
    39        <td>W (Watt)</td>
    40      </tr>
    41      <tr>
    42        <td>
    43          <tt>bar1</tt>
    44        </td>
    45        <td>MiB</td>
    46      </tr>
    47      <tr>
    48        <td>
    49          <tt>driver_version</tt>
    50        </td>
    51        <td>string</td>
    52      </tr>
    53      <tr>
    54        <td>
    55          <tt>cores_clock</tt>
    56        </td>
    57        <td>MHz</td>
    58      </tr>
    59      <tr>
    60        <td>
    61          <tt>memory_clock</tt>
    62        </td>
    63        <td>MHz</td>
    64      </tr>
    65      <tr>
    66        <td>
    67          <tt>pci_bandwidth</tt>
    68        </td>
    69        <td>MB/s</td>
    70      </tr>
    71      <tr>
    72        <td>
    73          <tt>display_state</tt>
    74        </td>
    75        <td>string</td>
    76      </tr>
    77      <tr>
    78        <td>
    79          <tt>persistence_mode</tt>
    80        </td>
    81        <td>string</td>
    82      </tr>
    83    </tbody>
    84  </table>
    85  
    86  ## Runtime Environment
    87  
    88  The `nvidia-gpu` device plugin exposes the following environment variables:
    89  
    90  - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
    91  
    92  ### Additional Task Configurations
    93  
    94  Additional environment variables can be set by the task to influence the runtime
    95  environment. See [Nvidia's
    96  documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
    97  
    98  ## Installation Requirements
    99  
   100  In order to use the `nomad-device-nvidia` device driver the following prerequisites must be met:
   101  
   102  1. GNU/Linux x86_64 with kernel version > 3.10
   103  2. NVIDIA GPU with Architecture > Fermi (2.1)
   104  3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
   105  4. Docker v19.03+
   106  
   107  ### Container Toolkit Installation
   108  
   109  Follow the [NVIDIA Container Toolkit installation instructions][nvidia_container_toolkit]
   110  from Nvidia to prepare a machine to use docker containers with Nvidia GPUs. You should
   111  be able to run this simple command to test your environment and produce meaningful
   112  output.
   113  
   114  ```shell
   115  docker run --rm --gpus all nvidia/cuda:11.0-base nvidia-smi
   116  ```
   117  
   118  
   119  ## Plugin Configuration
   120  
   121  ```hcl
   122  plugin "nomad-device-nvidia" {
   123    config {
   124      enabled            = true
   125      ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]
   126      fingerprint_period = "1m"
   127    }
   128  }
   129  ```
   130  
   131  The `nomad-device-nvidia` device plugin supports the following configuration in the agent
   132  config:
   133  
   134  - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
   135  
   136  - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
   137    should be ignored when fingerprinting.
   138  
   139  - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
   140    device changes.
   141  
   142  ## Limitations
   143  
   144  The Nvidia integration only works with drivers who natively integrate with
   145  Nvidia's [container runtime
   146  library](https://github.com/NVIDIA/libnvidia-container).
   147  
   148  Nomad has tested support with the [`docker` driver][docker-driver]. Support for
   149  [`lxc`][lxc-driver] should be possible by installing the [Nvidia hook][nvidia_hook]
   150  but is not tested or documented by Nomad.
   151  
   152  ## Source Code & Compiled Binaries
   153  
   154  The source code for this plugin can be found at hashicorp/nomad-device-nvidia. You
   155  can also find pre-built binaries on the [releases page][nvidia_plugin_download].
   156  
   157  ## Examples
   158  
   159  Inspect a node with a GPU:
   160  
   161  ```shell-session
   162  $ nomad node status 4d46e59f
   163  
   164  // ...TRUNCATED...
   165  
   166  Device Resource Utilization
   167  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   168  ```
   169  
   170  Display detailed statistics on a node with a GPU:
   171  
   172  ```shell-session
   173  $ nomad node status -stats 4d46e59f
   174  
   175  // ...TRUNCATED...
   176  
   177  Device Resource Utilization
   178  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   179  
   180  // ...TRUNCATED...
   181  
   182  Device Stats
   183  Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
   184  BAR1 buffer state   = 2 / 16384 MiB
   185  Decoder utilization = 0 %
   186  ECC L1 errors       = 0
   187  ECC L2 errors       = 0
   188  ECC memory errors   = 0
   189  Encoder utilization = 0 %
   190  GPU utilization     = 0 %
   191  Memory state        = 0 / 11441 MiB
   192  Memory utilization  = 0 %
   193  Power usage         = 37 / 149 W
   194  Temperature         = 34 C
   195  ```
   196  
   197  Run the following example job to see that the GPU was mounted in the
   198  container:
   199  
   200  ```hcl
   201  job "gpu-test" {
   202    datacenters = ["dc1"]
   203    type = "batch"
   204  
   205    group "smi" {
   206      task "smi" {
   207        driver = "docker"
   208  
   209        config {
   210          image = "nvidia/cuda:11.0-base"
   211          command = "nvidia-smi"
   212        }
   213  
   214        resources {
   215          device "nvidia/gpu" {
   216            count = 1
   217  
   218            # Add an affinity for a particular model
   219            affinity {
   220              attribute = "${device.model}"
   221              value     = "Tesla K80"
   222              weight    = 50
   223            }
   224          }
   225        }
   226      }
   227    }
   228  }
   229  ```
   230  
   231  ```shell-session
   232  $ nomad run example.nomad
   233  ==> Monitoring evaluation "21bd7584"
   234      Evaluation triggered by job "gpu-test"
   235      Allocation "d250baed" created: node "4d46e59f", group "smi"
   236      Evaluation status changed: "pending" -> "complete"
   237  ==> Evaluation "21bd7584" finished with status "complete"
   238  
   239  $ nomad alloc status d250baed
   240  
   241  // ...TRUNCATED...
   242  
   243  Task "smi" is "dead"
   244  Task Resources
   245  CPU        Memory       Disk     Addresses
   246  0/100 MHz  0 B/300 MiB  300 MiB
   247  
   248  Device Stats
   249  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   250  
   251  Task Events:
   252  Started At     = 2019-01-23T18:25:32Z
   253  Finished At    = 2019-01-23T18:25:34Z
   254  Total Restarts = 0
   255  Last Restart   = N/A
   256  
   257  Recent Events:
   258  Time                  Type        Description
   259  2019-01-23T18:25:34Z  Terminated  Exit Code: 0
   260  2019-01-23T18:25:32Z  Started     Task started by client
   261  2019-01-23T18:25:29Z  Task Setup  Building Task Directory
   262  2019-01-23T18:25:29Z  Received    Task received by client
   263  
   264  $ nomad alloc logs d250baed
   265  Wed Jan 23 18:25:32 2019
   266  +-----------------------------------------------------------------------------+
   267  | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
   268  |-------------------------------+----------------------+----------------------+
   269  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   270  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   271  |===============================+======================+======================|
   272  |   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
   273  | N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
   274  +-------------------------------+----------------------+----------------------+
   275  
   276  +-----------------------------------------------------------------------------+
   277  | Processes:                                                       GPU Memory |
   278  |  GPU       PID   Type   Process name                             Usage      |
   279  |=============================================================================|
   280  |  No running processes found                                                 |
   281  +-----------------------------------------------------------------------------+
   282  ```
   283  
   284  
   285  [docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
   286  [exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
   287  [java-driver]: /docs/drivers/java 'Nomad java Driver'
   288  [lxc-driver]: /plugins/drivers/community/lxc 'Nomad lxc Driver'
   289  [`plugin`]: /docs/configuration/plugin
   290  [`plugin_dir`]: /docs/configuration#plugin_dir
   291  [nvidia_hook]: https://github.com/lxc/lxc/blob/master/hooks/nvidia
   292  [nvidia_plugin_download]: https://releases.hashicorp.com/nomad-device-nvidia/
   293  [nvidia_container_toolkit]: https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html
   294  [source]: https://github.com/hashicorp/nomad-device-nvidia