github.com/Ilhicas/nomad@v1.0.4-0.20210304152020-e86851182bc3/website/content/docs/devices/nvidia.mdx

github.com/Ilhicas/nomad@v1.0.4-0.20210304152020-e86851182bc3/website/content/docs/devices/nvidia.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: 'Device Plugins: Nvidia'
     4  sidebar_title: Nvidia
     5  description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
     6  ---
     7  
     8  # Nvidia GPU Device Plugin
     9  
    10  Name: `nvidia-gpu`
    11  
    12  The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
    13  plugin is built into Nomad and does not need to be downloaded separately.
    14  
    15  ## Fingerprinted Attributes
    16  
    17  <table>
    18    <thead>
    19      <tr>
    20        <th>Attribute</th>
    21        <th>Unit</th>
    22      </tr>
    23    </thead>
    24    <tbody>
    25      <tr>
    26        <td>
    27          <tt>memory</tt>
    28        </td>
    29        <td>MiB</td>
    30      </tr>
    31      <tr>
    32        <td>
    33          <tt>power</tt>
    34        </td>
    35        <td>W (Watt)</td>
    36      </tr>
    37      <tr>
    38        <td>
    39          <tt>bar1</tt>
    40        </td>
    41        <td>MiB</td>
    42      </tr>
    43      <tr>
    44        <td>
    45          <tt>driver_version</tt>
    46        </td>
    47        <td>string</td>
    48      </tr>
    49      <tr>
    50        <td>
    51          <tt>cores_clock</tt>
    52        </td>
    53        <td>MHz</td>
    54      </tr>
    55      <tr>
    56        <td>
    57          <tt>memory_clock</tt>
    58        </td>
    59        <td>MHz</td>
    60      </tr>
    61      <tr>
    62        <td>
    63          <tt>pci_bandwidth</tt>
    64        </td>
    65        <td>MB/s</td>
    66      </tr>
    67      <tr>
    68        <td>
    69          <tt>display_state</tt>
    70        </td>
    71        <td>string</td>
    72      </tr>
    73      <tr>
    74        <td>
    75          <tt>persistence_mode</tt>
    76        </td>
    77        <td>string</td>
    78      </tr>
    79    </tbody>
    80  </table>
    81  
    82  ## Runtime Environment
    83  
    84  The `nvidia-gpu` device plugin exposes the following environment variables:
    85  
    86  - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
    87  
    88  ### Additional Task Configurations
    89  
    90  Additional environment variables can be set by the task to influence the runtime
    91  environment. See [Nvidia's
    92  documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
    93  
    94  ## Installation Requirements
    95  
    96  In order to use the `nvidia-gpu` the following prerequisites must be met:
    97  
    98  1. GNU/Linux x86_64 with kernel version > 3.10
    99  2. NVIDIA GPU with Architecture > Fermi (2.1)
   100  3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
   101  
   102  ### Docker Driver Requirements
   103  
   104  In order to use the Nvidia driver plugin with the Docker driver, please follow
   105  the installation instructions for
   106  [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>).
   107  
   108  ## Plugin Configuration
   109  
   110  ```hcl
   111  plugin "nvidia-gpu" {
   112    config {
   113      enabled            = true
   114      ignored_gpu_ids    = ["GPU-fef8089b", "GPU-ac81e44d"]
   115      fingerprint_period = "1m"
   116    }
   117  }
   118  ```
   119  
   120  The `nvidia-gpu` device plugin supports the following configuration in the agent
   121  config:
   122  
   123  - `enabled` `(bool: true)` - Control whether the plugin should be enabled and running.
   124  
   125  - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
   126    should be ignored when fingerprinting.
   127  
   128  - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
   129    device changes.
   130  
   131  ## Restrictions
   132  
   133  The Nvidia integration only works with drivers who natively integrate with
   134  Nvidia's [container runtime
   135  library](https://github.com/NVIDIA/libnvidia-container).
   136  
   137  Nomad has tested support with the [`docker` driver][docker-driver] and plans to
   138  bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
   139  drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
   140  [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
   141  tested or documented by Nomad.
   142  
   143  ## Examples
   144  
   145  Inspect a node with a GPU:
   146  
   147  ```shell-session
   148  $ nomad node status 4d46e59f
   149  ID            = 4d46e59f
   150  Name          = nomad
   151  Class         = <none>
   152  DC            = dc1
   153  Drain         = false
   154  Eligibility   = eligible
   155  Status        = ready
   156  Uptime        = 19m43s
   157  Driver Status = docker,mock_driver,raw_exec
   158  
   159  Node Events
   160  Time                  Subsystem  Message
   161  2019-01-23T18:25:18Z  Cluster    Node registered
   162  
   163  Allocated Resources
   164  CPU          Memory      Disk
   165  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   166  
   167  Allocation Resource Utilization
   168  CPU          Memory
   169  0/15576 MHz  0 B/55 GiB
   170  
   171  Host Resource Utilization
   172  CPU             Memory          Disk
   173  2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   174  
   175  Device Resource Utilization
   176  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   177  
   178  Allocations
   179  No allocations placed
   180  ```
   181  
   182  Display detailed statistics on a node with a GPU:
   183  
   184  ```shell-session
   185  $ nomad node status -stats 4d46e59f
   186  ID            = 4d46e59f
   187  Name          = nomad
   188  Class         = <none>
   189  DC            = dc1
   190  Drain         = false
   191  Eligibility   = eligible
   192  Status        = ready
   193  Uptime        = 19m59s
   194  Driver Status = docker,mock_driver,raw_exec
   195  
   196  Node Events
   197  Time                  Subsystem  Message
   198  2019-01-23T18:25:18Z  Cluster    Node registered
   199  
   200  Allocated Resources
   201  CPU          Memory      Disk
   202  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   203  
   204  Allocation Resource Utilization
   205  CPU          Memory
   206  0/15576 MHz  0 B/55 GiB
   207  
   208  Host Resource Utilization
   209  CPU             Memory          Disk
   210  2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   211  
   212  Device Resource Utilization
   213  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   214  
   215  // ...TRUNCATED...
   216  
   217  Device Stats
   218  Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
   219  BAR1 buffer state   = 2 / 16384 MiB
   220  Decoder utilization = 0 %
   221  ECC L1 errors       = 0
   222  ECC L2 errors       = 0
   223  ECC memory errors   = 0
   224  Encoder utilization = 0 %
   225  GPU utilization     = 0 %
   226  Memory state        = 0 / 11441 MiB
   227  Memory utilization  = 0 %
   228  Power usage         = 37 / 149 W
   229  Temperature         = 34 C
   230  
   231  Allocations
   232  No allocations placed
   233  ```
   234  
   235  Run the following example job to see that that the GPU was mounted in the
   236  container:
   237  
   238  ```hcl
   239  job "gpu-test" {
   240    datacenters = ["dc1"]
   241    type = "batch"
   242  
   243    group "smi" {
   244      task "smi" {
   245        driver = "docker"
   246  
   247        config {
   248          image = "nvidia/cuda:9.0-base"
   249          command = "nvidia-smi"
   250        }
   251  
   252        resources {
   253          device "nvidia/gpu" {
   254            count = 1
   255  
   256            # Add an affinity for a particular model
   257            affinity {
   258              attribute = "${device.model}"
   259              value     = "Tesla K80"
   260              weight    = 50
   261            }
   262          }
   263        }
   264      }
   265    }
   266  }
   267  ```
   268  
   269  ```shell-session
   270  $ nomad run example.nomad
   271  ==> Monitoring evaluation "21bd7584"
   272      Evaluation triggered by job "gpu-test"
   273      Allocation "d250baed" created: node "4d46e59f", group "smi"
   274      Evaluation status changed: "pending" -> "complete"
   275  ==> Evaluation "21bd7584" finished with status "complete"
   276  
   277  $ nomad alloc status d250baed
   278  ID                  = d250baed
   279  Eval ID             = 21bd7584
   280  Name                = gpu-test.smi[0]
   281  Node ID             = 4d46e59f
   282  Job ID              = example
   283  Job Version         = 0
   284  Client Status       = complete
   285  Client Description  = All tasks have completed
   286  Desired Status      = run
   287  Desired Description = <none>
   288  Created             = 7s ago
   289  Modified            = 2s ago
   290  
   291  Task "smi" is "dead"
   292  Task Resources
   293  CPU        Memory       Disk     Addresses
   294  0/100 MHz  0 B/300 MiB  300 MiB
   295  
   296  Device Stats
   297  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   298  
   299  Task Events:
   300  Started At     = 2019-01-23T18:25:32Z
   301  Finished At    = 2019-01-23T18:25:34Z
   302  Total Restarts = 0
   303  Last Restart   = N/A
   304  
   305  Recent Events:
   306  Time                  Type        Description
   307  2019-01-23T18:25:34Z  Terminated  Exit Code: 0
   308  2019-01-23T18:25:32Z  Started     Task started by client
   309  2019-01-23T18:25:29Z  Task Setup  Building Task Directory
   310  2019-01-23T18:25:29Z  Received    Task received by client
   311  
   312  $ nomad alloc logs d250baed
   313  Wed Jan 23 18:25:32 2019
   314  +-----------------------------------------------------------------------------+
   315  | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
   316  |-------------------------------+----------------------+----------------------+
   317  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   318  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   319  |===============================+======================+======================|
   320  |   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
   321  | N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
   322  +-------------------------------+----------------------+----------------------+
   323  
   324  +-----------------------------------------------------------------------------+
   325  | Processes:                                                       GPU Memory |
   326  |  GPU       PID   Type   Process name                             Usage      |
   327  |=============================================================================|
   328  |  No running processes found                                                 |
   329  +-----------------------------------------------------------------------------+
   330  ```
   331  
   332  [docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
   333  [exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
   334  [java-driver]: /docs/drivers/java 'Nomad java Driver'
   335  [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'