github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/devices/nvidia.mdx

github.com/iqoqo/nomad@v0.11.3-0.20200911112621-d7021c74d101/website/pages/docs/devices/nvidia.mdx (about)

     1  ---
     2  layout: docs
     3  page_title: 'Device Plugins: Nvidia'
     4  sidebar_title: Nvidia
     5  description: The Nvidia Device Plugin detects and makes Nvidia devices available to tasks.
     6  ---
     7  
     8  # Nvidia GPU Device Plugin
     9  
    10  Name: `nvidia-gpu`
    11  
    12  The Nvidia device plugin is used to expose Nvidia GPUs to Nomad. The Nvidia
    13  plugin is built into Nomad and does not need to be downloaded separately.
    14  
    15  ## Fingerprinted Attributes
    16  
    17  <table>
    18    <thead>
    19      <tr>
    20        <th>Attribute</th>
    21        <th>Unit</th>
    22      </tr>
    23    </thead>
    24    <tbody>
    25      <tr>
    26        <td>
    27          <tt>memory</tt>
    28        </td>
    29        <td>MiB</td>
    30      </tr>
    31      <tr>
    32        <td>
    33          <tt>power</tt>
    34        </td>
    35        <td>W (Watt)</td>
    36      </tr>
    37      <tr>
    38        <td>
    39          <tt>bar1</tt>
    40        </td>
    41        <td>MiB</td>
    42      </tr>
    43      <tr>
    44        <td>
    45          <tt>driver_version</tt>
    46        </td>
    47        <td>string</td>
    48      </tr>
    49      <tr>
    50        <td>
    51          <tt>cores_clock</tt>
    52        </td>
    53        <td>MHz</td>
    54      </tr>
    55      <tr>
    56        <td>
    57          <tt>memory_clock</tt>
    58        </td>
    59        <td>MHz</td>
    60      </tr>
    61      <tr>
    62        <td>
    63          <tt>pci_bandwidth</tt>
    64        </td>
    65        <td>MB/s</td>
    66      </tr>
    67      <tr>
    68        <td>
    69          <tt>display_state</tt>
    70        </td>
    71        <td>string</td>
    72      </tr>
    73      <tr>
    74        <td>
    75          <tt>persistence_mode</tt>
    76        </td>
    77        <td>string</td>
    78      </tr>
    79    </tbody>
    80  </table>
    81  
    82  ## Runtime Environment
    83  
    84  The `nvidia-gpu` device plugin exposes the following environment variables:
    85  
    86  - `NVIDIA_VISIBLE_DEVICES` - List of Nvidia GPU IDs available to the task.
    87  
    88  ### Additional Task Configurations
    89  
    90  Additional environment variables can be set by the task to influence the runtime
    91  environment. See [Nvidia's
    92  documentation](https://github.com/NVIDIA/nvidia-container-runtime#environment-variables-oci-spec).
    93  
    94  ## Installation Requirements
    95  
    96  In order to use the `nvidia-gpu` the following prerequisites must be met:
    97  
    98  1. GNU/Linux x86_64 with kernel version > 3.10
    99  2. NVIDIA GPU with Architecture > Fermi (2.1)
   100  3. NVIDIA drivers >= 340.29 with binary `nvidia-smi`
   101  
   102  ### Docker Driver Requirements
   103  
   104  In order to use the Nvidia driver plugin with the Docker driver, please follow
   105  the installation instructions for
   106  [`nvidia-docker`](<https://github.com/NVIDIA/nvidia-docker/wiki/Installation-(version-1.0)>).
   107  
   108  ## Plugin Configuration
   109  
   110  ```hcl
   111  plugin "nvidia-gpu" {
   112    ignored_gpu_ids = ["GPU-fef8089b", "GPU-ac81e44d"]
   113    fingerprint_period = "1m"
   114  }
   115  ```
   116  
   117  The `nvidia-gpu` device plugin supports the following configuration in the agent
   118  config:
   119  
   120  - `ignored_gpu_ids` `(array<string>: [])` - Specifies the set of GPU UUIDs that
   121    should be ignored when fingerprinting.
   122  
   123  - `fingerprint_period` `(string: "1m")` - The period in which to fingerprint for
   124    device changes.
   125  
   126  ## Restrictions
   127  
   128  The Nvidia integration only works with drivers who natively integrate with
   129  Nvidia's [container runtime
   130  library](https://github.com/NVIDIA/libnvidia-container).
   131  
   132  Nomad has tested support with the [`docker` driver][docker-driver] and plans to
   133  bring support to the built-in [`exec`][exec-driver] and [`java`][java-driver]
   134  drivers. Support for [`lxc`][lxc-driver] should be possible by installing the
   135  [Nvidia hook](https://github.com/lxc/lxc/blob/master/hooks/nvidia) but is not
   136  tested or documented by Nomad.
   137  
   138  ## Examples
   139  
   140  Inspect a node with a GPU:
   141  
   142  ```shell-sessionnomad node status 4d46e59f
   143  ID            = 4d46e59f
   144  Name          = nomad
   145  Class         = <none>
   146  DC            = dc1
   147  Drain         = false
   148  Eligibility   = eligible
   149  Status        = ready
   150  Uptime        = 19m43s
   151  Driver Status = docker,mock_driver,raw_exec
   152  
   153  Node Events
   154  Time                  Subsystem  Message
   155  2019-01-23T18:25:18Z  Cluster    Node registered
   156  
   157  Allocated Resources
   158  CPU          Memory      Disk
   159  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   160  
   161  Allocation Resource Utilization
   162  CPU          Memory
   163  0/15576 MHz  0 B/55 GiB
   164  
   165  Host Resource Utilization
   166  CPU             Memory          Disk
   167  2674/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   168  
   169  Device Resource Utilization
   170  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   171  
   172  Allocations
   173  No allocations placed
   174  ```
   175  
   176  Display detailed statistics on a node with a GPU:
   177  
   178  ```shell-sessionnomad node status -stats 4d46e59f
   179  ID            = 4d46e59f
   180  Name          = nomad
   181  Class         = <none>
   182  DC            = dc1
   183  Drain         = false
   184  Eligibility   = eligible
   185  Status        = ready
   186  Uptime        = 19m59s
   187  Driver Status = docker,mock_driver,raw_exec
   188  
   189  Node Events
   190  Time                  Subsystem  Message
   191  2019-01-23T18:25:18Z  Cluster    Node registered
   192  
   193  Allocated Resources
   194  CPU          Memory      Disk
   195  0/15576 MHz  0 B/55 GiB  0 B/28 GiB
   196  
   197  Allocation Resource Utilization
   198  CPU          Memory
   199  0/15576 MHz  0 B/55 GiB
   200  
   201  Host Resource Utilization
   202  CPU             Memory          Disk
   203  2673/15576 MHz  1.5 GiB/55 GiB  3.0 GiB/31 GiB
   204  
   205  Device Resource Utilization
   206  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   207  
   208  // ...TRUNCATED...
   209  
   210  Device Stats
   211  Device              = nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]
   212  BAR1 buffer state   = 2 / 16384 MiB
   213  Decoder utilization = 0 %
   214  ECC L1 errors       = 0
   215  ECC L2 errors       = 0
   216  ECC memory errors   = 0
   217  Encoder utilization = 0 %
   218  GPU utilization     = 0 %
   219  Memory state        = 0 / 11441 MiB
   220  Memory utilization  = 0 %
   221  Power usage         = 37 / 149 W
   222  Temperature         = 34 C
   223  
   224  Allocations
   225  No allocations placed
   226  ```
   227  
   228  Run the following example job to see that that the GPU was mounted in the
   229  container:
   230  
   231  ```hcl
   232  job "gpu-test" {
   233    datacenters = ["dc1"]
   234    type = "batch"
   235  
   236    group "smi" {
   237      task "smi" {
   238        driver = "docker"
   239  
   240        config {
   241          image = "nvidia/cuda:9.0-base"
   242          command = "nvidia-smi"
   243        }
   244  
   245        resources {
   246          device "nvidia/gpu" {
   247            count = 1
   248  
   249            # Add an affinity for a particular model
   250            affinity {
   251              attribute = "${device.model}"
   252              value     = "Tesla K80"
   253              weight    = 50
   254            }
   255          }
   256        }
   257      }
   258    }
   259  }
   260  ```
   261  
   262  ```shell-sessionnomad run example.nomad
   263  ==> Monitoring evaluation "21bd7584"
   264      Evaluation triggered by job "gpu-test"
   265      Allocation "d250baed" created: node "4d46e59f", group "smi"
   266      Evaluation status changed: "pending" -> "complete"
   267  ==> Evaluation "21bd7584" finished with status "complete"
   268  
   269  $ nomad alloc status d250baed
   270  ID                  = d250baed
   271  Eval ID             = 21bd7584
   272  Name                = gpu-test.smi[0]
   273  Node ID             = 4d46e59f
   274  Job ID              = example
   275  Job Version         = 0
   276  Client Status       = complete
   277  Client Description  = All tasks have completed
   278  Desired Status      = run
   279  Desired Description = <none>
   280  Created             = 7s ago
   281  Modified            = 2s ago
   282  
   283  Task "smi" is "dead"
   284  Task Resources
   285  CPU        Memory       Disk     Addresses
   286  0/100 MHz  0 B/300 MiB  300 MiB
   287  
   288  Device Stats
   289  nvidia/gpu/Tesla K80[GPU-e1f6f4f1-1ea5-7b9d-5f03-338a9dc32416]  0 / 11441 MiB
   290  
   291  Task Events:
   292  Started At     = 2019-01-23T18:25:32Z
   293  Finished At    = 2019-01-23T18:25:34Z
   294  Total Restarts = 0
   295  Last Restart   = N/A
   296  
   297  Recent Events:
   298  Time                  Type        Description
   299  2019-01-23T18:25:34Z  Terminated  Exit Code: 0
   300  2019-01-23T18:25:32Z  Started     Task started by client
   301  2019-01-23T18:25:29Z  Task Setup  Building Task Directory
   302  2019-01-23T18:25:29Z  Received    Task received by client
   303  
   304  $ nomad alloc logs d250baed
   305  Wed Jan 23 18:25:32 2019
   306  +-----------------------------------------------------------------------------+
   307  | NVIDIA-SMI 410.48                 Driver Version: 410.48                    |
   308  |-------------------------------+----------------------+----------------------+
   309  | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
   310  | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
   311  |===============================+======================+======================|
   312  |   0  Tesla K80           On   | 00004477:00:00.0 Off |                    0 |
   313  | N/A   33C    P8    37W / 149W |      0MiB / 11441MiB |      0%      Default |
   314  +-------------------------------+----------------------+----------------------+
   315  
   316  +-----------------------------------------------------------------------------+
   317  | Processes:                                                       GPU Memory |
   318  |  GPU       PID   Type   Process name                             Usage      |
   319  |=============================================================================|
   320  |  No running processes found                                                 |
   321  +-----------------------------------------------------------------------------+
   322  ```
   323  
   324  [docker-driver]: /docs/drivers/docker 'Nomad docker Driver'
   325  [exec-driver]: /docs/drivers/exec 'Nomad exec Driver'
   326  [java-driver]: /docs/drivers/java 'Nomad java Driver'
   327  [lxc-driver]: /docs/drivers/external/lxc 'Nomad lxc Driver'