github.com/netdata/go.d.plugin@v0.58.1/modules/nvidia_smi/integrations/nvidia_gpu.md (about)

     1  <!--startmeta
     2  custom_edit_url: "https://github.com/netdata/go.d.plugin/edit/master/modules/nvidia_smi/README.md"
     3  meta_yaml: "https://github.com/netdata/go.d.plugin/edit/master/modules/nvidia_smi/metadata.yaml"
     4  sidebar_label: "Nvidia GPU"
     5  learn_status: "Published"
     6  learn_rel_path: "Data Collection/Hardware Devices and Sensors"
     7  most_popular: False
     8  message: "DO NOT EDIT THIS FILE DIRECTLY, IT IS GENERATED BY THE COLLECTOR'S metadata.yaml FILE"
     9  endmeta-->
    10  
    11  # Nvidia GPU
    12  
    13  
    14  <img src="https://netdata.cloud/img/nvidia.svg" width="150"/>
    15  
    16  
    17  Plugin: go.d.plugin
    18  Module: nvidia_smi
    19  
    20  <img src="https://img.shields.io/badge/maintained%20by-Netdata-%2300ab44" />
    21  
    22  ## Overview
    23  
    24  This collector monitors GPUs performance metrics using
    25  the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) CLI tool.
    26  
    27  > **Warning**: under development, [loop mode](https://github.com/netdata/netdata/issues/14522) not implemented yet.
    28  
    29  
    30  
    31  
    32  This collector is supported on all platforms.
    33  
    34  This collector supports collecting metrics from multiple instances of this integration, including remote instances.
    35  
    36  
    37  ### Default Behavior
    38  
    39  #### Auto-Detection
    40  
    41  This integration doesn't support auto-detection.
    42  
    43  #### Limits
    44  
    45  The default configuration for this integration does not impose any limits on data collection.
    46  
    47  #### Performance Impact
    48  
    49  The default configuration for this integration is not expected to impose a significant performance impact on the system.
    50  
    51  
    52  ## Metrics
    53  
    54  Metrics grouped by *scope*.
    55  
    56  The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels.
    57  
    58  
    59  
    60  ### Per gpu
    61  
    62  These metrics refer to the GPU.
    63  
    64  Labels:
    65  
    66  | Label      | Description     |
    67  |:-----------|:----------------|
    68  | uuid | GPU id (e.g. 00000000:00:04.0) |
    69  | product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) |
    70  
    71  Metrics:
    72  
    73  | Metric | Dimensions | Unit | XML | CSV |
    74  |:------|:----------|:----|:---:|:---:|
    75  | nvidia_smi.gpu_pcie_bandwidth_usage | rx, tx | B/s | • |   |
    76  | nvidia_smi.gpu_pcie_bandwidth_utilization | rx, tx | % | • |   |
    77  | nvidia_smi.gpu_fan_speed_perc | fan_speed | % | • | • |
    78  | nvidia_smi.gpu_utilization | gpu | % | • | • |
    79  | nvidia_smi.gpu_memory_utilization | memory | % | • | • |
    80  | nvidia_smi.gpu_decoder_utilization | decoder | % | • |   |
    81  | nvidia_smi.gpu_encoder_utilization | encoder | % | • |   |
    82  | nvidia_smi.gpu_frame_buffer_memory_usage | free, used, reserved | B | • | • |
    83  | nvidia_smi.gpu_bar1_memory_usage | free, used | B | • |   |
    84  | nvidia_smi.gpu_temperature | temperature | Celsius | • | • |
    85  | nvidia_smi.gpu_voltage | voltage | V | • |   |
    86  | nvidia_smi.gpu_clock_freq | graphics, video, sm, mem | MHz | • | • |
    87  | nvidia_smi.gpu_power_draw | power_draw | Watts | • | • |
    88  | nvidia_smi.gpu_performance_state | P0-P15 | state | • | • |
    89  | nvidia_smi.gpu_mig_mode_current_status | enabled, disabled | status | • |   |
    90  | nvidia_smi.gpu_mig_devices_count | mig | devices | • |   |
    91  
    92  ### Per mig
    93  
    94  These metrics refer to the Multi-Instance GPU (MIG).
    95  
    96  Labels:
    97  
    98  | Label      | Description     |
    99  |:-----------|:----------------|
   100  | uuid | GPU id (e.g. 00000000:00:04.0) |
   101  | product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) |
   102  | gpu_instance_id | GPU instance id (e.g. 1) |
   103  
   104  Metrics:
   105  
   106  | Metric | Dimensions | Unit | XML | CSV |
   107  |:------|:----------|:----|:---:|:---:|
   108  | nvidia_smi.gpu_mig_frame_buffer_memory_usage | free, used, reserved | B | • |   |
   109  | nvidia_smi.gpu_mig_bar1_memory_usage | free, used | B | • |   |
   110  
   111  
   112  
   113  ## Alerts
   114  
   115  There are no alerts configured by default for this integration.
   116  
   117  
   118  ## Setup
   119  
   120  ### Prerequisites
   121  
   122  #### Enable in go.d.conf.
   123  
   124  This collector is disabled by default. You need to explicitly enable it in the `go.d.conf` file.
   125  
   126  
   127  
   128  ### Configuration
   129  
   130  #### File
   131  
   132  The configuration file name for this integration is `go.d/nvidia_smi.conf`.
   133  
   134  
   135  You can edit the configuration file using the `edit-config` script from the
   136  Netdata [config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory).
   137  
   138  ```bash
   139  cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata
   140  sudo ./edit-config go.d/nvidia_smi.conf
   141  ```
   142  #### Options
   143  
   144  The following options can be defined globally: update_every, autodetection_retry.
   145  
   146  
   147  <details><summary>Config options</summary>
   148  
   149  | Name | Description | Default | Required |
   150  |:----|:-----------|:-------|:--------:|
   151  | update_every | Data collection frequency. | 10 | no |
   152  | autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no |
   153  | binary_path | Path to nvidia_smi binary. The default is "nvidia_smi" and the executable is looked for in the directories specified in the PATH environment variable. | nvidia_smi | no |
   154  | timeout | nvidia_smi binary execution timeout. | 2 | no |
   155  | use_csv_format | Used format when requesting GPU information. XML is used if set to 'no'. | yes | no |
   156  
   157  </details>
   158  
   159  #### Examples
   160  
   161  ##### XML format
   162  
   163  Use XML format when requesting GPU information.
   164  
   165  <details><summary>Config</summary>
   166  
   167  ```yaml
   168  jobs:
   169    - name: nvidia_smi
   170      use_csv_format: no
   171  
   172  ```
   173  </details>
   174  
   175  ##### Custom binary path
   176  
   177  The executable is not in the directories specified in the PATH environment variable.
   178  
   179  <details><summary>Config</summary>
   180  
   181  ```yaml
   182  jobs:
   183    - name: nvidia_smi
   184      binary_path: /usr/local/sbin/nvidia_smi
   185  
   186  ```
   187  </details>
   188  
   189  
   190  
   191  ## Troubleshooting
   192  
   193  ### Debug Mode
   194  
   195  To troubleshoot issues with the `nvidia_smi` collector, run the `go.d.plugin` with the debug option enabled. The output
   196  should give you clues as to why the collector isn't working.
   197  
   198  - Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on
   199    your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`.
   200  
   201    ```bash
   202    cd /usr/libexec/netdata/plugins.d/
   203    ```
   204  
   205  - Switch to the `netdata` user.
   206  
   207    ```bash
   208    sudo -u netdata -s
   209    ```
   210  
   211  - Run the `go.d.plugin` to debug the collector:
   212  
   213    ```bash
   214    ./go.d.plugin -d -m nvidia_smi
   215    ```
   216  
   217