github.com/netdata/go.d.plugin@v0.58.1/modules/nvidia_smi/integrations/nvidia_gpu.md (about) 1 <!--startmeta 2 custom_edit_url: "https://github.com/netdata/go.d.plugin/edit/master/modules/nvidia_smi/README.md" 3 meta_yaml: "https://github.com/netdata/go.d.plugin/edit/master/modules/nvidia_smi/metadata.yaml" 4 sidebar_label: "Nvidia GPU" 5 learn_status: "Published" 6 learn_rel_path: "Data Collection/Hardware Devices and Sensors" 7 most_popular: False 8 message: "DO NOT EDIT THIS FILE DIRECTLY, IT IS GENERATED BY THE COLLECTOR'S metadata.yaml FILE" 9 endmeta--> 10 11 # Nvidia GPU 12 13 14 <img src="https://netdata.cloud/img/nvidia.svg" width="150"/> 15 16 17 Plugin: go.d.plugin 18 Module: nvidia_smi 19 20 <img src="https://img.shields.io/badge/maintained%20by-Netdata-%2300ab44" /> 21 22 ## Overview 23 24 This collector monitors GPUs performance metrics using 25 the [nvidia-smi](https://developer.nvidia.com/nvidia-system-management-interface) CLI tool. 26 27 > **Warning**: under development, [loop mode](https://github.com/netdata/netdata/issues/14522) not implemented yet. 28 29 30 31 32 This collector is supported on all platforms. 33 34 This collector supports collecting metrics from multiple instances of this integration, including remote instances. 35 36 37 ### Default Behavior 38 39 #### Auto-Detection 40 41 This integration doesn't support auto-detection. 42 43 #### Limits 44 45 The default configuration for this integration does not impose any limits on data collection. 46 47 #### Performance Impact 48 49 The default configuration for this integration is not expected to impose a significant performance impact on the system. 50 51 52 ## Metrics 53 54 Metrics grouped by *scope*. 55 56 The scope defines the instance that the metric belongs to. An instance is uniquely identified by a set of labels. 57 58 59 60 ### Per gpu 61 62 These metrics refer to the GPU. 63 64 Labels: 65 66 | Label | Description | 67 |:-----------|:----------------| 68 | uuid | GPU id (e.g. 00000000:00:04.0) | 69 | product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) | 70 71 Metrics: 72 73 | Metric | Dimensions | Unit | XML | CSV | 74 |:------|:----------|:----|:---:|:---:| 75 | nvidia_smi.gpu_pcie_bandwidth_usage | rx, tx | B/s | • | | 76 | nvidia_smi.gpu_pcie_bandwidth_utilization | rx, tx | % | • | | 77 | nvidia_smi.gpu_fan_speed_perc | fan_speed | % | • | • | 78 | nvidia_smi.gpu_utilization | gpu | % | • | • | 79 | nvidia_smi.gpu_memory_utilization | memory | % | • | • | 80 | nvidia_smi.gpu_decoder_utilization | decoder | % | • | | 81 | nvidia_smi.gpu_encoder_utilization | encoder | % | • | | 82 | nvidia_smi.gpu_frame_buffer_memory_usage | free, used, reserved | B | • | • | 83 | nvidia_smi.gpu_bar1_memory_usage | free, used | B | • | | 84 | nvidia_smi.gpu_temperature | temperature | Celsius | • | • | 85 | nvidia_smi.gpu_voltage | voltage | V | • | | 86 | nvidia_smi.gpu_clock_freq | graphics, video, sm, mem | MHz | • | • | 87 | nvidia_smi.gpu_power_draw | power_draw | Watts | • | • | 88 | nvidia_smi.gpu_performance_state | P0-P15 | state | • | • | 89 | nvidia_smi.gpu_mig_mode_current_status | enabled, disabled | status | • | | 90 | nvidia_smi.gpu_mig_devices_count | mig | devices | • | | 91 92 ### Per mig 93 94 These metrics refer to the Multi-Instance GPU (MIG). 95 96 Labels: 97 98 | Label | Description | 99 |:-----------|:----------------| 100 | uuid | GPU id (e.g. 00000000:00:04.0) | 101 | product_name | GPU product name (e.g. NVIDIA A100-SXM4-40GB) | 102 | gpu_instance_id | GPU instance id (e.g. 1) | 103 104 Metrics: 105 106 | Metric | Dimensions | Unit | XML | CSV | 107 |:------|:----------|:----|:---:|:---:| 108 | nvidia_smi.gpu_mig_frame_buffer_memory_usage | free, used, reserved | B | • | | 109 | nvidia_smi.gpu_mig_bar1_memory_usage | free, used | B | • | | 110 111 112 113 ## Alerts 114 115 There are no alerts configured by default for this integration. 116 117 118 ## Setup 119 120 ### Prerequisites 121 122 #### Enable in go.d.conf. 123 124 This collector is disabled by default. You need to explicitly enable it in the `go.d.conf` file. 125 126 127 128 ### Configuration 129 130 #### File 131 132 The configuration file name for this integration is `go.d/nvidia_smi.conf`. 133 134 135 You can edit the configuration file using the `edit-config` script from the 136 Netdata [config directory](https://github.com/netdata/netdata/blob/master/docs/configure/nodes.md#the-netdata-config-directory). 137 138 ```bash 139 cd /etc/netdata 2>/dev/null || cd /opt/netdata/etc/netdata 140 sudo ./edit-config go.d/nvidia_smi.conf 141 ``` 142 #### Options 143 144 The following options can be defined globally: update_every, autodetection_retry. 145 146 147 <details><summary>Config options</summary> 148 149 | Name | Description | Default | Required | 150 |:----|:-----------|:-------|:--------:| 151 | update_every | Data collection frequency. | 10 | no | 152 | autodetection_retry | Recheck interval in seconds. Zero means no recheck will be scheduled. | 0 | no | 153 | binary_path | Path to nvidia_smi binary. The default is "nvidia_smi" and the executable is looked for in the directories specified in the PATH environment variable. | nvidia_smi | no | 154 | timeout | nvidia_smi binary execution timeout. | 2 | no | 155 | use_csv_format | Used format when requesting GPU information. XML is used if set to 'no'. | yes | no | 156 157 </details> 158 159 #### Examples 160 161 ##### XML format 162 163 Use XML format when requesting GPU information. 164 165 <details><summary>Config</summary> 166 167 ```yaml 168 jobs: 169 - name: nvidia_smi 170 use_csv_format: no 171 172 ``` 173 </details> 174 175 ##### Custom binary path 176 177 The executable is not in the directories specified in the PATH environment variable. 178 179 <details><summary>Config</summary> 180 181 ```yaml 182 jobs: 183 - name: nvidia_smi 184 binary_path: /usr/local/sbin/nvidia_smi 185 186 ``` 187 </details> 188 189 190 191 ## Troubleshooting 192 193 ### Debug Mode 194 195 To troubleshoot issues with the `nvidia_smi` collector, run the `go.d.plugin` with the debug option enabled. The output 196 should give you clues as to why the collector isn't working. 197 198 - Navigate to the `plugins.d` directory, usually at `/usr/libexec/netdata/plugins.d/`. If that's not the case on 199 your system, open `netdata.conf` and look for the `plugins` setting under `[directories]`. 200 201 ```bash 202 cd /usr/libexec/netdata/plugins.d/ 203 ``` 204 205 - Switch to the `netdata` user. 206 207 ```bash 208 sudo -u netdata -s 209 ``` 210 211 - Run the `go.d.plugin` to debug the collector: 212 213 ```bash 214 ./go.d.plugin -d -m nvidia_smi 215 ``` 216 217