github.com/netdata/go.d.plugin@v0.58.1/modules/nvidia_smi/testdata/help-query-gpu.txt (about)

     1  List of valid properties to query for the switch "--query-gpu=":
     2  
     3  "timestamp"
     4  The timestamp of when the query was made in format "YYYY/MM/DD HH:MM:SS.msec".
     5  
     6  "driver_version"
     7  The version of the installed NVIDIA display driver. This is an alphanumeric string.
     8  
     9  "count"
    10  The number of NVIDIA GPUs in the system.
    11  
    12  "name" or "gpu_name"
    13  The official product name of the GPU. This is an alphanumeric string. For all products.
    14  
    15  "serial" or "gpu_serial"
    16  This number matches the serial number physically printed on each board. It is a globally unique immutable alphanumeric value.
    17  
    18  "uuid" or "gpu_uuid"
    19  This value is the globally unique immutable alphanumeric identifier of the GPU. It does not correspond to any physical label on the board.
    20  
    21  "pci.bus_id" or "gpu_bus_id"
    22  PCI bus id as "domain:bus:device.function", in hex.
    23  
    24  "pci.domain"
    25  PCI domain number, in hex.
    26  
    27  "pci.bus"
    28  PCI bus number, in hex.
    29  
    30  "pci.device"
    31  PCI device number, in hex.
    32  
    33  "pci.device_id"
    34  PCI vendor device id, in hex
    35  
    36  "pci.sub_device_id"
    37  PCI Sub System id, in hex
    38  
    39  "pcie.link.gen.current"
    40  The current PCI-E link generation. These may be reduced when the GPU is not in use.
    41  
    42  "pcie.link.gen.max"
    43  The maximum PCI-E link generation possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
    44  
    45  "pcie.link.width.current"
    46  The current PCI-E link width. These may be reduced when the GPU is not in use.
    47  
    48  "pcie.link.width.max"
    49  The maximum PCI-E link width possible with this GPU and system configuration. For example, if the GPU supports a higher PCIe generation than the system supports then this reports the system PCIe generation.
    50  
    51  "index"
    52  Zero based index of the GPU. Can change at each boot.
    53  
    54  "display_mode"
    55  A flag that indicates whether a physical display (e.g. monitor) is currently connected to any of the GPU's connectors. "Enabled" indicates an attached display. "Disabled" indicates otherwise.
    56  
    57  "display_active"
    58  A flag that indicates whether a display is initialized on the GPU's (e.g. memory is allocated on the device for display). Display can be active even when no monitor is physically attached. "Enabled" indicates an active display. "Disabled" indicates otherwise.
    59  
    60  "persistence_mode"
    61  A flag that indicates whether persistence mode is enabled for the GPU. Value is either "Enabled" or "Disabled". When persistence mode is enabled the NVIDIA driver remains loaded even when no active clients, such as X11 or nvidia-smi, exist. This minimizes the driver load latency associated with running dependent apps, such as CUDA programs. Linux only.
    62  
    63  "accounting.mode"
    64  A flag that indicates whether accounting mode is enabled for the GPU. Value is either "Enabled" or "Disabled". When accounting is enabled statistics are calculated for each compute process running on the GPU.Statistics can be queried during the lifetime or after termination of the process.The execution time of process is reported as 0 while the process is in running state and updated to actualexecution time after the process has terminated. See --help-query-accounted-apps for more info.
    65  
    66  "accounting.buffer_size"
    67  The size of the circular buffer that holds list of processes that can be queried for accounting stats. This is the maximum number of processes that accounting information will be stored for before information about oldest processes will get overwritten by information about new processes.
    68  
    69  Section about driver_model properties
    70  On Windows, the TCC and WDDM driver models are supported. The driver model can be changed with the (-dm) or (-fdm) flags. The TCC driver model is optimized for compute applications. I.E. kernel launch times will be quicker with TCC. The WDDM driver model is designed for graphics applications and is not recommended for compute applications. Linux does not support multiple driver models, and will always have the value of "N/A". Only for selected products. Please see feature matrix in NVML documentation.
    71  
    72  "driver_model.current"
    73  The driver model currently in use. Always "N/A" on Linux.
    74  
    75  "driver_model.pending"
    76  The driver model that will be used on the next reboot. Always "N/A" on Linux.
    77  
    78  "vbios_version"
    79  The BIOS of the GPU board.
    80  
    81  Section about inforom properties
    82  Version numbers for each object in the GPU board's inforom storage. The inforom is a small, persistent store of configuration and state data for the GPU. All inforom version fields are numerical. It can be useful to know these version numbers because some GPU features are only available with inforoms of a certain version or higher.
    83  
    84  "inforom.img" or "inforom.image"
    85  Global version of the infoROM image. Image version just like VBIOS version uniquely describes the exact version of the infoROM flashed on the board in contrast to infoROM object version which is only an indicator of supported features.
    86  
    87  "inforom.oem"
    88  Version for the OEM configuration data.
    89  
    90  "inforom.ecc"
    91  Version for the ECC recording data.
    92  
    93  "inforom.pwr" or "inforom.power"
    94  Version for the power management data.
    95  
    96  Section about gom properties
    97  GOM allows to reduce power usage and optimize GPU throughput by disabling GPU features. Each GOM is designed to meet specific user needs.
    98  In "All On" mode everything is enabled and running at full speed.
    99  The "Compute" mode is designed for running only compute tasks. Graphics operations are not allowed.
   100  The "Low Double Precision" mode is designed for running graphics applications that don't require high bandwidth double precision.
   101  GOM can be changed with the (--gom) flag.
   102  
   103  "gom.current" or "gpu_operation_mode.current"
   104  The GOM currently in use.
   105  
   106  "gom.pending" or "gpu_operation_mode.pending"
   107  The GOM that will be used on the next reboot.
   108  
   109  "fan.speed"
   110  The fan speed value is the percent of the product's maximum noise tolerance fan speed that the device's fan is currently intended to run at. This value may exceed 100% in certain cases. Note: The reported speed is the intended fan speed. If the fan is physically blocked and unable to spin, this output will not match the actual fan speed. Many parts do not report fan speeds because they rely on cooling via fans in the surrounding enclosure.
   111  
   112  "pstate"
   113  The current performance state for the GPU. States range from P0 (maximum performance) to P12 (minimum performance).
   114  
   115  Section about clocks_throttle_reasons properties
   116  Retrieves information about factors that are reducing the frequency of clocks. If all throttle reasons are returned as "Not Active" it means that clocks are running as high as possible.
   117  
   118  "clocks_throttle_reasons.supported"
   119  Bitmask of supported clock throttle reasons. See nvml.h for more details.
   120  
   121  "clocks_throttle_reasons.active"
   122  Bitmask of active clock throttle reasons. See nvml.h for more details.
   123  
   124  "clocks_throttle_reasons.gpu_idle"
   125  Nothing is running on the GPU and the clocks are dropping to Idle state. This limiter may be removed in a later release.
   126  
   127  "clocks_throttle_reasons.applications_clocks_setting"
   128  GPU clocks are limited by applications clocks setting. E.g. can be changed by nvidia-smi --applications-clocks=
   129  
   130  "clocks_throttle_reasons.sw_power_cap"
   131  SW Power Scaling algorithm is reducing the clocks below requested clocks because the GPU is consuming too much power. E.g. SW power cap limit can be changed with nvidia-smi --power-limit=
   132  
   133  "clocks_throttle_reasons.hw_slowdown"
   134  HW Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of:
   135   HW Thermal Slowdown: temperature being too high
   136   HW Power Brake Slowdown: External Power Brake Assertion is triggered (e.g. by the system power supply)
   137   * Power draw is too high and Fast Trigger protection is reducing the clocks
   138   * May be also reported during PState or clock change
   139   * This behavior may be removed in a later release
   140  
   141  "clocks_throttle_reasons.hw_thermal_slowdown"
   142  HW Thermal Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of temperature being too high
   143  
   144  "clocks_throttle_reasons.hw_power_brake_slowdown"
   145  HW Power Brake Slowdown (reducing the core clocks by a factor of 2 or more) is engaged. This is an indicator of External Power Brake Assertion being triggered (e.g. by the system power supply)
   146  
   147  "clocks_throttle_reasons.sw_thermal_slowdown"
   148  SW Thermal capping algorithm is reducing clocks below requested clocks because GPU temperature is higher than Max Operating Temp.
   149  
   150  "clocks_throttle_reasons.sync_boost"
   151  Sync Boost This GPU has been added to a Sync boost group with nvidia-smi or DCGM in
   152   * order to maximize performance per watt. All GPUs in the sync boost group
   153   * will boost to the minimum possible clocks across the entire group. Look at
   154   * the throttle reasons for other GPUs in the system to see why those GPUs are
   155   * holding this one at lower clocks.
   156  
   157  Section about memory properties
   158  On-board memory information. Reported total memory is affected by ECC state. If ECC is enabled the total available memory is decreased by several percent, due to the requisite parity bits. The driver may also reserve a small amount of memory for internal use, even without active work on the GPU.
   159  
   160  "memory.total"
   161  Total installed GPU memory.
   162  
   163  "memory.reserved"
   164  Total memory reserved by the NVIDIA driver and firmware.
   165  
   166  "memory.used"
   167  Total memory allocated by active contexts.
   168  
   169  "memory.free"
   170  Total free memory.
   171  
   172  "compute_mode"
   173  The compute mode flag indicates whether individual or multiple compute applications may run on the GPU.
   174  "0: Default" means multiple contexts are allowed per device.
   175  "1: Exclusive_Thread", deprecated, use Exclusive_Process instead
   176  "2: Prohibited" means no contexts are allowed per device (no compute apps).
   177  "3: Exclusive_Process" means only one context is allowed per device, usable from multiple threads at a time.
   178  
   179  "compute_cap"
   180  The CUDA Compute Capability, represented as Major DOT Minor.
   181  
   182  Section about utilization properties
   183  Utilization rates report how busy each GPU is over time, and can be used to determine how much an application is using the GPUs in the system.
   184  
   185  "utilization.gpu"
   186  Percent of time over the past sample period during which one or more kernels was executing on the GPU.
   187  The sample period may be between 1 second and 1/6 second depending on the product.
   188  
   189  "utilization.memory"
   190  Percent of time over the past sample period during which global (device) memory was being read or written.
   191  The sample period may be between 1 second and 1/6 second depending on the product.
   192  
   193  Section about encoder.stats properties
   194  Encoder stats report number of encoder sessions, average FPS and average latency in us for given GPUs in the system.
   195  
   196  "encoder.stats.sessionCount"
   197  Number of encoder sessions running on the GPU.
   198  
   199  "encoder.stats.averageFps"
   200  Average FPS of all sessions running on the GPU.
   201  
   202  "encoder.stats.averageLatency"
   203  Average latency in microseconds of all sessions running on the GPU.
   204  
   205  Section about ecc.mode properties
   206  A flag that indicates whether ECC support is enabled. May be either "Enabled" or "Disabled". Changes to ECC mode require a reboot. Requires Inforom ECC object version 1.0 or higher.
   207  
   208  "ecc.mode.current"
   209  The ECC mode that the GPU is currently operating under.
   210  
   211  "ecc.mode.pending"
   212  The ECC mode that the GPU will operate under after the next reboot.
   213  
   214  Section about ecc.errors properties
   215  NVIDIA GPUs can provide error counts for various types of ECC errors. Some ECC errors are either single or double bit, where single bit errors are corrected and double bit errors are uncorrectable. Texture memory errors may be correctable via resend or uncorrectable if the resend fails. These errors are available across two timescales (volatile and aggregate). Single bit ECC errors are automatically corrected by the HW and do not result in data corruption. Double bit errors are detected but not corrected. Please see the ECC documents on the web for information on compute application behavior when double bit errors occur. Volatile error counters track the number of errors detected since the last driver load. Aggregate error counts persist indefinitely and thus act as a lifetime counter.
   216  
   217  "ecc.errors.corrected.volatile.device_memory"
   218  Errors detected in global device memory.
   219  
   220  "ecc.errors.corrected.volatile.dram"
   221  Errors detected in global device memory.
   222  
   223  "ecc.errors.corrected.volatile.register_file"
   224  Errors detected in register file memory.
   225  
   226  "ecc.errors.corrected.volatile.l1_cache"
   227  Errors detected in the L1 cache.
   228  
   229  "ecc.errors.corrected.volatile.l2_cache"
   230  Errors detected in the L2 cache.
   231  
   232  "ecc.errors.corrected.volatile.texture_memory"
   233  Parity errors detected in texture memory.
   234  
   235  "ecc.errors.corrected.volatile.cbu"
   236  Parity errors detected in CBU.
   237  
   238  "ecc.errors.corrected.volatile.sram"
   239  Errors detected in global SRAMs.
   240  
   241  "ecc.errors.corrected.volatile.total"
   242  Total errors detected across entire chip.
   243  
   244  "ecc.errors.corrected.aggregate.device_memory"
   245  Errors detected in global device memory.
   246  
   247  "ecc.errors.corrected.aggregate.dram"
   248  Errors detected in global device memory.
   249  
   250  "ecc.errors.corrected.aggregate.register_file"
   251  Errors detected in register file memory.
   252  
   253  "ecc.errors.corrected.aggregate.l1_cache"
   254  Errors detected in the L1 cache.
   255  
   256  "ecc.errors.corrected.aggregate.l2_cache"
   257  Errors detected in the L2 cache.
   258  
   259  "ecc.errors.corrected.aggregate.texture_memory"
   260  Parity errors detected in texture memory.
   261  
   262  "ecc.errors.corrected.aggregate.cbu"
   263  Parity errors detected in CBU.
   264  
   265  "ecc.errors.corrected.aggregate.sram"
   266  Errors detected in global SRAMs.
   267  
   268  "ecc.errors.corrected.aggregate.total"
   269  Total errors detected across entire chip.
   270  
   271  "ecc.errors.uncorrected.volatile.device_memory"
   272  Errors detected in global device memory.
   273  
   274  "ecc.errors.uncorrected.volatile.dram"
   275  Errors detected in global device memory.
   276  
   277  "ecc.errors.uncorrected.volatile.register_file"
   278  Errors detected in register file memory.
   279  
   280  "ecc.errors.uncorrected.volatile.l1_cache"
   281  Errors detected in the L1 cache.
   282  
   283  "ecc.errors.uncorrected.volatile.l2_cache"
   284  Errors detected in the L2 cache.
   285  
   286  "ecc.errors.uncorrected.volatile.texture_memory"
   287  Parity errors detected in texture memory.
   288  
   289  "ecc.errors.uncorrected.volatile.cbu"
   290  Parity errors detected in CBU.
   291  
   292  "ecc.errors.uncorrected.volatile.sram"
   293  Errors detected in global SRAMs.
   294  
   295  "ecc.errors.uncorrected.volatile.total"
   296  Total errors detected across entire chip.
   297  
   298  "ecc.errors.uncorrected.aggregate.device_memory"
   299  Errors detected in global device memory.
   300  
   301  "ecc.errors.uncorrected.aggregate.dram"
   302  Errors detected in global device memory.
   303  
   304  "ecc.errors.uncorrected.aggregate.register_file"
   305  Errors detected in register file memory.
   306  
   307  "ecc.errors.uncorrected.aggregate.l1_cache"
   308  Errors detected in the L1 cache.
   309  
   310  "ecc.errors.uncorrected.aggregate.l2_cache"
   311  Errors detected in the L2 cache.
   312  
   313  "ecc.errors.uncorrected.aggregate.texture_memory"
   314  Parity errors detected in texture memory.
   315  
   316  "ecc.errors.uncorrected.aggregate.cbu"
   317  Parity errors detected in CBU.
   318  
   319  "ecc.errors.uncorrected.aggregate.sram"
   320  Errors detected in global SRAMs.
   321  
   322  "ecc.errors.uncorrected.aggregate.total"
   323  Total errors detected across entire chip.
   324  
   325  Section about retired_pages properties
   326  NVIDIA GPUs can retire pages of GPU device memory when they become unreliable. This can happen when multiple single bit ECC errors occur for the same page, or on a double bit ECC error. When a page is retired, the NVIDIA driver will hide it such that no driver, or application memory allocations can access it.
   327  
   328  "retired_pages.single_bit_ecc.count" or "retired_pages.sbe"
   329  The number of GPU device memory pages that have been retired due to multiple single bit ECC errors.
   330  
   331  "retired_pages.double_bit.count" or "retired_pages.dbe"
   332  The number of GPU device memory pages that have been retired due to a double bit ECC error.
   333  
   334  "retired_pages.pending"
   335  Checks if any GPU device memory pages are pending retirement on the next reboot. Pages that are pending retirement can still be allocated, and may cause further reliability issues.
   336  
   337  "temperature.gpu"
   338   Core GPU temperature. in degrees C.
   339  
   340  "temperature.memory"
   341   HBM memory temperature. in degrees C.
   342  
   343  "power.management"
   344  A flag that indicates whether power management is enabled. Either "Supported" or "[Not Supported]". Requires Inforom PWR object version 3.0 or higher or Kepler device.
   345  
   346  "power.draw"
   347  The last measured power draw for the entire board, in watts. Only available if power management is supported. This reading is accurate to within +/- 5 watts.
   348  
   349  "power.limit"
   350  The software power limit in watts. Set by software like nvidia-smi. On Kepler devices Power Limit can be adjusted using [-pl | --power-limit=] switches.
   351  
   352  "enforced.power.limit"
   353  The power management algorithm's power ceiling, in watts. Total board power draw is manipulated by the power management algorithm such that it stays under this value. This value is the minimum of various power limiters.
   354  
   355  "power.default_limit"
   356  The default power management algorithm's power ceiling, in watts. Power Limit will be set back to Default Power Limit after driver unload.
   357  
   358  "power.min_limit"
   359  The minimum value in watts that power limit can be set to.
   360  
   361  "power.max_limit"
   362  The maximum value in watts that power limit can be set to.
   363  
   364  "clocks.current.graphics" or "clocks.gr"
   365  Current frequency of graphics (shader) clock.
   366  
   367  "clocks.current.sm" or "clocks.sm"
   368  Current frequency of SM (Streaming Multiprocessor) clock.
   369  
   370  "clocks.current.memory" or "clocks.mem"
   371  Current frequency of memory clock.
   372  
   373  "clocks.current.video" or "clocks.video"
   374  Current frequency of video encoder/decoder clock.
   375  
   376  Section about clocks.applications properties
   377  User specified frequency at which applications will be running at. Can be changed with [-ac | --applications-clocks] switches.
   378  
   379  "clocks.applications.graphics" or "clocks.applications.gr"
   380  User specified frequency of graphics (shader) clock.
   381  
   382  "clocks.applications.memory" or "clocks.applications.mem"
   383  User specified frequency of memory clock.
   384  
   385  Section about clocks.default_applications properties
   386  Default frequency at which applications will be running at. Application clocks can be changed with [-ac | --applications-clocks] switches. Application clocks can be set to default using [-rac | --reset-applications-clocks] switches.
   387  
   388  "clocks.default_applications.graphics" or "clocks.default_applications.gr"
   389  Default frequency of applications graphics (shader) clock.
   390  
   391  "clocks.default_applications.memory" or "clocks.default_applications.mem"
   392  Default frequency of applications memory clock.
   393  
   394  Section about clocks.max properties
   395  Maximum frequency at which parts of the GPU are design to run.
   396  
   397  "clocks.max.graphics" or "clocks.max.gr"
   398  Maximum frequency of graphics (shader) clock.
   399  
   400  "clocks.max.sm" or "clocks.max.sm"
   401  Maximum frequency of SM (Streaming Multiprocessor) clock.
   402  
   403  "clocks.max.memory" or "clocks.max.mem"
   404  Maximum frequency of memory clock.
   405  
   406  Section about mig.mode properties
   407  A flag that indicates whether MIG mode is enabled. May be either "Enabled" or "Disabled". Changes to MIG mode require a GPU reset.
   408  
   409  "mig.mode.current"
   410  The MIG mode that the GPU is currently operating under.
   411  
   412  "mig.mode.pending"
   413  The MIG mode that the GPU will operate under after reset.
   414