github.com/kubeflow/training-operator@v1.7.0/docs/monitoring/README.md (about)

     1  # Prometheus Monitoring for Training Job
     2  
     3  ## Available Metrics
     4  
     5  Currently available metrics to monitor are listed below.
     6  
     7  
     8  **Job Creation**
     9  
    10  ```
    11  training_operator_jobs_created_total
    12  ```
    13  
    14  **Job Deletion**
    15  
    16  ```
    17  training_operator_jobs_deleted_total
    18  ```
    19  
    20  **Successful Job Completions**
    21  
    22  ```
    23  training_operator_jobs_successful_total
    24  ```
    25  
    26  **Failed Jobs**
    27  
    28  ```
    29  training_operator_jobs_failed_total
    30  ```
    31  
    32  **Restarted Jobs**
    33  
    34  ```
    35  training_operator_jobs_restarted_total
    36  ```