github.com/kubeflow/training-operator@v1.7.0/docs/monitoring/README.md (about) 1 # Prometheus Monitoring for Training Job 2 3 ## Available Metrics 4 5 Currently available metrics to monitor are listed below. 6 7 8 **Job Creation** 9 10 ``` 11 training_operator_jobs_created_total 12 ``` 13 14 **Job Deletion** 15 16 ``` 17 training_operator_jobs_deleted_total 18 ``` 19 20 **Successful Job Completions** 21 22 ``` 23 training_operator_jobs_successful_total 24 ``` 25 26 **Failed Jobs** 27 28 ``` 29 training_operator_jobs_failed_total 30 ``` 31 32 **Restarted Jobs** 33 34 ``` 35 training_operator_jobs_restarted_total 36 ```