volcano.sh/volcano@v1.9.0/docs/design/metrics.md (about) 1 ## Scheduler Monitoring 2 3 ## Introduction 4 Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427). 5 6 This document describes metrics we want to add into kube-batch to better monitor performance. 7 8 ## Metrics 9 In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions. 10 11 All the metrics are prefixed with `kube_batch_`. 12 13 ### kube-batch execution 14 This metrics track execution of plugins and actions of kube-batch loop. 15 16 | Metric name | Metric type | Labels | Description | 17 | ----------- | ----------- | ------ | ----------- | 18 | e2e_scheduling_latency | histogram | | E2e scheduling latency in seconds | 19 | plugin_latency | histogram | `plugin`=<plugin_name> | Schedule latency for plugin | 20 | action_latency | histogram | `action`=<action_name> | Schedule latency for action | 21 | task_latency | histogram | `job`=<job_id> `task`=<task_id> | Schedule latency for each task | 22 23 24 ### kube-batch operations 25 This metrics describe internal state of kube-batch. 26 27 | Metric name | Metric type | Labels | Description | 28 | ----------- | ----------- | ------ | ----------- | 29 | pod_schedule_errors | Counter | | The number of kube-batch failed due to an error | 30 | pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job | 31 | pod_preemption_victims | Counter | | Number of selected preemption victims | 32 | total_preemption_attempts | Counter | | Total preemption attempts in the cluster till now | 33 | unschedule_task_count | Counter | `job`=<job_id> | The number of tasks failed to schedule | 34 | unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration | 35 | job_retry_counts | Counter | `job`=<job_id> | The number of retry times of one job | 36 37 38 ### kube-batch Liveness 39 Healthcheck last time of kube-batch activity and timeout