volcano.sh/volcano@v1.9.0/docs/design/metrics.md (about)

     1  ## Scheduler Monitoring
     2  
     3  ## Introduction
     4  Currently users can leverage controller logs and job events to monitor scheduler. While useful for debugging, none of this options is particularly practical for monitoring kube-batch behaviour over time. There's also requirement like to monitor kube-batch in one view to resolve critical performance issue in time [#427](https://github.com/kubernetes-sigs/kube-batch/issues/427).
     5  
     6  This document describes metrics we want to add into kube-batch to better monitor performance.
     7  
     8  ## Metrics
     9  In order to support metrics, kube-batch needs to expose a metrics endpoint which can provide golang process metrics like number of goroutines, gc duration, cpu and memory usage, etc as well as kube-batch custom metrics related to time taken by plugins or actions.
    10  
    11  All the metrics are prefixed with `kube_batch_`.
    12  
    13  ### kube-batch execution
    14  This metrics track execution of plugins and actions of kube-batch loop.
    15  
    16  | Metric name | Metric type | Labels | Description |
    17  | ----------- | ----------- | ------ | ----------- |
    18  | e2e_scheduling_latency | histogram |  | E2e scheduling latency in seconds |
    19  | plugin_latency | histogram | `plugin`=<plugin_name> | Schedule latency for plugin |
    20  | action_latency | histogram | `action`=<action_name> | Schedule latency for action |
    21  | task_latency | histogram | `job`=<job_id> `task`=<task_id> | Schedule latency for each task |
    22  
    23  
    24  ### kube-batch operations
    25  This metrics describe internal state of kube-batch.
    26  
    27  | Metric name | Metric type | Labels | Description |
    28  | ----------- | ----------- | ------ | ----------- |
    29  | pod_schedule_errors | Counter |  | The number of kube-batch failed due to an error |
    30  | pod_schedule_successes | Counter | | The number of kube-batch success in scheduling a job |
    31  | pod_preemption_victims | Counter | | Number of selected preemption victims |
    32  | total_preemption_attempts | Counter |  | Total preemption attempts in the cluster till now |
    33  | unschedule_task_count | Counter | `job`=<job_id> | The number of tasks failed to schedule |
    34  | unschedule_job_counts | Counter | | The number of job failed to schedule in each iteration |
    35  | job_retry_counts | Counter | `job`=<job_id> | The number of retry times of one job |
    36  
    37  
    38  ### kube-batch Liveness
    39  Healthcheck last time of kube-batch activity and timeout