k8s.io/perf-tests/clusterloader2@v0.0.0-20240304094227-64bdb12da87e/docs/DEVELOPING_MEASUREMENT.md (about) 1 # Developing measurement 2 3 ## Prerequisite 4 It's strongly recommended to get familiar with [Getting started] tutorial. 5 6 Also, you can check out our [Design] of Clusterloader2. 7 8 ## Introduction 9 10 All measurements are implemented [here][Measurements]. 11 You can find current Measurement interface [here][Measurement interface]. 12 13 Measurement interface consists of three methods: 14 ```golang 15 Execute(config *Config) ([]Summary, error) 16 Dispose() 17 String() string 18 ``` 19 `Execute` method will be executed each time you specify your measurement in test config. 20 Usually measurement is specified two times in test config - once when we start measuring something and second time when we want to gather results. 21 Two distinguish these two cases, you need to specify parameter called `action` that takes values `start` and `gather`. 22 For example, for PodStartupLatency you specify action and some additional parameters when starting measurement: 23 24 ```yaml 25 - Identifier: PodStartupLatency 26 Method: PodStartupLatency 27 Params: 28 action: start 29 labelSelector: group = test-pod 30 threshold: 20s 31 ``` 32 And then later you can just gather results with: 33 ```yaml 34 - Identifier: PodStartupLatency 35 Method: PodStartupLatency 36 Params: 37 action: gather 38 ``` 39 40 `Dispose` method can be used to clean up after the measurement is no longer used. 41 42 `String()` method should return name of measurement. 43 44 ## Implementing simple measurement 45 46 Let's start with implementing simple measurement that will measure maximum number of running pods. 47 First, let's start with specifying package and measurement name: 48 49 ```golang 50 package common 51 52 const ( 53 maxRunningPodsMeasurementName = "MaxRunningPods" 54 ) 55 ``` 56 Next, let's define structure and create constructor. We will need two integers for tracking current number of running pods and maximal number of running pods. 57 Our measurement will have two actions `start` and `gather` so also want to know when our measurement is in running state. 58 We will need also channel for stopping informers. 59 ```golang 60 type maxRunningPodsMeasurement struct { 61 maxRunningPods int 62 currentRunningPods int 63 stopCh chan struct{} 64 isRunning bool 65 lock sync.Mutex 66 } 67 68 func createMaxRunningPodsMeasurement() measurement.Measurement { 69 return &maxRunningPodsMeasurement{} 70 } 71 ``` 72 73 Once we have it, we want to register our new measurement so it can be used in test config: 74 ```golang 75 func init() { 76 if err := measurement.Register(maxRunningPodsMeasurementName, createMaxRunningPodsMeasurement); err != nil { 77 klog.Fatalf("Cannot register %s: %v", maxRunningPodsMeasurementName, err) 78 } 79 } 80 ``` 81 82 Next, let's implement `String` method: 83 ```golang 84 func (*maxRunningPodsMeasurement) String() string { 85 return maxRunningPodsMeasurementName 86 } 87 ``` 88 Next, we need to implement `Execute` method. We want to have two actions - `start` and `gather`: 89 ```golang 90 func (s *maxRunningPodsMeasurement) Execute(config *measurement.Config) ([]measurement.Summary, error) { 91 action, err := util.GetString(config.Params, "action") 92 if err != nil { 93 return nil, err 94 } 95 switch action { 96 case "start": 97 return nil, s.start(config.ClusterFramework.GetClientSets().GetClient()) 98 case "gather": 99 return s.gather() 100 default: 101 return nil, fmt.Errorf("unknown action %v", action) 102 } 103 } 104 ``` 105 106 Now, let's focus on `start` method. First, we want to check if measurement is not already running. In this case we want to return error. 107 If measurement is not running then we want to start informer that will get information about all pods within cluster. 108 109 ```golang 110 func (m *maxRunningPodsMeasurement) start(c clientset.Interface) error { 111 if m.isRunning { 112 return fmt.Errorf("%s: measurement already running", m) 113 } 114 klog.V(2).Infof("%s: starting max pod measurement...", m) 115 m.isRunning = true 116 m.stopCh = make(chan struct{}) 117 i := informer.NewInformer( 118 &cache.ListWatch{ 119 ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { 120 return c.CoreV1().Pods("").List(context.TODO(), options) 121 }, 122 WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { 123 return c.CoreV1().Pods("").Watch(context.TODO(), options) 124 }, 125 }, 126 m.checkPod, 127 ) 128 return informer.StartAndSync(i, m.stopCh, informerSyncTimeout) 129 } 130 131 ``` 132 133 Next step is implementing `checkPod` method. This method will be responsible for counting pods in Running state. 134 As arguments, we get old and new version of object. 135 There are two special cases. 136 First case when old version of object is nil - it means that object is added. 137 Second case when new version of object is nil - it means that object was deleted. 138 139 ```golang 140 func MaxInt(x, y int) int { 141 if x < y { 142 return y 143 } 144 return x 145 } 146 147 func (m *maxRunningPodsMeasurement) checkPod(oldObj, newObj interface{}) { 148 func isPodRunning(obj interface{}) bool { 149 pod, ok := obj.(*corev1.Pod) 150 if !ok { 151 klog.V(2).Warningf("Couldn't convert object to Pod") 152 return false 153 } 154 return pod != nil && pod.Status.Phase == corev1.PodRunning 155 } 156 157 158 change := 0 159 if isPodRunning(oldObj) { 160 change-- 161 } 162 if isPodRunning(newObj) { 163 change++ 164 } 165 m.currentRunningPods += change 166 167 m.maxRunningPods = MaxInt(m.maxRunningPods, m.currentRunningPods) 168 klog.V(2).Infof("Max: %d, current: %d", m.maxRunningPods, m.currentRunningPods) 169 } 170 ``` 171 172 Finally, we can implement `gather` method. The easiest way of creating summary is by creating structure with json annotation and then passing serialized json to `CreateSummary` function like this: 173 ```golang 174 type runningPods struct { 175 Max int `json:"max"` 176 } 177 178 func (m *maxRunningPodsMeasurement) gather() ([]measurement.Summary, error) { 179 if !m.isRunning { 180 return nil, fmt.Errorf("measurement %s has not been started", maxRunningPodsMeasurementName) 181 } 182 183 runningPods := &runningPods{Max: m.maxRunningPods} 184 content, err := util.PrettyPrintJSON(runningPods) 185 if err != nil { 186 return nil, err 187 } 188 summary := measurement.CreateSummary(maxRunningPodsMeasurementName, "json", content) 189 return []measurement.Summary{summary}, err 190 } 191 ``` 192 193 Last, but not least, lets implement `Dispose` method. In this method we want to close channel so informer will be closed as well. 194 ```golang 195 func (s *maxRunningPodsMeasurement) Dispose() { 196 s.stop() 197 } 198 199 func (m *maxRunningPodsMeasurement) stop() { 200 if m.isRunning { 201 m.isRunning = false 202 close(m.stopCh) 203 } 204 } 205 ``` 206 207 Combaining all together we have full implementation: 208 ```golang 209 package common 210 211 import ( 212 "context" 213 "fmt" 214 "sync" 215 216 corev1 "k8s.io/api/core/v1" 217 metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" 218 "k8s.io/apimachinery/pkg/runtime" 219 "k8s.io/apimachinery/pkg/watch" 220 clientset "k8s.io/client-go/kubernetes" 221 "k8s.io/client-go/tools/cache" 222 "k8s.io/klog/v2" 223 "k8s.io/perf-tests/clusterloader2/pkg/measurement" 224 "k8s.io/perf-tests/clusterloader2/pkg/measurement/util/informer" 225 "k8s.io/perf-tests/clusterloader2/pkg/util" 226 ) 227 228 const ( 229 maxRunningPodsMeasurementName = "MaxRunningPods" 230 ) 231 232 type maxRunningPodsMeasurement struct { 233 maxRunningPods int 234 currentRunningPods int 235 stopCh chan struct{} 236 isRunning bool 237 lock sync.Mutex 238 } 239 240 func createMaxRunningPodsMeasurement() measurement.Measurement { 241 return &maxRunningPodsMeasurement{} 242 } 243 244 func init() { 245 if err := measurement.Register(maxRunningPodsMeasurementName, createMaxRunningPodsMeasurement); err != nil { 246 klog.Fatalf("Cannot register %s: %v", maxRunningPodsMeasurementName, err) 247 } 248 } 249 250 func (s *maxRunningPodsMeasurement) Dispose() { 251 s.stop() 252 } 253 254 func (*maxRunningPodsMeasurement) String() string { 255 return maxRunningPodsMeasurementName 256 } 257 258 func (s *maxRunningPodsMeasurement) Execute(config *measurement.Config) ([]measurement.Summary, error) { 259 action, err := util.GetString(config.Params, "action") 260 if err != nil { 261 return nil, err 262 } 263 switch action { 264 case "start": 265 return nil, s.start(config.ClusterFramework.GetClientSets().GetClient()) 266 case "gather": 267 return s.gather() 268 default: 269 return nil, fmt.Errorf("unknown action %v", action) 270 } 271 } 272 273 func (m *maxRunningPodsMeasurement) start(c clientset.Interface) error { 274 if m.isRunning { 275 return fmt.Errorf("%s: measurement already running", m) 276 } 277 klog.V(2).Infof("%s: starting max pod measurement...", m) 278 m.isRunning = true 279 m.stopCh = make(chan struct{}) 280 i := informer.NewInformer( 281 &cache.ListWatch{ 282 ListFunc: func(options metav1.ListOptions) (runtime.Object, error) { 283 return c.CoreV1().Pods("").List(context.TODO(), options) 284 }, 285 WatchFunc: func(options metav1.ListOptions) (watch.Interface, error) { 286 return c.CoreV1().Pods("").Watch(context.TODO(), options) 287 }, 288 }, 289 m.checkPod, 290 ) 291 return informer.StartAndSync(i, m.stopCh, informerSyncTimeout) 292 } 293 294 func (m *maxRunningPodsMeasurement) stop() { 295 if m.isRunning { 296 m.isRunning = false 297 close(m.stopCh) 298 } 299 } 300 301 func MaxInt(x, y int) int { 302 if x < y { 303 return y 304 } 305 return x 306 } 307 308 func (m *maxRunningPodsMeasurement) checkPod(oldObj, newObj interface{}) { 309 func isPodRunning(obj interface{}) bool { 310 pod, ok := obj.(*corev1.Pod) 311 if !ok { 312 klog.V(2).Warningf("Couldn't convert object to Pod") 313 return false 314 } 315 return pod != nil && pod.Status.Phase == corev1.PodRunning 316 } 317 318 319 change := 0 320 if isPodRunning(oldObj) { 321 change-- 322 } 323 if isPodRunning(newObj) { 324 change++ 325 } 326 m.lock.Lock() 327 defer m.lock.Unlock() 328 m.currentRunningPods += change 329 330 m.maxRunningPods = MaxInt(m.maxRunningPods, m.currentRunningPods) 331 klog.V(2).Infof("Max: %d, current: %d", m.maxRunningPods, m.currentRunningPods) 332 } 333 334 type runningPods struct { 335 Max int `json:"max"` 336 } 337 338 func (m *maxRunningPodsMeasurement) gather() ([]measurement.Summary, error) { 339 if !m.isRunning { 340 return nil, fmt.Errorf("measurement %s has not been started", maxRunningPodsMeasurementName) 341 } 342 343 m.lock.Lock() 344 defer m.lock.Unlock() 345 runningPods := &runningPods{Max: m.maxRunningPods} 346 content, err := util.PrettyPrintJSON(runningPods) 347 if err != nil { 348 return nil, err 349 } 350 summary := measurement.CreateSummary(maxRunningPodsMeasurementName, "json", content) 351 return []measurement.Summary{summary}, err 352 } 353 ``` 354 ## Trying new measurement 355 356 Once we have whole implementation, you can try it out. Easiest way is to modify [Getting started] example `config.yaml` by replacing `PodStartupLatency` with `MaxRunningPods` 357 358 ## Enabling new measurement in tests 359 360 Once your measurement is implemented and tested, we need to add it to existing scalability test(s). 361 This process needs to be done very carefully, because any mistake can lead to blocking all PRs to kubernetes repository. 362 You can follow these instructions: [Rollout process] 363 364 ## Prometheus based measurements 365 366 Sometimes you can implement your metric based on Prometheus metric. In this case, it consists of two steps: 367 - Gathering new metric (if it's not already gathered) 368 - Implementing measurement based on Prometheus metric 369 370 Let's start with two examples how you can gather more Prometheus metrics. 371 372 ### Gathering Prometheus metric 373 374 You can check example PodMonitor and ServiceMonitor [here][Monitors]. Let's go through two examples. 375 PodMonitor for NodeLocalDNS: 376 ```yaml 377 apiVersion: monitoring.coreos.com/v1 378 kind: PodMonitor 379 metadata: 380 labels: 381 k8s-app: node-local-dns-pods 382 name: node-local-dns-pods 383 namespace: monitoring 384 spec: 385 podMetricsEndpoints: 386 - interval: 10m 387 port: metrics 388 jobLabel: k8s-app 389 selector: 390 matchLabels: 391 k8s-app: node-local-dns 392 namespaceSelector: 393 matchNames: 394 - kube-system 395 ``` 396 Any monitor needs to be created within monitoring namespace. 397 Most important is specifying interval how often metric will be scraped (here we have once every 10 minutes). 398 The more pods you want to scrape the smaller frequency should be and/or Prometheus server should have more resources. 399 In this example we scrape up to 5k pods. 400 401 If your pods are already within service, you can use ServiceMonitor to scrape metrics: 402 ```yaml 403 apiVersion: monitoring.coreos.com/v1 404 kind: ServiceMonitor 405 metadata: 406 labels: 407 k8s-app: my-service 408 name: my-service 409 namespace: monitoring 410 spec: 411 endpoints: 412 - interval: 60s 413 port: http-metrics 414 jobLabel: k8s-app 415 namespaceSelector: 416 matchNames: 417 - kube-system 418 selector: 419 matchLabels: 420 k8s-app: my-service 421 ``` 422 423 Also, instead of using port name, you can specify port number with targetPort. 424 You can check more options in [Prometheus operator doc]. 425 426 ### Implementing new measurement 427 428 In this example we will implement measurement that checks how many pods were scheduled using Prometheus metric. 429 Apiserver provides multiple metrics, including `apiserver_request_total`, which can be used to check how many pods have been binded to nodes. 430 We are interested in `apiserver_request_total{verb="POST", resource="pods", subresource="binding",code="201"`. 431 432 Just like before, let's start with defining package and metric name: 433 ```golang 434 package common 435 436 const ( 437 schedulingThroughputPrometheusMeasurementName = "SchedulingThroughputPrometheus" 438 ) 439 ``` 440 441 In case of measurement based on Prometheus metric, we only need to specify how results will be gathered. 442 You can check interface [here][Prometheus interface], which looks like this: 443 444 ```golang 445 func CreatePrometheusMeasurement(gatherer Gatherer) measurement.Measurement 446 447 type Gatherer interface { 448 Gather(executor QueryExecutor, startTime time.Time, config *measurement.Config) ([]measurement.Summary, error) 449 IsEnabled(config *measurement.Config) bool 450 String() string 451 } 452 ``` 453 454 So let's start with creating structure: 455 ```golang 456 type schedulingThroughputGatherer struct{} 457 ``` 458 459 Now, we can register our new measurement: 460 461 ```golang 462 func init() { 463 create := func() measurement.Measurement { return CreatePrometheusMeasurement(&schedulingThroughputGatherer{}) } 464 if err := measurement.Register(schedulingThroughputPrometheusMeasurementName, create); err != nil { 465 klog.Fatalf("Cannot register %s: %v", schedulingThroughputMeasurementName, err) 466 } 467 } 468 ``` 469 470 And let's define `IsEnabled` and `String` methods: 471 472 ```golang 473 func (a *schedulingThroughputGatherer) String() string { 474 return schedulingThroughputPrometheusMeasurementName 475 } 476 477 func (a *schedulingThroughputGatherer) IsEnabled(config *measurement.Config) bool { 478 return true 479 } 480 ``` 481 482 Next we need to implement `Gather` method. We can start with Prometheus query. We want to check only maximal throughput of scheduling. 483 This query can look like this: 484 ```golang 485 const ( 486 maxSchedulingThroughputQuery = `max_over_time(sum(irate(apiserver_request_total{verb="POST", resource="pods", subresource="binding",code="201"}[1m]))[%v:5s])` 487 ) 488 ``` 489 We will need to provide only duration for which we want to compute it. 490 491 Now, let's define structure for our summary of results. As mentioned before, we will only gather maximal throughput of scheduling: 492 ```golang 493 type schedulingThroughputPrometheus struct { 494 Max float64 `json:"max"` 495 } 496 ``` 497 We can now implement method that will gather scheduling throughput data. 498 We need to compute duration for measurement, execute query, check of any error and return previously defined structure: 499 ```golang 500 func (a *schedulingThroughputGatherer) getThroughputSummary(executor QueryExecutor, startTime time.Time, config *measurement.Config) (*schedulingThroughputPrometheus, error) { 501 measurementEnd := time.Now() 502 measurementDuration := measurementEnd.Sub(startTime) 503 promDuration := measurementutil.ToPrometheusTime(measurementDuration) 504 query := fmt.Sprintf(maxSchedulingThroughputQuery, promDuration) 505 506 samples, err := executor.Query(query, measurementEnd) 507 if err != nil { 508 return nil, err 509 } 510 if len(samples) != 1 { 511 return nil, fmt.Errorf("got unexpected number of samples: %d", len(samples)) 512 } 513 514 maxSchedulingThroughput := samples[0].Value 515 throughputSummary := &schedulingThroughputPrometheus{ 516 Max: float64(maxSchedulingThroughput), 517 } 518 519 return throughputSummary, nil 520 } 521 ``` 522 523 We can finally implement `Gather` method. We need to get result and convert it to `measurement.Summary` format. 524 ```golang 525 func (a *schedulingThroughputGatherer) Gather(executor QueryExecutor, startTime time.Time, config *measurement.Config) ([]measurement.Summary, error) { 526 throughputSummary, err := a.getThroughputSummary(executor, startTime, config) 527 if err != nil { 528 return nil, err 529 } 530 531 content, err := util.PrettyPrintJSON(throughputSummary) 532 if err != nil { 533 return nil, err 534 } 535 536 summaries := []measurement.Summary{ 537 measurement.CreateSummary(a.String(), "json", content), 538 } 539 540 return summaries, err 541 } 542 ``` 543 544 If we want to add `threshold` to our measurement so it fails if scheduling throughput is below it, we can add to `Gather` method this check: 545 ```golang 546 threshold, err := util.GetFloat64OrDefault(config.Params, "threshold", 0) 547 if threshold > 0 && throughputSummary.Max < threshold { 548 err = errors.NewMetricViolationError( 549 "scheduler throughput_prometheus", 550 fmt.Sprintf("actual throughput %f lower than threshold %f", throughputSummary.Max, threshold)) 551 } 552 ``` 553 You can check whole implementation [here][Scheduling throughput]. 554 555 ### Unit testing of Prometheus measurements 556 557 You can use our testing framework that runs a local Prometheus query executor to check whether your measurement works as intended in an end-to-end way without running Clusterloader2 on a live cluster. 558 This works by providing sample files containing time series as inputs for the measurement's underlying PromQL queries. 559 See an example of a [unit test][Container restarts test] written in this framework and the [input data][Test input data] used therein. 560 561 [Design]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/design.md 562 [Getting started]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/GETTING_STARTED.md 563 [Measurement interface]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/interface.go 564 [Measurements]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/measurement 565 [Monitors]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/prometheus/manifests/default 566 [Prometheus interface]: https://github.com/kubernetes/perf-tests/blob/1c21298a325633062d6069c01a3b27933d6dba93/clusterloader2/pkg/measurement/common/prometheus_measurement.go#L45 567 [Prometheus operator doc]: https://github.com/prometheus-operator/prometheus-operator/blob/master/Documentation/api.md 568 [Rollout process]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/experiments.md 569 [Scheduling throughput]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/scheduling_throughput_prometheus.go 570 [Container restarts test]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/measurement/common/container_restarts_test.go 571 [Test input data]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/measurement/common/testdata/container_restarts