k8s.io/perf-tests/clusterloader2@v0.0.0-20240304094227-64bdb12da87e/docs/experiments.md (about) 1 # Clusterloader2 - experiment rollout 2 3 In this doc any change to the behavior of clusterloader2 that 4 5 - enables new measurement 6 - changes semantic of an existing measurement 7 - changes how clusterloader2 setups cluster and run tests 8 9 is referred as an "experiment". 10 11 ## Motivation 12 13 clusterloader2 is a tool used by all scalability, performance tests. Tests 14 compile clusterloader2 at [HEAD], thus introducing breaking changes to 15 clusterloader2 will stop scalability tests from passing at all. They have a 16 large blast radius: every PR to k/k needs to pass. Also while for smaller and 17 faster tests, breakages aren't that costly (unless they happen on the weekend, 18 see https://github.com/kubernetes/perf-tests/pull/586), they are expensive for 19 large, rarely run tests (e.g. [ci-kubernetes-e2e-gce-scale-performance]). 20 21 For this reason, all new changes/features added to clusterloader2 shall be gated 22 and rolled out gradually. This way, we can minimize blast radius of breaking 23 changes. We can even stop some kinds of them to happen at all, as they should be 24 caught at presubmit time and not allowed to merge at all. 25 26 ### General principles 27 28 1. Allow at least 24h between changes to test configs to ensure the experiment 29 is stable. We don't want to block PR to k/k because of a flaky experiment in 30 clusterloader2. 31 32 1. Check with sig-scalability whether there is a regression in the test you want 33 to enable experiment for. We don't want new features in clusterloader2 to 34 interfere with regression' debugging. 35 36 1. For tests not listed below it's fine to enable experiments at your 37 convenience. 38 39 ### Step-by-step process 40 41 _Each step should be a separate PR_ 42 43 1. Add a "knob" to turn on the future experiment. Usually this means adding a 44 new environmental variable (PR to [test-infra]) or a new override file (PR to 45 [perf-tests]). At this point the "knob" is not used by any code path. 46 47 Since the knob is not used anywhere it's no-op and should be safe. 48 49 1. Enable experiment in [perf-tests presubmits], 50 [ci-kubernetes-e2e-gci-gce-scalability] and [ci-kubernetes-kubemark-100-gce] 51 52 Again since the knob is not used anywhere it's no-op and should be safe. 53 54 Perf-test presubmit runs two jobs: [pull-perf-tests-clusterloader2] and 55 [pull-perf-tests-clusterloader2-kubemark]. Primary role of those presubmits 56 is to catch bugs in code from perf-tests, so we enable the experiment for 57 both of those jobs first. Enabling experiment in 58 [ci-kubernetes-e2e-gci-gce-scalability] should give you enough data points to 59 determine whether experiments work, once we add new code path or config. It 60 also runs frequent enough, so in case of problems you can revert quickly. 61 Before enabling the experiment in [ci-kubernetes-e2e-gci-gce-scalability] 62 make sure there is no ongoing regression affecting this test. If we are at 63 the code freeze of thaw, you should wait with updating 64 [ci-kubernetes-e2e-gci-gce-scalability] until the freeze is suspended. 65 66 1. Add a new code path or config that uses "knob" added in the first PR. 67 68 We've already enable it in the first step, so the PR will be only merged if 69 the new code path or configuration passes [perf-tests presubmits]. 70 71 1. Enable experiment in [pull-kubernetes-e2e-gce-100-performance] and 72 [pull-kubernetes-kubemark-e2e-gce-big] 73 74 Changing presubmit definitions in test-infra has an ability to break k/k 75 presubmits. PRs to test-infra don't trigger presubmit in k/k. Once you enable 76 the experiment in the presubmit you need to watch next 3 runs after your PR 77 is merged to detect breakages. 78 79 1. Enable experiment in [ci-kubernetes-e2e-gce-scale-performance] and rest of Kubemark 80 tests ([ci-kubernetes-kubemark-500-gce], [ci-kubernetes-kubemark-gce-scale], 81 and [ci-kubernetes-kubemark-high-density-100-gce]) 82 83 If feasible, please test experiment locally (it is outside of jobs run on 84 Prow) first, as those tests run once a day and expensive. Double-check with 85 sig-scalability that there is no ongoing regression in big clusters. 86 87 [head]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L121 88 [ci-kubernetes-e2e-gce-scale-performance]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L44 89 [ci-kubernetes-e2e-gci-gce-scalability]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L98 90 [ci-kubernetes-kubemark-100-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L258 91 [ci-kubernetes-kubemark-500-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L307 92 [ci-kubernetes-kubemark-gce-scale]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L355 93 [ci-kubernetes-kubemark-high-density-100-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L406 94 [perf-tests presubmits]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L267 95 [pull-kubernetes-e2e-gce-100-performance]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L3 96 [pull-kubernetes-kubemark-e2e-gce-big]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L149 97 [pull-perf-tests-clusterloader2-kubemark]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L317 98 [pull-perf-tests-clusterloader2]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L268 99 [test-infra]: https://github.com/kubernetes/test-infra 100 [perf-test]: https://github.com/kubernetes/perf-tests