k8s.io/perf-tests/clusterloader2@v0.0.0-20240304094227-64bdb12da87e/docs/experiments.md (about)

     1  # Clusterloader2 - experiment rollout
     2  
     3  In this doc any change to the behavior of clusterloader2 that
     4  
     5      - enables new measurement
     6      - changes semantic of an existing measurement
     7      - changes how clusterloader2 setups cluster and run tests
     8  
     9  is referred as an "experiment".
    10  
    11  ## Motivation
    12  
    13  clusterloader2 is a tool used by all scalability, performance tests. Tests
    14  compile clusterloader2 at [HEAD], thus introducing breaking changes to
    15  clusterloader2 will stop scalability tests from passing at all. They have a
    16  large blast radius: every PR to k/k needs to pass. Also while for smaller and
    17  faster tests, breakages aren't that costly (unless they happen on the weekend,
    18  see https://github.com/kubernetes/perf-tests/pull/586), they are expensive for
    19  large, rarely run tests (e.g. [ci-kubernetes-e2e-gce-scale-performance]).
    20  
    21  For this reason, all new changes/features added to clusterloader2 shall be gated
    22  and rolled out gradually. This way, we can minimize blast radius of breaking
    23  changes. We can even stop some kinds of them to happen at all, as they should be
    24  caught at presubmit time and not allowed to merge at all.
    25  
    26  ### General principles
    27  
    28  1. Allow at least 24h between changes to test configs to ensure the experiment
    29     is stable. We don't want to block PR to k/k because of a flaky experiment in
    30     clusterloader2.
    31  
    32  1. Check with sig-scalability whether there is a regression in the test you want
    33     to enable experiment for. We don't want new features in clusterloader2 to
    34     interfere with regression' debugging.
    35  
    36  1. For tests not listed below it's fine to enable experiments at your
    37     convenience.
    38  
    39  ### Step-by-step process
    40  
    41  _Each step should be a separate PR_
    42  
    43  1. Add a "knob" to turn on the future experiment. Usually this means adding a
    44     new environmental variable (PR to [test-infra]) or a new override file (PR to
    45     [perf-tests]). At this point the "knob" is not used by any code path.
    46  
    47     Since the knob is not used anywhere it's no-op and should be safe.
    48  
    49  1. Enable experiment in [perf-tests presubmits],
    50     [ci-kubernetes-e2e-gci-gce-scalability] and [ci-kubernetes-kubemark-100-gce]
    51  
    52     Again since the knob is not used anywhere it's no-op and should be safe.
    53  
    54     Perf-test presubmit runs two jobs: [pull-perf-tests-clusterloader2] and
    55     [pull-perf-tests-clusterloader2-kubemark]. Primary role of those presubmits
    56     is to catch bugs in code from perf-tests, so we enable the experiment for
    57     both of those jobs first. Enabling experiment in
    58     [ci-kubernetes-e2e-gci-gce-scalability] should give you enough data points to
    59     determine whether experiments work, once we add new code path or config. It
    60     also runs frequent enough, so in case of problems you can revert quickly.
    61     Before enabling the experiment in [ci-kubernetes-e2e-gci-gce-scalability]
    62     make sure there is no ongoing regression affecting this test. If we are at
    63     the code freeze of thaw, you should wait with updating
    64     [ci-kubernetes-e2e-gci-gce-scalability] until the freeze is suspended.
    65  
    66  1. Add a new code path or config that uses "knob" added in the first PR.
    67  
    68     We've already enable it in the first step, so the PR will be only merged if
    69     the new code path or configuration passes [perf-tests presubmits].
    70  
    71  1. Enable experiment in [pull-kubernetes-e2e-gce-100-performance] and
    72     [pull-kubernetes-kubemark-e2e-gce-big]
    73  
    74     Changing presubmit definitions in test-infra has an ability to break k/k
    75     presubmits. PRs to test-infra don't trigger presubmit in k/k. Once you enable
    76     the experiment in the presubmit you need to watch next 3 runs after your PR
    77     is merged to detect breakages.
    78  
    79  1. Enable experiment in [ci-kubernetes-e2e-gce-scale-performance] and rest of Kubemark
    80     tests ([ci-kubernetes-kubemark-500-gce], [ci-kubernetes-kubemark-gce-scale],
    81     and [ci-kubernetes-kubemark-high-density-100-gce])
    82  
    83     If feasible, please test experiment locally (it is outside of jobs run on
    84     Prow) first, as those tests run once a day and expensive. Double-check with
    85     sig-scalability that there is no ongoing regression in big clusters.
    86  
    87  [head]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L121
    88  [ci-kubernetes-e2e-gce-scale-performance]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L44
    89  [ci-kubernetes-e2e-gci-gce-scalability]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-release-blocking-jobs.yaml#L98
    90  [ci-kubernetes-kubemark-100-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L258
    91  [ci-kubernetes-kubemark-500-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L307
    92  [ci-kubernetes-kubemark-gce-scale]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L355
    93  [ci-kubernetes-kubemark-high-density-100-gce]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-periodic-jobs.yaml#L406
    94  [perf-tests presubmits]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L267
    95  [pull-kubernetes-e2e-gce-100-performance]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L3
    96  [pull-kubernetes-kubemark-e2e-gce-big]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L149
    97  [pull-perf-tests-clusterloader2-kubemark]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L317
    98  [pull-perf-tests-clusterloader2]: https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/sig-scalability/sig-scalability-presubmit-jobs.yaml#L268
    99  [test-infra]: https://github.com/kubernetes/test-infra
   100  [perf-test]: https://github.com/kubernetes/perf-tests