k8s.io/perf-tests/clusterloader2@v0.0.0-20240304094227-64bdb12da87e/README.md (about)

     1  # ClusterLoader
     2  
     3  ClusterLoader2 (CL2) is a "bring your own yaml" Kubernetes load testing tool
     4  being an official K8s scalability and performance testing framework.
     5  
     6  The CL2 tests are written in yaml using the semi-declarative paradigm.
     7  A test defines a set of states in which a cluster should be
     8  (e.g. I want to run 10k pods, 2k cluster-ip services, 5 daemon-sets, etc.)
     9  and specifies how fast (e.g. pod throughput) a given state should be reached.
    10  In addition, it defines which performance characteristics
    11  should be measured [Measurements list](#Measurement).
    12  Last but not least, CL2 provides an extra observability of the cluster
    13  during the test with [Prometheus](#prometheus-metrics).
    14  
    15  The CL2 test API is described [here][api].
    16  
    17  ## Getting started
    18  
    19  See [Getting started] guide if you are new user of ClusterLoader.
    20  
    21  ### Flags
    22  
    23  #### Required
    24  
    25  These flags are required for any test to be run.
    26   - kubeconfig - path to the kubeconfig file.
    27   - testconfig - path to the test config file. This flag can be used multiple times
    28  if more than one test should be run.
    29   - provider - Cluster provider, options are: gce, gke, kind, kubemark, aws, local, vsphere, skeleton
    30  
    31  #### Optional
    32  
    33   - nodes - number of nodes in the cluster.
    34  If not provided, test will assign the number of schedulable cluster nodes.
    35   - report-dir - path to directory, where summaries files should be stored.
    36  If not specified, summaries are printed to standard log.
    37   - mastername - Name of the master node
    38   - masterip - DNS Name / IP of the master node
    39   - testoverrides - path to file with overrides.
    40   - kubelet-port - Port of the kubelet to use (*default: 10250*)
    41  
    42  ## Tests
    43  
    44  ### Test definition
    45  
    46  Test definition is an instantiation of this [api] (in json or yaml).
    47  The motivation and description of the API can be found in [design doc].
    48  Definitions of test as well as definitions of individual objects support templating.
    49  Templates for test definition come with one predefined value - ```{{.Nodes}}```,
    50  which represents the number of schedulable nodes in the cluster. \
    51  Example of a test definition can be found here: [load test].
    52  
    53  ### Modules
    54  
    55  ClusterLoader2 supports modularization of the test configs via the [Module API](https://github.com/kubernetes/perf-tests/blob/1bbb8bd493e5ce6370b0e18f3deaf821f3f28fd0/clusterloader2/api/types.go#L77).
    56  With the Module API, you can divide a single test config file into multiple
    57  module files. A module can be parameterized and used multiple times by
    58  the test or other module. This provides a convenient way to avoid copy-pasting
    59  and maintaining super-long, unreadable test configs<sup id="a1">[1](#f1)</sup>.
    60  
    61  [TODO(mm4tt)]: <> (Point to the load config based on modules here once we migrate it)
    62  
    63  ### Object template
    64  
    65  Object template is similar to standard kubernetes object definition
    66  with the only difference being templating mechanism.
    67  Parameters can be passed from the test definition to the object template
    68  using the ```templateFillMap``` map.
    69  Two always available parameters are ```{{.Name}}``` and ```{{.Index}}```
    70  which specifies object name and object replica index respectively. \
    71  Example of a template can be found here: [load deployment template].
    72  
    73  ### Overrides
    74  
    75  Overrides allow to inject new variables values to the template. \
    76  Many tests define input parameters. Input parameter is a variable
    77  that potentially will be provided by the test framework. Cause input parameters are optional,
    78  each reference has to be opaqued with ```DefaultParam``` function that will
    79  handle case if given variable doesn't exist. \
    80  Example of overrides can be found here: [overrides]
    81  
    82  #### Passing environment variables
    83  
    84  Instead of using overrides in file, it is possible to depend on environment
    85  variables. Only variables that start with `CL2_` prefix will be parsed and
    86  available in script.
    87  
    88  Environment variables can be used with `DefaultParam` function to provide sane
    89  default values.
    90  
    91  ##### Setting variables in shell
    92  ```shell
    93  export CL2_ACCESS_TOKENS_QPS=5
    94  ```
    95  
    96  ##### Usage from test definition
    97  ```yaml
    98  {{$qpsPerToken := DefaultParam .CL2_ACCESS_TOKENS_QPS 0.1}}
    99  ```
   100  
   101  ## Measurement
   102  
   103  Currently available measurements are:
   104  - **APIAvailabilityMeasurement** \
   105  This measurement collects information about the availability of cluster's control plane. \
   106  There are two slightly different ways this is measured:
   107    - cluster-level availability, where we periodically issue an API call to `/readyz`,
   108    - host-level availability, where we periodically poll each of the control plane's host `/readyz` endpoint.
   109      - this requires the [exec service](https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/execservice) to be enabled.
   110  - **APIResponsivenessPrometheusSimple** \
   111  This measurement creates percentiles of latency and number for server api calls based on the data collected by the prometheus server. 
   112  Api calls are divided by resource, subresource, verb and scope. \
   113  This measurement verifies if [API call latencies SLO] is satisfied.
   114  If prometheus server is not available, the measurement will be skipped.
   115  - **APIResponsivenessPrometheus** \
   116  This measurement creates summary for latency and number for server api calls
   117  based on the data collected by the prometheus server.
   118  Api calls are divided by resource, subresource, verb and scope. \
   119  This measurement verifies if [API call latencies SLO] is satisfied.
   120  If prometheus server is not available, the measurement will be skipped.
   121  - **CPUProfile** \
   122  This measurement gathers the cpu usage profile provided by pprof for a given component.
   123  - **EtcdMetrics** \
   124  This measurement gathers a set of etcd metrics and its database size.
   125  - **MemoryProfile** \
   126  This measurement gathers the memory profile provided by pprof for a given component.
   127  - **MetricsForE2E** \
   128  The measurement gathers metrics from kube-apiserver, controller manager,
   129  scheduler and optionally all kubelets.
   130  - **PodPeriodicCommand** \
   131  This measurement continually runs commands on an interval in pods targeted
   132  with a label selector. The output from each command is collected, allowing for
   133  information to be polled throughout the duration of the measurement, such as
   134  CPU and memory profiles.
   135  - **PodStartupLatency** \
   136  This measurement verifies if [pod startup SLO] is satisfied.
   137  - **ResourceUsageSummary** \
   138  This measurement collects the resource usage per component. During gather execution,
   139  the collected data will be converted into summary presenting 90th, 99th and 100th usage percentile
   140  for each observed component. \
   141  Optionally resource constraints file can be provided to the measurement.
   142  Resource constraints file specifies cpu and/or memory constraint for a given component.
   143  If any of the constraint is violated, an error will be returned, causing test to fail.
   144  - **SchedulingMetrics** \
   145  This measurement gathers a set of scheduler metrics.
   146  - **SchedulingThroughput** \
   147  This measurement gathers scheduling throughput.
   148  - **Timer** \
   149  Timer allows for measuring latencies of certain parts of the test
   150  (single timer allows for independent measurements of different actions).
   151  - **WaitForControlledPodsRunning** \
   152  This measurement works as a barrier that waits until specified controlling objects
   153  (ReplicationController, ReplicaSet, Deployment, DaemonSet and Job) have all pods running.
   154  Controlling objects can be specified by label selector, field selector and namespace.
   155  In case of timeout test continues to run, with error (causing marking test as failed) being logged.
   156  - **WaitForRunningPods** \
   157  This is a barrier that waits until required number of pods are running.
   158  Pods can be specified by label selector, field selector and namespace.
   159  In case of timeout test continues to run, with error (causing marking test as failed) being logged.
   160  - **Sleep** \
   161  This is a barrier that waits until requested amount of the time passes.
   162  - **WaitForGenericK8sObjects** \
   163  This is a barrier that waits until required number of k8s object fulfill given condition requirements.
   164  Those conditions can be specified as a list of requirements of `Type=Status` format, e.g.: `NodeReady=True`.
   165  In case of timeout test continues to run, with error (causing marking test as failed) being logged.
   166  
   167  ## Prometheus metrics
   168  
   169  There are two ways of scraping metrics from pods within cluster:
   170  - **ServiceMonitor** \
   171  Allows to scrape metrics from all pods in service. Here you can find example [Service monitor]
   172  - **PodMonitor** \
   173  Allows to scrape metrics from all pods with specific label. Here you can find example [Pod monitor]
   174  
   175  ## Vendor
   176  
   177  Vendor is created using [Go modules].
   178  
   179  ---
   180  
   181  <sup><b id="f1">1.</b> As an example and anti-pattern see the 900 line [load test config.yaml](https://github.com/kubernetes/perf-tests/blob/92cc27ff529ae3702c87e8f154ea62f3f2d8e837/clusterloader2/testing/load/config.yaml) we ended up maintaining at some point. [↩](#a1)</sup>
   182  
   183  
   184  [api]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/api/types.go
   185  [API call latencies SLO]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/api_call_latency.md
   186  [exec service]: https://github.com/kubernetes/perf-tests/tree/master/clusterloader2/pkg/execservice
   187  [design doc]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/design.md
   188  [Go modules]: https://blog.golang.org/using-go-modules
   189  [Getting started]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/docs/GETTING_STARTED.md
   190  [load deployment template]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/deployment.yaml
   191  [load test]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/load/config.yaml
   192  [overrides]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/testing/density/scheduler/pod-affinity/overrides.yaml
   193  [pod startup SLO]: https://github.com/kubernetes/community/blob/master/sig-scalability/slos/pod_startup_latency.md
   194  [Service monitor]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/prometheus/manifests/default/prometheus-serviceMonitorKubeProxy.yaml
   195  [Pod monitor]: https://github.com/kubernetes/perf-tests/blob/master/clusterloader2/pkg/prometheus/manifests/default/prometheus-podMonitorNodeLocalDNS.yaml