k8s.io/test-infra/triage@v0.0.0-20240520184403-27c6b4c223d8/README.md (about) 1 # ![Triage](logo.svg) 2 3 Triage identifies clusters of similar test failures across all jobs. 4 5 Use it here: https://go.k8s.io/triage 6 7 8 ## Intro 9 10 Triage consists of two parts: a summarizer, which clusters together similar test failure messages, 11 and a web page which can be used to browse the results. The web page is a static HTML page which grabs 12 the results in JSON format, parses them, and displays them. 13 14 15 ## Usage 16 17 Triage summarization is generally run via `update_summaries.sh`, which downloads the input files in 18 the correct format and passes them automatically to `triage`. (File formats are listed below.) 19 However, summarization can be run directly with the following flags: 20 - `builds`: a path to a JSON file containing build information 21 - `previous` (optional): a path to a previous output which can be used to maintain consistent cluster 22 IDs 23 - `owners` (optional): a path to a file that maps SIGs to the labels they own (see [Methodology](#methodology)); 24 no longer used as labels are read straight from test names 25 - `output` (optional): the path to where the output should be written to; defaults to `./failure_data.json` 26 - `output_slices` (optional): a pattern to be used when outputting slices, if desired (see 27 [Methodology](#methodology)); e.g. `slices/failure_data_PREFIX.json`, where `PREFIX` will be replaced 28 with some identifier 29 - `num_workers` (optional): the number of worker goroutines to spawn for parallelized functions; defaults to `2*runtime.NumCPU()-1`. (Since CPU detection is unreliable in Kubernetes, we set it manually according to the number of CPUs in [test-infra-periodics.yaml](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-periodics.yaml).) 30 - `memoize` (optional): whether to memoize certain function results to JSON (and use previously memoized results if they exist); defaults to false 31 - `...tests`: after all named flags are passed in, a space-delimited series of paths to files containing test information should be passed in as well 32 33 Triage uses klog for logging, so klog flags can be passed in as well. 34 35 The web page can be accessed at https://go.k8s.io/triage with the following options: 36 - `Date`: defaults to "today"; note that all usages of "today" on the page refer to the currently set date 37 - `Show clusters for SIG`: filter results by the SIG assigned to the majority of the tests; allows multi-select 38 - `Include results from`: toggle between CI tests, PR tests, or both 39 - `Sort by`: basic sorting 40 - `Include filter`/`Exclude filter`: advanced regex filtering by field 41 42 Note that the clusters at the top of the web page are static, and must be added/removed manually. 43 Simply adding a button to the HTML is enough. 44 45 46 ## Go Packages 47 48 Package `berghelroach` contains a modified Levenshtein distance formula. Its only export is a `Dist()` function. 49 Package `summarize` depends on package `berghelroach` and does the actual heavy lifting. 50 51 52 ## Methodology 53 54 The entire process is orchestrated by `update_summaries.sh`, as follows: 55 56 1. Download all builds for the last 14 days from BigQuery. 57 1. Download all failed tests for the last 14 days from BigQuery. 58 1. Run `triage`: 59 1. Load the downloaded files, and convert them into a format that Go can handle better (i.e. by 60 parsing numbers). 61 1. Group the builds by their build paths, and the test failures by their test names. 62 1. Load previous results (if any) to aid in computation. 63 1. Create a local clustering of the test failures from step 2. This splits each group of test 64 failures into local clusters, i.e. groups of failures with similar failure texts. The mapping 65 at this point is `Test Name => Local Cluster Text => Group of Test Failures`. 66 1. Create a global clustering of the local clusters from the previous step, optionally using the 67 previous results. This takes each local cluster and attempts to find clusters from other tests 68 with similar cluster texts. If one is found, they are merged into a global cluster, with each 69 test's failures remaining separate within the global cluster. The mapping at this point is 70 `Global Cluster Text => Test Name => Group of Test Failures`. 71 1. Transform the global clustering into a format that compresses better, and which is consumable 72 by the web page. 73 1. If a mapping of owners to owner prefixes (such as `sig-testing => [sig-testing]`) was provided 74 as a flag, load it. 75 1. Annotate each cluster with an owner, by parsing the test name or using the provided mapping 76 from the previous step. This can be used to filter the clusters by SIG on the web page. 77 1. Write the results to a JSON file. 78 1. If the `output_slices` flag is set, create individual files ("slices") for each owner. Also, 79 split the results into 256 slices based on the cluster IDs. Write the slices to JSON files. 80 1. Upload the results into Google Cloud Storage so they can be browsed via the web page. 81 82 83 ## File Structure 84 85 Below are the file structures for the ingested and outputted files. `...` denotes a repetition of the 86 previous element. "`x` Flag" denotes the file format of a file passed to flag `x` of the summarizer. 87 88 ### Main Output 89 ``` 90 { 91 "clustered": [ 92 { 93 "key": string, 94 "id": string, 95 "text": string, 96 "spans": [ 97 int, 98 ... 99 ], 100 "tests": [ 101 { 102 "name": string, 103 "jobs": [ 104 { 105 "name": string, 106 "builds": [ 107 int, 108 ... 109 ] 110 }, 111 ... 112 ] 113 }, 114 ... 115 ], 116 "owner": string, 117 }, 118 ... 119 ], 120 "builds": { 121 "jobs": { 122 string: ([int, ...] OR {int as string: int, ...}) // See the description of the jobCollection type 123 }, 124 "cols": { 125 "started": [int, ...], 126 "tests_failed": [int, ...], 127 "elapsed": [int, ...], 128 "tests_run": [int, ...], 129 "result": [string, ...], 130 "executor": [string, ...], 131 "pr": [string, ...] 132 }, 133 "job_paths": { 134 string: string, 135 ... 136 }, 137 } 138 } 139 ``` 140 141 ### `builds` Flag 142 ``` 143 [ 144 { 145 "path": string, 146 "started": int as string, 147 "elapsed": int as string, 148 "tests_run": int as string, 149 "tests_failed": int as string, 150 "result": string, 151 "executor": string, 152 "job": string, 153 "number": int as string, 154 "pr": string, 155 "key": string 156 }, 157 ... 158 ] 159 ``` 160 161 ### `tests` Flag 162 This is a newline-delimited list of JSON objects. **Note the lack of comma between objects.** 163 ``` 164 { 165 "started": int as string, 166 "build": string, 167 "name": string, 168 "failure_text": string 169 } 170 ... 171 ``` 172 173 ### `previous` Flag 174 See [Main Output](#main-output). 175 176 ### `owners` Flag 177 ``` 178 { 179 string: [ 180 string, 181 ... 182 ], 183 ... 184 } 185 ``` 186 187 ### Slice Output 188 See [Main Output](#main-output). This is only a subset of the main output. 189 190 191 ## Updating JS dependencies for the web page 192 193 See: `package.json` + `./hack/build/ensure-node_modules.sh` 194 195 ## Deployment 196 Triage runs as static HTML hosted in GCS that is updated as part of a [Prow Periodic](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-periodics.yaml#L27). 197 198 To update the triage image run `make push` from `./triage` which will trigger a [cloudbuild](https://cloud.google.com/cloud-build) using [`//images/builder`](https://github.com/kubernetes/test-infra/tree/master/images/builder). This will result in a fresh triage image within the cloud image registry of the `k8s-testimages` project. (See Container Registry -> Images) 199 200 To update Triage frontend in Production or Staging manually run `make push-static` or `make push-staging` respectively. Otherwise it is updated on postsubmit via [post-test-infra-upload-triage](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-trusted.yaml#L616). 201 202 ### Staging 203 To access staging see [Triage Staging](https://storage.googleapis.com/k8s-triage/staging).