k8s.io/test-infra/triage@v0.0.0-20240520184403-27c6b4c223d8/README.md (about)

     1  # ![Triage](logo.svg)
     2  
     3  Triage identifies clusters of similar test failures across all jobs.
     4  
     5  Use it here: https://go.k8s.io/triage
     6  
     7  
     8  ## Intro
     9  
    10  Triage consists of two parts: a summarizer, which clusters together similar test failure messages,
    11  and a web page which can be used to browse the results. The web page is a static HTML page which grabs
    12  the results in JSON format, parses them, and displays them.
    13  
    14  
    15  ## Usage
    16  
    17  Triage summarization is generally run via `update_summaries.sh`, which downloads the input files in
    18  the correct format and passes them automatically to `triage`. (File formats are listed below.)
    19  However, summarization can be run directly with the following flags:
    20  - `builds`: a path to a JSON file containing build information
    21  - `previous` (optional): a path to a previous output which can be used to maintain consistent cluster
    22    IDs
    23  - `owners` (optional): a path to a file that maps SIGs to the labels they own (see [Methodology](#methodology));
    24    no longer used as labels are read straight from test names
    25  - `output` (optional): the path to where the output should be written to; defaults to `./failure_data.json`
    26  - `output_slices` (optional): a pattern to be used when outputting slices, if desired (see
    27    [Methodology](#methodology)); e.g. `slices/failure_data_PREFIX.json`, where `PREFIX` will be replaced
    28    with some identifier
    29  - `num_workers` (optional): the number of worker goroutines to spawn for parallelized functions; defaults to `2*runtime.NumCPU()-1`. (Since CPU detection is unreliable in Kubernetes, we set it manually according to the number of CPUs in [test-infra-periodics.yaml](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-periodics.yaml).)
    30  - `memoize` (optional): whether to memoize certain function results to JSON (and use previously memoized results if they exist); defaults to false
    31  - `...tests`: after all named flags are passed in, a space-delimited series of paths to files containing test information should be passed in as well
    32  
    33  Triage uses klog for logging, so klog flags can be passed in as well.
    34  
    35  The web page can be accessed at https://go.k8s.io/triage with the following options:
    36  - `Date`: defaults to "today"; note that all usages of "today" on the page refer to the currently set date
    37  - `Show clusters for SIG`: filter results by the SIG assigned to the majority of the tests; allows multi-select
    38  - `Include results from`: toggle between CI tests, PR tests, or both
    39  - `Sort by`: basic sorting
    40  - `Include filter`/`Exclude filter`: advanced regex filtering by field
    41  
    42  Note that the clusters at the top of the web page are static, and must be added/removed manually.
    43  Simply adding a button to the HTML is enough.
    44  
    45  
    46  ## Go Packages
    47  
    48  Package `berghelroach` contains a modified Levenshtein distance formula. Its only export is a `Dist()` function.  
    49  Package `summarize` depends on package `berghelroach` and does the actual heavy lifting.
    50  
    51  
    52  ## Methodology
    53  
    54  The entire process is orchestrated by `update_summaries.sh`, as follows:
    55  
    56  1. Download all builds for the last 14 days from BigQuery.
    57  1. Download all failed tests for the last 14 days from BigQuery.
    58  1. Run `triage`:
    59     1. Load the downloaded files, and convert them into a format that Go can handle better (i.e. by
    60        parsing numbers).
    61     1. Group the builds by their build paths, and the test failures by their test names.
    62     1. Load previous results (if any) to aid in computation.
    63     1. Create a local clustering of the test failures from step 2. This splits each group of test
    64        failures into local clusters, i.e. groups of failures with similar failure texts. The mapping
    65        at this point is `Test Name => Local Cluster Text => Group of Test Failures`.
    66     1. Create a global clustering of the local clusters from the previous step, optionally using the
    67        previous results. This takes each local cluster and attempts to find clusters from other tests
    68        with similar cluster texts. If one is found, they are merged into a global cluster, with each
    69        test's failures remaining separate within the global cluster. The mapping at this point is
    70        `Global Cluster Text => Test Name => Group of Test Failures`.
    71     1. Transform the global clustering into a format that compresses better, and which is consumable
    72        by the web page.
    73     1. If a mapping of owners to owner prefixes (such as `sig-testing => [sig-testing]`) was provided
    74        as a flag, load it.
    75     1. Annotate each cluster with an owner, by parsing the test name or using the provided mapping
    76        from the previous step. This can be used to filter the clusters by SIG on the web page.
    77     1. Write the results to a JSON file.
    78     1. If the `output_slices` flag is set, create individual files ("slices") for each owner. Also,
    79        split the results into 256 slices based on the cluster IDs. Write the slices to JSON files.
    80  1. Upload the results into Google Cloud Storage so they can be browsed via the web page.
    81  
    82  
    83  ## File Structure
    84  
    85  Below are the file structures for the ingested and outputted files. `...` denotes a repetition of the
    86  previous element. "`x` Flag" denotes the file format of a file passed to flag `x` of the summarizer.
    87  
    88  ### Main Output
    89  ```
    90  {
    91     "clustered": [
    92        {
    93           "key": string,
    94           "id": string,
    95           "text": string,
    96           "spans": [
    97              int,
    98              ...
    99           ],
   100           "tests": [
   101              {
   102                 "name": string,
   103                 "jobs": [
   104                    {
   105                       "name": string,
   106                       "builds": [
   107                          int,
   108                          ...
   109                       ]
   110                    },
   111                    ...
   112                 ]
   113              },
   114              ...
   115           ],
   116           "owner": string,
   117        },
   118        ...
   119     ],
   120     "builds": {
   121        "jobs": {
   122           string: ([int, ...] OR {int as string: int, ...})  // See the description of the jobCollection type
   123        },
   124        "cols": {
   125           "started": [int, ...],
   126           "tests_failed": [int, ...],
   127           "elapsed": [int, ...],
   128           "tests_run": [int, ...],
   129           "result": [string, ...],
   130           "executor": [string, ...],
   131           "pr": [string, ...]
   132        },
   133        "job_paths": {
   134           string: string,
   135           ...
   136        },
   137     }
   138  }
   139  ```
   140  
   141  ### `builds` Flag
   142  ```
   143  [
   144     {
   145        "path": string,
   146        "started": int as string,
   147        "elapsed": int as string,
   148        "tests_run": int as string,
   149        "tests_failed": int as string,
   150        "result": string,
   151        "executor": string,
   152        "job": string,
   153        "number": int as string,
   154        "pr": string,
   155        "key": string
   156     },
   157     ...
   158  ]
   159  ```
   160  
   161  ### `tests` Flag
   162  This is a newline-delimited list of JSON objects. **Note the lack of comma between objects.**
   163  ```
   164  {
   165     "started": int as string,
   166     "build": string,
   167     "name": string,
   168     "failure_text": string
   169  }
   170  ...
   171  ```
   172  
   173  ### `previous` Flag
   174  See [Main Output](#main-output).
   175  
   176  ### `owners` Flag
   177  ```
   178  {
   179     string: [
   180        string,
   181        ...
   182     ],
   183     ...
   184  }
   185  ```
   186  
   187  ### Slice Output
   188  See [Main Output](#main-output). This is only a subset of the main output.
   189  
   190  
   191  ## Updating JS dependencies for the web page
   192  
   193  See: `package.json` + `./hack/build/ensure-node_modules.sh`
   194  
   195  ## Deployment
   196  Triage runs as static HTML hosted in GCS that is updated as part of a [Prow Periodic](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-periodics.yaml#L27).
   197  
   198  To update the triage image run `make push` from `./triage` which will trigger a [cloudbuild](https://cloud.google.com/cloud-build) using [`//images/builder`](https://github.com/kubernetes/test-infra/tree/master/images/builder). This will result in a fresh triage image within the cloud image registry of the `k8s-testimages` project. (See Container Registry -> Images)
   199  
   200  To update Triage frontend in Production or Staging manually run `make push-static` or `make push-staging` respectively. Otherwise it is updated on postsubmit via [post-test-infra-upload-triage](https://github.com/kubernetes/test-infra/blob/master/config/jobs/kubernetes/test-infra/test-infra-trusted.yaml#L616).
   201  
   202  ### Staging
   203     To access staging see [Triage Staging](https://storage.googleapis.com/k8s-triage/staging).