k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/kettle/OVERVIEW.md (about)

     1  # Kettle Workflow
     2  
     3  ## Overview
     4  Kettle is a service tasked with tracking and uploading to [BigQuery] all completed jobs with results stored in the list of [Buckets]. This data can be then used to track metrics on failures, flakes, activity, etc.
     5  
     6  This document is intended to walk maintainers through Kettle's workflow to get a better understanding of how to add features, fix bugs, or rearchitect the service. This document will be organized in a Top->Down way that will start at Kettle's ENTRYPOINT and work through each stage.
     7  
     8  ## ENTRYPOINT
     9  Kettle's main process is the execution of `runner.sh` which:
    10  - sets gcloud auth credentials
    11  - creates initial "`bq config`"
    12  - pulls most recent [Buckets]
    13  - executes `update.py` on loop
    14  
    15  `update.py` governs the flow of Kettle's three main stages:
    16  - [make_db.py](#Make-Database): Collects every build from GCS in the given buckets and creates a database entry of results.
    17  - [make_json.py+bq load](#Create-json-Results-and-Upload): Builds json representation of the database and uploads reults to the respective tables.
    18  - [stream.py](#Stream-Results): Wait for pub-sub events for completed builds and upload as results surface.
    19  
    20  # Make Database
    21  Flags:
    22  - --buckets (str): Path to YAML that defines all the gcs buckets to collect jobs from
    23  - --junit (bool): If true, collect Junit xml test results
    24  - --threads (int): Number of threads to run concurrently with
    25  - --buildlimit (int): **Used in staging*  colect only N builds on each job
    26  
    27  `make_db.py` does the work of determine all the builds to collect and store to the database. It aggregates all the builds of two flavors: `pr` and `non-pr` builds. It searches gcs for build paths or generates build paths if they are "incremental builds" (monotomically increasing). It passes the work of collecting build information and results to threads that collect information. It then does a best-effort attempt to insert the build results to the DB, committing the insert every 200 builds.
    28  
    29  # Create JSON Results and Upload
    30  This stage gets run for each [BigQuery] table that Kettle is tasked with uploading data to. Typically looking like either:
    31  - Fixed Time: `pypy3 make_json.py --days <num> | pv | gzip > build_<table>.json.gz`
    32      and `bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records={MAX_BAD_RECORDS} kubernetes-public:k8s_infra_kettle.<table> build_<table>.json.gz schema.json`
    33  - All Results: `pypy3 make_json.py | pv | gzip > build_<table>.json.gz`
    34      and `bq load --source_format=NEWLINE_DELIMITED_JSON --max_bad_records={MAX_BAD_RECORDS} kubernetes-public:k8s_infra_kettle.<table> build_<table>.json.gz schema.json`
    35  
    36  ### Make Json
    37  `make_json.py` prepares an incremental table to track builds it has emitted to BQ. This table is named `build_emitted_<days>` (if days flag passed) or `build_emitted` otherwise. *This is important because if you change the days AND NOT the table being uploaded to, you will get duplicate results. If the `--reset_emitted` flag is passed, it will refresh the incremental table for fresh data. It then walks all of the builds to fetch within `<days>` or since epoch if unset, and dumps each as a json object to a build `tar.gz`.
    38  
    39  ### BQ Load
    40  This step uploads all of the `tar.gz` data to BQ while conforming to the [Schema], this schema must match the defined fields within [BigQuery] (see README for details on adding fields).
    41  
    42  # Stream Results
    43  After all historical data has been uploaded, Kettle enters a Streaming phase. It subscribes to pub-sub results from `kubernetes-jenkins/gcs-changes/kettle` (or the specified GCS subscription path) and listens for events (Jobs completing). When a job triggers an event, it:
    44  - will collect data for this job
    45  - insert it in the database
    46  - create a BQ client
    47  - gets the builds it just injected
    48  - serialized the rows to json
    49  - inserts it into the tables (from flag)
    50  - adds the data to the respective incremental tables
    51  
    52  [BigQuery]: https://console.cloud.google.com/bigquery?utm_source=bqui&utm_medium=link&utm_campaign=classic&project=k8s-infra-kettle
    53  [Buckets]: https://github.com/kubernetes/test-infra/blob/master/kettle/buckets.yaml
    54  [Schema]: https://github.com/kubernetes/test-infra/blob/master/kettle/schema.json