github.com/cilium/cilium@v1.16.2/Documentation/contributing/testing/ci.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _ci_gha:
     8  
     9  CI  / GitHub Actions
    10  --------------------
    11  
    12  The main CI infrastructure is maintained on GitHub Actions (GHA).
    13  
    14  This infrastructure is broadly comprised of smoke tests and platform tests.
    15  Smoke tests are typically initiated by ``pull_request`` or
    16  ``pull_request_target`` triggers automatically when opening or updating a pull
    17  request. Platform tests often require an organization member to manually
    18  trigger the test when the pull request is ready to be tested.
    19  
    20  Triggering Smoke Tests
    21  ~~~~~~~~~~~~~~~~~~~~~~
    22  
    23  Several short-running tests are automatically triggered for all contributor
    24  submissions, subject to GitHub's limitations around first-time contributors.
    25  If no GitHub workflows are triggering on your PR, a committer for the project
    26  should trigger these within a few days. Reach out in the ``#testing``
    27  channel on `Cilium Slack`_ for assistance in running these tests.
    28  
    29  .. _trigger_phrases:
    30  
    31  Triggering Platform Tests
    32  ~~~~~~~~~~~~~~~~~~~~~~~~~
    33  
    34  To ensure that build resources are used judiciously, some tests on GHA are
    35  manually triggered via comments. These builds typically make use of cloud
    36  infrastructure, such as allocating clusters or VMs in AKS, EKS or GKE. In
    37  order to trigger these jobs, a member of the GitHub organization must post a
    38  comment on the Pull Request with a "trigger phrase".
    39  
    40  If you'd like to trigger these jobs, ask in `Cilium Slack`_ in the ``#testing``
    41  channel. If you're regularly contributing to Cilium, you can also `become a
    42  member <https://github.com/cilium/community/blob/main/CONTRIBUTOR-LADDER.md#organization-member>`__
    43  of the Cilium organization.
    44  
    45  Depending on the PR target branch, a specific set of jobs is marked as required,
    46  as per the `Cilium CI matrix`_. They will be automatically featured in PR checks
    47  directly on the PR page. The following trigger phrases may be used to trigger
    48  them all at once:
    49  
    50  +------------------+--------------------------+
    51  | PR target branch | Trigger required PR jobs |
    52  +==================+==========================+
    53  | main             | /test                    |
    54  +------------------+--------------------------+
    55  | v1.15            | /test-backport-1.15      |
    56  +------------------+--------------------------+
    57  | v1.14            | /test-backport-1.14      |
    58  +------------------+--------------------------+
    59  | v1.13            | /test-backport-1.13      |
    60  +------------------+--------------------------+
    61  | v1.12            | /test-backport-1.12      |
    62  +------------------+--------------------------+
    63  
    64  Pull requests submitted against older stable branches such as v1.13 may also be
    65  subject to Jenkins CI jobs. For more information, see
    66  `v1.13 CI <https://docs.cilium.io/en/v1.13/contributing/testing/ci/#ci-jenkins>`__.
    67  
    68  For a full list of GHA, see `GitHub Actions Page <https://github.com/cilium/cilium/actions>`_
    69  
    70  Using GitHub Actions for testing
    71  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    72  
    73  On GHA, running a specific set of Ginkgo tests (``conformance-ginkgo.yaml``)
    74  can also be accomplished by modifying the files under
    75  ``.github/actions/ginkgo/`` by adding or removing entries.
    76  
    77  ``main-focus.yaml``:
    78  
    79      This file contains a list of tests to include and exclude. The ``cliFocus``
    80      defined for each element in the "include" section is expanded to the
    81      specific defined ``focus``. This mapping allows us to determine which regex
    82      should be used with ``ginkgo --focus`` for each element in the "focus" list.
    83      See :ref:`ginkgo-documentation` for more information about ``--focus`` flag.
    84  
    85      Additionally, there is a list of excluded tests along with justifications
    86      in the form of comments, explaining why each test is excluded based on
    87      constraints defined in the ginkgo tests.
    88  
    89      For more information, refer to
    90      `GitHub's documentation on expanding matrix configurations <https://docs.github.com/en/actions/using-jobs/using-a-matrix-for-your-jobs#expanding-or-adding-matrix-configurations>`__
    91  
    92  ``main-k8s-versions.yaml``:
    93  
    94      This file defines which kernel versions should be run with specific Kubernetes
    95      (k8s) versions. It contains an "include" section where each entry consists of
    96      a k8s version, IP family, Kubernetes image, and kernel version. These details
    97      determine the combinations of k8s versions and kernel versions to be tested.
    98  
    99  ``main-prs.yaml``:
   100  
   101      This file specifies the k8s versions to be executed for each pull request (PR).
   102      The list of k8s versions under the "k8s-version" section determines the matrix
   103      of jobs that should be executed for CI when triggered by PRs.
   104  
   105  ``main-scheduled.yaml``:
   106  
   107      This file specifies the k8s versions to be executed on a regular basis. The
   108      list of k8s versions under the "k8s-version" section determines the matrix of
   109      jobs that should be executed for CI as part of scheduled jobs.
   110  
   111  Workflow interactions:
   112  
   113      - The ``main-focus.yaml`` file helps define the test focus for CI jobs based on
   114        specific criteria, expanding the ``cliFocus`` to determine the relevant
   115        ``focus`` regex for ``ginkgo --focus``.
   116  
   117      - The ``main-k8s-versions.yaml`` file defines the mapping between k8s versions
   118        and the associated kernel versions to be tested.
   119  
   120      - Both ``main-prs.yaml`` and ``main-scheduled.yaml`` files utilize the
   121        "k8s-version" section to specify the k8s versions that should be included
   122        in the job matrix for PRs and scheduled jobs respectively.
   123  
   124      - These files collectively contribute to the generation of the job matrix
   125        for GitHub Actions workflows, ensuring appropriate testing and validation
   126        of the defined k8s versions.
   127  
   128  For example, to only run the test under ``f09-datapath-misc-2`` with Kubernetes
   129  version 1.26, the following files can be modified to have the following content:
   130  
   131  ``main-focus.yaml``:
   132  
   133     .. code-block:: yaml
   134  
   135          ---
   136          focus:
   137          - "f09-datapath-misc-2"
   138          include:
   139            - focus: "f09-datapath-misc-2"
   140              cliFocus: "K8sDatapathConfig Check|K8sDatapathConfig IPv4Only|K8sDatapathConfig High-scale|K8sDatapathConfig Iptables|K8sDatapathConfig IPv4Only|K8sDatapathConfig IPv6|K8sDatapathConfig Transparent"
   141  
   142  ``main-prs.yaml``:
   143  
   144     .. code-block:: yaml
   145  
   146          ---
   147          k8s-version:
   148            - "1.26"
   149  
   150  The ``main-k8s-versions.yaml`` and ``main-scheduled.yaml`` files can be left
   151  unmodified and this will result in the execution on the tests under
   152  ``f09-datapath-misc-2`` for the ``k8s-version`` "``1.26``".
   153  
   154  
   155  Bisect process
   156  ^^^^^^^^^^^^^^
   157  
   158  Bisecting Ginkgo tests (``conformance-ginkgo.yaml``) can be performed by
   159  modifying the workflow file, as well as modifying the files under
   160  ``.github/actions/ginkgo/`` as explained in the previous section. The sections
   161  that need to be modified for the ``conformance-ginkgo.yaml`` can be found in
   162  form of comments inside that file under the ``on`` section and enable the
   163  event type of ``pull_request``. Additionally, the following section also needs
   164  to be modified:
   165  
   166     .. code-block:: yaml
   167  
   168          jobs:
   169            check_changes:
   170              name: Deduce required tests from code changes
   171              [...]
   172              outputs:
   173                tested: ${{ steps.tested-tree.outputs.src }}
   174                matrix_sha: ${{ steps.sha.outputs.sha }}
   175                base_branch: ${{ steps.sha.outputs.base_branch }}
   176                sha: ${{ steps.sha.outputs.sha }}
   177                #
   178                # For bisect uncomment the base_branch and 'sha' lines below and comment
   179                # the two lines above this comment
   180                #
   181                #base_branch: <replace with the base branch name, should be 'main', not your branch name>
   182                #sha: <replace with the SHA of an existing docker image tag that you want to bisect>
   183  
   184  As per the instructions, the ``base_branch`` needs to be uncommented and
   185  should point to the base branch name that we are testing. The ``sha`` must to
   186  point to the commit SHA that we want to bisect. **The SHA must point to an
   187  existing image tag under the ``quay.io/cilium/cilium-ci`` docker image
   188  repository**.
   189  
   190  It is possible to find out whether or not a SHA exists by running either
   191  ``docker manifest inspect`` or ``docker buildx imagetools inspect``.
   192  This is an example output for the non-existing SHA ``22fa4bbd9a03db162f08c74c6ef260c015ecf25e``
   193  and existing SHA ``7b368923823e63c9824ea2b5ee4dc026bc4d5cd8``:
   194  
   195  
   196     .. code-block:: shell
   197  
   198          $ docker manifest inspect quay.io/cilium/cilium-ci:22fa4bbd9a03db162f08c74c6ef260c015ecf25e
   199          ERROR: quay.io/cilium/cilium-ci:22fa4bbd9a03db162f08c74c6ef260c015ecf25e: not found
   200  
   201          $ docker buildx imagetools inspect quay.io/cilium/cilium-ci:7b368923823e63c9824ea2b5ee4dc026bc4d5cd8
   202          Name:      quay.io/cilium/cilium-ci:7b368923823e63c9824ea2b5ee4dc026bc4d5cd8
   203          MediaType: application/vnd.docker.distribution.manifest.list.v2+json
   204          Digest:    sha256:0b7d1078570e6979c3a3b98896e4a3811bff483834771abc5969660df38463b5
   205  
   206          Manifests:
   207            Name:      quay.io/cilium/cilium-ci:7b368923823e63c9824ea2b5ee4dc026bc4d5cd8@sha256:63dbffea393df2c4cc96ff340280e92d2191b6961912f70ff3b44a0dd2b73c74
   208            MediaType: application/vnd.docker.distribution.manifest.v2+json
   209            Platform:  linux/amd64
   210  
   211            Name:      quay.io/cilium/cilium-ci:7b368923823e63c9824ea2b5ee4dc026bc4d5cd8@sha256:0c310ab0b7a14437abb5df46d62188f4b8b809f0a2091899b8151e5c0c578d09
   212            MediaType: application/vnd.docker.distribution.manifest.v2+json
   213            Platform:  linux/arm64
   214  
   215  Once the changes are committed and pushed into a draft Pull Request, it is
   216  possible to visualize the test results on the Pull Request's page.
   217  
   218  GitHub Test Results
   219  ^^^^^^^^^^^^^^^^^^^
   220  
   221  Once the test finishes, its result is sent to the respective Pull Request's
   222  page.
   223  
   224  In case of a failure, it is possible to check with test failed by going over the
   225  summary of the test on the GitHub Workflow Run's page:
   226  
   227  
   228  .. image:: /images/gha-summary.png
   229      :align: center
   230  
   231  
   232  On this example, the test ``K8sDatapathConfig Transparent encryption DirectRouting Check connectivity with transparent encryption and direct routing with bpf_host``
   233  failed. With the ``cilium-sysdumps`` artifact available for download we can
   234  retrieve it and perform further inspection to identify the cause for the
   235  failure. To investigate CI failures, see :ref:`ci_failure_triage`.
   236  
   237  .. _test_matrix:
   238  
   239  Testing matrix
   240  ^^^^^^^^^^^^^^
   241  
   242  Up to date CI testing information regarding k8s - kernel version pairs can
   243  always be found in the `Cilium CI matrix`_.
   244  
   245  .. _Cilium CI matrix: https://docs.google.com/spreadsheets/d/1TThkqvVZxaqLR-Ela4ZrcJ0lrTJByCqrbdCjnI32_X0
   246  
   247  .. _ci_failure_triage:
   248  
   249  CI Failure Triage
   250  ~~~~~~~~~~~~~~~~~
   251  
   252  This section describes the process to triage CI failures. We define 3 categories:
   253  
   254  +----------------------+-----------------------------------------------------------------------------------+
   255  | Keyword              | Description                                                                       |
   256  +======================+===================================================================================+
   257  | Flake                | Failure due to a temporary situation such as loss of connectivity to external     |
   258  |                      | services or bug in system component, e.g. quay.io is down, VM race conditions,    |
   259  |                      | kube-dns bug, ...                                                                 |
   260  +----------------------+-----------------------------------------------------------------------------------+
   261  | CI-Bug               | Bug in the test itself that renders the test unreliable, e.g. timing issue when   |
   262  |                      | importing and missing to block until policy is being enforced before connectivity |
   263  |                      | is verified.                                                                      |
   264  +----------------------+-----------------------------------------------------------------------------------+
   265  | Regression           | Failure is due to a regression, all failures in the CI that are not caused by     |
   266  |                      | bugs in the test are considered regressions.                                      |
   267  +----------------------+-----------------------------------------------------------------------------------+
   268  
   269  Triage process
   270  ^^^^^^^^^^^^^^
   271  
   272  #. Investigate the failure you are interested in and determine if it is a
   273     CI-Bug, Flake, or a Regression as defined in the table above.
   274  
   275     #. Search `GitHub issues <https://github.com/cilium/cilium/issues?utf8=%E2%9C%93&q=is%3Aissue+>`_
   276        to see if bug is already filed. Make sure to also include closed issues in
   277        your search as a CI issue can be considered solved and then re-appears.
   278        Good search terms are:
   279  
   280        - The test name, e.g.
   281          ::
   282  
   283              k8s-1.7.K8sValidatedKafkaPolicyTest Kafka Policy Tests KafkaPolicies (from (k8s-1.7.xml))
   284  
   285        - The line on which the test failed, e.g.
   286          ::
   287  
   288              github.com/cilium/cilium/test/k8s/kafka_policies.go:202
   289  
   290        - The error message, e.g.
   291          ::
   292  
   293              Failed to produce from empire-hq on topic deathstar-plan
   294  
   295  #. If a corresponding GitHub issue exists, update it with:
   296  
   297     #. A link to the failing GHA build (note that the build information is
   298        eventually deleted).
   299  
   300  #. If no existing GitHub issue was found, file a `new GitHub issue <https://github.com/cilium/cilium/issues/new>`_:
   301  
   302     #. Attach failure case and logs from failing test
   303     #. If the failure is a new regression or a real bug:
   304  
   305        #. Title: ``<Short bug description>``
   306        #. Labels ``kind/bug`` and ``needs/triage``.
   307  
   308     #. If failure is a new CI-Bug, Flake or if you are unsure:
   309  
   310        #. Title ``CI: <testname>: <cause>``, e.g. ``CI: K8sValidatedPolicyTest Namespaces: cannot curl service``
   311        #. Labels ``kind/bug/CI`` and ``needs/triage``
   312        #. Include the test name and whole Stacktrace section to help others find this issue.
   313  
   314     .. note::
   315  
   316        Be extra careful when you see a new flake on a PR, and want to open an
   317        issue. It's much more difficult to debug these without context around the
   318        PR and the changes it introduced. When creating an issue for a PR flake,
   319        include a description of the code change, the PR, or the diff. If it
   320        isn't related to the PR, then it should already happen in the ``main``
   321        branch, and a new issue isn't needed.
   322  
   323  **Examples:**
   324  
   325  * ``Flake, quay.io is down``
   326  * ``Flake, DNS not ready, #3333``
   327  * ``CI-Bug, K8sValidatedPolicyTest: Namespaces, pod not ready, #9939``
   328  * ``Regression, k8s host policy, #1111``
   329  
   330  Disabling Github Actions Workflows
   331  ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
   332  
   333  .. warning::
   334      Do not use the `GitHub web UI <https://docs.github.com/en/actions/using-workflows/disabling-and-enabling-a-workflow?tool=webui>`_
   335      to disable GitHub Actions workflows. It makes it difficult to find out who
   336      disabled the workflows and why.
   337  
   338  Alternatives to Disabling Github Actions Workflows
   339  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   340  
   341  Before proceeding, consider the following alternatives to disabling an entire
   342  GitHub Actions workflow.
   343  
   344  - Skip individual tests. If specific tests are causing the workflow to fail,
   345    disable those tests instead of disabling the workflow. When you disable a
   346    workflow, all the tests in the workflow stop running. This makes it easier
   347    to introduce new regressions that would have been caught by these tests
   348    otherwise.
   349  - Remove the workflow from the list of required status checks. This way the
   350    workflow still runs on pull requests, but you can still merge them without
   351    the workflow succeeding. To remove the workflow from the required status check
   352    list, post a message in the `#testing Slack channel <https://cilium.slack.com/archives/C7PE7V806>`_
   353    and @mention people in the `cilium-maintainers team <https://github.com/orgs/cilium/teams/cilium-maintainers>`__.
   354  
   355  Step 1: Open a GitHub Issue
   356  ^^^^^^^^^^^^^^^^^^^^^^^^^^^
   357  
   358  Open a GitHub issue to track activities related to fixing the workflow. If there
   359  are existing test flake GitHub issues, list them in the tracking issue. Find an
   360  assignee for the tracking issue to avoid the situation where the workflow remains
   361  disabled indefinitely because nobody is assigned to actually fix the workflow.
   362  
   363  Step 2: Update the required status check list
   364  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   365  
   366  If the workflow is in the required status check list, it needs to be removed
   367  from the list. Notify the `cilium-maintainers team <https://github.com/orgs/cilium/teams/cilium-maintainers>`__
   368  by mentioning ``@cilium/cilium-maintainers`` in the tracking issue and ask them
   369  to remove the workflow from the required status check list.
   370  
   371  Step 3: Update the workflow configuration
   372  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
   373  
   374  Update the workflow configuration as described in the following sub-steps
   375  depending on whether the workflow is triggered by the ``/test`` comment
   376  or by the ``pull_request`` or ``pull_request_target`` trigger. Open a pull
   377  request with your changes, have it reviewed, then merged.
   378  
   379  .. tabs::
   380    .. group-tab:: ``/test`` comment trigger
   381  
   382      For those workflows that get triggered by the ``/test`` comment, update
   383      ariane-config.yaml and remove the workflow from ``triggers:/test:workflows``
   384      section (`an example <https://github.com/cilium/cilium/pull/29488>`_). Do not
   385      remove the targeted trigger (``triggers:/ci-e2e`` for example) so that you can
   386      still use the targeted trigger to run the workflow when needed.
   387  
   388    .. group-tab:: ``pull_request`` or ``pull_request_target`` trigger
   389  
   390      For those workflows that get triggered by the ``pull_request`` or
   391      ``pull_request_target`` trigger, remove the trigger from the workflow file.
   392      Do not remove the ``schedule`` trigger if the workflow has it. It is useful
   393      to be able to see if the workflow has stabilized enough over time when making
   394      the decision to re-enable the workflow.