k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/eks-jobs-migration.md (about)

     1  # Migrating Kubernetes Jobs To The EKS Prow Build Cluster
     2  
     3  In an ongoing effort to migrate to community-owned resources, SIG K8S Infra and
     4  SIG Testing are working to complete the migration of jobs from the Google-owned
     5  internal GKE cluster to community-owned clusters.
     6  
     7  Most of jobs in the Prow Default Build Cluster should attempt to migrate to
     8  `cluster: eks-prow-build-cluster`. For criteria and requirements, please
     9  pay attention to this document.
    10  
    11  ## What is EKS Prow Build Cluster?
    12  
    13  EKS Prow Build Cluster (`eks-prow-build-cluster`) is an AWS EKS based cluster
    14  owned by community that's used for running ProwJobs. It's supposed to be
    15  similar to our GCP/GKE based clusters, but there are some significant
    16  differences such as:
    17  
    18  - Operating system on worker nodes where we run jobs is Ubuntu 20.04 (EKS
    19    optimized)
    20  - The cluster is smaller in terms of number of worker nodes, but worker nodes
    21    are larger (16 vCPUs, 128 GB RAM, 300 GB NVMe SSD)
    22  - There's cluster-autoscaler in place to scale up/down cluster on demand
    23  - The cluster is hosted on AWS which means we get to use the credits donation
    24    that we got from AWS :)
    25  
    26  ## Criteria and Requirements for Migration
    27  
    28  The following jobs can be migrated out of the box:
    29  
    30  - Jobs not using any external resources (e.g. cloud accounts)
    31    - Build, lint, verify, and similar jobs
    32  
    33  The following jobs can be migrated but require some actions to be taken:
    34  
    35  - Jobs using external non-GCP resources (e.g. DigitalOcean, vSphere...)
    36    - Community ownership of resources is required for jobs running in the new
    37      cluster, see the Community Ownership section for more details
    38  
    39  The following jobs **MUST NOT** be migrated at this time:
    40  
    41  - Jobs using GCP resources (e.g. E2E tests, promotion jobs, etc.)
    42  - Jobs that are running in trusted clusters (e.g. `test-infra-trusted` and
    43    `k8s-infra-prow-build-trusted`)
    44  
    45  Jobs that are already running in community-owned clusters (e.g. 
    46  `k8s-infra-prow-build`) can be migrated as well, but it's not required or
    47  mandated.
    48  
    49  ## How To Migrate Jobs?
    50  
    51  Fork and check out the [kubernetes/test-infra][k-test-infra] repository,
    52  then follow the steps below:
    53  
    54  [k-test-infra]: https://github.com/kubernetes/test-infra
    55  
    56  - Find a job that you want to migrate
    57    - You can check out the following ["Prow Results"][prow-results-default] link
    58      for recent jobs that are running in the `default` cluster
    59    - For explanation reasons, let's say that you picked up a job called
    60      `pull-jobset-test-integration-main`
    61  - Edit the file that `pull-jobset-test-integration-main` is defined in. All
    62    jobs are defined in the [kubernetes/test-infra][k-test-infra] repository. You
    63    can use GitHub or our [Code Search tool][cs-test-infra] to find the file
    64    where this job is defined (search by job name).
    65  - Look for a `.spec.cluster` key in the job definition. If there isn't one or
    66    it's set to `default`, then the job runs in the default cluster. Add (or
    67    replace) the following `.spec.cluster` key:
    68    `cluster: eks-prow-build-cluster`.
    69    - **IMPORTANT: if you see any entries under `label` that says `gce` skip this
    70      job and get back to it the next time as this is not ready to be moved yet.**
    71    - **IMPORTANT: if you see that a job uses Boskos, (e.g. there's `BOSKOS_HOST`
    72      environment variable set), please check with SIG K8s infra if a needed
    73      Boskos resource is available in the EKS Prow build cluster (see contact
    74      information at the end of this document).**
    75    - **IMPORTANT: if you see any entries under `label` or `volumeMounts` that
    76      might indicate that a job is using some external non-GCP resource (e.g.
    77      credentials for some cloud platform), you need to check if the Community
    78      Ownership criteria is satisfied for that resource (see the Community
    79      Ownership section for more details)**
    80    - **IMPORTANT: jobs running in community-owned clusters must have resource
    81      requests and limits specified for cluster-autoscaler to work properly.
    82      If that's not the case for your job, please see the next section for some
    83      details about determining correct resource requests and limits.**
    84    - Here's an [example of a job][example-eks-job] as a reference (pay attention
    85      to the `.spec.cluster` key)
    86  - Save the file, commit the change, create a branch and file a PR
    87  - Once the PR is merged, follow the guidelines in the Post Migration Period
    88    section of this document to ensure that the job remains stable
    89  
    90  If you have any trouble, please see how you can get in touch with us at the
    91  end of this document.
    92  
    93  [prow-results-default]: https://prow.k8s.io/?cluster=default
    94  [cs-test-infra]: https://cs.k8s.io/?q=job-name-here&i=nope&files=config%2F&excludeFiles=&repos=kubernetes/test-infra
    95  [example-eks-job]: https://github.com/kubernetes/test-infra/blob/1d219efcca8a254aaca2c34570db0a56a05f5770/config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-config.yaml#L3C1-L31
    96  
    97  ### Determining Resource Requests and Limits
    98  
    99  Jobs running in community-owned clusters must have resource requests and 
   100  limits specified for cluster-autoscaler to work properly. However, determining
   101  this is not always easy and incorrect requests and limits can cause the job
   102  to start flaking or failing.
   103  
   104  - If your job has resource requests specified but not limits, set limits
   105    to same values as requests. That's usually a good starting point, but
   106    some adjustments might be needed
   107  - Try to determine requests and limits based on what job is doing
   108    - Simple verification jobs (e.g. `gofmt`, license header checks, etc.)
   109      generally don't require a lot of resources. In such case, 1-2 vCPUs and
   110      1-2 GB RAM is usually enough, but that also depends on the size of the
   111      project
   112    - Builds and tests takes some more resources. We generally recommend at least
   113      2-4 vCPUs and 2-4 GB RAM, but that again depends on the project size
   114    - Lints jobs (e.g. `golangci-lint`) can be very resource intensive depending
   115      on configuration (e.g. enabled and disabled linters) as well the project
   116      size. We recommend at least 2-4 vCPUs and 4-8 GB RAM for lint jobs
   117  
   118  At this time, we strongly recommend that you match values for requests and
   119  limits to avoid any potential issues.
   120  
   121  Once the job migration PR is merged, make sure to follow guidelines stated in
   122  the Post Migration Period section and adjust requests and limits as needed.
   123  
   124  ## Post Migration Period
   125  
   126  The job should be monitored for its stability after migration. We want to make
   127  sure that stability remains same or improved after the migration. There are
   128  several steps that you should take:
   129  
   130  - Watch job's duration and success/flakiness rate over several days
   131    - You can find information about job runs on the
   132      ["Prow Results"][prow-results] page. You can also use [Testgrid][testgrid]
   133      if the job has a testgrid dashboard
   134    - The job duration and success/flakiness rate should be same or similar as in
   135      the old cluster. If you notice significant bumps, try adjusting resource
   136      requests and limits. If that doesn't help, reach out to us so that we
   137      can investigate
   138    - Note that some jobs are flaky by their nature, i.e. they were flaking in
   139      the default cluster too. This is not going to be fixed by moving job to
   140      the new cluster, but we shouldn't see flakiness rate getting worse
   141  - Watch job's actual resource consumption and adjust requests and limits as
   142    needed
   143    - We have a [Grafana dashboard][monitoring-eks] that can show you actual
   144      resource CPU and memory usage for job. If you notice that CPU gets
   145      throttled too often, try increasing number of allowed CPUs. Similar for
   146      memory, if you see memory usage too close too limits, try increasing it
   147      a little bit
   148  
   149  [prow-results]: https://prow.k8s.io
   150  [testgrid]: http://testgrid.k8s.io
   151  [monitoring-eks]: https://monitoring-eks.prow.k8s.io/d/53g2x7OZz/jobs?orgId=1&refresh=30s&var-org=kubernetes&var-repo=kubernetes&var-job=All
   152  
   153  ## Known Issues
   154  
   155  - Golang doesn't respect cgroups, i.e. CPU limits are not going to be respected
   156    by Golang applications
   157    - This means that a Go application will try to use all available CPUs on
   158      node, even though it's limited by (Kubernetes) resource limits, e.g. to 2
   159    - This can cause massive throttling and performance-sensitive tasks and tests
   160      can be hit by this. Nodes in the new clusters are much larger
   161      (we had 8 vCPUs in the old cluster per node, while we have 16 vCPUs in the
   162      new cluster per node), so it can be easier to get affected by this issue
   163    - In general, number of tests affected by this should be **very low**.
   164      However, if you think you're affected by this, you can try to mitigate the
   165      issue by setting `GOMAXPROCS` environment variable for that job to the
   166      value of `cpu` limit. There are also ways to automatically determine
   167      `GOMAXPROCS`, such as [`automaxprocs`][automaxprocs]
   168  - Kernel configs are not available inside the job's test pod so kubeadm might
   169    show a warning about this
   170    - We're still working on a permanent resolution, but you can take a look at
   171      the [following GitHub issue][gh-issue-kubeadm] for more details
   172  
   173  [automaxprocs]: https://github.com/uber-go/automaxprocs
   174  [gh-issue-kubeadm]: https://github.com/kubernetes/kubeadm/issues/2898
   175  
   176  ## Community Ownership of Resources
   177  
   178  We require that the external resources used in community-clusters satisfy
   179  the minimum Community Ownership criteria before we add relevant secrets and
   180  credentials to our community-clusters. There are two major reasons for that:
   181  
   182  - Ensuring safe and secure pipeline. We want to be able to securely integrate
   183    the given resource and our clusters, e.g. by generating credentials and
   184    putting them in the cluster
   185  - Continuity. We want to make sure that we don't lose access to the resource
   186    in case someone steps down from the project or becomes unreachable
   187  
   188  The minimum Community Ownership criteria is as follows:
   189  
   190  - SIG K8s infra leadership **MUST** have access to the given external resource 
   191    - This means that the leadership team must be given access and onboarded to
   192      the resource (e.g. cloud platform) so they can maintain the access and
   193      generate secrets to be used in the build cluster
   194  
   195  The recommend Community Ownership criteria is as follows:
   196  
   197  - We recommend going through the [donation process with CNCF][cncf-credits]
   198    so that we have proper track of resources available to us, and that you also
   199    get highlighted for your donation and help to the project
   200    - In case you want to go through this process, please reach out to SIG K8s
   201      infra, so that we connect you with CNCF and follow you through the process
   202    - However, we understand that this require additional effort and is not
   203      always possible, hence there's minimum criteria to ensure we don't
   204      block on migration
   205  
   206  [sig-k8s-infra-leads]: https://github.com/kubernetes/community/tree/master/sig-k8s-infra#leadership
   207  [cncf-credits]: https://www.cncf.io/credits/
   208  
   209  ## Reporting Issues and Getting in Touch
   210  
   211  If you encounter any issue along the way, we recommend leaving a comment
   212  in our [tracking GitHub issue][test-infra-gh-issue]. You can also reach out
   213  to us:
   214  
   215  - via our Slack channels: [#sig-k8s-infra][slack-k8s-infra] and
   216    [#sig-testing][slack-sig-testing]
   217  - via our mailing lists: `kubernetes-sig-k8s-infra@googlegroups.com` and
   218    `kubernetes-sig-testing@googlegroups.com`
   219  
   220  [test-infra-gh-issue]: https://github.com/kubernetes/test-infra/issues/29722
   221  [slack-k8s-infra]: https://kubernetes.slack.com/archives/CCK68P2Q2
   222  [slack-sig-testing]: https://kubernetes.slack.com/archives/C09QZ4DQB