k8s.io/test-infra@v0.0.0-20240520184403-27c6b4c223d8/docs/eks-jobs-migration.md (about) 1 # Migrating Kubernetes Jobs To The EKS Prow Build Cluster 2 3 In an ongoing effort to migrate to community-owned resources, SIG K8S Infra and 4 SIG Testing are working to complete the migration of jobs from the Google-owned 5 internal GKE cluster to community-owned clusters. 6 7 Most of jobs in the Prow Default Build Cluster should attempt to migrate to 8 `cluster: eks-prow-build-cluster`. For criteria and requirements, please 9 pay attention to this document. 10 11 ## What is EKS Prow Build Cluster? 12 13 EKS Prow Build Cluster (`eks-prow-build-cluster`) is an AWS EKS based cluster 14 owned by community that's used for running ProwJobs. It's supposed to be 15 similar to our GCP/GKE based clusters, but there are some significant 16 differences such as: 17 18 - Operating system on worker nodes where we run jobs is Ubuntu 20.04 (EKS 19 optimized) 20 - The cluster is smaller in terms of number of worker nodes, but worker nodes 21 are larger (16 vCPUs, 128 GB RAM, 300 GB NVMe SSD) 22 - There's cluster-autoscaler in place to scale up/down cluster on demand 23 - The cluster is hosted on AWS which means we get to use the credits donation 24 that we got from AWS :) 25 26 ## Criteria and Requirements for Migration 27 28 The following jobs can be migrated out of the box: 29 30 - Jobs not using any external resources (e.g. cloud accounts) 31 - Build, lint, verify, and similar jobs 32 33 The following jobs can be migrated but require some actions to be taken: 34 35 - Jobs using external non-GCP resources (e.g. DigitalOcean, vSphere...) 36 - Community ownership of resources is required for jobs running in the new 37 cluster, see the Community Ownership section for more details 38 39 The following jobs **MUST NOT** be migrated at this time: 40 41 - Jobs using GCP resources (e.g. E2E tests, promotion jobs, etc.) 42 - Jobs that are running in trusted clusters (e.g. `test-infra-trusted` and 43 `k8s-infra-prow-build-trusted`) 44 45 Jobs that are already running in community-owned clusters (e.g. 46 `k8s-infra-prow-build`) can be migrated as well, but it's not required or 47 mandated. 48 49 ## How To Migrate Jobs? 50 51 Fork and check out the [kubernetes/test-infra][k-test-infra] repository, 52 then follow the steps below: 53 54 [k-test-infra]: https://github.com/kubernetes/test-infra 55 56 - Find a job that you want to migrate 57 - You can check out the following ["Prow Results"][prow-results-default] link 58 for recent jobs that are running in the `default` cluster 59 - For explanation reasons, let's say that you picked up a job called 60 `pull-jobset-test-integration-main` 61 - Edit the file that `pull-jobset-test-integration-main` is defined in. All 62 jobs are defined in the [kubernetes/test-infra][k-test-infra] repository. You 63 can use GitHub or our [Code Search tool][cs-test-infra] to find the file 64 where this job is defined (search by job name). 65 - Look for a `.spec.cluster` key in the job definition. If there isn't one or 66 it's set to `default`, then the job runs in the default cluster. Add (or 67 replace) the following `.spec.cluster` key: 68 `cluster: eks-prow-build-cluster`. 69 - **IMPORTANT: if you see any entries under `label` that says `gce` skip this 70 job and get back to it the next time as this is not ready to be moved yet.** 71 - **IMPORTANT: if you see that a job uses Boskos, (e.g. there's `BOSKOS_HOST` 72 environment variable set), please check with SIG K8s infra if a needed 73 Boskos resource is available in the EKS Prow build cluster (see contact 74 information at the end of this document).** 75 - **IMPORTANT: if you see any entries under `label` or `volumeMounts` that 76 might indicate that a job is using some external non-GCP resource (e.g. 77 credentials for some cloud platform), you need to check if the Community 78 Ownership criteria is satisfied for that resource (see the Community 79 Ownership section for more details)** 80 - **IMPORTANT: jobs running in community-owned clusters must have resource 81 requests and limits specified for cluster-autoscaler to work properly. 82 If that's not the case for your job, please see the next section for some 83 details about determining correct resource requests and limits.** 84 - Here's an [example of a job][example-eks-job] as a reference (pay attention 85 to the `.spec.cluster` key) 86 - Save the file, commit the change, create a branch and file a PR 87 - Once the PR is merged, follow the guidelines in the Post Migration Period 88 section of this document to ensure that the job remains stable 89 90 If you have any trouble, please see how you can get in touch with us at the 91 end of this document. 92 93 [prow-results-default]: https://prow.k8s.io/?cluster=default 94 [cs-test-infra]: https://cs.k8s.io/?q=job-name-here&i=nope&files=config%2F&excludeFiles=&repos=kubernetes/test-infra 95 [example-eks-job]: https://github.com/kubernetes/test-infra/blob/1d219efcca8a254aaca2c34570db0a56a05f5770/config/jobs/kubernetes/cloud-provider-aws/cloud-provider-aws-config.yaml#L3C1-L31 96 97 ### Determining Resource Requests and Limits 98 99 Jobs running in community-owned clusters must have resource requests and 100 limits specified for cluster-autoscaler to work properly. However, determining 101 this is not always easy and incorrect requests and limits can cause the job 102 to start flaking or failing. 103 104 - If your job has resource requests specified but not limits, set limits 105 to same values as requests. That's usually a good starting point, but 106 some adjustments might be needed 107 - Try to determine requests and limits based on what job is doing 108 - Simple verification jobs (e.g. `gofmt`, license header checks, etc.) 109 generally don't require a lot of resources. In such case, 1-2 vCPUs and 110 1-2 GB RAM is usually enough, but that also depends on the size of the 111 project 112 - Builds and tests takes some more resources. We generally recommend at least 113 2-4 vCPUs and 2-4 GB RAM, but that again depends on the project size 114 - Lints jobs (e.g. `golangci-lint`) can be very resource intensive depending 115 on configuration (e.g. enabled and disabled linters) as well the project 116 size. We recommend at least 2-4 vCPUs and 4-8 GB RAM for lint jobs 117 118 At this time, we strongly recommend that you match values for requests and 119 limits to avoid any potential issues. 120 121 Once the job migration PR is merged, make sure to follow guidelines stated in 122 the Post Migration Period section and adjust requests and limits as needed. 123 124 ## Post Migration Period 125 126 The job should be monitored for its stability after migration. We want to make 127 sure that stability remains same or improved after the migration. There are 128 several steps that you should take: 129 130 - Watch job's duration and success/flakiness rate over several days 131 - You can find information about job runs on the 132 ["Prow Results"][prow-results] page. You can also use [Testgrid][testgrid] 133 if the job has a testgrid dashboard 134 - The job duration and success/flakiness rate should be same or similar as in 135 the old cluster. If you notice significant bumps, try adjusting resource 136 requests and limits. If that doesn't help, reach out to us so that we 137 can investigate 138 - Note that some jobs are flaky by their nature, i.e. they were flaking in 139 the default cluster too. This is not going to be fixed by moving job to 140 the new cluster, but we shouldn't see flakiness rate getting worse 141 - Watch job's actual resource consumption and adjust requests and limits as 142 needed 143 - We have a [Grafana dashboard][monitoring-eks] that can show you actual 144 resource CPU and memory usage for job. If you notice that CPU gets 145 throttled too often, try increasing number of allowed CPUs. Similar for 146 memory, if you see memory usage too close too limits, try increasing it 147 a little bit 148 149 [prow-results]: https://prow.k8s.io 150 [testgrid]: http://testgrid.k8s.io 151 [monitoring-eks]: https://monitoring-eks.prow.k8s.io/d/53g2x7OZz/jobs?orgId=1&refresh=30s&var-org=kubernetes&var-repo=kubernetes&var-job=All 152 153 ## Known Issues 154 155 - Golang doesn't respect cgroups, i.e. CPU limits are not going to be respected 156 by Golang applications 157 - This means that a Go application will try to use all available CPUs on 158 node, even though it's limited by (Kubernetes) resource limits, e.g. to 2 159 - This can cause massive throttling and performance-sensitive tasks and tests 160 can be hit by this. Nodes in the new clusters are much larger 161 (we had 8 vCPUs in the old cluster per node, while we have 16 vCPUs in the 162 new cluster per node), so it can be easier to get affected by this issue 163 - In general, number of tests affected by this should be **very low**. 164 However, if you think you're affected by this, you can try to mitigate the 165 issue by setting `GOMAXPROCS` environment variable for that job to the 166 value of `cpu` limit. There are also ways to automatically determine 167 `GOMAXPROCS`, such as [`automaxprocs`][automaxprocs] 168 - Kernel configs are not available inside the job's test pod so kubeadm might 169 show a warning about this 170 - We're still working on a permanent resolution, but you can take a look at 171 the [following GitHub issue][gh-issue-kubeadm] for more details 172 173 [automaxprocs]: https://github.com/uber-go/automaxprocs 174 [gh-issue-kubeadm]: https://github.com/kubernetes/kubeadm/issues/2898 175 176 ## Community Ownership of Resources 177 178 We require that the external resources used in community-clusters satisfy 179 the minimum Community Ownership criteria before we add relevant secrets and 180 credentials to our community-clusters. There are two major reasons for that: 181 182 - Ensuring safe and secure pipeline. We want to be able to securely integrate 183 the given resource and our clusters, e.g. by generating credentials and 184 putting them in the cluster 185 - Continuity. We want to make sure that we don't lose access to the resource 186 in case someone steps down from the project or becomes unreachable 187 188 The minimum Community Ownership criteria is as follows: 189 190 - SIG K8s infra leadership **MUST** have access to the given external resource 191 - This means that the leadership team must be given access and onboarded to 192 the resource (e.g. cloud platform) so they can maintain the access and 193 generate secrets to be used in the build cluster 194 195 The recommend Community Ownership criteria is as follows: 196 197 - We recommend going through the [donation process with CNCF][cncf-credits] 198 so that we have proper track of resources available to us, and that you also 199 get highlighted for your donation and help to the project 200 - In case you want to go through this process, please reach out to SIG K8s 201 infra, so that we connect you with CNCF and follow you through the process 202 - However, we understand that this require additional effort and is not 203 always possible, hence there's minimum criteria to ensure we don't 204 block on migration 205 206 [sig-k8s-infra-leads]: https://github.com/kubernetes/community/tree/master/sig-k8s-infra#leadership 207 [cncf-credits]: https://www.cncf.io/credits/ 208 209 ## Reporting Issues and Getting in Touch 210 211 If you encounter any issue along the way, we recommend leaving a comment 212 in our [tracking GitHub issue][test-infra-gh-issue]. You can also reach out 213 to us: 214 215 - via our Slack channels: [#sig-k8s-infra][slack-k8s-infra] and 216 [#sig-testing][slack-sig-testing] 217 - via our mailing lists: `kubernetes-sig-k8s-infra@googlegroups.com` and 218 `kubernetes-sig-testing@googlegroups.com` 219 220 [test-infra-gh-issue]: https://github.com/kubernetes/test-infra/issues/29722 221 [slack-k8s-infra]: https://kubernetes.slack.com/archives/CCK68P2Q2 222 [slack-sig-testing]: https://kubernetes.slack.com/archives/C09QZ4DQB