sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/user/troubleshooting.md (about) 1 # Troubleshooting 2 3 ## Troubleshooting Quick Start with Docker (CAPD) 4 5 <aside class="note warning"> 6 7 <h1>Warning</h1> 8 9 If you've run the Quick Start before ensure that you've [cleaned up](./quick-start.md#clean-up) all resources before trying it again. Check `docker ps` to ensure there are no running containers left before beginning the Quick Start. 10 11 </aside> 12 13 This guide assumes you've completed the [apply the workload cluster](./quick-start.md#apply-the-workload-cluster) section of the Quick Start using Docker. 14 15 When running `clusterctl describe cluster capi-quickstart` to verify the created resources, we expect the output to be similar to this (**note: this is before installing the Calico CNI**). 16 17 ```shell 18 NAME READY SEVERITY REASON SINCE MESSAGE 19 Cluster/capi-quickstart True 46m 20 ├─ClusterInfrastructure - DockerCluster/capi-quickstart-94r9d True 48m 21 ├─ControlPlane - KubeadmControlPlane/capi-quickstart-6487w True 46m 22 │ └─3 Machines... True 47m See capi-quickstart-6487w-d5lkp, capi-quickstart-6487w-mpmkq, ... 23 └─Workers 24 └─MachineDeployment/capi-quickstart-md-0-d6dn6 False Warning WaitingForAvailableMachines 48m Minimum availability requires 3 replicas, current 0 available 25 └─3 Machines... True 47m See capi-quickstart-md-0-d6dn6-584ff97cb7-kr7bj, capi-quickstart-md-0-d6dn6-584ff97cb7-s6cbf, ... 26 ``` 27 28 Machines should be started, but Workers are not because Calico isn't installed yet. You should be able to see the containers running with `docker ps --all` and they should not be restarting. 29 30 If you notice Machines are failing to start/restarting your output might look similar to this: 31 32 ```shell 33 clusterctl describe cluster capi-quickstart 34 NAME READY SEVERITY REASON SINCE MESSAGE 35 Cluster/capi-quickstart False Warning ScalingUp 57s Scaling up control plane to 3 replicas (actual 2) 36 ├─ClusterInfrastructure - DockerCluster/capi-quickstart-n5w87 True 110s 37 ├─ControlPlane - KubeadmControlPlane/capi-quickstart-6587k False Warning ScalingUp 57s Scaling up control plane to 3 replicas (actual 2) 38 │ ├─Machine/capi-quickstart-6587k-fgc6m True 81s 39 │ └─Machine/capi-quickstart-6587k-xtvnz False Warning BootstrapFailed 52s 1 of 2 completed 40 └─Workers 41 └─MachineDeployment/capi-quickstart-md-0-5whtj False Warning WaitingForAvailableMachines 110s Minimum availability requires 3 replicas, current 0 available 42 └─3 Machines... False Info Bootstrapping 77s See capi-quickstart-md-0-5whtj-5d8c9746c9-f8sw8, capi-quickstart-md-0-5whtj-5d8c9746c9-hzxc2, ... 43 ``` 44 45 In the example above we can see that the Machine `capi-quickstart-6587k-xtvnz` has failed to start. The reason provided is `BootstrapFailed`. 46 47 To investigate why a machine fails to start you can inspect the conditions of the objects using `clusterctl describe --show-conditions all cluster capi-quickstart`. You can get more detailed information about the status of the machines using `kubectl describe machines`. 48 49 To inspect the underlying infrastructure - in this case Docker containers acting as Machines - you can access the logs using `docker logs <MACHINE-NAME>`. For example: 50 51 ```shell 52 docker logs capi-quickstart-6587k-xtvnz 53 (...) 54 Failed to create control group inotify object: Too many open files 55 Failed to allocate manager object: Too many open files 56 [!!!!!!] Failed to allocate manager object. 57 Exiting PID 1... 58 ``` 59 60 To resolve this specific error please read [Cluster API with Docker - "too many open files"](#cluster-api-with-docker----too-many-open-files). 61 62 63 64 ## Node bootstrap failures when using CABPK with cloud-init 65 66 Failures during Node bootstrapping can have a lot of different causes. For example, Cluster API resources might be 67 misconfigured or there might be problems with the network. The following steps describe how bootstrap failures can 68 be troubleshooted systematically. 69 70 1. Access the Node via ssh. 71 2. Take a look at cloud-init logs via `less /var/log/cloud-init-output.log` or `journalctl -u cloud-init --since "1 day ago"`. 72 (Note: cloud-init persists logs of the commands it executes (like kubeadm) only after they have returned.) 73 3. It might also be helpful to take a look at `journalctl --since "1 day ago"`. 74 4. If you see that kubeadm times out waiting for the static Pods to come up, take a look at: 75 1. containerd: `crictl ps -a`, `crictl logs`, `journalctl -u containerd` 76 2. Kubelet: `journalctl -u kubelet --since "1 day ago"` 77 (Note: it might be helpful to increase the Kubelet log level by e.g. setting `--v=8` via 78 `systemctl edit --full kubelet && systemctl restart kubelet`) 79 5. If Node bootstrapping consistently fails and the kubeadm logs are not verbose enough, the `kubeadm` verbosity 80 can be increased via `KubeadmConfigSpec.Verbosity`. 81 82 ## Labeling nodes with reserved labels such as `node-role.kubernetes.io` fails with kubeadm error during bootstrap 83 84 Self-assigning Node labels such as `node-role.kubernetes.io` using the kubelet `--node-labels` flag 85 (see `kubeletExtraArgs` in the [CABPK examples](https://github.com/kubernetes-sigs/cluster-api/tree/main/bootstrap/kubeadm)) 86 is not possible due to a security measure imposed by the 87 [`NodeRestriction` admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction) 88 that kubeadm enables by default. 89 90 Assigning such labels to Nodes must be done after the bootstrap process has completed: 91 92 ```bash 93 kubectl label nodes <name> node-role.kubernetes.io/worker="" 94 ``` 95 96 For convenience, here is an example one-liner to do this post installation 97 98 ```bash 99 # Kubernetes 1.19 (kubeadm 1.19 sets only the node-role.kubernetes.io/master label) 100 kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker='' 101 # Kubernetes >= 1.20 (kubeadm >= 1.20 sets the node-role.kubernetes.io/control-plane label) 102 kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker='' 103 ``` 104 105 ## Cluster API with Docker 106 107 When provisioning workload clusters using Cluster API with the Docker infrastructure provider, 108 provisioning might be stuck: 109 110 1. if there are stopped containers on your machine from previous runs. Clean unused containers with [docker rm -f ](https://docs.docker.com/engine/reference/commandline/rm/). 111 112 2. if the Docker space on your disk is being exhausted 113 * Run [docker system df](https://docs.docker.com/engine/reference/commandline/system_df/) to inspect the disk space consumed by Docker resources. 114 * Run [docker system prune --volumes](https://docs.docker.com/engine/reference/commandline/system_prune/) to prune dangling images, containers, volumes and networks. 115 116 117 ## Cluster API with Docker - "too many open files" 118 When creating many nodes using Cluster API and Docker infrastructure, either by creating large Clusters or a number of small Clusters, the OS may run into inotify limits which prevent new nodes from being provisioned. 119 If the error `Failed to create inotify object: Too many open files` is present in the logs of the Docker Infrastructure provider this limit is being hit. 120 121 On Linux this issue can be resolved by increasing the inotify watch limits with: 122 123 ```bash 124 sysctl fs.inotify.max_user_watches=1048576 125 sysctl fs.inotify.max_user_instances=8192 126 ``` 127 128 Newly created clusters should be able to take advantage of the increased limits. 129 130 ### MacOS and Docker Desktop - "too many open files" 131 This error was also observed in Docker Desktop 4.3 and 4.4 on MacOS. It can be resolved by updating to Docker Desktop for Mac 4.5 or using a version lower than 4.3. 132 133 [The upstream issue for this error is closed as of the release of Docker 4.5.0](https://github.com/docker/for-mac/issues/6071) 134 135 Note: The below workaround is not recommended unless upgrade or downgrade cannot be performed. 136 137 If using a version of Docker Desktop for Mac 4.3 or 4.4, the following workaround can be used: 138 139 Increase the maximum inotify file watch settings in the Docker Desktop VM: 140 141 1) Enter the Docker Desktop VM 142 ```bash 143 nc -U ~/Library/Containers/com.docker.docker/Data/debug-shell.sock 144 ``` 145 2) Increase the inotify limits using sysctl 146 ```bash 147 sysctl fs.inotify.max_user_watches=1048576 148 sysctl fs.inotify.max_user_instances=8192 149 ``` 150 3) Exit the Docker Desktop VM 151 ```bash 152 exit 153 ``` 154 155 ## Failed clusterctl init - 'failed to get cert-manager object' 156 157 When using older versions of Cluster API 0.4 and 1.0 releases - 0.4.6, 1.0.3 and older respectively - Cert Manager may not be downloadable due to a change in the repository location. This will cause `clusterctl init` to fail with the error: 158 159 ```bash 160 clusterctl init --infrastructure docker 161 ``` 162 ```bash 163 Fetching providers 164 Installing cert-manager Version="v1.11.0" 165 Error: action failed after 10 attempts: failed to get cert-manager object /, Kind=, /: Object 'Kind' is missing in 'unstructured object has no kind' 166 ``` 167 168 This error was fixed in more recent Cluster API releases on the 0.4 and 1.0 release branches. The simplest way to resolve the issue is to upgrade to a newer version of Cluster API for a given release. For who need to continue using an older release it is possible to override the repository used by `clusterctl init` in the clusterctl config file. The default location of this file is in `$XDG_CONFIG_HOME/cluster-api/clusterctl.yaml`. 169 170 To do so add the following to the file: 171 ```yaml 172 cert-manager: 173 url: "https://github.com/cert-manager/cert-manager/releases/latest/cert-manager.yaml" 174 ``` 175 176 Alternatively a Cert Manager yaml file can be placed in the [clusterctl overrides layer](../clusterctl/configuration.md#overrides-layer) which is by default in `$XDG_CONFIG_HOME/cluster-api/overrides`. A Cert Manager yaml file can be placed at e.g. `$XDG_CONFIG_HOME/cluster-api/overrides/cert-manager/v1.11.0/cert-manager.yaml` 177 178 More information on the clusterctl config file can be found at [its page in the book](../clusterctl/configuration.md#clusterctl-configuration-file) 179 180 ## Failed clusterctl upgrade apply - 'failed to update cert-manager component' 181 182 Upgrading Cert Manager may fail due to a breaking change introduced in Cert Manager release v1.6. 183 An upgrade using `clusterctl` is affected when: 184 185 * using `clusterctl` in version `v1.1.4` or a more recent version. 186 * Cert Manager lower than version `v1.0.0` did run in the management cluster (which was shipped in Cluster API until including `v0.3.14`). 187 188 This will cause `clusterctl upgrade apply` to fail with the error: 189 190 ```bash 191 clusterctl upgrade apply 192 ``` 193 194 ```bash 195 Checking cert-manager version... 196 Deleting cert-manager Version="v1.5.3" 197 Installing cert-manager Version="v1.7.2" 198 Error: action failed after 10 attempts: failed to update cert-manager component apiextensions.k8s.io/v1, Kind=CustomResourceDefinition, /certificaterequests.cert-manager.io: CustomResourceDefinition.apiextensions.k8s.io "certificaterequests.cert-manager.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha2": must appear in spec.versions 199 ``` 200 201 The Cert Manager maintainers provide documentation to [migrate the deprecated API Resources](https://cert-manager.io/docs/installation/upgrading/remove-deprecated-apis/#upgrading-existing-cert-manager-resources) to the new storage versions to mitigate the issue. 202 203 More information about the change in Cert Manager can be found at [their upgrade notes from v1.5 to v1.6](https://cert-manager.io/docs/installation/upgrading/upgrading-1.5-1.6). 204 205 ## Clusterctl failing to start providers due to outdated image overrides 206 207 clusterctl allows users to configure [image overrides](../clusterctl/configuration.md#image-overrides) via the clusterctl config file. 208 However, when the image override is pinning a provider image to a specific version, it could happen that this 209 conflicts with clusterctl behavior of picking the latest version of a provider. 210 211 E.g., if you are pinning KCP images to version v1.0.2 but then clusterctl init fetches yamls for version v1.1.0 or greater KCP will 212 fail to start with the following error: 213 214 ```bash 215 invalid argument "ClusterTopology=false,KubeadmBootstrapFormatIgnition=false" for "--feature-gates" flag: unrecognized feature gate: KubeadmBootstrapFormatIgnition 216 ``` 217 218 In order to solve this problem you should specify the version of the provider you are installing by appending a 219 version tag to the provider name: 220 221 ```bash 222 clusterctl init -b kubeadm:v1.0.2 -c kubeadm:v1.0.2 --core cluster-api:v1.0.2 -i docker:v1.0.2 223 ``` 224 225 Even if slightly verbose, pinning the version provides a better control over what is installed, as usually 226 required in an enterprise environment, especially if you rely on an internal repository with a separated 227 software supply chain or a custom versioning schema. 228 229 ## Managed Cluster and co-authored slices 230 231 As documented in [#6320](https://github.com/kubernetes-sigs/cluster-api/issues/6320) managed topologies 232 assumes a slice to be either authored from templates or by the users/the infrastructure controllers. 233 234 In cases the slice is instead co-authored (templates provide some info, the infrastructure controller 235 fills in other info) this can lead to infinite reconcile. 236 237 A solution to this problem is being investigated, but in the meantime you should avoid co-authored slices.