sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/user/troubleshooting.md

sigs.k8s.io/cluster-api@v1.7.1/docs/book/src/user/troubleshooting.md (about)

     1  # Troubleshooting
     2  
     3  ## Troubleshooting Quick Start with Docker (CAPD)
     4  
     5  <aside class="note warning">
     6  
     7  <h1>Warning</h1>
     8  
     9  If you've run the Quick Start before ensure that you've [cleaned up](./quick-start.md#clean-up) all resources before trying it again. Check `docker ps` to ensure there are no running containers left before beginning the Quick Start.
    10  
    11  </aside>
    12  
    13  This guide assumes you've completed the [apply the workload cluster](./quick-start.md#apply-the-workload-cluster) section of the Quick Start using Docker.
    14  
    15  When running `clusterctl describe cluster capi-quickstart` to verify the created resources, we expect the output to be similar to this (**note: this is before installing the Calico CNI**).
    16  
    17  ```shell
    18  NAME                                                           READY  SEVERITY  REASON                       SINCE  MESSAGE
    19  Cluster/capi-quickstart                                        True                                          46m
    20  ├─ClusterInfrastructure - DockerCluster/capi-quickstart-94r9d  True                                          48m
    21  ├─ControlPlane - KubeadmControlPlane/capi-quickstart-6487w     True                                          46m
    22  │ └─3 Machines...                                              True                                          47m    See capi-quickstart-6487w-d5lkp, capi-quickstart-6487w-mpmkq, ...
    23  └─Workers
    24    └─MachineDeployment/capi-quickstart-md-0-d6dn6               False  Warning   WaitingForAvailableMachines  48m    Minimum availability requires 3 replicas, current 0 available
    25      └─3 Machines...                                            True                                          47m    See capi-quickstart-md-0-d6dn6-584ff97cb7-kr7bj, capi-quickstart-md-0-d6dn6-584ff97cb7-s6cbf, ...
    26  ```
    27  
    28  Machines should be started, but Workers are not because Calico isn't installed yet. You should be able to see the containers running with `docker ps --all` and they should not be restarting.
    29  
    30  If you notice Machines are failing to start/restarting your output might look similar to this:
    31  
    32  ```shell
    33  clusterctl describe cluster capi-quickstart
    34  NAME                                                           READY  SEVERITY  REASON                       SINCE  MESSAGE
    35  Cluster/capi-quickstart                                        False  Warning   ScalingUp                    57s    Scaling up control plane to 3 replicas (actual 2)
    36  ├─ClusterInfrastructure - DockerCluster/capi-quickstart-n5w87  True                                          110s
    37  ├─ControlPlane - KubeadmControlPlane/capi-quickstart-6587k     False  Warning   ScalingUp                    57s    Scaling up control plane to 3 replicas (actual 2)
    38  │ ├─Machine/capi-quickstart-6587k-fgc6m                        True                                          81s
    39  │ └─Machine/capi-quickstart-6587k-xtvnz                        False  Warning   BootstrapFailed              52s    1 of 2 completed
    40  └─Workers
    41    └─MachineDeployment/capi-quickstart-md-0-5whtj               False  Warning   WaitingForAvailableMachines  110s   Minimum availability requires 3 replicas, current 0 available
    42      └─3 Machines...                                            False  Info      Bootstrapping                77s    See capi-quickstart-md-0-5whtj-5d8c9746c9-f8sw8, capi-quickstart-md-0-5whtj-5d8c9746c9-hzxc2, ...
    43  ```
    44  
    45  In the example above we can see that the Machine `capi-quickstart-6587k-xtvnz` has failed to start. The reason provided is `BootstrapFailed`.
    46  
    47  To investigate why a machine fails to start you can inspect the conditions of the objects using `clusterctl describe --show-conditions all cluster capi-quickstart`. You can get more detailed information about the status of the machines using `kubectl describe machines`.
    48  
    49  To inspect the underlying infrastructure - in this case Docker containers acting as Machines - you can access the logs using `docker logs <MACHINE-NAME>`. For example:
    50  
    51  ```shell
    52  docker logs capi-quickstart-6587k-xtvnz
    53  (...)
    54  Failed to create control group inotify object: Too many open files
    55  Failed to allocate manager object: Too many open files
    56  [!!!!!!] Failed to allocate manager object.
    57  Exiting PID 1...
    58  ```
    59  
    60  To resolve this specific error please read [Cluster API with Docker  - "too many open files"](#cluster-api-with-docker----too-many-open-files).
    61  
    62  
    63  
    64  ## Node bootstrap failures when using CABPK with cloud-init
    65  
    66  Failures during Node bootstrapping can have a lot of different causes. For example, Cluster API resources might be
    67  misconfigured or there might be problems with the network. The following steps describe how bootstrap failures can
    68  be troubleshooted systematically.
    69  
    70  1. Access the Node via ssh.
    71  2. Take a look at cloud-init logs via `less /var/log/cloud-init-output.log` or `journalctl -u cloud-init --since "1 day ago"`.
    72     (Note: cloud-init persists logs of the commands it executes (like kubeadm) only after they have returned.)
    73  3. It might also be helpful to take a look at `journalctl --since "1 day ago"`.
    74  4. If you see that kubeadm times out waiting for the static Pods to come up, take a look at:
    75     1. containerd: `crictl ps -a`, `crictl logs`, `journalctl -u containerd`
    76     2. Kubelet: `journalctl -u kubelet --since "1 day ago"`
    77        (Note: it might be helpful to increase the Kubelet log level by e.g. setting `--v=8` via
    78        `systemctl edit --full kubelet && systemctl restart kubelet`)
    79  5. If Node bootstrapping consistently fails and the kubeadm logs are not verbose enough, the `kubeadm` verbosity
    80     can be increased via `KubeadmConfigSpec.Verbosity`.
    81  
    82  ## Labeling nodes with reserved labels such as `node-role.kubernetes.io` fails with kubeadm error during bootstrap
    83  
    84  Self-assigning Node labels such as `node-role.kubernetes.io` using the kubelet `--node-labels` flag
    85  (see `kubeletExtraArgs` in the [CABPK examples](https://github.com/kubernetes-sigs/cluster-api/tree/main/bootstrap/kubeadm))
    86  is not possible due to a security measure imposed by the
    87  [`NodeRestriction` admission controller](https://kubernetes.io/docs/reference/access-authn-authz/admission-controllers/#noderestriction)
    88  that kubeadm enables by default.
    89  
    90  Assigning such labels to Nodes must be done after the bootstrap process has completed:
    91  
    92  ```bash
    93  kubectl label nodes <name> node-role.kubernetes.io/worker=""
    94  ```
    95  
    96  For convenience, here is an example one-liner to do this post installation
    97  
    98  ```bash
    99  # Kubernetes 1.19 (kubeadm 1.19 sets only the node-role.kubernetes.io/master label)
   100  kubectl get nodes --no-headers -l '!node-role.kubernetes.io/master' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
   101  # Kubernetes >= 1.20 (kubeadm >= 1.20 sets the node-role.kubernetes.io/control-plane label)
   102  kubectl get nodes --no-headers -l '!node-role.kubernetes.io/control-plane' -o jsonpath='{range .items[*]}{.metadata.name}{"\n"}' | xargs -I{} kubectl label node {} node-role.kubernetes.io/worker=''
   103  ```
   104  
   105  ## Cluster API with Docker
   106  
   107  When provisioning workload clusters using Cluster API with the Docker infrastructure provider,
   108  provisioning might be stuck:
   109  
   110  1. if there are stopped containers on your machine from previous runs. Clean unused containers with [docker rm -f ](https://docs.docker.com/engine/reference/commandline/rm/).
   111  
   112  2. if the Docker space on your disk is being exhausted
   113      * Run [docker system df](https://docs.docker.com/engine/reference/commandline/system_df/) to inspect the disk space consumed by Docker resources.
   114      * Run [docker system prune --volumes](https://docs.docker.com/engine/reference/commandline/system_prune/) to prune dangling images, containers, volumes and networks.
   115  
   116  
   117  ## Cluster API with Docker  - "too many open files"
   118  When creating many nodes using Cluster API and Docker infrastructure, either by creating large Clusters or a number of small Clusters, the OS may run into inotify limits which prevent new nodes from being provisioned.
   119  If the error  `Failed to create inotify object: Too many open files` is present in the logs of the Docker Infrastructure provider this limit is being hit.
   120  
   121  On Linux this issue can be resolved by increasing the inotify watch limits with:
   122  
   123  ```bash
   124  sysctl fs.inotify.max_user_watches=1048576
   125  sysctl fs.inotify.max_user_instances=8192
   126  ```
   127  
   128  Newly created clusters should be able to take advantage of the increased limits.
   129  
   130  ### MacOS and Docker Desktop -  "too many open files"
   131  This error was also observed in Docker Desktop 4.3 and 4.4 on MacOS. It can be resolved by updating to Docker Desktop for Mac 4.5 or using a version lower than 4.3.
   132  
   133  [The upstream issue for this error is closed as of the release of Docker 4.5.0](https://github.com/docker/for-mac/issues/6071)
   134  
   135  Note: The below workaround is not recommended unless upgrade or downgrade cannot be performed.
   136  
   137  If using a version of Docker Desktop for Mac 4.3 or 4.4, the following workaround can be used:
   138  
   139  Increase the maximum inotify file watch settings in the Docker Desktop VM:
   140  
   141  1) Enter the Docker Desktop VM
   142  ```bash
   143  nc -U ~/Library/Containers/com.docker.docker/Data/debug-shell.sock
   144  ```
   145  2) Increase the inotify limits using sysctl
   146  ```bash
   147  sysctl fs.inotify.max_user_watches=1048576
   148  sysctl fs.inotify.max_user_instances=8192
   149  ```
   150  3) Exit the Docker Desktop VM
   151  ```bash
   152  exit
   153  ```
   154  
   155  ## Failed clusterctl init - 'failed to get cert-manager object'
   156  
   157  When using older versions of Cluster API 0.4 and 1.0 releases - 0.4.6, 1.0.3 and older respectively - Cert Manager may not be downloadable due to a change in the repository location. This will cause `clusterctl init` to fail with the error:
   158  
   159  ```bash
   160  clusterctl init --infrastructure docker
   161  ```
   162  ```bash
   163  Fetching providers
   164  Installing cert-manager Version="v1.11.0"
   165  Error: action failed after 10 attempts: failed to get cert-manager object /, Kind=, /: Object 'Kind' is missing in 'unstructured object has no kind'
   166  ```
   167  
   168  This error was fixed in more recent Cluster API releases on the 0.4 and 1.0 release branches. The simplest way to resolve the issue is to upgrade to a newer version of Cluster API for a given release. For who need to continue using an older release it is possible to override the repository used by `clusterctl init` in the clusterctl config file. The default location of this file is in `$XDG_CONFIG_HOME/cluster-api/clusterctl.yaml`.
   169  
   170  To do so add the following to the file:
   171  ```yaml
   172  cert-manager:
   173    url: "https://github.com/cert-manager/cert-manager/releases/latest/cert-manager.yaml"
   174  ```
   175  
   176  Alternatively a Cert Manager yaml file can be placed in the [clusterctl overrides layer](../clusterctl/configuration.md#overrides-layer) which is by default in `$XDG_CONFIG_HOME/cluster-api/overrides`. A Cert Manager yaml file can be placed at e.g. `$XDG_CONFIG_HOME/cluster-api/overrides/cert-manager/v1.11.0/cert-manager.yaml`
   177  
   178  More information on the clusterctl config file can be found at [its page in the book](../clusterctl/configuration.md#clusterctl-configuration-file)
   179  
   180  ## Failed clusterctl upgrade apply - 'failed to update cert-manager component'
   181  
   182  Upgrading Cert Manager may fail due to a breaking change introduced in Cert Manager release v1.6.
   183  An upgrade using `clusterctl` is affected when:
   184  
   185  * using `clusterctl` in version `v1.1.4` or a more recent version.
   186  * Cert Manager lower than version `v1.0.0` did run in the management cluster (which was shipped in Cluster API until including `v0.3.14`).
   187  
   188  This will cause `clusterctl upgrade apply` to fail with the error:
   189  
   190  ```bash
   191  clusterctl upgrade apply
   192  ```
   193  
   194  ```bash
   195  Checking cert-manager version...
   196  Deleting cert-manager Version="v1.5.3"
   197  Installing cert-manager Version="v1.7.2"
   198  Error: action failed after 10 attempts: failed to update cert-manager component apiextensions.k8s.io/v1, Kind=CustomResourceDefinition, /certificaterequests.cert-manager.io: CustomResourceDefinition.apiextensions.k8s.io "certificaterequests.cert-manager.io" is invalid: status.storedVersions[0]: Invalid value: "v1alpha2": must appear in spec.versions
   199  ```
   200  
   201  The Cert Manager maintainers provide documentation to [migrate the deprecated API Resources](https://cert-manager.io/docs/installation/upgrading/remove-deprecated-apis/#upgrading-existing-cert-manager-resources) to the new storage versions to mitigate the issue.
   202  
   203  More information about the change in Cert Manager can be found at [their upgrade notes from v1.5 to v1.6](https://cert-manager.io/docs/installation/upgrading/upgrading-1.5-1.6).
   204  
   205  ## Clusterctl failing to start providers due to outdated image overrides
   206  
   207  clusterctl allows users to configure [image overrides](../clusterctl/configuration.md#image-overrides) via the clusterctl config file.
   208  However, when the image override is pinning a provider image to a specific version, it could happen that this
   209  conflicts with clusterctl behavior of picking the latest version of a provider.
   210  
   211  E.g., if you are pinning KCP images to version v1.0.2 but then clusterctl init fetches yamls for version v1.1.0 or greater KCP will
   212  fail to start with the following error:
   213  
   214  ```bash
   215  invalid argument "ClusterTopology=false,KubeadmBootstrapFormatIgnition=false" for "--feature-gates" flag: unrecognized feature gate: KubeadmBootstrapFormatIgnition
   216  ```
   217  
   218  In order to solve this problem you should specify the version of the provider you are installing by appending a
   219  version tag to the provider name:
   220  
   221  ```bash
   222  clusterctl init -b kubeadm:v1.0.2 -c kubeadm:v1.0.2 --core cluster-api:v1.0.2 -i docker:v1.0.2
   223  ```
   224  
   225  Even if slightly verbose, pinning the version provides a better control over what is installed, as usually
   226  required in an enterprise environment, especially if you rely on an internal repository with a separated
   227  software supply chain or a custom versioning schema.
   228  
   229  ## Managed Cluster and co-authored slices
   230  
   231  As documented in [#6320](https://github.com/kubernetes-sigs/cluster-api/issues/6320) managed topologies
   232  assumes a slice to be either authored from templates or by the users/the infrastructure controllers.
   233  
   234  In cases the slice is instead co-authored (templates provide some info, the infrastructure controller
   235  fills in other info) this can lead to infinite reconcile.
   236  
   237  A solution to this problem is being investigated, but in the meantime you should avoid co-authored slices.