sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/book/src/topics/troubleshooting.md

sigs.k8s.io/cluster-api-provider-aws@v1.5.5/docs/book/src/topics/troubleshooting.md (about)

     1  # Troubleshooting
     2  
     3  ## Resources aren't being created
     4  
     5  TODO
     6  
     7  ## Target cluster's control plane machine is up but target cluster's apiserver not working as expected
     8  
     9  If `aws-provider-controller-manager-0` logs did not help, you might want to look into cloud-init logs, `/var/log/cloud-init-output.log`, on the controller host.
    10  Verifying kubelet status and logs may also provide hints:
    11  ```bash
    12  journalctl -u kubelet.service
    13  systemctl status kubelet
    14  ```
    15  For reaching controller host from your local machine:
    16  ```bash
    17   ssh -i <private-key> -o "ProxyCommand ssh -W %h:%p -i <private-key> ubuntu@<bastion-IP>" ubuntu@<controller-host-IP>
    18   ```
    19  
    20  `private-key` is the private key from the key-pair discussed in the `ssh key pair` section above.
    21  
    22  ## kubelet on the control plane host failing with error: NoCredentialProviders
    23  ```bash
    24  failed to run Kubelet: could not init cloud provider "aws": error finding instance i-0c276f2a1f1c617b2: "error listing AWS instances: \"NoCredentialProviders: no valid providers in chain. Deprecated.\\n\\tFor verbose messaging see aws.Config.CredentialsChainVerboseErrors\""
    25  ```
    26  This error can occur if `CloudFormation` stack is not created properly and IAM instance profile is missing appropriate roles. Run following command to inspect IAM instance profile:
    27  ```bash
    28  $ aws iam get-instance-profile --instance-profile-name control-plane.cluster-api-provider-aws.sigs.k8s.io --output json
    29  {
    30      "InstanceProfile": {
    31          "InstanceProfileId": "AIPAJQABLZS4A3QDU576Q",
    32          "Roles": [
    33              {
    34                  "AssumeRolePolicyDocument": {
    35                      "Version": "2012-10-17",
    36                      "Statement": [
    37                          {
    38                              "Action": "sts:AssumeRole",
    39                              "Effect": "Allow",
    40                              "Principal": {
    41                                  "Service": "ec2.amazonaws.com"
    42                              }
    43                          }
    44                      ]
    45                  },
    46                  "RoleId": "AROAJQABLZS4A3QDU576Q",
    47                  "CreateDate": "2019-05-13T16:45:12Z",
    48                  "RoleName": "control-plane.cluster-api-provider-aws.sigs.k8s.io",
    49                  "Path": "/",
    50                  "Arn": "arn:aws:iam::123456789012:role/control-plane.cluster-api-provider-aws.sigs.k8s.io"
    51              }
    52          ],
    53          "CreateDate": "2019-05-13T16:45:28Z",
    54          "InstanceProfileName": "control-plane.cluster-api-provider-aws.sigs.k8s.io",
    55          "Path": "/",
    56          "Arn": "arn:aws:iam::123456789012:instance-profile/control-plane.cluster-api-provider-aws.sigs.k8s.io"
    57      }
    58  }
    59  
    60  ```
    61  If instance profile does not look as expected, you may try recreating the CloudFormation stack using `clusterawsadm` as explained in the above sections.
    62  
    63  
    64  ## Recover a management cluster after losing the api server load balancer
    65  
    66  These steps outline the process for recovering a management cluster after losing the load balancer for the api server. These steps are needed because AWS load balancers have dynamically generated DNS names. This means that when a load balancer is deleted CAPA will recreate the load balancer but it will have a different DNS name that does not match the original, so we need to update some resources as well as the certs to match the new name to make the cluster healthy again. There are a few different scenarios which this could happen.
    67  
    68  * The load balancer gets deleted by some external process or user.
    69  * If a cluster is created with the same name as the management cluster in a different namespace and then deleted it will delete the existing load balancer. This is due to ownership of AWS resources being managed by tags. See this [issue](https://github.com/kubernetes-sigs/cluster-api-provider-aws/issues/969#issuecomment-519121056) for reference.
    70  
    71  ### **Access the api server locally**
    72  
    73  1. ssh to a control plane node and modify the `/etc/kubernetes/admin.conf`
    74  
    75      * Replace the `server` with `server: https://localhost:6443`
    76  
    77      * Add `insecure-skip-tls-verify: true`
    78  
    79      * Comment out `certificate-authority-data:`
    80  
    81  2. Export the kubeconfig and ensure you can connect
    82  
    83      ```bash
    84      export KUBECONFIG=/etc/kubernetes/admin.conf
    85      kubectl get nodes
    86      ```
    87  
    88  
    89  ### **Get rid of the lingering duplicate cluster**
    90  
    91  **This step is only needed in the scenario that duplicate cluster was created and deleted which caused the API server load balancer to be deleted.**
    92  
    93  1. since there is a duplicate cluster that is trying to be deleted and can't due to some resources being unable to cleanup since they are in use we need to stop the conflicting reconciliation process. Edit the duplicate aws cluster object and remove the `finalizers`
    94  
    95      ```bash
    96      kubectl edit awscluster <clustername>
    97      ```
    98  2. next run `kubectl describe awscluster <clustername>` to validate that the finalizers have been removed
    99  
   100  3. `kubectl get clusters` to verify the cluster is gone
   101  
   102  
   103  ### **Make at least one node `Ready`**
   104  
   105  1. Right now all endpoints are down due to nodes not being ready. this is problematic for coredns adn cni pods in particular. let's get one control plane node back healthy. on the control plane node we logged into edit the `/etc/kubernetes/kubelet.conf`
   106  
   107      * Replace the `server` with `server: https://localhost:6443`
   108  
   109      * Add `insecure-skip-tls-verify: true`
   110  
   111      * Comment out `certificate-authority-data:`
   112  
   113      * Restart the kubelet `systemctl restart kubelet`
   114  
   115  2. `kubectl get nodes` and validate that the node is in a  ready state.
   116  3. After a few minutes most things should start scheduling themselves on the new node. The pods that did not restart on their own that were causing issues were core-dns,kube-proxy, and cni pods.Those should be restart manually.
   117  4. (optional) tail the capa logs to see the load balancer start to reconcile
   118  
   119      ```bash
   120      kubectl logs -f -n capa-system deployments.apps/capa-controller-manager`
   121      ```
   122  
   123  ### **Update the control plane nodes with new LB settings**
   124  
   125  1. To be safe we will do this on all CP nodes rather than having them recreate to avoid potential data loss issues. Follow the following steps for **each** CP node.
   126  
   127  2. Regenrate the certs for the api server using the new name. Make sure to update your service cidr and endpoint in the below command.
   128  
   129      ```bash
   130      rm /etc/kubernetes/pki/apiserver.crt
   131      rm /etc/kubernetes/pki/apiserver.key
   132  
   133      kubeadm init phase certs apiserver --control-plane-endpoint="mynewendpoint.com" --service-cidr=100.64.0.0/13 -v10
   134      ```
   135  
   136  3. Update settings in `/etc/kubernetes/admin.conf`
   137  
   138      * Replace the `server` with `server: https://<your-new-lb.com>:6443`
   139      
   140      * Remove `insecure-skip-tls-verify: true`
   141  
   142      * Uncomment `certificate-authority-data:`
   143  
   144      * Export the kubeconfig and ensure you can connect 
   145  
   146          ```bash
   147          export KUBECONFIG=/etc/kubernetes/admin.conf
   148          kubectl get nodes
   149          ```
   150  
   151  4. Update the settings in `/etc/kubernetes/kubelet.conf`
   152  
   153      * Replace the `server` with `server: https://your-new-lb.com:6443`
   154  
   155      * Remove `insecure-skip-tls-verify: true`
   156  
   157      * Uncomment `certificate-authority-data:`
   158  
   159      * restart the kubelet `systemctl restart kubelet`
   160  
   161  5. Just as we did before we need new pods to pick up api server cache changes so  you will want to force restart pods like cni pods, kube-proxy, core-dns , etc.
   162  
   163  ### Update capi settings for new LB DNS name
   164  
   165  1. Update the control plane endpoint on the `awscluster` and `cluster` objects. To do this we need to disable the validatingwebhooks. We will back them up and then delete so we can apply later.
   166  
   167      ```bash
   168      kubectl get validatingwebhookconfigurations capa-validating-webhook-configuration -o yaml > capa-webhook && kubectl delete validatingwebhookconfigurations capa-validating-webhook-configuration
   169  
   170      kubectl get validatingwebhookconfigurations capi-validating-webhook-configuration -o yaml > capi-webhook && kubectl delete validatingwebhookconfigurations capi-validating-webhook-configuration
   171      ```
   172  
   173  2. Edit the `spec.controlPlaneEndpoint.host` field on both `awscluster` and `cluster` to have the new endpoint
   174  
   175  3. Re-apply your webhooks
   176  
   177      ```bash
   178      kubectl apply -f capi-webhook
   179      kubectl apply -f capa-webhook
   180      ```
   181  
   182  
   183  4. Update the following config maps and replace the old control plane name with the new one.
   184  
   185      ```bash
   186      kubectl edit cm -n kube-system kubeadm-config
   187      kubectl edit cm -n kube-system kube-proxy
   188      kubectl edit cm -n kube-public cluster-info
   189      ```
   190  
   191  5. Edit the cluster kubeconfig secret that capi uses to talk to the management cluster. You will need to decode teh secret, replace the endpoint and re-encode and save.
   192  
   193      ```bash
   194      kubectl edit secret -n <namespace> <cluster-name>-kubeconfig`
   195      ```
   196  6. At this point things should start to reconcile on their own, but we can use the commands in the next step to force it. 
   197  
   198  
   199  ### Roll all of the nodes to make sure everything is fresh
   200  
   201  
   202  1. 
   203     ```bash
   204     kubectl patch kcp <clusternamekcp> -n namespace --type merge -p "{\"spec\":{\"rolloutAfter\":\"`date +'%Y-%m-%dT%TZ'`\"}}"
   205     ```
   206     
   207  2. ```bash
   208      kubectl patch machinedeployment CLUSTER_NAME-md-0 -n namespace --type merge -p "{\"spec\":{\"template\":{\"metadata\":{\"annotations\":{\"date\":\"`date +'%s'`\"}}}}}"
   209      ```