sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/book/src/topics/troubleshooting.md

sigs.k8s.io/cluster-api-provider-azure@v1.14.3/docs/book/src/topics/troubleshooting.md (about)

     1  # Troubleshooting Guide
     2  
     3  Common issues users might run into when using Cluster API Provider for Azure. This list is work-in-progress. Feel free to open a PR to add to it if you find that useful information is missing.
     4  
     5  ## Examples of troubleshooting real-world issues
     6  
     7  ### No Azure resources are getting created
     8  
     9  This is likely due to missing or invalid Azure credentials. 
    10  
    11  Check the CAPZ controller logs on the management cluster:
    12  
    13  ```bash
    14  kubectl logs deploy/capz-controller-manager -n capz-system manager
    15  ```
    16  
    17  If you see an error similar to this:
    18  
    19  ```
    20  azure.BearerAuthorizer#WithAuthorization: Failed to refresh the Token for request to https://management.azure.com/subscriptions/123/providers/Microsoft.Compute/skus?%24filter=location+eq+%27eastus2%27&api-version=2019-04-01: StatusCode=401 -- Original Error: adal: Refresh request failed. Status Code = '401'. Response body: {\"error\":\"invalid_client\",\"error_description\":\"AADSTS7000215: Invalid client secret is provided.
    21  ```
    22  
    23  Make sure the provided Service Principal client ID and client secret are correct and that the password has not expired.
    24  
    25  ### The AzureCluster infrastructure is provisioned but no virtual machines are coming up
    26  
    27  Your Azure subscription might have no quota for the requested VM size in the specified Azure location.
    28  
    29  Check the CAPZ controller logs on the management cluster:
    30  
    31  ```bash
    32  kubectl logs deploy/capz-controller-manager -n capz-system manager
    33  ```
    34  
    35  If you see an error similar to this:
    36  ```
    37  "error"="failed to reconcile AzureMachine: failed to create virtual machine: failed to create VM capz-md-0-qkg6m in resource group capz-fkl3tp: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=\u003cnil\u003e Code=\"OperationNotAllowed\" Message=\"Operation could not be completed as it results in exceeding approved standardDSv3Family Cores quota.
    38  ```
    39  
    40  Follow the [these steps](https://learn.microsoft.com/azure/azure-resource-manager/templates/error-resource-quota). Alternatively, you can specify another Azure location and/or VM size during cluster creation.
    41  
    42  ### A virtual machine is running but the k8s node did not join the cluster
    43  
    44  Check the AzureMachine (or AzureMachinePool if using a MachinePool) status:
    45  ```bash
    46  kubectl get azuremachines -o wide
    47  ```
    48  
    49  If you see an output like this:
    50  
    51  ```
    52  NAME                                       READY   STATE
    53  default-template-md-0-w78jt                false   Updating
    54  ```
    55  
    56  This indicates that the bootstrap script has not yet succeeded. Check the AzureMachine `status.conditions` field for more information.
    57  
    58  [Take a look at the cloud-init logs](#checking-cloud-init-logs-ubuntu) for further debugging.
    59  
    60  ### One or more control plane replicas are missing
    61  
    62  Take a look at the KubeadmControlPlane controller logs and look for any potential errors:
    63  
    64  ```bash
    65  kubectl logs deploy/capi-kubeadm-control-plane-controller-manager -n capi-kubeadm-control-plane-system manager
    66  ```
    67  
    68  In addition, make sure all pods on the workload cluster are healthy, including pods in the `kube-system` namespace.
    69  
    70  ### Nodes are in NotReady state
    71  
    72  Make sure you have installed a CNI on the workload cluster and that all the pods on the workload cluster are in running state.
    73  
    74  ### Load Balancer service fails to come up
    75  
    76  Check the cloud-controller-manager logs on the workload cluster. 
    77  
    78  If running the Azure cloud provider in-tree:
    79  
    80  ```
    81  kubectl logs kube-controller-manager-<control-plane-node-name> -n kube-system 
    82  ```
    83  
    84  If running the Azure cloud provider out-of-tree:
    85  
    86  ```
    87  kubectl logs cloud-controller-manager -n kube-system 
    88  ```
    89  
    90  
    91  ## Watching Kubernetes resources
    92  
    93  To watch progression of all Cluster API resources on the management cluster you can run:
    94  
    95  ```bash
    96  kubectl get cluster-api
    97  ```
    98  
    99  ## Looking at controller logs
   100  
   101  To check the CAPZ controller logs on the management cluster, run:
   102  
   103  ```bash
   104  kubectl logs deploy/capz-controller-manager -n capz-system manager
   105  ```
   106  
   107  ### Checking cloud-init logs (Ubuntu)
   108  
   109  Cloud-init logs can provide more information on any issues that happened when running the bootstrap script. 
   110  
   111  #### Option 1: Using the Azure Portal 
   112  
   113  Located in the virtual machine blade (if [enabled](https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/boot-diagnostics) for the VM), the boot diagnostics option is under the Support and Troubleshooting section in the Azure portal.
   114  
   115  For more information, see [here](https://learn.microsoft.com/azure/virtual-machines/boot-diagnostics#boot-diagnostics-view)
   116  
   117  #### Option 2: Using the Azure CLI
   118  
   119  ```bash
   120  az vm boot-diagnostics get-boot-log --name MyVirtualMachine --resource-group MyResourceGroup
   121  ```
   122  
   123  For more information, see [here](https://learn.microsoft.com/cli/azure/vm/boot-diagnostics?view=azure-cli-latest).
   124  
   125  #### Option 3: With SSH
   126  
   127  Using the ssh information provided during cluster creation (environment variable `AZURE_SSH_PUBLIC_KEY_B64`):
   128  
   129  
   130  ##### connect to first control node - capi is default linux user created by deployment
   131  ```
   132  API_SERVER=$(kubectl get azurecluster capz-cluster -o jsonpath='{.spec.controlPlaneEndpoint.host}')
   133  ssh capi@${API_SERVER}
   134  ```
   135  
   136  ##### list nodes
   137  ```
   138  kubectl get azuremachines
   139  NAME                               READY   STATE
   140  capz-cluster-control-plane-2jprg   true    Succeeded
   141  capz-cluster-control-plane-ck5wv   true    Succeeded
   142  capz-cluster-control-plane-w4tv6   true    Succeeded
   143  capz-cluster-md-0-s52wb            false   Failed
   144  capz-cluster-md-0-w8xxw            true    Succeeded
   145  ```
   146  
   147  ##### pick node name from output above:
   148  ```
   149  node=$(kubectl get azuremachine capz-cluster-md-0-s52wb -o jsonpath='{.status.addresses[0].address}')
   150  ssh -J capi@${apiserver} capi@${node}
   151  ```
   152  
   153  ##### look at cloud-init logs
   154  `less /var/log/cloud-init-output.log`
   155  
   156  ## Automated log collection
   157  
   158  As part of CI there is a [log collection tool](https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/main/test/logger.go) <!-- markdown-link-check-disable-line -->
   159  which you can also leverage to pull all the logs for machines which will dump logs to `${PWD}/_artifacts}` by default. The following works 
   160  if your kubeconfig is configured with the management cluster.  See the tool for more settings.
   161  
   162  ```bash
   163  go run -tags e2e ./test/logger.go --name <workload-cluster-name> --namespace <workload-cluster-namespace>
   164  ```
   165  
   166  There are also some [provided scripts](https://github.com/kubernetes-sigs/cluster-api-provider-azure/tree/main/hack/debugging) that can help automate a few common tasks.