github.com/cilium/cilium@v1.16.2/Documentation/operations/troubleshooting_clustermesh.rst (about) 1 Cluster Mesh Troubleshooting 2 ============================ 3 4 5 Install the Cilium CLI 6 ---------------------- 7 8 .. include:: /installation/cli-download.rst 9 10 Automatic Verification 11 ---------------------- 12 13 #. Validate that Cilium pods are healthy and ready: 14 15 .. code-block:: shell-session 16 17 cilium status 18 19 #. Validate that Cluster Mesh is enabled and operational: 20 21 .. code-block:: shell-session 22 23 cilium clustermesh status 24 25 #. In case of errors, run the troubleshoot command to automatically investigate 26 Cilium agents connectivity issues towards the ClusterMesh control plane in 27 remote clusters: 28 29 .. code-block:: shell-session 30 31 kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg troubleshoot clustermesh 32 33 The troubleshoot command performs a set of automatic checks to validate 34 DNS resolution, network connectivity, TLS authentication, etcd authorization 35 and more, and reports the output in a user friendly format. 36 37 When KVStoreMesh is enabled, the output of the troubleshoot command refers 38 to the connections from the agents to the local cache, and it is expected to 39 be the same for all the clusters they are connected to. Run the troubleshoot 40 command inside the clustermesh-apiserver to investigate KVStoreMesh connectivity 41 issues towards the ClusterMesh control plane in remote clusters: 42 43 .. code-block:: shell-session 44 45 kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh -- \ 46 clustermesh-apiserver kvstoremesh-dbg troubleshoot 47 48 .. tip:: 49 50 You can specify one or more cluster names as parameters of the troubleshoot 51 command to run the checks only towards a subset of remote clusters. 52 53 54 Manual Verification 55 ------------------- 56 57 As an alternative to leveraging the tools presented in the previous section, 58 you may perform the following steps to troubleshoot ClusterMesh issues. 59 60 #. Validate that each cluster is assigned a **unique** human-readable name as well 61 as a numeric cluster ID (1-255). 62 63 #. Validate that the clustermesh-apiserver is initialized correctly for each cluster: 64 65 .. code-block:: shell-session 66 67 $ kubectl logs -n kube-system deployment/clustermesh-apiserver -c apiserver 68 ... 69 level=info msg="Connecting to etcd server..." config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" subsys=kvstore 70 level=info msg="Got lock lease ID 7c0281854b945c07" subsys=kvstore 71 level=info msg="Initial etcd session established" config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" subsys=kvstore 72 level=info msg="Successfully verified version of etcd endpoint" config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" etcdEndpoint="https://127.0.0.1:2379" subsys=kvstore version=3.4.13 73 74 #. Validate that ClusterMesh is healthy running ``cilium-dbg status --all-clusters`` inside each Cilium agent:: 75 76 ClusterMesh: 1/1 remote clusters ready, 10 global-services 77 k8s-c2: ready, 3 nodes, 25 endpoints, 8 identities, 10 services, 0 reconnections (last: never) 78 └ etcd: 1/1 connected, leases=0, lock lease-ID=7c028201b53de662, has-quorum=true: https://k8s-c2.mesh.cilium.io:2379 - 3.5.4 (Leader) 79 └ remote configuration: expected=true, retrieved=true, cluster-id=3, kvstoremesh=false, sync-canaries=true 80 └ synchronization status: nodes=true, endpoints=true, identities=true, services=true 81 82 When KVStoreMesh is enabled, additionally check its status and validate that 83 it is correctly connected to all remote clusters: 84 85 .. code-block:: shell-session 86 87 $ kubectl --context $CLUSTER1 exec -it -n kube-system deploy/clustermesh-apiserver \ 88 -c kvstoremesh -- clustermesh-apiserver kvstoremesh-dbg status --verbose 89 90 #. Validate that the required TLS secrets are set up properly. By default, the 91 following TLS secrets must be available in the namespace in which Cilium is 92 installed: 93 94 * ``clustermesh-apiserver-server-cert``, which is used by the etcd container 95 in the clustermesh-apiserver deployment. Not applicable if an external etcd 96 cluster is used. 97 98 * ``clustermesh-apiserver-admin-cert``, which is used by the apiserver/kvstoremesh 99 containers in the clustermesh-apiserver deployment, to authenticate against the 100 sidecar etcd instance. Not applicable if an external etcd cluster is used. 101 102 * ``clustermesh-apiserver-remote-cert``, which is used by Cilium agents, and 103 optionally the kvstoremesh container in the clustermesh-apiserver deployment, 104 to authenticate against remote etcd instances (either internal or external). 105 106 * ``clustermesh-apiserver-local-cert``, which is used by Cilium agents to 107 authenticate against the local etcd instance. Only applicable if KVStoreMesh 108 is enabled. 109 110 #. Validate that the configuration for remote clusters is picked up correctly. 111 For each remote cluster, an info log message ``New remote cluster 112 configuration`` along with the remote cluster name must be logged in the 113 ``cilium-agent`` logs. 114 115 If the configuration is not found, check the following: 116 117 * The ``cilium-clustermesh`` Kubernetes secret is present and correctly 118 mounted by the Cilium agent pods. 119 120 * The secret contains a file for each remote cluster with the filename matching 121 the name of the remote cluster as provided by the ``--cluster-name`` argument 122 or the ``cluster-name`` ConfigMap option. 123 124 * Each file named after a remote cluster contains a valid etcd configuration 125 consisting of the endpoints to reach the remote etcd cluster, and the path 126 of the certificate and private key to authenticate against that etcd cluster. 127 Additional files may be included in the secret to provide the certificate 128 and private key themselves. 129 130 * The ``/var/lib/cilium/clustermesh`` directory inside any of the Cilium agent 131 pods contains the files mounted from the ``cilium-clustermesh`` secret. 132 You can use 133 ``kubectl exec -ti -n kube-system ds/cilium -c cilium-agent -- ls /var/lib/cilium/clustermesh`` 134 to list the files present. 135 136 #. Validate that the connection to the remote cluster could be established. 137 You will see a log message like this in the ``cilium-agent`` logs for each 138 remote cluster:: 139 140 level=info msg="Connection to remote cluster established" 141 142 If the connection failed, you will see a warning like this:: 143 144 level=warning msg="Unable to establish etcd connection to remote cluster" 145 146 If the connection fails, check the following: 147 148 * When KVStoreMesh is disabled, validate that the ``hostAliases`` section in the Cilium DaemonSet maps 149 each remote cluster to the IP of the LoadBalancer that makes the remote 150 control plane available; When KVStoreMesh is enabled, 151 validate the ``hostAliases`` section in the clustermesh-apiserver Deployment. 152 153 * Validate that a local node in the source cluster can reach the IP 154 specified in the ``hostAliases`` section. When KVStoreMesh is disabled, the ``cilium-clustermesh`` 155 secret contains a configuration file for each remote cluster, it will 156 point to a logical name representing the remote cluster; 157 When KVStoreMesh is enabled, it exists in the ``cilium-kvstoremesh`` secret. 158 159 .. code-block:: yaml 160 161 endpoints: 162 - https://cluster1.mesh.cilium.io:2379 163 164 The name will *NOT* be resolvable via DNS outside the Cilium agent pods. 165 The name is mapped to an IP using ``hostAliases``. Run ``kubectl -n 166 kube-system get daemonset cilium -o yaml`` when KVStoreMesh is disabled, 167 or run ``kubectl -n kube-system get deployment clustermesh-apiserver -o yaml`` when KVStoreMesh is enabled, 168 grep for the FQDN to retrieve the IP that is configured. Then use ``curl`` to validate that the port is 169 reachable. 170 171 * A firewall between the local cluster and the remote cluster may drop the 172 control plane connection. Ensure that port 2379/TCP is allowed. 173 174 State Propagation 175 ----------------- 176 177 #. Run ``cilium-dbg node list`` in one of the Cilium pods and validate that it 178 lists both local nodes and nodes from remote clusters. If remote nodes are 179 not present, validate that Cilium agents (or KVStoreMesh, if enabled) 180 are correctly connected to the given remote cluster. Additionally, verify 181 that the initial nodes synchronization from all clusters has completed. 182 183 #. Validate the connectivity health matrix across clusters by running 184 ``cilium-health status`` inside any Cilium pod. It will list the status of 185 the connectivity health check to each remote node. If this fails, make sure 186 that the network allows the health checking traffic as specified in the 187 :ref:`firewall_requirements` section. 188 189 #. Validate that identities are synchronized correctly by running ``cilium-dbg 190 identity list`` in one of the Cilium pods. It must list identities from all 191 clusters. You can determine what cluster an identity belongs to by looking 192 at the label ``io.cilium.k8s.policy.cluster``. If remote identities are 193 not present, validate that Cilium agents (or KVStoreMesh, if enabled) 194 are correctly connected to the given remote cluster. Additionally, verify 195 that the initial identities synchronization from all clusters has completed. 196 197 #. Validate that the IP cache is synchronized correctly by running ``cilium-dbg 198 bpf ipcache list`` or ``cilium-dbg map get cilium_ipcache``. The output must 199 contain pod IPs from local and remote clusters. If remote IP addresses are 200 not present, validate that Cilium agents (or KVStoreMesh, if enabled) 201 are correctly connected to the given remote cluster. Additionally, verify 202 that the initial IPs synchronization from all clusters has completed. 203 204 #. When using global services, ensure that global services are configured with 205 endpoints from all clusters. Run ``cilium-dbg service list`` in any Cilium pod 206 and validate that the backend IPs consist of pod IPs from all clusters 207 running relevant backends. You can further validate the correct datapath 208 plumbing by running ``cilium-dbg bpf lb list`` to inspect the state of the eBPF 209 maps. 210 211 If this fails: 212 213 * Run ``cilium-dbg debuginfo`` and look for the section ``k8s-service-cache``. In 214 that section, you will find the contents of the service correlation 215 cache. It will list the Kubernetes services and endpoints of the local 216 cluster. It will also have a section ``externalEndpoints`` which must 217 list all endpoints of remote clusters. 218 219 :: 220 221 #### k8s-service-cache 222 223 (*k8s.ServiceCache)(0xc00000c500)({ 224 [...] 225 services: (map[k8s.ServiceID]*k8s.Service) (len=2) { 226 (k8s.ServiceID) default/kubernetes: (*k8s.Service)(0xc000cd11d0)(frontend:172.20.0.1/ports=[https]/selector=map[]), 227 (k8s.ServiceID) kube-system/kube-dns: (*k8s.Service)(0xc000cd1220)(frontend:172.20.0.10/ports=[metrics dns dns-tcp]/selector=map[k8s-app:kube-dns]) 228 }, 229 endpoints: (map[k8s.ServiceID]*k8s.Endpoints) (len=2) { 230 (k8s.ServiceID) kube-system/kube-dns: (*k8s.Endpoints)(0xc0000103c0)(10.16.127.105:53/TCP,10.16.127.105:53/UDP,10.16.127.105:9153/TCP), 231 (k8s.ServiceID) default/kubernetes: (*k8s.Endpoints)(0xc0000103f8)(192.168.60.11:6443/TCP) 232 }, 233 externalEndpoints: (map[k8s.ServiceID]k8s.externalEndpoints) { 234 } 235 }) 236 237 The sections ``services`` and ``endpoints`` represent the services of the 238 local cluster, the section ``externalEndpoints`` lists all remote 239 services and will be correlated with services matching the same 240 ``ServiceID``.