github.com/cilium/cilium@v1.16.2/Documentation/operations/troubleshooting_clustermesh.rst (about)

     1  Cluster Mesh Troubleshooting
     2  ============================
     3  
     4  
     5  Install the Cilium CLI
     6  ----------------------
     7  
     8  .. include:: /installation/cli-download.rst
     9  
    10  Automatic Verification
    11  ----------------------
    12  
    13   #. Validate that Cilium pods are healthy and ready:
    14  
    15      .. code-block:: shell-session
    16  
    17         cilium status
    18  
    19   #. Validate that Cluster Mesh is enabled and operational:
    20  
    21      .. code-block:: shell-session
    22  
    23         cilium clustermesh status
    24  
    25   #. In case of errors, run the troubleshoot command to automatically investigate
    26      Cilium agents connectivity issues towards the ClusterMesh control plane in
    27      remote clusters:
    28  
    29      .. code-block:: shell-session
    30  
    31         kubectl exec -it -n kube-system ds/cilium -c cilium-agent -- cilium-dbg troubleshoot clustermesh
    32  
    33      The troubleshoot command performs a set of automatic checks to validate
    34      DNS resolution, network connectivity, TLS authentication, etcd authorization
    35      and more, and reports the output in a user friendly format.
    36  
    37      When KVStoreMesh is enabled, the output of the troubleshoot command refers
    38      to the connections from the agents to the local cache, and it is expected to
    39      be the same for all the clusters they are connected to. Run the troubleshoot
    40      command inside the clustermesh-apiserver to investigate KVStoreMesh connectivity
    41      issues towards the ClusterMesh control plane in remote clusters:
    42  
    43      .. code-block:: shell-session
    44  
    45        kubectl exec -it -n kube-system deploy/clustermesh-apiserver -c kvstoremesh -- \
    46          clustermesh-apiserver kvstoremesh-dbg troubleshoot
    47  
    48      .. tip::
    49  
    50        You can specify one or more cluster names as parameters of the troubleshoot
    51        command to run the checks only towards a subset of remote clusters.
    52  
    53  
    54  Manual Verification
    55  -------------------
    56  
    57  As an alternative to leveraging the tools presented in the previous section,
    58  you may perform the following steps to troubleshoot ClusterMesh issues.
    59  
    60   #. Validate that each cluster is assigned a **unique** human-readable name as well
    61      as a numeric cluster ID (1-255).
    62  
    63   #. Validate that the clustermesh-apiserver is initialized correctly for each cluster:
    64  
    65      .. code-block:: shell-session
    66  
    67          $ kubectl logs -n kube-system deployment/clustermesh-apiserver -c apiserver
    68          ...
    69          level=info msg="Connecting to etcd server..." config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" subsys=kvstore
    70          level=info msg="Got lock lease ID 7c0281854b945c07" subsys=kvstore
    71          level=info msg="Initial etcd session established" config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" subsys=kvstore
    72          level=info msg="Successfully verified version of etcd endpoint" config=/var/lib/cilium/etcd-config.yaml endpoints="[https://127.0.0.1:2379]" etcdEndpoint="https://127.0.0.1:2379" subsys=kvstore version=3.4.13
    73  
    74   #. Validate that ClusterMesh is healthy running ``cilium-dbg status --all-clusters`` inside each Cilium agent::
    75  
    76          ClusterMesh:   1/1 remote clusters ready, 10 global-services
    77             k8s-c2: ready, 3 nodes, 25 endpoints, 8 identities, 10 services, 0 reconnections (last: never)
    78             └  etcd: 1/1 connected, leases=0, lock lease-ID=7c028201b53de662, has-quorum=true: https://k8s-c2.mesh.cilium.io:2379 - 3.5.4 (Leader)
    79             └  remote configuration: expected=true, retrieved=true, cluster-id=3, kvstoremesh=false, sync-canaries=true
    80             └  synchronization status: nodes=true, endpoints=true, identities=true, services=true
    81  
    82      When KVStoreMesh is enabled, additionally check its status and validate that
    83      it is correctly connected to all remote clusters:
    84  
    85      .. code-block:: shell-session
    86  
    87        $ kubectl --context $CLUSTER1 exec -it -n kube-system deploy/clustermesh-apiserver \
    88            -c kvstoremesh -- clustermesh-apiserver kvstoremesh-dbg status --verbose
    89  
    90   #. Validate that the required TLS secrets are set up properly. By default, the
    91      following TLS secrets must be available in the namespace in which Cilium is
    92      installed:
    93  
    94      * ``clustermesh-apiserver-server-cert``, which is used by the etcd container
    95        in the clustermesh-apiserver deployment. Not applicable if an external etcd
    96        cluster is used.
    97  
    98      * ``clustermesh-apiserver-admin-cert``, which is used by the apiserver/kvstoremesh
    99        containers in the clustermesh-apiserver deployment, to authenticate against the
   100        sidecar etcd instance. Not applicable if an external etcd cluster is used.
   101  
   102      * ``clustermesh-apiserver-remote-cert``, which is used by Cilium agents, and
   103        optionally the kvstoremesh container in the clustermesh-apiserver deployment,
   104        to authenticate against remote etcd instances (either internal or external).
   105  
   106      * ``clustermesh-apiserver-local-cert``, which is used by Cilium agents to
   107        authenticate against the local etcd instance. Only applicable if KVStoreMesh
   108        is enabled.
   109  
   110   #. Validate that the configuration for remote clusters is picked up correctly.
   111      For each remote cluster, an info log message ``New remote cluster
   112      configuration`` along with the remote cluster name must be logged in the
   113      ``cilium-agent`` logs.
   114  
   115      If the configuration is not found, check the following:
   116  
   117      * The ``cilium-clustermesh`` Kubernetes secret is present and correctly
   118        mounted by the Cilium agent pods.
   119  
   120      * The secret contains a file for each remote cluster with the filename matching
   121        the name of the remote cluster as provided by the ``--cluster-name`` argument
   122        or the ``cluster-name`` ConfigMap option.
   123  
   124      * Each file named after a remote cluster contains a valid etcd configuration
   125        consisting of the endpoints to reach the remote etcd cluster, and the path
   126        of the certificate and private key to authenticate against that etcd cluster.
   127        Additional files may be included in the secret to provide the certificate
   128        and private key themselves.
   129  
   130      * The ``/var/lib/cilium/clustermesh`` directory inside any of the Cilium agent
   131        pods contains the files mounted from the ``cilium-clustermesh`` secret.
   132        You can use
   133        ``kubectl exec -ti -n kube-system ds/cilium -c cilium-agent -- ls /var/lib/cilium/clustermesh``
   134        to list the files present.
   135  
   136   #. Validate that the connection to the remote cluster could be established.
   137      You will see a log message like this in the ``cilium-agent`` logs for each
   138      remote cluster::
   139  
   140         level=info msg="Connection to remote cluster established"
   141  
   142      If the connection failed, you will see a warning like this::
   143  
   144         level=warning msg="Unable to establish etcd connection to remote cluster"
   145  
   146      If the connection fails, check the following:
   147  
   148      * When KVStoreMesh is disabled, validate that the ``hostAliases`` section in the Cilium DaemonSet maps
   149        each remote cluster to the IP of the LoadBalancer that makes the remote
   150        control plane available; When KVStoreMesh is enabled,
   151        validate the ``hostAliases`` section in the clustermesh-apiserver Deployment.
   152  
   153      * Validate that a local node in the source cluster can reach the IP
   154        specified in the ``hostAliases`` section. When KVStoreMesh is disabled, the ``cilium-clustermesh``
   155        secret contains a configuration file for each remote cluster, it will
   156        point to a logical name representing the remote cluster;
   157        When KVStoreMesh is enabled, it exists in the ``cilium-kvstoremesh`` secret.
   158  
   159        .. code-block:: yaml
   160  
   161           endpoints:
   162           - https://cluster1.mesh.cilium.io:2379
   163  
   164        The name will *NOT* be resolvable via DNS outside the Cilium agent pods.
   165        The name is mapped to an IP using ``hostAliases``. Run ``kubectl -n
   166        kube-system get daemonset cilium -o yaml`` when KVStoreMesh is disabled,
   167        or run ``kubectl -n kube-system get deployment clustermesh-apiserver -o yaml`` when KVStoreMesh is enabled,
   168        grep for the FQDN to retrieve the IP that is configured. Then use ``curl`` to validate that the port is
   169        reachable.
   170  
   171      * A firewall between the local cluster and the remote cluster may drop the
   172        control plane connection. Ensure that port 2379/TCP is allowed.
   173  
   174  State Propagation
   175  -----------------
   176  
   177   #. Run ``cilium-dbg node list`` in one of the Cilium pods and validate that it
   178      lists both local nodes and nodes from remote clusters. If remote nodes are
   179      not present, validate that Cilium agents (or KVStoreMesh, if enabled)
   180      are correctly connected to the given remote cluster. Additionally, verify
   181      that the initial nodes synchronization from all clusters has completed.
   182  
   183   #. Validate the connectivity health matrix across clusters by running
   184      ``cilium-health status`` inside any Cilium pod. It will list the status of
   185      the connectivity health check to each remote node. If this fails, make sure
   186      that the network allows the health checking traffic as specified in the
   187      :ref:`firewall_requirements` section.
   188  
   189   #. Validate that identities are synchronized correctly by running ``cilium-dbg
   190      identity list`` in one of the Cilium pods. It must list identities from all
   191      clusters. You can determine what cluster an identity belongs to by looking
   192      at the label ``io.cilium.k8s.policy.cluster``. If remote identities are
   193      not present, validate that Cilium agents (or KVStoreMesh, if enabled)
   194      are correctly connected to the given remote cluster. Additionally, verify
   195      that the initial identities synchronization from all clusters has completed.
   196  
   197   #. Validate that the IP cache is synchronized correctly by running ``cilium-dbg
   198      bpf ipcache list`` or ``cilium-dbg map get cilium_ipcache``. The output must
   199      contain pod IPs from local and remote clusters. If remote IP addresses are
   200      not present, validate that Cilium agents (or KVStoreMesh, if enabled)
   201      are correctly connected to the given remote cluster. Additionally, verify
   202      that the initial IPs synchronization from all clusters has completed.
   203  
   204   #. When using global services, ensure that global services are configured with
   205      endpoints from all clusters. Run ``cilium-dbg service list`` in any Cilium pod
   206      and validate that the backend IPs consist of pod IPs from all clusters
   207      running relevant backends. You can further validate the correct datapath
   208      plumbing by running ``cilium-dbg bpf lb list`` to inspect the state of the eBPF
   209      maps.
   210  
   211      If this fails:
   212  
   213      * Run ``cilium-dbg debuginfo`` and look for the section ``k8s-service-cache``. In
   214        that section, you will find the contents of the service correlation
   215        cache. It will list the Kubernetes services and endpoints of the local
   216        cluster.  It will also have a section ``externalEndpoints`` which must
   217        list all endpoints of remote clusters.
   218  
   219        ::
   220  
   221            #### k8s-service-cache
   222  
   223            (*k8s.ServiceCache)(0xc00000c500)({
   224            [...]
   225             services: (map[k8s.ServiceID]*k8s.Service) (len=2) {
   226               (k8s.ServiceID) default/kubernetes: (*k8s.Service)(0xc000cd11d0)(frontend:172.20.0.1/ports=[https]/selector=map[]),
   227               (k8s.ServiceID) kube-system/kube-dns: (*k8s.Service)(0xc000cd1220)(frontend:172.20.0.10/ports=[metrics dns dns-tcp]/selector=map[k8s-app:kube-dns])
   228             },
   229             endpoints: (map[k8s.ServiceID]*k8s.Endpoints) (len=2) {
   230               (k8s.ServiceID) kube-system/kube-dns: (*k8s.Endpoints)(0xc0000103c0)(10.16.127.105:53/TCP,10.16.127.105:53/UDP,10.16.127.105:9153/TCP),
   231               (k8s.ServiceID) default/kubernetes: (*k8s.Endpoints)(0xc0000103f8)(192.168.60.11:6443/TCP)
   232             },
   233             externalEndpoints: (map[k8s.ServiceID]k8s.externalEndpoints) {
   234             }
   235            })
   236  
   237        The sections ``services`` and ``endpoints`` represent the services of the
   238        local cluster, the section ``externalEndpoints`` lists all remote
   239        services and will be correlated with services matching the same
   240        ``ServiceID``.