github.com/cilium/cilium@v1.16.2/Documentation/network/bgp-control-plane/bgp-control-plane-operation.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _bgp_control_plane_operation:
     8  
     9  BGP Control Plane Operation Guide
    10  #################################
    11  
    12  This document provides guidance on how to operate the BGP Control Plane.
    13  
    14  Logs
    15  ====
    16  
    17  BGP Control Plane logs can be found in the Cilium agent logs. The logs
    18  are tagged with ``subsys=bgp-control-plane``. You can use this tag to filter
    19  the logs as in the following example:
    20  
    21  .. code-block:: shell-session
    22  
    23     kubectl -n kube-system logs <cilium agent pod name> | grep "subsys=bgp-control-plane"
    24  
    25  Metrics
    26  =======
    27  
    28  Metrics exposed by BGP Control Plane are listed in the :ref:`metrics document
    29  <metrics_bgp_control_plane>`.
    30  
    31  .. _bgp_control_plane_agent_restart:
    32  
    33  Restarting an Agent
    34  ===================
    35  
    36  When you restart the Cilium agent, the BGP session will be lost because the BGP
    37  speaker is integrated within the Cilium agent. The BGP session will be restored
    38  once the Cilium agent is restarted. However, while the Cilium agent is down,
    39  the advertised routes will be removed from the BGP peer. As a result, you may
    40  temporarily lose connectivity to the Pods or Services. You can enable the
    41  :ref:`Graceful Restart <bgp_control_plane_graceful_restart>` to continue
    42  forwarding traffic to the Pods or Services during the agent restart.
    43  
    44  Upgrading or Downgrading Cilium
    45  ===============================
    46  
    47  When you upgrade or downgrade Cilium, you must restart the Cilium agent. For
    48  more details about the agent restart, see
    49  :ref:`bgp_control_plane_agent_restart` section.
    50  
    51  Note that with BGP Control Plane, it's especially important to pre-pull the
    52  agent image by following the :ref:`preflight process <pre_flight>` before
    53  upgrading Cilium. Image pull is time-consuming and error-prone because it
    54  involves network communication. If the image pull takes longer, it may exceed
    55  the Graceful Restart time (``restartTimeSeconds``) and cause the BGP peer to
    56  withdraw routes.
    57  
    58  .. _bgp_control_plane_node_shutdown:
    59  
    60  Shutting Down a Node
    61  ====================
    62  
    63  When you need to shut down a node for maintenance, you can follow the steps
    64  below to avoid packet loss as much as possible.
    65  
    66  1. Drain the node to evict all workloads. This will remove all Pods on the node
    67     from the Service endpoints and prevent Services with
    68     ``externalTrafficPolicy=Cluster`` from redirecting traffic to the node.
    69  
    70     .. code-block:: bash
    71  
    72        kubectl drain <node-name> --ignore-daemonsets
    73  
    74  2. Deconfigure the BGP sessions by modifying or removing the
    75     CiliumBGPPeeringPolicy node selector label on the Node object. This will
    76     shut down all BGP sessions on the node.
    77  
    78     .. code-block:: bash
    79  
    80        # Assuming you select the node by the label enable-bgp=true
    81        kubectl label node <node-name> --overwrite enable-bgp=false
    82  
    83  3. Wait for a while until the BGP peer removes routes towards the node. During
    84     this period, the BGP peer may still send traffic to the node. If you shut
    85     down the node without waiting for the BGP peer to remove routes, it will
    86     break the ongoing traffic of ``externalTrafficPolicy=Cluster`` Services.
    87  
    88  4. Shut down the node.
    89  
    90  In step 3, you may not be able to check the peer status and may want to wait
    91  for a specific period of time without checking the actual peer status. In this
    92  case, you can roughly estimate the time like the following:
    93  
    94  * If you disable the BGP Graceful Restart feature, the BGP peer should withdraw
    95    routes immediately after step 2.
    96  
    97  * If you enable the BGP Graceful Restart feature, there are two possible cases.
    98  
    99    * If the BGP peer supports the Graceful Restart with Notification
   100      (:rfc:`8538`), it will withdraw routes after the Stale Timer (defined in
   101      the :rfc:`8538#section-4.1`) expires.
   102  
   103    * If the BGP peer does not support the Graceful Restart with Notification, it
   104      will withdraw routes immediately after step 2 because the BGP Control Plane
   105      sends the BGP Notification to the peer when you unselect the node.
   106  
   107  The above estimation is a theoretical value, and the actual time always depends
   108  on the BGP peer's implementation. Ideally, you should check the peer router's
   109  actual behavior in advance with your network administrator.
   110  
   111  .. warning::
   112  
   113     Even if you follow the above steps, some ongoing Service traffic originally
   114     destined for the node may be reset because, after the route withdrawal and ECMP
   115     rehashing, the traffic is redirected to a different node, and the new node may
   116     select a different endpoint.
   117  
   118  Failure Scenarios
   119  =================
   120  
   121  This document describes common failure scenarios that you may encounter when
   122  using the BGP Control Plane and provides guidance on how to mitigate them.
   123  
   124  Cilium Agent Down
   125  -----------------
   126  
   127  If the Cilium agent goes down, the BGP session will be lost because the BGP
   128  speaker is integrated within the Cilium agent. The BGP session will be restored
   129  once the Cilium agent is restarted. However, while the Cilium agent is down,
   130  the advertised routes will be removed from the BGP peer. As a result, you may
   131  temporarily lose connectivity to the Pods or Services.
   132  
   133  Mitigation
   134  ~~~~~~~~~~
   135  
   136  The recommended way to address this issue is by enabling the
   137  :ref:`bgp_control_plane_graceful_restart` feature. This feature allows the BGP
   138  peer to retain routes for a specific period of time after the BGP session is
   139  lost. Since the datapath remains active even when the agent is down, this will
   140  prevent the loss of connectivity to the Pods or Services.
   141  
   142  When you can't use BGP Graceful Restart, you can take the following actions,
   143  depending on the kind of routes you are using:
   144  
   145  PodCIDR routes
   146  ++++++++++++++
   147  
   148  If you are advertising PodCIDR routes, pods on the failed node will be
   149  unreachable from the external network. If the failure only occurs on a subset
   150  of the nodes in the cluster, you can drain the unhealthy nodes to migrate the
   151  pods to other nodes.
   152  
   153  Service routes
   154  ++++++++++++++
   155  
   156  If you are advertising service routes, the load balancer (KubeProxy or Cilium
   157  KubeProxyReplacement) may become unreachable from the external network.
   158  Additionally, ongoing connections may be redirected to different nodes due to
   159  ECMP rehashing on the upstream routers. When the load balancer encounters
   160  unknown traffic, it will select a new endpoint. Depending on the load
   161  balancer's backend selection algorithm, the traffic may be directed to a
   162  different endpoint than before, potentially causing the connection to be reset.
   163  
   164  If your upstream routers support ECMP with `Resilient Hashing`_, enabling
   165  it may help to keep the ongoing connections forwarded to the same node.
   166  Enabling the :ref:`maglev` feature in Cilium may also help since it increases
   167  the probability that all nodes select the same endpoint for the same flow.
   168  However, it only works for the ``externalTrafficPolicy: Cluster``. If the
   169  Service's ``externalTrafficPolicy`` is set to ``Local``, it is inevitable that
   170  all ongoing connections with the endpoints on the failed node, and connections
   171  forwarded to a different node than before, will be reset.
   172  
   173  .. _Resilient Hashing: https://www.juniper.net/documentation/us/en/software/junos/interfaces-ethernet-switches/topics/topic-map/switches-interface-resilient-hashing.html
   174  
   175  Node Down
   176  ---------
   177  
   178  If the node goes down, the BGP sessions from this node will be lost. The peer
   179  will withdraw the routes advertised by the node immediately or takes some time
   180  to stop forwarding traffic to the node depending on the Graceful Restart settings.
   181  The latter case is problematic when you advertise the route to a Service with
   182  ``externalTrafficPolicy=Cluster`` because the peer will continue to forward traffic
   183  to the unavailable node until the restart timer (which is 120s by default) expires.
   184  
   185  Mitigation
   186  ~~~~~~~~~~
   187  
   188  Involuntary Shutdown
   189  ++++++++++++++++++++
   190  
   191  When a node is involuntarily shut down, there's no direct mitigation. You can
   192  choose to not use the BGP Graceful Restart feature, depending on the trade-off
   193  between the failure detection time vs stability provided by graceful restart in
   194  cases of Cilium pod restarts.
   195  
   196  Disabling the Graceful Restart allows the BGP peer to withdraw routes faster.
   197  Even if the node is shut down without BGP Notification or TCP connection close,
   198  the worst case time for peer to withdraw routes is the BGP hold time. When the
   199  Graceful Restart is enabled, the BGP peer may need hold time + restart time to
   200  withdraw routes received from the node.
   201  
   202  Voluntary Shutdown
   203  ++++++++++++++++++
   204  
   205  When you voluntarily shut down a node, you can follow the steps described in the
   206  :ref:`bgp_control_plane_node_shutdown` section to avoid packet loss as much as
   207  possible.
   208  
   209  Peering Link Down
   210  -----------------
   211  
   212  If the peering link between the BGP peers goes down, usually, both the BGP
   213  session and datapath connectivity will be lost. However, there may be a period
   214  during which the datapath connectivity is lost while the BGP session remains up
   215  and routes are still being advertised. This can cause the BGP peer to send
   216  traffic over the failed link, resulting in dropped packets. The length of this
   217  period depends on which link is down and the BGP configuration.
   218  
   219  If the link directly connected to the Node goes down, the BGP session will
   220  likely be lost immediately because the Linux kernel detects the link failure
   221  and shuts down the TCP session right away. If a link not directly connected to
   222  the Node goes down, the BGP session will be lost after the hold timer expires,
   223  which is set to 90 seconds by default.
   224  
   225  Mitigation
   226  ~~~~~~~~~~
   227  
   228  To make link detection failure fast, you can adjust ``holdTimeSeconds`` and
   229  ``keepAliveTimeSeconds`` in the BGP configuration to the shorter value.
   230  However, the minimal possible values are ``holdTimeSeconds=3`` and
   231  ``keepAliveTimeSeconds=1``. The general approach to make failure detection faster is to
   232  use BFD (Bidirectional Forwarding Detection), but currently, Cilium does not
   233  support it.
   234  
   235  Cilium Operator Down
   236  --------------------
   237  
   238  If the Cilium Operator goes down, PodCIDR allocation by IPAM, and LoadBalancer
   239  IP allocation by LB-IPAM are stopped. Therefore, the advertisement of new
   240  and withdrawal of old PodCIDR and Service VIP routes will be stopped as well.
   241  
   242  Mitigation
   243  ~~~~~~~~~~
   244  
   245  There's no direct mitigation in terms of the BGP. However, running the Cilium
   246  Operator with a :ref:`high-availability setup <cilium_operator_internals>` will
   247  make the Cilium Operator more resilient to failures.
   248  
   249  Service Losing All Backends
   250  ---------------------------
   251  
   252  If all service backends are gone due to an outage or a configuration mistake, BGP
   253  Control Plane behaves differently depending on the Service's
   254  ``externalTrafficPolicy``. When the ``externalTrafficPolicy`` is set to
   255  ``Cluster``, the Service's VIP remains advertised from all nodes selected by the
   256  CiliumBGPPeeringPolicy. When the ``externalTrafficPolicy`` is set to ``Local``,
   257  the advertisement stops entirely because the Service's VIP is only advertised
   258  from the node where the Service backends are running.
   259  
   260  Mitigation
   261  ~~~~~~~~~~
   262  
   263  There's no direct mitigation in terms of the BGP. In general, you should
   264  prevent the Service backends from being all gone by Kubernetes features like
   265  PodDisruptionBudget.