github.com/cilium/cilium@v1.16.2/Documentation/network/bgp-control-plane/bgp-control-plane-operation.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _bgp_control_plane_operation: 8 9 BGP Control Plane Operation Guide 10 ################################# 11 12 This document provides guidance on how to operate the BGP Control Plane. 13 14 Logs 15 ==== 16 17 BGP Control Plane logs can be found in the Cilium agent logs. The logs 18 are tagged with ``subsys=bgp-control-plane``. You can use this tag to filter 19 the logs as in the following example: 20 21 .. code-block:: shell-session 22 23 kubectl -n kube-system logs <cilium agent pod name> | grep "subsys=bgp-control-plane" 24 25 Metrics 26 ======= 27 28 Metrics exposed by BGP Control Plane are listed in the :ref:`metrics document 29 <metrics_bgp_control_plane>`. 30 31 .. _bgp_control_plane_agent_restart: 32 33 Restarting an Agent 34 =================== 35 36 When you restart the Cilium agent, the BGP session will be lost because the BGP 37 speaker is integrated within the Cilium agent. The BGP session will be restored 38 once the Cilium agent is restarted. However, while the Cilium agent is down, 39 the advertised routes will be removed from the BGP peer. As a result, you may 40 temporarily lose connectivity to the Pods or Services. You can enable the 41 :ref:`Graceful Restart <bgp_control_plane_graceful_restart>` to continue 42 forwarding traffic to the Pods or Services during the agent restart. 43 44 Upgrading or Downgrading Cilium 45 =============================== 46 47 When you upgrade or downgrade Cilium, you must restart the Cilium agent. For 48 more details about the agent restart, see 49 :ref:`bgp_control_plane_agent_restart` section. 50 51 Note that with BGP Control Plane, it's especially important to pre-pull the 52 agent image by following the :ref:`preflight process <pre_flight>` before 53 upgrading Cilium. Image pull is time-consuming and error-prone because it 54 involves network communication. If the image pull takes longer, it may exceed 55 the Graceful Restart time (``restartTimeSeconds``) and cause the BGP peer to 56 withdraw routes. 57 58 .. _bgp_control_plane_node_shutdown: 59 60 Shutting Down a Node 61 ==================== 62 63 When you need to shut down a node for maintenance, you can follow the steps 64 below to avoid packet loss as much as possible. 65 66 1. Drain the node to evict all workloads. This will remove all Pods on the node 67 from the Service endpoints and prevent Services with 68 ``externalTrafficPolicy=Cluster`` from redirecting traffic to the node. 69 70 .. code-block:: bash 71 72 kubectl drain <node-name> --ignore-daemonsets 73 74 2. Deconfigure the BGP sessions by modifying or removing the 75 CiliumBGPPeeringPolicy node selector label on the Node object. This will 76 shut down all BGP sessions on the node. 77 78 .. code-block:: bash 79 80 # Assuming you select the node by the label enable-bgp=true 81 kubectl label node <node-name> --overwrite enable-bgp=false 82 83 3. Wait for a while until the BGP peer removes routes towards the node. During 84 this period, the BGP peer may still send traffic to the node. If you shut 85 down the node without waiting for the BGP peer to remove routes, it will 86 break the ongoing traffic of ``externalTrafficPolicy=Cluster`` Services. 87 88 4. Shut down the node. 89 90 In step 3, you may not be able to check the peer status and may want to wait 91 for a specific period of time without checking the actual peer status. In this 92 case, you can roughly estimate the time like the following: 93 94 * If you disable the BGP Graceful Restart feature, the BGP peer should withdraw 95 routes immediately after step 2. 96 97 * If you enable the BGP Graceful Restart feature, there are two possible cases. 98 99 * If the BGP peer supports the Graceful Restart with Notification 100 (:rfc:`8538`), it will withdraw routes after the Stale Timer (defined in 101 the :rfc:`8538#section-4.1`) expires. 102 103 * If the BGP peer does not support the Graceful Restart with Notification, it 104 will withdraw routes immediately after step 2 because the BGP Control Plane 105 sends the BGP Notification to the peer when you unselect the node. 106 107 The above estimation is a theoretical value, and the actual time always depends 108 on the BGP peer's implementation. Ideally, you should check the peer router's 109 actual behavior in advance with your network administrator. 110 111 .. warning:: 112 113 Even if you follow the above steps, some ongoing Service traffic originally 114 destined for the node may be reset because, after the route withdrawal and ECMP 115 rehashing, the traffic is redirected to a different node, and the new node may 116 select a different endpoint. 117 118 Failure Scenarios 119 ================= 120 121 This document describes common failure scenarios that you may encounter when 122 using the BGP Control Plane and provides guidance on how to mitigate them. 123 124 Cilium Agent Down 125 ----------------- 126 127 If the Cilium agent goes down, the BGP session will be lost because the BGP 128 speaker is integrated within the Cilium agent. The BGP session will be restored 129 once the Cilium agent is restarted. However, while the Cilium agent is down, 130 the advertised routes will be removed from the BGP peer. As a result, you may 131 temporarily lose connectivity to the Pods or Services. 132 133 Mitigation 134 ~~~~~~~~~~ 135 136 The recommended way to address this issue is by enabling the 137 :ref:`bgp_control_plane_graceful_restart` feature. This feature allows the BGP 138 peer to retain routes for a specific period of time after the BGP session is 139 lost. Since the datapath remains active even when the agent is down, this will 140 prevent the loss of connectivity to the Pods or Services. 141 142 When you can't use BGP Graceful Restart, you can take the following actions, 143 depending on the kind of routes you are using: 144 145 PodCIDR routes 146 ++++++++++++++ 147 148 If you are advertising PodCIDR routes, pods on the failed node will be 149 unreachable from the external network. If the failure only occurs on a subset 150 of the nodes in the cluster, you can drain the unhealthy nodes to migrate the 151 pods to other nodes. 152 153 Service routes 154 ++++++++++++++ 155 156 If you are advertising service routes, the load balancer (KubeProxy or Cilium 157 KubeProxyReplacement) may become unreachable from the external network. 158 Additionally, ongoing connections may be redirected to different nodes due to 159 ECMP rehashing on the upstream routers. When the load balancer encounters 160 unknown traffic, it will select a new endpoint. Depending on the load 161 balancer's backend selection algorithm, the traffic may be directed to a 162 different endpoint than before, potentially causing the connection to be reset. 163 164 If your upstream routers support ECMP with `Resilient Hashing`_, enabling 165 it may help to keep the ongoing connections forwarded to the same node. 166 Enabling the :ref:`maglev` feature in Cilium may also help since it increases 167 the probability that all nodes select the same endpoint for the same flow. 168 However, it only works for the ``externalTrafficPolicy: Cluster``. If the 169 Service's ``externalTrafficPolicy`` is set to ``Local``, it is inevitable that 170 all ongoing connections with the endpoints on the failed node, and connections 171 forwarded to a different node than before, will be reset. 172 173 .. _Resilient Hashing: https://www.juniper.net/documentation/us/en/software/junos/interfaces-ethernet-switches/topics/topic-map/switches-interface-resilient-hashing.html 174 175 Node Down 176 --------- 177 178 If the node goes down, the BGP sessions from this node will be lost. The peer 179 will withdraw the routes advertised by the node immediately or takes some time 180 to stop forwarding traffic to the node depending on the Graceful Restart settings. 181 The latter case is problematic when you advertise the route to a Service with 182 ``externalTrafficPolicy=Cluster`` because the peer will continue to forward traffic 183 to the unavailable node until the restart timer (which is 120s by default) expires. 184 185 Mitigation 186 ~~~~~~~~~~ 187 188 Involuntary Shutdown 189 ++++++++++++++++++++ 190 191 When a node is involuntarily shut down, there's no direct mitigation. You can 192 choose to not use the BGP Graceful Restart feature, depending on the trade-off 193 between the failure detection time vs stability provided by graceful restart in 194 cases of Cilium pod restarts. 195 196 Disabling the Graceful Restart allows the BGP peer to withdraw routes faster. 197 Even if the node is shut down without BGP Notification or TCP connection close, 198 the worst case time for peer to withdraw routes is the BGP hold time. When the 199 Graceful Restart is enabled, the BGP peer may need hold time + restart time to 200 withdraw routes received from the node. 201 202 Voluntary Shutdown 203 ++++++++++++++++++ 204 205 When you voluntarily shut down a node, you can follow the steps described in the 206 :ref:`bgp_control_plane_node_shutdown` section to avoid packet loss as much as 207 possible. 208 209 Peering Link Down 210 ----------------- 211 212 If the peering link between the BGP peers goes down, usually, both the BGP 213 session and datapath connectivity will be lost. However, there may be a period 214 during which the datapath connectivity is lost while the BGP session remains up 215 and routes are still being advertised. This can cause the BGP peer to send 216 traffic over the failed link, resulting in dropped packets. The length of this 217 period depends on which link is down and the BGP configuration. 218 219 If the link directly connected to the Node goes down, the BGP session will 220 likely be lost immediately because the Linux kernel detects the link failure 221 and shuts down the TCP session right away. If a link not directly connected to 222 the Node goes down, the BGP session will be lost after the hold timer expires, 223 which is set to 90 seconds by default. 224 225 Mitigation 226 ~~~~~~~~~~ 227 228 To make link detection failure fast, you can adjust ``holdTimeSeconds`` and 229 ``keepAliveTimeSeconds`` in the BGP configuration to the shorter value. 230 However, the minimal possible values are ``holdTimeSeconds=3`` and 231 ``keepAliveTimeSeconds=1``. The general approach to make failure detection faster is to 232 use BFD (Bidirectional Forwarding Detection), but currently, Cilium does not 233 support it. 234 235 Cilium Operator Down 236 -------------------- 237 238 If the Cilium Operator goes down, PodCIDR allocation by IPAM, and LoadBalancer 239 IP allocation by LB-IPAM are stopped. Therefore, the advertisement of new 240 and withdrawal of old PodCIDR and Service VIP routes will be stopped as well. 241 242 Mitigation 243 ~~~~~~~~~~ 244 245 There's no direct mitigation in terms of the BGP. However, running the Cilium 246 Operator with a :ref:`high-availability setup <cilium_operator_internals>` will 247 make the Cilium Operator more resilient to failures. 248 249 Service Losing All Backends 250 --------------------------- 251 252 If all service backends are gone due to an outage or a configuration mistake, BGP 253 Control Plane behaves differently depending on the Service's 254 ``externalTrafficPolicy``. When the ``externalTrafficPolicy`` is set to 255 ``Cluster``, the Service's VIP remains advertised from all nodes selected by the 256 CiliumBGPPeeringPolicy. When the ``externalTrafficPolicy`` is set to ``Local``, 257 the advertisement stops entirely because the Service's VIP is only advertised 258 from the node where the Service backends are running. 259 260 Mitigation 261 ~~~~~~~~~~ 262 263 There's no direct mitigation in terms of the BGP. In general, you should 264 prevent the Service backends from being all gone by Kubernetes features like 265 PodDisruptionBudget.