github.com/cilium/cilium@v1.16.2/Documentation/network/bgp-control-plane/bgp-control-plane-operation.rst

github.com/cilium/cilium@v1.16.2/Documentation/network/bgp-control-plane/bgp-control-plane-operation.rst (about)

1 .. only:: not (epub or latex or html)
2
3 WARNING: You are looking at unreleased Cilium documentation.
4 Please use the official rendered version released here:
5 https://docs.cilium.io
6
7 .. _bgp_control_plane_operation:
8
9 BGP Control Plane Operation Guide
10 #################################
11
12 This document provides guidance on how to operate the BGP Control Plane.
13
14 Logs
15 ====
16
17 BGP Control Plane logs can be found in the Cilium agent logs. The logs
18 are tagged with ``subsys=bgp-control-plane``. You can use this tag to filter
19 the logs as in the following example:
20
21 .. code-block:: shell-session
22
23 kubectl -n kube-system logs <cilium agent pod name> | grep "subsys=bgp-control-plane"
24
25 Metrics
26 =======
27
28 Metrics exposed by BGP Control Plane are listed in the :ref:`metrics document
29 <metrics_bgp_control_plane>`.
30
31 .. _bgp_control_plane_agent_restart:
32
33 Restarting an Agent
34 ===================
35
36 When you restart the Cilium agent, the BGP session will be lost because the BGP
37 speaker is integrated within the Cilium agent. The BGP session will be restored
38 once the Cilium agent is restarted. However, while the Cilium agent is down,
39 the advertised routes will be removed from the BGP peer. As a result, you may
40 temporarily lose connectivity to the Pods or Services. You can enable the
41 :ref:`Graceful Restart <bgp_control_plane_graceful_restart>` to continue
42 forwarding traffic to the Pods or Services during the agent restart.
43
44 Upgrading or Downgrading Cilium
45 ===============================
46
47 When you upgrade or downgrade Cilium, you must restart the Cilium agent. For
48 more details about the agent restart, see
49 :ref:`bgp_control_plane_agent_restart` section.
50
51 Note that with BGP Control Plane, it's especially important to pre-pull the
52 agent image by following the :ref:`preflight process <pre_flight>` before
53 upgrading Cilium. Image pull is time-consuming and error-prone because it
54 involves network communication. If the image pull takes longer, it may exceed
55 the Graceful Restart time (``restartTimeSeconds``) and cause the BGP peer to
56 withdraw routes.
57
58 .. _bgp_control_plane_node_shutdown:
59
60 Shutting Down a Node
61 ====================
62
63 When you need to shut down a node for maintenance, you can follow the steps
64 below to avoid packet loss as much as possible.
65
66 1. Drain the node to evict all workloads. This will remove all Pods on the node
67 from the Service endpoints and prevent Services with
68 ``externalTrafficPolicy=Cluster`` from redirecting traffic to the node.
69
70 .. code-block:: bash
71
72 kubectl drain <node-name> --ignore-daemonsets
73
74 2. Deconfigure the BGP sessions by modifying or removing the
75 CiliumBGPPeeringPolicy node selector label on the Node object. This will
76 shut down all BGP sessions on the node.
77
78 .. code-block:: bash
79
80 # Assuming you select the node by the label enable-bgp=true
81 kubectl label node <node-name> --overwrite enable-bgp=false
82
83 3. Wait for a while until the BGP peer removes routes towards the node. During
84 this period, the BGP peer may still send traffic to the node. If you shut
85 down the node without waiting for the BGP peer to remove routes, it will
86 break the ongoing traffic of ``externalTrafficPolicy=Cluster`` Services.
87
88 4. Shut down the node.
89
90 In step 3, you may not be able to check the peer status and may want to wait
91 for a specific period of time without checking the actual peer status. In this
92 case, you can roughly estimate the time like the following:
93
94 * If you disable the BGP Graceful Restart feature, the BGP peer should withdraw
95 routes immediately after step 2.
96
97 * If you enable the BGP Graceful Restart feature, there are two possible cases.
98
99 * If the BGP peer supports the Graceful Restart with Notification
100 (:rfc:`8538`), it will withdraw routes after the Stale Timer (defined in
101 the :rfc:`8538#section-4.1`) expires.
102
103 * If the BGP peer does not support the Graceful Restart with Notification, it
104 will withdraw routes immediately after step 2 because the BGP Control Plane
105 sends the BGP Notification to the peer when you unselect the node.
106
107 The above estimation is a theoretical value, and the actual time always depends
108 on the BGP peer's implementation. Ideally, you should check the peer router's
109 actual behavior in advance with your network administrator.
110
111 .. warning::
112
113 Even if you follow the above steps, some ongoing Service traffic originally
114 destined for the node may be reset because, after the route withdrawal and ECMP
115 rehashing, the traffic is redirected to a different node, and the new node may
116 select a different endpoint.
117
118 Failure Scenarios
119 =================
120
121 This document describes common failure scenarios that you may encounter when
122 using the BGP Control Plane and provides guidance on how to mitigate them.
123
124 Cilium Agent Down
125 -----------------
126
127 If the Cilium agent goes down, the BGP session will be lost because the BGP
128 speaker is integrated within the Cilium agent. The BGP session will be restored
129 once the Cilium agent is restarted. However, while the Cilium agent is down,
130 the advertised routes will be removed from the BGP peer. As a result, you may
131 temporarily lose connectivity to the Pods or Services.
132
133 Mitigation
134 ~~~~~~~~~~
135
136 The recommended way to address this issue is by enabling the
137 :ref:`bgp_control_plane_graceful_restart` feature. This feature allows the BGP
138 peer to retain routes for a specific period of time after the BGP session is
139 lost. Since the datapath remains active even when the agent is down, this will
140 prevent the loss of connectivity to the Pods or Services.
141
142 When you can't use BGP Graceful Restart, you can take the following actions,
143 depending on the kind of routes you are using:
144
145 PodCIDR routes
146 ++++++++++++++
147
148 If you are advertising PodCIDR routes, pods on the failed node will be
149 unreachable from the external network. If the failure only occurs on a subset
150 of the nodes in the cluster, you can drain the unhealthy nodes to migrate the
151 pods to other nodes.
152
153 Service routes
154 ++++++++++++++
155
156 If you are advertising service routes, the load balancer (KubeProxy or Cilium
157 KubeProxyReplacement) may become unreachable from the external network.
158 Additionally, ongoing connections may be redirected to different nodes due to
159 ECMP rehashing on the upstream routers. When the load balancer encounters
160 unknown traffic, it will select a new endpoint. Depending on the load
161 balancer's backend selection algorithm, the traffic may be directed to a
162 different endpoint than before, potentially causing the connection to be reset.
163
164 If your upstream routers support ECMP with `Resilient Hashing`_, enabling
165 it may help to keep the ongoing connections forwarded to the same node.
166 Enabling the :ref:`maglev` feature in Cilium may also help since it increases
167 the probability that all nodes select the same endpoint for the same flow.
168 However, it only works for the ``externalTrafficPolicy: Cluster``. If the
169 Service's ``externalTrafficPolicy`` is set to ``Local``, it is inevitable that
170 all ongoing connections with the endpoints on the failed node, and connections
171 forwarded to a different node than before, will be reset.
172
173 .. _Resilient Hashing: https://www.juniper.net/documentation/us/en/software/junos/interfaces-ethernet-switches/topics/topic-map/switches-interface-resilient-hashing.html
174
175 Node Down
176 ---------
177
178 If the node goes down, the BGP sessions from this node will be lost. The peer
179 will withdraw the routes advertised by the node immediately or takes some time
180 to stop forwarding traffic to the node depending on the Graceful Restart settings.
181 The latter case is problematic when you advertise the route to a Service with
182 ``externalTrafficPolicy=Cluster`` because the peer will continue to forward traffic
183 to the unavailable node until the restart timer (which is 120s by default) expires.
184
185 Mitigation
186 ~~~~~~~~~~
187
188 Involuntary Shutdown
189 ++++++++++++++++++++
190
191 When a node is involuntarily shut down, there's no direct mitigation. You can
192 choose to not use the BGP Graceful Restart feature, depending on the trade-off
193 between the failure detection time vs stability provided by graceful restart in
194 cases of Cilium pod restarts.
195
196 Disabling the Graceful Restart allows the BGP peer to withdraw routes faster.
197 Even if the node is shut down without BGP Notification or TCP connection close,
198 the worst case time for peer to withdraw routes is the BGP hold time. When the
199 Graceful Restart is enabled, the BGP peer may need hold time + restart time to
200 withdraw routes received from the node.
201
202 Voluntary Shutdown
203 ++++++++++++++++++
204
205 When you voluntarily shut down a node, you can follow the steps described in the
206 :ref:`bgp_control_plane_node_shutdown` section to avoid packet loss as much as
207 possible.
208
209 Peering Link Down
210 -----------------
211
212 If the peering link between the BGP peers goes down, usually, both the BGP
213 session and datapath connectivity will be lost. However, there may be a period
214 during which the datapath connectivity is lost while the BGP session remains up
215 and routes are still being advertised. This can cause the BGP peer to send
216 traffic over the failed link, resulting in dropped packets. The length of this
217 period depends on which link is down and the BGP configuration.
218
219 If the link directly connected to the Node goes down, the BGP session will
220 likely be lost immediately because the Linux kernel detects the link failure
221 and shuts down the TCP session right away. If a link not directly connected to
222 the Node goes down, the BGP session will be lost after the hold timer expires,
223 which is set to 90 seconds by default.
224
225 Mitigation
226 ~~~~~~~~~~
227
228 To make link detection failure fast, you can adjust ``holdTimeSeconds`` and
229 ``keepAliveTimeSeconds`` in the BGP configuration to the shorter value.
230 However, the minimal possible values are ``holdTimeSeconds=3`` and
231 ``keepAliveTimeSeconds=1``. The general approach to make failure detection faster is to
232 use BFD (Bidirectional Forwarding Detection), but currently, Cilium does not
233 support it.
234
235 Cilium Operator Down
236 --------------------
237
238 If the Cilium Operator goes down, PodCIDR allocation by IPAM, and LoadBalancer
239 IP allocation by LB-IPAM are stopped. Therefore, the advertisement of new
240 and withdrawal of old PodCIDR and Service VIP routes will be stopped as well.
241
242 Mitigation
243 ~~~~~~~~~~
244
245 There's no direct mitigation in terms of the BGP. However, running the Cilium
246 Operator with a :ref:`high-availability setup <cilium_operator_internals>` will
247 make the Cilium Operator more resilient to failures.
248
249 Service Losing All Backends
250 ---------------------------
251
252 If all service backends are gone due to an outage or a configuration mistake, BGP
253 Control Plane behaves differently depending on the Service's
254 ``externalTrafficPolicy``. When the ``externalTrafficPolicy`` is set to
255 ``Cluster``, the Service's VIP remains advertised from all nodes selected by the
256 CiliumBGPPeeringPolicy. When the ``externalTrafficPolicy`` is set to ``Local``,
257 the advertisement stops entirely because the Service's VIP is only advertised
258 from the node where the Service backends are running.
259
260 Mitigation
261 ~~~~~~~~~~
262
263 There's no direct mitigation in terms of the BGP. In general, you should
264 prevent the Service backends from being all gone by Kubernetes features like
265 PodDisruptionBudget.