github.com/imran-kn/cilium-fork@v1.6.9/Documentation/architecture.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 http://docs.cilium.io 6 7 .. _arch_guide: 8 9 ############ 10 Architecture 11 ############ 12 13 This document describes the Cilium architecture. It focuses on 14 documenting the BPF datapath hooks to implement the Cilium datapath, how 15 the Cilium datapath integrates with the container orchestration layer, and the 16 objects shared between the layers e.g. the BPF datapath and Cilium agent. 17 18 Datapath 19 ======== 20 21 The Linux kernel supports a set of BPF hooks in the networking stack 22 that can be used to run BPF programs. The Cilium datapath uses these 23 hooks to load BPF programs that when used together create higher level 24 networking constructs. 25 26 The following is a list of the hooks used by Cilium and a brief 27 description. For a more thorough documentation on specifics of each 28 hook see :ref:`bpf_guide`. 29 30 * **XDP:** The XDP BPF hook is at the earliest point possible in the networking driver 31 and triggers a run of the BPF program upon packet reception. This 32 achieves the best possible packet processing performance since the 33 program runs directly on the packet data before any other processing 34 can happen. This hook is ideal for running filtering programs that 35 drop malicious or unexpected traffic, and other common DDOS protection 36 mechanisms. 37 38 * **Traffic Control Ingress/Egress:** BPF programs attached to the traffic 39 control (tc) ingress hook are attached to a networking interface, same as 40 XDP, but will run after the networking stack has done initial processing 41 of the packet. The hook is run before the L3 layer of the stack but has 42 access to most of the metadata associated with a packet. This is ideal 43 for doing local node processing, such as applying L3/L4 endpoint policy 44 and redirecting traffic to endpoints. For networking facing devices the 45 tc ingress hook can be coupled with above XDP hook. When this is done it 46 is reasonable to assume that the majority of the traffic at this 47 point is legitimate and destined for the host. 48 49 Containers typically use a virtual device called a veth pair which acts 50 as a virtual wire connecting the container to the host. By attaching to 51 the TC ingress hook of the host side of this veth pair Cilium can monitor 52 and enforce policy on all traffic exiting a container. By attaching a BPF 53 program to the veth pair associated with each container and routing all 54 network traffic to the host side virtual devices with another BPF program 55 attached to the tc ingress hook as well Cilium can monitor and enforce 56 policy on all traffic entering or exiting the node. 57 58 Depending on the use case, containers may also be connected through ipvlan 59 devices instead of a veth pair. In this mode, the physical device in the 60 host is the ipvlan master where virtual ipvlan devices in slave mode are 61 set up inside the container. One of the benefits of ipvlan over a veth pair 62 is that the stack requires less resources to push the packet into the 63 ipvlan slave device of the other network namespace and therefore may 64 achieve better latency results. This option can be used for unprivileged 65 containers. The BPF programs for tc are then attached to the tc egress 66 hook on the ipvlan slave device inside the container's network namespace 67 in order to have Cilium apply L3/L4 endpoint policy, for example, combined 68 with another BPF program running on the tc ingress hook of the ipvlan master 69 such that also incoming traffic on the node can be enforced. 70 71 * **Socket operations:** The socket operations hook is attached to a specific 72 cgroup and runs on TCP events. Cilium attaches a BPF socket operations 73 program to the root cgroup and uses this to monitor for TCP state transitions, 74 specifically for ESTABLISHED state transitions. When 75 a socket transitions into ESTABLISHED state if the TCP socket has a node 76 local peer (possibly a local proxy) a socket send/recv program is attached. 77 78 * **Socket send/recv:** The socket send/recv hook runs on every send operation 79 performed by a TCP socket. At this point the hook can inspect the message 80 and either drop the message, send the message to the TCP layer, or redirect 81 the message to another socket. Cilium uses this to accelerate the datapath redirects 82 as described below. 83 84 Combining the above hooks with a virtual interfaces (cilium_host, cilium_net), 85 an optional overlay interface (cilium_vxlan), Linux kernel crypto support and 86 a userspace proxy (Envoy) Cilium creates the following networking objects. 87 88 * **Prefilter:** The prefilter object runs an XDP program and 89 provides a set of prefilter rules used to filter traffic from the network for best performance. Specifically, 90 a set of CIDR maps supplied by the Cilium agent are used to do a lookup and the packet 91 is either dropped, for example when the destination is not a valid endpoint, or allowed to be processed by the stack. This can be easily 92 extended as needed to build in new prefilter criteria/capabilities. 93 94 * **Endpoint Policy:** The endpoint policy object implements the Cilium endpoint enforcement. 95 Using a map to lookup a packets associated identity and policy this layer 96 scales well to lots of endpoints. Depending on the policy this layer may drop the 97 packet, forward to a local endpoint, forward to the service object or forward to the 98 L7 Policy object for further L7 rules. This is the primary object in the Cilium 99 datapath responsible for mapping packets to identities and enforcing L3 and L4 policies. 100 101 * **Service:** The Service object performs a map lookup on the destination IP 102 and optionally destination port for every packet received by the object. 103 If a matching entry is found, the packet will be forwarded to one of the 104 configured L3/L4 endpoints. The Service block can be used to implement a 105 standalone load balancer on any interface using the TC ingress hook or may 106 be integrated in the endpoint policy object. 107 108 * **L3 Encryption:** On ingress the L3 Encryption object marks packets for 109 decryption, passes the packets to the Linux xfrm (transform) layer for 110 decryption, and after the packet is decrypted the object receives the packet 111 then passes it up the stack for further processing by other objects. Depending 112 on the mode, direct routing or overlay, this may be a BPF tail call or the 113 Linux routing stack that passes the packet to the next object. The key required 114 for decryption is encoded in the IPsec header so on ingress we do not need to 115 do a map lookup to find the decryption key. 116 117 On egress a map lookup is first performed using the destination IP to determine 118 if a packet should be encrypted and if so what keys are available on the destination 119 node. The most recent key available on both nodes is chosen and the 120 packet is marked for encryption. The packet is then passed to the Linux 121 xfrm layer where it is encrypted. Upon receiving the now encrypted packet 122 it is passed to the next layer either by sending it to the Linux stack for 123 routing or doing a direct tail call if an overlay is in use. 124 125 * **Socket Layer Enforcement:** Socket layer enforcement use two 126 hooks the socket operations hook and the socket send/recv hook to monitor 127 and attach to all TCP sockets associated with Cilium managed endpoints, including 128 any L7 proxies. The socket operations hook 129 will identify candidate sockets for accelerating. These include all local node connections 130 (endpoint to endpoint) and any connection to a Cilium proxy. 131 These identified connections will then have all messages handled by the socket 132 send/recv hook and will be accelerated using sockmap fast redirects. The fast 133 redirect ensures all policies implemented in Cilium are valid for the associated 134 socket/endpoint mapping and assuming they are sends the message directly to the 135 peer socket. This is allowed because the sockmap send/recv hooks ensures the message 136 will not need to be processed by any of the objects above. 137 138 * **L7 Policy:** The L7 Policy object redirect proxy traffic to a Cilium userspace 139 proxy instance. Cilium uses an Envoy instance as its userspace proxy. Envoy will 140 then either forward the traffic or generate appropriate reject messages based on the configured L7 policy. 141 142 These components are connected to create the flexible and efficient datapath used 143 by Cilium. Below we show the following possible flows connecting endpoints on a single 144 node, ingress to an endpoint, and endpoint to egress networking device. In each case 145 there is an additional diagram showing the TCP accelerated path available when socket layer enforcement is enabled. 146 147 Endpoint to Endpoint 148 -------------------- 149 First we show the local endpoint to endpoint flow with optional L7 Policy on 150 egress and ingress. Followed by the same endpoint to endpoint flow with 151 socket layer enforcement enabled. With socket layer enforcement enabled for TCP 152 traffic the 153 handshake initiating the connection will traverse the endpoint policy object until TCP state 154 is ESTABLISHED. Then after the connection is ESTABLISHED only the L7 Policy 155 object is still required. 156 157 .. image:: _static/cilium_bpf_endpoint.svg 158 159 Egress from Endpoint 160 -------------------- 161 162 Next we show local endpoint to egress with optional overlay network. In the 163 optional overlay network traffic is forwarded out the Linux network interface 164 corresponding to the overlay. In the default case the overlay interface is 165 named cilium_vxlan. Similar to above, when socket layer enforcement is enabled 166 and a L7 proxy is in use we can avoid running the endpoint policy block between 167 the endpoint and the L7 Policy for TCP traffic. An optional L3 encryption block 168 will encrypt the packet if enabled. 169 170 .. image:: _static/cilium_bpf_egress.svg 171 172 Ingress to Endpoint 173 ------------------- 174 175 Finally we show ingress to local endpoint also with optional overlay network. 176 Similar to above socket layer enforcement can be used to avoid a set of 177 policy traversals between the proxy and the endpoint socket. If the packet 178 is encrypted upon receive it is first decrypted and then handled through 179 the normal flow. 180 181 .. image:: _static/cilium_bpf_ingress.svg 182 183 veth-based versus ipvlan-based datapath 184 --------------------------------------- 185 186 .. note:: The ipvlan-based datapath is currently only in technology preview 187 and to be used for experimentation purposes. This restriction will 188 be lifted in future Cilium releases. 189 190 By default Cilium CNI operates in veth-based datapath mode which allows for 191 more flexibility in that all BPF programs are managed by Cilium out of the host 192 network namespace such that containers can be granted privileges for their 193 namespaces like CAP_NET_ADMIN without affecting security since BPF enforcement 194 points in the host are unreachable for the container. Given BPF programs are 195 attached from the host's network namespace, BPF also has the ability to take 196 over and efficiently manage most of the forwarding logic between local containers 197 and host since there always is a networking device reachable. However, this 198 also comes at a latency cost as in veth-based mode the network stack internally 199 needs to be re-traversed when handing the packet from one veth device to its 200 peer device in the other network namespace. This egress-to-ingress switch needs 201 to be done twice when communicating between local Cilium endpoints, and once 202 for packet that are arriving or sent out of the host. 203 204 For a more latency optimized datapath, Cilium CNI also supports ipvlan L3/L3S mode 205 with a number of restrictions. In order to support older kernel's without ipvlan's 206 hairpin mode, Cilium attaches BPF programs at the ipvlan slave device inside 207 the container's network namespace on the tc egress layer, which means that 208 this datapath mode can only be used for containers which are not running with 209 CAP_NET_ADMIN and CAP_NET_RAW privileges! ipvlan uses an internal forwarding 210 logic for direct slave-to-slave or slave-to-master redirection and therefore 211 forwarding to devices is not performed from the BPF program itself. The network 212 namespace switching is more efficient in ipvlan mode since the stack does not 213 need to be re-traversed as in veth-based datapath case for external packets. 214 The host-to-container network namespace switch happens directly at L3 layer 215 without having to queue and reschedule the packet for later ingress processing. 216 In case of communication among local endpoints, the egress-to-ingress switch 217 is performed once instead of having to perform it twice. 218 219 For Cilium in ipvlan mode there are a number of additional restrictions in 220 the current implementation which are to be addressed in upcoming work: NAT64 221 cannot be enabled at this point as well as L7 policy enforcement via proxy. 222 Service load-balancing to local endpoints is currently not enabled as well 223 as container to host-local communication. If one of these features are needed, 224 then the default veth-based datapath mode is recommended instead. 225 226 The ipvlan mode in Cilium's CNI can be enabled by running the Cilium daemon 227 with e.g. ``--datapath-mode=ipvlan --ipvlan-master-device=bond0`` where the 228 latter typically specifies the physical networking device which then also acts 229 as the ipvlan master device. Note that in case ipvlan datapath mode is deployed 230 in L3S mode with Kubernetes, make sure to have a stable kernel running with the 231 following ipvlan fix included: `d5256083f62e <https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d5256083f62e2720f75bb3c5a928a0afe47d6bc3>`_. 232 233 This completes the datapath overview. More BPF specifics can be found in the 234 :ref:`bpf_guide`. Additional details on how to extend the L7 Policy 235 exist in the :ref:`envoy` section. 236 237 Scale 238 ===== 239 240 BPF Map Limitations 241 ------------------- 242 243 All BPF maps are created with upper capacity limits. Insertion beyond the limit 244 will fail and thus limits the scalability of the datapath. The following table 245 shows the default values of the maps. Each limit can be bumped in the source 246 code. Configuration options will be added on request if demand arises. 247 248 ======================== ================ =============== ===================================================== 249 Map Name Scope Default Limit Scale Implications 250 ======================== ================ =============== ===================================================== 251 Connection Tracking node or endpoint 1M TCP/256K UDP Max 1M concurrent TCP connections, max 256K expected UDP answers 252 Endpoints node 64k Max 64k local endpoints + host IPs per node 253 IP cache node 512K Max 256K endpoints (IPv4+IPv6), max 512k endpoints (IPv4 or IPv6) across all clusters 254 Load Balancer node 64k Max 64k cumulative backends across all services across all clusters 255 Policy endpoint 16k Max 16k allowed identity + port + protocol pairs for specific endpoint 256 Proxy Map node 512k Max 512k concurrent redirected TCP connections to proxy 257 Tunnel node 64k Max 32k nodes (IPv4+IPv6) or 64k nodes (IPv4 or IPv6) across all clusters 258 ======================== ================ =============== ===================================================== 259 260 Kubernetes Integration 261 ====================== 262 263 The following diagram shows the integration of iptables rules as installed by 264 kube-proxy and the iptables rules as installed by Cilium. 265 266 .. image:: _static/kubernetes_iptables.svg