github.com/zhyoulun/cilium@v1.6.12/Documentation/architecture.rst

github.com/zhyoulun/cilium@v1.6.12/Documentation/architecture.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      http://docs.cilium.io
     6  
     7  .. _arch_guide:
     8  
     9  ############
    10  Architecture
    11  ############
    12  
    13  This document describes the Cilium architecture. It focuses on
    14  documenting the BPF datapath hooks to implement the Cilium datapath, how
    15  the Cilium datapath integrates with the container orchestration layer, and the
    16  objects shared between the layers e.g. the BPF datapath and Cilium agent.
    17  
    18  Datapath
    19  ========
    20  
    21  The Linux kernel supports a set of BPF hooks in the networking stack
    22  that can be used to run BPF programs. The Cilium datapath uses these
    23  hooks to load BPF programs that when used together create higher level
    24  networking constructs.
    25  
    26  The following is a list of the hooks used by Cilium and a brief
    27  description. For a more thorough documentation on specifics of each
    28  hook see :ref:`bpf_guide`.
    29  
    30  * **XDP:** The XDP BPF hook is at the earliest point possible in the networking driver
    31    and triggers a run of the BPF program upon packet reception. This
    32    achieves the best possible packet processing performance since the
    33    program runs directly on the packet data before any other processing
    34    can happen. This hook is ideal for running filtering programs that
    35    drop malicious or unexpected traffic, and other common DDOS protection
    36    mechanisms.
    37  
    38  * **Traffic Control Ingress/Egress:** BPF programs attached to the traffic
    39    control (tc) ingress hook are attached to a networking interface, same as
    40    XDP, but will run after the networking stack has done initial processing
    41    of the packet. The hook is run before the L3 layer of the stack but has
    42    access to most of the metadata associated with a packet. This is ideal
    43    for doing local node processing, such as applying L3/L4 endpoint policy
    44    and redirecting traffic to endpoints. For networking facing devices the
    45    tc ingress hook can be coupled with above XDP hook. When this is done it
    46    is reasonable to assume that the majority of the traffic at this
    47    point is legitimate and destined for the host.
    48    
    49    Containers typically use a virtual device called a veth pair which acts
    50    as a virtual wire connecting the container to the host. By attaching to
    51    the TC ingress hook of the host side of this veth pair Cilium can monitor
    52    and enforce policy on all traffic exiting a container. By attaching a BPF
    53    program to the veth pair associated with each container and routing all
    54    network traffic to the host side virtual devices with another BPF program
    55    attached to the tc ingress hook as well Cilium can monitor and enforce
    56    policy on all traffic entering or exiting the node.
    57    
    58    Depending on the use case, containers may also be connected through ipvlan
    59    devices instead of a veth pair. In this mode, the physical device in the
    60    host is the ipvlan master where virtual ipvlan devices in slave mode are
    61    set up inside the container. One of the benefits of ipvlan over a veth pair
    62    is that the stack requires less resources to push the packet into the
    63    ipvlan slave device of the other network namespace and therefore may
    64    achieve better latency results. This option can be used for unprivileged
    65    containers. The BPF programs for tc are then attached to the tc egress
    66    hook on the ipvlan slave device inside the container's network namespace
    67    in order to have Cilium apply L3/L4 endpoint policy, for example, combined
    68    with another BPF program running on the tc ingress hook of the ipvlan master
    69    such that also incoming traffic on the node can be enforced.
    70  
    71  * **Socket operations:** The socket operations hook is attached to a specific
    72    cgroup and runs on TCP events. Cilium attaches a BPF socket operations
    73    program to the root cgroup and uses this to monitor for TCP state transitions,
    74    specifically for ESTABLISHED state transitions. When
    75    a socket transitions into ESTABLISHED state if the TCP socket has a node
    76    local peer (possibly a local proxy) a socket send/recv program is attached.
    77  
    78  * **Socket send/recv:** The socket send/recv hook runs on every send operation
    79    performed by a TCP socket. At this point the hook can inspect the message
    80    and either drop the message, send the message to the TCP layer, or redirect
    81    the message to another socket. Cilium uses this to accelerate the datapath redirects
    82    as described below.
    83  
    84  Combining the above hooks with a virtual interfaces (cilium_host, cilium_net),
    85  an optional overlay interface (cilium_vxlan), Linux kernel crypto support and
    86  a userspace proxy (Envoy) Cilium creates the following networking objects.
    87  
    88  * **Prefilter:** The prefilter object runs an XDP program and
    89    provides a set of prefilter rules used to filter traffic from the network for best performance. Specifically,
    90    a set of CIDR maps supplied by the Cilium agent are used to do a lookup and the packet
    91    is either dropped, for example when the destination is not a valid endpoint, or allowed to be processed by the stack. This can be easily
    92    extended as needed to build in new prefilter criteria/capabilities.
    93  
    94  * **Endpoint Policy:** The endpoint policy object implements the Cilium endpoint enforcement.
    95    Using a map to lookup a packets associated identity and policy this layer
    96    scales well to lots of endpoints. Depending on the policy this layer may drop the
    97    packet, forward to a local endpoint, forward to the service object or forward to the
    98    L7 Policy object for further L7 rules. This is the primary object in the Cilium
    99    datapath responsible for mapping packets to identities and enforcing L3 and L4 policies.
   100  
   101  * **Service:** The Service object performs a map lookup on the destination IP
   102    and optionally destination port for every packet received by the object.
   103    If a matching entry is found, the packet will be forwarded to one of the
   104    configured L3/L4 endpoints. The Service block can be used to implement a
   105    standalone load balancer on any interface using the TC ingress hook or may
   106    be integrated in the endpoint policy object.
   107  
   108  * **L3 Encryption:** On ingress the L3 Encryption object marks packets for
   109    decryption, passes the packets to the Linux xfrm (transform) layer for
   110    decryption, and after the packet is decrypted the object receives the packet
   111    then passes it up the stack for further processing by other objects. Depending
   112    on the mode, direct routing or overlay, this may be a BPF tail call or the
   113    Linux routing stack that passes the packet to the next object. The key required
   114    for decryption is encoded in the IPsec header so on ingress we do not need to
   115    do a map lookup to find the decryption key.
   116  
   117    On egress a map lookup is first performed using the destination IP to determine
   118    if a packet should be encrypted and if so what keys are available on the destination
   119    node. The most recent key available on both nodes is chosen and the
   120    packet is marked for encryption. The packet is then passed to the Linux
   121    xfrm layer where it is encrypted. Upon receiving the now encrypted packet
   122    it is passed to the next layer either by sending it to the Linux stack for
   123    routing or doing a direct tail call if an overlay is in use.
   124  
   125  * **Socket Layer Enforcement:** Socket layer enforcement use two
   126    hooks the socket operations hook and the socket send/recv hook to monitor
   127    and attach to all TCP sockets associated with Cilium managed endpoints, including
   128    any L7 proxies. The socket operations hook
   129    will identify candidate sockets for accelerating. These include all local node connections
   130    (endpoint to endpoint) and any connection to a Cilium proxy.
   131    These identified connections will then have all messages handled by the socket
   132    send/recv hook and will be accelerated using sockmap fast redirects. The fast
   133    redirect ensures all policies implemented in Cilium are valid for the associated
   134    socket/endpoint mapping and assuming they are sends the message directly to the
   135    peer socket. This is allowed because the sockmap send/recv hooks ensures the message
   136    will not need to be processed by any of the objects above.
   137  
   138  * **L7 Policy:** The L7 Policy object redirect proxy traffic to a Cilium userspace
   139    proxy instance. Cilium uses an Envoy instance as its userspace proxy. Envoy will
   140    then either forward the traffic or generate appropriate reject messages based on the configured L7 policy.
   141  
   142  These components are connected to create the flexible and efficient datapath used
   143  by Cilium. Below we show the following possible flows connecting endpoints on a single
   144  node, ingress to an endpoint, and endpoint to egress networking device. In each case
   145  there is an additional diagram showing the TCP accelerated path available when socket layer enforcement is enabled.
   146  
   147  Endpoint to Endpoint
   148  --------------------
   149  First we show the local endpoint to endpoint flow with optional L7 Policy on
   150  egress and ingress. Followed by the same endpoint to endpoint flow with
   151  socket layer enforcement enabled. With socket layer enforcement enabled for TCP
   152  traffic the
   153  handshake initiating the connection will traverse the endpoint policy object until TCP state
   154  is ESTABLISHED. Then after the connection is ESTABLISHED only the L7 Policy
   155  object is still required.
   156  
   157  .. image:: _static/cilium_bpf_endpoint.svg
   158  
   159  Egress from Endpoint
   160  --------------------
   161  
   162  Next we show local endpoint to egress with optional overlay network. In the
   163  optional overlay network traffic is forwarded out the Linux network interface
   164  corresponding to the overlay. In the default case the overlay interface is
   165  named cilium_vxlan. Similar to above, when socket layer enforcement is enabled
   166  and a L7 proxy is in use we can avoid running the endpoint policy block between
   167  the endpoint and the L7 Policy for TCP traffic. An optional L3 encryption block
   168  will encrypt the packet if enabled.
   169  
   170  .. image:: _static/cilium_bpf_egress.svg
   171  
   172  Ingress to Endpoint
   173  -------------------
   174  
   175  Finally we show ingress to local endpoint also with optional overlay network.
   176  Similar to above socket layer enforcement can be used to avoid a set of
   177  policy traversals between the proxy and the endpoint socket. If the packet
   178  is encrypted upon receive it is first decrypted and then handled through
   179  the normal flow.
   180  
   181  .. image:: _static/cilium_bpf_ingress.svg
   182  
   183  veth-based versus ipvlan-based datapath
   184  ---------------------------------------
   185  
   186  .. note:: The ipvlan-based datapath is currently only in technology preview
   187            and to be used for experimentation purposes. This restriction will
   188            be lifted in future Cilium releases.
   189  
   190  By default Cilium CNI operates in veth-based datapath mode which allows for
   191  more flexibility in that all BPF programs are managed by Cilium out of the host
   192  network namespace such that containers can be granted privileges for their
   193  namespaces like CAP_NET_ADMIN without affecting security since BPF enforcement
   194  points in the host are unreachable for the container. Given BPF programs are
   195  attached from the host's network namespace, BPF also has the ability to take
   196  over and efficiently manage most of the forwarding logic between local containers
   197  and host since there always is a networking device reachable. However, this
   198  also comes at a latency cost as in veth-based mode the network stack internally
   199  needs to be re-traversed when handing the packet from one veth device to its
   200  peer device in the other network namespace. This egress-to-ingress switch needs
   201  to be done twice when communicating between local Cilium endpoints, and once
   202  for packet that are arriving or sent out of the host.
   203  
   204  For a more latency optimized datapath, Cilium CNI also supports ipvlan L3/L3S mode
   205  with a number of restrictions. In order to support older kernel's without ipvlan's
   206  hairpin mode, Cilium attaches BPF programs at the ipvlan slave device inside
   207  the container's network namespace on the tc egress layer, which means that
   208  this datapath mode can only be used for containers which are not running with
   209  CAP_NET_ADMIN and CAP_NET_RAW privileges! ipvlan uses an internal forwarding
   210  logic for direct slave-to-slave or slave-to-master redirection and therefore
   211  forwarding to devices is not performed from the BPF program itself. The network
   212  namespace switching is more efficient in ipvlan mode since the stack does not
   213  need to be re-traversed as in veth-based datapath case for external packets.
   214  The host-to-container network namespace switch happens directly at L3 layer
   215  without having to queue and reschedule the packet for later ingress processing.
   216  In case of communication among local endpoints, the egress-to-ingress switch
   217  is performed once instead of having to perform it twice.
   218  
   219  For Cilium in ipvlan mode there are a number of additional restrictions in
   220  the current implementation which are to be addressed in upcoming work: NAT64
   221  cannot be enabled at this point as well as L7 policy enforcement via proxy.
   222  Service load-balancing to local endpoints is currently not enabled as well
   223  as container to host-local communication. If one of these features are needed,
   224  then the default veth-based datapath mode is recommended instead.
   225  
   226  The ipvlan mode in Cilium's CNI can be enabled by running the Cilium daemon
   227  with e.g. ``--datapath-mode=ipvlan --ipvlan-master-device=bond0`` where the
   228  latter typically specifies the physical networking device which then also acts
   229  as the ipvlan master device. Note that in case ipvlan datapath mode is deployed
   230  in L3S mode with Kubernetes, make sure to have a stable kernel running with the
   231  following ipvlan fix included: `d5256083f62e <https://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git/commit/?id=d5256083f62e2720f75bb3c5a928a0afe47d6bc3>`_.
   232  
   233  This completes the datapath overview. More BPF specifics can be found in the
   234  :ref:`bpf_guide`. Additional details on how to extend the L7 Policy
   235  exist in the :ref:`envoy` section.
   236  
   237  Scale
   238  =====
   239  
   240  BPF Map Limitations
   241  -------------------
   242  
   243  All BPF maps are created with upper capacity limits. Insertion beyond the limit
   244  will fail and thus limits the scalability of the datapath. The following table
   245  shows the default values of the maps. Each limit can be bumped in the source
   246  code. Configuration options will be added on request if demand arises.
   247  
   248  ======================== ================ =============== =====================================================
   249  Map Name                 Scope            Default Limit   Scale Implications
   250  ======================== ================ =============== =====================================================
   251  Connection Tracking      node or endpoint 1M TCP/256K UDP Max 1M concurrent TCP connections, max 256K expected UDP answers
   252  Endpoints                node             64k             Max 64k local endpoints + host IPs per node
   253  IP cache                 node             512K            Max 256K endpoints (IPv4+IPv6), max 512k endpoints (IPv4 or IPv6) across all clusters
   254  Load Balancer            node             64k             Max 64k cumulative backends across all services across all clusters
   255  Policy                   endpoint         16k             Max 16k allowed identity + port + protocol pairs for specific endpoint
   256  Proxy Map                node             512k            Max 512k concurrent redirected TCP connections to proxy
   257  Tunnel                   node             64k             Max 32k nodes (IPv4+IPv6) or 64k nodes (IPv4 or IPv6) across all clusters
   258  ======================== ================ =============== =====================================================
   259  
   260  Kubernetes Integration
   261  ======================
   262  
   263  The following diagram shows the integration of iptables rules as installed by
   264  kube-proxy and the iptables rules as installed by Cilium.
   265  
   266  .. image:: _static/kubernetes_iptables.svg