github.com/cilium/cilium@v1.16.2/Documentation/bpf/progtypes.rst

github.com/cilium/cilium@v1.16.2/Documentation/bpf/progtypes.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _bpf_program:
     8  
     9  Program Types
    10  =============
    11  
    12  At the time of this writing, there are eighteen different BPF program types
    13  available, two of the main types for networking are further explained in below
    14  subsections, namely XDP BPF programs as well as tc BPF programs. Extensive
    15  usage examples for the two program types for LLVM, iproute2 or other tools
    16  are spread throughout the toolchain section and not covered here. Instead,
    17  this section focuses on their architecture, concepts and use cases.
    18  
    19  XDP
    20  ---
    21  
    22  XDP stands for eXpress Data Path and provides a framework for BPF that enables
    23  high-performance programmable packet processing in the Linux kernel. It runs
    24  the BPF program at the earliest possible point in software, namely at the moment
    25  the network driver receives the packet.
    26  
    27  At this point in the fast-path the driver just picked up the packet from its
    28  receive rings, without having done any expensive operations such as allocating
    29  an ``skb`` for pushing the packet further up the networking stack, without
    30  having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program
    31  is executed at the earliest point when it becomes available to the CPU for
    32  processing.
    33  
    34  XDP works in concert with the Linux kernel and its infrastructure, meaning
    35  the kernel is not bypassed as in various networking frameworks that operate
    36  in user space only. Keeping the packet in kernel space has several major
    37  advantages:
    38  
    39  * XDP is able to reuse all the upstream developed kernel networking drivers,
    40    user space tooling, or even other available in-kernel infrastructure such
    41    as routing tables, sockets, etc in BPF helper calls itself.
    42  * Residing in kernel space, XDP has the same security model as the rest of
    43    the kernel for accessing hardware.
    44  * There is no need for crossing kernel / user space boundaries since the
    45    processed packet already resides in the kernel and can therefore flexibly
    46    forward packets into other in-kernel entities like namespaces used by
    47    containers or the kernel's networking stack itself. This is particularly
    48    relevant in times of Meltdown and Spectre.
    49  * Punting packets from XDP to the kernel's robust, widely used and efficient
    50    TCP/IP stack is trivially possible, allows for full reuse and does not
    51    require maintaining a separate TCP/IP stack as with user space frameworks.
    52  * The use of BPF allows for full programmability, keeping a stable ABI with
    53    the same 'never-break-user-space' guarantees as with the kernel's system
    54    call ABI and compared to modules it also provides safety measures thanks to
    55    the BPF verifier that ensures the stability of the kernel's operation.
    56  * XDP trivially allows for atomically swapping programs during runtime without
    57    any network traffic interruption or even kernel / system reboot.
    58  * XDP allows for flexible structuring of workloads integrated into
    59    the kernel. For example, it can operate in "busy polling" or "interrupt
    60    driven" mode. Explicitly dedicating CPUs to XDP is not required. There
    61    are no special hardware requirements and it does not rely on hugepages.
    62  * XDP does not require any third party kernel modules or licensing. It is
    63    a long-term architectural solution, a core part of the Linux kernel, and
    64    developed by the kernel community.
    65  * XDP is already enabled and shipped everywhere with major distributions
    66    running a kernel equivalent to 4.8 or higher and supports most major 10G
    67    or higher networking drivers.
    68  
    69  As a framework for running BPF in the driver, XDP additionally ensures that
    70  packets are laid out linearly and fit into a single DMA'ed page which is
    71  readable and writable by the BPF program. XDP also ensures that additional
    72  headroom of 256 bytes is available to the program for implementing custom
    73  encapsulation headers with the help of the ``bpf_xdp_adjust_head()`` BPF helper
    74  or adding custom metadata in front of the packet through ``bpf_xdp_adjust_meta()``.
    75  
    76  The framework contains XDP action codes further described in the section
    77  below which a BPF program can return in order to instruct the driver how
    78  to proceed with the packet, and it enables the possibility to atomically
    79  replace BPF programs running at the XDP layer. XDP is tailored for
    80  high-performance by design. BPF allows to access the packet data through
    81  'direct packet access' which means that the program holds data pointers
    82  directly in registers, loads the content into registers, respectively
    83  writes from there into the packet.
    84  
    85  The packet representation in XDP that is passed to the BPF program as
    86  the BPF context looks as follows:
    87  
    88  .. code-block:: c
    89  
    90      struct xdp_buff {
    91          void *data;
    92          void *data_end;
    93          void *data_meta;
    94          void *data_hard_start;
    95          struct xdp_rxq_info *rxq;
    96      };
    97  
    98  ``data`` points to the start of the packet data in the page, and as the
    99  name suggests, ``data_end`` points to the end of the packet data. Since XDP
   100  allows for a headroom, ``data_hard_start`` points to the maximum possible
   101  headroom start in the page, meaning, when the packet should be encapsulated,
   102  then ``data`` is moved closer towards ``data_hard_start`` via ``bpf_xdp_adjust_head()``.
   103  The same BPF helper function also allows for decapsulation in which case
   104  ``data`` is moved further away from ``data_hard_start``.
   105  
   106  ``data_meta`` initially points to the same location as ``data`` but
   107  ``bpf_xdp_adjust_meta()`` is able to move the pointer towards ``data_hard_start``
   108  as well in order to provide room for custom metadata which is invisible to
   109  the normal kernel networking stack but can be read by tc BPF programs since
   110  it is transferred from XDP to the ``skb``. Vice versa, it can remove or reduce
   111  the size of the custom metadata through the same BPF helper function by
   112  moving ``data_meta`` away from ``data_hard_start`` again. ``data_meta`` can
   113  also be used solely for passing state between tail calls similarly to the
   114  ``skb->cb[]`` control block case that is accessible in tc BPF programs.
   115  
   116  This gives the following relation respectively invariant for the ``struct xdp_buff``
   117  packet pointers: ``data_hard_start`` <= ``data_meta`` <= ``data`` < ``data_end``.
   118  
   119  The ``rxq`` field points to some additional per receive queue metadata which
   120  is populated at ring setup time (not at XDP runtime):
   121  
   122  .. code-block:: c
   123  
   124      struct xdp_rxq_info {
   125          struct net_device *dev;
   126          u32 queue_index;
   127          u32 reg_state;
   128      } ____cacheline_aligned;
   129  
   130  The BPF program can retrieve ``queue_index`` as well as additional data
   131  from the netdevice itself such as ``ifindex``, etc.
   132  
   133  **BPF program return codes**
   134  
   135  After running the XDP BPF program, a verdict is returned from the program in
   136  order to tell the driver how to process the packet next. In the ``linux/bpf.h``
   137  system header file all available return verdicts are enumerated:
   138  
   139  .. code-block:: c
   140  
   141      enum xdp_action {
   142          XDP_ABORTED = 0,
   143          XDP_DROP,
   144          XDP_PASS,
   145          XDP_TX,
   146          XDP_REDIRECT,
   147      };
   148  
   149  ``XDP_DROP`` as the name suggests will drop the packet right at the driver
   150  level without wasting any further resources. This is in particular useful
   151  for BPF programs implementing DDoS mitigation mechanisms or firewalling in
   152  general. The ``XDP_PASS`` return code means that the packet is allowed to
   153  be passed up to the kernel's networking stack. Meaning, the current CPU
   154  that was processing this packet now allocates a ``skb``, populates it, and
   155  passes it onwards into the GRO engine. This would be equivalent to the
   156  default packet handling behavior without XDP. With ``XDP_TX`` the BPF program
   157  has an efficient option to transmit the network packet out of the same NIC it
   158  just arrived on again. This is typically useful when few nodes are implementing,
   159  for example, firewalling with subsequent load balancing in a cluster and
   160  thus act as a hairpinned load balancer pushing the incoming packets back
   161  into the switch after rewriting them in XDP BPF. ``XDP_REDIRECT`` is similar
   162  to ``XDP_TX`` in that it is able to transmit the XDP packet, but through
   163  another NIC. Another option for the ``XDP_REDIRECT`` case is to redirect
   164  into a BPF cpumap, meaning, the CPUs serving XDP on the NIC's receive queues
   165  can continue to do so and push the packet for processing the upper kernel
   166  stack to a remote CPU. This is similar to ``XDP_PASS``, but with the ability
   167  that the XDP BPF program can keep serving the incoming high load as opposed
   168  to temporarily spend work on the current packet for pushing into upper
   169  layers. Last but not least, ``XDP_ABORTED`` which serves denoting an exception
   170  like state from the program and has the same behavior as ``XDP_DROP`` only
   171  that ``XDP_ABORTED`` passes the ``trace_xdp_exception`` tracepoint which
   172  can be additionally monitored to detect misbehavior.
   173  
   174  **Use cases for XDP**
   175  
   176  Some of the main use cases for XDP are presented in this subsection. The
   177  list is non-exhaustive and given the programmability and efficiency XDP
   178  and BPF enables, it can easily be adapted to solve very specific use
   179  cases.
   180  
   181  * **DDoS mitigation, firewalling**
   182  
   183    One of the basic XDP BPF features is to tell the driver to drop a packet
   184    with ``XDP_DROP`` at this early stage which allows for any kind of efficient
   185    network policy enforcement with having an extremely low per-packet cost.
   186    This is ideal in situations when needing to cope with any sort of DDoS
   187    attacks, but also more general allows to implement any sort of firewalling
   188    policies with close to no overhead in BPF e.g. in either case as stand alone
   189    appliance (e.g. scrubbing 'clean' traffic through ``XDP_TX``) or widely
   190    deployed on nodes protecting end hosts themselves (via ``XDP_PASS`` or
   191    cpumap ``XDP_REDIRECT`` for good traffic). Offloaded XDP takes this even
   192    one step further by moving the already small per-packet cost entirely
   193    into the NIC with processing at line-rate.
   194  
   195  ..
   196  
   197  * **Forwarding and load-balancing**
   198  
   199    Another major use case of XDP is packet forwarding and load-balancing
   200    through either ``XDP_TX`` or ``XDP_REDIRECT`` actions. The packet can
   201    be arbitrarily mangled by the BPF program running in the XDP layer,
   202    even BPF helper functions are available for increasing or decreasing
   203    the packet's headroom in order to arbitrarily encapsulate respectively
   204    decapsulate the packet before sending it out again. With ``XDP_TX``
   205    hairpinned load-balancers can be implemented that push the packet out
   206    of the same networking device it originally arrived on, or with the
   207    ``XDP_REDIRECT`` action it can be forwarded to another NIC for
   208    transmission. The latter return code can also be used in combination
   209    with BPF's cpumap to load-balance packets for passing up the local
   210    stack, but on remote, non-XDP processing CPUs.
   211  
   212  ..
   213  
   214  * **Pre-stack filtering / processing**
   215  
   216    Besides policy enforcement, XDP can also be used for hardening the
   217    kernel's networking stack with the help of ``XDP_DROP`` case, meaning,
   218    it can drop irrelevant packets for a local node right at the earliest
   219    possible point before the networking stack sees them e.g. given we
   220    know that a node only serves TCP traffic, any UDP, SCTP or other L4
   221    traffic can be dropped right away. This has the advantage that packets
   222    do not need to traverse various entities like GRO engine, the kernel's
   223    flow dissector and others before it can be determined to drop them and
   224    thus this allows for reducing the kernel's attack surface. Thanks to
   225    XDP's early processing stage, this effectively 'pretends' to the kernel's
   226    networking stack that these packets have never been seen by the networking
   227    device. Additionally, if a potential bug in the stack's receive path
   228    got uncovered and would cause a 'ping of death' like scenario, XDP can be
   229    utilized to drop such packets right away without having to reboot the
   230    kernel or restart any services. Due to the ability to atomically swap
   231    such programs to enforce a drop of bad packets, no network traffic is
   232    even interrupted on a host.
   233  
   234    Another use case for pre-stack processing is that given the kernel has not
   235    yet allocated an ``skb`` for the packet, the BPF program is free to modify
   236    the packet and, again, have it 'pretend' to the stack that it was received
   237    by the networking device this way. This allows for cases such as having
   238    custom packet mangling and encapsulation protocols where the packet can be
   239    decapsulated prior to entering GRO aggregation in which GRO otherwise would
   240    not be able to perform any sort of aggregation due to not being aware of
   241    the custom protocol. XDP also allows to push metadata (non-packet data) in
   242    front of the packet. This is 'invisible' to the normal kernel stack, can
   243    be GRO aggregated (for matching metadata) and later on processed in
   244    coordination with a tc ingress BPF program where it has the context of
   245    a ``skb`` available for e.g. setting various skb fields.
   246  
   247  ..
   248  
   249  * **Flow sampling, monitoring**
   250  
   251    XDP can also be used for cases such as packet monitoring, sampling or any
   252    other network analytics, for example, as part of an intermediate node in
   253    the path or on end hosts in combination also with prior mentioned use cases.
   254    For complex packet analysis, XDP provides a facility to efficiently push
   255    network packets (truncated or with full payload) and custom metadata into
   256    a fast lockless per CPU memory mapped ring buffer provided from the Linux
   257    perf infrastructure to an user space application. This also allows for
   258    cases where only a flow's initial data can be analyzed and once determined
   259    as good traffic having the monitoring bypassed. Thanks to the flexibility
   260    brought by BPF, this allows for implementing any sort of custom monitoring
   261    or sampling.
   262  
   263  ..
   264  
   265  One example of XDP BPF production usage is Facebook's SHIV and Droplet
   266  infrastructure which implement their L4 load-balancing and DDoS countermeasures.
   267  Migrating their production infrastructure away from netfilter's IPVS
   268  (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared
   269  to their previous IPVS setup. This was first presented at the netdev 2.1
   270  conference:
   271  
   272  * Slides: https://netdevconf.info/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
   273  * Video: https://youtu.be/YEU2ClcGqts
   274  
   275  Another example is the integration of XDP into Cloudflare's DDoS mitigation
   276  pipeline, which originally was using cBPF instead of eBPF for attack signature
   277  matching through iptables' ``xt_bpf`` module. Due to use of iptables this
   278  caused severe performance problems under attack where a user space bypass
   279  solution was deemed necessary but came with drawbacks as well such as needing
   280  to busy poll the NIC and expensive packet re-injection into the kernel's stack.
   281  The migration over to eBPF and XDP combined best of both worlds by having
   282  high-performance programmable packet processing directly inside the kernel:
   283  
   284  * Slides: https://netdevconf.info/2.1/slides/apr6/bertin_Netdev-XDP.pdf
   285  * Video: https://youtu.be/7OuOukmuivg
   286  
   287  **XDP operation modes**
   288  
   289  XDP has three operation modes where 'native' XDP is the default mode. When
   290  talked about XDP this mode is typically implied.
   291  
   292  * **Native XDP**
   293  
   294    This is the default mode where the XDP BPF program is run directly out
   295    of the networking driver's early receive path. Most widespread used NICs
   296    for 10G and higher support native XDP already.
   297  
   298  ..
   299  
   300  * **Offloaded XDP**
   301  
   302    In the offloaded XDP mode the XDP BPF program is directly offloaded into
   303    the NIC instead of being executed on the host CPU. Thus, the already
   304    extremely low per-packet cost is pushed off the host CPU entirely and
   305    executed on the NIC, providing even higher performance than running in
   306    native XDP. This offload is typically implemented by SmartNICs
   307    containing multi-threaded, multicore flow processors where a in-kernel
   308    JIT compiler translates BPF into native instructions for the latter.
   309    Drivers supporting offloaded XDP usually also support native XDP for
   310    cases where some BPF helpers may not yet or only be available for the
   311    native mode.
   312  
   313  ..
   314  
   315  * **Generic XDP**
   316  
   317    For drivers not implementing native or offloaded XDP yet, the kernel
   318    provides an option for generic XDP which does not require any driver
   319    changes since run at a much later point out of the networking stack.
   320    This setting is primarily targeted at developers who want to write and
   321    test programs against the kernel's XDP API, and will not operate at the
   322    performance rate of the native or offloaded modes. For XDP usage in a
   323    production environment either the native or offloaded mode is better
   324    suited and the recommended way to run XDP.
   325  
   326  .. _xdp_drivers:
   327  
   328  **Driver support**
   329  
   330  **Drivers supporting native XDP**
   331  
   332  A list of drivers supporting native XDP can be found in the table below. The
   333  corresponding network driver name of an interface can be determined as follows:
   334  
   335  .. code-block:: shell-session
   336  
   337      # ethtool -i eth0
   338      driver: nfp
   339      [...]
   340  
   341  +------------------------+-------------------------+-------------+
   342  | Vendor                 | Driver                  | XDP Support |
   343  +========================+=========================+=============+
   344  | Amazon                 | ena                     | >= 5.6      |
   345  +------------------------+-------------------------+-------------+
   346  | Aquantia               | atlantic                | >= 5.19     |
   347  +------------------------+-------------------------+-------------+
   348  | Broadcom               | bnxt_en                 | >= 4.11     |
   349  +------------------------+-------------------------+-------------+
   350  | Cavium                 | thunderx                | >= 4.12     |
   351  +------------------------+-------------------------+-------------+
   352  | Engleder               | tsne                    | >= 6.3      |
   353  |                        | (TSN Express Path)      |             |
   354  +------------------------+-------------------------+-------------+
   355  | Freescale              | dpaa                    | >= 5.11     |
   356  |                        +-------------------------+-------------+
   357  |                        | dpaa2                   | >= 5.0      |
   358  |                        +-------------------------+-------------+
   359  |                        | enetc                   | >= 5.13     |
   360  |                        +-------------------------+-------------+
   361  |                        | fec_enet                | >= 6.2      |
   362  +------------------------+-------------------------+-------------+
   363  | Fungible               | fun                     | >= 5.18     |
   364  +------------------------+-------------------------+-------------+
   365  | Google                 | gve                     | >= 6.4      |
   366  +------------------------+-------------------------+-------------+
   367  | Intel                  | ice                     | >= 5.5      |
   368  |                        +-------------------------+-------------+
   369  |                        | igb                     | >= 5.10     |
   370  |                        +-------------------------+-------------+
   371  |                        | igc                     | >= 5.13     |
   372  |                        +-------------------------+-------------+
   373  |                        | i40e                    | >= 4.13     |
   374  |                        +-------------------------+-------------+
   375  |                        | ixgbe                   | >= 4.12     |
   376  |                        +-------------------------+-------------+
   377  |                        | ixgbevf                 | >= 4.17     |
   378  +------------------------+-------------------------+-------------+
   379  | Marvell                | mvneta                  | >= 5.5      |
   380  |                        +-------------------------+-------------+
   381  |                        | mvpp2                   | >= 5.9      |
   382  |                        +-------------------------+-------------+
   383  |                        | otx2                    | >= 5.16     |
   384  +------------------------+-------------------------+-------------+
   385  | Mediatek               | mtk                     | >= 6.0      |
   386  +------------------------+-------------------------+-------------+
   387  | Mellanox               | mlx4                    | >= 4.8      |
   388  |                        +-------------------------+-------------+
   389  |                        | mlx5                    | >= 4.9      |
   390  +------------------------+-------------------------+-------------+
   391  | Microchip              | lan966x                 | >= 6.2      |
   392  +------------------------+-------------------------+-------------+
   393  | Microsoft              | hv_netvsc (Hyper-V)     | >= 5.6      |
   394  |                        +-------------------------+-------------+
   395  |                        | mana                    | >= 5.17     |
   396  +------------------------+-------------------------+-------------+
   397  | Netronome              | nfp                     | >= 4.10     |
   398  +------------------------+-------------------------+-------------+
   399  | Others                 | bonding                 | >= 5.15     |
   400  |                        +-------------------------+-------------+
   401  |                        | netdevsim               | >= 4.16     |
   402  |                        +-------------------------+-------------+
   403  |                        | tun/tap                 | >= 4.14     |
   404  |                        +-------------------------+-------------+
   405  |                        | virtio_net              | >= 4.10     |
   406  |                        +-------------------------+-------------+
   407  |                        | xen-netfront            | >= 5.9      |
   408  |                        +-------------------------+-------------+
   409  |                        | veth                    | >= 4.19     |
   410  +------------------------+-------------------------+-------------+
   411  | QLogic                 | qede                    | >= 4.10     |
   412  +------------------------+-------------------------+-------------+
   413  | Socionext              | netsec                  | >= 5.3      |
   414  +------------------------+-------------------------+-------------+
   415  | Solarflare             | SFC Efx                 | >= 5.5      |
   416  +------------------------+-------------------------+-------------+
   417  | STMicro                | stmmac                  | >= 5.13     |
   418  +------------------------+-------------------------+-------------+
   419  | Texas Instruments      | cpsw                    | >= 5.3      |
   420  +------------------------+-------------------------+-------------+
   421  | VMware                 | vmxnet3                 | >= 6.6      |
   422  +------------------------+-------------------------+-------------+
   423  
   424  **Drivers supporting offloaded XDP**
   425  
   426  * **Netronome**
   427  
   428    * nfp [2]_
   429  
   430  .. note::
   431  
   432      Examples for writing and loading XDP programs are included in the `bpf_dev` section under the respective tools.
   433  
   434  .. [2] Some BPF helper functions such as retrieving the current CPU number
   435     will not be available in an offloaded setting.
   436  
   437  tc (traffic control)
   438  --------------------
   439  
   440  Aside from other program types such as XDP, BPF can also be used out of the
   441  kernel's tc (traffic control) layer in the networking data path. On a high-level
   442  there are three major differences when comparing XDP BPF programs to tc BPF
   443  ones:
   444  
   445  * The BPF input context is a ``sk_buff`` not a ``xdp_buff``. When the kernel's
   446    networking stack receives a packet, after the XDP layer, it allocates a buffer
   447    and parses the packet to store metadata about the packet. This representation
   448    is known as the ``sk_buff``. This structure is then exposed in the BPF input
   449    context so that BPF programs from the tc ingress layer can use the metadata that
   450    the stack extracts from the packet. This can be useful, but comes with an
   451    associated cost of the stack performing this allocation and metadata extraction,
   452    and handling the packet until it hits the tc hook. By definition, the ``xdp_buff``
   453    doesn't have access to this metadata because the XDP hook is called before
   454    this work is done. This is a significant contributor to the performance
   455    difference between the XDP and tc hooks.
   456  
   457    Therefore, BPF programs attached to the tc BPF hook can, for instance, read or
   458    write the skb's ``mark``, ``pkt_type``, ``protocol``, ``priority``,
   459    ``queue_mapping``, ``napi_id``, ``cb[]`` array, ``hash``, ``tc_classid`` or
   460    ``tc_index``, vlan metadata, the XDP transferred custom metadata and various
   461    other information. All members of the ``struct __sk_buff`` BPF context used
   462    in tc BPF are defined in the ``linux/bpf.h`` system header.
   463  
   464    Generally, the ``sk_buff`` is of a completely different nature than
   465    ``xdp_buff`` where both come with advantages and disadvantages. For example,
   466    the ``sk_buff`` case has the advantage that it is rather straight forward to
   467    mangle its associated metadata, however, it also contains a lot of protocol
   468    specific information (e.g. GSO related state) which makes it difficult to
   469    simply switch protocols by solely rewriting the packet data. This is due to
   470    the stack processing the packet based on the metadata rather than having the
   471    cost of accessing the packet contents each time. Thus, additional conversion
   472    is required from BPF helper functions taking care that ``sk_buff`` internals
   473    are properly converted as well. The ``xdp_buff`` case however does not
   474    face such issues since it comes at such an early stage where the kernel
   475    has not even allocated an ``sk_buff`` yet, thus packet rewrites of any
   476    kind can be realized trivially. However, the ``xdp_buff`` case has the
   477    disadvantage that ``sk_buff`` metadata is not available for mangling
   478    at this stage. The latter is overcome by passing custom metadata from
   479    XDP BPF to tc BPF, though. In this way, the limitations of each program
   480    type can be overcome by operating complementary programs of both types
   481    as the use case requires.
   482  
   483  ..
   484  
   485  * Compared to XDP, tc BPF programs can be triggered out of ingress and also
   486    egress points in the networking data path as opposed to ingress only in
   487    the case of XDP.
   488  
   489    The two hook points ``sch_handle_ingress()`` and ``sch_handle_egress()`` in
   490    the kernel are triggered out of ``__netif_receive_skb_core()`` and
   491    ``__dev_queue_xmit()``, respectively. The latter two are the main receive
   492    and transmit functions in the data path that, setting XDP aside, are triggered
   493    for every network packet going in or coming out of the node allowing for
   494    full visibility for tc BPF programs at these hook points.
   495  
   496  ..
   497  
   498  * The tc BPF programs do not require any driver changes since they are run
   499    at hook points in generic layers in the networking stack. Therefore, they
   500    can be attached to any type of networking device.
   501  
   502    While this provides flexibility, it also trades off performance compared
   503    to running at the native XDP layer. However, tc BPF programs still come
   504    at the earliest point in the generic kernel's networking data path after
   505    GRO has been run but **before** any protocol processing, traditional iptables
   506    firewalling such as iptables PREROUTING or nftables ingress hooks or other
   507    packet processing takes place. Likewise on egress, tc BPF programs execute
   508    at the latest point before handing the packet to the driver itself for
   509    transmission, meaning **after** traditional iptables firewalling hooks like
   510    iptables POSTROUTING, but still before handing the packet to the kernel's
   511    GSO engine.
   512  
   513    One exception which does require driver changes however are offloaded tc
   514    BPF programs, typically provided by SmartNICs in a similar way as offloaded
   515    XDP just with differing set of features due to the differences in the BPF
   516    input context, helper functions and verdict codes.
   517  
   518  ..
   519  
   520  BPF programs run in the tc layer are run from the ``cls_bpf`` classifier.
   521  While the tc terminology describes the BPF attachment point as a "classifier",
   522  this is a bit misleading since it under-represents what ``cls_bpf`` is
   523  capable of. That is to say, a fully programmable packet processor being able
   524  not only to read the ``skb`` metadata and packet data, but to also arbitrarily
   525  mangle both, and terminate the tc processing with an action verdict. ``cls_bpf``
   526  can thus be regarded as a self-contained entity that manages and executes tc
   527  BPF programs.
   528  
   529  ``cls_bpf`` can hold one or more tc BPF programs. In the case where Cilium
   530  deploys ``cls_bpf`` programs, it attaches only a single program for a given hook
   531  in ``direct-action`` mode. Typically, in the traditional tc scheme, there is a
   532  split between classifier and action modules, where the classifier has one
   533  or more actions attached to it that are triggered once the classifier has a
   534  match. In the modern world for using tc in the software data path this model
   535  does not scale well for complex packet processing. Given tc BPF programs
   536  attached to ``cls_bpf`` are fully self-contained, they effectively fuse the
   537  parsing and action process together into a single unit. Thanks to ``cls_bpf``'s
   538  ``direct-action`` mode, it will just return the tc action verdict and
   539  terminate the processing pipeline immediately. This allows for implementing
   540  scalable programmable packet processing in the networking data path by avoiding
   541  linear iteration of actions. ``cls_bpf`` is the only such "classifier" module
   542  in the tc layer capable of such a fast-path.
   543  
   544  Like XDP BPF programs, tc BPF programs can be atomically updated at runtime
   545  via ``cls_bpf`` without interrupting any network traffic or having to restart
   546  services.
   547  
   548  Both the tc ingress and the egress hook where ``cls_bpf`` itself can be
   549  attached to is managed by a pseudo qdisc called ``sch_clsact``. This is a
   550  drop-in replacement and proper superset of the ingress qdisc since it
   551  is able to manage both, ingress and egress tc hooks. For tc's egress hook
   552  in ``__dev_queue_xmit()`` it is important to stress that it is not executed
   553  under the kernel's qdisc root lock. Thus, both tc ingress and egress hooks
   554  are executed in a lockless manner in the fast-path. In either case, preemption
   555  is disabled and execution happens under RCU read side.
   556  
   557  Typically on egress there are qdiscs attached to netdevices such as ``sch_mq``,
   558  ``sch_fq``, ``sch_fq_codel`` or ``sch_htb`` where some of them are classful
   559  qdiscs that contain subclasses and thus require a packet classification
   560  mechanism to determine a verdict where to demux the packet. This is handled
   561  by a call to ``tcf_classify()`` which calls into tc classifiers if present.
   562  ``cls_bpf`` can also be attached and used in such cases. Such operation usually
   563  happens under the qdisc root lock and can be subject to lock contention. The
   564  ``sch_clsact`` qdisc's egress hook comes at a much earlier point however which
   565  does not fall under that and operates completely independent from conventional
   566  egress qdiscs. Thus for cases like ``sch_htb`` the ``sch_clsact`` qdisc could
   567  perform the heavy lifting packet classification through tc BPF outside of the
   568  qdisc root lock, setting the ``skb->mark`` or ``skb->priority`` from there such
   569  that ``sch_htb`` only requires a flat mapping without expensive packet
   570  classification under the root lock thus reducing contention.
   571  
   572  Offloaded tc BPF programs are supported for the case of ``sch_clsact`` in
   573  combination with ``cls_bpf`` where the prior loaded BPF program was JITed
   574  from a SmartNIC driver to be run natively on the NIC. Only ``cls_bpf``
   575  programs operating in ``direct-action`` mode are supported to be offloaded.
   576  ``cls_bpf`` only supports offloading a single program and cannot offload
   577  multiple programs. Furthermore only the ingress hook supports offloading
   578  BPF programs.
   579  
   580  One ``cls_bpf`` instance is able to hold multiple tc BPF programs internally.
   581  If this is the case, then the ``TC_ACT_UNSPEC`` program return code will
   582  continue execution with the next tc BPF program in that list. However, this
   583  has the drawback that several programs would need to parse the packet over
   584  and over again resulting in degraded performance.
   585  
   586  **BPF program return codes**
   587  
   588  Both the tc ingress and egress hook share the same action return verdicts
   589  that tc BPF programs can use. They are defined in the ``linux/pkt_cls.h``
   590  system header:
   591  
   592  .. code-block:: c
   593  
   594      #define TC_ACT_UNSPEC         (-1)
   595      #define TC_ACT_OK               0
   596      #define TC_ACT_SHOT             2
   597      #define TC_ACT_STOLEN           4
   598      #define TC_ACT_REDIRECT         7
   599  
   600  There are a few more action ``TC_ACT_*`` verdicts available in the system
   601  header file which are also used in the two hooks. However, they share the
   602  same semantics with the ones above. Meaning, from a tc BPF perspective,
   603  ``TC_ACT_OK`` and ``TC_ACT_RECLASSIFY`` have the same semantics, as well as
   604  the three ``TC_ACT_STOLEN``, ``TC_ACT_QUEUED`` and ``TC_ACT_TRAP`` opcodes.
   605  Therefore, for these cases we only describe ``TC_ACT_OK`` and the ``TC_ACT_STOLEN``
   606  opcode for the two groups.
   607  
   608  Starting out with ``TC_ACT_UNSPEC``. It has the meaning of "unspecified action"
   609  and is used in three cases, i) when an offloaded tc BPF program is attached
   610  and the tc ingress hook is run where the ``cls_bpf`` representation for the
   611  offloaded program will return ``TC_ACT_UNSPEC``, ii) in order to continue
   612  with the next tc BPF program in ``cls_bpf`` for the multi-program case. The
   613  latter also works in combination with offloaded tc BPF programs from point i)
   614  where the ``TC_ACT_UNSPEC`` from there continues with a next tc BPF program
   615  solely running in non-offloaded case. Last but not least, iii) ``TC_ACT_UNSPEC``
   616  is also used for the single program case to simply tell the kernel to continue
   617  with the ``skb`` without additional side-effects. ``TC_ACT_UNSPEC`` is very
   618  similar to the ``TC_ACT_OK`` action code in the sense that both pass the
   619  ``skb`` onwards either to upper layers of the stack on ingress or down to
   620  the networking device driver for transmission on egress, respectively. The
   621  only difference to ``TC_ACT_OK`` is that ``TC_ACT_OK`` sets ``skb->tc_index``
   622  based on the classid the tc BPF program set. The latter is set out of the
   623  tc BPF program itself through ``skb->tc_classid`` from the BPF context.
   624  
   625  ``TC_ACT_SHOT`` instructs the kernel to drop the packet, meaning, upper
   626  layers of the networking stack will never see the ``skb`` on ingress and
   627  similarly the packet will never be submitted for transmission on egress.
   628  ``TC_ACT_SHOT`` and ``TC_ACT_STOLEN`` are both similar in nature with few
   629  differences: ``TC_ACT_SHOT`` will indicate to the kernel that the ``skb``
   630  was released through ``kfree_skb()`` and return ``NET_XMIT_DROP`` to the
   631  callers for immediate feedback, whereas ``TC_ACT_STOLEN`` will release
   632  the ``skb`` through ``consume_skb()`` and pretend to upper layers that
   633  the transmission was successful through ``NET_XMIT_SUCCESS``. The perf's
   634  drop monitor which records traces of ``kfree_skb()`` will therefore
   635  also not see any drop indications from ``TC_ACT_STOLEN`` since its
   636  semantics are such that the ``skb`` has been "consumed" or queued but
   637  certainly not "dropped".
   638  
   639  Last but not least the ``TC_ACT_REDIRECT`` action which is available for
   640  tc BPF programs as well. This allows to redirect the ``skb`` to the same
   641  or another's device ingress or egress path together with the ``bpf_redirect()``
   642  helper. Being able to inject the packet into another device's ingress or
   643  egress direction allows for full flexibility in packet forwarding with
   644  BPF. There are no requirements on the target networking device other than
   645  being a networking device itself, there is no need to run another instance
   646  of ``cls_bpf`` on the target device or other such restrictions.
   647  
   648  **tc BPF FAQ**
   649  
   650  This section contains a few miscellaneous question and answer pairs related to
   651  tc BPF programs that are asked from time to time.
   652  
   653  * **Question:** What about ``act_bpf`` as a tc action module, is it still relevant?
   654  * **Answer:** Not really. Although ``cls_bpf`` and ``act_bpf`` share the same
   655    functionality for tc BPF programs, ``cls_bpf`` is more flexible since it is a
   656    proper superset of ``act_bpf``. The way tc works is that tc actions need to be
   657    attached to tc classifiers. In order to achieve the same flexibility as ``cls_bpf``,
   658    ``act_bpf`` would need to be attached to the ``cls_matchall`` classifier. As the
   659    name says, this will match on every packet in order to pass them through for attached
   660    tc action processing. For ``act_bpf``, this is will result in less efficient packet
   661    processing than using ``cls_bpf`` in ``direct-action`` mode directly. If ``act_bpf``
   662    is used in a setting with other classifiers than ``cls_bpf`` or ``cls_matchall``
   663    then this will perform even worse due to the nature of operation of tc classifiers.
   664    Meaning, if classifier A has a mismatch, then the packet is passed to classifier
   665    B, reparsing the packet, etc, thus in the typical case there will be linear
   666    processing where the packet would need to traverse N classifiers in the worst
   667    case to find a match and execute ``act_bpf`` on that. Therefore, ``act_bpf`` has
   668    never been largely relevant. Additionally, ``act_bpf`` does not provide a tc
   669    offloading interface either compared to ``cls_bpf``.
   670  
   671  ..
   672  
   673  * **Question:** Is it recommended to use ``cls_bpf`` not in ``direct-action`` mode?
   674  * **Answer:** No. The answer is similar to the one above in that this is otherwise
   675    unable to scale for more complex processing. tc BPF can already do everything needed
   676    by itself in an efficient manner and thus there is no need for anything other than
   677    ``direct-action`` mode.
   678  
   679  ..
   680  
   681  * **Question:** Is there any performance difference in offloaded ``cls_bpf`` and
   682    offloaded XDP?
   683  * **Answer:** No. Both are JITed through the same compiler in the kernel which
   684    handles the offloading to the SmartNIC and the loading mechanism for both is
   685    very similar as well. Thus, the BPF program gets translated into the same target
   686    instruction set in order to be able to run on the NIC natively. The two tc BPF
   687    and XDP BPF program types have a differing set of features, so depending on the
   688    use case one might be picked over the other due to availability of certain helper
   689    functions in the offload case, for example.
   690  
   691  **Use cases for tc BPF**
   692  
   693  Some of the main use cases for tc BPF programs are presented in this subsection.
   694  Also here, the list is non-exhaustive and given the programmability and efficiency
   695  of tc BPF, it can easily be tailored and integrated into orchestration systems
   696  in order to solve very specific use cases. While some use cases with XDP may overlap,
   697  tc BPF and XDP BPF are mostly complementary to each other and both can also be
   698  used at the same time or one over the other depending which is most suitable for a
   699  given problem to solve.
   700  
   701  * **Policy enforcement for containers**
   702  
   703    One application which tc BPF programs are suitable for is to implement policy
   704    enforcement, custom firewalling or similar security measures for containers or
   705    pods, respectively. In the conventional case, container isolation is implemented
   706    through network namespaces with veth networking devices connecting the host's
   707    initial namespace with the dedicated container's namespace. Since one end of
   708    the veth pair has been moved into the container's namespace whereas the other
   709    end remains in the initial namespace of the host, all network traffic from the
   710    container has to pass through the host-facing veth device allowing for attaching
   711    tc BPF programs on the tc ingress and egress hook of the veth. Network traffic
   712    going into the container will pass through the host-facing veth's tc egress
   713    hook whereas network traffic coming from the container will pass through the
   714    host-facing veth's tc ingress hook.
   715  
   716    For virtual devices like veth devices XDP is unsuitable in this case since the
   717    kernel operates solely on a ``skb`` here and generic XDP has a few limitations
   718    where it does not operate with cloned ``skb``'s. The latter is heavily used
   719    from the TCP/IP stack in order to hold data segments for retransmission where
   720    the generic XDP hook would simply get bypassed instead. Moreover, generic XDP
   721    needs to linearize the entire ``skb`` resulting in heavily degraded performance.
   722    tc BPF on the other hand is more flexible as it specializes on the ``skb``
   723    input context case and thus does not need to cope with the limitations from
   724    generic XDP.
   725  
   726  ..
   727  
   728  * **Forwarding and load-balancing**
   729  
   730    The forwarding and load-balancing use case is quite similar to XDP, although
   731    slightly more targeted towards east-west container workloads rather than
   732    north-south traffic (though both technologies can be used in either case).
   733    Since XDP is only available on ingress side, tc BPF programs allow for
   734    further use cases that apply in particular on egress, for example, container
   735    based traffic can already be NATed and load-balanced on the egress side
   736    through BPF out of the initial namespace such that this is done transparent
   737    to the container itself. Egress traffic is already based on the ``sk_buff``
   738    structure due to the nature of the kernel's networking stack, so packet
   739    rewrites and redirects are suitable out of tc BPF. By utilizing the
   740    ``bpf_redirect()`` helper function, BPF can take over the forwarding logic
   741    to push the packet either into the ingress or egress path of another networking
   742    device. Thus, any bridge-like devices become unnecessary to use as well by
   743    utilizing tc BPF as forwarding fabric.
   744  
   745  ..
   746  
   747  * **Flow sampling, monitoring**
   748  
   749    Like in XDP case, flow sampling and monitoring can be realized through a
   750    high-performance lockless per-CPU memory mapped perf ring buffer where the
   751    BPF program is able to push custom data, the full or truncated packet
   752    contents, or both up to a user space application. From the tc BPF program
   753    this is realized through the ``bpf_skb_event_output()`` BPF helper function
   754    which has the same function signature and semantics as ``bpf_xdp_event_output()``.
   755    Given tc BPF programs can be attached to ingress and egress as opposed to
   756    only ingress in XDP BPF case plus the two tc hooks are at the lowest layer
   757    in the (generic) networking stack, this allows for bidirectional monitoring
   758    of all network traffic from a particular node. This might be somewhat related
   759    to the cBPF case which tcpdump and Wireshark makes use of, though, without
   760    having to clone the ``skb`` and with being a lot more flexible in terms of
   761    programmability where, for example, BPF can already perform in-kernel
   762    aggregation rather than pushing everything up to user space as well as
   763    custom annotations for packets pushed into the ring buffer. The latter is
   764    also heavily used in Cilium where packet drops can be further annotated
   765    to correlate container labels and reasons for why a given packet had to
   766    be dropped (such as due to policy violation) in order to provide a richer
   767    context.
   768  
   769  ..
   770  
   771  * **Packet scheduler pre-processing**
   772  
   773    The ``sch_clsact``'s egress hook which is called ``sch_handle_egress()``
   774    runs right before taking the kernel's qdisc root lock, thus tc BPF programs
   775    can be utilized to perform all the heavy lifting packet classification
   776    and mangling before the packet is transmitted into a real full blown
   777    qdisc such as ``sch_htb``. This type of interaction of ``sch_clsact``
   778    with a real qdisc like ``sch_htb`` coming later in the transmission phase
   779    allows to reduce the lock contention on transmission since ``sch_clsact``'s
   780    egress hook is executed without taking locks.
   781  
   782  ..
   783  
   784  One concrete example user of tc BPF but also XDP BPF programs is Cilium.
   785  Cilium is open source software for transparently securing the network
   786  connectivity between application services deployed using Linux container
   787  management platforms like Docker and Kubernetes and operates at Layer 3/4
   788  as well as Layer 7. At the heart of Cilium operates BPF in order to
   789  implement the policy enforcement as well as load balancing and monitoring.
   790  
   791  * Slides: https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
   792  * Video: https://youtu.be/ilKlmTDdFgk
   793  * Github: https://github.com/cilium/cilium
   794  
   795  **Driver support**
   796  
   797  Since tc BPF programs are triggered from the kernel's networking stack
   798  and not directly out of the driver, they do not require any extra driver
   799  modification and therefore can run on any networking device. The only
   800  exception listed below is for offloading tc BPF programs to the NIC.
   801  
   802  **Drivers supporting offloaded tc BPF**
   803  
   804  * **Netronome**
   805  
   806    * nfp [2]_
   807  
   808  .. note::
   809  
   810      Examples for writing and loading tc BPF programs are included in the `bpf_dev` section under the respective tools.