github.com/imran-kn/cilium-fork@v1.6.9/Documentation/bpf.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      http://docs.cilium.io
     6  
     7  .. _bpf_guide:
     8  
     9  ***************************
    10  BPF and XDP Reference Guide
    11  ***************************
    12  
    13  .. note:: This documentation section is targeted at developers and users who
    14            want to understand BPF and XDP in great technical depth. While
    15            reading this reference guide may help broaden your understanding of
    16            Cilium, it is not a requirement to use Cilium. Please refer to the
    17            :ref:`gs_guide` and :ref:`arch_guide` for a higher level
    18            introduction.
    19  
    20  BPF is a highly flexible and efficient virtual machine-like construct in the
    21  Linux kernel allowing to execute bytecode at various hook points in a safe
    22  manner. It is used in a number of Linux kernel subsystems, most prominently
    23  networking, tracing and security (e.g. sandboxing).
    24  
    25  Although BPF exists since 1992, this document covers the extended Berkeley
    26  Packet Filter (eBPF) version which has first appeared in Kernel 3.18 and
    27  renders the original version which is being referred to as "classic" BPF
    28  (cBPF) these days mostly obsolete. cBPF is known to many as being the packet
    29  filter language used by tcpdump. Nowadays, the Linux kernel runs eBPF only and
    30  loaded cBPF bytecode is transparently translated into an eBPF representation
    31  in the kernel before program execution. This documentation will generally refer
    32  to the term BPF unless explicit differences between eBPF and cBPF are being
    33  pointed out.
    34  
    35  Even though the name Berkeley Packet Filter hints at a packet filtering specific
    36  purpose, the instruction set is generic and flexible enough these days that
    37  there are many use cases for BPF apart from networking. See :ref:`bpf_users`
    38  for a list of projects which use BPF.
    39  
    40  Cilium uses BPF heavily in its data path, see :ref:`arch_guide` for further
    41  information. The goal of this chapter is to provide a BPF reference guide in
    42  order to gain understanding of BPF, its networking specific use including loading
    43  BPF programs with tc (traffic control) and XDP (eXpress Data Path), and to aid
    44  with developing Cilium's BPF templates.
    45  
    46  BPF Architecture
    47  ================
    48  
    49  BPF does not define itself by only providing its instruction set, but also by
    50  offering further infrastructure around it such as maps which act as efficient
    51  key / value stores, helper functions to interact with and leverage kernel
    52  functionality, tail calls for calling into other BPF programs, security
    53  hardening primitives, a pseudo file system for pinning objects (maps,
    54  programs), and infrastructure for allowing BPF to be offloaded, for example, to
    55  a network card.
    56  
    57  LLVM provides a BPF back end, so that tools like clang can be used to
    58  compile C into a BPF object file, which can then be loaded into the kernel.
    59  BPF is deeply tied to the Linux kernel and allows for full programmability
    60  without sacrificing native kernel performance.
    61  
    62  Last but not least, also the kernel subsystems making use of BPF are part of
    63  BPF's infrastructure. The two main subsystems discussed throughout this
    64  document are tc and XDP where BPF programs can be attached to. XDP BPF programs
    65  are attached at the earliest networking driver stage and trigger a run of the
    66  BPF program upon packet reception. By definition, this achieves the best
    67  possible packet processing performance since packets cannot get processed at an
    68  even earlier point in software. However, since this processing occurs so early
    69  in the networking stack, the stack has not yet extracted metadata out of the
    70  packet. On the other hand, tc BPF programs are executed later in the kernel
    71  stack, so they have access to more metadata and core kernel functionality.
    72  Apart from tc and XDP programs, there are various other kernel subsystems as
    73  well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc).
    74  
    75  The following subsections provide further details on individual aspects of the
    76  BPF architecture.
    77  
    78  Instruction Set
    79  ---------------
    80  
    81  BPF is a general purpose RISC instruction set and was originally designed for the
    82  purpose of writing programs in a subset of C which can be compiled into BPF instructions
    83  through a compiler back end (e.g. LLVM), so that the kernel can later on map them
    84  through an in-kernel JIT compiler into native opcodes for optimal execution performance
    85  inside the kernel.
    86  
    87  The advantages for pushing these instructions into the kernel include:
    88  
    89  * Making the kernel programmable without having to cross kernel / user space
    90    boundaries. For example, BPF programs related to networking, as in the case of
    91    Cilium, can implement flexible container policies, load balancing and other means
    92    without having to move packets to user space and back into the kernel. State
    93    between BPF programs and kernel / user space can still be shared through maps
    94    whenever needed.
    95  
    96  * Given the flexibility of a programmable data path, programs can be heavily optimized
    97    for performance also by compiling out features that are not required for the use cases
    98    the program solves. For example, if a container does not require IPv4, then the BPF
    99    program can be built to only deal with IPv6 in order to save resources in the fast-path.
   100  
   101  * In case of networking (e.g. tc and XDP), BPF programs can be updated atomically
   102    without having to restart the kernel, system services or containers, and without
   103    traffic interruptions. Furthermore, any program state can also be maintained
   104    throughout updates via BPF maps.
   105  
   106  * BPF provides a stable ABI towards user space, and does not require any third party
   107    kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere,
   108    and guarantees that existing BPF programs keep running with newer kernel versions.
   109    This guarantee is the same guarantee that the kernel provides for system calls with
   110    regard to user space applications. Moreover, BPF programs are portable across
   111    different architectures.
   112  
   113  * BPF programs work in concert with the kernel, they make use of existing kernel
   114    infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and
   115    tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides.
   116    Unlike kernel modules, BPF programs are verified through an in-kernel verifier in
   117    order to ensure that they cannot crash the kernel, always terminate, etc. XDP
   118    programs, for example, reuse the existing in-kernel drivers and operate on the
   119    provided DMA buffers containing the packet frames without exposing them or an entire
   120    driver to user space as in other models. Moreover, XDP programs reuse the existing
   121    stack instead of bypassing it. BPF can be considered a generic "glue code" to
   122    kernel facilities for crafting programs to solve specific use cases.
   123  
   124  The execution of a BPF program inside the kernel is always event-driven! Examples:
   125  
   126  * A networking device which has a BPF program attached on its ingress path will
   127    trigger the execution of the program once a packet is received.
   128  
   129  * A kernel address which has a kprobe with a BPF program attached will trap once
   130    the code at that address gets executed, which will then invoke the kprobe's
   131    callback function for instrumentation, subsequently triggering the execution
   132    of the attached BPF program.
   133  
   134  BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter
   135  and a 512 byte large BPF stack space. Registers are named ``r0`` - ``r10``. The
   136  operating mode is 64 bit by default, the 32 bit subregisters can only be accessed
   137  through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters
   138  zero-extend into 64 bit when they are being written to.
   139  
   140  Register ``r10`` is the only register which is read-only and contains the frame pointer
   141  address in order to access the BPF stack space. The remaining ``r0`` - ``r9``
   142  registers are general purpose and of read/write nature.
   143  
   144  A BPF program can call into a predefined helper function, which is defined by
   145  the core kernel (never by modules). The BPF calling convention is defined as
   146  follows:
   147  
   148  * ``r0`` contains the return value of a helper function call.
   149  * ``r1`` - ``r5`` hold arguments from the BPF program to the kernel helper function.
   150  * ``r6`` - ``r9`` are callee saved registers that will be preserved on helper function call.
   151  
   152  The BPF calling convention is generic enough to map directly to ``x86_64``, ``arm64``
   153  and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a
   154  JIT only needs to issue a call instruction, but no additional extra moves for placing
   155  function arguments. This calling convention was modeled to cover common call
   156  situations without having a performance penalty. Calls with 6 or more arguments
   157  are currently not supported. The helper functions in the kernel which are dedicated
   158  to BPF (``BPF_CALL_0()`` to ``BPF_CALL_5()`` functions) are specifically designed
   159  with this convention in mind.
   160  
   161  Register ``r0`` is also the register containing the exit value for the BPF program.
   162  The semantics of the exit value are defined by the type of program. Furthermore, when
   163  handing execution back to the kernel, the exit value is passed as a 32 bit value.
   164  
   165  Registers ``r1`` - ``r5`` are scratch registers, meaning the BPF program needs to
   166  either spill them to the BPF stack or move them to callee saved registers if these
   167  arguments are to be reused across multiple helper function calls. Spilling means
   168  that the variable in the register is moved to the BPF stack. The reverse operation
   169  of moving the variable from the BPF stack to the register is called filling. The
   170  reason for spilling/filling is due to the limited number of registers.
   171  
   172  Upon entering execution of a BPF program, register ``r1`` initially contains the
   173  context for the program. The context is the input argument for the program (similar
   174  to ``argc/argv`` pair for a typical C program). BPF is restricted to work on a single
   175  context. The context is defined by the program type, for example, a networking
   176  program can have a kernel representation of the network packet (``skb``) as the
   177  input argument.
   178  
   179  The general operation of BPF is 64 bit to follow the natural model of 64 bit
   180  architectures in order to perform pointer arithmetics, pass pointers but also pass 64
   181  bit values into helper functions, and to allow for 64 bit atomic operations.
   182  
   183  The maximum instruction limit per program is restricted to 4096 BPF instructions,
   184  which, by design, means that any program will terminate quickly. Although the
   185  instruction set contains forward as well as backward jumps, the in-kernel BPF
   186  verifier will forbid loops so that termination is always guaranteed. Since BPF
   187  programs run inside the kernel, the verifier's job is to make sure that these are
   188  safe to run, not affecting the system's stability. This means that from an instruction
   189  set point of view, loops can be implemented, but the verifier will restrict that.
   190  However, there is also a concept of tail calls that allows for one BPF program to
   191  jump into another one. This, too, comes with an upper nesting limit of 32 calls,
   192  and is usually used to decouple parts of the program logic, for example, into stages.
   193  
   194  The instruction format is modeled as two operand instructions, which helps mapping
   195  BPF instructions to native instructions during JIT phase. The instruction set is
   196  of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions
   197  have been implemented and the encoding also allows to extend the set with further
   198  instructions when needed. The instruction encoding of a single 64 bit instruction on a
   199  big-endian machine is defined as a bit sequence from most significant bit (MSB) to least
   200  significant bit (LSB) of ``op:8``, ``dst_reg:4``, ``src_reg:4``, ``off:16``, ``imm:32``.
   201  ``off`` and ``imm`` is of signed type. The encodings are part of the kernel headers and
   202  defined in ``linux/bpf.h`` header, which also includes ``linux/bpf_common.h``.
   203  
   204  ``op`` defines the actual operation to be performed. Most of the encoding for ``op``
   205  has been reused from cBPF. The operation can be based on register or immediate
   206  operands. The encoding of ``op`` itself provides information on which mode to use
   207  (``BPF_X`` for denoting register-based operations, and ``BPF_K`` for immediate-based
   208  operations respectively). In the latter case, the destination operand is always
   209  a register. Both ``dst_reg`` and ``src_reg`` provide additional information about
   210  the register operands to be used (e.g. ``r0`` - ``r9``) for the operation. ``off``
   211  is used in some instructions to provide a relative offset, for example, for addressing
   212  the stack or other buffers available to BPF (e.g. map values, packet data, etc),
   213  or jump targets in jump instructions. ``imm`` contains a constant / immediate value.
   214  
   215  The available ``op`` instructions can be categorized into various instruction
   216  classes. These classes are also encoded inside the ``op`` field. The ``op`` field
   217  is divided into (from MSB to LSB) ``code:4``, ``source:1`` and ``class:3``. ``class``
   218  is the more generic instruction class, ``code`` denotes a specific operational
   219  code inside that class, and ``source`` tells whether the source operand is a register
   220  or an immediate value. Possible instruction classes include:
   221  
   222  * ``BPF_LD``, ``BPF_LDX``: Both classes are for load operations. ``BPF_LD`` is
   223    used for loading a double word as a special instruction spanning two instructions
   224    due to the ``imm:32`` split, and for byte / half-word / word loads of packet data.
   225    The latter was carried over from cBPF mainly in order to keep cBPF to BPF
   226    translations efficient, since they have optimized JIT code. For native BPF
   227    these packet load instructions are less relevant nowadays. ``BPF_LDX`` class
   228    holds instructions for byte / half-word / word / double-word loads out of
   229    memory. Memory in this context is generic and could be stack memory, map value
   230    data, packet data, etc.
   231  
   232  * ``BPF_ST``, ``BPF_STX``: Both classes are for store operations. Similar to ``BPF_LDX``
   233    the ``BPF_STX`` is the store counterpart and is used to store the data from a
   234    register into memory, which, again, can be stack memory, map value, packet data,
   235    etc. ``BPF_STX`` also holds special instructions for performing word and double-word
   236    based atomic add operations, which can be used for counters, for example. The
   237    ``BPF_ST`` class is similar to ``BPF_STX`` by providing instructions for storing
   238    data into memory only that the source operand is an immediate value.
   239  
   240  * ``BPF_ALU``, ``BPF_ALU64``: Both classes contain ALU operations. Generally,
   241    ``BPF_ALU`` operations are in 32 bit mode and ``BPF_ALU64`` in 64 bit mode.
   242    Both ALU classes have basic operations with source operand which is register-based
   243    and an immediate-based counterpart. Supported by both are add (``+``), sub (``-``),
   244    and (``&``), or (``|``), left shift (``<<``), right shift (``>>``), xor (``^``),
   245    mul (``*``), div (``/``), mod (``%``), neg (``~``) operations. Also mov (``<X> := <Y>``)
   246    was added as a special ALU operation for both classes in both operand modes.
   247    ``BPF_ALU64`` also contains a signed right shift. ``BPF_ALU`` additionally
   248    contains endianness conversion instructions for half-word / word / double-word
   249    on a given source register.
   250  
   251  * ``BPF_JMP``: This class is dedicated to jump operations. Jumps can be unconditional
   252    and conditional. Unconditional jumps simply move the program counter forward, so
   253    that the next instruction to be executed relative to the current instruction is
   254    ``off + 1``, where ``off`` is the constant offset encoded in the instruction. Since
   255    ``off`` is signed, the jump can also be performed backwards as long as it does not
   256    create a loop and is within program bounds. Conditional jumps operate on both,
   257    register-based and immediate-based source operands. If the condition in the jump
   258    operations results in ``true``, then a relative jump to ``off + 1`` is performed,
   259    otherwise the next instruction (``0 + 1``) is performed. This fall-through
   260    jump logic differs compared to cBPF and allows for better branch prediction as it
   261    fits the CPU branch predictor logic more naturally. Available conditions are
   262    jeq (``==``), jne (``!=``), jgt (``>``), jge (``>=``), jsgt (signed ``>``), jsge
   263    (signed ``>=``), jlt (``<``), jle (``<=``), jslt (signed ``<``), jsle (signed
   264    ``<=``) and jset (jump if ``DST & SRC``). Apart from that, there are three
   265    special jump operations within this class: the exit instruction which will leave
   266    the BPF program and return the current value in ``r0`` as a return code, the call
   267    instruction, which will issue a function call into one of the available BPF helper
   268    functions, and a hidden tail call instruction, which will jump into a different
   269    BPF program.
   270  
   271  The Linux kernel is shipped with a BPF interpreter which executes programs assembled in
   272  BPF instructions. Even cBPF programs are translated into eBPF programs transparently
   273  in the kernel, except for architectures that still ship with a cBPF JIT and
   274  have not yet migrated to an eBPF JIT.
   275  
   276  Currently ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` and
   277  ``arm`` architectures come with an in-kernel eBPF JIT compiler.
   278  
   279  All BPF handling such as loading of programs into the kernel or creation of BPF maps
   280  is managed through a central ``bpf()`` system call. It is also used for managing map
   281  entries (lookup / update / delete), and making programs as well as maps persistent
   282  in the BPF file system through pinning.
   283  
   284  Helper Functions
   285  ----------------
   286  
   287  Helper functions are a concept which enables BPF programs to consult a core kernel
   288  defined set of function calls in order to retrieve / push data from / to the
   289  kernel. Available helper functions may differ for each BPF program type,
   290  for example, BPF programs attached to sockets are only allowed to call into
   291  a subset of helpers compared to BPF programs attached to the tc layer.
   292  Encapsulation and decapsulation helpers for lightweight tunneling constitute
   293  an example of functions which are only available to lower tc layers, whereas
   294  event output helpers for pushing notifications to user space are available to
   295  tc and XDP programs.
   296  
   297  Each helper function is implemented with a commonly shared function signature
   298  similar to system calls. The signature is defined as:
   299  
   300  ::
   301  
   302      u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
   303  
   304  The calling convention as described in the previous section applies to all
   305  BPF helper functions.
   306  
   307  The kernel abstracts helper functions into macros ``BPF_CALL_0()`` to ``BPF_CALL_5()``
   308  which are similar to those of system calls. The following example is an extract
   309  from a helper function which updates map elements by calling into the
   310  corresponding map implementation callbacks:
   311  
   312  ::
   313  
   314      BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
   315                 void *, value, u64, flags)
   316      {
   317          WARN_ON_ONCE(!rcu_read_lock_held());
   318          return map->ops->map_update_elem(map, key, value, flags);
   319      }
   320  
   321      const struct bpf_func_proto bpf_map_update_elem_proto = {
   322          .func           = bpf_map_update_elem,
   323          .gpl_only       = false,
   324          .ret_type       = RET_INTEGER,
   325          .arg1_type      = ARG_CONST_MAP_PTR,
   326          .arg2_type      = ARG_PTR_TO_MAP_KEY,
   327          .arg3_type      = ARG_PTR_TO_MAP_VALUE,
   328          .arg4_type      = ARG_ANYTHING,
   329      };
   330  
   331  There are various advantages of this approach: while cBPF overloaded its
   332  load instructions in order to fetch data at an impossible packet offset to
   333  invoke auxiliary helper functions, each cBPF JIT needed to implement support
   334  for such a cBPF extension. In case of eBPF, each newly added helper function
   335  will be JIT compiled in a transparent and efficient way, meaning that the JIT
   336  compiler only needs to emit a call instruction since the register mapping
   337  is made in such a way that BPF register assignments already match the
   338  underlying architecture's calling convention. This allows for easily extending
   339  the core kernel with new helper functionality. All BPF helper functions are
   340  part of the core kernel and cannot be extended or added through kernel modules.
   341  
   342  The aforementioned function signature also allows the verifier to perform type
   343  checks. The above ``struct bpf_func_proto`` is used to hand all the necessary
   344  information which need to be known about the helper to the verifier, so that
   345  the verifier can make sure that the expected types from the helper match the
   346  current contents of the BPF program's analyzed registers.
   347  
   348  Argument types can range from passing in any kind of value up to restricted
   349  contents such as a pointer / size pair for the BPF stack buffer, which the
   350  helper should read from or write to. In the latter case, the verifier can also
   351  perform additional checks, for example, whether the buffer was previously
   352  initialized.
   353  
   354  The list of available BPF helper functions is rather long and constantly growing,
   355  for example, at the time of this writing, tc BPF programs can choose from 38
   356  different BPF helpers. The kernel's ``struct bpf_verifier_ops`` contains a
   357  ``get_func_proto`` callback function that provides the mapping of a specific
   358  ``enum bpf_func_id`` to one of the available helpers for a given BPF program
   359  type.
   360  
   361  Maps
   362  ----
   363  
   364  .. image:: images/bpf_map.png
   365      :align: center
   366  
   367  Maps are efficient key / value stores that reside in kernel space. They can be
   368  accessed from a BPF program in order to keep state among multiple BPF program
   369  invocations. They can also be accessed through file descriptors from user space
   370  and can be arbitrarily shared with other BPF programs or user space applications.
   371  
   372  BPF programs which share maps with each other are not required to be of the same
   373  program type, for example, tracing programs can share maps with networking programs.
   374  A single BPF program can currently access up to 64 different maps directly.
   375  
   376  Map implementations are provided by the core kernel. There are generic maps with
   377  per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are
   378  also a few non-generic maps that are used along with helper functions.
   379  
   380  Generic maps currently available are ``BPF_MAP_TYPE_HASH``, ``BPF_MAP_TYPE_ARRAY``,
   381  ``BPF_MAP_TYPE_PERCPU_HASH``, ``BPF_MAP_TYPE_PERCPU_ARRAY``, ``BPF_MAP_TYPE_LRU_HASH``,
   382  ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and ``BPF_MAP_TYPE_LPM_TRIE``. They all use the
   383  same common set of BPF helper functions in order to perform lookup, update or
   384  delete operations while implementing a different backend with differing semantics
   385  and performance characteristics.
   386  
   387  Non-generic maps that are currently in the kernel are ``BPF_MAP_TYPE_PROG_ARRAY``,
   388  ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, ``BPF_MAP_TYPE_CGROUP_ARRAY``,
   389  ``BPF_MAP_TYPE_STACK_TRACE``, ``BPF_MAP_TYPE_ARRAY_OF_MAPS``,
   390  ``BPF_MAP_TYPE_HASH_OF_MAPS``. For example, ``BPF_MAP_TYPE_PROG_ARRAY`` is an
   391  array map which holds other BPF programs, ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and
   392  ``BPF_MAP_TYPE_HASH_OF_MAPS`` both hold pointers to other maps such that entire
   393  BPF maps can be atomically replaced at runtime. These types of maps tackle a
   394  specific issue which was unsuitable to be implemented solely through a BPF helper
   395  function since additional (non-data) state is required to be held across BPF
   396  program invocations.
   397  
   398  Object Pinning
   399  --------------
   400  
   401  .. image:: images/bpf_fs.png
   402      :align: center
   403  
   404  BPF maps and programs act as a kernel resource and can only be accessed through
   405  file descriptors, backed by anonymous inodes in the kernel. Advantages, but
   406  also a number of disadvantages come along with them:
   407  
   408  User space applications can make use of most file descriptor related APIs,
   409  file descriptor passing for Unix domain sockets work transparently, etc, but
   410  at the same time, file descriptors are limited to a processes' lifetime,
   411  which makes options like map sharing rather cumbersome to carry out.
   412  
   413  Thus, it brings a number of complications for certain use cases such as iproute2,
   414  where tc or XDP sets up and loads the program into the kernel and terminates
   415  itself eventually. With that, also access to maps is unavailable from user
   416  space side, where it could otherwise be useful, for example, when maps are
   417  shared between ingress and egress locations of the data path. Also, third
   418  party applications may wish to monitor or update map contents during BPF
   419  program runtime.
   420  
   421  To overcome this limitation, a minimal kernel space BPF file system has been
   422  implemented, where BPF map and programs can be pinned to, a process called
   423  object pinning. The BPF system call has therefore been extended with two new
   424  commands which can pin (``BPF_OBJ_PIN``) or retrieve (``BPF_OBJ_GET``) a
   425  previously pinned object.
   426  
   427  For instance, tools such as tc make use of this infrastructure for sharing
   428  maps on ingress and egress. The BPF related file system is not a singleton,
   429  it does support multiple mount instances, hard and soft links, etc.
   430  
   431  Tail Calls
   432  ----------
   433  
   434  .. image:: images/bpf_tailcall.png
   435      :align: center
   436  
   437  Another concept that can be used with BPF is called tail calls. Tail calls can
   438  be seen as a mechanism that allows one BPF program to call another, without
   439  returning back to the old program. Such a call has minimal overhead as unlike
   440  function calls, it is implemented as a long jump, reusing the same stack frame.
   441  
   442  Such programs are verified independently of each other, thus for transferring
   443  state, either per-CPU maps as scratch buffers or in case of tc programs, ``skb``
   444  fields such as the ``cb[]`` area must be used.
   445  
   446  Only programs of the same type can be tail called, and they also need to match
   447  in terms of JIT compilation, thus either JIT compiled or only interpreted programs
   448  can be invoked, but not mixed together.
   449  
   450  There are two components involved for carrying out tail calls: the first part
   451  needs to setup a specialized map called program array (``BPF_MAP_TYPE_PROG_ARRAY``)
   452  that can be populated by user space with key / values, where values are the
   453  file descriptors of the tail called BPF programs, the second part is a
   454  ``bpf_tail_call()`` helper where the context, a reference to the program array
   455  and the lookup key is passed to. Then the kernel inlines this helper call
   456  directly into a specialized BPF instruction. Such a program array is currently
   457  write-only from user space side.
   458  
   459  The kernel looks up the related BPF program from the passed file descriptor
   460  and atomically replaces program pointers at the given map slot. When no map
   461  entry has been found at the provided key, the kernel will just "fall through"
   462  and continue execution of the old program with the instructions following
   463  after the ``bpf_tail_call()``. Tail calls are a powerful utility, for example,
   464  parsing network headers could be structured through tail calls. During runtime,
   465  functionality can be added or replaced atomically, and thus altering the BPF
   466  program's execution behavior.
   467  
   468  .. _bpf_to_bpf_calls:
   469  
   470  BPF to BPF Calls
   471  ----------------
   472  
   473  .. image:: images/bpf_call.png
   474      :align: center
   475  
   476  Aside from BPF helper calls and BPF tail calls, a more recent feature that has
   477  been added to the BPF core infrastructure is BPF to BPF calls. Before this
   478  feature was introduced into the kernel, a typical BPF C program had to declare
   479  any reusable code that, for example, resides in headers as ``always_inline``
   480  such that when LLVM compiles and generates the BPF object file all these
   481  functions were inlined and therefore duplicated many times in the resulting
   482  object file, artificially inflating its code size:
   483  
   484    ::
   485  
   486      #include <linux/bpf.h>
   487  
   488      #ifndef __section
   489      # define __section(NAME)                  \
   490         __attribute__((section(NAME), used))
   491      #endif
   492  
   493      #ifndef __inline
   494      # define __inline                         \
   495         inline __attribute__((always_inline))
   496      #endif
   497  
   498      static __inline int foo(void)
   499      {
   500          return XDP_DROP;
   501      }
   502  
   503      __section("prog")
   504      int xdp_drop(struct xdp_md *ctx)
   505      {
   506          return foo();
   507      }
   508  
   509      char __license[] __section("license") = "GPL";
   510  
   511  The main reason why this was necessary was due to lack of function call support
   512  in the BPF program loader as well as verifier, interpreter and JITs. Starting
   513  with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs
   514  no longer need to use ``always_inline`` everywhere. Thus, the prior shown BPF
   515  example code can then be rewritten more naturally as:
   516  
   517    ::
   518  
   519      #include <linux/bpf.h>
   520  
   521      #ifndef __section
   522      # define __section(NAME)                  \
   523         __attribute__((section(NAME), used))
   524      #endif
   525  
   526      static int foo(void)
   527      {
   528          return XDP_DROP;
   529      }
   530  
   531      __section("prog")
   532      int xdp_drop(struct xdp_md *ctx)
   533      {
   534          return foo();
   535      }
   536  
   537      char __license[] __section("license") = "GPL";
   538  
   539  Mainstream BPF JIT compilers like ``x86_64`` and ``arm64`` support BPF to BPF
   540  calls today with others following in near future. BPF to BPF call is an
   541  important performance optimization since it heavily reduces the generated BPF
   542  code size and therefore becomes friendlier to a CPU's instruction cache.
   543  
   544  The calling convention known from BPF helper function applies to BPF to BPF
   545  calls just as well, meaning ``r1`` up to ``r5`` are for passing arguments to
   546  the callee and the result is returned in ``r0``. ``r1`` to ``r5`` are scratch
   547  registers whereas ``r6`` to ``r9`` preserved across calls the usual way. The
   548  maximum number of nesting calls respectively allowed call frames is ``8``.
   549  A caller can pass pointers (e.g. to the caller's stack frame) down to the
   550  callee, but never vice versa.
   551  
   552  BPF to BPF calls are currently incompatible with the use of BPF tail calls,
   553  since the latter requires to reuse the current stack setup as-is, whereas
   554  the former adds additional stack frames and thus changes the expected layout
   555  for tail calls.
   556  
   557  BPF JIT compilers emit separate images for each function body and later fix
   558  up the function call addresses in the image in a final JIT pass. This has
   559  proven to require minimal changes to the JITs in that they can treat BPF to
   560  BPF calls as conventional BPF helper calls.
   561  
   562  JIT
   563  ---
   564  
   565  .. image:: images/bpf_jit.png
   566      :align: center
   567  
   568  The 64 bit ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64``
   569  and 32 bit ``arm``, ``x86_32`` architectures are all shipped with an in-kernel
   570  eBPF JIT compiler, also all of them are feature equivalent and can be enabled
   571  through:
   572  
   573  ::
   574  
   575      # echo 1 > /proc/sys/net/core/bpf_jit_enable
   576  
   577  The 32 bit ``mips``, ``ppc`` and ``sparc`` architectures currently have a cBPF
   578  JIT compiler. The mentioned architectures still having a cBPF JIT as well as all
   579  remaining architectures supported by the Linux kernel which do not have a BPF JIT
   580  compiler at all need to run eBPF programs through the in-kernel interpreter.
   581  
   582  In the kernel's source tree, eBPF JIT support can be easily determined through
   583  issuing a grep for ``HAVE_EBPF_JIT``:
   584  
   585  ::
   586  
   587      # git grep HAVE_EBPF_JIT arch/
   588      arch/arm/Kconfig:       select HAVE_EBPF_JIT   if !CPU_ENDIAN_BE32
   589      arch/arm64/Kconfig:     select HAVE_EBPF_JIT
   590      arch/powerpc/Kconfig:   select HAVE_EBPF_JIT   if PPC64
   591      arch/mips/Kconfig:      select HAVE_EBPF_JIT   if (64BIT && !CPU_MICROMIPS)
   592      arch/s390/Kconfig:      select HAVE_EBPF_JIT   if PACK_STACK && HAVE_MARCH_Z196_FEATURES
   593      arch/sparc/Kconfig:     select HAVE_EBPF_JIT   if SPARC64
   594      arch/x86/Kconfig:       select HAVE_EBPF_JIT   if X86_64
   595  
   596  JIT compilers speed up execution of the BPF program significantly since they
   597  reduce the per instruction cost compared to the interpreter. Often instructions
   598  can be mapped 1:1 with native instructions of the underlying architecture. This
   599  also reduces the resulting executable image size and is therefore more
   600  instruction cache friendly to the CPU. In particular in case of CISC instruction
   601  sets such as ``x86``, the JITs are optimized for emitting the shortest possible
   602  opcodes for a given instruction to shrink the total necessary size for the
   603  program translation.
   604  
   605  Hardening
   606  ---------
   607  
   608  BPF locks the entire BPF interpreter image (``struct bpf_prog``) as well
   609  as the JIT compiled image (``struct bpf_binary_header``) in the kernel as
   610  read-only during the program's lifetime in order to prevent the code from
   611  potential corruptions. Any corruption happening at that point, for example,
   612  due to some kernel bugs will result in a general protection fault and thus
   613  crash the kernel instead of allowing the corruption to happen silently.
   614  
   615  Architectures that support setting the image memory as read-only can be
   616  determined through:
   617  
   618  ::
   619  
   620      $ git grep ARCH_HAS_SET_MEMORY | grep select
   621      arch/arm/Kconfig:    select ARCH_HAS_SET_MEMORY
   622      arch/arm64/Kconfig:  select ARCH_HAS_SET_MEMORY
   623      arch/s390/Kconfig:   select ARCH_HAS_SET_MEMORY
   624      arch/x86/Kconfig:    select ARCH_HAS_SET_MEMORY
   625  
   626  The option ``CONFIG_ARCH_HAS_SET_MEMORY`` is not configurable, thanks to
   627  which this protection is always built-in. Other architectures might follow
   628  in the future.
   629  
   630  In case of the ``x86_64`` JIT compiler, the JITing of the indirect jump from
   631  the use of tail calls is realized through a retpoline in case ``CONFIG_RETPOLINE``
   632  has been set which is the default at the time of writing in most modern Linux
   633  distributions.
   634  
   635  In case of ``/proc/sys/net/core/bpf_jit_harden`` set to ``1`` additional
   636  hardening steps for the JIT compilation take effect for unprivileged users.
   637  This effectively trades off their performance slightly by decreasing a
   638  (potential) attack surface in case of untrusted users operating on the
   639  system. The decrease in program execution still results in better performance
   640  compared to switching to interpreter entirely.
   641  
   642  Currently, enabling hardening will blind all user provided 32 bit and 64 bit
   643  constants from the BPF program when it gets JIT compiled in order to prevent
   644  JIT spraying attacks which inject native opcodes as immediate values. This is
   645  problematic as these immediate values reside in executable kernel memory,
   646  therefore a jump that could be triggered from some kernel bug would jump to
   647  the start of the immediate value and then execute these as native instructions.
   648  
   649  JIT constant blinding prevents this due to randomizing the actual instruction,
   650  which means the operation is transformed from an immediate based source operand
   651  to a register based one through rewriting the instruction by splitting the
   652  actual load of the value into two steps: 1) load of a blinded immediate
   653  value ``rnd ^ imm`` into a register, 2) xoring that register with ``rnd``
   654  such that the original ``imm`` immediate then resides in the register and
   655  can be used for the actual operation. The example was provided for a load
   656  operation, but really all generic operations are blinded.
   657  
   658  Example of JITing a program with hardening disabled:
   659  
   660  ::
   661  
   662      # echo 0 > /proc/sys/net/core/bpf_jit_harden
   663  
   664        ffffffffa034f5e9 + <x>:
   665        [...]
   666        39:   mov    $0xa8909090,%eax
   667        3e:   mov    $0xa8909090,%eax
   668        43:   mov    $0xa8ff3148,%eax
   669        48:   mov    $0xa89081b4,%eax
   670        4d:   mov    $0xa8900bb0,%eax
   671        52:   mov    $0xa810e0c1,%eax
   672        57:   mov    $0xa8908eb4,%eax
   673        5c:   mov    $0xa89020b0,%eax
   674        [...]
   675  
   676  The same program gets constant blinded when loaded through BPF
   677  as an unprivileged user in the case hardening is enabled:
   678  
   679  ::
   680  
   681      # echo 1 > /proc/sys/net/core/bpf_jit_harden
   682  
   683        ffffffffa034f1e5 + <x>:
   684        [...]
   685        39:   mov    $0xe1192563,%r10d
   686        3f:   xor    $0x4989b5f3,%r10d
   687        46:   mov    %r10d,%eax
   688        49:   mov    $0xb8296d93,%r10d
   689        4f:   xor    $0x10b9fd03,%r10d
   690        56:   mov    %r10d,%eax
   691        59:   mov    $0x8c381146,%r10d
   692        5f:   xor    $0x24c7200e,%r10d
   693        66:   mov    %r10d,%eax
   694        69:   mov    $0xeb2a830e,%r10d
   695        6f:   xor    $0x43ba02ba,%r10d
   696        76:   mov    %r10d,%eax
   697        79:   mov    $0xd9730af,%r10d
   698        7f:   xor    $0xa5073b1f,%r10d
   699        86:   mov    %r10d,%eax
   700        89:   mov    $0x9a45662b,%r10d
   701        8f:   xor    $0x325586ea,%r10d
   702        96:   mov    %r10d,%eax
   703        [...]
   704  
   705  Both programs are semantically the same, only that none of the
   706  original immediate values are visible anymore in the disassembly of
   707  the second program.
   708  
   709  At the same time, hardening also disables any JIT kallsyms exposure
   710  for privileged users, preventing that JIT image addresses are not
   711  exposed to ``/proc/kallsyms`` anymore.
   712  
   713  Moreover, the Linux kernel provides the option ``CONFIG_BPF_JIT_ALWAYS_ON``
   714  which removes the entire BPF interpreter from the kernel and permanently
   715  enables the JIT compiler. This has been developed as part of a mitigation
   716  in the context of Spectre v2 such that when used in a VM-based setting,
   717  the guest kernel is not going to reuse the host kernel's BPF interpreter
   718  when mounting an attack anymore. For container-based environments, the
   719  ``CONFIG_BPF_JIT_ALWAYS_ON`` configuration option is optional, but in
   720  case JITs are enabled there anyway, the interpreter may as well be compiled
   721  out to reduce the kernel's complexity. Thus, it is also generally
   722  recommended for widely used JITs in case of main stream architectures
   723  such as ``x86_64`` and ``arm64``.
   724  
   725  Last but not least, the kernel offers an option to disable the use of
   726  the ``bpf(2)`` system call for unprivileged users through the
   727  ``/proc/sys/kernel/unprivileged_bpf_disabled`` sysctl knob. This is
   728  on purpose a one-time kill switch, meaning once set to ``1``, there is
   729  no option to reset it back to ``0`` until a new kernel reboot. When
   730  set only ``CAP_SYS_ADMIN`` privileged processes out of the initial
   731  namespace are allowed to use the ``bpf(2)`` system call from that
   732  point onwards. Upon start, Cilium sets this knob to ``1`` as well.
   733  
   734  ::
   735  
   736      # echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled
   737  
   738  Offloads
   739  --------
   740  
   741  .. image:: images/bpf_offload.png
   742      :align: center
   743  
   744  Networking programs in BPF, in particular for tc and XDP do have an
   745  offload-interface to hardware in the kernel in order to execute BPF
   746  code directly on the NIC.
   747  
   748  Currently, the ``nfp`` driver from Netronome has support for offloading
   749  BPF through a JIT compiler which translates BPF instructions to an
   750  instruction set implemented against the NIC. This includes offloading
   751  of BPF maps to the NIC as well, thus the offloaded BPF program can
   752  perform map lookups, updates and deletions.
   753  
   754  Toolchain
   755  =========
   756  
   757  Current user space tooling, introspection facilities and kernel control knobs around
   758  BPF are discussed in this section. Note, the tooling and infrastructure around BPF
   759  is still rapidly evolving and thus may not provide a complete picture of all available
   760  tools.
   761  
   762  Development Environment
   763  -----------------------
   764  
   765  A step by step guide for setting up a development environment for BPF can be found
   766  below for both Fedora and Ubuntu. This will guide you through building, installing
   767  and testing a development kernel as well as building and installing iproute2.
   768  
   769  The step of manually building iproute2 and Linux kernel is usually not necessary
   770  given that major distributions already ship recent enough kernels by default, but
   771  would be needed for testing bleeding edge versions or contributing BPF patches to
   772  iproute2 and to the Linux kernel, respectively. Similarly, for debugging and
   773  introspection purposes building bpftool is optional, but recommended.
   774  
   775  Fedora
   776  ``````
   777  
   778  The following applies to Fedora 25 or later:
   779  
   780  ::
   781  
   782      $ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \
   783        openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static
   784  
   785  .. note:: If you are running some other Fedora derivative and ``dnf`` is missing,
   786            try using ``yum`` instead.
   787  
   788  Ubuntu
   789  ``````
   790  
   791  The following applies to Ubuntu 17.04 or later:
   792  
   793  ::
   794  
   795      $ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \
   796        clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \
   797        graphviz
   798  
   799  openSUSE Tumbleweed
   800  ```````````````````
   801  
   802  The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later:
   803  
   804  ::
   805  
   806     $ sudo  zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \
   807     libcap-devel clang llvm graphviz bison flex glibc-devel-static
   808  
   809  Compiling the Kernel
   810  ````````````````````
   811  
   812  Development of new BPF features for the Linux kernel happens inside the ``net-next``
   813  git tree, latest BPF fixes in the ``net`` tree. The following command will obtain
   814  the kernel source for the ``net-next`` tree through git:
   815  
   816  ::
   817  
   818      $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git
   819  
   820  If the git commit history is not of interest, then ``--depth 1`` will clone the
   821  tree much faster by truncating the git history only to the most recent commit.
   822  
   823  In case the ``net`` tree is of interest, it can be cloned from this url:
   824  
   825  ::
   826  
   827      $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git
   828  
   829  There are dozens of tutorials in the Internet on how to build Linux kernels, one
   830  good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild)
   831  that can be followed with one of the two git trees mentioned above.
   832  
   833  Make sure that the generated ``.config`` file contains the following ``CONFIG_*``
   834  entries for running BPF. These entries are also needed for Cilium.
   835  
   836  ::
   837  
   838      CONFIG_CGROUP_BPF=y
   839      CONFIG_BPF=y
   840      CONFIG_BPF_SYSCALL=y
   841      CONFIG_NET_SCH_INGRESS=m
   842      CONFIG_NET_CLS_BPF=m
   843      CONFIG_NET_CLS_ACT=y
   844      CONFIG_BPF_JIT=y
   845      CONFIG_LWTUNNEL_BPF=y
   846      CONFIG_HAVE_EBPF_JIT=y
   847      CONFIG_BPF_EVENTS=y
   848      CONFIG_TEST_BPF=m
   849  
   850  Some of the entries cannot be adjusted through ``make menuconfig``. For example,
   851  ``CONFIG_HAVE_EBPF_JIT`` is selected automatically if a given architecture does
   852  come with an eBPF JIT. In this specific case, ``CONFIG_HAVE_EBPF_JIT`` is optional
   853  but highly recommended. An architecture not having an eBPF JIT compiler will need
   854  to fall back to the in-kernel interpreter with the cost of being less efficient
   855  executing BPF instructions.
   856  
   857  Verifying the Setup
   858  ```````````````````
   859  
   860  After you have booted into the newly compiled kernel, navigate to the BPF selftest
   861  suite in order to test BPF functionality (current working directory points to
   862  the root of the cloned git tree):
   863  
   864  ::
   865  
   866      $ cd tools/testing/selftests/bpf/
   867      $ make
   868      $ sudo ./test_verifier
   869  
   870  The verifier tests print out all the current checks being performed. The summary
   871  at the end of running all tests will dump information of test successes and
   872  failures:
   873  
   874  ::
   875  
   876      Summary: 847 PASSED, 0 SKIPPED, 0 FAILED
   877  
   878  .. note:: For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+
   879            caused by the BPF function calls which do not need to be inlined
   880            anymore. See section :ref:`bpf_to_bpf_calls` or the cover letter mail
   881            from the kernel patch (https://lwn.net/Articles/741773/) for more information.
   882            Not every BPF program has a dependency on LLVM 6.0+ if it does not
   883            use this new feature. If your distribution does not provide LLVM 6.0+
   884            you may compile it by following the instruction in the :ref:`tooling_llvm`
   885            section.
   886  
   887  In order to run through all BPF selftests, the following command is needed:
   888  
   889  ::
   890  
   891      $ sudo make run_tests
   892  
   893  If you see any failures, please contact us on Slack with the full test output.
   894  
   895  Compiling iproute2
   896  ``````````````````
   897  
   898  Similar to the ``net`` (fixes only) and ``net-next`` (new features) kernel trees,
   899  the iproute2 git tree has two branches, namely ``master`` and ``net-next``. The
   900  ``master`` branch is based on the ``net`` tree and the ``net-next`` branch is
   901  based against the ``net-next`` kernel tree. This is necessary, so that changes
   902  in header files can be synchronized in the iproute2 tree.
   903  
   904  In order to clone the iproute2 ``master`` branch, the following command can
   905  be used:
   906  
   907  ::
   908  
   909      $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git
   910  
   911  Similarly, to clone into mentioned ``net-next`` branch of iproute2, run the
   912  following:
   913  
   914  ::
   915  
   916      $ git clone -b net-next git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git
   917  
   918  After that, proceed with the build and installation:
   919  
   920  ::
   921  
   922      $ cd iproute2/
   923      $ ./configure --prefix=/usr
   924      TC schedulers
   925       ATM    no
   926  
   927      libc has setns: yes
   928      SELinux support: yes
   929      ELF support: yes
   930      libmnl support: no
   931      Berkeley DB: no
   932  
   933      docs: latex: no
   934       WARNING: no docs can be built from LaTeX files
   935       sgml2html: no
   936       WARNING: no HTML docs can be built from SGML
   937      $ make
   938      [...]
   939      $ sudo make install
   940  
   941  Ensure that the ``configure`` script shows ``ELF support: yes``, so that iproute2
   942  can process ELF files from LLVM's BPF back end. libelf was listed in the instructions
   943  for installing the dependencies in case of Fedora and Ubuntu earlier.
   944  
   945  Compiling bpftool
   946  `````````````````
   947  
   948  bpftool is an essential tool around debugging and introspection of BPF programs
   949  and maps. It is part of the kernel tree and available under ``tools/bpf/bpftool/``.
   950  
   951  Make sure to have cloned either the ``net`` or ``net-next`` kernel tree as described
   952  earlier. In order to build and install bpftool, the following steps are required:
   953  
   954  ::
   955  
   956      $ cd <kernel-tree>/tools/bpf/bpftool/
   957      $ make
   958      Auto-detecting system features:
   959      ...                        libbfd: [ on  ]
   960      ...        disassembler-four-args: [ OFF ]
   961  
   962        CC       xlated_dumper.o
   963        CC       prog.o
   964        CC       common.o
   965        CC       cgroup.o
   966        CC       main.o
   967        CC       json_writer.o
   968        CC       cfg.o
   969        CC       map.o
   970        CC       jit_disasm.o
   971        CC       disasm.o
   972      make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf'
   973  
   974      Auto-detecting system features:
   975      ...                        libelf: [ on  ]
   976      ...                           bpf: [ on  ]
   977  
   978        CC       libbpf.o
   979        CC       bpf.o
   980        CC       nlattr.o
   981        LD       libbpf-in.o
   982        LINK     libbpf.a
   983      make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf'
   984        LINK     bpftool
   985      $ sudo make install
   986  
   987  .. _tooling_llvm:
   988  
   989  LLVM
   990  ----
   991  
   992  LLVM is currently the only compiler suite providing a BPF back end. gcc does
   993  not support BPF at this point.
   994  
   995  The BPF back end was merged into LLVM's 3.7 release. Major distributions enable
   996  the BPF back end by default when they package LLVM, therefore installing clang
   997  and llvm is sufficient on most recent distributions to start compiling C
   998  into BPF object files.
   999  
  1000  The typical workflow is that BPF programs are written in C, compiled by LLVM
  1001  into object / ELF files, which are parsed by user space BPF ELF loaders (such as
  1002  iproute2 or others), and pushed into the kernel through the BPF system call.
  1003  The kernel verifies the BPF instructions and JITs them, returning a new file
  1004  descriptor for the program, which then can be attached to a subsystem (e.g.
  1005  networking). If supported, the subsystem could then further offload the BPF
  1006  program to hardware (e.g. NIC).
  1007  
  1008  For LLVM, BPF target support can be checked, for example, through the following:
  1009  
  1010  ::
  1011  
  1012      $ llc --version
  1013      LLVM (http://llvm.org/):
  1014      LLVM version 3.8.1
  1015      Optimized build.
  1016      Default target: x86_64-unknown-linux-gnu
  1017      Host CPU: skylake
  1018  
  1019      Registered Targets:
  1020        [...]
  1021        bpf        - BPF (host endian)
  1022        bpfeb      - BPF (big endian)
  1023        bpfel      - BPF (little endian)
  1024        [...]
  1025  
  1026  By default, the ``bpf`` target uses the endianness of the CPU it compiles on,
  1027  meaning that if the CPU's endianness is little endian, the program is represented
  1028  in little endian format as well, and if the CPU's endianness is big endian,
  1029  the program is represented in big endian. This also matches the runtime behavior
  1030  of BPF, which is generic and uses the CPU's endianness it runs on in order
  1031  to not disadvantage architectures in any of the format.
  1032  
  1033  For cross-compilation, the two targets ``bpfeb`` and ``bpfel`` were introduced,
  1034  thanks to that BPF programs can be compiled on a node running in one endianness
  1035  (e.g. little endian on x86) and run on a node in another endianness format (e.g.
  1036  big endian on arm). Note that the front end (clang) needs to run in the target
  1037  endianness as well.
  1038  
  1039  Using ``bpf`` as a target is the preferred way in situations where no mixture of
  1040  endianness applies. For example, compilation on ``x86_64`` results in the same
  1041  output for the targets ``bpf`` and ``bpfel`` due to being little endian, therefore
  1042  scripts triggering a compilation also do not have to be endian aware.
  1043  
  1044  A minimal, stand-alone XDP drop program might look like the following example
  1045  (``xdp-example.c``):
  1046  
  1047  ::
  1048  
  1049      #include <linux/bpf.h>
  1050  
  1051      #ifndef __section
  1052      # define __section(NAME)                  \
  1053         __attribute__((section(NAME), used))
  1054      #endif
  1055  
  1056      __section("prog")
  1057      int xdp_drop(struct xdp_md *ctx)
  1058      {
  1059          return XDP_DROP;
  1060      }
  1061  
  1062      char __license[] __section("license") = "GPL";
  1063  
  1064  It can then be compiled and loaded into the kernel as follows:
  1065  
  1066  ::
  1067  
  1068      $ clang -O2 -Wall -target bpf -c xdp-example.c -o xdp-example.o
  1069      # ip link set dev em1 xdp obj xdp-example.o
  1070  
  1071  .. note:: Attaching an XDP BPF program to a network device as above requires
  1072            Linux 4.11 with a device that supports XDP, or Linux 4.12 or later.
  1073  
  1074  For the generated object file LLVM (>= 3.9) uses the official BPF machine value,
  1075  that is, ``EM_BPF`` (decimal: ``247`` / hex: ``0xf7``). In this example, the program
  1076  has been compiled with ``bpf`` target under ``x86_64``, therefore ``LSB`` (as opposed
  1077  to ``MSB``) is shown regarding endianness:
  1078  
  1079  ::
  1080  
  1081      $ file xdp-example.o
  1082      xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped
  1083  
  1084  ``readelf -a xdp-example.o`` will dump further information about the ELF file, which can
  1085  sometimes be useful for introspecting generated section headers, relocation entries
  1086  and the symbol table.
  1087  
  1088  In the unlikely case where clang and LLVM need to be compiled from scratch, the
  1089  following commands can be used:
  1090  
  1091  ::
  1092  
  1093      $ git clone http://llvm.org/git/llvm.git
  1094      $ cd llvm/tools
  1095      $ git clone --depth 1 http://llvm.org/git/clang.git
  1096      $ cd ..; mkdir build; cd build
  1097      $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF
  1098      $ make -j $(getconf _NPROCESSORS_ONLN)
  1099  
  1100      $ ./bin/llc --version
  1101      LLVM (http://llvm.org/):
  1102      LLVM version x.y.zsvn
  1103      Optimized build.
  1104      Default target: x86_64-unknown-linux-gnu
  1105      Host CPU: skylake
  1106  
  1107      Registered Targets:
  1108        bpf    - BPF (host endian)
  1109        bpfeb  - BPF (big endian)
  1110        bpfel  - BPF (little endian)
  1111        x86    - 32-bit X86: Pentium-Pro and above
  1112        x86-64 - 64-bit X86: EM64T and AMD64
  1113  
  1114      $ export PATH=$PWD/bin:$PATH   # add to ~/.bashrc
  1115  
  1116  Make sure that ``--version`` mentions ``Optimized build.``, otherwise the
  1117  compilation time for programs when having LLVM in debugging mode will
  1118  significantly increase (e.g. by 10x or more).
  1119  
  1120  For debugging, clang can generate the assembler output as follows:
  1121  
  1122  ::
  1123  
  1124      $ clang -O2 -S -Wall -target bpf -c xdp-example.c -o xdp-example.S
  1125      $ cat xdp-example.S
  1126          .text
  1127          .section    prog,"ax",@progbits
  1128          .globl      xdp_drop
  1129          .p2align    3
  1130      xdp_drop:                             # @xdp_drop
  1131      # BB#0:
  1132          r0 = 1
  1133          exit
  1134  
  1135          .section    license,"aw",@progbits
  1136          .globl    __license               # @__license
  1137      __license:
  1138          .asciz    "GPL"
  1139  
  1140  Starting from LLVM's release 6.0, there is also assembler parser support. You can
  1141  program using BPF assembler directly, then use llvm-mc to assemble it into an
  1142  object file. For example, you can assemble the xdp-example.S listed above back
  1143  into object file using:
  1144  
  1145  ::
  1146  
  1147      $ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S
  1148  
  1149  Furthermore, more recent LLVM versions (>= 4.0) can also store debugging
  1150  information in dwarf format into the object file. This can be done through
  1151  the usual workflow by adding ``-g`` for compilation.
  1152  
  1153  ::
  1154  
  1155      $ clang -O2 -g -Wall -target bpf -c xdp-example.c -o xdp-example.o
  1156      $ llvm-objdump -S -no-show-raw-insn xdp-example.o
  1157  
  1158      xdp-example.o:        file format ELF64-BPF
  1159  
  1160      Disassembly of section prog:
  1161      xdp_drop:
  1162      ; {
  1163          0:        r0 = 1
  1164      ; return XDP_DROP;
  1165          1:        exit
  1166  
  1167  The ``llvm-objdump`` tool can then annotate the assembler output with the
  1168  original C code used in the compilation. The trivial example in this case
  1169  does not contain much C code, however, the line numbers shown as ``0:``
  1170  and ``1:`` correspond directly to the kernel's verifier log.
  1171  
  1172  This means that in case BPF programs get rejected by the verifier, ``llvm-objdump``
  1173  can help to correlate the instructions back to the original C code, which is
  1174  highly useful for analysis.
  1175  
  1176  ::
  1177  
  1178      # ip link set dev em1 xdp obj xdp-example.o verb
  1179  
  1180      Prog section 'prog' loaded (5)!
  1181       - Type:         6
  1182       - Instructions: 2 (0 over limit)
  1183       - License:      GPL
  1184  
  1185      Verifier analysis:
  1186  
  1187      0: (b7) r0 = 1
  1188      1: (95) exit
  1189      processed 2 insns
  1190  
  1191  As it can be seen in the verifier analysis, the ``llvm-objdump`` output dumps
  1192  the same BPF assembler code as the kernel.
  1193  
  1194  Leaving out the ``-no-show-raw-insn`` option will also dump the raw
  1195  ``struct bpf_insn`` as hex in front of the assembly:
  1196  
  1197  ::
  1198  
  1199      $ llvm-objdump -S xdp-example.o
  1200  
  1201      xdp-example.o:        file format ELF64-BPF
  1202  
  1203      Disassembly of section prog:
  1204      xdp_drop:
  1205      ; {
  1206         0:       b7 00 00 00 01 00 00 00     r0 = 1
  1207      ; return foo();
  1208         1:       95 00 00 00 00 00 00 00     exit
  1209  
  1210  For LLVM IR debugging, the compilation process for BPF can be split into
  1211  two steps, generating a binary LLVM IR intermediate file ``xdp-example.bc``, which
  1212  can later on be passed to llc:
  1213  
  1214  ::
  1215  
  1216      $ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
  1217      $ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o
  1218  
  1219  The generated LLVM IR can also be dumped in human readable format through:
  1220  
  1221  ::
  1222  
  1223      $ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o -
  1224  
  1225  LLVM is able to attach debug information such as the description of used data
  1226  types in the program to the generated BPF object file. By default this is in
  1227  DWARF format.
  1228  
  1229  A heavily simplified version used by BPF is called BTF (BPF Type Format). The
  1230  resulting DWARF can be converted into BTF and is later on loaded into the
  1231  kernel through BPF object loaders. The kernel will then verify the BTF data
  1232  for correctness and keeps track of the data types the BTF data is containing.
  1233  
  1234  BPF maps can then be annotated with key and value types out of the BTF data
  1235  such that a later dump of the map exports the map data along with the related
  1236  type information. This allows for better introspection, debugging and value
  1237  pretty printing. Note that BTF data is a generic debugging data format and
  1238  as such any DWARF to BTF converted data can be loaded (e.g. kernel's vmlinux
  1239  DWARF data could be converted to BTF and loaded). Latter is in particular
  1240  useful for BPF tracing in the future.
  1241  
  1242  In order to generate BTF from DWARF debugging information, elfutils (>= 0.173)
  1243  is needed. If that is not available, then adding the ``-mattr=dwarfris`` option
  1244  to the ``llc`` command is required during compilation:
  1245  
  1246  ::
  1247  
  1248      $ llc -march=bpf -mattr=help |& grep dwarfris
  1249        dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
  1250        [...]
  1251  
  1252  The reason using ``-mattr=dwarfris`` is because the flag ``dwarfris`` (``dwarf
  1253  relocation in section``) disables DWARF cross-section relocations between DWARF
  1254  and the ELF's symbol table since libdw does not have proper BPF relocation
  1255  support, and therefore tools like ``pahole`` would otherwise not be able to
  1256  properly dump structures from the object.
  1257  
  1258  elfutils (>= 0.173) implements proper BPF relocation support and therefore
  1259  the same can be achieved without the ``-mattr=dwarfris`` option. Dumping
  1260  the structures from the object file could be done from either DWARF or BTF
  1261  information. ``pahole`` uses the LLVM emitted DWARF information at this
  1262  point, however, future ``pahole`` versions could rely on BTF if available.
  1263  
  1264  For converting DWARF into BTF, a recent pahole version (>= 1.12) is required.
  1265  A recent pahole version can also be obtained from its official git repository
  1266  if not available from one of the distribution packages:
  1267  
  1268  ::
  1269  
  1270      $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git
  1271  
  1272  ``pahole`` comes with the option ``-J`` to convert DWARF into BTF from an
  1273  object file. ``pahole`` can be probed for BTF support as follows (note that
  1274  the ``llvm-objcopy`` tool is required for ``pahole`` as well, so check its
  1275  presence, too):
  1276  
  1277  ::
  1278  
  1279      $ pahole --help | grep BTF
  1280      -J, --btf_encode           Encode as BTF
  1281  
  1282  Generating debugging information also requires the front end to generate
  1283  source level debug information by passing ``-g`` to the ``clang`` command
  1284  line. Note that ``-g`` is needed independently of whether ``llc``'s
  1285  ``dwarfris`` option is used. Full example for generating the object file:
  1286  
  1287  ::
  1288  
  1289      $ clang -O2 -g -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
  1290      $ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o
  1291  
  1292  Alternatively, by using clang only to build a BPF program with debugging
  1293  information (again, the dwarfris flag can be omitted when having proper
  1294  elfutils version):
  1295  
  1296  ::
  1297  
  1298      $ clang -target bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
  1299  
  1300  After successful compilation ``pahole`` can be used to properly dump structures
  1301  of the BPF program based on the DWARF information:
  1302  
  1303  ::
  1304  
  1305      $ pahole xdp-example.o
  1306      struct xdp_md {
  1307              __u32                      data;                 /*     0     4 */
  1308              __u32                      data_end;             /*     4     4 */
  1309              __u32                      data_meta;            /*     8     4 */
  1310  
  1311              /* size: 12, cachelines: 1, members: 3 */
  1312              /* last cacheline: 12 bytes */
  1313      };
  1314  
  1315  Through the option ``-J`` ``pahole`` can eventually generate the BTF from
  1316  DWARF. In the object file DWARF data will still be retained alongside the
  1317  newly added BTF data. Full ``clang`` and ``pahole`` example combined:
  1318  
  1319  ::
  1320  
  1321      $ clang -target bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
  1322      $ pahole -J xdp-example.o
  1323  
  1324  The presence of a ``.BTF`` section can be seen through ``readelf`` tool:
  1325  
  1326  ::
  1327  
  1328      $ readelf -a xdp-example.o
  1329      [...]
  1330        [18] .BTF              PROGBITS         0000000000000000  00000671
  1331      [...]
  1332  
  1333  BPF loaders such as iproute2 will detect and load the BTF section, so that
  1334  BPF maps can be annotated with type information.
  1335  
  1336  LLVM by default uses the BPF base instruction set for generating code
  1337  in order to make sure that the generated object file can also be loaded
  1338  with older kernels such as long-term stable kernels (e.g. 4.9+).
  1339  
  1340  However, LLVM has a ``-mcpu`` selector for the BPF back end in order to
  1341  select different versions of the BPF instruction set, namely instruction
  1342  set extensions on top of the BPF base instruction set in order to generate
  1343  more efficient and smaller code.
  1344  
  1345  Available ``-mcpu`` options can be queried through:
  1346  
  1347  ::
  1348  
  1349      $ llc -march bpf -mcpu=help
  1350      Available CPUs for this target:
  1351  
  1352        generic - Select the generic processor.
  1353        probe   - Select the probe processor.
  1354        v1      - Select the v1 processor.
  1355        v2      - Select the v2 processor.
  1356      [...]
  1357  
  1358  The ``generic`` processor is the default processor, which is also the
  1359  base instruction set ``v1`` of BPF. Options ``v1`` and ``v2`` are typically
  1360  useful in an environment where the BPF program is being cross compiled
  1361  and the target host where the program is loaded differs from the one
  1362  where it is compiled (and thus available BPF kernel features might differ
  1363  as well).
  1364  
  1365  The recommended ``-mcpu`` option which is also used by Cilium internally is
  1366  ``-mcpu=probe``! Here, the LLVM BPF back end queries the kernel for availability
  1367  of BPF instruction set extensions and when found available, LLVM will use
  1368  them for compiling the BPF program whenever appropriate.
  1369  
  1370  A full command line example with llc's ``-mcpu=probe``:
  1371  
  1372  ::
  1373  
  1374      $ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
  1375      $ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o
  1376  
  1377  Generally, LLVM IR generation is architecture independent. There are
  1378  however a few differences when using ``clang -target bpf`` versus
  1379  leaving ``-target bpf`` out and thus using clang's default target which,
  1380  depending on the underlying architecture, might be ``x86_64``, ``arm64``
  1381  or others.
  1382  
  1383  Quoting from the kernel's ``Documentation/bpf/bpf_devel_QA.txt``:
  1384  
  1385  * BPF programs may recursively include header file(s) with file scope
  1386    inline assembly codes. The default target can handle this well, while
  1387    bpf target may fail if bpf backend assembler does not understand
  1388    these assembly codes, which is true in most cases.
  1389  
  1390  * When compiled without -g, additional elf sections, e.g., ``.eh_frame``
  1391    and ``.rela.eh_frame``, may be present in the object file with default
  1392    target, but not with bpf target.
  1393  
  1394  * The default target may turn a C switch statement into a switch table
  1395    lookup and jump operation. Since the switch table is placed in the
  1396    global read-only section, the bpf program will fail to load.
  1397    The bpf target does not support switch table optimization. The clang
  1398    option ``-fno-jump-tables`` can be used to disable switch table
  1399    generation.
  1400  
  1401  * For clang ``-target bpf``, it is guaranteed that pointer or long /
  1402    unsigned long types will always have a width of 64 bit, no matter
  1403    whether underlying clang binary or default target (or kernel) is
  1404    32 bit. However, when native clang target is used, then it will
  1405    compile these types based on the underlying architecture's
  1406    conventions, meaning in case of 32 bit architecture, pointer or
  1407    long / unsigned long types e.g. in BPF context structure will have
  1408    width of 32 bit while the BPF LLVM back end still operates in 64 bit.
  1409  
  1410  The native target is mostly needed in tracing for the case of walking
  1411  the kernel's ``struct pt_regs`` that maps CPU registers, or other kernel
  1412  structures where CPU's register width matters. In all other cases such
  1413  as networking, the use of ``clang -target bpf`` is the preferred choice.
  1414  
  1415  Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since
  1416  LLVM's release 7.0. A new code generation attribute ``alu32`` is added. When it is
  1417  enabled, LLVM will try to use 32-bit subregisters whenever possible, typically
  1418  when there are operations on 32-bit types. The associated ALU instructions with
  1419  32-bit subregisters will become ALU32 instructions. For example, for the
  1420  following sample code:
  1421  
  1422  ::
  1423  
  1424      $ cat 32-bit-example.c
  1425          void cal(unsigned int *a, unsigned int *b, unsigned int *c)
  1426          {
  1427            unsigned int sum = *a + *b;
  1428            *c = sum;
  1429          }
  1430  
  1431  At default code generation, the assembler will looks like:
  1432  
  1433  ::
  1434  
  1435      $ clang -target bpf -emit-llvm -S 32-bit-example.c
  1436      $ llc -march=bpf 32-bit-example.ll
  1437      $ cat 32-bit-example.s
  1438          cal:
  1439            r1 = *(u32 *)(r1 + 0)
  1440            r2 = *(u32 *)(r2 + 0)
  1441            r2 += r1
  1442            *(u32 *)(r3 + 0) = r2
  1443            exit
  1444  
  1445  64-bit registers are used, hence the addition means 64-bit addition. Now, if you
  1446  enable the new 32-bit subregisters support by specifying ``-mattr=+alu32``, then
  1447  the assembler will looks like:
  1448  
  1449  ::
  1450  
  1451      $ llc -march=bpf -mattr=+alu32 32-bit-example.ll
  1452      $ cat 32-bit-example.s
  1453          cal:
  1454            w1 = *(u32 *)(r1 + 0)
  1455            w2 = *(u32 *)(r2 + 0)
  1456            w2 += w1
  1457            *(u32 *)(r3 + 0) = w2
  1458            exit
  1459  
  1460  ``w`` register, meaning 32-bit subregister, will be used instead of 64-bit ``r``
  1461  register.
  1462  
  1463  Enable 32-bit subregisters might help reducing type extension instruction
  1464  sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures
  1465  for which registers pairs are used to model the 64-bit eBPF registers and extra
  1466  instructions are needed for manipulating the high 32-bit. Given read from 32-bit
  1467  subregister is guaranteed to read from low 32-bit only even though write still
  1468  needs to clear the high 32-bit, if the JIT compiler has known the definition of
  1469  one register only has subregister reads, then instructions for setting the high
  1470  32-bit of the destination could be eliminated.
  1471  
  1472  When writing C programs for BPF, there are a couple of pitfalls to be aware
  1473  of, compared to usual application development with C. The following items
  1474  describe some of the differences for the BPF model:
  1475  
  1476  1. **Everything needs to be inlined, there are no function calls (on older
  1477     LLVM versions) or shared library calls available.**
  1478  
  1479     Shared libraries, etc cannot be used with BPF. However, common library
  1480     code used in BPF programs can be placed into header files and included in
  1481     the main programs. For example, Cilium makes heavy use of it (see ``bpf/lib/``).
  1482     However, this still allows for including header files, for example, from
  1483     the kernel or other libraries and reuse their static inline functions or
  1484     macros / definitions.
  1485  
  1486     Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF
  1487     function calls are supported, then LLVM needs to compile and inline the
  1488     entire code into a flat sequence of BPF instructions for a given program
  1489     section. In such case, best practice is to use an annotation like ``__inline``
  1490     for every library function as shown below. The use of ``always_inline``
  1491     is recommended, since the compiler could still decide to uninline large
  1492     functions that are only annotated as ``inline``.
  1493  
  1494     In case the latter happens, LLVM will generate a relocation entry into
  1495     the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and
  1496     will thus produce an error since only BPF maps are valid relocation entries
  1497     which loaders can process.
  1498  
  1499     ::
  1500  
  1501      #include <linux/bpf.h>
  1502  
  1503      #ifndef __section
  1504      # define __section(NAME)                  \
  1505         __attribute__((section(NAME), used))
  1506      #endif
  1507  
  1508      #ifndef __inline
  1509      # define __inline                         \
  1510         inline __attribute__((always_inline))
  1511      #endif
  1512  
  1513      static __inline int foo(void)
  1514      {
  1515          return XDP_DROP;
  1516      }
  1517  
  1518      __section("prog")
  1519      int xdp_drop(struct xdp_md *ctx)
  1520      {
  1521          return foo();
  1522      }
  1523  
  1524      char __license[] __section("license") = "GPL";
  1525  
  1526  2. **Multiple programs can reside inside a single C file in different sections.**
  1527  
  1528     C programs for BPF make heavy use of section annotations. A C file is
  1529     typically structured into 3 or more sections. BPF ELF loaders use these
  1530     names to extract and prepare the relevant information in order to load
  1531     the programs and maps through the bpf system call. For example, iproute2
  1532     uses ``maps`` and ``license`` as default section name to find metadata
  1533     needed for map creation and the license for the BPF program, respectively.
  1534     On program creation time the latter is pushed into the kernel as well,
  1535     and enables some of the helper functions which are exposed as GPL only
  1536     in case the program also holds a GPL compatible license, for example
  1537     ``bpf_ktime_get_ns()``, ``bpf_probe_read()`` and others.
  1538  
  1539     The remaining section names are specific for BPF program code, for example,
  1540     the below code has been modified to contain two program sections, ``ingress``
  1541     and ``egress``. The toy example code demonstrates that both can share a map
  1542     and common static inline helpers such as the ``account_data()`` function.
  1543  
  1544     The ``xdp-example.c`` example has been modified to a ``tc-example.c``
  1545     example that can be loaded with tc and attached to a netdevice's ingress
  1546     and egress hook.  It accounts the transferred bytes into a map called
  1547     ``acc_map``, which has two map slots, one for traffic accounted on the
  1548     ingress hook, one on the egress hook.
  1549  
  1550     ::
  1551  
  1552      #include <linux/bpf.h>
  1553      #include <linux/pkt_cls.h>
  1554      #include <stdint.h>
  1555      #include <iproute2/bpf_elf.h>
  1556  
  1557      #ifndef __section
  1558      # define __section(NAME)                  \
  1559         __attribute__((section(NAME), used))
  1560      #endif
  1561  
  1562      #ifndef __inline
  1563      # define __inline                         \
  1564         inline __attribute__((always_inline))
  1565      #endif
  1566  
  1567      #ifndef lock_xadd
  1568      # define lock_xadd(ptr, val)              \
  1569         ((void)__sync_fetch_and_add(ptr, val))
  1570      #endif
  1571  
  1572      #ifndef BPF_FUNC
  1573      # define BPF_FUNC(NAME, ...)              \
  1574         (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
  1575      #endif
  1576  
  1577      static void *BPF_FUNC(map_lookup_elem, void *map, const void *key);
  1578  
  1579      struct bpf_elf_map acc_map __section("maps") = {
  1580          .type           = BPF_MAP_TYPE_ARRAY,
  1581          .size_key       = sizeof(uint32_t),
  1582          .size_value     = sizeof(uint32_t),
  1583          .pinning        = PIN_GLOBAL_NS,
  1584          .max_elem       = 2,
  1585      };
  1586  
  1587      static __inline int account_data(struct __sk_buff *skb, uint32_t dir)
  1588      {
  1589          uint32_t *bytes;
  1590  
  1591          bytes = map_lookup_elem(&acc_map, &dir);
  1592          if (bytes)
  1593                  lock_xadd(bytes, skb->len);
  1594  
  1595          return TC_ACT_OK;
  1596      }
  1597  
  1598      __section("ingress")
  1599      int tc_ingress(struct __sk_buff *skb)
  1600      {
  1601          return account_data(skb, 0);
  1602      }
  1603  
  1604      __section("egress")
  1605      int tc_egress(struct __sk_buff *skb)
  1606      {
  1607          return account_data(skb, 1);
  1608      }
  1609  
  1610      char __license[] __section("license") = "GPL";
  1611  
  1612    The example also demonstrates a couple of other things which are useful
  1613    to be aware of when developing programs. The code includes kernel headers,
  1614    standard C headers and an iproute2 specific header containing the
  1615    definition of ``struct bpf_elf_map``. iproute2 has a common BPF ELF loader
  1616    and as such the definition of ``struct bpf_elf_map`` is the very same for
  1617    XDP and tc typed programs.
  1618  
  1619    A ``struct bpf_elf_map`` entry defines a map in the program and contains
  1620    all relevant information (such as key / value size, etc) needed to generate
  1621    a map which is used from the two BPF programs. The structure must be placed
  1622    into the ``maps`` section, so that the loader can find it. There can be
  1623    multiple map declarations of this type with different variable names, but
  1624    all must be annotated with ``__section("maps")``.
  1625  
  1626    The ``struct bpf_elf_map`` is specific to iproute2. Different BPF ELF
  1627    loaders can have different formats, for example, the libbpf in the kernel
  1628    source tree, which is mainly used by ``perf``, has a different specification.
  1629    iproute2 guarantees backwards compatibility for ``struct bpf_elf_map``.
  1630    Cilium follows the iproute2 model.
  1631  
  1632    The example also demonstrates how BPF helper functions are mapped into
  1633    the C code and being used. Here, ``map_lookup_elem()`` is defined by
  1634    mapping this function into the ``BPF_FUNC_map_lookup_elem`` enum value
  1635    which is exposed as a helper in ``uapi/linux/bpf.h``. When the program is later
  1636    loaded into the kernel, the verifier checks whether the passed arguments
  1637    are of the expected type and re-points the helper call into a real
  1638    function call. Moreover, ``map_lookup_elem()`` also demonstrates how
  1639    maps can be passed to BPF helper functions. Here, ``&acc_map`` from the
  1640    ``maps`` section is passed as the first argument to ``map_lookup_elem()``.
  1641  
  1642    Since the defined array map is global, the accounting needs to use an
  1643    atomic operation, which is defined as ``lock_xadd()``. LLVM maps
  1644    ``__sync_fetch_and_add()`` as a built-in function to the BPF atomic
  1645    add instruction, that is, ``BPF_STX | BPF_XADD | BPF_W`` for word sizes.
  1646  
  1647    Last but not least, the ``struct bpf_elf_map`` tells that the map is to
  1648    be pinned as ``PIN_GLOBAL_NS``. This means that tc will pin the map
  1649    into the BPF pseudo file system as a node. By default, it will be pinned
  1650    to ``/sys/fs/bpf/tc/globals/acc_map`` for the given example. Due to the
  1651    ``PIN_GLOBAL_NS``, the map will be placed under ``/sys/fs/bpf/tc/globals/``.
  1652    ``globals`` acts as a global namespace that spans across object files.
  1653    If the example used ``PIN_OBJECT_NS``, then tc would create a directory
  1654    that is local to the object file. For example, different C files with
  1655    BPF code could have the same ``acc_map`` definition as above with a
  1656    ``PIN_GLOBAL_NS`` pinning. In that case, the map will be shared among
  1657    BPF programs originating from various object files. ``PIN_NONE`` would
  1658    mean that the map is not placed into the BPF file system as a node,
  1659    and as a result will not be accessible from user space after tc quits. It
  1660    would also mean that tc creates two separate map instances for each
  1661    program, since it cannot retrieve a previously pinned map under that
  1662    name. The ``acc_map`` part from the mentioned path is the name of the
  1663    map as specified in the source code.
  1664  
  1665    Thus, upon loading of the ``ingress`` program, tc will find that no such
  1666    map exists in the BPF file system and creates a new one. On success, the
  1667    map will also be pinned, so that when the ``egress`` program is loaded
  1668    through tc, it will find that such map already exists in the BPF file
  1669    system and will reuse that for the ``egress`` program. The loader also
  1670    makes sure in case maps exist with the same name that also their properties
  1671    (key / value size, etc) match.
  1672  
  1673    Just like tc can retrieve the same map, also third party applications
  1674    can use the ``BPF_OBJ_GET`` command from the bpf system call in order
  1675    to create a new file descriptor pointing to the same map instance, which
  1676    can then be used to lookup / update / delete map elements.
  1677  
  1678    The code can be compiled and loaded via iproute2 as follows:
  1679  
  1680    ::
  1681  
  1682      $ clang -O2 -Wall -target bpf -c tc-example.c -o tc-example.o
  1683  
  1684      # tc qdisc add dev em1 clsact
  1685      # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress
  1686      # tc filter add dev em1 egress bpf da obj tc-example.o sec egress
  1687  
  1688      # tc filter show dev em1 ingress
  1689      filter protocol all pref 49152 bpf
  1690      filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f
  1691  
  1692      # tc filter show dev em1 egress
  1693      filter protocol all pref 49152 bpf
  1694      filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714
  1695  
  1696      # mount | grep bpf
  1697      sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
  1698      bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700)
  1699  
  1700      # tree /sys/fs/bpf/
  1701      /sys/fs/bpf/
  1702      +-- ip -> /sys/fs/bpf/tc/
  1703      +-- tc
  1704      |   +-- globals
  1705      |       +-- acc_map
  1706      +-- xdp -> /sys/fs/bpf/tc/
  1707  
  1708      4 directories, 1 file
  1709  
  1710    As soon as packets pass the ``em1`` device, counters from the BPF map will
  1711    be increased.
  1712  
  1713  3. **There are no global variables allowed.**
  1714  
  1715    For the reasons already mentioned in point 1, BPF cannot have global variables
  1716    as often used in normal C programs.
  1717  
  1718    However, there is a work-around in that the program can simply use a BPF map
  1719    of type ``BPF_MAP_TYPE_PERCPU_ARRAY`` with just a single slot of arbitrary
  1720    value size. This works, because during execution, BPF programs are guaranteed
  1721    to never get preempted by the kernel and therefore can use the single map entry
  1722    as a scratch buffer for temporary data, for example, to extend beyond the stack
  1723    limitation. This also functions across tail calls, since it has the same
  1724    guarantees with regards to preemption.
  1725  
  1726    Otherwise, for holding state across multiple BPF program runs, normal BPF
  1727    maps can be used.
  1728  
  1729  4. **There are no const strings or arrays allowed.**
  1730  
  1731    Defining ``const`` strings or other arrays in the BPF C program does not work
  1732    for the same reasons as pointed out in sections 1 and 3, which is, that relocation
  1733    entries will be generated in the ELF file which will be rejected by loaders due
  1734    to not being part of the ABI towards loaders (loaders also cannot fix up such
  1735    entries as it would require large rewrites of the already compiled BPF sequence).
  1736  
  1737    In the future, LLVM might detect these occurrences and early throw an error
  1738    to the user.
  1739  
  1740    Helper functions such as ``trace_printk()`` can be worked around as follows:
  1741  
  1742    ::
  1743  
  1744      static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);
  1745  
  1746      #ifndef printk
  1747      # define printk(fmt, ...)                                      \
  1748          ({                                                         \
  1749              char ____fmt[] = fmt;                                  \
  1750              trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
  1751          })
  1752      #endif
  1753  
  1754    The program can then use the macro naturally like ``printk("skb len:%u\n", skb->len);``.
  1755    The output will then be written to the trace pipe. ``tc exec bpf dbg`` can be
  1756    used to retrieve the messages from there.
  1757  
  1758    The use of the ``trace_printk()`` helper function has a couple of disadvantages
  1759    and thus is not recommended for production usage. Constant strings like the
  1760    ``"skb len:%u\n"`` need to be loaded into the BPF stack each time the helper
  1761    function is called, but also BPF helper functions are limited to a maximum
  1762    of 5 arguments. This leaves room for only 3 additional variables which can be
  1763    passed for dumping.
  1764  
  1765    Therefore, despite being helpful for quick debugging, it is recommended (for networking
  1766    programs) to use the ``skb_event_output()`` or the ``xdp_event_output()`` helper,
  1767    respectively. They allow for passing custom structs from the BPF program to
  1768    the perf event ring buffer along with an optional packet sample. For example,
  1769    Cilium's monitor makes use of these helpers in order to implement a debugging
  1770    framework, notifications for network policy violations, etc. These helpers pass
  1771    the data through a lockless memory mapped per-CPU ``perf`` ring buffer, and
  1772    is thus significantly faster than ``trace_printk()``.
  1773  
  1774  5. **Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().**
  1775  
  1776    Since BPF programs cannot perform any function calls other than those to BPF
  1777    helpers, common library code needs to be implemented as inline functions. In
  1778    addition, also LLVM provides some built-ins that the programs can use for
  1779    constant sizes (here: ``n``) which will then always get inlined:
  1780  
  1781    ::
  1782  
  1783      #ifndef memset
  1784      # define memset(dest, chr, n)   __builtin_memset((dest), (chr), (n))
  1785      #endif
  1786  
  1787      #ifndef memcpy
  1788      # define memcpy(dest, src, n)   __builtin_memcpy((dest), (src), (n))
  1789      #endif
  1790  
  1791      #ifndef memmove
  1792      # define memmove(dest, src, n)  __builtin_memmove((dest), (src), (n))
  1793      #endif
  1794  
  1795    The ``memcmp()`` built-in had some corner cases where inlining did not take place
  1796    due to an LLVM issue in the back end, and is therefore not recommended to be
  1797    used until the issue is fixed.
  1798  
  1799  6. **There are no loops available (yet).**
  1800  
  1801    The BPF verifier in the kernel checks that a BPF program does not contain
  1802    loops by performing a depth first search of all possible program paths besides
  1803    other control flow graph validations. The purpose is to make sure that the
  1804    program is always guaranteed to terminate.
  1805  
  1806    A very limited form of looping is available for constant upper loop bounds
  1807    by using ``#pragma unroll`` directive. Example code that is compiled to BPF:
  1808  
  1809    ::
  1810  
  1811      #pragma unroll
  1812          for (i = 0; i < IPV6_MAX_HEADERS; i++) {
  1813              switch (nh) {
  1814              case NEXTHDR_NONE:
  1815                  return DROP_INVALID_EXTHDR;
  1816              case NEXTHDR_FRAGMENT:
  1817                  return DROP_FRAG_NOSUPPORT;
  1818              case NEXTHDR_HOP:
  1819              case NEXTHDR_ROUTING:
  1820              case NEXTHDR_AUTH:
  1821              case NEXTHDR_DEST:
  1822                  if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0)
  1823                      return DROP_INVALID;
  1824  
  1825                  nh = opthdr.nexthdr;
  1826                  if (nh == NEXTHDR_AUTH)
  1827                      len += ipv6_authlen(&opthdr);
  1828                  else
  1829                      len += ipv6_optlen(&opthdr);
  1830                  break;
  1831              default:
  1832                  *nexthdr = nh;
  1833                  return len;
  1834              }
  1835          }
  1836  
  1837    Another possibility is to use tail calls by calling into the same program
  1838    again and using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map for having a local
  1839    scratch space. While being dynamic, this form of looping however is limited
  1840    to a maximum of 32 iterations.
  1841  
  1842    In the future, BPF may have some native, but limited form of implementing loops.
  1843  
  1844  7. **Partitioning programs with tail calls.**
  1845  
  1846    Tail calls provide the flexibility to atomically alter program behavior during
  1847    runtime by jumping from one BPF program into another. In order to select the
  1848    next program, tail calls make use of program array maps (``BPF_MAP_TYPE_PROG_ARRAY``),
  1849    and pass the map as well as the index to the next program to jump to. There is no
  1850    return to the old program after the jump has been performed, and in case there was
  1851    no program present at the given map index, then execution continues on the original
  1852    program.
  1853  
  1854    For example, this can be used to implement various stages of a parser, where
  1855    such stages could be updated with new parsing features during runtime.
  1856  
  1857    Another use case are event notifications, for example, Cilium can opt in packet
  1858    drop notifications during runtime, where the ``skb_event_output()`` call is
  1859    located inside the tail called program. Thus, during normal operations, the
  1860    fall-through path will always be executed unless a program is added to the
  1861    related map index, where the program then prepares the metadata and triggers
  1862    the event notification to a user space daemon.
  1863  
  1864    Program array maps are quite flexible, enabling also individual actions to
  1865    be implemented for programs located in each map index. For example, the root
  1866    program attached to XDP or tc could perform an initial tail call to index 0
  1867    of the program array map, performing traffic sampling, then jumping to index 1
  1868    of the program array map, where firewalling policy is applied and the packet
  1869    either dropped or further processed in index 2 of the program array map, where
  1870    it is mangled and sent out of an interface again. Jumps in the program array
  1871    map can, of course, be arbitrary. The kernel will eventually execute the
  1872    fall-through path when the maximum tail call limit has been reached.
  1873  
  1874    Minimal example extract of using tail calls:
  1875  
  1876    ::
  1877  
  1878      [...]
  1879  
  1880      #ifndef __stringify
  1881      # define __stringify(X)   #X
  1882      #endif
  1883  
  1884      #ifndef __section
  1885      # define __section(NAME)                  \
  1886         __attribute__((section(NAME), used))
  1887      #endif
  1888  
  1889      #ifndef __section_tail
  1890      # define __section_tail(ID, KEY)          \
  1891         __section(__stringify(ID) "/" __stringify(KEY))
  1892      #endif
  1893  
  1894      #ifndef BPF_FUNC
  1895      # define BPF_FUNC(NAME, ...)              \
  1896         (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
  1897      #endif
  1898  
  1899      #define BPF_JMP_MAP_ID   1
  1900  
  1901      static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map,
  1902                           uint32_t index);
  1903  
  1904      struct bpf_elf_map jmp_map __section("maps") = {
  1905          .type           = BPF_MAP_TYPE_PROG_ARRAY,
  1906          .id             = BPF_JMP_MAP_ID,
  1907          .size_key       = sizeof(uint32_t),
  1908          .size_value     = sizeof(uint32_t),
  1909          .pinning        = PIN_GLOBAL_NS,
  1910          .max_elem       = 1,
  1911      };
  1912  
  1913      __section_tail(JMP_MAP_ID, 0)
  1914      int looper(struct __sk_buff *skb)
  1915      {
  1916          printk("skb cb: %u\n", skb->cb[0]++);
  1917          tail_call(skb, &jmp_map, 0);
  1918          return TC_ACT_OK;
  1919      }
  1920  
  1921      __section("prog")
  1922      int entry(struct __sk_buff *skb)
  1923      {
  1924          skb->cb[0] = 0;
  1925          tail_call(skb, &jmp_map, 0);
  1926          return TC_ACT_OK;
  1927      }
  1928  
  1929      char __license[] __section("license") = "GPL";
  1930  
  1931    When loading this toy program, tc will create the program array and pin it
  1932    to the BPF file system in the global namespace under ``jmp_map``. Also, the
  1933    BPF ELF loader in iproute2 will also recognize sections that are marked as
  1934    ``__section_tail()``. The provided ``id`` in ``struct bpf_elf_map`` will be
  1935    matched against the id marker in the ``__section_tail()``, that is, ``JMP_MAP_ID``,
  1936    and the program therefore loaded at the user specified program array map index,
  1937    which is ``0`` in this example. As a result, all provided tail call sections
  1938    will be populated by the iproute2 loader to the corresponding maps. This mechanism
  1939    is not specific to tc, but can be applied with any other BPF program type
  1940    that iproute2 supports (such as XDP, lwt).
  1941  
  1942    The generated elf contains section headers describing the map id and the
  1943    entry within that map:
  1944  
  1945    ::
  1946  
  1947      $ llvm-objdump -S --no-show-raw-insn prog_array.o | less
  1948      prog_array.o:   file format ELF64-BPF
  1949  
  1950      Disassembly of section 1/0:
  1951      looper:
  1952             0:       r6 = r1
  1953             1:       r2 = *(u32 *)(r6 + 48)
  1954             2:       r1 = r2
  1955             3:       r1 += 1
  1956             4:       *(u32 *)(r6 + 48) = r1
  1957             5:       r1 = 0 ll
  1958             7:       call -1
  1959             8:       r1 = r6
  1960             9:       r2 = 0 ll
  1961            11:       r3 = 0
  1962            12:       call 12
  1963            13:       r0 = 0
  1964            14:       exit
  1965      Disassembly of section prog:
  1966      entry:
  1967             0:       r2 = 0
  1968             1:       *(u32 *)(r1 + 48) = r2
  1969             2:       r2 = 0 ll
  1970             4:       r3 = 0
  1971             5:       call 12
  1972             6:       r0 = 0
  1973             7:       exi
  1974  
  1975    In this case, the ``section 1/0`` indicates that the ``looper()`` function
  1976    resides in the map id ``1`` at position ``0``.
  1977  
  1978    The pinned map can be retrieved by a user space applications (e.g. Cilium daemon),
  1979    but also by tc itself in order to update the map with new programs. Updates
  1980    happen atomically, the initial entry programs that are triggered first from the
  1981    various subsystems are also updated atomically.
  1982  
  1983    Example for tc to perform tail call map updates:
  1984  
  1985    ::
  1986  
  1987      # tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo
  1988  
  1989    In case iproute2 would update the pinned program array, the ``graft`` command
  1990    can be used. By pointing it to ``globals/jmp_map``, tc will update the
  1991    map at index / key ``0`` with a new program residing in the object file ``new.o``
  1992    under section ``foo``.
  1993  
  1994  8. **Limited stack space of maximum 512 bytes.**
  1995  
  1996    Stack space in BPF programs is limited to only 512 bytes, which needs to be
  1997    taken into careful consideration when implementing BPF programs in C. However,
  1998    as mentioned earlier in point 3, a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map with a
  1999    single entry can be used in order to enlarge scratch buffer space.
  2000  
  2001  9. **Use of BPF inline assembly possible.**
  2002  
  2003    LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it
  2004    might be needed. The following (nonsense) toy example shows a 64 bit atomic
  2005    add. Due to lack of documentation, LLVM source code in ``lib/Target/BPF/BPFInstrInfo.td``
  2006    as well as ``test/CodeGen/BPF/`` might be helpful for providing some additional
  2007    examples. Test code:
  2008  
  2009    ::
  2010  
  2011      #include <linux/bpf.h>
  2012  
  2013      #ifndef __section
  2014      # define __section(NAME)                  \
  2015         __attribute__((section(NAME), used))
  2016      #endif
  2017  
  2018      __section("prog")
  2019      int xdp_test(struct xdp_md *ctx)
  2020      {
  2021          __u64 a = 2, b = 3, *c = &a;
  2022          /* just a toy xadd example to show the syntax */
  2023          asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c));
  2024          return a;
  2025      }
  2026  
  2027      char __license[] __section("license") = "GPL";
  2028  
  2029    The above program is compiled into the following sequence of BPF
  2030    instructions:
  2031  
  2032    ::
  2033  
  2034      Verifier analysis:
  2035  
  2036      0: (b7) r1 = 2
  2037      1: (7b) *(u64 *)(r10 -8) = r1
  2038      2: (b7) r1 = 3
  2039      3: (bf) r2 = r10
  2040      4: (07) r2 += -8
  2041      5: (db) lock *(u64 *)(r2 +0) += r1
  2042      6: (79) r0 = *(u64 *)(r10 -8)
  2043      7: (95) exit
  2044      processed 8 insns (limit 131072), stack depth 8
  2045  
  2046  10. **Remove struct padding with aligning members by using #pragma pack.**
  2047  
  2048    In modern compilers, data structures are aligned by default to access memory
  2049    efficiently. Structure members are aligned to memory address that multiples their
  2050    size, and padding is added for the proper alignment. Because of this, the size
  2051    of struct may often grow larger than expected.
  2052  
  2053    ::
  2054  
  2055      struct called_info {
  2056          u64 start;  // 8-byte
  2057          u64 end;    // 8-byte
  2058          u32 sector; // 4-byte
  2059      }; // size of 20-byte ?
  2060  
  2061      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte
  2062  
  2063      // Actual compiled composition of struct called_info
  2064      // 0x0(0)                   0x8(8)
  2065      //  ↓________________________↓
  2066      //  |        start (8)       |
  2067      //  |________________________|
  2068      //  |         end  (8)       |
  2069      //  |________________________|
  2070      //  |  sector(4) |  PADDING  | <= address aligned to 8
  2071      //  |____________|___________|     with 4-byte PADDING.
  2072  
  2073    The BPF verifier in the kernel checks the stack boundary that a BPF program does
  2074    not access outside of boundary or uninitialized stack area. Using struct with the
  2075    padding as a map value, will cause ``invalid indirect read from stack`` failure on
  2076    ``bpf_prog_load()``.
  2077  
  2078    Example code:
  2079  
  2080    ::
  2081  
  2082      struct called_info {
  2083          u64 start;
  2084          u64 end;
  2085          u32 sector;
  2086      };
  2087  
  2088      struct bpf_map_def SEC("maps") called_info_map = {
  2089          .type = BPF_MAP_TYPE_HASH,
  2090          .key_size = sizeof(long),
  2091          .value_size = sizeof(struct called_info),
  2092          .max_entries = 4096,
  2093      };
  2094  
  2095      SEC("kprobe/submit_bio")
  2096      int submit_bio_entry(struct pt_regs *ctx)
  2097      {
  2098          char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n";
  2099          u64 start_time = bpf_ktime_get_ns();
  2100          long bio_ptr = PT_REGS_PARM1(ctx);
  2101          struct called_info called_info = {
  2102                  .start = start_time,
  2103                  .end = 0,
  2104                  .bi_sector = 0
  2105          };
  2106  
  2107          bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY);
  2108          bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time);
  2109          return 0;
  2110      }
  2111  
  2112      // On bpf_load_program
  2113      bpf_load_program() err=13
  2114      0: (bf) r6 = r1
  2115      ...
  2116      19: (b7) r1 = 0
  2117      20: (7b) *(u64 *)(r10 -72) = r1
  2118      21: (7b) *(u64 *)(r10 -80) = r7
  2119      22: (63) *(u32 *)(r10 -64) = r1
  2120      ...
  2121      30: (85) call bpf_map_update_elem#2
  2122      invalid indirect read from stack off -80+20 size 24
  2123  
  2124    At ``bpf_prog_load()``, an eBPF verifier ``bpf_check()`` is called, and it'll
  2125    check stack boundary by calling ``check_func_arg() -> check_stack_boundary()``.
  2126    From the upper error shows, ``struct called_info`` is compiled to 24-byte size,
  2127    and the message says reading a data from +20 is an invalid indirect read.
  2128    And as we discussed earlier, the address 0x14(20) is the place where PADDING is.
  2129  
  2130    ::
  2131  
  2132      // Actual compiled composition of struct called_info
  2133      // 0x10(16)    0x14(20)    0x18(24)
  2134      //  ↓____________↓___________↓
  2135      //  |  sector(4) |  PADDING  | <= address aligned to 8
  2136      //  |____________|___________|     with 4-byte PADDING.
  2137  
  2138    The ``check_stack_boundary()`` internally loops through the every ``access_size`` (24)
  2139    byte from the start pointer to make sure that it's within stack boundary and all
  2140    elements of the stack are initialized. Since the padding isn't supposed to be used,
  2141    it gets the 'invalid indirect read from stack' failure. To avoid this kind of
  2142    failure, remove the padding from the struct is necessary.
  2143  
  2144    Removing the padding by using ``#pragma pack(n)`` directive:
  2145  
  2146    ::
  2147  
  2148      #pragma pack(4)
  2149      struct called_info {
  2150          u64 start;  // 8-byte
  2151          u64 end;    // 8-byte
  2152          u32 sector; // 4-byte
  2153      }; // size of 20-byte ?
  2154  
  2155      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte
  2156  
  2157      // Actual compiled composition of packed struct called_info
  2158      // 0x0(0)                   0x8(8)
  2159      //  ↓________________________↓
  2160      //  |        start (8)       |
  2161      //  |________________________|
  2162      //  |         end  (8)       |
  2163      //  |________________________|
  2164      //  |  sector(4) |             <= address aligned to 4
  2165      //  |____________|                 with no PADDING.
  2166  
  2167    By locating ``#pragma pack(4)`` before of ``struct called_info``, compiler will align
  2168    members of a struct to the least of 4-byte and their natural alignment. As you can
  2169    see, the size of ``struct called_info`` has been shrunk to 20-byte and the padding
  2170    is no longer exist.
  2171  
  2172    But, removing the padding have downsides either. For example, compiler will generate
  2173    less optimized code. Since we've removed the padding, processors will conduct
  2174    unaligned access to the structure and this might lead to performance degradation.
  2175    And also, unaligned access might get rejected by verifier on some architectures.
  2176  
  2177    However, there is a way to avoid downsides of packed structure. By simply adding the
  2178    explicit padding ``u32 pad`` member at the end will resolve the same problem without
  2179    packing of the structure.
  2180  
  2181    ::
  2182  
  2183      struct called_info {
  2184          u64 start;  // 8-byte
  2185          u64 end;    // 8-byte
  2186          u32 sector; // 4-byte
  2187          u32 pad;    // 4-byte
  2188      }; // size of 24-byte ?
  2189  
  2190      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte
  2191  
  2192      // Actual compiled composition of struct called_info with explicit padding
  2193      // 0x0(0)                   0x8(8)
  2194      //  ↓________________________↓
  2195      //  |        start (8)       |
  2196      //  |________________________|
  2197      //  |         end  (8)       |
  2198      //  |________________________|
  2199      //  |  sector(4) |  pad (4)  | <= address aligned to 8
  2200      //  |____________|___________|     with explicit PADDING.
  2201  
  2202  11. **Accessing packet data via invalidated references**
  2203  
  2204    Some networking BPF helper functions such as ``bpf_skb_store_bytes`` might
  2205    change the size of a packet data. As verifier is not able to track such
  2206    changes, any a priori reference to the data will be invalidated by verifier.
  2207    Therefore, the reference needs to be updated before accessing the data to
  2208    avoid verifier rejecting a program.
  2209  
  2210    To illustrate this, consider the following snippet:
  2211  
  2212    ::
  2213  
  2214      struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  2215  
  2216      skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);
  2217  
  2218      if (ip4->protocol == IPPROTO_TCP) {
  2219          // do something
  2220      }
  2221  
  2222    Verifier will reject the snippet due to dereference of the invalidated
  2223    ``ip4->protocol``:
  2224  
  2225    ::
  2226  
  2227        R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0
  2228        R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  2229        R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1
  2230        ...
  2231        18: (85) call bpf_skb_store_bytes#9
  2232        19: (7b) *(u64 *)(r10 -56) = r7
  2233        R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3))
  2234        R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm
  2235        21: (61) r1 = *(u32 *)(r9 +23)
  2236        R9 invalid mem access 'inv'
  2237  
  2238    To fix this, the reference to ``ip4`` has to be updated:
  2239  
  2240    ::
  2241  
  2242      struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  2243  
  2244      skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);
  2245  
  2246      ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  2247  
  2248      if (ip4->protocol == IPPROTO_TCP) {
  2249          // do something
  2250      }
  2251  
  2252  iproute2
  2253  --------
  2254  
  2255  There are various front ends for loading BPF programs into the kernel such as bcc,
  2256  perf, iproute2 and others. The Linux kernel source tree also provides a user space
  2257  library under ``tools/lib/bpf/``, which is mainly used and driven by perf for
  2258  loading BPF tracing programs into the kernel. However, the library itself is
  2259  generic and not limited to perf only. bcc is a toolkit providing many useful
  2260  BPF programs mainly for tracing that are loaded ad-hoc through a Python interface
  2261  embedding the BPF C code. Syntax and semantics for implementing BPF programs
  2262  slightly differ among front ends in general, though. Additionally, there are also
  2263  BPF samples in the kernel source tree (``samples/bpf/``) which parse the generated
  2264  object files and load the code directly through the system call interface.
  2265  
  2266  This and previous sections mainly focus on the iproute2 suite's BPF front end for
  2267  loading networking programs of XDP, tc or lwt type, since Cilium's programs are
  2268  implemented against this BPF loader. In future, Cilium will be equipped with a
  2269  native BPF loader, but programs will still be compatible to be loaded through
  2270  iproute2 suite in order to facilitate development and debugging.
  2271  
  2272  All BPF program types supported by iproute2 share the same BPF loader logic
  2273  due to having a common loader back end implemented as a library (``lib/bpf.c``
  2274  in iproute2 source tree).
  2275  
  2276  The previous section on LLVM also covered some iproute2 parts related to writing
  2277  BPF C programs, and later sections in this document are related to tc and XDP
  2278  specific aspects when writing programs. Therefore, this section will rather focus
  2279  on usage examples for loading object files with iproute2 as well as some of the
  2280  generic mechanics of the loader. It does not try to provide a complete coverage
  2281  of all details, but enough for getting started.
  2282  
  2283  **1. Loading of XDP BPF object files.**
  2284  
  2285    Given a BPF object file ``prog.o`` has been compiled for XDP, it can be loaded
  2286    through ``ip`` to a XDP-supported netdevice called ``em1`` with the following
  2287    command:
  2288  
  2289    ::
  2290  
  2291      # ip link set dev em1 xdp obj prog.o
  2292  
  2293    The above command assumes that the program code resides in the default section
  2294    which is called ``prog`` in XDP case. Should this not be the case, and the
  2295    section is named differently, for example, ``foobar``, then the program needs
  2296    to be loaded as:
  2297  
  2298    ::
  2299  
  2300      # ip link set dev em1 xdp obj prog.o sec foobar
  2301  
  2302    Note that it is also possible to load the program out of the ``.text`` section.
  2303    Changing the minimal, stand-alone XDP drop program by removing the ``__section()``
  2304    annotation from the ``xdp_drop`` entry point would look like the following:
  2305  
  2306    ::
  2307  
  2308      #include <linux/bpf.h>
  2309  
  2310      #ifndef __section
  2311      # define __section(NAME)                  \
  2312         __attribute__((section(NAME), used))
  2313      #endif
  2314  
  2315      int xdp_drop(struct xdp_md *ctx)
  2316      {
  2317          return XDP_DROP;
  2318      }
  2319  
  2320      char __license[] __section("license") = "GPL";
  2321  
  2322    And can be loaded as follows:
  2323  
  2324    ::
  2325  
  2326      # ip link set dev em1 xdp obj prog.o sec .text
  2327  
  2328    By default, ``ip`` will throw an error in case a XDP program is already attached
  2329    to the networking interface, to prevent it from being overridden by accident. In
  2330    order to replace the currently running XDP program with a new one, the ``-force``
  2331    option must be used:
  2332  
  2333    ::
  2334  
  2335      # ip -force link set dev em1 xdp obj prog.o
  2336  
  2337    Most XDP-enabled drivers today support an atomic replacement of the existing
  2338    program with a new one without traffic interruption. There is always only a
  2339    single program attached to an XDP-enabled driver due to performance reasons,
  2340    hence a chain of programs is not supported. However, as described in the
  2341    previous section, partitioning of programs can be performed through tail
  2342    calls to achieve a similar use case when necessary.
  2343  
  2344    The ``ip link`` command will display an ``xdp`` flag if the interface has an XDP
  2345    program attached. ``ip link | grep xdp`` can thus be used to find all interfaces
  2346    that have XDP running. Further introspection facilities are provided through
  2347    the detailed view with ``ip -d link`` and ``bpftool`` can be used to retrieve
  2348    information about the attached program based on the BPF program ID shown in
  2349    the ``ip link`` dump.
  2350  
  2351    In order to remove the existing XDP program from the interface, the following
  2352    command must be issued:
  2353  
  2354    ::
  2355  
  2356      # ip link set dev em1 xdp off
  2357  
  2358    In the case of switching a driver's operation mode from non-XDP to native XDP
  2359    and vice versa, typically the driver needs to reconfigure its receive (and
  2360    transmit) rings in order to ensure received packet are set up linearly
  2361    within a single page for BPF to read and write into. However, once completed,
  2362    then most drivers only need to perform an atomic replacement of the program
  2363    itself when a BPF program is requested to be swapped.
  2364  
  2365    In total, XDP supports three operation modes which iproute2 implements as well:
  2366    ``xdpdrv``, ``xdpoffload`` and ``xdpgeneric``.
  2367  
  2368    ``xdpdrv`` stands for native XDP, meaning the BPF program is run directly in
  2369    the driver's receive path at the earliest possible point in software. This is
  2370    the normal / conventional XDP mode and requires driver's to implement XDP
  2371    support, which all major 10G/40G/+ networking drivers in the upstream Linux
  2372    kernel already provide.
  2373  
  2374    ``xdpgeneric`` stands for generic XDP and is intended as an experimental test
  2375    bed for drivers which do not yet support native XDP. Given the generic XDP hook
  2376    in the ingress path comes at a much later point in time when the packet already
  2377    enters the stack's main receive path as a ``skb``, the performance is significantly
  2378    less than with processing in ``xdpdrv`` mode. ``xdpgeneric`` therefore is for
  2379    the most part only interesting for experimenting, less for production environments.
  2380  
  2381    Last but not least, the ``xdpoffload`` mode is implemented by SmartNICs such
  2382    as those supported by Netronome's nfp driver and allow for offloading the entire
  2383    BPF/XDP program into hardware, thus the program is run on each packet reception
  2384    directly on the card. This provides even higher performance than running in
  2385    native XDP although not all BPF map types or BPF helper functions are available
  2386    for use compared to native XDP. The BPF verifier will reject the program in
  2387    such case and report to the user what is unsupported. Other than staying in
  2388    the realm of supported BPF features and helper functions, no special precautions
  2389    have to be taken when writing BPF C programs.
  2390  
  2391    When a command like ``ip link set dev em1 xdp obj [...]`` is used, then the
  2392    kernel will attempt to load the program first as native XDP, and in case the
  2393    driver does not support native XDP, it will automatically fall back to generic
  2394    XDP. Thus, for example, using explicitly ``xdpdrv`` instead of ``xdp``, the
  2395    kernel will only attempt to load the program as native XDP and fail in case
  2396    the driver does not support it, which provides a guarantee that generic XDP
  2397    is avoided altogether.
  2398  
  2399    Example for enforcing a BPF/XDP program to be loaded in native XDP mode,
  2400    dumping the link details and unloading the program again:
  2401  
  2402    ::
  2403  
  2404       # ip -force link set dev em1 xdpdrv obj prog.o
  2405       # ip link show
  2406       [...]
  2407       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000
  2408           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  2409           prog/xdp id 1 tag 57cd311f2e27366b
  2410       [...]
  2411       # ip link set dev em1 xdpdrv off
  2412  
  2413    Same example now for forcing generic XDP, even if the driver would support
  2414    native XDP, and additionally dumping the BPF instructions of the attached
  2415    dummy program through bpftool:
  2416  
  2417    ::
  2418  
  2419      # ip -force link set dev em1 xdpgeneric obj prog.o
  2420      # ip link show
  2421      [...]
  2422      6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000
  2423          link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  2424          prog/xdp id 4 tag 57cd311f2e27366b                <-- BPF program ID 4
  2425      [...]
  2426      # bpftool prog dump xlated id 4                       <-- Dump of instructions running on em1
  2427      0: (b7) r0 = 1
  2428      1: (95) exit
  2429      # ip link set dev em1 xdpgeneric off
  2430  
  2431    And last but not least offloaded XDP, where we additionally dump program
  2432    information via bpftool for retrieving general metadata:
  2433  
  2434    ::
  2435  
  2436       # ip -force link set dev em1 xdpoffload obj prog.o
  2437       # ip link show
  2438       [...]
  2439       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
  2440           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  2441           prog/xdp id 8 tag 57cd311f2e27366b
  2442       [...]
  2443       # bpftool prog show id 8
  2444       8: xdp  tag 57cd311f2e27366b dev em1                  <-- Also indicates a BPF program offloaded to em1
  2445           loaded_at Apr 11/20:38  uid 0
  2446           xlated 16B  not jited  memlock 4096B
  2447       # ip link set dev em1 xdpoffload off
  2448  
  2449    Note that it is not possible to use ``xdpdrv`` and ``xdpgeneric`` or other
  2450    modes at the same time, meaning only one of the XDP operation modes must be
  2451    picked.
  2452  
  2453    A switch between different XDP modes e.g. from generic to native or vice
  2454    versa is not atomically possible. Only switching programs within a specific
  2455    operation mode is:
  2456  
  2457    ::
  2458  
  2459       # ip -force link set dev em1 xdpgeneric obj prog.o
  2460       # ip -force link set dev em1 xdpoffload obj prog.o
  2461       RTNETLINK answers: File exists
  2462       # ip -force link set dev em1 xdpdrv obj prog.o
  2463       RTNETLINK answers: File exists
  2464       # ip -force link set dev em1 xdpgeneric obj prog.o    <-- Succeeds due to xdpgeneric
  2465       #
  2466  
  2467    Switching between modes requires to first leave the current operation mode
  2468    in order to then enter the new one:
  2469  
  2470    ::
  2471  
  2472       # ip -force link set dev em1 xdpgeneric obj prog.o
  2473       # ip -force link set dev em1 xdpgeneric off
  2474       # ip -force link set dev em1 xdpoffload obj prog.o
  2475       # ip l
  2476       [...]
  2477       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
  2478           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  2479           prog/xdp id 17 tag 57cd311f2e27366b
  2480       [...]
  2481       # ip -force link set dev em1 xdpoffload off
  2482  
  2483  **2. Loading of tc BPF object files.**
  2484  
  2485    Given a BPF object file ``prog.o`` has been compiled for tc, it can be loaded
  2486    through the tc command to a netdevice. Unlike XDP, there is no driver dependency
  2487    for supporting attaching BPF programs to the device. Here, the netdevice is called
  2488    ``em1``, and with the following command the program can be attached to the networking
  2489    ``ingress`` path of ``em1``:
  2490  
  2491    ::
  2492  
  2493      # tc qdisc add dev em1 clsact
  2494      # tc filter add dev em1 ingress bpf da obj prog.o
  2495  
  2496    The first step is to set up a ``clsact`` qdisc (Linux queueing discipline). ``clsact``
  2497    is a dummy qdisc similar to the ``ingress`` qdisc, which can only hold classifier
  2498    and actions, but does not perform actual queueing. It is needed in order to attach
  2499    the ``bpf`` classifier. The ``clsact`` qdisc provides two special hooks called
  2500    ``ingress`` and ``egress``, where the classifier can be attached to. Both ``ingress``
  2501    and ``egress`` hooks are located in central receive and transmit locations in the
  2502    networking data path, where every packet on the device passes through. The ``ingress``
  2503    hook is called from ``__netif_receive_skb_core() -> sch_handle_ingress()`` in the
  2504    kernel and the ``egress`` hook from ``__dev_queue_xmit() -> sch_handle_egress()``.
  2505  
  2506    The equivalent for attaching the program to the ``egress`` hook looks as follows:
  2507  
  2508    ::
  2509  
  2510      # tc filter add dev em1 egress bpf da obj prog.o
  2511  
  2512    The ``clsact`` qdisc is processed lockless from ``ingress`` and ``egress``
  2513    direction and can also be attached to virtual, queue-less devices such as
  2514    ``veth`` devices connecting containers.
  2515  
  2516    Next to the hook, the ``tc filter`` command selects ``bpf`` to be used in ``da``
  2517    (direct-action) mode. ``da`` mode is recommended and should always be specified.
  2518    It basically means that the ``bpf`` classifier does not need to call into external
  2519    tc action modules, which are not necessary for ``bpf`` anyway, since all packet
  2520    mangling, forwarding or other kind of actions can already be performed inside
  2521    the single BPF program which is to be attached, and is therefore significantly
  2522    faster.
  2523  
  2524    At this point, the program has been attached and is executed once packets traverse
  2525    the device. Like in XDP, should the default section name not be used, then it
  2526    can be specified during load, for example, in case of section ``foobar``:
  2527  
  2528    ::
  2529  
  2530      # tc filter add dev em1 egress bpf da obj prog.o sec foobar
  2531  
  2532    iproute2's BPF loader allows for using the same command line syntax across
  2533    program types, hence the ``obj prog.o sec foobar`` is the same syntax as with
  2534    XDP mentioned earlier.
  2535  
  2536    The attached programs can be listed through the following commands:
  2537  
  2538    ::
  2539  
  2540      # tc filter show dev em1 ingress
  2541      filter protocol all pref 49152 bpf
  2542      filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f
  2543  
  2544      # tc filter show dev em1 egress
  2545      filter protocol all pref 49152 bpf
  2546      filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714
  2547  
  2548    The output of ``prog.o:[ingress]`` tells that program section ``ingress`` was
  2549    loaded from the file ``prog.o``, and ``bpf`` operates in ``direct-action`` mode.
  2550    The program ``id`` and ``tag`` is appended for each case, where the latter denotes
  2551    a hash over the instruction stream which can be correlated with the object file
  2552    or ``perf`` reports with stack traces, etc. Last but not least, the ``id``
  2553    represents the system-wide unique BPF program identifier that can be used along
  2554    with ``bpftool`` to further inspect or dump the attached BPF program.
  2555  
  2556    tc can attach more than just a single BPF program, it provides various other
  2557    classifiers which can be chained together. However, attaching a single BPF program
  2558    is fully sufficient since all packet operations can be contained in the program
  2559    itself thanks to ``da`` (``direct-action``) mode, meaning the BPF program itself
  2560    will already return the tc action verdict such as ``TC_ACT_OK``, ``TC_ACT_SHOT``
  2561    and others. For optimal performance and flexibility, this is the recommended usage.
  2562  
  2563    In the above ``show`` command, tc also displays ``pref 49152`` and
  2564    ``handle 0x1`` next to the BPF related output. Both are auto-generated in
  2565    case they are not explicitly provided through the command line. ``pref``
  2566    denotes a priority number, which means that in case multiple classifiers are
  2567    attached, they will be executed based on ascending priority, and ``handle``
  2568    represents an identifier in case multiple instances of the same classifier have
  2569    been loaded under the same ``pref``. Since in case of BPF, a single program is
  2570    fully sufficient, ``pref`` and ``handle`` can typically be ignored.
  2571  
  2572    Only in the case where it is planned to atomically replace the attached BPF
  2573    programs, it would be recommended to explicitly specify ``pref`` and ``handle``
  2574    a priori on initial load, so that they do not have to be queried at a later
  2575    point in time for the ``replace`` operation. Thus, creation becomes:
  2576  
  2577    ::
  2578  
  2579      # tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar
  2580  
  2581      # tc filter show dev em1 ingress
  2582      filter protocol all pref 1 bpf
  2583      filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f
  2584  
  2585    And for the atomic replacement, the following can be issued for updating the
  2586    existing program at ``ingress`` hook with the new BPF program from the file
  2587    ``prog.o`` in section ``foobar``:
  2588  
  2589    ::
  2590  
  2591      # tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar
  2592  
  2593    Last but not least, in order to remove all attached programs from the ``ingress``
  2594    respectively ``egress`` hook, the following can be used:
  2595  
  2596    ::
  2597  
  2598      # tc filter del dev em1 ingress
  2599      # tc filter del dev em1 egress
  2600  
  2601    For removing the entire ``clsact`` qdisc from the netdevice, which implicitly also
  2602    removes all attached programs from the ``ingress`` and ``egress`` hooks, the
  2603    below command is provided:
  2604  
  2605    ::
  2606  
  2607      # tc qdisc del dev em1 clsact
  2608  
  2609    tc BPF programs can also be offloaded if the NIC and driver has support for it
  2610    similarly as with XDP BPF programs. Netronome's nfp supported NICs offer both
  2611    types of BPF offload.
  2612  
  2613    ::
  2614  
  2615      # tc qdisc add dev em1 clsact
  2616      # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
  2617      Error: TC offload is disabled on net device.
  2618      We have an error talking to the kernel
  2619  
  2620    If the above error is shown, then tc hardware offload first needs to be enabled
  2621    for the device through ethtool's ``hw-tc-offload`` setting:
  2622  
  2623    ::
  2624  
  2625      # ethtool -K em1 hw-tc-offload on
  2626      # tc qdisc add dev em1 clsact
  2627      # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
  2628      # tc filter show dev em1 ingress
  2629      filter protocol all pref 1 bpf
  2630      filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b
  2631  
  2632    The ``in_hw`` flag confirms that the program has been offloaded to the NIC.
  2633  
  2634    Note that BPF offloads for both tc and XDP cannot be loaded at the same time,
  2635    either the tc or XDP offload option must be selected.
  2636  
  2637  **3. Testing BPF offload interface via netdevsim driver.**
  2638  
  2639    The netdevsim driver which is part of the Linux kernel provides a dummy driver
  2640    which implements offload interfaces for XDP BPF and tc BPF programs and
  2641    facilitates testing kernel changes or low-level user space programs
  2642    implementing a control plane directly against the kernel's UAPI.
  2643  
  2644    A netdevsim device can be created as follows:
  2645  
  2646    ::
  2647  
  2648      # modprobe netdevsim
  2649      // [ID] [PORT_COUNT]
  2650      # echo "1 1" > /sys/bus/netdevsim/new_device
  2651      # devlink dev
  2652      netdevsim/netdevsim1
  2653      # devlink port
  2654      netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical
  2655      # ip l
  2656      [...]
  2657      4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  2658          link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff
  2659  
  2660    After that step, XDP BPF or tc BPF programs can be test loaded as shown
  2661    in the various examples earlier:
  2662  
  2663    ::
  2664  
  2665      # ip -force link set dev eth0 xdpoffload obj prog.o
  2666      # ip l
  2667      [...]
  2668      4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  2669          link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff
  2670          prog/xdp id 16 tag a04f5eef06a7f555
  2671  
  2672  These two workflows are the basic operations to load XDP BPF respectively tc BPF
  2673  programs with iproute2.
  2674  
  2675  There are other various advanced options for the BPF loader that apply both to XDP
  2676  and tc, some of them are listed here. In the examples only XDP is presented for
  2677  simplicity.
  2678  
  2679  **1. Verbose log output even on success.**
  2680  
  2681    The option ``verb`` can be appended for loading programs in order to dump the
  2682    verifier log, even if no error occurred:
  2683  
  2684    ::
  2685  
  2686      # ip link set dev em1 xdp obj xdp-example.o verb
  2687  
  2688      Prog section 'prog' loaded (5)!
  2689       - Type:         6
  2690       - Instructions: 2 (0 over limit)
  2691       - License:      GPL
  2692  
  2693      Verifier analysis:
  2694  
  2695      0: (b7) r0 = 1
  2696      1: (95) exit
  2697      processed 2 insns
  2698  
  2699  **2. Load program that is already pinned in BPF file system.**
  2700  
  2701    Instead of loading a program from an object file, iproute2 can also retrieve
  2702    the program from the BPF file system in case some external entity pinned it
  2703    there and attach it to the device:
  2704  
  2705    ::
  2706  
  2707    # ip link set dev em1 xdp pinned /sys/fs/bpf/prog
  2708  
  2709    iproute2 can also use the short form that is relative to the detected mount
  2710    point of the BPF file system:
  2711  
  2712    ::
  2713  
  2714    # ip link set dev em1 xdp pinned m:prog
  2715  
  2716  When loading BPF programs, iproute2 will automatically detect the mounted
  2717  file system instance in order to perform pinning of nodes. In case no mounted
  2718  BPF file system instance was found, then tc will automatically mount it
  2719  to the default location under ``/sys/fs/bpf/``.
  2720  
  2721  In case an instance has already been found, then it will be used and no additional
  2722  mount will be performed:
  2723  
  2724    ::
  2725  
  2726      # mkdir /var/run/bpf
  2727      # mount --bind /var/run/bpf /var/run/bpf
  2728      # mount -t bpf bpf /var/run/bpf
  2729      # tc filter add dev em1 ingress bpf da obj tc-example.o sec prog
  2730      # tree /var/run/bpf
  2731      /var/run/bpf
  2732      +-- ip -> /run/bpf/tc/
  2733      +-- tc
  2734      |   +-- globals
  2735      |       +-- jmp_map
  2736      +-- xdp -> /run/bpf/tc/
  2737  
  2738      4 directories, 1 file
  2739  
  2740  By default tc will create an initial directory structure as shown above,
  2741  where all subsystem users will point to the same location through symbolic
  2742  links for the ``globals`` namespace, so that pinned BPF maps can be reused
  2743  among various BPF program types in iproute2. In case the file system instance
  2744  has already been mounted and an existing structure already exists, then tc will
  2745  not override it. This could be the case for separating ``lwt``, ``tc`` and
  2746  ``xdp`` maps in order to not share ``globals`` among all.
  2747  
  2748  As briefly covered in the previous LLVM section, iproute2 will install a
  2749  header file upon installation which can be included through the standard
  2750  include path by BPF programs:
  2751  
  2752    ::
  2753  
  2754      #include <iproute2/bpf_elf.h>
  2755  
  2756  The purpose of this header file is to provide an API for maps and default section
  2757  names used by programs. It's a stable contract between iproute2 and BPF programs.
  2758  
  2759  The map definition for iproute2 is ``struct bpf_elf_map``. Its members have
  2760  been covered earlier in the LLVM section of this document.
  2761  
  2762  When parsing the BPF object file, the iproute2 loader will walk through
  2763  all ELF sections. It initially fetches ancillary sections like ``maps`` and
  2764  ``license``. For ``maps``, the ``struct bpf_elf_map`` array will be checked
  2765  for validity and whenever needed, compatibility workarounds are performed.
  2766  Subsequently all maps are created with the user provided information, either
  2767  retrieved as a pinned object, or newly created and then pinned into the BPF
  2768  file system. Next the loader will handle all program sections that contain
  2769  ELF relocation entries for maps, meaning that BPF instructions loading
  2770  map file descriptors into registers are rewritten so that the corresponding
  2771  map file descriptors are encoded into the instructions immediate value, in
  2772  order for the kernel to be able to convert them later on into map kernel
  2773  pointers. After that all the programs themselves are created through the BPF
  2774  system call, and tail called maps, if present, updated with the program's file
  2775  descriptors.
  2776  
  2777  bpftool
  2778  -------
  2779  
  2780  bpftool is the main introspection and debugging tool around BPF and developed
  2781  and shipped along with the Linux kernel tree under ``tools/bpf/bpftool/``.
  2782  
  2783  The tool can dump all BPF programs and maps that are currently loaded in
  2784  the system, or list and correlate all BPF maps used by a specific program.
  2785  Furthermore, it allows to dump the entire map's key / value pairs, or
  2786  lookup, update, delete individual ones as well as retrieve a key's neighbor
  2787  key in the map. Such operations can be performed based on BPF program or
  2788  map IDs or by specifying the location of a BPF file system pinned program
  2789  or map. The tool additionally also offers an option to pin maps or programs
  2790  into the BPF file system.
  2791  
  2792  For a quick overview of all BPF programs currently loaded on the host
  2793  invoke the following command:
  2794  
  2795    ::
  2796  
  2797       # bpftool prog
  2798       398: sched_cls  tag 56207908be8ad877
  2799          loaded_at Apr 09/16:24  uid 0
  2800          xlated 8800B  jited 6184B  memlock 12288B  map_ids 18,5,17,14
  2801       399: sched_cls  tag abc95fb4835a6ec9
  2802          loaded_at Apr 09/16:24  uid 0
  2803          xlated 344B  jited 223B  memlock 4096B  map_ids 18
  2804       400: sched_cls  tag afd2e542b30ff3ec
  2805          loaded_at Apr 09/16:24  uid 0
  2806          xlated 1720B  jited 1001B  memlock 4096B  map_ids 17
  2807       401: sched_cls  tag 2dbbd74ee5d51cc8
  2808          loaded_at Apr 09/16:24  uid 0
  2809          xlated 3728B  jited 2099B  memlock 4096B  map_ids 17
  2810       [...]
  2811  
  2812  Similarly, to get an overview of all active maps:
  2813  
  2814    ::
  2815  
  2816      # bpftool map
  2817      5: hash  flags 0x0
  2818          key 20B  value 112B  max_entries 65535  memlock 13111296B
  2819      6: hash  flags 0x0
  2820          key 20B  value 20B  max_entries 65536  memlock 7344128B
  2821      7: hash  flags 0x0
  2822          key 10B  value 16B  max_entries 8192  memlock 790528B
  2823      8: hash  flags 0x0
  2824          key 22B  value 28B  max_entries 8192  memlock 987136B
  2825      9: hash  flags 0x0
  2826          key 20B  value 8B  max_entries 512000  memlock 49352704B
  2827      [...]
  2828  
  2829  Note that for each command, bpftool also supports json based output by
  2830  appending ``--json`` at the end of the command line. An additional
  2831  ``--pretty`` improves the output to be more human readable.
  2832  
  2833    ::
  2834  
  2835       # bpftool prog --json --pretty
  2836  
  2837  For dumping the post-verifier BPF instruction image of a specific BPF
  2838  program, one starting point could be to inspect a specific program, e.g.
  2839  attached to the tc ingress hook:
  2840  
  2841    ::
  2842  
  2843       # tc filter show dev cilium_host egress
  2844       filter protocol all pref 1 bpf chain 0
  2845       filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_host.o:[from-netdev] \
  2846                           direct-action not_in_hw id 406 tag e0362f5bd9163a0a jited
  2847  
  2848  The program from the object file ``bpf_host.o``, section ``from-netdev`` has
  2849  a BPF program ID of ``406`` as denoted in ``id 406``. Based on this information
  2850  bpftool can provide some high-level metadata specific to the program:
  2851  
  2852    ::
  2853  
  2854       # bpftool prog show id 406
  2855       406: sched_cls  tag e0362f5bd9163a0a
  2856            loaded_at Apr 09/16:24  uid 0
  2857            xlated 11144B  jited 7721B  memlock 12288B  map_ids 18,20,8,5,6,14
  2858  
  2859  The program of ID 406 is of type ``sched_cls`` (``BPF_PROG_TYPE_SCHED_CLS``),
  2860  has a ``tag`` of ``e0362f5bd9163a0a`` (SHA sum over the instruction sequence),
  2861  it was loaded by root ``uid 0`` on ``Apr 09/16:24``. The BPF instruction
  2862  sequence is ``11,144 bytes`` long and the JITed image ``7,721 bytes``. The
  2863  program itself (excluding maps) consumes ``12,288 bytes`` that are accounted /
  2864  charged against user ``uid 0``. And the BPF program uses the BPF maps with
  2865  IDs ``18``, ``20``, ``8``, ``5``, ``6`` and ``14``. The latter IDs can further
  2866  be used to get information or dump the map themselves.
  2867  
  2868  Additionally, bpftool can issue a dump request of the BPF instructions the
  2869  program runs:
  2870  
  2871    ::
  2872  
  2873       # bpftool prog dump xlated id 406
  2874        0: (b7) r7 = 0
  2875        1: (63) *(u32 *)(r1 +60) = r7
  2876        2: (63) *(u32 *)(r1 +56) = r7
  2877        3: (63) *(u32 *)(r1 +52) = r7
  2878       [...]
  2879       47: (bf) r4 = r10
  2880       48: (07) r4 += -40
  2881       49: (79) r6 = *(u64 *)(r10 -104)
  2882       50: (bf) r1 = r6
  2883       51: (18) r2 = map[id:18]                    <-- BPF map id 18
  2884       53: (b7) r5 = 32
  2885       54: (85) call bpf_skb_event_output#5656112  <-- BPF helper call
  2886       55: (69) r1 = *(u16 *)(r6 +192)
  2887       [...]
  2888  
  2889  bpftool correlates BPF map IDs into the instruction stream as shown above
  2890  as well as calls to BPF helpers or other BPF programs.
  2891  
  2892  The instruction dump reuses the same 'pretty-printer' as the kernel's BPF
  2893  verifier. Since the program was JITed and therefore the actual JIT image
  2894  that was generated out of above ``xlated`` instructions is executed, it
  2895  can be dumped as well through bpftool:
  2896  
  2897    ::
  2898  
  2899       # bpftool prog dump jited id 406
  2900        0:        push   %rbp
  2901        1:        mov    %rsp,%rbp
  2902        4:        sub    $0x228,%rsp
  2903        b:        sub    $0x28,%rbp
  2904        f:        mov    %rbx,0x0(%rbp)
  2905       13:        mov    %r13,0x8(%rbp)
  2906       17:        mov    %r14,0x10(%rbp)
  2907       1b:        mov    %r15,0x18(%rbp)
  2908       1f:        xor    %eax,%eax
  2909       21:        mov    %rax,0x20(%rbp)
  2910       25:        mov    0x80(%rdi),%r9d
  2911       [...]
  2912  
  2913  Mainly for BPF JIT developers, the option also exists to interleave the
  2914  disassembly with the actual native opcodes:
  2915  
  2916    ::
  2917  
  2918       # bpftool prog dump jited id 406 opcodes
  2919        0:        push   %rbp
  2920                  55
  2921        1:        mov    %rsp,%rbp
  2922                  48 89 e5
  2923        4:        sub    $0x228,%rsp
  2924                  48 81 ec 28 02 00 00
  2925        b:        sub    $0x28,%rbp
  2926                  48 83 ed 28
  2927        f:        mov    %rbx,0x0(%rbp)
  2928                  48 89 5d 00
  2929       13:        mov    %r13,0x8(%rbp)
  2930                  4c 89 6d 08
  2931       17:        mov    %r14,0x10(%rbp)
  2932                  4c 89 75 10
  2933       1b:        mov    %r15,0x18(%rbp)
  2934                  4c 89 7d 18
  2935       [...]
  2936  
  2937  The same interleaving can be done for the normal BPF instructions which
  2938  can sometimes be useful for debugging in the kernel:
  2939  
  2940    ::
  2941  
  2942       # bpftool prog dump xlated id 406 opcodes
  2943        0: (b7) r7 = 0
  2944           b7 07 00 00 00 00 00 00
  2945        1: (63) *(u32 *)(r1 +60) = r7
  2946           63 71 3c 00 00 00 00 00
  2947        2: (63) *(u32 *)(r1 +56) = r7
  2948           63 71 38 00 00 00 00 00
  2949        3: (63) *(u32 *)(r1 +52) = r7
  2950           63 71 34 00 00 00 00 00
  2951        4: (63) *(u32 *)(r1 +48) = r7
  2952           63 71 30 00 00 00 00 00
  2953        5: (63) *(u32 *)(r1 +64) = r7
  2954           63 71 40 00 00 00 00 00
  2955        [...]
  2956  
  2957  The basic blocks of a program can also be visualized with the help of
  2958  ``graphviz``. For this purpose bpftool has a ``visual`` dump mode that
  2959  generates a dot file instead of the plain BPF ``xlated`` instruction
  2960  dump that can later be converted to a png file:
  2961  
  2962    ::
  2963  
  2964       # bpftool prog dump xlated id 406 visual &> output.dot
  2965       $ dot -Tpng output.dot -o output.png
  2966  
  2967  Another option would be to pass the dot file to dotty as a viewer, that
  2968  is ``dotty output.dot``, where the result for the ``bpf_host.o`` program
  2969  looks as follows (small extract):
  2970  
  2971  .. image:: images/bpf_dot.png
  2972      :align: center
  2973  
  2974  Note that the ``xlated`` instruction dump provides the post-verifier BPF
  2975  instruction image which means that it dumps the instructions as if they
  2976  were to be run through the BPF interpreter. In the kernel, the verifier
  2977  performs various rewrites of the original instructions provided by the
  2978  BPF loader.
  2979  
  2980  One example of rewrites is the inlining of helper functions in order to
  2981  improve runtime performance, here in the case of a map lookup for hash
  2982  tables:
  2983  
  2984    ::
  2985  
  2986       # bpftool prog dump xlated id 3
  2987        0: (b7) r1 = 2
  2988        1: (63) *(u32 *)(r10 -4) = r1
  2989        2: (bf) r2 = r10
  2990        3: (07) r2 += -4
  2991        4: (18) r1 = map[id:2]                      <-- BPF map id 2
  2992        6: (85) call __htab_map_lookup_elem#77408   <-+ BPF helper inlined rewrite
  2993        7: (15) if r0 == 0x0 goto pc+2                |
  2994        8: (07) r0 += 56                              |
  2995        9: (79) r0 = *(u64 *)(r0 +0)                <-+
  2996       10: (15) if r0 == 0x0 goto pc+24
  2997       11: (bf) r2 = r10
  2998       12: (07) r2 += -4
  2999       [...]
  3000  
  3001  bpftool correlates calls to helper functions or BPF to BPF calls through
  3002  kallsyms. Therefore, make sure that JITed BPF programs are exposed to
  3003  kallsyms (``bpf_jit_kallsyms``) and that kallsyms addresses are not
  3004  obfuscated (calls are otherwise shown as ``call bpf_unspec#0``):
  3005  
  3006    ::
  3007  
  3008       # echo 0 > /proc/sys/kernel/kptr_restrict
  3009       # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
  3010  
  3011  BPF to BPF calls are correlated as well for both, interpreter as well
  3012  as JIT case. In the latter, the tag of the subprogram is shown as
  3013  call target. In each case, the ``pc+2`` is the pc-relative offset of
  3014  the call target, which denotes the subprogram.
  3015  
  3016    ::
  3017  
  3018       # bpftool prog dump xlated id 1
  3019       0: (85) call pc+2#__bpf_prog_run_args32
  3020       1: (b7) r0 = 1
  3021       2: (95) exit
  3022       3: (b7) r0 = 2
  3023       4: (95) exit
  3024  
  3025  JITed variant of the dump:
  3026  
  3027    ::
  3028  
  3029       # bpftool prog dump xlated id 1
  3030       0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F
  3031       1: (b7) r0 = 1
  3032       2: (95) exit
  3033       3: (b7) r0 = 2
  3034       4: (95) exit
  3035  
  3036  In the case of tail calls, the kernel maps them into a single instruction
  3037  internally, bpftool will still correlate them as a helper call for ease
  3038  of debugging:
  3039  
  3040    ::
  3041  
  3042       # bpftool prog dump xlated id 2
  3043       [...]
  3044       10: (b7) r2 = 8
  3045       11: (85) call bpf_trace_printk#-41312
  3046       12: (bf) r1 = r6
  3047       13: (18) r2 = map[id:1]
  3048       15: (b7) r3 = 0
  3049       16: (85) call bpf_tail_call#12
  3050       17: (b7) r1 = 42
  3051       18: (6b) *(u16 *)(r6 +46) = r1
  3052       19: (b7) r0 = 0
  3053       20: (95) exit
  3054  
  3055       # bpftool map show id 1
  3056       1: prog_array  flags 0x0
  3057             key 4B  value 4B  max_entries 1  memlock 4096B
  3058  
  3059  Dumping an entire map is possible through the ``map dump`` subcommand
  3060  which iterates through all present map elements and dumps the key /
  3061  value pairs.
  3062  
  3063  If no BTF (BPF Type Format) data is available for a given map, then
  3064  the key / value pairs are dumped as hex:
  3065  
  3066    ::
  3067  
  3068       # bpftool map dump id 5
  3069       key:
  3070       f0 0d 00 00 00 00 00 00  0a 66 00 00 00 00 8a d6
  3071       02 00 00 00
  3072       value:
  3073       00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
  3074       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3075       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3076       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3077       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3078       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3079       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3080       key:
  3081       0a 66 1c ee 00 00 00 00  00 00 00 00 00 00 00 00
  3082       01 00 00 00
  3083       value:
  3084       00 00 00 00 00 00 00 00  01 00 00 00 00 00 00 00
  3085       00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00
  3086       [...]
  3087       Found 6 elements
  3088  
  3089  However, with BTF, the map also holds debugging information about
  3090  the key and value structures. For example, BTF in combination with
  3091  BPF maps and the BPF_ANNOTATE_KV_PAIR() macro from iproute2 will
  3092  result in the following dump (``test_xdp_noinline.o`` from kernel
  3093  selftests):
  3094  
  3095    ::
  3096  
  3097       # cat tools/testing/selftests/bpf/test_xdp_noinline.c
  3098         [...]
  3099          struct ctl_value {
  3100                union {
  3101                        __u64 value;
  3102                        __u32 ifindex;
  3103                        __u8 mac[6];
  3104                };
  3105          };
  3106  
  3107          struct bpf_map_def __attribute__ ((section("maps"), used)) ctl_array = {
  3108                 .type		= BPF_MAP_TYPE_ARRAY,
  3109                 .key_size	= sizeof(__u32),
  3110                 .value_size	= sizeof(struct ctl_value),
  3111                 .max_entries	= 16,
  3112                 .map_flags	= 0,
  3113          };
  3114          BPF_ANNOTATE_KV_PAIR(ctl_array, __u32, struct ctl_value);
  3115  
  3116          [...]
  3117  
  3118  The BPF_ANNOTATE_KV_PAIR() macro forces a map-specific ELF section
  3119  containing an empty key and value, this enables the iproute2 BPF loader
  3120  to correlate BTF data with that section and thus allows to choose the
  3121  corresponding types out of the BTF for loading the map.
  3122  
  3123  Compiling through LLVM and generating BTF through debugging information
  3124  by ``pahole``:
  3125  
  3126    ::
  3127  
  3128       # clang [...] -O2 -target bpf -g -emit-llvm -c test_xdp_noinline.c -o - |
  3129         llc -march=bpf -mcpu=probe -mattr=dwarfris -filetype=obj -o test_xdp_noinline.o
  3130       # pahole -J test_xdp_noinline.o
  3131  
  3132  Now loading into kernel and dumping the map via bpftool:
  3133  
  3134    ::
  3135  
  3136       # ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test
  3137       # ip a
  3138       1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:227 qdisc noqueue state UNKNOWN group default qlen 1000
  3139           link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
  3140           inet 127.0.0.1/8 scope host lo
  3141              valid_lft forever preferred_lft forever
  3142           inet6 ::1/128 scope host
  3143              valid_lft forever preferred_lft forever
  3144       [...]
  3145       # bpftool prog show id 227
  3146       227: xdp  tag a85e060c275c5616  gpl
  3147           loaded_at 2018-07-17T14:41:29+0000  uid 0
  3148           xlated 8152B  not jited  memlock 12288B  map_ids 381,385,386,382,384,383
  3149       # bpftool map dump id 386
  3150        [{
  3151             "key": 0,
  3152             "value": {
  3153                 "": {
  3154                     "value": 0,
  3155                     "ifindex": 0,
  3156                     "mac": []
  3157                 }
  3158             }
  3159         },{
  3160             "key": 1,
  3161             "value": {
  3162                 "": {
  3163                     "value": 0,
  3164                     "ifindex": 0,
  3165                     "mac": []
  3166                 }
  3167             }
  3168         },{
  3169       [...]
  3170  
  3171  Lookup, update, delete, and 'get next key' operations on the map for specific
  3172  keys can be performed through bpftool as well.
  3173  
  3174  BPF sysctls
  3175  -----------
  3176  
  3177  The Linux kernel provides few sysctls that are BPF related and covered in this section.
  3178  
  3179  * ``/proc/sys/net/core/bpf_jit_enable``: Enables or disables the BPF JIT compiler.
  3180  
  3181    +-------+-------------------------------------------------------------------+
  3182    | Value | Description                                                       |
  3183    +-------+-------------------------------------------------------------------+
  3184    | 0     | Disable the JIT and use only interpreter (kernel's default value) |
  3185    +-------+-------------------------------------------------------------------+
  3186    | 1     | Enable the JIT compiler                                           |
  3187    +-------+-------------------------------------------------------------------+
  3188    | 2     | Enable the JIT and emit debugging traces to the kernel log        |
  3189    +-------+-------------------------------------------------------------------+
  3190  
  3191    As described in subsequent sections, ``bpf_jit_disasm`` tool can be used to
  3192    process debugging traces when the JIT compiler is set to debugging mode (option ``2``).
  3193  
  3194  * ``/proc/sys/net/core/bpf_jit_harden``: Enables or disables BPF JIT hardening.
  3195    Note that enabling hardening trades off performance, but can mitigate JIT spraying
  3196    by blinding out the BPF program's immediate values. For programs processed through
  3197    the interpreter, blinding of immediate values is not needed / performed.
  3198  
  3199    +-------+-------------------------------------------------------------------+
  3200    | Value | Description                                                       |
  3201    +-------+-------------------------------------------------------------------+
  3202    | 0     | Disable JIT hardening (kernel's default value)                    |
  3203    +-------+-------------------------------------------------------------------+
  3204    | 1     | Enable JIT hardening for unprivileged users only                  |
  3205    +-------+-------------------------------------------------------------------+
  3206    | 2     | Enable JIT hardening for all users                                |
  3207    +-------+-------------------------------------------------------------------+
  3208  
  3209  * ``/proc/sys/net/core/bpf_jit_kallsyms``: Enables or disables export of JITed
  3210    programs as kernel symbols to ``/proc/kallsyms`` so that they can be used together
  3211    with ``perf`` tooling as well as making these addresses aware to the kernel for
  3212    stack unwinding, for example, used in dumping stack traces. The symbol names
  3213    contain the BPF program tag (``bpf_prog_<tag>``). If ``bpf_jit_harden`` is enabled,
  3214    then this feature is disabled.
  3215  
  3216    +-------+-------------------------------------------------------------------+
  3217    | Value | Description                                                       |
  3218    +-------+-------------------------------------------------------------------+
  3219    | 0     | Disable JIT kallsyms export (kernel's default value)              |
  3220    +-------+-------------------------------------------------------------------+
  3221    | 1     | Enable JIT kallsyms export for privileged users only              |
  3222    +-------+-------------------------------------------------------------------+
  3223  
  3224  * ``/proc/sys/kernel/unprivileged_bpf_disabled``: Enables or disable unprivileged
  3225    use of the ``bpf(2)`` system call. The Linux kernel has unprivileged use of
  3226    ``bpf(2)`` enabled by default, but once the switch is flipped, unprivileged use
  3227    will be permanently disabled until the next reboot. This sysctl knob is a one-time
  3228    switch, meaning if once set, then neither an application nor an admin can reset
  3229    the value anymore. This knob does not affect any cBPF programs such as seccomp
  3230    or traditional socket filters that do not use the ``bpf(2)`` system call for
  3231    loading the program into the kernel.
  3232  
  3233    +-------+-------------------------------------------------------------------+
  3234    | Value | Description                                                       |
  3235    +-------+-------------------------------------------------------------------+
  3236    | 0     | Unprivileged use of bpf syscall enabled (kernel's default value)  |
  3237    +-------+-------------------------------------------------------------------+
  3238    | 1     | Unprivileged use of bpf syscall disabled                          |
  3239    +-------+-------------------------------------------------------------------+
  3240  
  3241  Kernel Testing
  3242  --------------
  3243  
  3244  The Linux kernel ships a BPF selftest suite, which can be found in the kernel
  3245  source tree under ``tools/testing/selftests/bpf/``.
  3246  
  3247  ::
  3248  
  3249      $ cd tools/testing/selftests/bpf/
  3250      $ make
  3251      # make run_tests
  3252  
  3253  The test suite contains test cases against the BPF verifier, program tags,
  3254  various tests against the BPF map interface and map types. It contains various
  3255  runtime tests from C code for checking LLVM back end, and eBPF as well as cBPF
  3256  asm code that is run in the kernel for testing the interpreter and JITs.
  3257  
  3258  JIT Debugging
  3259  -------------
  3260  
  3261  For JIT developers performing audits or writing extensions, each compile run
  3262  can output the generated JIT image into the kernel log through:
  3263  
  3264  ::
  3265  
  3266      # echo 2 > /proc/sys/net/core/bpf_jit_enable
  3267  
  3268  Whenever a new BPF program is loaded, the JIT compiler will dump the output,
  3269  which can then be inspected with ``dmesg``, for example:
  3270  
  3271  ::
  3272  
  3273      [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f from=tcpdump pid=20583
  3274      [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68
  3275      [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00
  3276      [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00
  3277      [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00
  3278      [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
  3279  
  3280  ``flen`` is the length of the BPF program (here, 6 BPF instructions), and ``proglen``
  3281  tells the number of bytes generated by the JIT for the opcode image (here, 70 bytes
  3282  in size). ``pass`` means that the image was generated in 3 compiler passes, for
  3283  example, ``x86_64`` can have various optimization passes to further reduce the image
  3284  size when possible. ``image`` contains the address of the generated JIT image, ``from``
  3285  and ``pid`` the user space application name and PID respectively, which triggered the
  3286  compilation process. The dump output for eBPF and cBPF JITs is the same format.
  3287  
  3288  In the kernel tree under ``tools/bpf/``, there is a tool called ``bpf_jit_disasm``. It
  3289  reads out the latest dump and prints the disassembly for further inspection:
  3290  
  3291  ::
  3292  
  3293      # ./bpf_jit_disasm
  3294      70 bytes emitted from JIT compiler (pass:3, flen:6)
  3295      ffffffffa0069c8f + <x>:
  3296         0:       push   %rbp
  3297         1:       mov    %rsp,%rbp
  3298         4:       sub    $0x60,%rsp
  3299         8:       mov    %rbx,-0x8(%rbp)
  3300         c:       mov    0x68(%rdi),%r9d
  3301        10:       sub    0x6c(%rdi),%r9d
  3302        14:       mov    0xd8(%rdi),%r8
  3303        1b:       mov    $0xc,%esi
  3304        20:       callq  0xffffffffe0ff9442
  3305        25:       cmp    $0x800,%eax
  3306        2a:       jne    0x0000000000000042
  3307        2c:       mov    $0x17,%esi
  3308        31:       callq  0xffffffffe0ff945e
  3309        36:       cmp    $0x1,%eax
  3310        39:       jne    0x0000000000000042
  3311        3b:       mov    $0xffff,%eax
  3312        40:       jmp    0x0000000000000044
  3313        42:       xor    %eax,%eax
  3314        44:       leaveq
  3315        45:       retq
  3316  
  3317  Alternatively, the tool can also dump related opcodes along with the disassembly.
  3318  
  3319  ::
  3320  
  3321      # ./bpf_jit_disasm -o
  3322      70 bytes emitted from JIT compiler (pass:3, flen:6)
  3323      ffffffffa0069c8f + <x>:
  3324         0:       push   %rbp
  3325          55
  3326         1:       mov    %rsp,%rbp
  3327          48 89 e5
  3328         4:       sub    $0x60,%rsp
  3329          48 83 ec 60
  3330         8:       mov    %rbx,-0x8(%rbp)
  3331          48 89 5d f8
  3332         c:       mov    0x68(%rdi),%r9d
  3333          44 8b 4f 68
  3334        10:       sub    0x6c(%rdi),%r9d
  3335          44 2b 4f 6c
  3336        14:       mov    0xd8(%rdi),%r8
  3337          4c 8b 87 d8 00 00 00
  3338        1b:       mov    $0xc,%esi
  3339          be 0c 00 00 00
  3340        20:       callq  0xffffffffe0ff9442
  3341          e8 1d 94 ff e0
  3342        25:       cmp    $0x800,%eax
  3343          3d 00 08 00 00
  3344        2a:       jne    0x0000000000000042
  3345          75 16
  3346        2c:       mov    $0x17,%esi
  3347          be 17 00 00 00
  3348        31:       callq  0xffffffffe0ff945e
  3349          e8 28 94 ff e0
  3350        36:       cmp    $0x1,%eax
  3351          83 f8 01
  3352        39:       jne    0x0000000000000042
  3353          75 07
  3354        3b:       mov    $0xffff,%eax
  3355          b8 ff ff 00 00
  3356        40:       jmp    0x0000000000000044
  3357          eb 02
  3358        42:       xor    %eax,%eax
  3359          31 c0
  3360        44:       leaveq
  3361          c9
  3362        45:       retq
  3363          c3
  3364  
  3365  More recently, ``bpftool`` adapted the same feature of dumping the BPF JIT
  3366  image based on a given BPF program ID already loaded in the system (see
  3367  bpftool section).
  3368  
  3369  For performance analysis of JITed BPF programs, ``perf`` can be used as
  3370  usual. As a prerequisite, JITed programs need to be exported through kallsyms
  3371  infrastructure.
  3372  
  3373  ::
  3374  
  3375      # echo 1 > /proc/sys/net/core/bpf_jit_enable
  3376      # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms
  3377  
  3378  Enabling or disabling ``bpf_jit_kallsyms`` does not require a reload of the
  3379  related BPF programs. Next, a small workflow example is provided for profiling
  3380  BPF programs. A crafted tc BPF program is used for demonstration purposes,
  3381  where perf records a failed allocation inside ``bpf_clone_redirect()`` helper.
  3382  Due to the use of direct write, ``bpf_try_make_head_writable()`` failed, which
  3383  would then release the cloned ``skb`` again and return with an error message.
  3384  ``perf`` thus records all ``kfree_skb`` events.
  3385  
  3386  ::
  3387  
  3388      # tc qdisc add dev em1 clsact
  3389      # tc filter add dev em1 ingress bpf da obj prog.o sec main
  3390      # tc filter show dev em1 ingress
  3391      filter protocol all pref 49152 bpf
  3392      filter protocol all pref 49152 bpf handle 0x1 prog.o:[main] direct-action id 1 tag 8227addf251b7543
  3393  
  3394      # cat /proc/kallsyms
  3395      [...]
  3396      ffffffffc00349e0 t fjes_hw_init_command_registers    [fjes]
  3397      ffffffffc003e2e0 d __tracepoint_fjes_hw_stop_debug_err    [fjes]
  3398      ffffffffc0036190 t fjes_hw_epbuf_tx_pkt_send    [fjes]
  3399      ffffffffc004b000 t bpf_prog_8227addf251b7543
  3400  
  3401      # perf record -a -g -e skb:kfree_skb sleep 60
  3402      # perf script --kallsyms=/proc/kallsyms
  3403      [...]
  3404      ksoftirqd/0     6 [000]  1004.578402:    skb:kfree_skb: skbaddr=0xffff9d4161f20a00 protocol=2048 location=0xffffffffc004b52c
  3405         7fffb8745961 bpf_clone_redirect (/lib/modules/4.10.0+/build/vmlinux)
  3406         7fffc004e52c bpf_prog_8227addf251b7543 (/lib/modules/4.10.0+/build/vmlinux)
  3407         7fffc05b6283 cls_bpf_classify (/lib/modules/4.10.0+/build/vmlinux)
  3408         7fffb875957a tc_classify (/lib/modules/4.10.0+/build/vmlinux)
  3409         7fffb8729840 __netif_receive_skb_core (/lib/modules/4.10.0+/build/vmlinux)
  3410         7fffb8729e38 __netif_receive_skb (/lib/modules/4.10.0+/build/vmlinux)
  3411         7fffb872ae05 process_backlog (/lib/modules/4.10.0+/build/vmlinux)
  3412         7fffb872a43e net_rx_action (/lib/modules/4.10.0+/build/vmlinux)
  3413         7fffb886176c __do_softirq (/lib/modules/4.10.0+/build/vmlinux)
  3414         7fffb80ac5b9 run_ksoftirqd (/lib/modules/4.10.0+/build/vmlinux)
  3415         7fffb80ca7fa smpboot_thread_fn (/lib/modules/4.10.0+/build/vmlinux)
  3416         7fffb80c6831 kthread (/lib/modules/4.10.0+/build/vmlinux)
  3417         7fffb885e09c ret_from_fork (/lib/modules/4.10.0+/build/vmlinux)
  3418  
  3419  The stack trace recorded by ``perf`` will then show the ``bpf_prog_8227addf251b7543()``
  3420  symbol as part of the call trace, meaning that the BPF program with the
  3421  tag ``8227addf251b7543`` was related to the ``kfree_skb`` event, and
  3422  such program was attached to netdevice ``em1`` on the ingress hook as
  3423  shown by tc.
  3424  
  3425  Introspection
  3426  -------------
  3427  
  3428  The Linux kernel provides various tracepoints around BPF and XDP which
  3429  can be used for additional introspection, for example, to trace interactions
  3430  of user space programs with the bpf system call.
  3431  
  3432  Tracepoints for BPF:
  3433  
  3434  ::
  3435  
  3436      # perf list | grep bpf:
  3437      bpf:bpf_map_create                                 [Tracepoint event]
  3438      bpf:bpf_map_delete_elem                            [Tracepoint event]
  3439      bpf:bpf_map_lookup_elem                            [Tracepoint event]
  3440      bpf:bpf_map_next_key                               [Tracepoint event]
  3441      bpf:bpf_map_update_elem                            [Tracepoint event]
  3442      bpf:bpf_obj_get_map                                [Tracepoint event]
  3443      bpf:bpf_obj_get_prog                               [Tracepoint event]
  3444      bpf:bpf_obj_pin_map                                [Tracepoint event]
  3445      bpf:bpf_obj_pin_prog                               [Tracepoint event]
  3446      bpf:bpf_prog_get_type                              [Tracepoint event]
  3447      bpf:bpf_prog_load                                  [Tracepoint event]
  3448      bpf:bpf_prog_put_rcu                               [Tracepoint event]
  3449  
  3450  Example usage with ``perf`` (alternatively to ``sleep`` example used here,
  3451  a specific application like ``tc`` could be used here instead, of course):
  3452  
  3453  ::
  3454  
  3455      # perf record -a -e bpf:* sleep 10
  3456      # perf script
  3457      sock_example  6197 [005]   283.980322:      bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0
  3458      sock_example  6197 [005]   283.980721:       bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5
  3459      sock_example  6197 [005]   283.988423:   bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  3460      sock_example  6197 [005]   283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00]
  3461      [...]
  3462      sock_example  6197 [005]   288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00]
  3463           swapper     0 [005]   289.338243:    bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER
  3464  
  3465  For the BPF programs, their individual program tag is displayed.
  3466  
  3467  For debugging, XDP also has a tracepoint that is triggered when exceptions are raised:
  3468  
  3469  ::
  3470  
  3471      # perf list | grep xdp:
  3472      xdp:xdp_exception                                  [Tracepoint event]
  3473  
  3474  Exceptions are triggered in the following scenarios:
  3475  
  3476  * The BPF program returned an invalid / unknown XDP action code.
  3477  * The BPF program returned with ``XDP_ABORTED`` indicating a non-graceful exit.
  3478  * The BPF program returned with ``XDP_TX``, but there was an error on transmit,
  3479    for example, due to the port not being up, due to the transmit ring being full,
  3480    due to allocation failures, etc.
  3481  
  3482  Both tracepoint classes can also be inspected with a BPF program itself
  3483  attached to one or more tracepoints, collecting further information
  3484  in a map or punting such events to a user space collector through the
  3485  ``bpf_perf_event_output()`` helper, for example.
  3486  
  3487  Miscellaneous
  3488  -------------
  3489  
  3490  BPF programs and maps are memory accounted against ``RLIMIT_MEMLOCK`` similar
  3491  to ``perf``. The currently available size in unit of system pages which may be
  3492  locked into memory can be inspected through ``ulimit -l``. The setrlimit system
  3493  call man page provides further details.
  3494  
  3495  The default limit is usually insufficient to load more complex programs or
  3496  larger BPF maps, so that the BPF system call will return with ``errno``
  3497  of ``EPERM``. In such situations a workaround with ``ulimit -l unlimited`` or
  3498  with a sufficiently large limit could be performed. The ``RLIMIT_MEMLOCK`` is
  3499  mainly enforcing limits for unprivileged users. Depending on the setup,
  3500  setting a higher limit for privileged users is often acceptable.
  3501  
  3502  Program Types
  3503  =============
  3504  
  3505  At the time of this writing, there are eighteen different BPF program types
  3506  available, two of the main types for networking are further explained in below
  3507  subsections, namely XDP BPF programs as well as tc BPF programs. Extensive
  3508  usage examples for the two program types for LLVM, iproute2 or other tools
  3509  are spread throughout the toolchain section and not covered here. Instead,
  3510  this section focuses on their architecture, concepts and use cases.
  3511  
  3512  XDP
  3513  ---
  3514  
  3515  XDP stands for eXpress Data Path and provides a framework for BPF that enables
  3516  high-performance programmable packet processing in the Linux kernel. It runs
  3517  the BPF program at the earliest possible point in software, namely at the moment
  3518  the network driver receives the packet.
  3519  
  3520  At this point in the fast-path the driver just picked up the packet from its
  3521  receive rings, without having done any expensive operations such as allocating
  3522  an ``skb`` for pushing the packet further up the networking stack, without
  3523  having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program
  3524  is executed at the earliest point when it becomes available to the CPU for
  3525  processing.
  3526  
  3527  XDP works in concert with the Linux kernel and its infrastructure, meaning
  3528  the kernel is not bypassed as in various networking frameworks that operate
  3529  in user space only. Keeping the packet in kernel space has several major
  3530  advantages:
  3531  
  3532  * XDP is able to reuse all the upstream developed kernel networking drivers,
  3533    user space tooling, or even other available in-kernel infrastructure such
  3534    as routing tables, sockets, etc in BPF helper calls itself.
  3535  * Residing in kernel space, XDP has the same security model as the rest of
  3536    the kernel for accessing hardware.
  3537  * There is no need for crossing kernel / user space boundaries since the
  3538    processed packet already resides in the kernel and can therefore flexibly
  3539    forward packets into other in-kernel entities like namespaces used by
  3540    containers or the kernel's networking stack itself. This is particularly
  3541    relevant in times of Meltdown and Spectre.
  3542  * Punting packets from XDP to the kernel's robust, widely used and efficient
  3543    TCP/IP stack is trivially possible, allows for full reuse and does not
  3544    require maintaining a separate TCP/IP stack as with user space frameworks.
  3545  * The use of BPF allows for full programmability, keeping a stable ABI with
  3546    the same 'never-break-user-space' guarantees as with the kernel's system
  3547    call ABI and compared to modules it also provides safety measures thanks to
  3548    the BPF verifier that ensures the stability of the kernel's operation.
  3549  * XDP trivially allows for atomically swapping programs during runtime without
  3550    any network traffic interruption or even kernel / system reboot.
  3551  * XDP allows for flexible structuring of workloads integrated into
  3552    the kernel. For example, it can operate in "busy polling" or "interrupt
  3553    driven" mode. Explicitly dedicating CPUs to XDP is not required. There
  3554    are no special hardware requirements and it does not rely on hugepages.
  3555  * XDP does not require any third party kernel modules or licensing. It is
  3556    a long-term architectural solution, a core part of the Linux kernel, and
  3557    developed by the kernel community.
  3558  * XDP is already enabled and shipped everywhere with major distributions
  3559    running a kernel equivalent to 4.8 or higher and supports most major 10G
  3560    or higher networking drivers.
  3561  
  3562  As a framework for running BPF in the driver, XDP additionally ensures that
  3563  packets are laid out linearly and fit into a single DMA'ed page which is
  3564  readable and writable by the BPF program. XDP also ensures that additional
  3565  headroom of 256 bytes is available to the program for implementing custom
  3566  encapsulation headers with the help of the ``bpf_xdp_adjust_head()`` BPF helper
  3567  or adding custom metadata in front of the packet through ``bpf_xdp_adjust_meta()``.
  3568  
  3569  The framework contains XDP action codes further described in the section
  3570  below which a BPF program can return in order to instruct the driver how
  3571  to proceed with the packet, and it enables the possibility to atomically
  3572  replace BPF programs running at the XDP layer. XDP is tailored for
  3573  high-performance by design. BPF allows to access the packet data through
  3574  'direct packet access' which means that the program holds data pointers
  3575  directly in registers, loads the content into registers, respectively
  3576  writes from there into the packet.
  3577  
  3578  The packet representation in XDP that is passed to the BPF program as
  3579  the BPF context looks as follows:
  3580  
  3581  ::
  3582  
  3583      struct xdp_buff {
  3584          void *data;
  3585          void *data_end;
  3586          void *data_meta;
  3587          void *data_hard_start;
  3588          struct xdp_rxq_info *rxq;
  3589      };
  3590  
  3591  ``data`` points to the start of the packet data in the page, and as the
  3592  name suggests, ``data_end`` points to the end of the packet data. Since XDP
  3593  allows for a headroom, ``data_hard_start`` points to the maximum possible
  3594  headroom start in the page, meaning, when the packet should be encapsulated,
  3595  then ``data`` is moved closer towards ``data_hard_start`` via ``bpf_xdp_adjust_head()``.
  3596  The same BPF helper function also allows for decapsulation in which case
  3597  ``data`` is moved further away from ``data_hard_start``.
  3598  
  3599  ``data_meta`` initially points to the same location as ``data`` but
  3600  ``bpf_xdp_adjust_meta()`` is able to move the pointer towards ``data_hard_start``
  3601  as well in order to provide room for custom metadata which is invisible to
  3602  the normal kernel networking stack but can be read by tc BPF programs since
  3603  it is transferred from XDP to the ``skb``. Vice versa, it can remove or reduce
  3604  the size of the custom metadata through the same BPF helper function by
  3605  moving ``data_meta`` away from ``data_hard_start`` again. ``data_meta`` can
  3606  also be used solely for passing state between tail calls similarly to the
  3607  ``skb->cb[]`` control block case that is accessible in tc BPF programs.
  3608  
  3609  This gives the following relation respectively invariant for the ``struct xdp_buff``
  3610  packet pointers: ``data_hard_start`` <= ``data_meta`` <= ``data`` < ``data_end``.
  3611  
  3612  The ``rxq`` field points to some additional per receive queue metadata which
  3613  is populated at ring setup time (not at XDP runtime):
  3614  
  3615  ::
  3616  
  3617      struct xdp_rxq_info {
  3618          struct net_device *dev;
  3619          u32 queue_index;
  3620          u32 reg_state;
  3621      } ____cacheline_aligned;
  3622  
  3623  The BPF program can retrieve ``queue_index`` as well as additional data
  3624  from the netdevice itself such as ``ifindex``, etc.
  3625  
  3626  **BPF program return codes**
  3627  
  3628  After running the XDP BPF program, a verdict is returned from the program in
  3629  order to tell the driver how to process the packet next. In the ``linux/bpf.h``
  3630  system header file all available return verdicts are enumerated:
  3631  
  3632  ::
  3633  
  3634      enum xdp_action {
  3635          XDP_ABORTED = 0,
  3636          XDP_DROP,
  3637          XDP_PASS,
  3638          XDP_TX,
  3639          XDP_REDIRECT,
  3640      };
  3641  
  3642  ``XDP_DROP`` as the name suggests will drop the packet right at the driver
  3643  level without wasting any further resources. This is in particular useful
  3644  for BPF programs implementing DDoS mitigation mechanisms or firewalling in
  3645  general. The ``XDP_PASS`` return code means that the packet is allowed to
  3646  be passed up to the kernel's networking stack. Meaning, the current CPU
  3647  that was processing this packet now allocates a ``skb``, populates it, and
  3648  passes it onwards into the GRO engine. This would be equivalent to the
  3649  default packet handling behavior without XDP. With ``XDP_TX`` the BPF program
  3650  has an efficient option to transmit the network packet out of the same NIC it
  3651  just arrived on again. This is typically useful when few nodes are implementing,
  3652  for example, firewalling with subsequent load balancing in a cluster and
  3653  thus act as a hairpinned load balancer pushing the incoming packets back
  3654  into the switch after rewriting them in XDP BPF. ``XDP_REDIRECT`` is similar
  3655  to ``XDP_TX`` in that it is able to transmit the XDP packet, but through
  3656  another NIC. Another option for the ``XDP_REDIRECT`` case is to redirect
  3657  into a BPF cpumap, meaning, the CPUs serving XDP on the NIC's receive queues
  3658  can continue to do so and push the packet for processing the upper kernel
  3659  stack to a remote CPU. This is similar to ``XDP_PASS``, but with the ability
  3660  that the XDP BPF program can keep serving the incoming high load as opposed
  3661  to temporarily spend work on the current packet for pushing into upper
  3662  layers. Last but not least, ``XDP_ABORTED`` which serves denoting an exception
  3663  like state from the program and has the same behavior as ``XDP_DROP`` only
  3664  that ``XDP_ABORTED`` passes the ``trace_xdp_exception`` tracepoint which
  3665  can be additionally monitored to detect misbehavior.
  3666  
  3667  **Use cases for XDP**
  3668  
  3669  Some of the main use cases for XDP are presented in this subsection. The
  3670  list is non-exhaustive and given the programmability and efficiency XDP
  3671  and BPF enables, it can easily be adapted to solve very specific use
  3672  cases.
  3673  
  3674  * **DDoS mitigation, firewalling**
  3675  
  3676    One of the basic XDP BPF features is to tell the driver to drop a packet
  3677    with ``XDP_DROP`` at this early stage which allows for any kind of efficient
  3678    network policy enforcement with having an extremely low per-packet cost.
  3679    This is ideal in situations when needing to cope with any sort of DDoS
  3680    attacks, but also more general allows to implement any sort of firewalling
  3681    policies with close to no overhead in BPF e.g. in either case as stand alone
  3682    appliance (e.g. scrubbing 'clean' traffic through ``XDP_TX``) or widely
  3683    deployed on nodes protecting end hosts themselves (via ``XDP_PASS`` or
  3684    cpumap ``XDP_REDIRECT`` for good traffic). Offloaded XDP takes this even
  3685    one step further by moving the already small per-packet cost entirely
  3686    into the NIC with processing at line-rate.
  3687  
  3688  ..
  3689  
  3690  * **Forwarding and load-balancing**
  3691  
  3692    Another major use case of XDP is packet forwarding and load-balancing
  3693    through either ``XDP_TX`` or ``XDP_REDIRECT`` actions. The packet can
  3694    be arbitrarily mangled by the BPF program running in the XDP layer,
  3695    even BPF helper functions are available for increasing or decreasing
  3696    the packet's headroom in order to arbitrarily encapsulate respectively
  3697    decapsulate the packet before sending it out again. With ``XDP_TX``
  3698    hairpinned load-balancers can be implemented that push the packet out
  3699    of the same networking device it originally arrived on, or with the
  3700    ``XDP_REDIRECT`` action it can be forwarded to another NIC for
  3701    transmission. The latter return code can also be used in combination
  3702    with BPF's cpumap to load-balance packets for passing up the local
  3703    stack, but on remote, non-XDP processing CPUs.
  3704  
  3705  ..
  3706  
  3707  * **Pre-stack filtering / processing**
  3708  
  3709    Besides policy enforcement, XDP can also be used for hardening the
  3710    kernel's networking stack with the help of ``XDP_DROP`` case, meaning,
  3711    it can drop irrelevant packets for a local node right at the earliest
  3712    possible point before the networking stack sees them e.g. given we
  3713    know that a node only serves TCP traffic, any UDP, SCTP or other L4
  3714    traffic can be dropped right away. This has the advantage that packets
  3715    do not need to traverse various entities like GRO engine, the kernel's
  3716    flow dissector and others before it can be determined to drop them and
  3717    thus this allows for reducing the kernel's attack surface. Thanks to
  3718    XDP's early processing stage, this effectively 'pretends' to the kernel's
  3719    networking stack that these packets have never been seen by the networking
  3720    device. Additionally, if a potential bug in the stack's receive path
  3721    got uncovered and would cause a 'ping of death' like scenario, XDP can be
  3722    utilized to drop such packets right away without having to reboot the
  3723    kernel or restart any services. Due to the ability to atomically swap
  3724    such programs to enforce a drop of bad packets, no network traffic is
  3725    even interrupted on a host.
  3726  
  3727    Another use case for pre-stack processing is that given the kernel has not
  3728    yet allocated an ``skb`` for the packet, the BPF program is free to modify
  3729    the packet and, again, have it 'pretend' to the stack that it was received
  3730    by the networking device this way. This allows for cases such as having
  3731    custom packet mangling and encapsulation protocols where the packet can be
  3732    decapsulated prior to entering GRO aggregation in which GRO otherwise would
  3733    not be able to perform any sort of aggregation due to not being aware of
  3734    the custom protocol. XDP also allows to push metadata (non-packet data) in
  3735    front of the packet. This is 'invisible' to the normal kernel stack, can
  3736    be GRO aggregated (for matching metadata) and later on processed in
  3737    coordination with a tc ingress BPF program where it has the context of
  3738    a ``skb`` available for e.g. setting various skb fields.
  3739  
  3740  ..
  3741  
  3742  * **Flow sampling, monitoring**
  3743  
  3744    XDP can also be used for cases such as packet monitoring, sampling or any
  3745    other network analytics, for example, as part of an intermediate node in
  3746    the path or on end hosts in combination also with prior mentioned use cases.
  3747    For complex packet analysis, XDP provides a facility to efficiently push
  3748    network packets (truncated or with full payload) and custom metadata into
  3749    a fast lockless per CPU memory mapped ring buffer provided from the Linux
  3750    perf infrastructure to an user space application. This also allows for
  3751    cases where only a flow's initial data can be analyzed and once determined
  3752    as good traffic having the monitoring bypassed. Thanks to the flexibility
  3753    brought by BPF, this allows for implementing any sort of custom monitoring
  3754    or sampling.
  3755  
  3756  ..
  3757  
  3758  One example of XDP BPF production usage is Facebook's SHIV and Droplet
  3759  infrastructure which implement their L4 load-balancing and DDoS countermeasures.
  3760  Migrating their production infrastructure away from netfilter's IPVS
  3761  (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared
  3762  to their previous IPVS setup. This was first presented at the netdev 2.1
  3763  conference:
  3764  
  3765  * Slides: https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
  3766  * Video: https://youtu.be/YEU2ClcGqts
  3767  
  3768  Another example is the integration of XDP into Cloudflare's DDoS mitigation
  3769  pipeline, which originally was using cBPF instead of eBPF for attack signature
  3770  matching through iptables' ``xt_bpf`` module. Due to use of iptables this
  3771  caused severe performance problems under attack where a user space bypass
  3772  solution was deemed necessary but came with drawbacks as well such as needing
  3773  to busy poll the NIC and expensive packet re-injection into the kernel's stack.
  3774  The migration over to eBPF and XDP combined best of both worlds by having
  3775  high-performance programmable packet processing directly inside the kernel:
  3776  
  3777  * Slides: https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf
  3778  * Video: https://youtu.be/7OuOukmuivg
  3779  
  3780  **XDP operation modes**
  3781  
  3782  XDP has three operation modes where 'native' XDP is the default mode. When
  3783  talked about XDP this mode is typically implied.
  3784  
  3785  * **Native XDP**
  3786  
  3787    This is the default mode where the XDP BPF program is run directly out
  3788    of the networking driver's early receive path. Most widespread used NICs
  3789    for 10G and higher support native XDP already.
  3790  
  3791  ..
  3792  
  3793  * **Offloaded XDP**
  3794  
  3795    In the offloaded XDP mode the XDP BPF program is directly offloaded into
  3796    the NIC instead of being executed on the host CPU. Thus, the already
  3797    extremely low per-packet cost is pushed off the host CPU entirely and
  3798    executed on the NIC, providing even higher performance than running in
  3799    native XDP. This offload is typically implemented by SmartNICs
  3800    containing multi-threaded, multicore flow processors where a in-kernel
  3801    JIT compiler translates BPF into native instructions for the latter.
  3802    Drivers supporting offloaded XDP usually also support native XDP for
  3803    cases where some BPF helpers may not yet or only be available for the
  3804    native mode.
  3805  
  3806  ..
  3807  
  3808  * **Generic XDP**
  3809  
  3810    For drivers not implementing native or offloaded XDP yet, the kernel
  3811    provides an option for generic XDP which does not require any driver
  3812    changes since run at a much later point out of the networking stack.
  3813    This setting is primarily targeted at developers who want to write and
  3814    test programs against the kernel's XDP API, and will not operate at the
  3815    performance rate of the native or offloaded modes. For XDP usage in a
  3816    production environment either the native or offloaded mode is better
  3817    suited and the recommended way to run XDP.
  3818  
  3819  ..
  3820  
  3821  **Driver support**
  3822  
  3823  Since BPF and XDP is evolving quickly in terms of feature and driver support,
  3824  the following lists native and offloaded XDP drivers as of kernel 4.17.
  3825  
  3826  **Drivers supporting native XDP**
  3827  
  3828  * **Broadcom**
  3829  
  3830    * bnxt
  3831  
  3832  ..
  3833  
  3834  * **Cavium**
  3835  
  3836    * thunderx
  3837  
  3838  ..
  3839  
  3840  * **Intel**
  3841  
  3842    * ixgbe
  3843    * ixgbevf
  3844    * i40e
  3845  
  3846  ..
  3847  
  3848  * **Mellanox**
  3849  
  3850    * mlx4
  3851    * mlx5
  3852  
  3853  ..
  3854  
  3855  * **Netronome**
  3856  
  3857    * nfp
  3858  
  3859  ..
  3860  
  3861  * **Others**
  3862  
  3863    * tun
  3864    * virtio_net
  3865  
  3866  ..
  3867  
  3868  * **Qlogic**
  3869  
  3870    * qede
  3871  
  3872  ..
  3873  
  3874  * **Solarflare**
  3875  
  3876    * sfc [1]_
  3877  
  3878  **Drivers supporting offloaded XDP**
  3879  
  3880  * **Netronome**
  3881  
  3882    * nfp [2]_
  3883  
  3884  Note that examples for writing and loading XDP programs are included in
  3885  the toolchain section under the respective tools.
  3886  
  3887  .. [1] XDP for sfc available via out of tree driver as of kernel 4.17, but
  3888     will be upstreamed soon.
  3889  .. [2] Some BPF helper functions such as retrieving the current CPU number
  3890     will not be available in an offloaded setting.
  3891  
  3892  tc (traffic control)
  3893  --------------------
  3894  
  3895  Aside from other program types such as XDP, BPF can also be used out of the
  3896  kernel's tc (traffic control) layer in the networking data path. On a high-level
  3897  there are three major differences when comparing XDP BPF programs to tc BPF
  3898  ones:
  3899  
  3900  * The BPF input context is a ``sk_buff`` not a ``xdp_buff``. When the kernel's
  3901    networking stack receives a packet, after the XDP layer, it allocates a buffer
  3902    and parses the packet to store metadata about the packet. This representation
  3903    is known as the ``sk_buff``. This structure is then exposed in the BPF input
  3904    context so that BPF programs from the tc ingress layer can use the metadata that
  3905    the stack extracts from the packet. This can be useful, but comes with an
  3906    associated cost of the stack performing this allocation and metadata extraction,
  3907    and handling the packet until it hits the tc hook. By definition, the ``xdp_buff``
  3908    doesn't have access to this metadata because the XDP hook is called before
  3909    this work is done. This is a significant contributor to the performance
  3910    difference between the XDP and tc hooks.
  3911  
  3912    Therefore, BPF programs attached to the tc BPF hook can, for instance, read or
  3913    write the skb's ``mark``, ``pkt_type``, ``protocol``, ``priority``,
  3914    ``queue_mapping``, ``napi_id``, ``cb[]`` array, ``hash``, ``tc_classid`` or
  3915    ``tc_index``, vlan metadata, the XDP transferred custom metadata and various
  3916    other information. All members of the ``struct __sk_buff`` BPF context used
  3917    in tc BPF are defined in the ``linux/bpf.h`` system header.
  3918  
  3919    Generally, the ``sk_buff`` is of a completely different nature than
  3920    ``xdp_buff`` where both come with advantages and disadvantages. For example,
  3921    the ``sk_buff`` case has the advantage that it is rather straight forward to
  3922    mangle its associated metadata, however, it also contains a lot of protocol
  3923    specific information (e.g. GSO related state) which makes it difficult to
  3924    simply switch protocols by solely rewriting the packet data. This is due to
  3925    the stack processing the packet based on the metadata rather than having the
  3926    cost of accessing the packet contents each time. Thus, additional conversion
  3927    is required from BPF helper functions taking care that ``sk_buff`` internals
  3928    are properly converted as well. The ``xdp_buff`` case however does not
  3929    face such issues since it comes at such an early stage where the kernel
  3930    has not even allocated an ``sk_buff`` yet, thus packet rewrites of any
  3931    kind can be realized trivially. However, the ``xdp_buff`` case has the
  3932    disadvantage that ``sk_buff`` metadata is not available for mangling
  3933    at this stage. The latter is overcome by passing custom metadata from
  3934    XDP BPF to tc BPF, though. In this way, the limitations of each program
  3935    type can be overcome by operating complementary programs of both types
  3936    as the use case requires.
  3937  
  3938  ..
  3939  
  3940  * Compared to XDP, tc BPF programs can be triggered out of ingress and also
  3941    egress points in the networking data path as opposed to ingress only in
  3942    the case of XDP.
  3943  
  3944    The two hook points ``sch_handle_ingress()`` and ``sch_handle_egress()`` in
  3945    the kernel are triggered out of ``__netif_receive_skb_core()`` and
  3946    ``__dev_queue_xmit()``, respectively. The latter two are the main receive
  3947    and transmit functions in the data path that, setting XDP aside, are triggered
  3948    for every network packet going in or coming out of the node allowing for
  3949    full visibility for tc BPF programs at these hook points.
  3950  
  3951  ..
  3952  
  3953  * The tc BPF programs do not require any driver changes since they are run
  3954    at hook points in generic layers in the networking stack. Therefore, they
  3955    can be attached to any type of networking device.
  3956  
  3957    While this provides flexibility, it also trades off performance compared
  3958    to running at the native XDP layer. However, tc BPF programs still come
  3959    at the earliest point in the generic kernel's networking data path after
  3960    GRO has been run but **before** any protocol processing, traditional iptables
  3961    firewalling such as iptables PREROUTING or nftables ingress hooks or other
  3962    packet processing takes place. Likewise on egress, tc BPF programs execute
  3963    at the latest point before handing the packet to the driver itself for
  3964    transmission, meaning **after** traditional iptables firewalling hooks like
  3965    iptables POSTROUTING, but still before handing the packet to the kernel's
  3966    GSO engine.
  3967  
  3968    One exception which does require driver changes however are offloaded tc
  3969    BPF programs, typically provided by SmartNICs in a similar way as offloaded
  3970    XDP just with differing set of features due to the differences in the BPF
  3971    input context, helper functions and verdict codes.
  3972  
  3973  ..
  3974  
  3975  BPF programs run in the tc layer are run from the ``cls_bpf`` classifier.
  3976  While the tc terminology describes the BPF attachment point as a "classifier",
  3977  this is a bit misleading since it under-represents what ``cls_bpf`` is
  3978  capable of. That is to say, a fully programmable packet processor being able
  3979  not only to read the ``skb`` metadata and packet data, but to also arbitrarily
  3980  mangle both, and terminate the tc processing with an action verdict. ``cls_bpf``
  3981  can thus be regarded as a self-contained entity that manages and executes tc
  3982  BPF programs.
  3983  
  3984  ``cls_bpf`` can hold one or more tc BPF programs. In the case where Cilium
  3985  deploys ``cls_bpf`` programs, it attaches only a single program for a given hook
  3986  in ``direct-action`` mode. Typically, in the traditional tc scheme, there is a
  3987  split between classifier and action modules, where the classifier has one
  3988  or more actions attached to it that are triggered once the classifier has a
  3989  match. In the modern world for using tc in the software data path this model
  3990  does not scale well for complex packet processing. Given tc BPF programs
  3991  attached to ``cls_bpf`` are fully self-contained, they effectively fuse the
  3992  parsing and action process together into a single unit. Thanks to ``cls_bpf``'s
  3993  ``direct-action`` mode, it will just return the tc action verdict and
  3994  terminate the processing pipeline immediately. This allows for implementing
  3995  scalable programmable packet processing in the networking data path by avoiding
  3996  linear iteration of actions. ``cls_bpf`` is the only such "classifier" module
  3997  in the tc layer capable of such a fast-path.
  3998  
  3999  Like XDP BPF programs, tc BPF programs can be atomically updated at runtime
  4000  via ``cls_bpf`` without interrupting any network traffic or having to restart
  4001  services.
  4002  
  4003  Both the tc ingress and the egress hook where ``cls_bpf`` itself can be
  4004  attached to is managed by a pseudo qdisc called ``sch_clsact``. This is a
  4005  drop-in replacement and proper superset of the ingress qdisc since it
  4006  is able to manage both, ingress and egress tc hooks. For tc's egress hook
  4007  in ``__dev_queue_xmit()`` it is important to stress that it is not executed
  4008  under the kernel's qdisc root lock. Thus, both tc ingress and egress hooks
  4009  are executed in a lockless manner in the fast-path. In either case, preemption
  4010  is disabled and execution happens under RCU read side.
  4011  
  4012  Typically on egress there are qdiscs attached to netdevices such as ``sch_mq``,
  4013  ``sch_fq``, ``sch_fq_codel`` or ``sch_htb`` where some of them are classful
  4014  qdiscs that contain subclasses and thus require a packet classification
  4015  mechanism to determine a verdict where to demux the packet. This is handled
  4016  by a call to ``tcf_classify()`` which calls into tc classifiers if present.
  4017  ``cls_bpf`` can also be attached and used in such cases. Such operation usually
  4018  happens under the qdisc root lock and can be subject to lock contention. The
  4019  ``sch_clsact`` qdisc's egress hook comes at a much earlier point however which
  4020  does not fall under that and operates completely independent from conventional
  4021  egress qdiscs. Thus for cases like ``sch_htb`` the ``sch_clsact`` qdisc could
  4022  perform the heavy lifting packet classification through tc BPF outside of the
  4023  qdisc root lock, setting the ``skb->mark`` or ``skb->priority`` from there such
  4024  that ``sch_htb`` only requires a flat mapping without expensive packet
  4025  classification under the root lock thus reducing contention.
  4026  
  4027  Offloaded tc BPF programs are supported for the case of ``sch_clsact`` in
  4028  combination with ``cls_bpf`` where the prior loaded BPF program was JITed
  4029  from a SmartNIC driver to be run natively on the NIC. Only ``cls_bpf``
  4030  programs operating in ``direct-action`` mode are supported to be offloaded.
  4031  ``cls_bpf`` only supports offloading a single program and cannot offload
  4032  multiple programs. Furthermore only the ingress hook supports offloading
  4033  BPF programs.
  4034  
  4035  One ``cls_bpf`` instance is able to hold multiple tc BPF programs internally.
  4036  If this is the case, then the ``TC_ACT_UNSPEC`` program return code will
  4037  continue execution with the next tc BPF program in that list. However, this
  4038  has the drawback that several programs would need to parse the packet over
  4039  and over again resulting in degraded performance.
  4040  
  4041  **BPF program return codes**
  4042  
  4043  Both the tc ingress and egress hook share the same action return verdicts
  4044  that tc BPF programs can use. They are defined in the ``linux/pkt_cls.h``
  4045  system header:
  4046  
  4047  ::
  4048  
  4049      #define TC_ACT_UNSPEC         (-1)
  4050      #define TC_ACT_OK               0
  4051      #define TC_ACT_SHOT             2
  4052      #define TC_ACT_STOLEN           4
  4053      #define TC_ACT_REDIRECT         7
  4054  
  4055  There are a few more action ``TC_ACT_*`` verdicts available in the system
  4056  header file which are also used in the two hooks. However, they share the
  4057  same semantics with the ones above. Meaning, from a tc BPF perspective,
  4058  ``TC_ACT_OK`` and ``TC_ACT_RECLASSIFY`` have the same semantics, as well as
  4059  the three ``TC_ACT_STOLEN``, ``TC_ACT_QUEUED`` and ``TC_ACT_TRAP`` opcodes.
  4060  Therefore, for these cases we only describe ``TC_ACT_OK`` and the ``TC_ACT_STOLEN``
  4061  opcode for the two groups.
  4062  
  4063  Starting out with ``TC_ACT_UNSPEC``. It has the meaning of "unspecified action"
  4064  and is used in three cases, i) when an offloaded tc BPF program is attached
  4065  and the tc ingress hook is run where the ``cls_bpf`` representation for the
  4066  offloaded program will return ``TC_ACT_UNSPEC``, ii) in order to continue
  4067  with the next tc BPF program in ``cls_bpf`` for the multi-program case. The
  4068  latter also works in combination with offloaded tc BPF programs from point i)
  4069  where the ``TC_ACT_UNSPEC`` from there continues with a next tc BPF program
  4070  solely running in non-offloaded case. Last but not least, iii) ``TC_ACT_UNSPEC``
  4071  is also used for the single program case to simply tell the kernel to continue
  4072  with the ``skb`` without additional side-effects. ``TC_ACT_UNSPEC`` is very
  4073  similar to the ``TC_ACT_OK`` action code in the sense that both pass the
  4074  ``skb`` onwards either to upper layers of the stack on ingress or down to
  4075  the networking device driver for transmission on egress, respectively. The
  4076  only difference to ``TC_ACT_OK`` is that ``TC_ACT_OK`` sets ``skb->tc_index``
  4077  based on the classid the tc BPF program set. The latter is set out of the
  4078  tc BPF program itself through ``skb->tc_classid`` from the BPF context.
  4079  
  4080  ``TC_ACT_SHOT`` instructs the kernel to drop the packet, meaning, upper
  4081  layers of the networking stack will never see the ``skb`` on ingress and
  4082  similarly the packet will never be submitted for transmission on egress.
  4083  ``TC_ACT_SHOT`` and ``TC_ACT_STOLEN`` are both similar in nature with few
  4084  differences: ``TC_ACT_SHOT`` will indicate to the kernel that the ``skb``
  4085  was released through ``kfree_skb()`` and return ``NET_XMIT_DROP`` to the
  4086  callers for immediate feedback, whereas ``TC_ACT_STOLEN`` will release
  4087  the ``skb`` through ``consume_skb()`` and pretend to upper layers that
  4088  the transmission was successful through ``NET_XMIT_SUCCESS``. The perf's
  4089  drop monitor which records traces of ``kfree_skb()`` will therefore
  4090  also not see any drop indications from ``TC_ACT_STOLEN`` since its
  4091  semantics are such that the ``skb`` has been "consumed" or queued but
  4092  certainly not "dropped".
  4093  
  4094  Last but not least the ``TC_ACT_REDIRECT`` action which is available for
  4095  tc BPF programs as well. This allows to redirect the ``skb`` to the same
  4096  or another's device ingress or egress path together with the ``bpf_redirect()``
  4097  helper. Being able to inject the packet into another device's ingress or
  4098  egress direction allows for full flexibility in packet forwarding with
  4099  BPF. There are no requirements on the target networking device other than
  4100  being a networking device itself, there is no need to run another instance
  4101  of ``cls_bpf`` on the target device or other such restrictions.
  4102  
  4103  **tc BPF FAQ**
  4104  
  4105  This section contains a few miscellaneous question and answer pairs related to
  4106  tc BPF programs that are asked from time to time.
  4107  
  4108  * **Question:** What about ``act_bpf`` as a tc action module, is it still relevant?
  4109  * **Answer:** Not really. Although ``cls_bpf`` and ``act_bpf`` share the same
  4110    functionality for tc BPF programs, ``cls_bpf`` is more flexible since it is a
  4111    proper superset of ``act_bpf``. The way tc works is that tc actions need to be
  4112    attached to tc classifiers. In order to achieve the same flexibility as ``cls_bpf``,
  4113    ``act_bpf`` would need to be attached to the ``cls_matchall`` classifier. As the
  4114    name says, this will match on every packet in order to pass them through for attached
  4115    tc action processing. For ``act_bpf``, this is will result in less efficient packet
  4116    processing than using ``cls_bpf`` in ``direct-action`` mode directly. If ``act_bpf``
  4117    is used in a setting with other classifiers than ``cls_bpf`` or ``cls_matchall``
  4118    then this will perform even worse due to the nature of operation of tc classifiers.
  4119    Meaning, if classifier A has a mismatch, then the packet is passed to classifier
  4120    B, reparsing the packet, etc, thus in the typical case there will be linear
  4121    processing where the packet would need to traverse N classifiers in the worst
  4122    case to find a match and execute ``act_bpf`` on that. Therefore, ``act_bpf`` has
  4123    never been largely relevant. Additionally, ``act_bpf`` does not provide a tc
  4124    offloading interface either compared to ``cls_bpf``.
  4125  
  4126  ..
  4127  
  4128  * **Question:** Is it recommended to use ``cls_bpf`` not in ``direct-action`` mode?
  4129  * **Answer:** No. The answer is similar to the one above in that this is otherwise
  4130    unable to scale for more complex processing. tc BPF can already do everything needed
  4131    by itself in an efficient manner and thus there is no need for anything other than
  4132    ``direct-action`` mode.
  4133  
  4134  ..
  4135  
  4136  * **Question:** Is there any performance difference in offloaded ``cls_bpf`` and
  4137    offloaded XDP?
  4138  * **Answer:** No. Both are JITed through the same compiler in the kernel which
  4139    handles the offloading to the SmartNIC and the loading mechanism for both is
  4140    very similar as well. Thus, the BPF program gets translated into the same target
  4141    instruction set in order to be able to run on the NIC natively. The two tc BPF
  4142    and XDP BPF program types have a differing set of features, so depending on the
  4143    use case one might be picked over the other due to availability of certain helper
  4144    functions in the offload case, for example.
  4145  
  4146  **Use cases for tc BPF**
  4147  
  4148  Some of the main use cases for tc BPF programs are presented in this subsection.
  4149  Also here, the list is non-exhaustive and given the programmability and efficiency
  4150  of tc BPF, it can easily be tailored and integrated into orchestration systems
  4151  in order to solve very specific use cases. While some use cases with XDP may overlap,
  4152  tc BPF and XDP BPF are mostly complementary to each other and both can also be
  4153  used at the same time or one over the other depending which is most suitable for a
  4154  given problem to solve.
  4155  
  4156  * **Policy enforcement for containers**
  4157  
  4158    One application which tc BPF programs are suitable for is to implement policy
  4159    enforcement, custom firewalling or similar security measures for containers or
  4160    pods, respectively. In the conventional case, container isolation is implemented
  4161    through network namespaces with veth networking devices connecting the host's
  4162    initial namespace with the dedicated container's namespace. Since one end of
  4163    the veth pair has been moved into the container's namespace whereas the other
  4164    end remains in the initial namespace of the host, all network traffic from the
  4165    container has to pass through the host-facing veth device allowing for attaching
  4166    tc BPF programs on the tc ingress and egress hook of the veth. Network traffic
  4167    going into the container will pass through the host-facing veth's tc egress
  4168    hook whereas network traffic coming from the container will pass through the
  4169    host-facing veth's tc ingress hook.
  4170  
  4171    For virtual devices like veth devices XDP is unsuitable in this case since the
  4172    kernel operates solely on a ``skb`` here and generic XDP has a few limitations
  4173    where it does not operate with cloned ``skb``'s. The latter is heavily used
  4174    from the TCP/IP stack in order to hold data segments for retransmission where
  4175    the generic XDP hook would simply get bypassed instead. Moreover, generic XDP
  4176    needs to linearize the entire ``skb`` resulting in heavily degraded performance.
  4177    tc BPF on the other hand is more flexible as it specializes on the ``skb``
  4178    input context case and thus does not need to cope with the limitations from
  4179    generic XDP.
  4180  
  4181  ..
  4182  
  4183  * **Forwarding and load-balancing**
  4184  
  4185    The forwarding and load-balancing use case is quite similar to XDP, although
  4186    slightly more targeted towards east-west container workloads rather than
  4187    north-south traffic (though both technologies can be used in either case).
  4188    Since XDP is only available on ingress side, tc BPF programs allow for
  4189    further use cases that apply in particular on egress, for example, container
  4190    based traffic can already be NATed and load-balanced on the egress side
  4191    through BPF out of the initial namespace such that this is done transparent
  4192    to the container itself. Egress traffic is already based on the ``sk_buff``
  4193    structure due to the nature of the kernel's networking stack, so packet
  4194    rewrites and redirects are suitable out of tc BPF. By utilizing the
  4195    ``bpf_redirect()`` helper function, BPF can take over the forwarding logic
  4196    to push the packet either into the ingress or egress path of another networking
  4197    device. Thus, any bridge-like devices become unnecessary to use as well by
  4198    utilizing tc BPF as forwarding fabric.
  4199  
  4200  ..
  4201  
  4202  * **Flow sampling, monitoring**
  4203  
  4204    Like in XDP case, flow sampling and monitoring can be realized through a
  4205    high-performance lockless per-CPU memory mapped perf ring buffer where the
  4206    BPF program is able to push custom data, the full or truncated packet
  4207    contents, or both up to a user space application. From the tc BPF program
  4208    this is realized through the ``bpf_skb_event_output()`` BPF helper function
  4209    which has the same function signature and semantics as ``bpf_xdp_event_output()``.
  4210    Given tc BPF programs can be attached to ingress and egress as opposed to
  4211    only ingress in XDP BPF case plus the two tc hooks are at the lowest layer
  4212    in the (generic) networking stack, this allows for bidirectional monitoring
  4213    of all network traffic from a particular node. This might be somewhat related
  4214    to the cBPF case which tcpdump and Wireshark makes use of, though, without
  4215    having to clone the ``skb`` and with being a lot more flexible in terms of
  4216    programmability where, for example, BPF can already perform in-kernel
  4217    aggregation rather than pushing everything up to user space as well as
  4218    custom annotations for packets pushed into the ring buffer. The latter is
  4219    also heavily used in Cilium where packet drops can be further annotated
  4220    to correlate container labels and reasons for why a given packet had to
  4221    be dropped (such as due to policy violation) in order to provide a richer
  4222    context.
  4223  
  4224  ..
  4225  
  4226  * **Packet scheduler pre-processing**
  4227  
  4228    The ``sch_clsact``'s egress hook which is called ``sch_handle_egress()``
  4229    runs right before taking the kernel's qdisc root lock, thus tc BPF programs
  4230    can be utilized to perform all the heavy lifting packet classification
  4231    and mangling before the packet is transmitted into a real full blown
  4232    qdisc such as ``sch_htb``. This type of interaction of ``sch_clsact``
  4233    with a real qdisc like ``sch_htb`` coming later in the transmission phase
  4234    allows to reduce the lock contention on transmission since ``sch_clsact``'s
  4235    egress hook is executed without taking locks.
  4236  
  4237  ..
  4238  
  4239  One concrete example user of tc BPF but also XDP BPF programs is Cilium.
  4240  Cilium is open source software for transparently securing the network
  4241  connectivity between application services deployed using Linux container
  4242  management platforms like Docker and Kubernetes and operates at Layer 3/4
  4243  as well as Layer 7. At the heart of Cilium operates BPF in order to
  4244  implement the policy enforcement as well as load balancing and monitoring.
  4245  
  4246  * Slides: https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
  4247  * Video: https://youtu.be/ilKlmTDdFgk
  4248  * Github: https://github.com/cilium/cilium
  4249  
  4250  **Driver support**
  4251  
  4252  Since tc BPF programs are triggered from the kernel's networking stack
  4253  and not directly out of the driver, they do not require any extra driver
  4254  modification and therefore can run on any networking device. The only
  4255  exception listed below is for offloading tc BPF programs to the NIC.
  4256  
  4257  **Drivers supporting offloaded tc BPF**
  4258  
  4259  * **Netronome**
  4260  
  4261    * nfp [2]_
  4262  
  4263  Note that also here examples for writing and loading tc BPF programs are
  4264  included in the toolchain section under the respective tools.
  4265  
  4266  .. _bpf_users:
  4267  
  4268  Further Reading
  4269  ===============
  4270  
  4271  Mentioned lists of docs, projects, talks, papers, and further reading
  4272  material are likely not complete. Thus, feel free to open pull requests
  4273  to complete the list.
  4274  
  4275  Kernel Developer FAQ
  4276  --------------------
  4277  
  4278  Under ``Documentation/bpf/``, the Linux kernel provides two FAQ files that
  4279  are mainly targeted for kernel developers involved in the BPF subsystem.
  4280  
  4281  * **BPF Devel FAQ:** this document provides mostly information around patch
  4282    submission process as well as BPF kernel tree, stable tree and bug
  4283    reporting workflows, questions around BPF's extensibility and interaction
  4284    with LLVM and more.
  4285  
  4286    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/bpf/bpf_devel_QA.rst
  4287  
  4288  ..
  4289  
  4290  * **BPF Design FAQ:** this document tries to answer frequently asked questions
  4291    around BPF design decisions related to the instruction set, verifier,
  4292    calling convention, JITs, etc.
  4293  
  4294    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/bpf/bpf_design_QA.rst
  4295  
  4296  Projects using BPF
  4297  ------------------
  4298  
  4299  The following list includes a selection of open source projects making
  4300  use of BPF respectively provide tooling for BPF. In this context the eBPF
  4301  instruction set is specifically meant instead of projects utilizing the
  4302  legacy cBPF:
  4303  
  4304  **Tracing**
  4305  
  4306  * **BCC**
  4307  
  4308    BCC stands for BPF Compiler Collection and its key feature is to provide
  4309    a set of easy to use and efficient kernel tracing utilities all based
  4310    upon BPF programs hooking into kernel infrastructure based upon kprobes,
  4311    kretprobes, tracepoints, uprobes, uretprobes as well as USDT probes. The
  4312    collection provides close to hundred tools targeting different layers
  4313    across the stack from applications, system libraries, to the various
  4314    different kernel subsystems in order to analyze a system's performance
  4315    characteristics or problems. Additionally, BCC provides an API in order
  4316    to be used as a library for other projects.
  4317  
  4318    https://github.com/iovisor/bcc
  4319  
  4320  ..
  4321  
  4322  * **bpftrace**
  4323  
  4324    bpftrace is a DTrace-style dynamic tracing tool for Linux and uses LLVM
  4325    as a back end to compile scripts to BPF-bytecode and makes use of BCC
  4326    for interacting with the kernel's BPF tracing infrastructure. It provides
  4327    a higher-level language for implementing tracing scripts compared to
  4328    native BCC.
  4329  
  4330    https://github.com/ajor/bpftrace
  4331  
  4332  ..
  4333  
  4334  * **perf**
  4335  
  4336    The perf tool which is developed by the Linux kernel community as
  4337    part of the kernel source tree provides a way to load tracing BPF
  4338    programs through the conventional perf record subcommand where the
  4339    aggregated data from BPF can be retrieved and post processed in
  4340    perf.data for example through perf script and other means.
  4341  
  4342    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf
  4343  
  4344  ..
  4345  
  4346  * **ply**
  4347  
  4348    ply is a tracing tool that follows the 'Little Language' approach of
  4349    yore, and compiles ply scripts into Linux BPF programs that are attached
  4350    to kprobes and tracepoints in the kernel. The scripts have a C-like syntax,
  4351    heavily inspired by DTrace and by extension awk. ply keeps dependencies
  4352    to very minimum and only requires flex and bison at build time, only libc
  4353    at runtime.
  4354  
  4355    https://github.com/wkz/ply
  4356  
  4357  ..
  4358  
  4359  * **systemtap**
  4360  
  4361    systemtap is a scripting language and tool for extracting, filtering and
  4362    summarizing data in order to diagnose and analyze performance or functional
  4363    problems. It comes with a BPF back end called stapbpf which translates
  4364    the script directly into BPF without the need of an additional compiler
  4365    and injects the probe into the kernel. Thus, unlike stap's kernel modules
  4366    this does neither have external dependencies nor requires to load kernel
  4367    modules.
  4368  
  4369    https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary
  4370  
  4371  ..
  4372  
  4373  * **PCP**
  4374  
  4375    Performance Co-Pilot (PCP) is a system performance and analysis framework
  4376    which is able to collect metrics through a variety of agents as well as
  4377    analyze collected systems' performance metrics in real-time or by using
  4378    historical data. With pmdabcc, PCP has a BCC based performance metrics
  4379    domain agent which extracts data from the kernel via BPF and BCC.
  4380  
  4381    https://github.com/performancecopilot/pcp
  4382  
  4383  ..
  4384  
  4385  * **Weave Scope**
  4386  
  4387    Weave Scope is a cloud monitoring tool collecting data about processes,
  4388    networking connections or other system data by making use of BPF in combination
  4389    with kprobes. Weave Scope works on top of the gobpf library in order to load
  4390    BPF ELF files into the kernel, and comes with a tcptracer-bpf tool which
  4391    monitors connect, accept and close calls in order to trace TCP events.
  4392  
  4393    https://github.com/weaveworks/scope
  4394  
  4395  ..
  4396  
  4397  **Networking**
  4398  
  4399  * **Cilium**
  4400  
  4401    Cilium provides and transparently secures network connectivity and load-balancing
  4402    between application workloads such as application containers or processes. Cilium
  4403    operates at Layer 3/4 to provide traditional networking and security services
  4404    as well as Layer 7 to protect and secure use of modern application protocols
  4405    such as HTTP, gRPC and Kafka. It is integrated into orchestration frameworks
  4406    such as Kubernetes and Mesos, and BPF is the foundational part of Cilium that
  4407    operates in the kernel's networking data path.
  4408  
  4409    https://github.com/cilium/cilium
  4410  
  4411  ..
  4412  
  4413  * **Suricata**
  4414  
  4415    Suricata is a network IDS, IPS and NSM engine, and utilizes BPF as well as XDP
  4416    in three different areas, that is, as BPF filter in order to process or bypass
  4417    certain packets, as a BPF based load balancer in order to allow for programmable
  4418    load balancing and for XDP to implement a bypass or dropping mechanism at high
  4419    packet rates.
  4420  
  4421    http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html
  4422  
  4423    https://github.com/OISF/suricata
  4424  
  4425  ..
  4426  
  4427  * **systemd**
  4428  
  4429    systemd allows for IPv4/v6 accounting as well as implementing network access
  4430    control for its systemd units based on BPF's cgroup ingress and egress hooks.
  4431    Accounting is based on packets / bytes, and ACLs can be specified as address
  4432    prefixes for allow / deny rules. More information can be found at:
  4433  
  4434    http://0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html
  4435  
  4436    https://github.com/systemd/systemd
  4437  
  4438  ..
  4439  
  4440  * **iproute2**
  4441  
  4442    iproute2 offers the ability to load BPF programs as LLVM generated ELF files
  4443    into the kernel. iproute2 supports both, XDP BPF programs as well as tc BPF
  4444    programs through a common BPF loader backend. The tc and ip command line
  4445    utilities enable loader and introspection functionality for the user.
  4446  
  4447    https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/
  4448  
  4449  ..
  4450  
  4451  * **p4c-xdp**
  4452  
  4453    p4c-xdp presents a P4 compiler backend targeting BPF and XDP. P4 is a domain
  4454    specific language describing how packets are processed by the data plane of
  4455    a programmable network element such as NICs, appliances or switches, and with
  4456    the help of p4c-xdp P4 programs can be translated into BPF C programs which
  4457    can be compiled by clang / LLVM and loaded as BPF programs into the kernel
  4458    at XDP layer for high performance packet processing.
  4459  
  4460    https://github.com/vmware/p4c-xdp
  4461  
  4462  ..
  4463  
  4464  **Others**
  4465  
  4466  * **LLVM**
  4467  
  4468    clang / LLVM provides the BPF back end in order to compile C BPF programs
  4469    into BPF instructions contained in ELF files. The LLVM BPF back end is
  4470    developed alongside with the BPF core infrastructure in the Linux kernel
  4471    and maintained by the same community. clang / LLVM is a key part in the
  4472    toolchain for developing BPF programs.
  4473  
  4474    https://llvm.org/
  4475  
  4476  ..
  4477  
  4478  * **libbpf**
  4479  
  4480    libbpf is a generic BPF library which is developed by the Linux kernel
  4481    community as part of the kernel source tree and allows for loading and
  4482    attaching BPF programs from LLVM generated ELF files into the kernel.
  4483    The library is used by other kernel projects such as perf and bpftool.
  4484  
  4485    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf
  4486  
  4487  ..
  4488  
  4489  * **bpftool**
  4490  
  4491    bpftool is the main tool for introspecting and debugging BPF programs
  4492    and BPF maps, and like libbpf is developed by the Linux kernel community.
  4493    It allows for dumping all active BPF programs and maps in the system,
  4494    dumping and disassembling BPF or JITed BPF instructions from a program
  4495    as well as dumping and manipulating BPF maps in the system. bpftool
  4496    supports interaction with the BPF filesystem, loading various program
  4497    types from an object file into the kernel and much more.
  4498  
  4499    https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool
  4500  
  4501  ..
  4502  
  4503  * **gobpf**
  4504  
  4505    gobpf provides go bindings for the bcc framework as well as low-level routines in
  4506    order to load and use BPF programs from ELF files.
  4507  
  4508    https://github.com/iovisor/gobpf
  4509  
  4510  ..
  4511  
  4512  * **ebpf_asm**
  4513  
  4514    ebpf_asm provides an assembler for BPF programs written in an Intel-like assembly
  4515    syntax, and therefore offers an alternative for writing BPF programs directly in
  4516    assembly for cases where programs are rather small and simple without needing the
  4517    clang / LLVM toolchain.
  4518  
  4519    https://github.com/solarflarecom/ebpf_asm
  4520  
  4521  ..
  4522  
  4523  XDP Newbies
  4524  -----------
  4525  
  4526  There are a couple of walk-through posts by David S. Miller to the xdp-newbies
  4527  mailing list (http://vger.kernel.org/vger-lists.html#xdp-newbies), which explain
  4528  various parts of XDP and BPF:
  4529  
  4530  4. May 2017,
  4531       BPF Verifier Overview,
  4532       David S. Miller,
  4533       https://www.spinics.net/lists/xdp-newbies/msg00185.html
  4534  
  4535  3. May 2017,
  4536       Contextually speaking...,
  4537       David S. Miller,
  4538       https://www.spinics.net/lists/xdp-newbies/msg00181.html
  4539  
  4540  2. May 2017,
  4541       bpf.h and you...,
  4542       David S. Miller,
  4543       https://www.spinics.net/lists/xdp-newbies/msg00179.html
  4544  
  4545  1. Apr 2017,
  4546       XDP example of the day,
  4547       David S. Miller,
  4548       https://www.spinics.net/lists/xdp-newbies/msg00009.html
  4549  
  4550  BPF Newsletter
  4551  --------------
  4552  
  4553  Alexander Alemayhu initiated a newsletter around BPF roughly once per week
  4554  covering latest developments around BPF in Linux kernel land and its
  4555  surrounding ecosystem in user space.
  4556  
  4557  All BPF update newsletters (01 - 12) can be found here:
  4558  
  4559       https://cilium.io/blog/categories/BPF%20Newsletter
  4560  
  4561  Podcasts
  4562  --------
  4563  
  4564  There have been a number of technical podcasts partially covering BPF.
  4565  Incomplete list:
  4566  
  4567  5. Feb 2017,
  4568       Linux Networking Update from Netdev Conference,
  4569       Thomas Graf,
  4570       Software Gone Wild, Show 71,
  4571       http://blog.ipspace.net/2017/02/linux-networking-update-from-netdev.html
  4572       http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_71-NetDev_Update.mp3
  4573  
  4574  4. Jan 2017,
  4575       The IO Visor Project,
  4576       Brenden Blanco,
  4577       OVS Orbit, Episode 23,
  4578       https://ovsorbit.org/#e23
  4579       https://ovsorbit.org/episode-23.mp3
  4580  
  4581  3. Oct 2016,
  4582       Fast Linux Packet Forwarding,
  4583       Thomas Graf,
  4584       Software Gone Wild, Show 64,
  4585       http://blog.ipspace.net/2016/10/fast-linux-packet-forwarding-with.html
  4586       http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_64-Cilium_with_Thomas_Graf.mp3
  4587  
  4588  2. Aug 2016,
  4589       P4 on the Edge,
  4590       John Fastabend,
  4591       OVS Orbit, Episode 11,
  4592       https://ovsorbit.org/#e11
  4593       https://ovsorbit.org/episode-11.mp3
  4594  
  4595  1. May 2016,
  4596       Cilium,
  4597       Thomas Graf,
  4598       OVS Orbit, Episode 4,
  4599       https://ovsorbit.org/#e4
  4600       https://ovsorbit.benpfaff.org/episode-4.mp3
  4601  
  4602  Blog posts
  4603  ----------
  4604  
  4605  The following (incomplete) list includes blog posts around BPF, XDP and related projects:
  4606  
  4607  34. May 2017,
  4608       An entertaining eBPF XDP adventure,
  4609       Suchakra Sharma,
  4610       https://suchakra.wordpress.com/2017/05/23/an-entertaining-ebpf-xdp-adventure/
  4611  
  4612  33. May 2017,
  4613       eBPF, part 2: Syscall and Map Types,
  4614       Ferris Ellis,
  4615       https://ferrisellis.com/posts/ebpf_syscall_and_maps/
  4616  
  4617  32. May 2017,
  4618       Monitoring the Control Plane,
  4619       Gary Berger,
  4620       http://firstclassfunc.com/2017/05/monitoring-the-control-plane/
  4621  
  4622  31. Apr 2017,
  4623       USENIX/LISA 2016 Linux bcc/BPF Tools,
  4624       Brendan Gregg,
  4625       http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html
  4626  
  4627  30. Apr 2017,
  4628       Liveblog: Cilium for Network and Application Security with BPF and XDP,
  4629       Scott Lowe,
  4630       http://blog.scottlowe.org//2017/04/18/black-belt-cilium/
  4631  
  4632  29. Apr 2017,
  4633       eBPF, part 1: Past, Present, and Future,
  4634       Ferris Ellis,
  4635       https://ferrisellis.com/posts/ebpf_past_present_future/
  4636  
  4637  28. Mar 2017,
  4638       Analyzing KVM Hypercalls with eBPF Tracing,
  4639       Suchakra Sharma,
  4640       https://suchakra.wordpress.com/2017/03/31/analyzing-kvm-hypercalls-with-ebpf-tracing/
  4641  
  4642  27. Jan 2017,
  4643       Golang bcc/BPF Function Tracing,
  4644       Brendan Gregg,
  4645       http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html
  4646  
  4647  26. Dec 2016,
  4648       Give me 15 minutes and I'll change your view of Linux tracing,
  4649       Brendan Gregg,
  4650       http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html
  4651  
  4652  25. Nov 2016,
  4653       Cilium: Networking and security for containers with BPF and XDP,
  4654       Daniel Borkmann,
  4655       https://opensource.googleblog.com/2016/11/cilium-networking-and-security.html
  4656  
  4657  24. Nov 2016,
  4658       Linux bcc/BPF tcplife: TCP Lifespans,
  4659       Brendan Gregg,
  4660       http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html
  4661  
  4662  23. Oct 2016,
  4663       DTrace for Linux 2016,
  4664       Brendan Gregg,
  4665       http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html
  4666  
  4667  22. Oct 2016,
  4668       Linux 4.9's Efficient BPF-based Profiler,
  4669       Brendan Gregg,
  4670       http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html
  4671  
  4672  21. Oct 2016,
  4673       Linux bcc tcptop,
  4674       Brendan Gregg,
  4675       http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html
  4676  
  4677  20. Oct 2016,
  4678       Linux bcc/BPF Node.js USDT Tracing,
  4679       Brendan Gregg,
  4680       http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html
  4681  
  4682  19. Oct 2016,
  4683       Linux bcc/BPF Run Queue (Scheduler) Latency,
  4684       Brendan Gregg,
  4685       http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html
  4686  
  4687  18. Oct 2016,
  4688       Linux bcc ext4 Latency Tracing,
  4689       Brendan Gregg,
  4690       http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html
  4691  
  4692  17. Oct 2016,
  4693       Linux MySQL Slow Query Tracing with bcc/BPF,
  4694       Brendan Gregg,
  4695       http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html
  4696  
  4697  16. Oct 2016,
  4698       Linux bcc Tracing Security Capabilities,
  4699       Brendan Gregg,
  4700       http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html
  4701  
  4702  15. Sep 2016,
  4703       Suricata bypass feature,
  4704       Eric Leblond,
  4705       https://www.stamus-networks.com/2016/09/28/suricata-bypass-feature/
  4706  
  4707  14. Aug 2016,
  4708       Introducing the p0f BPF compiler,
  4709       Gilberto Bertin,
  4710       https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/
  4711  
  4712  13. Jun 2016,
  4713       Ubuntu Xenial bcc/BPF,
  4714       Brendan Gregg,
  4715       http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html
  4716  
  4717  12. Mar 2016,
  4718       Linux BPF/bcc Road Ahead, March 2016,
  4719       Brendan Gregg,
  4720       http://www.brendangregg.com/blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html
  4721  
  4722  11. Mar 2016,
  4723       Linux BPF Superpowers,
  4724       Brendan Gregg,
  4725       http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html
  4726  
  4727  10. Feb 2016,
  4728       Linux eBPF/bcc uprobes,
  4729       Brendan Gregg,
  4730       http://www.brendangregg.com/blog/2016-02-08/linux-ebpf-bcc-uprobes.html
  4731  
  4732  9. Feb 2016,
  4733       Who is waking the waker? (Linux chain graph prototype),
  4734       Brendan Gregg,
  4735       http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html
  4736  
  4737  8. Feb 2016,
  4738       Linux Wakeup and Off-Wake Profiling,
  4739       Brendan Gregg,
  4740       http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html
  4741  
  4742  7. Jan 2016,
  4743       Linux eBPF Off-CPU Flame Graph,
  4744       Brendan Gregg,
  4745       http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html
  4746  
  4747  6. Jan 2016,
  4748       Linux eBPF Stack Trace Hack,
  4749       Brendan Gregg,
  4750       http://www.brendangregg.com/blog/2016-01-18/ebpf-stack-trace-hack.html
  4751  
  4752  1. Sep 2015,
  4753       Linux Networking, Tracing and IO Visor, a New Systems Performance Tool for a Distributed World,
  4754       Suchakra Sharma,
  4755       https://thenewstack.io/comparing-dtrace-iovisor-new-systems-performance-platform-advance-linux-networking-virtualization/
  4756  
  4757  5. Aug 2015,
  4758       BPF Internals - II,
  4759       Suchakra Sharma,
  4760       https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/
  4761  
  4762  4. May 2015,
  4763       eBPF: One Small Step,
  4764       Brendan Gregg,
  4765       http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html
  4766  
  4767  3. May 2015,
  4768       BPF Internals - I,
  4769       Suchakra Sharma,
  4770       https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/
  4771  
  4772  2. Jul 2014,
  4773       Introducing the BPF Tools,
  4774       Marek Majkowski,
  4775       https://blog.cloudflare.com/introducing-the-bpf-tools/
  4776  
  4777  1. May 2014,
  4778       BPF - the forgotten bytecode,
  4779       Marek Majkowski,
  4780       https://blog.cloudflare.com/bpf-the-forgotten-bytecode/
  4781  
  4782  Talks
  4783  -----
  4784  
  4785  The following (incomplete) list includes talks and conference papers
  4786  related to BPF and XDP:
  4787  
  4788  44. May 2017,
  4789       PyCon 2017, Portland,
  4790       Executing python functions in the linux kernel by transpiling to bpf,
  4791       Alex Gartrell,
  4792       https://www.youtube.com/watch?v=CpqMroMBGP4
  4793  
  4794  43. May 2017,
  4795       gluecon 2017, Denver,
  4796       Cilium + BPF: Least Privilege Security on API Call Level for Microservices,
  4797       Dan Wendlandt,
  4798       http://gluecon.com/#agenda
  4799  
  4800  42. May 2017,
  4801       Lund Linux Con, Lund,
  4802       XDP - eXpress Data Path,
  4803       Jesper Dangaard Brouer,
  4804       http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf
  4805  
  4806  41. May 2017,
  4807       Polytechnique Montreal,
  4808       Trace Aggregation and Collection with eBPF,
  4809       Suchakra Sharma,
  4810       http://step.polymtl.ca/~suchakra/eBPF-5May2017.pdf
  4811  
  4812  40. Apr 2017,
  4813       DockerCon, Austin,
  4814       Cilium - Network and Application Security with BPF and XDP,
  4815       Thomas Graf,
  4816       https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp
  4817  
  4818  39. Apr 2017,
  4819       NetDev 2.1, Montreal,
  4820       XDP Mythbusters,
  4821       David S. Miller,
  4822       https://www.netdevconf.org/2.1/slides/apr7/miller-XDP-MythBusters.pdf
  4823  
  4824  38. Apr 2017,
  4825       NetDev 2.1, Montreal,
  4826       Droplet: DDoS countermeasures powered by BPF + XDP,
  4827       Huapeng Zhou, Doug Porter, Ryan Tierney, Nikita Shirokov,
  4828       https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf
  4829  
  4830  37. Apr 2017,
  4831       NetDev 2.1, Montreal,
  4832       XDP in practice: integrating XDP in our DDoS mitigation pipeline,
  4833       Gilberto Bertin,
  4834       https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf
  4835  
  4836  36. Apr 2017,
  4837       NetDev 2.1, Montreal,
  4838       XDP for the Rest of Us,
  4839       Andy Gospodarek, Jesper Dangaard Brouer,
  4840       https://www.netdevconf.org/2.1/slides/apr7/gospodarek-Netdev2.1-XDP-for-the-Rest-of-Us_Final.pdf
  4841  
  4842  35. Mar 2017,
  4843       SCALE15x, Pasadena,
  4844       Linux 4.x Tracing: Performance Analysis with bcc/BPF,
  4845       Brendan Gregg,
  4846       https://www.slideshare.net/brendangregg/linux-4x-tracing-performance-analysis-with-bccbpf
  4847  
  4848  34. Mar 2017,
  4849       XDP Inside and Out,
  4850       David S. Miller,
  4851       https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf
  4852  
  4853  33. Mar 2017,
  4854       OpenSourceDays, Copenhagen,
  4855       XDP - eXpress Data Path, Used for DDoS protection,
  4856       Jesper Dangaard Brouer,
  4857       https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf
  4858  
  4859  32. Mar 2017,
  4860       source{d}, Infrastructure 2017, Madrid,
  4861       High-performance Linux monitoring with eBPF,
  4862       Alfonso Acosta,
  4863       https://www.youtube.com/watch?v=k4jqTLtdrxQ
  4864  
  4865  31. Feb 2017,
  4866       FOSDEM 2017, Brussels,
  4867       Stateful packet processing with eBPF, an implementation of OpenState interface,
  4868       Quentin Monnet,
  4869       https://fosdem.org/2017/schedule/event/stateful_ebpf/
  4870  
  4871  30. Feb 2017,
  4872       FOSDEM 2017, Brussels,
  4873       eBPF and XDP walkthrough and recent updates,
  4874       Daniel Borkmann,
  4875       http://borkmann.ch/talks/2017_fosdem.pdf
  4876  
  4877  29. Feb 2017,
  4878       FOSDEM 2017, Brussels,
  4879       Cilium - BPF & XDP for containers,
  4880       Thomas Graf,
  4881       https://fosdem.org/2017/schedule/event/cilium/
  4882  
  4883  28. Jan 2017,
  4884       linuxconf.au, Hobart,
  4885       BPF: Tracing and more,
  4886       Brendan Gregg,
  4887       https://www.slideshare.net/brendangregg/bpf-tracing-and-more
  4888  
  4889  27. Dec 2016,
  4890       USENIX LISA 2016, Boston,
  4891       Linux 4.x Tracing Tools: Using BPF Superpowers,
  4892       Brendan Gregg,
  4893       https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers
  4894  
  4895  26. Nov 2016,
  4896       Linux Plumbers, Santa Fe,
  4897       Cilium: Networking & Security for Containers with BPF & XDP,
  4898       Thomas Graf,
  4899       http://www.slideshare.net/ThomasGraf5/clium-container-networking-with-bpf-xdp
  4900  
  4901  25. Nov 2016,
  4902       OVS Conference, Santa Clara,
  4903       Offloading OVS Flow Processing using eBPF,
  4904       William (Cheng-Chun) Tu,
  4905       http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf
  4906  
  4907  24. Oct 2016,
  4908       One.com, Copenhagen,
  4909       XDP - eXpress Data Path, Intro and future use-cases,
  4910       Jesper Dangaard Brouer,
  4911       http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf
  4912  
  4913  23. Oct 2016,
  4914       Docker Distributed Systems Summit, Berlin,
  4915       Cilium: Networking & Security for Containers with BPF & XDP,
  4916       Thomas Graf,
  4917       http://www.slideshare.net/Docker/cilium-bpf-xdp-for-containers-66969823
  4918  
  4919  22. Oct 2016,
  4920       NetDev 1.2, Tokyo,
  4921       Data center networking stack,
  4922       Tom Herbert,
  4923       http://netdevconf.org/1.2/session.html?tom-herbert
  4924  
  4925  21. Oct 2016,
  4926       NetDev 1.2, Tokyo,
  4927       Fast Programmable Networks & Encapsulated Protocols,
  4928       David S. Miller,
  4929       http://netdevconf.org/1.2/session.html?david-miller-keynote
  4930  
  4931  20. Oct 2016,
  4932       NetDev 1.2, Tokyo,
  4933       XDP workshop - Introduction, experience, and future development,
  4934       Tom Herbert,
  4935       http://netdevconf.org/1.2/session.html?herbert-xdp-workshop
  4936  
  4937  19. Oct 2016,
  4938       NetDev1.2, Tokyo,
  4939       The adventures of a Suricate in eBPF land,
  4940       Eric Leblond,
  4941       http://netdevconf.org/1.2/slides/oct6/10_suricata_ebpf.pdf
  4942  
  4943  18. Oct 2016,
  4944       NetDev1.2, Tokyo,
  4945       cls_bpf/eBPF updates since netdev 1.1,
  4946       Daniel Borkmann,
  4947       http://borkmann.ch/talks/2016_tcws.pdf
  4948  
  4949  17. Oct 2016,
  4950       NetDev1.2, Tokyo,
  4951       Advanced programmability and recent updates with tc’s cls_bpf,
  4952       Daniel Borkmann,
  4953       http://borkmann.ch/talks/2016_netdev2.pdf
  4954       http://www.netdevconf.org/1.2/papers/borkmann.pdf
  4955  
  4956  16. Oct 2016,
  4957       NetDev 1.2, Tokyo,
  4958       eBPF/XDP hardware offload to SmartNICs,
  4959       Jakub Kicinski, Nic Viljoen,
  4960       http://netdevconf.org/1.2/papers/eBPF_HW_OFFLOAD.pdf
  4961  
  4962  15. Aug 2016,
  4963       LinuxCon, Toronto,
  4964       What Can BPF Do For You?,
  4965       Brenden Blanco,
  4966       https://events.linuxfoundation.org/sites/events/files/slides/iovisor-lc-bof-2016.pdf
  4967  
  4968  14. Aug 2016,
  4969       LinuxCon, Toronto,
  4970       Cilium - Fast IPv6 Container Networking with BPF and XDP,
  4971       Thomas Graf,
  4972       https://www.slideshare.net/ThomasGraf5/cilium-fast-ipv6-container-networking-with-bpf-and-xdp
  4973  
  4974  13. Aug 2016,
  4975       P4, EBPF and Linux TC Offload,
  4976       Dinan Gunawardena, Jakub Kicinski,
  4977       https://de.slideshare.net/Open-NFP/p4-epbf-and-linux-tc-offload
  4978  
  4979  12. Jul 2016,
  4980       Linux Meetup, Santa Clara,
  4981       eXpress Data Path,
  4982       Brenden Blanco,
  4983       http://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016
  4984  
  4985  11. Jul 2016,
  4986       Linux Meetup, Santa Clara,
  4987       CETH for XDP,
  4988       Yan Chan, Yunsong Lu,
  4989       http://www.slideshare.net/IOVisor/ceth-for-xdp-linux-meetup-santa-clara-july-2016
  4990  
  4991  10. May 2016,
  4992       P4 workshop, Stanford,
  4993       P4 on the Edge,
  4994       John Fastabend,
  4995       https://schd.ws/hosted_files/2016p4workshop/1d/Intel%20Fastabend-P4%20on%20the%20Edge.pdf
  4996  
  4997  9. Mar 2016,
  4998      Performance @Scale 2016, Menlo Park,
  4999      Linux BPF Superpowers,
  5000      Brendan Gregg,
  5001      https://www.slideshare.net/brendangregg/linux-bpf-superpowers
  5002  
  5003  8. Mar 2016,
  5004      eXpress Data Path,
  5005      Tom Herbert, Alexei Starovoitov,
  5006      https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf
  5007  
  5008  7. Feb 2016,
  5009      NetDev1.1, Seville,
  5010      On getting tc classifier fully programmable with cls_bpf,
  5011      Daniel Borkmann,
  5012      http://borkmann.ch/talks/2016_netdev.pdf
  5013      http://www.netdevconf.org/1.1/proceedings/papers/On-getting-tc-classifier-fully-programmable-with-cls-bpf.pdf
  5014  
  5015  6. Jan 2016,
  5016      FOSDEM 2016, Brussels,
  5017      Linux tc and eBPF,
  5018      Daniel Borkmann,
  5019      http://borkmann.ch/talks/2016_fosdem.pdf
  5020  
  5021  5. Oct 2015,
  5022      LinuxCon Europe, Dublin,
  5023      eBPF on the Mainframe,
  5024      Michael Holzheu,
  5025      https://events.linuxfoundation.org/sites/events/files/slides/ebpf_on_the_mainframe_lcon_2015.pdf
  5026  
  5027  4. Aug 2015,
  5028      Tracing Summit, Seattle,
  5029      LLTng's Trace Filtering and beyond (with some eBPF goodness, of course!),
  5030      Suchakra Sharma,
  5031      https://github.com/iovisor/bpf-docs/raw/master/ebpf_excerpt_20Aug2015.pdf
  5032  
  5033  3. Jun 2015,
  5034      LinuxCon Japan, Tokyo,
  5035      Exciting Developments in Linux Tracing,
  5036      Elena Zannoni,
  5037      https://events.linuxfoundation.org/sites/events/files/slides/tracing-linux-ezannoni-linuxcon-ja-2015_0.pdf
  5038  
  5039  2. Feb 2015,
  5040      Collaboration Summit, Santa Rosa,
  5041      BPF: In-kernel Virtual Machine,
  5042      Alexei Starovoitov,
  5043      https://events.linuxfoundation.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf
  5044  
  5045  1. Feb 2015,
  5046      NetDev 0.1, Ottawa,
  5047      BPF: In-kernel Virtual Machine,
  5048      Alexei Starovoitov,
  5049      http://netdevconf.org/0.1/sessions/15.html
  5050  
  5051  0. Feb 2014,
  5052      DevConf.cz, Brno,
  5053      tc and cls_bpf: lightweight packet classifying with BPF,
  5054      Daniel Borkmann,
  5055      http://borkmann.ch/talks/2014_devconf.pdf
  5056  
  5057  Further Documents
  5058  -----------------
  5059  
  5060  - Dive into BPF: a list of reading material,
  5061    Quentin Monnet
  5062    (https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/)
  5063  
  5064  - XDP - eXpress Data Path,
  5065    Jesper Dangaard Brouer
  5066    (https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html)