github.com/cilium/cilium@v1.16.2/Documentation/bpf/architecture.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _bpf_architect:
     8  
     9  BPF Architecture
    10  ================
    11  
    12  BPF does not define itself by only providing its instruction set, but also by
    13  offering further infrastructure around it such as maps which act as efficient
    14  key / value stores, helper functions to interact with and leverage kernel
    15  functionality, tail calls for calling into other BPF programs, security
    16  hardening primitives, a pseudo file system for pinning objects (maps,
    17  programs), and infrastructure for allowing BPF to be offloaded, for example, to
    18  a network card.
    19  
    20  LLVM provides a BPF back end, so that tools like clang can be used to
    21  compile C into a BPF object file, which can then be loaded into the kernel.
    22  BPF is deeply tied to the Linux kernel and allows for full programmability
    23  without sacrificing native kernel performance.
    24  
    25  Last but not least, also the kernel subsystems making use of BPF are part of
    26  BPF's infrastructure. The two main subsystems discussed throughout this
    27  document are tc and XDP where BPF programs can be attached to. XDP BPF programs
    28  are attached at the earliest networking driver stage and trigger a run of the
    29  BPF program upon packet reception. By definition, this achieves the best
    30  possible packet processing performance since packets cannot get processed at an
    31  even earlier point in software. However, since this processing occurs so early
    32  in the networking stack, the stack has not yet extracted metadata out of the
    33  packet. On the other hand, tc BPF programs are executed later in the kernel
    34  stack, so they have access to more metadata and core kernel functionality.
    35  Apart from tc and XDP programs, there are various other kernel subsystems as
    36  well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc).
    37  
    38  The following subsections provide further details on individual aspects of the
    39  BPF architecture.
    40  
    41  Instruction Set
    42  ---------------
    43  
    44  BPF is a general purpose RISC instruction set and was originally designed for the
    45  purpose of writing programs in a subset of C which can be compiled into BPF instructions
    46  through a compiler back end (e.g. LLVM), so that the kernel can later on map them
    47  through an in-kernel JIT compiler into native opcodes for optimal execution performance
    48  inside the kernel.
    49  
    50  The advantages for pushing these instructions into the kernel include:
    51  
    52  * Making the kernel programmable without having to cross kernel / user space
    53    boundaries. For example, BPF programs related to networking, as in the case of
    54    Cilium, can implement flexible container policies, load balancing and other means
    55    without having to move packets to user space and back into the kernel. State
    56    between BPF programs and kernel / user space can still be shared through maps
    57    whenever needed.
    58  
    59  * Given the flexibility of a programmable data path, programs can be heavily optimized
    60    for performance also by compiling out features that are not required for the use cases
    61    the program solves. For example, if a container does not require IPv4, then the BPF
    62    program can be built to only deal with IPv6 in order to save resources in the fast-path.
    63  
    64  * In case of networking (e.g. tc and XDP), BPF programs can be updated atomically
    65    without having to restart the kernel, system services or containers, and without
    66    traffic interruptions. Furthermore, any program state can also be maintained
    67    throughout updates via BPF maps.
    68  
    69  * BPF provides a stable ABI towards user space, and does not require any third party
    70    kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere,
    71    and guarantees that existing BPF programs keep running with newer kernel versions.
    72    This guarantee is the same guarantee that the kernel provides for system calls with
    73    regard to user space applications. Moreover, BPF programs are portable across
    74    different architectures.
    75  
    76  * BPF programs work in concert with the kernel, they make use of existing kernel
    77    infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and
    78    tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides.
    79    Unlike kernel modules, BPF programs are verified through an in-kernel verifier in
    80    order to ensure that they cannot crash the kernel, always terminate, etc. XDP
    81    programs, for example, reuse the existing in-kernel drivers and operate on the
    82    provided DMA buffers containing the packet frames without exposing them or an entire
    83    driver to user space as in other models. Moreover, XDP programs reuse the existing
    84    stack instead of bypassing it. BPF can be considered a generic "glue code" to
    85    kernel facilities for crafting programs to solve specific use cases.
    86  
    87  The execution of a BPF program inside the kernel is always event-driven! Examples:
    88  
    89  * A networking device which has a BPF program attached on its ingress path will
    90    trigger the execution of the program once a packet is received.
    91  
    92  * A kernel address which has a kprobe with a BPF program attached will trap once
    93    the code at that address gets executed, which will then invoke the kprobe's
    94    callback function for instrumentation, subsequently triggering the execution
    95    of the attached BPF program.
    96  
    97  BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter
    98  and a 512 byte large BPF stack space. Registers are named ``r0`` - ``r10``. The
    99  operating mode is 64 bit by default, the 32 bit subregisters can only be accessed
   100  through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters
   101  zero-extend into 64 bit when they are being written to.
   102  
   103  Register ``r10`` is the only register which is read-only and contains the frame pointer
   104  address in order to access the BPF stack space. The remaining ``r0`` - ``r9``
   105  registers are general purpose and of read/write nature.
   106  
   107  A BPF program can call into a predefined helper function, which is defined by
   108  the core kernel (never by modules). The BPF calling convention is defined as
   109  follows:
   110  
   111  * ``r0`` contains the return value of a helper function call.
   112  * ``r1`` - ``r5`` hold arguments from the BPF program to the kernel helper function.
   113  * ``r6`` - ``r9`` are callee saved registers that will be preserved on helper function call.
   114  
   115  The BPF calling convention is generic enough to map directly to ``x86_64``, ``arm64``
   116  and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a
   117  JIT only needs to issue a call instruction, but no additional extra moves for placing
   118  function arguments. This calling convention was modeled to cover common call
   119  situations without having a performance penalty. Calls with 6 or more arguments
   120  are currently not supported. The helper functions in the kernel which are dedicated
   121  to BPF (``BPF_CALL_0()`` to ``BPF_CALL_5()`` functions) are specifically designed
   122  with this convention in mind.
   123  
   124  Register ``r0`` is also the register containing the exit value for the BPF program.
   125  The semantics of the exit value are defined by the type of program. Furthermore, when
   126  handing execution back to the kernel, the exit value is passed as a 32 bit value.
   127  
   128  Registers ``r1`` - ``r5`` are scratch registers, meaning the BPF program needs to
   129  either spill them to the BPF stack or move them to callee saved registers if these
   130  arguments are to be reused across multiple helper function calls. Spilling means
   131  that the variable in the register is moved to the BPF stack. The reverse operation
   132  of moving the variable from the BPF stack to the register is called filling. The
   133  reason for spilling/filling is due to the limited number of registers.
   134  
   135  Upon entering execution of a BPF program, register ``r1`` initially contains the
   136  context for the program. The context is the input argument for the program (similar
   137  to ``argc/argv`` pair for a typical C program). BPF is restricted to work on a single
   138  context. The context is defined by the program type, for example, a networking
   139  program can have a kernel representation of the network packet (``skb``) as the
   140  input argument.
   141  
   142  The general operation of BPF is 64 bit to follow the natural model of 64 bit
   143  architectures in order to perform pointer arithmetics, pass pointers but also pass 64
   144  bit values into helper functions, and to allow for 64 bit atomic operations.
   145  
   146  The maximum instruction limit per program is restricted to 4096 BPF instructions,
   147  which, by design, means that any program will terminate quickly. For kernel newer
   148  than 5.1 this limit was lifted to 1 million BPF instructions. Although the
   149  instruction set contains forward as well as backward jumps, the in-kernel BPF
   150  verifier will forbid loops so that termination is always guaranteed. Since BPF
   151  programs run inside the kernel, the verifier's job is to make sure that these are
   152  safe to run, not affecting the system's stability. This means that from an instruction
   153  set point of view, loops can be implemented, but the verifier will restrict that.
   154  However, there is also a concept of tail calls that allows for one BPF program to
   155  jump into another one. This, too, comes with an upper nesting limit of 33 calls,
   156  and is usually used to decouple parts of the program logic, for example, into stages.
   157  
   158  The instruction format is modeled as two operand instructions, which helps mapping
   159  BPF instructions to native instructions during JIT phase. The instruction set is
   160  of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions
   161  have been implemented and the encoding also allows to extend the set with further
   162  instructions when needed. The instruction encoding of a single 64 bit instruction on a
   163  big-endian machine is defined as a bit sequence from most significant bit (MSB) to least
   164  significant bit (LSB) of ``op:8``, ``dst_reg:4``, ``src_reg:4``, ``off:16``, ``imm:32``.
   165  ``off`` and ``imm`` is of signed type. The encodings are part of the kernel headers and
   166  defined in ``linux/bpf.h`` header, which also includes ``linux/bpf_common.h``.
   167  
   168  ``op`` defines the actual operation to be performed. Most of the encoding for ``op``
   169  has been reused from cBPF. The operation can be based on register or immediate
   170  operands. The encoding of ``op`` itself provides information on which mode to use
   171  (``BPF_X`` for denoting register-based operations, and ``BPF_K`` for immediate-based
   172  operations respectively). In the latter case, the destination operand is always
   173  a register. Both ``dst_reg`` and ``src_reg`` provide additional information about
   174  the register operands to be used (e.g. ``r0`` - ``r9``) for the operation. ``off``
   175  is used in some instructions to provide a relative offset, for example, for addressing
   176  the stack or other buffers available to BPF (e.g. map values, packet data, etc),
   177  or jump targets in jump instructions. ``imm`` contains a constant / immediate value.
   178  
   179  The available ``op`` instructions can be categorized into various instruction
   180  classes. These classes are also encoded inside the ``op`` field. The ``op`` field
   181  is divided into (from MSB to LSB) ``code:4``, ``source:1`` and ``class:3``. ``class``
   182  is the more generic instruction class, ``code`` denotes a specific operational
   183  code inside that class, and ``source`` tells whether the source operand is a register
   184  or an immediate value. Possible instruction classes include:
   185  
   186  * ``BPF_LD``, ``BPF_LDX``: Both classes are for load operations. ``BPF_LD`` is
   187    used for loading a double word as a special instruction spanning two instructions
   188    due to the ``imm:32`` split, and for byte / half-word / word loads of packet data.
   189    The latter was carried over from cBPF mainly in order to keep cBPF to BPF
   190    translations efficient, since they have optimized JIT code. For native BPF
   191    these packet load instructions are less relevant nowadays. ``BPF_LDX`` class
   192    holds instructions for byte / half-word / word / double-word loads out of
   193    memory. Memory in this context is generic and could be stack memory, map value
   194    data, packet data, etc.
   195  
   196  * ``BPF_ST``, ``BPF_STX``: Both classes are for store operations. Similar to ``BPF_LDX``
   197    the ``BPF_STX`` is the store counterpart and is used to store the data from a
   198    register into memory, which, again, can be stack memory, map value, packet data,
   199    etc. ``BPF_STX`` also holds special instructions for performing word and double-word
   200    based atomic add operations, which can be used for counters, for example. The
   201    ``BPF_ST`` class is similar to ``BPF_STX`` by providing instructions for storing
   202    data into memory only that the source operand is an immediate value.
   203  
   204  * ``BPF_ALU``, ``BPF_ALU64``: Both classes contain ALU operations. Generally,
   205    ``BPF_ALU`` operations are in 32 bit mode and ``BPF_ALU64`` in 64 bit mode.
   206    Both ALU classes have basic operations with source operand which is register-based
   207    and an immediate-based counterpart. Supported by both are add (``+``), sub (``-``),
   208    and (``&``), or (``|``), left shift (``<<``), right shift (``>>``), xor (``^``),
   209    mul (``*``), div (``/``), mod (``%``), neg (``~``) operations. Also mov (``<X> := <Y>``)
   210    was added as a special ALU operation for both classes in both operand modes.
   211    ``BPF_ALU64`` also contains a signed right shift. ``BPF_ALU`` additionally
   212    contains endianness conversion instructions for half-word / word / double-word
   213    on a given source register.
   214  
   215  * ``BPF_JMP``: This class is dedicated to jump operations. Jumps can be unconditional
   216    and conditional. Unconditional jumps simply move the program counter forward, so
   217    that the next instruction to be executed relative to the current instruction is
   218    ``off + 1``, where ``off`` is the constant offset encoded in the instruction. Since
   219    ``off`` is signed, the jump can also be performed backwards as long as it does not
   220    create a loop and is within program bounds. Conditional jumps operate on both,
   221    register-based and immediate-based source operands. If the condition in the jump
   222    operations results in ``true``, then a relative jump to ``off + 1`` is performed,
   223    otherwise the next instruction (``0 + 1``) is performed. This fall-through
   224    jump logic differs compared to cBPF and allows for better branch prediction as it
   225    fits the CPU branch predictor logic more naturally. Available conditions are
   226    jeq (``==``), jne (``!=``), jgt (``>``), jge (``>=``), jsgt (signed ``>``), jsge
   227    (signed ``>=``), jlt (``<``), jle (``<=``), jslt (signed ``<``), jsle (signed
   228    ``<=``) and jset (jump if ``DST & SRC``). Apart from that, there are three
   229    special jump operations within this class: the exit instruction which will leave
   230    the BPF program and return the current value in ``r0`` as a return code, the call
   231    instruction, which will issue a function call into one of the available BPF helper
   232    functions, and a hidden tail call instruction, which will jump into a different
   233    BPF program.
   234  
   235  The Linux kernel is shipped with a BPF interpreter which executes programs assembled in
   236  BPF instructions. Even cBPF programs are translated into eBPF programs transparently
   237  in the kernel, except for architectures that still ship with a cBPF JIT and
   238  have not yet migrated to an eBPF JIT.
   239  
   240  Currently ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` and
   241  ``arm`` architectures come with an in-kernel eBPF JIT compiler.
   242  
   243  All BPF handling such as loading of programs into the kernel or creation of BPF maps
   244  is managed through a central ``bpf()`` system call. It is also used for managing map
   245  entries (lookup / update / delete), and making programs as well as maps persistent
   246  in the BPF file system through pinning.
   247  
   248  Helper Functions
   249  ----------------
   250  
   251  Helper functions are a concept which enables BPF programs to consult a core kernel
   252  defined set of function calls in order to retrieve / push data from / to the
   253  kernel. Available helper functions may differ for each BPF program type,
   254  for example, BPF programs attached to sockets are only allowed to call into
   255  a subset of helpers compared to BPF programs attached to the tc layer.
   256  Encapsulation and decapsulation helpers for lightweight tunneling constitute
   257  an example of functions which are only available to lower tc layers, whereas
   258  event output helpers for pushing notifications to user space are available to
   259  tc and XDP programs.
   260  
   261  Each helper function is implemented with a commonly shared function signature
   262  similar to system calls. The signature is defined as:
   263  
   264  .. code-block:: c
   265  
   266      u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5)
   267  
   268  The calling convention as described in the previous section applies to all
   269  BPF helper functions.
   270  
   271  The kernel abstracts helper functions into macros ``BPF_CALL_0()`` to ``BPF_CALL_5()``
   272  which are similar to those of system calls. The following example is an extract
   273  from a helper function which updates map elements by calling into the
   274  corresponding map implementation callbacks:
   275  
   276  .. code-block:: c
   277  
   278      BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key,
   279                 void *, value, u64, flags)
   280      {
   281          WARN_ON_ONCE(!rcu_read_lock_held());
   282          return map->ops->map_update_elem(map, key, value, flags);
   283      }
   284  
   285      const struct bpf_func_proto bpf_map_update_elem_proto = {
   286          .func           = bpf_map_update_elem,
   287          .gpl_only       = false,
   288          .ret_type       = RET_INTEGER,
   289          .arg1_type      = ARG_CONST_MAP_PTR,
   290          .arg2_type      = ARG_PTR_TO_MAP_KEY,
   291          .arg3_type      = ARG_PTR_TO_MAP_VALUE,
   292          .arg4_type      = ARG_ANYTHING,
   293      };
   294  
   295  There are various advantages of this approach: while cBPF overloaded its
   296  load instructions in order to fetch data at an impossible packet offset to
   297  invoke auxiliary helper functions, each cBPF JIT needed to implement support
   298  for such a cBPF extension. In case of eBPF, each newly added helper function
   299  will be JIT compiled in a transparent and efficient way, meaning that the JIT
   300  compiler only needs to emit a call instruction since the register mapping
   301  is made in such a way that BPF register assignments already match the
   302  underlying architecture's calling convention. This allows for easily extending
   303  the core kernel with new helper functionality. All BPF helper functions are
   304  part of the core kernel and cannot be extended or added through kernel modules.
   305  
   306  The aforementioned function signature also allows the verifier to perform type
   307  checks. The above ``struct bpf_func_proto`` is used to hand all the necessary
   308  information which need to be known about the helper to the verifier, so that
   309  the verifier can make sure that the expected types from the helper match the
   310  current contents of the BPF program's analyzed registers.
   311  
   312  Argument types can range from passing in any kind of value up to restricted
   313  contents such as a pointer / size pair for the BPF stack buffer, which the
   314  helper should read from or write to. In the latter case, the verifier can also
   315  perform additional checks, for example, whether the buffer was previously
   316  initialized.
   317  
   318  The list of available BPF helper functions is rather long and constantly growing,
   319  for example, at the time of this writing, tc BPF programs can choose from 38
   320  different BPF helpers. The kernel's ``struct bpf_verifier_ops`` contains a
   321  ``get_func_proto`` callback function that provides the mapping of a specific
   322  ``enum bpf_func_id`` to one of the available helpers for a given BPF program
   323  type.
   324  
   325  Maps
   326  ----
   327  
   328  .. image:: /images/bpf_map.png
   329      :align: center
   330  
   331  Maps are efficient key / value stores that reside in kernel space. They can be
   332  accessed from a BPF program in order to keep state among multiple BPF program
   333  invocations. They can also be accessed through file descriptors from user space
   334  and can be arbitrarily shared with other BPF programs or user space applications.
   335  
   336  BPF programs which share maps with each other are not required to be of the same
   337  program type, for example, tracing programs can share maps with networking programs.
   338  A single BPF program can currently access up to 64 different maps directly.
   339  
   340  Map implementations are provided by the core kernel. There are generic maps with
   341  per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are
   342  also a few non-generic maps that are used along with helper functions.
   343  
   344  Generic maps currently available are ``BPF_MAP_TYPE_HASH``, ``BPF_MAP_TYPE_ARRAY``,
   345  ``BPF_MAP_TYPE_PERCPU_HASH``, ``BPF_MAP_TYPE_PERCPU_ARRAY``, ``BPF_MAP_TYPE_LRU_HASH``,
   346  ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and ``BPF_MAP_TYPE_LPM_TRIE``. They all use the
   347  same common set of BPF helper functions in order to perform lookup, update or
   348  delete operations while implementing a different backend with differing semantics
   349  and performance characteristics.
   350  
   351  Non-generic maps that are currently in the kernel are ``BPF_MAP_TYPE_PROG_ARRAY``,
   352  ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, ``BPF_MAP_TYPE_CGROUP_ARRAY``,
   353  ``BPF_MAP_TYPE_STACK_TRACE``, ``BPF_MAP_TYPE_ARRAY_OF_MAPS``,
   354  ``BPF_MAP_TYPE_HASH_OF_MAPS``. For example, ``BPF_MAP_TYPE_PROG_ARRAY`` is an
   355  array map which holds other BPF programs, ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and
   356  ``BPF_MAP_TYPE_HASH_OF_MAPS`` both hold pointers to other maps such that entire
   357  BPF maps can be atomically replaced at runtime. These types of maps tackle a
   358  specific issue which was unsuitable to be implemented solely through a BPF helper
   359  function since additional (non-data) state is required to be held across BPF
   360  program invocations.
   361  
   362  Object Pinning
   363  --------------
   364  
   365  .. image:: /images/bpf_fs.png
   366      :align: center
   367  
   368  BPF maps and programs act as a kernel resource and can only be accessed through
   369  file descriptors, backed by anonymous inodes in the kernel. Advantages, but
   370  also a number of disadvantages come along with them:
   371  
   372  User space applications can make use of most file descriptor related APIs,
   373  file descriptor passing for Unix domain sockets work transparently, etc, but
   374  at the same time, file descriptors are limited to a processes' lifetime,
   375  which makes options like map sharing rather cumbersome to carry out.
   376  
   377  Thus, it brings a number of complications for certain use cases such as iproute2,
   378  where tc or XDP sets up and loads the program into the kernel and terminates
   379  itself eventually. With that, also access to maps is unavailable from user
   380  space side, where it could otherwise be useful, for example, when maps are
   381  shared between ingress and egress locations of the data path. Also, third
   382  party applications may wish to monitor or update map contents during BPF
   383  program runtime.
   384  
   385  To overcome this limitation, a minimal kernel space BPF file system has been
   386  implemented, where BPF map and programs can be pinned to, a process called
   387  object pinning. The BPF system call has therefore been extended with two new
   388  commands which can pin (``BPF_OBJ_PIN``) or retrieve (``BPF_OBJ_GET``) a
   389  previously pinned object.
   390  
   391  For instance, tools such as tc make use of this infrastructure for sharing
   392  maps on ingress and egress. The BPF related file system is not a singleton,
   393  it does support multiple mount instances, hard and soft links, etc.
   394  
   395  Tail Calls
   396  ----------
   397  
   398  .. image:: /images/bpf_tailcall.png
   399      :align: center
   400  
   401  Another concept that can be used with BPF is called tail calls. Tail calls can
   402  be seen as a mechanism that allows one BPF program to call another, without
   403  returning back to the old program. Such a call has minimal overhead as unlike
   404  function calls, it is implemented as a long jump, reusing the same stack frame.
   405  
   406  Such programs are verified independently of each other, thus for transferring
   407  state, either per-CPU maps as scratch buffers or in case of tc programs, ``skb``
   408  fields such as the ``cb[]`` area must be used.
   409  
   410  Only programs of the same type can be tail called, and they also need to match
   411  in terms of JIT compilation, thus either JIT compiled or only interpreted programs
   412  can be invoked, but not mixed together.
   413  
   414  There are two components involved for carrying out tail calls: the first part
   415  needs to setup a specialized map called program array (``BPF_MAP_TYPE_PROG_ARRAY``)
   416  that can be populated by user space with key / values, where values are the
   417  file descriptors of the tail called BPF programs, the second part is a
   418  ``bpf_tail_call()`` helper where the context, a reference to the program array
   419  and the lookup key is passed to. Then the kernel inlines this helper call
   420  directly into a specialized BPF instruction. Such a program array is currently
   421  write-only from user space side.
   422  
   423  The kernel looks up the related BPF program from the passed file descriptor
   424  and atomically replaces program pointers at the given map slot. When no map
   425  entry has been found at the provided key, the kernel will just "fall through"
   426  and continue execution of the old program with the instructions following
   427  after the ``bpf_tail_call()``. Tail calls are a powerful utility, for example,
   428  parsing network headers could be structured through tail calls. During runtime,
   429  functionality can be added or replaced atomically, and thus altering the BPF
   430  program's execution behavior.
   431  
   432  .. _bpf_to_bpf_calls:
   433  
   434  BPF to BPF Calls
   435  ----------------
   436  
   437  .. image:: /images/bpf_call.png
   438      :align: center
   439  
   440  Aside from BPF helper calls and BPF tail calls, a more recent feature that has
   441  been added to the BPF core infrastructure is BPF to BPF calls. Before this
   442  feature was introduced into the kernel, a typical BPF C program had to declare
   443  any reusable code that, for example, resides in headers as ``always_inline``
   444  such that when LLVM compiles and generates the BPF object file all these
   445  functions were inlined and therefore duplicated many times in the resulting
   446  object file, artificially inflating its code size:
   447  
   448  .. code-block:: c
   449  
   450      #include <linux/bpf.h>
   451  
   452      #ifndef __section
   453      # define __section(NAME)                  \
   454         __attribute__((section(NAME), used))
   455      #endif
   456  
   457      #ifndef __inline
   458      # define __inline                         \
   459         inline __attribute__((always_inline))
   460      #endif
   461  
   462      static __inline int foo(void)
   463      {
   464          return XDP_DROP;
   465      }
   466  
   467      __section("prog")
   468      int xdp_drop(struct xdp_md *ctx)
   469      {
   470          return foo();
   471      }
   472  
   473      char __license[] __section("license") = "GPL";
   474  
   475  The main reason why this was necessary was due to lack of function call support
   476  in the BPF program loader as well as verifier, interpreter and JITs. Starting
   477  with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs
   478  no longer need to use ``always_inline`` everywhere. Thus, the prior shown BPF
   479  example code can then be rewritten more naturally as:
   480  
   481  .. code-block:: c
   482  
   483      #include <linux/bpf.h>
   484  
   485      #ifndef __section
   486      # define __section(NAME)                  \
   487         __attribute__((section(NAME), used))
   488      #endif
   489  
   490      static int foo(void)
   491      {
   492          return XDP_DROP;
   493      }
   494  
   495      __section("prog")
   496      int xdp_drop(struct xdp_md *ctx)
   497      {
   498          return foo();
   499      }
   500  
   501      char __license[] __section("license") = "GPL";
   502  
   503  Mainstream BPF JIT compilers like ``x86_64`` and ``arm64`` support BPF to BPF
   504  calls today with others following in near future. BPF to BPF call is an
   505  important performance optimization since it heavily reduces the generated BPF
   506  code size and therefore becomes friendlier to a CPU's instruction cache.
   507  
   508  The calling convention known from BPF helper function applies to BPF to BPF
   509  calls just as well, meaning ``r1`` up to ``r5`` are for passing arguments to
   510  the callee and the result is returned in ``r0``. ``r1`` to ``r5`` are scratch
   511  registers whereas ``r6`` to ``r9`` preserved across calls the usual way. The
   512  maximum number of nesting calls respectively allowed call frames is ``8``.
   513  A caller can pass pointers (e.g. to the caller's stack frame) down to the
   514  callee, but never vice versa.
   515  
   516  BPF JIT compilers emit separate images for each function body and later fix
   517  up the function call addresses in the image in a final JIT pass. This has
   518  proven to require minimal changes to the JITs in that they can treat BPF to
   519  BPF calls as conventional BPF helper calls.
   520  
   521  Up to kernel 5.9, BPF tail calls and BPF subprograms excluded each other. BPF
   522  programs that utilized tail calls couldn't take the benefit of reducing program
   523  image size and faster load times. Linux kernel 5.10 finally allows users to bring
   524  the best of two worlds and adds the ability to combine the BPF subprograms with
   525  tail calls.
   526  
   527  This improvement comes with some restrictions, though. Mixing these two features
   528  can cause a kernel stack overflow. To get an idea of what might happen, see the
   529  picture below that illustrates the mix of bpf2bpf calls and tail calls:
   530  
   531  .. image:: /images/bpf_tailcall_subprograms.png
   532      :align: center
   533  
   534  Tail calls, before the actual jump to the target program, will unwind only its
   535  current stack frame. As we can see in the example above, if a tail call occurs
   536  from within the sub-function, the function's (func1) stack frame will be
   537  present on the stack when a program execution is at func2. Once the final
   538  function (func3) function terminates, all the previous stack frames will be
   539  unwinded and control will get back to the caller of BPF program caller.
   540  
   541  The kernel introduced additional logic for detecting this feature combination.
   542  There is a limit on the stack size throughout the whole call chain down to 256
   543  bytes per subprogram (note that if the verifier detects the bpf2bpf call, then
   544  the main function is treated as a sub-function as well). In total, with this
   545  restriction, the BPF program's call chain can consume at most 8KB of stack
   546  space. This limit comes from the 256 bytes per stack frame multiplied by the
   547  tail call count limit (33). Without this, the BPF programs will operate on
   548  512-byte stack size, yielding the 16KB size in total for the maximum count of
   549  tail calls that would overflow the stack on some architectures.
   550  
   551  One more thing to mention is that this feature combination is currently
   552  supported only on the x86-64 architecture.
   553  
   554  JIT
   555  ---
   556  
   557  .. image:: /images/bpf_jit.png
   558      :align: center
   559  
   560  The 64 bit ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64``
   561  and 32 bit ``arm``, ``x86_32`` architectures are all shipped with an in-kernel
   562  eBPF JIT compiler, also all of them are feature equivalent and can be enabled
   563  through:
   564  
   565  .. code-block:: shell-session
   566  
   567      # echo 1 > /proc/sys/net/core/bpf_jit_enable
   568  
   569  The 32 bit ``mips``, ``ppc`` and ``sparc`` architectures currently have a cBPF
   570  JIT compiler. The mentioned architectures still having a cBPF JIT as well as all
   571  remaining architectures supported by the Linux kernel which do not have a BPF JIT
   572  compiler at all need to run eBPF programs through the in-kernel interpreter.
   573  
   574  In the kernel's source tree, eBPF JIT support can be easily determined through
   575  issuing a grep for ``HAVE_EBPF_JIT``:
   576  
   577  .. code-block:: shell-session
   578  
   579      # git grep HAVE_EBPF_JIT arch/
   580      arch/arm/Kconfig:       select HAVE_EBPF_JIT   if !CPU_ENDIAN_BE32
   581      arch/arm64/Kconfig:     select HAVE_EBPF_JIT
   582      arch/powerpc/Kconfig:   select HAVE_EBPF_JIT   if PPC64
   583      arch/mips/Kconfig:      select HAVE_EBPF_JIT   if (64BIT && !CPU_MICROMIPS)
   584      arch/s390/Kconfig:      select HAVE_EBPF_JIT   if PACK_STACK && HAVE_MARCH_Z196_FEATURES
   585      arch/sparc/Kconfig:     select HAVE_EBPF_JIT   if SPARC64
   586      arch/x86/Kconfig:       select HAVE_EBPF_JIT   if X86_64
   587  
   588  JIT compilers speed up execution of the BPF program significantly since they
   589  reduce the per instruction cost compared to the interpreter. Often instructions
   590  can be mapped 1:1 with native instructions of the underlying architecture. This
   591  also reduces the resulting executable image size and is therefore more
   592  instruction cache friendly to the CPU. In particular in case of CISC instruction
   593  sets such as ``x86``, the JITs are optimized for emitting the shortest possible
   594  opcodes for a given instruction to shrink the total necessary size for the
   595  program translation.
   596  
   597  Hardening
   598  ---------
   599  
   600  BPF locks the entire BPF interpreter image (``struct bpf_prog``) as well
   601  as the JIT compiled image (``struct bpf_binary_header``) in the kernel as
   602  read-only during the program's lifetime in order to prevent the code from
   603  potential corruptions. Any corruption happening at that point, for example,
   604  due to some kernel bugs will result in a general protection fault and thus
   605  crash the kernel instead of allowing the corruption to happen silently.
   606  
   607  Architectures that support setting the image memory as read-only can be
   608  determined through:
   609  
   610  .. code-block:: shell-session
   611  
   612      $ git grep ARCH_HAS_SET_MEMORY | grep select
   613      arch/arm/Kconfig:    select ARCH_HAS_SET_MEMORY
   614      arch/arm64/Kconfig:  select ARCH_HAS_SET_MEMORY
   615      arch/s390/Kconfig:   select ARCH_HAS_SET_MEMORY
   616      arch/x86/Kconfig:    select ARCH_HAS_SET_MEMORY
   617  
   618  The option ``CONFIG_ARCH_HAS_SET_MEMORY`` is not configurable, thanks to
   619  which this protection is always built-in. Other architectures might follow
   620  in the future.
   621  
   622  In case of the ``x86_64`` JIT compiler, the JITing of the indirect jump from
   623  the use of tail calls is realized through a retpoline in case ``CONFIG_RETPOLINE``
   624  has been set which is the default at the time of writing in most modern Linux
   625  distributions.
   626  
   627  In case of ``/proc/sys/net/core/bpf_jit_harden`` set to ``1`` additional
   628  hardening steps for the JIT compilation take effect for unprivileged users.
   629  This effectively trades off their performance slightly by decreasing a
   630  (potential) attack surface in case of untrusted users operating on the
   631  system. The decrease in program execution still results in better performance
   632  compared to switching to interpreter entirely.
   633  
   634  Currently, enabling hardening will blind all user provided 32 bit and 64 bit
   635  constants from the BPF program when it gets JIT compiled in order to prevent
   636  JIT spraying attacks which inject native opcodes as immediate values. This is
   637  problematic as these immediate values reside in executable kernel memory,
   638  therefore a jump that could be triggered from some kernel bug would jump to
   639  the start of the immediate value and then execute these as native instructions.
   640  
   641  JIT constant blinding prevents this due to randomizing the actual instruction,
   642  which means the operation is transformed from an immediate based source operand
   643  to a register based one through rewriting the instruction by splitting the
   644  actual load of the value into two steps: 1) load of a blinded immediate
   645  value ``rnd ^ imm`` into a register, 2) xoring that register with ``rnd``
   646  such that the original ``imm`` immediate then resides in the register and
   647  can be used for the actual operation. The example was provided for a load
   648  operation, but really all generic operations are blinded.
   649  
   650  Example of JITing a program with hardening disabled:
   651  
   652  .. code-block:: shell-session
   653  
   654      # echo 0 > /proc/sys/net/core/bpf_jit_harden
   655  
   656        ffffffffa034f5e9 + <x>:
   657        [...]
   658        39:   mov    $0xa8909090,%eax
   659        3e:   mov    $0xa8909090,%eax
   660        43:   mov    $0xa8ff3148,%eax
   661        48:   mov    $0xa89081b4,%eax
   662        4d:   mov    $0xa8900bb0,%eax
   663        52:   mov    $0xa810e0c1,%eax
   664        57:   mov    $0xa8908eb4,%eax
   665        5c:   mov    $0xa89020b0,%eax
   666        [...]
   667  
   668  The same program gets constant blinded when loaded through BPF
   669  as an unprivileged user in the case hardening is enabled:
   670  
   671  .. code-block:: shell-session
   672  
   673      # echo 1 > /proc/sys/net/core/bpf_jit_harden
   674  
   675        ffffffffa034f1e5 + <x>:
   676        [...]
   677        39:   mov    $0xe1192563,%r10d
   678        3f:   xor    $0x4989b5f3,%r10d
   679        46:   mov    %r10d,%eax
   680        49:   mov    $0xb8296d93,%r10d
   681        4f:   xor    $0x10b9fd03,%r10d
   682        56:   mov    %r10d,%eax
   683        59:   mov    $0x8c381146,%r10d
   684        5f:   xor    $0x24c7200e,%r10d
   685        66:   mov    %r10d,%eax
   686        69:   mov    $0xeb2a830e,%r10d
   687        6f:   xor    $0x43ba02ba,%r10d
   688        76:   mov    %r10d,%eax
   689        79:   mov    $0xd9730af,%r10d
   690        7f:   xor    $0xa5073b1f,%r10d
   691        86:   mov    %r10d,%eax
   692        89:   mov    $0x9a45662b,%r10d
   693        8f:   xor    $0x325586ea,%r10d
   694        96:   mov    %r10d,%eax
   695        [...]
   696  
   697  Both programs are semantically the same, only that none of the
   698  original immediate values are visible anymore in the disassembly of
   699  the second program.
   700  
   701  At the same time, hardening also disables any JIT kallsyms exposure
   702  for privileged users, preventing that JIT image addresses are not
   703  exposed to ``/proc/kallsyms`` anymore.
   704  
   705  Moreover, the Linux kernel provides the option ``CONFIG_BPF_JIT_ALWAYS_ON``
   706  which removes the entire BPF interpreter from the kernel and permanently
   707  enables the JIT compiler. This has been developed as part of a mitigation
   708  in the context of Spectre v2 such that when used in a VM-based setting,
   709  the guest kernel is not going to reuse the host kernel's BPF interpreter
   710  when mounting an attack anymore. For container-based environments, the
   711  ``CONFIG_BPF_JIT_ALWAYS_ON`` configuration option is optional, but in
   712  case JITs are enabled there anyway, the interpreter may as well be compiled
   713  out to reduce the kernel's complexity. Thus, it is also generally
   714  recommended for widely used JITs in case of main stream architectures
   715  such as ``x86_64`` and ``arm64``.
   716  
   717  Last but not least, the kernel offers an option to disable the use of
   718  the ``bpf(2)`` system call for unprivileged users through the
   719  ``/proc/sys/kernel/unprivileged_bpf_disabled`` sysctl knob. This is
   720  on purpose a one-time kill switch, meaning once set to ``1``, there is
   721  no option to reset it back to ``0`` until a new kernel reboot. When
   722  set only ``CAP_SYS_ADMIN`` privileged processes out of the initial
   723  namespace are allowed to use the ``bpf(2)`` system call from that
   724  point onwards. Upon start, Cilium sets this knob to ``1`` as well.
   725  
   726  .. code-block:: shell-session
   727  
   728      # echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled
   729  
   730  Offloads
   731  --------
   732  
   733  .. image:: /images/bpf_offload.png
   734      :align: center
   735  
   736  Networking programs in BPF, in particular for tc and XDP do have an
   737  offload-interface to hardware in the kernel in order to execute BPF
   738  code directly on the NIC.
   739  
   740  Currently, the ``nfp`` driver from Netronome has support for offloading
   741  BPF through a JIT compiler which translates BPF instructions to an
   742  instruction set implemented against the NIC. This includes offloading
   743  of BPF maps to the NIC as well, thus the offloaded BPF program can
   744  perform map lookups, updates and deletions.
   745  
   746  BPF sysctls
   747  -----------
   748  
   749  The Linux kernel provides few sysctls that are BPF related and covered in this section.
   750  
   751  * ``/proc/sys/net/core/bpf_jit_enable``: Enables or disables the BPF JIT compiler.
   752  
   753    +-------+-------------------------------------------------------------------+
   754    | Value | Description                                                       |
   755    +-------+-------------------------------------------------------------------+
   756    | 0     | Disable the JIT and use only interpreter (kernel's default value) |
   757    +-------+-------------------------------------------------------------------+
   758    | 1     | Enable the JIT compiler                                           |
   759    +-------+-------------------------------------------------------------------+
   760    | 2     | Enable the JIT and emit debugging traces to the kernel log        |
   761    +-------+-------------------------------------------------------------------+
   762  
   763    As described in subsequent sections, ``bpf_jit_disasm`` tool can be used to
   764    process debugging traces when the JIT compiler is set to debugging mode (option ``2``).
   765  
   766  * ``/proc/sys/net/core/bpf_jit_harden``: Enables or disables BPF JIT hardening.
   767    Note that enabling hardening trades off performance, but can mitigate JIT spraying
   768    by blinding out the BPF program's immediate values. For programs processed through
   769    the interpreter, blinding of immediate values is not needed / performed.
   770  
   771    +-------+-------------------------------------------------------------------+
   772    | Value | Description                                                       |
   773    +-------+-------------------------------------------------------------------+
   774    | 0     | Disable JIT hardening (kernel's default value)                    |
   775    +-------+-------------------------------------------------------------------+
   776    | 1     | Enable JIT hardening for unprivileged users only                  |
   777    +-------+-------------------------------------------------------------------+
   778    | 2     | Enable JIT hardening for all users                                |
   779    +-------+-------------------------------------------------------------------+
   780  
   781  * ``/proc/sys/net/core/bpf_jit_kallsyms``: Enables or disables export of JITed
   782    programs as kernel symbols to ``/proc/kallsyms`` so that they can be used together
   783    with ``perf`` tooling as well as making these addresses aware to the kernel for
   784    stack unwinding, for example, used in dumping stack traces. The symbol names
   785    contain the BPF program tag (``bpf_prog_<tag>``). If ``bpf_jit_harden`` is enabled,
   786    then this feature is disabled.
   787  
   788    +-------+-------------------------------------------------------------------+
   789    | Value | Description                                                       |
   790    +-------+-------------------------------------------------------------------+
   791    | 0     | Disable JIT kallsyms export (kernel's default value)              |
   792    +-------+-------------------------------------------------------------------+
   793    | 1     | Enable JIT kallsyms export for privileged users only              |
   794    +-------+-------------------------------------------------------------------+
   795  
   796  * ``/proc/sys/kernel/unprivileged_bpf_disabled``: Enables or disable unprivileged
   797    use of the ``bpf(2)`` system call. The Linux kernel has unprivileged use of
   798    ``bpf(2)`` enabled by default.
   799  
   800    Once the value is set to 1, unprivileged use will be permanently disabled until
   801    the next reboot, neither an application nor an admin can reset the value anymore.
   802  
   803    The value can also be set to 2, which means it can be changed at runtime to 0 or
   804    1 later while disabling the unprivileged used for now. This value was added
   805    in Linux 5.13. If ``BPF_UNPRIV_DEFAULT_OFF``
   806    is enabled in the kernel config, then this knob will default to 2 instead of 0.
   807  
   808    This knob does not affect any cBPF programs such as seccomp
   809    or traditional socket filters that do not use the ``bpf(2)`` system call for
   810    loading the program into the kernel.
   811  
   812    +-------+---------------------------------------------------------------------+
   813    | Value | Description                                                         |
   814    +-------+---------------------------------------------------------------------+
   815    | 0     | Unprivileged use of bpf syscall enabled (kernel's default value)    |
   816    +-------+---------------------------------------------------------------------+
   817    | 1     | Unprivileged use of bpf syscall disabled (until reboot)             |
   818    +-------+---------------------------------------------------------------------+
   819    | 2     | Unprivileged use of bpf syscall disabled                            |
   820    |       | (default if ``BPF_UNPRIV_DEFAULT_OFF`` is enabled in kernel config) |
   821    +-------+---------------------------------------------------------------------+