github.com/cilium/cilium@v1.16.2/Documentation/bpf/toolchain.rst (about)

     1  .. only:: not (epub or latex or html)
     2  
     3      WARNING: You are looking at unreleased Cilium documentation.
     4      Please use the official rendered version released here:
     5      https://docs.cilium.io
     6  
     7  .. _bpf_dev:
     8  
     9  Development Tools
    10  =================
    11  
    12  Current user space tooling, introspection facilities and kernel control knobs around
    13  BPF are discussed in this section.
    14  
    15  .. note:: The tooling and infrastructure around BPF is still rapidly evolving and thus may not provide a complete picture of all available tools.
    16  
    17  Development Environment
    18  -----------------------
    19  
    20  A step by step guide for setting up a development environment for BPF can be found
    21  below for both Fedora and Ubuntu. This will guide you through building, installing
    22  and testing a development kernel as well as building and installing iproute2.
    23  
    24  The step of manually building iproute2 and Linux kernel is usually not necessary
    25  given that major distributions already ship recent enough kernels by default, but
    26  would be needed for testing bleeding edge versions or contributing BPF patches to
    27  iproute2 and to the Linux kernel, respectively. Similarly, for debugging and
    28  introspection purposes building bpftool is optional, but recommended.
    29  
    30  .. tabs::
    31  
    32      .. group-tab:: Fedora
    33  
    34          The following applies to Fedora 25 or later:
    35  
    36          .. code-block:: shell-session
    37  
    38              $ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \
    39                openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static
    40  
    41          .. note:: If you are running some other Fedora derivative and ``dnf`` is missing,
    42                    try using ``yum`` instead.
    43  
    44      .. group-tab:: Ubuntu
    45  
    46          The following applies to Ubuntu 17.04 or later:
    47  
    48          .. code-block:: shell-session
    49  
    50              $ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \
    51                clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \
    52                graphviz
    53  
    54      .. group-tab:: openSUSE Tumbleweed
    55  
    56          The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later:
    57  
    58          .. code-block:: shell-session
    59  
    60             $ sudo zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \
    61             libcap-devel clang llvm graphviz bison flex glibc-devel-static
    62  
    63  Compiling the Kernel
    64  ````````````````````
    65  
    66  Development of new BPF features for the Linux kernel happens inside the ``net-next``
    67  git tree, latest BPF fixes in the ``net`` tree. The following command will obtain
    68  the kernel source for the ``net-next`` tree through git:
    69  
    70  .. code-block:: shell-session
    71  
    72      $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git
    73  
    74  If the git commit history is not of interest, then ``--depth 1`` will clone the
    75  tree much faster by truncating the git history only to the most recent commit.
    76  
    77  In case the ``net`` tree is of interest, it can be cloned from this url:
    78  
    79  .. code-block:: shell-session
    80  
    81      $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git
    82  
    83  There are dozens of tutorials in the Internet on how to build Linux kernels, one
    84  good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild)
    85  that can be followed with one of the two git trees mentioned above.
    86  
    87  Make sure that the generated ``.config`` file contains the following ``CONFIG_*``
    88  entries for running BPF. These entries are also needed for Cilium.
    89  
    90  ::
    91  
    92      CONFIG_CGROUP_BPF=y
    93      CONFIG_BPF=y
    94      CONFIG_BPF_SYSCALL=y
    95      CONFIG_NET_SCH_INGRESS=m
    96      CONFIG_NET_CLS_BPF=m
    97      CONFIG_NET_CLS_ACT=y
    98      CONFIG_BPF_JIT=y
    99      CONFIG_LWTUNNEL_BPF=y
   100      CONFIG_HAVE_EBPF_JIT=y
   101      CONFIG_BPF_EVENTS=y
   102      CONFIG_TEST_BPF=m
   103  
   104  Some of the entries cannot be adjusted through ``make menuconfig``. For example,
   105  ``CONFIG_HAVE_EBPF_JIT`` is selected automatically if a given architecture does
   106  come with an eBPF JIT. In this specific case, ``CONFIG_HAVE_EBPF_JIT`` is optional
   107  but highly recommended. An architecture not having an eBPF JIT compiler will need
   108  to fall back to the in-kernel interpreter with the cost of being less efficient
   109  executing BPF instructions.
   110  
   111  Verifying the Setup
   112  ```````````````````
   113  
   114  After you have booted into the newly compiled kernel, navigate to the BPF selftest
   115  suite in order to test BPF functionality (current working directory points to
   116  the root of the cloned git tree):
   117  
   118  .. code-block:: shell-session
   119  
   120      $ cd tools/testing/selftests/bpf/
   121      $ make
   122      $ sudo ./test_verifier
   123  
   124  The verifier tests print out all the current checks being performed. The summary
   125  at the end of running all tests will dump information of test successes and
   126  failures:
   127  
   128  ::
   129  
   130      Summary: 847 PASSED, 0 SKIPPED, 0 FAILED
   131  
   132  .. note:: For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+
   133            caused by the BPF function calls which do not need to be inlined
   134            anymore. See section :ref:`bpf_to_bpf_calls` or the cover letter mail
   135            from the kernel patch (https://lwn.net/Articles/741773/) for more information.
   136            Not every BPF program has a dependency on LLVM 6.0+ if it does not
   137            use this new feature. If your distribution does not provide LLVM 6.0+
   138            you may compile it by following the instruction in the :ref:`tooling_llvm`
   139            section.
   140  
   141  In order to run through all BPF selftests, the following command is needed:
   142  
   143  .. code-block:: shell-session
   144  
   145      $ sudo make run_tests
   146  
   147  If you see any failures, please contact us on `Cilium Slack`_ with the full
   148  test output.
   149  
   150  Compiling iproute2
   151  ``````````````````
   152  
   153  Similar to the ``net`` (fixes only) and ``net-next`` (new features) kernel trees,
   154  the iproute2 git tree has two branches, namely ``master`` and ``net-next``. The
   155  ``master`` branch is based on the ``net`` tree and the ``net-next`` branch is
   156  based against the ``net-next`` kernel tree. This is necessary, so that changes
   157  in header files can be synchronized in the iproute2 tree.
   158  
   159  In order to clone the iproute2 ``master`` branch, the following command can
   160  be used:
   161  
   162  .. code-block:: shell-session
   163  
   164      $ git clone https://git.kernel.org/pub/scm/network/iproute2/iproute2.git
   165  
   166  Similarly, to clone into mentioned ``net-next`` branch of iproute2, run the
   167  following:
   168  
   169  .. code-block:: shell-session
   170  
   171      $ git clone -b net-next https://git.kernel.org/pub/scm/network/iproute2/iproute2.git
   172  
   173  After that, proceed with the build and installation:
   174  
   175  .. code-block:: shell-session
   176  
   177      $ cd iproute2/
   178      $ ./configure --prefix=/usr
   179      TC schedulers
   180       ATM    no
   181  
   182      libc has setns: yes
   183      SELinux support: yes
   184      ELF support: yes
   185      libmnl support: no
   186      Berkeley DB: no
   187  
   188      docs: latex: no
   189       WARNING: no docs can be built from LaTeX files
   190       sgml2html: no
   191       WARNING: no HTML docs can be built from SGML
   192      $ make
   193      [...]
   194      $ sudo make install
   195  
   196  Ensure that the ``configure`` script shows ``ELF support: yes``, so that iproute2
   197  can process ELF files from LLVM's BPF back end. libelf was listed in the instructions
   198  for installing the dependencies in case of Fedora and Ubuntu earlier.
   199  
   200  Compiling bpftool
   201  `````````````````
   202  
   203  bpftool is an essential tool around debugging and introspection of BPF programs
   204  and maps. It is part of the kernel tree and available under ``tools/bpf/bpftool/``.
   205  
   206  Make sure to have cloned either the ``net`` or ``net-next`` kernel tree as described
   207  earlier. In order to build and install bpftool, the following steps are required:
   208  
   209  .. code-block:: shell-session
   210  
   211      $ cd <kernel-tree>/tools/bpf/bpftool/
   212      $ make
   213      Auto-detecting system features:
   214      ...                        libbfd: [ on  ]
   215      ...        disassembler-four-args: [ OFF ]
   216  
   217        CC       xlated_dumper.o
   218        CC       prog.o
   219        CC       common.o
   220        CC       cgroup.o
   221        CC       main.o
   222        CC       json_writer.o
   223        CC       cfg.o
   224        CC       map.o
   225        CC       jit_disasm.o
   226        CC       disasm.o
   227      make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf'
   228  
   229      Auto-detecting system features:
   230      ...                        libelf: [ on  ]
   231      ...                           bpf: [ on  ]
   232  
   233        CC       libbpf.o
   234        CC       bpf.o
   235        CC       nlattr.o
   236        LD       libbpf-in.o
   237        LINK     libbpf.a
   238      make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf'
   239        LINK     bpftool
   240      $ sudo make install
   241  
   242  .. _tooling_llvm:
   243  
   244  LLVM
   245  ----
   246  
   247  LLVM is currently the only compiler suite providing a BPF back end. gcc does
   248  not support BPF at this point.
   249  
   250  The BPF back end was merged into LLVM's 3.7 release. Major distributions enable
   251  the BPF back end by default when they package LLVM, therefore installing clang
   252  and llvm is sufficient on most recent distributions to start compiling C
   253  into BPF object files.
   254  
   255  The typical workflow is that BPF programs are written in C, compiled by LLVM
   256  into object / ELF files, which are parsed by user space BPF ELF loaders (such as
   257  iproute2 or others), and pushed into the kernel through the BPF system call.
   258  The kernel verifies the BPF instructions and JITs them, returning a new file
   259  descriptor for the program, which then can be attached to a subsystem (e.g.
   260  networking). If supported, the subsystem could then further offload the BPF
   261  program to hardware (e.g. NIC).
   262  
   263  For LLVM, BPF target support can be checked, for example, through the following:
   264  
   265  .. code-block:: shell-session
   266  
   267      $ llc --version
   268      LLVM (http://llvm.org/):
   269      LLVM version 3.8.1
   270      Optimized build.
   271      Default target: x86_64-unknown-linux-gnu
   272      Host CPU: skylake
   273  
   274      Registered Targets:
   275        [...]
   276        bpf        - BPF (host endian)
   277        bpfeb      - BPF (big endian)
   278        bpfel      - BPF (little endian)
   279        [...]
   280  
   281  By default, the ``bpf`` target uses the endianness of the CPU it compiles on,
   282  meaning that if the CPU's endianness is little endian, the program is represented
   283  in little endian format as well, and if the CPU's endianness is big endian,
   284  the program is represented in big endian. This also matches the runtime behavior
   285  of BPF, which is generic and uses the CPU's endianness it runs on in order
   286  to not disadvantage architectures in any of the format.
   287  
   288  For cross-compilation, the two targets ``bpfeb`` and ``bpfel`` were introduced,
   289  thanks to that BPF programs can be compiled on a node running in one endianness
   290  (e.g. little endian on x86) and run on a node in another endianness format (e.g.
   291  big endian on arm). Note that the front end (clang) needs to run in the target
   292  endianness as well.
   293  
   294  Using ``bpf`` as a target is the preferred way in situations where no mixture of
   295  endianness applies. For example, compilation on ``x86_64`` results in the same
   296  output for the targets ``bpf`` and ``bpfel`` due to being little endian, therefore
   297  scripts triggering a compilation also do not have to be endian aware.
   298  
   299  A minimal, stand-alone XDP drop program might look like the following example
   300  (``xdp-example.c``):
   301  
   302  .. code-block:: c
   303  
   304      #include <linux/bpf.h>
   305  
   306      #ifndef __section
   307      # define __section(NAME)                  \
   308         __attribute__((section(NAME), used))
   309      #endif
   310  
   311      __section("prog")
   312      int xdp_drop(struct xdp_md *ctx)
   313      {
   314          return XDP_DROP;
   315      }
   316  
   317      char __license[] __section("license") = "GPL";
   318  
   319  It can then be compiled and loaded into the kernel as follows:
   320  
   321  .. code-block:: shell-session
   322  
   323      $ clang -O2 -Wall --target=bpf -c xdp-example.c -o xdp-example.o
   324      # ip link set dev em1 xdp obj xdp-example.o
   325  
   326  .. note:: Attaching an XDP BPF program to a network device as above requires
   327            Linux 4.11 with a device that supports XDP, or Linux 4.12 or later.
   328  
   329  For the generated object file LLVM (>= 3.9) uses the official BPF machine value,
   330  that is, ``EM_BPF`` (decimal: ``247`` / hex: ``0xf7``). In this example, the program
   331  has been compiled with ``bpf`` target under ``x86_64``, therefore ``LSB`` (as opposed
   332  to ``MSB``) is shown regarding endianness:
   333  
   334  .. code-block:: shell-session
   335  
   336      $ file xdp-example.o
   337      xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped
   338  
   339  ``readelf -a xdp-example.o`` will dump further information about the ELF file, which can
   340  sometimes be useful for introspecting generated section headers, relocation entries
   341  and the symbol table.
   342  
   343  In the unlikely case where clang and LLVM need to be compiled from scratch, the
   344  following commands can be used:
   345  
   346  .. code-block:: shell-session
   347  
   348      $ git clone https://github.com/llvm/llvm-project.git
   349      $ cd llvm-project
   350      $ mkdir build
   351      $ cd build
   352      $ cmake -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF  -G "Unix Makefiles" ../llvm
   353      $ make -j $(getconf _NPROCESSORS_ONLN)
   354      $ ./bin/llc --version
   355      LLVM (http://llvm.org/):
   356      LLVM version x.y.zsvn
   357      Optimized build.
   358      Default target: x86_64-unknown-linux-gnu
   359      Host CPU: skylake
   360  
   361      Registered Targets:
   362        bpf    - BPF (host endian)
   363        bpfeb  - BPF (big endian)
   364        bpfel  - BPF (little endian)
   365        x86    - 32-bit X86: Pentium-Pro and above
   366        x86-64 - 64-bit X86: EM64T and AMD64
   367  
   368      $ export PATH=$PWD/bin:$PATH   # add to ~/.bashrc
   369  
   370  Make sure that ``--version`` mentions ``Optimized build.``, otherwise the
   371  compilation time for programs when having LLVM in debugging mode will
   372  significantly increase (e.g. by 10x or more).
   373  
   374  For debugging, clang can generate the assembler output as follows:
   375  
   376  .. code-block:: shell-session
   377  
   378      $ clang -O2 -S -Wall --target=bpf -c xdp-example.c -o xdp-example.S
   379      $ cat xdp-example.S
   380          .text
   381          .section    prog,"ax",@progbits
   382          .globl      xdp_drop
   383          .p2align    3
   384      xdp_drop:                             # @xdp_drop
   385      # BB#0:
   386          r0 = 1
   387          exit
   388  
   389          .section    license,"aw",@progbits
   390          .globl    __license               # @__license
   391      __license:
   392          .asciz    "GPL"
   393  
   394  Starting from LLVM's release 6.0, there is also assembler parser support. You can
   395  program using BPF assembler directly, then use llvm-mc to assemble it into an
   396  object file. For example, you can assemble the xdp-example.S listed above back
   397  into object file using:
   398  
   399  .. code-block:: shell-session
   400  
   401      $ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S
   402  
   403  Furthermore, more recent LLVM versions (>= 4.0) can also store debugging
   404  information in dwarf format into the object file. This can be done through
   405  the usual workflow by adding ``-g`` for compilation.
   406  
   407  .. code-block:: shell-session
   408  
   409      $ clang -O2 -g -Wall --target=bpf -c xdp-example.c -o xdp-example.o
   410      $ llvm-objdump -S --no-show-raw-insn xdp-example.o
   411  
   412      xdp-example.o:        file format ELF64-BPF
   413  
   414      Disassembly of section prog:
   415      xdp_drop:
   416      ; {
   417          0:        r0 = 1
   418      ; return XDP_DROP;
   419          1:        exit
   420  
   421  The ``llvm-objdump`` tool can then annotate the assembler output with the
   422  original C code used in the compilation. The trivial example in this case
   423  does not contain much C code, however, the line numbers shown as ``0:``
   424  and ``1:`` correspond directly to the kernel's verifier log.
   425  
   426  This means that in case BPF programs get rejected by the verifier, ``llvm-objdump``
   427  can help to correlate the instructions back to the original C code, which is
   428  highly useful for analysis.
   429  
   430  .. code-block:: shell-session
   431  
   432      # ip link set dev em1 xdp obj xdp-example.o verb
   433  
   434      Prog section 'prog' loaded (5)!
   435       - Type:         6
   436       - Instructions: 2 (0 over limit)
   437       - License:      GPL
   438  
   439      Verifier analysis:
   440  
   441      0: (b7) r0 = 1
   442      1: (95) exit
   443      processed 2 insns
   444  
   445  As it can be seen in the verifier analysis, the ``llvm-objdump`` output dumps
   446  the same BPF assembler code as the kernel.
   447  
   448  Leaving out the ``--no-show-raw-insn`` option will also dump the raw
   449  ``struct bpf_insn`` as hex in front of the assembly:
   450  
   451  .. code-block:: shell-session
   452  
   453      $ llvm-objdump -S xdp-example.o
   454  
   455      xdp-example.o:        file format ELF64-BPF
   456  
   457      Disassembly of section prog:
   458      xdp_drop:
   459      ; {
   460         0:       b7 00 00 00 01 00 00 00     r0 = 1
   461      ; return foo();
   462         1:       95 00 00 00 00 00 00 00     exit
   463  
   464  For LLVM IR debugging, the compilation process for BPF can be split into
   465  two steps, generating a binary LLVM IR intermediate file ``xdp-example.bc``, which
   466  can later on be passed to llc:
   467  
   468  .. code-block:: shell-session
   469  
   470      $ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
   471      $ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o
   472  
   473  The generated LLVM IR can also be dumped in human readable format through:
   474  
   475  .. code-block:: shell-session
   476  
   477      $ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o -
   478  
   479  LLVM is able to attach debug information such as the description of used data
   480  types in the program to the generated BPF object file. By default this is in
   481  DWARF format.
   482  
   483  A heavily simplified version used by BPF is called BTF (BPF Type Format). The
   484  resulting DWARF can be converted into BTF and is later on loaded into the
   485  kernel through BPF object loaders. The kernel will then verify the BTF data
   486  for correctness and keeps track of the data types the BTF data is containing.
   487  
   488  BPF maps can then be annotated with key and value types out of the BTF data
   489  such that a later dump of the map exports the map data along with the related
   490  type information. This allows for better introspection, debugging and value
   491  pretty printing. Note that BTF data is a generic debugging data format and
   492  as such any DWARF to BTF converted data can be loaded (e.g. kernel's vmlinux
   493  DWARF data could be converted to BTF and loaded). Latter is in particular
   494  useful for BPF tracing in the future.
   495  
   496  In order to generate BTF from DWARF debugging information, elfutils (>= 0.173)
   497  is needed. If that is not available, then adding the ``-mattr=dwarfris`` option
   498  to the ``llc`` command is required during compilation:
   499  
   500  .. code-block:: shell-session
   501  
   502      $ llc -march=bpf -mattr=help |& grep dwarfris
   503        dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections.
   504        [...]
   505  
   506  The reason using ``-mattr=dwarfris`` is because the flag ``dwarfris`` (``dwarf
   507  relocation in section``) disables DWARF cross-section relocations between DWARF
   508  and the ELF's symbol table since libdw does not have proper BPF relocation
   509  support, and therefore tools like ``pahole`` would otherwise not be able to
   510  properly dump structures from the object.
   511  
   512  elfutils (>= 0.173) implements proper BPF relocation support and therefore
   513  the same can be achieved without the ``-mattr=dwarfris`` option. Dumping
   514  the structures from the object file could be done from either DWARF or BTF
   515  information. ``pahole`` uses the LLVM emitted DWARF information at this
   516  point, however, future ``pahole`` versions could rely on BTF if available.
   517  
   518  For converting DWARF into BTF, a recent pahole version (>= 1.12) is required.
   519  A recent pahole version can also be obtained from its official git repository
   520  if not available from one of the distribution packages:
   521  
   522  .. code-block:: shell-session
   523  
   524      $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git
   525  
   526  ``pahole`` comes with the option ``-J`` to convert DWARF into BTF from an
   527  object file. ``pahole`` can be probed for BTF support as follows (note that
   528  the ``llvm-objcopy`` tool is required for ``pahole`` as well, so check its
   529  presence, too):
   530  
   531  .. code-block:: shell-session
   532  
   533      $ pahole --help | grep BTF
   534      -J, --btf_encode           Encode as BTF
   535  
   536  Generating debugging information also requires the front end to generate
   537  source level debug information by passing ``-g`` to the ``clang`` command
   538  line. Note that ``-g`` is needed independently of whether ``llc``'s
   539  ``dwarfris`` option is used. Full example for generating the object file:
   540  
   541  .. code-block:: shell-session
   542  
   543      $ clang -O2 -g -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
   544      $ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o
   545  
   546  Alternatively, by using clang only to build a BPF program with debugging
   547  information (again, the dwarfris flag can be omitted when having proper
   548  elfutils version):
   549  
   550  .. code-block:: shell-session
   551  
   552      $ clang --target=bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
   553  
   554  After successful compilation ``pahole`` can be used to properly dump structures
   555  of the BPF program based on the DWARF information:
   556  
   557  .. code-block:: shell-session
   558  
   559      $ pahole xdp-example.o
   560      struct xdp_md {
   561              __u32                      data;                 /*     0     4 */
   562              __u32                      data_end;             /*     4     4 */
   563              __u32                      data_meta;            /*     8     4 */
   564  
   565              /* size: 12, cachelines: 1, members: 3 */
   566              /* last cacheline: 12 bytes */
   567      };
   568  
   569  Through the option ``-J`` ``pahole`` can eventually generate the BTF from
   570  DWARF. In the object file DWARF data will still be retained alongside the
   571  newly added BTF data. Full ``clang`` and ``pahole`` example combined:
   572  
   573  .. code-block:: shell-session
   574  
   575      $ clang --target=bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o
   576      $ pahole -J xdp-example.o
   577  
   578  The presence of a ``.BTF`` section can be seen through ``readelf`` tool:
   579  
   580  .. code-block:: shell-session
   581  
   582      $ readelf -a xdp-example.o
   583      [...]
   584        [18] .BTF              PROGBITS         0000000000000000  00000671
   585      [...]
   586  
   587  BPF loaders such as iproute2 will detect and load the BTF section, so that
   588  BPF maps can be annotated with type information.
   589  
   590  LLVM by default uses the BPF base instruction set for generating code
   591  in order to make sure that the generated object file can also be loaded
   592  with older kernels such as long-term stable kernels (e.g. 4.9+).
   593  
   594  However, LLVM has a ``-mcpu`` selector for the BPF back end in order to
   595  select different versions of the BPF instruction set, namely instruction
   596  set extensions on top of the BPF base instruction set in order to generate
   597  more efficient and smaller code.
   598  
   599  Available ``-mcpu`` options can be queried through:
   600  
   601  .. code-block:: shell-session
   602  
   603      $ llc -march bpf -mcpu=help
   604      Available CPUs for this target:
   605  
   606        generic - Select the generic processor.
   607        probe   - Select the probe processor.
   608        v1      - Select the v1 processor.
   609        v2      - Select the v2 processor.
   610      [...]
   611  
   612  The ``generic`` processor is the default processor, which is also the
   613  base instruction set ``v1`` of BPF. Options ``v1`` and ``v2`` are typically
   614  useful in an environment where the BPF program is being cross compiled
   615  and the target host where the program is loaded differs from the one
   616  where it is compiled (and thus available BPF kernel features might differ
   617  as well).
   618  
   619  The recommended ``-mcpu`` option which is also used by Cilium internally is
   620  ``-mcpu=probe``! Here, the LLVM BPF back end queries the kernel for availability
   621  of BPF instruction set extensions and when found available, LLVM will use
   622  them for compiling the BPF program whenever appropriate.
   623  
   624  A full command line example with llc's ``-mcpu=probe``:
   625  
   626  .. code-block:: shell-session
   627  
   628      $ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc
   629      $ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o
   630  
   631  Generally, LLVM IR generation is architecture independent. There are
   632  however a few differences when using ``clang --target=bpf`` versus
   633  leaving ``--target=bpf`` out and thus using clang's default target which,
   634  depending on the underlying architecture, might be ``x86_64``, ``arm64``
   635  or others.
   636  
   637  Quoting from the kernel's ``Documentation/bpf/bpf_devel_QA.txt``:
   638  
   639  * BPF programs may recursively include header file(s) with file scope
   640    inline assembly codes. The default target can handle this well, while
   641    bpf target may fail if bpf backend assembler does not understand
   642    these assembly codes, which is true in most cases.
   643  
   644  * When compiled without -g, additional elf sections, e.g., ``.eh_frame``
   645    and ``.rela.eh_frame``, may be present in the object file with default
   646    target, but not with bpf target.
   647  
   648  * The default target may turn a C switch statement into a switch table
   649    lookup and jump operation. Since the switch table is placed in the
   650    global read-only section, the bpf program will fail to load.
   651    The bpf target does not support switch table optimization. The clang
   652    option ``-fno-jump-tables`` can be used to disable switch table
   653    generation.
   654  
   655  * For clang ``--target=bpf``, it is guaranteed that pointer or long /
   656    unsigned long types will always have a width of 64 bit, no matter
   657    whether underlying clang binary or default target (or kernel) is
   658    32 bit. However, when native clang target is used, then it will
   659    compile these types based on the underlying architecture's
   660    conventions, meaning in case of 32 bit architecture, pointer or
   661    long / unsigned long types e.g. in BPF context structure will have
   662    width of 32 bit while the BPF LLVM back end still operates in 64 bit.
   663  
   664  The native target is mostly needed in tracing for the case of walking
   665  the kernel's ``struct pt_regs`` that maps CPU registers, or other kernel
   666  structures where CPU's register width matters. In all other cases such
   667  as networking, the use of ``clang --target=bpf`` is the preferred choice.
   668  
   669  Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since
   670  LLVM's release 7.0. A new code generation attribute ``alu32`` is added. When it is
   671  enabled, LLVM will try to use 32-bit subregisters whenever possible, typically
   672  when there are operations on 32-bit types. The associated ALU instructions with
   673  32-bit subregisters will become ALU32 instructions. For example, for the
   674  following sample code:
   675  
   676  .. code-block:: shell-session
   677  
   678      $ cat 32-bit-example.c
   679          void cal(unsigned int *a, unsigned int *b, unsigned int *c)
   680          {
   681            unsigned int sum = *a + *b;
   682            *c = sum;
   683          }
   684  
   685  At default code generation, the assembler will looks like:
   686  
   687  .. code-block:: shell-session
   688  
   689      $ clang --target=bpf -emit-llvm -S 32-bit-example.c
   690      $ llc -march=bpf 32-bit-example.ll
   691      $ cat 32-bit-example.s
   692          cal:
   693            r1 = *(u32 *)(r1 + 0)
   694            r2 = *(u32 *)(r2 + 0)
   695            r2 += r1
   696            *(u32 *)(r3 + 0) = r2
   697            exit
   698  
   699  64-bit registers are used, hence the addition means 64-bit addition. Now, if you
   700  enable the new 32-bit subregisters support by specifying ``-mattr=+alu32``, then
   701  the assembler will looks like:
   702  
   703  .. code-block:: shell-session
   704  
   705      $ llc -march=bpf -mattr=+alu32 32-bit-example.ll
   706      $ cat 32-bit-example.s
   707          cal:
   708            w1 = *(u32 *)(r1 + 0)
   709            w2 = *(u32 *)(r2 + 0)
   710            w2 += w1
   711            *(u32 *)(r3 + 0) = w2
   712            exit
   713  
   714  ``w`` register, meaning 32-bit subregister, will be used instead of 64-bit ``r``
   715  register.
   716  
   717  Enable 32-bit subregisters might help reducing type extension instruction
   718  sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures
   719  for which registers pairs are used to model the 64-bit eBPF registers and extra
   720  instructions are needed for manipulating the high 32-bit. Given read from 32-bit
   721  subregister is guaranteed to read from low 32-bit only even though write still
   722  needs to clear the high 32-bit, if the JIT compiler has known the definition of
   723  one register only has subregister reads, then instructions for setting the high
   724  32-bit of the destination could be eliminated.
   725  
   726  When writing C programs for BPF, there are a couple of pitfalls to be aware
   727  of, compared to usual application development with C. The following items
   728  describe some of the differences for the BPF model:
   729  
   730  1. **Everything needs to be inlined, there are no function calls (on older
   731     LLVM versions) or shared library calls available.**
   732  
   733     Shared libraries, etc cannot be used with BPF. However, common library
   734     code used in BPF programs can be placed into header files and included in
   735     the main programs. For example, Cilium makes heavy use of it (see ``bpf/lib/``).
   736     However, this still allows for including header files, for example, from
   737     the kernel or other libraries and reuse their static inline functions or
   738     macros / definitions.
   739  
   740     Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF
   741     function calls are supported, then LLVM needs to compile and inline the
   742     entire code into a flat sequence of BPF instructions for a given program
   743     section. In such case, best practice is to use an annotation like ``__inline``
   744     for every library function as shown below. The use of ``always_inline``
   745     is recommended, since the compiler could still decide to uninline large
   746     functions that are only annotated as ``inline``.
   747  
   748     In case the latter happens, LLVM will generate a relocation entry into
   749     the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and
   750     will thus produce an error since only BPF maps are valid relocation entries
   751     which loaders can process.
   752  
   753     .. code-block:: c
   754  
   755      #include <linux/bpf.h>
   756  
   757      #ifndef __section
   758      # define __section(NAME)                  \
   759         __attribute__((section(NAME), used))
   760      #endif
   761  
   762      #ifndef __inline
   763      # define __inline                         \
   764         inline __attribute__((always_inline))
   765      #endif
   766  
   767      static __inline int foo(void)
   768      {
   769          return XDP_DROP;
   770      }
   771  
   772      __section("prog")
   773      int xdp_drop(struct xdp_md *ctx)
   774      {
   775          return foo();
   776      }
   777  
   778      char __license[] __section("license") = "GPL";
   779  
   780  2. **Multiple programs can reside inside a single C file in different sections.**
   781  
   782     C programs for BPF make heavy use of section annotations. A C file is
   783     typically structured into 3 or more sections. BPF ELF loaders use these
   784     names to extract and prepare the relevant information in order to load
   785     the programs and maps through the bpf system call. For example, iproute2
   786     uses ``maps`` and ``license`` as default section name to find metadata
   787     needed for map creation and the license for the BPF program, respectively.
   788     On program creation time the latter is pushed into the kernel as well,
   789     and enables some of the helper functions which are exposed as GPL only
   790     in case the program also holds a GPL compatible license, for example
   791     ``bpf_ktime_get_ns()``, ``bpf_probe_read()`` and others.
   792  
   793     The remaining section names are specific for BPF program code, for example,
   794     the below code has been modified to contain two program sections, ``ingress``
   795     and ``egress``. The toy example code demonstrates that both can share a map
   796     and common static inline helpers such as the ``account_data()`` function.
   797  
   798     The ``xdp-example.c`` example has been modified to a ``tc-example.c``
   799     example that can be loaded with tc and attached to a netdevice's ingress
   800     and egress hook.  It accounts the transferred bytes into a map called
   801     ``acc_map``, which has two map slots, one for traffic accounted on the
   802     ingress hook, one on the egress hook.
   803  
   804     .. code-block:: c
   805  
   806      #include <linux/bpf.h>
   807      #include <linux/pkt_cls.h>
   808      #include <stdint.h>
   809      #include <iproute2/bpf_elf.h>
   810  
   811      #ifndef __section
   812      # define __section(NAME)                  \
   813         __attribute__((section(NAME), used))
   814      #endif
   815  
   816      #ifndef __inline
   817      # define __inline                         \
   818         inline __attribute__((always_inline))
   819      #endif
   820  
   821      #ifndef lock_xadd
   822      # define lock_xadd(ptr, val)              \
   823         ((void)__sync_fetch_and_add(ptr, val))
   824      #endif
   825  
   826      #ifndef BPF_FUNC
   827      # define BPF_FUNC(NAME, ...)              \
   828         (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
   829      #endif
   830  
   831      static void *BPF_FUNC(map_lookup_elem, void *map, const void *key);
   832  
   833      struct bpf_elf_map acc_map __section("maps") = {
   834          .type           = BPF_MAP_TYPE_ARRAY,
   835          .size_key       = sizeof(uint32_t),
   836          .size_value     = sizeof(uint32_t),
   837          .pinning        = PIN_GLOBAL_NS,
   838          .max_elem       = 2,
   839      };
   840  
   841      static __inline int account_data(struct __sk_buff *skb, uint32_t dir)
   842      {
   843          uint32_t *bytes;
   844  
   845          bytes = map_lookup_elem(&acc_map, &dir);
   846          if (bytes)
   847                  lock_xadd(bytes, skb->len);
   848  
   849          return TC_ACT_OK;
   850      }
   851  
   852      __section("ingress")
   853      int tc_ingress(struct __sk_buff *skb)
   854      {
   855          return account_data(skb, 0);
   856      }
   857  
   858      __section("egress")
   859      int tc_egress(struct __sk_buff *skb)
   860      {
   861          return account_data(skb, 1);
   862      }
   863  
   864      char __license[] __section("license") = "GPL";
   865  
   866    The example also demonstrates a couple of other things which are useful
   867    to be aware of when developing programs. The code includes kernel headers,
   868    standard C headers and an iproute2 specific header containing the
   869    definition of ``struct bpf_elf_map``. iproute2 has a common BPF ELF loader
   870    and as such the definition of ``struct bpf_elf_map`` is the very same for
   871    XDP and tc typed programs.
   872  
   873    A ``struct bpf_elf_map`` entry defines a map in the program and contains
   874    all relevant information (such as key / value size, etc) needed to generate
   875    a map which is used from the two BPF programs. The structure must be placed
   876    into the ``maps`` section, so that the loader can find it. There can be
   877    multiple map declarations of this type with different variable names, but
   878    all must be annotated with ``__section("maps")``.
   879  
   880    The ``struct bpf_elf_map`` is specific to iproute2. Different BPF ELF
   881    loaders can have different formats, for example, the libbpf in the kernel
   882    source tree, which is mainly used by ``perf``, has a different specification.
   883    iproute2 guarantees backwards compatibility for ``struct bpf_elf_map``.
   884    Cilium follows the iproute2 model.
   885  
   886    The example also demonstrates how BPF helper functions are mapped into
   887    the C code and being used. Here, ``map_lookup_elem()`` is defined by
   888    mapping this function into the ``BPF_FUNC_map_lookup_elem`` enum value
   889    which is exposed as a helper in ``uapi/linux/bpf.h``. When the program is later
   890    loaded into the kernel, the verifier checks whether the passed arguments
   891    are of the expected type and re-points the helper call into a real
   892    function call. Moreover, ``map_lookup_elem()`` also demonstrates how
   893    maps can be passed to BPF helper functions. Here, ``&acc_map`` from the
   894    ``maps`` section is passed as the first argument to ``map_lookup_elem()``.
   895  
   896    Since the defined array map is global, the accounting needs to use an
   897    atomic operation, which is defined as ``lock_xadd()``. LLVM maps
   898    ``__sync_fetch_and_add()`` as a built-in function to the BPF atomic
   899    add instruction, that is, ``BPF_STX | BPF_XADD | BPF_W`` for word sizes.
   900  
   901    Last but not least, the ``struct bpf_elf_map`` tells that the map is to
   902    be pinned as ``PIN_GLOBAL_NS``. This means that tc will pin the map
   903    into the BPF pseudo file system as a node. By default, it will be pinned
   904    to ``/sys/fs/bpf/tc/globals/acc_map`` for the given example. Due to the
   905    ``PIN_GLOBAL_NS``, the map will be placed under ``/sys/fs/bpf/tc/globals/``.
   906    ``globals`` acts as a global namespace that spans across object files.
   907    If the example used ``PIN_OBJECT_NS``, then tc would create a directory
   908    that is local to the object file. For example, different C files with
   909    BPF code could have the same ``acc_map`` definition as above with a
   910    ``PIN_GLOBAL_NS`` pinning. In that case, the map will be shared among
   911    BPF programs originating from various object files. ``PIN_NONE`` would
   912    mean that the map is not placed into the BPF file system as a node,
   913    and as a result will not be accessible from user space after tc quits. It
   914    would also mean that tc creates two separate map instances for each
   915    program, since it cannot retrieve a previously pinned map under that
   916    name. The ``acc_map`` part from the mentioned path is the name of the
   917    map as specified in the source code.
   918  
   919    Thus, upon loading of the ``ingress`` program, tc will find that no such
   920    map exists in the BPF file system and creates a new one. On success, the
   921    map will also be pinned, so that when the ``egress`` program is loaded
   922    through tc, it will find that such map already exists in the BPF file
   923    system and will reuse that for the ``egress`` program. The loader also
   924    makes sure in case maps exist with the same name that also their properties
   925    (key / value size, etc) match.
   926  
   927    Just like tc can retrieve the same map, also third party applications
   928    can use the ``BPF_OBJ_GET`` command from the bpf system call in order
   929    to create a new file descriptor pointing to the same map instance, which
   930    can then be used to lookup / update / delete map elements.
   931  
   932    The code can be compiled and loaded via iproute2 as follows:
   933  
   934    .. code-block:: shell-session
   935  
   936      $ clang -O2 -Wall --target=bpf -c tc-example.c -o tc-example.o
   937  
   938      # tc qdisc add dev em1 clsact
   939      # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress
   940      # tc filter add dev em1 egress bpf da obj tc-example.o sec egress
   941  
   942      # tc filter show dev em1 ingress
   943      filter protocol all pref 49152 bpf
   944      filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f
   945  
   946      # tc filter show dev em1 egress
   947      filter protocol all pref 49152 bpf
   948      filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714
   949  
   950      # mount | grep bpf
   951      sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel)
   952      bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700)
   953  
   954      # tree /sys/fs/bpf/
   955      /sys/fs/bpf/
   956      +-- ip -> /sys/fs/bpf/tc/
   957      +-- tc
   958      |   +-- globals
   959      |       +-- acc_map
   960      +-- xdp -> /sys/fs/bpf/tc/
   961  
   962      4 directories, 1 file
   963  
   964    As soon as packets pass the ``em1`` device, counters from the BPF map will
   965    be increased.
   966  
   967  3. **There are no global variables allowed.**
   968  
   969    For the reasons already mentioned in point 1, BPF cannot have global variables
   970    as often used in normal C programs.
   971  
   972    However, there is a work-around in that the program can simply use a BPF map
   973    of type ``BPF_MAP_TYPE_PERCPU_ARRAY`` with just a single slot of arbitrary
   974    value size. This works, because during execution, BPF programs are guaranteed
   975    to never get preempted by the kernel and therefore can use the single map entry
   976    as a scratch buffer for temporary data, for example, to extend beyond the stack
   977    limitation. This also functions across tail calls, since it has the same
   978    guarantees with regards to preemption.
   979  
   980    Otherwise, for holding state across multiple BPF program runs, normal BPF
   981    maps can be used.
   982  
   983  4. **There are no const strings or arrays allowed.**
   984  
   985    Defining ``const`` strings or other arrays in the BPF C program does not work
   986    for the same reasons as pointed out in sections 1 and 3, which is, that relocation
   987    entries will be generated in the ELF file which will be rejected by loaders due
   988    to not being part of the ABI towards loaders (loaders also cannot fix up such
   989    entries as it would require large rewrites of the already compiled BPF sequence).
   990  
   991    In the future, LLVM might detect these occurrences and early throw an error
   992    to the user.
   993  
   994    Helper functions such as ``trace_printk()`` can be worked around as follows:
   995  
   996    .. code-block:: c
   997  
   998      static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...);
   999  
  1000      #ifndef printk
  1001      # define printk(fmt, ...)                                      \
  1002          ({                                                         \
  1003              char ____fmt[] = fmt;                                  \
  1004              trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \
  1005          })
  1006      #endif
  1007  
  1008    The program can then use the macro naturally like ``printk("skb len:%u\n", skb->len);``.
  1009    The output will then be written to the trace pipe. ``tc exec bpf dbg`` can be
  1010    used to retrieve the messages from there.
  1011  
  1012    The use of the ``trace_printk()`` helper function has a couple of disadvantages
  1013    and thus is not recommended for production usage. Constant strings like the
  1014    ``"skb len:%u\n"`` need to be loaded into the BPF stack each time the helper
  1015    function is called, but also BPF helper functions are limited to a maximum
  1016    of 5 arguments. This leaves room for only 3 additional variables which can be
  1017    passed for dumping.
  1018  
  1019    Therefore, despite being helpful for quick debugging, it is recommended (for networking
  1020    programs) to use the ``skb_event_output()`` or the ``xdp_event_output()`` helper,
  1021    respectively. They allow for passing custom structs from the BPF program to
  1022    the perf event ring buffer along with an optional packet sample. For example,
  1023    Cilium's monitor makes use of these helpers in order to implement a debugging
  1024    framework, notifications for network policy violations, etc. These helpers pass
  1025    the data through a lockless memory mapped per-CPU ``perf`` ring buffer, and
  1026    is thus significantly faster than ``trace_printk()``.
  1027  
  1028  5. **Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().**
  1029  
  1030    Since BPF programs cannot perform any function calls other than those to BPF
  1031    helpers, common library code needs to be implemented as inline functions. In
  1032    addition, also LLVM provides some built-ins that the programs can use for
  1033    constant sizes (here: ``n``) which will then always get inlined:
  1034  
  1035    .. code-block:: c
  1036  
  1037      #ifndef memset
  1038      # define memset(dest, chr, n)   __builtin_memset((dest), (chr), (n))
  1039      #endif
  1040  
  1041      #ifndef memcpy
  1042      # define memcpy(dest, src, n)   __builtin_memcpy((dest), (src), (n))
  1043      #endif
  1044  
  1045      #ifndef memmove
  1046      # define memmove(dest, src, n)  __builtin_memmove((dest), (src), (n))
  1047      #endif
  1048  
  1049    The ``memcmp()`` built-in had some corner cases where inlining did not take place
  1050    due to an LLVM issue in the back end, and is therefore not recommended to be
  1051    used until the issue is fixed.
  1052  
  1053  6. **There are no loops available (yet).**
  1054  
  1055    The BPF verifier in the kernel checks that a BPF program does not contain
  1056    loops by performing a depth first search of all possible program paths besides
  1057    other control flow graph validations. The purpose is to make sure that the
  1058    program is always guaranteed to terminate.
  1059  
  1060    A very limited form of looping is available for constant upper loop bounds
  1061    by using ``#pragma unroll`` directive. Example code that is compiled to BPF:
  1062  
  1063    .. code-block:: c
  1064  
  1065      #pragma unroll
  1066          for (i = 0; i < IPV6_MAX_HEADERS; i++) {
  1067              switch (nh) {
  1068              case NEXTHDR_NONE:
  1069                  return DROP_INVALID_EXTHDR;
  1070              case NEXTHDR_FRAGMENT:
  1071                  return DROP_FRAG_NOSUPPORT;
  1072              case NEXTHDR_HOP:
  1073              case NEXTHDR_ROUTING:
  1074              case NEXTHDR_AUTH:
  1075              case NEXTHDR_DEST:
  1076                  if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0)
  1077                      return DROP_INVALID;
  1078  
  1079                  nh = opthdr.nexthdr;
  1080                  if (nh == NEXTHDR_AUTH)
  1081                      len += ipv6_authlen(&opthdr);
  1082                  else
  1083                      len += ipv6_optlen(&opthdr);
  1084                  break;
  1085              default:
  1086                  *nexthdr = nh;
  1087                  return len;
  1088              }
  1089          }
  1090  
  1091    Another possibility is to use tail calls by calling into the same program
  1092    again and using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map for having a local
  1093    scratch space. While being dynamic, this form of looping however is limited
  1094    to a maximum of 34 iterations (the initial program, plus 33 iterations from
  1095    the tail calls).
  1096  
  1097    In the future, BPF may have some native, but limited form of implementing loops.
  1098  
  1099  7. **Partitioning programs with tail calls.**
  1100  
  1101    Tail calls provide the flexibility to atomically alter program behavior during
  1102    runtime by jumping from one BPF program into another. In order to select the
  1103    next program, tail calls make use of program array maps (``BPF_MAP_TYPE_PROG_ARRAY``),
  1104    and pass the map as well as the index to the next program to jump to. There is no
  1105    return to the old program after the jump has been performed, and in case there was
  1106    no program present at the given map index, then execution continues on the original
  1107    program.
  1108  
  1109    For example, this can be used to implement various stages of a parser, where
  1110    such stages could be updated with new parsing features during runtime.
  1111  
  1112    Another use case are event notifications, for example, Cilium can opt in packet
  1113    drop notifications during runtime, where the ``skb_event_output()`` call is
  1114    located inside the tail called program. Thus, during normal operations, the
  1115    fall-through path will always be executed unless a program is added to the
  1116    related map index, where the program then prepares the metadata and triggers
  1117    the event notification to a user space daemon.
  1118  
  1119    Program array maps are quite flexible, enabling also individual actions to
  1120    be implemented for programs located in each map index. For example, the root
  1121    program attached to XDP or tc could perform an initial tail call to index 0
  1122    of the program array map, performing traffic sampling, then jumping to index 1
  1123    of the program array map, where firewalling policy is applied and the packet
  1124    either dropped or further processed in index 2 of the program array map, where
  1125    it is mangled and sent out of an interface again. Jumps in the program array
  1126    map can, of course, be arbitrary. The kernel will eventually execute the
  1127    fall-through path when the maximum tail call limit has been reached.
  1128  
  1129    Minimal example extract of using tail calls:
  1130  
  1131    .. code-block:: c
  1132  
  1133      [...]
  1134  
  1135      #ifndef __stringify
  1136      # define __stringify(X)   #X
  1137      #endif
  1138  
  1139      #ifndef __section
  1140      # define __section(NAME)                  \
  1141         __attribute__((section(NAME), used))
  1142      #endif
  1143  
  1144      #ifndef __section_tail
  1145      # define __section_tail(ID, KEY)          \
  1146         __section(__stringify(ID) "/" __stringify(KEY))
  1147      #endif
  1148  
  1149      #ifndef BPF_FUNC
  1150      # define BPF_FUNC(NAME, ...)              \
  1151         (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME
  1152      #endif
  1153  
  1154      #define BPF_JMP_MAP_ID   1
  1155  
  1156      static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map,
  1157                           uint32_t index);
  1158  
  1159      struct bpf_elf_map jmp_map __section("maps") = {
  1160          .type           = BPF_MAP_TYPE_PROG_ARRAY,
  1161          .id             = BPF_JMP_MAP_ID,
  1162          .size_key       = sizeof(uint32_t),
  1163          .size_value     = sizeof(uint32_t),
  1164          .pinning        = PIN_GLOBAL_NS,
  1165          .max_elem       = 1,
  1166      };
  1167  
  1168      __section_tail(BPF_JMP_MAP_ID, 0)
  1169      int looper(struct __sk_buff *skb)
  1170      {
  1171          printk("skb cb: %u\n", skb->cb[0]++);
  1172          tail_call(skb, &jmp_map, 0);
  1173          return TC_ACT_OK;
  1174      }
  1175  
  1176      __section("prog")
  1177      int entry(struct __sk_buff *skb)
  1178      {
  1179          skb->cb[0] = 0;
  1180          tail_call(skb, &jmp_map, 0);
  1181          return TC_ACT_OK;
  1182      }
  1183  
  1184      char __license[] __section("license") = "GPL";
  1185  
  1186    When loading this toy program, tc will create the program array and pin it
  1187    to the BPF file system in the global namespace under ``jmp_map``. Also, the
  1188    BPF ELF loader in iproute2 will also recognize sections that are marked as
  1189    ``__section_tail()``. The provided ``id`` in ``struct bpf_elf_map`` will be
  1190    matched against the id marker in the ``__section_tail()``, that is, ``JMP_MAP_ID``,
  1191    and the program therefore loaded at the user specified program array map index,
  1192    which is ``0`` in this example. As a result, all provided tail call sections
  1193    will be populated by the iproute2 loader to the corresponding maps. This mechanism
  1194    is not specific to tc, but can be applied with any other BPF program type
  1195    that iproute2 supports (such as XDP, lwt).
  1196  
  1197    The generated elf contains section headers describing the map id and the
  1198    entry within that map:
  1199  
  1200    .. code-block:: shell-session
  1201  
  1202      $ llvm-objdump -S --no-show-raw-insn prog_array.o | less
  1203      prog_array.o:   file format ELF64-BPF
  1204  
  1205      Disassembly of section 1/0:
  1206      looper:
  1207             0:       r6 = r1
  1208             1:       r2 = *(u32 *)(r6 + 48)
  1209             2:       r1 = r2
  1210             3:       r1 += 1
  1211             4:       *(u32 *)(r6 + 48) = r1
  1212             5:       r1 = 0 ll
  1213             7:       call -1
  1214             8:       r1 = r6
  1215             9:       r2 = 0 ll
  1216            11:       r3 = 0
  1217            12:       call 12
  1218            13:       r0 = 0
  1219            14:       exit
  1220      Disassembly of section prog:
  1221      entry:
  1222             0:       r2 = 0
  1223             1:       *(u32 *)(r1 + 48) = r2
  1224             2:       r2 = 0 ll
  1225             4:       r3 = 0
  1226             5:       call 12
  1227             6:       r0 = 0
  1228             7:       exi
  1229  
  1230    In this case, the ``section 1/0`` indicates that the ``looper()`` function
  1231    resides in the map id ``1`` at position ``0``.
  1232  
  1233    The pinned map can be retrieved by a user space applications (e.g. Cilium daemon),
  1234    but also by tc itself in order to update the map with new programs. Updates
  1235    happen atomically, the initial entry programs that are triggered first from the
  1236    various subsystems are also updated atomically.
  1237  
  1238    Example for tc to perform tail call map updates:
  1239  
  1240    .. code-block:: shell-session
  1241  
  1242      # tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo
  1243  
  1244    In case iproute2 would update the pinned program array, the ``graft`` command
  1245    can be used. By pointing it to ``globals/jmp_map``, tc will update the
  1246    map at index / key ``0`` with a new program residing in the object file ``new.o``
  1247    under section ``foo``.
  1248  
  1249  8. **Limited stack space of maximum 512 bytes.**
  1250  
  1251    Stack space in BPF programs is limited to only 512 bytes, which needs to be
  1252    taken into careful consideration when implementing BPF programs in C. However,
  1253    as mentioned earlier in point 3, a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map with a
  1254    single entry can be used in order to enlarge scratch buffer space.
  1255  
  1256  9. **Use of BPF inline assembly possible.**
  1257  
  1258    LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it
  1259    might be needed. The following (nonsense) toy example shows a 64 bit atomic
  1260    add. Due to lack of documentation, LLVM source code in ``lib/Target/BPF/BPFInstrInfo.td``
  1261    as well as ``test/CodeGen/BPF/`` might be helpful for providing some additional
  1262    examples. Test code:
  1263  
  1264    .. code-block:: c
  1265  
  1266      #include <linux/bpf.h>
  1267  
  1268      #ifndef __section
  1269      # define __section(NAME)                  \
  1270         __attribute__((section(NAME), used))
  1271      #endif
  1272  
  1273      __section("prog")
  1274      int xdp_test(struct xdp_md *ctx)
  1275      {
  1276          __u64 a = 2, b = 3, *c = &a;
  1277          /* just a toy xadd example to show the syntax */
  1278          asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c));
  1279          return a;
  1280      }
  1281  
  1282      char __license[] __section("license") = "GPL";
  1283  
  1284    The above program is compiled into the following sequence of BPF
  1285    instructions:
  1286  
  1287    ::
  1288  
  1289      Verifier analysis:
  1290  
  1291      0: (b7) r1 = 2
  1292      1: (7b) *(u64 *)(r10 -8) = r1
  1293      2: (b7) r1 = 3
  1294      3: (bf) r2 = r10
  1295      4: (07) r2 += -8
  1296      5: (db) lock *(u64 *)(r2 +0) += r1
  1297      6: (79) r0 = *(u64 *)(r10 -8)
  1298      7: (95) exit
  1299      processed 8 insns (limit 131072), stack depth 8
  1300  
  1301  10. **Remove struct padding with aligning members by using #pragma pack.**
  1302  
  1303    In modern compilers, data structures are aligned by default to access memory
  1304    efficiently. Structure members are packed to memory addresses and padding is
  1305    added for the proper alignment with the processor word size (e.g. 8-byte for
  1306    64-bit processors, 4-byte for 32-bit processors). Because of this, the size of
  1307    struct may often grow larger than expected.
  1308  
  1309    .. code-block:: c
  1310  
  1311      struct called_info {
  1312          u64 start;  // 8-byte
  1313          u64 end;    // 8-byte
  1314          u32 sector; // 4-byte
  1315      }; // size of 20-byte ?
  1316  
  1317      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte
  1318  
  1319      // Actual compiled composition of struct called_info
  1320      // 0x0(0)                   0x8(8)
  1321      //  ↓________________________↓
  1322      //  |        start (8)       |
  1323      //  |________________________|
  1324      //  |         end  (8)       |
  1325      //  |________________________|
  1326      //  |  sector(4) |  PADDING  | <= address aligned to 8
  1327      //  |____________|___________|     with 4-byte PADDING.
  1328  
  1329    The BPF verifier in the kernel checks the stack boundary that a BPF program does
  1330    not access outside of boundary or uninitialized stack area. Using struct with the
  1331    padding as a map value, will cause ``invalid indirect read from stack`` failure on
  1332    ``bpf_prog_load()``.
  1333  
  1334    Example code:
  1335  
  1336    .. code-block:: c
  1337  
  1338      struct called_info {
  1339          u64 start;
  1340          u64 end;
  1341          u32 sector;
  1342      };
  1343  
  1344      struct bpf_map_def SEC("maps") called_info_map = {
  1345          .type = BPF_MAP_TYPE_HASH,
  1346          .key_size = sizeof(long),
  1347          .value_size = sizeof(struct called_info),
  1348          .max_entries = 4096,
  1349      };
  1350  
  1351      SEC("kprobe/submit_bio")
  1352      int submit_bio_entry(struct pt_regs *ctx)
  1353      {
  1354          char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n";
  1355          u64 start_time = bpf_ktime_get_ns();
  1356          long bio_ptr = PT_REGS_PARM1(ctx);
  1357          struct called_info called_info = {
  1358                  .start = start_time,
  1359                  .end = 0,
  1360                  .sector = 0
  1361          };
  1362  
  1363          bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY);
  1364          bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time);
  1365          return 0;
  1366      }
  1367  
  1368    Corresponding output on ``bpf_load_program()``::
  1369  
  1370      bpf_load_program() err=13
  1371      0: (bf) r6 = r1
  1372      ...
  1373      19: (b7) r1 = 0
  1374      20: (7b) *(u64 *)(r10 -72) = r1
  1375      21: (7b) *(u64 *)(r10 -80) = r7
  1376      22: (63) *(u32 *)(r10 -64) = r1
  1377      ...
  1378      30: (85) call bpf_map_update_elem#2
  1379      invalid indirect read from stack off -80+20 size 24
  1380  
  1381    At ``bpf_prog_load()``, an eBPF verifier ``bpf_check()`` is called, and it'll
  1382    check stack boundary by calling ``check_func_arg() -> check_stack_boundary()``.
  1383    From the upper error shows, ``struct called_info`` is compiled to 24-byte size,
  1384    and the message says reading a data from +20 is an invalid indirect read.
  1385    And as we discussed earlier, the address 0x14(20) is the place where PADDING is.
  1386  
  1387    .. code-block:: c
  1388  
  1389      // Actual compiled composition of struct called_info
  1390      // 0x10(16)    0x14(20)    0x18(24)
  1391      //  ↓____________↓___________↓
  1392      //  |  sector(4) |  PADDING  | <= address aligned to 8
  1393      //  |____________|___________|     with 4-byte PADDING.
  1394  
  1395    The ``check_stack_boundary()`` internally loops through the every ``access_size`` (24)
  1396    byte from the start pointer to make sure that it's within stack boundary and all
  1397    elements of the stack are initialized. Since the padding isn't supposed to be used,
  1398    it gets the 'invalid indirect read from stack' failure. To avoid this kind of
  1399    failure, removing the padding from the struct is necessary.
  1400  
  1401    Removing the padding by using ``#pragma pack(n)`` directive:
  1402  
  1403    .. code-block:: c
  1404  
  1405      #pragma pack(4)
  1406      struct called_info {
  1407          u64 start;  // 8-byte
  1408          u64 end;    // 8-byte
  1409          u32 sector; // 4-byte
  1410      }; // size of 20-byte ?
  1411  
  1412      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte
  1413  
  1414      // Actual compiled composition of packed struct called_info
  1415      // 0x0(0)                   0x8(8)
  1416      //  ↓________________________↓
  1417      //  |        start (8)       |
  1418      //  |________________________|
  1419      //  |         end  (8)       |
  1420      //  |________________________|
  1421      //  |  sector(4) |             <= address aligned to 4
  1422      //  |____________|                 with no PADDING.
  1423  
  1424    By locating ``#pragma pack(4)`` before of ``struct called_info``, the compiler will align
  1425    members of a struct to the least of 4-byte and their natural alignment. As you can
  1426    see, the size of ``struct called_info`` has been shrunk to 20-byte and the padding
  1427    no longer exists.
  1428  
  1429    But, removing the padding has downsides too. For example, the compiler will generate
  1430    less optimized code. Since we've removed the padding, processors will conduct
  1431    unaligned access to the structure and this might lead to performance degradation.
  1432    And also, unaligned access might get rejected by verifier on some architectures.
  1433  
  1434    However, there is a way to avoid downsides of packed structure. By simply adding the
  1435    explicit padding ``u32 pad`` member at the end will resolve the same problem without
  1436    packing of the structure.
  1437  
  1438    .. code-block:: c
  1439  
  1440      struct called_info {
  1441          u64 start;  // 8-byte
  1442          u64 end;    // 8-byte
  1443          u32 sector; // 4-byte
  1444          u32 pad;    // 4-byte
  1445      }; // size of 24-byte ?
  1446  
  1447      printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte
  1448  
  1449      // Actual compiled composition of struct called_info with explicit padding
  1450      // 0x0(0)                   0x8(8)
  1451      //  ↓________________________↓
  1452      //  |        start (8)       |
  1453      //  |________________________|
  1454      //  |         end  (8)       |
  1455      //  |________________________|
  1456      //  |  sector(4) |  pad (4)  | <= address aligned to 8
  1457      //  |____________|___________|     with explicit PADDING.
  1458  
  1459  11. **Accessing packet data via invalidated references**
  1460  
  1461    Some networking BPF helper functions such as ``bpf_skb_store_bytes`` might
  1462    change the size of a packet data. As verifier is not able to track such
  1463    changes, any a priori reference to the data will be invalidated by verifier.
  1464    Therefore, the reference needs to be updated before accessing the data to
  1465    avoid verifier rejecting a program.
  1466  
  1467    To illustrate this, consider the following snippet:
  1468  
  1469    .. code-block:: c
  1470  
  1471      struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  1472  
  1473      skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);
  1474  
  1475      if (ip4->protocol == IPPROTO_TCP) {
  1476          // do something
  1477      }
  1478  
  1479    Verifier will reject the snippet due to dereference of the invalidated
  1480    ``ip4->protocol``:
  1481  
  1482    ::
  1483  
  1484        R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0
  1485        R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff))
  1486        R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1
  1487        ...
  1488        18: (85) call bpf_skb_store_bytes#9
  1489        19: (7b) *(u64 *)(r10 -56) = r7
  1490        R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3))
  1491        R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm
  1492        21: (61) r1 = *(u32 *)(r9 +23)
  1493        R9 invalid mem access 'inv'
  1494  
  1495    To fix this, the reference to ``ip4`` has to be updated:
  1496  
  1497    .. code-block:: c
  1498  
  1499      struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  1500  
  1501      skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0);
  1502  
  1503      ip4 = (struct iphdr *) skb->data + ETH_HLEN;
  1504  
  1505      if (ip4->protocol == IPPROTO_TCP) {
  1506          // do something
  1507      }
  1508  
  1509  iproute2
  1510  --------
  1511  
  1512  There are various front ends for loading BPF programs into the kernel such as bcc,
  1513  perf, iproute2 and others. The Linux kernel source tree also provides a user space
  1514  library under ``tools/lib/bpf/``, which is mainly used and driven by perf for
  1515  loading BPF tracing programs into the kernel. However, the library itself is
  1516  generic and not limited to perf only. bcc is a toolkit providing many useful
  1517  BPF programs mainly for tracing that are loaded ad-hoc through a Python interface
  1518  embedding the BPF C code. Syntax and semantics for implementing BPF programs
  1519  slightly differ among front ends in general, though. Additionally, there are also
  1520  BPF samples in the kernel source tree (``samples/bpf/``) which parse the generated
  1521  object files and load the code directly through the system call interface.
  1522  
  1523  This and previous sections mainly focus on the iproute2 suite's BPF front end for
  1524  loading networking programs of XDP, tc or lwt type, since Cilium's programs are
  1525  implemented against this BPF loader. In future, Cilium will be equipped with a
  1526  native BPF loader, but programs will still be compatible to be loaded through
  1527  iproute2 suite in order to facilitate development and debugging.
  1528  
  1529  All BPF program types supported by iproute2 share the same BPF loader logic
  1530  due to having a common loader back end implemented as a library (``lib/bpf.c``
  1531  in iproute2 source tree).
  1532  
  1533  The previous section on LLVM also covered some iproute2 parts related to writing
  1534  BPF C programs, and later sections in this document are related to tc and XDP
  1535  specific aspects when writing programs. Therefore, this section will rather focus
  1536  on usage examples for loading object files with iproute2 as well as some of the
  1537  generic mechanics of the loader. It does not try to provide a complete coverage
  1538  of all details, but enough for getting started.
  1539  
  1540  **1. Loading of XDP BPF object files.**
  1541  
  1542    Given a BPF object file ``prog.o`` has been compiled for XDP, it can be loaded
  1543    through ``ip`` to a XDP-supported netdevice called ``em1`` with the following
  1544    command:
  1545  
  1546    .. code-block:: shell-session
  1547  
  1548      # ip link set dev em1 xdp obj prog.o
  1549  
  1550    The above command assumes that the program code resides in the default section
  1551    which is called ``prog`` in XDP case. Should this not be the case, and the
  1552    section is named differently, for example, ``foobar``, then the program needs
  1553    to be loaded as:
  1554  
  1555    .. code-block:: shell-session
  1556  
  1557      # ip link set dev em1 xdp obj prog.o sec foobar
  1558  
  1559    Note that it is also possible to load the program out of the ``.text`` section.
  1560    Changing the minimal, stand-alone XDP drop program by removing the ``__section()``
  1561    annotation from the ``xdp_drop`` entry point would look like the following:
  1562  
  1563    .. code-block:: c
  1564  
  1565      #include <linux/bpf.h>
  1566  
  1567      #ifndef __section
  1568      # define __section(NAME)                  \
  1569         __attribute__((section(NAME), used))
  1570      #endif
  1571  
  1572      int xdp_drop(struct xdp_md *ctx)
  1573      {
  1574          return XDP_DROP;
  1575      }
  1576  
  1577      char __license[] __section("license") = "GPL";
  1578  
  1579    And can be loaded as follows:
  1580  
  1581    .. code-block:: shell-session
  1582  
  1583      # ip link set dev em1 xdp obj prog.o sec .text
  1584  
  1585    By default, ``ip`` will throw an error in case a XDP program is already attached
  1586    to the networking interface, to prevent it from being overridden by accident. In
  1587    order to replace the currently running XDP program with a new one, the ``-force``
  1588    option must be used:
  1589  
  1590    .. code-block:: shell-session
  1591  
  1592      # ip -force link set dev em1 xdp obj prog.o
  1593  
  1594    Most XDP-enabled drivers today support an atomic replacement of the existing
  1595    program with a new one without traffic interruption. There is always only a
  1596    single program attached to an XDP-enabled driver due to performance reasons,
  1597    hence a chain of programs is not supported. However, as described in the
  1598    previous section, partitioning of programs can be performed through tail
  1599    calls to achieve a similar use case when necessary.
  1600  
  1601    The ``ip link`` command will display an ``xdp`` flag if the interface has an XDP
  1602    program attached. ``ip link | grep xdp`` can thus be used to find all interfaces
  1603    that have XDP running. Further introspection facilities are provided through
  1604    the detailed view with ``ip -d link`` and ``bpftool`` can be used to retrieve
  1605    information about the attached program based on the BPF program ID shown in
  1606    the ``ip link`` dump.
  1607  
  1608    In order to remove the existing XDP program from the interface, the following
  1609    command must be issued:
  1610  
  1611    .. code-block:: shell-session
  1612  
  1613      # ip link set dev em1 xdp off
  1614  
  1615    In the case of switching a driver's operation mode from non-XDP to native XDP
  1616    and vice versa, typically the driver needs to reconfigure its receive (and
  1617    transmit) rings in order to ensure received packet are set up linearly
  1618    within a single page for BPF to read and write into. However, once completed,
  1619    then most drivers only need to perform an atomic replacement of the program
  1620    itself when a BPF program is requested to be swapped.
  1621  
  1622    In total, XDP supports three operation modes which iproute2 implements as well:
  1623    ``xdpdrv``, ``xdpoffload`` and ``xdpgeneric``.
  1624  
  1625    ``xdpdrv`` stands for native XDP, meaning the BPF program is run directly in
  1626    the driver's receive path at the earliest possible point in software. This is
  1627    the normal / conventional XDP mode and requires driver's to implement XDP
  1628    support, which all major 10G/40G/+ networking drivers in the upstream Linux
  1629    kernel already provide.
  1630  
  1631    ``xdpgeneric`` stands for generic XDP and is intended as an experimental test
  1632    bed for drivers which do not yet support native XDP. Given the generic XDP hook
  1633    in the ingress path comes at a much later point in time when the packet already
  1634    enters the stack's main receive path as a ``skb``, the performance is significantly
  1635    less than with processing in ``xdpdrv`` mode. ``xdpgeneric`` therefore is for
  1636    the most part only interesting for experimenting, less for production environments.
  1637  
  1638    Last but not least, the ``xdpoffload`` mode is implemented by SmartNICs such
  1639    as those supported by Netronome's nfp driver and allow for offloading the entire
  1640    BPF/XDP program into hardware, thus the program is run on each packet reception
  1641    directly on the card. This provides even higher performance than running in
  1642    native XDP although not all BPF map types or BPF helper functions are available
  1643    for use compared to native XDP. The BPF verifier will reject the program in
  1644    such case and report to the user what is unsupported. Other than staying in
  1645    the realm of supported BPF features and helper functions, no special precautions
  1646    have to be taken when writing BPF C programs.
  1647  
  1648    When a command like ``ip link set dev em1 xdp obj [...]`` is used, then the
  1649    kernel will attempt to load the program first as native XDP, and in case the
  1650    driver does not support native XDP, it will automatically fall back to generic
  1651    XDP. Thus, for example, using explicitly ``xdpdrv`` instead of ``xdp``, the
  1652    kernel will only attempt to load the program as native XDP and fail in case
  1653    the driver does not support it, which provides a guarantee that generic XDP
  1654    is avoided altogether.
  1655  
  1656    Example for enforcing a BPF/XDP program to be loaded in native XDP mode,
  1657    dumping the link details and unloading the program again:
  1658  
  1659    .. code-block:: shell-session
  1660  
  1661       # ip -force link set dev em1 xdpdrv obj prog.o
  1662       # ip link show
  1663       [...]
  1664       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000
  1665           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  1666           prog/xdp id 1 tag 57cd311f2e27366b
  1667       [...]
  1668       # ip link set dev em1 xdpdrv off
  1669  
  1670    Same example now for forcing generic XDP, even if the driver would support
  1671    native XDP, and additionally dumping the BPF instructions of the attached
  1672    dummy program through bpftool:
  1673  
  1674    .. code-block:: shell-session
  1675  
  1676      # ip -force link set dev em1 xdpgeneric obj prog.o
  1677      # ip link show
  1678      [...]
  1679      6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000
  1680          link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  1681          prog/xdp id 4 tag 57cd311f2e27366b                <-- BPF program ID 4
  1682      [...]
  1683      # bpftool prog dump xlated id 4                       <-- Dump of instructions running on em1
  1684      0: (b7) r0 = 1
  1685      1: (95) exit
  1686      # ip link set dev em1 xdpgeneric off
  1687  
  1688    And last but not least offloaded XDP, where we additionally dump program
  1689    information via bpftool for retrieving general metadata:
  1690  
  1691    .. code-block:: shell-session
  1692  
  1693       # ip -force link set dev em1 xdpoffload obj prog.o
  1694       # ip link show
  1695       [...]
  1696       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
  1697           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  1698           prog/xdp id 8 tag 57cd311f2e27366b
  1699       [...]
  1700       # bpftool prog show id 8
  1701       8: xdp  tag 57cd311f2e27366b dev em1                  <-- Also indicates a BPF program offloaded to em1
  1702           loaded_at Apr 11/20:38  uid 0
  1703           xlated 16B  not jited  memlock 4096B
  1704       # ip link set dev em1 xdpoffload off
  1705  
  1706    Note that it is not possible to use ``xdpdrv`` and ``xdpgeneric`` or other
  1707    modes at the same time, meaning only one of the XDP operation modes must be
  1708    picked.
  1709  
  1710    A switch between different XDP modes e.g. from generic to native or vice
  1711    versa is not atomically possible. Only switching programs within a specific
  1712    operation mode is:
  1713  
  1714    .. code-block:: shell-session
  1715  
  1716       # ip -force link set dev em1 xdpgeneric obj prog.o
  1717       # ip -force link set dev em1 xdpoffload obj prog.o
  1718       RTNETLINK answers: File exists
  1719       # ip -force link set dev em1 xdpdrv obj prog.o
  1720       RTNETLINK answers: File exists
  1721       # ip -force link set dev em1 xdpgeneric obj prog.o    <-- Succeeds due to xdpgeneric
  1722       #
  1723  
  1724    Switching between modes requires to first leave the current operation mode
  1725    in order to then enter the new one:
  1726  
  1727    .. code-block:: shell-session
  1728  
  1729       # ip -force link set dev em1 xdpgeneric obj prog.o
  1730       # ip -force link set dev em1 xdpgeneric off
  1731       # ip -force link set dev em1 xdpoffload obj prog.o
  1732       # ip l
  1733       [...]
  1734       6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000
  1735           link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff
  1736           prog/xdp id 17 tag 57cd311f2e27366b
  1737       [...]
  1738       # ip -force link set dev em1 xdpoffload off
  1739  
  1740  **2. Loading of tc BPF object files.**
  1741  
  1742    Given a BPF object file ``prog.o`` has been compiled for tc, it can be loaded
  1743    through the tc command to a netdevice. Unlike XDP, there is no driver dependency
  1744    for supporting attaching BPF programs to the device. Here, the netdevice is called
  1745    ``em1``, and with the following command the program can be attached to the networking
  1746    ``ingress`` path of ``em1``:
  1747  
  1748    .. code-block:: shell-session
  1749  
  1750      # tc qdisc add dev em1 clsact
  1751      # tc filter add dev em1 ingress bpf da obj prog.o
  1752  
  1753    The first step is to set up a ``clsact`` qdisc (Linux queueing discipline). ``clsact``
  1754    is a dummy qdisc similar to the ``ingress`` qdisc, which can only hold classifier
  1755    and actions, but does not perform actual queueing. It is needed in order to attach
  1756    the ``bpf`` classifier. The ``clsact`` qdisc provides two special hooks called
  1757    ``ingress`` and ``egress``, where the classifier can be attached to. Both ``ingress``
  1758    and ``egress`` hooks are located in central receive and transmit locations in the
  1759    networking data path, where every packet on the device passes through. The ``ingress``
  1760    hook is called from ``__netif_receive_skb_core() -> sch_handle_ingress()`` in the
  1761    kernel and the ``egress`` hook from ``__dev_queue_xmit() -> sch_handle_egress()``.
  1762  
  1763    The equivalent for attaching the program to the ``egress`` hook looks as follows:
  1764  
  1765    .. code-block:: shell-session
  1766  
  1767      # tc filter add dev em1 egress bpf da obj prog.o
  1768  
  1769    The ``clsact`` qdisc is processed lockless from ``ingress`` and ``egress``
  1770    direction and can also be attached to virtual, queue-less devices such as
  1771    ``veth`` devices connecting containers.
  1772  
  1773    Next to the hook, the ``tc filter`` command selects ``bpf`` to be used in ``da``
  1774    (direct-action) mode. ``da`` mode is recommended and should always be specified.
  1775    It basically means that the ``bpf`` classifier does not need to call into external
  1776    tc action modules, which are not necessary for ``bpf`` anyway, since all packet
  1777    mangling, forwarding or other kind of actions can already be performed inside
  1778    the single BPF program which is to be attached, and is therefore significantly
  1779    faster.
  1780  
  1781    At this point, the program has been attached and is executed once packets traverse
  1782    the device. Like in XDP, should the default section name not be used, then it
  1783    can be specified during load, for example, in case of section ``foobar``:
  1784  
  1785    .. code-block:: shell-session
  1786  
  1787      # tc filter add dev em1 egress bpf da obj prog.o sec foobar
  1788  
  1789    iproute2's BPF loader allows for using the same command line syntax across
  1790    program types, hence the ``obj prog.o sec foobar`` is the same syntax as with
  1791    XDP mentioned earlier.
  1792  
  1793    The attached programs can be listed through the following commands:
  1794  
  1795    .. code-block:: shell-session
  1796  
  1797      # tc filter show dev em1 ingress
  1798      filter protocol all pref 49152 bpf
  1799      filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f
  1800  
  1801      # tc filter show dev em1 egress
  1802      filter protocol all pref 49152 bpf
  1803      filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714
  1804  
  1805    The output of ``prog.o:[ingress]`` tells that program section ``ingress`` was
  1806    loaded from the file ``prog.o``, and ``bpf`` operates in ``direct-action`` mode.
  1807    The program ``id`` and ``tag`` is appended for each case, where the latter denotes
  1808    a hash over the instruction stream which can be correlated with the object file
  1809    or ``perf`` reports with stack traces, etc. Last but not least, the ``id``
  1810    represents the system-wide unique BPF program identifier that can be used along
  1811    with ``bpftool`` to further inspect or dump the attached BPF program.
  1812  
  1813    tc can attach more than just a single BPF program, it provides various other
  1814    classifiers which can be chained together. However, attaching a single BPF program
  1815    is fully sufficient since all packet operations can be contained in the program
  1816    itself thanks to ``da`` (``direct-action``) mode, meaning the BPF program itself
  1817    will already return the tc action verdict such as ``TC_ACT_OK``, ``TC_ACT_SHOT``
  1818    and others. For optimal performance and flexibility, this is the recommended usage.
  1819  
  1820    In the above ``show`` command, tc also displays ``pref 49152`` and
  1821    ``handle 0x1`` next to the BPF related output. Both are auto-generated in
  1822    case they are not explicitly provided through the command line. ``pref``
  1823    denotes a priority number, which means that in case multiple classifiers are
  1824    attached, they will be executed based on ascending priority, and ``handle``
  1825    represents an identifier in case multiple instances of the same classifier have
  1826    been loaded under the same ``pref``. Since in case of BPF, a single program is
  1827    fully sufficient, ``pref`` and ``handle`` can typically be ignored.
  1828  
  1829    Only in the case where it is planned to atomically replace the attached BPF
  1830    programs, it would be recommended to explicitly specify ``pref`` and ``handle``
  1831    a priori on initial load, so that they do not have to be queried at a later
  1832    point in time for the ``replace`` operation. Thus, creation becomes:
  1833  
  1834    .. code-block:: shell-session
  1835  
  1836      # tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar
  1837  
  1838      # tc filter show dev em1 ingress
  1839      filter protocol all pref 1 bpf
  1840      filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f
  1841  
  1842    And for the atomic replacement, the following can be issued for updating the
  1843    existing program at ``ingress`` hook with the new BPF program from the file
  1844    ``prog.o`` in section ``foobar``:
  1845  
  1846    .. code-block:: shell-session
  1847  
  1848      # tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar
  1849  
  1850    Last but not least, in order to remove all attached programs from the ``ingress``
  1851    respectively ``egress`` hook, the following can be used:
  1852  
  1853    .. code-block:: shell-session
  1854  
  1855      # tc filter del dev em1 ingress
  1856      # tc filter del dev em1 egress
  1857  
  1858    For removing the entire ``clsact`` qdisc from the netdevice, which implicitly also
  1859    removes all attached programs from the ``ingress`` and ``egress`` hooks, the
  1860    below command is provided:
  1861  
  1862    .. code-block:: shell-session
  1863  
  1864      # tc qdisc del dev em1 clsact
  1865  
  1866    tc BPF programs can also be offloaded if the NIC and driver has support for it
  1867    similarly as with XDP BPF programs. Netronome's nfp supported NICs offer both
  1868    types of BPF offload.
  1869  
  1870    .. code-block:: shell-session
  1871  
  1872      # tc qdisc add dev em1 clsact
  1873      # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
  1874      Error: TC offload is disabled on net device.
  1875      We have an error talking to the kernel
  1876  
  1877    If the above error is shown, then tc hardware offload first needs to be enabled
  1878    for the device through ethtool's ``hw-tc-offload`` setting:
  1879  
  1880    .. code-block:: shell-session
  1881  
  1882      # ethtool -K em1 hw-tc-offload on
  1883      # tc qdisc add dev em1 clsact
  1884      # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o
  1885      # tc filter show dev em1 ingress
  1886      filter protocol all pref 1 bpf
  1887      filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b
  1888  
  1889    The ``in_hw`` flag confirms that the program has been offloaded to the NIC.
  1890  
  1891    Note that BPF offloads for both tc and XDP cannot be loaded at the same time,
  1892    either the tc or XDP offload option must be selected.
  1893  
  1894  **3. Testing BPF offload interface via netdevsim driver.**
  1895  
  1896    The netdevsim driver which is part of the Linux kernel provides a dummy driver
  1897    which implements offload interfaces for XDP BPF and tc BPF programs and
  1898    facilitates testing kernel changes or low-level user space programs
  1899    implementing a control plane directly against the kernel's UAPI.
  1900  
  1901    A netdevsim device can be created as follows:
  1902  
  1903    .. code-block:: shell-session
  1904  
  1905      # modprobe netdevsim
  1906      // [ID] [PORT_COUNT]
  1907      # echo "1 1" > /sys/bus/netdevsim/new_device
  1908      # devlink dev
  1909      netdevsim/netdevsim1
  1910      # devlink port
  1911      netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical
  1912      # ip l
  1913      [...]
  1914      4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  1915          link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff
  1916  
  1917    After that step, XDP BPF or tc BPF programs can be test loaded as shown
  1918    in the various examples earlier:
  1919  
  1920    .. code-block:: shell-session
  1921  
  1922      # ip -force link set dev eth0 xdpoffload obj prog.o
  1923      # ip l
  1924      [...]
  1925      4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000
  1926          link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff
  1927          prog/xdp id 16 tag a04f5eef06a7f555
  1928  
  1929  These two workflows are the basic operations to load XDP BPF respectively tc BPF
  1930  programs with iproute2.
  1931  
  1932  There are other various advanced options for the BPF loader that apply both to XDP
  1933  and tc, some of them are listed here. In the examples only XDP is presented for
  1934  simplicity.
  1935  
  1936  **1. Verbose log output even on success.**
  1937  
  1938    The option ``verb`` can be appended for loading programs in order to dump the
  1939    verifier log, even if no error occurred:
  1940  
  1941    .. code-block:: shell-session
  1942  
  1943      # ip link set dev em1 xdp obj xdp-example.o verb
  1944  
  1945      Prog section 'prog' loaded (5)!
  1946       - Type:         6
  1947       - Instructions: 2 (0 over limit)
  1948       - License:      GPL
  1949  
  1950      Verifier analysis:
  1951  
  1952      0: (b7) r0 = 1
  1953      1: (95) exit
  1954      processed 2 insns
  1955  
  1956  **2. Load program that is already pinned in BPF file system.**
  1957  
  1958    Instead of loading a program from an object file, iproute2 can also retrieve
  1959    the program from the BPF file system in case some external entity pinned it
  1960    there and attach it to the device:
  1961  
  1962    .. code-block:: shell-session
  1963  
  1964      # ip link set dev em1 xdp pinned /sys/fs/bpf/prog
  1965  
  1966    iproute2 can also use the short form that is relative to the detected mount
  1967    point of the BPF file system:
  1968  
  1969    .. code-block:: shell-session
  1970  
  1971      # ip link set dev em1 xdp pinned m:prog
  1972  
  1973  When loading BPF programs, iproute2 will automatically detect the mounted
  1974  file system instance in order to perform pinning of nodes. In case no mounted
  1975  BPF file system instance was found, then tc will automatically mount it
  1976  to the default location under ``/sys/fs/bpf/``.
  1977  
  1978  In case an instance has already been found, then it will be used and no additional
  1979  mount will be performed:
  1980  
  1981  .. code-block:: shell-session
  1982  
  1983      # mkdir /var/run/bpf
  1984      # mount --bind /var/run/bpf /var/run/bpf
  1985      # mount -t bpf bpf /var/run/bpf
  1986      # tc filter add dev em1 ingress bpf da obj tc-example.o sec prog
  1987      # tree /var/run/bpf
  1988      /var/run/bpf
  1989      +-- ip -> /run/bpf/tc/
  1990      +-- tc
  1991      |   +-- globals
  1992      |       +-- jmp_map
  1993      +-- xdp -> /run/bpf/tc/
  1994  
  1995      4 directories, 1 file
  1996  
  1997  By default tc will create an initial directory structure as shown above,
  1998  where all subsystem users will point to the same location through symbolic
  1999  links for the ``globals`` namespace, so that pinned BPF maps can be reused
  2000  among various BPF program types in iproute2. In case the file system instance
  2001  has already been mounted and an existing structure already exists, then tc will
  2002  not override it. This could be the case for separating ``lwt``, ``tc`` and
  2003  ``xdp`` maps in order to not share ``globals`` among all.
  2004  
  2005  As briefly covered in the previous LLVM section, iproute2 will install a
  2006  header file upon installation which can be included through the standard
  2007  include path by BPF programs:
  2008  
  2009  .. code-block:: c
  2010  
  2011      #include <iproute2/bpf_elf.h>
  2012  
  2013  The purpose of this header file is to provide an API for maps and default section
  2014  names used by programs. It's a stable contract between iproute2 and BPF programs.
  2015  
  2016  The map definition for iproute2 is ``struct bpf_elf_map``. Its members have
  2017  been covered earlier in the LLVM section of this document.
  2018  
  2019  When parsing the BPF object file, the iproute2 loader will walk through
  2020  all ELF sections. It initially fetches ancillary sections like ``maps`` and
  2021  ``license``. For ``maps``, the ``struct bpf_elf_map`` array will be checked
  2022  for validity and whenever needed, compatibility workarounds are performed.
  2023  Subsequently all maps are created with the user provided information, either
  2024  retrieved as a pinned object, or newly created and then pinned into the BPF
  2025  file system. Next the loader will handle all program sections that contain
  2026  ELF relocation entries for maps, meaning that BPF instructions loading
  2027  map file descriptors into registers are rewritten so that the corresponding
  2028  map file descriptors are encoded into the instructions immediate value, in
  2029  order for the kernel to be able to convert them later on into map kernel
  2030  pointers. After that all the programs themselves are created through the BPF
  2031  system call, and tail called maps, if present, updated with the program's file
  2032  descriptors.