github.com/cilium/cilium@v1.16.2/Documentation/bpf/toolchain.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _bpf_dev: 8 9 Development Tools 10 ================= 11 12 Current user space tooling, introspection facilities and kernel control knobs around 13 BPF are discussed in this section. 14 15 .. note:: The tooling and infrastructure around BPF is still rapidly evolving and thus may not provide a complete picture of all available tools. 16 17 Development Environment 18 ----------------------- 19 20 A step by step guide for setting up a development environment for BPF can be found 21 below for both Fedora and Ubuntu. This will guide you through building, installing 22 and testing a development kernel as well as building and installing iproute2. 23 24 The step of manually building iproute2 and Linux kernel is usually not necessary 25 given that major distributions already ship recent enough kernels by default, but 26 would be needed for testing bleeding edge versions or contributing BPF patches to 27 iproute2 and to the Linux kernel, respectively. Similarly, for debugging and 28 introspection purposes building bpftool is optional, but recommended. 29 30 .. tabs:: 31 32 .. group-tab:: Fedora 33 34 The following applies to Fedora 25 or later: 35 36 .. code-block:: shell-session 37 38 $ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \ 39 openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static 40 41 .. note:: If you are running some other Fedora derivative and ``dnf`` is missing, 42 try using ``yum`` instead. 43 44 .. group-tab:: Ubuntu 45 46 The following applies to Ubuntu 17.04 or later: 47 48 .. code-block:: shell-session 49 50 $ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \ 51 clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \ 52 graphviz 53 54 .. group-tab:: openSUSE Tumbleweed 55 56 The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later: 57 58 .. code-block:: shell-session 59 60 $ sudo zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \ 61 libcap-devel clang llvm graphviz bison flex glibc-devel-static 62 63 Compiling the Kernel 64 ```````````````````` 65 66 Development of new BPF features for the Linux kernel happens inside the ``net-next`` 67 git tree, latest BPF fixes in the ``net`` tree. The following command will obtain 68 the kernel source for the ``net-next`` tree through git: 69 70 .. code-block:: shell-session 71 72 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next.git 73 74 If the git commit history is not of interest, then ``--depth 1`` will clone the 75 tree much faster by truncating the git history only to the most recent commit. 76 77 In case the ``net`` tree is of interest, it can be cloned from this url: 78 79 .. code-block:: shell-session 80 81 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git 82 83 There are dozens of tutorials in the Internet on how to build Linux kernels, one 84 good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild) 85 that can be followed with one of the two git trees mentioned above. 86 87 Make sure that the generated ``.config`` file contains the following ``CONFIG_*`` 88 entries for running BPF. These entries are also needed for Cilium. 89 90 :: 91 92 CONFIG_CGROUP_BPF=y 93 CONFIG_BPF=y 94 CONFIG_BPF_SYSCALL=y 95 CONFIG_NET_SCH_INGRESS=m 96 CONFIG_NET_CLS_BPF=m 97 CONFIG_NET_CLS_ACT=y 98 CONFIG_BPF_JIT=y 99 CONFIG_LWTUNNEL_BPF=y 100 CONFIG_HAVE_EBPF_JIT=y 101 CONFIG_BPF_EVENTS=y 102 CONFIG_TEST_BPF=m 103 104 Some of the entries cannot be adjusted through ``make menuconfig``. For example, 105 ``CONFIG_HAVE_EBPF_JIT`` is selected automatically if a given architecture does 106 come with an eBPF JIT. In this specific case, ``CONFIG_HAVE_EBPF_JIT`` is optional 107 but highly recommended. An architecture not having an eBPF JIT compiler will need 108 to fall back to the in-kernel interpreter with the cost of being less efficient 109 executing BPF instructions. 110 111 Verifying the Setup 112 ``````````````````` 113 114 After you have booted into the newly compiled kernel, navigate to the BPF selftest 115 suite in order to test BPF functionality (current working directory points to 116 the root of the cloned git tree): 117 118 .. code-block:: shell-session 119 120 $ cd tools/testing/selftests/bpf/ 121 $ make 122 $ sudo ./test_verifier 123 124 The verifier tests print out all the current checks being performed. The summary 125 at the end of running all tests will dump information of test successes and 126 failures: 127 128 :: 129 130 Summary: 847 PASSED, 0 SKIPPED, 0 FAILED 131 132 .. note:: For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+ 133 caused by the BPF function calls which do not need to be inlined 134 anymore. See section :ref:`bpf_to_bpf_calls` or the cover letter mail 135 from the kernel patch (https://lwn.net/Articles/741773/) for more information. 136 Not every BPF program has a dependency on LLVM 6.0+ if it does not 137 use this new feature. If your distribution does not provide LLVM 6.0+ 138 you may compile it by following the instruction in the :ref:`tooling_llvm` 139 section. 140 141 In order to run through all BPF selftests, the following command is needed: 142 143 .. code-block:: shell-session 144 145 $ sudo make run_tests 146 147 If you see any failures, please contact us on `Cilium Slack`_ with the full 148 test output. 149 150 Compiling iproute2 151 `````````````````` 152 153 Similar to the ``net`` (fixes only) and ``net-next`` (new features) kernel trees, 154 the iproute2 git tree has two branches, namely ``master`` and ``net-next``. The 155 ``master`` branch is based on the ``net`` tree and the ``net-next`` branch is 156 based against the ``net-next`` kernel tree. This is necessary, so that changes 157 in header files can be synchronized in the iproute2 tree. 158 159 In order to clone the iproute2 ``master`` branch, the following command can 160 be used: 161 162 .. code-block:: shell-session 163 164 $ git clone https://git.kernel.org/pub/scm/network/iproute2/iproute2.git 165 166 Similarly, to clone into mentioned ``net-next`` branch of iproute2, run the 167 following: 168 169 .. code-block:: shell-session 170 171 $ git clone -b net-next https://git.kernel.org/pub/scm/network/iproute2/iproute2.git 172 173 After that, proceed with the build and installation: 174 175 .. code-block:: shell-session 176 177 $ cd iproute2/ 178 $ ./configure --prefix=/usr 179 TC schedulers 180 ATM no 181 182 libc has setns: yes 183 SELinux support: yes 184 ELF support: yes 185 libmnl support: no 186 Berkeley DB: no 187 188 docs: latex: no 189 WARNING: no docs can be built from LaTeX files 190 sgml2html: no 191 WARNING: no HTML docs can be built from SGML 192 $ make 193 [...] 194 $ sudo make install 195 196 Ensure that the ``configure`` script shows ``ELF support: yes``, so that iproute2 197 can process ELF files from LLVM's BPF back end. libelf was listed in the instructions 198 for installing the dependencies in case of Fedora and Ubuntu earlier. 199 200 Compiling bpftool 201 ````````````````` 202 203 bpftool is an essential tool around debugging and introspection of BPF programs 204 and maps. It is part of the kernel tree and available under ``tools/bpf/bpftool/``. 205 206 Make sure to have cloned either the ``net`` or ``net-next`` kernel tree as described 207 earlier. In order to build and install bpftool, the following steps are required: 208 209 .. code-block:: shell-session 210 211 $ cd <kernel-tree>/tools/bpf/bpftool/ 212 $ make 213 Auto-detecting system features: 214 ... libbfd: [ on ] 215 ... disassembler-four-args: [ OFF ] 216 217 CC xlated_dumper.o 218 CC prog.o 219 CC common.o 220 CC cgroup.o 221 CC main.o 222 CC json_writer.o 223 CC cfg.o 224 CC map.o 225 CC jit_disasm.o 226 CC disasm.o 227 make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf' 228 229 Auto-detecting system features: 230 ... libelf: [ on ] 231 ... bpf: [ on ] 232 233 CC libbpf.o 234 CC bpf.o 235 CC nlattr.o 236 LD libbpf-in.o 237 LINK libbpf.a 238 make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf' 239 LINK bpftool 240 $ sudo make install 241 242 .. _tooling_llvm: 243 244 LLVM 245 ---- 246 247 LLVM is currently the only compiler suite providing a BPF back end. gcc does 248 not support BPF at this point. 249 250 The BPF back end was merged into LLVM's 3.7 release. Major distributions enable 251 the BPF back end by default when they package LLVM, therefore installing clang 252 and llvm is sufficient on most recent distributions to start compiling C 253 into BPF object files. 254 255 The typical workflow is that BPF programs are written in C, compiled by LLVM 256 into object / ELF files, which are parsed by user space BPF ELF loaders (such as 257 iproute2 or others), and pushed into the kernel through the BPF system call. 258 The kernel verifies the BPF instructions and JITs them, returning a new file 259 descriptor for the program, which then can be attached to a subsystem (e.g. 260 networking). If supported, the subsystem could then further offload the BPF 261 program to hardware (e.g. NIC). 262 263 For LLVM, BPF target support can be checked, for example, through the following: 264 265 .. code-block:: shell-session 266 267 $ llc --version 268 LLVM (http://llvm.org/): 269 LLVM version 3.8.1 270 Optimized build. 271 Default target: x86_64-unknown-linux-gnu 272 Host CPU: skylake 273 274 Registered Targets: 275 [...] 276 bpf - BPF (host endian) 277 bpfeb - BPF (big endian) 278 bpfel - BPF (little endian) 279 [...] 280 281 By default, the ``bpf`` target uses the endianness of the CPU it compiles on, 282 meaning that if the CPU's endianness is little endian, the program is represented 283 in little endian format as well, and if the CPU's endianness is big endian, 284 the program is represented in big endian. This also matches the runtime behavior 285 of BPF, which is generic and uses the CPU's endianness it runs on in order 286 to not disadvantage architectures in any of the format. 287 288 For cross-compilation, the two targets ``bpfeb`` and ``bpfel`` were introduced, 289 thanks to that BPF programs can be compiled on a node running in one endianness 290 (e.g. little endian on x86) and run on a node in another endianness format (e.g. 291 big endian on arm). Note that the front end (clang) needs to run in the target 292 endianness as well. 293 294 Using ``bpf`` as a target is the preferred way in situations where no mixture of 295 endianness applies. For example, compilation on ``x86_64`` results in the same 296 output for the targets ``bpf`` and ``bpfel`` due to being little endian, therefore 297 scripts triggering a compilation also do not have to be endian aware. 298 299 A minimal, stand-alone XDP drop program might look like the following example 300 (``xdp-example.c``): 301 302 .. code-block:: c 303 304 #include <linux/bpf.h> 305 306 #ifndef __section 307 # define __section(NAME) \ 308 __attribute__((section(NAME), used)) 309 #endif 310 311 __section("prog") 312 int xdp_drop(struct xdp_md *ctx) 313 { 314 return XDP_DROP; 315 } 316 317 char __license[] __section("license") = "GPL"; 318 319 It can then be compiled and loaded into the kernel as follows: 320 321 .. code-block:: shell-session 322 323 $ clang -O2 -Wall --target=bpf -c xdp-example.c -o xdp-example.o 324 # ip link set dev em1 xdp obj xdp-example.o 325 326 .. note:: Attaching an XDP BPF program to a network device as above requires 327 Linux 4.11 with a device that supports XDP, or Linux 4.12 or later. 328 329 For the generated object file LLVM (>= 3.9) uses the official BPF machine value, 330 that is, ``EM_BPF`` (decimal: ``247`` / hex: ``0xf7``). In this example, the program 331 has been compiled with ``bpf`` target under ``x86_64``, therefore ``LSB`` (as opposed 332 to ``MSB``) is shown regarding endianness: 333 334 .. code-block:: shell-session 335 336 $ file xdp-example.o 337 xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped 338 339 ``readelf -a xdp-example.o`` will dump further information about the ELF file, which can 340 sometimes be useful for introspecting generated section headers, relocation entries 341 and the symbol table. 342 343 In the unlikely case where clang and LLVM need to be compiled from scratch, the 344 following commands can be used: 345 346 .. code-block:: shell-session 347 348 $ git clone https://github.com/llvm/llvm-project.git 349 $ cd llvm-project 350 $ mkdir build 351 $ cd build 352 $ cmake -DLLVM_ENABLE_PROJECTS=clang -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF -G "Unix Makefiles" ../llvm 353 $ make -j $(getconf _NPROCESSORS_ONLN) 354 $ ./bin/llc --version 355 LLVM (http://llvm.org/): 356 LLVM version x.y.zsvn 357 Optimized build. 358 Default target: x86_64-unknown-linux-gnu 359 Host CPU: skylake 360 361 Registered Targets: 362 bpf - BPF (host endian) 363 bpfeb - BPF (big endian) 364 bpfel - BPF (little endian) 365 x86 - 32-bit X86: Pentium-Pro and above 366 x86-64 - 64-bit X86: EM64T and AMD64 367 368 $ export PATH=$PWD/bin:$PATH # add to ~/.bashrc 369 370 Make sure that ``--version`` mentions ``Optimized build.``, otherwise the 371 compilation time for programs when having LLVM in debugging mode will 372 significantly increase (e.g. by 10x or more). 373 374 For debugging, clang can generate the assembler output as follows: 375 376 .. code-block:: shell-session 377 378 $ clang -O2 -S -Wall --target=bpf -c xdp-example.c -o xdp-example.S 379 $ cat xdp-example.S 380 .text 381 .section prog,"ax",@progbits 382 .globl xdp_drop 383 .p2align 3 384 xdp_drop: # @xdp_drop 385 # BB#0: 386 r0 = 1 387 exit 388 389 .section license,"aw",@progbits 390 .globl __license # @__license 391 __license: 392 .asciz "GPL" 393 394 Starting from LLVM's release 6.0, there is also assembler parser support. You can 395 program using BPF assembler directly, then use llvm-mc to assemble it into an 396 object file. For example, you can assemble the xdp-example.S listed above back 397 into object file using: 398 399 .. code-block:: shell-session 400 401 $ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S 402 403 Furthermore, more recent LLVM versions (>= 4.0) can also store debugging 404 information in dwarf format into the object file. This can be done through 405 the usual workflow by adding ``-g`` for compilation. 406 407 .. code-block:: shell-session 408 409 $ clang -O2 -g -Wall --target=bpf -c xdp-example.c -o xdp-example.o 410 $ llvm-objdump -S --no-show-raw-insn xdp-example.o 411 412 xdp-example.o: file format ELF64-BPF 413 414 Disassembly of section prog: 415 xdp_drop: 416 ; { 417 0: r0 = 1 418 ; return XDP_DROP; 419 1: exit 420 421 The ``llvm-objdump`` tool can then annotate the assembler output with the 422 original C code used in the compilation. The trivial example in this case 423 does not contain much C code, however, the line numbers shown as ``0:`` 424 and ``1:`` correspond directly to the kernel's verifier log. 425 426 This means that in case BPF programs get rejected by the verifier, ``llvm-objdump`` 427 can help to correlate the instructions back to the original C code, which is 428 highly useful for analysis. 429 430 .. code-block:: shell-session 431 432 # ip link set dev em1 xdp obj xdp-example.o verb 433 434 Prog section 'prog' loaded (5)! 435 - Type: 6 436 - Instructions: 2 (0 over limit) 437 - License: GPL 438 439 Verifier analysis: 440 441 0: (b7) r0 = 1 442 1: (95) exit 443 processed 2 insns 444 445 As it can be seen in the verifier analysis, the ``llvm-objdump`` output dumps 446 the same BPF assembler code as the kernel. 447 448 Leaving out the ``--no-show-raw-insn`` option will also dump the raw 449 ``struct bpf_insn`` as hex in front of the assembly: 450 451 .. code-block:: shell-session 452 453 $ llvm-objdump -S xdp-example.o 454 455 xdp-example.o: file format ELF64-BPF 456 457 Disassembly of section prog: 458 xdp_drop: 459 ; { 460 0: b7 00 00 00 01 00 00 00 r0 = 1 461 ; return foo(); 462 1: 95 00 00 00 00 00 00 00 exit 463 464 For LLVM IR debugging, the compilation process for BPF can be split into 465 two steps, generating a binary LLVM IR intermediate file ``xdp-example.bc``, which 466 can later on be passed to llc: 467 468 .. code-block:: shell-session 469 470 $ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 471 $ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o 472 473 The generated LLVM IR can also be dumped in human readable format through: 474 475 .. code-block:: shell-session 476 477 $ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o - 478 479 LLVM is able to attach debug information such as the description of used data 480 types in the program to the generated BPF object file. By default this is in 481 DWARF format. 482 483 A heavily simplified version used by BPF is called BTF (BPF Type Format). The 484 resulting DWARF can be converted into BTF and is later on loaded into the 485 kernel through BPF object loaders. The kernel will then verify the BTF data 486 for correctness and keeps track of the data types the BTF data is containing. 487 488 BPF maps can then be annotated with key and value types out of the BTF data 489 such that a later dump of the map exports the map data along with the related 490 type information. This allows for better introspection, debugging and value 491 pretty printing. Note that BTF data is a generic debugging data format and 492 as such any DWARF to BTF converted data can be loaded (e.g. kernel's vmlinux 493 DWARF data could be converted to BTF and loaded). Latter is in particular 494 useful for BPF tracing in the future. 495 496 In order to generate BTF from DWARF debugging information, elfutils (>= 0.173) 497 is needed. If that is not available, then adding the ``-mattr=dwarfris`` option 498 to the ``llc`` command is required during compilation: 499 500 .. code-block:: shell-session 501 502 $ llc -march=bpf -mattr=help |& grep dwarfris 503 dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections. 504 [...] 505 506 The reason using ``-mattr=dwarfris`` is because the flag ``dwarfris`` (``dwarf 507 relocation in section``) disables DWARF cross-section relocations between DWARF 508 and the ELF's symbol table since libdw does not have proper BPF relocation 509 support, and therefore tools like ``pahole`` would otherwise not be able to 510 properly dump structures from the object. 511 512 elfutils (>= 0.173) implements proper BPF relocation support and therefore 513 the same can be achieved without the ``-mattr=dwarfris`` option. Dumping 514 the structures from the object file could be done from either DWARF or BTF 515 information. ``pahole`` uses the LLVM emitted DWARF information at this 516 point, however, future ``pahole`` versions could rely on BTF if available. 517 518 For converting DWARF into BTF, a recent pahole version (>= 1.12) is required. 519 A recent pahole version can also be obtained from its official git repository 520 if not available from one of the distribution packages: 521 522 .. code-block:: shell-session 523 524 $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git 525 526 ``pahole`` comes with the option ``-J`` to convert DWARF into BTF from an 527 object file. ``pahole`` can be probed for BTF support as follows (note that 528 the ``llvm-objcopy`` tool is required for ``pahole`` as well, so check its 529 presence, too): 530 531 .. code-block:: shell-session 532 533 $ pahole --help | grep BTF 534 -J, --btf_encode Encode as BTF 535 536 Generating debugging information also requires the front end to generate 537 source level debug information by passing ``-g`` to the ``clang`` command 538 line. Note that ``-g`` is needed independently of whether ``llc``'s 539 ``dwarfris`` option is used. Full example for generating the object file: 540 541 .. code-block:: shell-session 542 543 $ clang -O2 -g -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 544 $ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o 545 546 Alternatively, by using clang only to build a BPF program with debugging 547 information (again, the dwarfris flag can be omitted when having proper 548 elfutils version): 549 550 .. code-block:: shell-session 551 552 $ clang --target=bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o 553 554 After successful compilation ``pahole`` can be used to properly dump structures 555 of the BPF program based on the DWARF information: 556 557 .. code-block:: shell-session 558 559 $ pahole xdp-example.o 560 struct xdp_md { 561 __u32 data; /* 0 4 */ 562 __u32 data_end; /* 4 4 */ 563 __u32 data_meta; /* 8 4 */ 564 565 /* size: 12, cachelines: 1, members: 3 */ 566 /* last cacheline: 12 bytes */ 567 }; 568 569 Through the option ``-J`` ``pahole`` can eventually generate the BTF from 570 DWARF. In the object file DWARF data will still be retained alongside the 571 newly added BTF data. Full ``clang`` and ``pahole`` example combined: 572 573 .. code-block:: shell-session 574 575 $ clang --target=bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o 576 $ pahole -J xdp-example.o 577 578 The presence of a ``.BTF`` section can be seen through ``readelf`` tool: 579 580 .. code-block:: shell-session 581 582 $ readelf -a xdp-example.o 583 [...] 584 [18] .BTF PROGBITS 0000000000000000 00000671 585 [...] 586 587 BPF loaders such as iproute2 will detect and load the BTF section, so that 588 BPF maps can be annotated with type information. 589 590 LLVM by default uses the BPF base instruction set for generating code 591 in order to make sure that the generated object file can also be loaded 592 with older kernels such as long-term stable kernels (e.g. 4.9+). 593 594 However, LLVM has a ``-mcpu`` selector for the BPF back end in order to 595 select different versions of the BPF instruction set, namely instruction 596 set extensions on top of the BPF base instruction set in order to generate 597 more efficient and smaller code. 598 599 Available ``-mcpu`` options can be queried through: 600 601 .. code-block:: shell-session 602 603 $ llc -march bpf -mcpu=help 604 Available CPUs for this target: 605 606 generic - Select the generic processor. 607 probe - Select the probe processor. 608 v1 - Select the v1 processor. 609 v2 - Select the v2 processor. 610 [...] 611 612 The ``generic`` processor is the default processor, which is also the 613 base instruction set ``v1`` of BPF. Options ``v1`` and ``v2`` are typically 614 useful in an environment where the BPF program is being cross compiled 615 and the target host where the program is loaded differs from the one 616 where it is compiled (and thus available BPF kernel features might differ 617 as well). 618 619 The recommended ``-mcpu`` option which is also used by Cilium internally is 620 ``-mcpu=probe``! Here, the LLVM BPF back end queries the kernel for availability 621 of BPF instruction set extensions and when found available, LLVM will use 622 them for compiling the BPF program whenever appropriate. 623 624 A full command line example with llc's ``-mcpu=probe``: 625 626 .. code-block:: shell-session 627 628 $ clang -O2 -Wall --target=bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 629 $ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o 630 631 Generally, LLVM IR generation is architecture independent. There are 632 however a few differences when using ``clang --target=bpf`` versus 633 leaving ``--target=bpf`` out and thus using clang's default target which, 634 depending on the underlying architecture, might be ``x86_64``, ``arm64`` 635 or others. 636 637 Quoting from the kernel's ``Documentation/bpf/bpf_devel_QA.txt``: 638 639 * BPF programs may recursively include header file(s) with file scope 640 inline assembly codes. The default target can handle this well, while 641 bpf target may fail if bpf backend assembler does not understand 642 these assembly codes, which is true in most cases. 643 644 * When compiled without -g, additional elf sections, e.g., ``.eh_frame`` 645 and ``.rela.eh_frame``, may be present in the object file with default 646 target, but not with bpf target. 647 648 * The default target may turn a C switch statement into a switch table 649 lookup and jump operation. Since the switch table is placed in the 650 global read-only section, the bpf program will fail to load. 651 The bpf target does not support switch table optimization. The clang 652 option ``-fno-jump-tables`` can be used to disable switch table 653 generation. 654 655 * For clang ``--target=bpf``, it is guaranteed that pointer or long / 656 unsigned long types will always have a width of 64 bit, no matter 657 whether underlying clang binary or default target (or kernel) is 658 32 bit. However, when native clang target is used, then it will 659 compile these types based on the underlying architecture's 660 conventions, meaning in case of 32 bit architecture, pointer or 661 long / unsigned long types e.g. in BPF context structure will have 662 width of 32 bit while the BPF LLVM back end still operates in 64 bit. 663 664 The native target is mostly needed in tracing for the case of walking 665 the kernel's ``struct pt_regs`` that maps CPU registers, or other kernel 666 structures where CPU's register width matters. In all other cases such 667 as networking, the use of ``clang --target=bpf`` is the preferred choice. 668 669 Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since 670 LLVM's release 7.0. A new code generation attribute ``alu32`` is added. When it is 671 enabled, LLVM will try to use 32-bit subregisters whenever possible, typically 672 when there are operations on 32-bit types. The associated ALU instructions with 673 32-bit subregisters will become ALU32 instructions. For example, for the 674 following sample code: 675 676 .. code-block:: shell-session 677 678 $ cat 32-bit-example.c 679 void cal(unsigned int *a, unsigned int *b, unsigned int *c) 680 { 681 unsigned int sum = *a + *b; 682 *c = sum; 683 } 684 685 At default code generation, the assembler will looks like: 686 687 .. code-block:: shell-session 688 689 $ clang --target=bpf -emit-llvm -S 32-bit-example.c 690 $ llc -march=bpf 32-bit-example.ll 691 $ cat 32-bit-example.s 692 cal: 693 r1 = *(u32 *)(r1 + 0) 694 r2 = *(u32 *)(r2 + 0) 695 r2 += r1 696 *(u32 *)(r3 + 0) = r2 697 exit 698 699 64-bit registers are used, hence the addition means 64-bit addition. Now, if you 700 enable the new 32-bit subregisters support by specifying ``-mattr=+alu32``, then 701 the assembler will looks like: 702 703 .. code-block:: shell-session 704 705 $ llc -march=bpf -mattr=+alu32 32-bit-example.ll 706 $ cat 32-bit-example.s 707 cal: 708 w1 = *(u32 *)(r1 + 0) 709 w2 = *(u32 *)(r2 + 0) 710 w2 += w1 711 *(u32 *)(r3 + 0) = w2 712 exit 713 714 ``w`` register, meaning 32-bit subregister, will be used instead of 64-bit ``r`` 715 register. 716 717 Enable 32-bit subregisters might help reducing type extension instruction 718 sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures 719 for which registers pairs are used to model the 64-bit eBPF registers and extra 720 instructions are needed for manipulating the high 32-bit. Given read from 32-bit 721 subregister is guaranteed to read from low 32-bit only even though write still 722 needs to clear the high 32-bit, if the JIT compiler has known the definition of 723 one register only has subregister reads, then instructions for setting the high 724 32-bit of the destination could be eliminated. 725 726 When writing C programs for BPF, there are a couple of pitfalls to be aware 727 of, compared to usual application development with C. The following items 728 describe some of the differences for the BPF model: 729 730 1. **Everything needs to be inlined, there are no function calls (on older 731 LLVM versions) or shared library calls available.** 732 733 Shared libraries, etc cannot be used with BPF. However, common library 734 code used in BPF programs can be placed into header files and included in 735 the main programs. For example, Cilium makes heavy use of it (see ``bpf/lib/``). 736 However, this still allows for including header files, for example, from 737 the kernel or other libraries and reuse their static inline functions or 738 macros / definitions. 739 740 Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF 741 function calls are supported, then LLVM needs to compile and inline the 742 entire code into a flat sequence of BPF instructions for a given program 743 section. In such case, best practice is to use an annotation like ``__inline`` 744 for every library function as shown below. The use of ``always_inline`` 745 is recommended, since the compiler could still decide to uninline large 746 functions that are only annotated as ``inline``. 747 748 In case the latter happens, LLVM will generate a relocation entry into 749 the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and 750 will thus produce an error since only BPF maps are valid relocation entries 751 which loaders can process. 752 753 .. code-block:: c 754 755 #include <linux/bpf.h> 756 757 #ifndef __section 758 # define __section(NAME) \ 759 __attribute__((section(NAME), used)) 760 #endif 761 762 #ifndef __inline 763 # define __inline \ 764 inline __attribute__((always_inline)) 765 #endif 766 767 static __inline int foo(void) 768 { 769 return XDP_DROP; 770 } 771 772 __section("prog") 773 int xdp_drop(struct xdp_md *ctx) 774 { 775 return foo(); 776 } 777 778 char __license[] __section("license") = "GPL"; 779 780 2. **Multiple programs can reside inside a single C file in different sections.** 781 782 C programs for BPF make heavy use of section annotations. A C file is 783 typically structured into 3 or more sections. BPF ELF loaders use these 784 names to extract and prepare the relevant information in order to load 785 the programs and maps through the bpf system call. For example, iproute2 786 uses ``maps`` and ``license`` as default section name to find metadata 787 needed for map creation and the license for the BPF program, respectively. 788 On program creation time the latter is pushed into the kernel as well, 789 and enables some of the helper functions which are exposed as GPL only 790 in case the program also holds a GPL compatible license, for example 791 ``bpf_ktime_get_ns()``, ``bpf_probe_read()`` and others. 792 793 The remaining section names are specific for BPF program code, for example, 794 the below code has been modified to contain two program sections, ``ingress`` 795 and ``egress``. The toy example code demonstrates that both can share a map 796 and common static inline helpers such as the ``account_data()`` function. 797 798 The ``xdp-example.c`` example has been modified to a ``tc-example.c`` 799 example that can be loaded with tc and attached to a netdevice's ingress 800 and egress hook. It accounts the transferred bytes into a map called 801 ``acc_map``, which has two map slots, one for traffic accounted on the 802 ingress hook, one on the egress hook. 803 804 .. code-block:: c 805 806 #include <linux/bpf.h> 807 #include <linux/pkt_cls.h> 808 #include <stdint.h> 809 #include <iproute2/bpf_elf.h> 810 811 #ifndef __section 812 # define __section(NAME) \ 813 __attribute__((section(NAME), used)) 814 #endif 815 816 #ifndef __inline 817 # define __inline \ 818 inline __attribute__((always_inline)) 819 #endif 820 821 #ifndef lock_xadd 822 # define lock_xadd(ptr, val) \ 823 ((void)__sync_fetch_and_add(ptr, val)) 824 #endif 825 826 #ifndef BPF_FUNC 827 # define BPF_FUNC(NAME, ...) \ 828 (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME 829 #endif 830 831 static void *BPF_FUNC(map_lookup_elem, void *map, const void *key); 832 833 struct bpf_elf_map acc_map __section("maps") = { 834 .type = BPF_MAP_TYPE_ARRAY, 835 .size_key = sizeof(uint32_t), 836 .size_value = sizeof(uint32_t), 837 .pinning = PIN_GLOBAL_NS, 838 .max_elem = 2, 839 }; 840 841 static __inline int account_data(struct __sk_buff *skb, uint32_t dir) 842 { 843 uint32_t *bytes; 844 845 bytes = map_lookup_elem(&acc_map, &dir); 846 if (bytes) 847 lock_xadd(bytes, skb->len); 848 849 return TC_ACT_OK; 850 } 851 852 __section("ingress") 853 int tc_ingress(struct __sk_buff *skb) 854 { 855 return account_data(skb, 0); 856 } 857 858 __section("egress") 859 int tc_egress(struct __sk_buff *skb) 860 { 861 return account_data(skb, 1); 862 } 863 864 char __license[] __section("license") = "GPL"; 865 866 The example also demonstrates a couple of other things which are useful 867 to be aware of when developing programs. The code includes kernel headers, 868 standard C headers and an iproute2 specific header containing the 869 definition of ``struct bpf_elf_map``. iproute2 has a common BPF ELF loader 870 and as such the definition of ``struct bpf_elf_map`` is the very same for 871 XDP and tc typed programs. 872 873 A ``struct bpf_elf_map`` entry defines a map in the program and contains 874 all relevant information (such as key / value size, etc) needed to generate 875 a map which is used from the two BPF programs. The structure must be placed 876 into the ``maps`` section, so that the loader can find it. There can be 877 multiple map declarations of this type with different variable names, but 878 all must be annotated with ``__section("maps")``. 879 880 The ``struct bpf_elf_map`` is specific to iproute2. Different BPF ELF 881 loaders can have different formats, for example, the libbpf in the kernel 882 source tree, which is mainly used by ``perf``, has a different specification. 883 iproute2 guarantees backwards compatibility for ``struct bpf_elf_map``. 884 Cilium follows the iproute2 model. 885 886 The example also demonstrates how BPF helper functions are mapped into 887 the C code and being used. Here, ``map_lookup_elem()`` is defined by 888 mapping this function into the ``BPF_FUNC_map_lookup_elem`` enum value 889 which is exposed as a helper in ``uapi/linux/bpf.h``. When the program is later 890 loaded into the kernel, the verifier checks whether the passed arguments 891 are of the expected type and re-points the helper call into a real 892 function call. Moreover, ``map_lookup_elem()`` also demonstrates how 893 maps can be passed to BPF helper functions. Here, ``&acc_map`` from the 894 ``maps`` section is passed as the first argument to ``map_lookup_elem()``. 895 896 Since the defined array map is global, the accounting needs to use an 897 atomic operation, which is defined as ``lock_xadd()``. LLVM maps 898 ``__sync_fetch_and_add()`` as a built-in function to the BPF atomic 899 add instruction, that is, ``BPF_STX | BPF_XADD | BPF_W`` for word sizes. 900 901 Last but not least, the ``struct bpf_elf_map`` tells that the map is to 902 be pinned as ``PIN_GLOBAL_NS``. This means that tc will pin the map 903 into the BPF pseudo file system as a node. By default, it will be pinned 904 to ``/sys/fs/bpf/tc/globals/acc_map`` for the given example. Due to the 905 ``PIN_GLOBAL_NS``, the map will be placed under ``/sys/fs/bpf/tc/globals/``. 906 ``globals`` acts as a global namespace that spans across object files. 907 If the example used ``PIN_OBJECT_NS``, then tc would create a directory 908 that is local to the object file. For example, different C files with 909 BPF code could have the same ``acc_map`` definition as above with a 910 ``PIN_GLOBAL_NS`` pinning. In that case, the map will be shared among 911 BPF programs originating from various object files. ``PIN_NONE`` would 912 mean that the map is not placed into the BPF file system as a node, 913 and as a result will not be accessible from user space after tc quits. It 914 would also mean that tc creates two separate map instances for each 915 program, since it cannot retrieve a previously pinned map under that 916 name. The ``acc_map`` part from the mentioned path is the name of the 917 map as specified in the source code. 918 919 Thus, upon loading of the ``ingress`` program, tc will find that no such 920 map exists in the BPF file system and creates a new one. On success, the 921 map will also be pinned, so that when the ``egress`` program is loaded 922 through tc, it will find that such map already exists in the BPF file 923 system and will reuse that for the ``egress`` program. The loader also 924 makes sure in case maps exist with the same name that also their properties 925 (key / value size, etc) match. 926 927 Just like tc can retrieve the same map, also third party applications 928 can use the ``BPF_OBJ_GET`` command from the bpf system call in order 929 to create a new file descriptor pointing to the same map instance, which 930 can then be used to lookup / update / delete map elements. 931 932 The code can be compiled and loaded via iproute2 as follows: 933 934 .. code-block:: shell-session 935 936 $ clang -O2 -Wall --target=bpf -c tc-example.c -o tc-example.o 937 938 # tc qdisc add dev em1 clsact 939 # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress 940 # tc filter add dev em1 egress bpf da obj tc-example.o sec egress 941 942 # tc filter show dev em1 ingress 943 filter protocol all pref 49152 bpf 944 filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f 945 946 # tc filter show dev em1 egress 947 filter protocol all pref 49152 bpf 948 filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714 949 950 # mount | grep bpf 951 sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) 952 bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700) 953 954 # tree /sys/fs/bpf/ 955 /sys/fs/bpf/ 956 +-- ip -> /sys/fs/bpf/tc/ 957 +-- tc 958 | +-- globals 959 | +-- acc_map 960 +-- xdp -> /sys/fs/bpf/tc/ 961 962 4 directories, 1 file 963 964 As soon as packets pass the ``em1`` device, counters from the BPF map will 965 be increased. 966 967 3. **There are no global variables allowed.** 968 969 For the reasons already mentioned in point 1, BPF cannot have global variables 970 as often used in normal C programs. 971 972 However, there is a work-around in that the program can simply use a BPF map 973 of type ``BPF_MAP_TYPE_PERCPU_ARRAY`` with just a single slot of arbitrary 974 value size. This works, because during execution, BPF programs are guaranteed 975 to never get preempted by the kernel and therefore can use the single map entry 976 as a scratch buffer for temporary data, for example, to extend beyond the stack 977 limitation. This also functions across tail calls, since it has the same 978 guarantees with regards to preemption. 979 980 Otherwise, for holding state across multiple BPF program runs, normal BPF 981 maps can be used. 982 983 4. **There are no const strings or arrays allowed.** 984 985 Defining ``const`` strings or other arrays in the BPF C program does not work 986 for the same reasons as pointed out in sections 1 and 3, which is, that relocation 987 entries will be generated in the ELF file which will be rejected by loaders due 988 to not being part of the ABI towards loaders (loaders also cannot fix up such 989 entries as it would require large rewrites of the already compiled BPF sequence). 990 991 In the future, LLVM might detect these occurrences and early throw an error 992 to the user. 993 994 Helper functions such as ``trace_printk()`` can be worked around as follows: 995 996 .. code-block:: c 997 998 static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...); 999 1000 #ifndef printk 1001 # define printk(fmt, ...) \ 1002 ({ \ 1003 char ____fmt[] = fmt; \ 1004 trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \ 1005 }) 1006 #endif 1007 1008 The program can then use the macro naturally like ``printk("skb len:%u\n", skb->len);``. 1009 The output will then be written to the trace pipe. ``tc exec bpf dbg`` can be 1010 used to retrieve the messages from there. 1011 1012 The use of the ``trace_printk()`` helper function has a couple of disadvantages 1013 and thus is not recommended for production usage. Constant strings like the 1014 ``"skb len:%u\n"`` need to be loaded into the BPF stack each time the helper 1015 function is called, but also BPF helper functions are limited to a maximum 1016 of 5 arguments. This leaves room for only 3 additional variables which can be 1017 passed for dumping. 1018 1019 Therefore, despite being helpful for quick debugging, it is recommended (for networking 1020 programs) to use the ``skb_event_output()`` or the ``xdp_event_output()`` helper, 1021 respectively. They allow for passing custom structs from the BPF program to 1022 the perf event ring buffer along with an optional packet sample. For example, 1023 Cilium's monitor makes use of these helpers in order to implement a debugging 1024 framework, notifications for network policy violations, etc. These helpers pass 1025 the data through a lockless memory mapped per-CPU ``perf`` ring buffer, and 1026 is thus significantly faster than ``trace_printk()``. 1027 1028 5. **Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().** 1029 1030 Since BPF programs cannot perform any function calls other than those to BPF 1031 helpers, common library code needs to be implemented as inline functions. In 1032 addition, also LLVM provides some built-ins that the programs can use for 1033 constant sizes (here: ``n``) which will then always get inlined: 1034 1035 .. code-block:: c 1036 1037 #ifndef memset 1038 # define memset(dest, chr, n) __builtin_memset((dest), (chr), (n)) 1039 #endif 1040 1041 #ifndef memcpy 1042 # define memcpy(dest, src, n) __builtin_memcpy((dest), (src), (n)) 1043 #endif 1044 1045 #ifndef memmove 1046 # define memmove(dest, src, n) __builtin_memmove((dest), (src), (n)) 1047 #endif 1048 1049 The ``memcmp()`` built-in had some corner cases where inlining did not take place 1050 due to an LLVM issue in the back end, and is therefore not recommended to be 1051 used until the issue is fixed. 1052 1053 6. **There are no loops available (yet).** 1054 1055 The BPF verifier in the kernel checks that a BPF program does not contain 1056 loops by performing a depth first search of all possible program paths besides 1057 other control flow graph validations. The purpose is to make sure that the 1058 program is always guaranteed to terminate. 1059 1060 A very limited form of looping is available for constant upper loop bounds 1061 by using ``#pragma unroll`` directive. Example code that is compiled to BPF: 1062 1063 .. code-block:: c 1064 1065 #pragma unroll 1066 for (i = 0; i < IPV6_MAX_HEADERS; i++) { 1067 switch (nh) { 1068 case NEXTHDR_NONE: 1069 return DROP_INVALID_EXTHDR; 1070 case NEXTHDR_FRAGMENT: 1071 return DROP_FRAG_NOSUPPORT; 1072 case NEXTHDR_HOP: 1073 case NEXTHDR_ROUTING: 1074 case NEXTHDR_AUTH: 1075 case NEXTHDR_DEST: 1076 if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0) 1077 return DROP_INVALID; 1078 1079 nh = opthdr.nexthdr; 1080 if (nh == NEXTHDR_AUTH) 1081 len += ipv6_authlen(&opthdr); 1082 else 1083 len += ipv6_optlen(&opthdr); 1084 break; 1085 default: 1086 *nexthdr = nh; 1087 return len; 1088 } 1089 } 1090 1091 Another possibility is to use tail calls by calling into the same program 1092 again and using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map for having a local 1093 scratch space. While being dynamic, this form of looping however is limited 1094 to a maximum of 34 iterations (the initial program, plus 33 iterations from 1095 the tail calls). 1096 1097 In the future, BPF may have some native, but limited form of implementing loops. 1098 1099 7. **Partitioning programs with tail calls.** 1100 1101 Tail calls provide the flexibility to atomically alter program behavior during 1102 runtime by jumping from one BPF program into another. In order to select the 1103 next program, tail calls make use of program array maps (``BPF_MAP_TYPE_PROG_ARRAY``), 1104 and pass the map as well as the index to the next program to jump to. There is no 1105 return to the old program after the jump has been performed, and in case there was 1106 no program present at the given map index, then execution continues on the original 1107 program. 1108 1109 For example, this can be used to implement various stages of a parser, where 1110 such stages could be updated with new parsing features during runtime. 1111 1112 Another use case are event notifications, for example, Cilium can opt in packet 1113 drop notifications during runtime, where the ``skb_event_output()`` call is 1114 located inside the tail called program. Thus, during normal operations, the 1115 fall-through path will always be executed unless a program is added to the 1116 related map index, where the program then prepares the metadata and triggers 1117 the event notification to a user space daemon. 1118 1119 Program array maps are quite flexible, enabling also individual actions to 1120 be implemented for programs located in each map index. For example, the root 1121 program attached to XDP or tc could perform an initial tail call to index 0 1122 of the program array map, performing traffic sampling, then jumping to index 1 1123 of the program array map, where firewalling policy is applied and the packet 1124 either dropped or further processed in index 2 of the program array map, where 1125 it is mangled and sent out of an interface again. Jumps in the program array 1126 map can, of course, be arbitrary. The kernel will eventually execute the 1127 fall-through path when the maximum tail call limit has been reached. 1128 1129 Minimal example extract of using tail calls: 1130 1131 .. code-block:: c 1132 1133 [...] 1134 1135 #ifndef __stringify 1136 # define __stringify(X) #X 1137 #endif 1138 1139 #ifndef __section 1140 # define __section(NAME) \ 1141 __attribute__((section(NAME), used)) 1142 #endif 1143 1144 #ifndef __section_tail 1145 # define __section_tail(ID, KEY) \ 1146 __section(__stringify(ID) "/" __stringify(KEY)) 1147 #endif 1148 1149 #ifndef BPF_FUNC 1150 # define BPF_FUNC(NAME, ...) \ 1151 (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME 1152 #endif 1153 1154 #define BPF_JMP_MAP_ID 1 1155 1156 static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map, 1157 uint32_t index); 1158 1159 struct bpf_elf_map jmp_map __section("maps") = { 1160 .type = BPF_MAP_TYPE_PROG_ARRAY, 1161 .id = BPF_JMP_MAP_ID, 1162 .size_key = sizeof(uint32_t), 1163 .size_value = sizeof(uint32_t), 1164 .pinning = PIN_GLOBAL_NS, 1165 .max_elem = 1, 1166 }; 1167 1168 __section_tail(BPF_JMP_MAP_ID, 0) 1169 int looper(struct __sk_buff *skb) 1170 { 1171 printk("skb cb: %u\n", skb->cb[0]++); 1172 tail_call(skb, &jmp_map, 0); 1173 return TC_ACT_OK; 1174 } 1175 1176 __section("prog") 1177 int entry(struct __sk_buff *skb) 1178 { 1179 skb->cb[0] = 0; 1180 tail_call(skb, &jmp_map, 0); 1181 return TC_ACT_OK; 1182 } 1183 1184 char __license[] __section("license") = "GPL"; 1185 1186 When loading this toy program, tc will create the program array and pin it 1187 to the BPF file system in the global namespace under ``jmp_map``. Also, the 1188 BPF ELF loader in iproute2 will also recognize sections that are marked as 1189 ``__section_tail()``. The provided ``id`` in ``struct bpf_elf_map`` will be 1190 matched against the id marker in the ``__section_tail()``, that is, ``JMP_MAP_ID``, 1191 and the program therefore loaded at the user specified program array map index, 1192 which is ``0`` in this example. As a result, all provided tail call sections 1193 will be populated by the iproute2 loader to the corresponding maps. This mechanism 1194 is not specific to tc, but can be applied with any other BPF program type 1195 that iproute2 supports (such as XDP, lwt). 1196 1197 The generated elf contains section headers describing the map id and the 1198 entry within that map: 1199 1200 .. code-block:: shell-session 1201 1202 $ llvm-objdump -S --no-show-raw-insn prog_array.o | less 1203 prog_array.o: file format ELF64-BPF 1204 1205 Disassembly of section 1/0: 1206 looper: 1207 0: r6 = r1 1208 1: r2 = *(u32 *)(r6 + 48) 1209 2: r1 = r2 1210 3: r1 += 1 1211 4: *(u32 *)(r6 + 48) = r1 1212 5: r1 = 0 ll 1213 7: call -1 1214 8: r1 = r6 1215 9: r2 = 0 ll 1216 11: r3 = 0 1217 12: call 12 1218 13: r0 = 0 1219 14: exit 1220 Disassembly of section prog: 1221 entry: 1222 0: r2 = 0 1223 1: *(u32 *)(r1 + 48) = r2 1224 2: r2 = 0 ll 1225 4: r3 = 0 1226 5: call 12 1227 6: r0 = 0 1228 7: exi 1229 1230 In this case, the ``section 1/0`` indicates that the ``looper()`` function 1231 resides in the map id ``1`` at position ``0``. 1232 1233 The pinned map can be retrieved by a user space applications (e.g. Cilium daemon), 1234 but also by tc itself in order to update the map with new programs. Updates 1235 happen atomically, the initial entry programs that are triggered first from the 1236 various subsystems are also updated atomically. 1237 1238 Example for tc to perform tail call map updates: 1239 1240 .. code-block:: shell-session 1241 1242 # tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo 1243 1244 In case iproute2 would update the pinned program array, the ``graft`` command 1245 can be used. By pointing it to ``globals/jmp_map``, tc will update the 1246 map at index / key ``0`` with a new program residing in the object file ``new.o`` 1247 under section ``foo``. 1248 1249 8. **Limited stack space of maximum 512 bytes.** 1250 1251 Stack space in BPF programs is limited to only 512 bytes, which needs to be 1252 taken into careful consideration when implementing BPF programs in C. However, 1253 as mentioned earlier in point 3, a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map with a 1254 single entry can be used in order to enlarge scratch buffer space. 1255 1256 9. **Use of BPF inline assembly possible.** 1257 1258 LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it 1259 might be needed. The following (nonsense) toy example shows a 64 bit atomic 1260 add. Due to lack of documentation, LLVM source code in ``lib/Target/BPF/BPFInstrInfo.td`` 1261 as well as ``test/CodeGen/BPF/`` might be helpful for providing some additional 1262 examples. Test code: 1263 1264 .. code-block:: c 1265 1266 #include <linux/bpf.h> 1267 1268 #ifndef __section 1269 # define __section(NAME) \ 1270 __attribute__((section(NAME), used)) 1271 #endif 1272 1273 __section("prog") 1274 int xdp_test(struct xdp_md *ctx) 1275 { 1276 __u64 a = 2, b = 3, *c = &a; 1277 /* just a toy xadd example to show the syntax */ 1278 asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c)); 1279 return a; 1280 } 1281 1282 char __license[] __section("license") = "GPL"; 1283 1284 The above program is compiled into the following sequence of BPF 1285 instructions: 1286 1287 :: 1288 1289 Verifier analysis: 1290 1291 0: (b7) r1 = 2 1292 1: (7b) *(u64 *)(r10 -8) = r1 1293 2: (b7) r1 = 3 1294 3: (bf) r2 = r10 1295 4: (07) r2 += -8 1296 5: (db) lock *(u64 *)(r2 +0) += r1 1297 6: (79) r0 = *(u64 *)(r10 -8) 1298 7: (95) exit 1299 processed 8 insns (limit 131072), stack depth 8 1300 1301 10. **Remove struct padding with aligning members by using #pragma pack.** 1302 1303 In modern compilers, data structures are aligned by default to access memory 1304 efficiently. Structure members are packed to memory addresses and padding is 1305 added for the proper alignment with the processor word size (e.g. 8-byte for 1306 64-bit processors, 4-byte for 32-bit processors). Because of this, the size of 1307 struct may often grow larger than expected. 1308 1309 .. code-block:: c 1310 1311 struct called_info { 1312 u64 start; // 8-byte 1313 u64 end; // 8-byte 1314 u32 sector; // 4-byte 1315 }; // size of 20-byte ? 1316 1317 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte 1318 1319 // Actual compiled composition of struct called_info 1320 // 0x0(0) 0x8(8) 1321 // ↓________________________↓ 1322 // | start (8) | 1323 // |________________________| 1324 // | end (8) | 1325 // |________________________| 1326 // | sector(4) | PADDING | <= address aligned to 8 1327 // |____________|___________| with 4-byte PADDING. 1328 1329 The BPF verifier in the kernel checks the stack boundary that a BPF program does 1330 not access outside of boundary or uninitialized stack area. Using struct with the 1331 padding as a map value, will cause ``invalid indirect read from stack`` failure on 1332 ``bpf_prog_load()``. 1333 1334 Example code: 1335 1336 .. code-block:: c 1337 1338 struct called_info { 1339 u64 start; 1340 u64 end; 1341 u32 sector; 1342 }; 1343 1344 struct bpf_map_def SEC("maps") called_info_map = { 1345 .type = BPF_MAP_TYPE_HASH, 1346 .key_size = sizeof(long), 1347 .value_size = sizeof(struct called_info), 1348 .max_entries = 4096, 1349 }; 1350 1351 SEC("kprobe/submit_bio") 1352 int submit_bio_entry(struct pt_regs *ctx) 1353 { 1354 char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n"; 1355 u64 start_time = bpf_ktime_get_ns(); 1356 long bio_ptr = PT_REGS_PARM1(ctx); 1357 struct called_info called_info = { 1358 .start = start_time, 1359 .end = 0, 1360 .sector = 0 1361 }; 1362 1363 bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY); 1364 bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time); 1365 return 0; 1366 } 1367 1368 Corresponding output on ``bpf_load_program()``:: 1369 1370 bpf_load_program() err=13 1371 0: (bf) r6 = r1 1372 ... 1373 19: (b7) r1 = 0 1374 20: (7b) *(u64 *)(r10 -72) = r1 1375 21: (7b) *(u64 *)(r10 -80) = r7 1376 22: (63) *(u32 *)(r10 -64) = r1 1377 ... 1378 30: (85) call bpf_map_update_elem#2 1379 invalid indirect read from stack off -80+20 size 24 1380 1381 At ``bpf_prog_load()``, an eBPF verifier ``bpf_check()`` is called, and it'll 1382 check stack boundary by calling ``check_func_arg() -> check_stack_boundary()``. 1383 From the upper error shows, ``struct called_info`` is compiled to 24-byte size, 1384 and the message says reading a data from +20 is an invalid indirect read. 1385 And as we discussed earlier, the address 0x14(20) is the place where PADDING is. 1386 1387 .. code-block:: c 1388 1389 // Actual compiled composition of struct called_info 1390 // 0x10(16) 0x14(20) 0x18(24) 1391 // ↓____________↓___________↓ 1392 // | sector(4) | PADDING | <= address aligned to 8 1393 // |____________|___________| with 4-byte PADDING. 1394 1395 The ``check_stack_boundary()`` internally loops through the every ``access_size`` (24) 1396 byte from the start pointer to make sure that it's within stack boundary and all 1397 elements of the stack are initialized. Since the padding isn't supposed to be used, 1398 it gets the 'invalid indirect read from stack' failure. To avoid this kind of 1399 failure, removing the padding from the struct is necessary. 1400 1401 Removing the padding by using ``#pragma pack(n)`` directive: 1402 1403 .. code-block:: c 1404 1405 #pragma pack(4) 1406 struct called_info { 1407 u64 start; // 8-byte 1408 u64 end; // 8-byte 1409 u32 sector; // 4-byte 1410 }; // size of 20-byte ? 1411 1412 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte 1413 1414 // Actual compiled composition of packed struct called_info 1415 // 0x0(0) 0x8(8) 1416 // ↓________________________↓ 1417 // | start (8) | 1418 // |________________________| 1419 // | end (8) | 1420 // |________________________| 1421 // | sector(4) | <= address aligned to 4 1422 // |____________| with no PADDING. 1423 1424 By locating ``#pragma pack(4)`` before of ``struct called_info``, the compiler will align 1425 members of a struct to the least of 4-byte and their natural alignment. As you can 1426 see, the size of ``struct called_info`` has been shrunk to 20-byte and the padding 1427 no longer exists. 1428 1429 But, removing the padding has downsides too. For example, the compiler will generate 1430 less optimized code. Since we've removed the padding, processors will conduct 1431 unaligned access to the structure and this might lead to performance degradation. 1432 And also, unaligned access might get rejected by verifier on some architectures. 1433 1434 However, there is a way to avoid downsides of packed structure. By simply adding the 1435 explicit padding ``u32 pad`` member at the end will resolve the same problem without 1436 packing of the structure. 1437 1438 .. code-block:: c 1439 1440 struct called_info { 1441 u64 start; // 8-byte 1442 u64 end; // 8-byte 1443 u32 sector; // 4-byte 1444 u32 pad; // 4-byte 1445 }; // size of 24-byte ? 1446 1447 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte 1448 1449 // Actual compiled composition of struct called_info with explicit padding 1450 // 0x0(0) 0x8(8) 1451 // ↓________________________↓ 1452 // | start (8) | 1453 // |________________________| 1454 // | end (8) | 1455 // |________________________| 1456 // | sector(4) | pad (4) | <= address aligned to 8 1457 // |____________|___________| with explicit PADDING. 1458 1459 11. **Accessing packet data via invalidated references** 1460 1461 Some networking BPF helper functions such as ``bpf_skb_store_bytes`` might 1462 change the size of a packet data. As verifier is not able to track such 1463 changes, any a priori reference to the data will be invalidated by verifier. 1464 Therefore, the reference needs to be updated before accessing the data to 1465 avoid verifier rejecting a program. 1466 1467 To illustrate this, consider the following snippet: 1468 1469 .. code-block:: c 1470 1471 struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; 1472 1473 skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); 1474 1475 if (ip4->protocol == IPPROTO_TCP) { 1476 // do something 1477 } 1478 1479 Verifier will reject the snippet due to dereference of the invalidated 1480 ``ip4->protocol``: 1481 1482 :: 1483 1484 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0 1485 R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) 1486 R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1 1487 ... 1488 18: (85) call bpf_skb_store_bytes#9 1489 19: (7b) *(u64 *)(r10 -56) = r7 1490 R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3)) 1491 R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm 1492 21: (61) r1 = *(u32 *)(r9 +23) 1493 R9 invalid mem access 'inv' 1494 1495 To fix this, the reference to ``ip4`` has to be updated: 1496 1497 .. code-block:: c 1498 1499 struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; 1500 1501 skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); 1502 1503 ip4 = (struct iphdr *) skb->data + ETH_HLEN; 1504 1505 if (ip4->protocol == IPPROTO_TCP) { 1506 // do something 1507 } 1508 1509 iproute2 1510 -------- 1511 1512 There are various front ends for loading BPF programs into the kernel such as bcc, 1513 perf, iproute2 and others. The Linux kernel source tree also provides a user space 1514 library under ``tools/lib/bpf/``, which is mainly used and driven by perf for 1515 loading BPF tracing programs into the kernel. However, the library itself is 1516 generic and not limited to perf only. bcc is a toolkit providing many useful 1517 BPF programs mainly for tracing that are loaded ad-hoc through a Python interface 1518 embedding the BPF C code. Syntax and semantics for implementing BPF programs 1519 slightly differ among front ends in general, though. Additionally, there are also 1520 BPF samples in the kernel source tree (``samples/bpf/``) which parse the generated 1521 object files and load the code directly through the system call interface. 1522 1523 This and previous sections mainly focus on the iproute2 suite's BPF front end for 1524 loading networking programs of XDP, tc or lwt type, since Cilium's programs are 1525 implemented against this BPF loader. In future, Cilium will be equipped with a 1526 native BPF loader, but programs will still be compatible to be loaded through 1527 iproute2 suite in order to facilitate development and debugging. 1528 1529 All BPF program types supported by iproute2 share the same BPF loader logic 1530 due to having a common loader back end implemented as a library (``lib/bpf.c`` 1531 in iproute2 source tree). 1532 1533 The previous section on LLVM also covered some iproute2 parts related to writing 1534 BPF C programs, and later sections in this document are related to tc and XDP 1535 specific aspects when writing programs. Therefore, this section will rather focus 1536 on usage examples for loading object files with iproute2 as well as some of the 1537 generic mechanics of the loader. It does not try to provide a complete coverage 1538 of all details, but enough for getting started. 1539 1540 **1. Loading of XDP BPF object files.** 1541 1542 Given a BPF object file ``prog.o`` has been compiled for XDP, it can be loaded 1543 through ``ip`` to a XDP-supported netdevice called ``em1`` with the following 1544 command: 1545 1546 .. code-block:: shell-session 1547 1548 # ip link set dev em1 xdp obj prog.o 1549 1550 The above command assumes that the program code resides in the default section 1551 which is called ``prog`` in XDP case. Should this not be the case, and the 1552 section is named differently, for example, ``foobar``, then the program needs 1553 to be loaded as: 1554 1555 .. code-block:: shell-session 1556 1557 # ip link set dev em1 xdp obj prog.o sec foobar 1558 1559 Note that it is also possible to load the program out of the ``.text`` section. 1560 Changing the minimal, stand-alone XDP drop program by removing the ``__section()`` 1561 annotation from the ``xdp_drop`` entry point would look like the following: 1562 1563 .. code-block:: c 1564 1565 #include <linux/bpf.h> 1566 1567 #ifndef __section 1568 # define __section(NAME) \ 1569 __attribute__((section(NAME), used)) 1570 #endif 1571 1572 int xdp_drop(struct xdp_md *ctx) 1573 { 1574 return XDP_DROP; 1575 } 1576 1577 char __license[] __section("license") = "GPL"; 1578 1579 And can be loaded as follows: 1580 1581 .. code-block:: shell-session 1582 1583 # ip link set dev em1 xdp obj prog.o sec .text 1584 1585 By default, ``ip`` will throw an error in case a XDP program is already attached 1586 to the networking interface, to prevent it from being overridden by accident. In 1587 order to replace the currently running XDP program with a new one, the ``-force`` 1588 option must be used: 1589 1590 .. code-block:: shell-session 1591 1592 # ip -force link set dev em1 xdp obj prog.o 1593 1594 Most XDP-enabled drivers today support an atomic replacement of the existing 1595 program with a new one without traffic interruption. There is always only a 1596 single program attached to an XDP-enabled driver due to performance reasons, 1597 hence a chain of programs is not supported. However, as described in the 1598 previous section, partitioning of programs can be performed through tail 1599 calls to achieve a similar use case when necessary. 1600 1601 The ``ip link`` command will display an ``xdp`` flag if the interface has an XDP 1602 program attached. ``ip link | grep xdp`` can thus be used to find all interfaces 1603 that have XDP running. Further introspection facilities are provided through 1604 the detailed view with ``ip -d link`` and ``bpftool`` can be used to retrieve 1605 information about the attached program based on the BPF program ID shown in 1606 the ``ip link`` dump. 1607 1608 In order to remove the existing XDP program from the interface, the following 1609 command must be issued: 1610 1611 .. code-block:: shell-session 1612 1613 # ip link set dev em1 xdp off 1614 1615 In the case of switching a driver's operation mode from non-XDP to native XDP 1616 and vice versa, typically the driver needs to reconfigure its receive (and 1617 transmit) rings in order to ensure received packet are set up linearly 1618 within a single page for BPF to read and write into. However, once completed, 1619 then most drivers only need to perform an atomic replacement of the program 1620 itself when a BPF program is requested to be swapped. 1621 1622 In total, XDP supports three operation modes which iproute2 implements as well: 1623 ``xdpdrv``, ``xdpoffload`` and ``xdpgeneric``. 1624 1625 ``xdpdrv`` stands for native XDP, meaning the BPF program is run directly in 1626 the driver's receive path at the earliest possible point in software. This is 1627 the normal / conventional XDP mode and requires driver's to implement XDP 1628 support, which all major 10G/40G/+ networking drivers in the upstream Linux 1629 kernel already provide. 1630 1631 ``xdpgeneric`` stands for generic XDP and is intended as an experimental test 1632 bed for drivers which do not yet support native XDP. Given the generic XDP hook 1633 in the ingress path comes at a much later point in time when the packet already 1634 enters the stack's main receive path as a ``skb``, the performance is significantly 1635 less than with processing in ``xdpdrv`` mode. ``xdpgeneric`` therefore is for 1636 the most part only interesting for experimenting, less for production environments. 1637 1638 Last but not least, the ``xdpoffload`` mode is implemented by SmartNICs such 1639 as those supported by Netronome's nfp driver and allow for offloading the entire 1640 BPF/XDP program into hardware, thus the program is run on each packet reception 1641 directly on the card. This provides even higher performance than running in 1642 native XDP although not all BPF map types or BPF helper functions are available 1643 for use compared to native XDP. The BPF verifier will reject the program in 1644 such case and report to the user what is unsupported. Other than staying in 1645 the realm of supported BPF features and helper functions, no special precautions 1646 have to be taken when writing BPF C programs. 1647 1648 When a command like ``ip link set dev em1 xdp obj [...]`` is used, then the 1649 kernel will attempt to load the program first as native XDP, and in case the 1650 driver does not support native XDP, it will automatically fall back to generic 1651 XDP. Thus, for example, using explicitly ``xdpdrv`` instead of ``xdp``, the 1652 kernel will only attempt to load the program as native XDP and fail in case 1653 the driver does not support it, which provides a guarantee that generic XDP 1654 is avoided altogether. 1655 1656 Example for enforcing a BPF/XDP program to be loaded in native XDP mode, 1657 dumping the link details and unloading the program again: 1658 1659 .. code-block:: shell-session 1660 1661 # ip -force link set dev em1 xdpdrv obj prog.o 1662 # ip link show 1663 [...] 1664 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000 1665 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 1666 prog/xdp id 1 tag 57cd311f2e27366b 1667 [...] 1668 # ip link set dev em1 xdpdrv off 1669 1670 Same example now for forcing generic XDP, even if the driver would support 1671 native XDP, and additionally dumping the BPF instructions of the attached 1672 dummy program through bpftool: 1673 1674 .. code-block:: shell-session 1675 1676 # ip -force link set dev em1 xdpgeneric obj prog.o 1677 # ip link show 1678 [...] 1679 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000 1680 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 1681 prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4 1682 [...] 1683 # bpftool prog dump xlated id 4 <-- Dump of instructions running on em1 1684 0: (b7) r0 = 1 1685 1: (95) exit 1686 # ip link set dev em1 xdpgeneric off 1687 1688 And last but not least offloaded XDP, where we additionally dump program 1689 information via bpftool for retrieving general metadata: 1690 1691 .. code-block:: shell-session 1692 1693 # ip -force link set dev em1 xdpoffload obj prog.o 1694 # ip link show 1695 [...] 1696 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 1697 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 1698 prog/xdp id 8 tag 57cd311f2e27366b 1699 [...] 1700 # bpftool prog show id 8 1701 8: xdp tag 57cd311f2e27366b dev em1 <-- Also indicates a BPF program offloaded to em1 1702 loaded_at Apr 11/20:38 uid 0 1703 xlated 16B not jited memlock 4096B 1704 # ip link set dev em1 xdpoffload off 1705 1706 Note that it is not possible to use ``xdpdrv`` and ``xdpgeneric`` or other 1707 modes at the same time, meaning only one of the XDP operation modes must be 1708 picked. 1709 1710 A switch between different XDP modes e.g. from generic to native or vice 1711 versa is not atomically possible. Only switching programs within a specific 1712 operation mode is: 1713 1714 .. code-block:: shell-session 1715 1716 # ip -force link set dev em1 xdpgeneric obj prog.o 1717 # ip -force link set dev em1 xdpoffload obj prog.o 1718 RTNETLINK answers: File exists 1719 # ip -force link set dev em1 xdpdrv obj prog.o 1720 RTNETLINK answers: File exists 1721 # ip -force link set dev em1 xdpgeneric obj prog.o <-- Succeeds due to xdpgeneric 1722 # 1723 1724 Switching between modes requires to first leave the current operation mode 1725 in order to then enter the new one: 1726 1727 .. code-block:: shell-session 1728 1729 # ip -force link set dev em1 xdpgeneric obj prog.o 1730 # ip -force link set dev em1 xdpgeneric off 1731 # ip -force link set dev em1 xdpoffload obj prog.o 1732 # ip l 1733 [...] 1734 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 1735 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 1736 prog/xdp id 17 tag 57cd311f2e27366b 1737 [...] 1738 # ip -force link set dev em1 xdpoffload off 1739 1740 **2. Loading of tc BPF object files.** 1741 1742 Given a BPF object file ``prog.o`` has been compiled for tc, it can be loaded 1743 through the tc command to a netdevice. Unlike XDP, there is no driver dependency 1744 for supporting attaching BPF programs to the device. Here, the netdevice is called 1745 ``em1``, and with the following command the program can be attached to the networking 1746 ``ingress`` path of ``em1``: 1747 1748 .. code-block:: shell-session 1749 1750 # tc qdisc add dev em1 clsact 1751 # tc filter add dev em1 ingress bpf da obj prog.o 1752 1753 The first step is to set up a ``clsact`` qdisc (Linux queueing discipline). ``clsact`` 1754 is a dummy qdisc similar to the ``ingress`` qdisc, which can only hold classifier 1755 and actions, but does not perform actual queueing. It is needed in order to attach 1756 the ``bpf`` classifier. The ``clsact`` qdisc provides two special hooks called 1757 ``ingress`` and ``egress``, where the classifier can be attached to. Both ``ingress`` 1758 and ``egress`` hooks are located in central receive and transmit locations in the 1759 networking data path, where every packet on the device passes through. The ``ingress`` 1760 hook is called from ``__netif_receive_skb_core() -> sch_handle_ingress()`` in the 1761 kernel and the ``egress`` hook from ``__dev_queue_xmit() -> sch_handle_egress()``. 1762 1763 The equivalent for attaching the program to the ``egress`` hook looks as follows: 1764 1765 .. code-block:: shell-session 1766 1767 # tc filter add dev em1 egress bpf da obj prog.o 1768 1769 The ``clsact`` qdisc is processed lockless from ``ingress`` and ``egress`` 1770 direction and can also be attached to virtual, queue-less devices such as 1771 ``veth`` devices connecting containers. 1772 1773 Next to the hook, the ``tc filter`` command selects ``bpf`` to be used in ``da`` 1774 (direct-action) mode. ``da`` mode is recommended and should always be specified. 1775 It basically means that the ``bpf`` classifier does not need to call into external 1776 tc action modules, which are not necessary for ``bpf`` anyway, since all packet 1777 mangling, forwarding or other kind of actions can already be performed inside 1778 the single BPF program which is to be attached, and is therefore significantly 1779 faster. 1780 1781 At this point, the program has been attached and is executed once packets traverse 1782 the device. Like in XDP, should the default section name not be used, then it 1783 can be specified during load, for example, in case of section ``foobar``: 1784 1785 .. code-block:: shell-session 1786 1787 # tc filter add dev em1 egress bpf da obj prog.o sec foobar 1788 1789 iproute2's BPF loader allows for using the same command line syntax across 1790 program types, hence the ``obj prog.o sec foobar`` is the same syntax as with 1791 XDP mentioned earlier. 1792 1793 The attached programs can be listed through the following commands: 1794 1795 .. code-block:: shell-session 1796 1797 # tc filter show dev em1 ingress 1798 filter protocol all pref 49152 bpf 1799 filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f 1800 1801 # tc filter show dev em1 egress 1802 filter protocol all pref 49152 bpf 1803 filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714 1804 1805 The output of ``prog.o:[ingress]`` tells that program section ``ingress`` was 1806 loaded from the file ``prog.o``, and ``bpf`` operates in ``direct-action`` mode. 1807 The program ``id`` and ``tag`` is appended for each case, where the latter denotes 1808 a hash over the instruction stream which can be correlated with the object file 1809 or ``perf`` reports with stack traces, etc. Last but not least, the ``id`` 1810 represents the system-wide unique BPF program identifier that can be used along 1811 with ``bpftool`` to further inspect or dump the attached BPF program. 1812 1813 tc can attach more than just a single BPF program, it provides various other 1814 classifiers which can be chained together. However, attaching a single BPF program 1815 is fully sufficient since all packet operations can be contained in the program 1816 itself thanks to ``da`` (``direct-action``) mode, meaning the BPF program itself 1817 will already return the tc action verdict such as ``TC_ACT_OK``, ``TC_ACT_SHOT`` 1818 and others. For optimal performance and flexibility, this is the recommended usage. 1819 1820 In the above ``show`` command, tc also displays ``pref 49152`` and 1821 ``handle 0x1`` next to the BPF related output. Both are auto-generated in 1822 case they are not explicitly provided through the command line. ``pref`` 1823 denotes a priority number, which means that in case multiple classifiers are 1824 attached, they will be executed based on ascending priority, and ``handle`` 1825 represents an identifier in case multiple instances of the same classifier have 1826 been loaded under the same ``pref``. Since in case of BPF, a single program is 1827 fully sufficient, ``pref`` and ``handle`` can typically be ignored. 1828 1829 Only in the case where it is planned to atomically replace the attached BPF 1830 programs, it would be recommended to explicitly specify ``pref`` and ``handle`` 1831 a priori on initial load, so that they do not have to be queried at a later 1832 point in time for the ``replace`` operation. Thus, creation becomes: 1833 1834 .. code-block:: shell-session 1835 1836 # tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar 1837 1838 # tc filter show dev em1 ingress 1839 filter protocol all pref 1 bpf 1840 filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f 1841 1842 And for the atomic replacement, the following can be issued for updating the 1843 existing program at ``ingress`` hook with the new BPF program from the file 1844 ``prog.o`` in section ``foobar``: 1845 1846 .. code-block:: shell-session 1847 1848 # tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar 1849 1850 Last but not least, in order to remove all attached programs from the ``ingress`` 1851 respectively ``egress`` hook, the following can be used: 1852 1853 .. code-block:: shell-session 1854 1855 # tc filter del dev em1 ingress 1856 # tc filter del dev em1 egress 1857 1858 For removing the entire ``clsact`` qdisc from the netdevice, which implicitly also 1859 removes all attached programs from the ``ingress`` and ``egress`` hooks, the 1860 below command is provided: 1861 1862 .. code-block:: shell-session 1863 1864 # tc qdisc del dev em1 clsact 1865 1866 tc BPF programs can also be offloaded if the NIC and driver has support for it 1867 similarly as with XDP BPF programs. Netronome's nfp supported NICs offer both 1868 types of BPF offload. 1869 1870 .. code-block:: shell-session 1871 1872 # tc qdisc add dev em1 clsact 1873 # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o 1874 Error: TC offload is disabled on net device. 1875 We have an error talking to the kernel 1876 1877 If the above error is shown, then tc hardware offload first needs to be enabled 1878 for the device through ethtool's ``hw-tc-offload`` setting: 1879 1880 .. code-block:: shell-session 1881 1882 # ethtool -K em1 hw-tc-offload on 1883 # tc qdisc add dev em1 clsact 1884 # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o 1885 # tc filter show dev em1 ingress 1886 filter protocol all pref 1 bpf 1887 filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b 1888 1889 The ``in_hw`` flag confirms that the program has been offloaded to the NIC. 1890 1891 Note that BPF offloads for both tc and XDP cannot be loaded at the same time, 1892 either the tc or XDP offload option must be selected. 1893 1894 **3. Testing BPF offload interface via netdevsim driver.** 1895 1896 The netdevsim driver which is part of the Linux kernel provides a dummy driver 1897 which implements offload interfaces for XDP BPF and tc BPF programs and 1898 facilitates testing kernel changes or low-level user space programs 1899 implementing a control plane directly against the kernel's UAPI. 1900 1901 A netdevsim device can be created as follows: 1902 1903 .. code-block:: shell-session 1904 1905 # modprobe netdevsim 1906 // [ID] [PORT_COUNT] 1907 # echo "1 1" > /sys/bus/netdevsim/new_device 1908 # devlink dev 1909 netdevsim/netdevsim1 1910 # devlink port 1911 netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical 1912 # ip l 1913 [...] 1914 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 1915 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff 1916 1917 After that step, XDP BPF or tc BPF programs can be test loaded as shown 1918 in the various examples earlier: 1919 1920 .. code-block:: shell-session 1921 1922 # ip -force link set dev eth0 xdpoffload obj prog.o 1923 # ip l 1924 [...] 1925 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 1926 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff 1927 prog/xdp id 16 tag a04f5eef06a7f555 1928 1929 These two workflows are the basic operations to load XDP BPF respectively tc BPF 1930 programs with iproute2. 1931 1932 There are other various advanced options for the BPF loader that apply both to XDP 1933 and tc, some of them are listed here. In the examples only XDP is presented for 1934 simplicity. 1935 1936 **1. Verbose log output even on success.** 1937 1938 The option ``verb`` can be appended for loading programs in order to dump the 1939 verifier log, even if no error occurred: 1940 1941 .. code-block:: shell-session 1942 1943 # ip link set dev em1 xdp obj xdp-example.o verb 1944 1945 Prog section 'prog' loaded (5)! 1946 - Type: 6 1947 - Instructions: 2 (0 over limit) 1948 - License: GPL 1949 1950 Verifier analysis: 1951 1952 0: (b7) r0 = 1 1953 1: (95) exit 1954 processed 2 insns 1955 1956 **2. Load program that is already pinned in BPF file system.** 1957 1958 Instead of loading a program from an object file, iproute2 can also retrieve 1959 the program from the BPF file system in case some external entity pinned it 1960 there and attach it to the device: 1961 1962 .. code-block:: shell-session 1963 1964 # ip link set dev em1 xdp pinned /sys/fs/bpf/prog 1965 1966 iproute2 can also use the short form that is relative to the detected mount 1967 point of the BPF file system: 1968 1969 .. code-block:: shell-session 1970 1971 # ip link set dev em1 xdp pinned m:prog 1972 1973 When loading BPF programs, iproute2 will automatically detect the mounted 1974 file system instance in order to perform pinning of nodes. In case no mounted 1975 BPF file system instance was found, then tc will automatically mount it 1976 to the default location under ``/sys/fs/bpf/``. 1977 1978 In case an instance has already been found, then it will be used and no additional 1979 mount will be performed: 1980 1981 .. code-block:: shell-session 1982 1983 # mkdir /var/run/bpf 1984 # mount --bind /var/run/bpf /var/run/bpf 1985 # mount -t bpf bpf /var/run/bpf 1986 # tc filter add dev em1 ingress bpf da obj tc-example.o sec prog 1987 # tree /var/run/bpf 1988 /var/run/bpf 1989 +-- ip -> /run/bpf/tc/ 1990 +-- tc 1991 | +-- globals 1992 | +-- jmp_map 1993 +-- xdp -> /run/bpf/tc/ 1994 1995 4 directories, 1 file 1996 1997 By default tc will create an initial directory structure as shown above, 1998 where all subsystem users will point to the same location through symbolic 1999 links for the ``globals`` namespace, so that pinned BPF maps can be reused 2000 among various BPF program types in iproute2. In case the file system instance 2001 has already been mounted and an existing structure already exists, then tc will 2002 not override it. This could be the case for separating ``lwt``, ``tc`` and 2003 ``xdp`` maps in order to not share ``globals`` among all. 2004 2005 As briefly covered in the previous LLVM section, iproute2 will install a 2006 header file upon installation which can be included through the standard 2007 include path by BPF programs: 2008 2009 .. code-block:: c 2010 2011 #include <iproute2/bpf_elf.h> 2012 2013 The purpose of this header file is to provide an API for maps and default section 2014 names used by programs. It's a stable contract between iproute2 and BPF programs. 2015 2016 The map definition for iproute2 is ``struct bpf_elf_map``. Its members have 2017 been covered earlier in the LLVM section of this document. 2018 2019 When parsing the BPF object file, the iproute2 loader will walk through 2020 all ELF sections. It initially fetches ancillary sections like ``maps`` and 2021 ``license``. For ``maps``, the ``struct bpf_elf_map`` array will be checked 2022 for validity and whenever needed, compatibility workarounds are performed. 2023 Subsequently all maps are created with the user provided information, either 2024 retrieved as a pinned object, or newly created and then pinned into the BPF 2025 file system. Next the loader will handle all program sections that contain 2026 ELF relocation entries for maps, meaning that BPF instructions loading 2027 map file descriptors into registers are rewritten so that the corresponding 2028 map file descriptors are encoded into the instructions immediate value, in 2029 order for the kernel to be able to convert them later on into map kernel 2030 pointers. After that all the programs themselves are created through the BPF 2031 system call, and tail called maps, if present, updated with the program's file 2032 descriptors.