github.com/imran-kn/cilium-fork@v1.6.9/Documentation/bpf.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 http://docs.cilium.io 6 7 .. _bpf_guide: 8 9 *************************** 10 BPF and XDP Reference Guide 11 *************************** 12 13 .. note:: This documentation section is targeted at developers and users who 14 want to understand BPF and XDP in great technical depth. While 15 reading this reference guide may help broaden your understanding of 16 Cilium, it is not a requirement to use Cilium. Please refer to the 17 :ref:`gs_guide` and :ref:`arch_guide` for a higher level 18 introduction. 19 20 BPF is a highly flexible and efficient virtual machine-like construct in the 21 Linux kernel allowing to execute bytecode at various hook points in a safe 22 manner. It is used in a number of Linux kernel subsystems, most prominently 23 networking, tracing and security (e.g. sandboxing). 24 25 Although BPF exists since 1992, this document covers the extended Berkeley 26 Packet Filter (eBPF) version which has first appeared in Kernel 3.18 and 27 renders the original version which is being referred to as "classic" BPF 28 (cBPF) these days mostly obsolete. cBPF is known to many as being the packet 29 filter language used by tcpdump. Nowadays, the Linux kernel runs eBPF only and 30 loaded cBPF bytecode is transparently translated into an eBPF representation 31 in the kernel before program execution. This documentation will generally refer 32 to the term BPF unless explicit differences between eBPF and cBPF are being 33 pointed out. 34 35 Even though the name Berkeley Packet Filter hints at a packet filtering specific 36 purpose, the instruction set is generic and flexible enough these days that 37 there are many use cases for BPF apart from networking. See :ref:`bpf_users` 38 for a list of projects which use BPF. 39 40 Cilium uses BPF heavily in its data path, see :ref:`arch_guide` for further 41 information. The goal of this chapter is to provide a BPF reference guide in 42 order to gain understanding of BPF, its networking specific use including loading 43 BPF programs with tc (traffic control) and XDP (eXpress Data Path), and to aid 44 with developing Cilium's BPF templates. 45 46 BPF Architecture 47 ================ 48 49 BPF does not define itself by only providing its instruction set, but also by 50 offering further infrastructure around it such as maps which act as efficient 51 key / value stores, helper functions to interact with and leverage kernel 52 functionality, tail calls for calling into other BPF programs, security 53 hardening primitives, a pseudo file system for pinning objects (maps, 54 programs), and infrastructure for allowing BPF to be offloaded, for example, to 55 a network card. 56 57 LLVM provides a BPF back end, so that tools like clang can be used to 58 compile C into a BPF object file, which can then be loaded into the kernel. 59 BPF is deeply tied to the Linux kernel and allows for full programmability 60 without sacrificing native kernel performance. 61 62 Last but not least, also the kernel subsystems making use of BPF are part of 63 BPF's infrastructure. The two main subsystems discussed throughout this 64 document are tc and XDP where BPF programs can be attached to. XDP BPF programs 65 are attached at the earliest networking driver stage and trigger a run of the 66 BPF program upon packet reception. By definition, this achieves the best 67 possible packet processing performance since packets cannot get processed at an 68 even earlier point in software. However, since this processing occurs so early 69 in the networking stack, the stack has not yet extracted metadata out of the 70 packet. On the other hand, tc BPF programs are executed later in the kernel 71 stack, so they have access to more metadata and core kernel functionality. 72 Apart from tc and XDP programs, there are various other kernel subsystems as 73 well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc). 74 75 The following subsections provide further details on individual aspects of the 76 BPF architecture. 77 78 Instruction Set 79 --------------- 80 81 BPF is a general purpose RISC instruction set and was originally designed for the 82 purpose of writing programs in a subset of C which can be compiled into BPF instructions 83 through a compiler back end (e.g. LLVM), so that the kernel can later on map them 84 through an in-kernel JIT compiler into native opcodes for optimal execution performance 85 inside the kernel. 86 87 The advantages for pushing these instructions into the kernel include: 88 89 * Making the kernel programmable without having to cross kernel / user space 90 boundaries. For example, BPF programs related to networking, as in the case of 91 Cilium, can implement flexible container policies, load balancing and other means 92 without having to move packets to user space and back into the kernel. State 93 between BPF programs and kernel / user space can still be shared through maps 94 whenever needed. 95 96 * Given the flexibility of a programmable data path, programs can be heavily optimized 97 for performance also by compiling out features that are not required for the use cases 98 the program solves. For example, if a container does not require IPv4, then the BPF 99 program can be built to only deal with IPv6 in order to save resources in the fast-path. 100 101 * In case of networking (e.g. tc and XDP), BPF programs can be updated atomically 102 without having to restart the kernel, system services or containers, and without 103 traffic interruptions. Furthermore, any program state can also be maintained 104 throughout updates via BPF maps. 105 106 * BPF provides a stable ABI towards user space, and does not require any third party 107 kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere, 108 and guarantees that existing BPF programs keep running with newer kernel versions. 109 This guarantee is the same guarantee that the kernel provides for system calls with 110 regard to user space applications. Moreover, BPF programs are portable across 111 different architectures. 112 113 * BPF programs work in concert with the kernel, they make use of existing kernel 114 infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and 115 tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides. 116 Unlike kernel modules, BPF programs are verified through an in-kernel verifier in 117 order to ensure that they cannot crash the kernel, always terminate, etc. XDP 118 programs, for example, reuse the existing in-kernel drivers and operate on the 119 provided DMA buffers containing the packet frames without exposing them or an entire 120 driver to user space as in other models. Moreover, XDP programs reuse the existing 121 stack instead of bypassing it. BPF can be considered a generic "glue code" to 122 kernel facilities for crafting programs to solve specific use cases. 123 124 The execution of a BPF program inside the kernel is always event-driven! Examples: 125 126 * A networking device which has a BPF program attached on its ingress path will 127 trigger the execution of the program once a packet is received. 128 129 * A kernel address which has a kprobe with a BPF program attached will trap once 130 the code at that address gets executed, which will then invoke the kprobe's 131 callback function for instrumentation, subsequently triggering the execution 132 of the attached BPF program. 133 134 BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter 135 and a 512 byte large BPF stack space. Registers are named ``r0`` - ``r10``. The 136 operating mode is 64 bit by default, the 32 bit subregisters can only be accessed 137 through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters 138 zero-extend into 64 bit when they are being written to. 139 140 Register ``r10`` is the only register which is read-only and contains the frame pointer 141 address in order to access the BPF stack space. The remaining ``r0`` - ``r9`` 142 registers are general purpose and of read/write nature. 143 144 A BPF program can call into a predefined helper function, which is defined by 145 the core kernel (never by modules). The BPF calling convention is defined as 146 follows: 147 148 * ``r0`` contains the return value of a helper function call. 149 * ``r1`` - ``r5`` hold arguments from the BPF program to the kernel helper function. 150 * ``r6`` - ``r9`` are callee saved registers that will be preserved on helper function call. 151 152 The BPF calling convention is generic enough to map directly to ``x86_64``, ``arm64`` 153 and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a 154 JIT only needs to issue a call instruction, but no additional extra moves for placing 155 function arguments. This calling convention was modeled to cover common call 156 situations without having a performance penalty. Calls with 6 or more arguments 157 are currently not supported. The helper functions in the kernel which are dedicated 158 to BPF (``BPF_CALL_0()`` to ``BPF_CALL_5()`` functions) are specifically designed 159 with this convention in mind. 160 161 Register ``r0`` is also the register containing the exit value for the BPF program. 162 The semantics of the exit value are defined by the type of program. Furthermore, when 163 handing execution back to the kernel, the exit value is passed as a 32 bit value. 164 165 Registers ``r1`` - ``r5`` are scratch registers, meaning the BPF program needs to 166 either spill them to the BPF stack or move them to callee saved registers if these 167 arguments are to be reused across multiple helper function calls. Spilling means 168 that the variable in the register is moved to the BPF stack. The reverse operation 169 of moving the variable from the BPF stack to the register is called filling. The 170 reason for spilling/filling is due to the limited number of registers. 171 172 Upon entering execution of a BPF program, register ``r1`` initially contains the 173 context for the program. The context is the input argument for the program (similar 174 to ``argc/argv`` pair for a typical C program). BPF is restricted to work on a single 175 context. The context is defined by the program type, for example, a networking 176 program can have a kernel representation of the network packet (``skb``) as the 177 input argument. 178 179 The general operation of BPF is 64 bit to follow the natural model of 64 bit 180 architectures in order to perform pointer arithmetics, pass pointers but also pass 64 181 bit values into helper functions, and to allow for 64 bit atomic operations. 182 183 The maximum instruction limit per program is restricted to 4096 BPF instructions, 184 which, by design, means that any program will terminate quickly. Although the 185 instruction set contains forward as well as backward jumps, the in-kernel BPF 186 verifier will forbid loops so that termination is always guaranteed. Since BPF 187 programs run inside the kernel, the verifier's job is to make sure that these are 188 safe to run, not affecting the system's stability. This means that from an instruction 189 set point of view, loops can be implemented, but the verifier will restrict that. 190 However, there is also a concept of tail calls that allows for one BPF program to 191 jump into another one. This, too, comes with an upper nesting limit of 32 calls, 192 and is usually used to decouple parts of the program logic, for example, into stages. 193 194 The instruction format is modeled as two operand instructions, which helps mapping 195 BPF instructions to native instructions during JIT phase. The instruction set is 196 of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions 197 have been implemented and the encoding also allows to extend the set with further 198 instructions when needed. The instruction encoding of a single 64 bit instruction on a 199 big-endian machine is defined as a bit sequence from most significant bit (MSB) to least 200 significant bit (LSB) of ``op:8``, ``dst_reg:4``, ``src_reg:4``, ``off:16``, ``imm:32``. 201 ``off`` and ``imm`` is of signed type. The encodings are part of the kernel headers and 202 defined in ``linux/bpf.h`` header, which also includes ``linux/bpf_common.h``. 203 204 ``op`` defines the actual operation to be performed. Most of the encoding for ``op`` 205 has been reused from cBPF. The operation can be based on register or immediate 206 operands. The encoding of ``op`` itself provides information on which mode to use 207 (``BPF_X`` for denoting register-based operations, and ``BPF_K`` for immediate-based 208 operations respectively). In the latter case, the destination operand is always 209 a register. Both ``dst_reg`` and ``src_reg`` provide additional information about 210 the register operands to be used (e.g. ``r0`` - ``r9``) for the operation. ``off`` 211 is used in some instructions to provide a relative offset, for example, for addressing 212 the stack or other buffers available to BPF (e.g. map values, packet data, etc), 213 or jump targets in jump instructions. ``imm`` contains a constant / immediate value. 214 215 The available ``op`` instructions can be categorized into various instruction 216 classes. These classes are also encoded inside the ``op`` field. The ``op`` field 217 is divided into (from MSB to LSB) ``code:4``, ``source:1`` and ``class:3``. ``class`` 218 is the more generic instruction class, ``code`` denotes a specific operational 219 code inside that class, and ``source`` tells whether the source operand is a register 220 or an immediate value. Possible instruction classes include: 221 222 * ``BPF_LD``, ``BPF_LDX``: Both classes are for load operations. ``BPF_LD`` is 223 used for loading a double word as a special instruction spanning two instructions 224 due to the ``imm:32`` split, and for byte / half-word / word loads of packet data. 225 The latter was carried over from cBPF mainly in order to keep cBPF to BPF 226 translations efficient, since they have optimized JIT code. For native BPF 227 these packet load instructions are less relevant nowadays. ``BPF_LDX`` class 228 holds instructions for byte / half-word / word / double-word loads out of 229 memory. Memory in this context is generic and could be stack memory, map value 230 data, packet data, etc. 231 232 * ``BPF_ST``, ``BPF_STX``: Both classes are for store operations. Similar to ``BPF_LDX`` 233 the ``BPF_STX`` is the store counterpart and is used to store the data from a 234 register into memory, which, again, can be stack memory, map value, packet data, 235 etc. ``BPF_STX`` also holds special instructions for performing word and double-word 236 based atomic add operations, which can be used for counters, for example. The 237 ``BPF_ST`` class is similar to ``BPF_STX`` by providing instructions for storing 238 data into memory only that the source operand is an immediate value. 239 240 * ``BPF_ALU``, ``BPF_ALU64``: Both classes contain ALU operations. Generally, 241 ``BPF_ALU`` operations are in 32 bit mode and ``BPF_ALU64`` in 64 bit mode. 242 Both ALU classes have basic operations with source operand which is register-based 243 and an immediate-based counterpart. Supported by both are add (``+``), sub (``-``), 244 and (``&``), or (``|``), left shift (``<<``), right shift (``>>``), xor (``^``), 245 mul (``*``), div (``/``), mod (``%``), neg (``~``) operations. Also mov (``<X> := <Y>``) 246 was added as a special ALU operation for both classes in both operand modes. 247 ``BPF_ALU64`` also contains a signed right shift. ``BPF_ALU`` additionally 248 contains endianness conversion instructions for half-word / word / double-word 249 on a given source register. 250 251 * ``BPF_JMP``: This class is dedicated to jump operations. Jumps can be unconditional 252 and conditional. Unconditional jumps simply move the program counter forward, so 253 that the next instruction to be executed relative to the current instruction is 254 ``off + 1``, where ``off`` is the constant offset encoded in the instruction. Since 255 ``off`` is signed, the jump can also be performed backwards as long as it does not 256 create a loop and is within program bounds. Conditional jumps operate on both, 257 register-based and immediate-based source operands. If the condition in the jump 258 operations results in ``true``, then a relative jump to ``off + 1`` is performed, 259 otherwise the next instruction (``0 + 1``) is performed. This fall-through 260 jump logic differs compared to cBPF and allows for better branch prediction as it 261 fits the CPU branch predictor logic more naturally. Available conditions are 262 jeq (``==``), jne (``!=``), jgt (``>``), jge (``>=``), jsgt (signed ``>``), jsge 263 (signed ``>=``), jlt (``<``), jle (``<=``), jslt (signed ``<``), jsle (signed 264 ``<=``) and jset (jump if ``DST & SRC``). Apart from that, there are three 265 special jump operations within this class: the exit instruction which will leave 266 the BPF program and return the current value in ``r0`` as a return code, the call 267 instruction, which will issue a function call into one of the available BPF helper 268 functions, and a hidden tail call instruction, which will jump into a different 269 BPF program. 270 271 The Linux kernel is shipped with a BPF interpreter which executes programs assembled in 272 BPF instructions. Even cBPF programs are translated into eBPF programs transparently 273 in the kernel, except for architectures that still ship with a cBPF JIT and 274 have not yet migrated to an eBPF JIT. 275 276 Currently ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` and 277 ``arm`` architectures come with an in-kernel eBPF JIT compiler. 278 279 All BPF handling such as loading of programs into the kernel or creation of BPF maps 280 is managed through a central ``bpf()`` system call. It is also used for managing map 281 entries (lookup / update / delete), and making programs as well as maps persistent 282 in the BPF file system through pinning. 283 284 Helper Functions 285 ---------------- 286 287 Helper functions are a concept which enables BPF programs to consult a core kernel 288 defined set of function calls in order to retrieve / push data from / to the 289 kernel. Available helper functions may differ for each BPF program type, 290 for example, BPF programs attached to sockets are only allowed to call into 291 a subset of helpers compared to BPF programs attached to the tc layer. 292 Encapsulation and decapsulation helpers for lightweight tunneling constitute 293 an example of functions which are only available to lower tc layers, whereas 294 event output helpers for pushing notifications to user space are available to 295 tc and XDP programs. 296 297 Each helper function is implemented with a commonly shared function signature 298 similar to system calls. The signature is defined as: 299 300 :: 301 302 u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) 303 304 The calling convention as described in the previous section applies to all 305 BPF helper functions. 306 307 The kernel abstracts helper functions into macros ``BPF_CALL_0()`` to ``BPF_CALL_5()`` 308 which are similar to those of system calls. The following example is an extract 309 from a helper function which updates map elements by calling into the 310 corresponding map implementation callbacks: 311 312 :: 313 314 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key, 315 void *, value, u64, flags) 316 { 317 WARN_ON_ONCE(!rcu_read_lock_held()); 318 return map->ops->map_update_elem(map, key, value, flags); 319 } 320 321 const struct bpf_func_proto bpf_map_update_elem_proto = { 322 .func = bpf_map_update_elem, 323 .gpl_only = false, 324 .ret_type = RET_INTEGER, 325 .arg1_type = ARG_CONST_MAP_PTR, 326 .arg2_type = ARG_PTR_TO_MAP_KEY, 327 .arg3_type = ARG_PTR_TO_MAP_VALUE, 328 .arg4_type = ARG_ANYTHING, 329 }; 330 331 There are various advantages of this approach: while cBPF overloaded its 332 load instructions in order to fetch data at an impossible packet offset to 333 invoke auxiliary helper functions, each cBPF JIT needed to implement support 334 for such a cBPF extension. In case of eBPF, each newly added helper function 335 will be JIT compiled in a transparent and efficient way, meaning that the JIT 336 compiler only needs to emit a call instruction since the register mapping 337 is made in such a way that BPF register assignments already match the 338 underlying architecture's calling convention. This allows for easily extending 339 the core kernel with new helper functionality. All BPF helper functions are 340 part of the core kernel and cannot be extended or added through kernel modules. 341 342 The aforementioned function signature also allows the verifier to perform type 343 checks. The above ``struct bpf_func_proto`` is used to hand all the necessary 344 information which need to be known about the helper to the verifier, so that 345 the verifier can make sure that the expected types from the helper match the 346 current contents of the BPF program's analyzed registers. 347 348 Argument types can range from passing in any kind of value up to restricted 349 contents such as a pointer / size pair for the BPF stack buffer, which the 350 helper should read from or write to. In the latter case, the verifier can also 351 perform additional checks, for example, whether the buffer was previously 352 initialized. 353 354 The list of available BPF helper functions is rather long and constantly growing, 355 for example, at the time of this writing, tc BPF programs can choose from 38 356 different BPF helpers. The kernel's ``struct bpf_verifier_ops`` contains a 357 ``get_func_proto`` callback function that provides the mapping of a specific 358 ``enum bpf_func_id`` to one of the available helpers for a given BPF program 359 type. 360 361 Maps 362 ---- 363 364 .. image:: images/bpf_map.png 365 :align: center 366 367 Maps are efficient key / value stores that reside in kernel space. They can be 368 accessed from a BPF program in order to keep state among multiple BPF program 369 invocations. They can also be accessed through file descriptors from user space 370 and can be arbitrarily shared with other BPF programs or user space applications. 371 372 BPF programs which share maps with each other are not required to be of the same 373 program type, for example, tracing programs can share maps with networking programs. 374 A single BPF program can currently access up to 64 different maps directly. 375 376 Map implementations are provided by the core kernel. There are generic maps with 377 per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are 378 also a few non-generic maps that are used along with helper functions. 379 380 Generic maps currently available are ``BPF_MAP_TYPE_HASH``, ``BPF_MAP_TYPE_ARRAY``, 381 ``BPF_MAP_TYPE_PERCPU_HASH``, ``BPF_MAP_TYPE_PERCPU_ARRAY``, ``BPF_MAP_TYPE_LRU_HASH``, 382 ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and ``BPF_MAP_TYPE_LPM_TRIE``. They all use the 383 same common set of BPF helper functions in order to perform lookup, update or 384 delete operations while implementing a different backend with differing semantics 385 and performance characteristics. 386 387 Non-generic maps that are currently in the kernel are ``BPF_MAP_TYPE_PROG_ARRAY``, 388 ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, ``BPF_MAP_TYPE_CGROUP_ARRAY``, 389 ``BPF_MAP_TYPE_STACK_TRACE``, ``BPF_MAP_TYPE_ARRAY_OF_MAPS``, 390 ``BPF_MAP_TYPE_HASH_OF_MAPS``. For example, ``BPF_MAP_TYPE_PROG_ARRAY`` is an 391 array map which holds other BPF programs, ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and 392 ``BPF_MAP_TYPE_HASH_OF_MAPS`` both hold pointers to other maps such that entire 393 BPF maps can be atomically replaced at runtime. These types of maps tackle a 394 specific issue which was unsuitable to be implemented solely through a BPF helper 395 function since additional (non-data) state is required to be held across BPF 396 program invocations. 397 398 Object Pinning 399 -------------- 400 401 .. image:: images/bpf_fs.png 402 :align: center 403 404 BPF maps and programs act as a kernel resource and can only be accessed through 405 file descriptors, backed by anonymous inodes in the kernel. Advantages, but 406 also a number of disadvantages come along with them: 407 408 User space applications can make use of most file descriptor related APIs, 409 file descriptor passing for Unix domain sockets work transparently, etc, but 410 at the same time, file descriptors are limited to a processes' lifetime, 411 which makes options like map sharing rather cumbersome to carry out. 412 413 Thus, it brings a number of complications for certain use cases such as iproute2, 414 where tc or XDP sets up and loads the program into the kernel and terminates 415 itself eventually. With that, also access to maps is unavailable from user 416 space side, where it could otherwise be useful, for example, when maps are 417 shared between ingress and egress locations of the data path. Also, third 418 party applications may wish to monitor or update map contents during BPF 419 program runtime. 420 421 To overcome this limitation, a minimal kernel space BPF file system has been 422 implemented, where BPF map and programs can be pinned to, a process called 423 object pinning. The BPF system call has therefore been extended with two new 424 commands which can pin (``BPF_OBJ_PIN``) or retrieve (``BPF_OBJ_GET``) a 425 previously pinned object. 426 427 For instance, tools such as tc make use of this infrastructure for sharing 428 maps on ingress and egress. The BPF related file system is not a singleton, 429 it does support multiple mount instances, hard and soft links, etc. 430 431 Tail Calls 432 ---------- 433 434 .. image:: images/bpf_tailcall.png 435 :align: center 436 437 Another concept that can be used with BPF is called tail calls. Tail calls can 438 be seen as a mechanism that allows one BPF program to call another, without 439 returning back to the old program. Such a call has minimal overhead as unlike 440 function calls, it is implemented as a long jump, reusing the same stack frame. 441 442 Such programs are verified independently of each other, thus for transferring 443 state, either per-CPU maps as scratch buffers or in case of tc programs, ``skb`` 444 fields such as the ``cb[]`` area must be used. 445 446 Only programs of the same type can be tail called, and they also need to match 447 in terms of JIT compilation, thus either JIT compiled or only interpreted programs 448 can be invoked, but not mixed together. 449 450 There are two components involved for carrying out tail calls: the first part 451 needs to setup a specialized map called program array (``BPF_MAP_TYPE_PROG_ARRAY``) 452 that can be populated by user space with key / values, where values are the 453 file descriptors of the tail called BPF programs, the second part is a 454 ``bpf_tail_call()`` helper where the context, a reference to the program array 455 and the lookup key is passed to. Then the kernel inlines this helper call 456 directly into a specialized BPF instruction. Such a program array is currently 457 write-only from user space side. 458 459 The kernel looks up the related BPF program from the passed file descriptor 460 and atomically replaces program pointers at the given map slot. When no map 461 entry has been found at the provided key, the kernel will just "fall through" 462 and continue execution of the old program with the instructions following 463 after the ``bpf_tail_call()``. Tail calls are a powerful utility, for example, 464 parsing network headers could be structured through tail calls. During runtime, 465 functionality can be added or replaced atomically, and thus altering the BPF 466 program's execution behavior. 467 468 .. _bpf_to_bpf_calls: 469 470 BPF to BPF Calls 471 ---------------- 472 473 .. image:: images/bpf_call.png 474 :align: center 475 476 Aside from BPF helper calls and BPF tail calls, a more recent feature that has 477 been added to the BPF core infrastructure is BPF to BPF calls. Before this 478 feature was introduced into the kernel, a typical BPF C program had to declare 479 any reusable code that, for example, resides in headers as ``always_inline`` 480 such that when LLVM compiles and generates the BPF object file all these 481 functions were inlined and therefore duplicated many times in the resulting 482 object file, artificially inflating its code size: 483 484 :: 485 486 #include <linux/bpf.h> 487 488 #ifndef __section 489 # define __section(NAME) \ 490 __attribute__((section(NAME), used)) 491 #endif 492 493 #ifndef __inline 494 # define __inline \ 495 inline __attribute__((always_inline)) 496 #endif 497 498 static __inline int foo(void) 499 { 500 return XDP_DROP; 501 } 502 503 __section("prog") 504 int xdp_drop(struct xdp_md *ctx) 505 { 506 return foo(); 507 } 508 509 char __license[] __section("license") = "GPL"; 510 511 The main reason why this was necessary was due to lack of function call support 512 in the BPF program loader as well as verifier, interpreter and JITs. Starting 513 with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs 514 no longer need to use ``always_inline`` everywhere. Thus, the prior shown BPF 515 example code can then be rewritten more naturally as: 516 517 :: 518 519 #include <linux/bpf.h> 520 521 #ifndef __section 522 # define __section(NAME) \ 523 __attribute__((section(NAME), used)) 524 #endif 525 526 static int foo(void) 527 { 528 return XDP_DROP; 529 } 530 531 __section("prog") 532 int xdp_drop(struct xdp_md *ctx) 533 { 534 return foo(); 535 } 536 537 char __license[] __section("license") = "GPL"; 538 539 Mainstream BPF JIT compilers like ``x86_64`` and ``arm64`` support BPF to BPF 540 calls today with others following in near future. BPF to BPF call is an 541 important performance optimization since it heavily reduces the generated BPF 542 code size and therefore becomes friendlier to a CPU's instruction cache. 543 544 The calling convention known from BPF helper function applies to BPF to BPF 545 calls just as well, meaning ``r1`` up to ``r5`` are for passing arguments to 546 the callee and the result is returned in ``r0``. ``r1`` to ``r5`` are scratch 547 registers whereas ``r6`` to ``r9`` preserved across calls the usual way. The 548 maximum number of nesting calls respectively allowed call frames is ``8``. 549 A caller can pass pointers (e.g. to the caller's stack frame) down to the 550 callee, but never vice versa. 551 552 BPF to BPF calls are currently incompatible with the use of BPF tail calls, 553 since the latter requires to reuse the current stack setup as-is, whereas 554 the former adds additional stack frames and thus changes the expected layout 555 for tail calls. 556 557 BPF JIT compilers emit separate images for each function body and later fix 558 up the function call addresses in the image in a final JIT pass. This has 559 proven to require minimal changes to the JITs in that they can treat BPF to 560 BPF calls as conventional BPF helper calls. 561 562 JIT 563 --- 564 565 .. image:: images/bpf_jit.png 566 :align: center 567 568 The 64 bit ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` 569 and 32 bit ``arm``, ``x86_32`` architectures are all shipped with an in-kernel 570 eBPF JIT compiler, also all of them are feature equivalent and can be enabled 571 through: 572 573 :: 574 575 # echo 1 > /proc/sys/net/core/bpf_jit_enable 576 577 The 32 bit ``mips``, ``ppc`` and ``sparc`` architectures currently have a cBPF 578 JIT compiler. The mentioned architectures still having a cBPF JIT as well as all 579 remaining architectures supported by the Linux kernel which do not have a BPF JIT 580 compiler at all need to run eBPF programs through the in-kernel interpreter. 581 582 In the kernel's source tree, eBPF JIT support can be easily determined through 583 issuing a grep for ``HAVE_EBPF_JIT``: 584 585 :: 586 587 # git grep HAVE_EBPF_JIT arch/ 588 arch/arm/Kconfig: select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 589 arch/arm64/Kconfig: select HAVE_EBPF_JIT 590 arch/powerpc/Kconfig: select HAVE_EBPF_JIT if PPC64 591 arch/mips/Kconfig: select HAVE_EBPF_JIT if (64BIT && !CPU_MICROMIPS) 592 arch/s390/Kconfig: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES 593 arch/sparc/Kconfig: select HAVE_EBPF_JIT if SPARC64 594 arch/x86/Kconfig: select HAVE_EBPF_JIT if X86_64 595 596 JIT compilers speed up execution of the BPF program significantly since they 597 reduce the per instruction cost compared to the interpreter. Often instructions 598 can be mapped 1:1 with native instructions of the underlying architecture. This 599 also reduces the resulting executable image size and is therefore more 600 instruction cache friendly to the CPU. In particular in case of CISC instruction 601 sets such as ``x86``, the JITs are optimized for emitting the shortest possible 602 opcodes for a given instruction to shrink the total necessary size for the 603 program translation. 604 605 Hardening 606 --------- 607 608 BPF locks the entire BPF interpreter image (``struct bpf_prog``) as well 609 as the JIT compiled image (``struct bpf_binary_header``) in the kernel as 610 read-only during the program's lifetime in order to prevent the code from 611 potential corruptions. Any corruption happening at that point, for example, 612 due to some kernel bugs will result in a general protection fault and thus 613 crash the kernel instead of allowing the corruption to happen silently. 614 615 Architectures that support setting the image memory as read-only can be 616 determined through: 617 618 :: 619 620 $ git grep ARCH_HAS_SET_MEMORY | grep select 621 arch/arm/Kconfig: select ARCH_HAS_SET_MEMORY 622 arch/arm64/Kconfig: select ARCH_HAS_SET_MEMORY 623 arch/s390/Kconfig: select ARCH_HAS_SET_MEMORY 624 arch/x86/Kconfig: select ARCH_HAS_SET_MEMORY 625 626 The option ``CONFIG_ARCH_HAS_SET_MEMORY`` is not configurable, thanks to 627 which this protection is always built-in. Other architectures might follow 628 in the future. 629 630 In case of the ``x86_64`` JIT compiler, the JITing of the indirect jump from 631 the use of tail calls is realized through a retpoline in case ``CONFIG_RETPOLINE`` 632 has been set which is the default at the time of writing in most modern Linux 633 distributions. 634 635 In case of ``/proc/sys/net/core/bpf_jit_harden`` set to ``1`` additional 636 hardening steps for the JIT compilation take effect for unprivileged users. 637 This effectively trades off their performance slightly by decreasing a 638 (potential) attack surface in case of untrusted users operating on the 639 system. The decrease in program execution still results in better performance 640 compared to switching to interpreter entirely. 641 642 Currently, enabling hardening will blind all user provided 32 bit and 64 bit 643 constants from the BPF program when it gets JIT compiled in order to prevent 644 JIT spraying attacks which inject native opcodes as immediate values. This is 645 problematic as these immediate values reside in executable kernel memory, 646 therefore a jump that could be triggered from some kernel bug would jump to 647 the start of the immediate value and then execute these as native instructions. 648 649 JIT constant blinding prevents this due to randomizing the actual instruction, 650 which means the operation is transformed from an immediate based source operand 651 to a register based one through rewriting the instruction by splitting the 652 actual load of the value into two steps: 1) load of a blinded immediate 653 value ``rnd ^ imm`` into a register, 2) xoring that register with ``rnd`` 654 such that the original ``imm`` immediate then resides in the register and 655 can be used for the actual operation. The example was provided for a load 656 operation, but really all generic operations are blinded. 657 658 Example of JITing a program with hardening disabled: 659 660 :: 661 662 # echo 0 > /proc/sys/net/core/bpf_jit_harden 663 664 ffffffffa034f5e9 + <x>: 665 [...] 666 39: mov $0xa8909090,%eax 667 3e: mov $0xa8909090,%eax 668 43: mov $0xa8ff3148,%eax 669 48: mov $0xa89081b4,%eax 670 4d: mov $0xa8900bb0,%eax 671 52: mov $0xa810e0c1,%eax 672 57: mov $0xa8908eb4,%eax 673 5c: mov $0xa89020b0,%eax 674 [...] 675 676 The same program gets constant blinded when loaded through BPF 677 as an unprivileged user in the case hardening is enabled: 678 679 :: 680 681 # echo 1 > /proc/sys/net/core/bpf_jit_harden 682 683 ffffffffa034f1e5 + <x>: 684 [...] 685 39: mov $0xe1192563,%r10d 686 3f: xor $0x4989b5f3,%r10d 687 46: mov %r10d,%eax 688 49: mov $0xb8296d93,%r10d 689 4f: xor $0x10b9fd03,%r10d 690 56: mov %r10d,%eax 691 59: mov $0x8c381146,%r10d 692 5f: xor $0x24c7200e,%r10d 693 66: mov %r10d,%eax 694 69: mov $0xeb2a830e,%r10d 695 6f: xor $0x43ba02ba,%r10d 696 76: mov %r10d,%eax 697 79: mov $0xd9730af,%r10d 698 7f: xor $0xa5073b1f,%r10d 699 86: mov %r10d,%eax 700 89: mov $0x9a45662b,%r10d 701 8f: xor $0x325586ea,%r10d 702 96: mov %r10d,%eax 703 [...] 704 705 Both programs are semantically the same, only that none of the 706 original immediate values are visible anymore in the disassembly of 707 the second program. 708 709 At the same time, hardening also disables any JIT kallsyms exposure 710 for privileged users, preventing that JIT image addresses are not 711 exposed to ``/proc/kallsyms`` anymore. 712 713 Moreover, the Linux kernel provides the option ``CONFIG_BPF_JIT_ALWAYS_ON`` 714 which removes the entire BPF interpreter from the kernel and permanently 715 enables the JIT compiler. This has been developed as part of a mitigation 716 in the context of Spectre v2 such that when used in a VM-based setting, 717 the guest kernel is not going to reuse the host kernel's BPF interpreter 718 when mounting an attack anymore. For container-based environments, the 719 ``CONFIG_BPF_JIT_ALWAYS_ON`` configuration option is optional, but in 720 case JITs are enabled there anyway, the interpreter may as well be compiled 721 out to reduce the kernel's complexity. Thus, it is also generally 722 recommended for widely used JITs in case of main stream architectures 723 such as ``x86_64`` and ``arm64``. 724 725 Last but not least, the kernel offers an option to disable the use of 726 the ``bpf(2)`` system call for unprivileged users through the 727 ``/proc/sys/kernel/unprivileged_bpf_disabled`` sysctl knob. This is 728 on purpose a one-time kill switch, meaning once set to ``1``, there is 729 no option to reset it back to ``0`` until a new kernel reboot. When 730 set only ``CAP_SYS_ADMIN`` privileged processes out of the initial 731 namespace are allowed to use the ``bpf(2)`` system call from that 732 point onwards. Upon start, Cilium sets this knob to ``1`` as well. 733 734 :: 735 736 # echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled 737 738 Offloads 739 -------- 740 741 .. image:: images/bpf_offload.png 742 :align: center 743 744 Networking programs in BPF, in particular for tc and XDP do have an 745 offload-interface to hardware in the kernel in order to execute BPF 746 code directly on the NIC. 747 748 Currently, the ``nfp`` driver from Netronome has support for offloading 749 BPF through a JIT compiler which translates BPF instructions to an 750 instruction set implemented against the NIC. This includes offloading 751 of BPF maps to the NIC as well, thus the offloaded BPF program can 752 perform map lookups, updates and deletions. 753 754 Toolchain 755 ========= 756 757 Current user space tooling, introspection facilities and kernel control knobs around 758 BPF are discussed in this section. Note, the tooling and infrastructure around BPF 759 is still rapidly evolving and thus may not provide a complete picture of all available 760 tools. 761 762 Development Environment 763 ----------------------- 764 765 A step by step guide for setting up a development environment for BPF can be found 766 below for both Fedora and Ubuntu. This will guide you through building, installing 767 and testing a development kernel as well as building and installing iproute2. 768 769 The step of manually building iproute2 and Linux kernel is usually not necessary 770 given that major distributions already ship recent enough kernels by default, but 771 would be needed for testing bleeding edge versions or contributing BPF patches to 772 iproute2 and to the Linux kernel, respectively. Similarly, for debugging and 773 introspection purposes building bpftool is optional, but recommended. 774 775 Fedora 776 `````` 777 778 The following applies to Fedora 25 or later: 779 780 :: 781 782 $ sudo dnf install -y git gcc ncurses-devel elfutils-libelf-devel bc \ 783 openssl-devel libcap-devel clang llvm graphviz bison flex glibc-static 784 785 .. note:: If you are running some other Fedora derivative and ``dnf`` is missing, 786 try using ``yum`` instead. 787 788 Ubuntu 789 `````` 790 791 The following applies to Ubuntu 17.04 or later: 792 793 :: 794 795 $ sudo apt-get install -y make gcc libssl-dev bc libelf-dev libcap-dev \ 796 clang gcc-multilib llvm libncurses5-dev git pkg-config libmnl-dev bison flex \ 797 graphviz 798 799 openSUSE Tumbleweed 800 ``````````````````` 801 802 The following applies to openSUSE Tumbleweed and openSUSE Leap 15.0 or later: 803 804 :: 805 806 $ sudo zypper install -y git gcc ncurses-devel libelf-devel bc libopenssl-devel \ 807 libcap-devel clang llvm graphviz bison flex glibc-devel-static 808 809 Compiling the Kernel 810 ```````````````````` 811 812 Development of new BPF features for the Linux kernel happens inside the ``net-next`` 813 git tree, latest BPF fixes in the ``net`` tree. The following command will obtain 814 the kernel source for the ``net-next`` tree through git: 815 816 :: 817 818 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next.git 819 820 If the git commit history is not of interest, then ``--depth 1`` will clone the 821 tree much faster by truncating the git history only to the most recent commit. 822 823 In case the ``net`` tree is of interest, it can be cloned from this url: 824 825 :: 826 827 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/davem/net.git 828 829 There are dozens of tutorials in the Internet on how to build Linux kernels, one 830 good resource is the Kernel Newbies website (https://kernelnewbies.org/KernelBuild) 831 that can be followed with one of the two git trees mentioned above. 832 833 Make sure that the generated ``.config`` file contains the following ``CONFIG_*`` 834 entries for running BPF. These entries are also needed for Cilium. 835 836 :: 837 838 CONFIG_CGROUP_BPF=y 839 CONFIG_BPF=y 840 CONFIG_BPF_SYSCALL=y 841 CONFIG_NET_SCH_INGRESS=m 842 CONFIG_NET_CLS_BPF=m 843 CONFIG_NET_CLS_ACT=y 844 CONFIG_BPF_JIT=y 845 CONFIG_LWTUNNEL_BPF=y 846 CONFIG_HAVE_EBPF_JIT=y 847 CONFIG_BPF_EVENTS=y 848 CONFIG_TEST_BPF=m 849 850 Some of the entries cannot be adjusted through ``make menuconfig``. For example, 851 ``CONFIG_HAVE_EBPF_JIT`` is selected automatically if a given architecture does 852 come with an eBPF JIT. In this specific case, ``CONFIG_HAVE_EBPF_JIT`` is optional 853 but highly recommended. An architecture not having an eBPF JIT compiler will need 854 to fall back to the in-kernel interpreter with the cost of being less efficient 855 executing BPF instructions. 856 857 Verifying the Setup 858 ``````````````````` 859 860 After you have booted into the newly compiled kernel, navigate to the BPF selftest 861 suite in order to test BPF functionality (current working directory points to 862 the root of the cloned git tree): 863 864 :: 865 866 $ cd tools/testing/selftests/bpf/ 867 $ make 868 $ sudo ./test_verifier 869 870 The verifier tests print out all the current checks being performed. The summary 871 at the end of running all tests will dump information of test successes and 872 failures: 873 874 :: 875 876 Summary: 847 PASSED, 0 SKIPPED, 0 FAILED 877 878 .. note:: For kernel releases 4.16+ the BPF selftest has a dependency on LLVM 6.0+ 879 caused by the BPF function calls which do not need to be inlined 880 anymore. See section :ref:`bpf_to_bpf_calls` or the cover letter mail 881 from the kernel patch (https://lwn.net/Articles/741773/) for more information. 882 Not every BPF program has a dependency on LLVM 6.0+ if it does not 883 use this new feature. If your distribution does not provide LLVM 6.0+ 884 you may compile it by following the instruction in the :ref:`tooling_llvm` 885 section. 886 887 In order to run through all BPF selftests, the following command is needed: 888 889 :: 890 891 $ sudo make run_tests 892 893 If you see any failures, please contact us on Slack with the full test output. 894 895 Compiling iproute2 896 `````````````````` 897 898 Similar to the ``net`` (fixes only) and ``net-next`` (new features) kernel trees, 899 the iproute2 git tree has two branches, namely ``master`` and ``net-next``. The 900 ``master`` branch is based on the ``net`` tree and the ``net-next`` branch is 901 based against the ``net-next`` kernel tree. This is necessary, so that changes 902 in header files can be synchronized in the iproute2 tree. 903 904 In order to clone the iproute2 ``master`` branch, the following command can 905 be used: 906 907 :: 908 909 $ git clone git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git 910 911 Similarly, to clone into mentioned ``net-next`` branch of iproute2, run the 912 following: 913 914 :: 915 916 $ git clone -b net-next git://git.kernel.org/pub/scm/linux/kernel/git/iproute2/iproute2.git 917 918 After that, proceed with the build and installation: 919 920 :: 921 922 $ cd iproute2/ 923 $ ./configure --prefix=/usr 924 TC schedulers 925 ATM no 926 927 libc has setns: yes 928 SELinux support: yes 929 ELF support: yes 930 libmnl support: no 931 Berkeley DB: no 932 933 docs: latex: no 934 WARNING: no docs can be built from LaTeX files 935 sgml2html: no 936 WARNING: no HTML docs can be built from SGML 937 $ make 938 [...] 939 $ sudo make install 940 941 Ensure that the ``configure`` script shows ``ELF support: yes``, so that iproute2 942 can process ELF files from LLVM's BPF back end. libelf was listed in the instructions 943 for installing the dependencies in case of Fedora and Ubuntu earlier. 944 945 Compiling bpftool 946 ````````````````` 947 948 bpftool is an essential tool around debugging and introspection of BPF programs 949 and maps. It is part of the kernel tree and available under ``tools/bpf/bpftool/``. 950 951 Make sure to have cloned either the ``net`` or ``net-next`` kernel tree as described 952 earlier. In order to build and install bpftool, the following steps are required: 953 954 :: 955 956 $ cd <kernel-tree>/tools/bpf/bpftool/ 957 $ make 958 Auto-detecting system features: 959 ... libbfd: [ on ] 960 ... disassembler-four-args: [ OFF ] 961 962 CC xlated_dumper.o 963 CC prog.o 964 CC common.o 965 CC cgroup.o 966 CC main.o 967 CC json_writer.o 968 CC cfg.o 969 CC map.o 970 CC jit_disasm.o 971 CC disasm.o 972 make[1]: Entering directory '/home/foo/trees/net/tools/lib/bpf' 973 974 Auto-detecting system features: 975 ... libelf: [ on ] 976 ... bpf: [ on ] 977 978 CC libbpf.o 979 CC bpf.o 980 CC nlattr.o 981 LD libbpf-in.o 982 LINK libbpf.a 983 make[1]: Leaving directory '/home/foo/trees/bpf/tools/lib/bpf' 984 LINK bpftool 985 $ sudo make install 986 987 .. _tooling_llvm: 988 989 LLVM 990 ---- 991 992 LLVM is currently the only compiler suite providing a BPF back end. gcc does 993 not support BPF at this point. 994 995 The BPF back end was merged into LLVM's 3.7 release. Major distributions enable 996 the BPF back end by default when they package LLVM, therefore installing clang 997 and llvm is sufficient on most recent distributions to start compiling C 998 into BPF object files. 999 1000 The typical workflow is that BPF programs are written in C, compiled by LLVM 1001 into object / ELF files, which are parsed by user space BPF ELF loaders (such as 1002 iproute2 or others), and pushed into the kernel through the BPF system call. 1003 The kernel verifies the BPF instructions and JITs them, returning a new file 1004 descriptor for the program, which then can be attached to a subsystem (e.g. 1005 networking). If supported, the subsystem could then further offload the BPF 1006 program to hardware (e.g. NIC). 1007 1008 For LLVM, BPF target support can be checked, for example, through the following: 1009 1010 :: 1011 1012 $ llc --version 1013 LLVM (http://llvm.org/): 1014 LLVM version 3.8.1 1015 Optimized build. 1016 Default target: x86_64-unknown-linux-gnu 1017 Host CPU: skylake 1018 1019 Registered Targets: 1020 [...] 1021 bpf - BPF (host endian) 1022 bpfeb - BPF (big endian) 1023 bpfel - BPF (little endian) 1024 [...] 1025 1026 By default, the ``bpf`` target uses the endianness of the CPU it compiles on, 1027 meaning that if the CPU's endianness is little endian, the program is represented 1028 in little endian format as well, and if the CPU's endianness is big endian, 1029 the program is represented in big endian. This also matches the runtime behavior 1030 of BPF, which is generic and uses the CPU's endianness it runs on in order 1031 to not disadvantage architectures in any of the format. 1032 1033 For cross-compilation, the two targets ``bpfeb`` and ``bpfel`` were introduced, 1034 thanks to that BPF programs can be compiled on a node running in one endianness 1035 (e.g. little endian on x86) and run on a node in another endianness format (e.g. 1036 big endian on arm). Note that the front end (clang) needs to run in the target 1037 endianness as well. 1038 1039 Using ``bpf`` as a target is the preferred way in situations where no mixture of 1040 endianness applies. For example, compilation on ``x86_64`` results in the same 1041 output for the targets ``bpf`` and ``bpfel`` due to being little endian, therefore 1042 scripts triggering a compilation also do not have to be endian aware. 1043 1044 A minimal, stand-alone XDP drop program might look like the following example 1045 (``xdp-example.c``): 1046 1047 :: 1048 1049 #include <linux/bpf.h> 1050 1051 #ifndef __section 1052 # define __section(NAME) \ 1053 __attribute__((section(NAME), used)) 1054 #endif 1055 1056 __section("prog") 1057 int xdp_drop(struct xdp_md *ctx) 1058 { 1059 return XDP_DROP; 1060 } 1061 1062 char __license[] __section("license") = "GPL"; 1063 1064 It can then be compiled and loaded into the kernel as follows: 1065 1066 :: 1067 1068 $ clang -O2 -Wall -target bpf -c xdp-example.c -o xdp-example.o 1069 # ip link set dev em1 xdp obj xdp-example.o 1070 1071 .. note:: Attaching an XDP BPF program to a network device as above requires 1072 Linux 4.11 with a device that supports XDP, or Linux 4.12 or later. 1073 1074 For the generated object file LLVM (>= 3.9) uses the official BPF machine value, 1075 that is, ``EM_BPF`` (decimal: ``247`` / hex: ``0xf7``). In this example, the program 1076 has been compiled with ``bpf`` target under ``x86_64``, therefore ``LSB`` (as opposed 1077 to ``MSB``) is shown regarding endianness: 1078 1079 :: 1080 1081 $ file xdp-example.o 1082 xdp-example.o: ELF 64-bit LSB relocatable, *unknown arch 0xf7* version 1 (SYSV), not stripped 1083 1084 ``readelf -a xdp-example.o`` will dump further information about the ELF file, which can 1085 sometimes be useful for introspecting generated section headers, relocation entries 1086 and the symbol table. 1087 1088 In the unlikely case where clang and LLVM need to be compiled from scratch, the 1089 following commands can be used: 1090 1091 :: 1092 1093 $ git clone http://llvm.org/git/llvm.git 1094 $ cd llvm/tools 1095 $ git clone --depth 1 http://llvm.org/git/clang.git 1096 $ cd ..; mkdir build; cd build 1097 $ cmake .. -DLLVM_TARGETS_TO_BUILD="BPF;X86" -DBUILD_SHARED_LIBS=OFF -DCMAKE_BUILD_TYPE=Release -DLLVM_BUILD_RUNTIME=OFF 1098 $ make -j $(getconf _NPROCESSORS_ONLN) 1099 1100 $ ./bin/llc --version 1101 LLVM (http://llvm.org/): 1102 LLVM version x.y.zsvn 1103 Optimized build. 1104 Default target: x86_64-unknown-linux-gnu 1105 Host CPU: skylake 1106 1107 Registered Targets: 1108 bpf - BPF (host endian) 1109 bpfeb - BPF (big endian) 1110 bpfel - BPF (little endian) 1111 x86 - 32-bit X86: Pentium-Pro and above 1112 x86-64 - 64-bit X86: EM64T and AMD64 1113 1114 $ export PATH=$PWD/bin:$PATH # add to ~/.bashrc 1115 1116 Make sure that ``--version`` mentions ``Optimized build.``, otherwise the 1117 compilation time for programs when having LLVM in debugging mode will 1118 significantly increase (e.g. by 10x or more). 1119 1120 For debugging, clang can generate the assembler output as follows: 1121 1122 :: 1123 1124 $ clang -O2 -S -Wall -target bpf -c xdp-example.c -o xdp-example.S 1125 $ cat xdp-example.S 1126 .text 1127 .section prog,"ax",@progbits 1128 .globl xdp_drop 1129 .p2align 3 1130 xdp_drop: # @xdp_drop 1131 # BB#0: 1132 r0 = 1 1133 exit 1134 1135 .section license,"aw",@progbits 1136 .globl __license # @__license 1137 __license: 1138 .asciz "GPL" 1139 1140 Starting from LLVM's release 6.0, there is also assembler parser support. You can 1141 program using BPF assembler directly, then use llvm-mc to assemble it into an 1142 object file. For example, you can assemble the xdp-example.S listed above back 1143 into object file using: 1144 1145 :: 1146 1147 $ llvm-mc -triple bpf -filetype=obj -o xdp-example.o xdp-example.S 1148 1149 Furthermore, more recent LLVM versions (>= 4.0) can also store debugging 1150 information in dwarf format into the object file. This can be done through 1151 the usual workflow by adding ``-g`` for compilation. 1152 1153 :: 1154 1155 $ clang -O2 -g -Wall -target bpf -c xdp-example.c -o xdp-example.o 1156 $ llvm-objdump -S -no-show-raw-insn xdp-example.o 1157 1158 xdp-example.o: file format ELF64-BPF 1159 1160 Disassembly of section prog: 1161 xdp_drop: 1162 ; { 1163 0: r0 = 1 1164 ; return XDP_DROP; 1165 1: exit 1166 1167 The ``llvm-objdump`` tool can then annotate the assembler output with the 1168 original C code used in the compilation. The trivial example in this case 1169 does not contain much C code, however, the line numbers shown as ``0:`` 1170 and ``1:`` correspond directly to the kernel's verifier log. 1171 1172 This means that in case BPF programs get rejected by the verifier, ``llvm-objdump`` 1173 can help to correlate the instructions back to the original C code, which is 1174 highly useful for analysis. 1175 1176 :: 1177 1178 # ip link set dev em1 xdp obj xdp-example.o verb 1179 1180 Prog section 'prog' loaded (5)! 1181 - Type: 6 1182 - Instructions: 2 (0 over limit) 1183 - License: GPL 1184 1185 Verifier analysis: 1186 1187 0: (b7) r0 = 1 1188 1: (95) exit 1189 processed 2 insns 1190 1191 As it can be seen in the verifier analysis, the ``llvm-objdump`` output dumps 1192 the same BPF assembler code as the kernel. 1193 1194 Leaving out the ``-no-show-raw-insn`` option will also dump the raw 1195 ``struct bpf_insn`` as hex in front of the assembly: 1196 1197 :: 1198 1199 $ llvm-objdump -S xdp-example.o 1200 1201 xdp-example.o: file format ELF64-BPF 1202 1203 Disassembly of section prog: 1204 xdp_drop: 1205 ; { 1206 0: b7 00 00 00 01 00 00 00 r0 = 1 1207 ; return foo(); 1208 1: 95 00 00 00 00 00 00 00 exit 1209 1210 For LLVM IR debugging, the compilation process for BPF can be split into 1211 two steps, generating a binary LLVM IR intermediate file ``xdp-example.bc``, which 1212 can later on be passed to llc: 1213 1214 :: 1215 1216 $ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 1217 $ llc xdp-example.bc -march=bpf -filetype=obj -o xdp-example.o 1218 1219 The generated LLVM IR can also be dumped in human readable format through: 1220 1221 :: 1222 1223 $ clang -O2 -Wall -emit-llvm -S -c xdp-example.c -o - 1224 1225 LLVM is able to attach debug information such as the description of used data 1226 types in the program to the generated BPF object file. By default this is in 1227 DWARF format. 1228 1229 A heavily simplified version used by BPF is called BTF (BPF Type Format). The 1230 resulting DWARF can be converted into BTF and is later on loaded into the 1231 kernel through BPF object loaders. The kernel will then verify the BTF data 1232 for correctness and keeps track of the data types the BTF data is containing. 1233 1234 BPF maps can then be annotated with key and value types out of the BTF data 1235 such that a later dump of the map exports the map data along with the related 1236 type information. This allows for better introspection, debugging and value 1237 pretty printing. Note that BTF data is a generic debugging data format and 1238 as such any DWARF to BTF converted data can be loaded (e.g. kernel's vmlinux 1239 DWARF data could be converted to BTF and loaded). Latter is in particular 1240 useful for BPF tracing in the future. 1241 1242 In order to generate BTF from DWARF debugging information, elfutils (>= 0.173) 1243 is needed. If that is not available, then adding the ``-mattr=dwarfris`` option 1244 to the ``llc`` command is required during compilation: 1245 1246 :: 1247 1248 $ llc -march=bpf -mattr=help |& grep dwarfris 1249 dwarfris - Disable MCAsmInfo DwarfUsesRelocationsAcrossSections. 1250 [...] 1251 1252 The reason using ``-mattr=dwarfris`` is because the flag ``dwarfris`` (``dwarf 1253 relocation in section``) disables DWARF cross-section relocations between DWARF 1254 and the ELF's symbol table since libdw does not have proper BPF relocation 1255 support, and therefore tools like ``pahole`` would otherwise not be able to 1256 properly dump structures from the object. 1257 1258 elfutils (>= 0.173) implements proper BPF relocation support and therefore 1259 the same can be achieved without the ``-mattr=dwarfris`` option. Dumping 1260 the structures from the object file could be done from either DWARF or BTF 1261 information. ``pahole`` uses the LLVM emitted DWARF information at this 1262 point, however, future ``pahole`` versions could rely on BTF if available. 1263 1264 For converting DWARF into BTF, a recent pahole version (>= 1.12) is required. 1265 A recent pahole version can also be obtained from its official git repository 1266 if not available from one of the distribution packages: 1267 1268 :: 1269 1270 $ git clone https://git.kernel.org/pub/scm/devel/pahole/pahole.git 1271 1272 ``pahole`` comes with the option ``-J`` to convert DWARF into BTF from an 1273 object file. ``pahole`` can be probed for BTF support as follows (note that 1274 the ``llvm-objcopy`` tool is required for ``pahole`` as well, so check its 1275 presence, too): 1276 1277 :: 1278 1279 $ pahole --help | grep BTF 1280 -J, --btf_encode Encode as BTF 1281 1282 Generating debugging information also requires the front end to generate 1283 source level debug information by passing ``-g`` to the ``clang`` command 1284 line. Note that ``-g`` is needed independently of whether ``llc``'s 1285 ``dwarfris`` option is used. Full example for generating the object file: 1286 1287 :: 1288 1289 $ clang -O2 -g -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 1290 $ llc xdp-example.bc -march=bpf -mattr=dwarfris -filetype=obj -o xdp-example.o 1291 1292 Alternatively, by using clang only to build a BPF program with debugging 1293 information (again, the dwarfris flag can be omitted when having proper 1294 elfutils version): 1295 1296 :: 1297 1298 $ clang -target bpf -O2 -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o 1299 1300 After successful compilation ``pahole`` can be used to properly dump structures 1301 of the BPF program based on the DWARF information: 1302 1303 :: 1304 1305 $ pahole xdp-example.o 1306 struct xdp_md { 1307 __u32 data; /* 0 4 */ 1308 __u32 data_end; /* 4 4 */ 1309 __u32 data_meta; /* 8 4 */ 1310 1311 /* size: 12, cachelines: 1, members: 3 */ 1312 /* last cacheline: 12 bytes */ 1313 }; 1314 1315 Through the option ``-J`` ``pahole`` can eventually generate the BTF from 1316 DWARF. In the object file DWARF data will still be retained alongside the 1317 newly added BTF data. Full ``clang`` and ``pahole`` example combined: 1318 1319 :: 1320 1321 $ clang -target bpf -O2 -Wall -g -c -Xclang -target-feature -Xclang +dwarfris -c xdp-example.c -o xdp-example.o 1322 $ pahole -J xdp-example.o 1323 1324 The presence of a ``.BTF`` section can be seen through ``readelf`` tool: 1325 1326 :: 1327 1328 $ readelf -a xdp-example.o 1329 [...] 1330 [18] .BTF PROGBITS 0000000000000000 00000671 1331 [...] 1332 1333 BPF loaders such as iproute2 will detect and load the BTF section, so that 1334 BPF maps can be annotated with type information. 1335 1336 LLVM by default uses the BPF base instruction set for generating code 1337 in order to make sure that the generated object file can also be loaded 1338 with older kernels such as long-term stable kernels (e.g. 4.9+). 1339 1340 However, LLVM has a ``-mcpu`` selector for the BPF back end in order to 1341 select different versions of the BPF instruction set, namely instruction 1342 set extensions on top of the BPF base instruction set in order to generate 1343 more efficient and smaller code. 1344 1345 Available ``-mcpu`` options can be queried through: 1346 1347 :: 1348 1349 $ llc -march bpf -mcpu=help 1350 Available CPUs for this target: 1351 1352 generic - Select the generic processor. 1353 probe - Select the probe processor. 1354 v1 - Select the v1 processor. 1355 v2 - Select the v2 processor. 1356 [...] 1357 1358 The ``generic`` processor is the default processor, which is also the 1359 base instruction set ``v1`` of BPF. Options ``v1`` and ``v2`` are typically 1360 useful in an environment where the BPF program is being cross compiled 1361 and the target host where the program is loaded differs from the one 1362 where it is compiled (and thus available BPF kernel features might differ 1363 as well). 1364 1365 The recommended ``-mcpu`` option which is also used by Cilium internally is 1366 ``-mcpu=probe``! Here, the LLVM BPF back end queries the kernel for availability 1367 of BPF instruction set extensions and when found available, LLVM will use 1368 them for compiling the BPF program whenever appropriate. 1369 1370 A full command line example with llc's ``-mcpu=probe``: 1371 1372 :: 1373 1374 $ clang -O2 -Wall -target bpf -emit-llvm -c xdp-example.c -o xdp-example.bc 1375 $ llc xdp-example.bc -march=bpf -mcpu=probe -filetype=obj -o xdp-example.o 1376 1377 Generally, LLVM IR generation is architecture independent. There are 1378 however a few differences when using ``clang -target bpf`` versus 1379 leaving ``-target bpf`` out and thus using clang's default target which, 1380 depending on the underlying architecture, might be ``x86_64``, ``arm64`` 1381 or others. 1382 1383 Quoting from the kernel's ``Documentation/bpf/bpf_devel_QA.txt``: 1384 1385 * BPF programs may recursively include header file(s) with file scope 1386 inline assembly codes. The default target can handle this well, while 1387 bpf target may fail if bpf backend assembler does not understand 1388 these assembly codes, which is true in most cases. 1389 1390 * When compiled without -g, additional elf sections, e.g., ``.eh_frame`` 1391 and ``.rela.eh_frame``, may be present in the object file with default 1392 target, but not with bpf target. 1393 1394 * The default target may turn a C switch statement into a switch table 1395 lookup and jump operation. Since the switch table is placed in the 1396 global read-only section, the bpf program will fail to load. 1397 The bpf target does not support switch table optimization. The clang 1398 option ``-fno-jump-tables`` can be used to disable switch table 1399 generation. 1400 1401 * For clang ``-target bpf``, it is guaranteed that pointer or long / 1402 unsigned long types will always have a width of 64 bit, no matter 1403 whether underlying clang binary or default target (or kernel) is 1404 32 bit. However, when native clang target is used, then it will 1405 compile these types based on the underlying architecture's 1406 conventions, meaning in case of 32 bit architecture, pointer or 1407 long / unsigned long types e.g. in BPF context structure will have 1408 width of 32 bit while the BPF LLVM back end still operates in 64 bit. 1409 1410 The native target is mostly needed in tracing for the case of walking 1411 the kernel's ``struct pt_regs`` that maps CPU registers, or other kernel 1412 structures where CPU's register width matters. In all other cases such 1413 as networking, the use of ``clang -target bpf`` is the preferred choice. 1414 1415 Also, LLVM started to support 32-bit subregisters and BPF ALU32 instructions since 1416 LLVM's release 7.0. A new code generation attribute ``alu32`` is added. When it is 1417 enabled, LLVM will try to use 32-bit subregisters whenever possible, typically 1418 when there are operations on 32-bit types. The associated ALU instructions with 1419 32-bit subregisters will become ALU32 instructions. For example, for the 1420 following sample code: 1421 1422 :: 1423 1424 $ cat 32-bit-example.c 1425 void cal(unsigned int *a, unsigned int *b, unsigned int *c) 1426 { 1427 unsigned int sum = *a + *b; 1428 *c = sum; 1429 } 1430 1431 At default code generation, the assembler will looks like: 1432 1433 :: 1434 1435 $ clang -target bpf -emit-llvm -S 32-bit-example.c 1436 $ llc -march=bpf 32-bit-example.ll 1437 $ cat 32-bit-example.s 1438 cal: 1439 r1 = *(u32 *)(r1 + 0) 1440 r2 = *(u32 *)(r2 + 0) 1441 r2 += r1 1442 *(u32 *)(r3 + 0) = r2 1443 exit 1444 1445 64-bit registers are used, hence the addition means 64-bit addition. Now, if you 1446 enable the new 32-bit subregisters support by specifying ``-mattr=+alu32``, then 1447 the assembler will looks like: 1448 1449 :: 1450 1451 $ llc -march=bpf -mattr=+alu32 32-bit-example.ll 1452 $ cat 32-bit-example.s 1453 cal: 1454 w1 = *(u32 *)(r1 + 0) 1455 w2 = *(u32 *)(r2 + 0) 1456 w2 += w1 1457 *(u32 *)(r3 + 0) = w2 1458 exit 1459 1460 ``w`` register, meaning 32-bit subregister, will be used instead of 64-bit ``r`` 1461 register. 1462 1463 Enable 32-bit subregisters might help reducing type extension instruction 1464 sequences. It could also help kernel eBPF JIT compiler for 32-bit architectures 1465 for which registers pairs are used to model the 64-bit eBPF registers and extra 1466 instructions are needed for manipulating the high 32-bit. Given read from 32-bit 1467 subregister is guaranteed to read from low 32-bit only even though write still 1468 needs to clear the high 32-bit, if the JIT compiler has known the definition of 1469 one register only has subregister reads, then instructions for setting the high 1470 32-bit of the destination could be eliminated. 1471 1472 When writing C programs for BPF, there are a couple of pitfalls to be aware 1473 of, compared to usual application development with C. The following items 1474 describe some of the differences for the BPF model: 1475 1476 1. **Everything needs to be inlined, there are no function calls (on older 1477 LLVM versions) or shared library calls available.** 1478 1479 Shared libraries, etc cannot be used with BPF. However, common library 1480 code used in BPF programs can be placed into header files and included in 1481 the main programs. For example, Cilium makes heavy use of it (see ``bpf/lib/``). 1482 However, this still allows for including header files, for example, from 1483 the kernel or other libraries and reuse their static inline functions or 1484 macros / definitions. 1485 1486 Unless a recent kernel (4.16+) and LLVM (6.0+) is used where BPF to BPF 1487 function calls are supported, then LLVM needs to compile and inline the 1488 entire code into a flat sequence of BPF instructions for a given program 1489 section. In such case, best practice is to use an annotation like ``__inline`` 1490 for every library function as shown below. The use of ``always_inline`` 1491 is recommended, since the compiler could still decide to uninline large 1492 functions that are only annotated as ``inline``. 1493 1494 In case the latter happens, LLVM will generate a relocation entry into 1495 the ELF file, which BPF ELF loaders such as iproute2 cannot resolve and 1496 will thus produce an error since only BPF maps are valid relocation entries 1497 which loaders can process. 1498 1499 :: 1500 1501 #include <linux/bpf.h> 1502 1503 #ifndef __section 1504 # define __section(NAME) \ 1505 __attribute__((section(NAME), used)) 1506 #endif 1507 1508 #ifndef __inline 1509 # define __inline \ 1510 inline __attribute__((always_inline)) 1511 #endif 1512 1513 static __inline int foo(void) 1514 { 1515 return XDP_DROP; 1516 } 1517 1518 __section("prog") 1519 int xdp_drop(struct xdp_md *ctx) 1520 { 1521 return foo(); 1522 } 1523 1524 char __license[] __section("license") = "GPL"; 1525 1526 2. **Multiple programs can reside inside a single C file in different sections.** 1527 1528 C programs for BPF make heavy use of section annotations. A C file is 1529 typically structured into 3 or more sections. BPF ELF loaders use these 1530 names to extract and prepare the relevant information in order to load 1531 the programs and maps through the bpf system call. For example, iproute2 1532 uses ``maps`` and ``license`` as default section name to find metadata 1533 needed for map creation and the license for the BPF program, respectively. 1534 On program creation time the latter is pushed into the kernel as well, 1535 and enables some of the helper functions which are exposed as GPL only 1536 in case the program also holds a GPL compatible license, for example 1537 ``bpf_ktime_get_ns()``, ``bpf_probe_read()`` and others. 1538 1539 The remaining section names are specific for BPF program code, for example, 1540 the below code has been modified to contain two program sections, ``ingress`` 1541 and ``egress``. The toy example code demonstrates that both can share a map 1542 and common static inline helpers such as the ``account_data()`` function. 1543 1544 The ``xdp-example.c`` example has been modified to a ``tc-example.c`` 1545 example that can be loaded with tc and attached to a netdevice's ingress 1546 and egress hook. It accounts the transferred bytes into a map called 1547 ``acc_map``, which has two map slots, one for traffic accounted on the 1548 ingress hook, one on the egress hook. 1549 1550 :: 1551 1552 #include <linux/bpf.h> 1553 #include <linux/pkt_cls.h> 1554 #include <stdint.h> 1555 #include <iproute2/bpf_elf.h> 1556 1557 #ifndef __section 1558 # define __section(NAME) \ 1559 __attribute__((section(NAME), used)) 1560 #endif 1561 1562 #ifndef __inline 1563 # define __inline \ 1564 inline __attribute__((always_inline)) 1565 #endif 1566 1567 #ifndef lock_xadd 1568 # define lock_xadd(ptr, val) \ 1569 ((void)__sync_fetch_and_add(ptr, val)) 1570 #endif 1571 1572 #ifndef BPF_FUNC 1573 # define BPF_FUNC(NAME, ...) \ 1574 (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME 1575 #endif 1576 1577 static void *BPF_FUNC(map_lookup_elem, void *map, const void *key); 1578 1579 struct bpf_elf_map acc_map __section("maps") = { 1580 .type = BPF_MAP_TYPE_ARRAY, 1581 .size_key = sizeof(uint32_t), 1582 .size_value = sizeof(uint32_t), 1583 .pinning = PIN_GLOBAL_NS, 1584 .max_elem = 2, 1585 }; 1586 1587 static __inline int account_data(struct __sk_buff *skb, uint32_t dir) 1588 { 1589 uint32_t *bytes; 1590 1591 bytes = map_lookup_elem(&acc_map, &dir); 1592 if (bytes) 1593 lock_xadd(bytes, skb->len); 1594 1595 return TC_ACT_OK; 1596 } 1597 1598 __section("ingress") 1599 int tc_ingress(struct __sk_buff *skb) 1600 { 1601 return account_data(skb, 0); 1602 } 1603 1604 __section("egress") 1605 int tc_egress(struct __sk_buff *skb) 1606 { 1607 return account_data(skb, 1); 1608 } 1609 1610 char __license[] __section("license") = "GPL"; 1611 1612 The example also demonstrates a couple of other things which are useful 1613 to be aware of when developing programs. The code includes kernel headers, 1614 standard C headers and an iproute2 specific header containing the 1615 definition of ``struct bpf_elf_map``. iproute2 has a common BPF ELF loader 1616 and as such the definition of ``struct bpf_elf_map`` is the very same for 1617 XDP and tc typed programs. 1618 1619 A ``struct bpf_elf_map`` entry defines a map in the program and contains 1620 all relevant information (such as key / value size, etc) needed to generate 1621 a map which is used from the two BPF programs. The structure must be placed 1622 into the ``maps`` section, so that the loader can find it. There can be 1623 multiple map declarations of this type with different variable names, but 1624 all must be annotated with ``__section("maps")``. 1625 1626 The ``struct bpf_elf_map`` is specific to iproute2. Different BPF ELF 1627 loaders can have different formats, for example, the libbpf in the kernel 1628 source tree, which is mainly used by ``perf``, has a different specification. 1629 iproute2 guarantees backwards compatibility for ``struct bpf_elf_map``. 1630 Cilium follows the iproute2 model. 1631 1632 The example also demonstrates how BPF helper functions are mapped into 1633 the C code and being used. Here, ``map_lookup_elem()`` is defined by 1634 mapping this function into the ``BPF_FUNC_map_lookup_elem`` enum value 1635 which is exposed as a helper in ``uapi/linux/bpf.h``. When the program is later 1636 loaded into the kernel, the verifier checks whether the passed arguments 1637 are of the expected type and re-points the helper call into a real 1638 function call. Moreover, ``map_lookup_elem()`` also demonstrates how 1639 maps can be passed to BPF helper functions. Here, ``&acc_map`` from the 1640 ``maps`` section is passed as the first argument to ``map_lookup_elem()``. 1641 1642 Since the defined array map is global, the accounting needs to use an 1643 atomic operation, which is defined as ``lock_xadd()``. LLVM maps 1644 ``__sync_fetch_and_add()`` as a built-in function to the BPF atomic 1645 add instruction, that is, ``BPF_STX | BPF_XADD | BPF_W`` for word sizes. 1646 1647 Last but not least, the ``struct bpf_elf_map`` tells that the map is to 1648 be pinned as ``PIN_GLOBAL_NS``. This means that tc will pin the map 1649 into the BPF pseudo file system as a node. By default, it will be pinned 1650 to ``/sys/fs/bpf/tc/globals/acc_map`` for the given example. Due to the 1651 ``PIN_GLOBAL_NS``, the map will be placed under ``/sys/fs/bpf/tc/globals/``. 1652 ``globals`` acts as a global namespace that spans across object files. 1653 If the example used ``PIN_OBJECT_NS``, then tc would create a directory 1654 that is local to the object file. For example, different C files with 1655 BPF code could have the same ``acc_map`` definition as above with a 1656 ``PIN_GLOBAL_NS`` pinning. In that case, the map will be shared among 1657 BPF programs originating from various object files. ``PIN_NONE`` would 1658 mean that the map is not placed into the BPF file system as a node, 1659 and as a result will not be accessible from user space after tc quits. It 1660 would also mean that tc creates two separate map instances for each 1661 program, since it cannot retrieve a previously pinned map under that 1662 name. The ``acc_map`` part from the mentioned path is the name of the 1663 map as specified in the source code. 1664 1665 Thus, upon loading of the ``ingress`` program, tc will find that no such 1666 map exists in the BPF file system and creates a new one. On success, the 1667 map will also be pinned, so that when the ``egress`` program is loaded 1668 through tc, it will find that such map already exists in the BPF file 1669 system and will reuse that for the ``egress`` program. The loader also 1670 makes sure in case maps exist with the same name that also their properties 1671 (key / value size, etc) match. 1672 1673 Just like tc can retrieve the same map, also third party applications 1674 can use the ``BPF_OBJ_GET`` command from the bpf system call in order 1675 to create a new file descriptor pointing to the same map instance, which 1676 can then be used to lookup / update / delete map elements. 1677 1678 The code can be compiled and loaded via iproute2 as follows: 1679 1680 :: 1681 1682 $ clang -O2 -Wall -target bpf -c tc-example.c -o tc-example.o 1683 1684 # tc qdisc add dev em1 clsact 1685 # tc filter add dev em1 ingress bpf da obj tc-example.o sec ingress 1686 # tc filter add dev em1 egress bpf da obj tc-example.o sec egress 1687 1688 # tc filter show dev em1 ingress 1689 filter protocol all pref 49152 bpf 1690 filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[ingress] direct-action id 1 tag c5f7825e5dac396f 1691 1692 # tc filter show dev em1 egress 1693 filter protocol all pref 49152 bpf 1694 filter protocol all pref 49152 bpf handle 0x1 tc-example.o:[egress] direct-action id 2 tag b2fd5adc0f262714 1695 1696 # mount | grep bpf 1697 sysfs on /sys/fs/bpf type sysfs (rw,nosuid,nodev,noexec,relatime,seclabel) 1698 bpf on /sys/fs/bpf type bpf (rw,relatime,mode=0700) 1699 1700 # tree /sys/fs/bpf/ 1701 /sys/fs/bpf/ 1702 +-- ip -> /sys/fs/bpf/tc/ 1703 +-- tc 1704 | +-- globals 1705 | +-- acc_map 1706 +-- xdp -> /sys/fs/bpf/tc/ 1707 1708 4 directories, 1 file 1709 1710 As soon as packets pass the ``em1`` device, counters from the BPF map will 1711 be increased. 1712 1713 3. **There are no global variables allowed.** 1714 1715 For the reasons already mentioned in point 1, BPF cannot have global variables 1716 as often used in normal C programs. 1717 1718 However, there is a work-around in that the program can simply use a BPF map 1719 of type ``BPF_MAP_TYPE_PERCPU_ARRAY`` with just a single slot of arbitrary 1720 value size. This works, because during execution, BPF programs are guaranteed 1721 to never get preempted by the kernel and therefore can use the single map entry 1722 as a scratch buffer for temporary data, for example, to extend beyond the stack 1723 limitation. This also functions across tail calls, since it has the same 1724 guarantees with regards to preemption. 1725 1726 Otherwise, for holding state across multiple BPF program runs, normal BPF 1727 maps can be used. 1728 1729 4. **There are no const strings or arrays allowed.** 1730 1731 Defining ``const`` strings or other arrays in the BPF C program does not work 1732 for the same reasons as pointed out in sections 1 and 3, which is, that relocation 1733 entries will be generated in the ELF file which will be rejected by loaders due 1734 to not being part of the ABI towards loaders (loaders also cannot fix up such 1735 entries as it would require large rewrites of the already compiled BPF sequence). 1736 1737 In the future, LLVM might detect these occurrences and early throw an error 1738 to the user. 1739 1740 Helper functions such as ``trace_printk()`` can be worked around as follows: 1741 1742 :: 1743 1744 static void BPF_FUNC(trace_printk, const char *fmt, int fmt_size, ...); 1745 1746 #ifndef printk 1747 # define printk(fmt, ...) \ 1748 ({ \ 1749 char ____fmt[] = fmt; \ 1750 trace_printk(____fmt, sizeof(____fmt), ##__VA_ARGS__); \ 1751 }) 1752 #endif 1753 1754 The program can then use the macro naturally like ``printk("skb len:%u\n", skb->len);``. 1755 The output will then be written to the trace pipe. ``tc exec bpf dbg`` can be 1756 used to retrieve the messages from there. 1757 1758 The use of the ``trace_printk()`` helper function has a couple of disadvantages 1759 and thus is not recommended for production usage. Constant strings like the 1760 ``"skb len:%u\n"`` need to be loaded into the BPF stack each time the helper 1761 function is called, but also BPF helper functions are limited to a maximum 1762 of 5 arguments. This leaves room for only 3 additional variables which can be 1763 passed for dumping. 1764 1765 Therefore, despite being helpful for quick debugging, it is recommended (for networking 1766 programs) to use the ``skb_event_output()`` or the ``xdp_event_output()`` helper, 1767 respectively. They allow for passing custom structs from the BPF program to 1768 the perf event ring buffer along with an optional packet sample. For example, 1769 Cilium's monitor makes use of these helpers in order to implement a debugging 1770 framework, notifications for network policy violations, etc. These helpers pass 1771 the data through a lockless memory mapped per-CPU ``perf`` ring buffer, and 1772 is thus significantly faster than ``trace_printk()``. 1773 1774 5. **Use of LLVM built-in functions for memset()/memcpy()/memmove()/memcmp().** 1775 1776 Since BPF programs cannot perform any function calls other than those to BPF 1777 helpers, common library code needs to be implemented as inline functions. In 1778 addition, also LLVM provides some built-ins that the programs can use for 1779 constant sizes (here: ``n``) which will then always get inlined: 1780 1781 :: 1782 1783 #ifndef memset 1784 # define memset(dest, chr, n) __builtin_memset((dest), (chr), (n)) 1785 #endif 1786 1787 #ifndef memcpy 1788 # define memcpy(dest, src, n) __builtin_memcpy((dest), (src), (n)) 1789 #endif 1790 1791 #ifndef memmove 1792 # define memmove(dest, src, n) __builtin_memmove((dest), (src), (n)) 1793 #endif 1794 1795 The ``memcmp()`` built-in had some corner cases where inlining did not take place 1796 due to an LLVM issue in the back end, and is therefore not recommended to be 1797 used until the issue is fixed. 1798 1799 6. **There are no loops available (yet).** 1800 1801 The BPF verifier in the kernel checks that a BPF program does not contain 1802 loops by performing a depth first search of all possible program paths besides 1803 other control flow graph validations. The purpose is to make sure that the 1804 program is always guaranteed to terminate. 1805 1806 A very limited form of looping is available for constant upper loop bounds 1807 by using ``#pragma unroll`` directive. Example code that is compiled to BPF: 1808 1809 :: 1810 1811 #pragma unroll 1812 for (i = 0; i < IPV6_MAX_HEADERS; i++) { 1813 switch (nh) { 1814 case NEXTHDR_NONE: 1815 return DROP_INVALID_EXTHDR; 1816 case NEXTHDR_FRAGMENT: 1817 return DROP_FRAG_NOSUPPORT; 1818 case NEXTHDR_HOP: 1819 case NEXTHDR_ROUTING: 1820 case NEXTHDR_AUTH: 1821 case NEXTHDR_DEST: 1822 if (skb_load_bytes(skb, l3_off + len, &opthdr, sizeof(opthdr)) < 0) 1823 return DROP_INVALID; 1824 1825 nh = opthdr.nexthdr; 1826 if (nh == NEXTHDR_AUTH) 1827 len += ipv6_authlen(&opthdr); 1828 else 1829 len += ipv6_optlen(&opthdr); 1830 break; 1831 default: 1832 *nexthdr = nh; 1833 return len; 1834 } 1835 } 1836 1837 Another possibility is to use tail calls by calling into the same program 1838 again and using a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map for having a local 1839 scratch space. While being dynamic, this form of looping however is limited 1840 to a maximum of 32 iterations. 1841 1842 In the future, BPF may have some native, but limited form of implementing loops. 1843 1844 7. **Partitioning programs with tail calls.** 1845 1846 Tail calls provide the flexibility to atomically alter program behavior during 1847 runtime by jumping from one BPF program into another. In order to select the 1848 next program, tail calls make use of program array maps (``BPF_MAP_TYPE_PROG_ARRAY``), 1849 and pass the map as well as the index to the next program to jump to. There is no 1850 return to the old program after the jump has been performed, and in case there was 1851 no program present at the given map index, then execution continues on the original 1852 program. 1853 1854 For example, this can be used to implement various stages of a parser, where 1855 such stages could be updated with new parsing features during runtime. 1856 1857 Another use case are event notifications, for example, Cilium can opt in packet 1858 drop notifications during runtime, where the ``skb_event_output()`` call is 1859 located inside the tail called program. Thus, during normal operations, the 1860 fall-through path will always be executed unless a program is added to the 1861 related map index, where the program then prepares the metadata and triggers 1862 the event notification to a user space daemon. 1863 1864 Program array maps are quite flexible, enabling also individual actions to 1865 be implemented for programs located in each map index. For example, the root 1866 program attached to XDP or tc could perform an initial tail call to index 0 1867 of the program array map, performing traffic sampling, then jumping to index 1 1868 of the program array map, where firewalling policy is applied and the packet 1869 either dropped or further processed in index 2 of the program array map, where 1870 it is mangled and sent out of an interface again. Jumps in the program array 1871 map can, of course, be arbitrary. The kernel will eventually execute the 1872 fall-through path when the maximum tail call limit has been reached. 1873 1874 Minimal example extract of using tail calls: 1875 1876 :: 1877 1878 [...] 1879 1880 #ifndef __stringify 1881 # define __stringify(X) #X 1882 #endif 1883 1884 #ifndef __section 1885 # define __section(NAME) \ 1886 __attribute__((section(NAME), used)) 1887 #endif 1888 1889 #ifndef __section_tail 1890 # define __section_tail(ID, KEY) \ 1891 __section(__stringify(ID) "/" __stringify(KEY)) 1892 #endif 1893 1894 #ifndef BPF_FUNC 1895 # define BPF_FUNC(NAME, ...) \ 1896 (*NAME)(__VA_ARGS__) = (void *)BPF_FUNC_##NAME 1897 #endif 1898 1899 #define BPF_JMP_MAP_ID 1 1900 1901 static void BPF_FUNC(tail_call, struct __sk_buff *skb, void *map, 1902 uint32_t index); 1903 1904 struct bpf_elf_map jmp_map __section("maps") = { 1905 .type = BPF_MAP_TYPE_PROG_ARRAY, 1906 .id = BPF_JMP_MAP_ID, 1907 .size_key = sizeof(uint32_t), 1908 .size_value = sizeof(uint32_t), 1909 .pinning = PIN_GLOBAL_NS, 1910 .max_elem = 1, 1911 }; 1912 1913 __section_tail(JMP_MAP_ID, 0) 1914 int looper(struct __sk_buff *skb) 1915 { 1916 printk("skb cb: %u\n", skb->cb[0]++); 1917 tail_call(skb, &jmp_map, 0); 1918 return TC_ACT_OK; 1919 } 1920 1921 __section("prog") 1922 int entry(struct __sk_buff *skb) 1923 { 1924 skb->cb[0] = 0; 1925 tail_call(skb, &jmp_map, 0); 1926 return TC_ACT_OK; 1927 } 1928 1929 char __license[] __section("license") = "GPL"; 1930 1931 When loading this toy program, tc will create the program array and pin it 1932 to the BPF file system in the global namespace under ``jmp_map``. Also, the 1933 BPF ELF loader in iproute2 will also recognize sections that are marked as 1934 ``__section_tail()``. The provided ``id`` in ``struct bpf_elf_map`` will be 1935 matched against the id marker in the ``__section_tail()``, that is, ``JMP_MAP_ID``, 1936 and the program therefore loaded at the user specified program array map index, 1937 which is ``0`` in this example. As a result, all provided tail call sections 1938 will be populated by the iproute2 loader to the corresponding maps. This mechanism 1939 is not specific to tc, but can be applied with any other BPF program type 1940 that iproute2 supports (such as XDP, lwt). 1941 1942 The generated elf contains section headers describing the map id and the 1943 entry within that map: 1944 1945 :: 1946 1947 $ llvm-objdump -S --no-show-raw-insn prog_array.o | less 1948 prog_array.o: file format ELF64-BPF 1949 1950 Disassembly of section 1/0: 1951 looper: 1952 0: r6 = r1 1953 1: r2 = *(u32 *)(r6 + 48) 1954 2: r1 = r2 1955 3: r1 += 1 1956 4: *(u32 *)(r6 + 48) = r1 1957 5: r1 = 0 ll 1958 7: call -1 1959 8: r1 = r6 1960 9: r2 = 0 ll 1961 11: r3 = 0 1962 12: call 12 1963 13: r0 = 0 1964 14: exit 1965 Disassembly of section prog: 1966 entry: 1967 0: r2 = 0 1968 1: *(u32 *)(r1 + 48) = r2 1969 2: r2 = 0 ll 1970 4: r3 = 0 1971 5: call 12 1972 6: r0 = 0 1973 7: exi 1974 1975 In this case, the ``section 1/0`` indicates that the ``looper()`` function 1976 resides in the map id ``1`` at position ``0``. 1977 1978 The pinned map can be retrieved by a user space applications (e.g. Cilium daemon), 1979 but also by tc itself in order to update the map with new programs. Updates 1980 happen atomically, the initial entry programs that are triggered first from the 1981 various subsystems are also updated atomically. 1982 1983 Example for tc to perform tail call map updates: 1984 1985 :: 1986 1987 # tc exec bpf graft m:globals/jmp_map key 0 obj new.o sec foo 1988 1989 In case iproute2 would update the pinned program array, the ``graft`` command 1990 can be used. By pointing it to ``globals/jmp_map``, tc will update the 1991 map at index / key ``0`` with a new program residing in the object file ``new.o`` 1992 under section ``foo``. 1993 1994 8. **Limited stack space of maximum 512 bytes.** 1995 1996 Stack space in BPF programs is limited to only 512 bytes, which needs to be 1997 taken into careful consideration when implementing BPF programs in C. However, 1998 as mentioned earlier in point 3, a ``BPF_MAP_TYPE_PERCPU_ARRAY`` map with a 1999 single entry can be used in order to enlarge scratch buffer space. 2000 2001 9. **Use of BPF inline assembly possible.** 2002 2003 LLVM 6.0 or later allows use of inline assembly for BPF for the rare cases where it 2004 might be needed. The following (nonsense) toy example shows a 64 bit atomic 2005 add. Due to lack of documentation, LLVM source code in ``lib/Target/BPF/BPFInstrInfo.td`` 2006 as well as ``test/CodeGen/BPF/`` might be helpful for providing some additional 2007 examples. Test code: 2008 2009 :: 2010 2011 #include <linux/bpf.h> 2012 2013 #ifndef __section 2014 # define __section(NAME) \ 2015 __attribute__((section(NAME), used)) 2016 #endif 2017 2018 __section("prog") 2019 int xdp_test(struct xdp_md *ctx) 2020 { 2021 __u64 a = 2, b = 3, *c = &a; 2022 /* just a toy xadd example to show the syntax */ 2023 asm volatile("lock *(u64 *)(%0+0) += %1" : "=r"(c) : "r"(b), "0"(c)); 2024 return a; 2025 } 2026 2027 char __license[] __section("license") = "GPL"; 2028 2029 The above program is compiled into the following sequence of BPF 2030 instructions: 2031 2032 :: 2033 2034 Verifier analysis: 2035 2036 0: (b7) r1 = 2 2037 1: (7b) *(u64 *)(r10 -8) = r1 2038 2: (b7) r1 = 3 2039 3: (bf) r2 = r10 2040 4: (07) r2 += -8 2041 5: (db) lock *(u64 *)(r2 +0) += r1 2042 6: (79) r0 = *(u64 *)(r10 -8) 2043 7: (95) exit 2044 processed 8 insns (limit 131072), stack depth 8 2045 2046 10. **Remove struct padding with aligning members by using #pragma pack.** 2047 2048 In modern compilers, data structures are aligned by default to access memory 2049 efficiently. Structure members are aligned to memory address that multiples their 2050 size, and padding is added for the proper alignment. Because of this, the size 2051 of struct may often grow larger than expected. 2052 2053 :: 2054 2055 struct called_info { 2056 u64 start; // 8-byte 2057 u64 end; // 8-byte 2058 u32 sector; // 4-byte 2059 }; // size of 20-byte ? 2060 2061 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte 2062 2063 // Actual compiled composition of struct called_info 2064 // 0x0(0) 0x8(8) 2065 // ↓________________________↓ 2066 // | start (8) | 2067 // |________________________| 2068 // | end (8) | 2069 // |________________________| 2070 // | sector(4) | PADDING | <= address aligned to 8 2071 // |____________|___________| with 4-byte PADDING. 2072 2073 The BPF verifier in the kernel checks the stack boundary that a BPF program does 2074 not access outside of boundary or uninitialized stack area. Using struct with the 2075 padding as a map value, will cause ``invalid indirect read from stack`` failure on 2076 ``bpf_prog_load()``. 2077 2078 Example code: 2079 2080 :: 2081 2082 struct called_info { 2083 u64 start; 2084 u64 end; 2085 u32 sector; 2086 }; 2087 2088 struct bpf_map_def SEC("maps") called_info_map = { 2089 .type = BPF_MAP_TYPE_HASH, 2090 .key_size = sizeof(long), 2091 .value_size = sizeof(struct called_info), 2092 .max_entries = 4096, 2093 }; 2094 2095 SEC("kprobe/submit_bio") 2096 int submit_bio_entry(struct pt_regs *ctx) 2097 { 2098 char fmt[] = "submit_bio(bio=0x%lx) called: %llu\n"; 2099 u64 start_time = bpf_ktime_get_ns(); 2100 long bio_ptr = PT_REGS_PARM1(ctx); 2101 struct called_info called_info = { 2102 .start = start_time, 2103 .end = 0, 2104 .bi_sector = 0 2105 }; 2106 2107 bpf_map_update_elem(&called_info_map, &bio_ptr, &called_info, BPF_ANY); 2108 bpf_trace_printk(fmt, sizeof(fmt), bio_ptr, start_time); 2109 return 0; 2110 } 2111 2112 // On bpf_load_program 2113 bpf_load_program() err=13 2114 0: (bf) r6 = r1 2115 ... 2116 19: (b7) r1 = 0 2117 20: (7b) *(u64 *)(r10 -72) = r1 2118 21: (7b) *(u64 *)(r10 -80) = r7 2119 22: (63) *(u32 *)(r10 -64) = r1 2120 ... 2121 30: (85) call bpf_map_update_elem#2 2122 invalid indirect read from stack off -80+20 size 24 2123 2124 At ``bpf_prog_load()``, an eBPF verifier ``bpf_check()`` is called, and it'll 2125 check stack boundary by calling ``check_func_arg() -> check_stack_boundary()``. 2126 From the upper error shows, ``struct called_info`` is compiled to 24-byte size, 2127 and the message says reading a data from +20 is an invalid indirect read. 2128 And as we discussed earlier, the address 0x14(20) is the place where PADDING is. 2129 2130 :: 2131 2132 // Actual compiled composition of struct called_info 2133 // 0x10(16) 0x14(20) 0x18(24) 2134 // ↓____________↓___________↓ 2135 // | sector(4) | PADDING | <= address aligned to 8 2136 // |____________|___________| with 4-byte PADDING. 2137 2138 The ``check_stack_boundary()`` internally loops through the every ``access_size`` (24) 2139 byte from the start pointer to make sure that it's within stack boundary and all 2140 elements of the stack are initialized. Since the padding isn't supposed to be used, 2141 it gets the 'invalid indirect read from stack' failure. To avoid this kind of 2142 failure, remove the padding from the struct is necessary. 2143 2144 Removing the padding by using ``#pragma pack(n)`` directive: 2145 2146 :: 2147 2148 #pragma pack(4) 2149 struct called_info { 2150 u64 start; // 8-byte 2151 u64 end; // 8-byte 2152 u32 sector; // 4-byte 2153 }; // size of 20-byte ? 2154 2155 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 20-byte 2156 2157 // Actual compiled composition of packed struct called_info 2158 // 0x0(0) 0x8(8) 2159 // ↓________________________↓ 2160 // | start (8) | 2161 // |________________________| 2162 // | end (8) | 2163 // |________________________| 2164 // | sector(4) | <= address aligned to 4 2165 // |____________| with no PADDING. 2166 2167 By locating ``#pragma pack(4)`` before of ``struct called_info``, compiler will align 2168 members of a struct to the least of 4-byte and their natural alignment. As you can 2169 see, the size of ``struct called_info`` has been shrunk to 20-byte and the padding 2170 is no longer exist. 2171 2172 But, removing the padding have downsides either. For example, compiler will generate 2173 less optimized code. Since we've removed the padding, processors will conduct 2174 unaligned access to the structure and this might lead to performance degradation. 2175 And also, unaligned access might get rejected by verifier on some architectures. 2176 2177 However, there is a way to avoid downsides of packed structure. By simply adding the 2178 explicit padding ``u32 pad`` member at the end will resolve the same problem without 2179 packing of the structure. 2180 2181 :: 2182 2183 struct called_info { 2184 u64 start; // 8-byte 2185 u64 end; // 8-byte 2186 u32 sector; // 4-byte 2187 u32 pad; // 4-byte 2188 }; // size of 24-byte ? 2189 2190 printf("size of %d-byte\n", sizeof(struct called_info)); // size of 24-byte 2191 2192 // Actual compiled composition of struct called_info with explicit padding 2193 // 0x0(0) 0x8(8) 2194 // ↓________________________↓ 2195 // | start (8) | 2196 // |________________________| 2197 // | end (8) | 2198 // |________________________| 2199 // | sector(4) | pad (4) | <= address aligned to 8 2200 // |____________|___________| with explicit PADDING. 2201 2202 11. **Accessing packet data via invalidated references** 2203 2204 Some networking BPF helper functions such as ``bpf_skb_store_bytes`` might 2205 change the size of a packet data. As verifier is not able to track such 2206 changes, any a priori reference to the data will be invalidated by verifier. 2207 Therefore, the reference needs to be updated before accessing the data to 2208 avoid verifier rejecting a program. 2209 2210 To illustrate this, consider the following snippet: 2211 2212 :: 2213 2214 struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; 2215 2216 skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); 2217 2218 if (ip4->protocol == IPPROTO_TCP) { 2219 // do something 2220 } 2221 2222 Verifier will reject the snippet due to dereference of the invalidated 2223 ``ip4->protocol``: 2224 2225 :: 2226 2227 R1=pkt_end(id=0,off=0,imm=0) R2=pkt(id=0,off=34,r=34,imm=0) R3=inv0 2228 R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=4294967295,var_off=(0x0; 0xffffffff)) 2229 R8=inv4294967162 R9=pkt(id=0,off=0,r=34,imm=0) R10=fp0,call_-1 2230 ... 2231 18: (85) call bpf_skb_store_bytes#9 2232 19: (7b) *(u64 *)(r10 -56) = r7 2233 R0=inv(id=0) R6=ctx(id=0,off=0,imm=0) R7=inv(id=0,umax_value=2,var_off=(0x0; 0x3)) 2234 R8=inv4294967162 R9=inv(id=0) R10=fp0,call_-1 fp-48=mmmm???? fp-56=mmmmmmmm 2235 21: (61) r1 = *(u32 *)(r9 +23) 2236 R9 invalid mem access 'inv' 2237 2238 To fix this, the reference to ``ip4`` has to be updated: 2239 2240 :: 2241 2242 struct iphdr *ip4 = (struct iphdr *) skb->data + ETH_HLEN; 2243 2244 skb_store_bytes(skb, l3_off + offsetof(struct iphdr, saddr), &new_saddr, 4, 0); 2245 2246 ip4 = (struct iphdr *) skb->data + ETH_HLEN; 2247 2248 if (ip4->protocol == IPPROTO_TCP) { 2249 // do something 2250 } 2251 2252 iproute2 2253 -------- 2254 2255 There are various front ends for loading BPF programs into the kernel such as bcc, 2256 perf, iproute2 and others. The Linux kernel source tree also provides a user space 2257 library under ``tools/lib/bpf/``, which is mainly used and driven by perf for 2258 loading BPF tracing programs into the kernel. However, the library itself is 2259 generic and not limited to perf only. bcc is a toolkit providing many useful 2260 BPF programs mainly for tracing that are loaded ad-hoc through a Python interface 2261 embedding the BPF C code. Syntax and semantics for implementing BPF programs 2262 slightly differ among front ends in general, though. Additionally, there are also 2263 BPF samples in the kernel source tree (``samples/bpf/``) which parse the generated 2264 object files and load the code directly through the system call interface. 2265 2266 This and previous sections mainly focus on the iproute2 suite's BPF front end for 2267 loading networking programs of XDP, tc or lwt type, since Cilium's programs are 2268 implemented against this BPF loader. In future, Cilium will be equipped with a 2269 native BPF loader, but programs will still be compatible to be loaded through 2270 iproute2 suite in order to facilitate development and debugging. 2271 2272 All BPF program types supported by iproute2 share the same BPF loader logic 2273 due to having a common loader back end implemented as a library (``lib/bpf.c`` 2274 in iproute2 source tree). 2275 2276 The previous section on LLVM also covered some iproute2 parts related to writing 2277 BPF C programs, and later sections in this document are related to tc and XDP 2278 specific aspects when writing programs. Therefore, this section will rather focus 2279 on usage examples for loading object files with iproute2 as well as some of the 2280 generic mechanics of the loader. It does not try to provide a complete coverage 2281 of all details, but enough for getting started. 2282 2283 **1. Loading of XDP BPF object files.** 2284 2285 Given a BPF object file ``prog.o`` has been compiled for XDP, it can be loaded 2286 through ``ip`` to a XDP-supported netdevice called ``em1`` with the following 2287 command: 2288 2289 :: 2290 2291 # ip link set dev em1 xdp obj prog.o 2292 2293 The above command assumes that the program code resides in the default section 2294 which is called ``prog`` in XDP case. Should this not be the case, and the 2295 section is named differently, for example, ``foobar``, then the program needs 2296 to be loaded as: 2297 2298 :: 2299 2300 # ip link set dev em1 xdp obj prog.o sec foobar 2301 2302 Note that it is also possible to load the program out of the ``.text`` section. 2303 Changing the minimal, stand-alone XDP drop program by removing the ``__section()`` 2304 annotation from the ``xdp_drop`` entry point would look like the following: 2305 2306 :: 2307 2308 #include <linux/bpf.h> 2309 2310 #ifndef __section 2311 # define __section(NAME) \ 2312 __attribute__((section(NAME), used)) 2313 #endif 2314 2315 int xdp_drop(struct xdp_md *ctx) 2316 { 2317 return XDP_DROP; 2318 } 2319 2320 char __license[] __section("license") = "GPL"; 2321 2322 And can be loaded as follows: 2323 2324 :: 2325 2326 # ip link set dev em1 xdp obj prog.o sec .text 2327 2328 By default, ``ip`` will throw an error in case a XDP program is already attached 2329 to the networking interface, to prevent it from being overridden by accident. In 2330 order to replace the currently running XDP program with a new one, the ``-force`` 2331 option must be used: 2332 2333 :: 2334 2335 # ip -force link set dev em1 xdp obj prog.o 2336 2337 Most XDP-enabled drivers today support an atomic replacement of the existing 2338 program with a new one without traffic interruption. There is always only a 2339 single program attached to an XDP-enabled driver due to performance reasons, 2340 hence a chain of programs is not supported. However, as described in the 2341 previous section, partitioning of programs can be performed through tail 2342 calls to achieve a similar use case when necessary. 2343 2344 The ``ip link`` command will display an ``xdp`` flag if the interface has an XDP 2345 program attached. ``ip link | grep xdp`` can thus be used to find all interfaces 2346 that have XDP running. Further introspection facilities are provided through 2347 the detailed view with ``ip -d link`` and ``bpftool`` can be used to retrieve 2348 information about the attached program based on the BPF program ID shown in 2349 the ``ip link`` dump. 2350 2351 In order to remove the existing XDP program from the interface, the following 2352 command must be issued: 2353 2354 :: 2355 2356 # ip link set dev em1 xdp off 2357 2358 In the case of switching a driver's operation mode from non-XDP to native XDP 2359 and vice versa, typically the driver needs to reconfigure its receive (and 2360 transmit) rings in order to ensure received packet are set up linearly 2361 within a single page for BPF to read and write into. However, once completed, 2362 then most drivers only need to perform an atomic replacement of the program 2363 itself when a BPF program is requested to be swapped. 2364 2365 In total, XDP supports three operation modes which iproute2 implements as well: 2366 ``xdpdrv``, ``xdpoffload`` and ``xdpgeneric``. 2367 2368 ``xdpdrv`` stands for native XDP, meaning the BPF program is run directly in 2369 the driver's receive path at the earliest possible point in software. This is 2370 the normal / conventional XDP mode and requires driver's to implement XDP 2371 support, which all major 10G/40G/+ networking drivers in the upstream Linux 2372 kernel already provide. 2373 2374 ``xdpgeneric`` stands for generic XDP and is intended as an experimental test 2375 bed for drivers which do not yet support native XDP. Given the generic XDP hook 2376 in the ingress path comes at a much later point in time when the packet already 2377 enters the stack's main receive path as a ``skb``, the performance is significantly 2378 less than with processing in ``xdpdrv`` mode. ``xdpgeneric`` therefore is for 2379 the most part only interesting for experimenting, less for production environments. 2380 2381 Last but not least, the ``xdpoffload`` mode is implemented by SmartNICs such 2382 as those supported by Netronome's nfp driver and allow for offloading the entire 2383 BPF/XDP program into hardware, thus the program is run on each packet reception 2384 directly on the card. This provides even higher performance than running in 2385 native XDP although not all BPF map types or BPF helper functions are available 2386 for use compared to native XDP. The BPF verifier will reject the program in 2387 such case and report to the user what is unsupported. Other than staying in 2388 the realm of supported BPF features and helper functions, no special precautions 2389 have to be taken when writing BPF C programs. 2390 2391 When a command like ``ip link set dev em1 xdp obj [...]`` is used, then the 2392 kernel will attempt to load the program first as native XDP, and in case the 2393 driver does not support native XDP, it will automatically fall back to generic 2394 XDP. Thus, for example, using explicitly ``xdpdrv`` instead of ``xdp``, the 2395 kernel will only attempt to load the program as native XDP and fail in case 2396 the driver does not support it, which provides a guarantee that generic XDP 2397 is avoided altogether. 2398 2399 Example for enforcing a BPF/XDP program to be loaded in native XDP mode, 2400 dumping the link details and unloading the program again: 2401 2402 :: 2403 2404 # ip -force link set dev em1 xdpdrv obj prog.o 2405 # ip link show 2406 [...] 2407 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdp qdisc mq state UP mode DORMANT group default qlen 1000 2408 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 2409 prog/xdp id 1 tag 57cd311f2e27366b 2410 [...] 2411 # ip link set dev em1 xdpdrv off 2412 2413 Same example now for forcing generic XDP, even if the driver would support 2414 native XDP, and additionally dumping the BPF instructions of the attached 2415 dummy program through bpftool: 2416 2417 :: 2418 2419 # ip -force link set dev em1 xdpgeneric obj prog.o 2420 # ip link show 2421 [...] 2422 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpgeneric qdisc mq state UP mode DORMANT group default qlen 1000 2423 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 2424 prog/xdp id 4 tag 57cd311f2e27366b <-- BPF program ID 4 2425 [...] 2426 # bpftool prog dump xlated id 4 <-- Dump of instructions running on em1 2427 0: (b7) r0 = 1 2428 1: (95) exit 2429 # ip link set dev em1 xdpgeneric off 2430 2431 And last but not least offloaded XDP, where we additionally dump program 2432 information via bpftool for retrieving general metadata: 2433 2434 :: 2435 2436 # ip -force link set dev em1 xdpoffload obj prog.o 2437 # ip link show 2438 [...] 2439 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 2440 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 2441 prog/xdp id 8 tag 57cd311f2e27366b 2442 [...] 2443 # bpftool prog show id 8 2444 8: xdp tag 57cd311f2e27366b dev em1 <-- Also indicates a BPF program offloaded to em1 2445 loaded_at Apr 11/20:38 uid 0 2446 xlated 16B not jited memlock 4096B 2447 # ip link set dev em1 xdpoffload off 2448 2449 Note that it is not possible to use ``xdpdrv`` and ``xdpgeneric`` or other 2450 modes at the same time, meaning only one of the XDP operation modes must be 2451 picked. 2452 2453 A switch between different XDP modes e.g. from generic to native or vice 2454 versa is not atomically possible. Only switching programs within a specific 2455 operation mode is: 2456 2457 :: 2458 2459 # ip -force link set dev em1 xdpgeneric obj prog.o 2460 # ip -force link set dev em1 xdpoffload obj prog.o 2461 RTNETLINK answers: File exists 2462 # ip -force link set dev em1 xdpdrv obj prog.o 2463 RTNETLINK answers: File exists 2464 # ip -force link set dev em1 xdpgeneric obj prog.o <-- Succeeds due to xdpgeneric 2465 # 2466 2467 Switching between modes requires to first leave the current operation mode 2468 in order to then enter the new one: 2469 2470 :: 2471 2472 # ip -force link set dev em1 xdpgeneric obj prog.o 2473 # ip -force link set dev em1 xdpgeneric off 2474 # ip -force link set dev em1 xdpoffload obj prog.o 2475 # ip l 2476 [...] 2477 6: em1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 xdpoffload qdisc mq state UP mode DORMANT group default qlen 1000 2478 link/ether be:08:4d:b6:85:65 brd ff:ff:ff:ff:ff:ff 2479 prog/xdp id 17 tag 57cd311f2e27366b 2480 [...] 2481 # ip -force link set dev em1 xdpoffload off 2482 2483 **2. Loading of tc BPF object files.** 2484 2485 Given a BPF object file ``prog.o`` has been compiled for tc, it can be loaded 2486 through the tc command to a netdevice. Unlike XDP, there is no driver dependency 2487 for supporting attaching BPF programs to the device. Here, the netdevice is called 2488 ``em1``, and with the following command the program can be attached to the networking 2489 ``ingress`` path of ``em1``: 2490 2491 :: 2492 2493 # tc qdisc add dev em1 clsact 2494 # tc filter add dev em1 ingress bpf da obj prog.o 2495 2496 The first step is to set up a ``clsact`` qdisc (Linux queueing discipline). ``clsact`` 2497 is a dummy qdisc similar to the ``ingress`` qdisc, which can only hold classifier 2498 and actions, but does not perform actual queueing. It is needed in order to attach 2499 the ``bpf`` classifier. The ``clsact`` qdisc provides two special hooks called 2500 ``ingress`` and ``egress``, where the classifier can be attached to. Both ``ingress`` 2501 and ``egress`` hooks are located in central receive and transmit locations in the 2502 networking data path, where every packet on the device passes through. The ``ingress`` 2503 hook is called from ``__netif_receive_skb_core() -> sch_handle_ingress()`` in the 2504 kernel and the ``egress`` hook from ``__dev_queue_xmit() -> sch_handle_egress()``. 2505 2506 The equivalent for attaching the program to the ``egress`` hook looks as follows: 2507 2508 :: 2509 2510 # tc filter add dev em1 egress bpf da obj prog.o 2511 2512 The ``clsact`` qdisc is processed lockless from ``ingress`` and ``egress`` 2513 direction and can also be attached to virtual, queue-less devices such as 2514 ``veth`` devices connecting containers. 2515 2516 Next to the hook, the ``tc filter`` command selects ``bpf`` to be used in ``da`` 2517 (direct-action) mode. ``da`` mode is recommended and should always be specified. 2518 It basically means that the ``bpf`` classifier does not need to call into external 2519 tc action modules, which are not necessary for ``bpf`` anyway, since all packet 2520 mangling, forwarding or other kind of actions can already be performed inside 2521 the single BPF program which is to be attached, and is therefore significantly 2522 faster. 2523 2524 At this point, the program has been attached and is executed once packets traverse 2525 the device. Like in XDP, should the default section name not be used, then it 2526 can be specified during load, for example, in case of section ``foobar``: 2527 2528 :: 2529 2530 # tc filter add dev em1 egress bpf da obj prog.o sec foobar 2531 2532 iproute2's BPF loader allows for using the same command line syntax across 2533 program types, hence the ``obj prog.o sec foobar`` is the same syntax as with 2534 XDP mentioned earlier. 2535 2536 The attached programs can be listed through the following commands: 2537 2538 :: 2539 2540 # tc filter show dev em1 ingress 2541 filter protocol all pref 49152 bpf 2542 filter protocol all pref 49152 bpf handle 0x1 prog.o:[ingress] direct-action id 1 tag c5f7825e5dac396f 2543 2544 # tc filter show dev em1 egress 2545 filter protocol all pref 49152 bpf 2546 filter protocol all pref 49152 bpf handle 0x1 prog.o:[egress] direct-action id 2 tag b2fd5adc0f262714 2547 2548 The output of ``prog.o:[ingress]`` tells that program section ``ingress`` was 2549 loaded from the file ``prog.o``, and ``bpf`` operates in ``direct-action`` mode. 2550 The program ``id`` and ``tag`` is appended for each case, where the latter denotes 2551 a hash over the instruction stream which can be correlated with the object file 2552 or ``perf`` reports with stack traces, etc. Last but not least, the ``id`` 2553 represents the system-wide unique BPF program identifier that can be used along 2554 with ``bpftool`` to further inspect or dump the attached BPF program. 2555 2556 tc can attach more than just a single BPF program, it provides various other 2557 classifiers which can be chained together. However, attaching a single BPF program 2558 is fully sufficient since all packet operations can be contained in the program 2559 itself thanks to ``da`` (``direct-action``) mode, meaning the BPF program itself 2560 will already return the tc action verdict such as ``TC_ACT_OK``, ``TC_ACT_SHOT`` 2561 and others. For optimal performance and flexibility, this is the recommended usage. 2562 2563 In the above ``show`` command, tc also displays ``pref 49152`` and 2564 ``handle 0x1`` next to the BPF related output. Both are auto-generated in 2565 case they are not explicitly provided through the command line. ``pref`` 2566 denotes a priority number, which means that in case multiple classifiers are 2567 attached, they will be executed based on ascending priority, and ``handle`` 2568 represents an identifier in case multiple instances of the same classifier have 2569 been loaded under the same ``pref``. Since in case of BPF, a single program is 2570 fully sufficient, ``pref`` and ``handle`` can typically be ignored. 2571 2572 Only in the case where it is planned to atomically replace the attached BPF 2573 programs, it would be recommended to explicitly specify ``pref`` and ``handle`` 2574 a priori on initial load, so that they do not have to be queried at a later 2575 point in time for the ``replace`` operation. Thus, creation becomes: 2576 2577 :: 2578 2579 # tc filter add dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar 2580 2581 # tc filter show dev em1 ingress 2582 filter protocol all pref 1 bpf 2583 filter protocol all pref 1 bpf handle 0x1 prog.o:[foobar] direct-action id 1 tag c5f7825e5dac396f 2584 2585 And for the atomic replacement, the following can be issued for updating the 2586 existing program at ``ingress`` hook with the new BPF program from the file 2587 ``prog.o`` in section ``foobar``: 2588 2589 :: 2590 2591 # tc filter replace dev em1 ingress pref 1 handle 1 bpf da obj prog.o sec foobar 2592 2593 Last but not least, in order to remove all attached programs from the ``ingress`` 2594 respectively ``egress`` hook, the following can be used: 2595 2596 :: 2597 2598 # tc filter del dev em1 ingress 2599 # tc filter del dev em1 egress 2600 2601 For removing the entire ``clsact`` qdisc from the netdevice, which implicitly also 2602 removes all attached programs from the ``ingress`` and ``egress`` hooks, the 2603 below command is provided: 2604 2605 :: 2606 2607 # tc qdisc del dev em1 clsact 2608 2609 tc BPF programs can also be offloaded if the NIC and driver has support for it 2610 similarly as with XDP BPF programs. Netronome's nfp supported NICs offer both 2611 types of BPF offload. 2612 2613 :: 2614 2615 # tc qdisc add dev em1 clsact 2616 # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o 2617 Error: TC offload is disabled on net device. 2618 We have an error talking to the kernel 2619 2620 If the above error is shown, then tc hardware offload first needs to be enabled 2621 for the device through ethtool's ``hw-tc-offload`` setting: 2622 2623 :: 2624 2625 # ethtool -K em1 hw-tc-offload on 2626 # tc qdisc add dev em1 clsact 2627 # tc filter replace dev em1 ingress pref 1 handle 1 bpf skip_sw da obj prog.o 2628 # tc filter show dev em1 ingress 2629 filter protocol all pref 1 bpf 2630 filter protocol all pref 1 bpf handle 0x1 prog.o:[classifier] direct-action skip_sw in_hw id 19 tag 57cd311f2e27366b 2631 2632 The ``in_hw`` flag confirms that the program has been offloaded to the NIC. 2633 2634 Note that BPF offloads for both tc and XDP cannot be loaded at the same time, 2635 either the tc or XDP offload option must be selected. 2636 2637 **3. Testing BPF offload interface via netdevsim driver.** 2638 2639 The netdevsim driver which is part of the Linux kernel provides a dummy driver 2640 which implements offload interfaces for XDP BPF and tc BPF programs and 2641 facilitates testing kernel changes or low-level user space programs 2642 implementing a control plane directly against the kernel's UAPI. 2643 2644 A netdevsim device can be created as follows: 2645 2646 :: 2647 2648 # modprobe netdevsim 2649 // [ID] [PORT_COUNT] 2650 # echo "1 1" > /sys/bus/netdevsim/new_device 2651 # devlink dev 2652 netdevsim/netdevsim1 2653 # devlink port 2654 netdevsim/netdevsim1/0: type eth netdev eth0 flavour physical 2655 # ip l 2656 [...] 2657 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 2658 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff 2659 2660 After that step, XDP BPF or tc BPF programs can be test loaded as shown 2661 in the various examples earlier: 2662 2663 :: 2664 2665 # ip -force link set dev eth0 xdpoffload obj prog.o 2666 # ip l 2667 [...] 2668 4: eth0: <BROADCAST,NOARP,UP,LOWER_UP> mtu 1500 xdpoffload qdisc noqueue state UNKNOWN mode DEFAULT group default qlen 1000 2669 link/ether 2a:d5:cd:08:d1:3f brd ff:ff:ff:ff:ff:ff 2670 prog/xdp id 16 tag a04f5eef06a7f555 2671 2672 These two workflows are the basic operations to load XDP BPF respectively tc BPF 2673 programs with iproute2. 2674 2675 There are other various advanced options for the BPF loader that apply both to XDP 2676 and tc, some of them are listed here. In the examples only XDP is presented for 2677 simplicity. 2678 2679 **1. Verbose log output even on success.** 2680 2681 The option ``verb`` can be appended for loading programs in order to dump the 2682 verifier log, even if no error occurred: 2683 2684 :: 2685 2686 # ip link set dev em1 xdp obj xdp-example.o verb 2687 2688 Prog section 'prog' loaded (5)! 2689 - Type: 6 2690 - Instructions: 2 (0 over limit) 2691 - License: GPL 2692 2693 Verifier analysis: 2694 2695 0: (b7) r0 = 1 2696 1: (95) exit 2697 processed 2 insns 2698 2699 **2. Load program that is already pinned in BPF file system.** 2700 2701 Instead of loading a program from an object file, iproute2 can also retrieve 2702 the program from the BPF file system in case some external entity pinned it 2703 there and attach it to the device: 2704 2705 :: 2706 2707 # ip link set dev em1 xdp pinned /sys/fs/bpf/prog 2708 2709 iproute2 can also use the short form that is relative to the detected mount 2710 point of the BPF file system: 2711 2712 :: 2713 2714 # ip link set dev em1 xdp pinned m:prog 2715 2716 When loading BPF programs, iproute2 will automatically detect the mounted 2717 file system instance in order to perform pinning of nodes. In case no mounted 2718 BPF file system instance was found, then tc will automatically mount it 2719 to the default location under ``/sys/fs/bpf/``. 2720 2721 In case an instance has already been found, then it will be used and no additional 2722 mount will be performed: 2723 2724 :: 2725 2726 # mkdir /var/run/bpf 2727 # mount --bind /var/run/bpf /var/run/bpf 2728 # mount -t bpf bpf /var/run/bpf 2729 # tc filter add dev em1 ingress bpf da obj tc-example.o sec prog 2730 # tree /var/run/bpf 2731 /var/run/bpf 2732 +-- ip -> /run/bpf/tc/ 2733 +-- tc 2734 | +-- globals 2735 | +-- jmp_map 2736 +-- xdp -> /run/bpf/tc/ 2737 2738 4 directories, 1 file 2739 2740 By default tc will create an initial directory structure as shown above, 2741 where all subsystem users will point to the same location through symbolic 2742 links for the ``globals`` namespace, so that pinned BPF maps can be reused 2743 among various BPF program types in iproute2. In case the file system instance 2744 has already been mounted and an existing structure already exists, then tc will 2745 not override it. This could be the case for separating ``lwt``, ``tc`` and 2746 ``xdp`` maps in order to not share ``globals`` among all. 2747 2748 As briefly covered in the previous LLVM section, iproute2 will install a 2749 header file upon installation which can be included through the standard 2750 include path by BPF programs: 2751 2752 :: 2753 2754 #include <iproute2/bpf_elf.h> 2755 2756 The purpose of this header file is to provide an API for maps and default section 2757 names used by programs. It's a stable contract between iproute2 and BPF programs. 2758 2759 The map definition for iproute2 is ``struct bpf_elf_map``. Its members have 2760 been covered earlier in the LLVM section of this document. 2761 2762 When parsing the BPF object file, the iproute2 loader will walk through 2763 all ELF sections. It initially fetches ancillary sections like ``maps`` and 2764 ``license``. For ``maps``, the ``struct bpf_elf_map`` array will be checked 2765 for validity and whenever needed, compatibility workarounds are performed. 2766 Subsequently all maps are created with the user provided information, either 2767 retrieved as a pinned object, or newly created and then pinned into the BPF 2768 file system. Next the loader will handle all program sections that contain 2769 ELF relocation entries for maps, meaning that BPF instructions loading 2770 map file descriptors into registers are rewritten so that the corresponding 2771 map file descriptors are encoded into the instructions immediate value, in 2772 order for the kernel to be able to convert them later on into map kernel 2773 pointers. After that all the programs themselves are created through the BPF 2774 system call, and tail called maps, if present, updated with the program's file 2775 descriptors. 2776 2777 bpftool 2778 ------- 2779 2780 bpftool is the main introspection and debugging tool around BPF and developed 2781 and shipped along with the Linux kernel tree under ``tools/bpf/bpftool/``. 2782 2783 The tool can dump all BPF programs and maps that are currently loaded in 2784 the system, or list and correlate all BPF maps used by a specific program. 2785 Furthermore, it allows to dump the entire map's key / value pairs, or 2786 lookup, update, delete individual ones as well as retrieve a key's neighbor 2787 key in the map. Such operations can be performed based on BPF program or 2788 map IDs or by specifying the location of a BPF file system pinned program 2789 or map. The tool additionally also offers an option to pin maps or programs 2790 into the BPF file system. 2791 2792 For a quick overview of all BPF programs currently loaded on the host 2793 invoke the following command: 2794 2795 :: 2796 2797 # bpftool prog 2798 398: sched_cls tag 56207908be8ad877 2799 loaded_at Apr 09/16:24 uid 0 2800 xlated 8800B jited 6184B memlock 12288B map_ids 18,5,17,14 2801 399: sched_cls tag abc95fb4835a6ec9 2802 loaded_at Apr 09/16:24 uid 0 2803 xlated 344B jited 223B memlock 4096B map_ids 18 2804 400: sched_cls tag afd2e542b30ff3ec 2805 loaded_at Apr 09/16:24 uid 0 2806 xlated 1720B jited 1001B memlock 4096B map_ids 17 2807 401: sched_cls tag 2dbbd74ee5d51cc8 2808 loaded_at Apr 09/16:24 uid 0 2809 xlated 3728B jited 2099B memlock 4096B map_ids 17 2810 [...] 2811 2812 Similarly, to get an overview of all active maps: 2813 2814 :: 2815 2816 # bpftool map 2817 5: hash flags 0x0 2818 key 20B value 112B max_entries 65535 memlock 13111296B 2819 6: hash flags 0x0 2820 key 20B value 20B max_entries 65536 memlock 7344128B 2821 7: hash flags 0x0 2822 key 10B value 16B max_entries 8192 memlock 790528B 2823 8: hash flags 0x0 2824 key 22B value 28B max_entries 8192 memlock 987136B 2825 9: hash flags 0x0 2826 key 20B value 8B max_entries 512000 memlock 49352704B 2827 [...] 2828 2829 Note that for each command, bpftool also supports json based output by 2830 appending ``--json`` at the end of the command line. An additional 2831 ``--pretty`` improves the output to be more human readable. 2832 2833 :: 2834 2835 # bpftool prog --json --pretty 2836 2837 For dumping the post-verifier BPF instruction image of a specific BPF 2838 program, one starting point could be to inspect a specific program, e.g. 2839 attached to the tc ingress hook: 2840 2841 :: 2842 2843 # tc filter show dev cilium_host egress 2844 filter protocol all pref 1 bpf chain 0 2845 filter protocol all pref 1 bpf chain 0 handle 0x1 bpf_host.o:[from-netdev] \ 2846 direct-action not_in_hw id 406 tag e0362f5bd9163a0a jited 2847 2848 The program from the object file ``bpf_host.o``, section ``from-netdev`` has 2849 a BPF program ID of ``406`` as denoted in ``id 406``. Based on this information 2850 bpftool can provide some high-level metadata specific to the program: 2851 2852 :: 2853 2854 # bpftool prog show id 406 2855 406: sched_cls tag e0362f5bd9163a0a 2856 loaded_at Apr 09/16:24 uid 0 2857 xlated 11144B jited 7721B memlock 12288B map_ids 18,20,8,5,6,14 2858 2859 The program of ID 406 is of type ``sched_cls`` (``BPF_PROG_TYPE_SCHED_CLS``), 2860 has a ``tag`` of ``e0362f5bd9163a0a`` (SHA sum over the instruction sequence), 2861 it was loaded by root ``uid 0`` on ``Apr 09/16:24``. The BPF instruction 2862 sequence is ``11,144 bytes`` long and the JITed image ``7,721 bytes``. The 2863 program itself (excluding maps) consumes ``12,288 bytes`` that are accounted / 2864 charged against user ``uid 0``. And the BPF program uses the BPF maps with 2865 IDs ``18``, ``20``, ``8``, ``5``, ``6`` and ``14``. The latter IDs can further 2866 be used to get information or dump the map themselves. 2867 2868 Additionally, bpftool can issue a dump request of the BPF instructions the 2869 program runs: 2870 2871 :: 2872 2873 # bpftool prog dump xlated id 406 2874 0: (b7) r7 = 0 2875 1: (63) *(u32 *)(r1 +60) = r7 2876 2: (63) *(u32 *)(r1 +56) = r7 2877 3: (63) *(u32 *)(r1 +52) = r7 2878 [...] 2879 47: (bf) r4 = r10 2880 48: (07) r4 += -40 2881 49: (79) r6 = *(u64 *)(r10 -104) 2882 50: (bf) r1 = r6 2883 51: (18) r2 = map[id:18] <-- BPF map id 18 2884 53: (b7) r5 = 32 2885 54: (85) call bpf_skb_event_output#5656112 <-- BPF helper call 2886 55: (69) r1 = *(u16 *)(r6 +192) 2887 [...] 2888 2889 bpftool correlates BPF map IDs into the instruction stream as shown above 2890 as well as calls to BPF helpers or other BPF programs. 2891 2892 The instruction dump reuses the same 'pretty-printer' as the kernel's BPF 2893 verifier. Since the program was JITed and therefore the actual JIT image 2894 that was generated out of above ``xlated`` instructions is executed, it 2895 can be dumped as well through bpftool: 2896 2897 :: 2898 2899 # bpftool prog dump jited id 406 2900 0: push %rbp 2901 1: mov %rsp,%rbp 2902 4: sub $0x228,%rsp 2903 b: sub $0x28,%rbp 2904 f: mov %rbx,0x0(%rbp) 2905 13: mov %r13,0x8(%rbp) 2906 17: mov %r14,0x10(%rbp) 2907 1b: mov %r15,0x18(%rbp) 2908 1f: xor %eax,%eax 2909 21: mov %rax,0x20(%rbp) 2910 25: mov 0x80(%rdi),%r9d 2911 [...] 2912 2913 Mainly for BPF JIT developers, the option also exists to interleave the 2914 disassembly with the actual native opcodes: 2915 2916 :: 2917 2918 # bpftool prog dump jited id 406 opcodes 2919 0: push %rbp 2920 55 2921 1: mov %rsp,%rbp 2922 48 89 e5 2923 4: sub $0x228,%rsp 2924 48 81 ec 28 02 00 00 2925 b: sub $0x28,%rbp 2926 48 83 ed 28 2927 f: mov %rbx,0x0(%rbp) 2928 48 89 5d 00 2929 13: mov %r13,0x8(%rbp) 2930 4c 89 6d 08 2931 17: mov %r14,0x10(%rbp) 2932 4c 89 75 10 2933 1b: mov %r15,0x18(%rbp) 2934 4c 89 7d 18 2935 [...] 2936 2937 The same interleaving can be done for the normal BPF instructions which 2938 can sometimes be useful for debugging in the kernel: 2939 2940 :: 2941 2942 # bpftool prog dump xlated id 406 opcodes 2943 0: (b7) r7 = 0 2944 b7 07 00 00 00 00 00 00 2945 1: (63) *(u32 *)(r1 +60) = r7 2946 63 71 3c 00 00 00 00 00 2947 2: (63) *(u32 *)(r1 +56) = r7 2948 63 71 38 00 00 00 00 00 2949 3: (63) *(u32 *)(r1 +52) = r7 2950 63 71 34 00 00 00 00 00 2951 4: (63) *(u32 *)(r1 +48) = r7 2952 63 71 30 00 00 00 00 00 2953 5: (63) *(u32 *)(r1 +64) = r7 2954 63 71 40 00 00 00 00 00 2955 [...] 2956 2957 The basic blocks of a program can also be visualized with the help of 2958 ``graphviz``. For this purpose bpftool has a ``visual`` dump mode that 2959 generates a dot file instead of the plain BPF ``xlated`` instruction 2960 dump that can later be converted to a png file: 2961 2962 :: 2963 2964 # bpftool prog dump xlated id 406 visual &> output.dot 2965 $ dot -Tpng output.dot -o output.png 2966 2967 Another option would be to pass the dot file to dotty as a viewer, that 2968 is ``dotty output.dot``, where the result for the ``bpf_host.o`` program 2969 looks as follows (small extract): 2970 2971 .. image:: images/bpf_dot.png 2972 :align: center 2973 2974 Note that the ``xlated`` instruction dump provides the post-verifier BPF 2975 instruction image which means that it dumps the instructions as if they 2976 were to be run through the BPF interpreter. In the kernel, the verifier 2977 performs various rewrites of the original instructions provided by the 2978 BPF loader. 2979 2980 One example of rewrites is the inlining of helper functions in order to 2981 improve runtime performance, here in the case of a map lookup for hash 2982 tables: 2983 2984 :: 2985 2986 # bpftool prog dump xlated id 3 2987 0: (b7) r1 = 2 2988 1: (63) *(u32 *)(r10 -4) = r1 2989 2: (bf) r2 = r10 2990 3: (07) r2 += -4 2991 4: (18) r1 = map[id:2] <-- BPF map id 2 2992 6: (85) call __htab_map_lookup_elem#77408 <-+ BPF helper inlined rewrite 2993 7: (15) if r0 == 0x0 goto pc+2 | 2994 8: (07) r0 += 56 | 2995 9: (79) r0 = *(u64 *)(r0 +0) <-+ 2996 10: (15) if r0 == 0x0 goto pc+24 2997 11: (bf) r2 = r10 2998 12: (07) r2 += -4 2999 [...] 3000 3001 bpftool correlates calls to helper functions or BPF to BPF calls through 3002 kallsyms. Therefore, make sure that JITed BPF programs are exposed to 3003 kallsyms (``bpf_jit_kallsyms``) and that kallsyms addresses are not 3004 obfuscated (calls are otherwise shown as ``call bpf_unspec#0``): 3005 3006 :: 3007 3008 # echo 0 > /proc/sys/kernel/kptr_restrict 3009 # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms 3010 3011 BPF to BPF calls are correlated as well for both, interpreter as well 3012 as JIT case. In the latter, the tag of the subprogram is shown as 3013 call target. In each case, the ``pc+2`` is the pc-relative offset of 3014 the call target, which denotes the subprogram. 3015 3016 :: 3017 3018 # bpftool prog dump xlated id 1 3019 0: (85) call pc+2#__bpf_prog_run_args32 3020 1: (b7) r0 = 1 3021 2: (95) exit 3022 3: (b7) r0 = 2 3023 4: (95) exit 3024 3025 JITed variant of the dump: 3026 3027 :: 3028 3029 # bpftool prog dump xlated id 1 3030 0: (85) call pc+2#bpf_prog_3b185187f1855c4c_F 3031 1: (b7) r0 = 1 3032 2: (95) exit 3033 3: (b7) r0 = 2 3034 4: (95) exit 3035 3036 In the case of tail calls, the kernel maps them into a single instruction 3037 internally, bpftool will still correlate them as a helper call for ease 3038 of debugging: 3039 3040 :: 3041 3042 # bpftool prog dump xlated id 2 3043 [...] 3044 10: (b7) r2 = 8 3045 11: (85) call bpf_trace_printk#-41312 3046 12: (bf) r1 = r6 3047 13: (18) r2 = map[id:1] 3048 15: (b7) r3 = 0 3049 16: (85) call bpf_tail_call#12 3050 17: (b7) r1 = 42 3051 18: (6b) *(u16 *)(r6 +46) = r1 3052 19: (b7) r0 = 0 3053 20: (95) exit 3054 3055 # bpftool map show id 1 3056 1: prog_array flags 0x0 3057 key 4B value 4B max_entries 1 memlock 4096B 3058 3059 Dumping an entire map is possible through the ``map dump`` subcommand 3060 which iterates through all present map elements and dumps the key / 3061 value pairs. 3062 3063 If no BTF (BPF Type Format) data is available for a given map, then 3064 the key / value pairs are dumped as hex: 3065 3066 :: 3067 3068 # bpftool map dump id 5 3069 key: 3070 f0 0d 00 00 00 00 00 00 0a 66 00 00 00 00 8a d6 3071 02 00 00 00 3072 value: 3073 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 3074 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3075 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3076 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3077 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3078 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3079 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3080 key: 3081 0a 66 1c ee 00 00 00 00 00 00 00 00 00 00 00 00 3082 01 00 00 00 3083 value: 3084 00 00 00 00 00 00 00 00 01 00 00 00 00 00 00 00 3085 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 3086 [...] 3087 Found 6 elements 3088 3089 However, with BTF, the map also holds debugging information about 3090 the key and value structures. For example, BTF in combination with 3091 BPF maps and the BPF_ANNOTATE_KV_PAIR() macro from iproute2 will 3092 result in the following dump (``test_xdp_noinline.o`` from kernel 3093 selftests): 3094 3095 :: 3096 3097 # cat tools/testing/selftests/bpf/test_xdp_noinline.c 3098 [...] 3099 struct ctl_value { 3100 union { 3101 __u64 value; 3102 __u32 ifindex; 3103 __u8 mac[6]; 3104 }; 3105 }; 3106 3107 struct bpf_map_def __attribute__ ((section("maps"), used)) ctl_array = { 3108 .type = BPF_MAP_TYPE_ARRAY, 3109 .key_size = sizeof(__u32), 3110 .value_size = sizeof(struct ctl_value), 3111 .max_entries = 16, 3112 .map_flags = 0, 3113 }; 3114 BPF_ANNOTATE_KV_PAIR(ctl_array, __u32, struct ctl_value); 3115 3116 [...] 3117 3118 The BPF_ANNOTATE_KV_PAIR() macro forces a map-specific ELF section 3119 containing an empty key and value, this enables the iproute2 BPF loader 3120 to correlate BTF data with that section and thus allows to choose the 3121 corresponding types out of the BTF for loading the map. 3122 3123 Compiling through LLVM and generating BTF through debugging information 3124 by ``pahole``: 3125 3126 :: 3127 3128 # clang [...] -O2 -target bpf -g -emit-llvm -c test_xdp_noinline.c -o - | 3129 llc -march=bpf -mcpu=probe -mattr=dwarfris -filetype=obj -o test_xdp_noinline.o 3130 # pahole -J test_xdp_noinline.o 3131 3132 Now loading into kernel and dumping the map via bpftool: 3133 3134 :: 3135 3136 # ip -force link set dev lo xdp obj test_xdp_noinline.o sec xdp-test 3137 # ip a 3138 1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 xdpgeneric/id:227 qdisc noqueue state UNKNOWN group default qlen 1000 3139 link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00 3140 inet 127.0.0.1/8 scope host lo 3141 valid_lft forever preferred_lft forever 3142 inet6 ::1/128 scope host 3143 valid_lft forever preferred_lft forever 3144 [...] 3145 # bpftool prog show id 227 3146 227: xdp tag a85e060c275c5616 gpl 3147 loaded_at 2018-07-17T14:41:29+0000 uid 0 3148 xlated 8152B not jited memlock 12288B map_ids 381,385,386,382,384,383 3149 # bpftool map dump id 386 3150 [{ 3151 "key": 0, 3152 "value": { 3153 "": { 3154 "value": 0, 3155 "ifindex": 0, 3156 "mac": [] 3157 } 3158 } 3159 },{ 3160 "key": 1, 3161 "value": { 3162 "": { 3163 "value": 0, 3164 "ifindex": 0, 3165 "mac": [] 3166 } 3167 } 3168 },{ 3169 [...] 3170 3171 Lookup, update, delete, and 'get next key' operations on the map for specific 3172 keys can be performed through bpftool as well. 3173 3174 BPF sysctls 3175 ----------- 3176 3177 The Linux kernel provides few sysctls that are BPF related and covered in this section. 3178 3179 * ``/proc/sys/net/core/bpf_jit_enable``: Enables or disables the BPF JIT compiler. 3180 3181 +-------+-------------------------------------------------------------------+ 3182 | Value | Description | 3183 +-------+-------------------------------------------------------------------+ 3184 | 0 | Disable the JIT and use only interpreter (kernel's default value) | 3185 +-------+-------------------------------------------------------------------+ 3186 | 1 | Enable the JIT compiler | 3187 +-------+-------------------------------------------------------------------+ 3188 | 2 | Enable the JIT and emit debugging traces to the kernel log | 3189 +-------+-------------------------------------------------------------------+ 3190 3191 As described in subsequent sections, ``bpf_jit_disasm`` tool can be used to 3192 process debugging traces when the JIT compiler is set to debugging mode (option ``2``). 3193 3194 * ``/proc/sys/net/core/bpf_jit_harden``: Enables or disables BPF JIT hardening. 3195 Note that enabling hardening trades off performance, but can mitigate JIT spraying 3196 by blinding out the BPF program's immediate values. For programs processed through 3197 the interpreter, blinding of immediate values is not needed / performed. 3198 3199 +-------+-------------------------------------------------------------------+ 3200 | Value | Description | 3201 +-------+-------------------------------------------------------------------+ 3202 | 0 | Disable JIT hardening (kernel's default value) | 3203 +-------+-------------------------------------------------------------------+ 3204 | 1 | Enable JIT hardening for unprivileged users only | 3205 +-------+-------------------------------------------------------------------+ 3206 | 2 | Enable JIT hardening for all users | 3207 +-------+-------------------------------------------------------------------+ 3208 3209 * ``/proc/sys/net/core/bpf_jit_kallsyms``: Enables or disables export of JITed 3210 programs as kernel symbols to ``/proc/kallsyms`` so that they can be used together 3211 with ``perf`` tooling as well as making these addresses aware to the kernel for 3212 stack unwinding, for example, used in dumping stack traces. The symbol names 3213 contain the BPF program tag (``bpf_prog_<tag>``). If ``bpf_jit_harden`` is enabled, 3214 then this feature is disabled. 3215 3216 +-------+-------------------------------------------------------------------+ 3217 | Value | Description | 3218 +-------+-------------------------------------------------------------------+ 3219 | 0 | Disable JIT kallsyms export (kernel's default value) | 3220 +-------+-------------------------------------------------------------------+ 3221 | 1 | Enable JIT kallsyms export for privileged users only | 3222 +-------+-------------------------------------------------------------------+ 3223 3224 * ``/proc/sys/kernel/unprivileged_bpf_disabled``: Enables or disable unprivileged 3225 use of the ``bpf(2)`` system call. The Linux kernel has unprivileged use of 3226 ``bpf(2)`` enabled by default, but once the switch is flipped, unprivileged use 3227 will be permanently disabled until the next reboot. This sysctl knob is a one-time 3228 switch, meaning if once set, then neither an application nor an admin can reset 3229 the value anymore. This knob does not affect any cBPF programs such as seccomp 3230 or traditional socket filters that do not use the ``bpf(2)`` system call for 3231 loading the program into the kernel. 3232 3233 +-------+-------------------------------------------------------------------+ 3234 | Value | Description | 3235 +-------+-------------------------------------------------------------------+ 3236 | 0 | Unprivileged use of bpf syscall enabled (kernel's default value) | 3237 +-------+-------------------------------------------------------------------+ 3238 | 1 | Unprivileged use of bpf syscall disabled | 3239 +-------+-------------------------------------------------------------------+ 3240 3241 Kernel Testing 3242 -------------- 3243 3244 The Linux kernel ships a BPF selftest suite, which can be found in the kernel 3245 source tree under ``tools/testing/selftests/bpf/``. 3246 3247 :: 3248 3249 $ cd tools/testing/selftests/bpf/ 3250 $ make 3251 # make run_tests 3252 3253 The test suite contains test cases against the BPF verifier, program tags, 3254 various tests against the BPF map interface and map types. It contains various 3255 runtime tests from C code for checking LLVM back end, and eBPF as well as cBPF 3256 asm code that is run in the kernel for testing the interpreter and JITs. 3257 3258 JIT Debugging 3259 ------------- 3260 3261 For JIT developers performing audits or writing extensions, each compile run 3262 can output the generated JIT image into the kernel log through: 3263 3264 :: 3265 3266 # echo 2 > /proc/sys/net/core/bpf_jit_enable 3267 3268 Whenever a new BPF program is loaded, the JIT compiler will dump the output, 3269 which can then be inspected with ``dmesg``, for example: 3270 3271 :: 3272 3273 [ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f from=tcpdump pid=20583 3274 [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 3275 [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 3276 [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 3277 [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 3278 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 3279 3280 ``flen`` is the length of the BPF program (here, 6 BPF instructions), and ``proglen`` 3281 tells the number of bytes generated by the JIT for the opcode image (here, 70 bytes 3282 in size). ``pass`` means that the image was generated in 3 compiler passes, for 3283 example, ``x86_64`` can have various optimization passes to further reduce the image 3284 size when possible. ``image`` contains the address of the generated JIT image, ``from`` 3285 and ``pid`` the user space application name and PID respectively, which triggered the 3286 compilation process. The dump output for eBPF and cBPF JITs is the same format. 3287 3288 In the kernel tree under ``tools/bpf/``, there is a tool called ``bpf_jit_disasm``. It 3289 reads out the latest dump and prints the disassembly for further inspection: 3290 3291 :: 3292 3293 # ./bpf_jit_disasm 3294 70 bytes emitted from JIT compiler (pass:3, flen:6) 3295 ffffffffa0069c8f + <x>: 3296 0: push %rbp 3297 1: mov %rsp,%rbp 3298 4: sub $0x60,%rsp 3299 8: mov %rbx,-0x8(%rbp) 3300 c: mov 0x68(%rdi),%r9d 3301 10: sub 0x6c(%rdi),%r9d 3302 14: mov 0xd8(%rdi),%r8 3303 1b: mov $0xc,%esi 3304 20: callq 0xffffffffe0ff9442 3305 25: cmp $0x800,%eax 3306 2a: jne 0x0000000000000042 3307 2c: mov $0x17,%esi 3308 31: callq 0xffffffffe0ff945e 3309 36: cmp $0x1,%eax 3310 39: jne 0x0000000000000042 3311 3b: mov $0xffff,%eax 3312 40: jmp 0x0000000000000044 3313 42: xor %eax,%eax 3314 44: leaveq 3315 45: retq 3316 3317 Alternatively, the tool can also dump related opcodes along with the disassembly. 3318 3319 :: 3320 3321 # ./bpf_jit_disasm -o 3322 70 bytes emitted from JIT compiler (pass:3, flen:6) 3323 ffffffffa0069c8f + <x>: 3324 0: push %rbp 3325 55 3326 1: mov %rsp,%rbp 3327 48 89 e5 3328 4: sub $0x60,%rsp 3329 48 83 ec 60 3330 8: mov %rbx,-0x8(%rbp) 3331 48 89 5d f8 3332 c: mov 0x68(%rdi),%r9d 3333 44 8b 4f 68 3334 10: sub 0x6c(%rdi),%r9d 3335 44 2b 4f 6c 3336 14: mov 0xd8(%rdi),%r8 3337 4c 8b 87 d8 00 00 00 3338 1b: mov $0xc,%esi 3339 be 0c 00 00 00 3340 20: callq 0xffffffffe0ff9442 3341 e8 1d 94 ff e0 3342 25: cmp $0x800,%eax 3343 3d 00 08 00 00 3344 2a: jne 0x0000000000000042 3345 75 16 3346 2c: mov $0x17,%esi 3347 be 17 00 00 00 3348 31: callq 0xffffffffe0ff945e 3349 e8 28 94 ff e0 3350 36: cmp $0x1,%eax 3351 83 f8 01 3352 39: jne 0x0000000000000042 3353 75 07 3354 3b: mov $0xffff,%eax 3355 b8 ff ff 00 00 3356 40: jmp 0x0000000000000044 3357 eb 02 3358 42: xor %eax,%eax 3359 31 c0 3360 44: leaveq 3361 c9 3362 45: retq 3363 c3 3364 3365 More recently, ``bpftool`` adapted the same feature of dumping the BPF JIT 3366 image based on a given BPF program ID already loaded in the system (see 3367 bpftool section). 3368 3369 For performance analysis of JITed BPF programs, ``perf`` can be used as 3370 usual. As a prerequisite, JITed programs need to be exported through kallsyms 3371 infrastructure. 3372 3373 :: 3374 3375 # echo 1 > /proc/sys/net/core/bpf_jit_enable 3376 # echo 1 > /proc/sys/net/core/bpf_jit_kallsyms 3377 3378 Enabling or disabling ``bpf_jit_kallsyms`` does not require a reload of the 3379 related BPF programs. Next, a small workflow example is provided for profiling 3380 BPF programs. A crafted tc BPF program is used for demonstration purposes, 3381 where perf records a failed allocation inside ``bpf_clone_redirect()`` helper. 3382 Due to the use of direct write, ``bpf_try_make_head_writable()`` failed, which 3383 would then release the cloned ``skb`` again and return with an error message. 3384 ``perf`` thus records all ``kfree_skb`` events. 3385 3386 :: 3387 3388 # tc qdisc add dev em1 clsact 3389 # tc filter add dev em1 ingress bpf da obj prog.o sec main 3390 # tc filter show dev em1 ingress 3391 filter protocol all pref 49152 bpf 3392 filter protocol all pref 49152 bpf handle 0x1 prog.o:[main] direct-action id 1 tag 8227addf251b7543 3393 3394 # cat /proc/kallsyms 3395 [...] 3396 ffffffffc00349e0 t fjes_hw_init_command_registers [fjes] 3397 ffffffffc003e2e0 d __tracepoint_fjes_hw_stop_debug_err [fjes] 3398 ffffffffc0036190 t fjes_hw_epbuf_tx_pkt_send [fjes] 3399 ffffffffc004b000 t bpf_prog_8227addf251b7543 3400 3401 # perf record -a -g -e skb:kfree_skb sleep 60 3402 # perf script --kallsyms=/proc/kallsyms 3403 [...] 3404 ksoftirqd/0 6 [000] 1004.578402: skb:kfree_skb: skbaddr=0xffff9d4161f20a00 protocol=2048 location=0xffffffffc004b52c 3405 7fffb8745961 bpf_clone_redirect (/lib/modules/4.10.0+/build/vmlinux) 3406 7fffc004e52c bpf_prog_8227addf251b7543 (/lib/modules/4.10.0+/build/vmlinux) 3407 7fffc05b6283 cls_bpf_classify (/lib/modules/4.10.0+/build/vmlinux) 3408 7fffb875957a tc_classify (/lib/modules/4.10.0+/build/vmlinux) 3409 7fffb8729840 __netif_receive_skb_core (/lib/modules/4.10.0+/build/vmlinux) 3410 7fffb8729e38 __netif_receive_skb (/lib/modules/4.10.0+/build/vmlinux) 3411 7fffb872ae05 process_backlog (/lib/modules/4.10.0+/build/vmlinux) 3412 7fffb872a43e net_rx_action (/lib/modules/4.10.0+/build/vmlinux) 3413 7fffb886176c __do_softirq (/lib/modules/4.10.0+/build/vmlinux) 3414 7fffb80ac5b9 run_ksoftirqd (/lib/modules/4.10.0+/build/vmlinux) 3415 7fffb80ca7fa smpboot_thread_fn (/lib/modules/4.10.0+/build/vmlinux) 3416 7fffb80c6831 kthread (/lib/modules/4.10.0+/build/vmlinux) 3417 7fffb885e09c ret_from_fork (/lib/modules/4.10.0+/build/vmlinux) 3418 3419 The stack trace recorded by ``perf`` will then show the ``bpf_prog_8227addf251b7543()`` 3420 symbol as part of the call trace, meaning that the BPF program with the 3421 tag ``8227addf251b7543`` was related to the ``kfree_skb`` event, and 3422 such program was attached to netdevice ``em1`` on the ingress hook as 3423 shown by tc. 3424 3425 Introspection 3426 ------------- 3427 3428 The Linux kernel provides various tracepoints around BPF and XDP which 3429 can be used for additional introspection, for example, to trace interactions 3430 of user space programs with the bpf system call. 3431 3432 Tracepoints for BPF: 3433 3434 :: 3435 3436 # perf list | grep bpf: 3437 bpf:bpf_map_create [Tracepoint event] 3438 bpf:bpf_map_delete_elem [Tracepoint event] 3439 bpf:bpf_map_lookup_elem [Tracepoint event] 3440 bpf:bpf_map_next_key [Tracepoint event] 3441 bpf:bpf_map_update_elem [Tracepoint event] 3442 bpf:bpf_obj_get_map [Tracepoint event] 3443 bpf:bpf_obj_get_prog [Tracepoint event] 3444 bpf:bpf_obj_pin_map [Tracepoint event] 3445 bpf:bpf_obj_pin_prog [Tracepoint event] 3446 bpf:bpf_prog_get_type [Tracepoint event] 3447 bpf:bpf_prog_load [Tracepoint event] 3448 bpf:bpf_prog_put_rcu [Tracepoint event] 3449 3450 Example usage with ``perf`` (alternatively to ``sleep`` example used here, 3451 a specific application like ``tc`` could be used here instead, of course): 3452 3453 :: 3454 3455 # perf record -a -e bpf:* sleep 10 3456 # perf script 3457 sock_example 6197 [005] 283.980322: bpf:bpf_map_create: map type=ARRAY ufd=4 key=4 val=8 max=256 flags=0 3458 sock_example 6197 [005] 283.980721: bpf:bpf_prog_load: prog=a5ea8fa30ea6849c type=SOCKET_FILTER ufd=5 3459 sock_example 6197 [005] 283.988423: bpf:bpf_prog_get_type: prog=a5ea8fa30ea6849c type=SOCKET_FILTER 3460 sock_example 6197 [005] 283.988443: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[06 00 00 00] val=[00 00 00 00 00 00 00 00] 3461 [...] 3462 sock_example 6197 [005] 288.990868: bpf:bpf_map_lookup_elem: map type=ARRAY ufd=4 key=[01 00 00 00] val=[14 00 00 00 00 00 00 00] 3463 swapper 0 [005] 289.338243: bpf:bpf_prog_put_rcu: prog=a5ea8fa30ea6849c type=SOCKET_FILTER 3464 3465 For the BPF programs, their individual program tag is displayed. 3466 3467 For debugging, XDP also has a tracepoint that is triggered when exceptions are raised: 3468 3469 :: 3470 3471 # perf list | grep xdp: 3472 xdp:xdp_exception [Tracepoint event] 3473 3474 Exceptions are triggered in the following scenarios: 3475 3476 * The BPF program returned an invalid / unknown XDP action code. 3477 * The BPF program returned with ``XDP_ABORTED`` indicating a non-graceful exit. 3478 * The BPF program returned with ``XDP_TX``, but there was an error on transmit, 3479 for example, due to the port not being up, due to the transmit ring being full, 3480 due to allocation failures, etc. 3481 3482 Both tracepoint classes can also be inspected with a BPF program itself 3483 attached to one or more tracepoints, collecting further information 3484 in a map or punting such events to a user space collector through the 3485 ``bpf_perf_event_output()`` helper, for example. 3486 3487 Miscellaneous 3488 ------------- 3489 3490 BPF programs and maps are memory accounted against ``RLIMIT_MEMLOCK`` similar 3491 to ``perf``. The currently available size in unit of system pages which may be 3492 locked into memory can be inspected through ``ulimit -l``. The setrlimit system 3493 call man page provides further details. 3494 3495 The default limit is usually insufficient to load more complex programs or 3496 larger BPF maps, so that the BPF system call will return with ``errno`` 3497 of ``EPERM``. In such situations a workaround with ``ulimit -l unlimited`` or 3498 with a sufficiently large limit could be performed. The ``RLIMIT_MEMLOCK`` is 3499 mainly enforcing limits for unprivileged users. Depending on the setup, 3500 setting a higher limit for privileged users is often acceptable. 3501 3502 Program Types 3503 ============= 3504 3505 At the time of this writing, there are eighteen different BPF program types 3506 available, two of the main types for networking are further explained in below 3507 subsections, namely XDP BPF programs as well as tc BPF programs. Extensive 3508 usage examples for the two program types for LLVM, iproute2 or other tools 3509 are spread throughout the toolchain section and not covered here. Instead, 3510 this section focuses on their architecture, concepts and use cases. 3511 3512 XDP 3513 --- 3514 3515 XDP stands for eXpress Data Path and provides a framework for BPF that enables 3516 high-performance programmable packet processing in the Linux kernel. It runs 3517 the BPF program at the earliest possible point in software, namely at the moment 3518 the network driver receives the packet. 3519 3520 At this point in the fast-path the driver just picked up the packet from its 3521 receive rings, without having done any expensive operations such as allocating 3522 an ``skb`` for pushing the packet further up the networking stack, without 3523 having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program 3524 is executed at the earliest point when it becomes available to the CPU for 3525 processing. 3526 3527 XDP works in concert with the Linux kernel and its infrastructure, meaning 3528 the kernel is not bypassed as in various networking frameworks that operate 3529 in user space only. Keeping the packet in kernel space has several major 3530 advantages: 3531 3532 * XDP is able to reuse all the upstream developed kernel networking drivers, 3533 user space tooling, or even other available in-kernel infrastructure such 3534 as routing tables, sockets, etc in BPF helper calls itself. 3535 * Residing in kernel space, XDP has the same security model as the rest of 3536 the kernel for accessing hardware. 3537 * There is no need for crossing kernel / user space boundaries since the 3538 processed packet already resides in the kernel and can therefore flexibly 3539 forward packets into other in-kernel entities like namespaces used by 3540 containers or the kernel's networking stack itself. This is particularly 3541 relevant in times of Meltdown and Spectre. 3542 * Punting packets from XDP to the kernel's robust, widely used and efficient 3543 TCP/IP stack is trivially possible, allows for full reuse and does not 3544 require maintaining a separate TCP/IP stack as with user space frameworks. 3545 * The use of BPF allows for full programmability, keeping a stable ABI with 3546 the same 'never-break-user-space' guarantees as with the kernel's system 3547 call ABI and compared to modules it also provides safety measures thanks to 3548 the BPF verifier that ensures the stability of the kernel's operation. 3549 * XDP trivially allows for atomically swapping programs during runtime without 3550 any network traffic interruption or even kernel / system reboot. 3551 * XDP allows for flexible structuring of workloads integrated into 3552 the kernel. For example, it can operate in "busy polling" or "interrupt 3553 driven" mode. Explicitly dedicating CPUs to XDP is not required. There 3554 are no special hardware requirements and it does not rely on hugepages. 3555 * XDP does not require any third party kernel modules or licensing. It is 3556 a long-term architectural solution, a core part of the Linux kernel, and 3557 developed by the kernel community. 3558 * XDP is already enabled and shipped everywhere with major distributions 3559 running a kernel equivalent to 4.8 or higher and supports most major 10G 3560 or higher networking drivers. 3561 3562 As a framework for running BPF in the driver, XDP additionally ensures that 3563 packets are laid out linearly and fit into a single DMA'ed page which is 3564 readable and writable by the BPF program. XDP also ensures that additional 3565 headroom of 256 bytes is available to the program for implementing custom 3566 encapsulation headers with the help of the ``bpf_xdp_adjust_head()`` BPF helper 3567 or adding custom metadata in front of the packet through ``bpf_xdp_adjust_meta()``. 3568 3569 The framework contains XDP action codes further described in the section 3570 below which a BPF program can return in order to instruct the driver how 3571 to proceed with the packet, and it enables the possibility to atomically 3572 replace BPF programs running at the XDP layer. XDP is tailored for 3573 high-performance by design. BPF allows to access the packet data through 3574 'direct packet access' which means that the program holds data pointers 3575 directly in registers, loads the content into registers, respectively 3576 writes from there into the packet. 3577 3578 The packet representation in XDP that is passed to the BPF program as 3579 the BPF context looks as follows: 3580 3581 :: 3582 3583 struct xdp_buff { 3584 void *data; 3585 void *data_end; 3586 void *data_meta; 3587 void *data_hard_start; 3588 struct xdp_rxq_info *rxq; 3589 }; 3590 3591 ``data`` points to the start of the packet data in the page, and as the 3592 name suggests, ``data_end`` points to the end of the packet data. Since XDP 3593 allows for a headroom, ``data_hard_start`` points to the maximum possible 3594 headroom start in the page, meaning, when the packet should be encapsulated, 3595 then ``data`` is moved closer towards ``data_hard_start`` via ``bpf_xdp_adjust_head()``. 3596 The same BPF helper function also allows for decapsulation in which case 3597 ``data`` is moved further away from ``data_hard_start``. 3598 3599 ``data_meta`` initially points to the same location as ``data`` but 3600 ``bpf_xdp_adjust_meta()`` is able to move the pointer towards ``data_hard_start`` 3601 as well in order to provide room for custom metadata which is invisible to 3602 the normal kernel networking stack but can be read by tc BPF programs since 3603 it is transferred from XDP to the ``skb``. Vice versa, it can remove or reduce 3604 the size of the custom metadata through the same BPF helper function by 3605 moving ``data_meta`` away from ``data_hard_start`` again. ``data_meta`` can 3606 also be used solely for passing state between tail calls similarly to the 3607 ``skb->cb[]`` control block case that is accessible in tc BPF programs. 3608 3609 This gives the following relation respectively invariant for the ``struct xdp_buff`` 3610 packet pointers: ``data_hard_start`` <= ``data_meta`` <= ``data`` < ``data_end``. 3611 3612 The ``rxq`` field points to some additional per receive queue metadata which 3613 is populated at ring setup time (not at XDP runtime): 3614 3615 :: 3616 3617 struct xdp_rxq_info { 3618 struct net_device *dev; 3619 u32 queue_index; 3620 u32 reg_state; 3621 } ____cacheline_aligned; 3622 3623 The BPF program can retrieve ``queue_index`` as well as additional data 3624 from the netdevice itself such as ``ifindex``, etc. 3625 3626 **BPF program return codes** 3627 3628 After running the XDP BPF program, a verdict is returned from the program in 3629 order to tell the driver how to process the packet next. In the ``linux/bpf.h`` 3630 system header file all available return verdicts are enumerated: 3631 3632 :: 3633 3634 enum xdp_action { 3635 XDP_ABORTED = 0, 3636 XDP_DROP, 3637 XDP_PASS, 3638 XDP_TX, 3639 XDP_REDIRECT, 3640 }; 3641 3642 ``XDP_DROP`` as the name suggests will drop the packet right at the driver 3643 level without wasting any further resources. This is in particular useful 3644 for BPF programs implementing DDoS mitigation mechanisms or firewalling in 3645 general. The ``XDP_PASS`` return code means that the packet is allowed to 3646 be passed up to the kernel's networking stack. Meaning, the current CPU 3647 that was processing this packet now allocates a ``skb``, populates it, and 3648 passes it onwards into the GRO engine. This would be equivalent to the 3649 default packet handling behavior without XDP. With ``XDP_TX`` the BPF program 3650 has an efficient option to transmit the network packet out of the same NIC it 3651 just arrived on again. This is typically useful when few nodes are implementing, 3652 for example, firewalling with subsequent load balancing in a cluster and 3653 thus act as a hairpinned load balancer pushing the incoming packets back 3654 into the switch after rewriting them in XDP BPF. ``XDP_REDIRECT`` is similar 3655 to ``XDP_TX`` in that it is able to transmit the XDP packet, but through 3656 another NIC. Another option for the ``XDP_REDIRECT`` case is to redirect 3657 into a BPF cpumap, meaning, the CPUs serving XDP on the NIC's receive queues 3658 can continue to do so and push the packet for processing the upper kernel 3659 stack to a remote CPU. This is similar to ``XDP_PASS``, but with the ability 3660 that the XDP BPF program can keep serving the incoming high load as opposed 3661 to temporarily spend work on the current packet for pushing into upper 3662 layers. Last but not least, ``XDP_ABORTED`` which serves denoting an exception 3663 like state from the program and has the same behavior as ``XDP_DROP`` only 3664 that ``XDP_ABORTED`` passes the ``trace_xdp_exception`` tracepoint which 3665 can be additionally monitored to detect misbehavior. 3666 3667 **Use cases for XDP** 3668 3669 Some of the main use cases for XDP are presented in this subsection. The 3670 list is non-exhaustive and given the programmability and efficiency XDP 3671 and BPF enables, it can easily be adapted to solve very specific use 3672 cases. 3673 3674 * **DDoS mitigation, firewalling** 3675 3676 One of the basic XDP BPF features is to tell the driver to drop a packet 3677 with ``XDP_DROP`` at this early stage which allows for any kind of efficient 3678 network policy enforcement with having an extremely low per-packet cost. 3679 This is ideal in situations when needing to cope with any sort of DDoS 3680 attacks, but also more general allows to implement any sort of firewalling 3681 policies with close to no overhead in BPF e.g. in either case as stand alone 3682 appliance (e.g. scrubbing 'clean' traffic through ``XDP_TX``) or widely 3683 deployed on nodes protecting end hosts themselves (via ``XDP_PASS`` or 3684 cpumap ``XDP_REDIRECT`` for good traffic). Offloaded XDP takes this even 3685 one step further by moving the already small per-packet cost entirely 3686 into the NIC with processing at line-rate. 3687 3688 .. 3689 3690 * **Forwarding and load-balancing** 3691 3692 Another major use case of XDP is packet forwarding and load-balancing 3693 through either ``XDP_TX`` or ``XDP_REDIRECT`` actions. The packet can 3694 be arbitrarily mangled by the BPF program running in the XDP layer, 3695 even BPF helper functions are available for increasing or decreasing 3696 the packet's headroom in order to arbitrarily encapsulate respectively 3697 decapsulate the packet before sending it out again. With ``XDP_TX`` 3698 hairpinned load-balancers can be implemented that push the packet out 3699 of the same networking device it originally arrived on, or with the 3700 ``XDP_REDIRECT`` action it can be forwarded to another NIC for 3701 transmission. The latter return code can also be used in combination 3702 with BPF's cpumap to load-balance packets for passing up the local 3703 stack, but on remote, non-XDP processing CPUs. 3704 3705 .. 3706 3707 * **Pre-stack filtering / processing** 3708 3709 Besides policy enforcement, XDP can also be used for hardening the 3710 kernel's networking stack with the help of ``XDP_DROP`` case, meaning, 3711 it can drop irrelevant packets for a local node right at the earliest 3712 possible point before the networking stack sees them e.g. given we 3713 know that a node only serves TCP traffic, any UDP, SCTP or other L4 3714 traffic can be dropped right away. This has the advantage that packets 3715 do not need to traverse various entities like GRO engine, the kernel's 3716 flow dissector and others before it can be determined to drop them and 3717 thus this allows for reducing the kernel's attack surface. Thanks to 3718 XDP's early processing stage, this effectively 'pretends' to the kernel's 3719 networking stack that these packets have never been seen by the networking 3720 device. Additionally, if a potential bug in the stack's receive path 3721 got uncovered and would cause a 'ping of death' like scenario, XDP can be 3722 utilized to drop such packets right away without having to reboot the 3723 kernel or restart any services. Due to the ability to atomically swap 3724 such programs to enforce a drop of bad packets, no network traffic is 3725 even interrupted on a host. 3726 3727 Another use case for pre-stack processing is that given the kernel has not 3728 yet allocated an ``skb`` for the packet, the BPF program is free to modify 3729 the packet and, again, have it 'pretend' to the stack that it was received 3730 by the networking device this way. This allows for cases such as having 3731 custom packet mangling and encapsulation protocols where the packet can be 3732 decapsulated prior to entering GRO aggregation in which GRO otherwise would 3733 not be able to perform any sort of aggregation due to not being aware of 3734 the custom protocol. XDP also allows to push metadata (non-packet data) in 3735 front of the packet. This is 'invisible' to the normal kernel stack, can 3736 be GRO aggregated (for matching metadata) and later on processed in 3737 coordination with a tc ingress BPF program where it has the context of 3738 a ``skb`` available for e.g. setting various skb fields. 3739 3740 .. 3741 3742 * **Flow sampling, monitoring** 3743 3744 XDP can also be used for cases such as packet monitoring, sampling or any 3745 other network analytics, for example, as part of an intermediate node in 3746 the path or on end hosts in combination also with prior mentioned use cases. 3747 For complex packet analysis, XDP provides a facility to efficiently push 3748 network packets (truncated or with full payload) and custom metadata into 3749 a fast lockless per CPU memory mapped ring buffer provided from the Linux 3750 perf infrastructure to an user space application. This also allows for 3751 cases where only a flow's initial data can be analyzed and once determined 3752 as good traffic having the monitoring bypassed. Thanks to the flexibility 3753 brought by BPF, this allows for implementing any sort of custom monitoring 3754 or sampling. 3755 3756 .. 3757 3758 One example of XDP BPF production usage is Facebook's SHIV and Droplet 3759 infrastructure which implement their L4 load-balancing and DDoS countermeasures. 3760 Migrating their production infrastructure away from netfilter's IPVS 3761 (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared 3762 to their previous IPVS setup. This was first presented at the netdev 2.1 3763 conference: 3764 3765 * Slides: https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf 3766 * Video: https://youtu.be/YEU2ClcGqts 3767 3768 Another example is the integration of XDP into Cloudflare's DDoS mitigation 3769 pipeline, which originally was using cBPF instead of eBPF for attack signature 3770 matching through iptables' ``xt_bpf`` module. Due to use of iptables this 3771 caused severe performance problems under attack where a user space bypass 3772 solution was deemed necessary but came with drawbacks as well such as needing 3773 to busy poll the NIC and expensive packet re-injection into the kernel's stack. 3774 The migration over to eBPF and XDP combined best of both worlds by having 3775 high-performance programmable packet processing directly inside the kernel: 3776 3777 * Slides: https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf 3778 * Video: https://youtu.be/7OuOukmuivg 3779 3780 **XDP operation modes** 3781 3782 XDP has three operation modes where 'native' XDP is the default mode. When 3783 talked about XDP this mode is typically implied. 3784 3785 * **Native XDP** 3786 3787 This is the default mode where the XDP BPF program is run directly out 3788 of the networking driver's early receive path. Most widespread used NICs 3789 for 10G and higher support native XDP already. 3790 3791 .. 3792 3793 * **Offloaded XDP** 3794 3795 In the offloaded XDP mode the XDP BPF program is directly offloaded into 3796 the NIC instead of being executed on the host CPU. Thus, the already 3797 extremely low per-packet cost is pushed off the host CPU entirely and 3798 executed on the NIC, providing even higher performance than running in 3799 native XDP. This offload is typically implemented by SmartNICs 3800 containing multi-threaded, multicore flow processors where a in-kernel 3801 JIT compiler translates BPF into native instructions for the latter. 3802 Drivers supporting offloaded XDP usually also support native XDP for 3803 cases where some BPF helpers may not yet or only be available for the 3804 native mode. 3805 3806 .. 3807 3808 * **Generic XDP** 3809 3810 For drivers not implementing native or offloaded XDP yet, the kernel 3811 provides an option for generic XDP which does not require any driver 3812 changes since run at a much later point out of the networking stack. 3813 This setting is primarily targeted at developers who want to write and 3814 test programs against the kernel's XDP API, and will not operate at the 3815 performance rate of the native or offloaded modes. For XDP usage in a 3816 production environment either the native or offloaded mode is better 3817 suited and the recommended way to run XDP. 3818 3819 .. 3820 3821 **Driver support** 3822 3823 Since BPF and XDP is evolving quickly in terms of feature and driver support, 3824 the following lists native and offloaded XDP drivers as of kernel 4.17. 3825 3826 **Drivers supporting native XDP** 3827 3828 * **Broadcom** 3829 3830 * bnxt 3831 3832 .. 3833 3834 * **Cavium** 3835 3836 * thunderx 3837 3838 .. 3839 3840 * **Intel** 3841 3842 * ixgbe 3843 * ixgbevf 3844 * i40e 3845 3846 .. 3847 3848 * **Mellanox** 3849 3850 * mlx4 3851 * mlx5 3852 3853 .. 3854 3855 * **Netronome** 3856 3857 * nfp 3858 3859 .. 3860 3861 * **Others** 3862 3863 * tun 3864 * virtio_net 3865 3866 .. 3867 3868 * **Qlogic** 3869 3870 * qede 3871 3872 .. 3873 3874 * **Solarflare** 3875 3876 * sfc [1]_ 3877 3878 **Drivers supporting offloaded XDP** 3879 3880 * **Netronome** 3881 3882 * nfp [2]_ 3883 3884 Note that examples for writing and loading XDP programs are included in 3885 the toolchain section under the respective tools. 3886 3887 .. [1] XDP for sfc available via out of tree driver as of kernel 4.17, but 3888 will be upstreamed soon. 3889 .. [2] Some BPF helper functions such as retrieving the current CPU number 3890 will not be available in an offloaded setting. 3891 3892 tc (traffic control) 3893 -------------------- 3894 3895 Aside from other program types such as XDP, BPF can also be used out of the 3896 kernel's tc (traffic control) layer in the networking data path. On a high-level 3897 there are three major differences when comparing XDP BPF programs to tc BPF 3898 ones: 3899 3900 * The BPF input context is a ``sk_buff`` not a ``xdp_buff``. When the kernel's 3901 networking stack receives a packet, after the XDP layer, it allocates a buffer 3902 and parses the packet to store metadata about the packet. This representation 3903 is known as the ``sk_buff``. This structure is then exposed in the BPF input 3904 context so that BPF programs from the tc ingress layer can use the metadata that 3905 the stack extracts from the packet. This can be useful, but comes with an 3906 associated cost of the stack performing this allocation and metadata extraction, 3907 and handling the packet until it hits the tc hook. By definition, the ``xdp_buff`` 3908 doesn't have access to this metadata because the XDP hook is called before 3909 this work is done. This is a significant contributor to the performance 3910 difference between the XDP and tc hooks. 3911 3912 Therefore, BPF programs attached to the tc BPF hook can, for instance, read or 3913 write the skb's ``mark``, ``pkt_type``, ``protocol``, ``priority``, 3914 ``queue_mapping``, ``napi_id``, ``cb[]`` array, ``hash``, ``tc_classid`` or 3915 ``tc_index``, vlan metadata, the XDP transferred custom metadata and various 3916 other information. All members of the ``struct __sk_buff`` BPF context used 3917 in tc BPF are defined in the ``linux/bpf.h`` system header. 3918 3919 Generally, the ``sk_buff`` is of a completely different nature than 3920 ``xdp_buff`` where both come with advantages and disadvantages. For example, 3921 the ``sk_buff`` case has the advantage that it is rather straight forward to 3922 mangle its associated metadata, however, it also contains a lot of protocol 3923 specific information (e.g. GSO related state) which makes it difficult to 3924 simply switch protocols by solely rewriting the packet data. This is due to 3925 the stack processing the packet based on the metadata rather than having the 3926 cost of accessing the packet contents each time. Thus, additional conversion 3927 is required from BPF helper functions taking care that ``sk_buff`` internals 3928 are properly converted as well. The ``xdp_buff`` case however does not 3929 face such issues since it comes at such an early stage where the kernel 3930 has not even allocated an ``sk_buff`` yet, thus packet rewrites of any 3931 kind can be realized trivially. However, the ``xdp_buff`` case has the 3932 disadvantage that ``sk_buff`` metadata is not available for mangling 3933 at this stage. The latter is overcome by passing custom metadata from 3934 XDP BPF to tc BPF, though. In this way, the limitations of each program 3935 type can be overcome by operating complementary programs of both types 3936 as the use case requires. 3937 3938 .. 3939 3940 * Compared to XDP, tc BPF programs can be triggered out of ingress and also 3941 egress points in the networking data path as opposed to ingress only in 3942 the case of XDP. 3943 3944 The two hook points ``sch_handle_ingress()`` and ``sch_handle_egress()`` in 3945 the kernel are triggered out of ``__netif_receive_skb_core()`` and 3946 ``__dev_queue_xmit()``, respectively. The latter two are the main receive 3947 and transmit functions in the data path that, setting XDP aside, are triggered 3948 for every network packet going in or coming out of the node allowing for 3949 full visibility for tc BPF programs at these hook points. 3950 3951 .. 3952 3953 * The tc BPF programs do not require any driver changes since they are run 3954 at hook points in generic layers in the networking stack. Therefore, they 3955 can be attached to any type of networking device. 3956 3957 While this provides flexibility, it also trades off performance compared 3958 to running at the native XDP layer. However, tc BPF programs still come 3959 at the earliest point in the generic kernel's networking data path after 3960 GRO has been run but **before** any protocol processing, traditional iptables 3961 firewalling such as iptables PREROUTING or nftables ingress hooks or other 3962 packet processing takes place. Likewise on egress, tc BPF programs execute 3963 at the latest point before handing the packet to the driver itself for 3964 transmission, meaning **after** traditional iptables firewalling hooks like 3965 iptables POSTROUTING, but still before handing the packet to the kernel's 3966 GSO engine. 3967 3968 One exception which does require driver changes however are offloaded tc 3969 BPF programs, typically provided by SmartNICs in a similar way as offloaded 3970 XDP just with differing set of features due to the differences in the BPF 3971 input context, helper functions and verdict codes. 3972 3973 .. 3974 3975 BPF programs run in the tc layer are run from the ``cls_bpf`` classifier. 3976 While the tc terminology describes the BPF attachment point as a "classifier", 3977 this is a bit misleading since it under-represents what ``cls_bpf`` is 3978 capable of. That is to say, a fully programmable packet processor being able 3979 not only to read the ``skb`` metadata and packet data, but to also arbitrarily 3980 mangle both, and terminate the tc processing with an action verdict. ``cls_bpf`` 3981 can thus be regarded as a self-contained entity that manages and executes tc 3982 BPF programs. 3983 3984 ``cls_bpf`` can hold one or more tc BPF programs. In the case where Cilium 3985 deploys ``cls_bpf`` programs, it attaches only a single program for a given hook 3986 in ``direct-action`` mode. Typically, in the traditional tc scheme, there is a 3987 split between classifier and action modules, where the classifier has one 3988 or more actions attached to it that are triggered once the classifier has a 3989 match. In the modern world for using tc in the software data path this model 3990 does not scale well for complex packet processing. Given tc BPF programs 3991 attached to ``cls_bpf`` are fully self-contained, they effectively fuse the 3992 parsing and action process together into a single unit. Thanks to ``cls_bpf``'s 3993 ``direct-action`` mode, it will just return the tc action verdict and 3994 terminate the processing pipeline immediately. This allows for implementing 3995 scalable programmable packet processing in the networking data path by avoiding 3996 linear iteration of actions. ``cls_bpf`` is the only such "classifier" module 3997 in the tc layer capable of such a fast-path. 3998 3999 Like XDP BPF programs, tc BPF programs can be atomically updated at runtime 4000 via ``cls_bpf`` without interrupting any network traffic or having to restart 4001 services. 4002 4003 Both the tc ingress and the egress hook where ``cls_bpf`` itself can be 4004 attached to is managed by a pseudo qdisc called ``sch_clsact``. This is a 4005 drop-in replacement and proper superset of the ingress qdisc since it 4006 is able to manage both, ingress and egress tc hooks. For tc's egress hook 4007 in ``__dev_queue_xmit()`` it is important to stress that it is not executed 4008 under the kernel's qdisc root lock. Thus, both tc ingress and egress hooks 4009 are executed in a lockless manner in the fast-path. In either case, preemption 4010 is disabled and execution happens under RCU read side. 4011 4012 Typically on egress there are qdiscs attached to netdevices such as ``sch_mq``, 4013 ``sch_fq``, ``sch_fq_codel`` or ``sch_htb`` where some of them are classful 4014 qdiscs that contain subclasses and thus require a packet classification 4015 mechanism to determine a verdict where to demux the packet. This is handled 4016 by a call to ``tcf_classify()`` which calls into tc classifiers if present. 4017 ``cls_bpf`` can also be attached and used in such cases. Such operation usually 4018 happens under the qdisc root lock and can be subject to lock contention. The 4019 ``sch_clsact`` qdisc's egress hook comes at a much earlier point however which 4020 does not fall under that and operates completely independent from conventional 4021 egress qdiscs. Thus for cases like ``sch_htb`` the ``sch_clsact`` qdisc could 4022 perform the heavy lifting packet classification through tc BPF outside of the 4023 qdisc root lock, setting the ``skb->mark`` or ``skb->priority`` from there such 4024 that ``sch_htb`` only requires a flat mapping without expensive packet 4025 classification under the root lock thus reducing contention. 4026 4027 Offloaded tc BPF programs are supported for the case of ``sch_clsact`` in 4028 combination with ``cls_bpf`` where the prior loaded BPF program was JITed 4029 from a SmartNIC driver to be run natively on the NIC. Only ``cls_bpf`` 4030 programs operating in ``direct-action`` mode are supported to be offloaded. 4031 ``cls_bpf`` only supports offloading a single program and cannot offload 4032 multiple programs. Furthermore only the ingress hook supports offloading 4033 BPF programs. 4034 4035 One ``cls_bpf`` instance is able to hold multiple tc BPF programs internally. 4036 If this is the case, then the ``TC_ACT_UNSPEC`` program return code will 4037 continue execution with the next tc BPF program in that list. However, this 4038 has the drawback that several programs would need to parse the packet over 4039 and over again resulting in degraded performance. 4040 4041 **BPF program return codes** 4042 4043 Both the tc ingress and egress hook share the same action return verdicts 4044 that tc BPF programs can use. They are defined in the ``linux/pkt_cls.h`` 4045 system header: 4046 4047 :: 4048 4049 #define TC_ACT_UNSPEC (-1) 4050 #define TC_ACT_OK 0 4051 #define TC_ACT_SHOT 2 4052 #define TC_ACT_STOLEN 4 4053 #define TC_ACT_REDIRECT 7 4054 4055 There are a few more action ``TC_ACT_*`` verdicts available in the system 4056 header file which are also used in the two hooks. However, they share the 4057 same semantics with the ones above. Meaning, from a tc BPF perspective, 4058 ``TC_ACT_OK`` and ``TC_ACT_RECLASSIFY`` have the same semantics, as well as 4059 the three ``TC_ACT_STOLEN``, ``TC_ACT_QUEUED`` and ``TC_ACT_TRAP`` opcodes. 4060 Therefore, for these cases we only describe ``TC_ACT_OK`` and the ``TC_ACT_STOLEN`` 4061 opcode for the two groups. 4062 4063 Starting out with ``TC_ACT_UNSPEC``. It has the meaning of "unspecified action" 4064 and is used in three cases, i) when an offloaded tc BPF program is attached 4065 and the tc ingress hook is run where the ``cls_bpf`` representation for the 4066 offloaded program will return ``TC_ACT_UNSPEC``, ii) in order to continue 4067 with the next tc BPF program in ``cls_bpf`` for the multi-program case. The 4068 latter also works in combination with offloaded tc BPF programs from point i) 4069 where the ``TC_ACT_UNSPEC`` from there continues with a next tc BPF program 4070 solely running in non-offloaded case. Last but not least, iii) ``TC_ACT_UNSPEC`` 4071 is also used for the single program case to simply tell the kernel to continue 4072 with the ``skb`` without additional side-effects. ``TC_ACT_UNSPEC`` is very 4073 similar to the ``TC_ACT_OK`` action code in the sense that both pass the 4074 ``skb`` onwards either to upper layers of the stack on ingress or down to 4075 the networking device driver for transmission on egress, respectively. The 4076 only difference to ``TC_ACT_OK`` is that ``TC_ACT_OK`` sets ``skb->tc_index`` 4077 based on the classid the tc BPF program set. The latter is set out of the 4078 tc BPF program itself through ``skb->tc_classid`` from the BPF context. 4079 4080 ``TC_ACT_SHOT`` instructs the kernel to drop the packet, meaning, upper 4081 layers of the networking stack will never see the ``skb`` on ingress and 4082 similarly the packet will never be submitted for transmission on egress. 4083 ``TC_ACT_SHOT`` and ``TC_ACT_STOLEN`` are both similar in nature with few 4084 differences: ``TC_ACT_SHOT`` will indicate to the kernel that the ``skb`` 4085 was released through ``kfree_skb()`` and return ``NET_XMIT_DROP`` to the 4086 callers for immediate feedback, whereas ``TC_ACT_STOLEN`` will release 4087 the ``skb`` through ``consume_skb()`` and pretend to upper layers that 4088 the transmission was successful through ``NET_XMIT_SUCCESS``. The perf's 4089 drop monitor which records traces of ``kfree_skb()`` will therefore 4090 also not see any drop indications from ``TC_ACT_STOLEN`` since its 4091 semantics are such that the ``skb`` has been "consumed" or queued but 4092 certainly not "dropped". 4093 4094 Last but not least the ``TC_ACT_REDIRECT`` action which is available for 4095 tc BPF programs as well. This allows to redirect the ``skb`` to the same 4096 or another's device ingress or egress path together with the ``bpf_redirect()`` 4097 helper. Being able to inject the packet into another device's ingress or 4098 egress direction allows for full flexibility in packet forwarding with 4099 BPF. There are no requirements on the target networking device other than 4100 being a networking device itself, there is no need to run another instance 4101 of ``cls_bpf`` on the target device or other such restrictions. 4102 4103 **tc BPF FAQ** 4104 4105 This section contains a few miscellaneous question and answer pairs related to 4106 tc BPF programs that are asked from time to time. 4107 4108 * **Question:** What about ``act_bpf`` as a tc action module, is it still relevant? 4109 * **Answer:** Not really. Although ``cls_bpf`` and ``act_bpf`` share the same 4110 functionality for tc BPF programs, ``cls_bpf`` is more flexible since it is a 4111 proper superset of ``act_bpf``. The way tc works is that tc actions need to be 4112 attached to tc classifiers. In order to achieve the same flexibility as ``cls_bpf``, 4113 ``act_bpf`` would need to be attached to the ``cls_matchall`` classifier. As the 4114 name says, this will match on every packet in order to pass them through for attached 4115 tc action processing. For ``act_bpf``, this is will result in less efficient packet 4116 processing than using ``cls_bpf`` in ``direct-action`` mode directly. If ``act_bpf`` 4117 is used in a setting with other classifiers than ``cls_bpf`` or ``cls_matchall`` 4118 then this will perform even worse due to the nature of operation of tc classifiers. 4119 Meaning, if classifier A has a mismatch, then the packet is passed to classifier 4120 B, reparsing the packet, etc, thus in the typical case there will be linear 4121 processing where the packet would need to traverse N classifiers in the worst 4122 case to find a match and execute ``act_bpf`` on that. Therefore, ``act_bpf`` has 4123 never been largely relevant. Additionally, ``act_bpf`` does not provide a tc 4124 offloading interface either compared to ``cls_bpf``. 4125 4126 .. 4127 4128 * **Question:** Is it recommended to use ``cls_bpf`` not in ``direct-action`` mode? 4129 * **Answer:** No. The answer is similar to the one above in that this is otherwise 4130 unable to scale for more complex processing. tc BPF can already do everything needed 4131 by itself in an efficient manner and thus there is no need for anything other than 4132 ``direct-action`` mode. 4133 4134 .. 4135 4136 * **Question:** Is there any performance difference in offloaded ``cls_bpf`` and 4137 offloaded XDP? 4138 * **Answer:** No. Both are JITed through the same compiler in the kernel which 4139 handles the offloading to the SmartNIC and the loading mechanism for both is 4140 very similar as well. Thus, the BPF program gets translated into the same target 4141 instruction set in order to be able to run on the NIC natively. The two tc BPF 4142 and XDP BPF program types have a differing set of features, so depending on the 4143 use case one might be picked over the other due to availability of certain helper 4144 functions in the offload case, for example. 4145 4146 **Use cases for tc BPF** 4147 4148 Some of the main use cases for tc BPF programs are presented in this subsection. 4149 Also here, the list is non-exhaustive and given the programmability and efficiency 4150 of tc BPF, it can easily be tailored and integrated into orchestration systems 4151 in order to solve very specific use cases. While some use cases with XDP may overlap, 4152 tc BPF and XDP BPF are mostly complementary to each other and both can also be 4153 used at the same time or one over the other depending which is most suitable for a 4154 given problem to solve. 4155 4156 * **Policy enforcement for containers** 4157 4158 One application which tc BPF programs are suitable for is to implement policy 4159 enforcement, custom firewalling or similar security measures for containers or 4160 pods, respectively. In the conventional case, container isolation is implemented 4161 through network namespaces with veth networking devices connecting the host's 4162 initial namespace with the dedicated container's namespace. Since one end of 4163 the veth pair has been moved into the container's namespace whereas the other 4164 end remains in the initial namespace of the host, all network traffic from the 4165 container has to pass through the host-facing veth device allowing for attaching 4166 tc BPF programs on the tc ingress and egress hook of the veth. Network traffic 4167 going into the container will pass through the host-facing veth's tc egress 4168 hook whereas network traffic coming from the container will pass through the 4169 host-facing veth's tc ingress hook. 4170 4171 For virtual devices like veth devices XDP is unsuitable in this case since the 4172 kernel operates solely on a ``skb`` here and generic XDP has a few limitations 4173 where it does not operate with cloned ``skb``'s. The latter is heavily used 4174 from the TCP/IP stack in order to hold data segments for retransmission where 4175 the generic XDP hook would simply get bypassed instead. Moreover, generic XDP 4176 needs to linearize the entire ``skb`` resulting in heavily degraded performance. 4177 tc BPF on the other hand is more flexible as it specializes on the ``skb`` 4178 input context case and thus does not need to cope with the limitations from 4179 generic XDP. 4180 4181 .. 4182 4183 * **Forwarding and load-balancing** 4184 4185 The forwarding and load-balancing use case is quite similar to XDP, although 4186 slightly more targeted towards east-west container workloads rather than 4187 north-south traffic (though both technologies can be used in either case). 4188 Since XDP is only available on ingress side, tc BPF programs allow for 4189 further use cases that apply in particular on egress, for example, container 4190 based traffic can already be NATed and load-balanced on the egress side 4191 through BPF out of the initial namespace such that this is done transparent 4192 to the container itself. Egress traffic is already based on the ``sk_buff`` 4193 structure due to the nature of the kernel's networking stack, so packet 4194 rewrites and redirects are suitable out of tc BPF. By utilizing the 4195 ``bpf_redirect()`` helper function, BPF can take over the forwarding logic 4196 to push the packet either into the ingress or egress path of another networking 4197 device. Thus, any bridge-like devices become unnecessary to use as well by 4198 utilizing tc BPF as forwarding fabric. 4199 4200 .. 4201 4202 * **Flow sampling, monitoring** 4203 4204 Like in XDP case, flow sampling and monitoring can be realized through a 4205 high-performance lockless per-CPU memory mapped perf ring buffer where the 4206 BPF program is able to push custom data, the full or truncated packet 4207 contents, or both up to a user space application. From the tc BPF program 4208 this is realized through the ``bpf_skb_event_output()`` BPF helper function 4209 which has the same function signature and semantics as ``bpf_xdp_event_output()``. 4210 Given tc BPF programs can be attached to ingress and egress as opposed to 4211 only ingress in XDP BPF case plus the two tc hooks are at the lowest layer 4212 in the (generic) networking stack, this allows for bidirectional monitoring 4213 of all network traffic from a particular node. This might be somewhat related 4214 to the cBPF case which tcpdump and Wireshark makes use of, though, without 4215 having to clone the ``skb`` and with being a lot more flexible in terms of 4216 programmability where, for example, BPF can already perform in-kernel 4217 aggregation rather than pushing everything up to user space as well as 4218 custom annotations for packets pushed into the ring buffer. The latter is 4219 also heavily used in Cilium where packet drops can be further annotated 4220 to correlate container labels and reasons for why a given packet had to 4221 be dropped (such as due to policy violation) in order to provide a richer 4222 context. 4223 4224 .. 4225 4226 * **Packet scheduler pre-processing** 4227 4228 The ``sch_clsact``'s egress hook which is called ``sch_handle_egress()`` 4229 runs right before taking the kernel's qdisc root lock, thus tc BPF programs 4230 can be utilized to perform all the heavy lifting packet classification 4231 and mangling before the packet is transmitted into a real full blown 4232 qdisc such as ``sch_htb``. This type of interaction of ``sch_clsact`` 4233 with a real qdisc like ``sch_htb`` coming later in the transmission phase 4234 allows to reduce the lock contention on transmission since ``sch_clsact``'s 4235 egress hook is executed without taking locks. 4236 4237 .. 4238 4239 One concrete example user of tc BPF but also XDP BPF programs is Cilium. 4240 Cilium is open source software for transparently securing the network 4241 connectivity between application services deployed using Linux container 4242 management platforms like Docker and Kubernetes and operates at Layer 3/4 4243 as well as Layer 7. At the heart of Cilium operates BPF in order to 4244 implement the policy enforcement as well as load balancing and monitoring. 4245 4246 * Slides: https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp 4247 * Video: https://youtu.be/ilKlmTDdFgk 4248 * Github: https://github.com/cilium/cilium 4249 4250 **Driver support** 4251 4252 Since tc BPF programs are triggered from the kernel's networking stack 4253 and not directly out of the driver, they do not require any extra driver 4254 modification and therefore can run on any networking device. The only 4255 exception listed below is for offloading tc BPF programs to the NIC. 4256 4257 **Drivers supporting offloaded tc BPF** 4258 4259 * **Netronome** 4260 4261 * nfp [2]_ 4262 4263 Note that also here examples for writing and loading tc BPF programs are 4264 included in the toolchain section under the respective tools. 4265 4266 .. _bpf_users: 4267 4268 Further Reading 4269 =============== 4270 4271 Mentioned lists of docs, projects, talks, papers, and further reading 4272 material are likely not complete. Thus, feel free to open pull requests 4273 to complete the list. 4274 4275 Kernel Developer FAQ 4276 -------------------- 4277 4278 Under ``Documentation/bpf/``, the Linux kernel provides two FAQ files that 4279 are mainly targeted for kernel developers involved in the BPF subsystem. 4280 4281 * **BPF Devel FAQ:** this document provides mostly information around patch 4282 submission process as well as BPF kernel tree, stable tree and bug 4283 reporting workflows, questions around BPF's extensibility and interaction 4284 with LLVM and more. 4285 4286 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/bpf/bpf_devel_QA.rst 4287 4288 .. 4289 4290 * **BPF Design FAQ:** this document tries to answer frequently asked questions 4291 around BPF design decisions related to the instruction set, verifier, 4292 calling convention, JITs, etc. 4293 4294 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/Documentation/bpf/bpf_design_QA.rst 4295 4296 Projects using BPF 4297 ------------------ 4298 4299 The following list includes a selection of open source projects making 4300 use of BPF respectively provide tooling for BPF. In this context the eBPF 4301 instruction set is specifically meant instead of projects utilizing the 4302 legacy cBPF: 4303 4304 **Tracing** 4305 4306 * **BCC** 4307 4308 BCC stands for BPF Compiler Collection and its key feature is to provide 4309 a set of easy to use and efficient kernel tracing utilities all based 4310 upon BPF programs hooking into kernel infrastructure based upon kprobes, 4311 kretprobes, tracepoints, uprobes, uretprobes as well as USDT probes. The 4312 collection provides close to hundred tools targeting different layers 4313 across the stack from applications, system libraries, to the various 4314 different kernel subsystems in order to analyze a system's performance 4315 characteristics or problems. Additionally, BCC provides an API in order 4316 to be used as a library for other projects. 4317 4318 https://github.com/iovisor/bcc 4319 4320 .. 4321 4322 * **bpftrace** 4323 4324 bpftrace is a DTrace-style dynamic tracing tool for Linux and uses LLVM 4325 as a back end to compile scripts to BPF-bytecode and makes use of BCC 4326 for interacting with the kernel's BPF tracing infrastructure. It provides 4327 a higher-level language for implementing tracing scripts compared to 4328 native BCC. 4329 4330 https://github.com/ajor/bpftrace 4331 4332 .. 4333 4334 * **perf** 4335 4336 The perf tool which is developed by the Linux kernel community as 4337 part of the kernel source tree provides a way to load tracing BPF 4338 programs through the conventional perf record subcommand where the 4339 aggregated data from BPF can be retrieved and post processed in 4340 perf.data for example through perf script and other means. 4341 4342 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/perf 4343 4344 .. 4345 4346 * **ply** 4347 4348 ply is a tracing tool that follows the 'Little Language' approach of 4349 yore, and compiles ply scripts into Linux BPF programs that are attached 4350 to kprobes and tracepoints in the kernel. The scripts have a C-like syntax, 4351 heavily inspired by DTrace and by extension awk. ply keeps dependencies 4352 to very minimum and only requires flex and bison at build time, only libc 4353 at runtime. 4354 4355 https://github.com/wkz/ply 4356 4357 .. 4358 4359 * **systemtap** 4360 4361 systemtap is a scripting language and tool for extracting, filtering and 4362 summarizing data in order to diagnose and analyze performance or functional 4363 problems. It comes with a BPF back end called stapbpf which translates 4364 the script directly into BPF without the need of an additional compiler 4365 and injects the probe into the kernel. Thus, unlike stap's kernel modules 4366 this does neither have external dependencies nor requires to load kernel 4367 modules. 4368 4369 https://sourceware.org/git/gitweb.cgi?p=systemtap.git;a=summary 4370 4371 .. 4372 4373 * **PCP** 4374 4375 Performance Co-Pilot (PCP) is a system performance and analysis framework 4376 which is able to collect metrics through a variety of agents as well as 4377 analyze collected systems' performance metrics in real-time or by using 4378 historical data. With pmdabcc, PCP has a BCC based performance metrics 4379 domain agent which extracts data from the kernel via BPF and BCC. 4380 4381 https://github.com/performancecopilot/pcp 4382 4383 .. 4384 4385 * **Weave Scope** 4386 4387 Weave Scope is a cloud monitoring tool collecting data about processes, 4388 networking connections or other system data by making use of BPF in combination 4389 with kprobes. Weave Scope works on top of the gobpf library in order to load 4390 BPF ELF files into the kernel, and comes with a tcptracer-bpf tool which 4391 monitors connect, accept and close calls in order to trace TCP events. 4392 4393 https://github.com/weaveworks/scope 4394 4395 .. 4396 4397 **Networking** 4398 4399 * **Cilium** 4400 4401 Cilium provides and transparently secures network connectivity and load-balancing 4402 between application workloads such as application containers or processes. Cilium 4403 operates at Layer 3/4 to provide traditional networking and security services 4404 as well as Layer 7 to protect and secure use of modern application protocols 4405 such as HTTP, gRPC and Kafka. It is integrated into orchestration frameworks 4406 such as Kubernetes and Mesos, and BPF is the foundational part of Cilium that 4407 operates in the kernel's networking data path. 4408 4409 https://github.com/cilium/cilium 4410 4411 .. 4412 4413 * **Suricata** 4414 4415 Suricata is a network IDS, IPS and NSM engine, and utilizes BPF as well as XDP 4416 in three different areas, that is, as BPF filter in order to process or bypass 4417 certain packets, as a BPF based load balancer in order to allow for programmable 4418 load balancing and for XDP to implement a bypass or dropping mechanism at high 4419 packet rates. 4420 4421 http://suricata.readthedocs.io/en/latest/capture-hardware/ebpf-xdp.html 4422 4423 https://github.com/OISF/suricata 4424 4425 .. 4426 4427 * **systemd** 4428 4429 systemd allows for IPv4/v6 accounting as well as implementing network access 4430 control for its systemd units based on BPF's cgroup ingress and egress hooks. 4431 Accounting is based on packets / bytes, and ACLs can be specified as address 4432 prefixes for allow / deny rules. More information can be found at: 4433 4434 http://0pointer.net/blog/ip-accounting-and-access-lists-with-systemd.html 4435 4436 https://github.com/systemd/systemd 4437 4438 .. 4439 4440 * **iproute2** 4441 4442 iproute2 offers the ability to load BPF programs as LLVM generated ELF files 4443 into the kernel. iproute2 supports both, XDP BPF programs as well as tc BPF 4444 programs through a common BPF loader backend. The tc and ip command line 4445 utilities enable loader and introspection functionality for the user. 4446 4447 https://git.kernel.org/pub/scm/network/iproute2/iproute2.git/ 4448 4449 .. 4450 4451 * **p4c-xdp** 4452 4453 p4c-xdp presents a P4 compiler backend targeting BPF and XDP. P4 is a domain 4454 specific language describing how packets are processed by the data plane of 4455 a programmable network element such as NICs, appliances or switches, and with 4456 the help of p4c-xdp P4 programs can be translated into BPF C programs which 4457 can be compiled by clang / LLVM and loaded as BPF programs into the kernel 4458 at XDP layer for high performance packet processing. 4459 4460 https://github.com/vmware/p4c-xdp 4461 4462 .. 4463 4464 **Others** 4465 4466 * **LLVM** 4467 4468 clang / LLVM provides the BPF back end in order to compile C BPF programs 4469 into BPF instructions contained in ELF files. The LLVM BPF back end is 4470 developed alongside with the BPF core infrastructure in the Linux kernel 4471 and maintained by the same community. clang / LLVM is a key part in the 4472 toolchain for developing BPF programs. 4473 4474 https://llvm.org/ 4475 4476 .. 4477 4478 * **libbpf** 4479 4480 libbpf is a generic BPF library which is developed by the Linux kernel 4481 community as part of the kernel source tree and allows for loading and 4482 attaching BPF programs from LLVM generated ELF files into the kernel. 4483 The library is used by other kernel projects such as perf and bpftool. 4484 4485 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/lib/bpf 4486 4487 .. 4488 4489 * **bpftool** 4490 4491 bpftool is the main tool for introspecting and debugging BPF programs 4492 and BPF maps, and like libbpf is developed by the Linux kernel community. 4493 It allows for dumping all active BPF programs and maps in the system, 4494 dumping and disassembling BPF or JITed BPF instructions from a program 4495 as well as dumping and manipulating BPF maps in the system. bpftool 4496 supports interaction with the BPF filesystem, loading various program 4497 types from an object file into the kernel and much more. 4498 4499 https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/tools/bpf/bpftool 4500 4501 .. 4502 4503 * **gobpf** 4504 4505 gobpf provides go bindings for the bcc framework as well as low-level routines in 4506 order to load and use BPF programs from ELF files. 4507 4508 https://github.com/iovisor/gobpf 4509 4510 .. 4511 4512 * **ebpf_asm** 4513 4514 ebpf_asm provides an assembler for BPF programs written in an Intel-like assembly 4515 syntax, and therefore offers an alternative for writing BPF programs directly in 4516 assembly for cases where programs are rather small and simple without needing the 4517 clang / LLVM toolchain. 4518 4519 https://github.com/solarflarecom/ebpf_asm 4520 4521 .. 4522 4523 XDP Newbies 4524 ----------- 4525 4526 There are a couple of walk-through posts by David S. Miller to the xdp-newbies 4527 mailing list (http://vger.kernel.org/vger-lists.html#xdp-newbies), which explain 4528 various parts of XDP and BPF: 4529 4530 4. May 2017, 4531 BPF Verifier Overview, 4532 David S. Miller, 4533 https://www.spinics.net/lists/xdp-newbies/msg00185.html 4534 4535 3. May 2017, 4536 Contextually speaking..., 4537 David S. Miller, 4538 https://www.spinics.net/lists/xdp-newbies/msg00181.html 4539 4540 2. May 2017, 4541 bpf.h and you..., 4542 David S. Miller, 4543 https://www.spinics.net/lists/xdp-newbies/msg00179.html 4544 4545 1. Apr 2017, 4546 XDP example of the day, 4547 David S. Miller, 4548 https://www.spinics.net/lists/xdp-newbies/msg00009.html 4549 4550 BPF Newsletter 4551 -------------- 4552 4553 Alexander Alemayhu initiated a newsletter around BPF roughly once per week 4554 covering latest developments around BPF in Linux kernel land and its 4555 surrounding ecosystem in user space. 4556 4557 All BPF update newsletters (01 - 12) can be found here: 4558 4559 https://cilium.io/blog/categories/BPF%20Newsletter 4560 4561 Podcasts 4562 -------- 4563 4564 There have been a number of technical podcasts partially covering BPF. 4565 Incomplete list: 4566 4567 5. Feb 2017, 4568 Linux Networking Update from Netdev Conference, 4569 Thomas Graf, 4570 Software Gone Wild, Show 71, 4571 http://blog.ipspace.net/2017/02/linux-networking-update-from-netdev.html 4572 http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_71-NetDev_Update.mp3 4573 4574 4. Jan 2017, 4575 The IO Visor Project, 4576 Brenden Blanco, 4577 OVS Orbit, Episode 23, 4578 https://ovsorbit.org/#e23 4579 https://ovsorbit.org/episode-23.mp3 4580 4581 3. Oct 2016, 4582 Fast Linux Packet Forwarding, 4583 Thomas Graf, 4584 Software Gone Wild, Show 64, 4585 http://blog.ipspace.net/2016/10/fast-linux-packet-forwarding-with.html 4586 http://media.blubrry.com/ipspace/stream.ipspace.net/nuggets/podcast/Show_64-Cilium_with_Thomas_Graf.mp3 4587 4588 2. Aug 2016, 4589 P4 on the Edge, 4590 John Fastabend, 4591 OVS Orbit, Episode 11, 4592 https://ovsorbit.org/#e11 4593 https://ovsorbit.org/episode-11.mp3 4594 4595 1. May 2016, 4596 Cilium, 4597 Thomas Graf, 4598 OVS Orbit, Episode 4, 4599 https://ovsorbit.org/#e4 4600 https://ovsorbit.benpfaff.org/episode-4.mp3 4601 4602 Blog posts 4603 ---------- 4604 4605 The following (incomplete) list includes blog posts around BPF, XDP and related projects: 4606 4607 34. May 2017, 4608 An entertaining eBPF XDP adventure, 4609 Suchakra Sharma, 4610 https://suchakra.wordpress.com/2017/05/23/an-entertaining-ebpf-xdp-adventure/ 4611 4612 33. May 2017, 4613 eBPF, part 2: Syscall and Map Types, 4614 Ferris Ellis, 4615 https://ferrisellis.com/posts/ebpf_syscall_and_maps/ 4616 4617 32. May 2017, 4618 Monitoring the Control Plane, 4619 Gary Berger, 4620 http://firstclassfunc.com/2017/05/monitoring-the-control-plane/ 4621 4622 31. Apr 2017, 4623 USENIX/LISA 2016 Linux bcc/BPF Tools, 4624 Brendan Gregg, 4625 http://www.brendangregg.com/blog/2017-04-29/usenix-lisa-2016-bcc-bpf-tools.html 4626 4627 30. Apr 2017, 4628 Liveblog: Cilium for Network and Application Security with BPF and XDP, 4629 Scott Lowe, 4630 http://blog.scottlowe.org//2017/04/18/black-belt-cilium/ 4631 4632 29. Apr 2017, 4633 eBPF, part 1: Past, Present, and Future, 4634 Ferris Ellis, 4635 https://ferrisellis.com/posts/ebpf_past_present_future/ 4636 4637 28. Mar 2017, 4638 Analyzing KVM Hypercalls with eBPF Tracing, 4639 Suchakra Sharma, 4640 https://suchakra.wordpress.com/2017/03/31/analyzing-kvm-hypercalls-with-ebpf-tracing/ 4641 4642 27. Jan 2017, 4643 Golang bcc/BPF Function Tracing, 4644 Brendan Gregg, 4645 http://www.brendangregg.com/blog/2017-01-31/golang-bcc-bpf-function-tracing.html 4646 4647 26. Dec 2016, 4648 Give me 15 minutes and I'll change your view of Linux tracing, 4649 Brendan Gregg, 4650 http://www.brendangregg.com/blog/2016-12-27/linux-tracing-in-15-minutes.html 4651 4652 25. Nov 2016, 4653 Cilium: Networking and security for containers with BPF and XDP, 4654 Daniel Borkmann, 4655 https://opensource.googleblog.com/2016/11/cilium-networking-and-security.html 4656 4657 24. Nov 2016, 4658 Linux bcc/BPF tcplife: TCP Lifespans, 4659 Brendan Gregg, 4660 http://www.brendangregg.com/blog/2016-11-30/linux-bcc-tcplife.html 4661 4662 23. Oct 2016, 4663 DTrace for Linux 2016, 4664 Brendan Gregg, 4665 http://www.brendangregg.com/blog/2016-10-27/dtrace-for-linux-2016.html 4666 4667 22. Oct 2016, 4668 Linux 4.9's Efficient BPF-based Profiler, 4669 Brendan Gregg, 4670 http://www.brendangregg.com/blog/2016-10-21/linux-efficient-profiler.html 4671 4672 21. Oct 2016, 4673 Linux bcc tcptop, 4674 Brendan Gregg, 4675 http://www.brendangregg.com/blog/2016-10-15/linux-bcc-tcptop.html 4676 4677 20. Oct 2016, 4678 Linux bcc/BPF Node.js USDT Tracing, 4679 Brendan Gregg, 4680 http://www.brendangregg.com/blog/2016-10-12/linux-bcc-nodejs-usdt.html 4681 4682 19. Oct 2016, 4683 Linux bcc/BPF Run Queue (Scheduler) Latency, 4684 Brendan Gregg, 4685 http://www.brendangregg.com/blog/2016-10-08/linux-bcc-runqlat.html 4686 4687 18. Oct 2016, 4688 Linux bcc ext4 Latency Tracing, 4689 Brendan Gregg, 4690 http://www.brendangregg.com/blog/2016-10-06/linux-bcc-ext4dist-ext4slower.html 4691 4692 17. Oct 2016, 4693 Linux MySQL Slow Query Tracing with bcc/BPF, 4694 Brendan Gregg, 4695 http://www.brendangregg.com/blog/2016-10-04/linux-bcc-mysqld-qslower.html 4696 4697 16. Oct 2016, 4698 Linux bcc Tracing Security Capabilities, 4699 Brendan Gregg, 4700 http://www.brendangregg.com/blog/2016-10-01/linux-bcc-security-capabilities.html 4701 4702 15. Sep 2016, 4703 Suricata bypass feature, 4704 Eric Leblond, 4705 https://www.stamus-networks.com/2016/09/28/suricata-bypass-feature/ 4706 4707 14. Aug 2016, 4708 Introducing the p0f BPF compiler, 4709 Gilberto Bertin, 4710 https://blog.cloudflare.com/introducing-the-p0f-bpf-compiler/ 4711 4712 13. Jun 2016, 4713 Ubuntu Xenial bcc/BPF, 4714 Brendan Gregg, 4715 http://www.brendangregg.com/blog/2016-06-14/ubuntu-xenial-bcc-bpf.html 4716 4717 12. Mar 2016, 4718 Linux BPF/bcc Road Ahead, March 2016, 4719 Brendan Gregg, 4720 http://www.brendangregg.com/blog/2016-03-28/linux-bpf-bcc-road-ahead-2016.html 4721 4722 11. Mar 2016, 4723 Linux BPF Superpowers, 4724 Brendan Gregg, 4725 http://www.brendangregg.com/blog/2016-03-05/linux-bpf-superpowers.html 4726 4727 10. Feb 2016, 4728 Linux eBPF/bcc uprobes, 4729 Brendan Gregg, 4730 http://www.brendangregg.com/blog/2016-02-08/linux-ebpf-bcc-uprobes.html 4731 4732 9. Feb 2016, 4733 Who is waking the waker? (Linux chain graph prototype), 4734 Brendan Gregg, 4735 http://www.brendangregg.com/blog/2016-02-05/ebpf-chaingraph-prototype.html 4736 4737 8. Feb 2016, 4738 Linux Wakeup and Off-Wake Profiling, 4739 Brendan Gregg, 4740 http://www.brendangregg.com/blog/2016-02-01/linux-wakeup-offwake-profiling.html 4741 4742 7. Jan 2016, 4743 Linux eBPF Off-CPU Flame Graph, 4744 Brendan Gregg, 4745 http://www.brendangregg.com/blog/2016-01-20/ebpf-offcpu-flame-graph.html 4746 4747 6. Jan 2016, 4748 Linux eBPF Stack Trace Hack, 4749 Brendan Gregg, 4750 http://www.brendangregg.com/blog/2016-01-18/ebpf-stack-trace-hack.html 4751 4752 1. Sep 2015, 4753 Linux Networking, Tracing and IO Visor, a New Systems Performance Tool for a Distributed World, 4754 Suchakra Sharma, 4755 https://thenewstack.io/comparing-dtrace-iovisor-new-systems-performance-platform-advance-linux-networking-virtualization/ 4756 4757 5. Aug 2015, 4758 BPF Internals - II, 4759 Suchakra Sharma, 4760 https://suchakra.wordpress.com/2015/08/12/bpf-internals-ii/ 4761 4762 4. May 2015, 4763 eBPF: One Small Step, 4764 Brendan Gregg, 4765 http://www.brendangregg.com/blog/2015-05-15/ebpf-one-small-step.html 4766 4767 3. May 2015, 4768 BPF Internals - I, 4769 Suchakra Sharma, 4770 https://suchakra.wordpress.com/2015/05/18/bpf-internals-i/ 4771 4772 2. Jul 2014, 4773 Introducing the BPF Tools, 4774 Marek Majkowski, 4775 https://blog.cloudflare.com/introducing-the-bpf-tools/ 4776 4777 1. May 2014, 4778 BPF - the forgotten bytecode, 4779 Marek Majkowski, 4780 https://blog.cloudflare.com/bpf-the-forgotten-bytecode/ 4781 4782 Talks 4783 ----- 4784 4785 The following (incomplete) list includes talks and conference papers 4786 related to BPF and XDP: 4787 4788 44. May 2017, 4789 PyCon 2017, Portland, 4790 Executing python functions in the linux kernel by transpiling to bpf, 4791 Alex Gartrell, 4792 https://www.youtube.com/watch?v=CpqMroMBGP4 4793 4794 43. May 2017, 4795 gluecon 2017, Denver, 4796 Cilium + BPF: Least Privilege Security on API Call Level for Microservices, 4797 Dan Wendlandt, 4798 http://gluecon.com/#agenda 4799 4800 42. May 2017, 4801 Lund Linux Con, Lund, 4802 XDP - eXpress Data Path, 4803 Jesper Dangaard Brouer, 4804 http://people.netfilter.org/hawk/presentations/LLC2017/XDP_DDoS_protecting_LLC2017.pdf 4805 4806 41. May 2017, 4807 Polytechnique Montreal, 4808 Trace Aggregation and Collection with eBPF, 4809 Suchakra Sharma, 4810 http://step.polymtl.ca/~suchakra/eBPF-5May2017.pdf 4811 4812 40. Apr 2017, 4813 DockerCon, Austin, 4814 Cilium - Network and Application Security with BPF and XDP, 4815 Thomas Graf, 4816 https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp 4817 4818 39. Apr 2017, 4819 NetDev 2.1, Montreal, 4820 XDP Mythbusters, 4821 David S. Miller, 4822 https://www.netdevconf.org/2.1/slides/apr7/miller-XDP-MythBusters.pdf 4823 4824 38. Apr 2017, 4825 NetDev 2.1, Montreal, 4826 Droplet: DDoS countermeasures powered by BPF + XDP, 4827 Huapeng Zhou, Doug Porter, Ryan Tierney, Nikita Shirokov, 4828 https://www.netdevconf.org/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf 4829 4830 37. Apr 2017, 4831 NetDev 2.1, Montreal, 4832 XDP in practice: integrating XDP in our DDoS mitigation pipeline, 4833 Gilberto Bertin, 4834 https://www.netdevconf.org/2.1/slides/apr6/bertin_Netdev-XDP.pdf 4835 4836 36. Apr 2017, 4837 NetDev 2.1, Montreal, 4838 XDP for the Rest of Us, 4839 Andy Gospodarek, Jesper Dangaard Brouer, 4840 https://www.netdevconf.org/2.1/slides/apr7/gospodarek-Netdev2.1-XDP-for-the-Rest-of-Us_Final.pdf 4841 4842 35. Mar 2017, 4843 SCALE15x, Pasadena, 4844 Linux 4.x Tracing: Performance Analysis with bcc/BPF, 4845 Brendan Gregg, 4846 https://www.slideshare.net/brendangregg/linux-4x-tracing-performance-analysis-with-bccbpf 4847 4848 34. Mar 2017, 4849 XDP Inside and Out, 4850 David S. Miller, 4851 https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf 4852 4853 33. Mar 2017, 4854 OpenSourceDays, Copenhagen, 4855 XDP - eXpress Data Path, Used for DDoS protection, 4856 Jesper Dangaard Brouer, 4857 https://github.com/iovisor/bpf-docs/raw/master/XDP_Inside_and_Out.pdf 4858 4859 32. Mar 2017, 4860 source{d}, Infrastructure 2017, Madrid, 4861 High-performance Linux monitoring with eBPF, 4862 Alfonso Acosta, 4863 https://www.youtube.com/watch?v=k4jqTLtdrxQ 4864 4865 31. Feb 2017, 4866 FOSDEM 2017, Brussels, 4867 Stateful packet processing with eBPF, an implementation of OpenState interface, 4868 Quentin Monnet, 4869 https://fosdem.org/2017/schedule/event/stateful_ebpf/ 4870 4871 30. Feb 2017, 4872 FOSDEM 2017, Brussels, 4873 eBPF and XDP walkthrough and recent updates, 4874 Daniel Borkmann, 4875 http://borkmann.ch/talks/2017_fosdem.pdf 4876 4877 29. Feb 2017, 4878 FOSDEM 2017, Brussels, 4879 Cilium - BPF & XDP for containers, 4880 Thomas Graf, 4881 https://fosdem.org/2017/schedule/event/cilium/ 4882 4883 28. Jan 2017, 4884 linuxconf.au, Hobart, 4885 BPF: Tracing and more, 4886 Brendan Gregg, 4887 https://www.slideshare.net/brendangregg/bpf-tracing-and-more 4888 4889 27. Dec 2016, 4890 USENIX LISA 2016, Boston, 4891 Linux 4.x Tracing Tools: Using BPF Superpowers, 4892 Brendan Gregg, 4893 https://www.slideshare.net/brendangregg/linux-4x-tracing-tools-using-bpf-superpowers 4894 4895 26. Nov 2016, 4896 Linux Plumbers, Santa Fe, 4897 Cilium: Networking & Security for Containers with BPF & XDP, 4898 Thomas Graf, 4899 http://www.slideshare.net/ThomasGraf5/clium-container-networking-with-bpf-xdp 4900 4901 25. Nov 2016, 4902 OVS Conference, Santa Clara, 4903 Offloading OVS Flow Processing using eBPF, 4904 William (Cheng-Chun) Tu, 4905 http://openvswitch.org/support/ovscon2016/7/1120-tu.pdf 4906 4907 24. Oct 2016, 4908 One.com, Copenhagen, 4909 XDP - eXpress Data Path, Intro and future use-cases, 4910 Jesper Dangaard Brouer, 4911 http://people.netfilter.org/hawk/presentations/xdp2016/xdp_intro_and_use_cases_sep2016.pdf 4912 4913 23. Oct 2016, 4914 Docker Distributed Systems Summit, Berlin, 4915 Cilium: Networking & Security for Containers with BPF & XDP, 4916 Thomas Graf, 4917 http://www.slideshare.net/Docker/cilium-bpf-xdp-for-containers-66969823 4918 4919 22. Oct 2016, 4920 NetDev 1.2, Tokyo, 4921 Data center networking stack, 4922 Tom Herbert, 4923 http://netdevconf.org/1.2/session.html?tom-herbert 4924 4925 21. Oct 2016, 4926 NetDev 1.2, Tokyo, 4927 Fast Programmable Networks & Encapsulated Protocols, 4928 David S. Miller, 4929 http://netdevconf.org/1.2/session.html?david-miller-keynote 4930 4931 20. Oct 2016, 4932 NetDev 1.2, Tokyo, 4933 XDP workshop - Introduction, experience, and future development, 4934 Tom Herbert, 4935 http://netdevconf.org/1.2/session.html?herbert-xdp-workshop 4936 4937 19. Oct 2016, 4938 NetDev1.2, Tokyo, 4939 The adventures of a Suricate in eBPF land, 4940 Eric Leblond, 4941 http://netdevconf.org/1.2/slides/oct6/10_suricata_ebpf.pdf 4942 4943 18. Oct 2016, 4944 NetDev1.2, Tokyo, 4945 cls_bpf/eBPF updates since netdev 1.1, 4946 Daniel Borkmann, 4947 http://borkmann.ch/talks/2016_tcws.pdf 4948 4949 17. Oct 2016, 4950 NetDev1.2, Tokyo, 4951 Advanced programmability and recent updates with tc’s cls_bpf, 4952 Daniel Borkmann, 4953 http://borkmann.ch/talks/2016_netdev2.pdf 4954 http://www.netdevconf.org/1.2/papers/borkmann.pdf 4955 4956 16. Oct 2016, 4957 NetDev 1.2, Tokyo, 4958 eBPF/XDP hardware offload to SmartNICs, 4959 Jakub Kicinski, Nic Viljoen, 4960 http://netdevconf.org/1.2/papers/eBPF_HW_OFFLOAD.pdf 4961 4962 15. Aug 2016, 4963 LinuxCon, Toronto, 4964 What Can BPF Do For You?, 4965 Brenden Blanco, 4966 https://events.linuxfoundation.org/sites/events/files/slides/iovisor-lc-bof-2016.pdf 4967 4968 14. Aug 2016, 4969 LinuxCon, Toronto, 4970 Cilium - Fast IPv6 Container Networking with BPF and XDP, 4971 Thomas Graf, 4972 https://www.slideshare.net/ThomasGraf5/cilium-fast-ipv6-container-networking-with-bpf-and-xdp 4973 4974 13. Aug 2016, 4975 P4, EBPF and Linux TC Offload, 4976 Dinan Gunawardena, Jakub Kicinski, 4977 https://de.slideshare.net/Open-NFP/p4-epbf-and-linux-tc-offload 4978 4979 12. Jul 2016, 4980 Linux Meetup, Santa Clara, 4981 eXpress Data Path, 4982 Brenden Blanco, 4983 http://www.slideshare.net/IOVisor/express-data-path-linux-meetup-santa-clara-july-2016 4984 4985 11. Jul 2016, 4986 Linux Meetup, Santa Clara, 4987 CETH for XDP, 4988 Yan Chan, Yunsong Lu, 4989 http://www.slideshare.net/IOVisor/ceth-for-xdp-linux-meetup-santa-clara-july-2016 4990 4991 10. May 2016, 4992 P4 workshop, Stanford, 4993 P4 on the Edge, 4994 John Fastabend, 4995 https://schd.ws/hosted_files/2016p4workshop/1d/Intel%20Fastabend-P4%20on%20the%20Edge.pdf 4996 4997 9. Mar 2016, 4998 Performance @Scale 2016, Menlo Park, 4999 Linux BPF Superpowers, 5000 Brendan Gregg, 5001 https://www.slideshare.net/brendangregg/linux-bpf-superpowers 5002 5003 8. Mar 2016, 5004 eXpress Data Path, 5005 Tom Herbert, Alexei Starovoitov, 5006 https://github.com/iovisor/bpf-docs/raw/master/Express_Data_Path.pdf 5007 5008 7. Feb 2016, 5009 NetDev1.1, Seville, 5010 On getting tc classifier fully programmable with cls_bpf, 5011 Daniel Borkmann, 5012 http://borkmann.ch/talks/2016_netdev.pdf 5013 http://www.netdevconf.org/1.1/proceedings/papers/On-getting-tc-classifier-fully-programmable-with-cls-bpf.pdf 5014 5015 6. Jan 2016, 5016 FOSDEM 2016, Brussels, 5017 Linux tc and eBPF, 5018 Daniel Borkmann, 5019 http://borkmann.ch/talks/2016_fosdem.pdf 5020 5021 5. Oct 2015, 5022 LinuxCon Europe, Dublin, 5023 eBPF on the Mainframe, 5024 Michael Holzheu, 5025 https://events.linuxfoundation.org/sites/events/files/slides/ebpf_on_the_mainframe_lcon_2015.pdf 5026 5027 4. Aug 2015, 5028 Tracing Summit, Seattle, 5029 LLTng's Trace Filtering and beyond (with some eBPF goodness, of course!), 5030 Suchakra Sharma, 5031 https://github.com/iovisor/bpf-docs/raw/master/ebpf_excerpt_20Aug2015.pdf 5032 5033 3. Jun 2015, 5034 LinuxCon Japan, Tokyo, 5035 Exciting Developments in Linux Tracing, 5036 Elena Zannoni, 5037 https://events.linuxfoundation.org/sites/events/files/slides/tracing-linux-ezannoni-linuxcon-ja-2015_0.pdf 5038 5039 2. Feb 2015, 5040 Collaboration Summit, Santa Rosa, 5041 BPF: In-kernel Virtual Machine, 5042 Alexei Starovoitov, 5043 https://events.linuxfoundation.org/sites/events/files/slides/bpf_collabsummit_2015feb20.pdf 5044 5045 1. Feb 2015, 5046 NetDev 0.1, Ottawa, 5047 BPF: In-kernel Virtual Machine, 5048 Alexei Starovoitov, 5049 http://netdevconf.org/0.1/sessions/15.html 5050 5051 0. Feb 2014, 5052 DevConf.cz, Brno, 5053 tc and cls_bpf: lightweight packet classifying with BPF, 5054 Daniel Borkmann, 5055 http://borkmann.ch/talks/2014_devconf.pdf 5056 5057 Further Documents 5058 ----------------- 5059 5060 - Dive into BPF: a list of reading material, 5061 Quentin Monnet 5062 (https://qmonnet.github.io/whirl-offload/2016/09/01/dive-into-bpf/) 5063 5064 - XDP - eXpress Data Path, 5065 Jesper Dangaard Brouer 5066 (https://prototype-kernel.readthedocs.io/en/latest/networking/XDP/index.html)