github.com/cilium/cilium@v1.16.2/Documentation/bpf/architecture.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _bpf_architect: 8 9 BPF Architecture 10 ================ 11 12 BPF does not define itself by only providing its instruction set, but also by 13 offering further infrastructure around it such as maps which act as efficient 14 key / value stores, helper functions to interact with and leverage kernel 15 functionality, tail calls for calling into other BPF programs, security 16 hardening primitives, a pseudo file system for pinning objects (maps, 17 programs), and infrastructure for allowing BPF to be offloaded, for example, to 18 a network card. 19 20 LLVM provides a BPF back end, so that tools like clang can be used to 21 compile C into a BPF object file, which can then be loaded into the kernel. 22 BPF is deeply tied to the Linux kernel and allows for full programmability 23 without sacrificing native kernel performance. 24 25 Last but not least, also the kernel subsystems making use of BPF are part of 26 BPF's infrastructure. The two main subsystems discussed throughout this 27 document are tc and XDP where BPF programs can be attached to. XDP BPF programs 28 are attached at the earliest networking driver stage and trigger a run of the 29 BPF program upon packet reception. By definition, this achieves the best 30 possible packet processing performance since packets cannot get processed at an 31 even earlier point in software. However, since this processing occurs so early 32 in the networking stack, the stack has not yet extracted metadata out of the 33 packet. On the other hand, tc BPF programs are executed later in the kernel 34 stack, so they have access to more metadata and core kernel functionality. 35 Apart from tc and XDP programs, there are various other kernel subsystems as 36 well which use BPF such as tracing (kprobes, uprobes, tracepoints, etc). 37 38 The following subsections provide further details on individual aspects of the 39 BPF architecture. 40 41 Instruction Set 42 --------------- 43 44 BPF is a general purpose RISC instruction set and was originally designed for the 45 purpose of writing programs in a subset of C which can be compiled into BPF instructions 46 through a compiler back end (e.g. LLVM), so that the kernel can later on map them 47 through an in-kernel JIT compiler into native opcodes for optimal execution performance 48 inside the kernel. 49 50 The advantages for pushing these instructions into the kernel include: 51 52 * Making the kernel programmable without having to cross kernel / user space 53 boundaries. For example, BPF programs related to networking, as in the case of 54 Cilium, can implement flexible container policies, load balancing and other means 55 without having to move packets to user space and back into the kernel. State 56 between BPF programs and kernel / user space can still be shared through maps 57 whenever needed. 58 59 * Given the flexibility of a programmable data path, programs can be heavily optimized 60 for performance also by compiling out features that are not required for the use cases 61 the program solves. For example, if a container does not require IPv4, then the BPF 62 program can be built to only deal with IPv6 in order to save resources in the fast-path. 63 64 * In case of networking (e.g. tc and XDP), BPF programs can be updated atomically 65 without having to restart the kernel, system services or containers, and without 66 traffic interruptions. Furthermore, any program state can also be maintained 67 throughout updates via BPF maps. 68 69 * BPF provides a stable ABI towards user space, and does not require any third party 70 kernel modules. BPF is a core part of the Linux kernel that is shipped everywhere, 71 and guarantees that existing BPF programs keep running with newer kernel versions. 72 This guarantee is the same guarantee that the kernel provides for system calls with 73 regard to user space applications. Moreover, BPF programs are portable across 74 different architectures. 75 76 * BPF programs work in concert with the kernel, they make use of existing kernel 77 infrastructure (e.g. drivers, netdevices, tunnels, protocol stack, sockets) and 78 tooling (e.g. iproute2) as well as the safety guarantees which the kernel provides. 79 Unlike kernel modules, BPF programs are verified through an in-kernel verifier in 80 order to ensure that they cannot crash the kernel, always terminate, etc. XDP 81 programs, for example, reuse the existing in-kernel drivers and operate on the 82 provided DMA buffers containing the packet frames without exposing them or an entire 83 driver to user space as in other models. Moreover, XDP programs reuse the existing 84 stack instead of bypassing it. BPF can be considered a generic "glue code" to 85 kernel facilities for crafting programs to solve specific use cases. 86 87 The execution of a BPF program inside the kernel is always event-driven! Examples: 88 89 * A networking device which has a BPF program attached on its ingress path will 90 trigger the execution of the program once a packet is received. 91 92 * A kernel address which has a kprobe with a BPF program attached will trap once 93 the code at that address gets executed, which will then invoke the kprobe's 94 callback function for instrumentation, subsequently triggering the execution 95 of the attached BPF program. 96 97 BPF consists of eleven 64 bit registers with 32 bit subregisters, a program counter 98 and a 512 byte large BPF stack space. Registers are named ``r0`` - ``r10``. The 99 operating mode is 64 bit by default, the 32 bit subregisters can only be accessed 100 through special ALU (arithmetic logic unit) operations. The 32 bit lower subregisters 101 zero-extend into 64 bit when they are being written to. 102 103 Register ``r10`` is the only register which is read-only and contains the frame pointer 104 address in order to access the BPF stack space. The remaining ``r0`` - ``r9`` 105 registers are general purpose and of read/write nature. 106 107 A BPF program can call into a predefined helper function, which is defined by 108 the core kernel (never by modules). The BPF calling convention is defined as 109 follows: 110 111 * ``r0`` contains the return value of a helper function call. 112 * ``r1`` - ``r5`` hold arguments from the BPF program to the kernel helper function. 113 * ``r6`` - ``r9`` are callee saved registers that will be preserved on helper function call. 114 115 The BPF calling convention is generic enough to map directly to ``x86_64``, ``arm64`` 116 and other ABIs, thus all BPF registers map one to one to HW CPU registers, so that a 117 JIT only needs to issue a call instruction, but no additional extra moves for placing 118 function arguments. This calling convention was modeled to cover common call 119 situations without having a performance penalty. Calls with 6 or more arguments 120 are currently not supported. The helper functions in the kernel which are dedicated 121 to BPF (``BPF_CALL_0()`` to ``BPF_CALL_5()`` functions) are specifically designed 122 with this convention in mind. 123 124 Register ``r0`` is also the register containing the exit value for the BPF program. 125 The semantics of the exit value are defined by the type of program. Furthermore, when 126 handing execution back to the kernel, the exit value is passed as a 32 bit value. 127 128 Registers ``r1`` - ``r5`` are scratch registers, meaning the BPF program needs to 129 either spill them to the BPF stack or move them to callee saved registers if these 130 arguments are to be reused across multiple helper function calls. Spilling means 131 that the variable in the register is moved to the BPF stack. The reverse operation 132 of moving the variable from the BPF stack to the register is called filling. The 133 reason for spilling/filling is due to the limited number of registers. 134 135 Upon entering execution of a BPF program, register ``r1`` initially contains the 136 context for the program. The context is the input argument for the program (similar 137 to ``argc/argv`` pair for a typical C program). BPF is restricted to work on a single 138 context. The context is defined by the program type, for example, a networking 139 program can have a kernel representation of the network packet (``skb``) as the 140 input argument. 141 142 The general operation of BPF is 64 bit to follow the natural model of 64 bit 143 architectures in order to perform pointer arithmetics, pass pointers but also pass 64 144 bit values into helper functions, and to allow for 64 bit atomic operations. 145 146 The maximum instruction limit per program is restricted to 4096 BPF instructions, 147 which, by design, means that any program will terminate quickly. For kernel newer 148 than 5.1 this limit was lifted to 1 million BPF instructions. Although the 149 instruction set contains forward as well as backward jumps, the in-kernel BPF 150 verifier will forbid loops so that termination is always guaranteed. Since BPF 151 programs run inside the kernel, the verifier's job is to make sure that these are 152 safe to run, not affecting the system's stability. This means that from an instruction 153 set point of view, loops can be implemented, but the verifier will restrict that. 154 However, there is also a concept of tail calls that allows for one BPF program to 155 jump into another one. This, too, comes with an upper nesting limit of 33 calls, 156 and is usually used to decouple parts of the program logic, for example, into stages. 157 158 The instruction format is modeled as two operand instructions, which helps mapping 159 BPF instructions to native instructions during JIT phase. The instruction set is 160 of fixed size, meaning every instruction has 64 bit encoding. Currently, 87 instructions 161 have been implemented and the encoding also allows to extend the set with further 162 instructions when needed. The instruction encoding of a single 64 bit instruction on a 163 big-endian machine is defined as a bit sequence from most significant bit (MSB) to least 164 significant bit (LSB) of ``op:8``, ``dst_reg:4``, ``src_reg:4``, ``off:16``, ``imm:32``. 165 ``off`` and ``imm`` is of signed type. The encodings are part of the kernel headers and 166 defined in ``linux/bpf.h`` header, which also includes ``linux/bpf_common.h``. 167 168 ``op`` defines the actual operation to be performed. Most of the encoding for ``op`` 169 has been reused from cBPF. The operation can be based on register or immediate 170 operands. The encoding of ``op`` itself provides information on which mode to use 171 (``BPF_X`` for denoting register-based operations, and ``BPF_K`` for immediate-based 172 operations respectively). In the latter case, the destination operand is always 173 a register. Both ``dst_reg`` and ``src_reg`` provide additional information about 174 the register operands to be used (e.g. ``r0`` - ``r9``) for the operation. ``off`` 175 is used in some instructions to provide a relative offset, for example, for addressing 176 the stack or other buffers available to BPF (e.g. map values, packet data, etc), 177 or jump targets in jump instructions. ``imm`` contains a constant / immediate value. 178 179 The available ``op`` instructions can be categorized into various instruction 180 classes. These classes are also encoded inside the ``op`` field. The ``op`` field 181 is divided into (from MSB to LSB) ``code:4``, ``source:1`` and ``class:3``. ``class`` 182 is the more generic instruction class, ``code`` denotes a specific operational 183 code inside that class, and ``source`` tells whether the source operand is a register 184 or an immediate value. Possible instruction classes include: 185 186 * ``BPF_LD``, ``BPF_LDX``: Both classes are for load operations. ``BPF_LD`` is 187 used for loading a double word as a special instruction spanning two instructions 188 due to the ``imm:32`` split, and for byte / half-word / word loads of packet data. 189 The latter was carried over from cBPF mainly in order to keep cBPF to BPF 190 translations efficient, since they have optimized JIT code. For native BPF 191 these packet load instructions are less relevant nowadays. ``BPF_LDX`` class 192 holds instructions for byte / half-word / word / double-word loads out of 193 memory. Memory in this context is generic and could be stack memory, map value 194 data, packet data, etc. 195 196 * ``BPF_ST``, ``BPF_STX``: Both classes are for store operations. Similar to ``BPF_LDX`` 197 the ``BPF_STX`` is the store counterpart and is used to store the data from a 198 register into memory, which, again, can be stack memory, map value, packet data, 199 etc. ``BPF_STX`` also holds special instructions for performing word and double-word 200 based atomic add operations, which can be used for counters, for example. The 201 ``BPF_ST`` class is similar to ``BPF_STX`` by providing instructions for storing 202 data into memory only that the source operand is an immediate value. 203 204 * ``BPF_ALU``, ``BPF_ALU64``: Both classes contain ALU operations. Generally, 205 ``BPF_ALU`` operations are in 32 bit mode and ``BPF_ALU64`` in 64 bit mode. 206 Both ALU classes have basic operations with source operand which is register-based 207 and an immediate-based counterpart. Supported by both are add (``+``), sub (``-``), 208 and (``&``), or (``|``), left shift (``<<``), right shift (``>>``), xor (``^``), 209 mul (``*``), div (``/``), mod (``%``), neg (``~``) operations. Also mov (``<X> := <Y>``) 210 was added as a special ALU operation for both classes in both operand modes. 211 ``BPF_ALU64`` also contains a signed right shift. ``BPF_ALU`` additionally 212 contains endianness conversion instructions for half-word / word / double-word 213 on a given source register. 214 215 * ``BPF_JMP``: This class is dedicated to jump operations. Jumps can be unconditional 216 and conditional. Unconditional jumps simply move the program counter forward, so 217 that the next instruction to be executed relative to the current instruction is 218 ``off + 1``, where ``off`` is the constant offset encoded in the instruction. Since 219 ``off`` is signed, the jump can also be performed backwards as long as it does not 220 create a loop and is within program bounds. Conditional jumps operate on both, 221 register-based and immediate-based source operands. If the condition in the jump 222 operations results in ``true``, then a relative jump to ``off + 1`` is performed, 223 otherwise the next instruction (``0 + 1``) is performed. This fall-through 224 jump logic differs compared to cBPF and allows for better branch prediction as it 225 fits the CPU branch predictor logic more naturally. Available conditions are 226 jeq (``==``), jne (``!=``), jgt (``>``), jge (``>=``), jsgt (signed ``>``), jsge 227 (signed ``>=``), jlt (``<``), jle (``<=``), jslt (signed ``<``), jsle (signed 228 ``<=``) and jset (jump if ``DST & SRC``). Apart from that, there are three 229 special jump operations within this class: the exit instruction which will leave 230 the BPF program and return the current value in ``r0`` as a return code, the call 231 instruction, which will issue a function call into one of the available BPF helper 232 functions, and a hidden tail call instruction, which will jump into a different 233 BPF program. 234 235 The Linux kernel is shipped with a BPF interpreter which executes programs assembled in 236 BPF instructions. Even cBPF programs are translated into eBPF programs transparently 237 in the kernel, except for architectures that still ship with a cBPF JIT and 238 have not yet migrated to an eBPF JIT. 239 240 Currently ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` and 241 ``arm`` architectures come with an in-kernel eBPF JIT compiler. 242 243 All BPF handling such as loading of programs into the kernel or creation of BPF maps 244 is managed through a central ``bpf()`` system call. It is also used for managing map 245 entries (lookup / update / delete), and making programs as well as maps persistent 246 in the BPF file system through pinning. 247 248 Helper Functions 249 ---------------- 250 251 Helper functions are a concept which enables BPF programs to consult a core kernel 252 defined set of function calls in order to retrieve / push data from / to the 253 kernel. Available helper functions may differ for each BPF program type, 254 for example, BPF programs attached to sockets are only allowed to call into 255 a subset of helpers compared to BPF programs attached to the tc layer. 256 Encapsulation and decapsulation helpers for lightweight tunneling constitute 257 an example of functions which are only available to lower tc layers, whereas 258 event output helpers for pushing notifications to user space are available to 259 tc and XDP programs. 260 261 Each helper function is implemented with a commonly shared function signature 262 similar to system calls. The signature is defined as: 263 264 .. code-block:: c 265 266 u64 fn(u64 r1, u64 r2, u64 r3, u64 r4, u64 r5) 267 268 The calling convention as described in the previous section applies to all 269 BPF helper functions. 270 271 The kernel abstracts helper functions into macros ``BPF_CALL_0()`` to ``BPF_CALL_5()`` 272 which are similar to those of system calls. The following example is an extract 273 from a helper function which updates map elements by calling into the 274 corresponding map implementation callbacks: 275 276 .. code-block:: c 277 278 BPF_CALL_4(bpf_map_update_elem, struct bpf_map *, map, void *, key, 279 void *, value, u64, flags) 280 { 281 WARN_ON_ONCE(!rcu_read_lock_held()); 282 return map->ops->map_update_elem(map, key, value, flags); 283 } 284 285 const struct bpf_func_proto bpf_map_update_elem_proto = { 286 .func = bpf_map_update_elem, 287 .gpl_only = false, 288 .ret_type = RET_INTEGER, 289 .arg1_type = ARG_CONST_MAP_PTR, 290 .arg2_type = ARG_PTR_TO_MAP_KEY, 291 .arg3_type = ARG_PTR_TO_MAP_VALUE, 292 .arg4_type = ARG_ANYTHING, 293 }; 294 295 There are various advantages of this approach: while cBPF overloaded its 296 load instructions in order to fetch data at an impossible packet offset to 297 invoke auxiliary helper functions, each cBPF JIT needed to implement support 298 for such a cBPF extension. In case of eBPF, each newly added helper function 299 will be JIT compiled in a transparent and efficient way, meaning that the JIT 300 compiler only needs to emit a call instruction since the register mapping 301 is made in such a way that BPF register assignments already match the 302 underlying architecture's calling convention. This allows for easily extending 303 the core kernel with new helper functionality. All BPF helper functions are 304 part of the core kernel and cannot be extended or added through kernel modules. 305 306 The aforementioned function signature also allows the verifier to perform type 307 checks. The above ``struct bpf_func_proto`` is used to hand all the necessary 308 information which need to be known about the helper to the verifier, so that 309 the verifier can make sure that the expected types from the helper match the 310 current contents of the BPF program's analyzed registers. 311 312 Argument types can range from passing in any kind of value up to restricted 313 contents such as a pointer / size pair for the BPF stack buffer, which the 314 helper should read from or write to. In the latter case, the verifier can also 315 perform additional checks, for example, whether the buffer was previously 316 initialized. 317 318 The list of available BPF helper functions is rather long and constantly growing, 319 for example, at the time of this writing, tc BPF programs can choose from 38 320 different BPF helpers. The kernel's ``struct bpf_verifier_ops`` contains a 321 ``get_func_proto`` callback function that provides the mapping of a specific 322 ``enum bpf_func_id`` to one of the available helpers for a given BPF program 323 type. 324 325 Maps 326 ---- 327 328 .. image:: /images/bpf_map.png 329 :align: center 330 331 Maps are efficient key / value stores that reside in kernel space. They can be 332 accessed from a BPF program in order to keep state among multiple BPF program 333 invocations. They can also be accessed through file descriptors from user space 334 and can be arbitrarily shared with other BPF programs or user space applications. 335 336 BPF programs which share maps with each other are not required to be of the same 337 program type, for example, tracing programs can share maps with networking programs. 338 A single BPF program can currently access up to 64 different maps directly. 339 340 Map implementations are provided by the core kernel. There are generic maps with 341 per-CPU and non-per-CPU flavor that can read / write arbitrary data, but there are 342 also a few non-generic maps that are used along with helper functions. 343 344 Generic maps currently available are ``BPF_MAP_TYPE_HASH``, ``BPF_MAP_TYPE_ARRAY``, 345 ``BPF_MAP_TYPE_PERCPU_HASH``, ``BPF_MAP_TYPE_PERCPU_ARRAY``, ``BPF_MAP_TYPE_LRU_HASH``, 346 ``BPF_MAP_TYPE_LRU_PERCPU_HASH`` and ``BPF_MAP_TYPE_LPM_TRIE``. They all use the 347 same common set of BPF helper functions in order to perform lookup, update or 348 delete operations while implementing a different backend with differing semantics 349 and performance characteristics. 350 351 Non-generic maps that are currently in the kernel are ``BPF_MAP_TYPE_PROG_ARRAY``, 352 ``BPF_MAP_TYPE_PERF_EVENT_ARRAY``, ``BPF_MAP_TYPE_CGROUP_ARRAY``, 353 ``BPF_MAP_TYPE_STACK_TRACE``, ``BPF_MAP_TYPE_ARRAY_OF_MAPS``, 354 ``BPF_MAP_TYPE_HASH_OF_MAPS``. For example, ``BPF_MAP_TYPE_PROG_ARRAY`` is an 355 array map which holds other BPF programs, ``BPF_MAP_TYPE_ARRAY_OF_MAPS`` and 356 ``BPF_MAP_TYPE_HASH_OF_MAPS`` both hold pointers to other maps such that entire 357 BPF maps can be atomically replaced at runtime. These types of maps tackle a 358 specific issue which was unsuitable to be implemented solely through a BPF helper 359 function since additional (non-data) state is required to be held across BPF 360 program invocations. 361 362 Object Pinning 363 -------------- 364 365 .. image:: /images/bpf_fs.png 366 :align: center 367 368 BPF maps and programs act as a kernel resource and can only be accessed through 369 file descriptors, backed by anonymous inodes in the kernel. Advantages, but 370 also a number of disadvantages come along with them: 371 372 User space applications can make use of most file descriptor related APIs, 373 file descriptor passing for Unix domain sockets work transparently, etc, but 374 at the same time, file descriptors are limited to a processes' lifetime, 375 which makes options like map sharing rather cumbersome to carry out. 376 377 Thus, it brings a number of complications for certain use cases such as iproute2, 378 where tc or XDP sets up and loads the program into the kernel and terminates 379 itself eventually. With that, also access to maps is unavailable from user 380 space side, where it could otherwise be useful, for example, when maps are 381 shared between ingress and egress locations of the data path. Also, third 382 party applications may wish to monitor or update map contents during BPF 383 program runtime. 384 385 To overcome this limitation, a minimal kernel space BPF file system has been 386 implemented, where BPF map and programs can be pinned to, a process called 387 object pinning. The BPF system call has therefore been extended with two new 388 commands which can pin (``BPF_OBJ_PIN``) or retrieve (``BPF_OBJ_GET``) a 389 previously pinned object. 390 391 For instance, tools such as tc make use of this infrastructure for sharing 392 maps on ingress and egress. The BPF related file system is not a singleton, 393 it does support multiple mount instances, hard and soft links, etc. 394 395 Tail Calls 396 ---------- 397 398 .. image:: /images/bpf_tailcall.png 399 :align: center 400 401 Another concept that can be used with BPF is called tail calls. Tail calls can 402 be seen as a mechanism that allows one BPF program to call another, without 403 returning back to the old program. Such a call has minimal overhead as unlike 404 function calls, it is implemented as a long jump, reusing the same stack frame. 405 406 Such programs are verified independently of each other, thus for transferring 407 state, either per-CPU maps as scratch buffers or in case of tc programs, ``skb`` 408 fields such as the ``cb[]`` area must be used. 409 410 Only programs of the same type can be tail called, and they also need to match 411 in terms of JIT compilation, thus either JIT compiled or only interpreted programs 412 can be invoked, but not mixed together. 413 414 There are two components involved for carrying out tail calls: the first part 415 needs to setup a specialized map called program array (``BPF_MAP_TYPE_PROG_ARRAY``) 416 that can be populated by user space with key / values, where values are the 417 file descriptors of the tail called BPF programs, the second part is a 418 ``bpf_tail_call()`` helper where the context, a reference to the program array 419 and the lookup key is passed to. Then the kernel inlines this helper call 420 directly into a specialized BPF instruction. Such a program array is currently 421 write-only from user space side. 422 423 The kernel looks up the related BPF program from the passed file descriptor 424 and atomically replaces program pointers at the given map slot. When no map 425 entry has been found at the provided key, the kernel will just "fall through" 426 and continue execution of the old program with the instructions following 427 after the ``bpf_tail_call()``. Tail calls are a powerful utility, for example, 428 parsing network headers could be structured through tail calls. During runtime, 429 functionality can be added or replaced atomically, and thus altering the BPF 430 program's execution behavior. 431 432 .. _bpf_to_bpf_calls: 433 434 BPF to BPF Calls 435 ---------------- 436 437 .. image:: /images/bpf_call.png 438 :align: center 439 440 Aside from BPF helper calls and BPF tail calls, a more recent feature that has 441 been added to the BPF core infrastructure is BPF to BPF calls. Before this 442 feature was introduced into the kernel, a typical BPF C program had to declare 443 any reusable code that, for example, resides in headers as ``always_inline`` 444 such that when LLVM compiles and generates the BPF object file all these 445 functions were inlined and therefore duplicated many times in the resulting 446 object file, artificially inflating its code size: 447 448 .. code-block:: c 449 450 #include <linux/bpf.h> 451 452 #ifndef __section 453 # define __section(NAME) \ 454 __attribute__((section(NAME), used)) 455 #endif 456 457 #ifndef __inline 458 # define __inline \ 459 inline __attribute__((always_inline)) 460 #endif 461 462 static __inline int foo(void) 463 { 464 return XDP_DROP; 465 } 466 467 __section("prog") 468 int xdp_drop(struct xdp_md *ctx) 469 { 470 return foo(); 471 } 472 473 char __license[] __section("license") = "GPL"; 474 475 The main reason why this was necessary was due to lack of function call support 476 in the BPF program loader as well as verifier, interpreter and JITs. Starting 477 with Linux kernel 4.16 and LLVM 6.0 this restriction got lifted and BPF programs 478 no longer need to use ``always_inline`` everywhere. Thus, the prior shown BPF 479 example code can then be rewritten more naturally as: 480 481 .. code-block:: c 482 483 #include <linux/bpf.h> 484 485 #ifndef __section 486 # define __section(NAME) \ 487 __attribute__((section(NAME), used)) 488 #endif 489 490 static int foo(void) 491 { 492 return XDP_DROP; 493 } 494 495 __section("prog") 496 int xdp_drop(struct xdp_md *ctx) 497 { 498 return foo(); 499 } 500 501 char __license[] __section("license") = "GPL"; 502 503 Mainstream BPF JIT compilers like ``x86_64`` and ``arm64`` support BPF to BPF 504 calls today with others following in near future. BPF to BPF call is an 505 important performance optimization since it heavily reduces the generated BPF 506 code size and therefore becomes friendlier to a CPU's instruction cache. 507 508 The calling convention known from BPF helper function applies to BPF to BPF 509 calls just as well, meaning ``r1`` up to ``r5`` are for passing arguments to 510 the callee and the result is returned in ``r0``. ``r1`` to ``r5`` are scratch 511 registers whereas ``r6`` to ``r9`` preserved across calls the usual way. The 512 maximum number of nesting calls respectively allowed call frames is ``8``. 513 A caller can pass pointers (e.g. to the caller's stack frame) down to the 514 callee, but never vice versa. 515 516 BPF JIT compilers emit separate images for each function body and later fix 517 up the function call addresses in the image in a final JIT pass. This has 518 proven to require minimal changes to the JITs in that they can treat BPF to 519 BPF calls as conventional BPF helper calls. 520 521 Up to kernel 5.9, BPF tail calls and BPF subprograms excluded each other. BPF 522 programs that utilized tail calls couldn't take the benefit of reducing program 523 image size and faster load times. Linux kernel 5.10 finally allows users to bring 524 the best of two worlds and adds the ability to combine the BPF subprograms with 525 tail calls. 526 527 This improvement comes with some restrictions, though. Mixing these two features 528 can cause a kernel stack overflow. To get an idea of what might happen, see the 529 picture below that illustrates the mix of bpf2bpf calls and tail calls: 530 531 .. image:: /images/bpf_tailcall_subprograms.png 532 :align: center 533 534 Tail calls, before the actual jump to the target program, will unwind only its 535 current stack frame. As we can see in the example above, if a tail call occurs 536 from within the sub-function, the function's (func1) stack frame will be 537 present on the stack when a program execution is at func2. Once the final 538 function (func3) function terminates, all the previous stack frames will be 539 unwinded and control will get back to the caller of BPF program caller. 540 541 The kernel introduced additional logic for detecting this feature combination. 542 There is a limit on the stack size throughout the whole call chain down to 256 543 bytes per subprogram (note that if the verifier detects the bpf2bpf call, then 544 the main function is treated as a sub-function as well). In total, with this 545 restriction, the BPF program's call chain can consume at most 8KB of stack 546 space. This limit comes from the 256 bytes per stack frame multiplied by the 547 tail call count limit (33). Without this, the BPF programs will operate on 548 512-byte stack size, yielding the 16KB size in total for the maximum count of 549 tail calls that would overflow the stack on some architectures. 550 551 One more thing to mention is that this feature combination is currently 552 supported only on the x86-64 architecture. 553 554 JIT 555 --- 556 557 .. image:: /images/bpf_jit.png 558 :align: center 559 560 The 64 bit ``x86_64``, ``arm64``, ``ppc64``, ``s390x``, ``mips64``, ``sparc64`` 561 and 32 bit ``arm``, ``x86_32`` architectures are all shipped with an in-kernel 562 eBPF JIT compiler, also all of them are feature equivalent and can be enabled 563 through: 564 565 .. code-block:: shell-session 566 567 # echo 1 > /proc/sys/net/core/bpf_jit_enable 568 569 The 32 bit ``mips``, ``ppc`` and ``sparc`` architectures currently have a cBPF 570 JIT compiler. The mentioned architectures still having a cBPF JIT as well as all 571 remaining architectures supported by the Linux kernel which do not have a BPF JIT 572 compiler at all need to run eBPF programs through the in-kernel interpreter. 573 574 In the kernel's source tree, eBPF JIT support can be easily determined through 575 issuing a grep for ``HAVE_EBPF_JIT``: 576 577 .. code-block:: shell-session 578 579 # git grep HAVE_EBPF_JIT arch/ 580 arch/arm/Kconfig: select HAVE_EBPF_JIT if !CPU_ENDIAN_BE32 581 arch/arm64/Kconfig: select HAVE_EBPF_JIT 582 arch/powerpc/Kconfig: select HAVE_EBPF_JIT if PPC64 583 arch/mips/Kconfig: select HAVE_EBPF_JIT if (64BIT && !CPU_MICROMIPS) 584 arch/s390/Kconfig: select HAVE_EBPF_JIT if PACK_STACK && HAVE_MARCH_Z196_FEATURES 585 arch/sparc/Kconfig: select HAVE_EBPF_JIT if SPARC64 586 arch/x86/Kconfig: select HAVE_EBPF_JIT if X86_64 587 588 JIT compilers speed up execution of the BPF program significantly since they 589 reduce the per instruction cost compared to the interpreter. Often instructions 590 can be mapped 1:1 with native instructions of the underlying architecture. This 591 also reduces the resulting executable image size and is therefore more 592 instruction cache friendly to the CPU. In particular in case of CISC instruction 593 sets such as ``x86``, the JITs are optimized for emitting the shortest possible 594 opcodes for a given instruction to shrink the total necessary size for the 595 program translation. 596 597 Hardening 598 --------- 599 600 BPF locks the entire BPF interpreter image (``struct bpf_prog``) as well 601 as the JIT compiled image (``struct bpf_binary_header``) in the kernel as 602 read-only during the program's lifetime in order to prevent the code from 603 potential corruptions. Any corruption happening at that point, for example, 604 due to some kernel bugs will result in a general protection fault and thus 605 crash the kernel instead of allowing the corruption to happen silently. 606 607 Architectures that support setting the image memory as read-only can be 608 determined through: 609 610 .. code-block:: shell-session 611 612 $ git grep ARCH_HAS_SET_MEMORY | grep select 613 arch/arm/Kconfig: select ARCH_HAS_SET_MEMORY 614 arch/arm64/Kconfig: select ARCH_HAS_SET_MEMORY 615 arch/s390/Kconfig: select ARCH_HAS_SET_MEMORY 616 arch/x86/Kconfig: select ARCH_HAS_SET_MEMORY 617 618 The option ``CONFIG_ARCH_HAS_SET_MEMORY`` is not configurable, thanks to 619 which this protection is always built-in. Other architectures might follow 620 in the future. 621 622 In case of the ``x86_64`` JIT compiler, the JITing of the indirect jump from 623 the use of tail calls is realized through a retpoline in case ``CONFIG_RETPOLINE`` 624 has been set which is the default at the time of writing in most modern Linux 625 distributions. 626 627 In case of ``/proc/sys/net/core/bpf_jit_harden`` set to ``1`` additional 628 hardening steps for the JIT compilation take effect for unprivileged users. 629 This effectively trades off their performance slightly by decreasing a 630 (potential) attack surface in case of untrusted users operating on the 631 system. The decrease in program execution still results in better performance 632 compared to switching to interpreter entirely. 633 634 Currently, enabling hardening will blind all user provided 32 bit and 64 bit 635 constants from the BPF program when it gets JIT compiled in order to prevent 636 JIT spraying attacks which inject native opcodes as immediate values. This is 637 problematic as these immediate values reside in executable kernel memory, 638 therefore a jump that could be triggered from some kernel bug would jump to 639 the start of the immediate value and then execute these as native instructions. 640 641 JIT constant blinding prevents this due to randomizing the actual instruction, 642 which means the operation is transformed from an immediate based source operand 643 to a register based one through rewriting the instruction by splitting the 644 actual load of the value into two steps: 1) load of a blinded immediate 645 value ``rnd ^ imm`` into a register, 2) xoring that register with ``rnd`` 646 such that the original ``imm`` immediate then resides in the register and 647 can be used for the actual operation. The example was provided for a load 648 operation, but really all generic operations are blinded. 649 650 Example of JITing a program with hardening disabled: 651 652 .. code-block:: shell-session 653 654 # echo 0 > /proc/sys/net/core/bpf_jit_harden 655 656 ffffffffa034f5e9 + <x>: 657 [...] 658 39: mov $0xa8909090,%eax 659 3e: mov $0xa8909090,%eax 660 43: mov $0xa8ff3148,%eax 661 48: mov $0xa89081b4,%eax 662 4d: mov $0xa8900bb0,%eax 663 52: mov $0xa810e0c1,%eax 664 57: mov $0xa8908eb4,%eax 665 5c: mov $0xa89020b0,%eax 666 [...] 667 668 The same program gets constant blinded when loaded through BPF 669 as an unprivileged user in the case hardening is enabled: 670 671 .. code-block:: shell-session 672 673 # echo 1 > /proc/sys/net/core/bpf_jit_harden 674 675 ffffffffa034f1e5 + <x>: 676 [...] 677 39: mov $0xe1192563,%r10d 678 3f: xor $0x4989b5f3,%r10d 679 46: mov %r10d,%eax 680 49: mov $0xb8296d93,%r10d 681 4f: xor $0x10b9fd03,%r10d 682 56: mov %r10d,%eax 683 59: mov $0x8c381146,%r10d 684 5f: xor $0x24c7200e,%r10d 685 66: mov %r10d,%eax 686 69: mov $0xeb2a830e,%r10d 687 6f: xor $0x43ba02ba,%r10d 688 76: mov %r10d,%eax 689 79: mov $0xd9730af,%r10d 690 7f: xor $0xa5073b1f,%r10d 691 86: mov %r10d,%eax 692 89: mov $0x9a45662b,%r10d 693 8f: xor $0x325586ea,%r10d 694 96: mov %r10d,%eax 695 [...] 696 697 Both programs are semantically the same, only that none of the 698 original immediate values are visible anymore in the disassembly of 699 the second program. 700 701 At the same time, hardening also disables any JIT kallsyms exposure 702 for privileged users, preventing that JIT image addresses are not 703 exposed to ``/proc/kallsyms`` anymore. 704 705 Moreover, the Linux kernel provides the option ``CONFIG_BPF_JIT_ALWAYS_ON`` 706 which removes the entire BPF interpreter from the kernel and permanently 707 enables the JIT compiler. This has been developed as part of a mitigation 708 in the context of Spectre v2 such that when used in a VM-based setting, 709 the guest kernel is not going to reuse the host kernel's BPF interpreter 710 when mounting an attack anymore. For container-based environments, the 711 ``CONFIG_BPF_JIT_ALWAYS_ON`` configuration option is optional, but in 712 case JITs are enabled there anyway, the interpreter may as well be compiled 713 out to reduce the kernel's complexity. Thus, it is also generally 714 recommended for widely used JITs in case of main stream architectures 715 such as ``x86_64`` and ``arm64``. 716 717 Last but not least, the kernel offers an option to disable the use of 718 the ``bpf(2)`` system call for unprivileged users through the 719 ``/proc/sys/kernel/unprivileged_bpf_disabled`` sysctl knob. This is 720 on purpose a one-time kill switch, meaning once set to ``1``, there is 721 no option to reset it back to ``0`` until a new kernel reboot. When 722 set only ``CAP_SYS_ADMIN`` privileged processes out of the initial 723 namespace are allowed to use the ``bpf(2)`` system call from that 724 point onwards. Upon start, Cilium sets this knob to ``1`` as well. 725 726 .. code-block:: shell-session 727 728 # echo 1 > /proc/sys/kernel/unprivileged_bpf_disabled 729 730 Offloads 731 -------- 732 733 .. image:: /images/bpf_offload.png 734 :align: center 735 736 Networking programs in BPF, in particular for tc and XDP do have an 737 offload-interface to hardware in the kernel in order to execute BPF 738 code directly on the NIC. 739 740 Currently, the ``nfp`` driver from Netronome has support for offloading 741 BPF through a JIT compiler which translates BPF instructions to an 742 instruction set implemented against the NIC. This includes offloading 743 of BPF maps to the NIC as well, thus the offloaded BPF program can 744 perform map lookups, updates and deletions. 745 746 BPF sysctls 747 ----------- 748 749 The Linux kernel provides few sysctls that are BPF related and covered in this section. 750 751 * ``/proc/sys/net/core/bpf_jit_enable``: Enables or disables the BPF JIT compiler. 752 753 +-------+-------------------------------------------------------------------+ 754 | Value | Description | 755 +-------+-------------------------------------------------------------------+ 756 | 0 | Disable the JIT and use only interpreter (kernel's default value) | 757 +-------+-------------------------------------------------------------------+ 758 | 1 | Enable the JIT compiler | 759 +-------+-------------------------------------------------------------------+ 760 | 2 | Enable the JIT and emit debugging traces to the kernel log | 761 +-------+-------------------------------------------------------------------+ 762 763 As described in subsequent sections, ``bpf_jit_disasm`` tool can be used to 764 process debugging traces when the JIT compiler is set to debugging mode (option ``2``). 765 766 * ``/proc/sys/net/core/bpf_jit_harden``: Enables or disables BPF JIT hardening. 767 Note that enabling hardening trades off performance, but can mitigate JIT spraying 768 by blinding out the BPF program's immediate values. For programs processed through 769 the interpreter, blinding of immediate values is not needed / performed. 770 771 +-------+-------------------------------------------------------------------+ 772 | Value | Description | 773 +-------+-------------------------------------------------------------------+ 774 | 0 | Disable JIT hardening (kernel's default value) | 775 +-------+-------------------------------------------------------------------+ 776 | 1 | Enable JIT hardening for unprivileged users only | 777 +-------+-------------------------------------------------------------------+ 778 | 2 | Enable JIT hardening for all users | 779 +-------+-------------------------------------------------------------------+ 780 781 * ``/proc/sys/net/core/bpf_jit_kallsyms``: Enables or disables export of JITed 782 programs as kernel symbols to ``/proc/kallsyms`` so that they can be used together 783 with ``perf`` tooling as well as making these addresses aware to the kernel for 784 stack unwinding, for example, used in dumping stack traces. The symbol names 785 contain the BPF program tag (``bpf_prog_<tag>``). If ``bpf_jit_harden`` is enabled, 786 then this feature is disabled. 787 788 +-------+-------------------------------------------------------------------+ 789 | Value | Description | 790 +-------+-------------------------------------------------------------------+ 791 | 0 | Disable JIT kallsyms export (kernel's default value) | 792 +-------+-------------------------------------------------------------------+ 793 | 1 | Enable JIT kallsyms export for privileged users only | 794 +-------+-------------------------------------------------------------------+ 795 796 * ``/proc/sys/kernel/unprivileged_bpf_disabled``: Enables or disable unprivileged 797 use of the ``bpf(2)`` system call. The Linux kernel has unprivileged use of 798 ``bpf(2)`` enabled by default. 799 800 Once the value is set to 1, unprivileged use will be permanently disabled until 801 the next reboot, neither an application nor an admin can reset the value anymore. 802 803 The value can also be set to 2, which means it can be changed at runtime to 0 or 804 1 later while disabling the unprivileged used for now. This value was added 805 in Linux 5.13. If ``BPF_UNPRIV_DEFAULT_OFF`` 806 is enabled in the kernel config, then this knob will default to 2 instead of 0. 807 808 This knob does not affect any cBPF programs such as seccomp 809 or traditional socket filters that do not use the ``bpf(2)`` system call for 810 loading the program into the kernel. 811 812 +-------+---------------------------------------------------------------------+ 813 | Value | Description | 814 +-------+---------------------------------------------------------------------+ 815 | 0 | Unprivileged use of bpf syscall enabled (kernel's default value) | 816 +-------+---------------------------------------------------------------------+ 817 | 1 | Unprivileged use of bpf syscall disabled (until reboot) | 818 +-------+---------------------------------------------------------------------+ 819 | 2 | Unprivileged use of bpf syscall disabled | 820 | | (default if ``BPF_UNPRIV_DEFAULT_OFF`` is enabled in kernel config) | 821 +-------+---------------------------------------------------------------------+