github.com/cilium/cilium@v1.16.2/Documentation/bpf/progtypes.rst (about) 1 .. only:: not (epub or latex or html) 2 3 WARNING: You are looking at unreleased Cilium documentation. 4 Please use the official rendered version released here: 5 https://docs.cilium.io 6 7 .. _bpf_program: 8 9 Program Types 10 ============= 11 12 At the time of this writing, there are eighteen different BPF program types 13 available, two of the main types for networking are further explained in below 14 subsections, namely XDP BPF programs as well as tc BPF programs. Extensive 15 usage examples for the two program types for LLVM, iproute2 or other tools 16 are spread throughout the toolchain section and not covered here. Instead, 17 this section focuses on their architecture, concepts and use cases. 18 19 XDP 20 --- 21 22 XDP stands for eXpress Data Path and provides a framework for BPF that enables 23 high-performance programmable packet processing in the Linux kernel. It runs 24 the BPF program at the earliest possible point in software, namely at the moment 25 the network driver receives the packet. 26 27 At this point in the fast-path the driver just picked up the packet from its 28 receive rings, without having done any expensive operations such as allocating 29 an ``skb`` for pushing the packet further up the networking stack, without 30 having pushed the packet into the GRO engine, etc. Thus, the XDP BPF program 31 is executed at the earliest point when it becomes available to the CPU for 32 processing. 33 34 XDP works in concert with the Linux kernel and its infrastructure, meaning 35 the kernel is not bypassed as in various networking frameworks that operate 36 in user space only. Keeping the packet in kernel space has several major 37 advantages: 38 39 * XDP is able to reuse all the upstream developed kernel networking drivers, 40 user space tooling, or even other available in-kernel infrastructure such 41 as routing tables, sockets, etc in BPF helper calls itself. 42 * Residing in kernel space, XDP has the same security model as the rest of 43 the kernel for accessing hardware. 44 * There is no need for crossing kernel / user space boundaries since the 45 processed packet already resides in the kernel and can therefore flexibly 46 forward packets into other in-kernel entities like namespaces used by 47 containers or the kernel's networking stack itself. This is particularly 48 relevant in times of Meltdown and Spectre. 49 * Punting packets from XDP to the kernel's robust, widely used and efficient 50 TCP/IP stack is trivially possible, allows for full reuse and does not 51 require maintaining a separate TCP/IP stack as with user space frameworks. 52 * The use of BPF allows for full programmability, keeping a stable ABI with 53 the same 'never-break-user-space' guarantees as with the kernel's system 54 call ABI and compared to modules it also provides safety measures thanks to 55 the BPF verifier that ensures the stability of the kernel's operation. 56 * XDP trivially allows for atomically swapping programs during runtime without 57 any network traffic interruption or even kernel / system reboot. 58 * XDP allows for flexible structuring of workloads integrated into 59 the kernel. For example, it can operate in "busy polling" or "interrupt 60 driven" mode. Explicitly dedicating CPUs to XDP is not required. There 61 are no special hardware requirements and it does not rely on hugepages. 62 * XDP does not require any third party kernel modules or licensing. It is 63 a long-term architectural solution, a core part of the Linux kernel, and 64 developed by the kernel community. 65 * XDP is already enabled and shipped everywhere with major distributions 66 running a kernel equivalent to 4.8 or higher and supports most major 10G 67 or higher networking drivers. 68 69 As a framework for running BPF in the driver, XDP additionally ensures that 70 packets are laid out linearly and fit into a single DMA'ed page which is 71 readable and writable by the BPF program. XDP also ensures that additional 72 headroom of 256 bytes is available to the program for implementing custom 73 encapsulation headers with the help of the ``bpf_xdp_adjust_head()`` BPF helper 74 or adding custom metadata in front of the packet through ``bpf_xdp_adjust_meta()``. 75 76 The framework contains XDP action codes further described in the section 77 below which a BPF program can return in order to instruct the driver how 78 to proceed with the packet, and it enables the possibility to atomically 79 replace BPF programs running at the XDP layer. XDP is tailored for 80 high-performance by design. BPF allows to access the packet data through 81 'direct packet access' which means that the program holds data pointers 82 directly in registers, loads the content into registers, respectively 83 writes from there into the packet. 84 85 The packet representation in XDP that is passed to the BPF program as 86 the BPF context looks as follows: 87 88 .. code-block:: c 89 90 struct xdp_buff { 91 void *data; 92 void *data_end; 93 void *data_meta; 94 void *data_hard_start; 95 struct xdp_rxq_info *rxq; 96 }; 97 98 ``data`` points to the start of the packet data in the page, and as the 99 name suggests, ``data_end`` points to the end of the packet data. Since XDP 100 allows for a headroom, ``data_hard_start`` points to the maximum possible 101 headroom start in the page, meaning, when the packet should be encapsulated, 102 then ``data`` is moved closer towards ``data_hard_start`` via ``bpf_xdp_adjust_head()``. 103 The same BPF helper function also allows for decapsulation in which case 104 ``data`` is moved further away from ``data_hard_start``. 105 106 ``data_meta`` initially points to the same location as ``data`` but 107 ``bpf_xdp_adjust_meta()`` is able to move the pointer towards ``data_hard_start`` 108 as well in order to provide room for custom metadata which is invisible to 109 the normal kernel networking stack but can be read by tc BPF programs since 110 it is transferred from XDP to the ``skb``. Vice versa, it can remove or reduce 111 the size of the custom metadata through the same BPF helper function by 112 moving ``data_meta`` away from ``data_hard_start`` again. ``data_meta`` can 113 also be used solely for passing state between tail calls similarly to the 114 ``skb->cb[]`` control block case that is accessible in tc BPF programs. 115 116 This gives the following relation respectively invariant for the ``struct xdp_buff`` 117 packet pointers: ``data_hard_start`` <= ``data_meta`` <= ``data`` < ``data_end``. 118 119 The ``rxq`` field points to some additional per receive queue metadata which 120 is populated at ring setup time (not at XDP runtime): 121 122 .. code-block:: c 123 124 struct xdp_rxq_info { 125 struct net_device *dev; 126 u32 queue_index; 127 u32 reg_state; 128 } ____cacheline_aligned; 129 130 The BPF program can retrieve ``queue_index`` as well as additional data 131 from the netdevice itself such as ``ifindex``, etc. 132 133 **BPF program return codes** 134 135 After running the XDP BPF program, a verdict is returned from the program in 136 order to tell the driver how to process the packet next. In the ``linux/bpf.h`` 137 system header file all available return verdicts are enumerated: 138 139 .. code-block:: c 140 141 enum xdp_action { 142 XDP_ABORTED = 0, 143 XDP_DROP, 144 XDP_PASS, 145 XDP_TX, 146 XDP_REDIRECT, 147 }; 148 149 ``XDP_DROP`` as the name suggests will drop the packet right at the driver 150 level without wasting any further resources. This is in particular useful 151 for BPF programs implementing DDoS mitigation mechanisms or firewalling in 152 general. The ``XDP_PASS`` return code means that the packet is allowed to 153 be passed up to the kernel's networking stack. Meaning, the current CPU 154 that was processing this packet now allocates a ``skb``, populates it, and 155 passes it onwards into the GRO engine. This would be equivalent to the 156 default packet handling behavior without XDP. With ``XDP_TX`` the BPF program 157 has an efficient option to transmit the network packet out of the same NIC it 158 just arrived on again. This is typically useful when few nodes are implementing, 159 for example, firewalling with subsequent load balancing in a cluster and 160 thus act as a hairpinned load balancer pushing the incoming packets back 161 into the switch after rewriting them in XDP BPF. ``XDP_REDIRECT`` is similar 162 to ``XDP_TX`` in that it is able to transmit the XDP packet, but through 163 another NIC. Another option for the ``XDP_REDIRECT`` case is to redirect 164 into a BPF cpumap, meaning, the CPUs serving XDP on the NIC's receive queues 165 can continue to do so and push the packet for processing the upper kernel 166 stack to a remote CPU. This is similar to ``XDP_PASS``, but with the ability 167 that the XDP BPF program can keep serving the incoming high load as opposed 168 to temporarily spend work on the current packet for pushing into upper 169 layers. Last but not least, ``XDP_ABORTED`` which serves denoting an exception 170 like state from the program and has the same behavior as ``XDP_DROP`` only 171 that ``XDP_ABORTED`` passes the ``trace_xdp_exception`` tracepoint which 172 can be additionally monitored to detect misbehavior. 173 174 **Use cases for XDP** 175 176 Some of the main use cases for XDP are presented in this subsection. The 177 list is non-exhaustive and given the programmability and efficiency XDP 178 and BPF enables, it can easily be adapted to solve very specific use 179 cases. 180 181 * **DDoS mitigation, firewalling** 182 183 One of the basic XDP BPF features is to tell the driver to drop a packet 184 with ``XDP_DROP`` at this early stage which allows for any kind of efficient 185 network policy enforcement with having an extremely low per-packet cost. 186 This is ideal in situations when needing to cope with any sort of DDoS 187 attacks, but also more general allows to implement any sort of firewalling 188 policies with close to no overhead in BPF e.g. in either case as stand alone 189 appliance (e.g. scrubbing 'clean' traffic through ``XDP_TX``) or widely 190 deployed on nodes protecting end hosts themselves (via ``XDP_PASS`` or 191 cpumap ``XDP_REDIRECT`` for good traffic). Offloaded XDP takes this even 192 one step further by moving the already small per-packet cost entirely 193 into the NIC with processing at line-rate. 194 195 .. 196 197 * **Forwarding and load-balancing** 198 199 Another major use case of XDP is packet forwarding and load-balancing 200 through either ``XDP_TX`` or ``XDP_REDIRECT`` actions. The packet can 201 be arbitrarily mangled by the BPF program running in the XDP layer, 202 even BPF helper functions are available for increasing or decreasing 203 the packet's headroom in order to arbitrarily encapsulate respectively 204 decapsulate the packet before sending it out again. With ``XDP_TX`` 205 hairpinned load-balancers can be implemented that push the packet out 206 of the same networking device it originally arrived on, or with the 207 ``XDP_REDIRECT`` action it can be forwarded to another NIC for 208 transmission. The latter return code can also be used in combination 209 with BPF's cpumap to load-balance packets for passing up the local 210 stack, but on remote, non-XDP processing CPUs. 211 212 .. 213 214 * **Pre-stack filtering / processing** 215 216 Besides policy enforcement, XDP can also be used for hardening the 217 kernel's networking stack with the help of ``XDP_DROP`` case, meaning, 218 it can drop irrelevant packets for a local node right at the earliest 219 possible point before the networking stack sees them e.g. given we 220 know that a node only serves TCP traffic, any UDP, SCTP or other L4 221 traffic can be dropped right away. This has the advantage that packets 222 do not need to traverse various entities like GRO engine, the kernel's 223 flow dissector and others before it can be determined to drop them and 224 thus this allows for reducing the kernel's attack surface. Thanks to 225 XDP's early processing stage, this effectively 'pretends' to the kernel's 226 networking stack that these packets have never been seen by the networking 227 device. Additionally, if a potential bug in the stack's receive path 228 got uncovered and would cause a 'ping of death' like scenario, XDP can be 229 utilized to drop such packets right away without having to reboot the 230 kernel or restart any services. Due to the ability to atomically swap 231 such programs to enforce a drop of bad packets, no network traffic is 232 even interrupted on a host. 233 234 Another use case for pre-stack processing is that given the kernel has not 235 yet allocated an ``skb`` for the packet, the BPF program is free to modify 236 the packet and, again, have it 'pretend' to the stack that it was received 237 by the networking device this way. This allows for cases such as having 238 custom packet mangling and encapsulation protocols where the packet can be 239 decapsulated prior to entering GRO aggregation in which GRO otherwise would 240 not be able to perform any sort of aggregation due to not being aware of 241 the custom protocol. XDP also allows to push metadata (non-packet data) in 242 front of the packet. This is 'invisible' to the normal kernel stack, can 243 be GRO aggregated (for matching metadata) and later on processed in 244 coordination with a tc ingress BPF program where it has the context of 245 a ``skb`` available for e.g. setting various skb fields. 246 247 .. 248 249 * **Flow sampling, monitoring** 250 251 XDP can also be used for cases such as packet monitoring, sampling or any 252 other network analytics, for example, as part of an intermediate node in 253 the path or on end hosts in combination also with prior mentioned use cases. 254 For complex packet analysis, XDP provides a facility to efficiently push 255 network packets (truncated or with full payload) and custom metadata into 256 a fast lockless per CPU memory mapped ring buffer provided from the Linux 257 perf infrastructure to an user space application. This also allows for 258 cases where only a flow's initial data can be analyzed and once determined 259 as good traffic having the monitoring bypassed. Thanks to the flexibility 260 brought by BPF, this allows for implementing any sort of custom monitoring 261 or sampling. 262 263 .. 264 265 One example of XDP BPF production usage is Facebook's SHIV and Droplet 266 infrastructure which implement their L4 load-balancing and DDoS countermeasures. 267 Migrating their production infrastructure away from netfilter's IPVS 268 (IP Virtual Server) over to XDP BPF allowed for a 10x speedup compared 269 to their previous IPVS setup. This was first presented at the netdev 2.1 270 conference: 271 272 * Slides: https://netdevconf.info/2.1/slides/apr6/zhou-netdev-xdp-2017.pdf 273 * Video: https://youtu.be/YEU2ClcGqts 274 275 Another example is the integration of XDP into Cloudflare's DDoS mitigation 276 pipeline, which originally was using cBPF instead of eBPF for attack signature 277 matching through iptables' ``xt_bpf`` module. Due to use of iptables this 278 caused severe performance problems under attack where a user space bypass 279 solution was deemed necessary but came with drawbacks as well such as needing 280 to busy poll the NIC and expensive packet re-injection into the kernel's stack. 281 The migration over to eBPF and XDP combined best of both worlds by having 282 high-performance programmable packet processing directly inside the kernel: 283 284 * Slides: https://netdevconf.info/2.1/slides/apr6/bertin_Netdev-XDP.pdf 285 * Video: https://youtu.be/7OuOukmuivg 286 287 **XDP operation modes** 288 289 XDP has three operation modes where 'native' XDP is the default mode. When 290 talked about XDP this mode is typically implied. 291 292 * **Native XDP** 293 294 This is the default mode where the XDP BPF program is run directly out 295 of the networking driver's early receive path. Most widespread used NICs 296 for 10G and higher support native XDP already. 297 298 .. 299 300 * **Offloaded XDP** 301 302 In the offloaded XDP mode the XDP BPF program is directly offloaded into 303 the NIC instead of being executed on the host CPU. Thus, the already 304 extremely low per-packet cost is pushed off the host CPU entirely and 305 executed on the NIC, providing even higher performance than running in 306 native XDP. This offload is typically implemented by SmartNICs 307 containing multi-threaded, multicore flow processors where a in-kernel 308 JIT compiler translates BPF into native instructions for the latter. 309 Drivers supporting offloaded XDP usually also support native XDP for 310 cases where some BPF helpers may not yet or only be available for the 311 native mode. 312 313 .. 314 315 * **Generic XDP** 316 317 For drivers not implementing native or offloaded XDP yet, the kernel 318 provides an option for generic XDP which does not require any driver 319 changes since run at a much later point out of the networking stack. 320 This setting is primarily targeted at developers who want to write and 321 test programs against the kernel's XDP API, and will not operate at the 322 performance rate of the native or offloaded modes. For XDP usage in a 323 production environment either the native or offloaded mode is better 324 suited and the recommended way to run XDP. 325 326 .. _xdp_drivers: 327 328 **Driver support** 329 330 **Drivers supporting native XDP** 331 332 A list of drivers supporting native XDP can be found in the table below. The 333 corresponding network driver name of an interface can be determined as follows: 334 335 .. code-block:: shell-session 336 337 # ethtool -i eth0 338 driver: nfp 339 [...] 340 341 +------------------------+-------------------------+-------------+ 342 | Vendor | Driver | XDP Support | 343 +========================+=========================+=============+ 344 | Amazon | ena | >= 5.6 | 345 +------------------------+-------------------------+-------------+ 346 | Aquantia | atlantic | >= 5.19 | 347 +------------------------+-------------------------+-------------+ 348 | Broadcom | bnxt_en | >= 4.11 | 349 +------------------------+-------------------------+-------------+ 350 | Cavium | thunderx | >= 4.12 | 351 +------------------------+-------------------------+-------------+ 352 | Engleder | tsne | >= 6.3 | 353 | | (TSN Express Path) | | 354 +------------------------+-------------------------+-------------+ 355 | Freescale | dpaa | >= 5.11 | 356 | +-------------------------+-------------+ 357 | | dpaa2 | >= 5.0 | 358 | +-------------------------+-------------+ 359 | | enetc | >= 5.13 | 360 | +-------------------------+-------------+ 361 | | fec_enet | >= 6.2 | 362 +------------------------+-------------------------+-------------+ 363 | Fungible | fun | >= 5.18 | 364 +------------------------+-------------------------+-------------+ 365 | Google | gve | >= 6.4 | 366 +------------------------+-------------------------+-------------+ 367 | Intel | ice | >= 5.5 | 368 | +-------------------------+-------------+ 369 | | igb | >= 5.10 | 370 | +-------------------------+-------------+ 371 | | igc | >= 5.13 | 372 | +-------------------------+-------------+ 373 | | i40e | >= 4.13 | 374 | +-------------------------+-------------+ 375 | | ixgbe | >= 4.12 | 376 | +-------------------------+-------------+ 377 | | ixgbevf | >= 4.17 | 378 +------------------------+-------------------------+-------------+ 379 | Marvell | mvneta | >= 5.5 | 380 | +-------------------------+-------------+ 381 | | mvpp2 | >= 5.9 | 382 | +-------------------------+-------------+ 383 | | otx2 | >= 5.16 | 384 +------------------------+-------------------------+-------------+ 385 | Mediatek | mtk | >= 6.0 | 386 +------------------------+-------------------------+-------------+ 387 | Mellanox | mlx4 | >= 4.8 | 388 | +-------------------------+-------------+ 389 | | mlx5 | >= 4.9 | 390 +------------------------+-------------------------+-------------+ 391 | Microchip | lan966x | >= 6.2 | 392 +------------------------+-------------------------+-------------+ 393 | Microsoft | hv_netvsc (Hyper-V) | >= 5.6 | 394 | +-------------------------+-------------+ 395 | | mana | >= 5.17 | 396 +------------------------+-------------------------+-------------+ 397 | Netronome | nfp | >= 4.10 | 398 +------------------------+-------------------------+-------------+ 399 | Others | bonding | >= 5.15 | 400 | +-------------------------+-------------+ 401 | | netdevsim | >= 4.16 | 402 | +-------------------------+-------------+ 403 | | tun/tap | >= 4.14 | 404 | +-------------------------+-------------+ 405 | | virtio_net | >= 4.10 | 406 | +-------------------------+-------------+ 407 | | xen-netfront | >= 5.9 | 408 | +-------------------------+-------------+ 409 | | veth | >= 4.19 | 410 +------------------------+-------------------------+-------------+ 411 | QLogic | qede | >= 4.10 | 412 +------------------------+-------------------------+-------------+ 413 | Socionext | netsec | >= 5.3 | 414 +------------------------+-------------------------+-------------+ 415 | Solarflare | SFC Efx | >= 5.5 | 416 +------------------------+-------------------------+-------------+ 417 | STMicro | stmmac | >= 5.13 | 418 +------------------------+-------------------------+-------------+ 419 | Texas Instruments | cpsw | >= 5.3 | 420 +------------------------+-------------------------+-------------+ 421 | VMware | vmxnet3 | >= 6.6 | 422 +------------------------+-------------------------+-------------+ 423 424 **Drivers supporting offloaded XDP** 425 426 * **Netronome** 427 428 * nfp [2]_ 429 430 .. note:: 431 432 Examples for writing and loading XDP programs are included in the `bpf_dev` section under the respective tools. 433 434 .. [2] Some BPF helper functions such as retrieving the current CPU number 435 will not be available in an offloaded setting. 436 437 tc (traffic control) 438 -------------------- 439 440 Aside from other program types such as XDP, BPF can also be used out of the 441 kernel's tc (traffic control) layer in the networking data path. On a high-level 442 there are three major differences when comparing XDP BPF programs to tc BPF 443 ones: 444 445 * The BPF input context is a ``sk_buff`` not a ``xdp_buff``. When the kernel's 446 networking stack receives a packet, after the XDP layer, it allocates a buffer 447 and parses the packet to store metadata about the packet. This representation 448 is known as the ``sk_buff``. This structure is then exposed in the BPF input 449 context so that BPF programs from the tc ingress layer can use the metadata that 450 the stack extracts from the packet. This can be useful, but comes with an 451 associated cost of the stack performing this allocation and metadata extraction, 452 and handling the packet until it hits the tc hook. By definition, the ``xdp_buff`` 453 doesn't have access to this metadata because the XDP hook is called before 454 this work is done. This is a significant contributor to the performance 455 difference between the XDP and tc hooks. 456 457 Therefore, BPF programs attached to the tc BPF hook can, for instance, read or 458 write the skb's ``mark``, ``pkt_type``, ``protocol``, ``priority``, 459 ``queue_mapping``, ``napi_id``, ``cb[]`` array, ``hash``, ``tc_classid`` or 460 ``tc_index``, vlan metadata, the XDP transferred custom metadata and various 461 other information. All members of the ``struct __sk_buff`` BPF context used 462 in tc BPF are defined in the ``linux/bpf.h`` system header. 463 464 Generally, the ``sk_buff`` is of a completely different nature than 465 ``xdp_buff`` where both come with advantages and disadvantages. For example, 466 the ``sk_buff`` case has the advantage that it is rather straight forward to 467 mangle its associated metadata, however, it also contains a lot of protocol 468 specific information (e.g. GSO related state) which makes it difficult to 469 simply switch protocols by solely rewriting the packet data. This is due to 470 the stack processing the packet based on the metadata rather than having the 471 cost of accessing the packet contents each time. Thus, additional conversion 472 is required from BPF helper functions taking care that ``sk_buff`` internals 473 are properly converted as well. The ``xdp_buff`` case however does not 474 face such issues since it comes at such an early stage where the kernel 475 has not even allocated an ``sk_buff`` yet, thus packet rewrites of any 476 kind can be realized trivially. However, the ``xdp_buff`` case has the 477 disadvantage that ``sk_buff`` metadata is not available for mangling 478 at this stage. The latter is overcome by passing custom metadata from 479 XDP BPF to tc BPF, though. In this way, the limitations of each program 480 type can be overcome by operating complementary programs of both types 481 as the use case requires. 482 483 .. 484 485 * Compared to XDP, tc BPF programs can be triggered out of ingress and also 486 egress points in the networking data path as opposed to ingress only in 487 the case of XDP. 488 489 The two hook points ``sch_handle_ingress()`` and ``sch_handle_egress()`` in 490 the kernel are triggered out of ``__netif_receive_skb_core()`` and 491 ``__dev_queue_xmit()``, respectively. The latter two are the main receive 492 and transmit functions in the data path that, setting XDP aside, are triggered 493 for every network packet going in or coming out of the node allowing for 494 full visibility for tc BPF programs at these hook points. 495 496 .. 497 498 * The tc BPF programs do not require any driver changes since they are run 499 at hook points in generic layers in the networking stack. Therefore, they 500 can be attached to any type of networking device. 501 502 While this provides flexibility, it also trades off performance compared 503 to running at the native XDP layer. However, tc BPF programs still come 504 at the earliest point in the generic kernel's networking data path after 505 GRO has been run but **before** any protocol processing, traditional iptables 506 firewalling such as iptables PREROUTING or nftables ingress hooks or other 507 packet processing takes place. Likewise on egress, tc BPF programs execute 508 at the latest point before handing the packet to the driver itself for 509 transmission, meaning **after** traditional iptables firewalling hooks like 510 iptables POSTROUTING, but still before handing the packet to the kernel's 511 GSO engine. 512 513 One exception which does require driver changes however are offloaded tc 514 BPF programs, typically provided by SmartNICs in a similar way as offloaded 515 XDP just with differing set of features due to the differences in the BPF 516 input context, helper functions and verdict codes. 517 518 .. 519 520 BPF programs run in the tc layer are run from the ``cls_bpf`` classifier. 521 While the tc terminology describes the BPF attachment point as a "classifier", 522 this is a bit misleading since it under-represents what ``cls_bpf`` is 523 capable of. That is to say, a fully programmable packet processor being able 524 not only to read the ``skb`` metadata and packet data, but to also arbitrarily 525 mangle both, and terminate the tc processing with an action verdict. ``cls_bpf`` 526 can thus be regarded as a self-contained entity that manages and executes tc 527 BPF programs. 528 529 ``cls_bpf`` can hold one or more tc BPF programs. In the case where Cilium 530 deploys ``cls_bpf`` programs, it attaches only a single program for a given hook 531 in ``direct-action`` mode. Typically, in the traditional tc scheme, there is a 532 split between classifier and action modules, where the classifier has one 533 or more actions attached to it that are triggered once the classifier has a 534 match. In the modern world for using tc in the software data path this model 535 does not scale well for complex packet processing. Given tc BPF programs 536 attached to ``cls_bpf`` are fully self-contained, they effectively fuse the 537 parsing and action process together into a single unit. Thanks to ``cls_bpf``'s 538 ``direct-action`` mode, it will just return the tc action verdict and 539 terminate the processing pipeline immediately. This allows for implementing 540 scalable programmable packet processing in the networking data path by avoiding 541 linear iteration of actions. ``cls_bpf`` is the only such "classifier" module 542 in the tc layer capable of such a fast-path. 543 544 Like XDP BPF programs, tc BPF programs can be atomically updated at runtime 545 via ``cls_bpf`` without interrupting any network traffic or having to restart 546 services. 547 548 Both the tc ingress and the egress hook where ``cls_bpf`` itself can be 549 attached to is managed by a pseudo qdisc called ``sch_clsact``. This is a 550 drop-in replacement and proper superset of the ingress qdisc since it 551 is able to manage both, ingress and egress tc hooks. For tc's egress hook 552 in ``__dev_queue_xmit()`` it is important to stress that it is not executed 553 under the kernel's qdisc root lock. Thus, both tc ingress and egress hooks 554 are executed in a lockless manner in the fast-path. In either case, preemption 555 is disabled and execution happens under RCU read side. 556 557 Typically on egress there are qdiscs attached to netdevices such as ``sch_mq``, 558 ``sch_fq``, ``sch_fq_codel`` or ``sch_htb`` where some of them are classful 559 qdiscs that contain subclasses and thus require a packet classification 560 mechanism to determine a verdict where to demux the packet. This is handled 561 by a call to ``tcf_classify()`` which calls into tc classifiers if present. 562 ``cls_bpf`` can also be attached and used in such cases. Such operation usually 563 happens under the qdisc root lock and can be subject to lock contention. The 564 ``sch_clsact`` qdisc's egress hook comes at a much earlier point however which 565 does not fall under that and operates completely independent from conventional 566 egress qdiscs. Thus for cases like ``sch_htb`` the ``sch_clsact`` qdisc could 567 perform the heavy lifting packet classification through tc BPF outside of the 568 qdisc root lock, setting the ``skb->mark`` or ``skb->priority`` from there such 569 that ``sch_htb`` only requires a flat mapping without expensive packet 570 classification under the root lock thus reducing contention. 571 572 Offloaded tc BPF programs are supported for the case of ``sch_clsact`` in 573 combination with ``cls_bpf`` where the prior loaded BPF program was JITed 574 from a SmartNIC driver to be run natively on the NIC. Only ``cls_bpf`` 575 programs operating in ``direct-action`` mode are supported to be offloaded. 576 ``cls_bpf`` only supports offloading a single program and cannot offload 577 multiple programs. Furthermore only the ingress hook supports offloading 578 BPF programs. 579 580 One ``cls_bpf`` instance is able to hold multiple tc BPF programs internally. 581 If this is the case, then the ``TC_ACT_UNSPEC`` program return code will 582 continue execution with the next tc BPF program in that list. However, this 583 has the drawback that several programs would need to parse the packet over 584 and over again resulting in degraded performance. 585 586 **BPF program return codes** 587 588 Both the tc ingress and egress hook share the same action return verdicts 589 that tc BPF programs can use. They are defined in the ``linux/pkt_cls.h`` 590 system header: 591 592 .. code-block:: c 593 594 #define TC_ACT_UNSPEC (-1) 595 #define TC_ACT_OK 0 596 #define TC_ACT_SHOT 2 597 #define TC_ACT_STOLEN 4 598 #define TC_ACT_REDIRECT 7 599 600 There are a few more action ``TC_ACT_*`` verdicts available in the system 601 header file which are also used in the two hooks. However, they share the 602 same semantics with the ones above. Meaning, from a tc BPF perspective, 603 ``TC_ACT_OK`` and ``TC_ACT_RECLASSIFY`` have the same semantics, as well as 604 the three ``TC_ACT_STOLEN``, ``TC_ACT_QUEUED`` and ``TC_ACT_TRAP`` opcodes. 605 Therefore, for these cases we only describe ``TC_ACT_OK`` and the ``TC_ACT_STOLEN`` 606 opcode for the two groups. 607 608 Starting out with ``TC_ACT_UNSPEC``. It has the meaning of "unspecified action" 609 and is used in three cases, i) when an offloaded tc BPF program is attached 610 and the tc ingress hook is run where the ``cls_bpf`` representation for the 611 offloaded program will return ``TC_ACT_UNSPEC``, ii) in order to continue 612 with the next tc BPF program in ``cls_bpf`` for the multi-program case. The 613 latter also works in combination with offloaded tc BPF programs from point i) 614 where the ``TC_ACT_UNSPEC`` from there continues with a next tc BPF program 615 solely running in non-offloaded case. Last but not least, iii) ``TC_ACT_UNSPEC`` 616 is also used for the single program case to simply tell the kernel to continue 617 with the ``skb`` without additional side-effects. ``TC_ACT_UNSPEC`` is very 618 similar to the ``TC_ACT_OK`` action code in the sense that both pass the 619 ``skb`` onwards either to upper layers of the stack on ingress or down to 620 the networking device driver for transmission on egress, respectively. The 621 only difference to ``TC_ACT_OK`` is that ``TC_ACT_OK`` sets ``skb->tc_index`` 622 based on the classid the tc BPF program set. The latter is set out of the 623 tc BPF program itself through ``skb->tc_classid`` from the BPF context. 624 625 ``TC_ACT_SHOT`` instructs the kernel to drop the packet, meaning, upper 626 layers of the networking stack will never see the ``skb`` on ingress and 627 similarly the packet will never be submitted for transmission on egress. 628 ``TC_ACT_SHOT`` and ``TC_ACT_STOLEN`` are both similar in nature with few 629 differences: ``TC_ACT_SHOT`` will indicate to the kernel that the ``skb`` 630 was released through ``kfree_skb()`` and return ``NET_XMIT_DROP`` to the 631 callers for immediate feedback, whereas ``TC_ACT_STOLEN`` will release 632 the ``skb`` through ``consume_skb()`` and pretend to upper layers that 633 the transmission was successful through ``NET_XMIT_SUCCESS``. The perf's 634 drop monitor which records traces of ``kfree_skb()`` will therefore 635 also not see any drop indications from ``TC_ACT_STOLEN`` since its 636 semantics are such that the ``skb`` has been "consumed" or queued but 637 certainly not "dropped". 638 639 Last but not least the ``TC_ACT_REDIRECT`` action which is available for 640 tc BPF programs as well. This allows to redirect the ``skb`` to the same 641 or another's device ingress or egress path together with the ``bpf_redirect()`` 642 helper. Being able to inject the packet into another device's ingress or 643 egress direction allows for full flexibility in packet forwarding with 644 BPF. There are no requirements on the target networking device other than 645 being a networking device itself, there is no need to run another instance 646 of ``cls_bpf`` on the target device or other such restrictions. 647 648 **tc BPF FAQ** 649 650 This section contains a few miscellaneous question and answer pairs related to 651 tc BPF programs that are asked from time to time. 652 653 * **Question:** What about ``act_bpf`` as a tc action module, is it still relevant? 654 * **Answer:** Not really. Although ``cls_bpf`` and ``act_bpf`` share the same 655 functionality for tc BPF programs, ``cls_bpf`` is more flexible since it is a 656 proper superset of ``act_bpf``. The way tc works is that tc actions need to be 657 attached to tc classifiers. In order to achieve the same flexibility as ``cls_bpf``, 658 ``act_bpf`` would need to be attached to the ``cls_matchall`` classifier. As the 659 name says, this will match on every packet in order to pass them through for attached 660 tc action processing. For ``act_bpf``, this is will result in less efficient packet 661 processing than using ``cls_bpf`` in ``direct-action`` mode directly. If ``act_bpf`` 662 is used in a setting with other classifiers than ``cls_bpf`` or ``cls_matchall`` 663 then this will perform even worse due to the nature of operation of tc classifiers. 664 Meaning, if classifier A has a mismatch, then the packet is passed to classifier 665 B, reparsing the packet, etc, thus in the typical case there will be linear 666 processing where the packet would need to traverse N classifiers in the worst 667 case to find a match and execute ``act_bpf`` on that. Therefore, ``act_bpf`` has 668 never been largely relevant. Additionally, ``act_bpf`` does not provide a tc 669 offloading interface either compared to ``cls_bpf``. 670 671 .. 672 673 * **Question:** Is it recommended to use ``cls_bpf`` not in ``direct-action`` mode? 674 * **Answer:** No. The answer is similar to the one above in that this is otherwise 675 unable to scale for more complex processing. tc BPF can already do everything needed 676 by itself in an efficient manner and thus there is no need for anything other than 677 ``direct-action`` mode. 678 679 .. 680 681 * **Question:** Is there any performance difference in offloaded ``cls_bpf`` and 682 offloaded XDP? 683 * **Answer:** No. Both are JITed through the same compiler in the kernel which 684 handles the offloading to the SmartNIC and the loading mechanism for both is 685 very similar as well. Thus, the BPF program gets translated into the same target 686 instruction set in order to be able to run on the NIC natively. The two tc BPF 687 and XDP BPF program types have a differing set of features, so depending on the 688 use case one might be picked over the other due to availability of certain helper 689 functions in the offload case, for example. 690 691 **Use cases for tc BPF** 692 693 Some of the main use cases for tc BPF programs are presented in this subsection. 694 Also here, the list is non-exhaustive and given the programmability and efficiency 695 of tc BPF, it can easily be tailored and integrated into orchestration systems 696 in order to solve very specific use cases. While some use cases with XDP may overlap, 697 tc BPF and XDP BPF are mostly complementary to each other and both can also be 698 used at the same time or one over the other depending which is most suitable for a 699 given problem to solve. 700 701 * **Policy enforcement for containers** 702 703 One application which tc BPF programs are suitable for is to implement policy 704 enforcement, custom firewalling or similar security measures for containers or 705 pods, respectively. In the conventional case, container isolation is implemented 706 through network namespaces with veth networking devices connecting the host's 707 initial namespace with the dedicated container's namespace. Since one end of 708 the veth pair has been moved into the container's namespace whereas the other 709 end remains in the initial namespace of the host, all network traffic from the 710 container has to pass through the host-facing veth device allowing for attaching 711 tc BPF programs on the tc ingress and egress hook of the veth. Network traffic 712 going into the container will pass through the host-facing veth's tc egress 713 hook whereas network traffic coming from the container will pass through the 714 host-facing veth's tc ingress hook. 715 716 For virtual devices like veth devices XDP is unsuitable in this case since the 717 kernel operates solely on a ``skb`` here and generic XDP has a few limitations 718 where it does not operate with cloned ``skb``'s. The latter is heavily used 719 from the TCP/IP stack in order to hold data segments for retransmission where 720 the generic XDP hook would simply get bypassed instead. Moreover, generic XDP 721 needs to linearize the entire ``skb`` resulting in heavily degraded performance. 722 tc BPF on the other hand is more flexible as it specializes on the ``skb`` 723 input context case and thus does not need to cope with the limitations from 724 generic XDP. 725 726 .. 727 728 * **Forwarding and load-balancing** 729 730 The forwarding and load-balancing use case is quite similar to XDP, although 731 slightly more targeted towards east-west container workloads rather than 732 north-south traffic (though both technologies can be used in either case). 733 Since XDP is only available on ingress side, tc BPF programs allow for 734 further use cases that apply in particular on egress, for example, container 735 based traffic can already be NATed and load-balanced on the egress side 736 through BPF out of the initial namespace such that this is done transparent 737 to the container itself. Egress traffic is already based on the ``sk_buff`` 738 structure due to the nature of the kernel's networking stack, so packet 739 rewrites and redirects are suitable out of tc BPF. By utilizing the 740 ``bpf_redirect()`` helper function, BPF can take over the forwarding logic 741 to push the packet either into the ingress or egress path of another networking 742 device. Thus, any bridge-like devices become unnecessary to use as well by 743 utilizing tc BPF as forwarding fabric. 744 745 .. 746 747 * **Flow sampling, monitoring** 748 749 Like in XDP case, flow sampling and monitoring can be realized through a 750 high-performance lockless per-CPU memory mapped perf ring buffer where the 751 BPF program is able to push custom data, the full or truncated packet 752 contents, or both up to a user space application. From the tc BPF program 753 this is realized through the ``bpf_skb_event_output()`` BPF helper function 754 which has the same function signature and semantics as ``bpf_xdp_event_output()``. 755 Given tc BPF programs can be attached to ingress and egress as opposed to 756 only ingress in XDP BPF case plus the two tc hooks are at the lowest layer 757 in the (generic) networking stack, this allows for bidirectional monitoring 758 of all network traffic from a particular node. This might be somewhat related 759 to the cBPF case which tcpdump and Wireshark makes use of, though, without 760 having to clone the ``skb`` and with being a lot more flexible in terms of 761 programmability where, for example, BPF can already perform in-kernel 762 aggregation rather than pushing everything up to user space as well as 763 custom annotations for packets pushed into the ring buffer. The latter is 764 also heavily used in Cilium where packet drops can be further annotated 765 to correlate container labels and reasons for why a given packet had to 766 be dropped (such as due to policy violation) in order to provide a richer 767 context. 768 769 .. 770 771 * **Packet scheduler pre-processing** 772 773 The ``sch_clsact``'s egress hook which is called ``sch_handle_egress()`` 774 runs right before taking the kernel's qdisc root lock, thus tc BPF programs 775 can be utilized to perform all the heavy lifting packet classification 776 and mangling before the packet is transmitted into a real full blown 777 qdisc such as ``sch_htb``. This type of interaction of ``sch_clsact`` 778 with a real qdisc like ``sch_htb`` coming later in the transmission phase 779 allows to reduce the lock contention on transmission since ``sch_clsact``'s 780 egress hook is executed without taking locks. 781 782 .. 783 784 One concrete example user of tc BPF but also XDP BPF programs is Cilium. 785 Cilium is open source software for transparently securing the network 786 connectivity between application services deployed using Linux container 787 management platforms like Docker and Kubernetes and operates at Layer 3/4 788 as well as Layer 7. At the heart of Cilium operates BPF in order to 789 implement the policy enforcement as well as load balancing and monitoring. 790 791 * Slides: https://www.slideshare.net/ThomasGraf5/dockercon-2017-cilium-network-and-application-security-with-bpf-and-xdp 792 * Video: https://youtu.be/ilKlmTDdFgk 793 * Github: https://github.com/cilium/cilium 794 795 **Driver support** 796 797 Since tc BPF programs are triggered from the kernel's networking stack 798 and not directly out of the driver, they do not require any extra driver 799 modification and therefore can run on any networking device. The only 800 exception listed below is for offloading tc BPF programs to the NIC. 801 802 **Drivers supporting offloaded tc BPF** 803 804 * **Netronome** 805 806 * nfp [2]_ 807 808 .. note:: 809 810 Examples for writing and loading tc BPF programs are included in the `bpf_dev` section under the respective tools.