Blame view
Documentation/networking/filter.rst
61.4 KB
cb3f0d56e docs: networking:... |
1 |
.. SPDX-License-Identifier: GPL-2.0 |
ffba964e4 Documentation/bpf... |
2 |
.. _networking-filter: |
cb3f0d56e docs: networking:... |
3 |
======================================================= |
7924cd5e0 filter: doc: impr... |
4 5 |
Linux Socket Filtering aka Berkeley Packet Filter (BPF) ======================================================= |
1da177e4c Linux-2.6.12-rc2 |
6 7 |
Introduction |
7924cd5e0 filter: doc: impr... |
8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 |
------------ Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. Though there are some distinct differences between the BSD and Linux Kernel filtering, but when we speak of BPF or LSF in Linux context, we mean the very same mechanism of filtering in the Linux kernel. BPF allows a user-space program to attach a filter onto any socket and allow or disallow certain types of data to come through the socket. LSF follows exactly the same filter code structure as BSD's BPF, so referring to the BSD bpf.4 manpage is very helpful in creating filters. On Linux, BPF is much simpler than on BSD. One does not have to worry about devices or anything like that. You simply create your filter code, send it to the kernel via the SO_ATTACH_FILTER option and if your filter code passes the kernel check on it, you then immediately begin filtering data on that socket. You can also detach filters from your socket via the SO_DETACH_FILTER option. This will probably not be used much since when you close a socket that has a filter on it the filter is automagically removed. The other less common case may be adding a different filter on the same socket where you had another filter that is still running: the kernel takes care of removing the old one and placing your new one in its place, assuming your filter has passed the checks, otherwise if it fails the old filter will remain on that socket. SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once set, a filter cannot be removed or changed. This allows one process to setup a socket, attach a filter, lock it then drop privileges and be assured that the filter will be kept until the socket is closed. The biggest user of this construct might be libpcap. Issuing a high-level filter command like `tcpdump -i em1 port 22` passes through the libpcap internal compiler that generates a structure that can eventually be loaded via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` displays what is being placed into this structure. Although we were only speaking about sockets here, BPF in Linux is used in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel |
cb3f0d56e docs: networking:... |
48 |
qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places |
7924cd5e0 filter: doc: impr... |
49 |
such as team driver, PTP code, etc where BPF is being used. |
cb3f0d56e docs: networking:... |
50 |
.. [1] Documentation/userspace-api/seccomp_filter.rst |
7924cd5e0 filter: doc: impr... |
51 52 53 54 55 56 57 58 59 60 61 62 63 |
Original BPF paper: Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new architecture for user-level packet capture. In Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings (USENIX'93). USENIX Association, Berkeley, CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] Structure --------- User space applications include <linux/filter.h> which contains the |
cb3f0d56e docs: networking:... |
64 |
following relevant structures:: |
7924cd5e0 filter: doc: impr... |
65 |
|
cb3f0d56e docs: networking:... |
66 67 68 69 70 71 |
struct sock_filter { /* Filter block */ __u16 code; /* Actual filter code */ __u8 jt; /* Jump true */ __u8 jf; /* Jump false */ __u32 k; /* Generic multiuse field */ }; |
7924cd5e0 filter: doc: impr... |
72 73 74 |
Such a structure is assembled as an array of 4-tuples, that contains a code, jt, jf and k value. jt and jf are jump offsets and k a generic |
cb3f0d56e docs: networking:... |
75 |
value to be used for a provided code:: |
7924cd5e0 filter: doc: impr... |
76 |
|
cb3f0d56e docs: networking:... |
77 78 79 80 |
struct sock_fprog { /* Required for SO_ATTACH_FILTER. */ unsigned short len; /* Number of filter blocks */ struct sock_filter __user *filter; }; |
7924cd5e0 filter: doc: impr... |
81 82 83 84 85 86 |
For socket filtering, a pointer to this structure (as shown in follow-up example) is being passed to the kernel through setsockopt(2). Example ------- |
cb3f0d56e docs: networking:... |
87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 |
:: #include <sys/socket.h> #include <sys/types.h> #include <arpa/inet.h> #include <linux/if_ether.h> /* ... */ /* From the example above: tcpdump -i em1 port 22 -dd */ struct sock_filter code[] = { { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 8, 0x000086dd }, { 0x30, 0, 0, 0x00000014 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 17, 0x00000011 }, { 0x28, 0, 0, 0x00000036 }, { 0x15, 14, 0, 0x00000016 }, { 0x28, 0, 0, 0x00000038 }, { 0x15, 12, 13, 0x00000016 }, { 0x15, 0, 12, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 8, 0x00000011 }, { 0x28, 0, 0, 0x00000014 }, { 0x45, 6, 0, 0x00001fff }, { 0xb1, 0, 0, 0x0000000e }, { 0x48, 0, 0, 0x0000000e }, { 0x15, 2, 0, 0x00000016 }, { 0x48, 0, 0, 0x00000010 }, { 0x15, 0, 1, 0x00000016 }, { 0x06, 0, 0, 0x0000ffff }, { 0x06, 0, 0, 0x00000000 }, }; struct sock_fprog bpf = { .len = ARRAY_SIZE(code), .filter = code, }; sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); if (sock < 0) /* ... bail out ... */ ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); if (ret < 0) /* ... bail out ... */ /* ... */ close(sock); |
7924cd5e0 filter: doc: impr... |
138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 |
The above example code attaches a socket filter for a PF_PACKET socket in order to let all IPv4/IPv6 packets with port 22 pass. The rest will be dropped for this socket. The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments and SO_LOCK_FILTER for preventing the filter to be detached, takes an integer value with 0 or 1. Note that socket filters are not restricted to PF_PACKET sockets only, but can also be used on other socket families. Summary of system calls: * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); Normally, most use cases for socket filtering on packet sockets will be covered by libpcap in high-level syntax, so as an application developer you should stick to that. libpcap wraps its own layer around all that. Unless i) using/linking to libpcap is not an option, ii) the required BPF filters use Linux extensions that are not supported by libpcap's compiler, iii) a filter might be more complex and not cleanly implementable with libpcap's compiler, or iv) particular filter codes should be optimized differently than libpcap's internal compiler does; then in such cases writing such a filter "by hand" can be of an alternative. For example, xt_bpf and cls_bpf users might have requirements that could result in more complex filter code, or one that cannot be expressed with libpcap (e.g. different return codes for various code paths). Moreover, BPF JIT implementors may wish to manually write test cases and thus need low-level access to BPF code as well. BPF engine and instruction set ------------------------------ |
c246fd333 filter.txt: updat... |
174 |
Under tools/bpf/ there's a small helper tool called bpf_asm which can |
7924cd5e0 filter: doc: impr... |
175 176 177 178 179 180 181 |
be used to write low-level filters for example scenarios mentioned in the previous section. Asm-like syntax mentioned here has been implemented in bpf_asm and will be used for further explanations (instead of dealing with less readable opcodes directly, principles are the same). The syntax is closely modelled after Steven McCanne's and Van Jacobson's BPF paper. The BPF architecture consists of the following basic elements: |
cb3f0d56e docs: networking:... |
182 |
======= ==================================================== |
7924cd5e0 filter: doc: impr... |
183 |
Element Description |
cb3f0d56e docs: networking:... |
184 |
======= ==================================================== |
7924cd5e0 filter: doc: impr... |
185 186 187 |
A 32 bit wide accumulator X 32 bit wide X register M[] 16 x 32 bit wide misc registers aka "scratch memory |
cb3f0d56e docs: networking:... |
188 189 |
store", addressable from 0 to 15 ======= ==================================================== |
7924cd5e0 filter: doc: impr... |
190 191 |
A program, that is translated by bpf_asm into "opcodes" is an array that |
cb3f0d56e docs: networking:... |
192 |
consists of the following elements (as already mentioned):: |
7924cd5e0 filter: doc: impr... |
193 194 195 196 197 198 199 200 201 202 203 204 205 |
op:16, jt:8, jf:8, k:32 The element op is a 16 bit wide opcode that has a particular instruction encoded. jt and jf are two 8 bit wide jump targets, one for condition "jump if true", the other one "jump if false". Eventually, element k contains a miscellaneous argument that can be interpreted in different ways depending on the given instruction in op. The instruction set consists of load, store, branch, alu, miscellaneous and return instructions that are also represented in bpf_asm syntax. This table lists all bpf_asm instructions available resp. what their underlying opcodes as defined in linux/filter.h stand for: |
cb3f0d56e docs: networking:... |
206 |
=========== =================== ===================== |
7924cd5e0 filter: doc: impr... |
207 |
Instruction Addressing mode Description |
cb3f0d56e docs: networking:... |
208 |
=========== =================== ===================== |
31ce8c4a1 bpf, doc: Documen... |
209 |
ld 1, 2, 3, 4, 12 Load word into A |
7924cd5e0 filter: doc: impr... |
210 211 212 |
ldi 4 Load word into A ldh 1, 2 Load half-word into A ldb 1, 2 Load byte into A |
31ce8c4a1 bpf, doc: Documen... |
213 |
ldx 3, 4, 5, 12 Load word into X |
7924cd5e0 filter: doc: impr... |
214 215 216 217 218 219 220 221 |
ldxi 4 Load word into X ldxb 5 Load byte into X st 3 Store A into M[] stx 3 Store X into M[] jmp 6 Jump to label ja 6 Jump to label |
31ce8c4a1 bpf, doc: Documen... |
222 223 224 225 226 227 228 229 |
jeq 7, 8, 9, 10 Jump on A == <x> jneq 9, 10 Jump on A != <x> jne 9, 10 Jump on A != <x> jlt 9, 10 Jump on A < <x> jle 9, 10 Jump on A <= <x> jgt 7, 8, 9, 10 Jump on A > <x> jge 7, 8, 9, 10 Jump on A >= <x> jset 7, 8, 9, 10 Jump on A & <x> |
7924cd5e0 filter: doc: impr... |
230 231 232 233 234 235 |
add 0, 4 A + <x> sub 0, 4 A - <x> mul 0, 4 A * <x> div 0, 4 A / <x> mod 0, 4 A % <x> |
83d26b632 bpf: doc: "neg" o... |
236 |
neg !A |
7924cd5e0 filter: doc: impr... |
237 238 239 240 241 242 243 244 |
and 0, 4 A & <x> or 0, 4 A | <x> xor 0, 4 A ^ <x> lsh 0, 4 A << <x> rsh 0, 4 A >> <x> tax Copy A into X txa Copy X into A |
31ce8c4a1 bpf, doc: Documen... |
245 |
ret 4, 11 Return |
cb3f0d56e docs: networking:... |
246 |
=========== =================== ===================== |
7924cd5e0 filter: doc: impr... |
247 248 |
The next table shows addressing formats from the 2nd column: |
cb3f0d56e docs: networking:... |
249 |
=============== =================== =============================================== |
7924cd5e0 filter: doc: impr... |
250 |
Addressing mode Syntax Description |
cb3f0d56e docs: networking:... |
251 |
=============== =================== =============================================== |
7924cd5e0 filter: doc: impr... |
252 253 254 255 256 257 258 259 |
0 x/%x Register X 1 [k] BHW at byte offset k in the packet 2 [x + k] BHW at the offset X + k in the packet 3 M[k] Word at offset k in M[] 4 #k Literal value stored in k 5 4*([k]&0xf) Lower nibble * 4 at byte offset k in the packet 6 L Jump label L 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf |
31ce8c4a1 bpf, doc: Documen... |
260 261 262 263 264 |
8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf 9 #k,Lt Jump to Lt if predicate is true 10 x/%x,Lt Jump to Lt if predicate is true 11 a/%a Accumulator A 12 extension BPF extension |
cb3f0d56e docs: networking:... |
265 |
=============== =================== =============================================== |
7924cd5e0 filter: doc: impr... |
266 267 268 269 270 271 272 |
The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by "overloading" the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A. Possible BPF extensions are shown in the following table: |
cb3f0d56e docs: networking:... |
273 |
=================================== ================================================= |
7924cd5e0 filter: doc: impr... |
274 |
Extension Description |
cb3f0d56e docs: networking:... |
275 |
=================================== ================================================= |
7924cd5e0 filter: doc: impr... |
276 277 278 279 280 281 282 283 284 285 |
len skb->len proto skb->protocol type skb->pkt_type poff Payload start offset ifidx skb->dev->ifindex nla Netlink attribute of type X with offset A nlan Nested Netlink attribute of type X with offset A mark skb->mark queue skb->queue_mapping hatype skb->dev->type |
b0db5cdf3 net: doc: Update ... |
286 |
rxhash skb->hash |
7924cd5e0 filter: doc: impr... |
287 |
cpu raw_smp_processor_id() |
df8a39def net: rename vlan_... |
288 |
vlan_tci skb_vlan_tag_get(skb) |
27cd54524 filter: introduce... |
289 290 |
vlan_avail skb_vlan_tag_present(skb) vlan_tpid skb->vlan_proto |
4cd3675eb filter: added BPF... |
291 |
rand prandom_u32() |
cb3f0d56e docs: networking:... |
292 |
=================================== ================================================= |
7924cd5e0 filter: doc: impr... |
293 294 295 |
These extensions can also be prefixed with '#'. Examples for low-level BPF: |
cb3f0d56e docs: networking:... |
296 |
**ARP packets**:: |
7924cd5e0 filter: doc: impr... |
297 298 299 300 301 |
ldh [12] jne #0x806, drop ret #-1 drop: ret #0 |
cb3f0d56e docs: networking:... |
302 |
**IPv4 TCP packets**:: |
7924cd5e0 filter: doc: impr... |
303 304 305 306 307 308 309 |
ldh [12] jne #0x800, drop ldb [23] jneq #6, drop ret #-1 drop: ret #0 |
cb3f0d56e docs: networking:... |
310 |
**(Accelerated) VLAN w/ id 10**:: |
7924cd5e0 filter: doc: impr... |
311 312 313 314 315 |
ld vlan_tci jneq #10, drop ret #-1 drop: ret #0 |
cb3f0d56e docs: networking:... |
316 |
**icmp random packet sampling, 1 in 4**: |
4cd3675eb filter: added BPF... |
317 318 319 320 321 322 323 324 325 326 |
ldh [12] jne #0x800, drop ldb [23] jneq #1, drop # get a random uint32 number ld rand mod #4 jneq #1, drop ret #-1 drop: ret #0 |
cb3f0d56e docs: networking:... |
327 |
**SECCOMP filter example**:: |
7924cd5e0 filter: doc: impr... |
328 329 330 331 332 333 334 335 336 337 338 339 340 341 |
ld [4] /* offsetof(struct seccomp_data, arch) */ jne #0xc000003e, bad /* AUDIT_ARCH_X86_64 */ ld [0] /* offsetof(struct seccomp_data, nr) */ jeq #15, good /* __NR_rt_sigreturn */ jeq #231, good /* __NR_exit_group */ jeq #60, good /* __NR_exit */ jeq #0, good /* __NR_read */ jeq #1, good /* __NR_write */ jeq #5, good /* __NR_fstat */ jeq #9, good /* __NR_mmap */ jeq #14, good /* __NR_rt_sigprocmask */ jeq #13, good /* __NR_rt_sigaction */ jeq #35, good /* __NR_nanosleep */ |
fd76875ca seccomp: Rename S... |
342 |
bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ |
7924cd5e0 filter: doc: impr... |
343 344 345 346 347 |
good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ The above example code can be placed into a file (here called "foo"), and then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf and cls_bpf understands and can directly be loaded with. Example with above |
cb3f0d56e docs: networking:... |
348 |
ARP code:: |
7924cd5e0 filter: doc: impr... |
349 |
|
cb3f0d56e docs: networking:... |
350 351 |
$ ./bpf_asm foo 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0, |
7924cd5e0 filter: doc: impr... |
352 |
|
cb3f0d56e docs: networking:... |
353 |
In copy and paste C-like output:: |
7924cd5e0 filter: doc: impr... |
354 |
|
cb3f0d56e docs: networking:... |
355 356 357 358 359 |
$ ./bpf_asm -c foo { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 1, 0x00000806 }, { 0x06, 0, 0, 0xffffffff }, { 0x06, 0, 0, 0000000000 }, |
7924cd5e0 filter: doc: impr... |
360 361 362 363 |
In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF filters that might not be obvious at first, it's good to test filters before attaching to a live system. For that purpose, there's a small tool called |
c246fd333 filter.txt: updat... |
364 |
bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows |
7924cd5e0 filter: doc: impr... |
365 366 |
for testing BPF filters against given pcap files, single stepping through the BPF code on the pcap's packets and to do BPF machine register dumps. |
cb3f0d56e docs: networking:... |
367 |
Starting bpf_dbg is trivial and just requires issuing:: |
7924cd5e0 filter: doc: impr... |
368 |
|
cb3f0d56e docs: networking:... |
369 |
# ./bpf_dbg |
7924cd5e0 filter: doc: impr... |
370 371 372 373 374 375 376 377 378 379 380 381 |
In case input and output do not equal stdin/stdout, bpf_dbg takes an alternative stdin source as a first argument, and an alternative stdout sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. Other than that, a particular libreadline configuration can be set via file "~/.bpf_dbg_init" and the command history is stored in the file "~/.bpf_dbg_history". Interaction in bpf_dbg happens through a shell that also has auto-completion support (follow-up example commands starting with '>' denote bpf_dbg shell). The usual workflow would be to ... |
cb3f0d56e docs: networking:... |
382 |
* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0 |
7924cd5e0 filter: doc: impr... |
383 |
Loads a BPF filter from standard output of bpf_asm, or transformed via |
cb3f0d56e docs: networking:... |
384 385 |
e.g. ``tcpdump -iem1 -ddd port 22 | tr ' ' ','``. Note that for JIT |
7924cd5e0 filter: doc: impr... |
386 387 388 |
debugging (next section), this command creates a temporary socket and loads the BPF code into the kernel. Thus, this will also be useful for JIT developers. |
cb3f0d56e docs: networking:... |
389 |
* load pcap foo.pcap |
7924cd5e0 filter: doc: impr... |
390 |
Loads standard tcpdump pcap file. |
cb3f0d56e docs: networking:... |
391 |
* run [<n>] |
7924cd5e0 filter: doc: impr... |
392 393 394 |
bpf passes:1 fails:9 Runs through all packets from a pcap to account how many passes and fails the filter will generate. A limit of packets to traverse can be given. |
cb3f0d56e docs: networking:... |
395 396 397 398 399 400 401 402 |
* disassemble:: l0: ldh [12] l1: jeq #0x800, l2, l5 l2: ldb [23] l3: jeq #0x1, l4, l5 l4: ret #0xffff l5: ret #0 |
7924cd5e0 filter: doc: impr... |
403 |
Prints out BPF code disassembly. |
cb3f0d56e docs: networking:... |
404 405 406 407 408 409 410 411 412 |
* dump:: /* { op, jt, jf, k }, */ { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 3, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, { 0x15, 0, 1, 0x00000001 }, { 0x06, 0, 0, 0x0000ffff }, { 0x06, 0, 0, 0000000000 }, |
7924cd5e0 filter: doc: impr... |
413 |
Prints out C-style BPF code dump. |
cb3f0d56e docs: networking:... |
414 415 416 417 418 419 420 |
* breakpoint 0:: breakpoint at: l0: ldh [12] * breakpoint 1:: breakpoint at: l1: jeq #0x800, l2, l5 |
7924cd5e0 filter: doc: impr... |
421 |
... |
cb3f0d56e docs: networking:... |
422 |
|
7924cd5e0 filter: doc: impr... |
423 424 425 426 |
Sets breakpoints at particular BPF instructions. Issuing a `run` command will walk through the pcap file continuing from the current packet and break when a breakpoint is being hit (another `run` will continue from the currently active breakpoint executing next instructions): |
cb3f0d56e docs: networking:... |
427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 |
* run:: -- register dump -- pc: [0] <-- program counter code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction curr: l0: ldh [12] <-- disassembly of current instruction A: [00000000][0] <-- content of A (hex, decimal) X: [00000000][0] <-- content of X (hex, decimal) M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) -- packet dump -- <-- Current packet from pcap (hex) len: 42 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 32: 00 00 00 00 00 00 0a 3b 01 01 (breakpoint) > * breakpoint:: breakpoints: 0 1 Prints currently set breakpoints. * step [-<n>, +<n>] |
7924cd5e0 filter: doc: impr... |
451 452 453 454 |
Performs single stepping through the BPF program from the current pc offset. Thus, on each step invocation, above register dump is issued. This can go forwards and backwards in time, a plain `step` will break on the next BPF instruction, thus +1. (No `run` needs to be issued here.) |
cb3f0d56e docs: networking:... |
455 |
* select <n> |
7924cd5e0 filter: doc: impr... |
456 457 458 459 |
Selects a given packet from the pcap file to continue from. Thus, on the next `run` or `step`, the BPF program is being evaluated against the user pre-selected packet. Numbering starts just as in Wireshark with index 1. |
cb3f0d56e docs: networking:... |
460 |
* quit |
7924cd5e0 filter: doc: impr... |
461 462 463 464 |
Exits bpf_dbg. JIT compiler ------------ |
e8cb0167a bpf, doc: add RIS... |
465 466 467 468 |
The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each attached filter from user space or for internal kernel users if it has |
cb3f0d56e docs: networking:... |
469 |
been previously enabled by root:: |
7924cd5e0 filter: doc: impr... |
470 471 472 473 |
echo 1 > /proc/sys/net/core/bpf_jit_enable For JIT developers, doing audits etc, each compile run can output the generated |
cb3f0d56e docs: networking:... |
474 |
opcode image into the kernel log via:: |
7924cd5e0 filter: doc: impr... |
475 476 |
echo 2 > /proc/sys/net/core/bpf_jit_enable |
cb3f0d56e docs: networking:... |
477 |
Example output from dmesg:: |
7924cd5e0 filter: doc: impr... |
478 |
|
cb3f0d56e docs: networking:... |
479 480 481 482 483 484 |
[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3 |
7924cd5e0 filter: doc: impr... |
485 |
|
2c25fc9a5 bpf, doc: Update ... |
486 487 488 489 490 |
When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and setting any other value than that will return in failure. This is even the case for setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the generally recommended approach instead. |
c246fd333 filter.txt: updat... |
491 |
In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for |
cb3f0d56e docs: networking:... |
492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 |
generating disassembly out of the kernel log's hexdump:: # ./bpf_jit_disasm 70 bytes emitted from JIT compiler (pass:3, flen:6) ffffffffa0069c8f + <x>: 0: push %rbp 1: mov %rsp,%rbp 4: sub $0x60,%rsp 8: mov %rbx,-0x8(%rbp) c: mov 0x68(%rdi),%r9d 10: sub 0x6c(%rdi),%r9d 14: mov 0xd8(%rdi),%r8 1b: mov $0xc,%esi 20: callq 0xffffffffe0ff9442 25: cmp $0x800,%eax 2a: jne 0x0000000000000042 2c: mov $0x17,%esi 31: callq 0xffffffffe0ff945e 36: cmp $0x1,%eax 39: jne 0x0000000000000042 3b: mov $0xffff,%eax 40: jmp 0x0000000000000044 42: xor %eax,%eax 44: leaveq 45: retq Issuing option `-o` will "annotate" opcodes to resulting assembler instructions, which can be very useful for JIT developers: # ./bpf_jit_disasm -o 70 bytes emitted from JIT compiler (pass:3, flen:6) ffffffffa0069c8f + <x>: 0: push %rbp 55 1: mov %rsp,%rbp 48 89 e5 4: sub $0x60,%rsp 48 83 ec 60 8: mov %rbx,-0x8(%rbp) 48 89 5d f8 c: mov 0x68(%rdi),%r9d 44 8b 4f 68 10: sub 0x6c(%rdi),%r9d 44 2b 4f 6c 14: mov 0xd8(%rdi),%r8 4c 8b 87 d8 00 00 00 1b: mov $0xc,%esi be 0c 00 00 00 20: callq 0xffffffffe0ff9442 e8 1d 94 ff e0 25: cmp $0x800,%eax 3d 00 08 00 00 2a: jne 0x0000000000000042 75 16 2c: mov $0x17,%esi be 17 00 00 00 31: callq 0xffffffffe0ff945e e8 28 94 ff e0 36: cmp $0x1,%eax 83 f8 01 39: jne 0x0000000000000042 75 07 3b: mov $0xffff,%eax b8 ff ff 00 00 40: jmp 0x0000000000000044 eb 02 42: xor %eax,%eax 31 c0 44: leaveq c9 45: retq c3 |
7924cd5e0 filter: doc: impr... |
564 565 566 |
For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful toolchain for developing and testing the kernel's JIT compiler. |
9a985cdc5 doc: filter: exte... |
567 568 |
BPF kernel internals -------------------- |
e4ad40326 net: filter: ment... |
569 |
Internally, for the kernel interpreter, a different instruction set |
9a985cdc5 doc: filter: exte... |
570 571 572 |
format with similar underlying principles from BPF described in previous paragraphs is being used. However, the instruction set format is modelled closer to the underlying architecture to mimic native instruction sets, so |
e4ad40326 net: filter: ment... |
573 574 575 576 577 |
that a better performance can be achieved (more details later). This new ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which originates from [e]xtended BPF is not the same as BPF extensions! While eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' of BPF_LD | BPF_{B,H,W} | BPF_ABS instruction.) |
9a985cdc5 doc: filter: exte... |
578 579 |
It is designed to be JITed with one to one mapping, which can also open up |
e4ad40326 net: filter: ment... |
580 581 |
the possibility for GCC/LLVM compilers to generate optimized eBPF code through an eBPF backend that performs almost as fast as natively compiled code. |
9a985cdc5 doc: filter: exte... |
582 583 |
The new instruction set was originally designed with the possible goal in |
e4ad40326 net: filter: ment... |
584 |
mind to write programs in "restricted C" and compile into eBPF with a optional |
9a985cdc5 doc: filter: exte... |
585 |
GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with |
e4ad40326 net: filter: ment... |
586 |
minimal performance overhead over two steps, that is, C -> eBPF -> native code. |
9a985cdc5 doc: filter: exte... |
587 588 589 590 591 592 |
Currently, the new format is being used for running user BPF programs, which includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, team driver's classifier for its load-balancing mode, netfilter's xt_bpf extension, PTP dissector/classifier, and much more. They are all internally converted by the kernel into the new instruction set representation and run |
e4ad40326 net: filter: ment... |
593 |
in the eBPF interpreter. For in-kernel handlers, this all works transparently |
7ae457c1e net: filter: spli... |
594 595 596 597 598 |
by using bpf_prog_create() for setting up the filter, resp. bpf_prog_destroy() for destroying it. The macro BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed code to run the filter. 'filter' is a pointer to struct bpf_prog that we got from bpf_prog_create(), and 'ctx' the given context (e.g. |
4df95ff48 net: filter: rena... |
599 |
skb pointer). All constraints and restrictions from bpf_check_classic() apply |
e4ad40326 net: filter: ment... |
600 |
before a conversion to the new layout is being done behind the scenes! |
e8cb0167a bpf, doc: add RIS... |
601 602 |
Currently, the classic BPF format is being used for JITing on most 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64, |
06b741521 bpf, doc: Add BPF... |
603 |
sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF |
e8cb0167a bpf, doc: add RIS... |
604 |
instruction set. |
9a985cdc5 doc: filter: exte... |
605 606 607 608 609 610 611 612 |
Some core changes of the new internal format: - Number of registers increase from 2 to 10: The old format had two registers A and X, and a hidden frame pointer. The new layout extends this to be 10 internal registers and a read-only frame pointer. Since 64-bit CPUs are passing arguments to functions via registers |
e4ad40326 net: filter: ment... |
613 |
the number of args from eBPF program to in-kernel function is restricted |
9a985cdc5 doc: filter: exte... |
614 615 616 617 |
to 5 and one register is used to accept return value from an in-kernel function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers. |
e4ad40326 net: filter: ment... |
618 |
Therefore, eBPF calling convention is defined as: |
9a985cdc5 doc: filter: exte... |
619 |
|
e4ad40326 net: filter: ment... |
620 621 |
* R0 - return value from in-kernel function, and exit value for eBPF program * R1 - R5 - arguments from eBPF program to in-kernel function |
9a985cdc5 doc: filter: exte... |
622 623 |
* R6 - R9 - callee saved registers that in-kernel function will preserve * R10 - read-only frame pointer to access stack |
e4ad40326 net: filter: ment... |
624 625 |
Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, etc, and eBPF calling convention maps directly to ABIs used by the kernel on |
9a985cdc5 doc: filter: exte... |
626 627 628 629 |
64-bit architectures. On 32-bit architectures JIT may map programs that use only 32-bit arithmetic and may let more complex programs to be interpreted. |
e4ad40326 net: filter: ment... |
630 631 632 633 |
R0 - R5 are scratch registers and eBPF program needs spill/fill them if necessary across calls. Note that there is only one eBPF program (== one eBPF main routine) and it cannot call other eBPF functions, it can only call predefined in-kernel functions, though. |
9a985cdc5 doc: filter: exte... |
634 635 636 637 |
- Register width increases from 32-bit to 64-bit: Still, the semantics of the original 32-bit ALU operations are preserved |
e4ad40326 net: filter: ment... |
638 |
via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower |
9a985cdc5 doc: filter: exte... |
639 640 641 642 643 644 645 646 647 648 |
subregisters that zero-extend into 64-bit if they are being written to. That behavior maps directly to x86_64 and arm64 subregister definition, but makes other JITs more difficult. 32-bit architectures run 64-bit internal BPF programs via interpreter. Their JITs may convert BPF programs that only use 32-bit subregisters into native instruction set and let the rest being interpreted. Operation is 64-bit, because on 64-bit architectures, pointers are also 64-bit wide, and we want to pass 64-bit values in/out of kernel functions, |
e4ad40326 net: filter: ment... |
649 650 |
so 32-bit eBPF registers would otherwise require to define register-pair ABI, thus, there won't be able to use a direct eBPF register to HW register |
9a985cdc5 doc: filter: exte... |
651 652 653 654 655 |
mapping and JIT would need to do combine/split/move operations for every register in and out of the function, which is complex, bug prone and slow. Another reason is the use of atomic 64-bit counters. - Conditional jt/jf targets replaced with jt/fall-through: |
cb3f0d56e docs: networking:... |
656 657 658 |
While the original design has constructs such as ``if (cond) jump_true; else jump_false;``, they are being replaced into alternative constructs like ``if (cond) jump_true; /* else fall-through */``. |
9a985cdc5 doc: filter: exte... |
659 660 661 |
- Introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions: |
dfee07cce net: filter: doc:... |
662 663 664 665 666 667 668 669 670 671 672 673 674 |
Before an in-kernel function call, the internal BPF program needs to place function arguments into R1 to R5 registers to satisfy calling convention, then the interpreter will take them from registers and pass to in-kernel function. If R1 - R5 registers are mapped to CPU registers that are used for argument passing on given architecture, the JIT compiler doesn't need to emit extra moves. Function arguments will be in the correct registers and BPF_CALL instruction will be JITed as single 'call' HW instruction. This calling convention was picked to cover common call situations without performance penalty. After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has a return value of the function. Since R6 - R9 are callee saved, their state is preserved across the call. |
cb3f0d56e docs: networking:... |
675 |
For example, consider three C functions:: |
dfee07cce net: filter: doc:... |
676 |
|
cb3f0d56e docs: networking:... |
677 678 679 |
u64 f1() { return (*_f2)(1); } u64 f2(u64 a) { return f3(a + 1, a); } u64 f3(u64 a, u64 b) { return a - b; } |
dfee07cce net: filter: doc:... |
680 |
|
cb3f0d56e docs: networking:... |
681 |
GCC can compile f1, f3 into x86_64:: |
dfee07cce net: filter: doc:... |
682 |
|
cb3f0d56e docs: networking:... |
683 684 685 686 687 688 689 690 |
f1: movl $1, %edi movq _f2(%rip), %rax jmp *%rax f3: movq %rdi, %rax subq %rsi, %rax ret |
dfee07cce net: filter: doc:... |
691 |
|
cb3f0d56e docs: networking:... |
692 |
Function f2 in eBPF may look like:: |
dfee07cce net: filter: doc:... |
693 |
|
cb3f0d56e docs: networking:... |
694 695 696 697 698 |
f2: bpf_mov R2, R1 bpf_add R1, 1 bpf_call f3 bpf_exit |
dfee07cce net: filter: doc:... |
699 |
|
cb3f0d56e docs: networking:... |
700 |
If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and |
1a9525f68 Documentation: re... |
701 |
returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to |
dfee07cce net: filter: doc:... |
702 |
be used to call into f2. |
e4ad40326 net: filter: ment... |
703 |
For practical reasons all eBPF programs have only one argument 'ctx' which is |
1a9525f68 Documentation: re... |
704 |
already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs |
dfee07cce net: filter: doc:... |
705 706 707 708 709 710 |
can call kernel functions with up to 5 arguments. Calls with 6 or more arguments are currently not supported, but these restrictions can be lifted if necessary in the future. On 64-bit architectures all register map to HW registers one to one. For example, x86_64 JIT compiler can map them as ... |
cb3f0d56e docs: networking:... |
711 |
:: |
dfee07cce net: filter: doc:... |
712 713 714 715 716 717 718 719 720 721 722 723 724 725 |
R0 - rax R1 - rdi R2 - rsi R3 - rdx R4 - rcx R5 - r8 R6 - rbx R7 - r13 R8 - r14 R9 - r15 R10 - rbp ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing and rbx, r12 - r15 are callee saved. |
cb3f0d56e docs: networking:... |
726 |
Then the following internal BPF pseudo-program:: |
dfee07cce net: filter: doc:... |
727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 |
bpf_mov R6, R1 /* save ctx */ bpf_mov R2, 2 bpf_mov R3, 3 bpf_mov R4, 4 bpf_mov R5, 5 bpf_call foo bpf_mov R7, R0 /* save foo() return value */ bpf_mov R1, R6 /* restore ctx for next call */ bpf_mov R2, 6 bpf_mov R3, 7 bpf_mov R4, 8 bpf_mov R5, 9 bpf_call bar bpf_add R0, R7 bpf_exit |
cb3f0d56e docs: networking:... |
743 |
After JIT to x86_64 may look like:: |
dfee07cce net: filter: doc:... |
744 745 746 747 748 749 750 751 752 753 754 755 756 757 |
push %rbp mov %rsp,%rbp sub $0x228,%rsp mov %rbx,-0x228(%rbp) mov %r13,-0x220(%rbp) mov %rdi,%rbx mov $0x2,%esi mov $0x3,%edx mov $0x4,%ecx mov $0x5,%r8d callq foo mov %rax,%r13 mov %rbx,%rdi |
808c9f7eb bpf, doc: Change ... |
758 759 760 761 |
mov $0x6,%esi mov $0x7,%edx mov $0x8,%ecx mov $0x9,%r8d |
dfee07cce net: filter: doc:... |
762 763 764 765 766 767 |
callq bar add %r13,%rax mov -0x228(%rbp),%rbx mov -0x220(%rbp),%r13 leaveq retq |
cb3f0d56e docs: networking:... |
768 |
Which is in this example equivalent in C to:: |
dfee07cce net: filter: doc:... |
769 770 771 |
u64 bpf_filter(u64 ctx) { |
cb3f0d56e docs: networking:... |
772 |
return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9); |
dfee07cce net: filter: doc:... |
773 774 775 776 |
} In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper |
cb3f0d56e docs: networking:... |
777 |
registers and place their return value into ``%rax`` which is R0 in eBPF. |
dfee07cce net: filter: doc:... |
778 |
Prologue and epilogue are emitted by JIT and are implicit in the |
e4ad40326 net: filter: ment... |
779 |
interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve |
dfee07cce net: filter: doc:... |
780 |
them across the calls as defined by calling convention. |
cb3f0d56e docs: networking:... |
781 |
For example the following program is invalid:: |
dfee07cce net: filter: doc:... |
782 783 784 785 786 787 788 |
bpf_mov R1, 1 bpf_call foo bpf_mov R0, R1 bpf_exit After the call the registers R1-R5 contain junk values and cannot be read. |
0cbf47416 Documentation: de... |
789 |
An in-kernel eBPF verifier is used to validate internal BPF programs. |
9a985cdc5 doc: filter: exte... |
790 |
|
e4ad40326 net: filter: ment... |
791 |
Also in the new design, eBPF is limited to 4096 insns, which means that any |
9a985cdc5 doc: filter: exte... |
792 793 |
program will terminate quickly and will only call a fixed number of kernel functions. Original BPF and the new format are two operand instructions, |
e4ad40326 net: filter: ment... |
794 |
which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT. |
9a985cdc5 doc: filter: exte... |
795 796 797 798 |
The input context pointer for invoking the interpreter function is generic, its content is defined by a specific use case. For seccomp register R1 points to seccomp_data, for converted BPF filters R1 points to a skb. |
cb3f0d56e docs: networking:... |
799 |
A program, that is translated internally consists of the following elements:: |
9a985cdc5 doc: filter: exte... |
800 |
|
e430f34ee net: filter: clea... |
801 |
op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32 |
9a985cdc5 doc: filter: exte... |
802 |
|
dfee07cce net: filter: doc:... |
803 804 805 806 807 808 |
So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field has room for new instructions. Some of them may use 16/24/32 byte encoding. New instructions must be multiple of 8 bytes to preserve backward compatibility. Internal BPF is a general purpose RISC instruction set. Not every register and every instruction are used during translation from original BPF to new format. |
cb3f0d56e docs: networking:... |
809 |
For example, socket filters are not using ``exclusive add`` instruction, but |
dfee07cce net: filter: doc:... |
810 811 812 |
tracing filters may do to maintain counters of events, for example. Register R9 is not used by socket filters either, but more complex filters may be running out of registers and would have to resort to spill/fill to stack. |
46604676c docs/bpf: minor c... |
813 |
Internal BPF can be used as a generic assembler for last step performance |
dfee07cce net: filter: doc:... |
814 815 816 817 818 819 |
optimizations, socket filters and seccomp are using it as assembler. Tracing filters may use it as assembler to generate code from kernel. In kernel usage may not be bounded by security considerations, since generated internal BPF code may be optimizing internal code path and not being exposed to the user space. Safety of internal BPF can come from a verifier (TBD). In such use cases as described, it may be used as safe instruction set. |
9a985cdc5 doc: filter: exte... |
820 821 822 823 824 825 |
Just like the original BPF, the new format runs within a controlled environment, is deterministic and the kernel can easily prove that. The safety of the program can be determined in two steps: first step does depth-first-search to disallow loops and other CFG validation; second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack. |
783e327b6 net: filter: docu... |
826 827 828 829 830 |
eBPF opcode encoding -------------------- eBPF is reusing most of the opcode encoding from classic to simplify conversion of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code' |
cb3f0d56e docs: networking:... |
831 |
field is divided into three parts:: |
783e327b6 net: filter: docu... |
832 833 834 835 836 837 838 839 |
+----------------+--------+--------------------+ | 4 bits | 1 bit | 3 bits | | operation code | source | instruction class | +----------------+--------+--------------------+ (MSB) (LSB) Three LSB bits store instruction class which is one of: |
cb3f0d56e docs: networking:... |
840 841 842 |
=================== =============== Classic BPF classes eBPF classes =================== =============== |
783e327b6 net: filter: docu... |
843 844 845 846 847 848 |
BPF_LD 0x00 BPF_LD 0x00 BPF_LDX 0x01 BPF_LDX 0x01 BPF_ST 0x02 BPF_ST 0x02 BPF_STX 0x03 BPF_STX 0x03 BPF_ALU 0x04 BPF_ALU 0x04 BPF_JMP 0x05 BPF_JMP 0x05 |
d405c7407 bpf: allocate 0x0... |
849 |
BPF_RET 0x06 BPF_JMP32 0x06 |
783e327b6 net: filter: docu... |
850 |
BPF_MISC 0x07 BPF_ALU64 0x07 |
cb3f0d56e docs: networking:... |
851 |
=================== =============== |
783e327b6 net: filter: docu... |
852 853 |
When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ... |
cb3f0d56e docs: networking:... |
854 855 856 857 |
:: BPF_K 0x00 BPF_X 0x08 |
783e327b6 net: filter: docu... |
858 |
|
cb3f0d56e docs: networking:... |
859 |
* in classic BPF, this means:: |
783e327b6 net: filter: docu... |
860 |
|
cb3f0d56e docs: networking:... |
861 862 |
BPF_SRC(code) == BPF_X - use register X as source operand BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
783e327b6 net: filter: docu... |
863 |
|
cb3f0d56e docs: networking:... |
864 |
* in eBPF, this means:: |
783e327b6 net: filter: docu... |
865 |
|
cb3f0d56e docs: networking:... |
866 867 |
BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand |
783e327b6 net: filter: docu... |
868 869 |
... and four MSB bits store operation code. |
cb3f0d56e docs: networking:... |
870 |
If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of:: |
783e327b6 net: filter: docu... |
871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 |
BPF_ADD 0x00 BPF_SUB 0x10 BPF_MUL 0x20 BPF_DIV 0x30 BPF_OR 0x40 BPF_AND 0x50 BPF_LSH 0x60 BPF_RSH 0x70 BPF_NEG 0x80 BPF_MOD 0x90 BPF_XOR 0xa0 BPF_MOV 0xb0 /* eBPF only: mov reg to reg */ BPF_ARSH 0xc0 /* eBPF only: sign extending shift right */ BPF_END 0xd0 /* eBPF only: endianness conversion */ |
cb3f0d56e docs: networking:... |
886 |
If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of:: |
783e327b6 net: filter: docu... |
887 |
|
d405c7407 bpf: allocate 0x0... |
888 |
BPF_JA 0x00 /* BPF_JMP only */ |
783e327b6 net: filter: docu... |
889 890 891 892 893 894 895 |
BPF_JEQ 0x10 BPF_JGT 0x20 BPF_JGE 0x30 BPF_JSET 0x40 BPF_JNE 0x50 /* eBPF only: jump != */ BPF_JSGT 0x60 /* eBPF only: signed '>' */ BPF_JSGE 0x70 /* eBPF only: signed '>=' */ |
d405c7407 bpf: allocate 0x0... |
896 897 |
BPF_CALL 0x80 /* eBPF BPF_JMP only: function call */ BPF_EXIT 0x90 /* eBPF BPF_JMP only: function return */ |
92b31a9af bpf: add BPF_J{LT... |
898 899 900 901 |
BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ BPF_JSLT 0xc0 /* eBPF only: signed '<' */ BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ |
783e327b6 net: filter: docu... |
902 903 904 905 906 907 908 909 910 911 912 913 914 |
So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF and eBPF. There are only two registers in classic BPF, so it means A += X. In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, BPF_XOR | BPF_K | BPF_ALU means A ^= imm32 in classic BPF and analogous src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. Classic BPF is using BPF_MISC class to represent A = X and X = A moves. eBPF is using BPF_MOV | BPF_X | BPF_ALU code instead. Since there are no BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean exactly the same operations as BPF_ALU, but with 64-bit wide operands instead. So BPF_ADD | BPF_X | BPF_ALU64 means 64-bit addition, i.e.: dst_reg = dst_reg + src_reg |
cb3f0d56e docs: networking:... |
915 |
Classic BPF wastes the whole BPF_RET class to represent a single ``ret`` |
783e327b6 net: filter: docu... |
916 917 918 |
operation. Classic BPF_RET | BPF_K means copy imm32 into return register and perform function exit. eBPF is modeled to match CPU, so BPF_JMP | BPF_EXIT in eBPF means function exit only. The eBPF program needs to store return |
d405c7407 bpf: allocate 0x0... |
919 920 921 |
value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide operands for the comparisons instead. |
783e327b6 net: filter: docu... |
922 |
|
cb3f0d56e docs: networking:... |
923 |
For load and store instructions the 8-bit 'code' field is divided as:: |
783e327b6 net: filter: docu... |
924 925 926 927 928 929 930 931 |
+--------+--------+-------------------+ | 3 bits | 2 bits | 3 bits | | mode | size | instruction class | +--------+--------+-------------------+ (MSB) (LSB) Size modifier is one of ... |
cb3f0d56e docs: networking:... |
932 |
:: |
783e327b6 net: filter: docu... |
933 934 935 936 |
BPF_W 0x00 /* word */ BPF_H 0x08 /* half word */ BPF_B 0x10 /* byte */ BPF_DW 0x18 /* eBPF only, double word */ |
cb3f0d56e docs: networking:... |
937 |
... which encodes size of load/store operation:: |
783e327b6 net: filter: docu... |
938 939 940 941 942 |
B - 1 byte H - 2 byte W - 4 byte DW - 8 byte (eBPF only) |
cb3f0d56e docs: networking:... |
943 |
Mode modifier is one of:: |
783e327b6 net: filter: docu... |
944 |
|
02ab695bb net: filter: add ... |
945 |
BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */ |
783e327b6 net: filter: docu... |
946 947 948 949 950 951 952 953 954 955 956 957 |
BPF_ABS 0x20 BPF_IND 0x40 BPF_MEM 0x60 BPF_LEN 0x80 /* classic BPF only, reserved in eBPF */ BPF_MSH 0xa0 /* classic BPF only, reserved in eBPF */ BPF_XADD 0xc0 /* eBPF only, exclusive add */ eBPF has two non-generic instructions: (BPF_ABS | <size> | BPF_LD) and (BPF_IND | <size> | BPF_LD) which are used to access packet data. They had to be carried over from classic to have strong performance of socket filters running in eBPF interpreter. These instructions can only |
cb3f0d56e docs: networking:... |
958 |
be used when interpreter context is a pointer to ``struct sk_buff`` and |
783e327b6 net: filter: docu... |
959 960 961 962 963 964 965 966 967 968 969 |
have seven implicit operands. Register R6 is an implicit input that must contain pointer to sk_buff. Register R0 is an implicit output which contains the data fetched from the packet. Registers R1-R5 are scratch registers and must not be used to store the data across BPF_ABS | BPF_LD or BPF_IND | BPF_LD instructions. These instructions have implicit program exit condition as well. When eBPF program is trying to access the data beyond the packet boundary, the interpreter will abort the execution of the program. JIT compilers therefore must preserve this property. src_reg and imm32 fields are explicit inputs to these instructions. |
cb3f0d56e docs: networking:... |
970 |
For example:: |
783e327b6 net: filter: docu... |
971 972 973 974 975 |
BPF_IND | BPF_W | BPF_LD means: R0 = ntohl(*(u32 *) (((struct sk_buff *) R6)->data + src_reg + imm32)) and R1 - R5 were scratched. |
cb3f0d56e docs: networking:... |
976 |
Unlike classic BPF instruction set, eBPF has generic load/store operations:: |
783e327b6 net: filter: docu... |
977 |
|
cb3f0d56e docs: networking:... |
978 979 980 981 982 |
BPF_MEM | <size> | BPF_STX: *(size *) (dst_reg + off) = src_reg BPF_MEM | <size> | BPF_ST: *(size *) (dst_reg + off) = imm32 BPF_MEM | <size> | BPF_LDX: dst_reg = *(size *) (src_reg + off) BPF_XADD | BPF_W | BPF_STX: lock xadd *(u32 *)(dst_reg + off16) += src_reg BPF_XADD | BPF_DW | BPF_STX: lock xadd *(u64 *)(dst_reg + off16) += src_reg |
783e327b6 net: filter: docu... |
983 984 985 |
Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and 2 byte atomic increments are not supported. |
02ab695bb net: filter: add ... |
986 |
eBPF has one 16-byte instruction: BPF_LD | BPF_DW | BPF_IMM which consists |
cb3f0d56e docs: networking:... |
987 |
of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single |
02ab695bb net: filter: add ... |
988 989 990 |
instruction that loads 64-bit immediate value into a dst_reg. Classic BPF has similar instruction: BPF_LD | BPF_W | BPF_IMM which loads 32-bit immediate value into a register. |
51580e798 bpf: verifier (ad... |
991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 |
eBPF verifier ------------- The safety of the eBPF program is determined in two steps. First step does DAG check to disallow loops and other CFG validation. In particular it will detect programs that have unreachable instructions. (though classic BPF checker allows them) Second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack. At the start of the program the register R1 contains a pointer to context and has type PTR_TO_CTX. If verifier sees an insn that does R2=R1, then R2 has now type PTR_TO_CTX as well and can be used on the right hand side of expression. |
0cbf47416 Documentation: de... |
1007 |
If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, |
51580e798 bpf: verifier (ad... |
1008 1009 1010 |
since addition of two valid pointers makes invalid pointer. (In 'secure' mode verifier will reject any type of pointer arithmetic to make sure that kernel addresses don't leak to unprivileged users) |
cb3f0d56e docs: networking:... |
1011 |
If register was never written to, it's not readable:: |
51580e798 bpf: verifier (ad... |
1012 1013 |
bpf_mov R0 = R2 bpf_exit |
cb3f0d56e docs: networking:... |
1014 |
|
51580e798 bpf: verifier (ad... |
1015 1016 1017 1018 1019 1020 |
will be rejected, since R2 is unreadable at the start of the program. After kernel function call, R1-R5 are reset to unreadable and R0 has a return type of the function. Since R6-R9 are callee saved, their state is preserved across the call. |
cb3f0d56e docs: networking:... |
1021 1022 |
:: |
51580e798 bpf: verifier (ad... |
1023 1024 1025 1026 |
bpf_mov R6 = 1 bpf_call foo bpf_mov R0 = R6 bpf_exit |
cb3f0d56e docs: networking:... |
1027 |
|
51580e798 bpf: verifier (ad... |
1028 1029 1030 1031 |
is a correct program. If there was R1 instead of R6, it would have been rejected. load/store instructions are allowed only with registers of valid types, which |
0cbf47416 Documentation: de... |
1032 |
are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. |
cb3f0d56e docs: networking:... |
1033 |
For example:: |
51580e798 bpf: verifier (ad... |
1034 1035 1036 1037 |
bpf_mov R1 = 1 bpf_mov R2 = 2 bpf_xadd *(u32 *)(R1 + 3) += R2 bpf_exit |
cb3f0d56e docs: networking:... |
1038 |
|
51580e798 bpf: verifier (ad... |
1039 1040 |
will be rejected, since R1 doesn't have a valid pointer type at the time of execution of instruction bpf_xadd. |
cb3f0d56e docs: networking:... |
1041 |
At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``) |
51580e798 bpf: verifier (ad... |
1042 1043 |
A callback is used to customize verifier to restrict eBPF program access to only certain fields within ctx structure with specified size and alignment. |
cb3f0d56e docs: networking:... |
1044 |
For example, the following insn:: |
51580e798 bpf: verifier (ad... |
1045 |
bpf_ld R0 = *(u32 *)(R6 + 8) |
cb3f0d56e docs: networking:... |
1046 |
|
51580e798 bpf: verifier (ad... |
1047 1048 1049 1050 |
intends to load a word from address R6 + 8 and store it into R0 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know that offset 8 of size 4 bytes can be accessed for reading, otherwise the verifier will reject the program. |
0cbf47416 Documentation: de... |
1051 |
If R6=PTR_TO_STACK, then access should be aligned and be within |
51580e798 bpf: verifier (ad... |
1052 1053 1054 1055 1056 |
stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, so it will fail verification, since it's out of bounds. The verifier will allow eBPF program to read data from stack only after it wrote into it. |
cb3f0d56e docs: networking:... |
1057 |
|
51580e798 bpf: verifier (ad... |
1058 |
Classic BPF verifier does similar check with M[0-15] memory slots. |
cb3f0d56e docs: networking:... |
1059 |
For example:: |
51580e798 bpf: verifier (ad... |
1060 1061 |
bpf_ld R0 = *(u32 *)(R10 - 4) bpf_exit |
cb3f0d56e docs: networking:... |
1062 |
|
51580e798 bpf: verifier (ad... |
1063 |
is invalid program. |
0cbf47416 Documentation: de... |
1064 |
Though R10 is correct read-only register and has type PTR_TO_STACK |
51580e798 bpf: verifier (ad... |
1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 |
and R10 - 4 is within stack bounds, there were no stores into that location. Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved registers may not be enough for some programs. Allowed function calls are customized with bpf_verifier_ops->get_func_proto() The eBPF verifier will check that registers match argument constraints. After the call register R0 will be set to return type of the function. Function calls is a main mechanism to extend functionality of eBPF programs. Socket filters may let programs to call one set of functions, whereas tracing filters may allow completely different set. If a function made accessible to eBPF program, it needs to be thought through from safety point of view. The verifier will guarantee that the function is called with valid arguments. seccomp vs socket filters have different security restrictions for classic BPF. Seccomp solves this by two stage verifier: classic BPF verifier is followed by seccomp verifier. In case of eBPF one configurable verifier is shared for all use cases. See details of eBPF verifier in kernel/bpf/verifier.c |
0cbf47416 Documentation: de... |
1088 1089 1090 1091 |
Register value tracking ----------------------- In order to determine the safety of an eBPF program, the verifier must track the range of possible values in each register and also in each stack slot. |
cb3f0d56e docs: networking:... |
1092 |
This is done with ``struct bpf_reg_state``, defined in include/linux/ |
0cbf47416 Documentation: de... |
1093 1094 1095 1096 |
bpf_verifier.h, which unifies tracking of scalar and pointer values. Each register state has a type, which is either NOT_INIT (the register has not been written to), SCALAR_VALUE (some value which is not usable as a pointer), or a pointer type. The types of pointers describe their base, as follows: |
cb3f0d56e docs: networking:... |
1097 1098 1099 1100 1101 1102 1103 1104 1105 |
PTR_TO_CTX Pointer to bpf_context. CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic on these pointers is forbidden. PTR_TO_MAP_VALUE Pointer to the value stored in a map element. |
0cbf47416 Documentation: de... |
1106 |
PTR_TO_MAP_VALUE_OR_NULL |
cb3f0d56e docs: networking:... |
1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 |
Either a pointer to a map value, or NULL; map accesses (see section 'eBPF maps', below) return this type, which becomes a PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on these pointers is forbidden. PTR_TO_STACK Frame pointer. PTR_TO_PACKET skb->data. PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted. |
a610b665e Documentation: De... |
1119 |
PTR_TO_SOCKET_OR_NULL |
cb3f0d56e docs: networking:... |
1120 1121 1122 1123 1124 1125 |
Either a pointer to a socket, or NULL; socket lookup returns this type, which becomes a PTR_TO_SOCKET when checked != NULL. PTR_TO_SOCKET is reference-counted, so programs must release the reference through the socket release function before the end of the program. Arithmetic on these pointers is forbidden. |
0cbf47416 Documentation: de... |
1126 1127 1128 1129 1130 1131 |
However, a pointer may be offset from this base (as a result of pointer arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable offset'. The former is used when an exactly-known value (e.g. an immediate operand) is added to a pointer, while the latter is used for values which are not exactly known. The variable offset is also used in SCALAR_VALUEs, to track the range of possible values in the register. |
cb3f0d56e docs: networking:... |
1132 |
|
0cbf47416 Documentation: de... |
1133 |
The verifier's knowledge about the variable offset consists of: |
cb3f0d56e docs: networking:... |
1134 |
|
0cbf47416 Documentation: de... |
1135 1136 |
* minimum and maximum values as unsigned * minimum and maximum values as signed |
cb3f0d56e docs: networking:... |
1137 |
|
0cbf47416 Documentation: de... |
1138 |
* knowledge of the values of individual bits, in the form of a 'tnum': a u64 |
cb3f0d56e docs: networking:... |
1139 1140 1141 1142 1143 1144 1145 |
'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both mask and value; no bit should ever be 1 in both. For example, if a byte is read into a register from memory, the register's top 56 bits are known zero, while the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 0x1ff), because of potential carries. |
68625b763 bpf, doc: clarifi... |
1146 |
|
0cbf47416 Documentation: de... |
1147 1148 1149 1150 1151 1152 1153 1154 |
Besides arithmetic, the register state can also be updated by conditional branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or BPF_JSGE) would instead update the signed minimum/maximum values. Information from the signed and unsigned bounds can be combined; for instance if a value is first tested < 8 and then tested s> 4, the verifier will conclude that the value is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. |
68625b763 bpf, doc: clarifi... |
1155 |
|
0cbf47416 Documentation: de... |
1156 1157 |
PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all pointers sharing that same variable offset. This is important for packet range |
68625b763 bpf, doc: clarifi... |
1158 1159 1160 1161 1162 1163 |
checks: after adding a variable to a packet pointer register A, if you then copy it to another register B and then add a constant 4 to A, both registers will share the same 'id' but the A will have a fixed offset of +4. Then if A is bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is now known to have a safe range of at least 4 bytes. See 'Direct packet access', below, for more on PTR_TO_PACKET ranges. |
0cbf47416 Documentation: de... |
1164 1165 1166 1167 1168 1169 1170 1171 1172 1173 |
The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of the pointer returned from a map lookup. This means that when one copy is checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. As well as range-checking, the tracked information is also used for enforcing alignment of pointer accesses. For instance, on most systems the packet pointer is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through that pointer are safe. |
a610b665e Documentation: De... |
1174 1175 1176 1177 |
The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common to all copies of the pointer returned from a socket lookup. This has similar behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly |
cb3f0d56e docs: networking:... |
1178 |
represents a reference to the corresponding ``struct sock``. To ensure that the |
a610b665e Documentation: De... |
1179 1180 |
reference is not leaked, it is imperative to NULL-check the reference and in the non-NULL case, and pass the valid reference to the socket release function. |
0cbf47416 Documentation: de... |
1181 |
|
f9c8d19d6 bpf: add document... |
1182 1183 1184 1185 |
Direct packet access -------------------- In cls_bpf and act_bpf programs the verifier allows direct access to the packet data via skb->data and skb->data_end pointers. |
cb3f0d56e docs: networking:... |
1186 1187 1188 1189 1190 1191 1192 1193 1194 |
Ex:: 1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ 2: r3 = *(u32 *)(r1 +76) /* load skb->data */ 3: r5 = r3 4: r5 += 14 5: if r5 > r4 goto pc+16 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ |
f9c8d19d6 bpf: add document... |
1195 1196 |
this 2byte load from the packet is safe to do, since the program author |
cb3f0d56e docs: networking:... |
1197 |
did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which |
f9c8d19d6 bpf: add document... |
1198 1199 1200 1201 1202 1203 1204 1205 |
means that in the fall-through case the register R3 (which points to skb->data) has at least 14 directly accessible bytes. The verifier marks it as R3=pkt(id=0,off=0,r=14). id=0 means that no additional variables were added to the register. off=0 means that no additional constants were added. r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points to the packet data, but constant 14 was added to the register, so |
cb3f0d56e docs: networking:... |
1206 |
it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14) |
f9c8d19d6 bpf: add document... |
1207 |
which is zero bytes. |
cb3f0d56e docs: networking:... |
1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 |
More complex packet access may look like:: R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 7: r4 = *(u8 *)(r3 +12) 8: r4 *= 14 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ 10: r3 += r4 11: r2 = r1 12: r2 <<= 48 13: r2 >>= 48 14: r3 += r2 15: r2 = r3 16: r2 += 8 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 18: if r2 > r1 goto pc+2 R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 19: r1 = *(u8 *)(r3 +4) |
f9c8d19d6 bpf: add document... |
1227 |
The state of the register R3 is R3=pkt(id=2,off=0,r=8) |
cb3f0d56e docs: networking:... |
1228 |
id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some |
f9c8d19d6 bpf: add document... |
1229 |
offset within a packet and since the program author did |
cb3f0d56e docs: networking:... |
1230 |
``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8). |
0cbf47416 Documentation: de... |
1231 1232 |
The verifier only allows 'add'/'sub' operations on packet registers. Any other operation will set the register state to 'SCALAR_VALUE' and it won't be |
f9c8d19d6 bpf: add document... |
1233 |
available for direct packet access. |
cb3f0d56e docs: networking:... |
1234 1235 1236 |
Operation ``r3 += rX`` may overflow and become less than original skb->data, therefore the verifier has to prevent that. So when it sees ``r3 += rX`` |
0cbf47416 Documentation: de... |
1237 1238 1239 |
instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 against skb->data_end will not give us 'range' information, so attempts to read through the pointer will give "invalid access to packet" error. |
cb3f0d56e docs: networking:... |
1240 1241 |
Ex. after insn ``r4 = *(u8 *)(r3 +12)`` (insn #7 above) the state of r4 is |
0cbf47416 Documentation: de... |
1242 1243 |
R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits of the register are guaranteed to be zero, and nothing is known about the lower |
cb3f0d56e docs: networking:... |
1244 |
8 bits. After insn ``r4 *= 14`` the state becomes |
0cbf47416 Documentation: de... |
1245 1246 |
R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit value by constant 14 will keep upper 52 bits as zero, also the least significant |
cb3f0d56e docs: networking:... |
1247 |
bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make |
0cbf47416 Documentation: de... |
1248 1249 1250 1251 |
R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign extending. This logic is implemented in adjust_reg_min_max_vals() function, which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice versa) and adjust_scalar_min_max_vals() for operations on two scalars. |
f9c8d19d6 bpf: add document... |
1252 1253 |
The end result is that bpf program author can access packet directly |
cb3f0d56e docs: networking:... |
1254 |
using normal C code as:: |
f9c8d19d6 bpf: add document... |
1255 1256 1257 1258 1259 1260 1261 |
void *data = (void *)(long)skb->data; void *data_end = (void *)(long)skb->data_end; struct eth_hdr *eth = data; struct iphdr *iph = data + sizeof(*eth); struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) |
cb3f0d56e docs: networking:... |
1262 |
return 0; |
f9c8d19d6 bpf: add document... |
1263 |
if (eth->h_proto != htons(ETH_P_IP)) |
cb3f0d56e docs: networking:... |
1264 |
return 0; |
f9c8d19d6 bpf: add document... |
1265 |
if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) |
cb3f0d56e docs: networking:... |
1266 |
return 0; |
f9c8d19d6 bpf: add document... |
1267 |
if (udp->dest == 53 || udp->source == 9) |
cb3f0d56e docs: networking:... |
1268 |
...; |
f9c8d19d6 bpf: add document... |
1269 1270 |
which makes such programs easier to write comparing to LD_ABS insn and significantly faster. |
99c55f7d4 bpf: introduce BP... |
1271 1272 1273 1274 1275 1276 |
eBPF maps --------- 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps are accessed from user space via BPF syscall, which has commands: |
cb3f0d56e docs: networking:... |
1277 |
|
99c55f7d4 bpf: introduce BP... |
1278 |
- create a map with given type and attributes |
cb3f0d56e docs: networking:... |
1279 |
``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)`` |
99c55f7d4 bpf: introduce BP... |
1280 1281 1282 1283 |
using attr->map_type, attr->key_size, attr->value_size, attr->max_entries returns process-local file descriptor or negative error - lookup key in a given map |
cb3f0d56e docs: networking:... |
1284 |
``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)`` |
99c55f7d4 bpf: introduce BP... |
1285 1286 1287 1288 |
using attr->map_fd, attr->key, attr->value returns zero and stores found elem into value or negative error - create or update key/value pair in a given map |
cb3f0d56e docs: networking:... |
1289 |
``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)`` |
99c55f7d4 bpf: introduce BP... |
1290 1291 1292 1293 |
using attr->map_fd, attr->key, attr->value returns zero or negative error - find and delete element by key in a given map |
cb3f0d56e docs: networking:... |
1294 |
``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)`` |
99c55f7d4 bpf: introduce BP... |
1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 |
using attr->map_fd, attr->key - to delete map: close(fd) Exiting process will delete maps automatically userspace programs use this syscall to create/access maps that eBPF programs are concurrently updating. maps can have different types: hash, array, bloom filter, radix-tree, etc. The map is defined by: |
cb3f0d56e docs: networking:... |
1306 1307 1308 1309 1310 |
- type - max number of elements - key size in bytes - value size in bytes |
99c55f7d4 bpf: introduce BP... |
1311 |
|
0cbf47416 Documentation: de... |
1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 |
Pruning ------- The verifier does not actually walk all possible paths through the program. For each new branch to analyse, the verifier looks at all the states it's previously been in when at this instruction. If any of them contain the current state as a subset, the branch is 'pruned' - that is, the fact that the previous state was accepted implies the current state would be as well. For instance, if in the previous state, r1 held a packet-pointer, and in the current state, r1 holds a packet-pointer with a range as long or longer and at least as strict an alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't have been used by any path from that point, so any value in r2 (including another NOT_INIT) is safe. The implementation is in the function regsafe(). Pruning considers not only the registers but also the stack (and any spilled registers it may hold). They must all be safe for the branch to be pruned. This is implemented in states_equal(). |
51580e798 bpf: verifier (ad... |
1327 1328 1329 1330 1331 |
Understanding eBPF verifier messages ------------------------------------ The following are few examples of invalid eBPF programs and verifier error messages as seen in the log: |
cb3f0d56e docs: networking:... |
1332 1333 1334 |
Program with unreachable instructions:: static struct bpf_insn prog[] = { |
51580e798 bpf: verifier (ad... |
1335 1336 |
BPF_EXIT_INSN(), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1337 |
}; |
51580e798 bpf: verifier (ad... |
1338 |
Error: |
cb3f0d56e docs: networking:... |
1339 |
|
51580e798 bpf: verifier (ad... |
1340 |
unreachable insn 1 |
cb3f0d56e docs: networking:... |
1341 |
Program that reads uninitialized register:: |
51580e798 bpf: verifier (ad... |
1342 1343 |
BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1344 1345 |
Error:: |
51580e798 bpf: verifier (ad... |
1346 1347 |
0: (bf) r0 = r2 R2 !read_ok |
cb3f0d56e docs: networking:... |
1348 |
Program that doesn't initialize R0 before exiting:: |
51580e798 bpf: verifier (ad... |
1349 1350 |
BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1351 1352 |
Error:: |
51580e798 bpf: verifier (ad... |
1353 1354 1355 |
0: (bf) r2 = r1 1: (95) exit R0 !read_ok |
cb3f0d56e docs: networking:... |
1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 |
Program that accesses stack out of bounds:: BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), BPF_EXIT_INSN(), Error:: 0: (7a) *(u64 *)(r10 +8) = 0 invalid stack off=8 size=8 Program that doesn't initialize stack before passing its address into function:: |
51580e798 bpf: verifier (ad... |
1367 |
|
51580e798 bpf: verifier (ad... |
1368 1369 1370 1371 1372 |
BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1373 1374 |
Error:: |
51580e798 bpf: verifier (ad... |
1375 1376 1377 1378 1379 |
0: (bf) r2 = r10 1: (07) r2 += -8 2: (b7) r1 = 0x0 3: (85) call 1 invalid indirect read from stack off -8+0 size 8 |
cb3f0d56e docs: networking:... |
1380 |
Program that uses invalid map_fd=0 while calling to map_lookup_elem() function:: |
51580e798 bpf: verifier (ad... |
1381 1382 1383 1384 1385 1386 |
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1387 1388 |
Error:: |
51580e798 bpf: verifier (ad... |
1389 1390 1391 1392 1393 1394 1395 1396 |
0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 0x0 4: (85) call 1 fd 0 is not pointing to valid bpf_map Program that doesn't check return value of map_lookup_elem() before accessing |
cb3f0d56e docs: networking:... |
1397 |
map element:: |
51580e798 bpf: verifier (ad... |
1398 1399 1400 1401 1402 1403 1404 |
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1405 1406 |
Error:: |
51580e798 bpf: verifier (ad... |
1407 1408 1409 1410 1411 1412 1413 1414 1415 |
0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 0x0 4: (85) call 1 5: (7a) *(u64 *)(r0 +0) = 0 R0 invalid mem access 'map_value_or_null' Program that correctly checks map_lookup_elem() returned value for NULL, but |
cb3f0d56e docs: networking:... |
1416 |
accesses the memory with incorrect alignment:: |
51580e798 bpf: verifier (ad... |
1417 1418 1419 1420 1421 1422 1423 1424 |
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1425 1426 |
Error:: |
51580e798 bpf: verifier (ad... |
1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438 |
0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 1 4: (85) call 1 5: (15) if r0 == 0x0 goto pc+1 R0=map_ptr R10=fp 6: (7a) *(u64 *)(r0 +4) = 0 misaligned access off 4 size 8 Program that correctly checks map_lookup_elem() returned value for NULL and accesses memory with correct alignment in one side of 'if' branch, but fails |
cb3f0d56e docs: networking:... |
1439 |
to do so in the other side of 'if' branch:: |
51580e798 bpf: verifier (ad... |
1440 1441 1442 1443 1444 1445 1446 1447 1448 1449 |
BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP | BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), BPF_EXIT_INSN(), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1450 1451 |
Error:: |
51580e798 bpf: verifier (ad... |
1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 |
0: (7a) *(u64 *)(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 1 4: (85) call 1 5: (15) if r0 == 0x0 goto pc+2 R0=map_ptr R10=fp 6: (7a) *(u64 *)(r0 +0) = 0 7: (95) exit from 5 to 8: R0=imm0 R10=fp 8: (7a) *(u64 *)(r0 +0) = 1 R0 invalid mem access 'imm' |
a610b665e Documentation: De... |
1465 |
Program that performs a socket lookup then sets the pointer to NULL without |
cb3f0d56e docs: networking:... |
1466 |
checking it:: |
a610b665e Documentation: De... |
1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 |
BPF_MOV64_IMM(BPF_REG_2, 0), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_MOV64_IMM(BPF_REG_3, 4), BPF_MOV64_IMM(BPF_REG_4, 0), BPF_MOV64_IMM(BPF_REG_5, 0), BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1477 1478 |
Error:: |
a610b665e Documentation: De... |
1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491 |
0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (bf) r2 = r10 3: (07) r2 += -8 4: (b7) r3 = 4 5: (b7) r4 = 0 6: (b7) r5 = 0 7: (85) call bpf_sk_lookup_tcp#65 8: (b7) r0 = 0 9: (95) exit Unreleased reference id=1, alloc_insn=7 Program that performs a socket lookup but does not NULL-check the returned |
cb3f0d56e docs: networking:... |
1492 |
value:: |
a610b665e Documentation: De... |
1493 1494 1495 1496 1497 1498 1499 1500 1501 |
BPF_MOV64_IMM(BPF_REG_2, 0), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_MOV64_IMM(BPF_REG_3, 4), BPF_MOV64_IMM(BPF_REG_4, 0), BPF_MOV64_IMM(BPF_REG_5, 0), BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), BPF_EXIT_INSN(), |
cb3f0d56e docs: networking:... |
1502 1503 |
Error:: |
a610b665e Documentation: De... |
1504 1505 1506 1507 1508 1509 1510 1511 1512 1513 |
0: (b7) r2 = 0 1: (63) *(u32 *)(r10 -8) = r2 2: (bf) r2 = r10 3: (07) r2 += -8 4: (b7) r3 = 4 5: (b7) r4 = 0 6: (b7) r5 = 0 7: (85) call bpf_sk_lookup_tcp#65 8: (95) exit Unreleased reference id=1, alloc_insn=7 |
04caa4893 net: filter: doc:... |
1514 1515 1516 1517 1518 1519 |
Testing ------- Next to the BPF toolchain, the kernel also ships a test module that contains various test cases for classic and internal BPF that can be executed against the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and |
cb3f0d56e docs: networking:... |
1520 |
enabled via Kconfig:: |
04caa4893 net: filter: doc:... |
1521 1522 1523 1524 1525 1526 |
CONFIG_TEST_BPF=m After the module has been built and installed, the test suite can be executed via insmod or modprobe against 'test_bpf' module. Results of the test cases including timings in nsec can be found in the kernel log (dmesg). |
7924cd5e0 filter: doc: impr... |
1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 |
Misc ---- Also trinity, the Linux syscall fuzzer, has built-in support for BPF and SECCOMP-BPF kernel fuzzing. Written by ---------- The document was written in the hope that it is found useful and in order to give potential BPF hackers or security auditors a better overview of the underlying architecture. |
cb3f0d56e docs: networking:... |
1539 1540 1541 |
- Jay Schulist <jschlst@samba.org> - Daniel Borkmann <daniel@iogearbox.net> - Alexei Starovoitov <ast@kernel.org> |