Eric Lee / smarc-fsl-linux-kernel

Blame view

Documentation/networking/filter.rst 61.4 KB

cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1	.. SPDX-License-Identifier: GPL-2.0
ffba964e4 Tiezhu Yang Documentation/bpf...	2	.. _networking-filter:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	3	=======================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	4 5	Linux Socket Filtering aka Berkeley Packet Filter (BPF) =======================================================
1da177e4c Linus Torvalds Linux-2.6.12-rc2	6 7	Introduction
7924cd5e0 Daniel Borkmann filter: doc: impr...	8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47	------------ Linux Socket Filtering (LSF) is derived from the Berkeley Packet Filter. Though there are some distinct differences between the BSD and Linux Kernel filtering, but when we speak of BPF or LSF in Linux context, we mean the very same mechanism of filtering in the Linux kernel. BPF allows a user-space program to attach a filter onto any socket and allow or disallow certain types of data to come through the socket. LSF follows exactly the same filter code structure as BSD's BPF, so referring to the BSD bpf.4 manpage is very helpful in creating filters. On Linux, BPF is much simpler than on BSD. One does not have to worry about devices or anything like that. You simply create your filter code, send it to the kernel via the SO_ATTACH_FILTER option and if your filter code passes the kernel check on it, you then immediately begin filtering data on that socket. You can also detach filters from your socket via the SO_DETACH_FILTER option. This will probably not be used much since when you close a socket that has a filter on it the filter is automagically removed. The other less common case may be adding a different filter on the same socket where you had another filter that is still running: the kernel takes care of removing the old one and placing your new one in its place, assuming your filter has passed the checks, otherwise if it fails the old filter will remain on that socket. SO_LOCK_FILTER option allows to lock the filter attached to a socket. Once set, a filter cannot be removed or changed. This allows one process to setup a socket, attach a filter, lock it then drop privileges and be assured that the filter will be kept until the socket is closed. The biggest user of this construct might be libpcap. Issuing a high-level filter command like `tcpdump -i em1 port 22` passes through the libpcap internal compiler that generates a structure that can eventually be loaded via SO_ATTACH_FILTER to the kernel. `tcpdump -i em1 port 22 -ddd` displays what is being placed into this structure. Although we were only speaking about sockets here, BPF in Linux is used in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	48	qdisc layer, SECCOMP-BPF (SECure COMPuting [1]_), and lots of other places
7924cd5e0 Daniel Borkmann filter: doc: impr...	49	such as team driver, PTP code, etc where BPF is being used.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	50	.. [1] Documentation/userspace-api/seccomp_filter.rst
7924cd5e0 Daniel Borkmann filter: doc: impr...	51 52 53 54 55 56 57 58 59 60 61 62 63	Original BPF paper: Steven McCanne and Van Jacobson. 1993. The BSD packet filter: a new architecture for user-level packet capture. In Proceedings of the USENIX Winter 1993 Conference Proceedings on USENIX Winter 1993 Conference Proceedings (USENIX'93). USENIX Association, Berkeley, CA, USA, 2-2. [http://www.tcpdump.org/papers/bpf-usenix93.pdf] Structure --------- User space applications include <linux/filter.h> which contains the
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	64	following relevant structures::
7924cd5e0 Daniel Borkmann filter: doc: impr...	65
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	66 67 68 69 70 71	struct sock_filter { /* Filter block / __u16 code; / Actual filter code / __u8 jt; / Jump true / __u8 jf; / Jump false / __u32 k; / Generic multiuse field */ };
7924cd5e0 Daniel Borkmann filter: doc: impr...	72 73 74	Such a structure is assembled as an array of 4-tuples, that contains a code, jt, jf and k value. jt and jf are jump offsets and k a generic
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	75	value to be used for a provided code::
7924cd5e0 Daniel Borkmann filter: doc: impr...	76
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	77 78 79 80	struct sock_fprog { /* Required for SO_ATTACH_FILTER. / unsigned short len; / Number of filter blocks / struct sock_filter __user filter; };
7924cd5e0 Daniel Borkmann filter: doc: impr...	81 82 83 84 85 86	For socket filtering, a pointer to this structure (as shown in follow-up example) is being passed to the kernel through setsockopt(2). Example -------
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137	:: #include <sys/socket.h> #include <sys/types.h> #include <arpa/inet.h> #include <linux/if_ether.h> /* ... / / From the example above: tcpdump -i em1 port 22 -dd / struct sock_filter code[] = { { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 8, 0x000086dd }, { 0x30, 0, 0, 0x00000014 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 17, 0x00000011 }, { 0x28, 0, 0, 0x00000036 }, { 0x15, 14, 0, 0x00000016 }, { 0x28, 0, 0, 0x00000038 }, { 0x15, 12, 13, 0x00000016 }, { 0x15, 0, 12, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, { 0x15, 2, 0, 0x00000084 }, { 0x15, 1, 0, 0x00000006 }, { 0x15, 0, 8, 0x00000011 }, { 0x28, 0, 0, 0x00000014 }, { 0x45, 6, 0, 0x00001fff }, { 0xb1, 0, 0, 0x0000000e }, { 0x48, 0, 0, 0x0000000e }, { 0x15, 2, 0, 0x00000016 }, { 0x48, 0, 0, 0x00000010 }, { 0x15, 0, 1, 0x00000016 }, { 0x06, 0, 0, 0x0000ffff }, { 0x06, 0, 0, 0x00000000 }, }; struct sock_fprog bpf = { .len = ARRAY_SIZE(code), .filter = code, }; sock = socket(PF_PACKET, SOCK_RAW, htons(ETH_P_ALL)); if (sock < 0) / ... bail out ... / ret = setsockopt(sock, SOL_SOCKET, SO_ATTACH_FILTER, &bpf, sizeof(bpf)); if (ret < 0) / ... bail out ... / / ... */ close(sock);
7924cd5e0 Daniel Borkmann filter: doc: impr...	138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173	The above example code attaches a socket filter for a PF_PACKET socket in order to let all IPv4/IPv6 packets with port 22 pass. The rest will be dropped for this socket. The setsockopt(2) call to SO_DETACH_FILTER doesn't need any arguments and SO_LOCK_FILTER for preventing the filter to be detached, takes an integer value with 0 or 1. Note that socket filters are not restricted to PF_PACKET sockets only, but can also be used on other socket families. Summary of system calls: * setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_FILTER, &val, sizeof(val)); * setsockopt(sockfd, SOL_SOCKET, SO_DETACH_FILTER, &val, sizeof(val)); * setsockopt(sockfd, SOL_SOCKET, SO_LOCK_FILTER, &val, sizeof(val)); Normally, most use cases for socket filtering on packet sockets will be covered by libpcap in high-level syntax, so as an application developer you should stick to that. libpcap wraps its own layer around all that. Unless i) using/linking to libpcap is not an option, ii) the required BPF filters use Linux extensions that are not supported by libpcap's compiler, iii) a filter might be more complex and not cleanly implementable with libpcap's compiler, or iv) particular filter codes should be optimized differently than libpcap's internal compiler does; then in such cases writing such a filter "by hand" can be of an alternative. For example, xt_bpf and cls_bpf users might have requirements that could result in more complex filter code, or one that cannot be expressed with libpcap (e.g. different return codes for various code paths). Moreover, BPF JIT implementors may wish to manually write test cases and thus need low-level access to BPF code as well. BPF engine and instruction set ------------------------------
c246fd333 Wang Sheng-Hui filter.txt: updat...	174	Under tools/bpf/ there's a small helper tool called bpf_asm which can
7924cd5e0 Daniel Borkmann filter: doc: impr...	175 176 177 178 179 180 181	be used to write low-level filters for example scenarios mentioned in the previous section. Asm-like syntax mentioned here has been implemented in bpf_asm and will be used for further explanations (instead of dealing with less readable opcodes directly, principles are the same). The syntax is closely modelled after Steven McCanne's and Van Jacobson's BPF paper. The BPF architecture consists of the following basic elements:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	182	======= ====================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	183	Element Description
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	184	======= ====================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	185 186 187	A 32 bit wide accumulator X 32 bit wide X register M[] 16 x 32 bit wide misc registers aka "scratch memory
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	188 189	store", addressable from 0 to 15 ======= ====================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	190 191	A program, that is translated by bpf_asm into "opcodes" is an array that
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	192	consists of the following elements (as already mentioned)::
7924cd5e0 Daniel Borkmann filter: doc: impr...	193 194 195 196 197 198 199 200 201 202 203 204 205	op:16, jt:8, jf:8, k:32 The element op is a 16 bit wide opcode that has a particular instruction encoded. jt and jf are two 8 bit wide jump targets, one for condition "jump if true", the other one "jump if false". Eventually, element k contains a miscellaneous argument that can be interpreted in different ways depending on the given instruction in op. The instruction set consists of load, store, branch, alu, miscellaneous and return instructions that are also represented in bpf_asm syntax. This table lists all bpf_asm instructions available resp. what their underlying opcodes as defined in linux/filter.h stand for:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	206	=========== =================== =====================
7924cd5e0 Daniel Borkmann filter: doc: impr...	207	Instruction Addressing mode Description
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	208	=========== =================== =====================
31ce8c4a1 Arthur Fabre bpf, doc: Documen...	209	ld 1, 2, 3, 4, 12 Load word into A
7924cd5e0 Daniel Borkmann filter: doc: impr...	210 211 212	ldi 4 Load word into A ldh 1, 2 Load half-word into A ldb 1, 2 Load byte into A
31ce8c4a1 Arthur Fabre bpf, doc: Documen...	213	ldx 3, 4, 5, 12 Load word into X
7924cd5e0 Daniel Borkmann filter: doc: impr...	214 215 216 217 218 219 220 221	ldxi 4 Load word into X ldxb 5 Load byte into X st 3 Store A into M[] stx 3 Store X into M[] jmp 6 Jump to label ja 6 Jump to label
31ce8c4a1 Arthur Fabre bpf, doc: Documen...	222 223 224 225 226 227 228 229	jeq 7, 8, 9, 10 Jump on A == <x> jneq 9, 10 Jump on A != <x> jne 9, 10 Jump on A != <x> jlt 9, 10 Jump on A < <x> jle 9, 10 Jump on A <= <x> jgt 7, 8, 9, 10 Jump on A > <x> jge 7, 8, 9, 10 Jump on A >= <x> jset 7, 8, 9, 10 Jump on A & <x>
7924cd5e0 Daniel Borkmann filter: doc: impr...	230 231 232 233 234 235	add 0, 4 A + <x> sub 0, 4 A - <x> mul 0, 4 A * <x> div 0, 4 A / <x> mod 0, 4 A % <x>
83d26b632 Dave Anderson bpf: doc: "neg" o...	236	neg !A
7924cd5e0 Daniel Borkmann filter: doc: impr...	237 238 239 240 241 242 243 244	and 0, 4 A & <x> or 0, 4 A \| <x> xor 0, 4 A ^ <x> lsh 0, 4 A << <x> rsh 0, 4 A >> <x> tax Copy A into X txa Copy X into A
31ce8c4a1 Arthur Fabre bpf, doc: Documen...	245	ret 4, 11 Return
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	246	=========== =================== =====================
7924cd5e0 Daniel Borkmann filter: doc: impr...	247 248	The next table shows addressing formats from the 2nd column:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	249	=============== =================== ===============================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	250	Addressing mode Syntax Description
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	251	=============== =================== ===============================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	252 253 254 255 256 257 258 259	0 x/%x Register X 1 [k] BHW at byte offset k in the packet 2 [x + k] BHW at the offset X + k in the packet 3 M[k] Word at offset k in M[] 4 #k Literal value stored in k 5 4([k]&0xf) Lower nibble 4 at byte offset k in the packet 6 L Jump label L 7 #k,Lt,Lf Jump to Lt if true, otherwise jump to Lf
31ce8c4a1 Arthur Fabre bpf, doc: Documen...	260 261 262 263 264	8 x/%x,Lt,Lf Jump to Lt if true, otherwise jump to Lf 9 #k,Lt Jump to Lt if predicate is true 10 x/%x,Lt Jump to Lt if predicate is true 11 a/%a Accumulator A 12 extension BPF extension
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	265	=============== =================== ===============================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	266 267 268 269 270 271 272	The Linux kernel also has a couple of BPF extensions that are used along with the class of load instructions by "overloading" the k argument with a negative offset + a particular extension offset. The result of such BPF extensions are loaded into A. Possible BPF extensions are shown in the following table:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	273	=================================== =================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	274	Extension Description
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	275	=================================== =================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	276 277 278 279 280 281 282 283 284 285	len skb->len proto skb->protocol type skb->pkt_type poff Payload start offset ifidx skb->dev->ifindex nla Netlink attribute of type X with offset A nlan Nested Netlink attribute of type X with offset A mark skb->mark queue skb->queue_mapping hatype skb->dev->type
b0db5cdf3 Tobias Klauser net: doc: Update ...	286	rxhash skb->hash
7924cd5e0 Daniel Borkmann filter: doc: impr...	287	cpu raw_smp_processor_id()
df8a39def Jiri Pirko net: rename vlan_...	288	vlan_tci skb_vlan_tag_get(skb)
27cd54524 Michal Sekletar filter: introduce...	289 290	vlan_avail skb_vlan_tag_present(skb) vlan_tpid skb->vlan_proto
4cd3675eb Chema Gonzalez filter: added BPF...	291	rand prandom_u32()
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	292	=================================== =================================================
7924cd5e0 Daniel Borkmann filter: doc: impr...	293 294 295	These extensions can also be prefixed with '#'. Examples for low-level BPF:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	296	ARP packets::
7924cd5e0 Daniel Borkmann filter: doc: impr...	297 298 299 300 301	ldh [12] jne #0x806, drop ret #-1 drop: ret #0
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	302	IPv4 TCP packets::
7924cd5e0 Daniel Borkmann filter: doc: impr...	303 304 305 306 307 308 309	ldh [12] jne #0x800, drop ldb [23] jneq #6, drop ret #-1 drop: ret #0
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	310	(Accelerated) VLAN w/ id 10::
7924cd5e0 Daniel Borkmann filter: doc: impr...	311 312 313 314 315	ld vlan_tci jneq #10, drop ret #-1 drop: ret #0
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	316	icmp random packet sampling, 1 in 4:
4cd3675eb Chema Gonzalez filter: added BPF...	317 318 319 320 321 322 323 324 325 326	ldh [12] jne #0x800, drop ldb [23] jneq #1, drop # get a random uint32 number ld rand mod #4 jneq #1, drop ret #-1 drop: ret #0
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	327	SECCOMP filter example::
7924cd5e0 Daniel Borkmann filter: doc: impr...	328 329 330 331 332 333 334 335 336 337 338 339 340 341	ld [4] /* offsetof(struct seccomp_data, arch) / jne #0xc000003e, bad / AUDIT_ARCH_X86_64 / ld [0] / offsetof(struct seccomp_data, nr) / jeq #15, good / __NR_rt_sigreturn / jeq #231, good / __NR_exit_group / jeq #60, good / __NR_exit / jeq #0, good / __NR_read / jeq #1, good / __NR_write / jeq #5, good / __NR_fstat / jeq #9, good / __NR_mmap / jeq #14, good / __NR_rt_sigprocmask / jeq #13, good / __NR_rt_sigaction / jeq #35, good / __NR_nanosleep */
fd76875ca Kees Cook seccomp: Rename S...	342	bad: ret #0 /* SECCOMP_RET_KILL_THREAD */
7924cd5e0 Daniel Borkmann filter: doc: impr...	343 344 345 346 347	good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ The above example code can be placed into a file (here called "foo"), and then be passed to the bpf_asm tool for generating opcodes, output that xt_bpf and cls_bpf understands and can directly be loaded with. Example with above
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	348	ARP code::
7924cd5e0 Daniel Borkmann filter: doc: impr...	349
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	350 351	$ ./bpf_asm foo 4,40 0 0 12,21 0 1 2054,6 0 0 4294967295,6 0 0 0,
7924cd5e0 Daniel Borkmann filter: doc: impr...	352
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	353	In copy and paste C-like output::
7924cd5e0 Daniel Borkmann filter: doc: impr...	354
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	355 356 357 358 359	$ ./bpf_asm -c foo { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 1, 0x00000806 }, { 0x06, 0, 0, 0xffffffff }, { 0x06, 0, 0, 0000000000 },
7924cd5e0 Daniel Borkmann filter: doc: impr...	360 361 362 363	In particular, as usage with xt_bpf or cls_bpf can result in more complex BPF filters that might not be obvious at first, it's good to test filters before attaching to a live system. For that purpose, there's a small tool called
c246fd333 Wang Sheng-Hui filter.txt: updat...	364	bpf_dbg under tools/bpf/ in the kernel source directory. This debugger allows
7924cd5e0 Daniel Borkmann filter: doc: impr...	365 366	for testing BPF filters against given pcap files, single stepping through the BPF code on the pcap's packets and to do BPF machine register dumps.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	367	Starting bpf_dbg is trivial and just requires issuing::
7924cd5e0 Daniel Borkmann filter: doc: impr...	368
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	369	# ./bpf_dbg
7924cd5e0 Daniel Borkmann filter: doc: impr...	370 371 372 373 374 375 376 377 378 379 380 381	In case input and output do not equal stdin/stdout, bpf_dbg takes an alternative stdin source as a first argument, and an alternative stdout sink as a second one, e.g. `./bpf_dbg test_in.txt test_out.txt`. Other than that, a particular libreadline configuration can be set via file "~/.bpf_dbg_init" and the command history is stored in the file "~/.bpf_dbg_history". Interaction in bpf_dbg happens through a shell that also has auto-completion support (follow-up example commands starting with '>' denote bpf_dbg shell). The usual workflow would be to ...
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	382	* load bpf 6,40 0 0 12,21 0 3 2048,48 0 0 23,21 0 1 1,6 0 0 65535,6 0 0 0
7924cd5e0 Daniel Borkmann filter: doc: impr...	383	Loads a BPF filter from standard output of bpf_asm, or transformed via
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	384 385	e.g. ``tcpdump -iem1 -ddd port 22 \| tr ' ' ','``. Note that for JIT
7924cd5e0 Daniel Borkmann filter: doc: impr...	386 387 388	debugging (next section), this command creates a temporary socket and loads the BPF code into the kernel. Thus, this will also be useful for JIT developers.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	389	* load pcap foo.pcap
7924cd5e0 Daniel Borkmann filter: doc: impr...	390	Loads standard tcpdump pcap file.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	391	* run [<n>]
7924cd5e0 Daniel Borkmann filter: doc: impr...	392 393 394	bpf passes:1 fails:9 Runs through all packets from a pcap to account how many passes and fails the filter will generate. A limit of packets to traverse can be given.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	395 396 397 398 399 400 401 402	* disassemble:: l0: ldh [12] l1: jeq #0x800, l2, l5 l2: ldb [23] l3: jeq #0x1, l4, l5 l4: ret #0xffff l5: ret #0
7924cd5e0 Daniel Borkmann filter: doc: impr...	403	Prints out BPF code disassembly.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	404 405 406 407 408 409 410 411 412	* dump:: /* { op, jt, jf, k }, */ { 0x28, 0, 0, 0x0000000c }, { 0x15, 0, 3, 0x00000800 }, { 0x30, 0, 0, 0x00000017 }, { 0x15, 0, 1, 0x00000001 }, { 0x06, 0, 0, 0x0000ffff }, { 0x06, 0, 0, 0000000000 },
7924cd5e0 Daniel Borkmann filter: doc: impr...	413	Prints out C-style BPF code dump.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	414 415 416 417 418 419 420	* breakpoint 0:: breakpoint at: l0: ldh [12] * breakpoint 1:: breakpoint at: l1: jeq #0x800, l2, l5
7924cd5e0 Daniel Borkmann filter: doc: impr...	421	...
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	422
7924cd5e0 Daniel Borkmann filter: doc: impr...	423 424 425 426	Sets breakpoints at particular BPF instructions. Issuing a `run` command will walk through the pcap file continuing from the current packet and break when a breakpoint is being hit (another `run` will continue from the currently active breakpoint executing next instructions):
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450	* run:: -- register dump -- pc: [0] <-- program counter code: [40] jt[0] jf[0] k[12] <-- plain BPF code of current instruction curr: l0: ldh [12] <-- disassembly of current instruction A: [00000000][0] <-- content of A (hex, decimal) X: [00000000][0] <-- content of X (hex, decimal) M[0,15]: [00000000][0] <-- folded content of M (hex, decimal) -- packet dump -- <-- Current packet from pcap (hex) len: 42 0: 00 19 cb 55 55 a4 00 14 a4 43 78 69 08 06 00 01 16: 08 00 06 04 00 01 00 14 a4 43 78 69 0a 3b 01 26 32: 00 00 00 00 00 00 0a 3b 01 01 (breakpoint) > * breakpoint:: breakpoints: 0 1 Prints currently set breakpoints. * step [-<n>, +<n>]
7924cd5e0 Daniel Borkmann filter: doc: impr...	451 452 453 454	Performs single stepping through the BPF program from the current pc offset. Thus, on each step invocation, above register dump is issued. This can go forwards and backwards in time, a plain `step` will break on the next BPF instruction, thus +1. (No `run` needs to be issued here.)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	455	* select <n>
7924cd5e0 Daniel Borkmann filter: doc: impr...	456 457 458 459	Selects a given packet from the pcap file to continue from. Thus, on the next `run` or `step`, the BPF program is being evaluated against the user pre-selected packet. Numbering starts just as in Wireshark with index 1.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	460	* quit
7924cd5e0 Daniel Borkmann filter: doc: impr...	461 462 463 464	Exits bpf_dbg. JIT compiler ------------
e8cb0167a Björn Töpel bpf, doc: add RIS...	465 466 467 468	The Linux kernel has a built-in BPF JIT compiler for x86_64, SPARC, PowerPC, ARM, ARM64, MIPS, RISC-V and s390 and can be enabled through CONFIG_BPF_JIT. The JIT compiler is transparently invoked for each attached filter from user space or for internal kernel users if it has
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	469	been previously enabled by root::
7924cd5e0 Daniel Borkmann filter: doc: impr...	470 471 472 473	echo 1 > /proc/sys/net/core/bpf_jit_enable For JIT developers, doing audits etc, each compile run can output the generated
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	474	opcode image into the kernel log via::
7924cd5e0 Daniel Borkmann filter: doc: impr...	475 476	echo 2 > /proc/sys/net/core/bpf_jit_enable
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	477	Example output from dmesg::
7924cd5e0 Daniel Borkmann filter: doc: impr...	478
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	479 480 481 482 483 484	[ 3389.935842] flen=6 proglen=70 pass=3 image=ffffffffa0069c8f [ 3389.935847] JIT code: 00000000: 55 48 89 e5 48 83 ec 60 48 89 5d f8 44 8b 4f 68 [ 3389.935849] JIT code: 00000010: 44 2b 4f 6c 4c 8b 87 d8 00 00 00 be 0c 00 00 00 [ 3389.935850] JIT code: 00000020: e8 1d 94 ff e0 3d 00 08 00 00 75 16 be 17 00 00 [ 3389.935851] JIT code: 00000030: 00 e8 28 94 ff e0 83 f8 01 75 07 b8 ff ff 00 00 [ 3389.935852] JIT code: 00000040: eb 02 31 c0 c9 c3
7924cd5e0 Daniel Borkmann filter: doc: impr...	485
2c25fc9a5 Leo Yan bpf, doc: Update ...	486 487 488 489 490	When CONFIG_BPF_JIT_ALWAYS_ON is enabled, bpf_jit_enable is permanently set to 1 and setting any other value than that will return in failure. This is even the case for setting bpf_jit_enable to 2, since dumping the final JIT image into the kernel log is discouraged and introspection through bpftool (under tools/bpf/bpftool/) is the generally recommended approach instead.
c246fd333 Wang Sheng-Hui filter.txt: updat...	491	In the kernel source tree under tools/bpf/, there's bpf_jit_disasm for
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563	generating disassembly out of the kernel log's hexdump:: # ./bpf_jit_disasm 70 bytes emitted from JIT compiler (pass:3, flen:6) ffffffffa0069c8f + <x>: 0: push %rbp 1: mov %rsp,%rbp 4: sub $0x60,%rsp 8: mov %rbx,-0x8(%rbp) c: mov 0x68(%rdi),%r9d 10: sub 0x6c(%rdi),%r9d 14: mov 0xd8(%rdi),%r8 1b: mov $0xc,%esi 20: callq 0xffffffffe0ff9442 25: cmp $0x800,%eax 2a: jne 0x0000000000000042 2c: mov $0x17,%esi 31: callq 0xffffffffe0ff945e 36: cmp $0x1,%eax 39: jne 0x0000000000000042 3b: mov $0xffff,%eax 40: jmp 0x0000000000000044 42: xor %eax,%eax 44: leaveq 45: retq Issuing option `-o` will "annotate" opcodes to resulting assembler instructions, which can be very useful for JIT developers: # ./bpf_jit_disasm -o 70 bytes emitted from JIT compiler (pass:3, flen:6) ffffffffa0069c8f + <x>: 0: push %rbp 55 1: mov %rsp,%rbp 48 89 e5 4: sub $0x60,%rsp 48 83 ec 60 8: mov %rbx,-0x8(%rbp) 48 89 5d f8 c: mov 0x68(%rdi),%r9d 44 8b 4f 68 10: sub 0x6c(%rdi),%r9d 44 2b 4f 6c 14: mov 0xd8(%rdi),%r8 4c 8b 87 d8 00 00 00 1b: mov $0xc,%esi be 0c 00 00 00 20: callq 0xffffffffe0ff9442 e8 1d 94 ff e0 25: cmp $0x800,%eax 3d 00 08 00 00 2a: jne 0x0000000000000042 75 16 2c: mov $0x17,%esi be 17 00 00 00 31: callq 0xffffffffe0ff945e e8 28 94 ff e0 36: cmp $0x1,%eax 83 f8 01 39: jne 0x0000000000000042 75 07 3b: mov $0xffff,%eax b8 ff ff 00 00 40: jmp 0x0000000000000044 eb 02 42: xor %eax,%eax 31 c0 44: leaveq c9 45: retq c3
7924cd5e0 Daniel Borkmann filter: doc: impr...	564 565 566	For BPF JIT developers, bpf_jit_disasm, bpf_asm and bpf_dbg provides a useful toolchain for developing and testing the kernel's JIT compiler.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	567 568	BPF kernel internals --------------------
e4ad40326 Alexei Starovoitov net: filter: ment...	569	Internally, for the kernel interpreter, a different instruction set
9a985cdc5 Alexei Starovoitov doc: filter: exte...	570 571 572	format with similar underlying principles from BPF described in previous paragraphs is being used. However, the instruction set format is modelled closer to the underlying architecture to mimic native instruction sets, so
e4ad40326 Alexei Starovoitov net: filter: ment...	573 574 575 576 577	that a better performance can be achieved (more details later). This new ISA is called 'eBPF' or 'internal BPF' interchangeably. (Note: eBPF which originates from [e]xtended BPF is not the same as BPF extensions! While eBPF is an ISA, BPF extensions date back to classic BPF's 'overloading' of BPF_LD \| BPF_{B,H,W} \| BPF_ABS instruction.)
9a985cdc5 Alexei Starovoitov doc: filter: exte...	578 579	It is designed to be JITed with one to one mapping, which can also open up
e4ad40326 Alexei Starovoitov net: filter: ment...	580 581	the possibility for GCC/LLVM compilers to generate optimized eBPF code through an eBPF backend that performs almost as fast as natively compiled code.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	582 583	The new instruction set was originally designed with the possible goal in
e4ad40326 Alexei Starovoitov net: filter: ment...	584	mind to write programs in "restricted C" and compile into eBPF with a optional
9a985cdc5 Alexei Starovoitov doc: filter: exte...	585	GCC/LLVM backend, so that it can just-in-time map to modern 64-bit CPUs with
e4ad40326 Alexei Starovoitov net: filter: ment...	586	minimal performance overhead over two steps, that is, C -> eBPF -> native code.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	587 588 589 590 591 592	Currently, the new format is being used for running user BPF programs, which includes seccomp BPF, classic socket filters, cls_bpf traffic classifier, team driver's classifier for its load-balancing mode, netfilter's xt_bpf extension, PTP dissector/classifier, and much more. They are all internally converted by the kernel into the new instruction set representation and run
e4ad40326 Alexei Starovoitov net: filter: ment...	593	in the eBPF interpreter. For in-kernel handlers, this all works transparently
7ae457c1e Alexei Starovoitov net: filter: spli...	594 595 596 597 598	by using bpf_prog_create() for setting up the filter, resp. bpf_prog_destroy() for destroying it. The macro BPF_PROG_RUN(filter, ctx) transparently invokes eBPF interpreter or JITed code to run the filter. 'filter' is a pointer to struct bpf_prog that we got from bpf_prog_create(), and 'ctx' the given context (e.g.
4df95ff48 Alexei Starovoitov net: filter: rena...	599	skb pointer). All constraints and restrictions from bpf_check_classic() apply
e4ad40326 Alexei Starovoitov net: filter: ment...	600	before a conversion to the new layout is being done behind the scenes!
e8cb0167a Björn Töpel bpf, doc: add RIS...	601 602	Currently, the classic BPF format is being used for JITing on most 32-bit architectures, whereas x86-64, aarch64, s390x, powerpc64,
06b741521 Luke Nelson bpf, doc: Add BPF...	603	sparc64, arm32, riscv64, riscv32 perform JIT compilation from eBPF
e8cb0167a Björn Töpel bpf, doc: add RIS...	604	instruction set.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	605 606 607 608 609 610 611 612	Some core changes of the new internal format: - Number of registers increase from 2 to 10: The old format had two registers A and X, and a hidden frame pointer. The new layout extends this to be 10 internal registers and a read-only frame pointer. Since 64-bit CPUs are passing arguments to functions via registers
e4ad40326 Alexei Starovoitov net: filter: ment...	613	the number of args from eBPF program to in-kernel function is restricted
9a985cdc5 Alexei Starovoitov doc: filter: exte...	614 615 616 617	to 5 and one register is used to accept return value from an in-kernel function. Natively, x86_64 passes first 6 arguments in registers, aarch64/ sparcv9/mips64 have 7 - 8 registers for arguments; x86_64 has 6 callee saved registers, and aarch64/sparcv9/mips64 have 11 or more callee saved registers.
e4ad40326 Alexei Starovoitov net: filter: ment...	618	Therefore, eBPF calling convention is defined as:
9a985cdc5 Alexei Starovoitov doc: filter: exte...	619
e4ad40326 Alexei Starovoitov net: filter: ment...	620 621	* R0 - return value from in-kernel function, and exit value for eBPF program * R1 - R5 - arguments from eBPF program to in-kernel function
9a985cdc5 Alexei Starovoitov doc: filter: exte...	622 623	* R6 - R9 - callee saved registers that in-kernel function will preserve * R10 - read-only frame pointer to access stack
e4ad40326 Alexei Starovoitov net: filter: ment...	624 625	Thus, all eBPF registers map one to one to HW registers on x86_64, aarch64, etc, and eBPF calling convention maps directly to ABIs used by the kernel on
9a985cdc5 Alexei Starovoitov doc: filter: exte...	626 627 628 629	64-bit architectures. On 32-bit architectures JIT may map programs that use only 32-bit arithmetic and may let more complex programs to be interpreted.
e4ad40326 Alexei Starovoitov net: filter: ment...	630 631 632 633	R0 - R5 are scratch registers and eBPF program needs spill/fill them if necessary across calls. Note that there is only one eBPF program (== one eBPF main routine) and it cannot call other eBPF functions, it can only call predefined in-kernel functions, though.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	634 635 636 637	- Register width increases from 32-bit to 64-bit: Still, the semantics of the original 32-bit ALU operations are preserved
e4ad40326 Alexei Starovoitov net: filter: ment...	638	via 32-bit subregisters. All eBPF registers are 64-bit with 32-bit lower
9a985cdc5 Alexei Starovoitov doc: filter: exte...	639 640 641 642 643 644 645 646 647 648	subregisters that zero-extend into 64-bit if they are being written to. That behavior maps directly to x86_64 and arm64 subregister definition, but makes other JITs more difficult. 32-bit architectures run 64-bit internal BPF programs via interpreter. Their JITs may convert BPF programs that only use 32-bit subregisters into native instruction set and let the rest being interpreted. Operation is 64-bit, because on 64-bit architectures, pointers are also 64-bit wide, and we want to pass 64-bit values in/out of kernel functions,
e4ad40326 Alexei Starovoitov net: filter: ment...	649 650	so 32-bit eBPF registers would otherwise require to define register-pair ABI, thus, there won't be able to use a direct eBPF register to HW register
9a985cdc5 Alexei Starovoitov doc: filter: exte...	651 652 653 654 655	mapping and JIT would need to do combine/split/move operations for every register in and out of the function, which is complex, bug prone and slow. Another reason is the use of atomic 64-bit counters. - Conditional jt/jf targets replaced with jt/fall-through:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	656 657 658	While the original design has constructs such as ``if (cond) jump_true; else jump_false;``, they are being replaced into alternative constructs like ``if (cond) jump_true; /* else fall-through */``.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	659 660 661	- Introduces bpf_call insn and register passing convention for zero overhead calls from/to other kernel functions:
dfee07cce Alexei Starovoitov net: filter: doc:...	662 663 664 665 666 667 668 669 670 671 672 673 674	Before an in-kernel function call, the internal BPF program needs to place function arguments into R1 to R5 registers to satisfy calling convention, then the interpreter will take them from registers and pass to in-kernel function. If R1 - R5 registers are mapped to CPU registers that are used for argument passing on given architecture, the JIT compiler doesn't need to emit extra moves. Function arguments will be in the correct registers and BPF_CALL instruction will be JITed as single 'call' HW instruction. This calling convention was picked to cover common call situations without performance penalty. After an in-kernel function call, R1 - R5 are reset to unreadable and R0 has a return value of the function. Since R6 - R9 are callee saved, their state is preserved across the call.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	675	For example, consider three C functions::
dfee07cce Alexei Starovoitov net: filter: doc:...	676
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	677 678 679	u64 f1() { return (*_f2)(1); } u64 f2(u64 a) { return f3(a + 1, a); } u64 f3(u64 a, u64 b) { return a - b; }
dfee07cce Alexei Starovoitov net: filter: doc:...	680
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	681	GCC can compile f1, f3 into x86_64::
dfee07cce Alexei Starovoitov net: filter: doc:...	682
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	683 684 685 686 687 688 689 690	f1: movl $1, %edi movq _f2(%rip), %rax jmp *%rax f3: movq %rdi, %rax subq %rsi, %rax ret
dfee07cce Alexei Starovoitov net: filter: doc:...	691
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	692	Function f2 in eBPF may look like::
dfee07cce Alexei Starovoitov net: filter: doc:...	693
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	694 695 696 697 698	f2: bpf_mov R2, R1 bpf_add R1, 1 bpf_call f3 bpf_exit
dfee07cce Alexei Starovoitov net: filter: doc:...	699
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	700	If f2 is JITed and the pointer stored to ``_f2``. The calls f1 -> f2 -> f3 and
1a9525f68 Li RongQing Documentation: re...	701	returns will be seamless. Without JIT, __bpf_prog_run() interpreter needs to
dfee07cce Alexei Starovoitov net: filter: doc:...	702	be used to call into f2.
e4ad40326 Alexei Starovoitov net: filter: ment...	703	For practical reasons all eBPF programs have only one argument 'ctx' which is
1a9525f68 Li RongQing Documentation: re...	704	already placed into R1 (e.g. on __bpf_prog_run() startup) and the programs
dfee07cce Alexei Starovoitov net: filter: doc:...	705 706 707 708 709 710	can call kernel functions with up to 5 arguments. Calls with 6 or more arguments are currently not supported, but these restrictions can be lifted if necessary in the future. On 64-bit architectures all register map to HW registers one to one. For example, x86_64 JIT compiler can map them as ...
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	711	::
dfee07cce Alexei Starovoitov net: filter: doc:...	712 713 714 715 716 717 718 719 720 721 722 723 724 725	R0 - rax R1 - rdi R2 - rsi R3 - rdx R4 - rcx R5 - r8 R6 - rbx R7 - r13 R8 - r14 R9 - r15 R10 - rbp ... since x86_64 ABI mandates rdi, rsi, rdx, rcx, r8, r9 for argument passing and rbx, r12 - r15 are callee saved.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	726	Then the following internal BPF pseudo-program::
dfee07cce Alexei Starovoitov net: filter: doc:...	727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742	bpf_mov R6, R1 /* save ctx / bpf_mov R2, 2 bpf_mov R3, 3 bpf_mov R4, 4 bpf_mov R5, 5 bpf_call foo bpf_mov R7, R0 / save foo() return value / bpf_mov R1, R6 / restore ctx for next call */ bpf_mov R2, 6 bpf_mov R3, 7 bpf_mov R4, 8 bpf_mov R5, 9 bpf_call bar bpf_add R0, R7 bpf_exit
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	743	After JIT to x86_64 may look like::
dfee07cce Alexei Starovoitov net: filter: doc:...	744 745 746 747 748 749 750 751 752 753 754 755 756 757	push %rbp mov %rsp,%rbp sub $0x228,%rsp mov %rbx,-0x228(%rbp) mov %r13,-0x220(%rbp) mov %rdi,%rbx mov $0x2,%esi mov $0x3,%edx mov $0x4,%ecx mov $0x5,%r8d callq foo mov %rax,%r13 mov %rbx,%rdi
808c9f7eb Mao Wenan bpf, doc: Change ...	758 759 760 761	mov $0x6,%esi mov $0x7,%edx mov $0x8,%ecx mov $0x9,%r8d
dfee07cce Alexei Starovoitov net: filter: doc:...	762 763 764 765 766 767	callq bar add %r13,%rax mov -0x228(%rbp),%rbx mov -0x220(%rbp),%r13 leaveq retq
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	768	Which is in this example equivalent in C to::
dfee07cce Alexei Starovoitov net: filter: doc:...	769 770 771	u64 bpf_filter(u64 ctx) {
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	772	return foo(ctx, 2, 3, 4, 5) + bar(ctx, 6, 7, 8, 9);
dfee07cce Alexei Starovoitov net: filter: doc:...	773 774 775 776	} In-kernel functions foo() and bar() with prototype: u64 (*)(u64 arg1, u64 arg2, u64 arg3, u64 arg4, u64 arg5); will receive arguments in proper
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	777	registers and place their return value into ``%rax`` which is R0 in eBPF.
dfee07cce Alexei Starovoitov net: filter: doc:...	778	Prologue and epilogue are emitted by JIT and are implicit in the
e4ad40326 Alexei Starovoitov net: filter: ment...	779	interpreter. R0-R5 are scratch registers, so eBPF program needs to preserve
dfee07cce Alexei Starovoitov net: filter: doc:...	780	them across the calls as defined by calling convention.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	781	For example the following program is invalid::
dfee07cce Alexei Starovoitov net: filter: doc:...	782 783 784 785 786 787 788	bpf_mov R1, 1 bpf_call foo bpf_mov R0, R1 bpf_exit After the call the registers R1-R5 contain junk values and cannot be read.
0cbf47416 Edward Cree Documentation: de...	789	An in-kernel eBPF verifier is used to validate internal BPF programs.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	790
e4ad40326 Alexei Starovoitov net: filter: ment...	791	Also in the new design, eBPF is limited to 4096 insns, which means that any
9a985cdc5 Alexei Starovoitov doc: filter: exte...	792 793	program will terminate quickly and will only call a fixed number of kernel functions. Original BPF and the new format are two operand instructions,
e4ad40326 Alexei Starovoitov net: filter: ment...	794	which helps to do one-to-one mapping between eBPF insn and x86 insn during JIT.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	795 796 797 798	The input context pointer for invoking the interpreter function is generic, its content is defined by a specific use case. For seccomp register R1 points to seccomp_data, for converted BPF filters R1 points to a skb.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	799	A program, that is translated internally consists of the following elements::
9a985cdc5 Alexei Starovoitov doc: filter: exte...	800
e430f34ee Alexei Starovoitov net: filter: clea...	801	op:16, jt:8, jf:8, k:32 ==> op:8, dst_reg:4, src_reg:4, off:16, imm:32
9a985cdc5 Alexei Starovoitov doc: filter: exte...	802
dfee07cce Alexei Starovoitov net: filter: doc:...	803 804 805 806 807 808	So far 87 internal BPF instructions were implemented. 8-bit 'op' opcode field has room for new instructions. Some of them may use 16/24/32 byte encoding. New instructions must be multiple of 8 bytes to preserve backward compatibility. Internal BPF is a general purpose RISC instruction set. Not every register and every instruction are used during translation from original BPF to new format.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	809	For example, socket filters are not using ``exclusive add`` instruction, but
dfee07cce Alexei Starovoitov net: filter: doc:...	810 811 812	tracing filters may do to maintain counters of events, for example. Register R9 is not used by socket filters either, but more complex filters may be running out of registers and would have to resort to spill/fill to stack.
46604676c Andrii Nakryiko docs/bpf: minor c...	813	Internal BPF can be used as a generic assembler for last step performance
dfee07cce Alexei Starovoitov net: filter: doc:...	814 815 816 817 818 819	optimizations, socket filters and seccomp are using it as assembler. Tracing filters may use it as assembler to generate code from kernel. In kernel usage may not be bounded by security considerations, since generated internal BPF code may be optimizing internal code path and not being exposed to the user space. Safety of internal BPF can come from a verifier (TBD). In such use cases as described, it may be used as safe instruction set.
9a985cdc5 Alexei Starovoitov doc: filter: exte...	820 821 822 823 824 825	Just like the original BPF, the new format runs within a controlled environment, is deterministic and the kernel can easily prove that. The safety of the program can be determined in two steps: first step does depth-first-search to disallow loops and other CFG validation; second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack.
783e327b6 Alexei Starovoitov net: filter: docu...	826 827 828 829 830	eBPF opcode encoding -------------------- eBPF is reusing most of the opcode encoding from classic to simplify conversion of classic BPF to eBPF. For arithmetic and jump instructions the 8-bit 'code'
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	831	field is divided into three parts::
783e327b6 Alexei Starovoitov net: filter: docu...	832 833 834 835 836 837 838 839	+----------------+--------+--------------------+ \| 4 bits \| 1 bit \| 3 bits \| \| operation code \| source \| instruction class \| +----------------+--------+--------------------+ (MSB) (LSB) Three LSB bits store instruction class which is one of:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	840 841 842	=================== =============== Classic BPF classes eBPF classes =================== ===============
783e327b6 Alexei Starovoitov net: filter: docu...	843 844 845 846 847 848	BPF_LD 0x00 BPF_LD 0x00 BPF_LDX 0x01 BPF_LDX 0x01 BPF_ST 0x02 BPF_ST 0x02 BPF_STX 0x03 BPF_STX 0x03 BPF_ALU 0x04 BPF_ALU 0x04 BPF_JMP 0x05 BPF_JMP 0x05
d405c7407 Jiong Wang bpf: allocate 0x0...	849	BPF_RET 0x06 BPF_JMP32 0x06
783e327b6 Alexei Starovoitov net: filter: docu...	850	BPF_MISC 0x07 BPF_ALU64 0x07
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	851	=================== ===============
783e327b6 Alexei Starovoitov net: filter: docu...	852 853	When BPF_CLASS(code) == BPF_ALU or BPF_JMP, 4th bit encodes source operand ...
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	854 855 856 857	:: BPF_K 0x00 BPF_X 0x08
783e327b6 Alexei Starovoitov net: filter: docu...	858
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	859	* in classic BPF, this means::
783e327b6 Alexei Starovoitov net: filter: docu...	860
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	861 862	BPF_SRC(code) == BPF_X - use register X as source operand BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
783e327b6 Alexei Starovoitov net: filter: docu...	863
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	864	* in eBPF, this means::
783e327b6 Alexei Starovoitov net: filter: docu...	865
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	866 867	BPF_SRC(code) == BPF_X - use 'src_reg' register as source operand BPF_SRC(code) == BPF_K - use 32-bit immediate as source operand
783e327b6 Alexei Starovoitov net: filter: docu...	868 869	... and four MSB bits store operation code.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	870	If BPF_CLASS(code) == BPF_ALU or BPF_ALU64 [ in eBPF ], BPF_OP(code) is one of::
783e327b6 Alexei Starovoitov net: filter: docu...	871 872 873 874 875 876 877 878 879 880 881 882 883 884 885	BPF_ADD 0x00 BPF_SUB 0x10 BPF_MUL 0x20 BPF_DIV 0x30 BPF_OR 0x40 BPF_AND 0x50 BPF_LSH 0x60 BPF_RSH 0x70 BPF_NEG 0x80 BPF_MOD 0x90 BPF_XOR 0xa0 BPF_MOV 0xb0 /* eBPF only: mov reg to reg / BPF_ARSH 0xc0 / eBPF only: sign extending shift right / BPF_END 0xd0 / eBPF only: endianness conversion */
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	886	If BPF_CLASS(code) == BPF_JMP or BPF_JMP32 [ in eBPF ], BPF_OP(code) is one of::
783e327b6 Alexei Starovoitov net: filter: docu...	887
d405c7407 Jiong Wang bpf: allocate 0x0...	888	BPF_JA 0x00 /* BPF_JMP only */
783e327b6 Alexei Starovoitov net: filter: docu...	889 890 891 892 893 894 895	BPF_JEQ 0x10 BPF_JGT 0x20 BPF_JGE 0x30 BPF_JSET 0x40 BPF_JNE 0x50 /* eBPF only: jump != / BPF_JSGT 0x60 / eBPF only: signed '>' / BPF_JSGE 0x70 / eBPF only: signed '>=' */
d405c7407 Jiong Wang bpf: allocate 0x0...	896 897	BPF_CALL 0x80 /* eBPF BPF_JMP only: function call / BPF_EXIT 0x90 / eBPF BPF_JMP only: function return */
92b31a9af Daniel Borkmann bpf: add BPF_J{LT...	898 899 900 901	BPF_JLT 0xa0 /* eBPF only: unsigned '<' / BPF_JLE 0xb0 / eBPF only: unsigned '<=' / BPF_JSLT 0xc0 / eBPF only: signed '<' / BPF_JSLE 0xd0 / eBPF only: signed '<=' */
783e327b6 Alexei Starovoitov net: filter: docu...	902 903 904 905 906 907 908 909 910 911 912 913 914	So BPF_ADD \| BPF_X \| BPF_ALU means 32-bit addition in both classic BPF and eBPF. There are only two registers in classic BPF, so it means A += X. In eBPF it means dst_reg = (u32) dst_reg + (u32) src_reg; similarly, BPF_XOR \| BPF_K \| BPF_ALU means A ^= imm32 in classic BPF and analogous src_reg = (u32) src_reg ^ (u32) imm32 in eBPF. Classic BPF is using BPF_MISC class to represent A = X and X = A moves. eBPF is using BPF_MOV \| BPF_X \| BPF_ALU code instead. Since there are no BPF_MISC operations in eBPF, the class 7 is used as BPF_ALU64 to mean exactly the same operations as BPF_ALU, but with 64-bit wide operands instead. So BPF_ADD \| BPF_X \| BPF_ALU64 means 64-bit addition, i.e.: dst_reg = dst_reg + src_reg
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	915	Classic BPF wastes the whole BPF_RET class to represent a single ``ret``
783e327b6 Alexei Starovoitov net: filter: docu...	916 917 918	operation. Classic BPF_RET \| BPF_K means copy imm32 into return register and perform function exit. eBPF is modeled to match CPU, so BPF_JMP \| BPF_EXIT in eBPF means function exit only. The eBPF program needs to store return
d405c7407 Jiong Wang bpf: allocate 0x0...	919 920 921	value into register R0 before doing a BPF_EXIT. Class 6 in eBPF is used as BPF_JMP32 to mean exactly the same operations as BPF_JMP, but with 32-bit wide operands for the comparisons instead.
783e327b6 Alexei Starovoitov net: filter: docu...	922
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	923	For load and store instructions the 8-bit 'code' field is divided as::
783e327b6 Alexei Starovoitov net: filter: docu...	924 925 926 927 928 929 930 931	+--------+--------+-------------------+ \| 3 bits \| 2 bits \| 3 bits \| \| mode \| size \| instruction class \| +--------+--------+-------------------+ (MSB) (LSB) Size modifier is one of ...
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	932	::
783e327b6 Alexei Starovoitov net: filter: docu...	933 934 935 936	BPF_W 0x00 /* word / BPF_H 0x08 / half word / BPF_B 0x10 / byte / BPF_DW 0x18 / eBPF only, double word */
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	937	... which encodes size of load/store operation::
783e327b6 Alexei Starovoitov net: filter: docu...	938 939 940 941 942	B - 1 byte H - 2 byte W - 4 byte DW - 8 byte (eBPF only)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	943	Mode modifier is one of::
783e327b6 Alexei Starovoitov net: filter: docu...	944
02ab695bb Alexei Starovoitov net: filter: add ...	945	BPF_IMM 0x00 /* used for 32-bit mov in classic BPF and 64-bit in eBPF */
783e327b6 Alexei Starovoitov net: filter: docu...	946 947 948 949 950 951 952 953 954 955 956 957	BPF_ABS 0x20 BPF_IND 0x40 BPF_MEM 0x60 BPF_LEN 0x80 /* classic BPF only, reserved in eBPF / BPF_MSH 0xa0 / classic BPF only, reserved in eBPF / BPF_XADD 0xc0 / eBPF only, exclusive add */ eBPF has two non-generic instructions: (BPF_ABS \| <size> \| BPF_LD) and (BPF_IND \| <size> \| BPF_LD) which are used to access packet data. They had to be carried over from classic to have strong performance of socket filters running in eBPF interpreter. These instructions can only
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	958	be used when interpreter context is a pointer to ``struct sk_buff`` and
783e327b6 Alexei Starovoitov net: filter: docu...	959 960 961 962 963 964 965 966 967 968 969	have seven implicit operands. Register R6 is an implicit input that must contain pointer to sk_buff. Register R0 is an implicit output which contains the data fetched from the packet. Registers R1-R5 are scratch registers and must not be used to store the data across BPF_ABS \| BPF_LD or BPF_IND \| BPF_LD instructions. These instructions have implicit program exit condition as well. When eBPF program is trying to access the data beyond the packet boundary, the interpreter will abort the execution of the program. JIT compilers therefore must preserve this property. src_reg and imm32 fields are explicit inputs to these instructions.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	970	For example::
783e327b6 Alexei Starovoitov net: filter: docu...	971 972 973 974 975	BPF_IND \| BPF_W \| BPF_LD means: R0 = ntohl((u32 ) (((struct sk_buff *) R6)->data + src_reg + imm32)) and R1 - R5 were scratched.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	976	Unlike classic BPF instruction set, eBPF has generic load/store operations::
783e327b6 Alexei Starovoitov net: filter: docu...	977
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	978 979 980 981 982	BPF_MEM \| <size> \| BPF_STX: (size ) (dst_reg + off) = src_reg BPF_MEM \| <size> \| BPF_ST: (size ) (dst_reg + off) = imm32 BPF_MEM \| <size> \| BPF_LDX: dst_reg = (size ) (src_reg + off) BPF_XADD \| BPF_W \| BPF_STX: lock xadd (u32 )(dst_reg + off16) += src_reg BPF_XADD \| BPF_DW \| BPF_STX: lock xadd (u64 )(dst_reg + off16) += src_reg
783e327b6 Alexei Starovoitov net: filter: docu...	983 984 985	Where size is one of: BPF_B or BPF_H or BPF_W or BPF_DW. Note that 1 and 2 byte atomic increments are not supported.
02ab695bb Alexei Starovoitov net: filter: add ...	986	eBPF has one 16-byte instruction: BPF_LD \| BPF_DW \| BPF_IMM which consists
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	987	of two consecutive ``struct bpf_insn`` 8-byte blocks and interpreted as single
02ab695bb Alexei Starovoitov net: filter: add ...	988 989 990	instruction that loads 64-bit immediate value into a dst_reg. Classic BPF has similar instruction: BPF_LD \| BPF_W \| BPF_IMM which loads 32-bit immediate value into a register.
51580e798 Alexei Starovoitov bpf: verifier (ad...	991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006	eBPF verifier ------------- The safety of the eBPF program is determined in two steps. First step does DAG check to disallow loops and other CFG validation. In particular it will detect programs that have unreachable instructions. (though classic BPF checker allows them) Second step starts from the first insn and descends all possible paths. It simulates execution of every insn and observes the state change of registers and stack. At the start of the program the register R1 contains a pointer to context and has type PTR_TO_CTX. If verifier sees an insn that does R2=R1, then R2 has now type PTR_TO_CTX as well and can be used on the right hand side of expression.
0cbf47416 Edward Cree Documentation: de...	1007	If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE,
51580e798 Alexei Starovoitov bpf: verifier (ad...	1008 1009 1010	since addition of two valid pointers makes invalid pointer. (In 'secure' mode verifier will reject any type of pointer arithmetic to make sure that kernel addresses don't leak to unprivileged users)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1011	If register was never written to, it's not readable::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1012 1013	bpf_mov R0 = R2 bpf_exit
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1014
51580e798 Alexei Starovoitov bpf: verifier (ad...	1015 1016 1017 1018 1019 1020	will be rejected, since R2 is unreadable at the start of the program. After kernel function call, R1-R5 are reset to unreadable and R0 has a return type of the function. Since R6-R9 are callee saved, their state is preserved across the call.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1021 1022	::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1023 1024 1025 1026	bpf_mov R6 = 1 bpf_call foo bpf_mov R0 = R6 bpf_exit
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1027
51580e798 Alexei Starovoitov bpf: verifier (ad...	1028 1029 1030 1031	is a correct program. If there was R1 instead of R6, it would have been rejected. load/store instructions are allowed only with registers of valid types, which
0cbf47416 Edward Cree Documentation: de...	1032	are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1033	For example::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1034 1035 1036 1037	bpf_mov R1 = 1 bpf_mov R2 = 2 bpf_xadd (u32 )(R1 + 3) += R2 bpf_exit
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1038
51580e798 Alexei Starovoitov bpf: verifier (ad...	1039 1040	will be rejected, since R1 doesn't have a valid pointer type at the time of execution of instruction bpf_xadd.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1041	At the start R1 type is PTR_TO_CTX (a pointer to generic ``struct bpf_context``)
51580e798 Alexei Starovoitov bpf: verifier (ad...	1042 1043	A callback is used to customize verifier to restrict eBPF program access to only certain fields within ctx structure with specified size and alignment.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1044	For example, the following insn::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1045	bpf_ld R0 = (u32 )(R6 + 8)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1046
51580e798 Alexei Starovoitov bpf: verifier (ad...	1047 1048 1049 1050	intends to load a word from address R6 + 8 and store it into R0 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know that offset 8 of size 4 bytes can be accessed for reading, otherwise the verifier will reject the program.
0cbf47416 Edward Cree Documentation: de...	1051	If R6=PTR_TO_STACK, then access should be aligned and be within
51580e798 Alexei Starovoitov bpf: verifier (ad...	1052 1053 1054 1055 1056	stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, so it will fail verification, since it's out of bounds. The verifier will allow eBPF program to read data from stack only after it wrote into it.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1057
51580e798 Alexei Starovoitov bpf: verifier (ad...	1058	Classic BPF verifier does similar check with M[0-15] memory slots.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1059	For example::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1060 1061	bpf_ld R0 = (u32 )(R10 - 4) bpf_exit
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1062
51580e798 Alexei Starovoitov bpf: verifier (ad...	1063	is invalid program.
0cbf47416 Edward Cree Documentation: de...	1064	Though R10 is correct read-only register and has type PTR_TO_STACK
51580e798 Alexei Starovoitov bpf: verifier (ad...	1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087	and R10 - 4 is within stack bounds, there were no stores into that location. Pointer register spill/fill is tracked as well, since four (R6-R9) callee saved registers may not be enough for some programs. Allowed function calls are customized with bpf_verifier_ops->get_func_proto() The eBPF verifier will check that registers match argument constraints. After the call register R0 will be set to return type of the function. Function calls is a main mechanism to extend functionality of eBPF programs. Socket filters may let programs to call one set of functions, whereas tracing filters may allow completely different set. If a function made accessible to eBPF program, it needs to be thought through from safety point of view. The verifier will guarantee that the function is called with valid arguments. seccomp vs socket filters have different security restrictions for classic BPF. Seccomp solves this by two stage verifier: classic BPF verifier is followed by seccomp verifier. In case of eBPF one configurable verifier is shared for all use cases. See details of eBPF verifier in kernel/bpf/verifier.c
0cbf47416 Edward Cree Documentation: de...	1088 1089 1090 1091	Register value tracking ----------------------- In order to determine the safety of an eBPF program, the verifier must track the range of possible values in each register and also in each stack slot.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1092	This is done with ``struct bpf_reg_state``, defined in include/linux/
0cbf47416 Edward Cree Documentation: de...	1093 1094 1095 1096	bpf_verifier.h, which unifies tracking of scalar and pointer values. Each register state has a type, which is either NOT_INIT (the register has not been written to), SCALAR_VALUE (some value which is not usable as a pointer), or a pointer type. The types of pointers describe their base, as follows:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1097 1098 1099 1100 1101 1102 1103 1104 1105	PTR_TO_CTX Pointer to bpf_context. CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic on these pointers is forbidden. PTR_TO_MAP_VALUE Pointer to the value stored in a map element.
0cbf47416 Edward Cree Documentation: de...	1106	PTR_TO_MAP_VALUE_OR_NULL
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118	Either a pointer to a map value, or NULL; map accesses (see section 'eBPF maps', below) return this type, which becomes a PTR_TO_MAP_VALUE when checked != NULL. Arithmetic on these pointers is forbidden. PTR_TO_STACK Frame pointer. PTR_TO_PACKET skb->data. PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. PTR_TO_SOCKET Pointer to struct bpf_sock_ops, implicitly refcounted.
a610b665e Joe Stringer Documentation: De...	1119	PTR_TO_SOCKET_OR_NULL
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1120 1121 1122 1123 1124 1125	Either a pointer to a socket, or NULL; socket lookup returns this type, which becomes a PTR_TO_SOCKET when checked != NULL. PTR_TO_SOCKET is reference-counted, so programs must release the reference through the socket release function before the end of the program. Arithmetic on these pointers is forbidden.
0cbf47416 Edward Cree Documentation: de...	1126 1127 1128 1129 1130 1131	However, a pointer may be offset from this base (as a result of pointer arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable offset'. The former is used when an exactly-known value (e.g. an immediate operand) is added to a pointer, while the latter is used for values which are not exactly known. The variable offset is also used in SCALAR_VALUEs, to track the range of possible values in the register.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1132
0cbf47416 Edward Cree Documentation: de...	1133	The verifier's knowledge about the variable offset consists of:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1134
0cbf47416 Edward Cree Documentation: de...	1135 1136	* minimum and maximum values as unsigned * minimum and maximum values as signed
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1137
0cbf47416 Edward Cree Documentation: de...	1138	* knowledge of the values of individual bits, in the form of a 'tnum': a u64
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1139 1140 1141 1142 1143 1144 1145	'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; 1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both mask and value; no bit should ever be 1 in both. For example, if a byte is read into a register from memory, the register's top 56 bits are known zero, while the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we then OR this with 0x40, we get (0x40; 0xbf), then if we add 1 we get (0x0; 0x1ff), because of potential carries.
68625b763 Wang YanQing bpf, doc: clarifi...	1146
0cbf47416 Edward Cree Documentation: de...	1147 1148 1149 1150 1151 1152 1153 1154	Besides arithmetic, the register state can also be updated by conditional branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or BPF_JSGE) would instead update the signed minimum/maximum values. Information from the signed and unsigned bounds can be combined; for instance if a value is first tested < 8 and then tested s> 4, the verifier will conclude that the value is also > 4 and s< 8, since the bounds prevent crossing the sign boundary.
68625b763 Wang YanQing bpf, doc: clarifi...	1155
0cbf47416 Edward Cree Documentation: de...	1156 1157	PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all pointers sharing that same variable offset. This is important for packet range
68625b763 Wang YanQing bpf, doc: clarifi...	1158 1159 1160 1161 1162 1163	checks: after adding a variable to a packet pointer register A, if you then copy it to another register B and then add a constant 4 to A, both registers will share the same 'id' but the A will have a fixed offset of +4. Then if A is bounds-checked and found to be less than a PTR_TO_PACKET_END, the register B is now known to have a safe range of at least 4 bytes. See 'Direct packet access', below, for more on PTR_TO_PACKET ranges.
0cbf47416 Edward Cree Documentation: de...	1164 1165 1166 1167 1168 1169 1170 1171 1172 1173	The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of the pointer returned from a map lookup. This means that when one copy is checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. As well as range-checking, the tracked information is also used for enforcing alignment of pointer accesses. For instance, on most systems the packet pointer is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through that pointer are safe.
a610b665e Joe Stringer Documentation: De...	1174 1175 1176 1177	The 'id' field is also used on PTR_TO_SOCKET and PTR_TO_SOCKET_OR_NULL, common to all copies of the pointer returned from a socket lookup. This has similar behaviour to the handling for PTR_TO_MAP_VALUE_OR_NULL->PTR_TO_MAP_VALUE, but it also handles reference tracking for the pointer. PTR_TO_SOCKET implicitly
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1178	represents a reference to the corresponding ``struct sock``. To ensure that the
a610b665e Joe Stringer Documentation: De...	1179 1180	reference is not leaked, it is imperative to NULL-check the reference and in the non-NULL case, and pass the valid reference to the socket release function.
0cbf47416 Edward Cree Documentation: de...	1181
f9c8d19d6 Alexei Starovoitov bpf: add document...	1182 1183 1184 1185	Direct packet access -------------------- In cls_bpf and act_bpf programs the verifier allows direct access to the packet data via skb->data and skb->data_end pointers.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1186 1187 1188 1189 1190 1191 1192 1193 1194	Ex:: 1: r4 = (u32 )(r1 +80) /* load skb->data_end / 2: r3 = (u32 )(r1 +76) / load skb->data / 3: r5 = r3 4: r5 += 14 5: if r5 > r4 goto pc+16 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 6: r0 = (u16 )(r3 +12) / access 12 and 13 bytes of the packet */
f9c8d19d6 Alexei Starovoitov bpf: add document...	1195 1196	this 2byte load from the packet is safe to do, since the program author
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1197	did check ``if (skb->data + 14 > skb->data_end) goto err`` at insn #5 which
f9c8d19d6 Alexei Starovoitov bpf: add document...	1198 1199 1200 1201 1202 1203 1204 1205	means that in the fall-through case the register R3 (which points to skb->data) has at least 14 directly accessible bytes. The verifier marks it as R3=pkt(id=0,off=0,r=14). id=0 means that no additional variables were added to the register. off=0 means that no additional constants were added. r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points to the packet data, but constant 14 was added to the register, so
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1206	it now points to ``skb->data + 14`` and accessible range is [R5, R5 + 14 - 14)
f9c8d19d6 Alexei Starovoitov bpf: add document...	1207	which is zero bytes.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226	More complex packet access may look like:: R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 6: r0 = (u8 )(r3 +7) /* load 7th byte from the packet / 7: r4 = (u8 )(r3 +12) 8: r4 = 14 9: r3 = (u32 )(r1 +76) /* load skb->data / 10: r3 += r4 11: r2 = r1 12: r2 <<= 48 13: r2 >>= 48 14: r3 += r2 15: r2 = r3 16: r2 += 8 17: r1 = (u32 )(r1 +80) / load skb->data_end / 18: if r2 > r1 goto pc+2 R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 19: r1 = (u8 *)(r3 +4)
f9c8d19d6 Alexei Starovoitov bpf: add document...	1227	The state of the register R3 is R3=pkt(id=2,off=0,r=8)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1228	id=2 means that two ``r3 += rX`` instructions were seen, so r3 points to some
f9c8d19d6 Alexei Starovoitov bpf: add document...	1229	offset within a packet and since the program author did
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1230	``if (r3 + 8 > r1) goto err`` at insn #18, the safe range is [R3, R3 + 8).
0cbf47416 Edward Cree Documentation: de...	1231 1232	The verifier only allows 'add'/'sub' operations on packet registers. Any other operation will set the register state to 'SCALAR_VALUE' and it won't be
f9c8d19d6 Alexei Starovoitov bpf: add document...	1233	available for direct packet access.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1234 1235 1236	Operation ``r3 += rX`` may overflow and become less than original skb->data, therefore the verifier has to prevent that. So when it sees ``r3 += rX``
0cbf47416 Edward Cree Documentation: de...	1237 1238 1239	instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 against skb->data_end will not give us 'range' information, so attempts to read through the pointer will give "invalid access to packet" error.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1240 1241	Ex. after insn ``r4 = (u8 )(r3 +12)`` (insn #7 above) the state of r4 is
0cbf47416 Edward Cree Documentation: de...	1242 1243	R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits of the register are guaranteed to be zero, and nothing is known about the lower
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1244	8 bits. After insn ``r4 *= 14`` the state becomes
0cbf47416 Edward Cree Documentation: de...	1245 1246	R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit value by constant 14 will keep upper 52 bits as zero, also the least significant
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1247	bit will be zero as 14 is even. Similarly ``r2 >>= 48`` will make
0cbf47416 Edward Cree Documentation: de...	1248 1249 1250 1251	R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign extending. This logic is implemented in adjust_reg_min_max_vals() function, which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice versa) and adjust_scalar_min_max_vals() for operations on two scalars.
f9c8d19d6 Alexei Starovoitov bpf: add document...	1252 1253	The end result is that bpf program author can access packet directly
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1254	using normal C code as::
f9c8d19d6 Alexei Starovoitov bpf: add document...	1255 1256 1257 1258 1259 1260 1261	void data = (void )(long)skb->data; void data_end = (void )(long)skb->data_end; struct eth_hdr eth = data; struct iphdr iph = data + sizeof(eth); struct udphdr udp = data + sizeof(eth) + sizeof(iph); if (data + sizeof(eth) + sizeof(iph) + sizeof(*udp) > data_end)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1262	return 0;
f9c8d19d6 Alexei Starovoitov bpf: add document...	1263	if (eth->h_proto != htons(ETH_P_IP))
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1264	return 0;
f9c8d19d6 Alexei Starovoitov bpf: add document...	1265	if (iph->protocol != IPPROTO_UDP \|\| iph->ihl != 5)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1266	return 0;
f9c8d19d6 Alexei Starovoitov bpf: add document...	1267	if (udp->dest == 53 \|\| udp->source == 9)
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1268	...;
f9c8d19d6 Alexei Starovoitov bpf: add document...	1269 1270	which makes such programs easier to write comparing to LD_ABS insn and significantly faster.
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1271 1272 1273 1274 1275 1276	eBPF maps --------- 'maps' is a generic storage of different types for sharing data between kernel and userspace. The maps are accessed from user space via BPF syscall, which has commands:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1277
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1278	- create a map with given type and attributes
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1279	``map_fd = bpf(BPF_MAP_CREATE, union bpf_attr *attr, u32 size)``
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1280 1281 1282 1283	using attr->map_type, attr->key_size, attr->value_size, attr->max_entries returns process-local file descriptor or negative error - lookup key in a given map
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1284	``err = bpf(BPF_MAP_LOOKUP_ELEM, union bpf_attr *attr, u32 size)``
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1285 1286 1287 1288	using attr->map_fd, attr->key, attr->value returns zero and stores found elem into value or negative error - create or update key/value pair in a given map
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1289	``err = bpf(BPF_MAP_UPDATE_ELEM, union bpf_attr *attr, u32 size)``
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1290 1291 1292 1293	using attr->map_fd, attr->key, attr->value returns zero or negative error - find and delete element by key in a given map
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1294	``err = bpf(BPF_MAP_DELETE_ELEM, union bpf_attr *attr, u32 size)``
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1295 1296 1297 1298 1299 1300 1301 1302 1303 1304 1305	using attr->map_fd, attr->key - to delete map: close(fd) Exiting process will delete maps automatically userspace programs use this syscall to create/access maps that eBPF programs are concurrently updating. maps can have different types: hash, array, bloom filter, radix-tree, etc. The map is defined by:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1306 1307 1308 1309 1310	- type - max number of elements - key size in bytes - value size in bytes
99c55f7d4 Alexei Starovoitov bpf: introduce BP...	1311
0cbf47416 Edward Cree Documentation: de...	1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326	Pruning ------- The verifier does not actually walk all possible paths through the program. For each new branch to analyse, the verifier looks at all the states it's previously been in when at this instruction. If any of them contain the current state as a subset, the branch is 'pruned' - that is, the fact that the previous state was accepted implies the current state would be as well. For instance, if in the previous state, r1 held a packet-pointer, and in the current state, r1 holds a packet-pointer with a range as long or longer and at least as strict an alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't have been used by any path from that point, so any value in r2 (including another NOT_INIT) is safe. The implementation is in the function regsafe(). Pruning considers not only the registers but also the stack (and any spilled registers it may hold). They must all be safe for the branch to be pruned. This is implemented in states_equal().
51580e798 Alexei Starovoitov bpf: verifier (ad...	1327 1328 1329 1330 1331	Understanding eBPF verifier messages ------------------------------------ The following are few examples of invalid eBPF programs and verifier error messages as seen in the log:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1332 1333 1334	Program with unreachable instructions:: static struct bpf_insn prog[] = {
51580e798 Alexei Starovoitov bpf: verifier (ad...	1335 1336	BPF_EXIT_INSN(), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1337	};
51580e798 Alexei Starovoitov bpf: verifier (ad...	1338	Error:
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1339
51580e798 Alexei Starovoitov bpf: verifier (ad...	1340	unreachable insn 1
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1341	Program that reads uninitialized register::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1342 1343	BPF_MOV64_REG(BPF_REG_0, BPF_REG_2), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1344 1345	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1346 1347	0: (bf) r0 = r2 R2 !read_ok
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1348	Program that doesn't initialize R0 before exiting::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1349 1350	BPF_MOV64_REG(BPF_REG_2, BPF_REG_1), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1351 1352	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1353 1354 1355	0: (bf) r2 = r1 1: (95) exit R0 !read_ok
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366	Program that accesses stack out of bounds:: BPF_ST_MEM(BPF_DW, BPF_REG_10, 8, 0), BPF_EXIT_INSN(), Error:: 0: (7a) (u64 )(r10 +8) = 0 invalid stack off=8 size=8 Program that doesn't initialize stack before passing its address into function::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1367
51580e798 Alexei Starovoitov bpf: verifier (ad...	1368 1369 1370 1371 1372	BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1373 1374	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1375 1376 1377 1378 1379	0: (bf) r2 = r10 1: (07) r2 += -8 2: (b7) r1 = 0x0 3: (85) call 1 invalid indirect read from stack off -8+0 size 8
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1380	Program that uses invalid map_fd=0 while calling to map_lookup_elem() function::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1381 1382 1383 1384 1385 1386	BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1387 1388	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1389 1390 1391 1392 1393 1394 1395 1396	0: (7a) (u64 )(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 0x0 4: (85) call 1 fd 0 is not pointing to valid bpf_map Program that doesn't check return value of map_lookup_elem() before accessing
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1397	map element::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1398 1399 1400 1401 1402 1403 1404	BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1405 1406	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1407 1408 1409 1410 1411 1412 1413 1414 1415	0: (7a) (u64 )(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 0x0 4: (85) call 1 5: (7a) (u64 )(r0 +0) = 0 R0 invalid mem access 'map_value_or_null' Program that correctly checks map_lookup_elem() returned value for NULL, but
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1416	accesses the memory with incorrect alignment::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1417 1418 1419 1420 1421 1422 1423 1424	BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 1), BPF_ST_MEM(BPF_DW, BPF_REG_0, 4, 0), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1425 1426	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437 1438	0: (7a) (u64 )(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 1 4: (85) call 1 5: (15) if r0 == 0x0 goto pc+1 R0=map_ptr R10=fp 6: (7a) (u64 )(r0 +4) = 0 misaligned access off 4 size 8 Program that correctly checks map_lookup_elem() returned value for NULL and accesses memory with correct alignment in one side of 'if' branch, but fails
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1439	to do so in the other side of 'if' branch::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1440 1441 1442 1443 1444 1445 1446 1447 1448 1449	BPF_ST_MEM(BPF_DW, BPF_REG_10, -8, 0), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_LD_MAP_FD(BPF_REG_1, 0), BPF_RAW_INSN(BPF_JMP \| BPF_CALL, 0, 0, 0, BPF_FUNC_map_lookup_elem), BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 0), BPF_EXIT_INSN(), BPF_ST_MEM(BPF_DW, BPF_REG_0, 0, 1), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1450 1451	Error::
51580e798 Alexei Starovoitov bpf: verifier (ad...	1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464	0: (7a) (u64 )(r10 -8) = 0 1: (bf) r2 = r10 2: (07) r2 += -8 3: (b7) r1 = 1 4: (85) call 1 5: (15) if r0 == 0x0 goto pc+2 R0=map_ptr R10=fp 6: (7a) (u64 )(r0 +0) = 0 7: (95) exit from 5 to 8: R0=imm0 R10=fp 8: (7a) (u64 )(r0 +0) = 1 R0 invalid mem access 'imm'
a610b665e Joe Stringer Documentation: De...	1465	Program that performs a socket lookup then sets the pointer to NULL without
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1466	checking it::
a610b665e Joe Stringer Documentation: De...	1467 1468 1469 1470 1471 1472 1473 1474 1475 1476	BPF_MOV64_IMM(BPF_REG_2, 0), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_MOV64_IMM(BPF_REG_3, 4), BPF_MOV64_IMM(BPF_REG_4, 0), BPF_MOV64_IMM(BPF_REG_5, 0), BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), BPF_MOV64_IMM(BPF_REG_0, 0), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1477 1478	Error::
a610b665e Joe Stringer Documentation: De...	1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490 1491	0: (b7) r2 = 0 1: (63) (u32 )(r10 -8) = r2 2: (bf) r2 = r10 3: (07) r2 += -8 4: (b7) r3 = 4 5: (b7) r4 = 0 6: (b7) r5 = 0 7: (85) call bpf_sk_lookup_tcp#65 8: (b7) r0 = 0 9: (95) exit Unreleased reference id=1, alloc_insn=7 Program that performs a socket lookup but does not NULL-check the returned
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1492	value::
a610b665e Joe Stringer Documentation: De...	1493 1494 1495 1496 1497 1498 1499 1500 1501	BPF_MOV64_IMM(BPF_REG_2, 0), BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_2, -8), BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -8), BPF_MOV64_IMM(BPF_REG_3, 4), BPF_MOV64_IMM(BPF_REG_4, 0), BPF_MOV64_IMM(BPF_REG_5, 0), BPF_EMIT_CALL(BPF_FUNC_sk_lookup_tcp), BPF_EXIT_INSN(),
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1502 1503	Error::
a610b665e Joe Stringer Documentation: De...	1504 1505 1506 1507 1508 1509 1510 1511 1512 1513	0: (b7) r2 = 0 1: (63) (u32 )(r10 -8) = r2 2: (bf) r2 = r10 3: (07) r2 += -8 4: (b7) r3 = 4 5: (b7) r4 = 0 6: (b7) r5 = 0 7: (85) call bpf_sk_lookup_tcp#65 8: (95) exit Unreleased reference id=1, alloc_insn=7
04caa4893 Daniel Borkmann net: filter: doc:...	1514 1515 1516 1517 1518 1519	Testing ------- Next to the BPF toolchain, the kernel also ships a test module that contains various test cases for classic and internal BPF that can be executed against the BPF interpreter and JIT compiler. It can be found in lib/test_bpf.c and
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1520	enabled via Kconfig::
04caa4893 Daniel Borkmann net: filter: doc:...	1521 1522 1523 1524 1525 1526	CONFIG_TEST_BPF=m After the module has been built and installed, the test suite can be executed via insmod or modprobe against 'test_bpf' module. Results of the test cases including timings in nsec can be found in the kernel log (dmesg).
7924cd5e0 Daniel Borkmann filter: doc: impr...	1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538	Misc ---- Also trinity, the Linux syscall fuzzer, has built-in support for BPF and SECCOMP-BPF kernel fuzzing. Written by ---------- The document was written in the hope that it is found useful and in order to give potential BPF hackers or security auditors a better overview of the underlying architecture.
cb3f0d56e Mauro Carvalho Chehab docs: networking:...	1539 1540 1541	- Jay Schulist <jschlst@samba.org> - Daniel Borkmann <daniel@iogearbox.net> - Alexei Starovoitov <ast@kernel.org>