Eric Lee / smarc-fsl-linux-kernel

09 Mar, 2016

7 commits

26e909311 samples/bpf: add map performance test ... Browse Code »

performance tests for hash map and per-cpu hash map
with and without pre-allocation

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 12:22:03 +0800
7dcc42b68 samples/bpf: stress test bpf_get_stackid ... Browse Code »

increase stress by also calling bpf_get_stackid() from
various *spin* functions

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 12:22:02 +0800
9d8b612d8 samples/bpf: add bpf map stress test ... Browse Code »

this test calls bpf programs from different contexts:
from inside of slub, from rcu, from pretty much everywhere,
since it kprobes all spin_lock functions.
It stresses the bpf hash and percpu map pre-allocation,
deallocation logic and call_rcu mechanisms.
User space part adding more stress by walking and deleting map elements.

Note that due to nature bpf_load.c the earlier kprobe+bpf programs are
already active while loader loads new programs, creates new kprobes and
attaches them.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 12:22:02 +0800
c3f85cffc samples/bpf: test both pre-alloc and normal maps ... Browse Code »

extend test coveraged to include pre-allocated and run-time alloc maps

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:32 +0800
89b976070 samples/bpf: add map_flags to bpf loader ... Browse Code »

note old loader is compatible with new kernel.
map_flags are optional

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:32 +0800
3622e7e49 samples/bpf: move ksym_search() into library ... Browse Code »

move ksym search from offwaketime into library to be reused
in other tests

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:32 +0800
618ec9a7b samples/bpf: make map creation more verbose ... Browse Code »

map creation is typically the first one to fail when rlimits are
too low, not enough memory, etc
Make this failure scenario more verbose

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:32 +0800

20 Feb, 2016

1 commit

a6ffe7b9d samples/bpf: offwaketime example ... Browse Code »

This is simplified version of Brendan Gregg's offwaketime:
This program shows kernel stack traces and task names that were blocked and
"off-CPU", along with the stack traces and task names for the threads that woke
them, and the total elapsed time from when they blocked to when they were woken
up. The combined stacks, task names, and total time is summarized in kernel
context for efficiency.

Example:
$ sudo ./offwaketime | flamegraph.pl > demo.svg
Open demo.svg in the browser as FlameGraph visualization.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-02-20 13:21:44 +0800

06 Feb, 2016

3 commits

3059303f5 samples/bpf: update tracex[23] examples to use per-cpu maps ... Browse Code »

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-02-06 16:34:36 +0800
df570f577 samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_ARRAY ... Browse Code »

A sanity test for BPF_MAP_TYPE_PERCPU_ARRAY

Signed-off-by: Ming Lei
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

tom.leiming@gmail.com
2016-02-06 16:34:36 +0800
e15596717 samples/bpf: unit test for BPF_MAP_TYPE_PERCPU_HASH ... Browse Code »

A sanity test for BPF_MAP_TYPE_PERCPU_HASH.

Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Martin KaFai Lau
2016-02-06 16:34:36 +0800

17 Nov, 2015

1 commit

30b50aa61 bpf: samples: exclude asm/sysreg.h for arm64 ... Browse Code »

commit 338d4f49d6f7114a017d294ccf7374df4f998edc
("arm64: kernel: Add support for Privileged Access Never") includes sysreg.h
into futex.h and uaccess.h. But, the inline assembly used by asm/sysreg.h is
incompatible with llvm so it will cause BPF samples build failure for ARM64.
Since sysreg.h is useless for BPF samples, just exclude it from Makefile via
defining __ASM_SYSREG_H.

Signed-off-by: Yang Shi
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Yang Shi
2015-11-17 03:40:49 +0800

03 Nov, 2015

1 commit

42984d7c1 bpf: add sample usages for persistent maps/progs ... Browse Code »

This patch adds a couple of stand-alone examples on how BPF_OBJ_PIN
and BPF_OBJ_GET commands can be used.

Example with maps:

# ./fds_example -F /sys/fs/bpf/m -P -m -k 1 -v 42
bpf: map fd:3 (Success)
bpf: pin ret:(0,Success)
bpf: fd:3 u->(1:42) ret:(0,Success)
# ./fds_example -F /sys/fs/bpf/m -G -m -k 1
bpf: get fd:3 (Success)
bpf: fd:3 l->(1):42 ret:(0,Success)
# ./fds_example -F /sys/fs/bpf/m -G -m -k 1 -v 24
bpf: get fd:3 (Success)
bpf: fd:3 u->(1:24) ret:(0,Success)
# ./fds_example -F /sys/fs/bpf/m -G -m -k 1
bpf: get fd:3 (Success)
bpf: fd:3 l->(1):24 ret:(0,Success)

# ./fds_example -F /sys/fs/bpf/m2 -P -m
bpf: map fd:3 (Success)
bpf: pin ret:(0,Success)
# ./fds_example -F /sys/fs/bpf/m2 -G -m -k 1
bpf: get fd:3 (Success)
bpf: fd:3 l->(1):0 ret:(0,Success)
# ./fds_example -F /sys/fs/bpf/m2 -G -m
bpf: get fd:3 (Success)

Example with progs:

# ./fds_example -F /sys/fs/bpf/p -P -p
bpf: prog fd:3 (Success)
bpf: pin ret:(0,Success)
bpf sock:4
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-11-03 11:48:39 +0800

01 Nov, 2015

1 commit

b75ec3af2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net Browse Code »

David S. Miller
2015-11-01 12:15:30 +0800

28 Oct, 2015

1 commit

85ff8a43f bpf: sample: define aarch64 specific registers ... Browse Code »

Define aarch64 specific registers for building bpf samples correctly.

Signed-off-by: Yang Shi
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Yang Shi
2015-10-28 10:51:10 +0800

22 Oct, 2015

1 commit

39111695b samples: bpf: add bpf_perf_event_output example ... Browse Code »

Performance test and example of bpf_perf_event_output().
kprobe is attached to sys_write() and trivial bpf program streams
pid+cookie into userspace via PERF_COUNT_SW_BPF_OUTPUT event.

Usage:
$ sudo ./bld_x64/samples/bpf/trace_output
recv 2968913 events per sec

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-10-22 21:42:15 +0800

13 Oct, 2015

1 commit

bf5088773 bpf: add unprivileged bpf tests ... Browse Code »

Add new tests samples/bpf/test_verifier:

unpriv: return pointer
checks that pointer cannot be returned from the eBPF program

unpriv: add const to pointer
unpriv: add pointer to pointer
unpriv: neg pointer
checks that pointer arithmetic is disallowed

unpriv: cmp pointer with const
unpriv: cmp pointer with pointer
checks that comparison of pointers is disallowed
Only one case allowed 'void *value = bpf_map_lookup_elem(..); if (value == 0) ...'

unpriv: check that printk is disallowed
since bpf_trace_printk is not available to unprivileged

unpriv: pass pointer to helper function
checks that pointers cannot be passed to functions that expect integers
If function expects a pointer the verifier allows only that type of pointer.
Like 1st argument of bpf_map_lookup_elem() must be pointer to map.
(applies to non-root as well)

unpriv: indirectly pass pointer on stack to helper function
checks that pointer stored into stack cannot be used as part of key
passed into bpf_map_lookup_elem()

unpriv: mangle pointer on stack 1
unpriv: mangle pointer on stack 2
checks that writing into stack slot that already contains a pointer
is disallowed

unpriv: read pointer from stack in small chunks
checks that < 8 byte read from stack slot that contains a pointer is
disallowed

unpriv: write pointer into ctx
checks that storing pointers into skb->fields is disallowed

unpriv: write pointer into map elem value
checks that storing pointers into element values is disallowed
For example:
int bpf_prog(struct __sk_buff *skb)
{
u32 key = 0;
u64 *value = bpf_map_lookup_elem(&map, &key);
if (value)
*value = (u64) skb;
}
will be rejected.

unpriv: partial copy of pointer
checks that doing 32-bit register mov from register containing
a pointer is disallowed

unpriv: pass pointer to tail_call
checks that passing pointer as an index into bpf_tail_call
is disallowed

unpriv: cmp map pointer with zero
checks that comparing map pointer with constant is disallowed

unpriv: write into frame pointer
checks that frame pointer is read-only (applies to root too)

unpriv: cmp of frame pointer
checks that R10 cannot be using in comparison

unpriv: cmp of stack pointer
checks that Rx = R10 - imm is ok, but comparing Rx is not

unpriv: obfuscate stack pointer
checks that Rx = R10 - imm is ok, but Rx -= imm is not

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-10-13 10:13:37 +0800

18 Sep, 2015

1 commit

27b29f630 bpf: add bpf_redirect() helper ... Browse Code »

Existing bpf_clone_redirect() helper clones skb before redirecting
it to RX or TX of destination netdev.
Introduce bpf_redirect() helper that does that without cloning.

Benchmarked with two hosts using 10G ixgbe NICs.
One host is doing line rate pktgen.
Another host is configured as:
$ tc qdisc add dev $dev ingress
$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section clone_redirect_xmit drop
so it receives the packet on $dev and immediately xmits it on $dev + 1
The section 'clone_redirect_xmit' in tcbpf1_kern.o file has the program
that does bpf_clone_redirect() and performance is 2.0 Mpps

$ tc filter add dev $dev root pref 10 u32 match u32 0 0 flowid 1:2 \
action bpf run object-file tcbpf1_kern.o section redirect_xmit drop
which is using bpf_redirect() - 2.4 Mpps

and using cls_bpf with integrated actions as:
$ tc filter add dev $dev root pref 10 \
bpf run object-file tcbpf1_kern.o section redirect_xmit integ_act classid 1
performance is 2.5 Mpps

To summarize:
u32+act_bpf using clone_redirect - 2.0 Mpps
u32+act_bpf using redirect - 2.4 Mpps
cls_bpf using redirect - 2.5 Mpps

For comparison linux bridge in this setup is doing 2.1 Mpps
and ixgbe rx + drop in ip_rcv - 7.8 Mpps

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: John Fastabend
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-09-18 12:09:07 +0800

13 Aug, 2015

1 commit

5ed3ccbd5 bpf: fix build warnings and add function read_trace_pipe() ... Browse Code »

There are two improvements in this patch:
1. Fix the build warnings;
2. Add function read_trace_pipe() to print the result on
the screen;

Before this patch, we can get the result through /sys/kernel/de
bug/tracing/trace_pipe and get nothing on the screen.
By applying this patch, the result can be printed on the screen.
$ ./tracex6
...
tracex6-705 [003] d..1 131.428593: : CPU-3 19981414
sshd-683 [000] d..1 131.428727: : CPU-0 221682321
sshd-683 [000] d..1 131.428821: : CPU-0 221808766
sshd-683 [000] d..1 131.428950: : CPU-0 221982984
sshd-683 [000] d..1 131.429045: : CPU-0 222111851
tracex6-705 [003] d..1 131.429168: : CPU-3 20757551
sshd-683 [000] d..1 131.429170: : CPU-0 222281240
sshd-683 [000] d..1 131.429261: : CPU-0 222403340
sshd-683 [000] d..1 131.429378: : CPU-0 222561024
...

Signed-off-by: Kaixu Xia
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Kaixu Xia
2015-08-13 07:39:12 +0800

10 Aug, 2015

1 commit

47efb3027 samples/bpf: example of get selected PMU counter value ... Browse Code »

This is a simple example and shows how to use the new ability
to get the selected Hardware PMU counter value.

Signed-off-by: Kaixu Xia
Signed-off-by: David S. Miller

Kaixu Xia
2015-08-10 13:50:06 +0800

27 Jul, 2015

1 commit

24b4d2abd ebpf: Allow dereferences of PTR_TO_STACK registers ... Browse Code »

mov %rsp, %r1 ; r1 = rsp
add $-8, %r1 ; r1 = rsp - 8
store_q $123, -8(%rsp) ; *(u64*)r1 = 123
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alex Gartrell
2015-07-27 15:54:10 +0800

09 Jul, 2015

1 commit

d912557b3 samples: bpf: enable trace samples for s390x ... Browse Code »

The trace bpf samples do not compile on s390x because they use x86
specific fields from the "pt_regs" structure.

Fix this and access the fields via new PT_REGS macros.

Signed-off-by: Michael Holzheu
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Michael Holzheu
2015-07-09 06:17:45 +0800

23 Jun, 2015

1 commit

0fb1170ee bpf: BPF based latency tracing ... Browse Code »

BPF offers another way to generate latency histograms. We attach
kprobes at trace_preempt_off and trace_preempt_on and calculate the
time it takes to from seeing the off/on transition.

The first array is used to store the start time stamp. The key is the
CPU id. The second array stores the log2(time diff). We need to use
static allocation here (array and not hash tables). The kprobes
hooking into trace_preempt_on|off should not calling any dynamic
memory allocation or free path. We need to avoid recursivly
getting called. Besides that, it reduces jitter in the measurement.

CPU 0
latency
1 -> 1
2 -> 3
4 -> 7
8 -> 15
16 -> 31
32 -> 63
64 -> 127
128 -> 255
256 -> 511
512 -> 1023 : 0
1024 -> 2047 : 0
2048 -> 4095
4096 -> 8191
8192 -> 16383
16384 -> 32767
32768 -> 65535
65536 -> 131071 : 179
131072 -> 262143 : 18
262144 -> 524287 : 4
524288 -> 1048575 : 1363
CPU 1
latency
1 -> 1
2 -> 3
4 -> 7
8 -> 15
16 -> 31
32 -> 63
64 -> 127
128 -> 255
256 -> 511
512 -> 1023 : 0
1024 -> 2047 : 0
2048 -> 4095
4096 -> 8191
8192 -> 16383
16384 -> 32767
32768 -> 65535
65536 -> 131071 : 29
131072 -> 262143 : 4
262144 -> 524287 : 1
524288 -> 1048575 : 364
CPU 2
latency
1 -> 1
2 -> 3
4 -> 7
8 -> 15
16 -> 31
32 -> 63
64 -> 127
128 -> 255
256 -> 511
512 -> 1023 : 0
1024 -> 2047 : 0
2048 -> 4095
4096 -> 8191
8192 -> 16383
16384 -> 32767
32768 -> 65535 : 59
65536 -> 131071 : 2
131072 -> 262143 : 0
262144 -> 524287 : 1
524288 -> 1048575 : 174
CPU 3
latency
1 -> 1
2 -> 3
4 -> 7
8 -> 15
16 -> 31
32 -> 63
64 -> 127
128 -> 255
256 -> 511
512 -> 1023 : 0
1024 -> 2047 : 0
2048 -> 4095
4096 -> 8191
8192 -> 16383
16384 -> 32767
32768 -> 65535 : 72
65536 -> 131071 : 32
131072 -> 262143 : 26
262144 -> 524287 : 12
524288 -> 1048575 : 298 : count distribution : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | | | | | : 166723 |*************************************** | : 19870 |*** | : 6324 | | : 1098 | | : 190 | | | | | | | | | | : count distribution : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | | | | | : 114042 |*************************************** | : 9587 |** | : 4140 | | : 673 | | : 179 | | | | | | | | | | : count distribution : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | | | | | : 40147 |*************************************** | : 2300 |* | : 828 | | : 178 | | | | | | | | | | | | : count distribution : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | : 0 | | | | | | : 29626 |*************************************** | : 2704 |** | : 1090 | | : 160 | | | | | | | | | | | |

All this is based on the trace3 examples written by
Alexei Starovoitov .

Signed-off-by: Daniel Wagner
Cc: Alexei Starovoitov
Cc: Alexei Starovoitov
Cc: "David S. Miller"
Cc: Daniel Borkmann
Cc: Ingo Molnar
Cc: linux-kernel@vger.kernel.org
Cc: netdev@vger.kernel.org
Acked-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Daniel Wagner
2015-06-23 21:09:58 +0800

16 Jun, 2015

1 commit

ffeedafbf bpf: introduce current->pid, tgid, uid, gid, comm accessors ... Browse Code »

eBPF programs attached to kprobes need to filter based on
current->pid, uid and other fields, so introduce helper functions:

u64 bpf_get_current_pid_tgid(void)
Return: current->tgid << 32 | current->pid

u64 bpf_get_current_uid_gid(void)
Return: current_gid << 32 | current_uid

bpf_get_current_comm(char *buf, int size_of_buf)
stores current->comm into buf

They can be used from the programs attached to TC as well to classify packets
based on current task fields.

Update tracex2 example to print histogram of write syscalls for each process
instead of aggregated for all.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-06-16 06:53:50 +0800

07 Jun, 2015

2 commits

d691f9e8d bpf: allow programs to write to certain skb fields ... Browse Code »

allow programs read/write skb->mark, tc_index fields and
((struct qdisc_skb_cb *)cb)->data.

mark and tc_index are generically useful in TC.
cb[0]-cb[4] are primarily used to pass arguments from one
program to another called via bpf_tail_call() which can
be seen in sockex3_kern.c example.

All fields of 'struct __sk_buff' are readable to socket and tc_cls_act progs.
mark, tc_index are writeable from tc_cls_act only.
cb[0]-cb[4] are writeable by both sockets and tc_cls_act.

Add verifier tests and improve sample code.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-06-07 17:01:33 +0800
3431205e0 bpf: make programs see skb->data == L2 for ingress and egress ... Browse Code »

eBPF programs attached to ingress and egress qdiscs see inconsistent skb->data.
For ingress L2 header is already pulled, whereas for egress it's present.
This is known to program writers which are currently forced to use
BPF_LL_OFF workaround.
Since programs don't change skb internal pointers it is safe to do
pull/push right around invocation of the program and earlier taps and
later pt->func() will not be affected.
Multiple taps via packet_rcv(), tpacket_rcv() are doing the same trick
around run_filter/BPF_PROG_RUN even if skb_shared.

This fix finally allows programs to use optimized LD_ABS/IND instructions
without BPF_LL_OFF for higher performance.
tc ingress + cls_bpf + samples/bpf/tcbpf1_kern.o
w/o JIT w/JIT
before 20.5 23.6 Mpps
after 21.8 26.6 Mpps

Old programs with BPF_LL_OFF will still work as-is.

We can now undo most of the earlier workaround commit:
a166151cbe33 ("bpf: fix bpf helpers to use skb->mac_header relative offsets")

Signed-off-by: Alexei Starovoitov
Acked-by: Jamal Hadi Salim
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-06-07 17:01:33 +0800

22 May, 2015

2 commits

530b2c861 samples/bpf: bpf_tail_call example for networking ... Browse Code »

Usage:
$ sudo ./sockex3
IP src.port -> dst.port bytes packets
127.0.0.1.42010 -> 127.0.0.1.12865 1568 8
127.0.0.1.59526 -> 127.0.0.1.33778 11422636 173070
127.0.0.1.33778 -> 127.0.0.1.59526 11260224828 341974
127.0.0.1.12865 -> 127.0.0.1.42010 1832 12
IP src.port -> dst.port bytes packets
127.0.0.1.42010 -> 127.0.0.1.12865 1568 8
127.0.0.1.59526 -> 127.0.0.1.33778 23198092 351486
127.0.0.1.33778 -> 127.0.0.1.59526 22972698518 698616
127.0.0.1.12865 -> 127.0.0.1.42010 1832 12

this example is similar to sockex2 in a way that it accumulates per-flow
statistics, but it does packet parsing differently.
sockex2 inlines full packet parser routine into single bpf program.
This sockex3 example have 4 independent programs that parse vlan, mpls, ip, ipv6
and one main program that starts the process.
bpf_tail_call() mechanism allows each program to be small and be called
on demand potentially multiple times, so that many vlan, mpls, ip in ip,
gre encapsulations can be parsed. These and other protocol parsers can
be added or removed at runtime. TLVs can be parsed in similar manner.
Note, tail_call_cnt dynamic check limits the number of tail calls to 32.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-05-22 05:07:59 +0800
5bacd7805 samples/bpf: bpf_tail_call example for tracing ... Browse Code »

kprobe example that demonstrates how future seccomp programs may look like.
It attaches to seccomp_phase1() function and tail-calls other BPF programs
depending on syscall number.

Existing optimized classic BPF seccomp programs generated by Chrome look like:
if (sd.nr < 121) {
if (sd.nr < 57) {
if (sd.nr < 22) {
if (sd.nr < 7) {
if (sd.nr < 4) {
if (sd.nr < 1) {
check sys_read
} else {
if (sd.nr < 3) {
check sys_write and sys_open
} else {
check sys_close
}
}
} else {
} else {
} else {
} else {
} else {
}

the future seccomp using native eBPF may look like:
bpf_tail_call(&sd, &syscall_jmp_table, sd.nr);
which is simpler, faster and leaves more room for per-syscall checks.

Usage:
$ sudo ./tracex5
-366 [001] d... 4.870033: : read(fd=1, buf=00007f6d5bebf000, size=771)
-369 [003] d... 4.870066: : mmap
-369 [003] d... 4.870077: : syscall=110 (one of get/set uid/pid/gid)
-369 [003] d... 4.870089: : syscall=107 (one of get/set uid/pid/gid)
sh-369 [000] d... 4.891740: : read(fd=0, buf=00000000023d1000, size=512)
sh-369 [000] d... 4.891747: : write(fd=1, buf=00000000023d3000, size=512)
sh-369 [000] d... 4.891747: : read(fd=1, buf=00000000023d3000, size=512)

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-05-22 05:07:59 +0800

13 May, 2015

1 commit

b88c06e36 samples/bpf: fix in-source build of samples with clang ... Browse Code »

in-source build of 'make samples/bpf/' was incorrectly
using default compiler instead of invoking clang/llvm.
out-of-source build was ok.

Fixes: a80857822b0c ("samples: bpf: trivial eBPF program in C")
Signed-off-by: Brenden Blanco
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Brenden Blanco
2015-05-13 11:15:25 +0800

17 Apr, 2015

2 commits

725f9dcd5 bpf: fix two bugs in verification logic when accessing 'ctx' pointer ... Browse Code »

1.
first bug is a silly mistake. It broke tracing examples and prevented
simple bpf programs from loading.

In the following code:
if (insn->imm == 0 && BPF_SIZE(insn->code) == BPF_W) {
} else if (...) {
// this part should have been executed when
// insn->code == BPF_W and insn->imm != 0
}

Obviously it's not doing that. So simple instructions like:
r2 = *(u64 *)(r1 + 8)
will be rejected. Note the comments in the code around these branches
were and still valid and indicate the true intent.

Replace it with:
if (BPF_SIZE(insn->code) != BPF_W)
continue;

if (insn->imm == 0) {
} else if (...) {
// now this code will be executed when
// insn->code == BPF_W and insn->imm != 0
}

2.
second bug is more subtle.
If malicious code is using the same dest register as source register,
the checks designed to prevent the same instruction to be used with different
pointer types will fail to trigger, since we were assigning src_reg_type
when it was already overwritten by check_mem_access().
The fix is trivial. Just move line:
src_reg_type = regs[insn->src_reg].type;
before check_mem_access().
Add new 'access skb fields bad4' test to check this case.

Fixes: 9bac3d6d548e ("bpf: allow extended BPF programs access skb fields")
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-04-17 02:08:49 +0800
a166151cb bpf: fix bpf helpers to use skb->mac_header relative offsets ... Browse Code »

For the short-term solution, lets fix bpf helper functions to use
skb->mac_header relative offsets instead of skb->data in order to
get the same eBPF programs with cls_bpf and act_bpf work on ingress
and egress qdisc path. We need to ensure that mac_header is set
before calling into programs. This is effectively the first option
from below referenced discussion.

More long term solution for LD_ABS|LD_IND instructions will be more
intrusive but also more beneficial than this, and implemented later
as it's too risky at this point in time.

I.e., we plan to look into the option of moving skb_pull() out of
eth_type_trans() and into netif_receive_skb() as has been suggested
as second option. Meanwhile, this solution ensures ingress can be
used with eBPF, too, and that we won't run into ABI troubles later.
For dealing with negative offsets inside eBPF helper functions,
we've implemented bpf_skb_clone_unwritable() to test for unwriteable
headers.

Reference: http://thread.gmane.org/gmane.linux.network/359129/focus=359694
Fixes: 608cd71a9c7c ("tc: bpf: generalize pedit action")
Fixes: 91bc4822c3d6 ("tc: bpf: add checksum helpers")
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-04-17 02:08:49 +0800

16 Apr, 2015

1 commit

6c373ca89 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:

1) Add BQL support to via-rhine, from Tino Reichardt.

2) Integrate SWITCHDEV layer support into the DSA layer, so DSA drivers
can support hw switch offloading. From Floria Fainelli.

3) Allow 'ip address' commands to initiate multicast group join/leave,
from Madhu Challa.

4) Many ipv4 FIB lookup optimizations from Alexander Duyck.

5) Support EBPF in cls_bpf classifier and act_bpf action, from Daniel
Borkmann.

6) Remove the ugly compat support in ARP for ugly layers like ax25,
rose, etc. And use this to clean up the neigh layer, then use it to
implement MPLS support. All from Eric Biederman.

7) Support L3 forwarding offloading in switches, from Scott Feldman.

8) Collapse the LOCAL and MAIN ipv4 FIB tables when possible, to speed
up route lookups even further. From Alexander Duyck.

9) Many improvements and bug fixes to the rhashtable implementation,
from Herbert Xu and Thomas Graf. In particular, in the case where
an rhashtable user bulk adds a large number of items into an empty
table, we expand the table much more sanely.

10) Don't make the tcp_metrics hash table per-namespace, from Eric
Biederman.

11) Extend EBPF to access SKB fields, from Alexei Starovoitov.

12) Split out new connection request sockets so that they can be
established in the main hash table. Much less false sharing since
hash lookups go direct to the request sockets instead of having to
go first to the listener then to the request socks hashed
underneath. From Eric Dumazet.

13) Add async I/O support for crytpo AF_ALG sockets, from Tadeusz Struk.

14) Support stable privacy address generation for RFC7217 in IPV6. From
Hannes Frederic Sowa.

15) Hash network namespace into IP frag IDs, also from Hannes Frederic
Sowa.

16) Convert PTP get/set methods to use 64-bit time, from Richard
Cochran.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1816 commits)
fm10k: Bump driver version to 0.15.2
fm10k: corrected VF multicast update
fm10k: mbx_update_max_size does not drop all oversized messages
fm10k: reset head instead of calling update_max_size
fm10k: renamed mbx_tx_dropped to mbx_tx_oversized
fm10k: update xcast mode before synchronizing multicast addresses
fm10k: start service timer on probe
fm10k: fix function header comment
fm10k: comment next_vf_mbx flow
fm10k: don't handle mailbox events in iov_event path and always process mailbox
fm10k: use separate workqueue for fm10k driver
fm10k: Set PF queues to unlimited bandwidth during virtualization
fm10k: expose tx_timeout_count as an ethtool stat
fm10k: only increment tx_timeout_count in Tx hang path
fm10k: remove extraneous "Reset interface" message
fm10k: separate PF only stats so that VF does not display them
fm10k: use hw->mac.max_queues for stats
fm10k: only show actual queues, not the maximum in hardware
fm10k: allow creation of VLAN on default vid
fm10k: fix unused warnings
...

Linus Torvalds
2015-04-16 00:00:47 +0800

07 Apr, 2015

1 commit

91bc4822c tc: bpf: add checksum helpers ... Browse Code »

Commit 608cd71a9c7c ("tc: bpf: generalize pedit action") has added the
possibility to mangle packet data to BPF programs in the tc pipeline.
This patch adds two helpers bpf_l3_csum_replace() and bpf_l4_csum_replace()
for fixing up the protocol checksums after the packet mangling.

It also adds 'flags' argument to bpf_skb_store_bytes() helper to avoid
unnecessary checksum recomputations when BPF programs adjusting l3/l4
checksums and documents all three helpers in uapi header.

Moreover, a sample program is added to show how BPF programs can make use
of the mangle and csum helpers.

Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-04-07 04:42:35 +0800

02 Apr, 2015

4 commits

9811e3535 samples/bpf: Add kmem_alloc()/free() tracker tool ... Browse Code »

One BPF program attaches to kmem_cache_alloc_node() and
remembers all allocated objects in the map.
Another program attaches to kmem_cache_free() and deletes
corresponding object from the map.

User space walks the map every second and prints any objects
which are older than 1 second.

Usage:

$ sudo tracex4

Then start few long living processes. The 'tracex4' will print
something like this:

obj 0xffff880465928000 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff88043181c280 is 13sec old was allocated at ip ffffffff8105dc32
obj 0xffff880465848000 is 8sec old was allocated at ip ffffffff8105dc32
obj 0xffff8804338bc280 is 15sec old was allocated at ip ffffffff8105dc32

$ addr2line -fispe vmlinux ffffffff8105dc32
do_fork at fork.c:1665

As soon as processes exit the memory is reclaimed and 'tracex4'
prints nothing.

Similar experiment can be done with the __kmalloc()/kfree() pair.

Signed-off-by: Alexei Starovoitov
Cc: Andrew Morton
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: David S. Miller
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1427312966-8434-10-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar

Alexei Starovoitov
2015-04-02 19:25:51 +0800
5c7fc2d27 samples/bpf: Add IO latency analysis (iosnoop/heatmap) tool ... Browse Code »

BPF C program attaches to
blk_mq_start_request()/blk_update_request() kprobe events to
calculate IO latency.

For every completed block IO event it computes the time delta
in nsec and records in a histogram map:

map[log10(delta)*10]++

User space reads this histogram map every 2 seconds and prints
it as a 'heatmap' using gray shades of text terminal. Black
spaces have many events and white spaces have very few events.
Left most space is the smallest latency, right most space is
the largest latency in the range.

Usage:

$ sudo ./tracex3
and do 'sudo dd if=/dev/sda of=/dev/null' in other terminal.

Observe IO latencies and how different activity (like 'make
kernel') affects it.

Similar experiments can be done for network transmit latencies,
syscalls, etc.

'-t' flag prints the heatmap using normal ascii characters:

$ sudo ./tracex3 -t
heatmap of IO latency
# - many events with this latency
- few events
|1us |10us |100us |1ms |10ms |100ms |1s |10s
*ooo. *O.#. # 221
. *# . # 125
.. .o#*.. # 55
. . . . .#O # 37
.# # 175
.#*. # 37
# # 199
. . *#*. # 55
*#..* # 42
# # 266
...***Oo#*OO**o#* . # 629
# # 271
. .#o* o.*o* # 221
. . o* *#O.. # 50

Signed-off-by: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: David S. Miller
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1427312966-8434-9-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar

Alexei Starovoitov
2015-04-02 19:25:51 +0800
d822a1926 samples/bpf: Add counting example for kfree_skb() function calls and the write() syscall ... Browse Code »

this example has two probes in one C file that attach to
different kprove events and use two different maps.

1st probe is x64 specific equivalent of dropmon. It attaches to
kfree_skb, retrevies 'ip' address of kfree_skb() caller and
counts number of packet drops at that 'ip' address. User space
prints 'location - count' map every second.

2nd probe attaches to kprobe:sys_write and computes a histogram
of different write sizes

Usage:
$ sudo tracex2
location 0xffffffff81695995 count 1
location 0xffffffff816d0da9 count 2

location 0xffffffff81695995 count 2
location 0xffffffff816d0da9 count 2

location 0xffffffff81695995 count 3
location 0xffffffff816d0da9 count 2

557145+0 records in
557145+0 records out
285258240 bytes (285 MB) copied, 1.02379 s, 279 MB/s
syscall write() stats
byte_size : count distribution
1 -> 1 : 3 | |
2 -> 3 : 0 | |
4 -> 7 : 0 | |
8 -> 15 : 0 | |
16 -> 31 : 2 | |
32 -> 63 : 3 | |
64 -> 127 : 1 | |
128 -> 255 : 1 | |
256 -> 511 : 0 | |
512 -> 1023 : 1118968 |************************************* |

Ctrl-C at any time. Kernel will auto cleanup maps and programs

$ addr2line -ape ./bld_x64/vmlinux 0xffffffff81695995
0xffffffff816d0da9 0xffffffff81695995:
./bld_x64/../net/ipv4/icmp.c:1038 0xffffffff816d0da9:
./bld_x64/../net/unix/af_unix.c:1231

Signed-off-by: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: David S. Miller
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1427312966-8434-8-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar

Alexei Starovoitov
2015-04-02 19:25:50 +0800
b896c4f95 samples/bpf: Add simple non-portable kprobe filter example ... Browse Code »

tracex1_kern.c - C program compiled into BPF.

It attaches to kprobe:netif_receive_skb()

When skb->dev->name == "lo", it prints sample debug message into
trace_pipe via bpf_trace_printk() helper function.

tracex1_user.c - corresponding user space component that:
- loads BPF program via bpf() syscall
- opens kprobes:netif_receive_skb event via perf_event_open()
syscall
- attaches the program to event via ioctl(event_fd,
PERF_EVENT_IOC_SET_BPF, prog_fd);
- prints from trace_pipe

Note, this BPF program is non-portable. It must be recompiled
with current kernel headers. kprobe is not a stable ABI and
BPF+kprobe scripts may no longer be meaningful when kernel
internals change.

No matter in what way the kernel changes, neither the kprobe,
nor the BPF program can ever crash or corrupt the kernel,
assuming the kprobes, perf and BPF subsystem has no bugs.

The verifier will detect that the program is using
bpf_trace_printk() and the kernel will print 'this is a DEBUG
kernel' warning banner, which means that bpf_trace_printk()
should be used for debugging of the BPF program only.

Usage:
$ sudo tracex1
ping-19826 [000] d.s2 63103.382648: : skb ffff880466b1ca00 len 84
ping-19826 [000] d.s2 63103.382684: : skb ffff880466b1d300 len 84

ping-19826 [000] d.s2 63104.382533: : skb ffff880466b1ca00 len 84
ping-19826 [000] d.s2 63104.382594: : skb ffff880466b1d300 len 84

Signed-off-by: Alexei Starovoitov
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Daniel Borkmann
Cc: David S. Miller
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Peter Zijlstra
Cc: Steven Rostedt
Link: http://lkml.kernel.org/r/1427312966-8434-7-git-send-email-ast@plumgrid.com
Signed-off-by: Ingo Molnar

Alexei Starovoitov
2015-04-02 19:25:50 +0800

18 Mar, 2015

1 commit

c24973957 bpf: allow BPF programs access 'protocol' and 'vlan_tci' fields ... Browse Code »

as a follow on to patch 70006af95515 ("bpf: allow eBPF access skb fields")
this patch allows 'protocol' and 'vlan_tci' fields to be accessible
from extended BPF programs.

The usage of 'protocol', 'vlan_present' and 'vlan_tci' fields is the same as
corresponding SKF_AD_PROTOCOL, SKF_AD_VLAN_TAG_PRESENT and SKF_AD_VLAN_TAG
accesses in classic BPF.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-03-18 03:06:31 +0800

16 Mar, 2015

1 commit

614cd3bd3 samples: bpf: add skb->field examples and tests ... Browse Code »

- modify sockex1 example to count number of bytes in outgoing packets
- modify sockex2 example to count number of bytes and packets per flow
- add 4 stress tests that exercise 'skb->field' code path of verifier

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2015-03-16 10:02:28 +0800

02 Mar, 2015

1 commit

f1a66f85b ebpf: export BPF_PSEUDO_MAP_FD to uapi ... Browse Code »

We need to export BPF_PSEUDO_MAP_FD to user space, as it's used in the
ELF BPF loader where instructions are being loaded that need map fixups.

An initial stage loads all maps into the kernel, and later on replaces
related instructions in the eBPF blob with BPF_PSEUDO_MAP_FD as source
register and the actual fd as immediate value.

The kernel verifier recognizes this keyword and replaces the map fd with
a real pointer internally.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2015-03-02 03:05:19 +0800