Eric Lee / smarc-fsl-linux-kernel

11 Mar, 2018

1 commit

816cfeb77 bpf: fix wrong exposure of map_flags into fdinfo for lpm ... Browse Code »

[ upstream commit a316338cb71a3260201490e615f2f6d5c0d8fb2c ]

trie_alloc() always needs to have BPF_F_NO_PREALLOC passed in via
attr->map_flags, since it does not support preallocation yet. We
check the flag, but we never copy the flag into trie->map.map_flags,
which is later on exposed into fdinfo and used by loaders such as
iproute2. Latter uses this in bpf_map_selfcheck_pinned() to test
whether a pinned map has the same spec as the one from the BPF obj
file and if not, bails out, which is currently the case for lpm
since it exposes always 0 as flags.

Also copy over flags in array_map_alloc() and stack_map_alloc().
They always have to be 0 right now, but we should make sure to not
miss to copy them over at a later point in time when we add actual
flags for them to use.

Fixes: b95a5c4db09b ("bpf: add a longest prefix match trie map implementation")
Reported-by: Jarno Rajahalme
Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2018-03-11 23:21:34 +0800

05 Jul, 2017

1 commit

251d00bf1 bpf: don't trigger OOM killer under pressure with map alloc ... Browse Code »

[ Upstream commit d407bd25a204bd66b7346dde24bd3d37ef0e0b05 ]

This patch adds two helpers, bpf_map_area_alloc() and bpf_map_area_free(),
that are to be used for map allocations. Using kmalloc() for very large
allocations can cause excessive work within the page allocator, so i) fall
back earlier to vmalloc() when the attempt is considered costly anyway,
and even more importantly ii) don't trigger OOM killer with any of the
allocators.

Since this is based on a user space request, for example, when creating
maps with element pre-allocation, we really want such requests to fail
instead of killing other user space processes.

Also, don't spam the kernel log with warnings should any of the allocations
fail under pressure. Given that, we can make backend selection in
bpf_map_area_alloc() generic, and convert all maps over to use this API
for spots with potentially large allocation requests.

Note, replacing the one kmalloc_array() is fine as overflow checks happen
earlier in htab_map_alloc(), since it must also protect the multiplication
for vmalloc() should kmalloc_array() fail.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2017-07-05 20:40:21 +0800

10 Sep, 2016

1 commit

f3694e001 bpf: add BPF_CALL_x macros for declaring helpers ... Browse Code »

This work adds BPF_CALL_() macros and converts all the eBPF helper functions
to use them, in a similar fashion like we do with SYSCALL_DEFINE() macros
that are used today. Motivation for this is to hide all the register handling
and all necessary casts from the user, so that it is done automatically in the
background when adding a BPF_CALL_() call.

This makes current helpers easier to review, eases to write future helpers,
avoids getting the casting mess wrong, and allows for extending all helpers at
once (f.e. build time checks, etc). It also helps detecting more easily in
code reviews that unused registers are not instrumented in the code by accident,
breaking compatibility with existing programs.

BPF_CALL_() internals are quite similar to SYSCALL_DEFINE() ones with some
fundamental differences, for example, for generating the actual helper function
that carries all u64 regs, we need to fill unused regs, so that we always end up
with 5 u64 regs as an argument.

I reviewed several 0-5 generated BPF_CALL_() variants of the .i results and
they look all as expected. No sparse issue spotted. We let this also sit for a
few days with Fengguang's kbuild test robot, and there were no issues seen. On
s390, it barked on the "uses dynamic stack allocation" notice, which is an old
one from bpf_perf_event_output{,_tp}() reappearing here due to the conversion
to the call wrapper, just telling that the perf raw record/frag sits on stack
(gcc with s390's -mwarn-dynamicstack), but that's all. Did various runtime tests
and they were fine as well. All eBPF helpers are now converted to use these
macros, getting rid of a good chunk of all the raw castings.

Signed-off-by: Daniel Borkmann
Acked-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Daniel Borkmann
2016-09-10 10:36:04 +0800

08 Jun, 2016

1 commit

616d1c1b9 Merge branch 'linus' into perf/core, to refresh the branch ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2016-06-08 15:26:46 +0800

30 May, 2016

1 commit

97c79a38c perf core: Per event callchain limit ... Browse Code »

Additionally to being able to control the system wide maximum depth via
/proc/sys/kernel/perf_event_max_stack, now we are able to ask for
different depths per event, using perf_event_attr.sample_max_stack for
that.

This uses an u16 hole at the end of perf_event_attr, that, when
perf_event_attr.sample_type has the PERF_SAMPLE_CALLCHAIN, if
sample_max_stack is zero, means use perf_event_max_stack, otherwise
it'll be bounds checked under callchain_mutex.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Brendan Gregg
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: He Kuang
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-30 23:41:44 +0800

26 May, 2016

1 commit

bdc6b758e Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf updates from Ingo Molnar:
"Mostly tooling and PMU driver fixes, but also a number of late updates
such as the reworking of the call-chain size limiting logic to make
call-graph recording more robust, plus tooling side changes for the
new 'backwards ring-buffer' extension to the perf ring-buffer"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
perf record: Read from backward ring buffer
perf record: Rename variable to make code clear
perf record: Prevent reading invalid data in record__mmap_read
perf evlist: Add API to pause/resume
perf trace: Use the ptr->name beautifier as default for "filename" args
perf trace: Use the fd->name beautifier as default for "fd" args
perf report: Add srcline_from/to branch sort keys
perf evsel: Record fd into perf_mmap
perf evsel: Add overwrite attribute and check write_backward
perf tools: Set buildid dir under symfs when --symfs is provided
perf trace: Only auto set call-graph to "dwarf" when syscalls are being traced
perf annotate: Sort list of recognised instructions
perf annotate: Fix identification of ARM blt and bls instructions
perf tools: Fix usage of max_stack sysctl
perf callchain: Stop validating callchains by the max_stack sysctl
perf trace: Fix exit_group() formatting
perf top: Use machine->kptr_restrict_warned
perf trace: Warn when trying to resolve kernel addresses with kptr_restrict=1
perf machine: Do not bail out if not managing to read ref reloc symbol
perf/x86/intel/p4: Trival indentation fix, remove space
...

Linus Torvalds
2016-05-26 08:05:40 +0800

18 May, 2016

1 commit

a7fd20d1c Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next ... Browse Code »

Pull networking updates from David Miller:
"Highlights:

1) Support SPI based w5100 devices, from Akinobu Mita.

2) Partial Segmentation Offload, from Alexander Duyck.

3) Add GMAC4 support to stmmac driver, from Alexandre TORGUE.

4) Allow cls_flower stats offload, from Amir Vadai.

5) Implement bpf blinding, from Daniel Borkmann.

6) Optimize _ASYNC_ bit twiddling on sockets, unless the socket is
actually using FASYNC these atomics are superfluous. From Eric
Dumazet.

7) Run TCP more preemptibly, also from Eric Dumazet.

8) Support LED blinking, EEPROM dumps, and rxvlan offloading in mlx5e
driver, from Gal Pressman.

9) Allow creating ppp devices via rtnetlink, from Guillaume Nault.

10) Improve BPF usage documentation, from Jesper Dangaard Brouer.

11) Support tunneling offloads in qed, from Manish Chopra.

12) aRFS offloading in mlx5e, from Maor Gottlieb.

13) Add RFS and RPS support to SCTP protocol, from Marcelo Ricardo
Leitner.

14) Add MSG_EOR support to TCP, this allows controlling packet
coalescing on application record boundaries for more accurate
socket timestamp sampling. From Martin KaFai Lau.

15) Fix alignment of 64-bit netlink attributes across the board, from
Nicolas Dichtel.

16) Per-vlan stats in bridging, from Nikolay Aleksandrov.

17) Several conversions of drivers to ethtool ksettings, from Philippe
Reynes.

18) Checksum neutral ILA in ipv6, from Tom Herbert.

19) Factorize all of the various marvell dsa drivers into one, from
Vivien Didelot

20) Add VF support to qed driver, from Yuval Mintz"

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1649 commits)
Revert "phy dp83867: Fix compilation with CONFIG_OF_MDIO=m"
Revert "phy dp83867: Make rgmii parameters optional"
r8169: default to 64-bit DMA on recent PCIe chips
phy dp83867: Make rgmii parameters optional
phy dp83867: Fix compilation with CONFIG_OF_MDIO=m
bpf: arm64: remove callee-save registers use for tmp registers
asix: Fix offset calculation in asix_rx_fixup() causing slow transmissions
switchdev: pass pointer to fib_info instead of copy
net_sched: close another race condition in tcf_mirred_release()
tipc: fix nametable publication field in nl compat
drivers: net: Don't print unpopulated net_device name
qed: add support for dcbx.
ravb: Add missing free_irq() calls to ravb_close()
qed: Remove a stray tab
net: ethernet: fec-mpc52xx: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fec-mpc52xx: use phydev from struct net_device
bpf, doc: fix typo on bpf_asm descriptions
stmmac: hardware TX COE doesn't work when force_thresh_dma_mode is set
net: ethernet: fs-enet: use phy_ethtool_{get|set}_link_ksettings
net: ethernet: fs-enet: use phydev from struct net_device
...

Linus Torvalds
2016-05-18 07:26:30 +0800

17 May, 2016

1 commit

cfbcf4684 perf core: Pass max stack as a perf_callchain_entry context ... Browse Code »

This makes perf_callchain_{user,kernel}() receive the max stack
as context for the perf_callchain_entry, instead of accessing
the global sysctl_perf_event_max_stack.

Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: Alexei Starovoitov
Cc: Brendan Gregg
Cc: David Ahern
Cc: Frederic Weisbecker
Cc: He Kuang
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/n/tip-kolmn1yo40p7jhswxwrc7rrd@git.kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-05-17 10:11:50 +0800

27 Apr, 2016

1 commit

c5dfd78eb perf core: Allow setting up max frame stack depth via sysctl ... Browse Code »

The default remains 127, which is good for most cases, and not even hit
most of the time, but then for some cases, as reported by Brendan, 1024+
deep frames are appearing on the radar for things like groovy, ruby.

And in some workloads putting a _lower_ cap on this may make sense. One
that is per event still needs to be put in place tho.

The new file is:

# cat /proc/sys/kernel/perf_event_max_stack
127

Chaging it:

# echo 256 > /proc/sys/kernel/perf_event_max_stack
# cat /proc/sys/kernel/perf_event_max_stack
256

But as soon as there is some event using callchains we get:

# echo 512 > /proc/sys/kernel/perf_event_max_stack
-bash: echo: write error: Device or resource busy
#

Because we only allocate the callchain percpu data structures when there
is a user, which allows for changing the max easily, its just a matter
of having no callchain users at that point.

Reported-and-Tested-by: Brendan Gregg
Reviewed-by: Frederic Weisbecker
Acked-by: Alexei Starovoitov
Acked-by: David Ahern
Cc: Adrian Hunter
Cc: Alexander Shishkin
Cc: He Kuang
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Masami Hiramatsu
Cc: Milian Wolff
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: Wang Nan
Cc: Zefan Li
Link: http://lkml.kernel.org/r/20160426002928.GB16708@kernel.org
Signed-off-by: Arnaldo Carvalho de Melo

Arnaldo Carvalho de Melo
2016-04-27 21:20:39 +0800

08 Apr, 2016

1 commit

9940d67c9 bpf: support bpf_get_stackid() and bpf_perf_event_output() in tracepoint programs ... Browse Code »

needs two wrapper functions to fetch 'struct pt_regs *' to convert
tracepoint bpf context into kprobe bpf context to reuse existing
helper functions

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-04-08 09:04:26 +0800

09 Mar, 2016

2 commits

557c0c6e7 bpf: convert stackmap to pre-allocation ... Browse Code »

It was observed that calling bpf_get_stackid() from a kprobe inside
slub or from spin_unlock causes similar deadlock as with hashmap,
therefore convert stackmap to use pre-allocated memory.

The call_rcu is no longer feasible mechanism, since delayed freeing
causes bpf_get_stackid() to fail unpredictably when number of actual
stacks is significantly less than user requested max_entries.
Since elements are no longer freed into slub, we can push elements into
freelist immediately and let them be recycled.
However the very unlikley race between user space map_lookup() and
program-side recycling is possible:
cpu0 cpu1
---- ----
user does lookup(stackidX)
starts copying ips into buffer
delete(stackidX)
calls bpf_get_stackid()
which recyles the element and
overwrites with new stack trace

To avoid user space seeing a partial stack trace consisting of two
merged stack traces, do bucket = xchg(, NULL); copy; xchg(,bucket);
to preserve consistent stack trace delivery to user space.
Now we can move memset(,0) of left-over element value from critical
path of bpf_get_stackid() into slow-path of user space lookup.
Also disallow lookup() from bpf program, since it's useless and
program shouldn't be messing with collected stack trace.

Note that similar race between user space lookup and kernel side updates
is also present in hashmap, but it's not a new race. bpf programs were
always allowed to modify hash and array map elements while user space
is copying them.

Fixes: d5a3b1f69186 ("bpf: introduce BPF_MAP_TYPE_STACK_TRACE")
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:31 +0800
823707b68 bpf: check for reserved flag bits in array and stack maps ... Browse Code »

Suggested-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-03-09 04:28:31 +0800

20 Feb, 2016

1 commit

d5a3b1f69 bpf: introduce BPF_MAP_TYPE_STACK_TRACE ... Browse Code »

add new map type to store stack traces and corresponding helper
bpf_get_stackid(ctx, map, flags) - walk user or kernel stack and return id
@ctx: struct pt_regs*
@map: pointer to stack_trace map
@flags: bits 0-7 - numer of stack frames to skip
bit 8 - collect user stack instead of kernel
bit 9 - compare stacks by hash only
bit 10 - if two different stacks hash into the same stackid
discard old
other bits - reserved
Return: >= 0 stackid on success or negative error

stackid is a 32-bit integer handle that can be further combined with
other data (including other stackid) and used as a key into maps.

Userspace will access stackmap using standard lookup/delete syscall commands to
retrieve full stack trace for given stackid.

Signed-off-by: Alexei Starovoitov
Signed-off-by: David S. Miller

Alexei Starovoitov
2016-02-20 13:21:44 +0800