Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2021

2 commits

1f63b3393 bpf: Save correct stopping point in file seq iteration ... Browse Code »

[ Upstream commit 69ca310f34168eae0ada434796bfc22fb4a0fa26 ]

On some systems, some variant of the following splat is
repeatedly seen. The common factor in all traces seems
to be the entry point to task_file_seq_next(). With the
patch, all warnings go away.

rcu: INFO: rcu_sched self-detected stall on CPU
rcu: \x0926-....: (20992 ticks this GP) idle=d7e/1/0x4000000000000002 softirq=81556231/81556231 fqs=4876
\x09(t=21033 jiffies g=159148529 q=223125)
NMI backtrace for cpu 26
CPU: 26 PID: 2015853 Comm: bpftool Kdump: loaded Not tainted 5.6.13-0_fbk4_3876_gd8d1f9bf80bb #1
Hardware name: Quanta Twin Lakes MP/Twin Lakes Passive MP, BIOS F09_3A12 10/08/2018
Call Trace:

dump_stack+0x50/0x70
nmi_cpu_backtrace.cold.6+0x13/0x50
? lapic_can_unplug_cpu.cold.30+0x40/0x40
nmi_trigger_cpumask_backtrace+0xba/0xca
rcu_dump_cpu_stacks+0x99/0xc7
rcu_sched_clock_irq.cold.90+0x1b4/0x3aa
? tick_sched_do_timer+0x60/0x60
update_process_times+0x24/0x50
tick_sched_timer+0x37/0x70
__hrtimer_run_queues+0xfe/0x270
hrtimer_interrupt+0xf4/0x210
smp_apic_timer_interrupt+0x5e/0x120
apic_timer_interrupt+0xf/0x20

RIP: 0010:get_pid_task+0x38/0x80
Code: 89 f6 48 8d 44 f7 08 48 8b 00 48 85 c0 74 2b 48 83 c6 55 48 c1 e6 04 48 29 f0 74 19 48 8d 78 20 ba 01 00 00 00 f0 0f c1 50 20 d2 74 27 78 11 83 c2 01 78 0c 48 83 c4 08 c3 31 c0 48 83 c4 08
RSP: 0018:ffffc9000d293dc8 EFLAGS: 00000202 ORIG_RAX: ffffffffffffff13
RAX: ffff888637c05600 RBX: ffffc9000d293e0c RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000550 RDI: ffff888637c05620
RBP: ffffffff8284eb80 R08: ffff88831341d300 R09: ffff88822ffd8248
R10: ffff88822ffd82d0 R11: 00000000003a93c0 R12: 0000000000000001
R13: 00000000ffffffff R14: ffff88831341d300 R15: 0000000000000000
? find_ge_pid+0x1b/0x20
task_seq_get_next+0x52/0xc0
task_file_seq_get_next+0x159/0x220
task_file_seq_next+0x4f/0xa0
bpf_seq_read+0x159/0x390
vfs_read+0x8a/0x140
ksys_read+0x59/0xd0
do_syscall_64+0x42/0x110
entry_SYSCALL_64_after_hwframe+0x44/0xa9
RIP: 0033:0x7f95ae73e76e
Code: Bad RIP value.
RSP: 002b:00007ffc02c1dbf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000000
RAX: ffffffffffffffda RBX: 000000000170faa0 RCX: 00007f95ae73e76e
RDX: 0000000000001000 RSI: 00007ffc02c1dc30 RDI: 0000000000000007
RBP: 00007ffc02c1ec70 R08: 0000000000000005 R09: 0000000000000006
R10: fffffffffffff20b R11: 0000000000000246 R12: 00000000019112a0
R13: 0000000000000000 R14: 0000000000000007 R15: 00000000004283c0

If unable to obtain the file structure for the current task,
proceed to the next task number after the one returned from
task_seq_get_next(), instead of the next task number from the
original iterator.

Also, save the stopping task number from task_seq_get_next()
on failure in case of restarts.

Fixes: eaaacd23910f ("bpf: Add task and task/file iterator targets")
Signed-off-by: Jonathan Lemon
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201218185032.2464558-2-jonathan.lemon@gmail.com
Signed-off-by: Sasha Levin

Jonathan Lemon
2021-01-20 01:27:28 +0800
a3a51c69c bpf: Simplify task_file_seq_get_next() ... Browse Code »

[ Upstream commit 91b2db27d3ff9ad29e8b3108dfbf1e2f49fe9bd3 ]

Simplify task_file_seq_get_next() by removing two in/out arguments: task
and fstruct. Use info->task and info->files instead.

Signed-off-by: Song Liu
Signed-off-by: Daniel Borkmann
Acked-by: Yonghong Song
Link: https://lore.kernel.org/bpf/20201120002833.2481110-1-songliubraving@fb.com
Signed-off-by: Sasha Levin

Song Liu
2021-01-20 01:27:28 +0800

12 Dec, 2020

1 commit

b7906b70a bpf: Fix enum names for bpf_this_cpu_ptr() and bpf_per_cpu_ptr() helpers ... Browse Code »

Remove bpf_ prefix, which causes these helpers to be reported in verifier
dump as bpf_bpf_this_cpu_ptr() and bpf_bpf_per_cpu_ptr(), respectively. Lets
fix it as long as it is still possible before UAPI freezes on these helpers.

Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Andrii Nakryiko
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Signed-off-by: Linus Torvalds

Andrii Nakryiko
2020-12-12 06:19:07 +0800

11 Dec, 2020

1 commit

b02709587 bpf: Fix propagation of 32-bit signed bounds from 64-bit bounds. ... Browse Code »

The 64-bit signed bounds should not affect 32-bit signed bounds unless the
verifier knows that upper 32-bits are either all 1s or all 0s. For example the
register with smin_value==1 doesn't mean that s32_min_value is also equal to 1,
since smax_value could be larger than 32-bit subregister can hold.
The verifier refines the smax/s32_max return value from certain helpers in
do_refine_retval_range(). Teach the verifier to recognize that smin/s32_min
value is also bounded. When both smin and smax bounds fit into 32-bit
subregister the verifier can propagate those bounds.

Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Jean-Philippe Brucker
Acked-by: John Fastabend
Signed-off-by: Alexei Starovoitov

Alexei Starovoitov
2020-12-11 05:02:53 +0800

15 Nov, 2020

1 commit

f782e2c30 bpf: Relax return code check for subprograms ... Browse Code »

Currently verifier enforces return code checks for subprograms in the
same manner as it does for program entry points. This prevents returning
arbitrary scalar values from subprograms. Scalar type of returned values
is checked by btf_prepare_func_args() and hence it should be safe to
allow only scalars for now. Relax return code checks for subprograms and
allow any correct scalar values.

Fixes: 51c39bb1d5d10 (bpf: Introduce function-by-function verification)
Signed-off-by: Dmitrii Banshchikov
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201113171756.90594-1-me@ubique.spb.ru

Dmitrii Banshchikov
2020-11-15 00:17:27 +0800

11 Nov, 2020

1 commit

f16e63133 bpf: Fix unsigned 'datasec_id' compared with zero in check_pseudo_btf_id ... Browse Code »

The unsigned variable datasec_id is assigned a return value from the call
to check_pseudo_btf_id(), which may return negative error code.

This fixes the following coccicheck warning:

./kernel/bpf/verifier.c:9616:5-15: WARNING: Unsigned expression compared with zero: datasec_id > 0

Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Reported-by: Tosk Robot
Signed-off-by: Kaixu Xia
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Acked-by: John Fastabend
Cc: Hao Luo
Link: https://lore.kernel.org/bpf/1605071026-25906-1-git-send-email-kaixuxia@tencent.com

Kaixu Xia
2020-11-11 17:50:22 +0800

07 Nov, 2020

1 commit

6f64e4778 bpf: Update verification logic for LSM programs ... Browse Code »

The current logic checks if the name of the BTF type passed in
attach_btf_id starts with "bpf_lsm_", this is not sufficient as it also
allows attachment to non-LSM hooks like the very function that performs
this check, i.e. bpf_lsm_verify_prog.

In order to ensure that this verification logic allows attachment to
only LSM hooks, the LSM_HOOK definitions in lsm_hook_defs.h are used to
generate a BTF_ID set. Upon verification, the attach_btf_id of the
program being attached is checked for presence in this set.

Fixes: 9e4e01dfd325 ("bpf: lsm: Implement attach, detach and execution")
Signed-off-by: KP Singh
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20201105230651.2621917-1-kpsingh@chromium.org

KP Singh
2020-11-07 05:15:21 +0800

06 Nov, 2020

2 commits

d3bec0138 bpf: Zero-fill re-used per-cpu map element ... Browse Code »

Zero-fill element values for all other cpus than current, just as
when not using prealloc. This is the only way the bpf program can
ensure known initial values for all cpus ('onallcpus' cannot be
set when coming from the bpf program).

The scenario is: bpf program inserts some elements in a per-cpu
map, then deletes some (or userspace does). When later adding
new elements using bpf_map_update_elem(), the bpf program can
only set the value of the new elements for the current cpu.
When prealloc is enabled, previously deleted elements are re-used.
Without the fix, values for other cpus remain whatever they were
when the re-used entry was previously freed.

A selftest is added to validate correct operation in above
scenario as well as in case of LRU per-cpu map element re-use.

Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: David Verbeiren
Signed-off-by: Alexei Starovoitov
Acked-by: Matthieu Baerts
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201104112332.15191-1-david.verbeiren@tessares.net

David Verbeiren
2020-11-06 11:55:57 +0800
7c0afcad7 bpf: BPF_PRELOAD depends on BPF_SYSCALL ... Browse Code »

Fix build error when BPF_SYSCALL is not set/enabled but BPF_PRELOAD is
by making BPF_PRELOAD depend on BPF_SYSCALL.

ERROR: modpost: "bpf_preload_ops" [kernel/bpf/preload/bpf_preload.ko] undefined!

Reported-by: kernel test robot
Reported-by: Randy Dunlap
Signed-off-by: Randy Dunlap
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20201105195109.26232-1-rdunlap@infradead.org

Randy Dunlap
2020-11-06 10:49:29 +0800

30 Oct, 2020

1 commit

080b6f407 bpf: Don't rely on GCC __attribute__((optimize)) to disable GCSE ... Browse Code »

Commit 3193c0836 ("bpf: Disable GCC -fgcse optimization for
___bpf_prog_run()") introduced a __no_fgcse macro that expands to a
function scope __attribute__((optimize("-fno-gcse"))), to disable a
GCC specific optimization that was causing trouble on x86 builds, and
was not expected to have any positive effect in the first place.

However, as the GCC manual documents, __attribute__((optimize))
is not for production use, and results in all other optimization
options to be forgotten for the function in question. This can
cause all kinds of trouble, but in one particular reported case,
it causes -fno-asynchronous-unwind-tables to be disregarded,
resulting in .eh_frame info to be emitted for the function.

This reverts commit 3193c0836, and instead, it disables the -fgcse
optimization for the entire source file, but only when building for
X86 using GCC with CONFIG_BPF_JIT_ALWAYS_ON disabled. Note that the
original commit states that CONFIG_RETPOLINE=n triggers the issue,
whereas CONFIG_RETPOLINE=y performs better without the optimization,
so it is kept disabled in both cases.

Fixes: 3193c0836f20 ("bpf: Disable GCC -fgcse optimization for ___bpf_prog_run()")
Signed-off-by: Ard Biesheuvel
Signed-off-by: Alexei Starovoitov
Tested-by: Geert Uytterhoeven
Reviewed-by: Nick Desaulniers
Link: https://lore.kernel.org/lkml/CAMuHMdUg0WJHEcq6to0-eODpXPOywLot6UD2=GFHpzoj_hCoBQ@mail.gmail.com/
Link: https://lore.kernel.org/bpf/20201028171506.15682-2-ardb@kernel.org

Ard Biesheuvel
2020-10-30 11:01:46 +0800

24 Oct, 2020

1 commit

3cb12d27f Merge tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from Jakub Kicinski:
"Cross-tree/merge window issues:

- rtl8150: don't incorrectly assign random MAC addresses; fix late in
the 5.9 cycle started depending on a return code from a function
which changed with the 5.10 PR from the usb subsystem

Current release regressions:

- Revert "virtio-net: ethtool configurable RXCSUM", it was causing
crashes at probe when control vq was not negotiated/available

Previous release regressions:

- ixgbe: fix probing of multi-port 10 Gigabit Intel NICs with an MDIO
bus, only first device would be probed correctly

- nexthop: Fix performance regression in nexthop deletion by
effectively switching from recently added synchronize_rcu() to
synchronize_rcu_expedited()

- netsec: ignore 'phy-mode' device property on ACPI systems; the
property is not populated correctly by the firmware, but firmware
configures the PHY so just keep boot settings

Previous releases - always broken:

- tcp: fix to update snd_wl1 in bulk receiver fast path, addressing
bulk transfers getting "stuck"

- icmp: randomize the global rate limiter to prevent attackers from
getting useful signal

- r8169: fix operation under forced interrupt threading, make the
driver always use hard irqs, even on RT, given the handler is light
and only wants to schedule napi (and do so through a _irqoff()
variant, preferably)

- bpf: Enforce pointer id generation for all may-be-null register
type to avoid pointers erroneously getting marked as null-checked

- tipc: re-configure queue limit for broadcast link

- net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN
tunnels

- fix various issues in chelsio inline tls driver

Misc:

- bpf: improve just-added bpf_redirect_neigh() helper api to support
supplying nexthop by the caller - in case BPF program has already
done a lookup we can avoid doing another one

- remove unnecessary break statements

- make MCTCP not select IPV6, but rather depend on it"

* tag 'net-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (62 commits)
tcp: fix to update snd_wl1 in bulk receiver fast path
net: Properly typecast int values to set sk_max_pacing_rate
netfilter: nf_fwd_netdev: clear timestamp in forwarding path
ibmvnic: save changed mac address to adapter->mac_addr
selftests: mptcp: depends on built-in IPv6
Revert "virtio-net: ethtool configurable RXCSUM"
rtnetlink: fix data overflow in rtnl_calcit()
net: ethernet: mtk-star-emac: select REGMAP_MMIO
net: hdlc_raw_eth: Clear the IFF_TX_SKB_SHARING flag after calling ether_setup
net: hdlc: In hdlc_rcv, check to make sure dev is an HDLC device
bpf, libbpf: Guard bpf inline asm from bpf_tail_call_static
bpf, selftests: Extend test_tc_redirect to use modified bpf_redirect_neigh()
bpf: Fix bpf_redirect_neigh helper api to support supplying nexthop
mptcp: depends on IPV6 but not as a module
sfc: move initialisation of efx->filter_sem to efx_init_struct()
mpls: load mpls_gso after mpls_iptunnel
net/sched: act_tunnel_key: fix OOB write in case of IPv6 ERSPAN tunnels
net/sched: act_gate: Unlock ->tcfa_lock in tc_setup_flow_action()
net: dsa: bcm_sf2: make const array static, makes object smaller
mptcp: MPTCP_IPV6 should depend on IPV6 instead of selecting it
...

Linus Torvalds
2020-10-24 03:05:49 +0800

23 Oct, 2020

1 commit

f56e65dff Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull initial set_fs() removal from Al Viro:
"Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
fs: Allow a NULL pos pointer to __kernel_read
fs: Allow a NULL pos pointer to __kernel_write
powerpc: remove address space overrides using set_fs()
powerpc: use non-set_fs based maccess routines
x86: remove address space overrides using set_fs()
x86: make TASK_SIZE_MAX usable from assembly code
x86: move PAGE_OFFSET, TASK_SIZE & friends to page_{32,64}_types.h
lkdtm: remove set_fs-based tests
test_bitmap: remove user bitmap tests
uaccess: add infrastructure for kernel builds with set_fs()
fs: don't allow splice read/write without explicit ops
fs: don't allow kernel reads and writes without iter ops
sysctl: Convert to iter interfaces
proc: add a read_iter method to proc proc_ops
proc: cleanup the compat vs no compat file ops
proc: remove a level of indentation in proc_get_inode

Linus Torvalds
2020-10-23 00:59:21 +0800

20 Oct, 2020

2 commits

93c230e3f bpf: Enforce id generation for all may-be-null register type ... Browse Code »

The commit af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
introduces RET_PTR_TO_BTF_ID_OR_NULL and
the commit eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
introduces RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL.
Note that for RET_PTR_TO_MEM_OR_BTF_ID_OR_NULL, the reg0->type
could become PTR_TO_MEM_OR_NULL which is not covered by
BPF_PROBE_MEM.

The BPF_REG_0 will then hold a _OR_NULL pointer type. This _OR_NULL
pointer type requires the bpf program to explicitly do a NULL check first.
After NULL check, the verifier will mark all registers having
the same reg->id as safe to use. However, the reg->id
is not set for those new _OR_NULL return types. One of the ways
that may be wrong is, checking NULL for one btf_id typed pointer will
end up validating all other btf_id typed pointers because
all of them have id == 0. The later tests will exercise
this path.

To fix it and also avoid similar issue in the future, this patch
moves the id generation logic out of each individual RET type
test in check_helper_call(). Instead, it does one
reg_type_may_be_null() test and then do the id generation
if needed.

This patch also adds a WARN_ON_ONCE in mark_ptr_or_null_reg()
to catch future breakage.

The _OR_NULL pointer usage in the bpf_iter_reg.ctx_arg_info is
fine because it just happens that the existing id generation after
check_ctx_access() has covered it. It is also using the
reg_type_may_be_null() to decide if id generation is needed or not.

Fixes: af7ec1383361 ("bpf: Add bpf_skc_to_tcp6_sock() helper")
Fixes: eaa6bcb71ef6 ("bpf: Introduce bpf_per_cpu_ptr()")
Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20201019194212.1050855-1-kafai@fb.com

Martin KaFai Lau
2020-10-20 06:57:42 +0800
76702a2e7 bpf: Remove unneeded break ... Browse Code »

A break is not needed if it is preceded by a return.

Signed-off-by: Tom Rix
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201019173846.1021-1-trix@redhat.com

Tom Rix
2020-10-20 02:40:21 +0800

16 Oct, 2020

1 commit

9ff9b0d39 Merge tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next ... Browse Code »

Pull networking updates from Jakub Kicinski:

- Add redirect_neigh() BPF packet redirect helper, allowing to limit
stack traversal in common container configs and improving TCP
back-pressure.

Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

- Expand netlink policy support and improve policy export to user
space. (Ge)netlink core performs request validation according to
declared policies. Expand the expressiveness of those policies
(min/max length and bitmasks). Allow dumping policies for particular
commands. This is used for feature discovery by user space (instead
of kernel version parsing or trial and error).

- Support IGMPv3/MLDv2 multicast listener discovery protocols in
bridge.

- Allow more than 255 IPv4 multicast interfaces.

- Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
packets of TCPv6.

- In Multi-patch TCP (MPTCP) support concurrent transmission of data on
multiple subflows in a load balancing scenario. Enhance advertising
addresses via the RM_ADDR/ADD_ADDR options.

- Support SMC-Dv2 version of SMC, which enables multi-subnet
deployments.

- Allow more calls to same peer in RxRPC.

- Support two new Controller Area Network (CAN) protocols - CAN-FD and
ISO 15765-2:2016.

- Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
kernel problem.

- Add TC actions for implementing MPLS L2 VPNs.

- Improve nexthop code - e.g. handle various corner cases when nexthop
objects are removed from groups better, skip unnecessary
notifications and make it easier to offload nexthops into HW by
converting to a blocking notifier.

- Support adding and consuming TCP header options by BPF programs,
opening the doors for easy experimental and deployment-specific TCP
option use.

- Reorganize TCP congestion control (CC) initialization to simplify
life of TCP CC implemented in BPF.

- Add support for shipping BPF programs with the kernel and loading
them early on boot via the User Mode Driver mechanism, hence reusing
all the user space infra we have.

- Support sleepable BPF programs, initially targeting LSM and tracing.

- Add bpf_d_path() helper for returning full path for given 'struct
path'.

- Make bpf_tail_call compatible with bpf-to-bpf calls.

- Allow BPF programs to call map_update_elem on sockmaps.

- Add BPF Type Format (BTF) support for type and enum discovery, as
well as support for using BTF within the kernel itself (current use
is for pretty printing structures).

- Support listing and getting information about bpf_links via the bpf
syscall.

- Enhance kernel interfaces around NIC firmware update. Allow
specifying overwrite mask to control if settings etc. are reset
during update; report expected max time operation may take to users;
support firmware activation without machine reboot incl. limits of
how much impact reset may have (e.g. dropping link or not).

- Extend ethtool configuration interface to report IEEE-standard
counters, to limit the need for per-vendor logic in user space.

- Adopt or extend devlink use for debug, monitoring, fw update in many
drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
dpaa2-eth).

- In mlxsw expose critical and emergency SFP module temperature alarms.
Refactor port buffer handling to make the defaults more suitable and
support setting these values explicitly via the DCBNL interface.

- Add XDP support for Intel's igb driver.

- Support offloading TC flower classification and filtering rules to
mscc_ocelot switches.

- Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
fixed interval period pulse generator and one-step timestamping in
dpaa-eth.

- Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
offload.

- Add Lynx PHY/PCS MDIO module, and convert various drivers which have
this HW to use it. Convert mvpp2 to split PCS.

- Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
7-port Mediatek MT7531 IP.

- Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
and wcn3680 support in wcn36xx.

- Improve performance for packets which don't require much offloads on
recent Mellanox NICs by 20% by making multiple packets share a
descriptor entry.

- Move chelsio inline crypto drivers (for TLS and IPsec) from the
crypto subtree to drivers/net. Move MDIO drivers out of the phy
directory.

- Clean up a lot of W=1 warnings, reportedly the actively developed
subsections of networking drivers should now build W=1 warning free.

- Make sure drivers don't use in_interrupt() to dynamically adapt their
code. Convert tasklets to use new tasklet_setup API (sadly this
conversion is not yet complete).

* tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
net, sockmap: Don't call bpf_prog_put() on NULL pointer
bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
bpf, sockmap: Add locking annotations to iterator
netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
net: fix pos incrementment in ipv6_route_seq_next
net/smc: fix invalid return code in smcd_new_buf_create()
net/smc: fix valid DMBE buffer sizes
net/smc: fix use-after-free of delayed events
bpfilter: Fix build error with CONFIG_BPFILTER_UMH
cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
bpf: Fix register equivalence tracking.
rxrpc: Fix loss of final ack on shutdown
rxrpc: Fix bundle counting for exclusive connections
netfilter: restore NF_INET_NUMHOOKS
ibmveth: Identify ingress large send packets.
ibmveth: Switch order of ibmveth_helper calls.
cxgb4: handle 4-tuple PEDIT to NAT mode translation
selftests: Add VRF route leaking tests
...

Linus Torvalds
2020-10-16 09:42:13 +0800

15 Oct, 2020

2 commits

e688c3db7 bpf: Fix register equivalence tracking. ... Browse Code »

The 64-bit JEQ/JNE handling in reg_set_min_max() was clearing reg->id in either
true or false branch. In the case 'if (reg->id)' check was done on the other
branch the counter part register would have reg->id == 0 when called into
find_equal_scalars(). In such case the helper would incorrectly identify other
registers with id == 0 as equivalent and propagate the state incorrectly.
Fix it by preserving ID across reg_set_min_max().

In other words any kind of comparison operator on the scalar register
should preserve its ID to recognize:

r1 = r2
if (r1 == 20) {
#1 here both r1 and r2 == 20
} else if (r2 < 20) {
#2 here both r1 and r2 < 20
}

The patch is addressing #1 case. The #2 was working correctly already.

Fixes: 75748837b7e5 ("bpf: Propagate scalar ranges through register assignments.")
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Acked-by: John Fastabend
Tested-by: Yonghong Song
Link: https://lore.kernel.org/bpf/20201014175608.1416-1-alexei.starovoitov@gmail.com

Alexei Starovoitov
2020-10-15 22:05:31 +0800
6873139ed Merge tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull objtool updates from Ingo Molnar:
"Most of the changes are cleanups and reorganization to make the
objtool code more arch-agnostic. This is in preparation for non-x86
support.

Other changes:

- KASAN fixes

- Handle unreachable trap after call to noreturn functions better

- Ignore unreachable fake jumps

- Misc smaller fixes & cleanups"

* tag 'objtool-core-2020-10-13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits)
perf build: Allow nested externs to enable BUILD_BUG() usage
objtool: Allow nested externs to enable BUILD_BUG()
objtool: Permit __kasan_check_{read,write} under UACCESS
objtool: Ignore unreachable trap after call to noreturn functions
objtool: Handle calling non-function symbols in other sections
objtool: Ignore unreachable fake jumps
objtool: Remove useless tests before save_reg()
objtool: Decode unwind hint register depending on architecture
objtool: Make unwind hint definitions available to other architectures
objtool: Only include valid definitions depending on source file type
objtool: Rename frame.h -> objtool.h
objtool: Refactor jump table code to support other architectures
objtool: Make relocation in alternative handling arch dependent
objtool: Abstract alternative special case handling
objtool: Move macros describing structures to arch-dependent code
objtool: Make sync-check consider the target architecture
objtool: Group headers to check in a single list
objtool: Define 'struct orc_entry' only when needed
objtool: Skip ORC entry creation for non-text sections
objtool: Move ORC logic out of check()
...

Linus Torvalds
2020-10-15 01:13:37 +0800

13 Oct, 2020

1 commit

ccdf7fae3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-10-12

The main changes are:

1) The BPF verifier improvements to track register allocation pattern, from Alexei and Yonghong.

2) libbpf relocation support for different size load/store, from Andrii.

3) bpf_redirect_peer() helper and support for inner map array with different max_entries, from Daniel.

4) BPF support for per-cpu variables, form Hao.

5) sockmap improvements, from John.
====================

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-10-13 07:16:50 +0800

12 Oct, 2020

1 commit

4a8f87e60 bpf: Allow for map-in-map with dynamic inner array map entries ... Browse Code »

Recent work in f4d05259213f ("bpf: Add map_meta_equal map ops") and 134fede4eecf
("bpf: Relax max_entries check for most of the inner map types") added support
for dynamic inner max elements for most map-in-map types. Exceptions were maps
like array or prog array where the map_gen_lookup() callback uses the maps'
max_entries field as a constant when emitting instructions.

We recently implemented Maglev consistent hashing into Cilium's load balancer
which uses map-in-map with an outer map being hash and inner being array holding
the Maglev backend table for each service. This has been designed this way in
order to reduce overall memory consumption given the outer hash map allows to
avoid preallocating a large, flat memory area for all services. Also, the
number of service mappings is not always known a-priori.

The use case for dynamic inner array map entries is to further reduce memory
overhead, for example, some services might just have a small number of back
ends while others could have a large number. Right now the Maglev backend table
for small and large number of backends would need to have the same inner array
map entries which adds a lot of unneeded overhead.

Dynamic inner array map entries can be realized by avoiding the inlined code
generation for their lookup. The lookup will still be efficient since it will
be calling into array_map_lookup_elem() directly and thus avoiding retpoline.
The patch adds a BPF_F_INNER_MAP flag to map creation which therefore skips
inline code generation and relaxes array_map_meta_equal() check to ignore both
maps' max_entries. This also still allows to have faster lookups for map-in-map
when BPF_F_INNER_MAP is not specified and hence dynamic max_entries not needed.

Example code generation where inner map is dynamic sized array:

# bpftool p d x i 125
int handle__sys_enter(void * ctx):
; int handle__sys_enter(void *ctx)
0: (b4) w1 = 0
; int key = 0;
1: (63) *(u32 *)(r10 -4) = r1
2: (bf) r2 = r10
;
3: (07) r2 += -4
; inner_map = bpf_map_lookup_elem(&outer_arr_dyn, &key);
4: (18) r1 = map[id:468]
6: (07) r1 += 272
7: (61) r0 = *(u32 *)(r2 +0)
8: (35) if r0 >= 0x3 goto pc+5
9: (67) r0 <
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201010234006.7075-4-daniel@iogearbox.net

Daniel Borkmann
2020-10-12 01:21:04 +0800

10 Oct, 2020

2 commits

5689d49b7 bpf: Track spill/fill of bounded scalars. ... Browse Code »

Under register pressure the llvm may spill registers with bounds into the stack.
The verifier has to track them through spill/fill otherwise many kinds of bound
errors will be seen. The spill/fill of induction variables was already
happening. This patch extends this logic from tracking spill/fill of a constant
into any bounded register. There is no need to track spill/fill of unbounded,
since no new information will be retrieved from the stack during register fill.

Though extra stack difference could cause state pruning to be less effective, no
adverse affects were seen from this patch on selftests and on cilium programs.

Signed-off-by: Yonghong Song
Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20201009011240.48506-3-alexei.starovoitov@gmail.com

Yonghong Song
2020-10-10 04:03:06 +0800
75748837b bpf: Propagate scalar ranges through register assignments. ... Browse Code »

The llvm register allocator may use two different registers representing the
same virtual register. In such case the following pattern can be observed:
1047: (bf) r9 = r6
1048: (a5) if r6 < 0x1000 goto pc+1
1050: ...
1051: (a5) if r9 < 0x2 goto pc+66
1052: ...
1053: (bf) r2 = r9 /* r2 needs to have upper and lower bounds */

This is normal behavior of greedy register allocator.
The slides 137+ explain why regalloc introduces such register copy:
http://llvm.org/devmtg/2018-04/slides/Yatsina-LLVM%20Greedy%20Register%20Allocator.pdf
There is no way to tell llvm 'not to do this'.
Hence the verifier has to recognize such patterns.

In order to track this information without backtracking allocate ID
for scalars in a similar way as it's done for find_good_pkt_pointers().

When the verifier encounters r9 = r6 assignment it will assign the same ID
to both registers. Later if either register range is narrowed via conditional
jump propagate the register state into the other register.

Clear register ID in adjust_reg_min_max_vals() for any alu instruction. The
register ID is ignored for scalars in regsafe() and doesn't affect state
pruning. mark_reg_unknown() clears the ID. It's used to process call, endian
and other instructions. Hence ID is explicitly cleared only in
adjust_reg_min_max_vals() and in 32-bit mov.

Signed-off-by: Alexei Starovoitov
Signed-off-by: Daniel Borkmann
Acked-by: Andrii Nakryiko
Acked-by: John Fastabend
Link: https://lore.kernel.org/bpf/20201009011240.48506-2-alexei.starovoitov@gmail.com

Alexei Starovoitov
2020-10-10 04:03:06 +0800

09 Oct, 2020

1 commit

9d49aea13 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Small conflict around locking in rxrpc_process_event() -
channel_lock moved to bundle in next, while state lock
needs _bh() from net.

Signed-off-by: Jakub Kicinski

Jakub Kicinski
2020-10-09 06:44:50 +0800

08 Oct, 2020

2 commits

5b9fbeb75 bpf: Fix scalar32_min_max_or bounds tracking ... Browse Code »

Simon reported an issue with the current scalar32_min_max_or() implementation.
That is, compared to the other 32 bit subreg tracking functions, the code in
scalar32_min_max_or() stands out that it's using the 64 bit registers instead
of 32 bit ones. This leads to bounds tracking issues, for example:

[...]
8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
8: (79) r1 = *(u64 *)(r0 +0)
R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
9: (b7) r0 = 1
10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
10: (18) r2 = 0x600000002
12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
12: (ad) if r1 < r2 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: (95) exit
14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
14: (25) if r1 > 0x0 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: (95) exit
16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
16: (47) r1 |= 0
17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x1; 0x700000000),s32_max_value=1,u32_max_value=1) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
[...]

The bound tests on the map value force the upper unsigned bound to be 25769803777
in 64 bit (0b11000000000000000000000000000000001) and then lower one to be 1. By
using OR they are truncated and thus result in the range [1,1] for the 32 bit reg
tracker. This is incorrect given the only thing we know is that the value must be
positive and thus 2147483647 (0b1111111111111111111111111111111) at max for the
subregs. Fix it by using the {u,s}32_{min,max}_value vars instead. This also makes
sense, for example, for the case where we update dst_reg->s32_{min,max}_value in
the else branch we need to use the newly computed dst_reg->u32_{min,max}_value as
we know that these are positive. Previously, in the else branch the 64 bit values
of umin_value=1 and umax_value=32212254719 were used and latter got truncated to
be 1 as upper bound there. After the fix the subreg range is now correct:

[...]
8: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
8: (79) r1 = *(u64 *)(r0 +0)
R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R10=fp0 fp-8=mmmmmmmm
9: R0=map_value(id=0,off=0,ks=4,vs=48,imm=0) R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
9: (b7) r0 = 1
10: R0_w=inv1 R1_w=inv(id=0) R10=fp0 fp-8=mmmmmmmm
10: (18) r2 = 0x600000002
12: R0_w=inv1 R1_w=inv(id=0) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
12: (ad) if r1 < r2 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: R0_w=inv1 R1_w=inv(id=0,umin_value=25769803778) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
13: (95) exit
14: R0_w=inv1 R1_w=inv(id=0,umax_value=25769803777,var_off=(0x0; 0x7ffffffff)) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
14: (25) if r1 > 0x0 goto pc+1
R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: R0_w=inv1 R1_w=inv(id=0,umax_value=0,var_off=(0x0; 0x7fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
15: (95) exit
16: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=25769803777,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
16: (47) r1 |= 0
17: R0_w=inv1 R1_w=inv(id=0,umin_value=1,umax_value=32212254719,var_off=(0x0; 0x77fffffff),u32_max_value=2147483647) R2_w=inv25769803778 R10=fp0 fp-8=mmmmmmmm
[...]

Fixes: 3f50f132d840 ("bpf: Verifier, do explicit ALU32 bounds tracking")
Reported-by: Simon Scannell
Signed-off-by: Daniel Borkmann
Reviewed-by: John Fastabend
Acked-by: Alexei Starovoitov

Daniel Borkmann
2020-10-08 17:02:53 +0800
49a2a4d41 kernel/bpf/verifier: Fix build when NET is not enabled ... Browse Code »

Fix build errors in kernel/bpf/verifier.c when CONFIG_NET is
not enabled.

../kernel/bpf/verifier.c:3995:13: error: ‘btf_sock_ids’ undeclared here (not in a function); did you mean ‘bpf_sock_ops’?
.btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

../kernel/bpf/verifier.c:3995:26: error: ‘BTF_SOCK_TYPE_SOCK_COMMON’ undeclared here (not in a function); did you mean ‘PTR_TO_SOCK_COMMON’?
.btf_id = &btf_sock_ids[BTF_SOCK_TYPE_SOCK_COMMON],

Fixes: 1df8f55a37bd ("bpf: Enable bpf_skc_to_* sock casting helper to networking prog type")
Signed-off-by: Randy Dunlap
Signed-off-by: Alexei Starovoitov
Acked-by: Yonghong Song
Link: https://lore.kernel.org/bpf/20201007021613.13646-1-rdunlap@infradead.org

Randy Dunlap
2020-10-08 01:53:43 +0800

06 Oct, 2020

2 commits

8b0308fe3 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Rejecting non-native endian BTF overlapped with the addition
of support for it.

The rest were more simple overlapping changes, except the
renesas ravb binding update, which had to follow a file
move as well as a YAML conversion.

Signed-off-by: David S. Miller

David S. Miller
2020-10-06 09:40:01 +0800
39d8f0d10 bpf: Use raw_spin_trylock() for pcpu_freelist_push/pop in NMI ... Browse Code »

Recent improvements in LOCKDEP highlighted a potential A-A deadlock with
pcpu_freelist in NMI:

./tools/testing/selftests/bpf/test_progs -t stacktrace_build_id_nmi

[ 18.984807] ================================
[ 18.984807] WARNING: inconsistent lock state
[ 18.984808] 5.9.0-rc6-01771-g1466de1330e1 #2967 Not tainted
[ 18.984809] --------------------------------
[ 18.984809] inconsistent {INITIAL USE} -> {IN-NMI} usage.
[ 18.984810] test_progs/1990 [HC2[2]:SC0[0]:HE0:SE1] takes:
[ 18.984810] ffffe8ffffc219c0 (&head->lock){....}-{2:2}, at: __pcpu_freelist_pop+0xe3/0x180
[ 18.984813] {INITIAL USE} state was registered at:
[ 18.984814] lock_acquire+0x175/0x7c0
[ 18.984814] _raw_spin_lock+0x2c/0x40
[ 18.984815] __pcpu_freelist_pop+0xe3/0x180
[ 18.984815] pcpu_freelist_pop+0x31/0x40
[ 18.984816] htab_map_alloc+0xbbf/0xf40
[ 18.984816] __do_sys_bpf+0x5aa/0x3ed0
[ 18.984817] do_syscall_64+0x2d/0x40
[ 18.984818] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[ 18.984818] irq event stamp: 12
[...]
[ 18.984822] other info that might help us debug this:
[ 18.984823] Possible unsafe locking scenario:
[ 18.984823]
[ 18.984824] CPU0
[ 18.984824] ----
[ 18.984824] lock(&head->lock);
[ 18.984826]
[ 18.984826] lock(&head->lock);
[ 18.984827]
[ 18.984828] *** DEADLOCK ***
[ 18.984828]
[ 18.984829] 2 locks held by test_progs/1990:
[...]
[ 18.984838]
[ 18.984838] dump_stack+0x9a/0xd0
[ 18.984839] lock_acquire+0x5c9/0x7c0
[ 18.984839] ? lock_release+0x6f0/0x6f0
[ 18.984840] ? __pcpu_freelist_pop+0xe3/0x180
[ 18.984840] _raw_spin_lock+0x2c/0x40
[ 18.984841] ? __pcpu_freelist_pop+0xe3/0x180
[ 18.984841] __pcpu_freelist_pop+0xe3/0x180
[ 18.984842] pcpu_freelist_pop+0x17/0x40
[ 18.984842] ? lock_release+0x6f0/0x6f0
[ 18.984843] __bpf_get_stackid+0x534/0xaf0
[ 18.984843] bpf_prog_1fd9e30e1438d3c5_oncpu+0x73/0x350
[ 18.984844] bpf_overflow_handler+0x12f/0x3f0

This is because pcpu_freelist_head.lock is accessed in both NMI and
non-NMI context. Fix this issue by using raw_spin_trylock() in NMI.

Since NMI interrupts non-NMI context, when NMI context tries to lock the
raw_spinlock, non-NMI context of the same CPU may already have locked a
lock and is blocked from unlocking the lock. For a system with N CPUs,
there could be N NMIs at the same time, and they may block N non-NMI
raw_spinlocks. This is tricky for pcpu_freelist_push(), where unlike
_pop(), failing _push() means leaking memory. This issue is more likely to
trigger in non-SMP system.

Fix this issue with an extra list, pcpu_freelist.extralist. The extralist
is primarily used to take _push() when raw_spin_trylock() failed on all
the per CPU lists. It should be empty most of the time. The following
table summarizes the behavior of pcpu_freelist in NMI and non-NMI:

non-NMI pop(): use _lock(); check per CPU lists first;
if all per CPU lists are empty, check extralist;
if extralist is empty, return NULL.

non-NMI push(): use _lock(); only push to per CPU lists.

NMI pop(): use _trylock(); check per CPU lists first;
if all per CPU lists are locked or empty, check extralist;
if extralist is locked or empty, return NULL.

NMI push(): use _trylock(); check per CPU lists first;
if all per CPU lists are locked; try push to extralist;
if extralist is also locked, keep trying on per CPU lists.

Reported-by: Alexei Starovoitov
Signed-off-by: Song Liu
Signed-off-by: Daniel Borkmann
Acked-by: Martin KaFai Lau
Link: https://lore.kernel.org/bpf/20201005165838.3735218-1-songliubraving@fb.com

Song Liu
2020-10-06 06:04:11 +0800

05 Oct, 2020

1 commit

8731745e4 bpf, verifier: Use fallthrough pseudo-keyword ... Browse Code »

Replace /* fallthrough */ comments with the new pseudo-keyword
macro fallthrough [1].

[1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Daniel Borkmann
Acked-by: Yonghong Song
Link: https://lore.kernel.org/bpf/20201002234217.GA12280@embeddedor

Gustavo A. R. Silva
2020-10-05 21:52:36 +0800

03 Oct, 2020

4 commits

1028ae406 bpf: Deref map in BPF_PROG_BIND_MAP when it's already used ... Browse Code »

We are missing a deref for the case when we are doing BPF_PROG_BIND_MAP
on a map that's being already held by the program.
There is 'if (ret) bpf_map_put(map)' below which doesn't trigger
because we don't consider this an error.
Let's add missing bpf_map_put() for this specific condition.

Fixes: ef15314aa5de ("bpf: Add BPF_PROG_BIND_MAP syscall")
Reported-by: Alexei Starovoitov
Signed-off-by: Stanislav Fomichev
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20201003002544.3601440-1-sdf@google.com

Stanislav Fomichev
2020-10-03 10:21:25 +0800
63d9b80dc bpf: Introducte bpf_this_cpu_ptr() ... Browse Code »

Add bpf_this_cpu_ptr() to help access percpu var on this cpu. This
helper always returns a valid pointer, therefore no need to check
returned value for NULL. Also note that all programs run with
preemption disabled, which means that the returned pointer is stable
during all the execution of the program.

Signed-off-by: Hao Luo
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20200929235049.2533242-6-haoluo@google.com

Hao Luo
2020-10-03 06:00:49 +0800
eaa6bcb71 bpf: Introduce bpf_per_cpu_ptr() ... Browse Code »

Add bpf_per_cpu_ptr() to help bpf programs access percpu vars.
bpf_per_cpu_ptr() has the same semantic as per_cpu_ptr() in the kernel
except that it may return NULL. This happens when the cpu parameter is
out of range. So the caller must check the returned value.

Signed-off-by: Hao Luo
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20200929235049.2533242-5-haoluo@google.com

Hao Luo
2020-10-03 06:00:49 +0800
4976b718c bpf: Introduce pseudo_btf_id ... Browse Code »

Pseudo_btf_id is a type of ld_imm insn that associates a btf_id to a
ksym so that further dereferences on the ksym can use the BTF info
to validate accesses. Internally, when seeing a pseudo_btf_id ld insn,
the verifier reads the btf_id stored in the insn[0]'s imm field and
marks the dst_reg as PTR_TO_BTF_ID. The btf_id points to a VAR_KIND,
which is encoded in btf_vminux by pahole. If the VAR is not of a struct
type, the dst reg will be marked as PTR_TO_MEM instead of PTR_TO_BTF_ID
and the mem_size is resolved to the size of the VAR's type.

>From the VAR btf_id, the verifier can also read the address of the
ksym's corresponding kernel var from kallsyms and use that to fill
dst_reg.

Therefore, the proper functionality of pseudo_btf_id depends on (1)
kallsyms and (2) the encoding of kernel global VARs in pahole, which
should be available since pahole v1.18.

Signed-off-by: Hao Luo
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/20200929235049.2533242-2-haoluo@google.com

Hao Luo
2020-10-03 05:59:25 +0800

01 Oct, 2020

2 commits

792caccc4 bpf: Introduce BPF_F_PRESERVE_ELEMS for perf event array ... Browse Code »

Currently, perf event in perf event array is removed from the array when
the map fd used to add the event is closed. This behavior makes it
difficult to the share perf events with perf event array.

Introduce perf event map that keeps the perf event open with a new flag
BPF_F_PRESERVE_ELEMS. With this flag set, perf events in the array are not
removed when the original map fd is closed. Instead, the perf event will
stay in the map until 1) it is explicitly removed from the array; or 2)
the array is freed.

Signed-off-by: Song Liu
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200930224927.1936644-2-songliubraving@fb.com

Song Liu
2020-10-01 14:18:12 +0800
92acdc58a bpf, net: Rework cookie generator as per-cpu one ... Browse Code »

With its use in BPF, the cookie generator can be called very frequently
in particular when used out of cgroup v2 hooks (e.g. connect / sendmsg)
and attached to the root cgroup, for example, when used in v1/v2 mixed
environments. In particular, when there's a high churn on sockets in the
system there can be many parallel requests to the bpf_get_socket_cookie()
and bpf_get_netns_cookie() helpers which then cause contention on the
atomic counter.

As similarly done in f991bd2e1421 ("fs: introduce a per-cpu last_ino
allocator"), add a small helper library that both can use for the 64 bit
counters. Given this can be called from different contexts, we also need
to deal with potential nested calls even though in practice they are
considered extremely rare. One idea as suggested by Eric Dumazet was
to use a reverse counter for this situation since we don't expect 64 bit
overflows anyways; that way, we can avoid bigger gaps in the 64 bit
counter space compared to just batch-wise increase. Even on machines
with small number of cores (e.g. 4) the cookie generation shrinks from
min/max/med/avg (ns) of 22/50/40/38.9 down to 10/35/14/17.3 when run
in parallel from multiple CPUs.

Signed-off-by: Daniel Borkmann
Signed-off-by: Alexei Starovoitov
Reviewed-by: Eric Dumazet
Acked-by: Martin KaFai Lau
Cc: Eric Dumazet
Link: https://lore.kernel.org/bpf/8a80b8d27d3c49f9a14e1d5213c19d8be87d1dc8.1601477936.git.daniel@iogearbox.net

Daniel Borkmann
2020-10-01 02:50:35 +0800

30 Sep, 2020

4 commits

43bc2874e bpf: Fix context type resolving for extension programs ... Browse Code »

Eelco reported we can't properly access arguments if the tracing
program is attached to extension program.

Having following program:

SEC("classifier/test_pkt_md_access")
int test_pkt_md_access(struct __sk_buff *skb)

with its extension:

SEC("freplace/test_pkt_md_access")
int test_pkt_md_access_new(struct __sk_buff *skb)

and tracing that extension with:

SEC("fentry/test_pkt_md_access_new")
int BPF_PROG(fentry, struct sk_buff *skb)

It's not possible to access skb argument in the fentry program,
with following error from verifier:

; int BPF_PROG(fentry, struct sk_buff *skb)
0: (79) r1 = *(u64 *)(r1 +0)
invalid bpf_context access off=0 size=8

The problem is that btf_ctx_access gets the context type for the
traced program, which is in this case the extension.

But when we trace extension program, we want to get the context
type of the program that the extension is attached to, so we can
access the argument properly in the trace program.

This version of the patch is tweaked slightly from Jiri's original one,
since the refactoring in the previous patches means we have to get the
target prog type from the new variable in prog->aux instead of directly
from the target prog.

Reported-by: Eelco Chaudron
Suggested-by: Jiri Olsa
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/160138355278.48470.17057040257274725638.stgit@toke.dk

Toke Høiland-Jørgensen
2020-09-30 04:09:24 +0800
4a1e7c0c6 bpf: Support attaching freplace programs to multiple attach points ... Browse Code »

This enables support for attaching freplace programs to multiple attach
points. It does this by amending the UAPI for bpf_link_Create with a target
btf ID that can be used to supply the new attachment point along with the
target program fd. The target must be compatible with the target that was
supplied at program load time.

The implementation reuses the checks that were factored out of
check_attach_btf_id() to ensure compatibility between the BTF types of the
old and new attachment. If these match, a new bpf_tracing_link will be
created for the new attach target, allowing multiple attachments to
co-exist simultaneously.

The code could theoretically support multiple-attach of other types of
tracing programs as well, but since I don't have a use case for any of
those, there is no API support for doing so.

Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/160138355169.48470.17165680973640685368.stgit@toke.dk

Toke Høiland-Jørgensen
2020-09-30 04:09:24 +0800
3aac1ead5 bpf: Move prog->aux->linked_prog and trampoline into bpf_link on attach ... Browse Code »

In preparation for allowing multiple attachments of freplace programs, move
the references to the target program and trampoline into the
bpf_tracing_link structure when that is created. To do this atomically,
introduce a new mutex in prog->aux to protect writing to the two pointers
to target prog and trampoline, and rename the members to make it clear that
they are related.

With this change, it is no longer possible to attach the same tracing
program multiple times (detaching in-between), since the reference from the
tracing program to the target disappears on the first attach. However,
since the next patch will let the caller supply an attach target, that will
also make it possible to attach to the same place multiple times.

Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: Alexei Starovoitov
Acked-by: Andrii Nakryiko
Link: https://lore.kernel.org/bpf/160138355059.48470.2503076992210324984.stgit@toke.dk

Toke Høiland-Jørgensen
2020-09-30 04:09:23 +0800
9d9aae53b bpf/preload: Make sure Makefile cleans up after itself, and add .gitignore ... Browse Code »

The Makefile in bpf/preload builds a local copy of libbpf, but does not
properly clean up after itself. This can lead to subsequent compilation
failures, since the feature detection cache is kept around which can lead
subsequent detection to fail.

Fix this by properly setting clean-files, and while we're at it, also add a
.gitignore for the directory to ignore the build artifacts.

Fixes: d71fa5c9763c ("bpf: Add kernel module with user mode driver that populates bpffs.")
Signed-off-by: Toke Høiland-Jørgensen
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200927193005.8459-1-toke@redhat.com

Toke Høiland-Jørgensen
2020-09-30 02:15:01 +0800

29 Sep, 2020

3 commits

eb411377a bpf: Add bpf_seq_printf_btf helper ... Browse Code »

A helper is added to allow seq file writing of kernel data
structures using vmlinux BTF. Its signature is

long bpf_seq_printf_btf(struct seq_file *m, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);

Flags and struct btf_ptr definitions/use are identical to the
bpf_snprintf_btf helper, and the helper returns 0 on success
or a negative error value.

Suggested-by: Alexei Starovoitov
Signed-off-by: Alan Maguire
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/1601292670-1616-8-git-send-email-alan.maguire@oracle.com

Alan Maguire
2020-09-29 09:26:58 +0800
af6532094 bpf: Bump iter seq size to support BTF representation of large data structures ... Browse Code »

BPF iter size is limited to PAGE_SIZE; if we wish to display BTF-based
representations of larger kernel data structures such as task_struct,
this will be insufficient.

Suggested-by: Alexei Starovoitov
Signed-off-by: Alan Maguire
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/1601292670-1616-6-git-send-email-alan.maguire@oracle.com

Alan Maguire
2020-09-29 09:26:58 +0800
c4d0bfb45 bpf: Add bpf_snprintf_btf helper ... Browse Code »

A helper is added to support tracing kernel type information in BPF
using the BPF Type Format (BTF). Its signature is

long bpf_snprintf_btf(char *str, u32 str_size, struct btf_ptr *ptr,
u32 btf_ptr_size, u64 flags);

struct btf_ptr * specifies

- a pointer to the data to be traced
- the BTF id of the type of data pointed to
- a flags field is provided for future use; these flags
are not to be confused with the BTF_F_* flags
below that control how the btf_ptr is displayed; the
flags member of the struct btf_ptr may be used to
disambiguate types in kernel versus module BTF, etc;
the main distinction is the flags relate to the type
and information needed in identifying it; not how it
is displayed.

For example a BPF program with a struct sk_buff *skb
could do the following:

static struct btf_ptr b = { };

b.ptr = skb;
b.type_id = __builtin_btf_type_id(struct sk_buff, 1);
bpf_snprintf_btf(str, sizeof(str), &b, sizeof(b), 0, 0);

Default output looks like this:

(struct sk_buff){
.transport_header = (__u16)65535,
.mac_header = (__u16)65535,
.end = (sk_buff_data_t)192,
.head = (unsigned char *)0x000000007524fd8b,
.data = (unsigned char *)0x000000007524fd8b,
.truesize = (unsigned int)768,
.users = (refcount_t){
.refs = (atomic_t){
.counter = (int)1,
},
},
}

Flags modifying display are as follows:

- BTF_F_COMPACT: no formatting around type information
- BTF_F_NONAME: no struct/union member names/types
- BTF_F_PTR_RAW: show raw (unobfuscated) pointer values;
equivalent to %px.
- BTF_F_ZERO: show zero-valued struct/union members;
they are not displayed by default

Signed-off-by: Alan Maguire
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/1601292670-1616-4-git-send-email-alan.maguire@oracle.com

Alan Maguire
2020-09-29 09:26:58 +0800