Eric Lee / smarc-fsl-linux-kernel

01 Mar, 2020

1 commit

9f0ca0c1a Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Alexei Starovoitov says:

====================
pull-request: bpf-next 2020-02-28

The following pull-request contains BPF updates for your *net-next* tree.

We've added 41 non-merge commits during the last 7 day(s) which contain
a total of 49 files changed, 1383 insertions(+), 499 deletions(-).

The main changes are:

1) BPF and Real-Time nicely co-exist.

2) bpftool feature improvements.

3) retrieve bpf_sk_storage via INET_DIAG.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-03-01 07:53:35 +0800

28 Feb, 2020

4 commits

1ed4d9245 bpf: INET_DIAG support in bpf_sk_storage ... Browse Code »

This patch adds INET_DIAG support to bpf_sk_storage.

1. Although this series adds bpf_sk_storage diag capability to inet sk,
bpf_sk_storage is in general applicable to all fullsock. Hence, the
bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller
will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
the argument.

2. The request will be like:
INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
......

Considering there could have multiple bpf_sk_storages in a sk,
instead of reusing INET_DIAG_INFO ("ss -i"), the user can select
some specific bpf_sk_storage to dump by specifying an array of
SK_DIAG_BPF_STORAGE_REQ_MAP_FD.

If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
of a sk.

3. The reply will be like:
INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
SK_DIAG_BPF_STORAGE (nla_nest)
SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
SK_DIAG_BPF_STORAGE (nla_nest)
SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
......

4. Unlike other INET_DIAG info of a sk which is pretty static, the size
required to dump the bpf_sk_storage(s) of a sk is dynamic as the
system adding more bpf_sk_storage_map. It is hard to set a static
min_dump_alloc size.

Hence, this series learns it at the runtime and adjust the
cb->min_dump_alloc as it iterates all sk(s) of a system. The
"unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
is for this purpose.

The next patch will update the cb->min_dump_alloc as it
iterates the sk(s).

Signed-off-by: Martin KaFai Lau
Signed-off-by: Alexei Starovoitov
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com

Martin KaFai Lau
2020-02-28 10:50:19 +0800
9f6e05590 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

The mptcp conflict was overlapping additions.

The SMC conflict was an additional and removal happening at the same
time.

Signed-off-by: David S. Miller

David S. Miller
2020-02-28 10:31:39 +0800
d7f10df86 bpf: Replace zero-length array with flexible-array member ... Browse Code »

The current codebase makes use of the zero-length array language
extension to the C90 standard, but the preferred mechanism to declare
variable-length types such as these ones is a flexible array member[1][2],
introduced in C99:

struct foo {
int stuff;
struct boo array[];
};

By making use of the mechanism above, we will get a compiler warning
in case the flexible array does not occur last in the structure, which
will help us prevent some kind of undefined behavior bugs from being
inadvertently introduced[3] to the codebase from now on.

Also, notice that, dynamic memory allocations won't be affected by
this change:

"Flexible array members have incomplete type, and so the sizeof operator
may not be applied. As a quirk of the original implementation of
zero-length arrays, sizeof evaluates to zero."[1]

This issue was found with the help of Coccinelle.

[1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
[2] https://github.com/KSPP/linux/issues/21
[3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

Signed-off-by: Gustavo A. R. Silva
Signed-off-by: Daniel Borkmann
Acked-by: Song Liu
Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor

Gustavo A. R. Silva
2020-02-28 08:21:02 +0800
ed5fa5591 Merge tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit ... Browse Code »

Pull audit fixes from Paul Moore:
"Two fixes for problems found by syzbot:

- Moving audit filter structure fields into a union caused some
problems in the code which populates that filter structure.

We keep the union (that idea is a good one), but we are fixing the
code so that it doesn't needlessly set fields in the union and mess
up the error handling.

- The audit_receive_msg() function wasn't validating user input as
well as it should in all cases, we add the necessary checks"

* tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
audit: always check the netlink payload length in audit_receive_msg()
audit: fix error handling in audit_data_to_entry()

Linus Torvalds
2020-02-28 03:01:22 +0800

27 Feb, 2020

2 commits

91ad64a84 Merge tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace ... Browse Code »

Pull tracing and bootconfig updates:
"Fixes and changes to bootconfig before it goes live in a release.

Change in API of bootconfig (before it comes live in a release):
- Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
exists
- Set CONFIG_BOOT_CONFIG to 'n' by default
- Show error if "bootconfig" on cmdline but not compiled in
- Prevent redefining the same value
- Have a way to append values
- Added a SELECT BLK_DEV_INITRD to fix a build failure

Synthetic event fixes:
- Switch to raw_smp_processor_id() for recording CPU value in preempt
section. (No care for what the value actually is)
- Fix samples always recording u64 values
- Fix endianess
- Check number of values matches number of fields
- Fix a printing bug

Fix of trace_printk() breaking postponed start up tests

Make a function static that is only used in a single file"

* tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
bootconfig: Add append value operator support
bootconfig: Prohibit re-defining value on same key
bootconfig: Print array as multiple commands for legacy command line
bootconfig: Reject subkey and value on same parent key
tools/bootconfig: Remove unneeded error message silencer
bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
bootconfig: Set CONFIG_BOOT_CONFIG=n by default
tracing: Clear trace_state when starting trace
bootconfig: Mark boot_config_checksum() static
tracing: Disable trace_printk() on post poned tests
tracing: Have synthetic event test use raw_smp_processor_id()
tracing: Fix number printing bug in print_synth_event()
tracing: Check that number of vals matches number of synth event fields
tracing: Make synth_event trace functions endian-correct
tracing: Make sure synth_event_trace() example always uses u64

Linus Torvalds
2020-02-27 02:34:42 +0800
fda31c502 signal: avoid double atomic counter increments for user accounting ... Browse Code »

When queueing a signal, we increment both the users count of pending
signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
of the user struct itself (because we keep a reference to the user in
the signal structure in order to correctly account for it when freeing).

That turns out to be fairly expensive, because both of them are atomic
updates, and particularly under extreme signal handling pressure on big
machines, you can get a lot of cache contention on the user struct.
That can then cause horrid cacheline ping-pong when you do these
multiple accesses.

So change the reference counting to only pin the user for the _first_
pending signal, and to unpin it when the last pending signal is
dequeued. That means that when a user sees a lot of concurrent signal
queuing - which is the only situation when this matters - the only
atomic access needed is generally the 'sigpending' count update.

This was noticed because of a particularly odd timing artifact on a
dual-socket 96C/192T Cascade Lake platform: when you get into bad
contention, on that machine for some reason seems to be much worse when
the contention happens in the upper 32-byte half of the cacheline.

As a result, the kernel test robot will-it-scale 'signal1' benchmark had
an odd performance regression simply due to random alignment of the
'struct user_struct' (and pointed to a completely unrelated and
apparently nonsensical commit for the regression).

Avoiding the double increments (and decrements on the dequeueing side,
of course) makes for much less contention and hugely improved
performance on that will-it-scale microbenchmark.

Quoting Feng Tang:

"It makes a big difference, that the performance score is tripled! bump
from original 17000 to 54000. Also the gap between 5.0-rc6 and
5.0-rc6+Jiri's patch is reduced to around 2%"

[ The "2% gap" is the odd cacheline placement difference on that
platform: under the extreme contention case, the effect of which half
of the cacheline was hot was 5%, so with the reduced contention the
odd timing artifact is reduced too ]

It does help in the non-contended case too, but is not nearly as
noticeable.

Reported-and-tested-by: Feng Tang
Cc: Eric W. Biederman
Cc: Huang, Ying
Cc: Philip Li
Cc: Andi Kleen
Cc: Jiri Olsa
Cc: Peter Zijlstra
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-02-27 01:54:03 +0800

26 Feb, 2020

1 commit

2910b5aa6 bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue ... Browse Code »

Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by
default") also changed the CONFIG_BOOTTIME_TRACING to select
CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu,
it introduced wrong dependencies with BLK_DEV_INITRD as below.

WARNING: unmet direct dependencies detected for BOOT_CONFIG
Depends on [n]: BLK_DEV_INITRD [=n]
Selected by [y]:
- BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y]

This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to
fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so
that both boot-time tracing and boot configuration off but those
appear on the menu list.

Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2

Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default")
Reported-by: Randy Dunlap
Compiled-tested-by: Randy Dunlap
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)

Masami Hiramatsu
2020-02-26 08:07:58 +0800

25 Feb, 2020

19 commits

099bfaa73 bpf/stackmap: Dont trylock mmap_sem with PREEMPT_RT and interrupts disabled ... Browse Code »

In a RT kernel down_read_trylock() cannot be used from NMI context and
up_read_non_owner() is another problematic issue.

So in such a configuration, simply elide the annotated stackmap and
just report the raw IPs.

In the longer term, it might be possible to provide a atomic friendly
versions of the page cache traversal which will at least provide the info
if the pages are resident and don't need to be paged in.

[ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work
callback and add a comment ]

Signed-off-by: David S. Miller
Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de

David Miller
2020-02-25 08:20:10 +0800
66150d0dd bpf, lpm: Make locking RT friendly ... Browse Code »

The LPM trie map cannot be used in contexts like perf, kprobes and tracing
as this map type dynamically allocates memory.

The memory allocation happens with a raw spinlock held which is a truly
spinning lock on a PREEMPT RT enabled kernel which disables preemption and
interrupts.

As RT does not allow memory allocation from such a section for various
reasons, convert the raw spinlock to a regular spinlock.

On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks
which provide the proper protection but keep the code preemptible.

On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does
not cause any functional change.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.602129531@linutronix.de

Thomas Gleixner
2020-02-25 08:20:10 +0800
7f805d17f bpf: Prepare hashtab locking for PREEMPT_RT ... Browse Code »

PREEMPT_RT forbids certain operations like memory allocations (even with
GFP_ATOMIC) from atomic contexts. This is required because even with
GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
with long held lock sections. To ensure the deterministic behaviour these
locks are regular spinlocks, which are converted to 'sleepable' spinlocks
on RT. The only true atomic contexts on an RT kernel are the low level
hardware handling, scheduling, low level interrupt handling, NMIs etc. None
of these contexts should ever do memory allocations.

As regular device interrupt handlers and soft interrupts are forced into
thread context, the existing code which does
spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
just works.

In theory the BPF locks could be converted to regular spinlocks as well,
but the bucket locks and percpu_freelist locks can be taken from arbitrary
contexts (perf, kprobes, tracepoints) which are required to be atomic
contexts even on RT. These mechanisms require preallocated maps, so there
is no need to invoke memory allocations within the lock held sections.

BPF maps which need dynamic allocation are only used from (forced) thread
context on RT and can therefore use regular spinlocks which in turn allows
to invoke memory allocations from the lock held section.

To achieve this make the hash bucket lock a union of a raw and a regular
spinlock and initialize and lock/unlock either the raw spinlock for
preallocated maps or the regular variant for maps which require memory
allocations.

On a non RT kernel this distinction is neither possible nor required.
spinlock maps to raw_spinlock and the extra code and conditional is
optimized out by the compiler. No functional change.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de

Thomas Gleixner
2020-02-25 08:20:10 +0800
d01f9b198 bpf: Factor out hashtab bucket lock operations ... Browse Code »

As a preparation for making the BPF locking RT friendly, factor out the
hash bucket lock operations into inline functions. This allows to do the
necessary RT modification in one place instead of sprinkling it all over
the place. No functional change.

The now unused htab argument of the lock/unlock functions will be used in
the next step which adds PREEMPT_RT support.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de

Thomas Gleixner
2020-02-25 08:20:10 +0800
b6e5dae15 bpf: Replace open coded recursion prevention in sys_bpf() ... Browse Code »

The required protection is that the caller cannot be migrated to a
different CPU as these functions end up in places which take either a hash
bucket lock or might trigger a kprobe inside the memory allocator. Both
scenarios can lead to deadlocks. The deadlock prevention is per CPU by
incrementing a per CPU variable which temporarily blocks the invocation of
BPF programs from perf and kprobes.

Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs
with the new helper functions. These functions are already prepared to make
BPF work on PREEMPT_RT enabled kernels. No functional change for !RT
kernels.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.317843926@linutronix.de

Thomas Gleixner
2020-02-25 08:20:10 +0800
085fee1a7 bpf: Use recursion prevention helpers in hashtab code ... Browse Code »

The required protection is that the caller cannot be migrated to a
different CPU as these places take either a hash bucket lock or might
trigger a kprobe inside the memory allocator. Both scenarios can lead to
deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
variable which temporarily blocks the invocation of BPF programs from perf
and kprobes.

Replace the open coded preempt_disable/enable() and this_cpu_inc/dec()
pairs with the new recursion prevention helpers to prepare BPF to work on
PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable
in the helpers map to preempt_disable/enable(), i.e. no functional change.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de

Thomas Gleixner
2020-02-25 08:20:10 +0800
02ad05965 bpf: Use migrate_disable/enabe() in trampoline code. ... Browse Code »

Instead of preemption disable/enable to reflect the purpose. This allows
PREEMPT_RT to substitute it with an actual migration disable
implementation. On non RT kernels this is still mapped to
preempt_disable/enable().

Signed-off-by: David S. Miller
Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.891428873@linutronix.de

David Miller
2020-02-25 08:20:09 +0800
3d9f773cf bpf: Use bpf_prog_run_pin_on_cpu() at simple call sites. ... Browse Code »

All of these cases are strictly of the form:

preempt_disable();
BPF_PROG_RUN(...);
preempt_enable();

Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
with:

migrate_disable();
BPF_PROG_RUN(...);
migrate_enable();

On non RT enabled kernels this maps to preempt_disable/enable() and on RT
enabled kernels this solely prevents migration, which is sufficient as
there is no requirement to prevent reentrancy to any BPF program from a
preempting task. The only requirement is that the program stays on the same
CPU.

Therefore, this is a trivially correct transformation.

The seccomp loop does not need protection over the loop. It only needs
protection per BPF filter program

[ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

Signed-off-by: David S. Miller
Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de

David Miller
2020-02-25 08:20:09 +0800
569de905e bpf: Dont iterate over possible CPUs with interrupts disabled ... Browse Code »

pcpu_freelist_populate() is disabling interrupts and then iterates over the
possible CPUs. The reason why this disables interrupts is to silence
lockdep because the invoked ___pcpu_freelist_push() takes spin locks.

Neither the interrupt disabling nor the locking are required in this
function because it's called during initialization and the resulting map is
not yet visible to anything.

Split out the actual push assignement into an inline, call it from the loop
and remove the interrupt disable.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.365930116@linutronix.de

Thomas Gleixner
2020-02-25 08:18:20 +0800
8a37963c7 bpf: Remove recursion prevention from rcu free callback ... Browse Code »

If an element is freed via RCU then recursion into BPF instrumentation
functions is not a concern. The element is already detached from the map
and the RCU callback does not hold any locks on which a kprobe, perf event
or tracepoint attached BPF program could deadlock.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de

Thomas Gleixner
2020-02-25 08:18:20 +0800
1d7bf6b7d perf/bpf: Remove preempt disable around BPF invocation ... Browse Code »

The BPF invocation from the perf event overflow handler does not require to
disable preemption because this is called from NMI or at least hard
interrupt context which is already non-preemptible.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.151953573@linutronix.de

Thomas Gleixner
2020-02-25 08:18:20 +0800
b0a81b94c bpf/trace: Remove redundant preempt_disable from trace_call_bpf() ... Browse Code »

Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
invoked from a trace point via __DO_TRACE() which already disables
preemption _before_ invoking any of the functions which are attached to a
trace point.

Remove it and add a cant_sleep() check.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de

Thomas Gleixner
2020-02-25 08:18:20 +0800
70ed0706a bpf: disable preemption for bpf progs attached to uprobe ... Browse Code »

trace_call_bpf() no longer disables preemption on its own.
All callers of this function has to do it explicitly.

Signed-off-by: Alexei Starovoitov
Acked-by: Thomas Gleixner

Alexei Starovoitov
2020-02-25 08:17:14 +0800
1b7a51a63 bpf/trace: Remove EXPORT from trace_call_bpf() ... Browse Code »

All callers are built in. No point to export this.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov

Thomas Gleixner
2020-02-25 08:16:38 +0800
f03efe49b bpf/tracing: Remove redundant preempt_disable() in __bpf_trace_run() ... Browse Code »

__bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.

This is redundant because __bpf_trace_run() is invoked from a trace point
via __DO_TRACE() which already disables preemption _before_ invoking any of
the functions which are attached to a trace point.

Remove it and add a cant_sleep() check.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de

Thomas Gleixner
2020-02-25 08:12:20 +0800
dbca151ca bpf: Update locking comment in hashtab code ... Browse Code »

The comment where the bucket lock is acquired says:

/* bpf_map_update_elem() can be called in_irq() */

which is not really helpful and aside of that it does not explain the
subtle details of the hash bucket locks expecially in the context of BPF
and perf, kprobes and tracing.

Add a comment at the top of the file which explains the protection scopes
and the details how potential deadlocks are prevented.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de

Thomas Gleixner
2020-02-25 08:12:20 +0800
2ed905c52 bpf: Enforce preallocation for instrumentation programs on RT ... Browse Code »

Aside of the general unsafety of run-time map allocation for
instrumentation type programs RT enabled kernels have another constraint:

The instrumentation programs are invoked with preemption disabled, but the
memory allocator spinlocks cannot be acquired in atomic context because
they are converted to 'sleeping' spinlocks on RT.

Therefore enforce map preallocation for these programs types when RT is
enabled.

Signed-off-by: Thomas Gleixner
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de

Thomas Gleixner
2020-02-25 08:12:19 +0800
94dacdbd5 bpf: Tighten the requirements for preallocated hash maps ... Browse Code »

The assumption that only programs attached to perf NMI events can deadlock
on memory allocators is wrong. Assume the following simplified callchain:

kmalloc() from regular non BPF context
cache empty
freelist empty
lock(zone->lock);
tracepoint or kprobe
BPF()
update_elem()
lock(bucket)
kmalloc()
cache empty
freelist empty
lock(zone->lock);
Signed-off-by: Alexei Starovoitov
Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de

Thomas Gleixner
2020-02-25 08:12:19 +0800
756125289 audit: always check the netlink payload length in audit_receive_msg() ... Browse Code »

This patch ensures that we always check the netlink payload length
in audit_receive_msg() before we take any action on the payload
itself.

Cc: stable@vger.kernel.org
Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
Signed-off-by: Paul Moore

Paul Moore
2020-02-25 05:38:57 +0800

23 Feb, 2020

3 commits

2ad3e17eb audit: fix error handling in audit_data_to_entry() ... Browse Code »

Commit 219ca39427bf ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.

This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.

Cc: stable@vger.kernel.org
Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore

Paul Moore
2020-02-23 09:36:47 +0800
f3cc24942 Merge tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull irq fixes from Thomas Gleixner:
"Two fixes for the irq core code which are follow ups to the recent MSI
fixes:

- The WARN_ON which was put into the MSI setaffinity callback for
paranoia reasons actually triggered via a callchain which escaped
when all the possible ways to reach that code were analyzed.

The proc/irq/$N/*affinity interfaces have a quirk which came in
when ALPHA moved to the generic interface: In case that the written
affinity mask does not contain any online CPU it calls into ALPHAs
magic auto affinity setting code.

A few years later this mechanism was also made available to x86 for
no good reasons and in a way which circumvents all sanity checks
for interrupts which cannot have their affinity set from process
context on X86 due to the way the X86 interrupt delivery works.

It would be possible to make this work properly, but there is no
point in doing so. If the interrupt is not yet started then the
affinity setting has no effect and if it is started already then it
is already assigned to an online CPU so there is no point to
randomly move it to some other CPU. Just return EINVAL as the code
has done before that change forever.

- The new MSI quirk bit in the irq domain flags turned out to be
already occupied, which escaped the author and the reviewers
because the already in use bits were 0,6,2,3,4,5 listed in that
order.

That bit 6 was simply overlooked because the ordering was straight
forward linear otherwise. So the new bit ended up being a
duplicate.

Fix it up by switching the oddball 6 to the obvious 1"

* tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
genirq/irqdomain: Make sure all irq domain flags are distinct
genirq/proc: Reject invalid affinity masks (again)

Linus Torvalds
2020-02-23 09:25:46 +0800
591dd4c10 Merge tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux ... Browse Code »

Pull s390 fixes from Vasily Gorbik:

- Remove ieee_emulation_warnings sysctl which is a dead code.

- Avoid triggering rebuild of the kernel during make install.

- Enable protected virtualization guest support in default configs.

- Fix cio_ignore seq_file .next function to increase position index.
And use kobj_to_dev instead of container_of in cio code.

- Fix storage block address lists to contain absolute addresses in qdio
code.

- Few clang warnings and spelling fixes.

* tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
s390/qdio: fill SBALEs with absolute addresses
s390/qdio: fill SL with absolute addresses
s390: remove obsolete ieee_emulation_warnings
s390: make 'install' not depend on vmlinux
s390/kaslr: Fix casts in get_random
s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
s390/pkey/zcrypt: spelling s/crytp/crypt/
s390/cio: use kobj_to_dev() API
s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
s390/cio: cio_ignore_proc_seq_next should increase position index

Linus Torvalds
2020-02-23 02:43:41 +0800

22 Feb, 2020

5 commits

b105e8e28 Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next ... Browse Code »

Daniel Borkmann says:

====================
pull-request: bpf-next 2020-02-21

The following pull-request contains BPF updates for your *net-next* tree.

We've added 25 non-merge commits during the last 4 day(s) which contain
a total of 33 files changed, 2433 insertions(+), 161 deletions(-).

The main changes are:

1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
with reuseport BPF programs, from Jakub Sitnicki.

2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
to specify the tracepoint/function dynamically, from Eelco Chaudron.

3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
guided optimizations, from Daniel Xu.

4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.

5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
BTF defined maps, from Andrii Nakryiko.

6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
been observed that former fails in envs with low memlock, from Yonghong Song.
====================

Signed-off-by: David S. Miller

David S. Miller
2020-02-22 07:22:45 +0800
035ff358f net: Generate reuseport group ID on group creation ... Browse Code »

Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose
reuseport_lock") has introduced lazy generation of reuseport group IDs that
survive group resize.

By comparing the identifier we check if BPF reuseport program is not trying
to select a socket from a BPF map that belongs to a different reuseport
group than the one the packet is for.

Because SOCKARRAY used to be the only BPF map type that can be used with
reuseport BPF, it was possible to delay the generation of reuseport group
ID until a socket from the group was inserted into BPF map for the first
time.

Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
either generate the reuseport ID on map update, like SOCKARRAY does, or
allocate an ID from the start when reuseport group gets created.

This patch takes the latter approach to keep sockmap free of calls into
reuseport code. This streamlines the reuseport_id access as its lifetime
now matches the longevity of reuseport object.

The cost of this simplification, however, is that we allocate reuseport IDs
for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
setups. With the way identifiers are currently generated, we can have at
most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
get close to the limit, we can switch an u64 counter like sk_cookie.

Another change is that we now always call into SOCKARRAY logic to unlink
the socket from the map when unhashing or closing the socket. Previously we
did it only when at least one socket from the group was in a BPF map.

It is worth noting that this doesn't conflict with sockmap tear-down in
case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
group. sockmap tear-down happens first:

prot->unhash
`- tcp_bpf_unhash
|- tcp_bpf_remove
| `- while (sk_psock_link_pop(psock))
| `- sk_psock_unlink
| `- sock_map_delete_from_link
| `- __sock_map_delete
| `- sock_map_unref
| `- sk_psock_put
| `- sk_psock_drop
| `- rcu_assign_sk_user_data(sk, NULL)
`- inet_unhash
`- reuseport_detach_sock
`- bpf_sk_reuseport_detach
`- WRITE_ONCE(sk->sk_user_data, NULL)

Suggested-by: Martin Lau
Signed-off-by: Jakub Sitnicki
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com

Jakub Sitnicki
2020-02-22 05:29:45 +0800
9fed9000c bpf: Allow selecting reuseport socket from a SOCKMAP/SOCKHASH ... Browse Code »

SOCKMAP & SOCKHASH now support storing references to listening
sockets. Nothing keeps us from using these map types a collection of
sockets to select from in BPF reuseport programs. Whitelist the map types
with the bpf_sk_select_reuseport helper.

The restriction that the socket has to be a member of a reuseport group
still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
set are not a valid target and we signal it with -EINVAL.

The main benefit from this change is that, in contrast to
REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
listening socket can be just one BPF map at the same time.

Signed-off-by: Jakub Sitnicki
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com

Jakub Sitnicki
2020-02-22 05:29:45 +0800
3dc55dba6 Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net ... Browse Code »

Pull networking fixes from David Miller:

1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
Cong Wang.

2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
is finished, from Magnus Karlsson.

3) Set table field properly to RT_TABLE_COMPAT when necessary, from
Jethro Beekman.

4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
that in IFLA_ALT_IFNAME. From Eric Dumazet.

5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

6) Handle mtu==0 devices properly in wireguard, from Jason A.
Donenfeld.

7) Fix several lockdep warnings in bonding, from Taehee Yoo.

8) Fix cls_flower port blocking, from Jason Baron.

9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

10) Fix RDMA race in qede driver, from Michal Kalderon.

11) Fix several false lockdep warnings by adding conditions to
list_for_each_entry_rcu(), from Madhuparna Bhowmik.

12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

14) Hey, variables declared in switch statement before any case
statements are not initialized. I learn something every day. Get
rids of this stuff in several parts of the networking, from Kees
Cook.

* git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
bnxt_en: Improve device shutdown method.
net: netlink: cap max groups which will be considered in netlink_bind()
net: thunderx: workaround BGX TX Underflow issue
ionic: fix fw_status read
net: disable BRIDGE_NETFILTER by default
net: macb: Properly handle phylink on at91rm9200
s390/qeth: fix off-by-one in RX copybreak check
s390/qeth: don't warn for napi with 0 budget
s390/qeth: vnicc Fix EOPNOTSUPP precedence
openvswitch: Distribute switch variables for initialization
net: ip6_gre: Distribute switch variables for initialization
net: core: Distribute switch variables for initialization
udp: rehash on disconnect
net/tls: Fix to avoid gettig invalid tls record
bpf: Fix a potential deadlock with bpf_map_do_batch
bpf: Do not grab the bucket spinlock by default on htab batch ops
ice: Wait for VF to be reset/ready before configuration
ice: Don't tell the OS that link is going down
ice: Don't reject odd values of usecs set by user
...

Linus Torvalds
2020-02-22 03:59:51 +0800
412c53a68 y2038: remove unused time32 interfaces ... Browse Code »

No users remain, so kill these off before we grow new ones.

Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Acked-by: Thomas Gleixner
Cc: Deepa Dinamani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2020-02-22 03:22:15 +0800

21 Feb, 2020

5 commits

d8a953ddd bootconfig: Set CONFIG_BOOT_CONFIG=n by default ... Browse Code »

Set CONFIG_BOOT_CONFIG=n by default. This also warns
user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
in the kernel command line.

Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2

Suggested-by: Steven Rostedt
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)

Masami Hiramatsu
2020-02-21 06:52:12 +0800
7ab215f22 tracing: Clear trace_state when starting trace ... Browse Code »

Clear trace_state data structure when starting trace
in __synth_event_trace_start() internal function.

Currently trace_state is initialized only in the
synth_event_trace_start() API, but the trace_state
in synth_event_trace() and synth_event_trace_array()
are on the stack without initialization.
This means those APIs will see wrong parameters and
wil skip closing process in __synth_event_trace_end()
because trace_state->disabled may be !0.

Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2

Reviewed-by: Tom Zanussi
Signed-off-by: Masami Hiramatsu
Signed-off-by: Steven Rostedt (VMware)

Masami Hiramatsu
2020-02-21 06:48:59 +0800
78041c0c9 tracing: Disable trace_printk() on post poned tests ... Browse Code »

The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.

To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.

As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.

Cc: stable@vger.kernel.org
Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2020-02-21 06:43:58 +0800
3c18a9be7 tracing: Have synthetic event test use raw_smp_processor_id() ... Browse Code »

The test code that tests synthetic event creation pushes in as one of its
test fields the current CPU using "smp_processor_id()". As this is just
something to see if the value is correctly passed in, and the actual CPU
used does not matter, use raw_smp_processor_id(), otherwise with debug
preemption enabled, a warning happens as the smp_processor_id() is called
without preemption enabled.

Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home

Reviewed-by: Tom Zanussi
Signed-off-by: Steven Rostedt (VMware)

Steven Rostedt (VMware)
2020-02-21 06:43:41 +0800
784bd0847 tracing: Fix number printing bug in print_synth_event() ... Browse Code »

Fix a varargs-related bug in print_synth_event() which resulted in
strange output and oopses on 32-bit x86 systems. The problem is that
trace_seq_printf() expects the varargs to match the format string, but
print_synth_event() was always passing u64 values regardless. This
results in unspecified behavior when unpacking with va_arg() in
trace_seq_printf().

Add a function that takes the size into account when calling
trace_seq_printf().

Before:

modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null)

After:

insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598

Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org

Reported-by: Steven Rostedt (VMware)
Signed-off-by: Tom Zanussi
Signed-off-by: Steven Rostedt (VMware)

Tom Zanussi
2020-02-21 05:24:17 +0800