01 Mar, 2020

1 commit

  • Alexei Starovoitov says:

    ====================
    pull-request: bpf-next 2020-02-28

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 41 non-merge commits during the last 7 day(s) which contain
    a total of 49 files changed, 1383 insertions(+), 499 deletions(-).

    The main changes are:

    1) BPF and Real-Time nicely co-exist.

    2) bpftool feature improvements.

    3) retrieve bpf_sk_storage via INET_DIAG.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

28 Feb, 2020

4 commits

  • This patch adds INET_DIAG support to bpf_sk_storage.

    1. Although this series adds bpf_sk_storage diag capability to inet sk,
    bpf_sk_storage is in general applicable to all fullsock. Hence, the
    bpf_sk_storage logic will operate on SK_DIAG_* nlattr. The caller
    will pass in its specific nesting nlattr (e.g. INET_DIAG_*) as
    the argument.

    2. The request will be like:
    INET_DIAG_REQ_SK_BPF_STORAGES (nla_nest) (defined in latter patch)
    SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
    SK_DIAG_BPF_STORAGE_REQ_MAP_FD (nla_put_u32)
    ......

    Considering there could have multiple bpf_sk_storages in a sk,
    instead of reusing INET_DIAG_INFO ("ss -i"), the user can select
    some specific bpf_sk_storage to dump by specifying an array of
    SK_DIAG_BPF_STORAGE_REQ_MAP_FD.

    If no SK_DIAG_BPF_STORAGE_REQ_MAP_FD is specified (i.e. an empty
    INET_DIAG_REQ_SK_BPF_STORAGES), it will dump all bpf_sk_storages
    of a sk.

    3. The reply will be like:
    INET_DIAG_BPF_SK_STORAGES (nla_nest) (defined in latter patch)
    SK_DIAG_BPF_STORAGE (nla_nest)
    SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
    SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
    SK_DIAG_BPF_STORAGE (nla_nest)
    SK_DIAG_BPF_STORAGE_MAP_ID (nla_put_u32)
    SK_DIAG_BPF_STORAGE_MAP_VALUE (nla_reserve_64bit)
    ......

    4. Unlike other INET_DIAG info of a sk which is pretty static, the size
    required to dump the bpf_sk_storage(s) of a sk is dynamic as the
    system adding more bpf_sk_storage_map. It is hard to set a static
    min_dump_alloc size.

    Hence, this series learns it at the runtime and adjust the
    cb->min_dump_alloc as it iterates all sk(s) of a system. The
    "unsigned int *res_diag_size" in bpf_sk_storage_diag_put()
    is for this purpose.

    The next patch will update the cb->min_dump_alloc as it
    iterates the sk(s).

    Signed-off-by: Martin KaFai Lau
    Signed-off-by: Alexei Starovoitov
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200225230421.1975729-1-kafai@fb.com

    Martin KaFai Lau
     
  • The mptcp conflict was overlapping additions.

    The SMC conflict was an additional and removal happening at the same
    time.

    Signed-off-by: David S. Miller

    David S. Miller
     
  • The current codebase makes use of the zero-length array language
    extension to the C90 standard, but the preferred mechanism to declare
    variable-length types such as these ones is a flexible array member[1][2],
    introduced in C99:

    struct foo {
    int stuff;
    struct boo array[];
    };

    By making use of the mechanism above, we will get a compiler warning
    in case the flexible array does not occur last in the structure, which
    will help us prevent some kind of undefined behavior bugs from being
    inadvertently introduced[3] to the codebase from now on.

    Also, notice that, dynamic memory allocations won't be affected by
    this change:

    "Flexible array members have incomplete type, and so the sizeof operator
    may not be applied. As a quirk of the original implementation of
    zero-length arrays, sizeof evaluates to zero."[1]

    This issue was found with the help of Coccinelle.

    [1] https://gcc.gnu.org/onlinedocs/gcc/Zero-Length.html
    [2] https://github.com/KSPP/linux/issues/21
    [3] commit 76497732932f ("cxgb3/l2t: Fix undefined behaviour")

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/20200227001744.GA3317@embeddedor

    Gustavo A. R. Silva
     
  • Pull audit fixes from Paul Moore:
    "Two fixes for problems found by syzbot:

    - Moving audit filter structure fields into a union caused some
    problems in the code which populates that filter structure.

    We keep the union (that idea is a good one), but we are fixing the
    code so that it doesn't needlessly set fields in the union and mess
    up the error handling.

    - The audit_receive_msg() function wasn't validating user input as
    well as it should in all cases, we add the necessary checks"

    * tag 'audit-pr-20200226' of git://git.kernel.org/pub/scm/linux/kernel/git/pcmoore/audit:
    audit: always check the netlink payload length in audit_receive_msg()
    audit: fix error handling in audit_data_to_entry()

    Linus Torvalds
     

27 Feb, 2020

2 commits

  • Pull tracing and bootconfig updates:
    "Fixes and changes to bootconfig before it goes live in a release.

    Change in API of bootconfig (before it comes live in a release):
    - Have a magic value "BOOTCONFIG" in initrd to know a bootconfig
    exists
    - Set CONFIG_BOOT_CONFIG to 'n' by default
    - Show error if "bootconfig" on cmdline but not compiled in
    - Prevent redefining the same value
    - Have a way to append values
    - Added a SELECT BLK_DEV_INITRD to fix a build failure

    Synthetic event fixes:
    - Switch to raw_smp_processor_id() for recording CPU value in preempt
    section. (No care for what the value actually is)
    - Fix samples always recording u64 values
    - Fix endianess
    - Check number of values matches number of fields
    - Fix a printing bug

    Fix of trace_printk() breaking postponed start up tests

    Make a function static that is only used in a single file"

    * tag 'trace-v5.6-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
    bootconfig: Fix CONFIG_BOOTTIME_TRACING dependency issue
    bootconfig: Add append value operator support
    bootconfig: Prohibit re-defining value on same key
    bootconfig: Print array as multiple commands for legacy command line
    bootconfig: Reject subkey and value on same parent key
    tools/bootconfig: Remove unneeded error message silencer
    bootconfig: Add bootconfig magic word for indicating bootconfig explicitly
    bootconfig: Set CONFIG_BOOT_CONFIG=n by default
    tracing: Clear trace_state when starting trace
    bootconfig: Mark boot_config_checksum() static
    tracing: Disable trace_printk() on post poned tests
    tracing: Have synthetic event test use raw_smp_processor_id()
    tracing: Fix number printing bug in print_synth_event()
    tracing: Check that number of vals matches number of synth event fields
    tracing: Make synth_event trace functions endian-correct
    tracing: Make sure synth_event_trace() example always uses u64

    Linus Torvalds
     
  • When queueing a signal, we increment both the users count of pending
    signals (for RLIMIT_SIGPENDING tracking) and we increment the refcount
    of the user struct itself (because we keep a reference to the user in
    the signal structure in order to correctly account for it when freeing).

    That turns out to be fairly expensive, because both of them are atomic
    updates, and particularly under extreme signal handling pressure on big
    machines, you can get a lot of cache contention on the user struct.
    That can then cause horrid cacheline ping-pong when you do these
    multiple accesses.

    So change the reference counting to only pin the user for the _first_
    pending signal, and to unpin it when the last pending signal is
    dequeued. That means that when a user sees a lot of concurrent signal
    queuing - which is the only situation when this matters - the only
    atomic access needed is generally the 'sigpending' count update.

    This was noticed because of a particularly odd timing artifact on a
    dual-socket 96C/192T Cascade Lake platform: when you get into bad
    contention, on that machine for some reason seems to be much worse when
    the contention happens in the upper 32-byte half of the cacheline.

    As a result, the kernel test robot will-it-scale 'signal1' benchmark had
    an odd performance regression simply due to random alignment of the
    'struct user_struct' (and pointed to a completely unrelated and
    apparently nonsensical commit for the regression).

    Avoiding the double increments (and decrements on the dequeueing side,
    of course) makes for much less contention and hugely improved
    performance on that will-it-scale microbenchmark.

    Quoting Feng Tang:

    "It makes a big difference, that the performance score is tripled! bump
    from original 17000 to 54000. Also the gap between 5.0-rc6 and
    5.0-rc6+Jiri's patch is reduced to around 2%"

    [ The "2% gap" is the odd cacheline placement difference on that
    platform: under the extreme contention case, the effect of which half
    of the cacheline was hot was 5%, so with the reduced contention the
    odd timing artifact is reduced too ]

    It does help in the non-contended case too, but is not nearly as
    noticeable.

    Reported-and-tested-by: Feng Tang
    Cc: Eric W. Biederman
    Cc: Huang, Ying
    Cc: Philip Li
    Cc: Andi Kleen
    Cc: Jiri Olsa
    Cc: Peter Zijlstra
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

26 Feb, 2020

1 commit

  • Since commit d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by
    default") also changed the CONFIG_BOOTTIME_TRACING to select
    CONFIG_BOOT_CONFIG to show the boot-time tracing on the menu,
    it introduced wrong dependencies with BLK_DEV_INITRD as below.

    WARNING: unmet direct dependencies detected for BOOT_CONFIG
    Depends on [n]: BLK_DEV_INITRD [=n]
    Selected by [y]:
    - BOOTTIME_TRACING [=y] && TRACING_SUPPORT [=y] && FTRACE [=y] && TRACING [=y]

    This makes the CONFIG_BOOT_CONFIG selects CONFIG_BLK_DEV_INITRD to
    fix this error and make CONFIG_BOOTTIME_TRACING=n by default, so
    that both boot-time tracing and boot configuration off but those
    appear on the menu list.

    Link: http://lkml.kernel.org/r/158264140162.23842.11237423518607465535.stgit@devnote2

    Fixes: d8a953ddde5e ("bootconfig: Set CONFIG_BOOT_CONFIG=n by default")
    Reported-by: Randy Dunlap
    Compiled-tested-by: Randy Dunlap
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     

25 Feb, 2020

19 commits

  • In a RT kernel down_read_trylock() cannot be used from NMI context and
    up_read_non_owner() is another problematic issue.

    So in such a configuration, simply elide the annotated stackmap and
    just report the raw IPs.

    In the longer term, it might be possible to provide a atomic friendly
    versions of the page cache traversal which will at least provide the info
    if the pages are resident and don't need to be paged in.

    [ tglx: Use IS_ENABLED() to avoid the #ifdeffery, fixup the irq work
    callback and add a comment ]

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.708960317@linutronix.de

    David Miller
     
  • The LPM trie map cannot be used in contexts like perf, kprobes and tracing
    as this map type dynamically allocates memory.

    The memory allocation happens with a raw spinlock held which is a truly
    spinning lock on a PREEMPT RT enabled kernel which disables preemption and
    interrupts.

    As RT does not allow memory allocation from such a section for various
    reasons, convert the raw spinlock to a regular spinlock.

    On a RT enabled kernel these locks are substituted by 'sleeping' spinlocks
    which provide the proper protection but keep the code preemptible.

    On a non-RT kernel regular spinlocks map to raw spinlocks, i.e. this does
    not cause any functional change.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.602129531@linutronix.de

    Thomas Gleixner
     
  • PREEMPT_RT forbids certain operations like memory allocations (even with
    GFP_ATOMIC) from atomic contexts. This is required because even with
    GFP_ATOMIC the memory allocator calls into code pathes which acquire locks
    with long held lock sections. To ensure the deterministic behaviour these
    locks are regular spinlocks, which are converted to 'sleepable' spinlocks
    on RT. The only true atomic contexts on an RT kernel are the low level
    hardware handling, scheduling, low level interrupt handling, NMIs etc. None
    of these contexts should ever do memory allocations.

    As regular device interrupt handlers and soft interrupts are forced into
    thread context, the existing code which does
    spin_lock*(); alloc(GPF_ATOMIC); spin_unlock*();
    just works.

    In theory the BPF locks could be converted to regular spinlocks as well,
    but the bucket locks and percpu_freelist locks can be taken from arbitrary
    contexts (perf, kprobes, tracepoints) which are required to be atomic
    contexts even on RT. These mechanisms require preallocated maps, so there
    is no need to invoke memory allocations within the lock held sections.

    BPF maps which need dynamic allocation are only used from (forced) thread
    context on RT and can therefore use regular spinlocks which in turn allows
    to invoke memory allocations from the lock held section.

    To achieve this make the hash bucket lock a union of a raw and a regular
    spinlock and initialize and lock/unlock either the raw spinlock for
    preallocated maps or the regular variant for maps which require memory
    allocations.

    On a non RT kernel this distinction is neither possible nor required.
    spinlock maps to raw_spinlock and the extra code and conditional is
    optimized out by the compiler. No functional change.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.509685912@linutronix.de

    Thomas Gleixner
     
  • As a preparation for making the BPF locking RT friendly, factor out the
    hash bucket lock operations into inline functions. This allows to do the
    necessary RT modification in one place instead of sprinkling it all over
    the place. No functional change.

    The now unused htab argument of the lock/unlock functions will be used in
    the next step which adds PREEMPT_RT support.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.420416916@linutronix.de

    Thomas Gleixner
     
  • The required protection is that the caller cannot be migrated to a
    different CPU as these functions end up in places which take either a hash
    bucket lock or might trigger a kprobe inside the memory allocator. Both
    scenarios can lead to deadlocks. The deadlock prevention is per CPU by
    incrementing a per CPU variable which temporarily blocks the invocation of
    BPF programs from perf and kprobes.

    Replace the open coded preempt_[dis|en]able and __this_cpu_[inc|dec] pairs
    with the new helper functions. These functions are already prepared to make
    BPF work on PREEMPT_RT enabled kernels. No functional change for !RT
    kernels.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.317843926@linutronix.de

    Thomas Gleixner
     
  • The required protection is that the caller cannot be migrated to a
    different CPU as these places take either a hash bucket lock or might
    trigger a kprobe inside the memory allocator. Both scenarios can lead to
    deadlocks. The deadlock prevention is per CPU by incrementing a per CPU
    variable which temporarily blocks the invocation of BPF programs from perf
    and kprobes.

    Replace the open coded preempt_disable/enable() and this_cpu_inc/dec()
    pairs with the new recursion prevention helpers to prepare BPF to work on
    PREEMPT_RT enabled kernels. On a non-RT kernel the migrate disable/enable
    in the helpers map to preempt_disable/enable(), i.e. no functional change.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145644.211208533@linutronix.de

    Thomas Gleixner
     
  • Instead of preemption disable/enable to reflect the purpose. This allows
    PREEMPT_RT to substitute it with an actual migration disable
    implementation. On non RT kernels this is still mapped to
    preempt_disable/enable().

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.891428873@linutronix.de

    David Miller
     
  • All of these cases are strictly of the form:

    preempt_disable();
    BPF_PROG_RUN(...);
    preempt_enable();

    Replace this with bpf_prog_run_pin_on_cpu() which wraps BPF_PROG_RUN()
    with:

    migrate_disable();
    BPF_PROG_RUN(...);
    migrate_enable();

    On non RT enabled kernels this maps to preempt_disable/enable() and on RT
    enabled kernels this solely prevents migration, which is sufficient as
    there is no requirement to prevent reentrancy to any BPF program from a
    preempting task. The only requirement is that the program stays on the same
    CPU.

    Therefore, this is a trivially correct transformation.

    The seccomp loop does not need protection over the loop. It only needs
    protection per BPF filter program

    [ tglx: Converted to bpf_prog_run_pin_on_cpu() ]

    Signed-off-by: David S. Miller
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.691493094@linutronix.de

    David Miller
     
  • pcpu_freelist_populate() is disabling interrupts and then iterates over the
    possible CPUs. The reason why this disables interrupts is to silence
    lockdep because the invoked ___pcpu_freelist_push() takes spin locks.

    Neither the interrupt disabling nor the locking are required in this
    function because it's called during initialization and the resulting map is
    not yet visible to anything.

    Split out the actual push assignement into an inline, call it from the loop
    and remove the interrupt disable.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.365930116@linutronix.de

    Thomas Gleixner
     
  • If an element is freed via RCU then recursion into BPF instrumentation
    functions is not a concern. The element is already detached from the map
    and the RCU callback does not hold any locks on which a kprobe, perf event
    or tracepoint attached BPF program could deadlock.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.259118710@linutronix.de

    Thomas Gleixner
     
  • The BPF invocation from the perf event overflow handler does not require to
    disable preemption because this is called from NMI or at least hard
    interrupt context which is already non-preemptible.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.151953573@linutronix.de

    Thomas Gleixner
     
  • Similar to __bpf_trace_run this is redundant because __bpf_trace_run() is
    invoked from a trace point via __DO_TRACE() which already disables
    preemption _before_ invoking any of the functions which are attached to a
    trace point.

    Remove it and add a cant_sleep() check.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145643.059995527@linutronix.de

    Thomas Gleixner
     
  • trace_call_bpf() no longer disables preemption on its own.
    All callers of this function has to do it explicitly.

    Signed-off-by: Alexei Starovoitov
    Acked-by: Thomas Gleixner

    Alexei Starovoitov
     
  • All callers are built in. No point to export this.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov

    Thomas Gleixner
     
  • __bpf_trace_run() disables preemption around the BPF_PROG_RUN() invocation.

    This is redundant because __bpf_trace_run() is invoked from a trace point
    via __DO_TRACE() which already disables preemption _before_ invoking any of
    the functions which are attached to a trace point.

    Remove it and add a cant_sleep() check.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145642.847220186@linutronix.de

    Thomas Gleixner
     
  • The comment where the bucket lock is acquired says:

    /* bpf_map_update_elem() can be called in_irq() */

    which is not really helpful and aside of that it does not explain the
    subtle details of the hash bucket locks expecially in the context of BPF
    and perf, kprobes and tracing.

    Add a comment at the top of the file which explains the protection scopes
    and the details how potential deadlocks are prevented.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145642.755793061@linutronix.de

    Thomas Gleixner
     
  • Aside of the general unsafety of run-time map allocation for
    instrumentation type programs RT enabled kernels have another constraint:

    The instrumentation programs are invoked with preemption disabled, but the
    memory allocator spinlocks cannot be acquired in atomic context because
    they are converted to 'sleeping' spinlocks on RT.

    Therefore enforce map preallocation for these programs types when RT is
    enabled.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145642.648784007@linutronix.de

    Thomas Gleixner
     
  • The assumption that only programs attached to perf NMI events can deadlock
    on memory allocators is wrong. Assume the following simplified callchain:

    kmalloc() from regular non BPF context
    cache empty
    freelist empty
    lock(zone->lock);
    tracepoint or kprobe
    BPF()
    update_elem()
    lock(bucket)
    kmalloc()
    cache empty
    freelist empty
    lock(zone->lock);
    Signed-off-by: Alexei Starovoitov
    Link: https://lore.kernel.org/bpf/20200224145642.540542802@linutronix.de

    Thomas Gleixner
     
  • This patch ensures that we always check the netlink payload length
    in audit_receive_msg() before we take any action on the payload
    itself.

    Cc: stable@vger.kernel.org
    Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
    Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore

    Paul Moore
     

23 Feb, 2020

3 commits

  • Commit 219ca39427bf ("audit: use union for audit_field values since
    they are mutually exclusive") combined a number of separate fields in
    the audit_field struct into a single union. Generally this worked
    just fine because they are generally mutually exclusive.
    Unfortunately in audit_data_to_entry() the overlap can be a problem
    when a specific error case is triggered that causes the error path
    code to attempt to cleanup an audit_field struct and the cleanup
    involves attempting to free a stored LSM string (the lsm_str field).
    Currently the code always has a non-NULL value in the
    audit_field.lsm_str field as the top of the for-loop transfers a
    value into audit_field.val (both .lsm_str and .val are part of the
    same union); if audit_data_to_entry() fails and the audit_field
    struct is specified to contain a LSM string, but the
    audit_field.lsm_str has not yet been properly set, the error handling
    code will attempt to free the bogus audit_field.lsm_str value that
    was set with audit_field.val at the top of the for-loop.

    This patch corrects this by ensuring that the audit_field.val is only
    set when needed (it is cleared when the audit_field struct is
    allocated with kcalloc()). It also corrects a few other issues to
    ensure that in case of error the proper error code is returned.

    Cc: stable@vger.kernel.org
    Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
    Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
    Signed-off-by: Paul Moore

    Paul Moore
     
  • Pull irq fixes from Thomas Gleixner:
    "Two fixes for the irq core code which are follow ups to the recent MSI
    fixes:

    - The WARN_ON which was put into the MSI setaffinity callback for
    paranoia reasons actually triggered via a callchain which escaped
    when all the possible ways to reach that code were analyzed.

    The proc/irq/$N/*affinity interfaces have a quirk which came in
    when ALPHA moved to the generic interface: In case that the written
    affinity mask does not contain any online CPU it calls into ALPHAs
    magic auto affinity setting code.

    A few years later this mechanism was also made available to x86 for
    no good reasons and in a way which circumvents all sanity checks
    for interrupts which cannot have their affinity set from process
    context on X86 due to the way the X86 interrupt delivery works.

    It would be possible to make this work properly, but there is no
    point in doing so. If the interrupt is not yet started then the
    affinity setting has no effect and if it is started already then it
    is already assigned to an online CPU so there is no point to
    randomly move it to some other CPU. Just return EINVAL as the code
    has done before that change forever.

    - The new MSI quirk bit in the irq domain flags turned out to be
    already occupied, which escaped the author and the reviewers
    because the already in use bits were 0,6,2,3,4,5 listed in that
    order.

    That bit 6 was simply overlooked because the ordering was straight
    forward linear otherwise. So the new bit ended up being a
    duplicate.

    Fix it up by switching the oddball 6 to the obvious 1"

    * tag 'irq-urgent-2020-02-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    genirq/irqdomain: Make sure all irq domain flags are distinct
    genirq/proc: Reject invalid affinity masks (again)

    Linus Torvalds
     
  • Pull s390 fixes from Vasily Gorbik:

    - Remove ieee_emulation_warnings sysctl which is a dead code.

    - Avoid triggering rebuild of the kernel during make install.

    - Enable protected virtualization guest support in default configs.

    - Fix cio_ignore seq_file .next function to increase position index.
    And use kobj_to_dev instead of container_of in cio code.

    - Fix storage block address lists to contain absolute addresses in qdio
    code.

    - Few clang warnings and spelling fixes.

    * tag 's390-5.6-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390/qdio: fill SBALEs with absolute addresses
    s390/qdio: fill SL with absolute addresses
    s390: remove obsolete ieee_emulation_warnings
    s390: make 'install' not depend on vmlinux
    s390/kaslr: Fix casts in get_random
    s390/mm: Explicitly compare PAGE_DEFAULT_KEY against zero in storage_key_init_range
    s390/pkey/zcrypt: spelling s/crytp/crypt/
    s390/cio: use kobj_to_dev() API
    s390/defconfig: enable CONFIG_PROTECTED_VIRTUALIZATION_GUEST
    s390/cio: cio_ignore_proc_seq_next should increase position index

    Linus Torvalds
     

22 Feb, 2020

5 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-02-21

    The following pull-request contains BPF updates for your *net-next* tree.

    We've added 25 non-merge commits during the last 4 day(s) which contain
    a total of 33 files changed, 2433 insertions(+), 161 deletions(-).

    The main changes are:

    1) Allow for adding TCP listen sockets into sock_map/hash so they can be used
    with reuseport BPF programs, from Jakub Sitnicki.

    2) Add a new bpf_program__set_attach_target() helper for adding libbpf support
    to specify the tracepoint/function dynamically, from Eelco Chaudron.

    3) Add bpf_read_branch_records() BPF helper which helps use cases like profile
    guided optimizations, from Daniel Xu.

    4) Enable bpf_perf_event_read_value() in all tracing programs, from Song Liu.

    5) Relax BTF mandatory check if only used for libbpf itself e.g. to process
    BTF defined maps, from Andrii Nakryiko.

    6) Move BPF selftests -mcpu compilation attribute from 'probe' to 'v3' as it has
    been observed that former fails in envs with low memlock, from Yonghong Song.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Commit 736b46027eb4 ("net: Add ID (if needed) to sock_reuseport and expose
    reuseport_lock") has introduced lazy generation of reuseport group IDs that
    survive group resize.

    By comparing the identifier we check if BPF reuseport program is not trying
    to select a socket from a BPF map that belongs to a different reuseport
    group than the one the packet is for.

    Because SOCKARRAY used to be the only BPF map type that can be used with
    reuseport BPF, it was possible to delay the generation of reuseport group
    ID until a socket from the group was inserted into BPF map for the first
    time.

    Now that SOCK{MAP,HASH} can be used with reuseport BPF we have two options,
    either generate the reuseport ID on map update, like SOCKARRAY does, or
    allocate an ID from the start when reuseport group gets created.

    This patch takes the latter approach to keep sockmap free of calls into
    reuseport code. This streamlines the reuseport_id access as its lifetime
    now matches the longevity of reuseport object.

    The cost of this simplification, however, is that we allocate reuseport IDs
    for all SO_REUSEPORT users. Even those that don't use SOCKARRAY in their
    setups. With the way identifiers are currently generated, we can have at
    most S32_MAX reuseport groups, which hopefully is sufficient. If we ever
    get close to the limit, we can switch an u64 counter like sk_cookie.

    Another change is that we now always call into SOCKARRAY logic to unlink
    the socket from the map when unhashing or closing the socket. Previously we
    did it only when at least one socket from the group was in a BPF map.

    It is worth noting that this doesn't conflict with sockmap tear-down in
    case a socket is in a SOCK{MAP,HASH} and belongs to a reuseport
    group. sockmap tear-down happens first:

    prot->unhash
    `- tcp_bpf_unhash
    |- tcp_bpf_remove
    | `- while (sk_psock_link_pop(psock))
    | `- sk_psock_unlink
    | `- sock_map_delete_from_link
    | `- __sock_map_delete
    | `- sock_map_unref
    | `- sk_psock_put
    | `- sk_psock_drop
    | `- rcu_assign_sk_user_data(sk, NULL)
    `- inet_unhash
    `- reuseport_detach_sock
    `- bpf_sk_reuseport_detach
    `- WRITE_ONCE(sk->sk_user_data, NULL)

    Suggested-by: Martin Lau
    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200218171023.844439-10-jakub@cloudflare.com

    Jakub Sitnicki
     
  • SOCKMAP & SOCKHASH now support storing references to listening
    sockets. Nothing keeps us from using these map types a collection of
    sockets to select from in BPF reuseport programs. Whitelist the map types
    with the bpf_sk_select_reuseport helper.

    The restriction that the socket has to be a member of a reuseport group
    still applies. Sockets in SOCKMAP/SOCKHASH that don't have sk_reuseport_cb
    set are not a valid target and we signal it with -EINVAL.

    The main benefit from this change is that, in contrast to
    REUSEPORT_SOCKARRAY, SOCK{MAP,HASH} don't impose a restriction that a
    listening socket can be just one BPF map at the same time.

    Signed-off-by: Jakub Sitnicki
    Signed-off-by: Daniel Borkmann
    Link: https://lore.kernel.org/bpf/20200218171023.844439-9-jakub@cloudflare.com

    Jakub Sitnicki
     
  • Pull networking fixes from David Miller:

    1) Limit xt_hashlimit hash table size to avoid OOM or hung tasks, from
    Cong Wang.

    2) Fix deadlock in xsk by publishing global consumer pointers when NAPI
    is finished, from Magnus Karlsson.

    3) Set table field properly to RT_TABLE_COMPAT when necessary, from
    Jethro Beekman.

    4) NLA_STRING attributes are not necessary NULL terminated, deal wiht
    that in IFLA_ALT_IFNAME. From Eric Dumazet.

    5) Fix checksum handling in atlantic driver, from Dmitry Bezrukov.

    6) Handle mtu==0 devices properly in wireguard, from Jason A.
    Donenfeld.

    7) Fix several lockdep warnings in bonding, from Taehee Yoo.

    8) Fix cls_flower port blocking, from Jason Baron.

    9) Sanitize internal map names in libbpf, from Toke Høiland-Jørgensen.

    10) Fix RDMA race in qede driver, from Michal Kalderon.

    11) Fix several false lockdep warnings by adding conditions to
    list_for_each_entry_rcu(), from Madhuparna Bhowmik.

    12) Fix sleep in atomic in mlx5 driver, from Huy Nguyen.

    13) Fix potential deadlock in bpf_map_do_batch(), from Yonghong Song.

    14) Hey, variables declared in switch statement before any case
    statements are not initialized. I learn something every day. Get
    rids of this stuff in several parts of the networking, from Kees
    Cook.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (99 commits)
    bnxt_en: Issue PCIe FLR in kdump kernel to cleanup pending DMAs.
    bnxt_en: Improve device shutdown method.
    net: netlink: cap max groups which will be considered in netlink_bind()
    net: thunderx: workaround BGX TX Underflow issue
    ionic: fix fw_status read
    net: disable BRIDGE_NETFILTER by default
    net: macb: Properly handle phylink on at91rm9200
    s390/qeth: fix off-by-one in RX copybreak check
    s390/qeth: don't warn for napi with 0 budget
    s390/qeth: vnicc Fix EOPNOTSUPP precedence
    openvswitch: Distribute switch variables for initialization
    net: ip6_gre: Distribute switch variables for initialization
    net: core: Distribute switch variables for initialization
    udp: rehash on disconnect
    net/tls: Fix to avoid gettig invalid tls record
    bpf: Fix a potential deadlock with bpf_map_do_batch
    bpf: Do not grab the bucket spinlock by default on htab batch ops
    ice: Wait for VF to be reset/ready before configuration
    ice: Don't tell the OS that link is going down
    ice: Don't reject odd values of usecs set by user
    ...

    Linus Torvalds
     
  • No users remain, so kill these off before we grow new ones.

    Link: http://lkml.kernel.org/r/20200110154232.4104492-3-arnd@arndb.de
    Signed-off-by: Arnd Bergmann
    Acked-by: Thomas Gleixner
    Cc: Deepa Dinamani
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     

21 Feb, 2020

5 commits

  • Set CONFIG_BOOT_CONFIG=n by default. This also warns
    user if CONFIG_BOOT_CONFIG=n but "bootconfig" is given
    in the kernel command line.

    Link: http://lkml.kernel.org/r/158220111291.26565.9036889083940367969.stgit@devnote2

    Suggested-by: Steven Rostedt
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • Clear trace_state data structure when starting trace
    in __synth_event_trace_start() internal function.

    Currently trace_state is initialized only in the
    synth_event_trace_start() API, but the trace_state
    in synth_event_trace() and synth_event_trace_array()
    are on the stack without initialization.
    This means those APIs will see wrong parameters and
    wil skip closing process in __synth_event_trace_end()
    because trace_state->disabled may be !0.

    Link: http://lkml.kernel.org/r/158193315899.8868.1781259176894639952.stgit@devnote2

    Reviewed-by: Tom Zanussi
    Signed-off-by: Masami Hiramatsu
    Signed-off-by: Steven Rostedt (VMware)

    Masami Hiramatsu
     
  • The tracing seftests checks various aspects of the tracing infrastructure,
    and one is filtering. If trace_printk() is active during a self test, it can
    cause the filtering to fail, which will disable that part of the trace.

    To keep the selftests from failing because of trace_printk() calls,
    trace_printk() checks the variable tracing_selftest_running, and if set, it
    does not write to the tracing buffer.

    As some tracers were registered earlier in boot, the selftest they triggered
    would fail because not all the infrastructure was set up for the full
    selftest. Thus, some of the tests were post poned to when their
    infrastructure was ready (namely file system code). The postpone code did
    not set the tracing_seftest_running variable, and could fail if a
    trace_printk() was added and executed during their run.

    Cc: stable@vger.kernel.org
    Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • The test code that tests synthetic event creation pushes in as one of its
    test fields the current CPU using "smp_processor_id()". As this is just
    something to see if the value is correctly passed in, and the actual CPU
    used does not matter, use raw_smp_processor_id(), otherwise with debug
    preemption enabled, a warning happens as the smp_processor_id() is called
    without preemption enabled.

    Link: http://lkml.kernel.org/r/20200220162950.35162579@gandalf.local.home

    Reviewed-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Steven Rostedt (VMware)
     
  • Fix a varargs-related bug in print_synth_event() which resulted in
    strange output and oopses on 32-bit x86 systems. The problem is that
    trace_seq_printf() expects the varargs to match the format string, but
    print_synth_event() was always passing u64 values regardless. This
    results in unspecified behavior when unpacking with va_arg() in
    trace_seq_printf().

    Add a function that takes the size into account when calling
    trace_seq_printf().

    Before:

    modprobe-1731 [003] .... 919.039758: gen_synth_test: next_pid_field=777(null)next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=3(null)my_string_field=thneed my_int_field=598(null)

    After:

    insmod-1136 [001] .... 36.634590: gen_synth_test: next_pid_field=777 next_comm_field=hula hoops ts_ns=1000000 ts_ms=1000 cpu=1 my_string_field=thneed my_int_field=598

    Link: http://lkml.kernel.org/r/a9b59eb515dbbd7d4abe53b347dccf7a8e285657.1581720155.git.zanussi@kernel.org

    Reported-by: Steven Rostedt (VMware)
    Signed-off-by: Tom Zanussi
    Signed-off-by: Steven Rostedt (VMware)

    Tom Zanussi