17 May, 2020

1 commit

  • Currently informational messages within block trace do not have PID
    information of the process reporting the message included. With BFQ it
    is sometimes useful to have the information and there's no good reason
    to omit the information from the trace. So just fill in pid information
    when generating note message.

    Signed-off-by: Jan Kara
    Reviewed-by: Chaitanya Kulkarni
    Acked-by: Paolo Valente
    Signed-off-by: Jens Axboe

    Jan Kara
     

20 Apr, 2020

4 commits

  • Pull time namespace fix from Thomas Gleixner:
    "An update for the proc interface of time namespaces: Use symbolic
    names instead of clockid numbers. The usability nuisance of numbers
    was noticed by Michael when polishing the man page"

    * tag 'timers-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    proc, time/namespace: Show clock symbolic names in /proc/pid/timens_offsets

    Linus Torvalds
     
  • Pull irq fixes from Thomas Gleixner:
    "A set of fixes/updates for the interrupt subsystem:

    - Remove setup_irq() and remove_irq(). All users have been converted
    so remove them before new users surface.

    - A set of bugfixes for various interrupt chip drivers

    - Add a few missing static attributes to address sparse warnings"

    * tag 'irq-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    irqchip/irq-bcm7038-l1: Make bcm7038_l1_of_init() static
    irqchip/irq-mvebu-icu: Make legacy_bindings static
    irqchip/meson-gpio: Fix HARDIRQ-safe -> HARDIRQ-unsafe lock order
    irqchip/sifive-plic: Fix maximum priority threshold value
    irqchip/ti-sci-inta: Fix processing of masked irqs
    irqchip/mbigen: Free msi_desc on device teardown
    irqchip/gic-v4.1: Update effective affinity of virtual SGIs
    irqchip/gic-v4.1: Add support for VPENDBASER's Dirty+Valid signaling
    genirq: Remove setup_irq() and remove_irq()

    Linus Torvalds
     
  • Pull scheduler fixes from Thomas Gleixner:
    "Two fixes for the scheduler:

    - Work around an uninitialized variable warning where GCC can't
    figure it out.

    - Allow 'isolcpus=' to skip unknown subparameters so that older
    kernels work with the commandline of a newer kernel. Improve the
    error output while at it"

    * tag 'sched-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/vtime: Work around an unitialized variable warning
    sched/isolation: Allow "isolcpus=" to skip unknown sub-parameters

    Linus Torvalds
     
  • Pull RCU fix from Thomas Gleixner:
    "A single bugfix for RCU to prevent taking a lock in NMI context"

    * tag 'core-urgent-2020-04-19' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    rcu: Don't acquire lock in NMI handler in rcu_nmi_enter_common()

    Linus Torvalds
     

19 Apr, 2020

1 commit

  • Pull thread fixes from Christian Brauner:
    "A few fixes and minor improvements:

    - Correctly validate the cgroup file descriptor when clone3() is used
    with CLONE_INTO_CGROUP.

    - Check that a new enough version of struct clone_args is passed
    which supports the cgroup file descriptor argument when
    CLONE_INTO_CGROUP is set in the flags argument.

    - Catch nonsensical struct clone_args layouts at build time.

    - Catch extensions of struct clone_args without updating the uapi
    visible size definitions at build time.

    - Check whether the signal is valid early in kill_pid_usb_asyncio()
    before doing further work.

    - Replace open-coded rcu_read_lock()+kill_pid_info()+rcu_read_unlock()
    sequence in kill_something_info() with kill_proc_info() which is a
    dedicated helper to do just that"

    * tag 'for-linus-2020-04-18' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
    clone3: add build-time CLONE_ARGS_SIZE_VER* validity checks
    clone3: add a check for the user struct size if CLONE_INTO_CGROUP is set
    clone3: fix cgroup argument sanity check
    signal: use kill_proc_info instead of kill_pid_info in kill_something_info
    signal: check sig before setting info in kill_pid_usb_asyncio

    Linus Torvalds
     

17 Apr, 2020

1 commit

  • Pull networking fixes from David Miller:

    1) Disable RISCV BPF JIT builds when !MMU, from Björn Töpel.

    2) nf_tables leaves dangling pointer after free, fix from Eric Dumazet.

    3) Out of boundary write in __xsk_rcv_memcpy(), fix from Li RongQing.

    4) Adjust icmp6 message source address selection when routes have a
    preferred source address set, from Tim Stallard.

    5) Be sure to validate HSR protocol version when creating new links,
    from Taehee Yoo.

    6) CAP_NET_ADMIN should be sufficient to manage l2tp tunnels even in
    non-initial namespaces, from Michael Weiß.

    7) Missing release firmware call in mlx5, from Eran Ben Elisha.

    8) Fix variable type in macsec_changelink(), caught by KASAN. Fix from
    Taehee Yoo.

    9) Fix pause frame negotiation in marvell phy driver, from Clemens
    Gruber.

    10) Record RX queue early enough in tun packet paths such that XDP
    programs will see the correct RX queue index, from Gilberto Bertin.

    11) Fix double unlock in mptcp, from Florian Westphal.

    12) Fix offset overflow in ARM bpf JIT, from Luke Nelson.

    13) marvell10g needs to soft reset PHY when coming out of low power
    mode, from Russell King.

    14) Fix MTU setting regression in stmmac for some chip types, from
    Florian Fainelli.

    * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (101 commits)
    amd-xgbe: Use __napi_schedule() in BH context
    mISDN: make dmril and dmrim static
    net: stmmac: dwmac-sunxi: Provide TX and RX fifo sizes
    net: dsa: mt7530: fix tagged frames pass-through in VLAN-unaware mode
    tipc: fix incorrect increasing of link window
    Documentation: Fix tcp_challenge_ack_limit default value
    net: tulip: make early_486_chipsets static
    dt-bindings: net: ethernet-phy: add desciption for ethernet-phy-id1234.d400
    ipv6: remove redundant assignment to variable err
    net/rds: Use ERR_PTR for rds_message_alloc_sgs()
    net: mscc: ocelot: fix untagged packet drops when enslaving to vlan aware bridge
    selftests/bpf: Check for correct program attach/detach in xdp_attach test
    libbpf: Fix type of old_fd in bpf_xdp_set_link_opts
    libbpf: Always specify expected_attach_type on program load if supported
    xsk: Add missing check on user supplied headroom size
    mac80211: fix channel switch trigger from unknown mesh peer
    mac80211: fix race in ieee80211_register_hw()
    net: marvell10g: soft-reset the PHY when coming out of low power
    net: marvell10g: report firmware version
    net/cxgb4: Check the return from t4_query_params properly
    ...

    Linus Torvalds
     

16 Apr, 2020

1 commit

  • Michael Kerrisk suggested to replace numeric clock IDs with symbolic names.

    Now the content of these files looks like this:
    $ cat /proc/774/timens_offsets
    monotonic 864000 0
    boottime 1728000 0

    For setting offsets, both representations of clocks (numeric and symbolic)
    can be used.

    As for compatibility, it is acceptable to change things as long as
    userspace doesn't care. The format of timens_offsets files is very new and
    there are no userspace tools yet which rely on this format.

    But three projects crun, util-linux and criu rely on the interface of
    setting time offsets and this is why it's required to continue supporting
    the numeric clock IDs on write.

    Fixes: 04a8682a71be ("fs/proc: Introduce /proc/pid/timens_offsets")
    Suggested-by: Michael Kerrisk
    Signed-off-by: Andrei Vagin
    Signed-off-by: Thomas Gleixner
    Tested-by: Michael Kerrisk
    Acked-by: Michael Kerrisk
    Cc: Andrew Morton
    Cc: Eric W. Biederman
    Cc: Dmitry Safonov
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20200411154031.642557-1-avagin@gmail.com

    Andrei Vagin
     

15 Apr, 2020

8 commits

  • Work around this warning:

    kernel/sched/cputime.c: In function ‘kcpustat_field’:
    kernel/sched/cputime.c:1007:6: warning: ‘val’ may be used uninitialized in this function [-Wmaybe-uninitialized]

    because GCC can't see that val is used only when err is 0.

    Acked-by: Peter Zijlstra
    Signed-off-by: Borislav Petkov
    Signed-off-by: Ingo Molnar
    Link: https://lore.kernel.org/r/20200327214334.GF8015@zn.tnic

    Borislav Petkov
     
  • The "isolcpus=" parameter allows sub-parameters before the cpulist is
    specified, and if the parser detects an unknown sub-parameters the whole
    parameter will be ignored.

    This design is incompatible with itself when new sub-parameters are added.
    An older kernel will not recognize the new sub-parameter and will
    invalidate the whole parameter so the CPU isolation will not take
    effect. It emits a warning:

    isolcpus: Error, unknown flag

    The better and compatible way is to allow "isolcpus=" to skip unknown
    sub-parameters, so that even if new sub-parameters are added an older
    kernel will still be able to behave as usual even if with the new
    sub-parameter specified on the command line.

    Ideally this should have been there when the first sub-parameter for
    "isolcpus=" was introduced.

    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Xu
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20200403223517.406353-1-peterx@redhat.com

    Peter Xu
     
  • CLONE_ARGS_SIZE_VER* macros are defined explicitly and not via
    the offsets of the relevant struct clone_args fields, which makes
    it rather error-prone, so it probably makes sense to add some
    compile-time checks for them (including the one that breaks
    on struct clone_args extension as a reminder to add a relevant
    size macro and a similar check). Function copy_clone_args_from_user
    seems to be a good place for such checks.

    Signed-off-by: Eugene Syromiatnikov
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200412202658.GA31499@asgard.redhat.com
    Signed-off-by: Christian Brauner

    Eugene Syromiatnikov
     
  • Passing CLONE_INTO_CGROUP with an under-sized structure (that doesn't
    properly contain cgroup field) seems like garbage input, especially
    considering the fact that fd 0 is a valid descriptor.

    Signed-off-by: Eugene Syromiatnikov
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200412203123.GA5869@asgard.redhat.com
    Signed-off-by: Christian Brauner

    Eugene Syromiatnikov
     
  • Checking that cgroup field value of struct clone_args is less than 0
    is useless, as it is defined as unsigned 64-bit integer. Moreover,
    it doesn't catch the situations where its higher bits are lost during
    the assignment to the cgroup field of the cgroup field of the internal
    struct kernel_clone_args (where it is declared as signed 32-bit
    integer), so it is still possible to pass garbage there. A check
    against INT_MAX solves both these issues.

    Fixes: ef2c41cf38a7559b ("clone3: allow spawning processes into cgroups")
    Signed-off-by: Eugene Syromiatnikov
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/20200412202533.GA29554@asgard.redhat.com
    Signed-off-by: Christian Brauner

    Eugene Syromiatnikov
     
  • …g 'snapshot' operation

    Traced event can trigger 'snapshot' operation(i.e. calls snapshot_trigger()
    or snapshot_count_trigger()) when register_snapshot_trigger() has completed
    registration but doesn't allocate buffer for 'snapshot' event trigger. In
    the rare case, 'snapshot' operation always detects the lack of allocated
    buffer so make register_snapshot_trigger() allocate buffer first.

    trigger-snapshot.tc in kselftest reproduces the issue on slow vm:
    -----------------------------------------------------------
    cat trace
    ...
    ftracetest-3028 [002] .... 236.784290: sched_process_fork: comm=ftracetest pid=3028 child_comm=ftracetest child_pid=3036
    <...>-2875 [003] .... 240.460335: tracing_snapshot_instance_cond: *** SNAPSHOT NOT ALLOCATED ***
    <...>-2875 [003] .... 240.460338: tracing_snapshot_instance_cond: *** stopping trace here! ***
    -----------------------------------------------------------

    Link: http://lkml.kernel.org/r/20200414015145.66236-1-yangx.jy@cn.fujitsu.com

    Cc: stable@vger.kernel.org
    Fixes: 93e31ffbf417a ("tracing: Add 'snapshot' event trigger command")
    Signed-off-by: Xiao Yang <yangx.jy@cn.fujitsu.com>
    Signed-off-by: Steven Rostedt (VMware) <rostedt@goodmis.org>

    Xiao Yang
     
  • This issue was detected by using the Coccinelle software:

    kernel/bpf/verifier.c:1259:16-21: WARNING: conversion to bool not needed here

    The conversion to bool is unneeded, remove it.

    Reported-by: Hulk Robot
    Signed-off-by: Zou Wei
    Signed-off-by: Daniel Borkmann
    Acked-by: Song Liu
    Link: https://lore.kernel.org/bpf/1586779076-101346-1-git-send-email-zou_wei@huawei.com

    Zou Wei
     
  • VM_MAYWRITE flag during initial memory mapping determines if already mmap()'ed
    pages can be later remapped as writable ones through mprotect() call. To
    prevent user application to rewrite contents of memory-mapped as read-only and
    subsequently frozen BPF map, remove VM_MAYWRITE flag completely on initially
    read-only mapping.

    Alternatively, we could treat any memory-mapping on unfrozen map as writable
    and bump writecnt instead. But there is little legitimate reason to map
    BPF map as read-only and then re-mmap() it as writable through mprotect(),
    instead of just mmap()'ing it as read/write from the very beginning.

    Also, at the suggestion of Jann Horn, drop unnecessary refcounting in mmap
    operations. We can just rely on VMA holding reference to BPF map's file
    properly.

    Fixes: fc9702273e2e ("bpf: Add mmap() support for BPF_MAP_TYPE_ARRAY")
    Reported-by: Jann Horn
    Signed-off-by: Andrii Nakryiko
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Jann Horn
    Link: https://lore.kernel.org/bpf/20200410202613.3679837-1-andriin@fb.com

    Andrii Nakryiko
     

14 Apr, 2020

2 commits


13 Apr, 2020

6 commits

  • signal.c provides kill_proc_info, we can use it instead of kill_pid_info
    in kill_something_info func gracefully.

    Signed-off-by: Zhiqiang Liu
    Acked-by: Oleg Nesterov
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/80236965-f0b5-c888-95ff-855bdec75bb3@huawei.com
    Signed-off-by: Christian Brauner

    Zhiqiang Liu
     
  • In kill_pid_usb_asyncio, if signal is not valid, we do not need to
    set info struct.

    Signed-off-by: Zhiqiang Liu
    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/r/f525fd08-1cf7-fb09-d20c-4359145eb940@huawei.com
    Signed-off-by: Christian Brauner

    Zhiqiang Liu
     
  • Pull time(keeping) updates from Thomas Gleixner:

    - Fix the time_for_children symlink in /proc/$PID/ so it properly
    reflects that it part of the 'time' namespace

    - Add the missing userns limit for the allowed number of time
    namespaces, which was half defined but the actual array member was
    not added. This went unnoticed as the array has an exessive empty
    member at the end but introduced a user visible regression as the
    output was corrupted.

    - Prevent further silent ucount corruption by adding a BUILD_BUG_ON()
    to catch half updated data.

    * tag 'timers-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ucount: Make sure ucounts in /proc/sys/user don't regress again
    time/namespace: Add max_time_namespaces ucount
    time/namespace: Fix time_for_children symlink

    Linus Torvalds
     
  • Pull scheduler fixes/updates from Thomas Gleixner:

    - Deduplicate the average computations in the scheduler core and the
    fair class code.

    - Fix a raise between runtime distribution and assignement which can
    cause exceeding the quota by up to 70%.

    - Prevent negative results in the imbalanace calculation

    - Remove a stale warning in the workqueue code which can be triggered
    since the call site was moved out of preempt disabled code. It's a
    false positive.

    - Deduplicate the print macros for procfs

    - Add the ucmap values to the SCHED_DEBUG procfs output for completness

    * tag 'sched-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/debug: Add task uclamp values to SCHED_DEBUG procfs
    sched/debug: Factor out printing formats into common macros
    sched/debug: Remove redundant macro define
    sched/core: Remove unused rq::last_load_update_tick
    workqueue: Remove the warning in wq_worker_sleeping()
    sched/fair: Fix negative imbalance in imbalance calculation
    sched/fair: Fix race between runtime distribution and assignment
    sched/fair: Align rq->avg_idle and rq->avg_scan_cost

    Linus Torvalds
     
  • Pull perf fixes from Thomas Gleixner:
    "Three fixes/updates for perf:

    - Fix the perf event cgroup tracking which tries to track the cgroup
    even for disabled events.

    - Add Ice Lake server support for uncore events

    - Disable pagefaults when retrieving the physical address in the
    sampling code"

    * tag 'perf-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    perf/core: Disable page faults when getting phys address
    perf/x86/intel/uncore: Add Ice Lake server uncore support
    perf/cgroup: Correct indirection in perf_less_group_idx()
    perf/core: Fix event cgroup tracking

    Linus Torvalds
     
  • Pull locking fixes from Thomas Gleixner:
    "Three small fixes/updates for the locking core code:

    - Plug a task struct reference leak in the percpu rswem
    implementation.

    - Document the refcount interaction with PID_MAX_LIMIT

    - Improve the 'invalid wait context' data dump in lockdep so it
    contains all information which is required to decode the problem"

    * tag 'locking-urgent-2020-04-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/lockdep: Improve 'invalid wait context' splat
    locking/refcount: Document interaction with PID_MAX_LIMIT
    locking/percpu-rwsem: Fix a task_struct refcount

    Linus Torvalds
     

12 Apr, 2020

1 commit


11 Apr, 2020

6 commits

  • Merge yet more updates from Andrew Morton:

    - Almost all of the rest of MM (memcg, slab-generic, slab, pagealloc,
    gup, hugetlb, pagemap, memremap)

    - Various other things (hfs, ocfs2, kmod, misc, seqfile)

    * akpm: (34 commits)
    ipc/util.c: sysvipc_find_ipc() should increase position index
    kernel/gcov/fs.c: gcov_seq_next() should increase position index
    fs/seq_file.c: seq_read(): add info message about buggy .next functions
    drivers/dma/tegra20-apb-dma.c: fix platform_get_irq.cocci warnings
    change email address for Pali Rohár
    selftests: kmod: test disabling module autoloading
    selftests: kmod: fix handling test numbers above 9
    docs: admin-guide: document the kernel.modprobe sysctl
    fs/filesystems.c: downgrade user-reachable WARN_ONCE() to pr_warn_once()
    kmod: make request_module() return an error when autoloading is disabled
    mm/memremap: set caching mode for PCI P2PDMA memory to WC
    mm/memory_hotplug: add pgprot_t to mhp_params
    powerpc/mm: thread pgprot_t through create_section_mapping()
    x86/mm: introduce __set_memory_prot()
    x86/mm: thread pgprot_t through init_memory_mapping()
    mm/memory_hotplug: rename mhp_restrictions to mhp_params
    mm/memory_hotplug: drop the flags field from struct mhp_restrictions
    mm/special: create generic fallbacks for pte_special() and pte_mkspecial()
    mm/vma: introduce VM_ACCESS_FLAGS
    mm/vma: define a default value for VM_DATA_DEFAULT_FLAGS
    ...

    Linus Torvalds
     
  • If seq_file .next function does not change position index, read after
    some lseek can generate unexpected output.

    https://bugzilla.kernel.org/show_bug.cgi?id=206283
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Acked-by: Peter Oberparleiter
    Cc: Al Viro
    Cc: Davidlohr Bueso
    Cc: Ingo Molnar
    Cc: Manfred Spraul
    Cc: NeilBrown
    Cc: Steven Rostedt
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/f65c6ee7-bd00-f910-2f8a-37cc67e4ff88@virtuozzo.com
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • Patch series "module autoloading fixes and cleanups", v5.

    This series fixes a bug where request_module() was reporting success to
    kernel code when module autoloading had been completely disabled via
    'echo > /proc/sys/kernel/modprobe'.

    It also addresses the issues raised on the original thread
    (https://lkml.kernel.org/lkml/20200310223731.126894-1-ebiggers@kernel.org/T/#u)
    bydocumenting the modprobe sysctl, adding a self-test for the empty path
    case, and downgrading a user-reachable WARN_ONCE().

    This patch (of 4):

    It's long been possible to disable kernel module autoloading completely
    (while still allowing manual module insertion) by setting
    /proc/sys/kernel/modprobe to the empty string.

    This can be preferable to setting it to a nonexistent file since it
    avoids the overhead of an attempted execve(), avoids potential
    deadlocks, and avoids the call to security_kernel_module_request() and
    thus on SELinux-based systems eliminates the need to write SELinux rules
    to dontaudit module_request.

    However, when module autoloading is disabled in this way,
    request_module() returns 0. This is broken because callers expect 0 to
    mean that the module was successfully loaded.

    Apparently this was never noticed because this method of disabling
    module autoloading isn't used much, and also most callers don't use the
    return value of request_module() since it's always necessary to check
    whether the module registered its functionality or not anyway.

    But improperly returning 0 can indeed confuse a few callers, for example
    get_fs_type() in fs/filesystems.c where it causes a WARNING to be hit:

    if (!fs && (request_module("fs-%.*s", len, name) == 0)) {
    fs = __get_fs_type(name, len);
    WARN_ONCE(!fs, "request_module fs-%.*s succeeded, but still no fs?\n", len, name);
    }

    This is easily reproduced with:

    echo > /proc/sys/kernel/modprobe
    mount -t NONEXISTENT none /

    It causes:

    request_module fs-NONEXISTENT succeeded, but still no fs?
    WARNING: CPU: 1 PID: 1106 at fs/filesystems.c:275 get_fs_type+0xd6/0xf0
    [...]

    This should actually use pr_warn_once() rather than WARN_ONCE(), since
    it's also user-reachable if userspace immediately unloads the module.
    Regardless, request_module() should correctly return an error when it
    fails. So let's make it return -ENOENT, which matches the error when
    the modprobe binary doesn't exist.

    I've also sent patches to document and test this case.

    Signed-off-by: Eric Biggers
    Signed-off-by: Andrew Morton
    Reviewed-by: Kees Cook
    Reviewed-by: Jessica Yu
    Acked-by: Luis Chamberlain
    Cc: Alexei Starovoitov
    Cc: Greg Kroah-Hartman
    Cc: Jeff Vander Stoep
    Cc: Ben Hutchings
    Cc: Josh Triplett
    Cc:
    Link: http://lkml.kernel.org/r/20200310223731.126894-1-ebiggers@kernel.org
    Link: http://lkml.kernel.org/r/20200312202552.241885-1-ebiggers@kernel.org
    Signed-off-by: Linus Torvalds

    Eric Biggers
     
  • printk_deferred(), similarly to printk_safe/printk_nmi, does not
    immediately attempt to print a new message on the consoles, avoiding
    calls into non-reentrant kernel paths, e.g. scheduler or timekeeping,
    which potentially can deadlock the system.

    Those printk() flavors, instead, rely on per-CPU flush irq_work to print
    messages from safer contexts. For same reasons (recursive scheduler or
    timekeeping calls) printk() uses per-CPU irq_work in order to wake up
    user space syslog/kmsg readers.

    However, only printk_safe/printk_nmi do make sure that per-CPU areas
    have been initialised and that it's safe to modify per-CPU irq_work.
    This means that, for instance, should printk_deferred() be invoked "too
    early", that is before per-CPU areas are initialised, printk_deferred()
    will perform illegal per-CPU access.

    Lech Perczak [0] reports that after commit 1b710b1b10ef ("char/random:
    silence a lockdep splat with printk()") user-space syslog/kmsg readers
    are not able to read new kernel messages.

    The reason is printk_deferred() being called too early (as was pointed
    out by Petr and John).

    Fix printk_deferred() and do not queue per-CPU irq_work before per-CPU
    areas are initialized.

    Link: https://lore.kernel.org/lkml/aa0732c6-5c4e-8a8b-a1c1-75ebe3dca05b@camlintechnologies.com/
    Reported-by: Lech Perczak
    Signed-off-by: Sergey Senozhatsky
    Tested-by: Jann Horn
    Reviewed-by: Petr Mladek
    Cc: Greg Kroah-Hartman
    Cc: Theodore Ts'o
    Cc: John Ogness
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Pull proc fix from Eric Biederman:
    "A brown paper bag slipped through my proc changes, and syzcaller
    caught it when the code ended up in your tree.

    I have opted to fix it the simplest cleanest way I know how, so there
    is no reasonable chance for the bug to repeat"

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
    proc: Use a dedicated lock in struct pid

    Linus Torvalds
     
  • Pull more power management updates from Rafael Wysocki:
    "Rework compat ioctl handling in the user space hibernation interface
    (Christoph Hellwig) and fix a typo in a function name in the cpuidle
    haltpoll driver (Yihao Wu)"

    * tag 'pm-5.7-rc1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
    cpuidle-haltpoll: Fix small typo
    PM / sleep: handle the compat case in snapshot_set_swap_area()
    PM / sleep: move SNAPSHOT_SET_SWAP_AREA handling into a helper

    Linus Torvalds
     

10 Apr, 2020

3 commits

  • Daniel Borkmann says:

    ====================
    pull-request: bpf 2020-04-10

    The following pull-request contains BPF updates for your *net* tree.

    We've added 13 non-merge commits during the last 7 day(s) which contain
    a total of 13 files changed, 137 insertions(+), 43 deletions(-).

    The main changes are:

    1) JIT code emission fixes for riscv and arm32, from Luke Nelson and Xi Wang.

    2) Disable vmlinux BTF info if GCC_PLUGIN_RANDSTRUCT is used, from Slava Bacherikov.

    3) Fix oob write in AF_XDP when meta data is used, from Li RongQing.

    4) Fix bpf_get_link_xdp_id() handling on single prog when flags are specified,
    from Andrey Ignatov.

    5) Fix sk_assign() BPF helper for request sockets that can have sk_reuseport
    field uninitialized, from Joe Stringer.

    6) Fix mprotect() test case for the BPF LSM, from KP Singh.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     
  • Pull module updates from Jessica Yu:
    "Only a small cleanup this time around: a trivial conversion of
    zero-length arrays to flexible arrays"

    * tag 'modules-for-v5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/jeyu/linux:
    kernel: module: Replace zero-length array with flexible-array member

    Linus Torvalds
     
  • syzbot wrote:
    > ========================================================
    > WARNING: possible irq lock inversion dependency detected
    > 5.6.0-syzkaller #0 Not tainted
    > --------------------------------------------------------
    > swapper/1/0 just changed the state of lock:
    > ffffffff898090d8 (tasklist_lock){.+.?}-{2:2}, at: send_sigurg+0x9f/0x320 fs/fcntl.c:840
    > but this lock took another, SOFTIRQ-unsafe lock in the past:
    > (&pid->wait_pidfd){+.+.}-{2:2}
    >
    >
    > and interrupts could create inverse lock ordering between them.
    >
    >
    > other info that might help us debug this:
    > Possible interrupt unsafe locking scenario:
    >
    > CPU0 CPU1
    > ---- ----
    > lock(&pid->wait_pidfd);
    > local_irq_disable();
    > lock(tasklist_lock);
    > lock(&pid->wait_pidfd);
    >
    > lock(tasklist_lock);
    >
    > *** DEADLOCK ***
    >
    > 4 locks held by swapper/1/0:

    The problem is that because wait_pidfd.lock is taken under the tasklist
    lock. It must always be taken with irqs disabled as tasklist_lock can be
    taken from interrupt context and if wait_pidfd.lock was already taken this
    would create a lock order inversion.

    Oleg suggested just disabling irqs where I have added extra calls to
    wait_pidfd.lock. That should be safe and I think the code will eventually
    do that. It was rightly pointed out by Christian that sharing the
    wait_pidfd.lock was a premature optimization.

    It is also true that my pre-merge window testing was insufficient. So
    remove the premature optimization and give struct pid a dedicated lock of
    it's own for struct pid things. I have verified that lockdep sees all 3
    paths where we take the new pid->lock and lockdep does not complain.

    It is my current day dream that one day pid->lock can be used to guard the
    task lists as well and then the tasklist_lock won't need to be held to
    deliver signals. That will require taking pid->lock with irqs disabled.

    Acked-by: Christian Brauner
    Link: https://lore.kernel.org/lkml/00000000000011d66805a25cd73f@google.com/
    Cc: Oleg Nesterov
    Cc: Christian Brauner
    Reported-by: syzbot+343f75cdeea091340956@syzkaller.appspotmail.com
    Reported-by: syzbot+832aabf700bc3ec920b9@syzkaller.appspotmail.com
    Reported-by: syzbot+f675f964019f884dbd0f@syzkaller.appspotmail.com
    Reported-by: syzbot+a9fb1457d720a55d6dc5@syzkaller.appspotmail.com
    Fixes: 7bc3e6e55acf ("proc: Use a list of inodes to flush from proc")
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

09 Apr, 2020

2 commits

  • The commit 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using
    dma_map_page_attrs") removed "dma_debug_page" enum, but missed to update
    type2name string table. This causes incorrect displaying of dma allocation
    type.
    Fix it by removing "page" string from type2name string table and switch to
    use named initializers.

    Before (dma_alloc_coherent()):
    k3-ringacc 4b800000.ringacc: scather-gather idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable
    k3-ringacc 4b800000.ringacc: scather-gather idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable

    After:
    k3-ringacc 4b800000.ringacc: coherent idx 2208 P=d1140000 N=d114 D=d1140000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable
    k3-ringacc 4b800000.ringacc: coherent idx 2216 P=d1150000 N=d115 D=d1150000 L=40 DMA_BIDIRECTIONAL dma map error check not applicable

    Fixes: 2e05ea5cdc1a ("dma-mapping: implement dma_map_single_attrs using dma_map_page_attrs")
    Signed-off-by: Grygorii Strashko
    Signed-off-by: Christoph Hellwig

    Grygorii Strashko
     
  • The upper 32-bit physical address gets truncated inadvertently
    when dma_direct_get_required_mask() invokes phys_to_dma_direct().
    This results in dma_addressing_limited() return incorrect value
    when used in platforms with LPAE enabled.
    Fix it here by explicitly type casting 'max_pfn' to phys_addr_t
    in order to prevent overflow of intermediate value while evaluating
    '(max_pfn - 1) << PAGE_SHIFT'.

    Signed-off-by: Kishon Vijay Abraham I
    Signed-off-by: Christoph Hellwig

    Kishon Vijay Abraham I
     

08 Apr, 2020

4 commits

  • The 'invalid wait context' splat doesn't print all the information
    required to reconstruct / validate the error, specifically the
    irq-context state is missing.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The following commit:

    7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem")

    introduced task_struct memory leaks due to messing up the task_struct
    refcount.

    At the beginning of percpu_rwsem_wake_function(), it calls get_task_struct(),
    but if the trylock failed, it will remain in the waitqueue. However, it
    will run percpu_rwsem_wake_function() again with get_task_struct() to
    increase the refcount but then only call put_task_struct() once the trylock
    succeeded.

    Fix it by adjusting percpu_rwsem_wake_function() a bit to guard against
    when percpu_rwsem_wait() observing !private, terminating the wait and
    doing a quick exit() while percpu_rwsem_wake_function() then doing
    wake_up_process(p) as a use-after-free.

    Fixes: 7f26482a872c ("locking/percpu-rwsem: Remove the embedded rwsem")
    Suggested-by: Peter Zijlstra
    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200330213002.2374-1-cai@lca.pw

    Qian Cai
     
  • Requested and effective uclamp values can be a bit tricky to decipher when
    playing with cgroup hierarchies. Add them to a task's procfs when
    SCHED_DEBUG is enabled.

    Reviewed-by: Qais Yousef
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com

    Valentin Schneider
     
  • The printing macros in debug.c keep redefining the same output
    format. Collect each output format in a single definition, and reuse that
    definition in the other macros. While at it, add a layer of parentheses and
    replace printf's with the newly introduced macros.

    Reviewed-by: Qais Yousef
    Signed-off-by: Valentin Schneider
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Ingo Molnar
    Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com

    Valentin Schneider