20 Apr, 2019

13 commits

  • Separate print_modules() and hard lockup error message.

    Before the patch:

    NMI watchdog: Watchdog detected hard LOCKUP on cpu 1Modules linked in: nls_cp437

    Link: http://lkml.kernel.org/r/20190412062557.2700-1-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • The help text for CONFIG_ARCH_HAS_KCOV is stale, and describes the
    feature as being enabled only for x86_64, when it is now enabled for
    several architectures, including arm, arm64, powerpc, and s390.

    Let's remove that stale help text, and update it along the lines of hat
    for ARCH_HAS_FORTIFY_SOURCE, better describing when an architecture
    should select CONFIG_ARCH_HAS_KCOV.

    Link: http://lkml.kernel.org/r/20190412102733.5154-1-mark.rutland@arm.com
    Signed-off-by: Mark Rutland
    Acked-by: Dmitry Vyukov
    Cc: Kees Cook
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     
  • During !CONFIG_CGROUP reclaim, we expand the inactive list size if it's
    thrashing on the node that is about to be reclaimed. But when cgroups
    are enabled, we suddenly ignore the node scope and use the cgroup scope
    only. The result is that pressure bleeds between NUMA nodes depending
    on whether cgroups are merely compiled into Linux. This behavioral
    difference is unexpected and undesirable.

    When the refault adaptivity of the inactive list was first introduced,
    there were no statistics at the lruvec level - the intersection of node
    and memcg - so it was better than nothing.

    But now that we have that infrastructure, use lruvec_page_state() to
    make the list balancing decision always NUMA aware.

    [hannes@cmpxchg.org: fix bisection hole]
    Link: http://lkml.kernel.org/r/20190417155241.GB23013@cmpxchg.org
    Link: http://lkml.kernel.org/r/20190412144438.2645-1-hannes@cmpxchg.org
    Fixes: 2a2e48854d70 ("mm: vmscan: fix IO/refault regression in cache workingset transition")
    Signed-off-by: Johannes Weiner
    Reviewed-by: Shakeel Butt
    Cc: Roman Gushchin
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • has_unmovable_pages() is used by allocating CMA and gigantic pages as
    well as the memory hotplug. The later doesn't know how to offline CMA
    pool properly now, but if an unused (free) CMA page is encountered, then
    has_unmovable_pages() happily considers it as a free memory and
    propagates this up the call chain. Memory offlining code then frees the
    page without a proper CMA tear down which leads to an accounting issues.
    Moreover if the same memory range is onlined again then the memory never
    gets back to the CMA pool.

    State after memory offline:

    # grep cma /proc/vmstat
    nr_free_cma 205824

    # cat /sys/kernel/debug/cma/cma-kvm_cma/count
    209920

    Also, kmemleak still think those memory address are reserved below but
    have already been used by the buddy allocator after onlining. This
    patch fixes the situation by treating CMA pageblocks as unmovable except
    when has_unmovable_pages() is called as part of CMA allocation.

    Offlined Pages 4096
    kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
    Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    create_object+0x344/0x380
    __kmalloc_node+0x3ec/0x860
    kvmalloc_node+0x58/0x110
    seq_read+0x41c/0x620
    __vfs_read+0x3c/0x70
    vfs_read+0xbc/0x1a0
    ksys_read+0x7c/0x140
    system_call+0x5c/0x70
    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xc000201cc8000000 (size 13757317120):
    kmemleak: comm "swapper/0", pid 0, jiffies 4294937297
    kmemleak: min_count = -1
    kmemleak: count = 0
    kmemleak: flags = 0x5
    kmemleak: checksum = 0
    kmemleak: backtrace:
    cma_declare_contiguous+0x2a4/0x3b0
    kvm_cma_reserve+0x11c/0x134
    setup_arch+0x300/0x3f8
    start_kernel+0x9c/0x6e8
    start_here_common+0x1c/0x4b0
    kmemleak: Automatic memory scanning thread ended

    [cai@lca.pw: use is_migrate_cma_page() and update commit log]
    Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • Silly sizeof(pointer) vs sizeof(uint8_t[]) bug.

    Link: http://lkml.kernel.org/r/20190414123009.GA12971@avx2
    Fixes: e483b0208784 ("proc: test /proc/*/maps, smaps, smaps_rollup, statm")
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • F29 bans mapping first 64KB even for root making test fail. Iterate
    from address 0 until mmap() works.

    Gentoo (root):

    openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
    mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0

    Gentoo (non-root):

    openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
    mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EPERM (Operation not permitted)
    mmap(0x1000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x1000

    F29 (root):

    openat(AT_FDCWD, "/dev/zero", O_RDONLY) = 3
    mmap(NULL, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x1000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x2000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x3000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x4000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x5000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x6000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x7000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x8000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x9000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xa000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xb000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xc000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xd000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xe000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0xf000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = -1 EACCES (Permission denied)
    mmap(0x10000, 4096, PROT_NONE, MAP_PRIVATE|MAP_FIXED, 3, 0) = 0x10000

    Now all proc tests succeed on F29 if run as root, at last!

    Link: http://lkml.kernel.org/r/20190414123612.GB12971@avx2
    Signed-off-by: Alexey Dobriyan
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Commit 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    depends on skipping vmstat entries with empty name introduced in
    7aaf77272358 ("mm: don't show nr_indirectly_reclaimable in
    /proc/vmstat") but reverted in b29940c1abd7 ("mm: rename and change
    semantics of nr_indirectly_reclaimable_bytes").

    So skipping no longer works and /proc/vmstat has misformatted lines " 0".

    This patch simply shows debug counters "nr_tlb_remote_*" for UP.

    Link: http://lkml.kernel.org/r/155481488468.467.4295519102880913454.stgit@buzz
    Fixes: 58bc4c34d249 ("mm/vmstat.c: skip NR_TLB_REMOTE_FLUSH* properly")
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Vlastimil Babka
    Cc: Roman Gushchin
    Cc: Jann Horn
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • When adding memory by probing a memory block in the sysfs interface,
    there is an obvious issue where we will unlock the device_hotplug_lock
    when we failed to takes it.

    That issue was introduced in 8df1d0e4a265 ("mm/memory_hotplug: make
    add_memory() take the device_hotplug_lock").

    We should drop out in time when failing to take the device_hotplug_lock.

    Link: http://lkml.kernel.org/r/1554696437-9593-1-git-send-email-zhongjiang@huawei.com
    Fixes: 8df1d0e4a265 ("mm/memory_hotplug: make add_memory() take the device_hotplug_lock")
    Signed-off-by: zhong jiang
    Reported-by: Yang yingliang
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     
  • The igrab() in shmem_unuse() looks good, but we forgot that it gives no
    protection against concurrent unmounting: a point made by Konstantin
    Khlebnikov eight years ago, and then fixed in 2.6.39 by 778dd893ae78
    ("tmpfs: fix race between umount and swapoff"). The current 5.1-rc
    swapoff is liable to hit "VFS: Busy inodes after unmount of tmpfs.
    Self-destruct in 5 seconds. Have a nice day..." followed by GPF.

    Once again, give up on using igrab(); but don't go back to making such
    heavy-handed use of shmem_swaplist_mutex as last time: that would spoil
    the new design, and I expect could deadlock inside shmem_swapin_page().

    Instead, shmem_unuse() just raise a "stop_eviction" count in the shmem-
    specific inode, and shmem_evict_inode() wait for that to go down to 0.
    Call it "stop_eviction" rather than "swapoff_busy" because it can be put
    to use for others later (huge tmpfs patches expect to use it).

    That simplifies shmem_unuse(), protecting it from both unlink and
    unmount; and in practice lets it locate all the swap in its first try.
    But do not rely on that: there's still a theoretical case, when
    shmem_writepage() might have been preempted after its get_swap_page(),
    before making the swap entry visible to swapoff.

    [hughd@google.com: remove incorrect list_del()]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904091133570.1898@eggly.anvils
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081259400.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • The old try_to_unuse() implementation was driven by find_next_to_unuse(),
    which terminated as soon as all the swap had been freed.

    Add inuse_pages checks now (alongside signal_pending()) to stop scanning
    mms and swap_map once finished.

    The same ought to be done in shmem_unuse() too, but never was before,
    and needs a different interface: so leave it as is for now.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081258200.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • SWAP_UNUSE_MAX_TRIES 3 appeared to work well in earlier testing, but
    further testing has proved it to be a source of unnecessary swapoff
    EBUSY failures (which can then be followed by unmount EBUSY failures).

    When mmget_not_zero() or shmem's igrab() fails, there is an mm exiting
    or inode being evicted, freeing up swap independent of try_to_unuse().
    Those typically completed much sooner than the old quadratic swapoff,
    but now it's more common that swapoff may need to wait for them.

    It's possible to move those cases from init_mm.mmlist and shmem_swaplist
    to separate "exiting" swaplists, and try_to_unuse() then wait for those
    lists to be emptied; but we've not bothered with that in the past, and
    don't want to risk missing some other forgotten case. So just revert to
    cycling around until the swap is gone, without any retries limit.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081256170.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: "Alex Xu (Hello71)"
    Cc: Huang Ying
    Cc: Kelley Nielsen
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Vineeth Pillai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Swapfile "type" was passed all the way down to shmem_unuse_inode(), but
    then forgotten from shmem_find_swap_entries(): with the result that
    removing one swapfile would try to free up all the swap from shmem - no
    problem when only one swapfile anyway, but counter-productive when more,
    causing swapoff to be unnecessarily OOM-killed when it should succeed.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1904081254470.1523@eggly.anvils
    Fixes: b56a2d8af914 ("mm: rid swapoff of quadratic complexity")
    Signed-off-by: Hugh Dickins
    Cc: Konstantin Khlebnikov
    Cc: "Alex Xu (Hello71)"
    Cc: Vineeth Pillai
    Cc: Kelley Nielsen
    Cc: Rik van Riel
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Commit 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    calls kasan_reset_tag() for off-slab slab management object leading to
    freelist being stored non-tagged.

    However, cache_grow_begin() calls alloc_slabmgmt() which calls
    kmem_cache_alloc_node() assigns a tag for the address and stores it in
    the shadow address. As the result, it causes endless errors below
    during boot due to drain_freelist() -> slab_destroy() ->
    kasan_slab_free() which compares already untagged freelist against the
    stored tag in the shadow address.

    Since off-slab slab management object freelist is such a special case,
    just store it tagged. Non-off-slab management object freelist is still
    stored untagged which has not been assigned a tag and should not cause
    any other troubles with this inconsistency.

    BUG: KASAN: double-free or invalid-free in slab_destroy+0x84/0x88
    Pointer tag: [ff], memory tag: [99]

    CPU: 0 PID: 1376 Comm: kworker/0:4 Tainted: G W 5.1.0-rc3+ #8
    Hardware name: HPE Apollo 70 /C01_APACHE_MB , BIOS L50_5.13_1.0.6 07/10/2018
    Workqueue: cgroup_destroy css_killed_work_fn
    Call trace:
    print_address_description+0x74/0x2a4
    kasan_report_invalid_free+0x80/0xc0
    __kasan_slab_free+0x204/0x208
    kasan_slab_free+0xc/0x18
    kmem_cache_free+0xe4/0x254
    slab_destroy+0x84/0x88
    drain_freelist+0xd0/0x104
    __kmem_cache_shrink+0x1ac/0x224
    __kmemcg_cache_deactivate+0x1c/0x28
    memcg_deactivate_kmem_caches+0xa0/0xe8
    memcg_offline_kmem+0x8c/0x3d4
    mem_cgroup_css_offline+0x24c/0x290
    css_killed_work_fn+0x154/0x618
    process_one_work+0x9cc/0x183c
    worker_thread+0x9b0/0xe38
    kthread+0x374/0x390
    ret_from_fork+0x10/0x18

    Allocated by task 1625:
    __kasan_kmalloc+0x168/0x240
    kasan_slab_alloc+0x18/0x20
    kmem_cache_alloc_node+0x1f8/0x3a0
    cache_grow_begin+0x4fc/0xa24
    cache_alloc_refill+0x2f8/0x3e8
    kmem_cache_alloc+0x1bc/0x3bc
    sock_alloc_inode+0x58/0x334
    alloc_inode+0xb8/0x164
    new_inode_pseudo+0x20/0xec
    sock_alloc+0x74/0x284
    __sock_create+0xb0/0x58c
    sock_create+0x98/0xb8
    __sys_socket+0x60/0x138
    __arm64_sys_socket+0xa4/0x110
    el0_svc_handler+0x2c0/0x47c
    el0_svc+0x8/0xc

    Freed by task 1625:
    __kasan_slab_free+0x114/0x208
    kasan_slab_free+0xc/0x18
    kfree+0x1a8/0x1e0
    single_release+0x7c/0x9c
    close_pdeo+0x13c/0x43c
    proc_reg_release+0xec/0x108
    __fput+0x2f8/0x784
    ____fput+0x1c/0x28
    task_work_run+0xc0/0x1b0
    do_notify_resume+0xb44/0x1278
    work_pending+0x8/0x10

    The buggy address belongs to the object at ffff809681b89e00
    which belongs to the cache kmalloc-128 of size 128
    The buggy address is located 0 bytes inside of
    128-byte region [ffff809681b89e00, ffff809681b89e80)
    The buggy address belongs to the page:
    page:ffff7fe025a06e00 count:1 mapcount:0 mapping:01ff80082000fb00
    index:0xffff809681b8fe04
    flags: 0x17ffffffc000200(slab)
    raw: 017ffffffc000200 ffff7fe025a06d08 ffff7fe022ef7b88 01ff80082000fb00
    raw: ffff809681b8fe04 ffff809681b80000 00000001000000e0 0000000000000000
    page dumped because: kasan: bad access detected
    page allocated via order 0, migratetype Unmovable, gfp_mask
    0x2420c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_COMP|__GFP_THISNODE)
    prep_new_page+0x4e0/0x5e0
    get_page_from_freelist+0x4ce8/0x50d4
    __alloc_pages_nodemask+0x738/0x38b8
    cache_grow_begin+0xd8/0xa24
    ____cache_alloc_node+0x14c/0x268
    __kmalloc+0x1c8/0x3fc
    ftrace_free_mem+0x408/0x1284
    ftrace_free_init_mem+0x20/0x28
    kernel_init+0x24/0x548
    ret_from_fork+0x10/0x18

    Memory state around the buggy address:
    ffff809681b89c00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b89d00: fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe
    >ffff809681b89e00: 99 99 99 99 99 99 99 99 fe fe fe fe fe fe fe fe
    ^
    ffff809681b89f00: 43 43 43 43 43 fe fe fe fe fe fe fe fe fe fe fe
    ffff809681b8a000: 6d fe fe fe fe fe fe fe fe fe fe fe fe fe fe fe

    Link: http://lkml.kernel.org/r/20190403022858.97584-1-cai@lca.pw
    Fixes: 51dedad06b5f ("kasan, slab: make freelist stored without tags")
    Signed-off-by: Qian Cai
    Reviewed-by: Andrey Konovalov
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

19 Apr, 2019

2 commits

  • Pull arm64 fix from Catalin Marinas:
    "Avoid compiler uninitialised warning introduced by recent arm64 futex
    fix"

    * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux:
    arm64: futex: Restore oldval initialization to work around buggy compilers

    Linus Torvalds
     
  • Commit 045afc24124d ("arm64: futex: Fix FUTEX_WAKE_OP atomic ops with
    non-zero result value") removed oldval's zero initialization in
    arch_futex_atomic_op_inuser because it is not necessary. Unfortunately,
    Android's arm64 GCC 4.9.4 [1] does not agree:

    ../kernel/futex.c: In function 'do_futex':
    ../kernel/futex.c:1658:17: warning: 'oldval' may be used uninitialized
    in this function [-Wmaybe-uninitialized]
    return oldval == cmparg;
    ^
    In file included from ../kernel/futex.c:73:0:
    ../arch/arm64/include/asm/futex.h:53:6: note: 'oldval' was declared here
    int oldval, ret, tmp;
    ^

    GCC fails to follow that when ret is non-zero, futex_atomic_op_inuser
    returns right away, avoiding the uninitialized use that it claims.
    Restoring the zero initialization works around this issue.

    [1]: https://android.googlesource.com/platform/prebuilts/gcc/linux-x86/aarch64/aarch64-linux-android-4.9/

    Cc: stable@vger.kernel.org
    Fixes: 045afc24124d ("arm64: futex: Fix FUTEX_WAKE_OP atomic ops with non-zero result value")
    Reviewed-by: Greg Kroah-Hartman
    Signed-off-by: Nathan Chancellor
    Signed-off-by: Catalin Marinas

    Nathan Chancellor
     

18 Apr, 2019

10 commits

  • As stated in the original commit for pidfd_send_signal() we don't allow
    to signal processes through O_PATH file descriptors since it is
    semantically equivalent to a write on the pidfd.

    We already correctly error out right now and return EBADF if an O_PATH
    fd is passed. This is because we use file->f_op to detect whether a
    pidfd is passed and O_PATH fds have their file->f_op set to empty_fops
    in do_dentry_open() and thus fail the test.

    Thus, there is no regression. It's just semantically correct to use
    fdget() and return an error right from there instead of taking a
    reference and returning an error later.

    Signed-off-by: Christian Brauner
    Acked-by: Oleg Nesterov
    Cc: Arnd Bergmann
    Cc: "Eric W. Biederman"
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Jann Horn
    Cc: David Howells
    Cc: "Michael Kerrisk (man-pages)"
    Cc: Andy Lutomirsky
    Cc: Andrew Morton
    Cc: Oleg Nesterov
    Cc: Aleksa Sarai
    Cc: Al Viro
    Signed-off-by: Linus Torvalds

    Christian Brauner
     
  • Pull s390 bug fixes from Martin Schwidefsky:

    - Fix overwrite of the initial ramdisk due to misuse of IS_ENABLED

    - Fix integer overflow in the dasd driver resulting in incorrect number
    of blocks for large devices

    - Fix a lockdep false positive in the 3270 driver

    - Fix a deadlock in the zcrypt driver

    - Fix incorrect debug feature entries in the pkey api

    - Fix inline assembly constraints fallout with CONFIG_KASAN=y

    * tag 's390-5.1-3' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux:
    s390: correct some inline assembly constraints
    s390/pkey: add one more argument space for debug feature entry
    s390/zcrypt: fix possible deadlock situation on ap queue remove
    s390/3270: fix lockdep false positive on view->lock
    s390/dasd: Fix capacity calculation for large volumes
    s390/mem_detect: Use IS_ENABLED(CONFIG_BLK_DEV_INITRD)

    Linus Torvalds
     
  • Pull AFS fixes from David Howells:

    - Stop using the deprecated get_seconds().

    - Don't make tracepoint strings const as the section they go in isn't
    read-only.

    - Differentiate failure due to unmarshalling from other failure cases.
    We shouldn't abort with RXGEN_CC/SS_UNMARSHAL if it's not due to
    unmarshalling.

    - Add a missing unlock_page().

    - Fix the interaction between receiving a notification from a server
    that it has invalidated all outstanding callback promises and a
    client call that we're in the middle of making that will get a new
    promise.

    * tag 'afs-fixes-20190413' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs:
    afs: Fix in-progess ops to ignore server-level callback invalidation
    afs: Unlock pages for __pagevec_release()
    afs: Differentiate abort due to unmarshalling from other errors
    afs: Avoid section confusion in CM_NAME
    afs: avoid deprecated get_seconds()

    Linus Torvalds
     
  • Pull crypto fix from Herbert Xu:
    "Fix a bug in the implementation of the x86 accelerated version of
    poly1305"

    * 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
    crypto: x86/poly1305 - fix overflow during partial reduction

    Linus Torvalds
     
  • Pull drm fixes from Dave Airlie:
    "Since Easter is looming for me, I'm just pushing whatever is in my
    tree, I'll see what else turns up and maybe I'll send another pull
    early next week if there is anything.

    tegra:
    - stream id programming fix
    - avoid divide by 0 for bad hdmi audio setup code

    ttm:
    - Hugepages fix
    - refcount imbalance in error path fix

    amdgpu:
    - GPU VM fixes for Vega/RV
    - DC AUX fix for active DP-DVI dongles
    - DC fix for multihead regression"

    * tag 'drm-fixes-2019-04-18' of git://anongit.freedesktop.org/drm/drm:
    drm/tegra: hdmi: Setup audio only if configured
    drm/amd/display: If one stream full updates, full update all planes
    drm/amdgpu/gmc9: fix VM_L2_CNTL3 programming
    drm/amdgpu: shadow in shadow_list without tbo.mem.start cause page fault in sriov TDR
    gpu: host1x: Program stream ID to bypass without SMMU
    drm/amd/display: extending AUX SW Timeout
    drm/ttm: fix dma_fence refcount imbalance on error path
    drm/ttm: fix incrementing the page pointer for huge pages
    drm/ttm: fix start page for huge page check in ttm_put_pages()
    drm/ttm: fix out-of-bounds read in ttm_put_pages() v2

    Linus Torvalds
     
  • - GPUVM fixes for vega/RV and shadow buffers
    - TTM fixes for hugepages
    - TTM fix for refcount imbalance in error path
    - DC AUX fix for some active DP-DVI dongles
    - DC fix for multihead VT switch regression

    Signed-off-by: Dave Airlie
    From: Alex Deucher
    Link: https://patchwork.freedesktop.org/patch/msgid/20190415051703.3377-1-alexander.deucher@amd.com

    Dave Airlie
     
  • drm/tegra: Fixes for v5.1-rc6

    This contains a follow-up fix for the stream ID programming and a fix
    for a regression on older Tegra devices (Tegra20 and Tegra30) that are
    running into a division by zero trying to enable audio over HDMI.

    Signed-off-by: Dave Airlie

    From: Thierry Reding
    Link: https://patchwork.freedesktop.org/patch/msgid/20190417073525.21680-1-thierry.reding@gmail.com

    Dave Airlie
     
  • Pull smb3 fixes from Steve French:
    "Five small SMB3 fixes, all also for stable - an important fix for an
    oplock (lease) bug, a handle leak, and three bugs spotted by KASAN"

    * tag '5.1-rc5-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
    CIFS: keep FileInfo handle live during oplock break
    cifs: fix handle leak in smb2_query_symlink()
    cifs: Fix lease buffer length error
    cifs: Fix use-after-free in SMB2_read
    cifs: Fix use-after-free in SMB2_write

    Linus Torvalds
     
  • Pull IPMI fixes from Corey Minyard:
    "Fixes for some bugs cause by recent changes. One crash if you feed bad
    data to the module parameters, one BUG that sometimes occurs when a
    user closes the connection, and one bug that cause the driver to not
    work if the configuration information only comes in from SMBIOS"

    * tag 'for-linus-5.1-2' of git://github.com/cminyard/linux-ipmi:
    ipmi: fix sleep-in-atomic in free_user at cleanup SRCU user->release_barrier
    ipmi: ipmi_si_hardcode.c: init si_type array to fix a crash
    ipmi: Fix failure on SMBIOS specified devices

    Linus Torvalds
     
  • Pull networking fixes from David Miller:

    1) Handle init flow failures properly in iwlwifi driver, from Shahar S
    Matityahu.

    2) mac80211 TXQs need to be unscheduled on powersave start, from Felix
    Fietkau.

    3) SKB memory accounting fix in A-MDSU aggregation, from Felix Fietkau.

    4) Increase RCU lock hold time in mlx5 FPGA code, from Saeed Mahameed.

    5) Avoid checksum complete with XDP in mlx5, also from Saeed.

    6) Fix netdev feature clobbering in ibmvnic driver, from Thomas Falcon.

    7) Partial sent TLS record leak fix from Jakub Kicinski.

    8) Reject zero size iova range in vhost, from Jason Wang.

    9) Allow pending work to complete before clcsock release from Karsten
    Graul.

    10) Fix XDP handling max MTU in thunderx, from Matteo Croce.

    11) A lot of protocols look at the sa_family field of a sockaddr before
    validating it's length is large enough, from Tetsuo Handa.

    12) Don't write to free'd pointer in qede ptp error path, from Colin Ian
    King.

    13) Have to recompile IP options in ipv4_link_failure because it can be
    invoked from ARP, from Stephen Suryaputra.

    14) Doorbell handling fixes in qed from Denis Bolotin.

    15) Revert net-sysfs kobject register leak fix, it causes new problems.
    From Wang Hai.

    16) Spectre v1 fix in ATM code, from Gustavo A. R. Silva.

    17) Fix put of BROPT_VLAN_STATS_PER_PORT in bridging code, from Nikolay
    Aleksandrov.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (111 commits)
    socket: fix compat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW
    tcp: tcp_grow_window() needs to respect tcp_space()
    ocelot: Clean up stats update deferred work
    ocelot: Don't sleep in atomic context (irqs_disabled())
    net: bridge: fix netlink export of vlan_stats_per_port option
    qed: fix spelling mistake "faspath" -> "fastpath"
    tipc: set sysctl_tipc_rmem and named_timeout right range
    tipc: fix link established but not in session
    net: Fix missing meta data in skb with vlan packet
    net: atm: Fix potential Spectre v1 vulnerabilities
    net/core: work around section mismatch warning for ptp_classifier
    net: bridge: fix per-port af_packet sockets
    bnx2x: fix spelling mistake "dicline" -> "decline"
    route: Avoid crash from dereferencing NULL rt->from
    MAINTAINERS: normalize Woojung Huh's email address
    bonding: fix event handling for stacked bonds
    Revert "net-sysfs: Fix memory leak in netdev_register_kobject"
    rtnetlink: fix rtnl_valid_stats_req() nlmsg_len check
    qed: Fix the DORQ's attentions handling
    qed: Fix missing DORQ attentions
    ...

    Linus Torvalds
     

17 Apr, 2019

15 commits

  • free_user() could be called in atomic context.

    This patch pushed the free operation off into a workqueue.

    Example:

    BUG: sleeping function called from invalid context at kernel/workqueue.c:2856
    in_atomic(): 1, irqs_disabled(): 0, pid: 177, name: ksoftirqd/27
    CPU: 27 PID: 177 Comm: ksoftirqd/27 Not tainted 4.19.25-3 #1
    Hardware name: AIC 1S-HV26-08/MB-DPSB04-06, BIOS IVYBV060 10/21/2015
    Call Trace:
    dump_stack+0x5c/0x7b
    ___might_sleep+0xec/0x110
    __flush_work+0x48/0x1f0
    ? try_to_del_timer_sync+0x4d/0x80
    _cleanup_srcu_struct+0x104/0x140
    free_user+0x18/0x30 [ipmi_msghandler]
    ipmi_free_recv_msg+0x3a/0x50 [ipmi_msghandler]
    deliver_response+0xbd/0xd0 [ipmi_msghandler]
    deliver_local_response+0xe/0x30 [ipmi_msghandler]
    handle_one_recv_msg+0x163/0xc80 [ipmi_msghandler]
    ? dequeue_entity+0xa0/0x960
    handle_new_recv_msgs+0x15c/0x1f0 [ipmi_msghandler]
    tasklet_action_common.isra.22+0x103/0x120
    __do_softirq+0xf8/0x2d7
    run_ksoftirqd+0x26/0x50
    smpboot_thread_fn+0x11d/0x1e0
    kthread+0x103/0x140
    ? sort_range+0x20/0x20
    ? kthread_destroy_worker+0x40/0x40
    ret_from_fork+0x1f/0x40

    Fixes: 77f8269606bf ("ipmi: fix use-after-free of user->release_barrier.rda")

    Reported-by: Konstantin Khlebnikov
    Signed-off-by: Corey Minyard
    Cc: stable@vger.kernel.org # 5.0
    Cc: Yang Yingliang

    Corey Minyard
     
  • Inline assembly code changed in this patch should really use "Q"
    constraint "Memory reference without index register and with short
    displacement". The kernel build with kasan instrumentation enabled
    might occasionally break otherwise (due to stack instrumentation).

    Signed-off-by: Vasily Gorbik
    Signed-off-by: Martin Schwidefsky

    Vasily Gorbik
     
  • The audio configuration is only valid if the HDMI codec has been
    properly set up. Do not attempt to set up audio before that happens
    because it causes a division by zero.

    Note that this is only problematic on Tegra20 and Tegra30. Later chips
    implement the division instructions which return zero when dividing by
    zero and don't throw an exception.

    Fixes: db5adf4d6dce ("drm/tegra: hdmi: Fix audio to work with any pixel clock rate")
    Reported-by: Marcel Ziswiler
    Tested-by: Dmitry Osipenko
    Signed-off-by: Thierry Reding

    Thierry Reding
     
  • It looks like the new socket options only work correctly
    for native execution, but in case of compat mode fall back
    to the old behavior as we ignore the 'old_timeval' flag.

    Rework so we treat SO_RCVTIMEO_NEW/SO_SNDTIMEO_NEW the
    same way in compat and native 32-bit mode.

    Cc: Deepa Dinamani
    Fixes: a9beb86ae6e5 ("sock: Add SO_RCVTIMEO_NEW and SO_SNDTIMEO_NEW")
    Signed-off-by: Arnd Bergmann
    Acked-by: Deepa Dinamani
    Signed-off-by: David S. Miller

    Arnd Bergmann
     
  • For some reason, tcp_grow_window() correctly tests if enough room
    is present before attempting to increase tp->rcv_ssthresh,
    but does not prevent it to grow past tcp_space()

    This is causing hard to debug issues, like failing
    the (__tcp_select_window(sk) >= tp->rcv_wnd) test
    in __tcp_ack_snd_check(), causing ACK delays and possibly
    slow flows.

    Depending on tcp_rmem[2], MTU, skb->len/skb->truesize ratio,
    we can see the problem happening on "netperf -t TCP_RR -- -r 2000,2000"
    after about 60 round trips, when the active side no longer sends
    immediate acks.

    This bug predates git history.

    Signed-off-by: Eric Dumazet
    Acked-by: Soheil Hassas Yeganeh
    Acked-by: Neal Cardwell
    Acked-by: Wei Wang
    Signed-off-by: David S. Miller

    Eric Dumazet
     
  • This is preventive cleanup that may save troubles later.
    No need to cancel repeateadly queued work if code is properly
    refactored.
    Don't let the ethtool -s process interfere with the stat workqueue
    scheduling.

    Signed-off-by: Claudiu Manoil
    Signed-off-by: David S. Miller

    Claudiu Manoil
     
  • Preemption disabled at:
    [] dev_set_rx_mode+0x1c/0x38
    Call trace:
    [] dump_backtrace+0x0/0x3d0
    [] show_stack+0x14/0x20
    [] dump_stack+0xac/0xe4
    [] ___might_sleep+0x164/0x238
    [] __might_sleep+0x50/0x88
    [] kmem_cache_alloc+0x17c/0x1d0
    [] ocelot_set_rx_mode+0x108/0x188 [mscc_ocelot_common]
    [] __dev_set_rx_mode+0x58/0xa0
    [] dev_set_rx_mode+0x24/0x38

    Fixes: a556c76adc05 ("net: mscc: Add initial Ocelot switch support")

    Signed-off-by: Claudiu Manoil
    Signed-off-by: David S. Miller

    Claudiu Manoil
     
  • Since the introduction of the vlan_stats_per_port option the netlink
    export of it has been broken since I made a typo and used the ifla
    attribute instead of the bridge option to retrieve its state.
    Sysfs export is fine, only netlink export has been affected.

    Fixes: 9163a0fc1f0c0 ("net: bridge: add support for per-port vlan stats")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov
     
  • There is a spelling mistake in a DP_INFO message, fix it.

    Signed-off-by: Colin Ian King
    Reviewed-by: Mukesh Ojha
    Signed-off-by: David S. Miller

    Colin Ian King
     
  • We find that sysctl_tipc_rmem and named_timeout do not have the right minimum
    setting. sysctl_tipc_rmem should be larger than zero, like sysctl_tcp_rmem.
    And named_timeout as a timeout setting should be not less than zero.

    Fixes: cc79dd1ba9c10 ("tipc: change socket buffer overflow control to respect sk_rcvbuf")
    Fixes: a5325ae5b8bff ("tipc: add name distributor resiliency queue")
    Signed-off-by: Jie Liu
    Reported-by: Qiang Ning
    Reviewed-by: Zhiqiang Liu
    Reviewed-by: Miaohe Lin
    Signed-off-by: David S. Miller

    Jie Liu
     
  • According to the link FSM, when a link endpoint got RESET_MSG (- a
    traditional one without the stopping bit) from its peer, it moves to
    PEER_RESET state and raises a LINK_DOWN event which then resets the
    link itself. Its state will become ESTABLISHING after the reset event
    and the link will be re-established soon after this endpoint starts to
    send ACTIVATE_MSG to the peer.

    There is no problem with this mechanism, however the link resetting has
    cleared the link 'in_session' flag (along with the other important link
    data such as: the link 'mtu') that was correctly set up at the 1st step
    (i.e. when this endpoint received the peer RESET_MSG). As a result, the
    link will become ESTABLISHED, but the 'in_session' flag is not set, and
    all STATE_MSG from its peer will be dropped at the link_validate_msg().
    It means the link not synced and will sooner or later face a failure.

    Since the link reset action is obviously needed for a new link session
    (this is also true in the other situations), the problem here is that
    the link is re-established a bit too early when the link endpoints are
    not really in-sync yet. The commit forces a resync as already done in
    the previous commit 91986ee166cf ("tipc: fix link session and
    re-establish issues") by simply varying the link 'peer_session' value
    at the link_reset().

    Acked-by: Jon Maloy
    Signed-off-by: Tuong Lien
    Signed-off-by: David S. Miller

    Tuong Lien
     
  • skb_reorder_vlan_header() should move XDP meta data with ethernet header
    if XDP meta data exists.

    Fixes: de8f3a83b0a0 ("bpf: add meta pointer for direct access")
    Signed-off-by: Yuya Kusakabe
    Signed-off-by: Takeru Hayasaka
    Co-developed-by: Takeru Hayasaka
    Reviewed-by: Toshiaki Makita
    Signed-off-by: David S. Miller

    Yuya Kusakabe
     
  • arg is controlled by user-space, hence leading to a potential
    exploitation of the Spectre variant 1 vulnerability.

    This issue was detected with the help of Smatch:

    net/atm/lec.c:715 lec_mcast_attach() warn: potential spectre issue 'dev_lec' [r] (local cap)

    Fix this by sanitizing arg before using it to index dev_lec.

    Notice that given that speculation windows are large, the policy is
    to kill the speculation on the first load and not worry if it can be
    completed with a dependent load/store [1].

    [1] https://lore.kernel.org/lkml/20180423164740.GY17484@dhcp22.suse.cz/

    Signed-off-by: Gustavo A. R. Silva
    Signed-off-by: David S. Miller

    Gustavo A. R. Silva
     
  • The routine ptp_classifier_init() uses an initializer for an
    automatic struct type variable which refers to an __initdata
    symbol. This is perfectly legal, but may trigger a section
    mismatch warning when running the compiler in -fpic mode, due
    to the fact that the initializer may be emitted into an anonymous
    .data section thats lack the __init annotation. So work around it
    by using assignments instead.

    Signed-off-by: Ard Biesheuvel
    Signed-off-by: Gerald Schaefer
    Signed-off-by: David S. Miller

    Ard Biesheuvel
     
  • When the commit below was introduced it changed two visible things:
    - the skb was no longer passed through the protocol handlers with the
    original device
    - the skb was passed up the stack with skb->dev = bridge

    The first change broke af_packet sockets on bridge ports. For example we
    use them for hostapd which listens for ETH_P_PAE packets on the ports.
    We discussed two possible fixes:
    - create a clone and pass it through NF_HOOK(), act on the original skb
    based on the result
    - somehow signal to the caller from the okfn() that it was called,
    meaning the skb is ok to be passed, which this patch is trying to
    implement via returning 1 from the bridge link-local okfn()

    Note that we rely on the fact that NF_QUEUE/STOLEN would return 0 and
    drop/error would return < 0 thus the okfn() is called only when the
    return was 1, so we signal to the caller that it was called by preserving
    the return value from nf_hook().

    Fixes: 8626c56c8279 ("bridge: fix potential use-after-free when hook returns QUEUE or STOLEN verdict")
    Signed-off-by: Nikolay Aleksandrov
    Signed-off-by: David S. Miller

    Nikolay Aleksandrov