05 Aug, 2020

1 commit

  • commit 5c72feee3e45b40a3c96c7145ec422899d0e8964 upstream.

    When handling a page fault, we drop mmap_sem to start async readahead so
    that we don't block on IO submission with mmap_sem held. However there's
    no point to drop mmap_sem in case readahead is disabled. Handle that case
    to avoid pointless dropping of mmap_sem and retrying the fault. This was
    actually reported to block mlockall(MCL_CURRENT) indefinitely.

    Fixes: 6b4c9f446981 ("filemap: drop the mmap_sem for all blocking operations")
    Reported-by: Minchan Kim
    Reported-by: Robert Stupp
    Signed-off-by: Jan Kara
    Signed-off-by: Andrew Morton
    Reviewed-by: Josef Bacik
    Reviewed-by: Minchan Kim
    Link: http://lkml.kernel.org/r/20200212101356.30759-1-jack@suse.cz
    Signed-off-by: Linus Torvalds
    Cc: SeongJae Park
    Signed-off-by: Greg Kroah-Hartman

    Jan Kara
     

29 Jul, 2020

4 commits

  • commit 594cced14ad3903166c8b091ff96adac7552f0b3 upstream.

    khugepaged has to drop mmap lock several times while collapsing a page.
    The situation can change while the lock is dropped and we need to
    re-validate that the VMA is still in place and the PMD is still subject
    for collapse.

    But we miss one corner case: while collapsing an anonymous pages the VMA
    could be replaced with file VMA. If the file VMA doesn't have any
    private pages we get NULL pointer dereference:

    general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
    anon_vma_lock_write include/linux/rmap.h:120 [inline]
    collapse_huge_page mm/khugepaged.c:1110 [inline]
    khugepaged_scan_pmd mm/khugepaged.c:1349 [inline]
    khugepaged_scan_mm_slot mm/khugepaged.c:2110 [inline]
    khugepaged_do_scan mm/khugepaged.c:2193 [inline]
    khugepaged+0x3bba/0x5a10 mm/khugepaged.c:2238

    The fix is to make sure that the VMA is anonymous in
    hugepage_vma_revalidate(). The helper is only used for collapsing
    anonymous pages.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Reported-by: syzbot+ed318e8b790ca72c5ad0@syzkaller.appspotmail.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Yang Shi
    Cc:
    Link: http://lkml.kernel.org/r/20200722121439.44328-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     
  • commit d38a2b7a9c939e6d7329ab92b96559ccebf7b135 upstream.

    If the kmem_cache refcount is greater than one, we should not mark the
    root kmem_cache as dying. If we mark the root kmem_cache dying
    incorrectly, the non-root kmem_cache can never be destroyed. It
    resulted in memory leak when memcg was destroyed. We can use the
    following steps to reproduce.

    1) Use kmem_cache_create() to create a new kmem_cache named A.
    2) Coincidentally, the kmem_cache A is an alias for kmem_cache B,
    so the refcount of B is just increased.
    3) Use kmem_cache_destroy() to destroy the kmem_cache A, just
    decrease the B's refcount but mark the B as dying.
    4) Create a new memory cgroup and alloc memory from the kmem_cache
    B. It leads to create a non-root kmem_cache for allocating memory.
    5) When destroy the memory cgroup created in the step 4), the
    non-root kmem_cache can never be destroyed.

    If we repeat steps 4) and 5), this will cause a lot of memory leak. So
    only when refcount reach zero, we mark the root kmem_cache as dying.

    Fixes: 92ee383f6daa ("mm: fix race between kmem_cache destroy, create and deactivate")
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Roman Gushchin
    Cc: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Shakeel Butt
    Cc:
    Link: http://lkml.kernel.org/r/20200716165103.83462-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     
  • commit 8d22a9351035ef2ff12ef163a1091b8b8cf1e49c upstream.

    It was hard to keep a test running, moving tasks between memcgs with
    move_charge_at_immigrate, while swapping: mem_cgroup_id_get_many()'s
    refcount is discovered to be 0 (supposedly impossible), so it is then
    forced to REFCOUNT_SATURATED, and after thousands of warnings in quick
    succession, the test is at last put out of misery by being OOM killed.

    This is because of the way moved_swap accounting was saved up until the
    task move gets completed in __mem_cgroup_clear_mc(), deferred from when
    mem_cgroup_move_swap_account() actually exchanged old and new ids.
    Concurrent activity can free up swap quicker than the task is scanned,
    bringing id refcount down 0 (which should only be possible when
    offlining).

    Just skip that optimization: do that part of the accounting immediately.

    Fixes: 615d66c37c75 ("mm: memcontrol: fix memcg id ref counter on swap charge move")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Cc: Johannes Weiner
    Cc: Alex Shi
    Cc: Shakeel Butt
    Cc: Michal Hocko
    Cc:
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2007071431050.4726@eggly.anvils
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 246c320a8cfe0b11d81a4af38fa9985ef0cc9a4c upstream.

    VMA with VM_GROWSDOWN or VM_GROWSUP flag set can change their size under
    mmap_read_lock(). It can lead to race with __do_munmap():

    Thread A Thread B
    __do_munmap()
    detach_vmas_to_be_unmapped()
    mmap_write_downgrade()
    expand_downwards()
    vma->vm_start = address;
    // The VMA now overlaps with
    // VMAs detached by the Thread A
    // page fault populates expanded part
    // of the VMA
    unmap_region()
    // Zaps pagetables partly
    // populated by Thread B

    Similar race exists for expand_upwards().

    The fix is to avoid downgrading mmap_lock in __do_munmap() if detached
    VMAs are next to VM_GROWSDOWN or VM_GROWSUP VMA.

    [akpm@linux-foundation.org: s/mmap_sem/mmap_lock/ in comment]

    Fixes: dd2283f2605e ("mm: mmap: zap pages with read mmap_sem in munmap")
    Reported-by: Jann Horn
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Oleg Nesterov
    Cc: Matthew Wilcox
    Cc: [4.20+]
    Link: http://lkml.kernel.org/r/20200709105309.42495-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

09 Jul, 2020

5 commits

  • commit b9e20f0da1f5c9c68689450a8cb436c9486434c8 upstream.

    Hugh reports:

    "While stressing compaction, one run oopsed on NULL capc->cc in
    __free_one_page()'s task_capc(zone): compact_zone_order() had been
    interrupted, and a page was being freed in the return from interrupt.

    Though you would not expect it from the source, both gccs I was using
    (4.8.1 and 7.5.0) had chosen to compile compact_zone_order() with the
    ".cc = &cc" implemented by mov %rbx,-0xb0(%rbp) immediately before
    callq compact_zone - long after the "current->capture_control =
    &capc". An interrupt in between those finds capc->cc NULL (zeroed by
    an earlier rep stos).

    This could presumably be fixed by a barrier() before setting
    current->capture_control in compact_zone_order(); but would also need
    more care on return from compact_zone(), in order not to risk leaking
    a page captured by interrupt just before capture_control is reset.

    Maybe that is the preferable fix, but I felt safer for task_capc() to
    exclude the rather surprising possibility of capture at interrupt
    time"

    I have checked that gcc10 also behaves the same.

    The advantage of fix in compact_zone_order() is that we don't add
    another test in the page freeing hot path, and that it might prevent
    future problems if we stop exposing pointers to uninitialized structures
    in current task.

    So this patch implements the suggestion for compact_zone_order() with
    barrier() (and WRITE_ONCE() to prevent store tearing) for setting
    current->capture_control, and prevents page leaking with
    WRITE_ONCE/READ_ONCE in the proper order.

    Link: http://lkml.kernel.org/r/20200616082649.27173-1-vbabka@suse.cz
    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Signed-off-by: Vlastimil Babka
    Reported-by: Hugh Dickins
    Suggested-by: Hugh Dickins
    Acked-by: Hugh Dickins
    Cc: Alex Shi
    Cc: Li Wang
    Cc: Mel Gorman
    Cc: [5.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • commit 6467552ca64c4ddd2b83ed73192107d7145f533b upstream.

    Dan reports:

    The patch 5e1f0f098b46: "mm, compaction: capture a page under direct
    compaction" from Mar 5, 2019, leads to the following Smatch complaint:

    mm/compaction.c:2321 compact_zone_order()
    error: we previously assumed 'capture' could be null (see line 2313)

    mm/compaction.c
    2288 static enum compact_result compact_zone_order(struct zone *zone, int order,
    2289 gfp_t gfp_mask, enum compact_priority prio,
    2290 unsigned int alloc_flags, int classzone_idx,
    2291 struct page **capture)
    ^^^^^^^

    2313 if (capture)
    ^^^^^^^
    Check for NULL

    2314 current->capture_control = &capc;
    2315
    2316 ret = compact_zone(&cc, &capc);
    2317
    2318 VM_BUG_ON(!list_empty(&cc.freepages));
    2319 VM_BUG_ON(!list_empty(&cc.migratepages));
    2320
    2321 *capture = capc.page;
    ^^^^^^^^
    Unchecked dereference.

    2322 current->capture_control = NULL;
    2323

    In practice this is not an issue, as the only caller path passes non-NULL
    capture:

    __alloc_pages_direct_compact()
    struct page *page = NULL;
    try_to_compact_pages(capture = &page);
    compact_zone_order(capture = capture);

    So let's remove the unnecessary check, which should also make Smatch happy.

    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Reported-by: Dan Carpenter
    Suggested-by: Andrew Morton
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Acked-by: Mel Gorman
    Link: http://lkml.kernel.org/r/18b0df3c-0589-d96c-23fa-040798fee187@suse.cz
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     
  • [ Upstream commit a68ee0573991e90af2f1785db309206408bad3e5 ]

    There is no need to copy SLUB_STATS items from root memcg cache to new
    memcg cache copies. Doing so could result in stack overruns because the
    store function only accepts 0 to clear the stat and returns an error for
    everything else while the show method would print out the whole stat.

    Then, the mismatch of the lengths returns from show and store methods
    happens in memcg_propagate_slab_attrs():

    else if (root_cache->max_attr_size < ARRAY_SIZE(mbuf))
    buf = mbuf;

    max_attr_size is only 2 from slab_attr_store(), then, it uses mbuf[64]
    in show_stat() later where a bounch of sprintf() would overrun the stack
    variable. Fix it by always allocating a page of buffer to be used in
    show_stat() if SLUB_STATS=y which should only be used for debug purpose.

    # echo 1 > /sys/kernel/slab/fs_cache/shrink
    BUG: KASAN: stack-out-of-bounds in number+0x421/0x6e0
    Write of size 1 at addr ffffc900256cfde0 by task kworker/76:0/53251

    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    number+0x421/0x6e0
    vsnprintf+0x451/0x8e0
    sprintf+0x9e/0xd0
    show_stat+0x124/0x1d0
    alloc_slowpath_show+0x13/0x20
    __kmem_cache_create+0x47a/0x6b0

    addr ffffc900256cfde0 is located in stack of task kworker/76:0/53251 at offset 0 in frame:
    process_one_work+0x0/0xb90

    this frame has 1 object:
    [32, 72) 'lockdep_map'

    Memory state around the buggy address:
    ffffc900256cfc80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ffffc900256cfd00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    >ffffc900256cfd80: 00 00 00 00 00 00 00 00 00 00 00 00 f1 f1 f1 f1
    ^
    ffffc900256cfe00: 00 00 00 00 00 f2 f2 f2 00 00 00 00 00 00 00 00
    ffffc900256cfe80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================
    Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: __kmem_cache_create+0x6ac/0x6b0
    Workqueue: memcg_kmem_cache memcg_kmem_cache_create_func
    Call Trace:
    __kmem_cache_create+0x6ac/0x6b0

    Fixes: 107dab5c92d5 ("slub: slub-specific propagation changes")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Glauber Costa
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200429222356.4322-1-cai@lca.pw
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     
  • [ Upstream commit 52f23478081ae0dcdb95d1650ea1e7d52d586829 ]

    The slub_debug is able to fix the corrupted slab freelist/page.
    However, alloc_debug_processing() only checks the validity of current
    and next freepointer during allocation path. As a result, once some
    objects have their freepointers corrupted, deactivate_slab() may lead to
    page fault.

    Below is from a test kernel module when 'slub_debug=PUF,kmalloc-128
    slub_nomerge'. The test kernel corrupts the freepointer of one free
    object on purpose. Unfortunately, deactivate_slab() does not detect it
    when iterating the freechain.

    BUG: unable to handle page fault for address: 00000000123456f8
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    ... ...
    RIP: 0010:deactivate_slab.isra.92+0xed/0x490
    ... ...
    Call Trace:
    ___slab_alloc+0x536/0x570
    __slab_alloc+0x17/0x30
    __kmalloc+0x1d9/0x200
    ext4_htree_store_dirent+0x30/0xf0
    htree_dirblock_to_tree+0xcb/0x1c0
    ext4_htree_fill_tree+0x1bc/0x2d0
    ext4_readdir+0x54f/0x920
    iterate_dir+0x88/0x190
    __x64_sys_getdents+0xa6/0x140
    do_syscall_64+0x49/0x170
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Therefore, this patch adds extra consistency check in deactivate_slab().
    Once an object's freepointer is corrupted, all following objects
    starting at this object are isolated.

    [akpm@linux-foundation.org: fix build with CONFIG_SLAB_DEBUG=n]
    Signed-off-by: Dongli Zhang
    Signed-off-by: Andrew Morton
    Cc: Joe Jin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200331031450.12182-1-dongli.zhang@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Dongli Zhang
     
  • [ Upstream commit 243bce09c91b0145aeaedd5afba799d81841c030 ]

    Chris Murphy reports that a slightly overcommitted load, testing swap
    and zram along with i915, splats and keeps on splatting, when it had
    better fail less noisily:

    gnome-shell: page allocation failure: order:0,
    mode:0x400d0(__GFP_IO|__GFP_FS|__GFP_COMP|__GFP_RECLAIMABLE),
    nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 2 PID: 1155 Comm: gnome-shell Not tainted 5.7.0-1.fc33.x86_64 #1
    Call Trace:
    dump_stack+0x64/0x88
    warn_alloc.cold+0x75/0xd9
    __alloc_pages_slowpath.constprop.0+0xcfa/0xd30
    __alloc_pages_nodemask+0x2df/0x320
    alloc_slab_page+0x195/0x310
    allocate_slab+0x3c5/0x440
    ___slab_alloc+0x40c/0x5f0
    __slab_alloc+0x1c/0x30
    kmem_cache_alloc+0x20e/0x220
    xas_nomem+0x28/0x70
    add_to_swap_cache+0x321/0x400
    __read_swap_cache_async+0x105/0x240
    swap_cluster_readahead+0x22c/0x2e0
    shmem_swapin+0x8e/0xc0
    shmem_swapin_page+0x196/0x740
    shmem_getpage_gfp+0x3a2/0xa60
    shmem_read_mapping_page_gfp+0x32/0x60
    shmem_get_pages+0x155/0x5e0 [i915]
    __i915_gem_object_get_pages+0x68/0xa0 [i915]
    i915_vma_pin+0x3fe/0x6c0 [i915]
    eb_add_vma+0x10b/0x2c0 [i915]
    i915_gem_do_execbuffer+0x704/0x3430 [i915]
    i915_gem_execbuffer2_ioctl+0x1ea/0x3e0 [i915]
    drm_ioctl_kernel+0x86/0xd0 [drm]
    drm_ioctl+0x206/0x390 [drm]
    ksys_ioctl+0x82/0xc0
    __x64_sys_ioctl+0x16/0x20
    do_syscall_64+0x5b/0xf0
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Reported on 5.7, but it goes back really to 3.1: when
    shmem_read_mapping_page_gfp() was implemented for use by i915, and
    allowed for __GFP_NORETRY and __GFP_NOWARN flags in most places, but
    missed swapin's "& GFP_KERNEL" mask for page tree node allocation in
    __read_swap_cache_async() - that was to mask off HIGHUSER_MOVABLE bits
    from what page cache uses, but GFP_RECLAIM_MASK is now what's needed.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=208085
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2006151330070.11064@eggly.anvils
    Fixes: 68da9f055755 ("tmpfs: pass gfp to shmem_getpage_gfp")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Matthew Wilcox (Oracle)
    Reported-by: Chris Murphy
    Analyzed-by: Vlastimil Babka
    Analyzed-by: Matthew Wilcox
    Tested-by: Chris Murphy
    Cc: [3.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

01 Jul, 2020

2 commits

  • commit 3a98990ae2150277ed34d3b248c60e68bf2244b2 upstream.

    We should put the css reference when memory allocation failed.

    Link: http://lkml.kernel.org/r/20200614122653.98829-1-songmuchun@bytedance.com
    Fixes: f0a3a24b532d ("mm: memcg/slab: rework non-root kmem_cache lifecycle management")
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     
  • commit 8982ae527fbef170ef298650c15d55a9ccd33973 upstream.

    The kzfree() function is normally used to clear some sensitive
    information, like encryption keys, in the buffer before freeing it back to
    the pool. Memset() is currently used for buffer clearing. However
    unlikely, there is still a non-zero probability that the compiler may
    choose to optimize away the memory clearing especially if LTO is being
    used in the future.

    To make sure that this optimization will never happen,
    memzero_explicit(), which is introduced in v3.18, is now used in
    kzfree() to future-proof it.

    Link: http://lkml.kernel.org/r/20200616154311.12314-2-longman@redhat.com
    Fixes: 3ef0e5ba4673 ("slab: introduce kzfree()")
    Signed-off-by: Waiman Long
    Acked-by: Michal Hocko
    Cc: David Howells
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Dan Carpenter
    Cc: "Jason A . Donenfeld"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Waiman Long
     

22 Jun, 2020

4 commits

  • commit da97f2d56bbd880b4138916a7ef96f9881a551b2 upstream.

    Now that deferred pages are initialized with interrupts enabled we can
    replace touch_nmi_watchdog() with cond_resched(), as it was before
    3a2d7fa8a3d5.

    For now, we cannot do the same in deferred_grow_zone() as it is still
    initializes pages with interrupts disabled.

    This change fixes RCU problem described in
    https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com

    [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
    [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
    [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
    [ 1.760091] NMI backtrace for cpu 1
    [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
    [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
    [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
    [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
    [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
    [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
    [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
    [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
    [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
    [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
    [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
    [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
    [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1.760091] Call Trace:
    [ 1.760091] deferred_init_pages+0x8f/0xbf
    [ 1.760091] deferred_init_memmap+0x184/0x29d
    [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
    [ 1.760091] kthread+0x112/0x130
    [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
    [ 1.760091] ret_from_fork+0x35/0x40
    [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Yiqian Wei
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Tested-by: David Hildenbrand
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Shile Zhang
    Cc: Vlastimil Babka
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 117003c32771df617acf66e140fbdbdeb0ac71f5 upstream.

    Patch series "initialize deferred pages with interrupts enabled", v4.

    Keep interrupts enabled during deferred page initialization in order to
    make code more modular and allow jiffies to update.

    Original approach, and discussion can be found here:
    http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com

    This patch (of 3):

    deferred_init_memmap() disables interrupts the entire time, so it calls
    touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
    will run with interrupts enabled, at which point cond_resched() should be
    used instead.

    deferred_grow_zone() makes the same watchdog calls through code shared
    with deferred init but will continue to run with interrupts disabled, so
    it can't call cond_resched().

    Pull the watchdog calls up to these two places to allow the first to be
    changed later, independently of the second. The frequency reduces from
    twice per pageblock (init and free) to once per max order block.

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Signed-off-by: Daniel Jordan
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Shile Zhang
    Cc: Kirill Tkhai
    Cc: James Morris
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Jordan
     
  • commit 3d060856adfc59afb9d029c233141334cfaba418 upstream.

    Initializing struct pages is a long task and keeping interrupts disabled
    for the duration of this operation introduces a number of problems.

    1. jiffies are not updated for long period of time, and thus incorrect time
    is reported. See proposed solution and discussion here:
    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
    2. It prevents farther improving deferred page initialization by allowing
    intra-node multi-threading.

    We are keeping interrupts disabled to solve a rather theoretical problem
    that was never observed in real world (See 3a2d7fa8a3d5).

    Let's keep interrupts enabled. In case we ever encounter a scenario where
    an interrupt thread wants to allocate large amount of memory this early in
    boot we can deal with that by growing zone (see deferred_grow_zone()) by
    the needed amount before starting deferred_init_memmap() threads.

    Before:
    [ 1.232459] node 0 initialised, 12058412 pages in 1ms

    After:
    [ 1.632580] node 0 initialised, 12051227 pages in 436ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Shile Zhang
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit c444eb564fb16645c172d550359cb3d75fe8a040 upstream.

    Write protect anon page faults require an accurate mapcount to decide
    if to break the COW or not. This is implemented in the THP path with
    reuse_swap_page() ->
    page_trans_huge_map_swapcount()/page_trans_huge_mapcount().

    If the COW triggers while the other processes sharing the page are
    under a huge pmd split, to do an accurate reading, we must ensure the
    mapcount isn't computed while it's being transferred from the head
    page to the tail pages.

    reuse_swap_cache() already runs serialized by the page lock, so it's
    enough to add the page lock around __split_huge_pmd_locked too, in
    order to add the missing serialization.

    Note: the commit in "Fixes" is just to facilitate the backporting,
    because the code before such commit didn't try to do an accurate THP
    mapcount calculation and it instead used the page_count() to decide if
    to COW or not. Both the page_count and the pin_count are THP-wide
    refcounts, so they're inaccurate if used in
    reuse_swap_page(). Reverting such commit (besides the unrelated fix to
    the local anon_vma assignment) would have also opened the window for
    memory corruption side effects to certain workloads as documented in
    such commit header.

    Signed-off-by: Andrea Arcangeli
    Suggested-by: Jann Horn
    Reported-by: Jann Horn
    Acked-by: Kirill A. Shutemov
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrea Arcangeli
     

17 Jun, 2020

3 commits

  • commit dde3c6b72a16c2db826f54b2d49bdea26c3534a2 upstream.

    syzkaller reports for memory leak when kobject_init_and_add() returns an
    error in the function sysfs_slab_add() [1]

    When this happened, the function kobject_put() is not called for the
    corresponding kobject, which potentially leads to memory leak.

    This patch fixes the issue by calling kobject_put() even if
    kobject_init_and_add() fails.

    [1]
    BUG: memory leak
    unreferenced object 0xffff8880a6d4be88 (size 8):
    comm "syz-executor.3", pid 946, jiffies 4295772514 (age 18.396s)
    hex dump (first 8 bytes):
    70 69 64 5f 33 00 ff ff pid_3...
    backtrace:
    kstrdup+0x35/0x70 mm/util.c:60
    kstrdup_const+0x3d/0x50 mm/util.c:82
    kvasprintf_const+0x112/0x170 lib/kasprintf.c:48
    kobject_set_name_vargs+0x55/0x130 lib/kobject.c:289
    kobject_add_varg lib/kobject.c:384 [inline]
    kobject_init_and_add+0xd8/0x170 lib/kobject.c:473
    sysfs_slab_add+0x1d8/0x290 mm/slub.c:5811
    __kmem_cache_create+0x50a/0x570 mm/slub.c:4384
    create_cache+0x113/0x1e0 mm/slab_common.c:407
    kmem_cache_create_usercopy+0x1a1/0x260 mm/slab_common.c:505
    kmem_cache_create+0xd/0x10 mm/slab_common.c:564
    create_pid_cachep kernel/pid_namespace.c:54 [inline]
    create_pid_namespace kernel/pid_namespace.c:96 [inline]
    copy_pid_ns+0x77c/0x8f0 kernel/pid_namespace.c:148
    create_new_namespaces+0x26b/0xa30 kernel/nsproxy.c:95
    unshare_nsproxy_namespaces+0xa7/0x1e0 kernel/nsproxy.c:229
    ksys_unshare+0x3d2/0x770 kernel/fork.c:2969
    __do_sys_unshare kernel/fork.c:3037 [inline]
    __se_sys_unshare kernel/fork.c:3035 [inline]
    __x64_sys_unshare+0x2d/0x40 kernel/fork.c:3035
    do_syscall_64+0xa1/0x530 arch/x86/entry/common.c:295

    Fixes: 80da026a8e5d ("mm/slub: fix slab double-free in case of duplicate sysfs filename")
    Reported-by: Hulk Robot
    Signed-off-by: Wang Hai
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200602115033.1054-1-wanghai38@huawei.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wang Hai
     
  • commit 17839856fd588f4ab6b789f482ed3ffd7c403e1f upstream.

    Doing a "get_user_pages()" on a copy-on-write page for reading can be
    ambiguous: the page can be COW'ed at any time afterwards, and the
    direction of a COW event isn't defined.

    Yes, whoever writes to it will generally do the COW, but if the thread
    that did the get_user_pages() unmapped the page before the write (and
    that could happen due to memory pressure in addition to any outright
    action), the writer could also just take over the old page instead.

    End result: the get_user_pages() call might result in a page pointer
    that is no longer associated with the original VM, and is associated
    with - and controlled by - another VM having taken it over instead.

    So when doing a get_user_pages() on a COW mapping, the only really safe
    thing to do would be to break the COW when getting the page, even when
    only getting it for reading.

    At the same time, some users simply don't even care.

    For example, the perf code wants to look up the page not because it
    cares about the page, but because the code simply wants to look up the
    physical address of the access for informational purposes, and doesn't
    really care about races when a page might be unmapped and remapped
    elsewhere.

    This adds logic to force a COW event by setting FOLL_WRITE on any
    copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
    pointer as a result.

    The current semantics end up being:

    - __get_user_pages_fast(): no change. If you don't ask for a write,
    you won't break COW. You'd better know what you're doing.

    - get_user_pages_fast(): the fast-case "look it up in the page tables
    without anything getting mmap_sem" now refuses to follow a read-only
    page, since it might need COW breaking. Which happens in the slow
    path - the fast path doesn't know if the memory might be COW or not.

    - get_user_pages() (including the slow-path fallback for gup_fast()):
    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
    very similar semantics to FOLL_FORCE.

    If it turns out that we want finer granularity (ie "only break COW when
    it might actually matter" - things like the zero page are special and
    don't need to be broken) we might need to push these semantics deeper
    into the lookup fault path. So if people care enough, it's possible
    that we might end up adding a new internal FOLL_BREAK_COW flag to go
    with the internal FOLL_COW flag we already have for tracking "I had a
    COW".

    Alternatively, if it turns out that different callers might want to
    explicitly control the forced COW break behavior, we might even want to
    make such a flag visible to the users of get_user_pages() instead of
    using the above default semantics.

    But for now, this is mostly commentary on the issue (this commit message
    being a lot bigger than the patch, and that patch in turn is almost all
    comments), with that minimal "enable COW breaking early" logic using the
    existing FOLL_WRITE behavior.

    [ It might be worth noting that we've always had this ambiguity, and it
    could arguably be seen as a user-space issue.

    You only get private COW mappings that could break either way in
    situations where user space is doing cooperative things (ie fork()
    before an execve() etc), but it _is_ surprising and very subtle, and
    fork() is supposed to give you independent address spaces.

    So let's treat this as a kernel issue and make the semantics of
    get_user_pages() easier to understand. Note that obviously a true
    shared mapping will still get a page that can change under us, so this
    does _not_ mean that get_user_pages() somehow returns any "stable"
    page ]

    Reported-by: Jann Horn
    Tested-by: Christoph Hellwig
    Acked-by: Oleg Nesterov
    Acked-by: Kirill Shutemov
    Acked-by: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • [ Upstream commit d4eaa2837851db2bfed572898bfc17f9a9f9151e ]

    For kvmalloc'ed data object that contains sensitive information like
    cryptographic keys, we need to make sure that the buffer is always cleared
    before freeing it. Using memset() alone for buffer clearing may not
    provide certainty as the compiler may compile it away. To be sure, the
    special memzero_explicit() has to be used.

    This patch introduces a new kvfree_sensitive() for freeing those sensitive
    data objects allocated by kvmalloc(). The relevant places where
    kvfree_sensitive() can be used are modified to use it.

    Fixes: 4f0882491a14 ("KEYS: Avoid false positive ENOMEM error on key read")
    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Eric Biggers
    Acked-by: David Howells
    Cc: Jarkko Sakkinen
    Cc: James Morris
    Cc: "Serge E. Hallyn"
    Cc: Joe Perches
    Cc: Matthew Wilcox
    Cc: David Rientjes
    Cc: Uladzislau Rezki
    Link: http://lkml.kernel.org/r/20200407200318.11711-1-longman@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     

07 Jun, 2020

1 commit

  • commit 5bfea2d9b17f1034a68147a8b03b9789af5700f9 upstream.

    The original code in mm/mremap.c checks huge pmd by:

    if (is_swap_pmd(*old_pmd) || pmd_trans_huge(*old_pmd)) {

    However, a DAX mapped nvdimm is mapped as huge page (by default) but it
    is not transparent huge page (_PAGE_PSE | PAGE_DEVMAP). This commit
    changes the condition to include the case.

    This addresses CVE-2020-10757.

    Fixes: 5c7fb56e5e3f ("mm, dax: dax-pmd vs thp-pmd vs hugetlbfs-pmd")
    Cc:
    Reported-by: Fan Yang
    Signed-off-by: Fan Yang
    Tested-by: Fan Yang
    Tested-by: Dan Williams
    Reviewed-by: Dan Williams
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Fan Yang
     

03 Jun, 2020

1 commit

  • [ Upstream commit 2f33a706027c94cd4f70fcd3e3f4a17c1ce4ea4b ]

    When collapse_file() calls try_to_release_page(), it has already isolated
    the page: so if releasing buffers happens to fail (as it sometimes does),
    remember to putback_lru_page(): otherwise that page is left unreclaimable
    and unfreeable, and the file extent uncollapsible.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

27 May, 2020

1 commit

  • commit 33cd65e73abd693c00c4156cf23677c453b41b3b upstream.

    During early boot, while KASAN is not yet initialized, it is possible to
    enter reporting code-path and end up in kasan_report().

    While uninitialized, the branch there prevents generating any reports,
    however, under certain circumstances when branches are being traced
    (TRACE_BRANCH_PROFILING), we may recurse deep enough to cause kernel
    reboots without warning.

    To prevent similar issues in future, we should disable branch tracing
    for the core runtime.

    [elver@google.com: remove duplicate DISABLE_BRANCH_PROFILING, per Qian Cai]
    Link: https://lore.kernel.org/lkml/20200517011732.GE24705@shao2-debian/
    Link: http://lkml.kernel.org/r/20200522075207.157349-1-elver@google.com
    Reported-by: kernel test robot
    Signed-off-by: Marco Elver
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Alexander Potapenko
    Cc: Andrey Ryabinin
    Cc: Qian Cai
    Cc:
    Link: http://lkml.kernel.org/r//20200517011732.GE24705@shao2-debian/
    Link: http://lkml.kernel.org/r/20200519182459.87166-1-elver@google.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Marco Elver
     

20 May, 2020

1 commit

  • [ Upstream commit ea0dfeb4209b4eab954d6e00ed136bc6b48b380d ]

    Recent commit 71725ed10c40 ("mm: huge tmpfs: try to split_huge_page()
    when punching hole") has allowed syzkaller to probe deeper, uncovering a
    long-standing lockdep issue between the irq-unsafe shmlock_user_lock,
    the irq-safe xa_lock on mapping->i_pages, and shmem inode's info->lock
    which nests inside xa_lock (or tree_lock) since 4.8's shmem_uncharge().

    user_shm_lock(), servicing SysV shmctl(SHM_LOCK), wants
    shmlock_user_lock while its caller shmem_lock() holds info->lock with
    interrupts disabled; but hugetlbfs_file_setup() calls user_shm_lock()
    with interrupts enabled, and might be interrupted by a writeback endio
    wanting xa_lock on i_pages.

    This may not risk an actual deadlock, since shmem inodes do not take
    part in writeback accounting, but there are several easy ways to avoid
    it.

    Requiring interrupts disabled for shmlock_user_lock would be easy, but
    it's a high-level global lock for which that seems inappropriate.
    Instead, recall that the use of info->lock to guard info->flags in
    shmem_lock() dates from pre-3.1 days, when races with SHMEM_PAGEIN and
    SHMEM_TRUNCATE could occur: nowadays it serves no purpose, the only flag
    added or removed is VM_LOCKED itself, and calls to shmem_lock() an inode
    are already serialized by the caller.

    Take info->lock out of the chain and the possibility of deadlock or
    lockdep warning goes away.

    Fixes: 4595ef88d136 ("shmem: make shmem_inode_info::lock irq-safe")
    Reported-by: syzbot+c8a8197c8852f566b9d9@syzkaller.appspotmail.com
    Reported-by: syzbot+40b71e145e73f78f81ad@syzkaller.appspotmail.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: Yang Shi
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2004161707410.16322@eggly.anvils
    Link: https://lore.kernel.org/lkml/000000000000e5838c05a3152f53@google.com/
    Link: https://lore.kernel.org/lkml/0000000000003712b305a331d3b1@google.com/
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

14 May, 2020

5 commits

  • [ Upstream commit 6bd87eec23cbc9ed222bed0f5b5b02bf300e9a8d ]

    Cache a copy of the name for the life time of the backing_dev_info
    structure so that we can reference it even after unregistering.

    Fixes: 68f23b89067f ("memcg: fix a crash in wb_workfn when a device disappears")
    Reported-by: Yufen Yu
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • [ Upstream commit eb7ae5e06bb6e6ac6bb86872d27c43ebab92f6b2 ]

    bdi_dev_name is not a fast path function, move it out of line. This
    prepares for using it from modular callers without having to export
    an implementation detail like bdi_unknown_name.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jan Kara
    Reviewed-by: Greg Kroah-Hartman
    Reviewed-by: Bart Van Assche
    Signed-off-by: Jens Axboe
    Signed-off-by: Sasha Levin

    Christoph Hellwig
     
  • commit 11d6761218d19ca06ae5387f4e3692c4fa9e7493 upstream.

    When I run my memcg testcase which creates lots of memcgs, I found
    there're unexpected out of memory logs while there're still enough
    available free memory. The error log is

    mkdir: cannot create directory 'foo.65533': Cannot allocate memory

    The reason is when we try to create more than MEM_CGROUP_ID_MAX memcgs,
    an -ENOMEM errno will be set by mem_cgroup_css_alloc(), but the right
    errno should be -ENOSPC "No space left on device", which is an
    appropriate errno for userspace's failed mkdir.

    As the errno really misled me, we should make it right. After this
    patch, the error log will be

    mkdir: cannot create directory 'foo.65533': No space left on device

    [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
    [akpm@linux-foundation.org: s/EBUSY/ENOSPC/, per Michal]
    Fixes: 73f576c04b94 ("mm: memcontrol: fix cgroup creation failure after many small jobs")
    Suggested-by: Matthew Wilcox
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Vladimir Davydov
    Link: http://lkml.kernel.org/r/20200407063621.GA18914@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/1586192163-20099-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yafang Shao
     
  • commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.

    Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
    external fragmentation event occurs") adds a boost_watermark() function
    which increases the min watermark in a zone by at least
    pageblock_nr_pages or the number of pages in a page block.

    On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
    512M. It does this regardless of the number of managed pages managed in
    the zone or the likelihood of success.

    This can put the zone immediately under water in terms of allocating
    pages from the zone, and can cause a small machine to fail immediately
    due to OoM. Unlike set_recommended_min_free_kbytes(), which
    substantially increases min_free_kbytes and is tied to THP,
    boost_watermark() can be called even if THP is not active.

    The problem is most likely to appear on architectures such as Arm64
    where pageblock_nr_pages is very large.

    It is desirable to run the kdump capture kernel in as small a space as
    possible to avoid wasting memory. In some architectures, such as Arm64,
    there are restrictions on where the capture kernel can run, and
    therefore, the space available. A capture kernel running in 768M can
    fail due to OoM immediately after boost_watermark() sets the min in zone
    DMA32, where most of the memory is, to 512M. It fails even though there
    is over 500M of free memory. With boost_watermark() suppressed, the
    capture kernel can run successfully in 448M.

    This patch limits boost_watermark() to boosting a zone's min watermark
    only when there are enough pages that the boost will produce positive
    results. In this case that is estimated to be four times as many pages
    as pageblock_nr_pages.

    Mel said:

    : There is no harm in marking it stable. Clearly it does not happen very
    : often but it's not impossible. 32-bit x86 is a lot less common now
    : which would previously have been vulnerable to triggering this easily.
    : ppc64 has a larger base page size but typically only has one zone.
    : arm64 is likely the most vulnerable, particularly when CMA is
    : configured with a small movable zone.

    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Henry Willard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Henry Willard
     
  • commit e84fe99b68ce353c37ceeecc95dce9696c976556 upstream.

    Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
    e.g., while booting up.

    watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
    Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
    RIP: __pageblock_pfn_to_page+0x134/0x1c0
    Call Trace:
    set_zone_contiguous+0x56/0x70
    page_alloc_init_late+0x166/0x176
    kernel_init_freeable+0xfa/0x255
    kernel_init+0xa/0x106
    ret_from_fork+0x35/0x40

    The issue becomes visible when having a lot of memory (e.g., 4TB)
    assigned to a single NUMA node - a system that can easily be created
    using QEMU. Inside VMs on a hypervisor with quite some memory
    overcommit, this is fairly easy to trigger.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Baoquan He
    Reviewed-by: Shile Zhang
    Acked-by: Michal Hocko
    Cc: Kirill Tkhai
    Cc: Shile Zhang
    Cc: Pavel Tatashin
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Alexander Duyck
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc:
    Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

10 May, 2020

1 commit

  • commit b2a84de2a2deb76a6a51609845341f508c518c03 upstream.

    Commit dcde237319e6 ("mm: Avoid creating virtual address aliases in
    brk()/mmap()/mremap()") changed mremap() so that only the 'old' address
    is untagged, leaving the 'new' address in the form it was passed from
    userspace. This prevents the unexpected creation of aliasing virtual
    mappings in userspace, but looks a bit odd when you read the code.

    Add a comment justifying the untagging behaviour in mremap().

    Reported-by: Linus Torvalds
    Acked-by: Linus Torvalds
    Reviewed-by: Catalin Marinas
    Signed-off-by: Will Deacon
    Signed-off-by: Catalin Marinas
    Signed-off-by: Greg Kroah-Hartman

    Will Deacon
     

02 May, 2020

1 commit

  • commit 94b7cc01da5a3cc4f3da5e0ff492ef008bb555d6 upstream.

    Syzbot reported the below lockdep splat:

    WARNING: possible irq lock inversion dependency detected
    5.6.0-rc7-syzkaller #0 Not tainted
    --------------------------------------------------------
    syz-executor.0/10317 just changed the state of lock:
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: spin_lock include/linux/spinlock.h:338 [inline]
    ffff888021d16568 (&(&info->lock)->rlock){+.+.}, at: shmem_mfill_atomic_pte+0x1012/0x21c0 mm/shmem.c:2407
    but this lock was taken by another, SOFTIRQ-safe lock in the past:
    (&(&xa->xa_lock)->rlock#5){..-.}

    and interrupts could create inverse lock ordering between them.

    other info that might help us debug this:
    Possible interrupt unsafe locking scenario:

    CPU0 CPU1
    ---- ----
    lock(&(&info->lock)->rlock);
    local_irq_disable();
    lock(&(&xa->xa_lock)->rlock#5);
    lock(&(&info->lock)->rlock);

    lock(&(&xa->xa_lock)->rlock#5);

    *** DEADLOCK ***

    The full report is quite lengthy, please see:

    https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2004152007370.13597@eggly.anvils/T/#m813b412c5f78e25ca8c6c7734886ed4de43f241d

    It is because CPU 0 held info->lock with IRQ enabled in userfaultfd_copy
    path, then CPU 1 is splitting a THP which held xa_lock and info->lock in
    IRQ disabled context at the same time. If softirq comes in to acquire
    xa_lock, the deadlock would be triggered.

    The fix is to acquire/release info->lock with *_irq version instead of
    plain spin_{lock,unlock} to make it softirq safe.

    Fixes: 4c27fe4c4c84 ("userfaultfd: shmem: add shmem_mcopy_atomic_pte for userfaultfd support")
    Reported-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+e27980339d305f2dbfd9@syzkaller.appspotmail.com
    Acked-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/1587061357-122619-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Yang Shi
     

29 Apr, 2020

3 commits

  • commit 56df70a63ed5d989c1d36deee94cae14342be6e9 upstream.

    find_mergeable_vma() can return NULL. In this case, it leads to a crash
    when we access vm_mm(its offset is 0x40) later in write_protect_page.
    And this case did happen on our server. The following call trace is
    captured in kernel 4.19 with the following patch applied and KSM zero
    page enabled on our server.

    commit e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")

    So add a vma check to fix it.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
    Oops: 0000 [#1] SMP NOPTI
    CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9
    RIP: try_to_merge_one_page+0xc7/0x760
    Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4
    60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8
    8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48
    RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246
    RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000
    RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000
    RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577
    R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000
    R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40
    FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    PKRU: 55555554
    Call Trace:
    ksm_scan_thread+0x115e/0x1960
    kthread+0xf5/0x130
    ret_from_fork+0x1f/0x30

    [songmuchun@bytedance.com: if the vma is out of date, just exit]
    Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
    [akpm@linux-foundation.org: add the conventional braces, replace /** with /*]
    Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")
    Co-developed-by: Xiongchun Duan
    Signed-off-by: Muchun Song
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Kirill Tkhai
    Cc: Hugh Dickins
    Cc: Yang Shi
    Cc: Claudio Imbrenda
    Cc: Markus Elfring
    Cc:
    Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
    Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     
  • commit 3c1d7e6ccb644d517a12f73a7ff200870926f865 upstream.

    Our machine encountered a panic(addressing exception) after run for a
    long time and the calltrace is:

    RIP: hugetlb_fault+0x307/0xbe0
    RSP: 0018:ffff9567fc27f808 EFLAGS: 00010286
    RAX: e800c03ff1258d48 RBX: ffffd3bb003b69c0 RCX: e800c03ff1258d48
    RDX: 17ff3fc00eda72b7 RSI: 00003ffffffff000 RDI: e800c03ff1258d48
    RBP: ffff9567fc27f8c8 R08: e800c03ff1258d48 R09: 0000000000000080
    R10: ffffaba0704c22a8 R11: 0000000000000001 R12: ffff95c87b4b60d8
    R13: 00005fff00000000 R14: 0000000000000000 R15: ffff9567face8074
    FS: 00007fe2d9ffb700(0000) GS:ffff956900e40000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: ffffd3bb003b69c0 CR3: 000000be67374000 CR4: 00000000003627e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    follow_hugetlb_page+0x175/0x540
    __get_user_pages+0x2a0/0x7e0
    __get_user_pages_unlocked+0x15d/0x210
    __gfn_to_pfn_memslot+0x3c5/0x460 [kvm]
    try_async_pf+0x6e/0x2a0 [kvm]
    tdp_page_fault+0x151/0x2d0 [kvm]
    ...
    kvm_arch_vcpu_ioctl_run+0x330/0x490 [kvm]
    kvm_vcpu_ioctl+0x309/0x6d0 [kvm]
    do_vfs_ioctl+0x3f0/0x540
    SyS_ioctl+0xa1/0xc0
    system_call_fastpath+0x22/0x27

    For 1G hugepages, huge_pte_offset() wants to return NULL or pudp, but it
    may return a wrong 'pmdp' if there is a race. Please look at the
    following code snippet:

    ...
    pud = pud_offset(p4d, addr);
    if (sz != PUD_SIZE && pud_none(*pud))
    return NULL;
    /* hugepage or swap? */
    if (pud_huge(*pud) || !pud_present(*pud))
    return (pte_t *)pud;

    pmd = pmd_offset(pud, addr);
    if (sz != PMD_SIZE && pmd_none(*pmd))
    return NULL;
    /* hugepage or swap? */
    if (pmd_huge(*pmd) || !pmd_present(*pmd))
    return (pte_t *)pmd;
    ...

    The following sequence would trigger this bug:

    - CPU0: sz = PUD_SIZE and *pud = 0 , continue
    - CPU0: "pud_huge(*pud)" is false
    - CPU1: calling hugetlb_no_page and set *pud to xxxx8e7(PRESENT)
    - CPU0: "!pud_present(*pud)" is false, continue
    - CPU0: pmd = pmd_offset(pud, addr) and maybe return a wrong pmdp

    However, we want CPU0 to return NULL or pudp in this case.

    We must make sure there is exactly one dereference of pud and pmd.

    Signed-off-by: Longpeng
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Reviewed-by: Jason Gunthorpe
    Cc: Matthew Wilcox
    Cc: Sean Christopherson
    Cc:
    Link: http://lkml.kernel.org/r/20200413010342.771-1-longpeng2@huawei.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Longpeng
     
  • commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

    remap_vmalloc_range() has had various issues with the bounds checks it
    promises to perform ("This function checks that addr is a valid
    vmalloc'ed area, and that it is big enough to cover the vma") over time,
    e.g.:

    - not detecting pgoff<<<<
    Signed-off-by: Andrew Morton
    Cc: stable@vger.kernel.org
    Cc: Alexei Starovoitov
    Cc: Daniel Borkmann
    Cc: Martin KaFai Lau
    Cc: Song Liu
    Cc: Yonghong Song
    Cc: Andrii Nakryiko
    Cc: John Fastabend
    Cc: KP Singh
    Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     

17 Apr, 2020

1 commit

  • commit 9b8b17541f13809d06f6f873325305ddbb760e3e upstream.

    If a cgroup violates its memory.high constraints, we may end up unduly
    penalising it. For example, for the following hierarchy:

    A: max high, 20 usage
    A/B: 9 high, 10 usage
    A/C: max high, 10 usage

    We would end up doing the following calculation below when calculating
    high delay for A/B:

    A/B: 10 - 9 = 1...
    A: 20 - PAGE_COUNTER_MAX = 21, so set max_overage to 21.

    This gets worse with higher disparities in usage in the parent.

    I have no idea how this disappeared from the final version of the patch,
    but it is certainly Not Good(tm). This wasn't obvious in testing because,
    for a simple cgroup hierarchy with only one child, the result is usually
    roughly the same. It's only in more complex hierarchies that things go
    really awry (although still, the effects are limited to a maximum of 2
    seconds in schedule_timeout_killable at a maximum).

    [chris@chrisdown.name: changelog]
    Fixes: e26733e0d0ec ("mm, memcg: throttle allocators based on ancestral memory.high")
    Signed-off-by: Jakub Kicinski
    Signed-off-by: Chris Down
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: [5.4.x]
    Link: http://lkml.kernel.org/r/20200331152424.GA1019937@chrisdown.name
    Signed-off-by: Linus Torvalds
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Jakub Kicinski
     

13 Apr, 2020

1 commit

  • commit 1ad53d9fa3f6168ebcf48a50e08b170432da2257 upstream.

    Under CONFIG_SLAB_FREELIST_HARDENED=y, the obfuscation was relatively weak
    in that the ptr and ptr address were usually so close that the first XOR
    would result in an almost entirely 0-byte value[1], leaving most of the
    "secret" number ultimately being stored after the third XOR. A single
    blind memory content exposure of the freelist was generally sufficient to
    learn the secret.

    Add a swab() call to mix bits a little more. This is a cheap way (1
    cycle) to make attacks need more than a single exposure to learn the
    secret (or to know _where_ the exposure is in memory).

    kmalloc-32 freelist walk, before:

    ptr ptr_addr stored value secret
    ffff90c22e019020@ffff90c22e019000 is 86528eb656b3b5bd (86528eb656b3b59d)
    ffff90c22e019040@ffff90c22e019020 is 86528eb656b3b5fd (86528eb656b3b59d)
    ffff90c22e019060@ffff90c22e019040 is 86528eb656b3b5bd (86528eb656b3b59d)
    ffff90c22e019080@ffff90c22e019060 is 86528eb656b3b57d (86528eb656b3b59d)
    ffff90c22e0190a0@ffff90c22e019080 is 86528eb656b3b5bd (86528eb656b3b59d)
    ...

    after:

    ptr ptr_addr stored value secret
    ffff9eed6e019020@ffff9eed6e019000 is 793d1135d52cda42 (86528eb656b3b59d)
    ffff9eed6e019040@ffff9eed6e019020 is 593d1135d52cda22 (86528eb656b3b59d)
    ffff9eed6e019060@ffff9eed6e019040 is 393d1135d52cda02 (86528eb656b3b59d)
    ffff9eed6e019080@ffff9eed6e019060 is 193d1135d52cdae2 (86528eb656b3b59d)
    ffff9eed6e0190a0@ffff9eed6e019080 is f93d1135d52cdac2 (86528eb656b3b59d)

    [1] https://blog.infosectcbr.com.au/2020/03/weaknesses-in-linux-kernel-heap.html

    Fixes: 2482ddec670f ("mm: add SLUB free list pointer obfuscation")
    Reported-by: Silvio Cesare
    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Link: http://lkml.kernel.org/r/202003051623.AF4F8CB@keescook
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

08 Apr, 2020

1 commit

  • commit aa9f7d5172fac9bf1f09e678c35e287a40a7b7dd upstream.

    Using an empty (malformed) nodelist that is not caught during mount option
    parsing leads to a stack-out-of-bounds access.

    The option string that was used was: "mpol=prefer:,". However,
    MPOL_PREFERRED requires a single node number, which is not being provided
    here.

    Add a check that 'nodes' is not empty after parsing for MPOL_PREFERRED's
    nodeid.

    Fixes: 095f1fc4ebf3 ("mempolicy: rework shmem mpol parsing and display")
    Reported-by: Entropy Moe
    Reported-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Tested-by: syzbot+b055b1a6b2b958707a21@syzkaller.appspotmail.com
    Cc: Lee Schermerhorn
    Link: http://lkml.kernel.org/r/89526377-7eb6-b662-e1d8-4430928abde9@infradead.org
    Signed-off-by: Linus Torvalds
    Cc: Guenter Roeck
    Signed-off-by: Greg Kroah-Hartman

    Randy Dunlap
     

01 Apr, 2020

3 commits

  • commit 8380ce479010f2f779587b462a9b4681934297c3 upstream.

    Depending on CONFIG_VMAP_STACK and the THREAD_SIZE / PAGE_SIZE ratio the
    space for task stacks can be allocated using __vmalloc_node_range(),
    alloc_pages_node() and kmem_cache_alloc_node().

    In the first and the second cases page->mem_cgroup pointer is set, but
    in the third it's not: memcg membership of a slab page should be
    determined using the memcg_from_slab_page() function, which looks at
    page->slab_cache->memcg_params.memcg . In this case, using
    mod_memcg_page_state() (as in account_kernel_stack()) is incorrect:
    page->mem_cgroup pointer is NULL even for pages charged to a non-root
    memory cgroup.

    It can lead to kernel_stack per-memcg counters permanently showing 0 on
    some architectures (depending on the configuration).

    In order to fix it, let's introduce a mod_memcg_obj_state() helper,
    which takes a pointer to a kernel object as a first argument, uses
    mem_cgroup_from_obj() to get a RCU-protected memcg pointer and calls
    mod_memcg_state(). It allows to handle all possible configurations
    (CONFIG_VMAP_STACK and various THREAD_SIZE/PAGE_SIZE values) without
    spilling any memcg/kmem specifics into fork.c .

    Note: This is a special version of the patch created for stable
    backports. It contains code from the following two patches:
    - mm: memcg/slab: introduce mem_cgroup_from_obj()
    - mm: fork: fix kernel_stack memcg stats for various stack implementations

    [guro@fb.com: introduce mem_cgroup_from_obj()]
    Link: http://lkml.kernel.org/r/20200324004221.GA36662@carbon.dhcp.thefacebook.com
    Fixes: 4d96ba353075 ("mm: memcg/slab: stop setting page->mem_cgroup pointer for slab pages")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Bharata B Rao
    Cc: Shakeel Butt
    Cc:
    Link: http://lkml.kernel.org/r/20200303233550.251375-1-guro@fb.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     
  • commit b943f045a9af9fd02f923e43fe8d7517e9961701 upstream.

    Fix the crash like this:

    BUG: Kernel NULL pointer dereference on read at 0x00000000
    Faulting instruction address: 0xc000000000c3447c
    Oops: Kernel access of bad area, sig: 11 [#1]
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    CPU: 11 PID: 7519 Comm: lt-ndctl Not tainted 5.6.0-rc7-autotest #1
    ...
    NIP [c000000000c3447c] vmemmap_populated+0x98/0xc0
    LR [c000000000088354] vmemmap_free+0x144/0x320
    Call Trace:
    section_deactivate+0x220/0x240
    __remove_pages+0x118/0x170
    arch_remove_memory+0x3c/0x150
    memunmap_pages+0x1cc/0x2f0
    devm_action_release+0x30/0x50
    release_nodes+0x2f8/0x3e0
    device_release_driver_internal+0x168/0x270
    unbind_store+0x130/0x170
    drv_attr_store+0x44/0x60
    sysfs_kf_write+0x68/0x80
    kernfs_fop_write+0x100/0x290
    __vfs_write+0x3c/0x70
    vfs_write+0xcc/0x240
    ksys_write+0x7c/0x140
    system_call+0x5c/0x68

    The crash is due to NULL dereference at

    test_bit(idx, ms->usage->subsection_map);

    due to ms->usage = NULL in pfn_section_valid()

    With commit d41e2f3bd546 ("mm/hotplug: fix hot remove failure in
    SPARSEMEM|!VMEMMAP case") section_mem_map is set to NULL after
    depopulate_section_mem(). This was done so that pfn_page() can work
    correctly with kernel config that disables SPARSEMEM_VMEMMAP. With that
    config pfn_to_page does

    __section_mem_map_addr(__sec) + __pfn;

    where

    static inline struct page *__section_mem_map_addr(struct mem_section *section)
    {
    unsigned long map = section->section_mem_map;
    map &= SECTION_MAP_MASK;
    return (struct page *)map;
    }

    Now with SPASEMEM_VMEMAP enabled, mem_section->usage->subsection_map is
    used to check the pfn validity (pfn_valid()). Since section_deactivate
    release mem_section->usage if a section is fully deactivated,
    pfn_valid() check after a subsection_deactivate cause a kernel crash.

    static inline int pfn_valid(unsigned long pfn)
    {
    ...
    return early_section(ms) || pfn_section_valid(ms, pfn);
    }

    where

    static inline int pfn_section_valid(struct mem_section *ms, unsigned long pfn)
    {
    int idx = subsection_map_index(pfn);

    return test_bit(idx, ms->usage->subsection_map);
    }

    Avoid this by clearing SECTION_HAS_MEM_MAP when mem_section->usage is
    freed. For architectures like ppc64 where large pages are used for
    vmmemap mapping (16MB), a specific vmemmap mapping can cover multiple
    sections. Hence before a vmemmap mapping page can be freed, the kernel
    needs to make sure there are no valid sections within that mapping.
    Clearing the section valid bit before depopulate_section_memap enables
    this.

    [aneesh.kumar@linux.ibm.com: add comment]
    Link: http://lkml.kernel.org/r/20200326133235.343616-1-aneesh.kumar@linux.ibm.comLink: http://lkml.kernel.org/r/20200325031914.107660-1-aneesh.kumar@linux.ibm.com
    Fixes: d41e2f3bd546 ("mm/hotplug: fix hot remove failure in SPARSEMEM|!VMEMMAP case")
    Reported-by: Sachin Sant
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Michael Ellerman
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Aneesh Kumar K.V
     
  • commit d795a90e2ba024dbf2f22107ae89c210b98b08b8 upstream.

    claim_swapfile() currently keeps the inode locked when it is successful,
    or the file is already swapfile (with -EBUSY). And, on the other error
    cases, it does not lock the inode.

    This inconsistency of the lock state and return value is quite confusing
    and actually causing a bad unlock balance as below in the "bad_swap"
    section of __do_sys_swapon().

    This commit fixes this issue by moving the inode_lock() and IS_SWAPFILE
    check out of claim_swapfile(). The inode is unlocked in
    "bad_swap_unlock_inode" section, so that the inode is ensured to be
    unlocked at "bad_swap". Thus, error handling codes after the locking now
    jumps to "bad_swap_unlock_inode" instead of "bad_swap".

    =====================================
    WARNING: bad unlock balance detected!
    5.5.0-rc7+ #176 Not tainted
    -------------------------------------
    swapon/4294 is trying to release lock (&sb->s_type->i_mutex_key) at: __do_sys_swapon+0x94b/0x3550
    but there are no more locks to release!

    other info that might help us debug this:
    no locks held by swapon/4294.

    stack backtrace:
    CPU: 5 PID: 4294 Comm: swapon Not tainted 5.5.0-rc7-BTRFS-ZNS+ #176
    Hardware name: ASUS All Series/H87-PRO, BIOS 2102 07/29/2014
    Call Trace:
    dump_stack+0xa1/0xea
    print_unlock_imbalance_bug.cold+0x114/0x123
    lock_release+0x562/0xed0
    up_write+0x2d/0x490
    __do_sys_swapon+0x94b/0x3550
    __x64_sys_swapon+0x54/0x80
    do_syscall_64+0xa4/0x4b0
    entry_SYSCALL_64_after_hwframe+0x49/0xbe
    RIP: 0033:0x7f15da0a0dc7

    Fixes: 1638045c3677 ("mm: set S_SWAPFILE on blockdev swap devices")
    Signed-off-by: Naohiro Aota
    Signed-off-by: Andrew Morton
    Tested-by: Qais Youef
    Reviewed-by: Andrew Morton
    Reviewed-by: Darrick J. Wong
    Cc: Christoph Hellwig
    Cc:
    Link: http://lkml.kernel.org/r/20200206090132.154869-1-naohiro.aota@wdc.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naohiro Aota
     

25 Mar, 2020

1 commit

  • commit 763802b53a427ed3cbd419dbba255c414fdd9e7c upstream.

    Commit 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in
    __purge_vmap_area_lazy()") introduced a call to vmalloc_sync_all() in
    the vunmap() code-path. While this change was necessary to maintain
    correctness on x86-32-pae kernels, it also adds additional cycles for
    architectures that don't need it.

    Specifically on x86-64 with CONFIG_VMAP_STACK=y some people reported
    severe performance regressions in micro-benchmarks because it now also
    calls the x86-64 implementation of vmalloc_sync_all() on vunmap(). But
    the vmalloc_sync_all() implementation on x86-64 is only needed for newly
    created mappings.

    To avoid the unnecessary work on x86-64 and to gain the performance
    back, split up vmalloc_sync_all() into two functions:

    * vmalloc_sync_mappings(), and
    * vmalloc_sync_unmappings()

    Most call-sites to vmalloc_sync_all() only care about new mappings being
    synchronized. The only exception is the new call-site added in the
    above mentioned commit.

    Shile Zhang directed us to a report of an 80% regression in reaim
    throughput.

    Fixes: 3f8fd02b1bf1 ("mm/vmalloc: Sync unmappings in __purge_vmap_area_lazy()")
    Reported-by: kernel test robot
    Reported-by: Shile Zhang
    Signed-off-by: Joerg Roedel
    Signed-off-by: Andrew Morton
    Tested-by: Borislav Petkov
    Acked-by: Rafael J. Wysocki [GHES]
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc:
    Link: http://lkml.kernel.org/r/20191009124418.8286-1-joro@8bytes.org
    Link: https://lists.01.org/hyperkitty/list/lkp@lists.01.org/thread/4D3JPPHBNOSPFK2KEPC6KGKS6J25AIDB/
    Link: http://lkml.kernel.org/r/20191113095530.228959-1-shile.zhang@linux.alibaba.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Joerg Roedel