08 Apr, 2022

1 commit

  • commit ae085d7f9365de7da27ab5c0d16b12d51ea7fca9 upstream.

    The objcg is not cleared and put for kfence object when it is freed,
    which could lead to memory leak for struct obj_cgroup and wrong
    statistics of NR_SLAB_RECLAIMABLE_B or NR_SLAB_UNRECLAIMABLE_B.

    Since the last freed object's objcg is not cleared,
    mem_cgroup_from_obj() could return the wrong memcg when this kfence
    object, which is not charged to any objcgs, is reallocated to other
    users.

    A real word issue [1] is caused by this bug.

    Link: https://lore.kernel.org/all/000000000000cabcb505dae9e577@google.com/ [1]
    Reported-by: syzbot+f8c45ccc7d5d45fc5965@syzkaller.appspotmail.com
    Fixes: d3fb45f370d9 ("mm, kfence: insert KFENCE hooks for SLAB")
    Signed-off-by: Muchun Song
    Cc: Dmitry Vyukov
    Cc: Marco Elver
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Muchun Song
     

23 Mar, 2022

1 commit

  • commit 029c4628b2eb2ca969e9bf979b05dc18d8d5575e upstream.

    In our testing, a livelock task was found. Through sysrq printing, same
    stack was found every time, as follows:

    __swap_duplicate+0x58/0x1a0
    swapcache_prepare+0x24/0x30
    __read_swap_cache_async+0xac/0x220
    read_swap_cache_async+0x58/0xa0
    swapin_readahead+0x24c/0x628
    do_swap_page+0x374/0x8a0
    __handle_mm_fault+0x598/0xd60
    handle_mm_fault+0x114/0x200
    do_page_fault+0x148/0x4d0
    do_translation_fault+0xb0/0xd4
    do_mem_abort+0x50/0xb0

    The reason for the livelock is that swapcache_prepare() always returns
    EEXIST, indicating that SWAP_HAS_CACHE has not been cleared, so that it
    cannot jump out of the loop. We suspect that the task that clears the
    SWAP_HAS_CACHE flag never gets a chance to run. We try to lower the
    priority of the task stuck in a livelock so that the task that clears
    the SWAP_HAS_CACHE flag will run. The results show that the system
    returns to normal after the priority is lowered.

    In our testing, multiple real-time tasks are bound to the same core, and
    the task in the livelock is the highest priority task of the core, so
    the livelocked task cannot be preempted.

    Although cond_resched() is used by __read_swap_cache_async, it is an
    empty function in the preemptive system and cannot achieve the purpose
    of releasing the CPU. A high-priority task cannot release the CPU
    unless preempted by a higher-priority task. But when this task is
    already the highest priority task on this core, other tasks will not be
    able to be scheduled. So we think we should replace cond_resched() with
    schedule_timeout_uninterruptible(1), schedule_timeout_interruptible will
    call set_current_state first to set the task state, so the task will be
    removed from the running queue, so as to achieve the purpose of giving
    up the CPU and prevent it from running in kernel mode for too long.

    (akpm: ugly hack becomes uglier. But it fixes the issue in a
    backportable-to-stable fashion while we hopefully work on something
    better)

    Link: https://lkml.kernel.org/r/20220221111749.1928222-1-cgel.zte@gmail.com
    Signed-off-by: Guo Ziliang
    Reported-by: Zeal Robot
    Reviewed-by: Ran Xiaokai
    Reviewed-by: Jiang Xuexin
    Reviewed-by: Yang Yang
    Acked-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Johannes Weiner
    Cc: Roger Quadros
    Cc: Ziliang Guo
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Guo Ziliang
     

09 Mar, 2022

4 commits

  • commit f2b277c4d1c63a85127e8aa2588e9cc3bd21cb99 upstream.

    Wangyong reports: after enabling tmpfs filesystem to support transparent
    hugepage with the following command:

    echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled

    the docker program tries to add F_SEAL_WRITE through the following
    command, but it fails unexpectedly with errno EBUSY:

    fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.

    That is because memfd_tag_pins() and memfd_wait_for_pins() were never
    updated for shmem huge pages: checking page_mapcount() against
    page_count() is hopeless on THP subpages - they need to check
    total_mapcount() against page_count() on THP heads only.

    Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
    (compared != 1): either can be justified, but given the non-atomic
    total_mapcount() calculation, it is better now to be strict. Bear in
    mind that total_mapcount() itself scans all of the THP subpages, when
    choosing to take an XA_CHECK_SCHED latency break.

    Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
    page has been swapped out since memfd_tag_pins(), then its refcount must
    have fallen, and so it can safely be untagged.

    Link: https://lkml.kernel.org/r/a4f79248-df75-2c8c-3df-ba3317ccb5da@google.com
    Signed-off-by: Hugh Dickins
    Reported-by: Zeal Robot
    Reported-by: wangyong
    Cc: Mike Kravetz
    Cc: Matthew Wilcox (Oracle)
    Cc: CGEL ZTE
    Cc: Kirill A. Shutemov
    Cc: Song Liu
    Cc: Yang Yang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     
  • commit 0708a0afe291bdfe1386d74d5ec1f0c27e8b9168 upstream.

    syzkaller was recently triggering an oversized kvmalloc() warning via
    xdp_umem_create().

    The triggered warning was added back in 7661809d493b ("mm: don't allow
    oversized kvmalloc() calls"). The rationale for the warning for huge
    kvmalloc sizes was as a reaction to a security bug where the size was
    more than UINT_MAX but not everything was prepared to handle unsigned
    long sizes.

    Anyway, the AF_XDP related call trace from this syzkaller report was:

    kvmalloc include/linux/mm.h:806 [inline]
    kvmalloc_array include/linux/mm.h:824 [inline]
    kvcalloc include/linux/mm.h:829 [inline]
    xdp_umem_pin_pages net/xdp/xdp_umem.c:102 [inline]
    xdp_umem_reg net/xdp/xdp_umem.c:219 [inline]
    xdp_umem_create+0x6a5/0xf00 net/xdp/xdp_umem.c:252
    xsk_setsockopt+0x604/0x790 net/xdp/xsk.c:1068
    __sys_setsockopt+0x1fd/0x4e0 net/socket.c:2176
    __do_sys_setsockopt net/socket.c:2187 [inline]
    __se_sys_setsockopt net/socket.c:2184 [inline]
    __x64_sys_setsockopt+0xb5/0x150 net/socket.c:2184
    do_syscall_x64 arch/x86/entry/common.c:50 [inline]
    do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Björn mentioned that requests for >2GB allocation can still be valid:

    The structure that is being allocated is the page-pinning accounting.
    AF_XDP has an internal limit of U32_MAX pages, which is *a lot*, but
    still fewer than what memcg allows (PAGE_COUNTER_MAX is a LONG_MAX/
    PAGE_SIZE on 64 bit systems). [...]

    I could just change from U32_MAX to INT_MAX, but as I stated earlier
    that has a hacky feeling to it. [...] From my perspective, the code
    isn't broken, with the memcg limits in consideration. [...]

    Linus says:

    [...] Pretty much every time this has come up, the kernel warning has
    shown that yes, the code was broken and there really wasn't a reason
    for doing allocations that big.

    Of course, some people would be perfectly fine with the allocation
    failing, they just don't want the warning. I didn't want __GFP_NOWARN
    to shut it up originally because I wanted people to see all those
    cases, but these days I think we can just say "yeah, people can shut
    it up explicitly by saying 'go ahead and fail this allocation, don't
    warn about it'".

    So enough time has passed that by now I'd certainly be ok with [it].

    Thus allow call-sites to silence such userspace triggered splats if the
    allocation requests have __GFP_NOWARN. For xdp_umem_pin_pages()'s call
    to kvcalloc() this is already the case, so nothing else needed there.

    Fixes: 7661809d493b ("mm: don't allow oversized kvmalloc() calls")
    Reported-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
    Suggested-by: Linus Torvalds
    Signed-off-by: Daniel Borkmann
    Tested-by: syzbot+11421fbbff99b989670e@syzkaller.appspotmail.com
    Cc: Björn Töpel
    Cc: Magnus Karlsson
    Cc: Willy Tarreau
    Cc: Andrew Morton
    Cc: Alexei Starovoitov
    Cc: Andrii Nakryiko
    Cc: Jakub Kicinski
    Cc: David S. Miller
    Link: https://lore.kernel.org/bpf/CAJ+HfNhyfsT5cS_U9EC213ducHs9k9zNxX9+abqC0kTrPbQ0gg@mail.gmail.com
    Link: https://lore.kernel.org/bpf/20211201202905.b9892171e3f5b9a60f9da251@linux-foundation.org
    Reviewed-by: Leon Romanovsky
    Ackd-by: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     
  • [ Upstream commit 26dca996ea7b1ac7008b6b6063fc88b849e3ac3e ]

    KASAN's quarantine might save its metadata inside freed objects. As
    this happens after the memory is zeroed by the slab allocator when
    init_on_free is enabled, the memory coming out of quarantine is not
    properly zeroed.

    This causes lib/test_meminit.c tests to fail with Generic KASAN.

    Zero the metadata when the object is removed from quarantine.

    Link: https://lkml.kernel.org/r/2805da5df4b57138fdacd671f5d227d58950ba54.1640037083.git.andreyknvl@google.com
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Marco Elver
    Cc: Alexander Potapenko
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Andrey Konovalov
     
  • [ Upstream commit 60115fa54ad7b913b7cb5844e6b7ffeb842d55f2 ]

    Yongqiang reports a kmemleak panic when module insmod/rmmod with KASAN
    enabled(without KASAN_VMALLOC) on x86[1].

    When the module area allocates memory, it's kmemleak_object is created
    successfully, but the KASAN shadow memory of module allocation is not
    ready, so when kmemleak scan the module's pointer, it will panic due to
    no shadow memory with KASAN check.

    module_alloc
    __vmalloc_node_range
    kmemleak_vmalloc
    kmemleak_scan
    update_checksum
    kasan_module_alloc
    kmemleak_ignore

    Note, there is no problem if KASAN_VMALLOC enabled, the modules area
    entire shadow memory is preallocated. Thus, the bug only exits on ARCH
    which supports dynamic allocation of module area per module load, for
    now, only x86/arm64/s390 are involved.

    Add a VM_DEFER_KMEMLEAK flags, defer vmalloc'ed object register of
    kmemleak in module_alloc() to fix this issue.

    [1] https://lore.kernel.org/all/6d41e2b9-4692-5ec4-b1cd-cbe29ae89739@huawei.com/

    [wangkefeng.wang@huawei.com: fix build]
    Link: https://lkml.kernel.org/r/20211125080307.27225-1-wangkefeng.wang@huawei.com
    [akpm@linux-foundation.org: simplify ifdefs, per Andrey]
    Link: https://lkml.kernel.org/r/CA+fCnZcnwJHUQq34VuRxpdoY6_XbJCDJ-jopksS5Eia4PijPzw@mail.gmail.com

    Link: https://lkml.kernel.org/r/20211124142034.192078-1-wangkefeng.wang@huawei.com
    Fixes: 793213a82de4 ("s390/kasan: dynamic shadow mem allocation for modules")
    Fixes: 39d114ddc682 ("arm64: add KASAN support")
    Fixes: bebf56a1b176 ("kasan: enable instrumentation of global variables")
    Signed-off-by: Kefeng Wang
    Reported-by: Yongqiang Liu
    Cc: Andrey Konovalov
    Cc: Andrey Ryabinin
    Cc: Dmitry Vyukov
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Cc: Alexander Gordeev
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: Alexander Potapenko
    Cc: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Kefeng Wang
     

02 Mar, 2022

2 commits

  • commit c94afc46cae7ad41b2ad6a99368147879f4b0e56 upstream.

    memblock.{reserved,memory}.regions may be allocated using kmalloc() in
    memblock_double_array(). Use kfree() to release these kmalloced regions
    indicated by memblock_{reserved,memory}_in_slab.

    Signed-off-by: Miaohe Lin
    Fixes: 3010f876500f ("mm: discard memblock data later")
    Signed-off-by: Mike Rapoport
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • When a THP is present in the page cache, we can return it several times,
    leading to userspace seeing the same data repeatedly if doing a read()
    that crosses a 64-page boundary. This is probably not a security issue
    (since the data all comes from the same file), but it can be interpreted
    as a transient data corruption issue. Fortunately, it is very rare as
    it can only occur when CONFIG_READ_ONLY_THP_FOR_FS is enabled, and it can
    only happen to executables. We don't often call read() on executables.

    This bug is fixed differently in v5.17 by commit 6b24ca4a1a8d
    ("mm: Use multi-index entries in the page cache"). That commit is
    unsuitable for backporting, so fix this in the clearest way. It
    sacrifices a little performance for clarity, but this should never
    be a performance path in these kernel versions.

    Fixes: cbd59c48ae2b ("mm/filemap: use head pages in generic_file_buffered_read")
    Cc: stable@vger.kernel.org # v5.15, v5.16
    Link: https://lore.kernel.org/r/df3b5d1c-a36b-2c73-3e27-99e74983de3a@suse.cz/
    Analyzed-by: Adam Majer
    Analyzed-by: Dirk Mueller
    Bisected-by: Takashi Iwai
    Reported-by: Vlastimil Babka
    Tested-by: Vlastimil Babka
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox (Oracle)
     

23 Feb, 2022

1 commit

  • commit 80d47f5de5e311cbc0d01ebb6ee684e8f4c196c6 upstream.

    Oded Gabbay reports that enabling NUMA balancing causes corruption with
    his Gaudi accelerator test load:

    "All the details are in the bug, but the bottom line is that somehow,
    this patch causes corruption when the numa balancing feature is
    enabled AND we don't use process affinity AND we use GUP to pin pages
    so our accelerator can DMA to/from system memory.

    Either disabling numa balancing, using process affinity to bind to
    specific numa-node or reverting this patch causes the bug to
    disappear"

    and Oded bisected the issue to commit 09854ba94c6a ("mm: do_wp_page()
    simplification").

    Now, the NUMA balancing shouldn't actually be changing the writability
    of a page, and as such shouldn't matter for COW. But it appears it
    does. Suspicious.

    However, regardless of that, the condition for enabling NUMA faults in
    change_pte_range() is nonsensical. It uses "page_mapcount(page)" to
    decide if a COW page should be NUMA-protected or not, and that makes
    absolutely no sense.

    The number of mappings a page has is irrelevant: not only does GUP get a
    reference to a page as in Oded's case, but the other mappings migth be
    paged out and the only reference to them would be in the page count.

    Since we should never try to NUMA-balance a page that we can't move
    anyway due to other references, just fix the code to use 'page_count()'.
    Oded confirms that that fixes his issue.

    Now, this does imply that something in NUMA balancing ends up changing
    page protections (other than the obvious one of making the page
    inaccessible to get the NUMA faulting information). Otherwise the COW
    simplification wouldn't matter - since doing the GUP on the page would
    make sure it's writable.

    The cause of that permission change would be good to figure out too,
    since it clearly results in spurious COW events - but fixing the
    nonsensical test that just happened to work before is obviously the
    CorrectThing(tm) to do regardless.

    Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=215616
    Link: https://lore.kernel.org/all/CAFCwf10eNmwq2wD71xjUhqkvv5+_pJMR1nPug2RqNDcFT4H86Q@mail.gmail.com/
    Reported-and-tested-by: Oded Gabbay
    Cc: David Hildenbrand
    Cc: Peter Xu
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

16 Feb, 2022

1 commit

  • commit 0764db9b49c932b89ee4d9e3236dff4bb07b4a66 upstream.

    Alexander reported a circular lock dependency revealed by the mmap1 ltp
    test:

    LOCKDEP_CIRCULAR (suite: ltp, case: mtest06 (mmap1))
    WARNING: possible circular locking dependency detected
    5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1 Not tainted
    ------------------------------------------------------
    mmap1/202299 is trying to acquire lock:
    00000001892c0188 (css_set_lock){..-.}-{2:2}, at: obj_cgroup_release+0x4a/0xe0
    but task is already holding lock:
    00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #1 (&sighand->siglock){-.-.}-{2:2}:
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    __lock_task_sighand+0x90/0x190
    cgroup_freeze_task+0x2e/0x90
    cgroup_migrate_execute+0x11c/0x608
    cgroup_update_dfl_csses+0x246/0x270
    cgroup_subtree_control_write+0x238/0x518
    kernfs_fop_write_iter+0x13e/0x1e0
    new_sync_write+0x100/0x190
    vfs_write+0x22c/0x2d8
    ksys_write+0x6c/0xf8
    __do_syscall+0x1da/0x208
    system_call+0x82/0xb0
    -> #0 (css_set_lock){..-.}-{2:2}:
    check_prev_add+0xe0/0xed8
    validate_chain+0x736/0xb20
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    obj_cgroup_release+0x4a/0xe0
    percpu_ref_put_many.constprop.0+0x150/0x168
    drain_obj_stock+0x94/0xe8
    refill_obj_stock+0x94/0x278
    obj_cgroup_charge+0x164/0x1d8
    kmem_cache_alloc+0xac/0x528
    __sigqueue_alloc+0x150/0x308
    __send_signal+0x260/0x550
    send_signal+0x7e/0x348
    force_sig_info_to_task+0x104/0x180
    force_sig_fault+0x48/0x58
    __do_pgm_check+0x120/0x1f0
    pgm_check_handler+0x11e/0x180
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0 CPU1
    ---- ----
    lock(&sighand->siglock);
    lock(css_set_lock);
    lock(&sighand->siglock);
    lock(css_set_lock);
    *** DEADLOCK ***
    2 locks held by mmap1/202299:
    #0: 00000000ca3b3818 (&sighand->siglock){-.-.}-{2:2}, at: force_sig_info_to_task+0x38/0x180
    #1: 00000001892ad560 (rcu_read_lock){....}-{1:2}, at: percpu_ref_put_many.constprop.0+0x0/0x168
    stack backtrace:
    CPU: 15 PID: 202299 Comm: mmap1 Not tainted 5.17.0-20220113.rc0.git0.f2211f194038.300.fc35.s390x+debug #1
    Hardware name: IBM 3906 M04 704 (LPAR)
    Call Trace:
    dump_stack_lvl+0x76/0x98
    check_noncircular+0x136/0x158
    check_prev_add+0xe0/0xed8
    validate_chain+0x736/0xb20
    __lock_acquire+0x604/0xbd8
    lock_acquire.part.0+0xe2/0x238
    lock_acquire+0xb0/0x200
    _raw_spin_lock_irqsave+0x6a/0xd8
    obj_cgroup_release+0x4a/0xe0
    percpu_ref_put_many.constprop.0+0x150/0x168
    drain_obj_stock+0x94/0xe8
    refill_obj_stock+0x94/0x278
    obj_cgroup_charge+0x164/0x1d8
    kmem_cache_alloc+0xac/0x528
    __sigqueue_alloc+0x150/0x308
    __send_signal+0x260/0x550
    send_signal+0x7e/0x348
    force_sig_info_to_task+0x104/0x180
    force_sig_fault+0x48/0x58
    __do_pgm_check+0x120/0x1f0
    pgm_check_handler+0x11e/0x180
    INFO: lockdep is turned off.

    In this example a slab allocation from __send_signal() caused a
    refilling and draining of a percpu objcg stock, resulted in a releasing
    of another non-related objcg. Objcg release path requires taking the
    css_set_lock, which is used to synchronize objcg lists.

    This can create a circular dependency with the sighandler lock, which is
    taken with the locked css_set_lock by the freezer code (to freeze a
    task).

    In general it seems that using css_set_lock to synchronize objcg lists
    makes any slab allocations and deallocation with the locked css_set_lock
    and any intervened locks risky.

    To fix the problem and make the code more robust let's stop using
    css_set_lock to synchronize objcg lists and use a new dedicated spinlock
    instead.

    Link: https://lkml.kernel.org/r/Yfm1IHmoGdyUR81T@carbon.dhcp.thefacebook.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Roman Gushchin
    Reported-by: Alexander Egorenkov
    Tested-by: Alexander Egorenkov
    Reviewed-by: Waiman Long
    Acked-by: Tejun Heo
    Reviewed-by: Shakeel Butt
    Reviewed-by: Jeremy Linton
    Tested-by: Jeremy Linton
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Roman Gushchin
     

09 Feb, 2022

2 commits

  • commit c10a0f877fe007021d70f9cada240f42adc2b5db upstream.

    When using devm_request_free_mem_region() and devm_memremap_pages() to
    add ZONE_DEVICE memory, if requested free mem region's end pfn were
    huge(e.g., 0x400000000), the node_end_pfn() will be also huge (see
    move_pfn_range_to_zone()). Thus it creates a huge hole between
    node_start_pfn() and node_end_pfn().

    We found on some AMD APUs, amdkfd requested such a free mem region and
    created a huge hole. In such a case, following code snippet was just
    doing busy test_bit() looping on the huge hole.

    for (pfn = start_pfn; pfn < end_pfn; pfn++) {
    struct page *page = pfn_to_online_page(pfn);
    if (!page)
    continue;
    ...
    }

    So we got a soft lockup:

    watchdog: BUG: soft lockup - CPU#6 stuck for 26s! [bash:1221]
    CPU: 6 PID: 1221 Comm: bash Not tainted 5.15.0-custom #1
    RIP: 0010:pfn_to_online_page+0x5/0xd0
    Call Trace:
    ? kmemleak_scan+0x16a/0x440
    kmemleak_write+0x306/0x3a0
    ? common_file_perm+0x72/0x170
    full_proxy_write+0x5c/0x90
    vfs_write+0xb9/0x260
    ksys_write+0x67/0xe0
    __x64_sys_write+0x1a/0x20
    do_syscall_64+0x3b/0xc0
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    I did some tests with the patch.

    (1) amdgpu module unloaded

    before the patch:

    real 0m0.976s
    user 0m0.000s
    sys 0m0.968s

    after the patch:

    real 0m0.981s
    user 0m0.000s
    sys 0m0.973s

    (2) amdgpu module loaded

    before the patch:

    real 0m35.365s
    user 0m0.000s
    sys 0m35.354s

    after the patch:

    real 0m1.049s
    user 0m0.000s
    sys 0m1.042s

    Link: https://lkml.kernel.org/r/20211108140029.721144-1-lang.yu@amd.com
    Signed-off-by: Lang Yu
    Acked-by: David Hildenbrand
    Acked-by: Catalin Marinas
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Lang Yu
     
  • commit fb5222aae64fe25e5f3ebefde8214dcf3ba33ca5 upstream.

    Patch series "page table check fixes and cleanups", v5.

    This patch (of 4):

    The pte entry that is used in pte_advanced_tests() is never removed from
    the page table at the end of the test.

    The issue is detected by page_table_check, to repro compile kernel with
    the following configs:

    CONFIG_DEBUG_VM_PGTABLE=y
    CONFIG_PAGE_TABLE_CHECK=y
    CONFIG_PAGE_TABLE_CHECK_ENFORCED=y

    During the boot the following BUG is printed:

    debug_vm_pgtable: [debug_vm_pgtable ]: Validating architecture page table helpers
    ------------[ cut here ]------------
    kernel BUG at mm/page_table_check.c:162!
    invalid opcode: 0000 [#1] PREEMPT SMP PTI
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.16.0-11413-g2c271fe77d52 #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.15.0-0-g2dd4b9b3f840-prebuilt.qemu.org 04/01/2014
    ...

    The entry should be properly removed from the page table before the page
    is released to the free list.

    Link: https://lkml.kernel.org/r/20220131203249.2832273-1-pasha.tatashin@soleen.com
    Link: https://lkml.kernel.org/r/20220131203249.2832273-2-pasha.tatashin@soleen.com
    Fixes: a5c3b9ffb0f4 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers")
    Signed-off-by: Pasha Tatashin
    Reviewed-by: Zi Yan
    Tested-by: Zi Yan
    Acked-by: David Rientjes
    Reviewed-by: Anshuman Khandual
    Cc: Paul Turner
    Cc: Wei Xu
    Cc: Greg Thelen
    Cc: Ingo Molnar
    Cc: Will Deacon
    Cc: Mike Rapoport
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Aneesh Kumar K.V
    Cc: Jiri Slaby
    Cc: Muchun Song
    Cc: Hugh Dickins
    Cc: [5.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pasha Tatashin
     

05 Feb, 2022

1 commit

  • commit c36c04c2e132fc39f6b658bf607aed4425427fd7 upstream.

    This reverts commit 54d516b1d62ff8f17cee2da06e5e4706a0d00b8a

    That commit did a refactoring that effectively combined fast and slow
    gup paths (again). And that was again incorrect, for two reasons:

    a) Fast gup and slow gup get reference counts on pages in different
    ways and with different goals: see Linus' writeup in commit
    cd1adf1b63a1 ("Revert "mm/gup: remove try_get_page(), call
    try_get_compound_head() directly""), and

    b) try_grab_compound_head() also has a specific check for
    "FOLL_LONGTERM && !is_pinned(page)", that assumes that the caller
    can fall back to slow gup. This resulted in new failures, as
    recently report by Will McVicker [1].

    But (a) has problems too, even though they may not have been reported
    yet. So just revert this.

    Link: https://lore.kernel.org/r/20220131203504.3458775-1-willmcvicker@google.com [1]
    Fixes: 54d516b1d62f ("mm/gup: small refactoring: simplify try_grab_page()")
    Reported-and-tested-by: Will McVicker
    Cc: Christoph Hellwig
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Christian Borntraeger
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: stable@vger.kernel.org # 5.15
    Signed-off-by: John Hubbard
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    John Hubbard
     

29 Jan, 2022

3 commits

  • commit 5b3be698a872c490dbed524f3e2463701ab21339 upstream.

    Commit 11192d9c124d ("memcg: flush stats only if updated") added
    tracking of memcg stats updates which is used by the readers to flush
    only if the updates are over a certain threshold. However each
    individual update can correspond to a large value change for a given
    stat. For example adding or removing a hugepage to an LRU changes the
    stat by thp_nr_pages (512 on x86_64).

    Treating the update related to THP as one can keep the stat off, in
    theory, by (thp_nr_pages * nr_cpus * CHARGE_BATCH) before flush.

    To handle such scenarios, this patch adds consideration of the stat
    update value as well instead of just the update event. In addition let
    the asyn flusher unconditionally flush the stats to put time limit on
    the stats skew and hopefully a lot less readers would need to flush.

    Link: https://lkml.kernel.org/r/20211118065350.697046-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Michal Koutný"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Ivan Babrou
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • commit fd25a9e0e23b995fd0ba5e2f00a1099452cbc3cf upstream.

    The memcg stats can be flushed in multiple context and potentially in
    parallel too. For example multiple parallel user space readers for
    memcg stats will contend on the rstat locks with each other. There is
    no need for that. We just need one flusher and everyone else can
    benefit.

    In addition after aa48e47e3906 ("memcg: infrastructure to flush memcg
    stats") the kernel periodically flush the memcg stats from the root, so,
    the other flushers will potentially have much less work to do.

    Link: https://lkml.kernel.org/r/20211001190040.48086-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: "Michal Koutný"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Ivan Babrou
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     
  • commit 11192d9c124d58d66449b163ed0d2cdff03761a1 upstream.

    At the moment, the kernel flushes the memcg stats on every refault and
    also on every reclaim iteration. Although rstat maintains per-cpu
    update tree but on the flush the kernel still has to go through all the
    cpu rstat update tree to check if there is anything to flush. This
    patch adds the tracking on the stats update side to make flush side more
    clever by skipping the flush if there is no update.

    The stats update codepath is very sensitive performance wise for many
    workloads and benchmarks. So, we can not follow what the commit
    aa48e47e3906 ("memcg: infrastructure to flush memcg stats") did which
    was triggering async flush through queue_work() and caused a lot
    performance regression reports. That got reverted by the commit
    1f828223b799 ("memcg: flush lruvec stats in the refault").

    In this patch we kept the stats update codepath very minimal and let the
    stats reader side to flush the stats only when the updates are over a
    specific threshold. For now the threshold is (nr_cpus * CHARGE_BATCH).

    To evaluate the impact of this patch, an 8 GiB tmpfs file is created on
    a system with swap-on-zram and the file was pushed to swap through
    memory.force_empty interface. On reading the whole file, the memcg stat
    flush in the refault code path is triggered. With this patch, we
    observed 63% reduction in the read time of 8 GiB file.

    Link: https://lkml.kernel.org/r/20211001190040.48086-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Reviewed-by: "Michal Koutný"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Ivan Babrou
    Signed-off-by: Greg Kroah-Hartman

    Shakeel Butt
     

27 Jan, 2022

4 commits

  • commit 87c01d57fa23de82fff593a7d070933d08755801 upstream.

    hmm_range_fault() can be used instead of get_user_pages() for devices
    which allow faulting however unlike get_user_pages() it will return an
    error when used on a VM_MIXEDMAP range.

    To make hmm_range_fault() more closely match get_user_pages() remove
    this restriction. This requires dealing with the !ARCH_HAS_PTE_SPECIAL
    case in hmm_vma_handle_pte(). Rather than replicating the logic of
    vm_normal_page() call it directly and do a check for the zero pfn
    similar to what get_user_pages() currently does.

    Also add a test to hmm selftest to verify functionality.

    Link: https://lkml.kernel.org/r/20211104012001.2555676-1-apopple@nvidia.com
    Fixes: da4c3c735ea4 ("mm/hmm/mirror: helper to snapshot CPU page table")
    Signed-off-by: Alistair Popple
    Reviewed-by: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Zi Yan
    Cc: Ralph Campbell
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alistair Popple
     
  • commit 62c9827cbb996c2c04f615ecd783ce28bcea894b upstream.

    Fix a data race in commit 779750d20b93 ("shmem: split huge pages beyond
    i_size under memory pressure").

    Here are call traces causing race:

    Call Trace 1:
    shmem_unused_huge_shrink+0x3ae/0x410
    ? __list_lru_walk_one.isra.5+0x33/0x160
    super_cache_scan+0x17c/0x190
    shrink_slab.part.55+0x1ef/0x3f0
    shrink_node+0x10e/0x330
    kswapd+0x380/0x740
    kthread+0xfc/0x130
    ? mem_cgroup_shrink_node+0x170/0x170
    ? kthread_create_on_node+0x70/0x70
    ret_from_fork+0x1f/0x30

    Call Trace 2:
    shmem_evict_inode+0xd8/0x190
    evict+0xbe/0x1c0
    do_unlinkat+0x137/0x330
    do_syscall_64+0x76/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    A simple explanation:

    Image there are 3 items in the local list (@list). In the first
    traversal, A is not deleted from @list.

    1) A->B->C
    ^
    |
    pos (leave)

    In the second traversal, B is deleted from @list. Concurrently, A is
    deleted from @list through shmem_evict_inode() since last reference
    counter of inode is dropped by other thread. Then the @list is corrupted.

    2) A->B->C
    ^ ^
    | |
    evict pos (drop)

    We should make sure the inode is either on the global list or deleted from
    any local list before iput().

    Fixed by moving inodes back to global list before we put them.

    [akpm@linux-foundation.org: coding style fixes]

    Link: https://lkml.kernel.org/r/20211125064502.99983-1-ligang.bdlg@bytedance.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Gang Li
    Reviewed-by: Muchun Song
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gang Li
     
  • commit c4dc63f0032c77464fbd4e7a6afc22fa6913c4a7 upstream.

    In kdump kernel of x86_64, page allocation failure is observed:

    kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
    Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:

    dump_stack_lvl+0x48/0x5e
    warn_alloc.cold+0x72/0xd6
    __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
    __alloc_pages+0x1df/0x210
    new_slab+0x389/0x4d0
    ___slab_alloc+0x58f/0x770
    __slab_alloc.constprop.0+0x4a/0x80
    kmem_cache_alloc_trace+0x24b/0x2c0
    sr_probe+0x1db/0x620
    ......
    device_add+0x405/0x920
    ......
    __scsi_add_device+0xe5/0x100
    ata_scsi_scan_host+0x97/0x1d0
    async_run_entry_fn+0x30/0x130
    process_one_work+0x1e8/0x3c0
    worker_thread+0x50/0x3b0
    ? rescuer_thread+0x350/0x350
    kthread+0x16b/0x190
    ? set_kthread_struct+0x40/0x40
    ret_from_fork+0x22/0x30

    Mem-Info:
    ......

    The above failure happened when calling kmalloc() to allocate buffer with
    GFP_DMA. It requests to allocate slab page from DMA zone while no managed
    pages at all in there.

    sr_probe()
    --> get_capabilities()
    --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);

    Because in the current kernel, dma-kmalloc will be created as long as
    CONFIG_ZONE_DMA is enabled. However, kdump kernel of x86_64 doesn't have
    managed pages on DMA zone since commit 6f599d84231f ("x86/kdump: Always
    reserve the low 1M when the crashkernel option is specified"). The
    failure can be always reproduced.

    For now, let's mute the warning of allocation failure if requesting pages
    from DMA zone while no managed pages.

    [akpm@linux-foundation.org: fix warning]

    Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
    Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
    Signed-off-by: Baoquan He
    Acked-by: John Donnelly
    Reviewed-by: Hyeonggon Yoo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Borislav Petkov
    Cc: Christoph Hellwig
    Cc: David Hildenbrand
    Cc: David Laight
    Cc: Marek Szyprowski
    Cc: Robin Murphy
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     
  • commit 62b3107073646e0946bd97ff926832bafb846d17 upstream.

    Patch series "Handle warning of allocation failure on DMA zone w/o
    managed pages", v4.

    **Problem observed:
    On x86_64, when crash is triggered and entering into kdump kernel, page
    allocation failure can always be seen.

    ---------------------------------
    DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
    swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 0 PID: 1 Comm: swapper/0
    Call Trace:
    dump_stack+0x7f/0xa1
    warn_alloc.cold+0x72/0xd6
    ......
    __alloc_pages+0x24d/0x2c0
    ......
    dma_atomic_pool_init+0xdb/0x176
    do_one_initcall+0x67/0x320
    ? rcu_read_lock_sched_held+0x3f/0x80
    kernel_init_freeable+0x290/0x2dc
    ? rest_init+0x24f/0x24f
    kernel_init+0xa/0x111
    ret_from_fork+0x22/0x30
    Mem-Info:
    ------------------------------------

    ***Root cause:
    In the current kernel, it assumes that DMA zone must have managed pages
    and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
    always true. E.g in kdump kernel of x86_64, only low 1M is presented and
    locked down at very early stage of boot, so that this low 1M won't be
    added into buddy allocator to become managed pages of DMA zone. This
    exception will always cause page allocation failure if page is requested
    from DMA zone.

    ***Investigation:
    This failure happens since below commit merged into linus's tree.
    1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
    23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
    f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
    7c321eb2b843 x86/kdump: Remove the backup region handling
    6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified

    Before them, on x86_64, the low 640K area will be reused by kdump kernel.
    So in kdump kernel, the content of low 640K area is copied into a backup
    region for dumping before jumping into kdump. Then except of those firmware
    reserved region in [0, 640K], the left area will be added into buddy
    allocator to become available managed pages of DMA zone.

    However, after above commits applied, in kdump kernel of x86_64, the low
    1M is reserved by memblock, but not released to buddy allocator. So any
    later page allocation requested from DMA zone will fail.

    At the beginning, if crashkernel is reserved, the low 1M need be locked
    down because AMD SME encrypts memory making the old backup region
    mechanims impossible when switching into kdump kernel.

    Later, it was also observed that there are BIOSes corrupting memory
    under 1M. To solve this, in commit f1d4d47c5851, the entire region of
    low 1M is always reserved after the real mode trampoline is allocated.

    Besides, recently, Intel engineer mentioned their TDX (Trusted domain
    extensions) which is under development in kernel also needs to lock down
    the low 1M. So we can't simply revert above commits to fix the page allocation
    failure from DMA zone as someone suggested.

    ***Solution:
    Currently, only DMA atomic pool and dma-kmalloc will initialize and
    request page allocation with GFP_DMA during bootup.

    So only initializ DMA atomic pool when DMA zone has available managed
    pages, otherwise just skip the initialization.

    For dma-kmalloc(), for the time being, let's mute the warning of
    allocation failure if requesting pages from DMA zone while no manged
    pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
    replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
    not necessary. Christoph is posting patches to fix those under
    drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
    suggested.

    This patch (of 3):

    In some places of the current kernel, it assumes that dma zone must have
    managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
    true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
    down at very early stage of boot, so that there's no managed pages at all
    in DMA zone. This exception will always cause page allocation failure if
    page is requested from DMA zone.

    Here add function has_managed_dma() and the relevant helper functions to
    check if there's DMA zone with managed pages. It will be used in later
    patches.

    Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
    Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
    Signed-off-by: Baoquan He
    Reviewed-by: David Hildenbrand
    Acked-by: John Donnelly
    Cc: Christoph Hellwig
    Cc: Christoph Lameter
    Cc: Hyeonggon Yoo
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: David Laight
    Cc: Borislav Petkov
    Cc: Marek Szyprowski
    Cc: Robin Murphy
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     

05 Jan, 2022

1 commit

  • commit ebb3f994dd92f8fb4d70c7541091216c1e10cb71 upstream.

    DAMON debugfs interface increases the reference counts of 'struct pid's
    for targets from the 'target_ids' file write callback
    ('dbgfs_target_ids_write()'), but decreases the counts only in DAMON
    monitoring termination callback ('dbgfs_before_terminate()').

    Therefore, when 'target_ids' file is repeatedly written without DAMON
    monitoring start/termination, the reference count is not decreased and
    therefore memory for the 'struct pid' cannot be freed. This commit
    fixes this issue by decreasing the reference counts when 'target_ids' is
    written.

    Link: https://lkml.kernel.org/r/20211229124029.23348-1-sj@kernel.org
    Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
    Signed-off-by: SeongJae Park
    Cc: [5.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    SeongJae Park
     

29 Dec, 2021

5 commits

  • commit 0129ab1f268b6cf88825eae819b9b84aa0a85634 upstream.

    Hulk robot reported a kmemleak problem:

    unreferenced object 0xffff93d1d8cc02e8 (size 248):
    comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
    hex dump (first 32 bytes):
    00 40 85 19 d4 93 ff ff 00 10 00 00 00 00 00 00 .@..............
    00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    backtrace:
    seq_open+0x2a/0x80
    full_proxy_open+0x167/0x1e0
    do_dentry_open+0x1e1/0x3a0
    path_openat+0x961/0xa20
    do_filp_open+0xae/0x120
    do_sys_openat2+0x216/0x2f0
    do_sys_open+0x57/0x80
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    unreferenced object 0xffff93d419854000 (size 4096):
    comm "cat", pid 23327, jiffies 4624670141 (age 495992.217s)
    hex dump (first 32 bytes):
    6b 66 65 6e 63 65 2d 23 32 35 30 3a 20 30 78 30 kfence-#250: 0x0
    30 30 30 30 30 30 30 37 35 34 62 64 61 31 32 2d 0000000754bda12-
    backtrace:
    seq_read_iter+0x313/0x440
    seq_read+0x14b/0x1a0
    full_proxy_read+0x56/0x80
    vfs_read+0xa5/0x1b0
    ksys_read+0xa0/0xf0
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    I find that we can easily reproduce this problem with the following
    commands:

    cat /sys/kernel/debug/kfence/objects
    echo scan > /sys/kernel/debug/kmemleak
    cat /sys/kernel/debug/kmemleak

    The leaked memory is allocated in the stack below:

    do_syscall_64
    do_sys_open
    do_dentry_open
    full_proxy_open
    seq_open ---> alloc seq_file
    vfs_read
    full_proxy_read
    seq_read
    seq_read_iter
    traverse ---> alloc seq_buf

    And it should have been released in the following process:

    do_syscall_64
    syscall_exit_to_user_mode
    exit_to_user_mode_prepare
    task_work_run
    ____fput
    __fput
    full_proxy_release ---> free here

    However, the release function corresponding to file_operations is not
    implemented in kfence. As a result, a memory leak occurs. Therefore,
    the solution to this problem is to implement the corresponding release
    function.

    Link: https://lkml.kernel.org/r/20211206133628.2822545-1-libaokun1@huawei.com
    Fixes: 0ce20dd84089 ("mm: add Kernel Electric-Fence infrastructure")
    Signed-off-by: Baokun Li
    Reported-by: Hulk Robot
    Acked-by: Marco Elver
    Reviewed-by: Kefeng Wang
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Yu Kuai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baokun Li
     
  • commit 34796417964b8d0aef45a99cf6c2d20cebe33733 upstream.

    DAMON debugfs interface iterates current monitoring targets in
    'dbgfs_target_ids_read()' while holding the corresponding
    'kdamond_lock'. However, it also destructs the monitoring targets in
    'dbgfs_before_terminate()' without holding the lock. This can result in
    a use_after_free bug. This commit avoids the race by protecting the
    destruction with the corresponding 'kdamond_lock'.

    Link: https://lkml.kernel.org/r/20211221094447.2241-1-sj@kernel.org
    Reported-by: Sangwoo Bae
    Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
    Signed-off-by: SeongJae Park
    Cc: [5.15.x]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    SeongJae Park
     
  • commit 2a57d83c78f889bf3f54eede908d0643c40d5418 upstream.

    Hulk Robot reported a panic in put_page_testzero() when testing
    madvise() with MADV_SOFT_OFFLINE. The BUG() is triggered when retrying
    get_any_page(). This is because we keep MF_COUNT_INCREASED flag in
    second try but the refcnt is not increased.

    page dumped because: VM_BUG_ON_PAGE(page_ref_count(page) == 0)
    ------------[ cut here ]------------
    kernel BUG at include/linux/mm.h:737!
    invalid opcode: 0000 [#1] PREEMPT SMP
    CPU: 5 PID: 2135 Comm: sshd Tainted: G B 5.16.0-rc6-dirty #373
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
    RIP: release_pages+0x53f/0x840
    Call Trace:
    free_pages_and_swap_cache+0x64/0x80
    tlb_flush_mmu+0x6f/0x220
    unmap_page_range+0xe6c/0x12c0
    unmap_single_vma+0x90/0x170
    unmap_vmas+0xc4/0x180
    exit_mmap+0xde/0x3a0
    mmput+0xa3/0x250
    do_exit+0x564/0x1470
    do_group_exit+0x3b/0x100
    __do_sys_exit_group+0x13/0x20
    __x64_sys_exit_group+0x16/0x20
    do_syscall_64+0x34/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    Modules linked in:
    ---[ end trace e99579b570fe0649 ]---
    RIP: 0010:release_pages+0x53f/0x840

    Link: https://lkml.kernel.org/r/20211221074908.3910286-1-liushixin2@huawei.com
    Fixes: b94e02822deb ("mm,hwpoison: try to narrow window race for free pages")
    Signed-off-by: Liu Shixin
    Reported-by: Hulk Robot
    Reviewed-by: Oscar Salvador
    Acked-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Liu Shixin
     
  • commit e37e7b0b3bd52ec4f8ab71b027bcec08f57f1b3b upstream.

    When a memory error hits a tail page of a free hugepage,
    __page_handle_poison() is expected to be called to isolate the error in
    4kB unit, but it's not called due to the outdated if-condition in
    memory_failure_hugetlb(). This loses the chance to isolate the error in
    the finer unit, so it's not optimal. Drop the condition.

    This "(p != head && TestSetPageHWPoison(head)" condition is based on the
    old semantics of PageHWPoison on hugepage (where PG_hwpoison flag was
    set on the subpage), so it's not necessray any more. By getting to set
    PG_hwpoison on head page for hugepages, concurrent error events on
    different subpages in a single hugepage can be prevented by
    TestSetPageHWPoison(head) at the beginning of memory_failure_hugetlb().
    So dropping the condition should not reopen the race window originally
    mentioned in commit b985194c8c0a ("hwpoison, hugetlb:
    lock_page/unlock_page does not match for handling a free hugepage")

    [naoya.horiguchi@linux.dev: fix "HardwareCorrupted" counter]
    Link: https://lkml.kernel.org/r/20211220084851.GA1460264@u2004

    Link: https://lkml.kernel.org/r/20211210110208.879740-1-naoya.horiguchi@linux.dev
    Signed-off-by: Naoya Horiguchi
    Reported-by: Fei Luo
    Reviewed-by: Mike Kravetz
    Cc: [5.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Naoya Horiguchi
     
  • commit 338635340669d5b317c7e8dcf4fff4a0f3651d87 upstream.

    alloc_pages_vma() may try to allocate THP page on the local NUMA node
    first:

    page = __alloc_pages_node(hpage_node,
    gfp | __GFP_THISNODE | __GFP_NORETRY, order);

    And if the allocation fails it retries allowing remote memory:

    if (!page && (gfp & __GFP_DIRECT_RECLAIM))
    page = __alloc_pages_node(hpage_node,
    gfp, order);

    However, this retry allocation completely ignores memory policy nodemask
    allowing allocation to escape restrictions.

    The first appearance of this bug seems to be the commit ac5b2c18911f
    ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings").

    The bug disappeared later in the commit 89c83fb539f9 ("mm, thp:
    consolidate THP gfp handling into alloc_hugepage_direct_gfpmask") and
    reappeared again in slightly different form in the commit 76e654cc91bb
    ("mm, page_alloc: allow hugepage fallback to remote nodes when
    madvised")

    Fix this by passing correct nodemask to the __alloc_pages() call.

    The demonstration/reproducer of the problem:

    $ mount -oremount,size=4G,huge=always /dev/shm/
    $ echo always > /sys/kernel/mm/transparent_hugepage/defrag
    $ cat mbind_thp.c
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define SIZE 2ULL << 30
    int main(int argc, char **argv)
    {
    int fd;
    unsigned long long i;
    char *addr;
    pid_t pid;
    char buf[100];
    unsigned long nodemask = 1;

    fd = open("/dev/shm/test", O_RDWR|O_CREAT);
    assert(fd > 0);
    assert(ftruncate(fd, SIZE) == 0);

    addr = mmap(NULL, SIZE, PROT_READ|PROT_WRITE,
    MAP_SHARED, fd, 0);

    assert(mbind(addr, SIZE, MPOL_BIND, &nodemask, 2, MPOL_MF_STRICT|MPOL_MF_MOVE)==0);
    for (i = 0; i < SIZE; i+=4096) {
    addr[i] = 1;
    }
    pid = getpid();
    snprintf(buf, sizeof(buf), "grep shm /proc/%d/numa_maps", pid);
    system(buf);
    sleep(10000);

    return 0;
    }
    $ gcc mbind_thp.c -o mbind_thp -lnuma
    $ numactl -H
    available: 2 nodes (0-1)
    node 0 cpus: 0 2
    node 0 size: 1918 MB
    node 0 free: 1595 MB
    node 1 cpus: 1 3
    node 1 size: 2014 MB
    node 1 free: 1731 MB
    node distances:
    node 0 1
    0: 10 20
    1: 20 10
    $ rm -f /dev/shm/test; taskset -c 0 ./mbind_thp
    7fd970a00000 bind:0 file=/dev/shm/test dirty=524288 active=0 N0=396800 N1=127488 kernelpagesize_kB=4

    Link: https://lkml.kernel.org/r/20211208165343.22349-1-arbn@yandex-team.com
    Fixes: ac5b2c18911f ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrey Ryabinin
     

14 Dec, 2021

3 commits

  • commit 3c376dfafbf7a8ea0dea212d095ddd83e93280bb upstream.

    Initialize min_ratio if it is set during bdi unregistration. This can
    prevent problems that may occur a when bdi is removed without resetting
    min_ratio.

    For example.
    1) insert external sdcard
    2) set external sdcard's min_ratio 70
    3) remove external sdcard without setting min_ratio 0
    4) insert external sdcard
    5) set external sdcard's min_ratio 70 << error occur(can't set)

    Because when an sdcard is removed, the present bdi_min_ratio value will
    remain. Currently, the only way to reset bdi_min_ratio is to reboot.

    [akpm@linux-foundation.org: tweak comment and coding style]

    Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com
    Signed-off-by: Manjong Lee
    Acked-by: Peter Zijlstra (Intel)
    Cc: Changheun Lee
    Cc: Jens Axboe
    Cc: Christoph Hellwig
    Cc: Matthew Wilcox
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Manjong Lee
     
  • commit 005a79e5c254c3f60ec269a459cc41b55028c798 upstream.

    On big-endian s390, the alloc/free_traces attributes produce endless
    output, because of always 0 idx in slab_debugfs_show().

    idx is de-referenced from *v, which points to a loff_t value, with

    unsigned int idx = *(unsigned int *)v;

    This will only give the upper 32 bits on big-endian, which remain 0.

    Instead of only fixing this de-reference, during discussion it seemed
    more appropriate to change the seq_ops so that they use an explicit
    iterator in private loc_track struct.

    This patch adds idx to loc_track, which will also fix the endianness
    bug.

    Link: https://lore.kernel.org/r/20211117193932.4049412-1-gerald.schaefer@linux.ibm.com
    Link: https://lkml.kernel.org/r/20211126171848.17534-1-gerald.schaefer@linux.ibm.com
    Fixes: 64dd68497be7 ("mm: slub: move sysfs slab alloc/free interfaces to debugfs")
    Signed-off-by: Gerald Schaefer
    Reported-by: Steffen Maier
    Acked-by: Vlastimil Babka
    Cc: Faiyaz Mohammed
    Cc: Greg Kroah-Hartman
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Gerald Schaefer
     
  • commit 70e9274805fccfd175d0431a947bfd11ee7df40e upstream.

    Because DAMON sleeps in uninterruptible mode, /proc/loadavg reports fake
    load while DAMON is turned on, though it is doing nothing. This can
    confuse users[1]. To avoid the case, this commit makes DAMON sleeps in
    idle mode.

    [1] https://lore.kernel.org/all/11868371.O9o76ZdvQC@natalenko.name/

    Link: https://lkml.kernel.org/r/20211126145015.15862-3-sj@kernel.org
    Fixes: 2224d8485492 ("mm: introduce Data Access MONitor (DAMON)")
    Reported-by: Oleksandr Natalenko
    Signed-off-by: SeongJae Park
    Tested-by: Oleksandr Natalenko
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    SeongJae Park
     

25 Nov, 2021

6 commits

  • commit a4a118f2eead1d6c49e00765de89878288d4b890 upstream.

    When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
    flush is missing. This TLB flush must be performed before releasing the
    i_mmap_rwsem, in order to prevent an unshared PMDs page from being
    released and reused before the TLB flush took place.

    Arguably, a comprehensive solution would use mmu_gather interface to
    batch the TLB flushes and the PMDs page release, however it is not an
    easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
    huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
    deferring the release of the page reference for the PMDs page until
    after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
    thinking PMDs are shared when they are not.

    Fix __unmap_hugepage_range() by adding the missing TLB flush, and
    forcing a flush when unshare is successful.

    Fixes: 24669e58477e ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6
    Signed-off-by: Nadav Amit
    Reviewed-by: Mike Kravetz
    Cc: Aneesh Kumar K.V
    Cc: KAMEZAWA Hiroyuki
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Nadav Amit
     
  • commit d78f3853f831eee46c6dbe726debf3be9e9c0d05 upstream.

    DAMON debugfs is supposed to protect dbgfs_ctxs, dbgfs_nr_ctxs, and
    dbgfs_dirs using damon_dbgfs_lock. However, some of the code is
    accessing the variables without the protection. This fixes it by
    protecting all such accesses.

    Link: https://lkml.kernel.org/r/20211110145758.16558-3-sj@kernel.org
    Fixes: 75c1c2b53c78 ("mm/damon/dbgfs: support multiple contexts")
    Signed-off-by: SeongJae Park
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    SeongJae Park
     
  • commit db7a347b26fe05d2e8c115bb24dfd908d0252bc3 upstream.

    Patch series "DAMON fixes".

    This patch (of 2):

    DAMON users can trigger below warning in '__alloc_pages()' by invoking
    write() to some DAMON debugfs files with arbitrarily high count
    argument, because DAMON debugfs interface allocates some buffers based
    on the user-specified 'count'.

    if (unlikely(order >= MAX_ORDER)) {
    WARN_ON_ONCE(!(gfp & __GFP_NOWARN));
    return NULL;
    }

    Because the DAMON debugfs interface code checks failure of the
    'kmalloc()', this commit simply suppresses the warnings by adding
    '__GFP_NOWARN' flag.

    Link: https://lkml.kernel.org/r/20211110145758.16558-1-sj@kernel.org
    Link: https://lkml.kernel.org/r/20211110145758.16558-2-sj@kernel.org
    Fixes: 4bc05954d007 ("mm/damon: implement a debugfs-based user space interface")
    Signed-off-by: SeongJae Park
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    SeongJae Park
     
  • commit 825c43f50e3aa811a291ffcb40e02fbf6d91ba86 upstream.

    The kmap_local conversion broke the ARM architecture, because the new
    code assumes that all PTEs used for creating kmaps form a linear array
    in memory, and uses array indexing to look up the kmap PTE belonging to
    a certain kmap index.

    On ARM, this cannot work, not only because the PTE pages may be
    non-adjacent in memory, but also because ARM/!LPAE interleaves hardware
    entries and extended entries (carrying software-only bits) in a way that
    is not compatible with array indexing.

    Fortunately, this only seems to affect configurations with more than 8
    CPUs, due to the way the per-CPU kmap slots are organized in memory.

    Work around this by permitting an architecture to set a Kconfig symbol
    that signifies that the kmap PTEs do not form a lineary array in memory,
    and so the only way to locate the appropriate one is to walk the page
    tables.

    Link: https://lore.kernel.org/linux-arm-kernel/20211026131249.3731275-1-ardb@kernel.org/
    Link: https://lkml.kernel.org/r/20211116094737.7391-1-ardb@kernel.org
    Fixes: 2a15ba82fa6c ("ARM: highmem: Switch to generic kmap atomic")
    Signed-off-by: Ard Biesheuvel
    Reported-by: Quanyang Wang
    Reviewed-by: Linus Walleij
    Acked-by: Russell King (Oracle)
    Cc: Thomas Gleixner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ard Biesheuvel
     
  • commit cc30042df6fcc82ea18acf0dace831503e60a0b7 upstream.

    Currently in the is_continue case in hugetlb_mcopy_atomic_pte(), if we
    bail out using "goto out_release_unlock;" in the cases where idx >=
    size, or !huge_pte_none(), the code will detect that new_pagecache_page
    == false, and so call restore_reserve_on_error(). In this case I see
    restore_reserve_on_error() delete the reservation, and the following
    call to remove_inode_hugepages() will increment h->resv_hugepages
    causing a 100% reproducible leak.

    We should treat the is_continue case similar to adding a page into the
    pagecache and set new_pagecache_page to true, to indicate that there is
    no reservation to restore on the error path, and we need not call
    restore_reserve_on_error(). Rename new_pagecache_page to
    page_in_pagecache to make that clear.

    Link: https://lkml.kernel.org/r/20211117193825.378528-1-almasrymina@google.com
    Fixes: c7b1850dfb41 ("hugetlb: don't pass page cache pages to restore_reserve_on_error")
    Signed-off-by: Mina Almasry
    Reported-by: James Houghton
    Reviewed-by: Mike Kravetz
    Cc: Wei Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mina Almasry
     
  • commit 34dbc3aaf5d9e89ba6cc5e24add9458c21ab1950 upstream.

    When kmemleak is enabled for SLOB, system does not boot and does not
    print anything to the console. At the very early stage in the boot
    process we hit infinite recursion from kmemleak_init() and eventually
    kernel crashes.

    kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
    kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
    valid for SLOB.

    Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

    Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
    Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
    Signed-off-by: Rustam Kovhaev
    Acked-by: Vlastimil Babka
    Reviewed-by: Muchun Song
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Catalin Marinas
    Cc: Greg Kroah-Hartman
    Cc: Glauber Costa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Rustam Kovhaev
     

19 Nov, 2021

5 commits

  • commit 60e2793d440a3ec95abb5d6d4fc034a4b480472d upstream.

    Any allocation failure during the #PF path will return with VM_FAULT_OOM
    which in turn results in pagefault_out_of_memory. This can happen for 2
    different reasons. a) Memcg is out of memory and we rely on
    mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
    normal allocation fails.

    The latter is quite problematic because allocation paths already trigger
    out_of_memory and the page allocator tries really hard to not fail
    allocations. Anyway, if the OOM killer has been already invoked there
    is no reason to invoke it again from the #PF path. Especially when the
    OOM condition might be gone by that time and we have no way to find out
    other than allocate.

    Moreover if the allocation failed and the OOM killer hasn't been invoked
    then we are unlikely to do the right thing from the #PF context because
    we have already lost the allocation context and restictions and
    therefore might oom kill a task from a different NUMA domain.

    This all suggests that there is no legitimate reason to trigger
    out_of_memory from pagefault_out_of_memory so drop it. Just to be sure
    that no #PF path returns with VM_FAULT_OOM without allocation print a
    warning that this is happening before we restart the #PF.

    [VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
    This is a local problem related to memcg, however, it causes unnecessary
    global OOM kills that are repeated over and over again and escalate into a
    real disaster. This has been broken since kmem accounting has been
    introduced for cgroup v1 (3.8). There was no kmem specific reclaim for
    the separate limit so the only way to handle kmem hard limit was to return
    with ENOMEM. In upstream the problem will be fixed by removing the
    outdated kmem limit, however stable and LTS kernels cannot do it and are
    still affected. This patch fixes the problem and should be backported
    into stable/LTS.]

    Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Vasily Averin
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Tetsuo Handa
    Cc: Uladzislau Rezki
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     
  • commit 0b28179a6138a5edd9d82ad2687c05b3773c387b upstream.

    Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.

    Memory cgroup charging allows killed or exiting tasks to exceed the hard
    limit. It can be misused and allowed to trigger global OOM from inside
    a memcg-limited container. On the other hand if memcg fails allocation,
    called from inside #PF handler it triggers global OOM from inside
    pagefault_out_of_memory().

    To prevent these problems this patchset:
    (a) removes execution of out_of_memory() from
    pagefault_out_of_memory(), becasue nobody can explain why it is
    necessary.
    (b) allow memcg to fail allocation of dying/killed tasks.

    This patch (of 3):

    Any allocation failure during the #PF path will return with VM_FAULT_OOM
    which in turn results in pagefault_out_of_memory which in turn executes
    out_out_memory() and can kill a random task.

    An allocation might fail when the current task is the oom victim and
    there are no memory reserves left. The OOM killer is already handled at
    the page allocator level for the global OOM and at the charging level
    for the memcg one. Both have much more information about the scope of
    allocation/charge request. This means that either the OOM killer has
    been invoked properly and didn't lead to the allocation success or it
    has been skipped because it couldn't have been invoked. In both cases
    triggering it from here is pointless and even harmful.

    It makes much more sense to let the killed task die rather than to wake
    up an eternally hungry oom-killer and send him to choose a fatter victim
    for breakfast.

    Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
    Signed-off-by: Vasily Averin
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Tetsuo Handa
    Cc: Uladzislau Rezki
    Cc: Vladimir Davydov
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     
  • commit a4ebf1b6ca1e011289677239a2a361fde4a88076 upstream.

    Memory cgroup charging allows killed or exiting tasks to exceed the hard
    limit. It is assumed that the amount of the memory charged by those
    tasks is bound and most of the memory will get released while the task
    is exiting. This is resembling a heuristic for the global OOM situation
    when tasks get access to memory reserves. There is no global memory
    shortage at the memcg level so the memcg heuristic is more relieved.

    The above assumption is overly optimistic though. E.g. vmalloc can
    scale to really large requests and the heuristic would allow that. We
    used to have an early break in the vmalloc allocator for killed tasks
    but this has been reverted by commit b8c8a338f75e ("Revert "vmalloc:
    back off when the current task is killed""). There are likely other
    similar code paths which do not check for fatal signals in an
    allocation&charge loop. Also there are some kernel objects charged to a
    memcg which are not bound to a process life time.

    It has been observed that it is not really hard to trigger these
    bypasses and cause global OOM situation.

    One potential way to address these runaways would be to limit the amount
    of excess (similar to the global OOM with limited oom reserves). This
    is certainly possible but it is not really clear how much of an excess
    is desirable and still protects from global OOMs as that would have to
    consider the overall memcg configuration.

    This patch is addressing the problem by removing the heuristic
    altogether. Bypass is only allowed for requests which either cannot
    fail or where the failure is not desirable while excess should be still
    limited (e.g. atomic requests). Implementation wise a killed or dying
    task fails to charge if it has passed the OOM killer stage. That should
    give all forms of reclaim chance to restore the limit before the failure
    (ENOMEM) and tell the caller to back off.

    In addition, this patch renames should_force_charge() helper to
    task_is_dying() because now its use is not associated witch forced
    charging.

    This patch depends on pagefault_out_of_memory() to not trigger
    out_of_memory(), because then a memcg failure can unwind to VM_FAULT_OOM
    and cause a global OOM killer.

    Link: https://lkml.kernel.org/r/8f5cebbb-06da-4902-91f0-6566fc4b4203@virtuozzo.com
    Signed-off-by: Vasily Averin
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Uladzislau Rezki
    Cc: Vlastimil Babka
    Cc: Shakeel Butt
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vasily Averin
     
  • commit d417b49fff3e2f21043c834841e8623a6098741d upstream.

    It is not safe to check page->index without holding the page lock. It
    can be changed if the page is moved between the swap cache and the page
    cache for a shmem file, for example. There is a VM_BUG_ON below which
    checks page->index is correct after taking the page lock.

    Link: https://lkml.kernel.org/r/20210818144932.940640-1-willy@infradead.org
    Fixes: 5c211ba29deb ("mm: add and use find_lock_entries")
    Signed-off-by: Matthew Wilcox (Oracle)
    Reported-by:
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Matthew Wilcox (Oracle)
     
  • [ Upstream commit afe8605ca45424629fdddfd85984b442c763dc47 ]

    There is one possible race window between zs_pool_dec_isolated() and
    zs_unregister_migration() because wait_for_isolated_drain() checks the
    isolated count without holding class->lock and there is no order inside
    zs_pool_dec_isolated(). Thus the below race window could be possible:

    zs_pool_dec_isolated zs_unregister_migration
    check pool->destroying != 0
    pool->destroying = true;
    smp_mb();
    wait_for_isolated_drain()
    wait for pool->isolated_pages == 0
    atomic_long_dec(&pool->isolated_pages);
    atomic_long_read(&pool->isolated_pages) == 0

    Since we observe the pool->destroying (false) before atomic_long_dec()
    for pool->isolated_pages, waking pool->migration_wait up is missed.

    Fix this by ensure checking pool->destroying happens after the
    atomic_long_dec(&pool->isolated_pages).

    Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com
    Fixes: 701d678599d0 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
    Signed-off-by: Miaohe Lin
    Cc: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Henry Burns
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Miaohe Lin