20 Jan, 2021

8 commits

  • This is the 5.10.9 stable release

    * tag 'v5.10.9': (153 commits)
    Linux 5.10.9
    netfilter: nf_nat: Fix memleak in nf_nat_init
    netfilter: conntrack: fix reading nf_conntrack_buckets
    ...

    Signed-off-by: Jason Liu

    Jason Liu
     
  • This is the 5.10.7 stable release

    * tag 'v5.10.7': (144 commits)
    Linux 5.10.7
    scsi: target: Fix XCOPY NAA identifier lookup
    rtlwifi: rise completion at the last step of firmware callback
    ...

    Signed-off-by: Jason Liu

    Jason Liu
     
  • This is the 5.10.5 stable release

    * tag 'v5.10.5': (63 commits)
    Linux 5.10.5
    device-dax: Fix range release
    ext4: avoid s_mb_prefetch to be zero in individual scenarios
    ...

    Signed-off-by: Jason Liu

    Jason Liu
     
  • commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf upstream.

    acquire_slab() fails if there is contention on the freelist of the page
    (probably because some other CPU is concurrently freeing an object from
    the page). In that case, it might make sense to look for a different page
    (since there might be more remote frees to the page from other CPUs, and
    we don't want contention on struct page).

    However, the current code accidentally stops looking at the partial list
    completely in that case. Especially on kernels without CONFIG_NUMA set,
    this means that get_partial() fails and new_slab_objects() falls back to
    new_slab(), allocating new pages. This could lead to an unnecessary
    increase in memory fragmentation.

    Link: https://lkml.kernel.org/r/20201228130853.1871516-1-jannh@google.com
    Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop")
    Signed-off-by: Jann Horn
    Acked-by: David Rientjes
    Acked-by: Joonsoo Kim
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Jann Horn
     
  • [ Upstream commit feb889fb40fafc6933339cf1cca8f770126819fb ]

    So technically there is nothing wrong with adding a pinned page to the
    swap cache, but the pinning obviously means that the page can't actually
    be free'd right now anyway, so it's a bit pointless.

    However, the real problem is not with it being a bit pointless: the real
    issue is that after we've added it to the swap cache, we'll try to unmap
    the page. That will succeed, because the code in mm/rmap.c doesn't know
    or care about pinned pages.

    Even the unmapping isn't fatal per se, since the page will stay around
    in memory due to the pinning, and we do hold the connection to it using
    the swap cache. But when we then touch it next and take a page fault,
    the logic in do_swap_page() will map it back into the process as a
    possibly read-only page, and we'll then break the page association on
    the next COW fault.

    Honestly, this issue could have been fixed in any of those other places:
    (a) we could refuse to unmap a pinned page (which makes conceptual
    sense), or (b) we could make sure to re-map a pinned page writably in
    do_swap_page(), or (c) we could just make do_wp_page() not COW the
    pinned page (which was what we historically did before that "mm:
    do_wp_page() simplification" commit).

    But while all of them are equally valid models for breaking this chain,
    not putting pinned pages into the swap cache in the first place is the
    simplest one by far.

    It's also the safest one: the reason why do_wp_page() was changed in the
    first place was that getting the "can I re-use this page" wrong is so
    fraught with errors. If you do it wrong, you end up with an incorrectly
    shared page.

    As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
    do_swap_page() would be a serious bug since it is only a (very good)
    heuristic. Re-using the page requires a hard black-and-white rule with
    no room for ambiguity.

    In contrast, saying "this page is very likely dma pinned, so let's not
    add it to the swap cache and try to unmap it" is an obviously safe thing
    to do, and if the heuristic might very rarely be a false positive, no
    harm is done.

    Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
    Reported-and-tested-by: Martin Raiber
    Cc: Pavel Begunkov
    Cc: Jens Axboe
    Cc: Peter Xu
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     
  • commit eb351d75ce1e75b4f793d609efac08426ca50acd upstream.

    Fix the build error:

    mm/process_vm_access.c:277:5: error: implicit declaration of function 'in_compat_syscall'; did you mean 'in_ia32_syscall'? [-Werror=implicit-function-declaration]

    Fixes: 38dc5079da7081e "Fix compat regression in process_vm_rw()"
    Reported-by: syzbot+5b0d0de84d6c65b8dd2b@syzkaller.appspotmail.com
    Cc: Kyle Huey
    Cc: Jens Axboe
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Andrew Morton
     
  • commit 0eb98f1588c2cc7a79816d84ab18a55d254f481c upstream.

    The huge page size is encoded for VM_FAULT_HWPOISON errors only. So if
    we return VM_FAULT_HWPOISON, huge page size would just be ignored.

    Link: https://lkml.kernel.org/r/20210107123449.38481-1-linmiaohe@huawei.com
    Fixes: aa50d3a7aa81 ("Encode huge page size for VM_FAULT_HWPOISON errors")
    Signed-off-by: Miaohe Lin
    Reviewed-by: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     
  • commit c22ee5284cf58017fa8c6d21d8f8c68159b6faab upstream.

    In VM_MAP_PUT_PAGES case, we should put pages and free array in vfree.
    But we missed to set area->nr_pages in vmap(). So we would fail to put
    pages in __vunmap() because area->nr_pages = 0.

    Link: https://lkml.kernel.org/r/20210107123541.39206-1-linmiaohe@huawei.com
    Fixes: b944afc9d64d ("mm: add a VM_MAP_PUT_PAGES flag for vmap")
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Reviewed-by: Uladzislau Rezki (Sony)
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Miaohe Lin
     

13 Jan, 2021

1 commit

  • commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.

    Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
    logic") we've had some very occasional reports of BUG_ON(PageWriteback)
    in write_cache_pages(), which we thought we already fixed in commit
    073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

    But syzbot just reported another one, even with that commit in place.

    And it turns out that there's a simpler way to trigger the BUG_ON() than
    the one Hugh found with page re-use. It all boils down to the fact that
    the page writeback is ostensibly serialized by the page lock, but that
    isn't actually really true.

    Yes, the people _setting_ writeback all do so under the page lock, but
    the actual clearing of the bit - and waking up any waiters - happens
    without any page lock.

    This gives us this fairly simple race condition:

    CPU1 = end previous writeback
    CPU2 = start new writeback under page lock
    CPU3 = write_cache_pages()

    CPU1 CPU2 CPU3
    ---- ---- ----

    end_page_writeback()
    test_clear_page_writeback(page)
    ... delayed...

    lock_page();
    set_page_writeback()
    unlock_page()

    lock_page()
    wait_on_page_writeback();

    wake_up_page(page, PG_writeback);
    .. wakes up CPU3 ..

    BUG_ON(PageWriteback(page));

    where the BUG_ON() happens because we woke up the PG_writeback bit
    becasue of the _previous_ writeback, but a new one had already been
    started because the clearing of the bit wasn't actually atomic wrt the
    actual wakeup or serialized by the page lock.

    The reason this didn't use to happen was that the old logic in waiting
    on a page bit would just loop if it ever saw the bit set again.

    The nice proper fix would probably be to get rid of the whole "wait for
    writeback to clear, and then set it" logic in the writeback path, and
    replace it with an atomic "wait-to-set" (ie the same as we have for page
    locking: we set the page lock bit with a single "lock_page()", not with
    "wait for lock bit to clear and then set it").

    However, out current model for writeback is that the waiting for the
    writeback bit is done by the generic VFS code (ie write_cache_pages()),
    but the actual setting of the writeback bit is done much later by the
    filesystem ".writepages()" function.

    IOW, to make the writeback bit have that same kind of "wait-to-set"
    behavior as we have for page locking, we'd have to change our roughly
    ~50 different writeback functions. Painful.

    Instead, just make "wait_on_page_writeback()" loop on the very unlikely
    situation that the PG_writeback bit is still set, basically re-instating
    the old behavior. This is very non-optimal in case of contention, but
    since we only ever set the bit under the page lock, that situation is
    controlled.

    Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Acked-by: Hugh Dickins
    Cc: Andrew Morton
    Cc: Matthew Wilcox
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

06 Jan, 2021

2 commits

  • commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.

    VMware observed a performance regression during memmap init on their
    platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
    iterate over memblock regions rather that check each PFN") causing it.

    Before the commit:

    [0.033176] Normal zone: 1445888 pages used for memmap
    [0.033176] Normal zone: 89391104 pages, LIFO batch:63
    [0.035851] ACPI: PM-Timer IO Port: 0x448

    With commit

    [0.026874] Normal zone: 1445888 pages used for memmap
    [0.026875] Normal zone: 89391104 pages, LIFO batch:63
    [2.028450] ACPI: PM-Timer IO Port: 0x448

    The root cause is the current memmap defer init doesn't work as expected.

    Before, memmap_init_zone() was used to do memmap init of one whole zone,
    to initialize all low zones of one numa node, but defer memmap init of
    the last zone in that numa node. However, since commit 73a6e474cb376,
    function memmap_init() is adapted to iterater over memblock regions
    inside one zone, then call memmap_init_zone() to do memmap init for each
    region.

    E.g, on VMware's system, the memory layout is as below, there are two
    memory regions in node 2. The current code will mistakenly initialize the
    whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
    iniatialize only one memmory section on the 2nd region [mem
    0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
    only one memory section's memmap initialized. That's why more time is
    costed at the time.

    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
    [ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
    [ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
    [ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
    [ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

    Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
    down the real zone end pfn so that defer_init() can use it to judge
    whether defer need be taken in zone wide.

    Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
    Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
    Signed-off-by: Baoquan He
    Reported-by: Rahul Gopakumar
    Reviewed-by: Mike Rapoport
    Cc: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     
  • commit e7dd91c456a8cdbcd7066997d15e36d14276a949 upstream.

    syzbot reported the deadlock here [1]. The issue is in hugetlb cow
    error handling when there are not enough huge pages for the faulting
    task which took the original reservation. It is possible that other
    (child) tasks could have consumed pages associated with the reservation.
    In this case, we want the task which took the original reservation to
    succeed. So, we unmap any associated pages in children so that they can
    be used by the faulting task that owns the reservation.

    The unmapping code needs to hold i_mmap_rwsem in write mode. However,
    due to commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd
    sharing synchronization") we are already holding i_mmap_rwsem in read
    mode when hugetlb_cow is called.

    Technically, i_mmap_rwsem does not need to be held in read mode for COW
    mappings as they can not share pmd's. Modifying the fault code to not
    take i_mmap_rwsem in read mode for COW (and other non-sharable) mappings
    is too involved for a stable fix.

    Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem before
    unmapping. This is OK as it is technically not needed. They are
    reacquired after unmapping as expected by calling code. Since this is
    done in an uncommon error path, the overhead of dropping and reacquiring
    mutexes is acceptable.

    While making changes, remove redundant BUG_ON after unmap_ref_private.

    [1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com

    Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Signed-off-by: Mike Kravetz
    Reported-by: syzbot+5eee4145df3c15e96625@syzkaller.appspotmail.com
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: "Aneesh Kumar K . V"
    Cc: Davidlohr Bueso
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz
     

04 Jan, 2021

1 commit

  • This is the 5.10.4 stable release

    * tag 'v5.10.4': (717 commits)
    Linux 5.10.4
    x86/CPU/AMD: Save AMD NodeId as cpu_die_id
    drm/edid: fix objtool warning in drm_cvt_modes()
    ...

    Signed-off-by: Jason Liu

    Conflicts:
    drivers/gpu/drm/imx/dcss/dcss-plane.c
    drivers/media/i2c/ov5640.c

    Jason Liu
     

30 Dec, 2020

13 commits

  • commit dcf5aedb24f899d537e21c18ea552c780598d352 upstream.

    Use temporary slots in reclaim function to avoid possible race when
    freeing those.

    While at it, make sure we check CLAIMED flag under page lock in the
    reclaim function to make sure we are not racing with z3fold_alloc().

    Link: https://lkml.kernel.org/r/20201209145151.18994-4-vitaly.wool@konsulko.com
    Signed-off-by: Vitaly Wool
    Cc:
    Cc: Mike Galbraith
    Cc: Sebastian Andrzej Siewior
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     
  • commit fc5488651c7d840c9cad9b0f273f2f31bd03413a upstream.

    Patch series "z3fold: stability / rt fixes".

    Address z3fold stability issues under stress load, primarily in the
    reclaim and free aspects. Besides, it fixes the locking problems that
    were only seen in real-time kernel configuration.

    This patch (of 3):

    There used to be two places in the code where slots could be freed, namely
    when freeing the last allocated handle from the slots and when releasing
    the z3fold header these slots aree linked to. The logic to decide on
    whether to free certain slots was complicated and error prone in both
    functions and it led to failures in RT case.

    To fix that, make free_handle() the single point of freeing slots.

    Link: https://lkml.kernel.org/r/20201209145151.18994-1-vitaly.wool@konsulko.com
    Link: https://lkml.kernel.org/r/20201209145151.18994-2-vitaly.wool@konsulko.com
    Signed-off-by: Vitaly Wool
    Tested-by: Mike Galbraith
    Cc: Sebastian Andrzej Siewior
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vitaly Wool
     
  • [ Upstream commit 597c892038e08098b17ccfe65afd9677e6979800 ]

    On 2-node NUMA hosts we see bursts of kswapd reclaim and subsequent
    pressure spikes and stalls from cache refaults while there is plenty of
    free memory in the system.

    Usually, kswapd is woken up when all eligible nodes in an allocation are
    full. But the code related to watermark boosting can wake kswapd on one
    full node while the other one is mostly empty. This may be justified to
    fight fragmentation, but is currently unconditionally done whether
    watermark boosting is occurring or not.

    In our case, many of our workloads' throughput scales with available
    memory, and pure utilization is a more tangible concern than trends
    around longer-term fragmentation. As a result we generally disable
    watermark boosting.

    Wake kswapd only woken when watermark boosting is requested.

    Link: https://lkml.kernel.org/r/20201020175833.397286-1-hannes@cmpxchg.org
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     
  • [ Upstream commit 7fc2513aa237e2ce239ab54d7b04d1d79b317110 ]

    Preserve the error code from region_add() instead of returning success.

    Link: https://lkml.kernel.org/r/X9NGZWnZl5/Mt99R@mwanda
    Fixes: 0db9d74ed884 ("hugetlb: disable region_add file_region coalescing")
    Signed-off-by: Dan Carpenter
    Reviewed-by: Mike Kravetz
    Reviewed-by: David Hildenbrand
    Cc: Mina Almasry
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Dan Carpenter
     
  • [ Upstream commit 1e8aaedb182d6ddffc894b832e4962629907b3e0 ]

    madvise_inject_error() uses get_user_pages_fast to translate the address
    we specified to a page. After [1], we drop the extra reference count for
    memory_failure() path. That commit says that memory_failure wanted to
    keep the pin in order to take the page out of circulation.

    The truth is that we need to keep the page pinned, otherwise the page
    might be re-used after the put_page() and we can end up messing with
    someone else's memory.

    E.g:

    CPU0
    process X CPU1
    madvise_inject_error
    get_user_pages
    put_page
    page gets reclaimed
    process Y allocates the page
    memory_failure
    // We mess with process Y memory

    madvise() is meant to operate on a self address space, so messing with
    pages that do not belong to us seems the wrong thing to do.
    To avoid that, let us keep the page pinned for memory_failure as well.

    Pages for DAX mappings will release this extra refcount in
    memory_failure_dev_pagemap.

    [1] ("23e7b5c2e271: mm, madvise_inject_error:
    Let memory_failure() optionally take a page reference")

    Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
    Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
    Signed-off-by: Oscar Salvador
    Suggested-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Oscar Salvador
     
  • [ Upstream commit c041098c690fe53cea5d20c62f128a4f7a5c19fe ]

    The size of vm area can be affected by the presence or not of the guard
    page. In particular when VM_NO_GUARD is present, the actual accessible
    size has to be considered like the real size minus the guard page.

    Currently kasan does not keep into account this information during the
    poison operation and in particular tries to poison the guard page as well.

    This approach, even if incorrect, does not cause an issue because the tags
    for the guard page are written in the shadow memory. With the future
    introduction of the Tag-Based KASAN, being the guard page inaccessible by
    nature, the write tag operation on this page triggers a fault.

    Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
    accessing directly the field in the data structure to detect the correct
    value.

    Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
    Fixes: d98c9e83b5e7c ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
    Signed-off-by: Vincenzo Frascino
    Cc: Andrey Konovalov
    Cc: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Vincenzo Frascino
     
  • [ Upstream commit 0a7dd4e901b8a4ee040ba953900d1d7120b34ee5 ]

    When multiple locks are acquired, they should be released in reverse
    order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
    case.

    s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
    s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);

    This unlock sequence, though allowed, is not optimal. If a waiter is
    present, mutex_unlock() will need to go through the slowpath of waking
    up the waiter with preemption disabled. Fix that by releasing the
    spinlock first before the mutex.

    Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
    Fixes: e36176be1c39 ("mm/vmalloc: rework vmap_area_lock")
    Signed-off-by: Waiman Long
    Reviewed-by: Uladzislau Rezki (Sony)
    Reviewed-by: David Hildenbrand
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     
  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     
  • [ Upstream commit eefbfa7fd678805b38a46293e78543f98f353d3e ]

    The rcu_read_lock/unlock only can guarantee that the memcg will not be
    freed, but it cannot guarantee the success of css_get to memcg.

    If the whole process of a cgroup offlining is completed between reading a
    objcg->memcg pointer and bumping the css reference on another CPU, and
    there are exactly 0 external references to this memory cgroup (how we get
    to the obj_cgroup_charge() then?), css_get() can change the ref counter
    from 0 back to 1.

    Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Joonsoo Kim
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: Christian Brauner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Muchun Song
     
  • [ Upstream commit 2f7659a314736b32b66273dbf91c19874a052fde ]

    Consider the following memcg hierarchy.

    root
    / \
    A B

    If we failed to get the reference on objcg of memcg A, the
    get_obj_cgroup_from_current can return the wrong objcg for the root
    memcg.

    Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com
    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Signed-off-by: Muchun Song
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Shakeel Butt
    Cc: Joonsoo Kim
    Cc: Yafang Shao
    Cc: Chris Down
    Cc: Christian Brauner
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Kees Cook
    Cc: Thomas Gleixner
    Cc: Eugene Syromiatnikov
    Cc: Suren Baghdasaryan
    Cc: Adrian Reber
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Muchun Song
     
  • [ Upstream commit 4509b42c38963f495b49aa50209c34337286ecbe ]

    These functions accomplish the same thing but have different
    implementations.

    unpin_user_page() has a bug where it calls mod_node_page_state() after
    calling put_page() which creates a risk that the page could have been
    hot-uplugged from the system.

    Fix this by using put_compound_head() as the only implementation.

    __unpin_devmap_managed_user_page() and related can be deleted as well in
    favour of the simpler, but slower, version in put_compound_head() that has
    an extra atomic page_ref_sub, but always calls put_page() which internally
    contains the special devmap code.

    Move put_compound_head() to be directly after try_grab_compound_head() so
    people can find it in future.

    Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
    Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Reviewed-by: Ira Weiny
    Reviewed-by: Jan Kara
    CC: Joao Martins
    CC: Jonathan Corbet
    CC: Dan Williams
    CC: Dave Chinner
    CC: Christoph Hellwig
    CC: Jane Chu
    CC: "Kirill A. Shutemov"
    CC: Michal Hocko
    CC: Mike Kravetz
    CC: Shuah Khan
    CC: Muchun Song
    CC: Vlastimil Babka
    CC: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

    Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
    fork() for ptes") pages under a FOLL_PIN will not be write protected
    during COW for fork. This means that pages returned from
    pin_user_pages(FOLL_WRITE) should not become write protected while the pin
    is active.

    However, there is a small race where get_user_pages_fast(FOLL_PIN) can
    establish a FOLL_PIN at the same time copy_present_page() is write
    protecting it:

    CPU 0 CPU 1
    get_user_pages_fast()
    internal_get_user_pages_fast()
    copy_page_range()
    pte_alloc_map_lock()
    copy_present_page()
    atomic_read(has_pinned) == 0
    page_maybe_dma_pinned() == false
    atomic_set(has_pinned, 1);
    gup_pgd_range()
    gup_pte_range()
    pte_t pte = gup_get_pte(ptep)
    pte_access_permitted(pte)
    try_grab_compound_head()
    pte = pte_wrprotect(pte)
    set_pte_at();
    pte_unmap_unlock()
    // GUP now returns with a write protected page

    The first attempt to resolve this by using the write protect caused
    problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
    early COW write protect games during fork()")

    Instead wrap copy_p4d_range() with the write side of a seqcount and check
    the read side around gup_pgd_range(). If there is a collision then
    get_user_pages_fast() fails and falls back to slow GUP.

    Slow GUP is safe against this race because copy_page_range() is only
    called while holding the exclusive side of the mmap_lock on the src
    mm_struct.

    [akpm@linux-foundation.org: coding style fixes]
    Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

    Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
    Signed-off-by: Jason Gunthorpe
    Suggested-by: Linus Torvalds
    Reviewed-by: John Hubbard
    Reviewed-by: Jan Kara
    Reviewed-by: Peter Xu
    Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
    Cc: Andrea Arcangeli
    Cc: "Aneesh Kumar K.V"
    Cc: Christoph Hellwig
    Cc: Hugh Dickins
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Leon Romanovsky
    Cc: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     
  • [ Upstream commit c28b1fc70390df32e29991eedd52bd86e7aba080 ]

    Patch series "Add a seqcount between gup_fast and copy_page_range()", v4.

    As discussed and suggested by Linus use a seqcount to close the small race
    between gup_fast and copy_page_range().

    Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
    in this case and it doesn't trigger any lockdeps.

    I was able to test it using two threads, one forking and the other using
    ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
    made the window large enough to reliably hit to test the logic.

    This patch (of 2):

    The next patch in this series makes the lockless flow a little more
    complex, so move the entire block into a new function and remove a level
    of indention. Tidy a bit of cruft:

    - addr is always the same as start, so use start

    - Use the modern check_add_overflow() for computing end = start + len

    - nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
    avoid shift overflow, make the variables unsigned long to avoid coding
    casts in both places. nr_pinned was missing its cast

    - The handling of ret and nr_pinned can be streamlined a bit

    No functional change.

    Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Jan Kara
    Reviewed-by: John Hubbard
    Reviewed-by: Peter Xu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Jason Gunthorpe
     

14 Dec, 2020

2 commits


12 Dec, 2020

3 commits

  • Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
    added compound_nr counter to first tail struct page, overlaying with
    page->mapping. The overlay itself is fine, but while freeing gigantic
    hugepages via free_contig_range(), a "bad page" check will trigger for
    non-NULL page->mapping on the first tail page:

    BUG: Bad page state in process bash pfn:380001
    page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
    aops:0x0
    flags: 0x3ffff00000000000()
    raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
    raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
    page dumped because: non-NULL mapping
    Modules linked in:
    CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
    Hardware name: IBM 3906 M03 703 (LPAR)
    Call Trace:
    show_stack+0x6e/0xe8
    dump_stack+0x90/0xc8
    bad_page+0xd6/0x130
    free_pcppages_bulk+0x26a/0x800
    free_unref_page+0x6e/0x90
    free_contig_range+0x94/0xe8
    update_and_free_page+0x1c4/0x2c8
    free_pool_huge_page+0x11e/0x138
    set_max_huge_pages+0x228/0x300
    nr_hugepages_store_common+0xb8/0x130
    kernfs_fop_write+0xd2/0x218
    vfs_write+0xb0/0x2b8
    ksys_write+0xac/0xe0
    system_call+0xe6/0x288
    Disabling lock debugging due to kernel taint

    This is because only the compound_order is cleared in
    destroy_compound_gigantic_page(), and compound_nr is set to
    1U << order == 1 for order 0 in set_compound_order(page, 0).

    Fix this by explicitly clearing compound_nr for first tail page after
    calling set_compound_order(page, 0).

    Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
    Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
    Signed-off-by: Gerald Schaefer
    Reviewed-by: Matthew Wilcox (Oracle)
    Cc: Heiko Carstens
    Cc: Mike Kravetz
    Cc: Christian Borntraeger
    Cc: [5.9+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • We hit this issue in our internal test. When enabling generic kasan, a
    kfree()'d object is put into per-cpu quarantine first. If the cpu goes
    offline, object still remains in the per-cpu quarantine. If we call
    kmem_cache_destroy() now, slub will report "Objects remaining" error.

    =============================================================================
    BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
    CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10
    Hardware name: linux,dummy-virt (DT)
    Call trace:
    dump_backtrace+0x0/0x2b0
    show_stack+0x18/0x68
    dump_stack+0xfc/0x168
    slab_err+0xac/0xd4
    __kmem_cache_shutdown+0x1e4/0x3c8
    kmem_cache_destroy+0x68/0x130
    test_version_show+0x84/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    do_el0_svc+0x38/0xa0
    el0_sync_handler+0x170/0x178
    el0_sync+0x174/0x180
    INFO: Object 0x(____ptrval____) @offset=15848
    INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
    stack_trace_save+0x9c/0xd0
    set_track+0x64/0xf0
    alloc_debug_processing+0x104/0x1a0
    ___slab_alloc+0x628/0x648
    __slab_alloc.isra.0+0x2c/0x58
    kmem_cache_alloc+0x560/0x588
    test_version_show+0x98/0xf0
    module_attr_show+0x40/0x60
    sysfs_kf_seq_show+0x128/0x1c0
    kernfs_seq_show+0xa0/0xb8
    seq_read+0x1f0/0x7e8
    kernfs_fop_read+0x70/0x338
    vfs_read+0xe4/0x250
    ksys_read+0xc8/0x180
    __arm64_sys_read+0x44/0x58
    el0_svc_common.constprop.0+0xac/0x228
    kmem_cache_destroy test_module_slab: Slab cache still has objects

    Register a cpu hotplug function to remove all objects in the offline
    per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
    indicate this cpu is offline.

    [qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
    Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

    Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
    Signed-off-by: Kuan-Ying Lee
    Signed-off-by: Zqiang
    Suggested-by: Dmitry Vyukov
    Reported-by: Guangye Yang
    Reviewed-by: Dmitry Vyukov
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Matthias Brugger
    Cc: Nicholas Tang
    Cc: Miles Chen
    Cc: Qian Cai
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kuan-Ying Lee
     
  • Revert commit 3351b16af494 ("mm/filemap: add static for function
    __add_to_page_cache_locked") due to incompatibility with
    ALLOW_ERROR_INJECTION which result in build errors.

    Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
    Tested-by: Justin Forbes
    Tested-by: Greg Thelen
    Acked-by: Alexei Starovoitov
    Cc: Michal Kubecek
    Cc: Alex Shi
    Cc: Souptick Joarder
    Cc: Daniel Borkmann
    Cc: Josef Bacik
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

09 Dec, 2020

1 commit

  • Jann spotted the security hole due to race of mm ownership check.

    If the task is sharing the mm_struct but goes through execve() before
    mm_access(), it could skip process_madvise_behavior_valid check. That
    makes *any advice hint* to reach into the remote process.

    This patch removes the mm ownership check. With it, it will lose the
    ability that local process could give *any* advice hint with vector
    interface for some reason (e.g., performance). Since there is no
    concrete example in upstream yet, it would be better to remove the
    abiliity at this moment and need to review when such new advice comes
    up.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Reported-by: Jann Horn
    Suggested-by: Jann Horn
    Signed-off-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

07 Dec, 2020

7 commits

  • On success, mmap should return the begin address of newly mapped area,
    but patch "mm: mmap: merge vma after call_mmap() if possible" set
    vm_start of newly merged vma to return value addr. Users of mmap will
    get wrong address if vma is merged after call_mmap(). We fix this by
    moving the assignment to addr before merging vma.

    We have a driver which changes vm_flags, and this bug is found by our
    testcases.

    Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
    Signed-off-by: Liu Zixian
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: David Hildenbrand
    Cc: Miaohe Lin
    Cc: Hongxiang Lou
    Cc: Hu Shiyuan
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
    Signed-off-by: Linus Torvalds

    Liu Zixian
     
  • Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
    using hugetlbfs. In this environment the issue is reproduced by:

    - Start a simple pod that uses the recently added HugePages medium
    feature (pod yaml attached)

    - Start a DPDK app. It doesn't need to run successfully (as in transfer
    packets) nor interact with real hardware. It seems just initializing
    the EAL layer (which handles hugepage reservation and locking) is
    enough to trigger the issue

    - Delete the Pod (or let it "Complete").

    This would result in a kworker thread going into a tight loop (top output):

    1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

    'perf top -g' reports:

    - 63.28% 0.01% [kernel] [k] worker_thread
    - 49.97% worker_thread
    - 52.64% process_one_work
    - 62.08% css_killed_work_fn
    - hugetlb_cgroup_css_offline
    41.52% _raw_spin_lock
    - 2.82% _cond_resched
    rcu_all_qs
    2.66% PageHuge
    - 0.57% schedule
    - 0.57% __schedule

    We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
    Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
    infinitely spinning. Little else can be done on the system as the
    cgroup_mutex can not be acquired.

    Do note that the issue can be reproduced by simply offlining a hugetlb
    cgroup containing pages with reservation counts.

    The loop in hugetlb_cgroup_css_offline is moving page counts from the
    cgroup being offlined to the parent cgroup. This is done for each
    hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
    The routine moving counts (hugetlb_cgroup_move_parent) is only moving
    'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
    both 'usage' and 'reservation' counts. Discussion about what to do with
    reservation counts when reparenting was discussed here:

    https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

    The decision was made to leave a zombie cgroup for with reservation
    counts. Unfortunately, the code checking reservation counts was
    incorrectly added to hugetlb_cgroup_have_usage.

    To fix the issue, simply remove the check for reservation counts. While
    fixing this issue, a related bug in hugetlb_cgroup_css_offline was
    noticed. The hstate index is not reinitialized each time through the
    do-while loop. Fix this as well.

    Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
    Reported-by: Adrian Moreno
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Adrian Moreno
    Reviewed-by: Shakeel Butt
    Cc: Mina Almasry
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Sandipan Das
    Cc: Shuah Khan
    Cc:
    Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

    Signed-off-by: Alex Shi
    Signed-off-by: Andrew Morton
    Cc: Souptick Joarder
    Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Alex Shi
     
  • We can't call kvfree() with a spin lock held, so defer it. Fixes a
    might_sleep() runtime warning.

    Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc:
    Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • While I was doing zram testing, I found sometimes decompression failed
    since the compression buffer was corrupted. With investigation, I found
    below commit calls cond_resched unconditionally so it could make a
    problem in atomic context if the task is reschedule.

    BUG: sleeping function called from invalid context at mm/vmalloc.c:108
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
    3 locks held by memhog/946:
    #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
    #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
    #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
    CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
    Call Trace:
    unmap_kernel_range_noflush+0x2eb/0x350
    unmap_kernel_range+0x14/0x30
    zs_unmap_object+0xd5/0xe0
    zram_bvec_rw.isra.0+0x38c/0x8e0
    zram_rw_page+0x90/0x101
    bdev_write_page+0x92/0xe0
    __swap_writepage+0x94/0x4a0
    pageout+0xe3/0x3a0
    shrink_page_list+0xb94/0xd60
    shrink_inactive_list+0x158/0x460

    We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
    contains the offending calling code) from zsmalloc.

    Even though this option showed some amount improvement(e.g., 30%) in
    some arm32 platforms, it has been headache to maintain since it have
    abused APIs[1](e.g., unmap_kernel_range in atomic context).

    Since we are approaching to deprecate 32bit machines and already made
    the config option available for only builtin build since v5.8, lastly it
    has been not default option in zsmalloc, it's time to drop the option
    for better maintenance.

    [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

    Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Cc: Tony Lindgren
    Cc: Christoph Hellwig
    Cc: Harish Sriram
    Cc: Uladzislau Rezki
    Cc:
    Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When investigating a slab cache bloat problem, significant amount of
    negative dentry cache was seen, but confusingly they neither got shrunk
    by reclaimer (the host has very tight memory) nor be shrunk by dropping
    cache. The vmcore shows there are over 14M negative dentry objects on
    lru, but tracing result shows they were even not scanned at all.

    Further investigation shows the memcg's vfs shrinker_map bit is not set.
    So the reclaimer or dropping cache just skip calling vfs shrinker. So
    we have to reboot the hosts to get the memory back.

    I didn't manage to come up with a reproducer in test environment, and
    the problem can't be reproduced after rebooting. But it seems there is
    race between shrinker map bit clear and reparenting by code inspection.
    The hypothesis is elaborated as below.

    The memcg hierarchy on our production environment looks like:

    root
    / \
    system user

    The main workloads are running under user slice's children, and it
    creates and removes memcg frequently. So reparenting happens very often
    under user slice, but no task is under user slice directly.

    So with the frequent reparenting and tight memory pressure, the below
    hypothetical race condition may happen:

    CPU A CPU B
    reparent
    dst->nr_items == 0
    shrinker:
    total_objects == 0
    add src->nr_items to dst
    set_bit
    return SHRINK_EMPTY
    clear_bit
    child memcg offline
    replace child's kmemcg_id with
    parent's (in memcg_offline_kmem())
    list_lru_del() between shrinker runs
    see parent's kmemcg_id
    dec dst->nr_items
    reparent again
    dst->nr_items may go negative
    due to concurrent list_lru_del()

    The second run of shrinker:
    read nr_items without any
    synchronization, so it may
    see intermediate negative
    nr_items then total_objects
    may return 0 coincidently

    keep the bit cleared
    dst->nr_items != 0
    skip set_bit
    add scr->nr_item to dst

    After this point dst->nr_item may never go zero, so reparenting will not
    set shrinker_map bit anymore. And since there is no task under user
    slice directly, so no new object will be added to its lru to set the
    shrinker map bit either. That bit is kept cleared forever.

    How does list_lru_del() race with reparenting? It is because reparenting
    replaces children's kmemcg_id to parent's without protecting from
    nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
    deleting items from child's lru, but dec'ing parent's nr_items, so the
    parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
    reparent list_lrus and free kmemcg_id on css offline") says.

    Since it is impossible that dst->nr_items goes negative and
    src->nr_items goes zero at the same time, so it seems we could set the
    shrinker map bit iff src->nr_items != 0. We could synchronize
    list_lru_count_one() and reparenting with nlru->lock, but it seems
    checking src->nr_items in reparenting is the simplest and avoids lock
    contention.

    Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
    Suggested-by: Roman Gushchin
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Acked-by: Kirill Tkhai
    Cc: Vladimir Davydov
    Cc: [4.19]
    Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
    for all allocations") introduced a regression into the handling of the
    obj_cgroup_charge() return value. If a non-zero value is returned
    (indicating of exceeding one of memory.max limits), the allocation
    should fail, instead of falling back to non-accounted mode.

    To make the code more readable, move memcg_slab_pre_alloc_hook() and
    memcg_slab_post_alloc_hook() calling conditions into bodies of these
    hooks.

    Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

25 Nov, 2020

1 commit

  • Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
    on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
    end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
    no longer an ext4 page at all.

    The problem is that PageWriteback is not accompanied by a page reference
    (as the NOTE at the end of test_clear_page_writeback() acknowledges): as
    soon as TestClearPageWriteback has been done, that page could be removed
    from page cache, freed, and reused for something else by the time that
    wake_up_page() is reached.

    https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
    Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
    check; but I'm paranoid about even looking at an unreferenced struct page,
    lest its memory might itself have already been reused or hotremoved (and
    wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

    Then on crashing a second time, realized there's a stronger reason against
    that approach. If my testing just occasionally crashes on that check,
    when the page is reused for part of a compound page, wouldn't it be much
    more common for the page to get reused as an order-0 page before reaching
    wake_up_page()? And on rare occasions, might that reused page already be
    marked PageWriteback by its new user, and already be waited upon? What
    would that look like?

    It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
    in write_cache_pages() (though I have never seen that crash myself).

    Matthew Wilcox explaining this to himself:
    "page is allocated, added to page cache, dirtied, writeback starts,

    --- thread A ---
    filesystem calls end_page_writeback()
    test_clear_page_writeback()
    --- context switch to thread B ---
    truncate_inode_pages_range() finds the page, it doesn't have writeback set,
    we delete it from the page cache. Page gets reallocated, dirtied, writeback
    starts again. Then we call write_cache_pages(), see
    PageWriteback() set, call wait_on_page_writeback()
    --- context switch back to thread A ---
    wake_up_page(page, PG_writeback);
    ... thread B is woken, but because the wakeup was for the old use of
    the page, PageWriteback is still set.

    Devious"

    And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    this would have been much less likely: before that, wake_page_function()'s
    non-exclusive case would stop walking and not wake if it found Writeback
    already set again; whereas now the non-exclusive case proceeds to wake.

    I have not thought of a fix that does not add a little overhead: the
    simplest fix is for end_page_writeback() to get_page() before calling
    test_clear_page_writeback(), then put_page() after wake_up_page().

    Was there a chance of missed wakeups before, since a page freed before
    reaching wake_up_page() would have PageWaiters cleared? I think not,
    because each waiter does hold a reference on the page. This bug comes
    when the old use of the page, the one we do TestClearPageWriteback on,
    had *no* waiters, so no additional page reference beyond the page cache
    (and whoever racily freed it). The reuse of the page has a waiter
    holding a reference, and its own PageWriteback set; but the belated
    wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

    Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
    Reported-by: Qian Cai
    Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v5.8+
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

23 Nov, 2020

1 commit

  • The calculation of the end page index was incorrect, leading to a
    regression of 70% when running stress-ng.

    With this fix, we instead see a performance improvement of 3%.

    Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
    Reported-by: kernel test robot
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: Xing Zhengjun
    Acked-by: Johannes Weiner
    Cc: William Kucharski
    Cc: Feng Tang
    Cc: "Chen, Rong A"
    Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)