Eric Lee / smarc-fsl-linux-kernel

20 Jan, 2021

8 commits

68023ebbb Merge tag 'v5.10.9' into lf-5.10.y ... Browse Code »

This is the 5.10.9 stable release

* tag 'v5.10.9': (153 commits)
Linux 5.10.9
netfilter: nf_nat: Fix memleak in nf_nat_init
netfilter: conntrack: fix reading nf_conntrack_buckets
...

Signed-off-by: Jason Liu

Jason Liu
2021-01-20 19:46:08 +0800
b7397449a Merge tag 'v5.10.7' into lf-5.10.y ... Browse Code »

This is the 5.10.7 stable release

* tag 'v5.10.7': (144 commits)
Linux 5.10.7
scsi: target: Fix XCOPY NAA identifier lookup
rtlwifi: rise completion at the last step of firmware callback
...

Signed-off-by: Jason Liu

Jason Liu
2021-01-20 19:46:05 +0800
f35fe82c7 Merge tag 'v5.10.5' into lf-5.10.y ... Browse Code »

This is the 5.10.5 stable release

* tag 'v5.10.5': (63 commits)
Linux 5.10.5
device-dax: Fix range release
ext4: avoid s_mb_prefetch to be zero in individual scenarios
...

Signed-off-by: Jason Liu

Jason Liu
2021-01-20 16:17:26 +0800
c8c01da72 mm, slub: consider rest of partial list if acquire_slab() fails ... Browse Code »

commit 8ff60eb052eeba95cfb3efe16b08c9199f8121cf upstream.

acquire_slab() fails if there is contention on the freelist of the page
(probably because some other CPU is concurrently freeing an object from
the page). In that case, it might make sense to look for a different page
(since there might be more remote frees to the page from other CPUs, and
we don't want contention on struct page).

However, the current code accidentally stops looking at the partial list
completely in that case. Especially on kernels without CONFIG_NUMA set,
this means that get_partial() fails and new_slab_objects() falls back to
new_slab(), allocating new pages. This could lead to an unnecessary
increase in memory fragmentation.

Link: https://lkml.kernel.org/r/20201228130853.1871516-1-jannh@google.com
Fixes: 7ced37197196 ("slub: Acquire_slab() avoid loop")
Signed-off-by: Jann Horn
Acked-by: David Rientjes
Acked-by: Joonsoo Kim
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2021-01-20 01:27:32 +0800
72c5ce894 mm: don't put pinned pages into the swap cache ... Browse Code »

[ Upstream commit feb889fb40fafc6933339cf1cca8f770126819fb ]

So technically there is nothing wrong with adding a pinned page to the
swap cache, but the pinning obviously means that the page can't actually
be free'd right now anyway, so it's a bit pointless.

However, the real problem is not with it being a bit pointless: the real
issue is that after we've added it to the swap cache, we'll try to unmap
the page. That will succeed, because the code in mm/rmap.c doesn't know
or care about pinned pages.

Even the unmapping isn't fatal per se, since the page will stay around
in memory due to the pinning, and we do hold the connection to it using
the swap cache. But when we then touch it next and take a page fault,
the logic in do_swap_page() will map it back into the process as a
possibly read-only page, and we'll then break the page association on
the next COW fault.

Honestly, this issue could have been fixed in any of those other places:
(a) we could refuse to unmap a pinned page (which makes conceptual
sense), or (b) we could make sure to re-map a pinned page writably in
do_swap_page(), or (c) we could just make do_wp_page() not COW the
pinned page (which was what we historically did before that "mm:
do_wp_page() simplification" commit).

But while all of them are equally valid models for breaking this chain,
not putting pinned pages into the swap cache in the first place is the
simplest one by far.

It's also the safest one: the reason why do_wp_page() was changed in the
first place was that getting the "can I re-use this page" wrong is so
fraught with errors. If you do it wrong, you end up with an incorrectly
shared page.

As a result, using "page_maybe_dma_pinned()" in either do_wp_page() or
do_swap_page() would be a serious bug since it is only a (very good)
heuristic. Re-using the page requires a hard black-and-white rule with
no room for ambiguity.

In contrast, saying "this page is very likely dma pinned, so let's not
add it to the swap cache and try to unmap it" is an obviously safe thing
to do, and if the heuristic might very rarely be a false positive, no
harm is done.

Fixes: 09854ba94c6a ("mm: do_wp_page() simplification")
Reported-and-tested-by: Martin Raiber
Cc: Pavel Begunkov
Cc: Jens Axboe
Cc: Peter Xu
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2021-01-20 01:27:29 +0800
ccd903e26 mm/process_vm_access.c: include compat.h ... Browse Code »

commit eb351d75ce1e75b4f793d609efac08426ca50acd upstream.

Fix the build error:

mm/process_vm_access.c:277:5: error: implicit declaration of function 'in_compat_syscall'; did you mean 'in_ia32_syscall'? [-Werror=implicit-function-declaration]

Fixes: 38dc5079da7081e "Fix compat regression in process_vm_rw()"
Reported-by: syzbot+5b0d0de84d6c65b8dd2b@syzkaller.appspotmail.com
Cc: Kyle Huey
Cc: Jens Axboe
Cc: Al Viro
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Andrew Morton
2021-01-20 01:27:21 +0800
d3e43af7c mm/hugetlb: fix potential missing huge page size info ... Browse Code »

commit 0eb98f1588c2cc7a79816d84ab18a55d254f481c upstream.

The huge page size is encoded for VM_FAULT_HWPOISON errors only. So if
we return VM_FAULT_HWPOISON, huge page size would just be ignored.

Link: https://lkml.kernel.org/r/20210107123449.38481-1-linmiaohe@huawei.com
Fixes: aa50d3a7aa81 ("Encode huge page size for VM_FAULT_HWPOISON errors")
Signed-off-by: Miaohe Lin
Reviewed-by: Mike Kravetz
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2021-01-20 01:27:21 +0800
b4ecc2596 mm/vmalloc.c: fix potential memory leak ... Browse Code »

commit c22ee5284cf58017fa8c6d21d8f8c68159b6faab upstream.

In VM_MAP_PUT_PAGES case, we should put pages and free array in vfree.
But we missed to set area->nr_pages in vmap(). So we would fail to put
pages in __vunmap() because area->nr_pages = 0.

Link: https://lkml.kernel.org/r/20210107123541.39206-1-linmiaohe@huawei.com
Fixes: b944afc9d64d ("mm: add a VM_MAP_PUT_PAGES flag for vmap")
Signed-off-by: Shijie Luo
Signed-off-by: Miaohe Lin
Reviewed-by: Uladzislau Rezki (Sony)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Miaohe Lin
2021-01-20 01:27:21 +0800

13 Jan, 2021

1 commit

876195e1c mm: make wait_on_page_writeback() wait for multiple pending writebacks ... Browse Code »

commit c2407cf7d22d0c0d94cf20342b3b8f06f1d904e7 upstream.

Ever since commit 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common()
logic") we've had some very occasional reports of BUG_ON(PageWriteback)
in write_cache_pages(), which we thought we already fixed in commit
073861ed77b6 ("mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback)").

But syzbot just reported another one, even with that commit in place.

And it turns out that there's a simpler way to trigger the BUG_ON() than
the one Hugh found with page re-use. It all boils down to the fact that
the page writeback is ostensibly serialized by the page lock, but that
isn't actually really true.

Yes, the people _setting_ writeback all do so under the page lock, but
the actual clearing of the bit - and waking up any waiters - happens
without any page lock.

This gives us this fairly simple race condition:

CPU1 = end previous writeback
CPU2 = start new writeback under page lock
CPU3 = write_cache_pages()

CPU1 CPU2 CPU3
---- ---- ----

end_page_writeback()
test_clear_page_writeback(page)
... delayed...

lock_page();
set_page_writeback()
unlock_page()

lock_page()
wait_on_page_writeback();

wake_up_page(page, PG_writeback);
.. wakes up CPU3 ..

BUG_ON(PageWriteback(page));

where the BUG_ON() happens because we woke up the PG_writeback bit
becasue of the _previous_ writeback, but a new one had already been
started because the clearing of the bit wasn't actually atomic wrt the
actual wakeup or serialized by the page lock.

The reason this didn't use to happen was that the old logic in waiting
on a page bit would just loop if it ever saw the bit set again.

The nice proper fix would probably be to get rid of the whole "wait for
writeback to clear, and then set it" logic in the writeback path, and
replace it with an atomic "wait-to-set" (ie the same as we have for page
locking: we set the page lock bit with a single "lock_page()", not with
"wait for lock bit to clear and then set it").

However, out current model for writeback is that the waiting for the
writeback bit is done by the generic VFS code (ie write_cache_pages()),
but the actual setting of the writeback bit is done much later by the
filesystem ".writepages()" function.

IOW, to make the writeback bit have that same kind of "wait-to-set"
behavior as we have for page locking, we'd have to change our roughly
~50 different writeback functions. Painful.

Instead, just make "wait_on_page_writeback()" loop on the very unlikely
situation that the PG_writeback bit is still set, basically re-instating
the old behavior. This is very non-optimal in case of contention, but
since we only ever set the bit under the page lock, that situation is
controlled.

Reported-by: syzbot+2fc0712f8f8b8b8fa0ef@syzkaller.appspotmail.com
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Acked-by: Hugh Dickins
Cc: Andrew Morton
Cc: Matthew Wilcox
Cc: stable@kernel.org
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Linus Torvalds
2021-01-13 03:18:22 +0800

06 Jan, 2021

2 commits

98b57685c mm: memmap defer init doesn't work as expected ... Browse Code »

commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.

VMware observed a performance regression during memmap init on their
platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
iterate over memblock regions rather that check each PFN") causing it.

Before the commit:

[0.033176] Normal zone: 1445888 pages used for memmap
[0.033176] Normal zone: 89391104 pages, LIFO batch:63
[0.035851] ACPI: PM-Timer IO Port: 0x448

With commit

[0.026874] Normal zone: 1445888 pages used for memmap
[0.026875] Normal zone: 89391104 pages, LIFO batch:63
[2.028450] ACPI: PM-Timer IO Port: 0x448

The root cause is the current memmap defer init doesn't work as expected.

Before, memmap_init_zone() was used to do memmap init of one whole zone,
to initialize all low zones of one numa node, but defer memmap init of
the last zone in that numa node. However, since commit 73a6e474cb376,
function memmap_init() is adapted to iterater over memblock regions
inside one zone, then call memmap_init_zone() to do memmap init for each
region.

E.g, on VMware's system, the memory layout is as below, there are two
memory regions in node 2. The current code will mistakenly initialize the
whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
iniatialize only one memmory section on the 2nd region [mem
0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
only one memory section's memmap initialized. That's why more time is
costed at the time.

[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
[ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
[ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
[ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
[ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
[ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
down the real zone end pfn so that defer_init() can use it to judge
whether defer need be taken in zone wide.

Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
Signed-off-by: Baoquan He
Reported-by: Rahul Gopakumar
Reviewed-by: Mike Rapoport
Cc: David Hildenbrand
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Baoquan He
2021-01-06 21:56:50 +0800
df73c8033 mm/hugetlb: fix deadlock in hugetlb_cow error path ... Browse Code »

commit e7dd91c456a8cdbcd7066997d15e36d14276a949 upstream.

syzbot reported the deadlock here [1]. The issue is in hugetlb cow
error handling when there are not enough huge pages for the faulting
task which took the original reservation. It is possible that other
(child) tasks could have consumed pages associated with the reservation.
In this case, we want the task which took the original reservation to
succeed. So, we unmap any associated pages in children so that they can
be used by the faulting task that owns the reservation.

The unmapping code needs to hold i_mmap_rwsem in write mode. However,
due to commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd
sharing synchronization") we are already holding i_mmap_rwsem in read
mode when hugetlb_cow is called.

Technically, i_mmap_rwsem does not need to be held in read mode for COW
mappings as they can not share pmd's. Modifying the fault code to not
take i_mmap_rwsem in read mode for COW (and other non-sharable) mappings
is too involved for a stable fix.

Instead, we simply drop the hugetlb_fault_mutex and i_mmap_rwsem before
unmapping. This is OK as it is technically not needed. They are
reacquired after unmapping as expected by calling code. Since this is
done in an uncommon error path, the overhead of dropping and reacquiring
mutexes is acceptable.

While making changes, remove redundant BUG_ON after unmap_ref_private.

[1] https://lkml.kernel.org/r/000000000000b73ccc05b5cf8558@google.com

Link: https://lkml.kernel.org/r/4c5781b8-3b00-761e-c0c7-c5edebb6ec1a@oracle.com
Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
Signed-off-by: Mike Kravetz
Reported-by: syzbot+5eee4145df3c15e96625@syzkaller.appspotmail.com
Cc: Naoya Horiguchi
Cc: Michal Hocko
Cc: Hugh Dickins
Cc: "Aneesh Kumar K . V"
Cc: Davidlohr Bueso
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Mike Kravetz
2021-01-06 21:56:50 +0800

04 Jan, 2021

1 commit

a1dbfc3b1 Merge tag 'v5.10.4' into lf-5.10.y ... Browse Code »

This is the 5.10.4 stable release

* tag 'v5.10.4': (717 commits)
Linux 5.10.4
x86/CPU/AMD: Save AMD NodeId as cpu_die_id
drm/edid: fix objtool warning in drm_cvt_modes()
...

Signed-off-by: Jason Liu

Conflicts:
drivers/gpu/drm/imx/dcss/dcss-plane.c
drivers/media/i2c/ov5640.c

Jason Liu
2021-01-04 20:34:49 +0800

30 Dec, 2020

13 commits

746d179b0 z3fold: stricter locking and more careful reclaim ... Browse Code »

commit dcf5aedb24f899d537e21c18ea552c780598d352 upstream.

Use temporary slots in reclaim function to avoid possible race when
freeing those.

While at it, make sure we check CLAIMED flag under page lock in the
reclaim function to make sure we are not racing with z3fold_alloc().

Link: https://lkml.kernel.org/r/20201209145151.18994-4-vitaly.wool@konsulko.com
Signed-off-by: Vitaly Wool
Cc:
Cc: Mike Galbraith
Cc: Sebastian Andrzej Siewior
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vitaly Wool
2020-12-30 18:54:10 +0800
b8b1d4e96 z3fold: simplify freeing slots ... Browse Code »

commit fc5488651c7d840c9cad9b0f273f2f31bd03413a upstream.

Patch series "z3fold: stability / rt fixes".

Address z3fold stability issues under stress load, primarily in the
reclaim and free aspects. Besides, it fixes the locking problems that
were only seen in real-time kernel configuration.

This patch (of 3):

There used to be two places in the code where slots could be freed, namely
when freeing the last allocated handle from the slots and when releasing
the z3fold header these slots aree linked to. The logic to decide on
whether to free certain slots was complicated and error prone in both
functions and it led to failures in RT case.

To fix that, make free_handle() the single point of freeing slots.

Link: https://lkml.kernel.org/r/20201209145151.18994-1-vitaly.wool@konsulko.com
Link: https://lkml.kernel.org/r/20201209145151.18994-2-vitaly.wool@konsulko.com
Signed-off-by: Vitaly Wool
Tested-by: Mike Galbraith
Cc: Sebastian Andrzej Siewior
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Vitaly Wool
2020-12-30 18:54:10 +0800
bd3f4b6fd mm: don't wake kswapd prematurely when watermark boosting is disabled ... Browse Code »

[ Upstream commit 597c892038e08098b17ccfe65afd9677e6979800 ]

On 2-node NUMA hosts we see bursts of kswapd reclaim and subsequent
pressure spikes and stalls from cache refaults while there is plenty of
free memory in the system.

Usually, kswapd is woken up when all eligible nodes in an allocation are
full. But the code related to watermark boosting can wake kswapd on one
full node while the other one is mostly empty. This may be justified to
fight fragmentation, but is currently unconditionally done whether
watermark boosting is occurring or not.

In our case, many of our workloads' throughput scales with available
memory, and pure utilization is a more tangible concern than trends
around longer-term fragmentation. As a result we generally disable
watermark boosting.

Wake kswapd only woken when watermark boosting is requested.

Link: https://lkml.kernel.org/r/20201020175833.397286-1-hannes@cmpxchg.org
Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Johannes Weiner
2020-12-30 18:53:55 +0800
9b52a37fb hugetlb: fix an error code in hugetlb_reserve_pages() ... Browse Code »

[ Upstream commit 7fc2513aa237e2ce239ab54d7b04d1d79b317110 ]

Preserve the error code from region_add() instead of returning success.

Link: https://lkml.kernel.org/r/X9NGZWnZl5/Mt99R@mwanda
Fixes: 0db9d74ed884 ("hugetlb: disable region_add file_region coalescing")
Signed-off-by: Dan Carpenter
Reviewed-by: Mike Kravetz
Reviewed-by: David Hildenbrand
Cc: Mina Almasry
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Dan Carpenter
2020-12-30 18:53:55 +0800
b7bf8ed8d mm,memory_failure: always pin the page in madvise_inject_error ... Browse Code »

[ Upstream commit 1e8aaedb182d6ddffc894b832e4962629907b3e0 ]

madvise_inject_error() uses get_user_pages_fast to translate the address
we specified to a page. After [1], we drop the extra reference count for
memory_failure() path. That commit says that memory_failure wanted to
keep the pin in order to take the page out of circulation.

The truth is that we need to keep the page pinned, otherwise the page
might be re-used after the put_page() and we can end up messing with
someone else's memory.

E.g:

CPU0
process X CPU1
madvise_inject_error
get_user_pages
put_page
page gets reclaimed
process Y allocates the page
memory_failure
// We mess with process Y memory

madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.

Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.

[1] ("23e7b5c2e271: mm, madvise_inject_error:
Let memory_failure() optionally take a page reference")

Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
Signed-off-by: Oscar Salvador
Suggested-by: Vlastimil Babka
Acked-by: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Oscar Salvador
2020-12-30 18:53:55 +0800
23713b480 mm/vmalloc.c: fix kasan shadow poisoning size ... Browse Code »

[ Upstream commit c041098c690fe53cea5d20c62f128a4f7a5c19fe ]

The size of vm area can be affected by the presence or not of the guard
page. In particular when VM_NO_GUARD is present, the actual accessible
size has to be considered like the real size minus the guard page.

Currently kasan does not keep into account this information during the
poison operation and in particular tries to poison the guard page as well.

This approach, even if incorrect, does not cause an issue because the tags
for the guard page are written in the shadow memory. With the future
introduction of the Tag-Based KASAN, being the guard page inaccessible by
nature, the write tag operation on this page triggers a fault.

Fix kasan shadow poisoning size invoking get_vm_area_size() instead of
accessing directly the field in the data structure to detect the correct
value.

Link: https://lkml.kernel.org/r/20201027160213.32904-1-vincenzo.frascino@arm.com
Fixes: d98c9e83b5e7c ("kasan: fix crashes on access to memory mapped by vm_map_ram()")
Signed-off-by: Vincenzo Frascino
Cc: Andrey Konovalov
Cc: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Marco Elver
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Vincenzo Frascino
2020-12-30 18:53:55 +0800
4a9d8b078 mm/vmalloc: Fix unlock order in s_stop() ... Browse Code »

[ Upstream commit 0a7dd4e901b8a4ee040ba953900d1d7120b34ee5 ]

When multiple locks are acquired, they should be released in reverse
order. For s_start() and s_stop() in mm/vmalloc.c, that is not the
case.

s_start: mutex_lock(&vmap_purge_lock); spin_lock(&vmap_area_lock);
s_stop : mutex_unlock(&vmap_purge_lock); spin_unlock(&vmap_area_lock);

This unlock sequence, though allowed, is not optimal. If a waiter is
present, mutex_unlock() will need to go through the slowpath of waking
up the waiter with preemption disabled. Fix that by releasing the
spinlock first before the mutex.

Link: https://lkml.kernel.org/r/20201213180843.16938-1-longman@redhat.com
Fixes: e36176be1c39 ("mm/vmalloc: rework vmap_area_lock")
Signed-off-by: Waiman Long
Reviewed-by: Uladzislau Rezki (Sony)
Reviewed-by: David Hildenbrand
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Waiman Long
2020-12-30 18:53:55 +0800
dd156e3fc mm/rmap: always do TTU_IGNORE_ACCESS ... Browse Code »

[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Shakeel Butt
2020-12-30 18:53:55 +0800
6d48fff6d mm: memcg/slab: fix use after free in obj_cgroup_charge ... Browse Code »

[ Upstream commit eefbfa7fd678805b38a46293e78543f98f353d3e ]

The rcu_read_lock/unlock only can guarantee that the memcg will not be
freed, but it cannot guarantee the success of css_get to memcg.

If the whole process of a cgroup offlining is completed between reading a
objcg->memcg pointer and bumping the css reference on another CPU, and
there are exactly 0 external references to this memory cgroup (how we get
to the obj_cgroup_charge() then?), css_get() can change the ref counter
from 0 back to 1.

Link: https://lkml.kernel.org/r/20201028035013.99711-2-songmuchun@bytedance.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Yafang Shao
Cc: Chris Down
Cc: Christian Brauner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Muchun Song
2020-12-30 18:53:54 +0800
02314d05e mm: memcg/slab: fix return of child memcg objcg for root memcg ... Browse Code »

[ Upstream commit 2f7659a314736b32b66273dbf91c19874a052fde ]

Consider the following memcg hierarchy.

root
/ \
A B

If we failed to get the reference on objcg of memcg A, the
get_obj_cgroup_from_current can return the wrong objcg for the root
memcg.

Link: https://lkml.kernel.org/r/20201029164429.58703-1-songmuchun@bytedance.com
Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Signed-off-by: Muchun Song
Acked-by: Roman Gushchin
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Shakeel Butt
Cc: Joonsoo Kim
Cc: Yafang Shao
Cc: Chris Down
Cc: Christian Brauner
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Kees Cook
Cc: Thomas Gleixner
Cc: Eugene Syromiatnikov
Cc: Suren Baghdasaryan
Cc: Adrian Reber
Cc: Marco Elver
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Muchun Song
2020-12-30 18:53:54 +0800
cfde6c181 mm/gup: combine put_compound_head() and unpin_user_page() ... Browse Code »

[ Upstream commit 4509b42c38963f495b49aa50209c34337286ecbe ]

These functions accomplish the same thing but have different
implementations.

unpin_user_page() has a bug where it calls mod_node_page_state() after
calling put_page() which creates a risk that the page could have been
hot-uplugged from the system.

Fix this by using put_compound_head() as the only implementation.

__unpin_devmap_managed_user_page() and related can be deleted as well in
favour of the simpler, but slower, version in put_compound_head() that has
an extra atomic page_ref_sub, but always calls put_page() which internally
contains the special devmap code.

Move put_compound_head() to be directly after try_grab_compound_head() so
people can find it in future.

Link: https://lkml.kernel.org/r/0-v1-6730d4ee0d32+40e6-gup_combine_put_jgg@nvidia.com
Fixes: 1970dc6f5226 ("mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting")
Signed-off-by: Jason Gunthorpe
Reviewed-by: John Hubbard
Reviewed-by: Ira Weiny
Reviewed-by: Jan Kara
CC: Joao Martins
CC: Jonathan Corbet
CC: Dan Williams
CC: Dave Chinner
CC: Christoph Hellwig
CC: Jane Chu
CC: "Kirill A. Shutemov"
CC: Michal Hocko
CC: Mike Kravetz
CC: Shuah Khan
CC: Muchun Song
CC: Vlastimil Babka
CC: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800
537946556 mm/gup: prevent gup_fast from racing with COW during fork ... Browse Code »

[ Upstream commit 57efa1fe5957694fa541c9062de0a127f0b9acb0 ]

Since commit 70e806e4e645 ("mm: Do early cow for pinned pages during
fork() for ptes") pages under a FOLL_PIN will not be write protected
during COW for fork. This means that pages returned from
pin_user_pages(FOLL_WRITE) should not become write protected while the pin
is active.

However, there is a small race where get_user_pages_fast(FOLL_PIN) can
establish a FOLL_PIN at the same time copy_present_page() is write
protecting it:

CPU 0 CPU 1
get_user_pages_fast()
internal_get_user_pages_fast()
copy_page_range()
pte_alloc_map_lock()
copy_present_page()
atomic_read(has_pinned) == 0
page_maybe_dma_pinned() == false
atomic_set(has_pinned, 1);
gup_pgd_range()
gup_pte_range()
pte_t pte = gup_get_pte(ptep)
pte_access_permitted(pte)
try_grab_compound_head()
pte = pte_wrprotect(pte)
set_pte_at();
pte_unmap_unlock()
// GUP now returns with a write protected page

The first attempt to resolve this by using the write protect caused
problems (and was missing a barrrier), see commit f3c64eda3e50 ("mm: avoid
early COW write protect games during fork()")

Instead wrap copy_p4d_range() with the write side of a seqcount and check
the read side around gup_pgd_range(). If there is a collision then
get_user_pages_fast() fails and falls back to slow GUP.

Slow GUP is safe against this race because copy_page_range() is only
called while holding the exclusive side of the mmap_lock on the src
mm_struct.

[akpm@linux-foundation.org: coding style fixes]
Link: https://lore.kernel.org/r/CAHk-=wi=iCnYCARbPGjkVJu9eyYeZ13N64tZYLdOB8CP5Q_PLw@mail.gmail.com

Link: https://lkml.kernel.org/r/2-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Fixes: f3c64eda3e50 ("mm: avoid early COW write protect games during fork()")
Signed-off-by: Jason Gunthorpe
Suggested-by: Linus Torvalds
Reviewed-by: John Hubbard
Reviewed-by: Jan Kara
Reviewed-by: Peter Xu
Acked-by: "Ahmed S. Darwish" [seqcount_t parts]
Cc: Andrea Arcangeli
Cc: "Aneesh Kumar K.V"
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jann Horn
Cc: Kirill Shutemov
Cc: Kirill Tkhai
Cc: Leon Romanovsky
Cc: Michal Hocko
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800
bcb0f647c mm/gup: reorganize internal_get_user_pages_fast() ... Browse Code »

[ Upstream commit c28b1fc70390df32e29991eedd52bd86e7aba080 ]

Patch series "Add a seqcount between gup_fast and copy_page_range()", v4.

As discussed and suggested by Linus use a seqcount to close the small race
between gup_fast and copy_page_range().

Ahmed confirms that raw_write_seqcount_begin() is the correct API to use
in this case and it doesn't trigger any lockdeps.

I was able to test it using two threads, one forking and the other using
ibv_reg_mr() to trigger GUP fast. Modifying copy_page_range() to sleep
made the window large enough to reliably hit to test the logic.

This patch (of 2):

The next patch in this series makes the lockless flow a little more
complex, so move the entire block into a new function and remove a level
of indention. Tidy a bit of cruft:

- addr is always the same as start, so use start

- Use the modern check_add_overflow() for computing end = start + len

- nr_pinned/pages << PAGE_SHIFT needs the LHS to be unsigned long to
avoid shift overflow, make the variables unsigned long to avoid coding
casts in both places. nr_pinned was missing its cast

- The handling of ret and nr_pinned can be streamlined a bit

No functional change.

Link: https://lkml.kernel.org/r/0-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Link: https://lkml.kernel.org/r/1-v4-908497cf359a+4782-gup_fork_jgg@nvidia.com
Signed-off-by: Jason Gunthorpe
Reviewed-by: Jan Kara
Reviewed-by: John Hubbard
Reviewed-by: Peter Xu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Jason Gunthorpe
2020-12-30 18:53:54 +0800

14 Dec, 2020

2 commits

1ac80b081 mm: vmalloc: Export __get_vm_area_caller ... Browse Code »

Needed by Jailhouse so far, as replacement of __get_vm_area. Should be
solved there eventually (via removal of JAILHOUSE_BORROW_ROOT_PT), then
this can be dropped again.

Signed-off-by: Jan Kiszka

Jan Kiszka
2020-12-14 11:30:02 +0800
11d98890e mm: Re-export ioremap_page_range ... Browse Code »

We need this in Jailhouse to map at specific virtual addresses, at
least for the moment.

Signed-off-by: Jan Kiszka

Jan Kiszka
2020-12-14 11:30:01 +0800

12 Dec, 2020

3 commits

ba9c1201b mm/hugetlb: clear compound_nr before freeing gigantic pages ... Browse Code »

Commit 1378a5ee451a ("mm: store compound_nr as well as compound_order")
added compound_nr counter to first tail struct page, overlaying with
page->mapping. The overlay itself is fine, but while freeing gigantic
hugepages via free_contig_range(), a "bad page" check will trigger for
non-NULL page->mapping on the first tail page:

BUG: Bad page state in process bash pfn:380001
page:00000000c35f0856 refcount:0 mapcount:0 mapping:00000000126b68aa index:0x0 pfn:0x380001
aops:0x0
flags: 0x3ffff00000000000()
raw: 3ffff00000000000 0000000000000100 0000000000000122 0000000100000000
raw: 0000000000000000 0000000000000000 ffffffff00000000 0000000000000000
page dumped because: non-NULL mapping
Modules linked in:
CPU: 6 PID: 616 Comm: bash Not tainted 5.10.0-rc7-next-20201208 #1
Hardware name: IBM 3906 M03 703 (LPAR)
Call Trace:
show_stack+0x6e/0xe8
dump_stack+0x90/0xc8
bad_page+0xd6/0x130
free_pcppages_bulk+0x26a/0x800
free_unref_page+0x6e/0x90
free_contig_range+0x94/0xe8
update_and_free_page+0x1c4/0x2c8
free_pool_huge_page+0x11e/0x138
set_max_huge_pages+0x228/0x300
nr_hugepages_store_common+0xb8/0x130
kernfs_fop_write+0xd2/0x218
vfs_write+0xb0/0x2b8
ksys_write+0xac/0xe0
system_call+0xe6/0x288
Disabling lock debugging due to kernel taint

This is because only the compound_order is cleared in
destroy_compound_gigantic_page(), and compound_nr is set to
1U << order == 1 for order 0 in set_compound_order(page, 0).

Fix this by explicitly clearing compound_nr for first tail page after
calling set_compound_order(page, 0).

Link: https://lkml.kernel.org/r/20201208182813.66391-2-gerald.schaefer@linux.ibm.com
Fixes: 1378a5ee451a ("mm: store compound_nr as well as compound_order")
Signed-off-by: Gerald Schaefer
Reviewed-by: Matthew Wilcox (Oracle)
Cc: Heiko Carstens
Cc: Mike Kravetz
Cc: Christian Borntraeger
Cc: [5.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2020-12-12 06:02:14 +0800
6c82d45c7 kasan: fix object remaining in offline per-cpu quarantine ... Browse Code »

We hit this issue in our internal test. When enabling generic kasan, a
kfree()'d object is put into per-cpu quarantine first. If the cpu goes
offline, object still remains in the per-cpu quarantine. If we call
kmem_cache_destroy() now, slub will report "Objects remaining" error.

=============================================================================
BUG test_module_slab (Not tainted): Objects remaining in test_module_slab on __kmem_cache_shutdown()
-----------------------------------------------------------------------------

Disabling lock debugging due to kernel taint
INFO: Slab 0x(____ptrval____) objects=34 used=1 fp=0x(____ptrval____) flags=0x2ffff00000010200
CPU: 3 PID: 176 Comm: cat Tainted: G B 5.10.0-rc1-00007-g4525c8781ec0-dirty #10
Hardware name: linux,dummy-virt (DT)
Call trace:
dump_backtrace+0x0/0x2b0
show_stack+0x18/0x68
dump_stack+0xfc/0x168
slab_err+0xac/0xd4
__kmem_cache_shutdown+0x1e4/0x3c8
kmem_cache_destroy+0x68/0x130
test_version_show+0x84/0xf0
module_attr_show+0x40/0x60
sysfs_kf_seq_show+0x128/0x1c0
kernfs_seq_show+0xa0/0xb8
seq_read+0x1f0/0x7e8
kernfs_fop_read+0x70/0x338
vfs_read+0xe4/0x250
ksys_read+0xc8/0x180
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.0+0xac/0x228
do_el0_svc+0x38/0xa0
el0_sync_handler+0x170/0x178
el0_sync+0x174/0x180
INFO: Object 0x(____ptrval____) @offset=15848
INFO: Allocated in test_version_show+0x98/0xf0 age=8188 cpu=6 pid=172
stack_trace_save+0x9c/0xd0
set_track+0x64/0xf0
alloc_debug_processing+0x104/0x1a0
___slab_alloc+0x628/0x648
__slab_alloc.isra.0+0x2c/0x58
kmem_cache_alloc+0x560/0x588
test_version_show+0x98/0xf0
module_attr_show+0x40/0x60
sysfs_kf_seq_show+0x128/0x1c0
kernfs_seq_show+0xa0/0xb8
seq_read+0x1f0/0x7e8
kernfs_fop_read+0x70/0x338
vfs_read+0xe4/0x250
ksys_read+0xc8/0x180
__arm64_sys_read+0x44/0x58
el0_svc_common.constprop.0+0xac/0x228
kmem_cache_destroy test_module_slab: Slab cache still has objects

Register a cpu hotplug function to remove all objects in the offline
per-cpu quarantine when cpu is going offline. Set a per-cpu variable to
indicate this cpu is offline.

[qiang.zhang@windriver.com: fix slab double free when cpu-hotplug]
Link: https://lkml.kernel.org/r/20201204102206.20237-1-qiang.zhang@windriver.com

Link: https://lkml.kernel.org/r/1606895585-17382-2-git-send-email-Kuan-Ying.Lee@mediatek.com
Signed-off-by: Kuan-Ying Lee
Signed-off-by: Zqiang
Suggested-by: Dmitry Vyukov
Reported-by: Guangye Yang
Reviewed-by: Dmitry Vyukov
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Matthias Brugger
Cc: Nicholas Tang
Cc: Miles Chen
Cc: Qian Cai
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kuan-Ying Lee
2020-12-12 06:02:14 +0800
16c0cc0ce revert "mm/filemap: add static for function __add_to_page_cache_locked" ... Browse Code »

Revert commit 3351b16af494 ("mm/filemap: add static for function
__add_to_page_cache_locked") due to incompatibility with
ALLOW_ERROR_INJECTION which result in build errors.

Link: https://lkml.kernel.org/r/CAADnVQJ6tmzBXvtroBuEH6QA0H+q7yaSKxrVvVxhqr3KBZdEXg@mail.gmail.com
Tested-by: Justin Forbes
Tested-by: Greg Thelen
Acked-by: Alexei Starovoitov
Cc: Michal Kubecek
Cc: Alex Shi
Cc: Souptick Joarder
Cc: Daniel Borkmann
Cc: Josef Bacik
Cc: Tony Luck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2020-12-12 06:02:14 +0800

09 Dec, 2020

1 commit

a68a0262a mm/madvise: remove racy mm ownership check ... Browse Code »

Jann spotted the security hole due to race of mm ownership check.

If the task is sharing the mm_struct but goes through execve() before
mm_access(), it could skip process_madvise_behavior_valid check. That
makes *any advice hint* to reach into the remote process.

This patch removes the mm ownership check. With it, it will lose the
ability that local process could give *any* advice hint with vector
interface for some reason (e.g., performance). Since there is no
concrete example in upstream yet, it would be better to remove the
abiliity at this moment and need to review when such new advice comes
up.

Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
Reported-by: Jann Horn
Suggested-by: Jann Horn
Signed-off-by: Minchan Kim
Signed-off-by: Linus Torvalds

Minchan Kim
2020-12-09 12:57:18 +0800

07 Dec, 2020

7 commits

309d08d9b mm/mmap.c: fix mmap return value when vma is merged after call_mmap() ... Browse Code »

On success, mmap should return the begin address of newly mapped area,
but patch "mm: mmap: merge vma after call_mmap() if possible" set
vm_start of newly merged vma to return value addr. Users of mmap will
get wrong address if vma is merged after call_mmap(). We fix this by
moving the assignment to addr before merging vma.

We have a driver which changes vm_flags, and this bug is found by our
testcases.

Fixes: d70cec898324 ("mm: mmap: merge vma after call_mmap() if possible")
Signed-off-by: Liu Zixian
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Reviewed-by: David Hildenbrand
Cc: Miaohe Lin
Cc: Hongxiang Lou
Cc: Hu Shiyuan
Cc: Matthew Wilcox
Link: https://lkml.kernel.org/r/20201203085350.22624-1-liuzixian4@huawei.com
Signed-off-by: Linus Torvalds

Liu Zixian
2020-12-07 02:19:07 +0800
7a5bde379 hugetlb_cgroup: fix offline of hugetlb cgroup with reservations ... Browse Code »

Adrian Moreno was ruuning a kubernetes 1.19 + containerd/docker workload
using hugetlbfs. In this environment the issue is reproduced by:

- Start a simple pod that uses the recently added HugePages medium
feature (pod yaml attached)

- Start a DPDK app. It doesn't need to run successfully (as in transfer
packets) nor interact with real hardware. It seems just initializing
the EAL layer (which handles hugepage reservation and locking) is
enough to trigger the issue

- Delete the Pod (or let it "Complete").

This would result in a kworker thread going into a tight loop (top output):

1425 root 20 0 0 0 0 R 99.7 0.0 5:22.45 kworker/28:7+cgroup_destroy

'perf top -g' reports:

- 63.28% 0.01% [kernel] [k] worker_thread
- 49.97% worker_thread
- 52.64% process_one_work
- 62.08% css_killed_work_fn
- hugetlb_cgroup_css_offline
41.52% _raw_spin_lock
- 2.82% _cond_resched
rcu_all_qs
2.66% PageHuge
- 0.57% schedule
- 0.57% __schedule

We are spinning in the do-while loop in hugetlb_cgroup_css_offline.
Worse yet, we are holding the master cgroup lock (cgroup_mutex) while
infinitely spinning. Little else can be done on the system as the
cgroup_mutex can not be acquired.

Do note that the issue can be reproduced by simply offlining a hugetlb
cgroup containing pages with reservation counts.

The loop in hugetlb_cgroup_css_offline is moving page counts from the
cgroup being offlined to the parent cgroup. This is done for each
hstate, and is repeated until hugetlb_cgroup_have_usage returns false.
The routine moving counts (hugetlb_cgroup_move_parent) is only moving
'usage' counts. The routine hugetlb_cgroup_have_usage is checking for
both 'usage' and 'reservation' counts. Discussion about what to do with
reservation counts when reparenting was discussed here:

https://lore.kernel.org/linux-kselftest/CAHS8izMFAYTgxym-Hzb_JmkTK1N_S9tGN71uS6MFV+R7swYu5A@mail.gmail.com/

The decision was made to leave a zombie cgroup for with reservation
counts. Unfortunately, the code checking reservation counts was
incorrectly added to hugetlb_cgroup_have_usage.

To fix the issue, simply remove the check for reservation counts. While
fixing this issue, a related bug in hugetlb_cgroup_css_offline was
noticed. The hstate index is not reinitialized each time through the
do-while loop. Fix this as well.

Fixes: 1adc4d419aa2 ("hugetlb_cgroup: add interface for charge/uncharge hugetlb reservations")
Reported-by: Adrian Moreno
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Tested-by: Adrian Moreno
Reviewed-by: Shakeel Butt
Cc: Mina Almasry
Cc: David Rientjes
Cc: Greg Thelen
Cc: Sandipan Das
Cc: Shuah Khan
Cc:
Link: https://lkml.kernel.org/r/20201203220242.158165-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-12-07 02:19:07 +0800
3351b16af mm/filemap: add static for function __add_to_page_cache_locked ... Browse Code »

mm/filemap.c:830:14: warning: no previous prototype for `__add_to_page_cache_locked' [-Wmissing-prototypes]

Signed-off-by: Alex Shi
Signed-off-by: Andrew Morton
Cc: Souptick Joarder
Link: https://lkml.kernel.org/r/1604661895-5495-1-git-send-email-alex.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Alex Shi
2020-12-07 02:19:07 +0800
b11a76b37 mm/swapfile: do not sleep with a spin lock held ... Browse Code »

We can't call kvfree() with a spin lock held, so defer it. Fixes a
might_sleep() runtime warning.

Fixes: 873d7bcfd066 ("mm/swapfile.c: use kvzalloc for swap_info_struct allocation")
Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc:
Link: https://lkml.kernel.org/r/20201202151549.10350-1-qcai@redhat.com
Signed-off-by: Linus Torvalds

Qian Cai
2020-12-07 02:19:07 +0800
e91d8d782 mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING ... Browse Code »

While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.

BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460

We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.

Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Sergey Senozhatsky
Cc: Tony Lindgren
Cc: Christoph Hellwig
Cc: Harish Sriram
Cc: Uladzislau Rezki
Cc:
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds

Minchan Kim
2020-12-07 02:19:07 +0800
8199be001 mm: list_lru: set shrinker map bit when child nr_items is not zero ... Browse Code »

When investigating a slab cache bloat problem, significant amount of
negative dentry cache was seen, but confusingly they neither got shrunk
by reclaimer (the host has very tight memory) nor be shrunk by dropping
cache. The vmcore shows there are over 14M negative dentry objects on
lru, but tracing result shows they were even not scanned at all.

Further investigation shows the memcg's vfs shrinker_map bit is not set.
So the reclaimer or dropping cache just skip calling vfs shrinker. So
we have to reboot the hosts to get the memory back.

I didn't manage to come up with a reproducer in test environment, and
the problem can't be reproduced after rebooting. But it seems there is
race between shrinker map bit clear and reparenting by code inspection.
The hypothesis is elaborated as below.

The memcg hierarchy on our production environment looks like:

root
/ \
system user

The main workloads are running under user slice's children, and it
creates and removes memcg frequently. So reparenting happens very often
under user slice, but no task is under user slice directly.

So with the frequent reparenting and tight memory pressure, the below
hypothetical race condition may happen:

CPU A CPU B
reparent
dst->nr_items == 0
shrinker:
total_objects == 0
add src->nr_items to dst
set_bit
return SHRINK_EMPTY
clear_bit
child memcg offline
replace child's kmemcg_id with
parent's (in memcg_offline_kmem())
list_lru_del() between shrinker runs
see parent's kmemcg_id
dec dst->nr_items
reparent again
dst->nr_items may go negative
due to concurrent list_lru_del()

The second run of shrinker:
read nr_items without any
synchronization, so it may
see intermediate negative
nr_items then total_objects
may return 0 coincidently

keep the bit cleared
dst->nr_items != 0
skip set_bit
add scr->nr_item to dst

After this point dst->nr_item may never go zero, so reparenting will not
set shrinker_map bit anymore. And since there is no task under user
slice directly, so no new object will be added to its lru to set the
shrinker map bit either. That bit is kept cleared forever.

How does list_lru_del() race with reparenting? It is because reparenting
replaces children's kmemcg_id to parent's without protecting from
nlru->lock, so list_lru_del() may see parent's kmemcg_id but actually
deleting items from child's lru, but dec'ing parent's nr_items, so the
parent's nr_items may go negative as commit 2788cf0c401c ("memcg:
reparent list_lrus and free kmemcg_id on css offline") says.

Since it is impossible that dst->nr_items goes negative and
src->nr_items goes zero at the same time, so it seems we could set the
shrinker map bit iff src->nr_items != 0. We could synchronize
list_lru_count_one() and reparenting with nlru->lock, but it seems
checking src->nr_items in reparenting is the simplest and avoids lock
contention.

Fixes: fae91d6d8be5 ("mm/list_lru.c: set bit in memcg shrinker bitmap on first list_lru item appearance")
Suggested-by: Roman Gushchin
Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Roman Gushchin
Reviewed-by: Shakeel Butt
Acked-by: Kirill Tkhai
Cc: Vladimir Davydov
Cc: [4.19]
Link: https://lkml.kernel.org/r/20201202171749.264354-1-shy828301@gmail.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-12-07 02:19:07 +0800
becaba65f mm: memcg/slab: fix obj_cgroup_charge() return value handling ... Browse Code »

Commit 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches
for all allocations") introduced a regression into the handling of the
obj_cgroup_charge() return value. If a non-zero value is returned
(indicating of exceeding one of memory.max limits), the allocation
should fail, instead of falling back to non-accounted mode.

To make the code more readable, move memcg_slab_pre_alloc_hook() and
memcg_slab_post_alloc_hook() calling conditions into bodies of these
hooks.

Fixes: 10befea91b61 ("mm: memcg/slab: use a single set of kmem_caches for all allocations")
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201127161828.GD840171@carbon.dhcp.thefacebook.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-12-07 02:19:07 +0800

25 Nov, 2020

1 commit

073861ed7 mm: fix VM_BUG_ON(PageTail) and BUG_ON(PageWriteback) ... Browse Code »

Twice now, when exercising ext4 looped on shmem huge pages, I have crashed
on the PF_ONLY_HEAD check inside PageWaiters(): ext4_finish_bio() calling
end_page_writeback() calling wake_up_page() on tail of a shmem huge page,
no longer an ext4 page at all.

The problem is that PageWriteback is not accompanied by a page reference
(as the NOTE at the end of test_clear_page_writeback() acknowledges): as
soon as TestClearPageWriteback has been done, that page could be removed
from page cache, freed, and reused for something else by the time that
wake_up_page() is reached.

https://lore.kernel.org/linux-mm/20200827122019.GC14765@casper.infradead.org/
Matthew Wilcox suggested avoiding or weakening the PageWaiters() tail
check; but I'm paranoid about even looking at an unreferenced struct page,
lest its memory might itself have already been reused or hotremoved (and
wake_up_page_bit() may modify that memory with its ClearPageWaiters()).

Then on crashing a second time, realized there's a stronger reason against
that approach. If my testing just occasionally crashes on that check,
when the page is reused for part of a compound page, wouldn't it be much
more common for the page to get reused as an order-0 page before reaching
wake_up_page()? And on rare occasions, might that reused page already be
marked PageWriteback by its new user, and already be waited upon? What
would that look like?

It would look like BUG_ON(PageWriteback) after wait_on_page_writeback()
in write_cache_pages() (though I have never seen that crash myself).

Matthew Wilcox explaining this to himself:
"page is allocated, added to page cache, dirtied, writeback starts,

--- thread A ---
filesystem calls end_page_writeback()
test_clear_page_writeback()
--- context switch to thread B ---
truncate_inode_pages_range() finds the page, it doesn't have writeback set,
we delete it from the page cache. Page gets reallocated, dirtied, writeback
starts again. Then we call write_cache_pages(), see
PageWriteback() set, call wait_on_page_writeback()
--- context switch back to thread A ---
wake_up_page(page, PG_writeback);
... thread B is woken, but because the wakeup was for the old use of
the page, PageWriteback is still set.

Devious"

And prior to 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
this would have been much less likely: before that, wake_page_function()'s
non-exclusive case would stop walking and not wake if it found Writeback
already set again; whereas now the non-exclusive case proceeds to wake.

I have not thought of a fix that does not add a little overhead: the
simplest fix is for end_page_writeback() to get_page() before calling
test_clear_page_writeback(), then put_page() after wake_up_page().

Was there a chance of missed wakeups before, since a page freed before
reaching wake_up_page() would have PageWaiters cleared? I think not,
because each waiter does hold a reference on the page. This bug comes
when the old use of the page, the one we do TestClearPageWriteback on,
had *no* waiters, so no additional page reference beyond the page cache
(and whoever racily freed it). The reuse of the page has a waiter
holding a reference, and its own PageWriteback set; but the belated
wake_up_page() has woken the reuse to hit that BUG_ON(PageWriteback).

Reported-by: syzbot+3622cea378100f45d59f@syzkaller.appspotmail.com
Reported-by: Qian Cai
Fixes: 2a9127fcf229 ("mm: rewrite wait_on_page_bit_common() logic")
Signed-off-by: Hugh Dickins
Cc: stable@vger.kernel.org # v5.8+
Signed-off-by: Linus Torvalds

Hugh Dickins
2020-11-25 07:23:19 +0800

23 Nov, 2020

1 commit

66383800d mm: fix madvise WILLNEED performance problem ... Browse Code »

The calculation of the end page index was incorrect, leading to a
regression of 70% when running stress-ng.

With this fix, we instead see a performance improvement of 3%.

Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
Reported-by: kernel test robot
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Tested-by: Xing Zhengjun
Acked-by: Johannes Weiner
Cc: William Kucharski
Cc: Feng Tang
Cc: "Chen, Rong A"
Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-11-23 02:48:22 +0800