Eric Lee / smarc-fsl-linux-kernel

17 Jun, 2021

17 commits

ccbd6283a mm/sparse: fix check_usemap_section_nr warnings ... Browse Code »

I see a "virt_to_phys used for non-linear address" warning from
check_usemap_section_nr() on arm64 platforms.

In current implementation of NODE_DATA, if CONFIG_NEED_MULTIPLE_NODES=y,
pglist_data is dynamically allocated and assigned to node_data[].

For example, in arch/arm64/include/asm/mmzone.h:

extern struct pglist_data *node_data[];
#define NODE_DATA(nid) (node_data[(nid)])

If CONFIG_NEED_MULTIPLE_NODES=n, pglist_data is defined as a global
variable named "contig_page_data".

For example, in include/linux/mmzone.h:

extern struct pglist_data contig_page_data;
#define NODE_DATA(nid) (&contig_page_data)

If CONFIG_DEBUG_VIRTUAL is not enabled, __pa() can handle both
dynamically allocated linear addresses and symbol addresses. However,
if (CONFIG_DEBUG_VIRTUAL=y && CONFIG_NEED_MULTIPLE_NODES=n) we can see
the "virt_to_phys used for non-linear address" warning because that
&contig_page_data is not a linear address on arm64.

Warning message:

virt_to_phys used for non-linear address: (contig_page_data+0x0/0x1c00)
WARNING: CPU: 0 PID: 0 at arch/arm64/mm/physaddr.c:15 __virt_to_phys+0x58/0x68
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Tainted: G W 5.13.0-rc1-00074-g1140ab592e2e #3
Hardware name: linux,dummy-virt (DT)
pstate: 600000c5 (nZCv daIF -PAN -UAO -TCO BTYPE=--)
Call trace:
__virt_to_phys+0x58/0x68
check_usemap_section_nr+0x50/0xfc
sparse_init_nid+0x1ac/0x28c
sparse_init+0x1c4/0x1e0
bootmem_init+0x60/0x90
setup_arch+0x184/0x1f0
start_kernel+0x78/0x488

To fix it, create a small function to handle both translation.

Link: https://lkml.kernel.org/r/1623058729-27264-1-git-send-email-miles.chen@mediatek.com
Signed-off-by: Miles Chen
Cc: Mike Rapoport
Cc: Baoquan He
Cc: Kazu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miles Chen
2021-06-17 00:24:43 +0800
504e070dc mm: thp: replace DEBUG_VM BUG with VM_WARN when unmap fails for split ... Browse Code »

When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
it may miss the failure when head page is unmapped but other subpage is
mapped. Then the second DEBUG_VM BUG() that check total mapcount would
catch it. This may incur some confusion.

As this is not a fatal issue, so consolidate the two DEBUG_VM checks
into one VM_WARN_ON_ONCE_PAGE().

[1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
Signed-off-by: Yang Shi
Reviewed-by: Zi Yan
Acked-by: Kirill A. Shutemov
Signed-off-by: Hugh Dickins
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2021-06-17 00:24:42 +0800
22061a1ff mm/thp: unmap_mapping_page() to fix THP truncate_cleanup_page() ... Browse Code »

There is a race between THP unmapping and truncation, when truncate sees
pmd_none() and skips the entry, after munmap's zap_huge_pmd() cleared
it, but before its page_remove_rmap() gets to decrement
compound_mapcount: generating false "BUG: Bad page cache" reports that
the page is still mapped when deleted. This commit fixes that, but not
in the way I hoped.

The first attempt used try_to_unmap(page, TTU_SYNC|TTU_IGNORE_MLOCK)
instead of unmap_mapping_range() in truncate_cleanup_page(): it has
often been an annoyance that we usually call unmap_mapping_range() with
no pages locked, but there apply it to a single locked page.
try_to_unmap() looks more suitable for a single locked page.

However, try_to_unmap_one() contains a VM_BUG_ON_PAGE(!pvmw.pte,page):
it is used to insert THP migration entries, but not used to unmap THPs.
Copy zap_huge_pmd() and add THP handling now? Perhaps, but their TLB
needs are different, I'm too ignorant of the DAX cases, and couldn't
decide how far to go for anon+swap. Set that aside.

The second attempt took a different tack: make no change in truncate.c,
but modify zap_huge_pmd() to insert an invalidated huge pmd instead of
clearing it initially, then pmd_clear() between page_remove_rmap() and
unlocking at the end. Nice. But powerpc blows that approach out of the
water, with its serialize_against_pte_lookup(), and interesting pgtable
usage. It would need serious help to get working on powerpc (with a
minor optimization issue on s390 too). Set that aside.

Just add an "if (page_mapped(page)) synchronize_rcu();" or other such
delay, after unmapping in truncate_cleanup_page()? Perhaps, but though
that's likely to reduce or eliminate the number of incidents, it would
give less assurance of whether we had identified the problem correctly.

This successful iteration introduces "unmap_mapping_page(page)" instead
of try_to_unmap(), and goes the usual unmap_mapping_range_tree() route,
with an addition to details. Then zap_pmd_range() watches for this
case, and does spin_unlock(pmd_lock) if so - just like
page_vma_mapped_walk() now does in the PVMW_SYNC case. Not pretty, but
safe.

Note that unmap_mapping_page() is doing a VM_BUG_ON(!PageLocked) to
assert its interface; but currently that's only used to make sure that
page->mapping is stable, and zap_pmd_range() doesn't care if the page is
locked or not. Along these lines, in invalidate_inode_pages2_range()
move the initial unmap_mapping_range() out from under page lock, before
then calling unmap_mapping_page() under page lock if still mapped.

Link: https://lkml.kernel.org/r/a2a4a148-cdd8-942c-4ef8-51b77f643dbe@google.com
Fixes: fc127da085c2 ("truncate: handle file thp")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Reviewed-by: Yang Shi
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
31657170d mm/thp: fix page_address_in_vma() on file THP tails ... Browse Code »

Anon THP tails were already supported, but memory-failure may need to
use page_address_in_vma() on file THP tails, which its page->mapping
check did not permit: fix it.

hughd adds: no current usage is known to hit the issue, but this does
fix a subtle trap in a general helper: best fixed in stable sooner than
later.

Link: https://lkml.kernel.org/r/a0d9b53-bf5d-8bab-ac5-759dc61819c1@google.com
Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Signed-off-by: Jue Wang
Signed-off-by: Hugh Dickins
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: Yang Shi
Acked-by: Kirill A. Shutemov
Cc: Alistair Popple
Cc: Jan Kara
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jue Wang
2021-06-17 00:24:42 +0800
494334e43 mm/thp: fix vma_address() if virtual address below file offset ... Browse Code »

Running certain tests with a DEBUG_VM kernel would crash within hours,
on the total_mapcount BUG() in split_huge_page_to_list(), while trying
to free up some memory by punching a hole in a shmem huge page: split's
try_to_unmap() was unable to find all the mappings of the page (which,
on a !DEBUG_VM kernel, would then keep the huge page pinned in memory).

When that BUG() was changed to a WARN(), it would later crash on the
VM_BUG_ON_VMA(end < vma->vm_start || start >= vma->vm_end, vma) in
mm/internal.h:vma_address(), used by rmap_walk_file() for
try_to_unmap().

vma_address() is usually correct, but there's a wraparound case when the
vm_start address is unusually low, but vm_pgoff not so low:
vma_address() chooses max(start, vma->vm_start), but that decides on the
wrong address, because start has become almost ULONG_MAX.

Rewrite vma_address() to be more careful about vm_pgoff; move the
VM_BUG_ON_VMA() out of it, returning -EFAULT for errors, so that it can
be safely used from page_mapped_in_vma() and page_address_in_vma() too.

Add vma_address_end() to apply similar care to end address calculation,
in page_vma_mapped_walk() and page_mkclean_one() and try_to_unmap_one();
though it raises a question of whether callers would do better to supply
pvmw->end to page_vma_mapped_walk() - I chose not, for a smaller patch.

An irritation is that their apparent generality breaks down on KSM
pages, which cannot be located by the page->index that page_to_pgoff()
uses: as commit 4b0ece6fa016 ("mm: migrate: fix remove_migration_pte()
for ksm pages") once discovered. I dithered over the best thing to do
about that, and have ended up with a VM_BUG_ON_PAGE(PageKsm) in both
vma_address() and vma_address_end(); though the only place in danger of
using it on them was try_to_unmap_one().

Sidenote: vma_address() and vma_address_end() now use compound_nr() on a
head page, instead of thp_size(): to make the right calculation on a
hugetlbfs page, whether or not THPs are configured. try_to_unmap() is
used on hugetlbfs pages, but perhaps the wrong calculation never
mattered.

Link: https://lkml.kernel.org/r/caf1c1a3-7cfb-7f8f-1beb-ba816e932825@google.com
Fixes: a8fa41ad2f6f ("mm, rmap: check all VMAs that PTE-mapped THP can be part of")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Yang Shi
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
732ed5582 mm/thp: try_to_unmap() use TTU_SYNC for safe splitting ... Browse Code »

Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
(!unmap_success): with dump_page() showing mapcount:1, but then its raw
struct page output showing _mapcount ffffffff i.e. mapcount 0.

And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
all indicative of some mapcount difficulty in development here perhaps.
But the !CONFIG_DEBUG_VM path handles the failures correctly and
silently.

I believe the problem is that once a racing unmap has cleared pte or
pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
from try_to_unmap() before the racing task has reached decrementing
mapcount.

Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
TTU_SYNC to the options, and passing that from unmap_page().

When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
for both: the slight overhead added should rarely matter, except perhaps
if splitting sparsely-populated multiply-mapped shmem. Once confident
that bugs are fixed, TTU_SYNC here can be removed, and the race
tolerated.

Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
Signed-off-by: Hugh Dickins
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: Kirill A. Shutemov
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Yang Shi
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
3b77e8c8c mm/thp: make is_huge_zero_pmd() safe and quicker ... Browse Code »

Most callers of is_huge_zero_pmd() supply a pmd already verified
present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
migration entry, in which the pfn is encoded differently from a present
pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
since L1TF forced us to protect against that); or perhaps even crash in
pmd_page() applied to a swap-like entry.

Make it safe by adding pmd_present() check into is_huge_zero_pmd()
itself; and make it quicker by saving huge_zero_pfn, so that
is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

__split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
but is unnecessary now that is_huge_zero_pmd() checks present.

Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Reviewed-by: Yang Shi
Cc: Alistair Popple
Cc: Jan Kara
Cc: Jue Wang
Cc: "Matthew Wilcox (Oracle)"
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Cc: Wang Yugui
Cc: Zi Yan
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
99fa8a482 mm/thp: fix __split_huge_pmd_locked() on shmem migration entry ... Browse Code »

Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.

Here is v2 batch of long-standing THP bug fixes that I had not got
around to sending before, but prompted now by Wang Yugui's report
https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
they have done no harm, but have *not* fixed that issue: something more
is needed and I have no idea of what.

This patch (of 7):

Stressing huge tmpfs page migration racing hole punch often crashed on
the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
kernel; or shortly afterwards, on a bad dereference in
__split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
migration entries in the non-anonymous case.

Full disclosure: those particular experiments were on a kernel with more
relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
vanilla kernel: it is conceivable that stricter locking happens to avoid
those cases, or makes them less likely; but __split_huge_pmd_locked()
already allowed for pmd migration entries when handling anonymous THPs,
so this commit brings the shmem and file THP handling into line.

And while there: use old_pmd rather than _pmd, as in the following
blocks; and make it clearer to the eye that the !vma_is_anonymous()
block is self-contained, making an early return after accounting for
unmapping.

Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
Signed-off-by: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Yang Shi
Cc: Wang Yugui
Cc: "Matthew Wilcox (Oracle)"
Cc: Naoya Horiguchi
Cc: Alistair Popple
Cc: Ralph Campbell
Cc: Zi Yan
Cc: Miaohe Lin
Cc: Minchan Kim
Cc: Jue Wang
Cc: Peter Xu
Cc: Jan Kara
Cc: Shakeel Butt
Cc: Oscar Salvador
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-06-17 00:24:42 +0800
ffc90cbb2 mm, thp: use head page in __migration_entry_wait() ... Browse Code »

We notice that hung task happens in a corner but practical scenario when
CONFIG_PREEMPT_NONE is enabled, as follows.

Process 0 Process 1 Process 2..Inf
split_huge_page_to_list
unmap_page
split_huge_pmd_address
__migration_entry_wait(head)
__migration_entry_wait(tail)
remap_page (roll back)
remove_migration_ptes
rmap_walk_anon
cond_resched

Where __migration_entry_wait(tail) is occurred in kernel space, e.g.,
copy_to_user in fstat, which will immediately fault again without
rescheduling, and thus occupy the cpu fully.

When there are too many processes performing __migration_entry_wait on
tail page, remap_page will never be done after cond_resched.

This makes __migration_entry_wait operate on the compound head page,
thus waits for remap_page to complete, whether the THP is split
successfully or roll back.

Note that put_and_wait_on_page_locked helps to drop the page reference
acquired with get_page_unless_zero, as soon as the page is on the wait
queue, before actually waiting. So splitting the THP is only prevented
for a brief interval.

Link: https://lkml.kernel.org/r/b9836c1dd522e903891760af9f0c86a2cce987eb.1623144009.git.xuyu@linux.alibaba.com
Fixes: ba98828088ad ("thp: add option to setup migration entries during PMD split")
Suggested-by: Hugh Dickins
Signed-off-by: Gang Deng
Signed-off-by: Xu Yu
Acked-by: Kirill A. Shutemov
Acked-by: Hugh Dickins
Cc: Matthew Wilcox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xu Yu
2021-06-17 00:24:42 +0800
1b3865d01 mm/slub.c: include swab.h ... Browse Code »

Fixes build with CONFIG_SLAB_FREELIST_HARDENED=y.

Hopefully. But it's the right thing to do anwyay.

Fixes: 1ad53d9fa3f61 ("slub: improve bit diffusion for freelist ptr obfuscation")
Link: https://bugzilla.kernel.org/show_bug.cgi?id=213417
Reported-by:
Acked-by: Kees Cook
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2021-06-17 00:24:42 +0800
e8675d291 mm/memory-failure: make sure wait for page writeback in memory_failure ... Browse Code »

Our syzkaller trigger the "BUG_ON(!list_empty(&inode->i_wb_list))" in
clear_inode:

kernel BUG at fs/inode.c:519!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
Process syz-executor.0 (pid: 249, stack limit = 0x00000000a12409d7)
CPU: 1 PID: 249 Comm: syz-executor.0 Not tainted 4.19.95
Hardware name: linux,dummy-virt (DT)
pstate: 80000005 (Nzcv daif -PAN -UAO)
pc : clear_inode+0x280/0x2a8
lr : clear_inode+0x280/0x2a8
Call trace:
clear_inode+0x280/0x2a8
ext4_clear_inode+0x38/0xe8
ext4_free_inode+0x130/0xc68
ext4_evict_inode+0xb20/0xcb8
evict+0x1a8/0x3c0
iput+0x344/0x460
do_unlinkat+0x260/0x410
__arm64_sys_unlinkat+0x6c/0xc0
el0_svc_common+0xdc/0x3b0
el0_svc_handler+0xf8/0x160
el0_svc+0x10/0x218
Kernel panic - not syncing: Fatal exception

A crash dump of this problem show that someone called __munlock_pagevec
to clear page LRU without lock_page: do_mmap -> mmap_region -> do_munmap
-> munlock_vma_pages_range -> __munlock_pagevec.

As a result memory_failure will call identify_page_state without
wait_on_page_writeback. And after truncate_error_page clear the mapping
of this page. end_page_writeback won't call sb_clear_inode_writeback to
clear inode->i_wb_list. That will trigger BUG_ON in clear_inode!

Fix it by checking PageWriteback too to help determine should we skip
wait_on_page_writeback.

Link: https://lkml.kernel.org/r/20210604084705.3729204-1-yangerkun@huawei.com
Fixes: 0bc1f8b0682c ("hwpoison: fix the handling path of the victimized page frame that belong to non-LRU")
Signed-off-by: yangerkun
Acked-by: Naoya Horiguchi
Cc: Jan Kara
Cc: Theodore Ts'o
Cc: Oscar Salvador
Cc: Yu Kuai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

yangerkun
2021-06-17 00:24:42 +0800
846be0857 mm/hugetlb: expand restore_reserve_on_error functionality ... Browse Code »

The routine restore_reserve_on_error is called to restore reservation
information when an error occurs after page allocation. The routine
alloc_huge_page modifies the mapping reserve map and potentially the
reserve count during allocation. If code calling alloc_huge_page
encounters an error after allocation and needs to free the page, the
reservation information needs to be adjusted.

Currently, restore_reserve_on_error only takes action on pages for which
the reserve count was adjusted(HPageRestoreReserve flag). There is
nothing wrong with these adjustments. However, alloc_huge_page ALWAYS
modifies the reserve map during allocation even if the reserve count is
not adjusted. This can cause issues as observed during development of
this patch [1].

One specific series of operations causing an issue is:

- Create a shared hugetlb mapping
Reservations for all pages created by default

- Fault in a page in the mapping
Reservation exists so reservation count is decremented

- Punch a hole in the file/mapping at index previously faulted
Reservation and any associated pages will be removed

- Allocate a page to fill the hole
No reservation entry, so reserve count unmodified
Reservation entry added to map by alloc_huge_page

- Error after allocation and before instantiating the page
Reservation entry remains in map

- Allocate a page to fill the hole
Reservation entry exists, so decrement reservation count

This will cause a reservation count underflow as the reservation count
was decremented twice for the same index.

A user would observe a very large number for HugePages_Rsvd in
/proc/meminfo. This would also likely cause subsequent allocations of
hugetlb pages to fail as it would 'appear' that all pages are reserved.

This sequence of operations is unlikely to happen, however they were
easily reproduced and observed using hacked up code as described in [1].

Address the issue by having the routine restore_reserve_on_error take
action on pages where HPageRestoreReserve is not set. In this case, we
need to remove any reserve map entry created by alloc_huge_page. A new
helper routine vma_del_reservation assists with this operation.

There are three callers of alloc_huge_page which do not currently call
restore_reserve_on error before freeing a page on error paths. Add
those missing calls.

[1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/

Link: https://lkml.kernel.org/r/20210607204510.22617-1-mike.kravetz@oracle.com
Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
Signed-off-by: Mike Kravetz
Reviewed-by: Mina Almasry
Cc: Axel Rasmussen
Cc: Peter Xu
Cc: Muchun Song
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2021-06-17 00:24:42 +0800
e41a49fad mm/slub: actually fix freelist pointer vs redzoning ... Browse Code »

It turns out that SLUB redzoning ("slub_debug=Z") checks from
s->object_size rather than from s->inuse (which is normally bumped to
make room for the freelist pointer), so a cache created with an object
size less than 24 would have the freelist pointer written beyond
s->object_size, causing the redzone to be corrupted by the freelist
pointer. This was very visible with "slub_debug=ZF":

BUG test (Tainted: G B ): Right Redzone overwritten
-----------------------------------------------------------------------------

INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620

Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): 00 00 00 00 00 f6 f4 a5 ........
Redzone (____ptrval____): 40 1d e8 1a aa @....
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

Adjust the offset to stay within s->object_size.

(Note that no caches of in this size range are known to exist in the
kernel currently.)

Link: https://lkml.kernel.org/r/20210608183955.280836-4-keescook@chromium.org
Link: https://lore.kernel.org/linux-mm/20200807160627.GA1420741@elver.google.com/
Link: https://lore.kernel.org/lkml/0f7dd7b2-7496-5e2d-9488-2ec9f8e90441@suse.cz/Fixes: 89b83f282d8b (slub: avoid redzone when choosing freepointer location)
Link: https://lore.kernel.org/lkml/CANpmjNOwZ5VpKQn+SYWovTkFB4VsT-RPwyENBmaK0dLcpqStkA@mail.gmail.com
Signed-off-by: Kees Cook
Reported-by: Marco Elver
Reported-by: "Lin, Zhenpeng"
Tested-by: Marco Elver
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Pekka Enberg
Cc: Roman Gushchin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2021-06-17 00:24:42 +0800
74c1d3e08 mm/slub: fix redzoning for small allocations ... Browse Code »

The redzone area for SLUB exists between s->object_size and s->inuse
(which is at least the word-aligned object_size). If a cache were
created with an object_size smaller than sizeof(void *), the in-object
stored freelist pointer would overwrite the redzone (e.g. with boot
param "slub_debug=ZF"):

BUG test (Tainted: G B ): Right Redzone overwritten
-----------------------------------------------------------------------------

INFO: 0xffff957ead1c05de-0xffff957ead1c05df @offset=1502. First byte 0x1a instead of 0xbb
INFO: Slab 0xffffef3950b47000 objects=170 used=170 fp=0x0000000000000000 flags=0x8000000000000200
INFO: Object 0xffff957ead1c05d8 @offset=1496 fp=0xffff957ead1c0620

Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
Redzone (____ptrval____): 1a aa ..
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

Store the freelist pointer out of line when object_size is smaller than
sizeof(void *) and redzoning is enabled.

Additionally remove the "smaller than sizeof(void *)" check under
CONFIG_DEBUG_VM in kmem_cache_sanity_check() as it is now redundant:
SLAB and SLOB both handle small sizes.

(Note that no caches within this size range are known to exist in the
kernel currently.)

Link: https://lkml.kernel.org/r/20210608183955.280836-3-keescook@chromium.org
Fixes: 81819f0fc828 ("SLUB core")
Signed-off-by: Kees Cook
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: "Lin, Zhenpeng"
Cc: Marco Elver
Cc: Pekka Enberg
Cc: Roman Gushchin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2021-06-17 00:24:42 +0800
8669dbab2 mm/slub: clarify verification reporting ... Browse Code »

Patch series "Actually fix freelist pointer vs redzoning", v4.

This fixes redzoning vs the freelist pointer (both for middle-position
and very small caches). Both are "theoretical" fixes, in that I see no
evidence of such small-sized caches actually be used in the kernel, but
that's no reason to let the bugs continue to exist, especially since
people doing local development keep tripping over it. :)

This patch (of 3):

Instead of repeating "Redzone" and "Poison", clarify which sides of
those zones got tripped. Additionally fix column alignment in the
trailer.

Before:

BUG test (Tainted: G B ): Redzone overwritten
...
Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
Redzone (____ptrval____): 1a aa ..
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

After:

BUG test (Tainted: G B ): Right Redzone overwritten
...
Redzone (____ptrval____): bb bb bb bb bb bb bb bb ........
Object (____ptrval____): f6 f4 a5 40 1d e8 ...@..
Redzone (____ptrval____): 1a aa ..
Padding (____ptrval____): 00 00 00 00 00 00 00 00 ........

The earlier commits that slowly resulted in the "Before" reporting were:

d86bd1bece6f ("mm/slub: support left redzone")
ffc79d288000 ("slub: use print_hex_dump")
2492268472e7 ("SLUB: change error reporting format to follow lockdep loosely")

Link: https://lkml.kernel.org/r/20210608183955.280836-1-keescook@chromium.org
Link: https://lkml.kernel.org/r/20210608183955.280836-2-keescook@chromium.org
Link: https://lore.kernel.org/lkml/cfdb11d7-fb8e-e578-c939-f7f5fb69a6bd@suse.cz/
Signed-off-by: Kees Cook
Acked-by: Vlastimil Babka
Cc: Marco Elver
Cc: "Lin, Zhenpeng"
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Roman Gushchin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2021-06-17 00:24:42 +0800
099dd6878 mm/swap: fix pte_same_as_swp() not removing uffd-wp bit when compare ... Browse Code »

I found it by pure code review, that pte_same_as_swp() of unuse_vma()
didn't take uffd-wp bit into account when comparing ptes.
pte_same_as_swp() returning false negative could cause failure to
swapoff swap ptes that was wr-protected by userfaultfd.

Link: https://lkml.kernel.org/r/20210603180546.9083-1-peterx@redhat.com
Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
Signed-off-by: Peter Xu
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc: [5.7+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-06-17 00:24:42 +0800
25182f05f mm,hwpoison: fix race with hugetlb page allocation ... Browse Code »

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following
race:

CPU0: CPU1:

gather_surplus_pages()
page = alloc_surplus_huge_page()
memory_failure_hugetlb()
get_hwpoison_page(page)
__get_hwpoison_page(page)
get_page_unless_zero(page)
zero = put_page_testzero(page)
VM_BUG_ON_PAGE(!zero, page)
enqueue_huge_page(h, page)
put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Link: https://lkml.kernel.org/r/20210603233632.2964832-2-nao.horiguchi@gmail.com
Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi
Reported-by: Muchun Song
Acked-by: Mike Kravetz
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Tony Luck
Cc: [5.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2021-06-17 00:24:42 +0800

05 Jun, 2021

7 commits

d84cf06e3 mm, hugetlb: fix simple resv_huge_pages underflow on UFFDIO_COPY ... Browse Code »

The userfaultfd hugetlb tests cause a resv_huge_pages underflow. This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
an index for which we already have a page in the cache. When this
happens, we allocate a second page, double consuming the reservation,
and then fail to insert the page into the cache and return -EEXIST.

To fix this, we first check if there is a page in the cache which
already consumed the reservation, and return -EEXIST immediately if so.

There is still a rare condition where we fail to copy the page contents
AND race with a call for hugetlb_no_page() for this index and again we
will underflow resv_huge_pages. That is fixed in a more complicated
patch not targeted for -stable.

Test:

Hacked the code locally such that resv_huge_pages underflows produce a
warning, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the test runs number
of free/resv hugepages is correct.

[mike.kravetz@oracle.com: changelog fixes]

Link: https://lkml.kernel.org/r/20210528004649.85298-1-almasrymina@google.com
Fixes: 8fb5debc5fcd ("userfaultfd: hugetlbfs: add hugetlb_mcopy_atomic_pte for userfaultfd support")
Signed-off-by: Mina Almasry
Reviewed-by: Mike Kravetz
Cc: Axel Rasmussen
Cc: Peter Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mina Almasry
2021-06-05 23:58:12 +0800
7b6889f54 mm/kasan/init.c: fix doc warning ... Browse Code »

Fix gcc W=1 warning:

mm/kasan/init.c:228: warning: Function parameter or member 'shadow_start' not described in 'kasan_populate_early_shadow'
mm/kasan/init.c:228: warning: Function parameter or member 'shadow_end' not described in 'kasan_populate_early_shadow'

Link: https://lkml.kernel.org/r/20210603140700.3045298-1-yukuai3@huawei.com
Signed-off-by: Yu Kuai
Acked-by: Andrey Ryabinin
Cc: Zhang Yi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yu Kuai
2021-06-05 23:58:11 +0800
0c5da3572 hugetlb: pass head page to remove_hugetlb_page() ... Browse Code »

When memory_failure() or soft_offline_page() is called on a tail page of
some hugetlb page, "BUG: unable to handle page fault" error can be
triggered.

remove_hugetlb_page() dereferences page->lru, so it's assumed that the
page points to a head page, but one of the caller,
dissolve_free_huge_page(), provides remove_hugetlb_page() with 'page'
which could be a tail page. So pass 'head' to it, instead.

Link: https://lkml.kernel.org/r/20210526235257.2769473-1-nao.horiguchi@gmail.com
Fixes: 6eb4e88a6d27 ("hugetlb: create remove_hugetlb_page() to separate functionality")
Signed-off-by: Naoya Horiguchi
Reviewed-by: Mike Kravetz
Reviewed-by: Muchun Song
Acked-by: Michal Hocko
Reviewed-by: Oscar Salvador
Cc: Miaohe Lin
Cc: Matthew Wilcox
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2021-06-05 23:58:11 +0800
bac9c6fa1 mm/page_alloc: fix counting of free pages after take off from buddy ... Browse Code »

Recently we found that there is a lot MemFree left in /proc/meminfo
after do a lot of pages soft offline, it's not quite correct.

Before Oscar's rework of soft offline for free pages [1], if we soft
offline free pages, these pages are left in buddy with HWPoison flag,
and NR_FREE_PAGES is not updated immediately. So the difference between
NR_FREE_PAGES and real number of available free pages is also even big
at the beginning.

However, with the workload running, when we catch HWPoison page in any
alloc functions subsequently, we will remove it from buddy, meanwhile
update the NR_FREE_PAGES and try again, so the NR_FREE_PAGES will get
more and more closer to the real number of available free pages.
(regardless of unpoison_memory())

Now, for offline free pages, after a successful call
take_page_off_buddy(), the page is no longer belong to buddy allocator,
and will not be used any more, but we missed accounting NR_FREE_PAGES in
this situation, and there is no chance to be updated later.

Do update in take_page_off_buddy() like rmqueue() does, but avoid double
counting if some one already set_migratetype_isolate() on the page.

[1]: commit 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")

Link: https://lkml.kernel.org/r/20210526075247.11130-1-dinghui@sangfor.com.cn
Fixes: 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages")
Signed-off-by: Ding Hui
Suggested-by: Naoya Horiguchi
Reviewed-by: Oscar Salvador
Acked-by: David Hildenbrand
Acked-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ding Hui
2021-06-05 23:58:11 +0800
04f7ce3f0 mm/debug_vm_pgtable: fix alignment for pmd/pud_advanced_tests() ... Browse Code »

In pmd/pud_advanced_tests(), the vaddr is aligned up to the next pmd/pud
entry, and so it does not match the given pmdp/pudp and (aligned down)
pfn any more.

For s390, this results in memory corruption, because the IDTE
instruction used e.g. in xxx_get_and_clear() will take the vaddr for
some calculations, in combination with the given pmdp. It will then end
up with a wrong table origin, ending on ...ff8, and some of those
wrongly set low-order bits will also select a wrong pagetable level for
the index addition. IDTE could therefore invalidate (or 0x20) something
outside of the page tables, depending on the wrongly picked index, which
in turn depends on the random vaddr.

As result, we sometimes see "BUG task_struct (Not tainted): Padding
overwritten" on s390, where one 0x5a padding value got overwritten with
0x7a.

Fix this by aligning down, similar to how the pmd/pud_aligned pfns are
calculated.

Link: https://lkml.kernel.org/r/20210525130043.186290-2-gerald.schaefer@linux.ibm.com
Fixes: a5c3b9ffb0f40 ("mm/debug_vm_pgtable: add tests validating advanced arch page table helpers")
Signed-off-by: Gerald Schaefer
Reviewed-by: Anshuman Khandual
Cc: Vineet Gupta
Cc: Palmer Dabbelt
Cc: Paul Walmsley
Cc: [5.9+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2021-06-05 23:58:11 +0800
8fd0e995c kfence: use TASK_IDLE when awaiting allocation ... Browse Code »

Since wait_event() uses TASK_UNINTERRUPTIBLE by default, waiting for an
allocation counts towards load. However, for KFENCE, this does not make
any sense, since there is no busy work we're awaiting.

Instead, use TASK_IDLE via wait_event_idle() to not count towards load.

BugLink: https://bugzilla.suse.com/show_bug.cgi?id=1185565
Link: https://lkml.kernel.org/r/20210521083209.3740269-1-elver@google.com
Fixes: 407f1d8c1b5f ("kfence: await for allocation using wait_event")
Signed-off-by: Marco Elver
Cc: Mel Gorman
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: David Laight
Cc: Hillf Danton
Cc: [5.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marco Elver
2021-06-05 23:58:11 +0800
50c25ee97 Revert "MIPS: make userspace mapping young by default" ... Browse Code »

This reverts commit f685a533a7fab35c5d069dcd663f59c8e4171a75.

The MIPS cache flush logic needs to know whether the mapping was already
established to decide how to flush caches. This is done by checking the
valid bit in the PTE. The commit above breaks this logic by setting the
valid in the PTE in new mappings, which causes kernel crashes.

Link: https://lkml.kernel.org/r/20210526094335.92948-1-tsbogend@alpha.franken.de
Fixes: f685a533a7f ("MIPS: make userspace mapping young by default")
Reported-by: Zhou Yanjie
Signed-off-by: Thomas Bogendoerfer
Cc: Huang Pei
Cc: Nicholas Piggin
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thomas Bogendoerfer
2021-06-05 23:58:11 +0800

23 May, 2021

4 commits

e32905e57 userfaultfd: hugetlbfs: fix new flag usage in error path ... Browse Code »

In commit d6995da31122 ("hugetlb: use page.private for hugetlb specific
page flags") the use of PagePrivate to indicate a reservation count
should be restored at free time was changed to the hugetlb specific flag
HPageRestoreReserve. Changes to a userfaultfd error path as well as a
VM_BUG_ON() in remove_inode_hugepages() were overlooked.

Users could see incorrect hugetlb reserve counts if they experience an
error with a UFFDIO_COPY operation. Specifically, this would be the
result of an unlikely copy_huge_page_from_user error. There is not an
increased chance of hitting the VM_BUG_ON.

Link: https://lkml.kernel.org/r/20210521233952.236434-1-mike.kravetz@oracle.com
Fixes: d6995da31122 ("hugetlb: use page.private for hugetlb specific page flags")
Signed-off-by: Mike Kravetz
Reviewed-by: Mina Almasry
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Muchun Song
Cc: Naoya Horiguchi
Cc: David Hildenbrand
Cc: Matthew Wilcox
Cc: Miaohe Lin
Cc: Mina Almasry
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Kravetz
2021-05-23 09:09:07 +0800
f70b00496 kasan: slab: always reset the tag in get_freepointer_safe() ... Browse Code »

With CONFIG_DEBUG_PAGEALLOC enabled, the kernel should also untag the
object pointer, as done in get_freepointer().

Failing to do so reportedly leads to SLUB freelist corruptions that
manifest as boot-time crashes.

Link: https://lkml.kernel.org/r/20210514072228.534418-1-glider@google.com
Signed-off-by: Alexander Potapenko
Cc: Marco Elver
Cc: Vincenzo Frascino
Cc: Andrey Ryabinin
Cc: Andrey Konovalov
Cc: Elliot Berman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2021-05-23 09:09:07 +0800
f10628d2f Revert "mm/gup: check page posion status for coredump." ... Browse Code »

While reviewing [1] I came across commit d3378e86d182 ("mm/gup: check
page posion status for coredump.") and noticed that this patch is broken
in two ways. First it doesn't really prevent hwpoison pages from being
dumped because hwpoison pages can be marked asynchornously at any time
after the check. Secondly, and more importantly, the patch introduces a
ref count leak because get_dump_page takes a reference on the page which
is not released.

It also seems that the patch was merged incorrectly because there were
follow up changes not included as well as discussions on how to address
the underlying problem [2]

Therefore revert the original patch.

Link: http://lkml.kernel.org/r/20210429122519.15183-4-david@redhat.com [1]
Link: http://lkml.kernel.org/r/57ac524c-b49a-99ec-c1e4-ef5027bfb61b@redhat.com [2]
Link: https://lkml.kernel.org/r/20210505135407.31590-1-mhocko@kernel.org
Fixes: d3378e86d182 ("mm/gup: check page posion status for coredump.")
Signed-off-by: Michal Hocko
Reviewed-by: David Hildenbrand
Cc: Aili Yao
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2021-05-23 09:09:07 +0800
f9f74dc21 mm/shuffle: fix section mismatch warning ... Browse Code »

clang sometimes decides not to inline shuffle_zone(), but it calls a
__meminit function. Without the extra __meminit annotation we get this
warning:

WARNING: modpost: vmlinux.o(.text+0x2a86d4): Section mismatch in reference from the function shuffle_zone() to the function .meminit.text:__shuffle_zone()
The function shuffle_zone() references
the function __meminit __shuffle_zone().
This is often because shuffle_zone lacks a __meminit
annotation or the annotation of __shuffle_zone is wrong.

shuffle_free_memory() did not show the same problem in my tests, but it
could happen in theory as well, so mark both as __meminit.

Link: https://lkml.kernel.org/r/20210514135952.2928094-1-arnd@kernel.org
Signed-off-by: Arnd Bergmann
Reviewed-by: David Hildenbrand
Reviewed-by: Nathan Chancellor
Cc: Nick Desaulniers
Cc: Arnd Bergmann
Cc: Wei Yang
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2021-05-23 09:09:07 +0800

15 May, 2021

6 commits

86d0c1642 mm/ioremap: fix iomap_max_page_shift ... Browse Code »

iomap_max_page_shift is expected to contain a page shift, so it can't be a
'bool', has to be an 'unsigned int'

And fix the default values: P4D_SHIFT is when huge iomap is allowed.

However, on some architectures (eg: powerpc book3s/64), P4D_SHIFT is not a
constant so it can't be used to initialise a static variable. So,
initialise iomap_max_page_shift with a maximum shift supported by the
architecture, it is gated by P4D_SHIFT in vmap_try_huge_p4d() anyway.

Link: https://lkml.kernel.org/r/ad2d366015794a9f21320dcbdd0a8eb98979e9df.1620898113.git.christophe.leroy@csgroup.eu
Fixes: bbc180a5adb0 ("mm: HUGE_VMAP arch support cleanup")
Signed-off-by: Christophe Leroy
Reviewed-by: Nicholas Piggin
Reviewed-by: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christophe Leroy
2021-05-15 10:41:32 +0800
628622904 ksm: revert "use GET_KSM_PAGE_NOLOCK to get ksm page in remove_rmap_item_from_tree()" ... Browse Code »

This reverts commit 3e96b6a2e9ad929a3230a22f4d64a74671a0720b. General
Protection Fault in rmap_walk_ksm() under memory pressure:
remove_rmap_item_from_tree() needs to take page lock, of course.

Link: https://lkml.kernel.org/r/alpine.LSU.2.11.2105092253500.1127@eggly.anvils
Signed-off-by: Hugh Dickins
Cc: Miaohe Lin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-05-15 10:41:32 +0800
7ed9d238c userfaultfd: release page in error path to avoid BUG_ON ... Browse Code »

Consider the following sequence of events:

1. Userspace issues a UFFD ioctl, which ends up calling into
shmem_mfill_atomic_pte(). We successfully account the blocks, we
shmem_alloc_page(), but then the copy_from_user() fails. We return
-ENOENT. We don't release the page we allocated.
2. Our caller detects this error code, tries the copy_from_user() after
dropping the mmap_lock, and retries, calling back into
shmem_mfill_atomic_pte().
3. Meanwhile, let's say another process filled up the tmpfs being used.
4. So shmem_mfill_atomic_pte() fails to account blocks this time, and
immediately returns - without releasing the page.

This triggers a BUG_ON in our caller, which asserts that the page
should always be consumed, unless -ENOENT is returned.

To fix this, detect if we have such a "dangling" page when accounting
fails, and if so, release it before returning.

Link: https://lkml.kernel.org/r/20210428230858.348400-1-axelrasmussen@google.com
Fixes: cb658a453b93 ("userfaultfd: shmem: avoid leaking blocks and used blocks in UFFDIO_COPY")
Signed-off-by: Axel Rasmussen
Reported-by: Hugh Dickins
Acked-by: Hugh Dickins
Reviewed-by: Peter Xu
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Axel Rasmussen
2021-05-15 10:41:32 +0800
afe0c26d1 mm, slub: move slub_debug static key enabling outside slab_mutex ... Browse Code »

Paul E. McKenney reported [1] that commit 1f0723a4c0df ("mm, slub: enable
slub_debug static key when creating cache with explicit debug flags")
results in the lockdep complaint:

======================================================
WARNING: possible circular locking dependency detected
5.12.0+ #15 Not tainted
------------------------------------------------------
rcu_torture_sta/109 is trying to acquire lock:
ffffffff96063cd0 (cpu_hotplug_lock){++++}-{0:0}, at: static_key_enable+0x9/0x20

but task is already holding lock:
ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250

which lock already depends on the new lock.

the existing dependency chain (in reverse order) is:

-> #1 (slab_mutex){+.+.}-{3:3}:
lock_acquire+0xb9/0x3a0
__mutex_lock+0x8d/0x920
slub_cpu_dead+0x15/0xf0
cpuhp_invoke_callback+0x17a/0x7c0
cpuhp_invoke_callback_range+0x3b/0x80
_cpu_down+0xdf/0x2a0
cpu_down+0x2c/0x50
device_offline+0x82/0xb0
remove_cpu+0x1a/0x30
torture_offline+0x80/0x140
torture_onoff+0x147/0x260
kthread+0x10a/0x140
ret_from_fork+0x22/0x30

-> #0 (cpu_hotplug_lock){++++}-{0:0}:
check_prev_add+0x8f/0xbf0
__lock_acquire+0x13f0/0x1d80
lock_acquire+0xb9/0x3a0
cpus_read_lock+0x21/0xa0
static_key_enable+0x9/0x20
__kmem_cache_create+0x38d/0x430
kmem_cache_create_usercopy+0x146/0x250
kmem_cache_create+0xd/0x10
rcu_torture_stats+0x79/0x280
kthread+0x10a/0x140
ret_from_fork+0x22/0x30

other info that might help us debug this:

Possible unsafe locking scenario:

CPU0 CPU1
---- ----
lock(slab_mutex);
lock(cpu_hotplug_lock);
lock(slab_mutex);
lock(cpu_hotplug_lock);

*** DEADLOCK ***

1 lock held by rcu_torture_sta/109:
#0: ffffffff96173c28 (slab_mutex){+.+.}-{3:3}, at: kmem_cache_create_usercopy+0x2d/0x250

stack backtrace:
CPU: 3 PID: 109 Comm: rcu_torture_sta Not tainted 5.12.0+ #15
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
Call Trace:
dump_stack+0x6d/0x89
check_noncircular+0xfe/0x110
? lock_is_held_type+0x98/0x110
check_prev_add+0x8f/0xbf0
__lock_acquire+0x13f0/0x1d80
lock_acquire+0xb9/0x3a0
? static_key_enable+0x9/0x20
? mark_held_locks+0x49/0x70
cpus_read_lock+0x21/0xa0
? static_key_enable+0x9/0x20
static_key_enable+0x9/0x20
__kmem_cache_create+0x38d/0x430
kmem_cache_create_usercopy+0x146/0x250
? rcu_torture_stats_print+0xd0/0xd0
kmem_cache_create+0xd/0x10
rcu_torture_stats+0x79/0x280
? rcu_torture_stats_print+0xd0/0xd0
kthread+0x10a/0x140
? kthread_park+0x80/0x80
ret_from_fork+0x22/0x30

This is because there's one order of locking from the hotplug callbacks:

lock(cpu_hotplug_lock); // from hotplug machinery itself
lock(slab_mutex); // in e.g. slab_mem_going_offline_callback()

And commit 1f0723a4c0df made the reverse sequence possible:
lock(slab_mutex); // in kmem_cache_create_usercopy()
lock(cpu_hotplug_lock); // kmem_cache_open() -> static_key_enable()

The simplest fix is to move static_key_enable() to a place before slab_mutex is
taken. That means kmem_cache_create_usercopy() in mm/slab_common.c which is not
ideal for SLUB-specific code, but the #ifdef CONFIG_SLUB_DEBUG makes it
at least self-contained and obvious.

[1] https://lore.kernel.org/lkml/20210502171827.GA3670492@paulmck-ThinkPad-P17-Gen-1/

Link: https://lkml.kernel.org/r/20210504120019.26791-1-vbabka@suse.cz
Fixes: 1f0723a4c0df ("mm, slub: enable slub_debug static key when creating cache with explicit debug flags")
Signed-off-by: Vlastimil Babka
Reported-by: Paul E. McKenney
Tested-by: Paul E. McKenney
Acked-by: David Rientjes
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vlastimil Babka
2021-05-15 10:41:32 +0800
84894e1c4 mm/hugetlb: fix cow where page writtable in child ... Browse Code »

When rework early cow of pinned hugetlb pages, we moved huge_ptep_get()
upper but overlooked a side effect that the huge_ptep_get() will fetch the
pte after wr-protection. After moving it upwards, we need explicit
wr-protect of child pte or we will keep the write bit set in the child
process, which could cause data corrution where the child can write to the
original page directly.

This issue can also be exposed by "memfd_test hugetlbfs" kselftest.

Link: https://lkml.kernel.org/r/20210503234356.9097-3-peterx@redhat.com
Fixes: 4eae4efa2c299 ("hugetlb: do early cow when page pinned on src mm")
Signed-off-by: Peter Xu
Reviewed-by: Mike Kravetz
Cc: Hugh Dickins
Cc: Joel Fernandes (Google)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-05-15 10:41:32 +0800
22247efd8 mm/hugetlb: fix F_SEAL_FUTURE_WRITE ... Browse Code »

Patch series "mm/hugetlb: Fix issues on file sealing and fork", v2.

Hugh reported issue with F_SEAL_FUTURE_WRITE not applied correctly to
hugetlbfs, which I can easily verify using the memfd_test program, which
seems that the program is hardly run with hugetlbfs pages (as by default
shmem).

Meanwhile I found another probably even more severe issue on that hugetlb
fork won't wr-protect child cow pages, so child can potentially write to
parent private pages. Patch 2 addresses that.

After this series applied, "memfd_test hugetlbfs" should start to pass.

This patch (of 2):

F_SEAL_FUTURE_WRITE is missing for hugetlb starting from the first day.
There is a test program for that and it fails constantly.

$ ./memfd_test hugetlbfs
memfd-hugetlb: CREATE
memfd-hugetlb: BASIC
memfd-hugetlb: SEAL-WRITE
memfd-hugetlb: SEAL-FUTURE-WRITE
mmap() didn't fail as expected
Aborted (core dumped)

I think it's probably because no one is really running the hugetlbfs test.

Fix it by checking FUTURE_WRITE also in hugetlbfs_file_mmap() as what we
do in shmem_mmap(). Generalize a helper for that.

Link: https://lkml.kernel.org/r/20210503234356.9097-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20210503234356.9097-2-peterx@redhat.com
Fixes: ab3948f58ff84 ("mm/memfd: add an F_SEAL_FUTURE_WRITE seal to memfd")
Signed-off-by: Peter Xu
Reported-by: Hugh Dickins
Reviewed-by: Mike Kravetz
Cc: Joel Fernandes (Google)
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Xu
2021-05-15 10:41:32 +0800

07 May, 2021

6 commits

baf2f90ba mm: fix typos in comments ... Browse Code »

succed -> succeed in mm/hugetlb.c
wil -> will in mm/mempolicy.c
wit -> with in mm/page_alloc.c
Retruns -> Returns in mm/page_vma_mapped.c
confict -> conflict in mm/secretmem.c
No functionality changed.

Link: https://lkml.kernel.org/r/20210408140027.60623-1-lujialin4@huawei.com
Signed-off-by: Lu Jialin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lu Jialin
2021-05-07 15:26:35 +0800
f0953a1bb mm: fix typos in comments ... Browse Code »

Fix ~94 single-word typos in locking code comments, plus a few
very obvious grammar mistakes.

Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
Signed-off-by: Ingo Molnar
Reviewed-by: Matthew Wilcox (Oracle)
Reviewed-by: Randy Dunlap
Cc: Bhaskar Chowdhury
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ingo Molnar
2021-05-07 15:26:35 +0800
80d015587 mm/slab.c: fix spelling mistake "disired" -> "desired" ... Browse Code »

There is a spelling mistake in a comment. Fix it.

Link: https://lkml.kernel.org/r/20210317094158.5762-1-colin.king@canonical.com
Signed-off-by: Colin Ian King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Colin Ian King
2021-05-07 15:26:34 +0800
f7c8ce44e mm/vmalloc: remove vwrite() ... Browse Code »

The last user (/dev/kmem) is gone. Let's drop it.

Link: https://lkml.kernel.org/r/20210324102351.6932-4-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Linus Torvalds
Cc: Greg Kroah-Hartman
Cc: Hillf Danton
Cc: Matthew Wilcox
Cc: Oleksiy Avramchenko
Cc: Steven Rostedt
Cc: Minchan Kim
Cc: huang ying
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2021-05-07 15:26:34 +0800
bbcd53c96 drivers/char: remove /dev/kmem for good ... Browse Code »

Patch series "drivers/char: remove /dev/kmem for good".

Exploring /dev/kmem and /dev/mem in the context of memory hot(un)plug and
memory ballooning, I started questioning the existence of /dev/kmem.

Comparing it with the /proc/kcore implementation, it does not seem to be
able to deal with things like

a) Pages unmapped from the direct mapping (e.g., to be used by secretmem)
-> kern_addr_valid(). virt_addr_valid() is not sufficient.

b) Special cases like gart aperture memory that is not to be touched
-> mem_pfn_is_ram()

Unless I am missing something, it's at least broken in some cases and might
fault/crash the machine.

Looks like its existence has been questioned before in 2005 and 2010 [1],
after ~11 additional years, it might make sense to revive the discussion.

CONFIG_DEVKMEM is only enabled in a single defconfig (on purpose or by
mistake?). All distributions disable it: in Ubuntu it has been disabled
for more than 10 years, in Debian since 2.6.31, in Fedora at least
starting with FC3, in RHEL starting with RHEL4, in SUSE starting from
15sp2, and OpenSUSE has it disabled as well.

1) /dev/kmem was popular for rootkits [2] before it got disabled
basically everywhere. Ubuntu documents [3] "There is no modern user of
/dev/kmem any more beyond attackers using it to load kernel rootkits.".
RHEL documents in a BZ [5] "it served no practical purpose other than to
serve as a potential security problem or to enable binary module drivers
to access structures/functions they shouldn't be touching"

2) /proc/kcore is a decent interface to have a controlled way to read
kernel memory for debugging puposes. (will need some extensions to
deal with memory offlining/unplug, memory ballooning, and poisoned
pages, though)

3) It might be useful for corner case debugging [1]. KDB/KGDB might be a
better fit, especially, to write random memory; harder to shoot
yourself into the foot.

4) "Kernel Memory Editor" [4] hasn't seen any updates since 2000 and seems
to be incompatible with 64bit [1]. For educational purposes,
/proc/kcore might be used to monitor value updates -- or older
kernels can be used.

5) It's broken on arm64, and therefore, completely disabled there.

Looks like it's essentially unused and has been replaced by better
suited interfaces for individual tasks (/proc/kcore, KDB/KGDB). Let's
just remove it.

[1] https://lwn.net/Articles/147901/
[2] https://www.linuxjournal.com/article/10505
[3] https://wiki.ubuntu.com/Security/Features#A.2Fdev.2Fkmem_disabled
[4] https://sourceforge.net/projects/kme/
[5] https://bugzilla.redhat.com/show_bug.cgi?id=154796

Link: https://lkml.kernel.org/r/20210324102351.6932-1-david@redhat.com
Link: https://lkml.kernel.org/r/20210324102351.6932-2-david@redhat.com
Signed-off-by: David Hildenbrand
Acked-by: Michal Hocko
Acked-by: Kees Cook
Cc: Linus Torvalds
Cc: Greg Kroah-Hartman
Cc: "Alexander A. Klimov"
Cc: Alexander Viro
Cc: Alexandre Belloni
Cc: Andrew Lunn
Cc: Andrey Zhizhikin
Cc: Arnd Bergmann
Cc: Benjamin Herrenschmidt
Cc: Brian Cain
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Chris Zankel
Cc: Corentin Labbe
Cc: "David S. Miller"
Cc: "Eric W. Biederman"
Cc: Geert Uytterhoeven
Cc: Gerald Schaefer
Cc: Greentime Hu
Cc: Gregory Clement
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Hillf Danton
Cc: huang ying
Cc: Ingo Molnar
Cc: Ivan Kokshaysky
Cc: "James E.J. Bottomley"
Cc: James Troup
Cc: Jiaxun Yang
Cc: Jonas Bonn
Cc: Jonathan Corbet
Cc: Kairui Song
Cc: Krzysztof Kozlowski
Cc: Kuninori Morimoto
Cc: Liviu Dudau
Cc: Lorenzo Pieralisi
Cc: Luc Van Oostenryck
Cc: Luis Chamberlain
Cc: Matthew Wilcox
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Mike Rapoport
Cc: Mikulas Patocka
Cc: Minchan Kim
Cc: Niklas Schnelle
Cc: Oleksiy Avramchenko
Cc: openrisc@lists.librecores.org
Cc: Palmer Dabbelt
Cc: Paul Mackerras
Cc: "Pavel Machek (CIP)"
Cc: Pavel Machek
Cc: "Peter Zijlstra (Intel)"
Cc: Pierre Morel
Cc: Randy Dunlap
Cc: Richard Henderson
Cc: Rich Felker
Cc: Robert Richter
Cc: Rob Herring
Cc: Russell King
Cc: Sam Ravnborg
Cc: Sebastian Andrzej Siewior
Cc: Sebastian Hesselbarth
Cc: sparclinux@vger.kernel.org
Cc: Stafford Horne
Cc: Stefan Kristiansson
Cc: Steven Rostedt
Cc: Sudeep Holla
Cc: Theodore Dubois
Cc: Thomas Bogendoerfer
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Viresh Kumar
Cc: William Cohen
Cc: Xiaoming Ni
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Hildenbrand
2021-05-07 15:26:34 +0800
cb152a1a9 mm: fix some typos and code style problems ... Browse Code »

fix some typos and code style problems in mm.

gfp.h: s/MAXNODES/MAX_NUMNODES
mmzone.h: s/then/than
rmap.c: s/__vma_split()/__vma_adjust()
swap.c: s/__mod_zone_page_stat/__mod_zone_page_state, s/is is/is
swap_state.c: s/whoes/whose
z3fold.c: code style problem fix in z3fold_unregister_migration
zsmalloc.c: s/of/or, s/give/given

Link: https://lkml.kernel.org/r/20210419083057.64820-1-luoshijie1@huawei.com
Signed-off-by: Shijie Luo
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shijie Luo
2021-05-07 15:26:33 +0800