Eric Lee / smarc-fsl-linux-kernel

30 Dec, 2020

1 commit

dd156e3fc mm/rmap: always do TTU_IGNORE_ACCESS ... Browse Code »

[ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
v2"), the code to check the secondary MMU's page table access bit is
broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
secondary MMU's page table before the check. More specifically for those
secondary MMUs which unmap the memory in
mmu_notifier_invalidate_range_start() like kvm.

However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
access check before trying to unmap the page. So, at worst the reclaim
will miss accesses in a very short window if we remove page table access
check in unmapping code.

There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
reclaim. From memcg reclaim the page_referenced() only account the
accesses from the processes which are in the same memcg of the target page
but the unmapping code is considering accesses from all the processes, so,
decreasing the effectiveness of memcg reclaim.

The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
code.

Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
Signed-off-by: Shakeel Butt
Acked-by: Johannes Weiner
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Vlastimil Babka
Cc: Michal Hocko
Cc: Andrea Arcangeli
Cc: Dan Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin

Shakeel Butt
2020-12-30 18:53:55 +0800

23 Nov, 2020

1 commit

bfe8cc1db mm/userfaultfd: do not access vma->vm_mm after calling handle_userfault() ... Browse Code »

Alexander reported a syzkaller / KASAN finding on s390, see below for
complete output.

In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
freed in some cases. In the case of userfaultfd_missing(), this will
happen after calling handle_userfault(), which might have released the
mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will
access an unstable vma->vm_mm, which could have been freed or re-used
already.

For all architectures other than s390 this will go w/o any negative
impact, because pte_free() simply frees the page and ignores the
passed-in mm. The implementation for SPARC32 would also access
mm->page_table_lock for pte_free(), but there is no THP support in
SPARC32, so the buggy code path will not be used there.

For s390, the mm->context.pgtable_list is being used to maintain the 2K
pagetable fragments, and operating on an already freed or even re-used
mm could result in various more or less subtle bugs due to list /
pagetable corruption.

Fix this by calling pte_free() before handle_userfault(), similar to how
it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
non-huge_zero_page case.

Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for
userfaultfd_missing() faults") actually introduced both, the
do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
changes wrt to calling handle_userfault(), but only in the latter case
it put the pte_free() before calling handle_userfault().

BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334

CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
Hardware name: IBM 3906 M04 701 (KVM/Linux)
Call Trace:
do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
create_huge_pmd mm/memory.c:4256 [inline]
__handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
handle_mm_fault+0x288/0x748 mm/memory.c:4607
do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
_copy_from_user+0x48/0xa8 lib/usercopy.c:16
copy_from_user include/linux/uaccess.h:192 [inline]
__do_sys_sigaltstack kernel/signal.c:4064 [inline]
__s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

Allocated by task 9334:
slab_alloc_node mm/slub.c:2891 [inline]
slab_alloc mm/slub.c:2899 [inline]
kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
__split_vma+0xba/0x560 mm/mmap.c:2742
split_vma+0xca/0x108 mm/mmap.c:2800
mlock_fixup+0x4ae/0x600 mm/mlock.c:550
apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
do_mlock+0x1aa/0x718 mm/mlock.c:711
__do_sys_mlock2 mm/mlock.c:738 [inline]
__s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

Freed by task 9333:
slab_free mm/slub.c:3142 [inline]
kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
__vma_adjust+0x7b2/0x2508 mm/mmap.c:960
vma_merge+0x87e/0xce0 mm/mmap.c:1209
userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
__fput+0x22c/0x7a8 fs/file_table.c:281
task_work_run+0x200/0x320 kernel/task_work.c:151
tracehook_notify_resume include/linux/tracehook.h:188 [inline]
do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
system_call+0xe6/0x28c arch/s390/kernel/entry.S:416

The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
page dumped because: kasan: bad access detected
page->mem_cgroup:0000000096959501

Memory state around the buggy address:
00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
>00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
==================================================================

Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
Reported-by: Alexander Egorenkov
Signed-off-by: Gerald Schaefer
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Heiko Carstens
Cc: [4.3+]
Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.com
Signed-off-by: Linus Torvalds

Gerald Schaefer
2020-11-23 02:48:22 +0800

17 Oct, 2020

6 commits

c4f9c701f mm: fix a race during THP splitting ... Browse Code »

It is reported that the following bug is triggered if the HDD is used as
swap device,

[ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007
[ 5758.165331] #PF: supervisor write access in kernel mode
[ 5758.171161] #PF: error_code(0x0002) - not-present page
[ 5758.176894] PGD 0 P4D 0
[ 5758.179721] Oops: 0002 [#1] SMP PTI
[ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S --------- --- 5.9.0-0.rc3.1.tst.el8.x86_64 #1
[ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
[ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
[ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
[ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246
[ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000
[ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217
[ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000
[ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000
[ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040
[ 5758.280134] FS: 0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000
[ 5758.289163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0
[ 5758.303538] Call Trace:
[ 5758.306273] split_huge_page_to_list+0x88b/0x950
[ 5758.311433] deferred_split_scan+0x1ca/0x310
[ 5758.316202] do_shrink_slab+0x12c/0x2a0
[ 5758.320491] shrink_slab+0x20f/0x2c0
[ 5758.324482] shrink_node+0x240/0x6c0
[ 5758.328469] balance_pgdat+0x2d1/0x550
[ 5758.332652] kswapd+0x201/0x3c0
[ 5758.336157] ? finish_wait+0x80/0x80
[ 5758.340147] ? balance_pgdat+0x550/0x550
[ 5758.344525] kthread+0x114/0x130
[ 5758.348126] ? kthread_park+0x80/0x80
[ 5758.352214] ret_from_fork+0x22/0x30
[ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
[ 5758.412673] CR2: 0000000000000007
[ 0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020

After further digging it's found that the following race condition exists in the
original implementation,

CPU1 CPU2
---- ----
deferred_split_scan()
split_huge_page(page) /* page isn't compound head */
split_huge_page_to_list(page, NULL)
__split_huge_page(page, )
ClearPageCompound(head)
/* unlock all subpages except page (not head) */
add_to_swap(head) /* not THP */
get_swap_page(head)
add_to_swap_cache(head, )
SetPageSwapCache(head)
if PageSwapCache(head)
split_swap_cluster(/* swap entry of head */)
/* Deref sis->cluster_info: NULL accessing! */

So, in split_huge_page_to_list(), PageSwapCache() is called for the already
split and unlocked "head", which may be added to swap cache in another CPU. So
split_swap_cluster() may be called wrongly.

To fix the race, the call to split_swap_cluster() is moved to
__split_huge_page() before all subpages are unlocked. So that the
PageSwapCache() is stable.

Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
Reported-by: Rafael Aquini
Signed-off-by: "Huang, Ying"
Signed-off-by: Andrew Morton
Tested-by: Rafael Aquini
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Matthew Wilcox
Link: https://lkml.kernel.org/r/20201009073647.1531083-1-ying.huang@intel.com
Signed-off-by: Linus Torvalds

Huang Ying
2020-10-17 02:11:15 +0800
e2333dad2 mm/huge_memory: fix can_split_huge_page assumption of THP size ... Browse Code »

Ask the page how many subpages it has instead of assuming it's PMD size.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Acked-by: Kirill A. Shutemov
Acked-by: "Huang, Ying"
Link: https://lkml.kernel.org/r/20200908195539.25896-8-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800
65dfe3c3b mm/huge_memory: fix page_trans_huge_mapcount assumption of THP size ... Browse Code »

Ask the page what size it is instead of assuming it's PMD size.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Acked-by: Kirill A. Shutemov
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-7-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800
8cce54756 mm/huge_memory: fix split assumption of page size ... Browse Code »

File THPs may now be of arbitrary size, and we can't rely on that size
after doing the split so remember the number of pages before we start the
split.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-10-17 02:11:15 +0800
86b562b62 mm/huge_memory: fix total_mapcount assumption of page size ... Browse Code »

File THPs may now be of arbitrary order.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-5-willy@infradead.org
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-10-17 02:11:15 +0800
8fb156c9e mm/page_owner: change split_page_owner to take a count ... Browse Code »

The implementation of split_page_owner() prefers a count rather than the
old order of the page. When we support a variable size THP, we won't
have the order at this point, but we will have the number of pages.
So change the interface to what the caller and callee would prefer.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Reviewed-by: SeongJae Park
Acked-by: Kirill A. Shutemov
Cc: Huang Ying
Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:15 +0800

14 Oct, 2020

1 commit

f9d86a605 mm/mmap: leave adjust_next as virtual address instead of page frame number ... Browse Code »

Instead of converting adjust_next between bytes and pages number, let's
just store the virtual address into adjust_next.

Also, this patch fixes one typo in the comment of vma_adjust_trans_huge().

[vbabka@suse.cz: changelog tweak]

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Mike Kravetz
Link: http://lkml.kernel.org/r/20200828081031.11306-1-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-10-14 09:38:31 +0800

13 Oct, 2020

1 commit

6734e20e3 Merge tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux ... Browse Code »

Pull arm64 updates from Will Deacon:
"There's quite a lot of code here, but much of it is due to the
addition of a new PMU driver as well as some arm64-specific selftests
which is an area where we've traditionally been lagging a bit.

In terms of exciting features, this includes support for the Memory
Tagging Extension which narrowly missed 5.9, hopefully allowing
userspace to run with use-after-free detection in production on CPUs
that support it. Work is ongoing to integrate the feature with KASAN
for 5.11.

Another change that I'm excited about (assuming they get the hardware
right) is preparing the ASID allocator for sharing the CPU page-table
with the SMMU. Those changes will also come in via Joerg with the
IOMMU pull.

We do stray outside of our usual directories in a few places, mostly
due to core changes required by MTE. Although much of this has been
Acked, there were a couple of places where we unfortunately didn't get
any review feedback.

Other than that, we ran into a handful of minor conflicts in -next,
but nothing that should post any issues.

Summary:

- Userspace support for the Memory Tagging Extension introduced by
Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

- Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
switching.

- Fix and subsequent rewrite of our Spectre mitigations, including
the addition of support for PR_SPEC_DISABLE_NOEXEC.

- Support for the Armv8.3 Pointer Authentication enhancements.

- Support for ASID pinning, which is required when sharing
page-tables with the SMMU.

- MM updates, including treating flush_tlb_fix_spurious_fault() as a
no-op.

- Perf/PMU driver updates, including addition of the ARM CMN PMU
driver and also support to handle CPU PMU IRQs as NMIs.

- Allow prefetchable PCI BARs to be exposed to userspace using normal
non-cacheable mappings.

- Implementation of ARCH_STACKWALK for unwinding.

- Improve reporting of unexpected kernel traps due to BPF JIT
failure.

- Improve robustness of user-visible HWCAP strings and their
corresponding numerical constants.

- Removal of TEXT_OFFSET.

- Removal of some unused functions, parameters and prototypes.

- Removal of MPIDR-based topology detection in favour of firmware
description.

- Cleanups to handling of SVE and FPSIMD register state in
preparation for potential future optimisation of handling across
syscalls.

- Cleanups to the SDEI driver in preparation for support in KVM.

- Miscellaneous cleanups and refactoring work"

* tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
Revert "arm64: initialize per-cpu offsets earlier"
arm64: random: Remove no longer needed prototypes
arm64: initialize per-cpu offsets earlier
kselftest/arm64: Check mte tagged user address in kernel
kselftest/arm64: Verify KSM page merge for MTE pages
kselftest/arm64: Verify all different mmap MTE options
kselftest/arm64: Check forked child mte memory accessibility
kselftest/arm64: Verify mte tag inclusion via prctl
kselftest/arm64: Add utilities and a test to validate mte memory
perf: arm-cmn: Fix conversion specifiers for node type
perf: arm-cmn: Fix unsigned comparison to less than zero
arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
arm64: Get rid of arm64_ssbd_state
KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
KVM: arm64: Get rid of kvm_arm_have_ssbd()
KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
...

Linus Torvalds
2020-10-13 01:00:51 +0800

28 Sep, 2020

1 commit

d042035ea mm/thp: Split huge pmds/puds if they're pinned when fork() ... Browse Code »

Pinned pages shouldn't be write-protected when fork() happens, because
follow up copy-on-write on these pages could cause the pinned pages to
be replaced by random newly allocated pages.

For huge PMDs, we split the huge pmd if pinning is detected. So that
future handling will be done by the PTE level (with our latest changes,
each of the small pages will be copied). We can achieve this by let
copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
fallthrough in copy_pmd_range() and finally land the next
copy_pte_range() call.

Huge PUDs will be even more special - so far it does not support
anonymous pages. But it can actually be done the same as the huge PMDs
even if the split huge PUDs means to erase the PUD entries. It'll
guarantee the follow up fault ins will remap the same pages in either
parent/child later.

This might not be the most efficient way, but it should be easy and
clean enough. It should be fine, since we're tackling with a very rare
case just to make sure userspaces that pinned some thps will still work
even without MADV_DONTFORK and after they fork()ed.

Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-28 02:21:35 +0800

20 Sep, 2020

1 commit

ec0abae6d mm/thp: fix __split_huge_pmd_locked() for migration PMD ... Browse Code »

A migrating transparent huge page has to already be unmapped. Otherwise,
the page could be modified while it is being copied to a new page and data
could be lost. The function __split_huge_pmd() checks for a PMD migration
entry before calling __split_huge_pmd_locked() leading one to think that
__split_huge_pmd_locked() can handle splitting a migrating PMD.

However, the code always increments the page->_mapcount and adjusts the
memory control group accounting assuming the page is mapped.

Also, if the PMD entry is a migration PMD entry, the call to
is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by
checking for a PMD migration entry.

Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Reviewed-by: Yang Shi
Reviewed-by: Zi Yan
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Alistair Popple
Cc: Christoph Hellwig
Cc: Jason Gunthorpe
Cc: Bharata B Rao
Cc: Ben Skeggs
Cc: Shuah Khan
Cc: [4.14+]
Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds

Ralph Campbell
2020-09-20 04:13:38 +0800

05 Sep, 2020

2 commits

b25d1dc94 Merge branch 'simplify-do_wp_page' ... Browse Code »

Merge emailed patches from Peter Xu:
"This is a small series that I picked up from Linus's suggestion to
simplify cow handling (and also make it more strict) by checking
against page refcounts rather than mapcounts.

This makes uffd-wp work again (verified by running upmapsort)"

Note: this is horrendously bad timing, and making this kind of
fundamental vm change after -rc3 is not at all how things should work.
The saving grace is that it really is a a nice simplification:

8 files changed, 29 insertions(+), 120 deletions(-)

The reason for the bad timing is that it turns out that commit
17839856fd58 ("gup: document and work around 'COW can break either way'
issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
Patocka also reports that it caused issues for strace when running in a
DAX environment with ext4 on a persistent memory setup.

And we can't just revert that commit without re-introducing the original
issue that is a potential security hole, so making COW stricter (and in
the process much simpler) is a step to then undoing the forced COW that
broke other uses.

Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

* emailed patches from Peter Xu :
mm: Add PGREUSE counter
mm/gup: Remove enfornced COW mechanism
mm/ksm: Remove reuse_ksm_page()
mm: do_wp_page() simplification

Linus Torvalds
2020-09-05 00:31:54 +0800
a308c71bf mm/gup: Remove enfornced COW mechanism ... Browse Code »

With the more strict (but greatly simplified) page reuse logic in
do_wp_page(), we can safely go back to the world where cow is not
enforced with writes.

This essentially reverts commit 17839856fd58 ("gup: document and work
around 'COW can break either way' issue"). There are some context
differences due to some changes later on around it:

2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

Some lines moved back and forth with those, but this revert patch should
have striped out and covered all the enforced cow bits anyways.

Suggested-by: Linus Torvalds
Signed-off-by: Peter Xu
Signed-off-by: Linus Torvalds

Peter Xu
2020-09-05 00:25:20 +0800

04 Sep, 2020

1 commit

72e6afa08 mm: Preserve the PG_arch_2 flag in __split_huge_page_tail() ... Browse Code »

When a huge page is split into normal pages, part of the head page flags
are transferred to the tail pages. However, the PG_arch_* flags are not
part of the preserved set.

PG_arch_2 is used by the arm64 MTE support to mark pages that have valid
tags. The absence of such flag would cause the arm64 set_pte_at() to
clear the tags in order to avoid stale tags exposed to user or the
swapping out hooks to ignore the tags. Not preserving PG_arch_2 on huge
page splitting leads to tag corruption in the tail pages.

Preserve the newly added PG_arch_2 flag in __split_huge_page_tail().

Signed-off-by: Catalin Marinas
Cc: Andrew Morton

Catalin Marinas
2020-09-04 19:46:06 +0800

13 Aug, 2020

2 commits

4958e4d86 mm: thp: remove debug_cow switch ... Browse Code »

Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
used anymore. So, just remove it.

Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Zi Yan
Cc: Kirill A. Shutemov
Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-08-13 01:57:57 +0800
b518154e5 mm/vmscan: protect the workingset on anonymous LRU ... Browse Code »

In current implementation, newly created or swap-in anonymous page is
started on active list. Growing active list results in rebalancing
active/inactive list so old pages on active list are demoted to inactive
list. Hence, the page on active list isn't protected at all.

Following is an example of this situation.

Assume that 50 hot pages on active list. Numbers denote the number of
pages on active/inactive list (active | inactive).

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(uo) | 50(h)

3. workload: another 50 newly created (used-once) pages
50(uo) | 50(uo), swap-out 50(h)

This patch tries to fix this issue. Like as file LRU, newly created or
swap-in anonymous pages will be inserted to the inactive list. They are
promoted to active list if enough reference happens. This simple
modification changes the above example as following.

1. 50 hot pages on active list
50(h) | 0

2. workload: 50 newly created (used-once) pages
50(h) | 50(uo)

3. workload: another 50 newly created (used-once) pages
50(h) | 50(uo), swap-out 50(uo)

As you can see, hot pages on active list would be protected.

Note that, this implementation has a drawback that the page cannot be
promoted and will be swapped-out if re-access interval is greater than the
size of inactive list but less than the size of total(active+inactive).
To solve this potential issue, following patch will apply workingset
detection similar to the one that's already applied to file LRU.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hugh Dickins
Cc: Matthew Wilcox
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Minchan Kim
Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-08-13 01:57:55 +0800

08 Aug, 2020

3 commits

42742d9bd mm: thp: replace HTTP links with HTTPS ones ... Browse Code »

Rationale:
Reduces attack surface on kernel devs opening the links for MITM
as HTTPS traffic is much harder to manipulate.

Deterministic algorithm:
For each file:
If not .svg:
For each line:
If doesn't contain `xmlns`:
For each link, `http://[^# ]*(?:\w|/)`:
If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
If both the HTTP and HTTPS versions
return 200 OK and serve the same content:
Replace HTTP with HTTPS.

[akpm@linux-foundation.org: fix amd.com URL, per Vlastimil]

Signed-off-by: Alexander A. Klimov
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Vlastimil Babka
Link: http://lkml.kernel.org/r/20200713164345.36088-1-grandmaster@al2klimov.de
Signed-off-by: Linus Torvalds

Alexander A. Klimov
2020-08-08 02:33:29 +0800
349d9fbb0 mm/mremap: start addresses are properly aligned ... Browse Code »

After previous cleanup, extent is the minimal step for both source and
destination. This means when extent is HPAGE_PMD_SIZE or PMD_SIZE,
old_addr and new_addr are properly aligned too.

Since these two functions are only invoked in move_page_tables, it is safe
to remove the check now.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Tested-by: Dmitry Osipenko
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Anshuman Khandual
Cc: Matthew Wilcox
Cc: Peter Xu
Cc: Sean Christopherson
Cc: Thomas Hellstrom
Cc: Thomas Hellstrom (VMware)
Cc: Vlastimil Babka
Cc: Yang Shi
Link: http://lkml.kernel.org/r/20200708095028.41706-4-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800
b8aa9d9d9 mm/mremap: it is sure to have enough space when extent meets requirement ... Browse Code »

Patch series "mm/mremap: cleanup move_page_tables() a little", v5.

move_page_tables() tries to move page table by PMD or PTE.

The root reason is if it tries to move PMD, both old and new range should
be PMD aligned. But current code calculate old range and new range
separately. This leads to some redundant check and calculation.

This cleanup tries to consolidate the range check in one place to reduce
some extra range handling.

This patch (of 3):

old_end is passed to these two functions to check whether there is enough
space to do the move, while this check is done before invoking these
functions.

These two functions only would be invoked when extent meets the
requirement and there is one check before invoking these functions:

if (extent > old_end - old_addr)
extent = old_end - old_addr;

This implies (old_end - old_addr) won't fail the check in these two
functions.

Signed-off-by: Wei Yang
Signed-off-by: Andrew Morton
Tested-by: Dmitry Osipenko
Acked-by: Kirill A. Shutemov
Cc: Vlastimil Babka
Cc: Yang Shi
Cc: Thomas Hellstrom (VMware)
Cc: Anshuman Khandual
Cc: Sean Christopherson
Cc: Wei Yang
Cc: Peter Xu
Cc: Aneesh Kumar K.V
Cc: Matthew Wilcox
Cc: Thomas Hellstrom
Link: http://lkml.kernel.org/r/20200710092835.56368-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200710092835.56368-2-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-1-richard.weiyang@linux.alibaba.com
Link: http://lkml.kernel.org/r/20200708095028.41706-2-richard.weiyang@linux.alibaba.com
Signed-off-by: Linus Torvalds

Wei Yang
2020-08-08 02:33:27 +0800

10 Jun, 2020

2 commits

c1e8d7c6a mmap locking API: convert mmap_sem comments ... Browse Code »

Convert comments that reference mmap_sem to reference mmap_lock instead.

[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800
3e4e28c5a mmap locking API: convert mmap_sem API comments ... Browse Code »

Convert comments that reference old mmap_sem APIs to reference
corresponding new mmap locking APIs instead.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Davidlohr Bueso
Reviewed-by: Daniel Jordan
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
Signed-off-by: Linus Torvalds

Michel Lespinasse
2020-06-10 00:39:14 +0800

06 Jun, 2020

1 commit

7ae77150d Merge tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc updates from Michael Ellerman:

- Support for userspace to send requests directly to the on-chip GZIP
accelerator on Power9.

- Rework of our lockless page table walking (__find_linux_pte()) to
make it safe against parallel page table manipulations without
relying on an IPI for serialisation.

- A series of fixes & enhancements to make our machine check handling
more robust.

- Lots of plumbing to add support for "prefixed" (64-bit) instructions
on Power10.

- Support for using huge pages for the linear mapping on 8xx (32-bit).

- Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
driver.

- Removal of some obsolete 40x platforms and associated cruft.

- Initial support for booting on Power10.

- Lots of other small features, cleanups & fixes.

Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
Wolfram Sang, Xiongfeng Wang.

* tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
powerpc/pseries: Make vio and ibmebus initcalls pseries specific
cxl: Remove dead Kconfig options
powerpc: Add POWER10 architected mode
powerpc/dt_cpu_ftrs: Add MMA feature
powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
powerpc: Add support for ISA v3.1
powerpc: Add new HWCAP bits
powerpc/64s: Don't set FSCR bits in INIT_THREAD
powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
powerpc/64s: Don't let DT CPU features set FSCR_DSCR
powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
powerpc/module_64: Consolidate ftrace code
powerpc/32: Disable KASAN with pages bigger than 16k
powerpc/uaccess: Don't set KUEP by default on book3s/32
powerpc/uaccess: Don't set KUAP by default on book3s/32
powerpc/8xx: Reduce time spent in allow_user_access() and friends
...

Linus Torvalds
2020-06-06 03:39:30 +0800

05 Jun, 2020

1 commit

fa1f68cc8 mm: use false for bool variable ... Browse Code »

Fixes coccicheck warnings:

mm/zbud.c:246:1-20: WARNING: Assignment of 0/1 to bool variable
mm/mremap.c:777:2-8: WARNING: Assignment of 0/1 to bool variable
mm/huge_memory.c:525:9-10: WARNING: return of 0/1 in function 'is_transparent_hugepage' with return type bool

Reported-by: Hulk Robot
Signed-off-by: Zou Wei
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/1586835930-47076-1-git-send-email-zou_wei@huawei.com
Signed-off-by: Linus Torvalds

Zou Wei
2020-06-05 10:06:24 +0800

04 Jun, 2020

9 commits

ee01c4d72 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"More mm/ work, plenty more to come

Subsystems affected by this patch series: slub, memcg, gup, kasan,
pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
thp, mmap, kconfig"

* akpm: (131 commits)
arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
riscv: support DEBUG_WX
mm: add DEBUG_WX support
drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
powerpc/mm: drop platform defined pmd_mknotpresent()
mm: thp: don't need to drain lru cache when splitting and mlocking THP
hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
sparc32: register memory occupied by kernel as memblock.memory
include/linux/memblock.h: fix minor typo and unclear comment
mm, mempolicy: fix up gup usage in lookup_node
tools/vm/page_owner_sort.c: filter out unneeded line
mm: swap: memcg: fix memcg stats for huge pages
mm: swap: fix vmstats for huge pages
mm: vmscan: limit the range of LRU type balancing
mm: vmscan: reclaim writepage is IO cost
mm: vmscan: determine anon/file pressure balance at the reclaim root
mm: balance LRU lists based on relative thrashing
mm: only count actual rotations as LRU reclaim cost
...

Linus Torvalds
2020-06-04 11:24:15 +0800
67e4eb076 mm: thp: don't need to drain lru cache when splitting and mlocking THP ... Browse Code »

Since commit 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
arrival") THP would not stay in pagevec anymore. So the optimization made
by commit d965432234db ("thp: increase split_huge_page() success rate")
doesn't make sense anymore, which tries to unpin munlocked THPs from
pagevec by draining pagevec.

Draining lru cache before isolating THP in mlock path is also unnecessary.
b676b293fb48 ("mm, thp: fix mapped pages avoiding unevictable list on
mlock") added it and 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped
file huge pages") accidentally carried it over after the above
optimization went in.

Signed-off-by: Yang Shi
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Link: http://lkml.kernel.org/r/1585946493-7531-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Linus Torvalds

Yang Shi
2020-06-04 11:09:49 +0800
d9eb1ea2b mm: memcontrol: delete unused lrucare handling ... Browse Code »

Swapin faults were the last event to charge pages after they had already
been put on the LRU list. Now that we charge directly on swapin, the
lrucare portion of the charge code is unused.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Joonsoo Kim
Cc: Alex Shi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Balbir Singh
Cc: Shakeel Butt
Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:48 +0800
9d82c6943 mm: memcontrol: convert anon and file-thp to new mem_cgroup_charge() API ... Browse Code »

With the page->mapping requirement gone from memcg, we can charge anon and
file-thp pages in one single step, right after they're allocated.

This removes two out of three API calls - especially the tricky commit
step that needed to happen at just the right time between when the page is
"set up" and when it's "published" - somewhat vague and fluid concepts
that varied by page type. All we need is a freshly allocated page and a
memcg context to charge.

v2: prevent double charges on pre-allocated hugepages in khugepaged

[hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Joonsoo Kim
Cc: Alex Shi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Cc: Qian Cai
Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:48 +0800
468c39823 mm: memcontrol: switch to native NR_ANON_THPS counter ... Browse Code »

With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
native NR_ANON_THPS accounting sites.

[hannes@cmpxchg.org: fixes]
Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org
Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Tested-by: Naresh Kamboju
Reviewed-by: Joonsoo Kim
Acked-by: Randy Dunlap [build-tested]
Cc: Alex Shi
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800
be5d0a74c mm: memcontrol: switch to native NR_ANON_MAPPED counter ... Browse Code »

Memcg maintains a private MEMCG_RSS counter. This divergence from the
generic VM accounting means unnecessary code overhead, and creates a
dependency for memcg that page->mapping is set up at the time of charging,
so that page types can be told apart.

Convert the generic accounting sites to mod_lruvec_page_state and friends
to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
same way we do for NR_FILE_MAPPED.

With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
counter, this patch finally eliminates the need to have page->mapping set
up at charge time. However, we need to have page->mem_cgroup set up by
the time rmap runs and does the accounting, so switch the commit and the
rmap callbacks around.

v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Cc: Alex Shi
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800
3fba69a56 mm: memcontrol: drop @compound parameter from memcg charging API ... Browse Code »

The memcg charging API carries a boolean @compound parameter that tells
whether the page we're dealing with is a hugepage.
mem_cgroup_commit_charge() has another boolean @lrucare that indicates
whether the page needs LRU locking or not while charging. The majority of
callsites know those parameters at compile time, which results in a lot of
naked "false, false" argument lists. This makes for cryptic code and is a
breeding ground for subtle mistakes.

Thankfully, the huge page state can be inferred from the page itself and
doesn't need to be passed along. This is safe because charging completes
before the page is published and somebody may split it.

Simplify the callsites by removing @compound, and let memcg infer the
state by using hpage_nr_pages() unconditionally. That function does
PageTransHuge() to identify huge pages, which also helpfully asserts that
nobody passes in tail pages by accident.

The following patches will introduce a new charging API, best not to carry
over unnecessary weight.

Signed-off-by: Johannes Weiner
Signed-off-by: Andrew Morton
Reviewed-by: Alex Shi
Reviewed-by: Joonsoo Kim
Reviewed-by: Shakeel Butt
Cc: Hugh Dickins
Cc: "Kirill A. Shutemov"
Cc: Michal Hocko
Cc: Roman Gushchin
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
Signed-off-by: Linus Torvalds

Johannes Weiner
2020-06-04 11:09:47 +0800
3917c8028 thp: change CoW semantics for anon-THP ... Browse Code »

Currently we have different copy-on-write semantics for anon- and
file-THP. For anon-THP we try to allocate huge page on the write fault,
but on file-THP we split PMD and allocate 4k page.

Arguably, file-THP semantics is more desirable: we don't necessary want to
unshare full PMD range from the parent on the first access. This is the
primary reason THP is unusable for some workloads, like Redis.

The original THP refcounting didn't allow to have PTE-mapped compound
pages, so we had no options, but to allocate huge page on CoW (with
fallback to 512 4k pages).

The current refcounting doesn't have such limitations and we can cut a lot
of complex code out of fault path.

khugepaged is now able to recover THP from such ranges if the
configuration allows.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Tested-by: Zi Yan
Reviewed-by: William Kucharski
Reviewed-by: Zi Yan
Acked-by: Yang Shi
Cc: Andrea Arcangeli
Cc: John Hubbard
Cc: Mike Kravetz
Cc: Ralph Campbell
Link: http://lkml.kernel.org/r/20200416160026.16538-8-kirill.shutemov@linux.intel.com
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2020-06-04 11:09:46 +0800
c444eb564 mm: thp: make the THP mapcount atomic against __split_huge_pmd_locked() ... Browse Code »

Write protect anon page faults require an accurate mapcount to decide
if to break the COW or not. This is implemented in the THP path with
reuse_swap_page() ->
page_trans_huge_map_swapcount()/page_trans_huge_mapcount().

If the COW triggers while the other processes sharing the page are
under a huge pmd split, to do an accurate reading, we must ensure the
mapcount isn't computed while it's being transferred from the head
page to the tail pages.

reuse_swap_cache() already runs serialized by the page lock, so it's
enough to add the page lock around __split_huge_pmd_locked too, in
order to add the missing serialization.

Note: the commit in "Fixes" is just to facilitate the backporting,
because the code before such commit didn't try to do an accurate THP
mapcount calculation and it instead used the page_count() to decide if
to COW or not. Both the page_count and the pin_count are THP-wide
refcounts, so they're inaccurate if used in
reuse_swap_page(). Reverting such commit (besides the unrelated fix to
the local anon_vma assignment) would have also opened the window for
memory corruption side effects to certain workloads as documented in
such commit header.

Signed-off-by: Andrea Arcangeli
Suggested-by: Jann Horn
Reported-by: Jann Horn
Acked-by: Kirill A. Shutemov
Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2020-06-04 11:06:45 +0800

03 Jun, 2020

1 commit

17839856f gup: document and work around "COW can break either way" issue ... Browse Code »

Doing a "get_user_pages()" on a copy-on-write page for reading can be
ambiguous: the page can be COW'ed at any time afterwards, and the
direction of a COW event isn't defined.

Yes, whoever writes to it will generally do the COW, but if the thread
that did the get_user_pages() unmapped the page before the write (and
that could happen due to memory pressure in addition to any outright
action), the writer could also just take over the old page instead.

End result: the get_user_pages() call might result in a page pointer
that is no longer associated with the original VM, and is associated
with - and controlled by - another VM having taken it over instead.

So when doing a get_user_pages() on a COW mapping, the only really safe
thing to do would be to break the COW when getting the page, even when
only getting it for reading.

At the same time, some users simply don't even care.

For example, the perf code wants to look up the page not because it
cares about the page, but because the code simply wants to look up the
physical address of the access for informational purposes, and doesn't
really care about races when a page might be unmapped and remapped
elsewhere.

This adds logic to force a COW event by setting FOLL_WRITE on any
copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
pointer as a result.

The current semantics end up being:

- __get_user_pages_fast(): no change. If you don't ask for a write,
you won't break COW. You'd better know what you're doing.

- get_user_pages_fast(): the fast-case "look it up in the page tables
without anything getting mmap_sem" now refuses to follow a read-only
page, since it might need COW breaking. Which happens in the slow
path - the fast path doesn't know if the memory might be COW or not.

- get_user_pages() (including the slow-path fallback for gup_fast()):
for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
very similar semantics to FOLL_FORCE.

If it turns out that we want finer granularity (ie "only break COW when
it might actually matter" - things like the zero page are special and
don't need to be broken) we might need to push these semantics deeper
into the lookup fault path. So if people care enough, it's possible
that we might end up adding a new internal FOLL_BREAK_COW flag to go
with the internal FOLL_COW flag we already have for tracking "I had a
COW".

Alternatively, if it turns out that different callers might want to
explicitly control the forced COW break behavior, we might even want to
make such a flag visible to the users of get_user_pages() instead of
using the above default semantics.

But for now, this is mostly commentary on the issue (this commit message
being a lot bigger than the patch, and that patch in turn is almost all
comments), with that minimal "enable COW breaking early" logic using the
existing FOLL_WRITE behavior.

[ It might be worth noting that we've always had this ambiguity, and it
could arguably be seen as a user-space issue.

You only get private COW mappings that could break either way in
situations where user space is doing cooperative things (ie fork()
before an execve() etc), but it _is_ surprising and very subtle, and
fork() is supposed to give you independent address spaces.

So let's treat this as a kernel issue and make the semantics of
get_user_pages() easier to understand. Note that obviously a true
shared mapping will still get a page that can change under us, so this
does _not_ mean that get_user_pages() somehow returns any "stable"
page ]

Reported-by: Jann Horn
Tested-by: Christoph Hellwig
Acked-by: Oleg Nesterov
Acked-by: Kirill Shutemov
Acked-by: Jan Kara
Cc: Andrea Arcangeli
Cc: Matthew Wilcox
Signed-off-by: Linus Torvalds

Linus Torvalds
2020-06-03 01:19:17 +0800

05 May, 2020

1 commit

93a98695f mm: change pmdp_huge_get_and_clear_full take vm_area_struct as arg ... Browse Code »

We will use this in later patch to do tlb flush when clearing pmd entries.

Signed-off-by: Aneesh Kumar K.V
Signed-off-by: Michael Ellerman
Link: https://lore.kernel.org/r/20200505071729.54912-22-aneesh.kumar@linux.ibm.com

Aneesh Kumar K.V
2020-05-05 19:20:13 +0800

08 Apr, 2020

5 commits

f45ec5ff1 userfaultfd: wp: support swap and page migration ... Browse Code »

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected. It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
also applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
b569a1760 userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork ... Browse Code »

UFFD_EVENT_FORK support for uffd-wp should be already there, except that
we should clean the uffd-wp bit if uffd fork event is not enabled. Detect
that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
by VM_UFFD_WP. Do this for both small PTEs and huge PMDs.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
292924b26 userfaultfd: wp: apply _PAGE_UFFD_WP bit ... Browse Code »

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new flags
are exclusively used. Then,

- For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd

- For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs. Then we can start to identify
which PTE/PMD is write protected by general (e.g., COW or soft dirty
tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp. Here the
userfault-wp will have higher priority and will be handled first. Only
after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
the general COW. These are the steps on what will happen with such a
page:

1. CPU accesses write protected shared page (so both protected by
general COW and uffd-wp), blocked by uffd-wp first because in
do_wp_page we'll handle uffd-wp first, so it has higher priority
than general COW.

2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
to remove the uffd-wp bit upon the PTE/PMD. However here we
still keep the write bit cleared. Notify the blocked CPU.

3. The blocked CPU resumes the page fault process with a fault
retry, during retry it'll notice it was not with the uffd-wp bit
this time but it is still write protected by general COW, then
it'll go though the COW path in the fault handler, copy the page,
apply write bit where necessary, and retry again.

4. The CPU will be able to access this page with write bit set.

Suggested-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Brian Geffon
Cc: Pavel Emelyanov
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: Martin Cracauer
Cc: Mel Gorman
Cc: Bobby Powers
Cc: Mike Rapoport
Cc: "Kirill A . Shutemov"
Cc: Maya Gokhale
Cc: Johannes Weiner
Cc: Marty McFadden
Cc: Denis Plotnikov
Cc: Hugh Dickins
Cc: "Dr . David Alan Gilbert"
Cc: Jerome Glisse
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
58705444c mm: merge parameters for change_protection() ... Browse Code »

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa). Further, these parameters are passed along the calls:

- change_protection_range()
- change_p4d_range()
- change_pud_range()
- change_pmd_range()
- ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters. Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to introduce
any new parameters to change_protection(). In the follow up patches, a
new parameter for userfaultfd write protection will be introduced.

No functional change at all.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
396bcc529 mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE ... Browse Code »

Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
notes that it should be reverted when the PowerPC problem was fixed. The
commit fixing the PowerPC problem (953c66c2b22a) did not revert the
commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
oversight, so remove the Kconfig symbol and undo the work of commit
e496cf3d7821.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Christoph Hellwig
Cc: Pankaj Gupta
Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-04-08 01:43:38 +0800