30 Dec, 2020

1 commit

  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     

23 Nov, 2020

1 commit

  • Alexander reported a syzkaller / KASAN finding on s390, see below for
    complete output.

    In do_huge_pmd_anonymous_page(), the pre-allocated pagetable will be
    freed in some cases. In the case of userfaultfd_missing(), this will
    happen after calling handle_userfault(), which might have released the
    mmap_lock. Therefore, the following pte_free(vma->vm_mm, pgtable) will
    access an unstable vma->vm_mm, which could have been freed or re-used
    already.

    For all architectures other than s390 this will go w/o any negative
    impact, because pte_free() simply frees the page and ignores the
    passed-in mm. The implementation for SPARC32 would also access
    mm->page_table_lock for pte_free(), but there is no THP support in
    SPARC32, so the buggy code path will not be used there.

    For s390, the mm->context.pgtable_list is being used to maintain the 2K
    pagetable fragments, and operating on an already freed or even re-used
    mm could result in various more or less subtle bugs due to list /
    pagetable corruption.

    Fix this by calling pte_free() before handle_userfault(), similar to how
    it is already done in __do_huge_pmd_anonymous_page() for the WRITE /
    non-huge_zero_page case.

    Commit 6b251fc96cf2c ("userfaultfd: call handle_userfault() for
    userfaultfd_missing() faults") actually introduced both, the
    do_huge_pmd_anonymous_page() and also __do_huge_pmd_anonymous_page()
    changes wrt to calling handle_userfault(), but only in the latter case
    it put the pte_free() before calling handle_userfault().

    BUG: KASAN: use-after-free in do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
    Read of size 8 at addr 00000000962d6988 by task syz-executor.0/9334

    CPU: 1 PID: 9334 Comm: syz-executor.0 Not tainted 5.10.0-rc1-syzkaller-07083-g4c9720875573 #0
    Hardware name: IBM 3906 M04 701 (KVM/Linux)
    Call Trace:
    do_huge_pmd_anonymous_page+0xcda/0xd90 mm/huge_memory.c:744
    create_huge_pmd mm/memory.c:4256 [inline]
    __handle_mm_fault+0xe6e/0x1068 mm/memory.c:4480
    handle_mm_fault+0x288/0x748 mm/memory.c:4607
    do_exception+0x394/0xae0 arch/s390/mm/fault.c:479
    do_dat_exception+0x34/0x80 arch/s390/mm/fault.c:567
    pgm_check_handler+0x1da/0x22c arch/s390/kernel/entry.S:706
    copy_from_user_mvcos arch/s390/lib/uaccess.c:111 [inline]
    raw_copy_from_user+0x3a/0x88 arch/s390/lib/uaccess.c:174
    _copy_from_user+0x48/0xa8 lib/usercopy.c:16
    copy_from_user include/linux/uaccess.h:192 [inline]
    __do_sys_sigaltstack kernel/signal.c:4064 [inline]
    __s390x_sys_sigaltstack+0xc8/0x240 kernel/signal.c:4060
    system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

    Allocated by task 9334:
    slab_alloc_node mm/slub.c:2891 [inline]
    slab_alloc mm/slub.c:2899 [inline]
    kmem_cache_alloc+0x118/0x348 mm/slub.c:2904
    vm_area_dup+0x9c/0x2b8 kernel/fork.c:356
    __split_vma+0xba/0x560 mm/mmap.c:2742
    split_vma+0xca/0x108 mm/mmap.c:2800
    mlock_fixup+0x4ae/0x600 mm/mlock.c:550
    apply_vma_lock_flags+0x2c6/0x398 mm/mlock.c:619
    do_mlock+0x1aa/0x718 mm/mlock.c:711
    __do_sys_mlock2 mm/mlock.c:738 [inline]
    __s390x_sys_mlock2+0x86/0xa8 mm/mlock.c:728
    system_call+0xe0/0x28c arch/s390/kernel/entry.S:415

    Freed by task 9333:
    slab_free mm/slub.c:3142 [inline]
    kmem_cache_free+0x7c/0x4b8 mm/slub.c:3158
    __vma_adjust+0x7b2/0x2508 mm/mmap.c:960
    vma_merge+0x87e/0xce0 mm/mmap.c:1209
    userfaultfd_release+0x412/0x6b8 fs/userfaultfd.c:868
    __fput+0x22c/0x7a8 fs/file_table.c:281
    task_work_run+0x200/0x320 kernel/task_work.c:151
    tracehook_notify_resume include/linux/tracehook.h:188 [inline]
    do_notify_resume+0x100/0x148 arch/s390/kernel/signal.c:538
    system_call+0xe6/0x28c arch/s390/kernel/entry.S:416

    The buggy address belongs to the object at 00000000962d6948 which belongs to the cache vm_area_struct of size 200
    The buggy address is located 64 bytes inside of 200-byte region [00000000962d6948, 00000000962d6a10)
    The buggy address belongs to the page: page:00000000313a09fe refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x962d6 flags: 0x3ffff00000000200(slab)
    raw: 3ffff00000000200 000040000257e080 0000000c0000000c 000000008020ba00
    raw: 0000000000000000 000f001e00000000 ffffffff00000001 0000000096959501
    page dumped because: kasan: bad access detected
    page->mem_cgroup:0000000096959501

    Memory state around the buggy address:
    00000000962d6880: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    00000000962d6900: 00 fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb
    >00000000962d6980: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
    ^
    00000000962d6a00: fb fb fc fc fc fc fc fc fc fc 00 00 00 00 00 00
    00000000962d6a80: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
    ==================================================================

    Fixes: 6b251fc96cf2c ("userfaultfd: call handle_userfault() for userfaultfd_missing() faults")
    Reported-by: Alexander Egorenkov
    Signed-off-by: Gerald Schaefer
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: [4.3+]
    Link: https://lkml.kernel.org/r/20201110190329.11920-1-gerald.schaefer@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     

17 Oct, 2020

6 commits

  • It is reported that the following bug is triggered if the HDD is used as
    swap device,

    [ 5758.157556] BUG: kernel NULL pointer dereference, address: 0000000000000007
    [ 5758.165331] #PF: supervisor write access in kernel mode
    [ 5758.171161] #PF: error_code(0x0002) - not-present page
    [ 5758.176894] PGD 0 P4D 0
    [ 5758.179721] Oops: 0002 [#1] SMP PTI
    [ 5758.183614] CPU: 10 PID: 316 Comm: kswapd1 Kdump: loaded Tainted: G S --------- --- 5.9.0-0.rc3.1.tst.el8.x86_64 #1
    [ 5758.196717] Hardware name: Intel Corporation S2600CP/S2600CP, BIOS SE5C600.86B.02.01.0002.082220131453 08/22/2013
    [ 5758.208176] RIP: 0010:split_swap_cluster+0x47/0x60
    [ 5758.213522] Code: c1 e3 06 48 c1 eb 0f 48 8d 1c d8 48 89 df e8 d0 20 6a 00 80 63 07 fb 48 85 db 74 16 48 89 df c6 07 00 66 66 66 90 31 c0 5b c3 24 25 07 00 00 00 fb 31 c0 5b c3 b8 f0 ff ff ff 5b c3 66 0f 1f
    [ 5758.234478] RSP: 0018:ffffb147442d7af0 EFLAGS: 00010246
    [ 5758.240309] RAX: 0000000000000000 RBX: 000000000014b217 RCX: ffffb14779fd9000
    [ 5758.248281] RDX: 000000000014b217 RSI: ffff9c52f2ab1400 RDI: 000000000014b217
    [ 5758.256246] RBP: ffffe00c51168080 R08: ffffe00c5116fe08 R09: ffff9c52fffd3000
    [ 5758.264208] R10: ffffe00c511537c8 R11: ffff9c52fffd3c90 R12: 0000000000000000
    [ 5758.272172] R13: ffffe00c51170000 R14: ffffe00c51170000 R15: ffffe00c51168040
    [ 5758.280134] FS: 0000000000000000(0000) GS:ffff9c52f2a80000(0000) knlGS:0000000000000000
    [ 5758.289163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 5758.295575] CR2: 0000000000000007 CR3: 0000000022a0e003 CR4: 00000000000606e0
    [ 5758.303538] Call Trace:
    [ 5758.306273] split_huge_page_to_list+0x88b/0x950
    [ 5758.311433] deferred_split_scan+0x1ca/0x310
    [ 5758.316202] do_shrink_slab+0x12c/0x2a0
    [ 5758.320491] shrink_slab+0x20f/0x2c0
    [ 5758.324482] shrink_node+0x240/0x6c0
    [ 5758.328469] balance_pgdat+0x2d1/0x550
    [ 5758.332652] kswapd+0x201/0x3c0
    [ 5758.336157] ? finish_wait+0x80/0x80
    [ 5758.340147] ? balance_pgdat+0x550/0x550
    [ 5758.344525] kthread+0x114/0x130
    [ 5758.348126] ? kthread_park+0x80/0x80
    [ 5758.352214] ret_from_fork+0x22/0x30
    [ 5758.356203] Modules linked in: fuse zram rfkill sunrpc intel_rapl_msr intel_rapl_common sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp mgag200 iTCO_wdt crct10dif_pclmul iTCO_vendor_support drm_kms_helper crc32_pclmul ghash_clmulni_intel syscopyarea sysfillrect sysimgblt fb_sys_fops cec rapl joydev intel_cstate ipmi_si ipmi_devintf drm intel_uncore i2c_i801 ipmi_msghandler pcspkr lpc_ich mei_me i2c_smbus mei ioatdma ip_tables xfs libcrc32c sr_mod sd_mod cdrom t10_pi sg igb ahci libahci i2c_algo_bit crc32c_intel libata dca wmi dm_mirror dm_region_hash dm_log dm_mod
    [ 5758.412673] CR2: 0000000000000007
    [ 0.000000] Linux version 5.9.0-0.rc3.1.tst.el8.x86_64 (mockbuild@x86-vm-15.build.eng.bos.redhat.com) (gcc (GCC) 8.3.1 20191121 (Red Hat 8.3.1-5), GNU ld version 2.30-79.el8) #1 SMP Wed Sep 9 16:03:34 EDT 2020

    After further digging it's found that the following race condition exists in the
    original implementation,

    CPU1 CPU2
    ---- ----
    deferred_split_scan()
    split_huge_page(page) /* page isn't compound head */
    split_huge_page_to_list(page, NULL)
    __split_huge_page(page, )
    ClearPageCompound(head)
    /* unlock all subpages except page (not head) */
    add_to_swap(head) /* not THP */
    get_swap_page(head)
    add_to_swap_cache(head, )
    SetPageSwapCache(head)
    if PageSwapCache(head)
    split_swap_cluster(/* swap entry of head */)
    /* Deref sis->cluster_info: NULL accessing! */

    So, in split_huge_page_to_list(), PageSwapCache() is called for the already
    split and unlocked "head", which may be added to swap cache in another CPU. So
    split_swap_cluster() may be called wrongly.

    To fix the race, the call to split_swap_cluster() is moved to
    __split_huge_page() before all subpages are unlocked. So that the
    PageSwapCache() is stable.

    Fixes: 59807685a7e77 ("mm, THP, swap: support splitting THP for THP swap out")
    Reported-by: Rafael Aquini
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Tested-by: Rafael Aquini
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Link: https://lkml.kernel.org/r/20201009073647.1531083-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Ask the page how many subpages it has instead of assuming it's PMD size.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Acked-by: "Huang, Ying"
    Link: https://lkml.kernel.org/r/20200908195539.25896-8-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Ask the page what size it is instead of assuming it's PMD size.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-7-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • File THPs may now be of arbitrary size, and we can't rely on that size
    after doing the split so remember the number of pages before we start the
    split.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • File THPs may now be of arbitrary order.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The implementation of split_page_owner() prefers a count rather than the
    old order of the page. When we support a variable size THP, we won't
    have the order at this point, but we will have the number of pages.
    So change the interface to what the caller and callee would prefer.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

14 Oct, 2020

1 commit

  • Instead of converting adjust_next between bytes and pages number, let's
    just store the virtual address into adjust_next.

    Also, this patch fixes one typo in the comment of vma_adjust_trans_huge().

    [vbabka@suse.cz: changelog tweak]

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Mike Kravetz
    Link: http://lkml.kernel.org/r/20200828081031.11306-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

13 Oct, 2020

1 commit

  • Pull arm64 updates from Will Deacon:
    "There's quite a lot of code here, but much of it is due to the
    addition of a new PMU driver as well as some arm64-specific selftests
    which is an area where we've traditionally been lagging a bit.

    In terms of exciting features, this includes support for the Memory
    Tagging Extension which narrowly missed 5.9, hopefully allowing
    userspace to run with use-after-free detection in production on CPUs
    that support it. Work is ongoing to integrate the feature with KASAN
    for 5.11.

    Another change that I'm excited about (assuming they get the hardware
    right) is preparing the ASID allocator for sharing the CPU page-table
    with the SMMU. Those changes will also come in via Joerg with the
    IOMMU pull.

    We do stray outside of our usual directories in a few places, mostly
    due to core changes required by MTE. Although much of this has been
    Acked, there were a couple of places where we unfortunately didn't get
    any review feedback.

    Other than that, we ran into a handful of minor conflicts in -next,
    but nothing that should post any issues.

    Summary:

    - Userspace support for the Memory Tagging Extension introduced by
    Armv8.5. Kernel support (via KASAN) is likely to follow in 5.11.

    - Selftests for MTE, Pointer Authentication and FPSIMD/SVE context
    switching.

    - Fix and subsequent rewrite of our Spectre mitigations, including
    the addition of support for PR_SPEC_DISABLE_NOEXEC.

    - Support for the Armv8.3 Pointer Authentication enhancements.

    - Support for ASID pinning, which is required when sharing
    page-tables with the SMMU.

    - MM updates, including treating flush_tlb_fix_spurious_fault() as a
    no-op.

    - Perf/PMU driver updates, including addition of the ARM CMN PMU
    driver and also support to handle CPU PMU IRQs as NMIs.

    - Allow prefetchable PCI BARs to be exposed to userspace using normal
    non-cacheable mappings.

    - Implementation of ARCH_STACKWALK for unwinding.

    - Improve reporting of unexpected kernel traps due to BPF JIT
    failure.

    - Improve robustness of user-visible HWCAP strings and their
    corresponding numerical constants.

    - Removal of TEXT_OFFSET.

    - Removal of some unused functions, parameters and prototypes.

    - Removal of MPIDR-based topology detection in favour of firmware
    description.

    - Cleanups to handling of SVE and FPSIMD register state in
    preparation for potential future optimisation of handling across
    syscalls.

    - Cleanups to the SDEI driver in preparation for support in KVM.

    - Miscellaneous cleanups and refactoring work"

    * tag 'arm64-upstream' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: (148 commits)
    Revert "arm64: initialize per-cpu offsets earlier"
    arm64: random: Remove no longer needed prototypes
    arm64: initialize per-cpu offsets earlier
    kselftest/arm64: Check mte tagged user address in kernel
    kselftest/arm64: Verify KSM page merge for MTE pages
    kselftest/arm64: Verify all different mmap MTE options
    kselftest/arm64: Check forked child mte memory accessibility
    kselftest/arm64: Verify mte tag inclusion via prctl
    kselftest/arm64: Add utilities and a test to validate mte memory
    perf: arm-cmn: Fix conversion specifiers for node type
    perf: arm-cmn: Fix unsigned comparison to less than zero
    arm64: dbm: Invalidate local TLB when setting TCR_EL1.HD
    arm64: mm: Make flush_tlb_fix_spurious_fault() a no-op
    arm64: Add support for PR_SPEC_DISABLE_NOEXEC prctl() option
    arm64: Pull in task_stack_page() to Spectre-v4 mitigation code
    KVM: arm64: Allow patching EL2 vectors even with KASLR is not enabled
    arm64: Get rid of arm64_ssbd_state
    KVM: arm64: Convert ARCH_WORKAROUND_2 to arm64_get_spectre_v4_state()
    KVM: arm64: Get rid of kvm_arm_have_ssbd()
    KVM: arm64: Simplify handling of ARCH_WORKAROUND_2
    ...

    Linus Torvalds
     

28 Sep, 2020

1 commit

  • Pinned pages shouldn't be write-protected when fork() happens, because
    follow up copy-on-write on these pages could cause the pinned pages to
    be replaced by random newly allocated pages.

    For huge PMDs, we split the huge pmd if pinning is detected. So that
    future handling will be done by the PTE level (with our latest changes,
    each of the small pages will be copied). We can achieve this by let
    copy_huge_pmd() return -EAGAIN for pinned pages, so that we'll
    fallthrough in copy_pmd_range() and finally land the next
    copy_pte_range() call.

    Huge PUDs will be even more special - so far it does not support
    anonymous pages. But it can actually be done the same as the huge PMDs
    even if the split huge PUDs means to erase the PUD entries. It'll
    guarantee the follow up fault ins will remap the same pages in either
    parent/child later.

    This might not be the most efficient way, but it should be easy and
    clean enough. It should be fine, since we're tackling with a very rare
    case just to make sure userspaces that pinned some thps will still work
    even without MADV_DONTFORK and after they fork()ed.

    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

20 Sep, 2020

1 commit

  • A migrating transparent huge page has to already be unmapped. Otherwise,
    the page could be modified while it is being copied to a new page and data
    could be lost. The function __split_huge_pmd() checks for a PMD migration
    entry before calling __split_huge_pmd_locked() leading one to think that
    __split_huge_pmd_locked() can handle splitting a migrating PMD.

    However, the code always increments the page->_mapcount and adjusts the
    memory control group accounting assuming the page is mapped.

    Also, if the PMD entry is a migration PMD entry, the call to
    is_huge_zero_pmd(*pmd) is incorrect because it calls pmd_pfn(pmd) instead
    of migration_entry_to_pfn(pmd_to_swp_entry(pmd)). Fix these problems by
    checking for a PMD migration entry.

    Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Reviewed-by: Zi Yan
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Alistair Popple
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Bharata B Rao
    Cc: Ben Skeggs
    Cc: Shuah Khan
    Cc: [4.14+]
    Link: https://lkml.kernel.org/r/20200903183140.19055-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

05 Sep, 2020

2 commits

  • Merge emailed patches from Peter Xu:
    "This is a small series that I picked up from Linus's suggestion to
    simplify cow handling (and also make it more strict) by checking
    against page refcounts rather than mapcounts.

    This makes uffd-wp work again (verified by running upmapsort)"

    Note: this is horrendously bad timing, and making this kind of
    fundamental vm change after -rc3 is not at all how things should work.
    The saving grace is that it really is a a nice simplification:

    8 files changed, 29 insertions(+), 120 deletions(-)

    The reason for the bad timing is that it turns out that commit
    17839856fd58 ("gup: document and work around 'COW can break either way'
    issue" broke not just UFFD functionality (as Peter noticed), but Mikulas
    Patocka also reports that it caused issues for strace when running in a
    DAX environment with ext4 on a persistent memory setup.

    And we can't just revert that commit without re-introducing the original
    issue that is a potential security hole, so making COW stricter (and in
    the process much simpler) is a step to then undoing the forced COW that
    broke other uses.

    Link: https://lore.kernel.org/lkml/alpine.LRH.2.02.2009031328040.6929@file01.intranet.prod.int.rdu2.redhat.com/

    * emailed patches from Peter Xu :
    mm: Add PGREUSE counter
    mm/gup: Remove enfornced COW mechanism
    mm/ksm: Remove reuse_ksm_page()
    mm: do_wp_page() simplification

    Linus Torvalds
     
  • With the more strict (but greatly simplified) page reuse logic in
    do_wp_page(), we can safely go back to the world where cow is not
    enforced with writes.

    This essentially reverts commit 17839856fd58 ("gup: document and work
    around 'COW can break either way' issue"). There are some context
    differences due to some changes later on around it:

    2170ecfa7688 ("drm/i915: convert get_user_pages() --> pin_user_pages()", 2020-06-03)
    376a34efa4ee ("mm/gup: refactor and de-duplicate gup_fast() code", 2020-06-03)

    Some lines moved back and forth with those, but this revert patch should
    have striped out and covered all the enforced cow bits anyways.

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Xu
    Signed-off-by: Linus Torvalds

    Peter Xu
     

04 Sep, 2020

1 commit

  • When a huge page is split into normal pages, part of the head page flags
    are transferred to the tail pages. However, the PG_arch_* flags are not
    part of the preserved set.

    PG_arch_2 is used by the arm64 MTE support to mark pages that have valid
    tags. The absence of such flag would cause the arm64 set_pte_at() to
    clear the tags in order to avoid stale tags exposed to user or the
    swapping out hooks to ignore the tags. Not preserving PG_arch_2 on huge
    page splitting leads to tag corruption in the tail pages.

    Preserve the newly added PG_arch_2 flag in __split_huge_page_tail().

    Signed-off-by: Catalin Marinas
    Cc: Andrew Morton

    Catalin Marinas
     

13 Aug, 2020

2 commits

  • Since commit 3917c80280c93a7123f ("thp: change CoW semantics for
    anon-THP"), the CoW page fault of THP has been rewritten, debug_cow is not
    used anymore. So, just remove it.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Zi Yan
    Cc: Kirill A. Shutemov
    Link: http://lkml.kernel.org/r/1592270980-116062-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

3 commits

  • Rationale:
    Reduces attack surface on kernel devs opening the links for MITM
    as HTTPS traffic is much harder to manipulate.

    Deterministic algorithm:
    For each file:
    If not .svg:
    For each line:
    If doesn't contain `xmlns`:
    For each link, `http://[^# ]*(?:\w|/)`:
    If neither `gnu\.org/license`, nor `mozilla\.org/MPL`:
    If both the HTTP and HTTPS versions
    return 200 OK and serve the same content:
    Replace HTTP with HTTPS.

    [akpm@linux-foundation.org: fix amd.com URL, per Vlastimil]

    Signed-off-by: Alexander A. Klimov
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200713164345.36088-1-grandmaster@al2klimov.de
    Signed-off-by: Linus Torvalds

    Alexander A. Klimov
     
  • After previous cleanup, extent is the minimal step for both source and
    destination. This means when extent is HPAGE_PMD_SIZE or PMD_SIZE,
    old_addr and new_addr are properly aligned too.

    Since these two functions are only invoked in move_page_tables, it is safe
    to remove the check now.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Tested-by: Dmitry Osipenko
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Anshuman Khandual
    Cc: Matthew Wilcox
    Cc: Peter Xu
    Cc: Sean Christopherson
    Cc: Thomas Hellstrom
    Cc: Thomas Hellstrom (VMware)
    Cc: Vlastimil Babka
    Cc: Yang Shi
    Link: http://lkml.kernel.org/r/20200708095028.41706-4-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Patch series "mm/mremap: cleanup move_page_tables() a little", v5.

    move_page_tables() tries to move page table by PMD or PTE.

    The root reason is if it tries to move PMD, both old and new range should
    be PMD aligned. But current code calculate old range and new range
    separately. This leads to some redundant check and calculation.

    This cleanup tries to consolidate the range check in one place to reduce
    some extra range handling.

    This patch (of 3):

    old_end is passed to these two functions to check whether there is enough
    space to do the move, while this check is done before invoking these
    functions.

    These two functions only would be invoked when extent meets the
    requirement and there is one check before invoking these functions:

    if (extent > old_end - old_addr)
    extent = old_end - old_addr;

    This implies (old_end - old_addr) won't fail the check in these two
    functions.

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Tested-by: Dmitry Osipenko
    Acked-by: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Yang Shi
    Cc: Thomas Hellstrom (VMware)
    Cc: Anshuman Khandual
    Cc: Sean Christopherson
    Cc: Wei Yang
    Cc: Peter Xu
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Thomas Hellstrom
    Link: http://lkml.kernel.org/r/20200710092835.56368-1-richard.weiyang@linux.alibaba.com
    Link: http://lkml.kernel.org/r/20200710092835.56368-2-richard.weiyang@linux.alibaba.com
    Link: http://lkml.kernel.org/r/20200708095028.41706-1-richard.weiyang@linux.alibaba.com
    Link: http://lkml.kernel.org/r/20200708095028.41706-2-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

06 Jun, 2020

1 commit

  • Pull powerpc updates from Michael Ellerman:

    - Support for userspace to send requests directly to the on-chip GZIP
    accelerator on Power9.

    - Rework of our lockless page table walking (__find_linux_pte()) to
    make it safe against parallel page table manipulations without
    relying on an IPI for serialisation.

    - A series of fixes & enhancements to make our machine check handling
    more robust.

    - Lots of plumbing to add support for "prefixed" (64-bit) instructions
    on Power10.

    - Support for using huge pages for the linear mapping on 8xx (32-bit).

    - Remove obsolete Xilinx PPC405/PPC440 support, and an associated sound
    driver.

    - Removal of some obsolete 40x platforms and associated cruft.

    - Initial support for booting on Power10.

    - Lots of other small features, cleanups & fixes.

    Thanks to: Alexey Kardashevskiy, Alistair Popple, Andrew Donnellan,
    Andrey Abramov, Aneesh Kumar K.V, Balamuruhan S, Bharata B Rao, Bulent
    Abali, Cédric Le Goater, Chen Zhou, Christian Zigotzky, Christophe
    JAILLET, Christophe Leroy, Dmitry Torokhov, Emmanuel Nicolet, Erhard F.,
    Gautham R. Shenoy, Geoff Levand, George Spelvin, Greg Kurz, Gustavo A.
    R. Silva, Gustavo Walbon, Haren Myneni, Hari Bathini, Joel Stanley,
    Jordan Niethe, Kajol Jain, Kees Cook, Leonardo Bras, Madhavan
    Srinivasan., Mahesh Salgaonkar, Markus Elfring, Michael Neuling, Michal
    Simek, Nathan Chancellor, Nathan Lynch, Naveen N. Rao, Nicholas Piggin,
    Oliver O'Halloran, Paul Mackerras, Pingfan Liu, Qian Cai, Ram Pai,
    Raphael Moreira Zinsly, Ravi Bangoria, Sam Bobroff, Sandipan Das, Segher
    Boessenkool, Stephen Rothwell, Sukadev Bhattiprolu, Tyrel Datwyler,
    Wolfram Sang, Xiongfeng Wang.

    * tag 'powerpc-5.8-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (299 commits)
    powerpc/pseries: Make vio and ibmebus initcalls pseries specific
    cxl: Remove dead Kconfig options
    powerpc: Add POWER10 architected mode
    powerpc/dt_cpu_ftrs: Add MMA feature
    powerpc/dt_cpu_ftrs: Enable Prefixed Instructions
    powerpc/dt_cpu_ftrs: Advertise support for ISA v3.1 if selected
    powerpc: Add support for ISA v3.1
    powerpc: Add new HWCAP bits
    powerpc/64s: Don't set FSCR bits in INIT_THREAD
    powerpc/64s: Save FSCR to init_task.thread.fscr after feature init
    powerpc/64s: Don't let DT CPU features set FSCR_DSCR
    powerpc/64s: Don't init FSCR_DSCR in __init_FSCR()
    powerpc/32s: Fix another build failure with CONFIG_PPC_KUAP_DEBUG
    powerpc/module_64: Use special stub for _mcount() with -mprofile-kernel
    powerpc/module_64: Simplify check for -mprofile-kernel ftrace relocations
    powerpc/module_64: Consolidate ftrace code
    powerpc/32: Disable KASAN with pages bigger than 16k
    powerpc/uaccess: Don't set KUEP by default on book3s/32
    powerpc/uaccess: Don't set KUAP by default on book3s/32
    powerpc/8xx: Reduce time spent in allow_user_access() and friends
    ...

    Linus Torvalds
     

05 Jun, 2020

1 commit

  • Fixes coccicheck warnings:

    mm/zbud.c:246:1-20: WARNING: Assignment of 0/1 to bool variable
    mm/mremap.c:777:2-8: WARNING: Assignment of 0/1 to bool variable
    mm/huge_memory.c:525:9-10: WARNING: return of 0/1 in function 'is_transparent_hugepage' with return type bool

    Reported-by: Hulk Robot
    Signed-off-by: Zou Wei
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1586835930-47076-1-git-send-email-zou_wei@huawei.com
    Signed-off-by: Linus Torvalds

    Zou Wei
     

04 Jun, 2020

9 commits

  • Merge more updates from Andrew Morton:
    "More mm/ work, plenty more to come

    Subsystems affected by this patch series: slub, memcg, gup, kasan,
    pagealloc, hugetlb, vmscan, tools, mempolicy, memblock, hugetlbfs,
    thp, mmap, kconfig"

    * akpm: (131 commits)
    arm64: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    x86: mm: use ARCH_HAS_DEBUG_WX instead of arch defined
    riscv: support DEBUG_WX
    mm: add DEBUG_WX support
    drivers/base/memory.c: cache memory blocks in xarray to accelerate lookup
    mm/thp: rename pmd_mknotpresent() as pmd_mkinvalid()
    powerpc/mm: drop platform defined pmd_mknotpresent()
    mm: thp: don't need to drain lru cache when splitting and mlocking THP
    hugetlbfs: get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
    sparc32: register memory occupied by kernel as memblock.memory
    include/linux/memblock.h: fix minor typo and unclear comment
    mm, mempolicy: fix up gup usage in lookup_node
    tools/vm/page_owner_sort.c: filter out unneeded line
    mm: swap: memcg: fix memcg stats for huge pages
    mm: swap: fix vmstats for huge pages
    mm: vmscan: limit the range of LRU type balancing
    mm: vmscan: reclaim writepage is IO cost
    mm: vmscan: determine anon/file pressure balance at the reclaim root
    mm: balance LRU lists based on relative thrashing
    mm: only count actual rotations as LRU reclaim cost
    ...

    Linus Torvalds
     
  • Since commit 8f182270dfec ("mm/swap.c: flush lru pvecs on compound page
    arrival") THP would not stay in pagevec anymore. So the optimization made
    by commit d965432234db ("thp: increase split_huge_page() success rate")
    doesn't make sense anymore, which tries to unpin munlocked THPs from
    pagevec by draining pagevec.

    Draining lru cache before isolating THP in mlock path is also unnecessary.
    b676b293fb48 ("mm, thp: fix mapped pages avoiding unevictable list on
    mlock") added it and 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped
    file huge pages") accidentally carried it over after the above
    optimization went in.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Acked-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/1585946493-7531-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
    small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
    native NR_ANON_THPS accounting sites.

    [hannes@cmpxchg.org: fixes]
    Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Tested-by: Naresh Kamboju
    Reviewed-by: Joonsoo Kim
    Acked-by: Randy Dunlap [build-tested]
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Currently we have different copy-on-write semantics for anon- and
    file-THP. For anon-THP we try to allocate huge page on the write fault,
    but on file-THP we split PMD and allocate 4k page.

    Arguably, file-THP semantics is more desirable: we don't necessary want to
    unshare full PMD range from the parent on the first access. This is the
    primary reason THP is unusable for some workloads, like Redis.

    The original THP refcounting didn't allow to have PTE-mapped compound
    pages, so we had no options, but to allocate huge page on CoW (with
    fallback to 512 4k pages).

    The current refcounting doesn't have such limitations and we can cut a lot
    of complex code out of fault path.

    khugepaged is now able to recover THP from such ranges if the
    configuration allows.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-8-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Write protect anon page faults require an accurate mapcount to decide
    if to break the COW or not. This is implemented in the THP path with
    reuse_swap_page() ->
    page_trans_huge_map_swapcount()/page_trans_huge_mapcount().

    If the COW triggers while the other processes sharing the page are
    under a huge pmd split, to do an accurate reading, we must ensure the
    mapcount isn't computed while it's being transferred from the head
    page to the tail pages.

    reuse_swap_cache() already runs serialized by the page lock, so it's
    enough to add the page lock around __split_huge_pmd_locked too, in
    order to add the missing serialization.

    Note: the commit in "Fixes" is just to facilitate the backporting,
    because the code before such commit didn't try to do an accurate THP
    mapcount calculation and it instead used the page_count() to decide if
    to COW or not. Both the page_count and the pin_count are THP-wide
    refcounts, so they're inaccurate if used in
    reuse_swap_page(). Reverting such commit (besides the unrelated fix to
    the local anon_vma assignment) would have also opened the window for
    memory corruption side effects to certain workloads as documented in
    such commit header.

    Signed-off-by: Andrea Arcangeli
    Suggested-by: Jann Horn
    Reported-by: Jann Horn
    Acked-by: Kirill A. Shutemov
    Fixes: 6d0a07edd17c ("mm: thp: calculate the mapcount correctly for THP pages during WP faults")
    Cc: stable@vger.kernel.org
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

03 Jun, 2020

1 commit

  • Doing a "get_user_pages()" on a copy-on-write page for reading can be
    ambiguous: the page can be COW'ed at any time afterwards, and the
    direction of a COW event isn't defined.

    Yes, whoever writes to it will generally do the COW, but if the thread
    that did the get_user_pages() unmapped the page before the write (and
    that could happen due to memory pressure in addition to any outright
    action), the writer could also just take over the old page instead.

    End result: the get_user_pages() call might result in a page pointer
    that is no longer associated with the original VM, and is associated
    with - and controlled by - another VM having taken it over instead.

    So when doing a get_user_pages() on a COW mapping, the only really safe
    thing to do would be to break the COW when getting the page, even when
    only getting it for reading.

    At the same time, some users simply don't even care.

    For example, the perf code wants to look up the page not because it
    cares about the page, but because the code simply wants to look up the
    physical address of the access for informational purposes, and doesn't
    really care about races when a page might be unmapped and remapped
    elsewhere.

    This adds logic to force a COW event by setting FOLL_WRITE on any
    copy-on-write mapping when FOLL_GET (or FOLL_PIN) is used to get a page
    pointer as a result.

    The current semantics end up being:

    - __get_user_pages_fast(): no change. If you don't ask for a write,
    you won't break COW. You'd better know what you're doing.

    - get_user_pages_fast(): the fast-case "look it up in the page tables
    without anything getting mmap_sem" now refuses to follow a read-only
    page, since it might need COW breaking. Which happens in the slow
    path - the fast path doesn't know if the memory might be COW or not.

    - get_user_pages() (including the slow-path fallback for gup_fast()):
    for a COW mapping, turn on FOLL_WRITE for FOLL_GET/FOLL_PIN, with
    very similar semantics to FOLL_FORCE.

    If it turns out that we want finer granularity (ie "only break COW when
    it might actually matter" - things like the zero page are special and
    don't need to be broken) we might need to push these semantics deeper
    into the lookup fault path. So if people care enough, it's possible
    that we might end up adding a new internal FOLL_BREAK_COW flag to go
    with the internal FOLL_COW flag we already have for tracking "I had a
    COW".

    Alternatively, if it turns out that different callers might want to
    explicitly control the forced COW break behavior, we might even want to
    make such a flag visible to the users of get_user_pages() instead of
    using the above default semantics.

    But for now, this is mostly commentary on the issue (this commit message
    being a lot bigger than the patch, and that patch in turn is almost all
    comments), with that minimal "enable COW breaking early" logic using the
    existing FOLL_WRITE behavior.

    [ It might be worth noting that we've always had this ambiguity, and it
    could arguably be seen as a user-space issue.

    You only get private COW mappings that could break either way in
    situations where user space is doing cooperative things (ie fork()
    before an execve() etc), but it _is_ surprising and very subtle, and
    fork() is supposed to give you independent address spaces.

    So let's treat this as a kernel issue and make the semantics of
    get_user_pages() easier to understand. Note that obviously a true
    shared mapping will still get a page that can change under us, so this
    does _not_ mean that get_user_pages() somehow returns any "stable"
    page ]

    Reported-by: Jann Horn
    Tested-by: Christoph Hellwig
    Acked-by: Oleg Nesterov
    Acked-by: Kirill Shutemov
    Acked-by: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

05 May, 2020

1 commit


08 Apr, 2020

5 commits

  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • UFFD_EVENT_FORK support for uffd-wp should be already there, except that
    we should clean the uffd-wp bit if uffd fork event is not enabled. Detect
    that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
    by VM_UFFD_WP. Do this for both small PTEs and huge PMDs.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
    change_protection() when used with uffd-wp and make sure the two new flags
    are exclusively used. Then,

    - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

    - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

    And use this new interface in mwriteprotect_range() to replace the old
    MM_CP_DIRTY_ACCT.

    Do this change for both PTEs and huge PMDs. Then we can start to identify
    which PTE/PMD is write protected by general (e.g., COW or soft dirty
    tracking), and which is for userfaultfd-wp.

    Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
    into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
    can be even more strict when detecting uffd-wp page faults in either
    do_wp_page() or wp_huge_pmd().

    After we're with _PAGE_UFFD_WP, a special case is when a page is both
    protected by the general COW logic and also userfault-wp. Here the
    userfault-wp will have higher priority and will be handled first. Only
    after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
    the general COW. These are the steps on what will happen with such a
    page:

    1. CPU accesses write protected shared page (so both protected by
    general COW and uffd-wp), blocked by uffd-wp first because in
    do_wp_page we'll handle uffd-wp first, so it has higher priority
    than general COW.

    2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
    to remove the uffd-wp bit upon the PTE/PMD. However here we
    still keep the write bit cleared. Notify the blocked CPU.

    3. The blocked CPU resumes the page fault process with a fault
    retry, during retry it'll notice it was not with the uffd-wp bit
    this time but it is still write protected by general COW, then
    it'll go though the COW path in the fault handler, copy the page,
    apply write bit where necessary, and retry again.

    4. The CPU will be able to access this page with write bit set.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Brian Geffon
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: Martin Cracauer
    Cc: Mel Gorman
    Cc: Bobby Powers
    Cc: Mike Rapoport
    Cc: "Kirill A . Shutemov"
    Cc: Maya Gokhale
    Cc: Johannes Weiner
    Cc: Marty McFadden
    Cc: Denis Plotnikov
    Cc: Hugh Dickins
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)