18 May, 2022

1 commit

  • commit 478d134e9506c7e9bfe2830ed03dd85e97966313 upstream.

    Kernel panic when injecting memory_failure for the global huge_zero_page,
    when CONFIG_DEBUG_VM is enabled, as follows.

    Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
    page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
    head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
    flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
    raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
    raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
    page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
    ------------[ cut here ]------------
    kernel BUG at mm/huge_memory.c:2499!
    invalid opcode: 0000 [#1] PREEMPT SMP PTI
    CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
    Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
    RIP: 0010:split_huge_page_to_list+0x66a/0x880
    Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
    RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
    RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
    RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
    R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
    R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
    FS: 00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    Call Trace:
    try_to_split_thp_page+0x3a/0x130
    memory_failure+0x128/0x800
    madvise_inject_error.cold+0x8b/0xa1
    __x64_sys_madvise+0x54/0x60
    do_syscall_64+0x35/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae
    RIP: 0033:0x7fc3754f8bf9
    Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
    RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
    RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
    RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
    RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
    R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
    R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000

    We think that raising BUG is overkilling for splitting huge_zero_page, the
    huge_zero_page can't be met from normal paths other than memory failure,
    but memory failure is a valid caller. So we tend to replace the BUG to
    WARN + returning -EBUSY, and thus the panic above won't happen again.

    Link: https://lkml.kernel.org/r/f35f8b97377d5d3ede1bc5ac3114da888c57cbce.1651052574.git.xuyu@linux.alibaba.com
    Fixes: d173d5417fb6 ("mm/memory-failure.c: skip huge_zero_page in memory_failure()")
    Fixes: 6a46079cf57a ("HWPOISON: The high level memory error handler in the VM v7")
    Signed-off-by: Xu Yu
    Suggested-by: Yang Shi
    Reported-by: kernel test robot
    Reviewed-by: Naoya Horiguchi
    Reviewed-by: Yang Shi
    Reviewed-by: Miaohe Lin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Xu Yu
     

29 Oct, 2021

1 commit

  • When handling shmem page fault the THP with corrupted subpage could be
    PMD mapped if certain conditions are satisfied. But kernel is supposed
    to send SIGBUS when trying to map hwpoisoned page.

    There are two paths which may do PMD map: fault around and regular
    fault.

    Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
    codepaths") the thing was even worse in fault around path. The THP
    could be PMD mapped as long as the VMA fits regardless what subpage is
    accessed and corrupted. After this commit as long as head page is not
    corrupted the THP could be PMD mapped.

    In the regular fault path the THP could be PMD mapped as long as the
    corrupted page is not accessed and the VMA fits.

    This loophole could be fixed by iterating every subpage to check if any
    of them is hwpoisoned or not, but it is somewhat costly in page fault
    path.

    So introduce a new page flag called HasHWPoisoned on the first tail
    page. It indicates the THP has hwpoisoned subpage(s). It is set if any
    subpage of THP is found hwpoisoned by memory failure and after the
    refcount is bumped successfully, then cleared when the THP is freed or
    split.

    The soft offline path doesn't need this since soft offline handler just
    marks a subpage hwpoisoned when the subpage is migrated successfully.
    But shmem THP didn't get split then migrated at all.

    Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Yang Shi
    Reviewed-by: Naoya Horiguchi
    Suggested-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     

19 Oct, 2021

1 commit

  • Decrease nr_thps counter in file's mapping to ensure that the page cache
    won't be dropped excessively on file write access if page has been
    already split.

    I've tried a test scenario running a big binary, kernel remaps it with
    THPs, then force a THP split with /sys/kernel/debug/split_huge_pages.
    During any further open of that binary with O_RDWR or O_WRITEONLY kernel
    drops page cache for it, because of non-zero thps counter.

    Link: https://lkml.kernel.org/r/20211012120237.2600-1-m.szyprowski@samsung.com
    Signed-off-by: Marek Szyprowski
    Fixes: 09d91cda0e82 ("mm,thp: avoid writes to file with THP in pagecache")
    Fixes: 06d3eff62d9d ("mm/thp: fix node page state in split_huge_page_to_list()")
    Acked-by: Matthew Wilcox (Oracle)
    Reviewed-by: Yang Shi
    Cc:
    Cc: Song Liu
    Cc: Rik van Riel
    Cc: "Kirill A . Shutemov"
    Cc: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Marek Szyprowski
     

04 Sep, 2021

2 commits

  • Before commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"), the
    TLB flushing is done in do_huge_pmd_numa_page() itself via
    flush_tlb_range().

    But after commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling"),
    the TLB flushing is done in migrate_pages() as in the following code path
    anyway.

    do_huge_pmd_numa_page
    migrate_misplaced_page
    migrate_pages

    So now, the TLB flushing code in do_huge_pmd_numa_page() becomes
    unnecessary. So the code is deleted in this patch to simplify the code.
    This is only code cleanup, there's no visible performance difference.

    The mmu_notifier_invalidate_range() in do_huge_pmd_numa_page() is
    deleted too. Because migrate_pages() takes care of that too when CPU
    TLB is flushed.

    Link: https://lkml.kernel.org/r/20210720065529.716031-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Zi Yan
    Reviewed-by: Yang Shi
    Cc: Dan Carpenter
    Cc: Mel Gorman
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vasily Gorbik
    Cc: Paolo Bonzini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • A successful shmem_fallocate() guarantees that the extent has been
    reserved, even beyond i_size when the FALLOC_FL_KEEP_SIZE flag was used.
    But that guarantee is broken by shmem_unused_huge_shrink()'s attempts to
    split huge pages and free their excess beyond i_size; and by other uses of
    split_huge_page() near i_size.

    It's sad to add a shmem inode field just for this, but I did not find a
    better way to keep the guarantee. A flag to say KEEP_SIZE has been used
    would be cheaper, but I'm averse to unclearable flags. The fallocend
    field is not perfect either (many disjoint ranges might be fallocated),
    but good enough; and gains another use later on.

    Link: https://lkml.kernel.org/r/ca9a146-3a59-6cd3-7f28-e9a044bb1052@google.com
    Fixes: 779750d20b93 ("shmem: split huge pages beyond i_size under memory pressure")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

12 Jul, 2021

1 commit

  • Parallel developments in mm/rmap.c have left behind some out-of-date
    comments: try_to_migrate_one() also accepts TTU_SYNC (already commented
    in try_to_migrate() itself), and try_to_migrate() returns nothing at
    all.

    TTU_SPLIT_FREEZE has just been deleted, so reword the comment about it
    in mm/huge_memory.c; and TTU_IGNORE_ACCESS was removed in 5.11, so
    delete the "recently referenced" comment from try_to_unmap_one() (once
    upon a time the comment was near the removed codeblock, but they drifted
    apart).

    Signed-off-by: Hugh Dickins
    Reviewed-by: Shakeel Butt
    Reviewed-by: Alistair Popple
    Link: https://lore.kernel.org/lkml/563ce5b2-7a44-5b4d-1dfd-59a0e65932a9@google.com/
    Cc: Andrew Morton
    Cc: Jason Gunthorpe
    Cc: Ralph Campbell
    Cc: Christoph Hellwig
    Cc: Yang Shi
    Cc: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

02 Jul, 2021

3 commits

  • Migration is currently implemented as a mode of operation for
    try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
    or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

    However it does not have much in common with the rest of the unmap
    functionality of try_to_unmap_one() and thus splitting it into a separate
    function reduces the complexity of try_to_unmap_one() making it more
    readable.

    Several simplifications can also be made in try_to_migrate_one() based on
    the following observations:

    - All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
    - No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
    - No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

    TTU_SPLIT_FREEZE is a special case of migration used when splitting an
    anonymous page. This is most easily dealt with by calling the correct
    function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
    for PageAnon or try_to_unmap().

    Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Cc: Ben Skeggs
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Peter Xu
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     
  • Both migration and device private pages use special swap entries that are
    manipluated by a range of inline functions. The arguments to these are
    somewhat inconsistent so rework them to remove flag type arguments and to
    make the arguments similar for both read and write entry creation.

    Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: Ralph Campbell
    Cc: Ben Skeggs
    Cc: Hugh Dickins
    Cc: John Hubbard
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Peter Xu
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     
  • Patch series "Add support for SVM atomics in Nouveau", v11.

    Introduction
    ============

    Some devices have features such as atomic PTE bits that can be used to
    implement atomic access to system memory. To support atomic operations to
    a shared virtual memory page such a device needs access to that page which
    is exclusive of the CPU. This series introduces a mechanism to
    temporarily unmap pages granting exclusive access to a device.

    These changes are required to support OpenCL atomic operations in Nouveau
    to shared virtual memory (SVM) regions allocated with the
    CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
    OpenCL SVM feature is available at
    https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
    OpenCL_API.html#_shared_virtual_memory .

    Implementation
    ==============

    Exclusive device access is implemented by adding a new swap entry type
    (SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
    difference is that on fault the original entry is immediately restored by
    the fault handler instead of waiting.

    Restoring the entry triggers calls to MMU notifers which allows a device
    driver to revoke the atomic access permission from the GPU prior to the
    CPU finalising the entry.

    Patches
    =======

    Patches 1 & 2 refactor existing migration and device private entry
    functions.

    Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
    functionality into separate functions - try_to_migrate_one() and
    try_to_munlock_one().

    Patch 5 renames some existing code but does not introduce functionality.

    Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

    Patch 7 contains the bulk of the implementation for device exclusive
    memory.

    Patch 8 contains some additions to the HMM selftests to ensure everything
    works as expected.

    Patch 9 is a cleanup for the Nouveau SVM implementation.

    Patch 10 contains the implementation of atomic access for the Nouveau
    driver.

    Testing
    =======

    This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
    which checks that GPU atomic accesses to system memory are atomic.
    Without this series the test fails as there is no way of write-protecting
    the page mapping which results in the device clobbering CPU writes. For
    reference the test is available at
    https://ozlabs.org/~apopple/opencl_svm_atomics/

    Further testing has been performed by adding support for testing exclusive
    access to the hmm-tests kselftests.

    This patch (of 10):

    Remove multiple similar inline functions for dealing with different types
    of special swap entries.

    Both migration and device private swap entries use the swap offset to
    store a pfn. Instead of multiple inline functions to obtain a struct page
    for each swap entry type use a common function pfn_swap_entry_to_page().
    Also open-code the various entry_to_pfn() functions as this results is
    shorter code that is easier to understand.

    Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
    Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
    Signed-off-by: Alistair Popple
    Reviewed-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Hugh Dickins
    Cc: Peter Xu
    Cc: Shakeel Butt
    Cc: Ben Skeggs
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alistair Popple
     

01 Jul, 2021

11 commits

  • Using MAX_INPUT_BUF_SZ as the maximum length of the string makes fortify
    complain as it thinks the string might be longer than the buffer, and if
    it is, we will end up with a "string" that is missing a NUL terminator.
    It's trivial to show that 'tok' points to a NUL-terminated string which is
    less than MAX_INPUT_BUF_SZ in length, so we may as well just use strcpy()
    and avoid the warning.

    Link: https://lkml.kernel.org/r/20210615200242.1716568-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • THP splitting's unmap_page() only sets TTU_SPLIT_FREEZE when PageAnon, and
    migration entries are only inserted when TTU_MIGRATION (unused here) or
    TTU_SPLIT_FREEZE is set: so it's just a waste of time for remap_page() to
    search for migration entries to remove when !PageAnon.

    Link: https://lkml.kernel.org/r/f987bc44-f28e-688d-2424-b4722153ed8@google.com
    Fixes: baa355fd3314 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • A quick grep shows x86_64, PowerPC (book3s), ARM64 and S390 support both
    NUMA balancing and THP. But S390 doesn't support THP migration so NUMA
    balancing actually can't migrate any misplaced pages.

    Skip make PMD PROT_NONE for such case otherwise CPU cycles may be wasted
    by pointless NUMA hinting faults on S390.

    Link: https://lkml.kernel.org/r/20210518200801.7413-8-shy828301@gmail.com
    Signed-off-by: Yang Shi
    Acked-by: Mel Gorman
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vasily Gorbik
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • When the THP NUMA fault support was added THP migration was not supported
    yet. So the ad hoc THP migration was implemented in NUMA fault handling.
    Since v4.14 THP migration has been supported so it doesn't make too much
    sense to still keep another THP migration implementation rather than using
    the generic migration code.

    This patch reworks the NUMA fault handling to use generic migration
    implementation to migrate misplaced page. There is no functional change.

    After the refactor the flow of NUMA fault handling looks just like its
    PTE counterpart:
    Acquire ptl
    Prepare for migration (elevate page refcount)
    Release ptl
    Isolate page from lru and elevate page refcount
    Migrate the misplaced THP

    If migration fails just restore the old normal PMD.

    In the old code anon_vma lock was needed to serialize THP migration
    against THP split, but since then the THP code has been reworked a lot, it
    seems anon_vma lock is not required anymore to avoid the race.

    The page refcount elevation when holding ptl should prevent from THP
    split.

    Use migrate_misplaced_page() for both base page and THP NUMA hinting fault
    and remove all the dead and duplicate code.

    [dan.carpenter@oracle.com: fix a double unlock bug]
    Link: https://lkml.kernel.org/r/YLX8uYN01JmfLnlK@mwanda

    Link: https://lkml.kernel.org/r/20210518200801.7413-4-shy828301@gmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Dan Carpenter
    Acked-by: Mel Gorman
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vasily Gorbik
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Pach series "mm: thp: use generic THP migration for NUMA hinting fault", v3.

    When the THP NUMA fault support was added THP migration was not supported
    yet. So the ad hoc THP migration was implemented in NUMA fault handling.
    Since v4.14 THP migration has been supported so it doesn't make too much
    sense to still keep another THP migration implementation rather than using
    the generic migration code. It is definitely a maintenance burden to keep
    two THP migration implementation for different code paths and it is more
    error prone. Using the generic THP migration implementation allows us
    remove the duplicate code and some hacks needed by the old ad hoc
    implementation.

    A quick grep shows x86_64, PowerPC (book3s), ARM64 ans S390 support both
    THP and NUMA balancing. The most of them support THP migration except for
    S390. Zi Yan tried to add THP migration support for S390 before but it
    was not accepted due to the design of S390 PMD. For the discussion,
    please see: https://lkml.org/lkml/2018/4/27/953.

    Per the discussion with Gerald Schaefer in v1 it is acceptible to skip
    huge PMD for S390 for now.

    I saw there were some hacks about gup from git history, but I didn't
    figure out if they have been removed or not since I just found FOLL_NUMA
    code in the current gup implementation and they seems useful.

    Patch #1 ~ #2 are preparation patches.
    Patch #3 is the real meat.
    Patch #4 ~ #6 keep consistent counters and behaviors with before.
    Patch #7 skips change huge PMD to prot_none if thp migration is not supported.

    Test
    ----
    Did some tests to measure the latency of do_huge_pmd_numa_page. The test
    VM has 80 vcpus and 64G memory. The test would create 2 processes to
    consume 128G memory together which would incur memory pressure to cause
    THP splits. And it also creates 80 processes to hog cpu, and the memory
    consumer processes are bound to different nodes periodically in order to
    increase NUMA faults.

    The below test script is used:

    echo 3 > /proc/sys/vm/drop_caches

    # Run stress-ng for 24 hours
    ./stress-ng/stress-ng --vm 2 --vm-bytes 64G --timeout 24h &
    PID=$!

    ./stress-ng/stress-ng --cpu $NR_CPUS --timeout 24h &

    # Wait for vm stressors forked
    sleep 5

    PID_1=`pgrep -P $PID | awk 'NR == 1'`
    PID_2=`pgrep -P $PID | awk 'NR == 2'`

    JOB1=`pgrep -P $PID_1`
    JOB2=`pgrep -P $PID_2`

    # Bind load jobs to different nodes periodically to force generate
    # cross node memory access
    while [ -d "/proc/$PID" ]
    do
    taskset -apc 8 $JOB1
    taskset -apc 8 $JOB2
    sleep 300
    taskset -apc 58 $JOB1
    taskset -apc 58 $JOB2
    sleep 300
    done

    With the above test the histogram of latency of do_huge_pmd_numa_page is
    as shown below. Since the number of do_huge_pmd_numa_page varies
    drastically for each run (should be due to scheduler), so I converted the
    raw number to percentage.

    patched base
    @us[stress-ng]:
    [0] 3.57% 0.16%
    [1] 55.68% 18.36%
    [2, 4) 10.46% 40.44%
    [4, 8) 7.26% 17.82%
    [8, 16) 21.12% 13.41%
    [16, 32) 1.06% 4.27%
    [32, 64) 0.56% 4.07%
    [64, 128) 0.16% 0.35%
    [128, 256) < 0.1% < 0.1%
    [256, 512) < 0.1% < 0.1%
    [512, 1K) < 0.1% < 0.1%
    [1K, 2K) < 0.1% < 0.1%
    [2K, 4K) < 0.1% < 0.1%
    [4K, 8K) < 0.1% < 0.1%
    [8K, 16K) < 0.1% < 0.1%
    [16K, 32K) < 0.1% < 0.1%
    [32K, 64K) < 0.1% < 0.1%

    Per the result, patched kernel is even slightly better than the base
    kernel. I think this is because the lock contention against THP split is
    less than base kernel due to the refactor.

    To exclude the affect from THP split, I also did test w/o memory pressure.
    No obvious regression is spotted. The below is the test result *w/o*
    memory pressure.

    patched base
    @us[stress-ng]:
    [0] 7.97% 18.4%
    [1] 69.63% 58.24%
    [2, 4) 4.18% 2.63%
    [4, 8) 0.22% 0.17%
    [8, 16) 1.03% 0.92%
    [16, 32) 0.14% < 0.1%
    [32, 64) < 0.1% < 0.1%
    [64, 128) < 0.1% < 0.1%
    [128, 256) < 0.1% < 0.1%
    [256, 512) 0.45% 1.19%
    [512, 1K) 15.45% 17.27%
    [1K, 2K) < 0.1% < 0.1%
    [2K, 4K) < 0.1% < 0.1%
    [4K, 8K) < 0.1% < 0.1%
    [8K, 16K) 0.86% 0.88%
    [16K, 32K) < 0.1% 0.15%
    [32K, 64K) < 0.1% < 0.1%
    [64K, 128K) < 0.1% < 0.1%
    [128K, 256K) < 0.1% < 0.1%

    The series also survived a series of tests that exercise NUMA balancing
    migrations by Mel.

    This patch (of 7):

    Add orig_pmd to struct vm_fault so the "orig_pmd" parameter used by huge
    page fault could be removed, just like its PTE counterpart does.

    Link: https://lkml.kernel.org/r/20210518200801.7413-1-shy828301@gmail.com
    Link: https://lkml.kernel.org/r/20210518200801.7413-2-shy828301@gmail.com
    Signed-off-by: Yang Shi
    Acked-by: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Zi Yan
    Cc: Huang Ying
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Gerald Schaefer
    Cc: Heiko Carstens
    Cc: Vasily Gorbik
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • We tried to do something similar in b569a1760782 ("userfaultfd: wp: drop
    _PAGE_UFFD_WP properly when fork") previously, but it's not doing it all
    right.. A few fixes around the code path:

    1. We were referencing VM_UFFD_WP vm_flags on the _old_ vma rather
    than the new vma. That's overlooked in b569a1760782, so it won't work
    as expected. Thanks to the recent rework on fork code
    (7a4830c380f3a8b3), we can easily get the new vma now, so switch the
    checks to that.

    2. Dropping the uffd-wp bit in copy_huge_pmd() could be wrong if the
    huge pmd is a migration huge pmd. When it happens, instead of using
    pmd_uffd_wp(), we should use pmd_swp_uffd_wp(). The fix is simply to
    handle them separately.

    3. Forget to carry over uffd-wp bit for a write migration huge pmd
    entry. This also happens in copy_huge_pmd(), where we converted a
    write huge migration entry into a read one.

    4. In copy_nonpresent_pte(), drop uffd-wp if necessary for swap ptes.

    5. In copy_present_page() when COW is enforced when fork(), we also
    need to pass over the uffd-wp bit if VM_UFFD_WP is armed on the new
    vma, and when the pte to be copied has uffd-wp bit set.

    Remove the comment in copy_present_pte() about this. It won't help a huge
    lot to only comment there, but comment everywhere would be an overkill.
    Let's assume the commit messages would help.

    [peterx@redhat.com: fix a few thp pmd missing uffd-wp bit]
    Link: https://lkml.kernel.org/r/20210428225030.9708-4-peterx@redhat.com

    Link: https://lkml.kernel.org/r/20210428225030.9708-3-peterx@redhat.com
    Fixes: b569a1760782f ("userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork")
    Signed-off-by: Peter Xu
    Cc: Jerome Glisse
    Cc: Mike Rapoport
    Cc: Alexander Viro
    Cc: Andrea Arcangeli
    Cc: Axel Rasmussen
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Joe Perches
    Cc: Kirill A. Shutemov
    Cc: Lokesh Gidra
    Cc: Mike Kravetz
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Patch series "mm/uffd: Misc fix for uffd-wp and one more test".

    This series tries to fix some corner case bugs for uffd-wp on either thp
    or fork(). Then it introduced a new test with pagemap/pageout.

    Patch layout:

    Patch 1: cleanup for THP, it'll slightly simplify the follow up patches
    Patch 2-4: misc fixes for uffd-wp here and there; please refer to each patch
    Patch 5: add pagemap support for uffd-wp
    Patch 6: add pagemap/pageout test for uffd-wp

    The last test introduced can also verify some of the fixes in previous
    patches, as the test will fail without the fixes. However it's not easy
    to verify all the changes in patch 2-4, but hopefully they can still be
    properly reviewed.

    Note that if considering the ongoing uffd-wp shmem & hugetlbfs work, patch
    5 will be incomplete as it's missing e.g. hugetlbfs part or the special
    swap pte detection. However that's not needed in this series, and since
    that series is still during review, this series does not depend on that
    one (the last test only runs with anonymous memory, not file-backed). So
    this series can be merged even before that series.

    This patch (of 6):

    Huge zero page is handled in a special path in copy_huge_pmd(), however it
    should share most codes with a normal thp page. Trying to share more code
    with it by removing the special path. The only leftover so far is the
    huge zero page refcounting (mm_get_huge_zero_page()), because that's
    separately done with a global counter.

    This prepares for a future patch to modify the huge pmd to be installed,
    so that we don't need to duplicate it explicitly into huge zero page case
    too.

    Link: https://lkml.kernel.org/r/20210428225030.9708-1-peterx@redhat.com
    Link: https://lkml.kernel.org/r/20210428225030.9708-2-peterx@redhat.com
    Signed-off-by: Peter Xu
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz , peterx@redhat.com
    Cc: Mike Rapoport
    Cc: Axel Rasmussen
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Alexander Viro
    Cc: Brian Geffon
    Cc: "Dr . David Alan Gilbert"
    Cc: Joe Perches
    Cc: Lokesh Gidra
    Cc: Mina Almasry
    Cc: Oliver Upton
    Cc: Shaohua Li
    Cc: Shuah Khan
    Cc: Stephen Rothwell
    Cc: Wang Qing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • If other processes are mapping any other subpages of the hugepage, i.e.
    in pte-mapped thp case, page_mapcount() will return 1 incorrectly. Then
    we would discard the page while other processes are still mapping it. Fix
    it by using total_mapcount() which can tell whether other processes are
    still mapping it.

    Link: https://lkml.kernel.org/r/20210511134857.1581273-6-linmiaohe@huawei.com
    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Reviewed-by: Yang Shi
    Signed-off-by: Miaohe Lin
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: David Hildenbrand
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Commit aa88b68c3b1d ("thp: keep huge zero page pinned until tlb flush")
    introduced tlb_remove_page() for huge zero page to keep it pinned until
    flush is complete and prevents the page from being split under us. But
    huge zero page is kept pinned until all relevant mm_users reach zero since
    the commit 6fcb52a56ff6 ("thp: reduce usage of huge zero page's atomic
    counter"). So tlb_remove_page_size() for huge zero pmd is unnecessary
    now.

    Link: https://lkml.kernel.org/r/20210511134857.1581273-5-linmiaohe@huawei.com
    Reviewed-by: Yang Shi
    Acked-by: David Hildenbrand
    Signed-off-by: Miaohe Lin
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Since commit 99cb0dbd47a1 ("mm,thp: add read-only THP support for
    (non-shmem) FS"), read-only THP file mapping is supported. But it forgot
    to add checking for it in transparent_hugepage_enabled(). To fix it, we
    add checking for read-only THP file mapping and also introduce helper
    transhuge_vma_enabled() to check whether thp is enabled for specified vma
    to reduce duplicated code. We rename transparent_hugepage_enabled to
    transparent_hugepage_active to make the code easier to follow as suggested
    by David Hildenbrand.

    [linmiaohe@huawei.com: define transhuge_vma_enabled next to transhuge_vma_suitable]
    Link: https://lkml.kernel.org/r/20210514093007.4117906-1-linmiaohe@huawei.com

    Link: https://lkml.kernel.org/r/20210511134857.1581273-4-linmiaohe@huawei.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Miaohe Lin
    Reviewed-by: Yang Shi
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: David Hildenbrand
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Now that we can represent the location of ->deferred_list instead of
    ->mapping + ->index, make use of it to improve readability.

    Link: https://lkml.kernel.org/r/20210511134857.1581273-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Yang Shi
    Reviewed-by: David Hildenbrand
    Cc: Alexey Dobriyan
    Cc: "Aneesh Kumar K . V"
    Cc: Anshuman Khandual
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Minchan Kim
    Cc: Ralph Campbell
    Cc: Rik van Riel
    Cc: Song Liu
    Cc: William Kucharski
    Cc: Zi Yan
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

17 Jun, 2021

4 commits

  • When debugging the bug reported by Wang Yugui [1], try_to_unmap() may
    fail, but the first VM_BUG_ON_PAGE() just checks page_mapcount() however
    it may miss the failure when head page is unmapped but other subpage is
    mapped. Then the second DEBUG_VM BUG() that check total mapcount would
    catch it. This may incur some confusion.

    As this is not a fatal issue, so consolidate the two DEBUG_VM checks
    into one VM_WARN_ON_ONCE_PAGE().

    [1] https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

    Link: https://lkml.kernel.org/r/d0f0db68-98b8-ebfb-16dc-f29df24cf012@google.com
    Signed-off-by: Yang Shi
    Reviewed-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Hugh Dickins
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Stressing huge tmpfs often crashed on unmap_page()'s VM_BUG_ON_PAGE
    (!unmap_success): with dump_page() showing mapcount:1, but then its raw
    struct page output showing _mapcount ffffffff i.e. mapcount 0.

    And even if that particular VM_BUG_ON_PAGE(!unmap_success) is removed,
    it is immediately followed by a VM_BUG_ON_PAGE(compound_mapcount(head)),
    and further down an IS_ENABLED(CONFIG_DEBUG_VM) total_mapcount BUG():
    all indicative of some mapcount difficulty in development here perhaps.
    But the !CONFIG_DEBUG_VM path handles the failures correctly and
    silently.

    I believe the problem is that once a racing unmap has cleared pte or
    pmd, try_to_unmap_one() may skip taking the page table lock, and emerge
    from try_to_unmap() before the racing task has reached decrementing
    mapcount.

    Instead of abandoning the unsafe VM_BUG_ON_PAGE(), and the ones that
    follow, use PVMW_SYNC in try_to_unmap_one() in this case: adding
    TTU_SYNC to the options, and passing that from unmap_page().

    When CONFIG_DEBUG_VM, or for non-debug too? Consensus is to do the same
    for both: the slight overhead added should rarely matter, except perhaps
    if splitting sparsely-populated multiply-mapped shmem. Once confident
    that bugs are fixed, TTU_SYNC here can be removed, and the race
    tolerated.

    Link: https://lkml.kernel.org/r/c1e95853-8bcd-d8fd-55fa-e7f2488e78f@google.com
    Fixes: fec89c109f3a ("thp: rewrite freeze_page()/unfreeze_page() with generic rmap walkers")
    Signed-off-by: Hugh Dickins
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: Kirill A. Shutemov
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Yang Shi
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Most callers of is_huge_zero_pmd() supply a pmd already verified
    present; but a few (notably zap_huge_pmd()) do not - it might be a pmd
    migration entry, in which the pfn is encoded differently from a present
    pmd: which might pass the is_huge_zero_pmd() test (though not on x86,
    since L1TF forced us to protect against that); or perhaps even crash in
    pmd_page() applied to a swap-like entry.

    Make it safe by adding pmd_present() check into is_huge_zero_pmd()
    itself; and make it quicker by saving huge_zero_pfn, so that
    is_huge_zero_pmd() will not need to do that pmd_page() lookup each time.

    __split_huge_pmd_locked() checked pmd_trans_huge() before: that worked,
    but is unnecessary now that is_huge_zero_pmd() checks present.

    Link: https://lkml.kernel.org/r/21ea9ca-a1f5-8b90-5e88-95fb1c49bbfa@google.com
    Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Yang Shi
    Cc: Alistair Popple
    Cc: Jan Kara
    Cc: Jue Wang
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc: Ralph Campbell
    Cc: Shakeel Butt
    Cc: Wang Yugui
    Cc: Zi Yan
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Patch series "mm/thp: fix THP splitting unmap BUGs and related", v10.

    Here is v2 batch of long-standing THP bug fixes that I had not got
    around to sending before, but prompted now by Wang Yugui's report
    https://lore.kernel.org/linux-mm/20210412180659.B9E3.409509F4@e16-tech.com/

    Wang Yugui has tested a rollup of these fixes applied to 5.10.39, and
    they have done no harm, but have *not* fixed that issue: something more
    is needed and I have no idea of what.

    This patch (of 7):

    Stressing huge tmpfs page migration racing hole punch often crashed on
    the VM_BUG_ON(!pmd_present) in pmdp_huge_clear_flush(), with DEBUG_VM=y
    kernel; or shortly afterwards, on a bad dereference in
    __split_huge_pmd_locked() when DEBUG_VM=n. They forgot to allow for pmd
    migration entries in the non-anonymous case.

    Full disclosure: those particular experiments were on a kernel with more
    relaxed mmap_lock and i_mmap_rwsem locking, and were not repeated on the
    vanilla kernel: it is conceivable that stricter locking happens to avoid
    those cases, or makes them less likely; but __split_huge_pmd_locked()
    already allowed for pmd migration entries when handling anonymous THPs,
    so this commit brings the shmem and file THP handling into line.

    And while there: use old_pmd rather than _pmd, as in the following
    blocks; and make it clearer to the eye that the !vma_is_anonymous()
    block is self-contained, making an early return after accounting for
    unmapping.

    Link: https://lkml.kernel.org/r/af88612-1473-2eaa-903-8d1a448b26@google.com
    Link: https://lkml.kernel.org/r/dd221a99-efb3-cd1d-6256-7e646af29314@google.com
    Fixes: e71769ae5260 ("mm: enable thp migration for shmem thp")
    Signed-off-by: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Yang Shi
    Cc: Wang Yugui
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Naoya Horiguchi
    Cc: Alistair Popple
    Cc: Ralph Campbell
    Cc: Zi Yan
    Cc: Miaohe Lin
    Cc: Minchan Kim
    Cc: Jue Wang
    Cc: Peter Xu
    Cc: Jan Kara
    Cc: Shakeel Butt
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

07 May, 2021

1 commit

  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

06 May, 2021

9 commits

  • The shrinker map management is not purely memcg specific, it is at the
    intersection between memory cgroup and shrinkers. It's allocation and
    assignment of a structure, and the only memcg bit is the map is being
    stored in a memcg structure. So move the shrinker_maps handling code
    into vmscan.c for tighter integration with shrinker code, and remove the
    "memcg_" prefix. There is no functional change.

    Link: https://lkml.kernel.org/r/20210311190845.9708-3-shy828301@gmail.com
    Signed-off-by: Yang Shi
    Acked-by: Vlastimil Babka
    Acked-by: Kirill Tkhai
    Acked-by: Roman Gushchin
    Reviewed-by: Shakeel Butt
    Cc: Dave Chinner
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Further extend /split_huge_pages to accept
    ",," for file-backed THP split tests since
    tmpfs may have file backed by THP that mapped nowhere.

    Update selftest program to test file-backed THP split too.

    Link: https://lkml.kernel.org/r/20210331235309.332292-2-zi.yan@sent.com
    Signed-off-by: Zi Yan
    Suggested-by: Kirill A. Shutemov
    Reviewed-by: Yang Shi
    Cc: "Kirill A . Shutemov"
    Cc: Shuah Khan
    Cc: John Hubbard
    Cc: Sandipan Das
    Cc: David Hildenbrand
    Cc: Mika Penttila
    Cc: David Rientjes
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • We did not have a direct user interface of splitting the compound page
    backing a THP and there is no need unless we want to expose the THP
    implementation details to users. Make /split_huge_pages accept a
    new command to do that.

    By writing ",," to
    /split_huge_pages, THPs within the given virtual address range
    from the process with the given pid are split. It is used to test
    split_huge_page function. In addition, a selftest program is added to
    tools/testing/selftests/vm to utilize the interface by splitting
    PMD THPs and PTE-mapped THPs.

    This does not change the old behavior, i.e., writing 1 to the interface
    to split all THPs in the system.

    Link: https://lkml.kernel.org/r/20210331235309.332292-1-zi.yan@sent.com
    Signed-off-by: Zi Yan
    Reviewed-by: Yang Shi
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: John Hubbard
    Cc: "Kirill A . Shutemov"
    Cc: Matthew Wilcox
    Cc: Mika Penttila
    Cc: Sandipan Das
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • It's more recommended to use helper function migration_entry_to_page()
    to get the page via migration entry. We can also enjoy the PageLocked()
    check there.

    Link: https://lkml.kernel.org/r/20210318122722.13135-7-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Peter Xu
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Michel Lespinasse
    Cc: Ralph Campbell
    Cc: Thomas Hellstrm (Intel)
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: William Kucharski
    Cc: Yang Shi
    Cc: yuleixzhang
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • The !PageCompound() check limits the page must be head or tail while
    !PageHead() further limits it to page head only. So !PageHead() check is
    equivalent here.

    Link: https://lkml.kernel.org/r/20210318122722.13135-5-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Peter Xu
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Michel Lespinasse
    Cc: Ralph Campbell
    Cc: Thomas Hellstrm (Intel)
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: William Kucharski
    Cc: Yang Shi
    Cc: yuleixzhang
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • The current code that checks if migrating misplaced transhuge page is
    needed is pretty hard to follow. Rework it and add a comment to make
    its logic more clear and improve readability.

    Link: https://lkml.kernel.org/r/20210318122722.13135-4-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Zi Yan
    Reviewed-by: Peter Xu
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Michel Lespinasse
    Cc: Ralph Campbell
    Cc: Thomas Hellstrm (Intel)
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: William Kucharski
    Cc: Yang Shi
    Cc: yuleixzhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • It's guaranteed that huge_zero_page will not be NULL if
    huge_zero_refcount is increased successfully.

    When READ_ONCE(huge_zero_page) is returned, there must be a
    huge_zero_page and it can be replaced with returning
    'true' when we do not care about the value of huge_zero_page.

    We can thus make it return bool to save READ_ONCE cpu cycles as the
    return value is just used to check if huge_zero_page exists.

    Link: https://lkml.kernel.org/r/20210318122722.13135-3-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Zi Yan
    Reviewed-by: Peter Xu
    Cc: Aneesh Kumar K.V
    Cc: Matthew Wilcox
    Cc: Michel Lespinasse
    Cc: Ralph Campbell
    Cc: Thomas Hellstrm (Intel)
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: William Kucharski
    Cc: Yang Shi
    Cc: yuleixzhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Patch series "Some cleanups for huge_memory", v3.

    This series contains cleanups to rework some function logics to make it
    more readable, use helper function and so on. More details can be found
    in the respective changelogs.

    This patch (of 6):

    The current implementation of vma_adjust_trans_huge() contains some
    duplicated codes. Add helper function to get rid of these codes to make
    it more succinct.

    Link: https://lkml.kernel.org/r/20210318122722.13135-1-linmiaohe@huawei.com
    Link: https://lkml.kernel.org/r/20210318122722.13135-2-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Peter Xu
    Cc: Zi Yan
    Cc: Matthew Wilcox
    Cc: William Kucharski
    Cc: Vlastimil Babka
    Cc: Peter Xu
    Cc: yuleixzhang
    Cc: Michel Lespinasse
    Cc: Aneesh Kumar K.V
    Cc: Ralph Campbell
    Cc: Thomas Hellstrm (Intel)
    Cc: Yang Shi
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • There is no need to use a new local variable ret2 to get the return
    value of handle_userfault(). Use ret directly to make code more
    succinct.

    Link: https://lkml.kernel.org/r/20210210072409.60587-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Reviewed-by: Andrew Morton
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     

14 Mar, 2021

2 commits

  • Rename mem_cgroup_split_huge_fixup to split_page_memcg and explicitly pass
    in page number argument.

    In this way, the interface name is more common and can be used by
    potential users. In addition, the complete info(memcg and flag) of the
    memcg needs to be set to the tail pages.

    Link: https://lkml.kernel.org/r/20210304074053.65527-2-zhouguanghui1@huawei.com
    Signed-off-by: Zhou Guanghui
    Acked-by: Johannes Weiner
    Reviewed-by: Zi Yan
    Reviewed-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Nicholas Piggin
    Cc: Kefeng Wang
    Cc: Hanjun Guo
    Cc: Tianhong Ding
    Cc: Weilong Chen
    Cc: Rui Xiang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhou Guanghui
     
  • We've got quite a few places (pte, pmd, pud) that explicitly checked
    against whether we should break the cow right now during fork(). It's
    easier to provide a helper, especially before we work the same thing on
    hugetlbfs.

    Since we'll reference is_cow_mapping() in mm.h, move it there too.
    Actually it suites mm.h more since internal.h is mm/ only, but mm.h is
    exported to the whole kernel. With that we should expect another patch to
    use is_cow_mapping() whenever we can across the kernel since we do use it
    quite a lot but it's always done with raw code against VM_* flags.

    Link: https://lkml.kernel.org/r/20210217233547.93892-4-peterx@redhat.com
    Signed-off-by: Peter Xu
    Reviewed-by: Jason Gunthorpe
    Cc: Alexey Dobriyan
    Cc: Andrea Arcangeli
    Cc: Christoph Hellwig
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: David Gibson
    Cc: Gal Pressman
    Cc: Jan Kara
    Cc: Jann Horn
    Cc: Kirill Shutemov
    Cc: Kirill Tkhai
    Cc: Matthew Wilcox
    Cc: Miaohe Lin
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Roland Scheidegger
    Cc: VMware Graphics
    Cc: Wei Zhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Xu
     

27 Feb, 2021

1 commit

  • Patch series "mm,thp,shm: limit shmem THP alloc gfp_mask", v6.

    The allocation flags of anonymous transparent huge pages can be controlled
    through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
    help the system from getting bogged down in the page reclaim and
    compaction code when many THPs are getting allocated simultaneously.

    However, the gfp_mask for shmem THP allocations were not limited by those
    configuration settings, and some workloads ended up with all CPUs stuck on
    the LRU lock in the page reclaim code, trying to allocate dozens of THPs
    simultaneously.

    This patch applies the same configurated limitation of THPs to shmem
    hugepage allocations, to prevent that from happening.

    This way a THP defrag setting of "never" or "defer+madvise" will result in
    quick allocation failures without direct reclaim when no 2MB free pages
    are available.

    With this patch applied, THP allocations for tmpfs will be a little more
    aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
    less aggressive for files that are not mmapped or mapped without that
    flag.

    This patch (of 4):

    The allocation flags of anonymous transparent huge pages can be controlled
    through the files in /sys/kernel/mm/transparent_hugepage/defrag, which can
    help the system from getting bogged down in the page reclaim and
    compaction code when many THPs are getting allocated simultaneously.

    However, the gfp_mask for shmem THP allocations were not limited by those
    configuration settings, and some workloads ended up with all CPUs stuck on
    the LRU lock in the page reclaim code, trying to allocate dozens of THPs
    simultaneously.

    This patch applies the same configurated limitation of THPs to shmem
    hugepage allocations, to prevent that from happening.

    Controlling the gfp_mask of THP allocations through the knobs in sysfs
    allows users to determine the balance between how aggressively the system
    tries to allocate THPs at fault time, and how much the application may end
    up stalling attempting those allocations.

    This way a THP defrag setting of "never" or "defer+madvise" will result in
    quick allocation failures without direct reclaim when no 2MB free pages
    are available.

    With this patch applied, THP allocations for tmpfs will be a little more
    aggressive than today for files mmapped with MADV_HUGEPAGE, and a little
    less aggressive for files that are not mmapped or mapped without that
    flag.

    Link: https://lkml.kernel.org/r/20201124194925.623931-1-riel@surriel.com
    Link: https://lkml.kernel.org/r/20201124194925.623931-2-riel@surriel.com
    Signed-off-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Xu Yu
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Matthew Wilcox (Oracle)
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

25 Feb, 2021

3 commits

  • Differentiate between hardware not supporting hugepages and user disabling
    THP via 'echo never > /sys/kernel/mm/transparent_hugepage/enabled'

    For the devdax namespace, the kernel handles the above via the
    supported_alignment attribute and failing to initialize the namespace if
    the namespace align value is not supported on the platform.

    For the fsdax namespace, the kernel will continue to initialize the
    namespace. This can result in the kernel creating a huge pte entry even
    though the hardware don't support the same.

    We do want hugepage support with pmem even if the end-user disabled THP
    via sysfs file (/sys/kernel/mm/transparent_hugepage/enabled). Hence
    differentiate between hardware/firmware lacking support vs user-controlled
    disable of THP and prevent a huge fault if the hardware lacks hugepage
    support.

    Link: https://lkml.kernel.org/r/20210205023956.417587-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Dan Williams
    Cc: "Kirill A . Shutemov"
    Cc: Jan Kara
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • The return value of set_huge_zero_page() is always ignored. So we should
    drop such return value.

    Link: https://lkml.kernel.org/r/20210203084816.46307-1-linmiaohe@huawei.com
    Signed-off-by: Miaohe Lin
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • When set_pmd_at is called in function do_huge_pmd_anonymous_page, new tlb
    entry can be added by software on MIPS platform.

    Here add update_mmu_cache_pmd when pmd entry is set, and
    update_mmu_cache_pmd is defined as empty excepts arc/mips platform. This
    patch has no negative effect on other platforms except arc/mips system.

    Link: http://lkml.kernel.org/r/1592990792-1923-2-git-send-email-maobibo@loongson.cn
    Signed-off-by: Bibo Mao
    Cc: Anshuman Khandual
    Cc: Daniel Silsby
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Paul Burton
    Cc: Ralf Baechle
    Cc: Thomas Bogendoerfer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Bibo Mao