30 Dec, 2020

1 commit

  • [ Upstream commit 013339df116c2ee0d796dd8bfb8f293a2030c063 ]

    Since commit 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic
    v2"), the code to check the secondary MMU's page table access bit is
    broken for !(TTU_IGNORE_ACCESS) because the page is unmapped from the
    secondary MMU's page table before the check. More specifically for those
    secondary MMUs which unmap the memory in
    mmu_notifier_invalidate_range_start() like kvm.

    However memory reclaim is the only user of !(TTU_IGNORE_ACCESS) or the
    absence of TTU_IGNORE_ACCESS and it explicitly performs the page table
    access check before trying to unmap the page. So, at worst the reclaim
    will miss accesses in a very short window if we remove page table access
    check in unmapping code.

    There is an unintented consequence of !(TTU_IGNORE_ACCESS) for the memcg
    reclaim. From memcg reclaim the page_referenced() only account the
    accesses from the processes which are in the same memcg of the target page
    but the unmapping code is considering accesses from all the processes, so,
    decreasing the effectiveness of memcg reclaim.

    The simplest solution is to always assume TTU_IGNORE_ACCESS in unmapping
    code.

    Link: https://lkml.kernel.org/r/20201104231928.1494083-1-shakeelb@google.com
    Fixes: 369ea8242c0f ("mm/rmap: update to new mmu_notifier semantic v2")
    Signed-off-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Shakeel Butt
     

15 Nov, 2020

1 commit

  • Qian Cai reported the following BUG in [1]

    LTP: starting move_pages12
    BUG: unable to handle page fault for address: ffffffffffffffe0
    ...
    RIP: 0010:anon_vma_interval_tree_iter_first+0xa2/0x170 avc_start_pgoff at mm/interval_tree.c:63
    Call Trace:
    rmap_walk_anon+0x141/0xa30 rmap_walk_anon at mm/rmap.c:1864
    try_to_unmap+0x209/0x2d0 try_to_unmap at mm/rmap.c:1763
    migrate_pages+0x1005/0x1fb0
    move_pages_and_store_status.isra.47+0xd7/0x1a0
    __x64_sys_move_pages+0xa5c/0x1100
    do_syscall_64+0x5f/0x310
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Hugh Dickins diagnosed this as a migration bug caused by code introduced
    to use i_mmap_rwsem for pmd sharing synchronization. Specifically, the
    routine unmap_and_move_huge_page() is always passing the TTU_RMAP_LOCKED
    flag to try_to_unmap() while holding i_mmap_rwsem. This is wrong for
    anon pages as the anon_vma_lock should be held in this case. Further
    analysis suggested that i_mmap_rwsem was not required to he held at all
    when calling try_to_unmap for anon pages as an anon page could never be
    part of a shared pmd mapping.

    Discussion also revealed that the hack in hugetlb_page_mapping_lock_write
    to drop page lock and acquire i_mmap_rwsem is wrong. There is no way to
    keep mapping valid while dropping page lock.

    This patch does the following:

    - Do not take i_mmap_rwsem and set TTU_RMAP_LOCKED for anon pages when
    calling try_to_unmap.

    - Remove the hacky code in hugetlb_page_mapping_lock_write. The routine
    will now simply do a 'trylock' while still holding the page lock. If
    the trylock fails, it will return NULL. This could impact the
    callers:

    - migration calling code will receive -EAGAIN and retry up to the
    hard coded limit (10).

    - memory error code will treat the page as BUSY. This will force
    killing (SIGKILL) instead of SIGBUS any mapping tasks.

    Do note that this change in behavior only happens when there is a
    race. None of the standard kernel testing suites actually hit this
    race, but it is possible.

    [1] https://lore.kernel.org/lkml/20200708012044.GC992@lca.pw/
    [2] https://lore.kernel.org/linux-mm/alpine.LSU.2.11.2010071833100.2214@eggly.anvils/

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Reported-by: Qian Cai
    Suggested-by: Hugh Dickins
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc:
    Link: https://lkml.kernel.org/r/20201105195058.78401-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

17 Oct, 2020

1 commit

  • Ask the page what size it is instead of assuming it's PMD size. Do this
    for anon pages as well as file pages for when someone decides to support
    that. Leave the assumption alone for pages which are PMD mapped; we don't
    currently grow THPs beyond PMD size, so we don't need to change this code
    yet.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-9-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

06 Sep, 2020

1 commit

  • During memory migration a pte is temporarily replaced with a migration
    swap pte. Some pte bits from the existing mapping such as the soft-dirty
    and uffd write-protect bits are preserved by copying these to the
    temporary migration swap pte.

    However these bits are not stored at the same location for swap and
    non-swap ptes. Therefore testing these bits requires using the
    appropriate helper function for the given pte type.

    Unfortunately several code locations were found where the wrong helper
    function is being used to test soft_dirty and uffd_wp bits which leads to
    them getting incorrectly set or cleared during page-migration.

    Fix these by using the correct tests based on pte type.

    Fixes: a5430dda8a3a ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Fixes: 8c3328f1f36a ("mm/migrate: migrate_vma() unmap page from vma while collecting pages")
    Fixes: f45ec5ff16a7 ("userfaultfd: wp: support swap and page migration")
    Signed-off-by: Alistair Popple
    Signed-off-by: Andrew Morton
    Reviewed-by: Peter Xu
    Cc: Jérôme Glisse
    Cc: John Hubbard
    Cc: Ralph Campbell
    Cc: Alistair Popple
    Cc:
    Link: https://lkml.kernel.org/r/20200825064232.10023-2-alistair@popple.id.au
    Signed-off-by: Linus Torvalds

    Alistair Popple
     

15 Aug, 2020

2 commits

  • mm->tlb_flush_batched could be accessed concurrently as noticed by
    KCSAN,

    BUG: KCSAN: data-race in flush_tlb_batched_pending / try_to_unmap_one

    write to 0xffff93f754880bd0 of 1 bytes by task 822 on cpu 6:
    try_to_unmap_one+0x59a/0x1ab0
    set_tlb_ubc_flush_pending at mm/rmap.c:635
    (inlined by) try_to_unmap_one at mm/rmap.c:1538
    rmap_walk_anon+0x296/0x650
    rmap_walk+0xdf/0x100
    try_to_unmap+0x18a/0x2f0
    shrink_page_list+0xef6/0x2870
    shrink_inactive_list+0x316/0x880
    shrink_lruvec+0x8dc/0x1380
    shrink_node+0x317/0xd80
    balance_pgdat+0x652/0xd90
    kswapd+0x396/0x8d0
    kthread+0x1e0/0x200
    ret_from_fork+0x27/0x50

    read to 0xffff93f754880bd0 of 1 bytes by task 6364 on cpu 4:
    flush_tlb_batched_pending+0x29/0x90
    flush_tlb_batched_pending at mm/rmap.c:682
    change_p4d_range+0x5dd/0x1030
    change_pte_range at mm/mprotect.c:44
    (inlined by) change_pmd_range at mm/mprotect.c:212
    (inlined by) change_pud_range at mm/mprotect.c:240
    (inlined by) change_p4d_range at mm/mprotect.c:260
    change_protection+0x222/0x310
    change_prot_numa+0x3e/0x60
    task_numa_work+0x219/0x350
    task_work_run+0xed/0x140
    prepare_exit_to_usermode+0x2cc/0x2e0
    ret_from_intr+0x32/0x42

    Reported by Kernel Concurrency Sanitizer on:
    CPU: 4 PID: 6364 Comm: mtest01 Tainted: G W L 5.5.0-next-20200210+ #5
    Hardware name: HPE ProLiant DL385 Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019

    flush_tlb_batched_pending() is under PTL but the write is not, but
    mm->tlb_flush_batched is only a bool type, so the value is unlikely to be
    shattered. Thus, mark it as an intentional data race by using the data
    race macro.

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Cc: Marco Elver
    Link: http://lkml.kernel.org/r/1581450783-8262-1-git-send-email-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     
  • The thp prefix is more frequently used than hpage and we should be
    consistent between the various functions.

    [akpm@linux-foundation.org: fix mm/migrate.c]

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

1 commit

  • Commit c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
    synchronization") requires callers of huge_pte_alloc to hold i_mmap_rwsem
    in at least read mode. This is because the explicit locking in
    huge_pmd_share (called by huge_pte_alloc) was removed. When restructuring
    the code, the call to huge_pte_alloc in the else block at the beginning of
    hugetlb_fault was missed.

    Unfortunately, that else clause is exercised when there is no page table
    entry. This will likely lead to a call to huge_pmd_share. If
    huge_pmd_share thinks pmd sharing is possible, it will traverse the
    mapping tree (i_mmap) without holding i_mmap_rwsem. If someone else is
    modifying the tree, bad things such as addressing exceptions or worse
    could happen.

    Simply remove the else clause. It should have been removed previously.
    The code following the else will call huge_pte_alloc with the appropriate
    locking.

    To prevent this type of issue in the future, add routines to assert that
    i_mmap_rwsem is held, and call these routines in huge pmd sharing
    routines.

    Fixes: c0d0381ade79 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
    Suggested-by: Matthew Wilcox
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A.Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc:
    Link: http://lkml.kernel.org/r/e670f327-5cf9-1959-96e4-6dc7cc30d3d5@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

10 Jun, 2020

1 commit

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

2 commits

  • With rmap memcg locking already in place for NR_ANON_MAPPED, it's just a
    small step to remove the MEMCG_RSS_HUGE wart and switch memcg to the
    native NR_ANON_THPS accounting sites.

    [hannes@cmpxchg.org: fixes]
    Link: http://lkml.kernel.org/r/20200512121750.GA397968@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Tested-by: Naresh Kamboju
    Reviewed-by: Joonsoo Kim
    Acked-by: Randy Dunlap [build-tested]
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-12-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

08 Apr, 2020

4 commits

  • I recently build the RISC-V port with LLVM trunk, which has introduced a
    new warning when casting from a pointer to an enum of a smaller size.
    This patch simply casts to a long in the middle to stop the warning. I'd
    be surprised this is the only one in the kernel, but it's the only one I
    saw.

    Signed-off-by: Palmer Dabbelt
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
    Signed-off-by: Linus Torvalds

    Palmer Dabbelt
     
  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • This reverts commit 4e4a9eb921332b9d1 ("mm/rmap.c: reuse mergeable
    anon_vma as parent when fork").

    In dup_mmap(), anon_vma_fork() is called for attaching anon_vma and
    parameter 'tmp' (i.e., the new vma of child) has same ->vm_next and
    ->vm_prev as its parent vma. That causes the anon_vma used by parent been
    mistakenly shared by child (In anon_vma_clone(), the code added by that
    commit will do this reuse work).

    Besides this issue, the design of reusing anon_vma from vma which has gone
    through fork should be avoided ([1]). So, this patch reverts that commit
    and maintains the consistent logic of reusing anon_vma for
    fork/split/merge vma.

    Reusing anon_vma within the process is fine. But if a vma has gone
    through fork(), then that vma's anon_vma should not be shared with its
    neighbor vma. As explained in [1], when vma gone through fork(), the
    check for list_is_singular(vma->anon_vma_chain) will be false, and
    don't share anon_vma.

    With current issue, one example can clarify more. Parent process do
    below two steps:

    1. p_vma_1 is created and p_anon_vma_1 is prepared;

    2. p_vma_2 is created and share p_anon_vma_1; (this is allowed,
    becaues p_vma_1 didn't gothrough fork()); parent process do fork():

    3. c_vma_1 is dup from p_vma_1, and has its own c_anon_vma_1
    prepared; at this point, c_vma_1->anon_vma_chain has two items, one
    for p_anon_vma_1 and one for c_anon_vma_1;

    4. c_vma_2 is dup from p_vma_2, it is not allowed to share
    c_anon_vma_1, because

    c_vma_1->anon_vma_chain has two items.
    [1] commit d0e9fe1758f2 ("Simplify and comment on anon_vma re-use for
    anon_vma_prepare()") explains the test of "list_is_singular()".

    Fixes: 4e4a9eb92133 ("mm/rmap.c: reuse mergeable anon_vma as parent when fork")
    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Matthew Wilcox
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/1581150928-3214-3-git-send-email-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     

03 Apr, 2020

3 commits

  • Patch series "hugetlbfs: use i_mmap_rwsem for more synchronization", v2.

    While discussing the issue with huge_pte_offset [1], I remembered that
    there were more outstanding hugetlb races. These issues are:

    1) For shared pmds, huge PTE pointers returned by huge_pte_alloc can become
    invalid via a call to huge_pmd_unshare by another thread.
    2) hugetlbfs page faults can race with truncation causing invalid global
    reserve counts and state.

    A previous attempt was made to use i_mmap_rwsem in this manner as
    described at [2]. However, those patches were reverted starting with [3]
    due to locking issues.

    To effectively use i_mmap_rwsem to address the above issues it needs to be
    held (in read mode) during page fault processing. However, during fault
    processing we need to lock the page we will be adding. Lock ordering
    requires we take page lock before i_mmap_rwsem. Waiting until after
    taking the page lock is too late in the fault process for the
    synchronization we want to do.

    To address this lock ordering issue, the following patches change the lock
    ordering for hugetlb pages. This is not too invasive as hugetlbfs
    processing is done separate from core mm in many places. However, I don't
    really like this idea. Much ugliness is contained in the new routine
    hugetlb_page_mapping_lock_write() of patch 1.

    The only other way I can think of to address these issues is by catching
    all the races. After catching a race, cleanup, backout, retry ... etc,
    as needed. This can get really ugly, especially for huge page
    reservations. At one time, I started writing some of the reservation
    backout code for page faults and it got so ugly and complicated I went
    down the path of adding synchronization to avoid the races. Any other
    suggestions would be welcome.

    [1] https://lore.kernel.org/linux-mm/1582342427-230392-1-git-send-email-longpeng2@huawei.com/
    [2] https://lore.kernel.org/linux-mm/20181222223013.22193-1-mike.kravetz@oracle.com/
    [3] https://lore.kernel.org/linux-mm/20190103235452.29335-1-mike.kravetz@oracle.com
    [4] https://lore.kernel.org/linux-mm/1584028670.7365.182.camel@lca.pw/
    [5] https://lore.kernel.org/lkml/20200312183142.108df9ac@canb.auug.org.au/

    This patch (of 2):

    While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:
    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with
    the ptep.
    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is called.

    One problem with this scheme is that it requires taking i_mmap_rwsem
    before taking the page lock during page faults. This is not the order
    specified in the rest of mm code. Handling of hugetlbfs pages is mostly
    isolated today. Therefore, we use this alternative locking order for
    PageHuge() pages.

    mapping->i_mmap_rwsem
    hugetlb_fault_mutex (hugetlbfs specific page fault mutex)
    page->flags PG_locked (lock_page)

    To help with lock ordering issues, hugetlb_page_mapping_lock_write() is
    introduced to write lock the i_mmap_rwsem associated with a page.

    In most cases it is easy to get address_space via vma->vm_file->f_mapping.
    However, in the case of migration or memory errors for anon pages we do
    not have an associated vma. A new routine _get_hugetlb_page_mapping()
    will use anon_vma to get address_space in these cases.

    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Link: http://lkml.kernel.org/r/20200316205756.146666-2-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Currently the declaration and definition for is_vma_temporary_stack() are
    scattered. Lets make is_vma_temporary_stack() helper available for
    general use and also drop the declaration from (include/linux/huge_mm.h)
    which is no longer required. While at this, rename this as
    vma_is_temporary_stack() in line with existing helpers. This should not
    cause any functional change.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS
    scheme tends to overflow too easily, each tail page increments the head
    page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number
    of huge pages that can be pinned.

    This patch removes that limitation, by using an exact form of pin counting
    for compound pages of order > 1. The "order > 1" is required because this
    approach uses the 3rd struct page in the compound page, and order 1
    compound pages only have two pages, so that won't work there.

    A new struct page field, hpage_pinned_refcount, has been added, replacing
    a padding field in the union (so no new space is used).

    This enhancement also has a useful side effect: huge pages and compound
    pages (of order > 1) do not suffer from the "potential false positives"
    problem that is discussed in the page_dma_pinned() comment block. That is
    because these compound pages have extra space for tracking things, so they
    get exact pin counts instead of overloading page->_refcount.

    Documentation/core-api/pin_user_pages.rst is updated accordingly.

    Suggested-by: Jan Kara
    Signed-off-by: John Hubbard
    Signed-off-by: Andrew Morton
    Reviewed-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ira Weiny
    Cc: Jérôme Glisse
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Dan Williams
    Cc: Dave Chinner
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com
    Signed-off-by: Linus Torvalds

    John Hubbard
     

02 Dec, 2019

1 commit

  • Adding fully unmapped pages into deferred split queue is not productive:
    these pages are about to be freed or they are pinned and cannot be split
    anyway.

    Link: http://lkml.kernel.org/r/20190913091849.11151-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reviewed-by: Yang Shi
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Dec, 2019

4 commits

  • The __page_check_anon_rmap() just calls two BUG_ON()s protected by
    CONFIG_DEBUG_VM, the #ifdef could be eliminated by using VM_BUG_ON_PAGE().

    Link: http://lkml.kernel.org/r/1573157346-111316-1-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Replace DESTROY_BY_RCU with SLAB_TYPESAFE_BY_RCU because
    SLAB_DESTROY_BY_RCU has been renamed to SLAB_TYPESAFE_BY_RCU by commit
    5f0d5a3ae7cf ("mm: Rename SLAB_DESTROY_BY_RCU to SLAB_TYPESAFE_BY_RCU")

    Link: http://lkml.kernel.org/r/20191017093554.22562-1-miles.chen@mediatek.com
    Signed-off-by: Miles Chen
    Cc: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miles Chen
     
  • In __anon_vma_prepare(), we will try to find anon_vma if it is possible to
    reuse it. While on fork, the logic is different.

    Since commit 5beb49305251 ("mm: change anon_vma linking to fix
    multi-process server scalability issue"), function anon_vma_clone() tries
    to allocate new anon_vma for child process. But the logic here will
    allocate a new anon_vma for each vma, even in parent this vma is mergeable
    and share the same anon_vma with its sibling. This may do better for
    scalability issue, while it is not necessary to do so especially after
    interval tree is used.

    Commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
    tries to reuse some anon_vma by counting child anon_vma and attached vmas.
    While for those mergeable anon_vmas, we can just reuse it and not
    necessary to go through the logic.

    After this change, kernel build test reduces 20% anon_vma allocation.

    Do the same kernel build test, it shows run time in sys reduced 11.6%.

    Origin:

    real 2m50.467s
    user 17m52.002s
    sys 1m51.953s

    real 2m48.662s
    user 17m55.464s
    sys 1m50.553s

    real 2m51.143s
    user 17m59.687s
    sys 1m53.600s

    Patched:

    real 2m39.933s
    user 17m1.835s
    sys 1m38.802s

    real 2m39.321s
    user 17m1.634s
    sys 1m39.206s

    real 2m39.575s
    user 17m1.420s
    sys 1m38.845s

    Link: http://lkml.kernel.org/r/20191011072256.16275-2-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Acked-by: Konstantin Khlebnikov
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Qian Cai
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Before commit 7a3ef208e662 ("mm: prevent endless growth of anon_vma
    hierarchy"), anon_vma_clone() doesn't change dst->anon_vma. While after
    this commit, anon_vma_clone() will try to reuse an exist one on forking.

    But this commit go a little bit further for the case not forking.
    anon_vma_clone() is called from __vma_split(), __split_vma(), copy_vma()
    and anon_vma_fork(). For the first three places, the purpose here is
    get a copy of src and we don't expect to touch dst->anon_vma even it is
    NULL.

    While after that commit, it is possible to reuse an anon_vma when
    dst->anon_vma is NULL. This is not we intend to have.

    This patch stops reuse of anon_vma for non-fork cases.

    Link: http://lkml.kernel.org/r/20191011072256.16275-1-richardw.yang@linux.intel.com
    Fixes: 7a3ef208e662 ("mm: prevent endless growth of anon_vma hierarchy")
    Signed-off-by: Wei Yang
    Acked-by: Konstantin Khlebnikov
    Cc: Kirill A. Shutemov
    Cc: "Jérôme Glisse"
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Qian Cai
    Cc: Shakeel Butt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

19 Oct, 2019

1 commit

  • Include for the definition of is_vma_temporary_stack
    to fix the following sparse warning:

    mm/rmap.c:1673:6: warning: symbol 'is_vma_temporary_stack' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20191009151155.27763-1-ben.dooks@codethink.co.uk
    Signed-off-by: Ben Dooks
    Reviewed-by: Qian Cai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ben Dooks
     

25 Sep, 2019

4 commits

  • This patch is (hopefully) the first step to enable THP for non-shmem
    filesystems.

    This patch enables an application to put part of its text sections to THP
    via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

    We tried to reuse the logic for THP on tmpfs.

    Currently, write is not supported for non-shmem THP. khugepaged will only
    process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
    (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
    execve(). This requirement limits non-shmem THP to text sections.

    The next patch will handle writes, which would only happen when the all
    the vmas with VM_DENYWRITE are unmapped.

    An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
    feature.

    [songliubraving@fb.com: fix build without CONFIG_SHMEM]
    Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
    [songliubraving@fb.com: fix double unlock in collapse_file()]
    Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
    Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Patch series "Make working with compound pages easier", v2.

    These three patches add three helpers and convert the appropriate
    places to use them.

    This patch (of 3):

    It's unnecessarily hard to find out the size of a potentially huge page.
    Replace 'PAGE_SIZE << compound_order(page)' with page_size(page).

    Link: http://lkml.kernel.org/r/20190721104612.19120-2-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Fixes gcc '-Wunused-but-set-variable' warning:

    mm/rmap.c: In function page_mkclean_one:
    mm/rmap.c:906:17: warning: variable cstart set but not used [-Wunused-but-set-variable]

    It is not used any more since
    commit cdb07bdea28e ("mm/rmap.c: remove redundant variable cend")

    Link: http://lkml.kernel.org/r/20190724141453.38536-1-yuehaibing@huawei.com
    Signed-off-by: YueHaibing
    Reported-by: Hulk Robot
    Reviewed-by: Mike Kravetz
    Reviewed-by: Kirill Tkhai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    YueHaibing
     

14 Aug, 2019

1 commit

  • When migrating an anonymous private page to a ZONE_DEVICE private page,
    the source page->mapping and page->index fields are copied to the
    destination ZONE_DEVICE struct page and the page_mapcount() is
    increased. This is so rmap_walk() can be used to unmap and migrate the
    page back to system memory.

    However, try_to_unmap_one() computes the subpage pointer from a swap pte
    which computes an invalid page pointer and a kernel panic results such
    as:

    BUG: unable to handle page fault for address: ffffea1fffffffc8

    Currently, only single pages can be migrated to device private memory so
    no subpage computation is needed and it can be set to "page".

    [rcampbell@nvidia.com: add comment]
    Link: http://lkml.kernel.org/r/20190724232700.23327-4-rcampbell@nvidia.com
    Link: http://lkml.kernel.org/r/20190719192955.30462-4-rcampbell@nvidia.com
    Fixes: a5430dda8a3a1c ("mm/migrate: support un-addressable ZONE_DEVICE page in migration")
    Signed-off-by: Ralph Campbell
    Cc: "Jérôme Glisse"
    Cc: "Kirill A. Shutemov"
    Cc: Mike Kravetz
    Cc: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Andrea Arcangeli
    Cc: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Ira Weiny
    Cc: Jan Kara
    Cc: Lai Jiangshan
    Cc: Logan Gunthorpe
    Cc: Martin Schwidefsky
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

15 May, 2019

4 commits

  • We have the pra.mapcount already, and there is no need to call the
    page_mapped() which may do some complicated computing for compound page.

    Link: http://lkml.kernel.org/r/20190404054828.2731-1-sjhuang@iluvatar.ai
    Signed-off-by: Huang Shijie
    Acked-by: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • MADV_DONTNEED is handled with mmap_sem taken in read mode. We call
    page_mkclean without holding mmap_sem.

    MADV_DONTNEED implies that pages in the region are unmapped and subsequent
    access to the pages in that range is handled as a new page fault. This
    implies that if we don't have parallel access to the region when
    MADV_DONTNEED is run we expect those range to be unallocated.

    w.r.t page_mkclean() we need to make sure that we don't break the
    MADV_DONTNEED semantics. MADV_DONTNEED check for pmd_none without holding
    pmd_lock. This implies we skip the pmd if we temporarily mark pmd none.
    Avoid doing that while marking the page clean.

    Keep the sequence same for dax too even though we don't support
    MADV_DONTNEED for dax mapping

    The bug was noticed by code review and I didn't observe any failures w.r.t
    test run. This is similar to

    commit 58ceeb6bec86d9140f9d91d71a710e963523d063
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:26 2017 -0700

    thp: fix MADV_DONTNEED vs. MADV_FREE race

    commit ced108037c2aa542b3ed8b7afd1576064ad1362a
    Author: Kirill A. Shutemov
    Date: Thu Apr 13 14:56:20 2017 -0700

    thp: fix MADV_DONTNEED vs. numa balancing race

    Link: http://lkml.kernel.org/r/20190321040610.14226-1-aneesh.kumar@linux.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Andrew Morton
    Cc: Dan Williams
    Cc:"Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

06 Mar, 2019

1 commit

  • We have common pattern to access lru_lock from a page pointer:
    zone_lru_lock(page_zone(page))

    Which is silly, because it unfolds to this:
    &NODE_DATA(page_to_nid(page))->node_zones[page_zonenum(page)]->zone_pgdat->lru_lock
    while we can simply do
    &NODE_DATA(page_to_nid(page))->lru_lock

    Remove zone_lru_lock() function, since it's only complicate things. Use
    'page_pgdat(page)->lru_lock' pattern instead.

    [aryabinin@virtuozzo.com: a slightly better version of __split_huge_page()]
    Link: http://lkml.kernel.org/r/20190301121651.7741-1-aryabinin@virtuozzo.com
    Link: http://lkml.kernel.org/r/20190228083329.31892-2-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Rik van Riel
    Cc: William Kucharski
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

10 Jan, 2019

1 commit

  • The conversion to use a structure for mmu_notifier_invalidate_range_*()
    unintentionally changed the usage in try_to_unmap_one() to init the
    'struct mmu_notifier_range' with vma->vm_start instead of @address,
    i.e. it invalidates the wrong address range. Revert to the correct
    address range.

    Manifests as KVM use-after-free WARNINGs and subsequent "BUG: Bad page
    state in process X" errors when reclaiming from a KVM guest due to KVM
    removing the wrong pages from its own mappings.

    Reported-by: leozinho29_eu@hotmail.com
    Reported-by: Mike Galbraith
    Reported-and-tested-by: Adam Borowski
    Reviewed-by: Jérôme Glisse
    Reviewed-by: Pankaj gupta
    Cc: Christian König
    Cc: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krčmář
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Cc: Andrew Morton
    Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
    Signed-off-by: Sean Christopherson
    Signed-off-by: Linus Torvalds

    Sean Christopherson
     

09 Jan, 2019

1 commit

  • This reverts b43a9990055958e70347c56f90ea2ae32c67334c

    The reverted commit caused issues with migration and poisoning of anon
    huge pages. The LTP move_pages12 test will cause an "unable to handle
    kernel NULL pointer" BUG would occur with stack similar to:

    RIP: 0010:down_write+0x1b/0x40
    Call Trace:
    migrate_pages+0x81f/0xb90
    __ia32_compat_sys_migrate_pages+0x190/0x190
    do_move_pages_to_node.isra.53.part.54+0x2a/0x50
    kernel_move_pages+0x566/0x7b0
    __x64_sys_move_pages+0x24/0x30
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The purpose of the reverted patch was to fix some long existing races
    with huge pmd sharing. It used i_mmap_rwsem for this purpose with the
    idea that this could also be used to address truncate/page fault races
    with another patch. Further analysis has determined that i_mmap_rwsem
    can not be used to address all these hugetlbfs synchronization issues.
    Therefore, revert this patch while working an another approach to the
    underlying issues.

    Link: http://lkml.kernel.org/r/20190103235452.29335-2-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reported-by: Jan Stancek
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

29 Dec, 2018

3 commits

  • While looking at BUGs associated with invalid huge page map counts, it was
    discovered and observed that a huge pte pointer could become 'invalid' and
    point to another task's page table. Consider the following:

    A task takes a page fault on a shared hugetlbfs file and calls
    huge_pte_alloc to get a ptep. Suppose the returned ptep points to a
    shared pmd.

    Now, another task truncates the hugetlbfs file. As part of truncation, it
    unmaps everyone who has the file mapped. If the range being truncated is
    covered by a shared pmd, huge_pmd_unshare will be called. For all but the
    last user of the shared pmd, huge_pmd_unshare will clear the pud pointing
    to the pmd. If the task in the middle of the page fault is not the last
    user, the ptep returned by huge_pte_alloc now points to another task's
    page table or worse. This leads to bad things such as incorrect page
    map/reference counts or invalid memory references.

    To fix, expand the use of i_mmap_rwsem as follows:

    - i_mmap_rwsem is held in read mode whenever huge_pmd_share is called.
    huge_pmd_share is only called via huge_pte_alloc, so callers of
    huge_pte_alloc take i_mmap_rwsem before calling. In addition, callers
    of huge_pte_alloc continue to hold the semaphore until finished with the
    ptep.

    - i_mmap_rwsem is held in write mode whenever huge_pmd_unshare is
    called.

    [mike.kravetz@oracle.com: add explicit check for mapping != null]
    Link: http://lkml.kernel.org/r/20181218223557.5202-2-mike.kravetz@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Naoya Horiguchi
    Cc: "Aneesh Kumar K . V"
    Cc: Andrea Arcangeli
    Cc: Davidlohr Bueso
    Cc: Prakash Sangappa
    Cc: Colin Ian King
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • This function is identical to __page_set_anon_rmap() since the time, when
    it was introduced (8 years ago). The patch removes the function, and
    makes its users to use __page_set_anon_rmap() instead.

    Link: http://lkml.kernel.org/r/154504875359.30235.6237926369392564851.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Jerome Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

01 Dec, 2018

1 commit

  • The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Oct, 2018

1 commit

  • The page migration code employs try_to_unmap() to try and unmap the source
    page. This is accomplished by using rmap_walk to find all vmas where the
    page is mapped. This search stops when page mapcount is zero. For shared
    PMD huge pages, the page map count is always 1 no matter the number of
    mappings. Shared mappings are tracked via the reference count of the PMD
    page. Therefore, try_to_unmap stops prematurely and does not completely
    unmap all mappings of the source page.

    This problem can result is data corruption as writes to the original
    source page can happen after contents of the page are copied to the target
    page. Hence, data is lost.

    This problem was originally seen as DB corruption of shared global areas
    after a huge page was soft offlined due to ECC memory errors. DB
    developers noticed they could reproduce the issue by (hotplug) offlining
    memory used to back huge pages. A simple testcase can reproduce the
    problem by creating a shared PMD mapping (note that this must be at least
    PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
    migrate_pages() to migrate process pages between nodes while continually
    writing to the huge pages being migrated.

    To fix, have the try_to_unmap_one routine check for huge PMD sharing by
    calling huge_pmd_unshare for hugetlbfs huge pages. If it is a shared
    mapping it will be 'unshared' which removes the page table entry and drops
    the reference on the PMD page. After this, flush caches and TLB.

    mmu notifiers are called before locking page tables, but we can not be
    sure of PMD sharing until page tables are locked. Therefore, check for
    the possibility of PMD sharing before locking so that notifiers can
    prepare for the worst possible case.

    Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
    [mike.kravetz@oracle.com: make _range_in_vma() a static inline]
    Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
    Fixes: 39dde65c9940 ("shared page table for hugetlb page")
    Signed-off-by: Mike Kravetz
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: Jerome Glisse
    Cc: Mike Kravetz
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mike Kravetz