17 Oct, 2020

1 commit

  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     

12 Oct, 2020

1 commit

  • When memory is hotplug added or removed the min_free_kbytes should be
    recalculated based on what is expected by khugepaged. Currently after
    hotplug, min_free_kbytes will be set to a lower default and higher
    default set when THP enabled is lost.

    This change restores min_free_kbytes as expected for THP consumers.

    [vijayb@linux.microsoft.com: v5]
    Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com

    Fixes: f000565adb77 ("thp: set recommended min free kbytes")
    Signed-off-by: Vijay Balakrishna
    Signed-off-by: Andrew Morton
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Allen Pais
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Song Liu
    Cc:
    Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
    Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com
    Signed-off-by: Linus Torvalds

    Vijay Balakrishna
     

11 Oct, 2020

1 commit

  • There have been elusive reports of filemap_fault() hitting its
    VM_BUG_ON_PAGE(page_to_pgoff(page) != offset, page) on kernels built
    with CONFIG_READ_ONLY_THP_FOR_FS=y.

    Suren has hit it on a kernel with CONFIG_READ_ONLY_THP_FOR_FS=y and
    CONFIG_NUMA is not set: and he has analyzed it down to how khugepaged
    without NUMA reuses the same huge page after collapse_file() failed
    (whereas NUMA targets its allocation to the respective node each time).
    And most of us were usually testing with CONFIG_NUMA=y kernels.

    collapse_file(old start)
    new_page = khugepaged_alloc_page(hpage)
    __SetPageLocked(new_page)
    new_page->index = start // hpage->index=old offset
    new_page->mapping = mapping
    xas_store(&xas, new_page)

    filemap_fault
    page = find_get_page(mapping, offset)
    // if offset falls inside hpage then
    // compound_head(page) == hpage
    lock_page_maybe_drop_mmap()
    __lock_page(page)

    // collapse fails
    xas_store(&xas, old page)
    new_page->mapping = NULL
    unlock_page(new_page)

    collapse_file(new start)
    new_page = khugepaged_alloc_page(hpage)
    __SetPageLocked(new_page)
    new_page->index = start // hpage->index=new offset
    new_page->mapping = mapping // mapping becomes valid again

    // since compound_head(page) == hpage
    // page_to_pgoff(page) got changed
    VM_BUG_ON_PAGE(page_to_pgoff(page) != offset)

    An initial patch replaced __SetPageLocked() by lock_page(), which did
    fix the race which Suren illustrates above. But testing showed that it's
    not good enough: if the racing task's __lock_page() gets delayed long
    after its find_get_page(), then it may follow collapse_file(new start)'s
    successful final unlock_page(), and crash on the same VM_BUG_ON_PAGE.

    It could be fixed by relaxing filemap_fault()'s VM_BUG_ON_PAGE to a
    check and retry (as is done for mapping), with similar relaxations in
    find_lock_entry() and pagecache_get_page(): but it's not obvious what
    else might get caught out; and khugepaged non-NUMA appears to be unique
    in exposing a page to page cache, then revoking, without going through
    a full cycle of freeing before reuse.

    Instead, non-NUMA khugepaged_prealloc_page() release the old page
    if anyone else has a reference to it (1% of cases when I tested).

    Although never reported on huge tmpfs, I believe its find_lock_entry()
    has been at similar risk; but huge tmpfs does not rely on khugepaged
    for its normal working nearly so much as READ_ONLY_THP_FOR_FS does.

    Reported-by: Denis Lisov
    Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=206569
    Link: https://lore.kernel.org/linux-mm/?q=20200219144635.3b7417145de19b65f258c943%40linux-foundation.org
    Reported-by: Qian Cai
    Link: https://lore.kernel.org/linux-xfs/?q=20200616013309.GB815%40lca.pw
    Reported-and-analyzed-by: Suren Baghdasaryan
    Fixes: 87c460a0bded ("mm/khugepaged: collapse_shmem() without freezing new_page")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v4.9+
    Reviewed-by: Matthew Wilcox (Oracle)
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

06 Sep, 2020

1 commit

  • collapse_file() in khugepaged passes PAGE_SIZE as the number of pages to
    be read to page_cache_sync_readahead(). The intent was probably to read
    a single page. Fix it to use the number of pages to the end of the
    window instead.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: David Howells
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Song Liu
    Acked-by: Yang Shi
    Acked-by: Pankaj Gupta
    Cc: Eric Biggers
    Link: https://lkml.kernel.org/r/20200903140844.14194-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    David Howells
     

22 Aug, 2020

1 commit

  • syzbot crashes on the VM_BUG_ON_MM(khugepaged_test_exit(mm), mm) in
    __khugepaged_enter(): yes, when one thread is about to dump core, has set
    core_state, and is waiting for others, another might do something calling
    __khugepaged_enter(), which now crashes because I lumped the core_state
    test (known as "mmget_still_valid") into khugepaged_test_exit(). I still
    think it's best to lump them together, so just in this exceptional case,
    check mm->mm_users directly instead of khugepaged_test_exit().

    Fixes: bbe98f9cadff ("khugepaged: khugepaged_test_exit() check mmget_still_valid()")
    Reported-by: syzbot
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Yang Shi
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Song Liu
    Cc: Mike Kravetz
    Cc: Eric Dumazet
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008141503370.18085@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

13 Aug, 2020

1 commit

  • In current implementation, newly created or swap-in anonymous page is
    started on active list. Growing active list results in rebalancing
    active/inactive list so old pages on active list are demoted to inactive
    list. Hence, the page on active list isn't protected at all.

    Following is an example of this situation.

    Assume that 50 hot pages on active list. Numbers denote the number of
    pages on active/inactive list (active | inactive).

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(uo) | 50(h)

    3. workload: another 50 newly created (used-once) pages
    50(uo) | 50(uo), swap-out 50(h)

    This patch tries to fix this issue. Like as file LRU, newly created or
    swap-in anonymous pages will be inserted to the inactive list. They are
    promoted to active list if enough reference happens. This simple
    modification changes the above example as following.

    1. 50 hot pages on active list
    50(h) | 0

    2. workload: 50 newly created (used-once) pages
    50(h) | 50(uo)

    3. workload: another 50 newly created (used-once) pages
    50(h) | 50(uo), swap-out 50(uo)

    As you can see, hot pages on active list would be protected.

    Note that, this implementation has a drawback that the page cannot be
    promoted and will be swapped-out if re-access interval is greater than the
    size of inactive list but less than the size of total(active+inactive).
    To solve this potential issue, following patch will apply workingset
    detection similar to the one that's already applied to file LRU.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Link: http://lkml.kernel.org/r/1595490560-15117-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

08 Aug, 2020

4 commits

  • Move collapse_huge_page()'s mmget_still_valid() check into
    khugepaged_test_exit() itself. collapse_huge_page() is used for anon THP
    only, and earned its mmget_still_valid() check because it inserts a huge
    pmd entry in place of the page table's pmd entry; whereas
    collapse_file()'s retract_page_tables() or collapse_pte_mapped_thp()
    merely clears the page table's pmd entry. But core dumping without mmap
    lock must have been as open to mistaking a racily cleared pmd entry for a
    page table at physical page 0, as exit_mmap() was. And we certainly have
    no interest in mapping as a THP once dumping core.

    Fixes: 59ea6d06cfa9 ("coredump: fix race condition between collapse_huge_page() and core dumping")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Song Liu
    Cc: Mike Kravetz
    Cc: Kirill A. Shutemov
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021217020.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Only once have I seen this scenario (and forgot even to notice what forced
    the eventual crash): a sequence of "BUG: Bad page map" alerts from
    vm_normal_page(), from zap_pte_range() servicing exit_mmap();
    pmd:00000000, pte values corresponding to data in physical page 0.

    The pte mappings being zapped in this case were supposed to be from a huge
    page of ext4 text (but could as well have been shmem): my belief is that
    it was racing with collapse_file()'s retract_page_tables(), found *pmd
    pointing to a page table, locked it, but *pmd had become 0 by the time
    start_pte was decided.

    In most cases, that possibility is excluded by holding mmap lock; but
    exit_mmap() proceeds without mmap lock. Most of what's run by khugepaged
    checks khugepaged_test_exit() after acquiring mmap lock:
    khugepaged_collapse_pte_mapped_thps() and hugepage_vma_revalidate() do so,
    for example. But retract_page_tables() did not: fix that.

    The fix is for retract_page_tables() to check khugepaged_test_exit(),
    after acquiring mmap lock, before doing anything to the page table.
    Getting the mmap lock serializes with __mmput(), which briefly takes and
    drops it in __khugepaged_exit(); then the khugepaged_test_exit() check on
    mm_users makes sure we don't touch the page table once exit_mmap() might
    reach it, since exit_mmap() will be proceeding without mmap lock, not
    expecting anyone to be racing with it.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [4.8+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021215400.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When retract_page_tables() removes a page table to make way for a huge
    pmd, it holds huge page lock, i_mmap_lock_write, mmap_write_trylock and
    pmd lock; but when collapse_pte_mapped_thp() does the same (to handle the
    case when the original mmap_write_trylock had failed), only
    mmap_write_trylock and pmd lock are held.

    That's not enough. One machine has twice crashed under load, with "BUG:
    spinlock bad magic" and GPF on 6b6b6b6b6b6b6b6b. Examining the second
    crash, page_vma_mapped_walk_done()'s spin_unlock of pvmw->ptl (serving
    page_referenced() on a file THP, that had found a page table at *pmd)
    discovers that the page table page and its lock have already been freed by
    the time it comes to unlock.

    Follow the example of retract_page_tables(), but we only need one of huge
    page lock or i_mmap_lock_write to secure against this: because it's the
    narrower lock, and because it simplifies collapse_pte_mapped_thp() to know
    the hpage earlier, choose to rely on huge page lock here.

    Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021213070.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • pmdp_collapse_flush() should be given the start address at which the huge
    page is mapped, haddr: it was given addr, which at that point has been
    used as a local variable, incremented to the end address of the extent.

    Found by source inspection while chasing a hugepage locking bug, which I
    then could not explain by this. At first I thought this was very bad;
    then saw that all of the page translations that were not flushed would
    actually still point to the right pages afterwards, so harmless; then
    realized that I know nothing of how different architectures and models
    cache intermediate paging structures, so maybe it matters after all -
    particularly since the page table concerned is immediately freed.

    Much easier to fix than to think about.

    Fixes: 27e1f8273113 ("khugepaged: enable collapse pmd for pte-mapped THP")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Song Liu
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2008021204390.27773@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Jul, 2020

1 commit

  • khugepaged has to drop mmap lock several times while collapsing a page.
    The situation can change while the lock is dropped and we need to
    re-validate that the VMA is still in place and the PMD is still subject
    for collapse.

    But we miss one corner case: while collapsing an anonymous pages the VMA
    could be replaced with file VMA. If the file VMA doesn't have any
    private pages we get NULL pointer dereference:

    general protection fault, probably for non-canonical address 0xdffffc0000000000: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000000-0x0000000000000007]
    anon_vma_lock_write include/linux/rmap.h:120 [inline]
    collapse_huge_page mm/khugepaged.c:1110 [inline]
    khugepaged_scan_pmd mm/khugepaged.c:1349 [inline]
    khugepaged_scan_mm_slot mm/khugepaged.c:2110 [inline]
    khugepaged_do_scan mm/khugepaged.c:2193 [inline]
    khugepaged+0x3bba/0x5a10 mm/khugepaged.c:2238

    The fix is to make sure that the VMA is anonymous in
    hugepage_vma_revalidate(). The helper is only used for collapsing
    anonymous pages.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Reported-by: syzbot+ed318e8b790ca72c5ad0@syzkaller.appspotmail.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Yang Shi
    Cc:
    Link: http://lkml.kernel.org/r/20200722121439.44328-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

12 commits

  • They're the same function, and for the purpose of all callers they are
    equivalent to lru_cache_add().

    [akpm@linux-foundation.org: fix it for local_lock changes]
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Joonsoo Kim
    Link: http://lkml.kernel.org/r/20200520232525.798933-5-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Swapin faults were the last event to charge pages after they had already
    been put on the LRU list. Now that we charge directly on swapin, the
    lrucare portion of the charge code is unused.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Cc: Shakeel Butt
    Link: http://lkml.kernel.org/r/20200508183105.225460-19-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • With the page->mapping requirement gone from memcg, we can charge anon and
    file-thp pages in one single step, right after they're allocated.

    This removes two out of three API calls - especially the tricky commit
    step that needed to happen at just the right time between when the page is
    "set up" and when it's "published" - somewhat vague and fluid concepts
    that varied by page type. All we need is a freshly allocated page and a
    memcg context to charge.

    v2: prevent double charges on pre-allocated hugepages in khugepaged

    [hannes@cmpxchg.org: Fix crash - *hpage could be ERR_PTR instead of NULL]
    Link: http://lkml.kernel.org/r/20200512215813.GA487759@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200508183105.225460-13-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains a private MEMCG_RSS counter. This divergence from the
    generic VM accounting means unnecessary code overhead, and creates a
    dependency for memcg that page->mapping is set up at the time of charging,
    so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counter of NR_ANON_MAPPED. We use
    lock_page_memcg() to stabilize page->mem_cgroup during rmap changes, the
    same way we do for NR_FILE_MAPPED.

    With the previous patch removing MEMCG_CACHE and the private NR_SHMEM
    counter, this patch finally eliminates the need to have page->mapping set
    up at charge time. However, we need to have page->mem_cgroup set up by
    the time rmap runs and does the accounting, so switch the commit and the
    rmap callbacks around.

    v2: fix temporary accounting bug by switching rmapcommit (Joonsoo)

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-11-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Memcg maintains private MEMCG_CACHE and NR_SHMEM counters. This
    divergence from the generic VM accounting means unnecessary code overhead,
    and creates a dependency for memcg that page->mapping is set up at the
    time of charging, so that page types can be told apart.

    Convert the generic accounting sites to mod_lruvec_page_state and friends
    to maintain the per-cgroup vmstat counters of NR_FILE_PAGES and NR_SHMEM.
    The page is already locked in these places, so page->mem_cgroup is stable;
    we only need minimal tweaks of two mem_cgroup_migrate() calls to ensure
    it's set up in time.

    Then replace MEMCG_CACHE with NR_FILE_PAGES and delete the private
    NR_SHMEM accounting sites.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Joonsoo Kim
    Cc: Alex Shi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Shakeel Butt
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-10-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The memcg charging API carries a boolean @compound parameter that tells
    whether the page we're dealing with is a hugepage.
    mem_cgroup_commit_charge() has another boolean @lrucare that indicates
    whether the page needs LRU locking or not while charging. The majority of
    callsites know those parameters at compile time, which results in a lot of
    naked "false, false" argument lists. This makes for cryptic code and is a
    breeding ground for subtle mistakes.

    Thankfully, the huge page state can be inferred from the page itself and
    doesn't need to be passed along. This is safe because charging completes
    before the page is published and somebody may split it.

    Simplify the callsites by removing @compound, and let memcg infer the
    state by using hpage_nr_pages() unconditionally. That function does
    PageTransHuge() to identify huge pages, which also helpfully asserts that
    nobody passes in tail pages by accident.

    The following patches will introduce a new charging API, best not to carry
    over unnecessary weight.

    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Reviewed-by: Alex Shi
    Reviewed-by: Joonsoo Kim
    Reviewed-by: Shakeel Butt
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Roman Gushchin
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200508183105.225460-4-hannes@cmpxchg.org
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • 'max_ptes_shared' specifies how many pages can be shared across multiple
    processes. Exceeding the number would block the collapse::

    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_shared

    A higher value may increase memory footprint for some workloads.

    By default, at least half of pages has to be not shared.

    [colin.king@canonical.com: fix several spelling mistakes]
    Link: http://lkml.kernel.org/r/20200420084241.65433-1-colin.king@canonical.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Colin Ian King
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-9-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We can collapse PTE-mapped compound pages. We only need to avoid handling
    them more than once: lock/unlock page only once if it's present in the PMD
    range multiple times as it handled on compound level. The same goes for
    LRU isolation and putback.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-7-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The page can be included into collapse as long as it doesn't have extra
    pins (from GUP or otherwise).

    Logic to check the refcount is moved to a separate function. For pages in
    swap cache, add compound_nr(page) to the expected refcount, in order to
    handle the compound page case. This is in preparation for the following
    patch.

    VM_BUG_ON_PAGE() was removed from __collapse_huge_page_copy() as the
    invariant it checks is no longer valid: the source can be mapped multiple
    times now.

    [yang.shi@linux.alibaba.com: remove error message when checking external pins]
    Link: http://lkml.kernel.org/r/1589317383-9595-1-git-send-email-yang.shi@linux.alibaba.com
    [cai@lca.pw: fix set-but-not-used warning]
    Link: http://lkml.kernel.org/r/20200521145644.GA6367@ovpn-112-192.phx2.redhat.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Reviewed-by: John Hubbard
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-6-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • collapse_huge_page() tries to swap in pages that are part of the PMD
    range. Just swapped in page goes though LRU add cache. The cache gets
    extra reference on the page.

    The extra reference can lead to the collapse fail: the following
    __collapse_huge_page_isolate() would check refcount and abort collapse
    seeing unexpected refcount.

    The fix is to drain local LRU add cache in
    __collapse_huge_page_swapin() if we successfully swapped in any pages.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-5-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Having a page in LRU add cache offsets page refcount and gives
    false-negative on PageLRU(). It reduces collapse success rate.

    Drain all LRU add caches before scanning. It happens relatively rare and
    should not disturb the system too much.

    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • __collapse_huge_page_swapin() checks the number of referenced PTE to
    decide if the memory range is hot enough to justify swapin.

    We have few problems with the approach:

    - It is way too late: we can do the check much earlier and safe time.
    khugepaged_scan_pmd() already knows if we have any pages to swap in
    and number of referenced page.

    - It stops collapse altogether if there's not enough referenced pages,
    not only swappingin.

    Fix it by making the right check early. We also can avoid additional
    page table scanning if khugepaged_scan_pmd() haven't found any swap
    entries.

    Fixes: 0db501f7a34c ("mm, thp: convert from optimistic swapin collapsing to conservative")
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Tested-by: Zi Yan
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Acked-by: Yang Shi
    Cc: Andrea Arcangeli
    Cc: John Hubbard
    Cc: Mike Kravetz
    Cc: Ralph Campbell
    Link: http://lkml.kernel.org/r/20200416160026.16538-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

29 May, 2020

1 commit

  • When collapse_file() calls try_to_release_page(), it has already isolated
    the page: so if releasing buffers happens to fail (as it sometimes does),
    remember to putback_lru_page(): otherwise that page is left unreclaimable
    and unfreeable, and the file extent uncollapsible.

    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Acked-by: Song Liu
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Rik van Riel
    Cc: [5.4+]
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2005231837500.1766@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

08 Apr, 2020

3 commits

  • Don't collapse the huge PMD if there is any userfault write protected
    small PTEs. The problem is that the write protection is in small page
    granularity and there's no way to keep all these write protection
    information if the small pages are going to be merged into a huge PMD.

    The same thing needs to be considered for swap entries and migration
    entries. So do the check as well disregarding khugepaged_max_ptes_swap.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

03 Apr, 2020

2 commits

  • Currently the declaration and definition for is_vma_temporary_stack() are
    scattered. Lets make is_vma_temporary_stack() helper available for
    general use and also drop the declaration from (include/linux/huge_mm.h)
    which is no longer required. While at this, rename this as
    vma_is_temporary_stack() in line with existing helpers. This should not
    cause any functional change.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1582782965-3274-4-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Patch series "mm/vma: some more minor changes", v2.

    The motivation here is to consolidate VMA flags and helpers in generic
    memory header and reduce code duplication when ever applicable. If there
    are other possible similar instances which might be missing here, please
    do let me me know. I will be happy to incorporate them.

    This patch (of 3):

    Move VM_NO_KHUGEPAGED into generic header (include/linux/mm.h). This just
    makes sure that no VMA flag is scattered in individual function files any
    longer. While at this, fix an old comment which is no longer valid. This
    should not cause any functional change.

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1582782965-3274-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     

02 Dec, 2019

1 commit

  • For non-shmem file THPs, khugepaged only collapses read only .text
    mapping (VM_DENYWRITE). These pages should not be dirty except the case
    where the file hasn't been flushed since first write.

    Call filemap_flush() in collapse_file() to accelerate the write back in
    such cases.

    Link: http://lkml.kernel.org/r/20191106060930.2571389-3-songliubraving@fb.com
    Signed-off-by: Song Liu
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Johannes Weiner
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

16 Nov, 2019

1 commit

  • In collapse_file(), for !is_shmem case, current check cannot guarantee
    the locked page is up-to-date. Specifically, xas_unlock_irq() should
    not be called before lock_page() and get_page(); and it is necessary to
    recheck PageUptodate() after locking the page.

    With this bug and CONFIG_READ_ONLY_THP_FOR_FS=y, madvise(HUGE)'ed .text
    may contain corrupted data. This is because khugepaged mistakenly
    collapses some not up-to-date sub pages into a huge page, and assumes
    the huge page is up-to-date. This will NOT corrupt data in the disk,
    because the page is read-only and never written back. Fix this by
    properly checking PageUptodate() after locking the page. This check
    replaces "VM_BUG_ON_PAGE(!PageUptodate(page), page);".

    Also, move PageDirty() check after locking the page. Current khugepaged
    should not try to collapse dirty file THP, because it is limited to
    read-only .text. The only case we hit a dirty page here is when the
    page hasn't been written since write. Bail out and retry when this
    happens.

    syzbot reported bug on previous version of this patch.

    Link: http://lkml.kernel.org/r/20191106060930.2571389-2-songliubraving@fb.com
    Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
    Signed-off-by: Song Liu
    Reported-by: syzbot+efb9e48b9fbdc49bb34a@syzkaller.appspotmail.com
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     

07 Nov, 2019

1 commit

  • I got some khugepaged spew on a 32bit x86:

    BUG: sleeping function called from invalid context at include/linux/mmu_notifier.h:346
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 25, name: khugepaged
    INFO: lockdep is turned off.
    CPU: 1 PID: 25 Comm: khugepaged Not tainted 5.4.0-rc5-elk+ #206
    Hardware name: System manufacturer P5Q-EM/P5Q-EM, BIOS 2203 07/08/2009
    Call Trace:
    dump_stack+0x66/0x8e
    ___might_sleep.cold.96+0x95/0xa6
    __might_sleep+0x2e/0x80
    collapse_huge_page.isra.51+0x5ac/0x1360
    khugepaged+0x9a9/0x20f0
    kthread+0xf5/0x110
    ret_from_fork+0x2e/0x38

    Looks like it's due to CONFIG_HIGHPTE=y pte_offset_map()->kmap_atomic()
    vs. mmu_notifier_invalidate_range_start(). Let's do the naive approach
    and just reorder the two operations.

    Link: http://lkml.kernel.org/r/20191029201513.GG1208@intel.com
    Fixes: 810e24e009cf71 ("mm/mmu_notifiers: annotate with might_sleep()")
    Signed-off-by: Ville Syrjl
    Reviewed-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Jérôme Glisse
    Cc: Ralph Campbell
    Cc: Ira Weiny
    Cc: Jason Gunthorpe
    Cc: Daniel Vetter
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ville Syrjälä
     

25 Sep, 2019

5 commits

  • khugepaged needs exclusive mmap_sem to access page table. When it fails
    to lock mmap_sem, the page will fault in as pte-mapped THP. As the page
    is already a THP, khugepaged will not handle this pmd again.

    This patch enables the khugepaged to retry collapse the page table.

    struct mm_slot (in khugepaged.c) is extended with an array, containing
    addresses of pte-mapped THPs. We use array here for simplicity. We can
    easily replace it with more advanced data structures when needed.

    In khugepaged_scan_mm_slot(), if the mm contains pte-mapped THP, we try to
    collapse the page table.

    Since collapse may happen at an later time, some pages may already fault
    in. collapse_pte_mapped_thp() is added to properly handle these pages.
    collapse_pte_mapped_thp() also double checks whether all ptes in this pmd
    are mapping to the same THP. This is necessary because some subpage of
    the THP may be replaced, for example by uprobe. In such cases, it is not
    possible to collapse the pmd.

    [kirill.shutemov@linux.intel.com: add comments for retract_page_tables()]
    Link: http://lkml.kernel.org/r/20190816145443.6ard3iilytc6jlgv@box
    Link: http://lkml.kernel.org/r/20190815164525.1848545-6-songliubraving@fb.com
    Signed-off-by: Song Liu
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Kirill A. Shutemov
    Suggested-by: Johannes Weiner
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • In previous patch, an application could put part of its text section in
    THP via madvise(). These THPs will be protected from writes when the
    application is still running (TXTBSY). However, after the application
    exits, the file is available for writes.

    This patch avoids writes to file THP by dropping page cache for the file
    when the file is open for write. A new counter nr_thps is added to struct
    address_space. In do_dentry_open(), if the file is open for write and
    nr_thps is non-zero, we drop page cache for the whole file.

    Link: http://lkml.kernel.org/r/20190801184244.3169074-8-songliubraving@fb.com
    Signed-off-by: Song Liu
    Reported-by: kbuild test robot
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • This patch is (hopefully) the first step to enable THP for non-shmem
    filesystems.

    This patch enables an application to put part of its text sections to THP
    via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

    We tried to reuse the logic for THP on tmpfs.

    Currently, write is not supported for non-shmem THP. khugepaged will only
    process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
    (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
    execve(). This requirement limits non-shmem THP to text sections.

    The next patch will handle writes, which would only happen when the all
    the vmas with VM_DENYWRITE are unmapped.

    An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
    feature.

    [songliubraving@fb.com: fix build without CONFIG_SHMEM]
    Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
    [songliubraving@fb.com: fix double unlock in collapse_file()]
    Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
    Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Next patch will add khugepaged support of non-shmem files. This patch
    renames these two functions to reflect the new functionality:

    collapse_shmem() => collapse_file()
    khugepaged_scan_shmem() => khugepaged_scan_file()

    Link: http://lkml.kernel.org/r/20190801184244.3169074-6-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Transparent Huge Pages are currently stored in i_pages as pointers to
    consecutive subpages. This patch changes that to storing consecutive
    pointers to the head page in preparation for storing huge pages more
    efficiently in i_pages.

    Large parts of this are "inspired" by Kirill's patch
    https://lore.kernel.org/lkml/20170126115819.58875-2-kirill.shutemov@linux.intel.com/

    Kirill and Huang Ying contributed several fixes.

    [willy@infradead.org: use compound_nr, squish uninit-var warning]
    Link: http://lkml.kernel.org/r/20190731210400.7419-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jan Kara
    Reviewed-by: Kirill Shutemov
    Reviewed-by: Song Liu
    Tested-by: Song Liu
    Tested-by: William Kucharski
    Reviewed-by: William Kucharski
    Tested-by: Qian Cai
    Tested-by: Mikhail Gavrilov
    Cc: Hugh Dickins
    Cc: Chris Wilson
    Cc: Song Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)