06 Dec, 2018

4 commits

  • commit 006d3ff27e884f80bd7d306b041afc415f63598f upstream.

    Huge tmpfs testing, on 32-bit kernel with lockdep enabled, showed that
    __split_huge_page() was using i_size_read() while holding the irq-safe
    lru_lock and page tree lock, but the 32-bit i_size_read() uses an
    irq-unsafe seqlock which should not be nested inside them.

    Instead, read the i_size earlier in split_huge_page_to_list(), and pass
    the end offset down to __split_huge_page(): all while holding head page
    lock, which is enough to prevent truncation of that extent before the
    page tree lock has been taken.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261520070.2275@eggly.anvils
    Fixes: baa355fd33142 ("thp: file pages support for split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 173d9d9fd3ddae84c110fea8aedf1f26af6be9ec upstream.

    Huge tmpfs stress testing has occasionally hit shmem_undo_range()'s
    VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page).

    Move the setting of mapping and index up before the page_ref_unfreeze()
    in __split_huge_page_tail() to fix this: so that a page cache lookup
    cannot get a reference while the tail's mapping and index are unstable.

    In fact, might as well move them up before the smp_wmb(): I don't see an
    actual need for that, but if I'm missing something, this way round is
    safer than the other, and no less efficient.

    You might argue that VM_BUG_ON_PAGE(page_to_pgoff(page) != index, page) is
    misplaced, and should be left until after the trylock_page(); but left as
    is has not crashed since, and gives more stringent assurance.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261516380.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Requires: 605ca5ede764 ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Jerome Glisse
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     
  • commit 605ca5ede7643a01f4c4a15913f9714ac297f8a6 upstream.

    THP split makes non-atomic change of tail page flags. This is almost ok
    because tail pages are locked and isolated but this breaks recent
    changes in page locking: non-atomic operation could clear bit
    PG_waiters.

    As a result concurrent sequence get_page_unless_zero() -> lock_page()
    might block forever. Especially if this page was truncated later.

    Fix is trivial: clone flags before unfreezing page reference counter.

    This race exists since commit 62906027091f ("mm: add PageWaiters
    indicating tasks are waiting for a page bit") while unsave unfreeze
    itself was added in commit 8df651c7059e ("thp: cleanup
    split_huge_page()").

    clear_compound_head() also must be called before unfreezing page
    reference because after successful get_page_unless_zero() might follow
    put_page() which needs correct compound_head().

    And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
    is made especially for that and has semantic of smp_store_release().

    Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Konstantin Khlebnikov
     
  • commit 906f9cdfc2a0800f13683f9e4ebdfd08c12ee81b upstream.

    The term "freeze" is used in several ways in the kernel, and in mm it
    has the particular meaning of forcing page refcount temporarily to 0.
    freeze_page() is just too confusing a name for a function that unmaps a
    page: rename it unmap_page(), and rename unfreeze_page() remap_page().

    Went to change the mention of freeze_page() added later in mm/rmap.c,
    but found it to be incorrect: ordinary page reclaim reaches there too;
    but the substance of the comment still seems correct, so edit it down.

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1811261514080.2275@eggly.anvils
    Fixes: e9b61f19858a5 ("thp: reintroduce split_huge_page()")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Jerome Glisse
    Cc: Konstantin Khlebnikov
    Cc: Matthew Wilcox
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Hugh Dickins
     

20 Oct, 2018

1 commit

  • commit eb66ae030829605d61fbef1909ce310e29f78821 upstream.

    Jann Horn points out that our TLB flushing was subtly wrong for the
    mremap() case. What makes mremap() special is that we don't follow the
    usual "add page to list of pages to be freed, then flush tlb, and then
    free pages". No, mremap() obviously just _moves_ the page from one page
    table location to another.

    That matters, because mremap() thus doesn't directly control the
    lifetime of the moved page with a freelist: instead, the lifetime of the
    page is controlled by the page table locking, that serializes access to
    the entry.

    As a result, we need to flush the TLB not just before releasing the lock
    for the source location (to avoid any concurrent accesses to the entry),
    but also before we release the destination page table lock (to avoid the
    TLB being flushed after somebody else has already done something to that
    page).

    This also makes the whole "need_flush" logic unnecessary, since we now
    always end up flushing the TLB for every valid entry.

    Reported-and-tested-by: Jann Horn
    Acked-by: Will Deacon
    Tested-by: Ingo Molnar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Linus Torvalds
     

18 Oct, 2018

1 commit

  • commit bfba8e5cf28f413aa05571af493871d74438979f upstream.

    Inside set_pmd_migration_entry() we are holding page table locks and thus
    we can not sleep so we can not call invalidate_range_start/end()

    So remove call to mmu_notifier_invalidate_range_start/end() because they
    are call inside the function calling set_pmd_migration_entry() (see
    try_to_unmap_one()).

    Link: http://lkml.kernel.org/r/20181012181056.7864-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reported-by: Andrea Arcangeli
    Reviewed-by: Zi Yan
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jérôme Glisse
     

13 Oct, 2018

1 commit

  • commit e125fe405abedc1dc8a5b2229b80ee91c1434015 upstream.

    A transparent huge page is represented by a single entry on an LRU list.
    Therefore, we can only make unevictable an entire compound page, not
    individual subpages.

    If a user tries to mlock() part of a huge page, we want the rest of the
    page to be reclaimable.

    We handle this by keeping PTE-mapped huge pages on normal LRU lists: the
    PMD on border of VM_LOCKED VMA will be split into PTE table.

    Introduction of THP migration breaks[1] the rules around mlocking THP
    pages. If we had a single PMD mapping of the page in mlocked VMA, the
    page will get mlocked, regardless of PTE mappings of the page.

    For tmpfs/shmem it's easy to fix by checking PageDoubleMap() in
    remove_migration_pmd().

    Anon THP pages can only be shared between processes via fork(). Mlocked
    page can only be shared if parent mlocked it before forking, otherwise CoW
    will be triggered on mlock().

    For Anon-THP, we can fix the issue by munlocking the page on removing PTE
    migration entry for the page. PTEs for the page will always come after
    mlocked PMD: rmap walks VMAs from oldest to newest.

    Test-case:

    #include
    #include
    #include
    #include
    #include

    int main(void)
    {
    unsigned long nodemask = 4;
    void *addr;

    addr = mmap((void *)0x20000000UL, 2UL << 20, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_LOCKED, -1, 0);

    if (fork()) {
    wait(NULL);
    return 0;
    }

    mlock(addr, 4UL << 10);
    mbind(addr, 2UL << 20, MPOL_PREFERRED | MPOL_F_RELATIVE_NODES,
    &nodemask, 4, MPOL_MF_MOVE);

    return 0;
    }

    [1] https://lkml.kernel.org/r/CAOMGZ=G52R-30rZvhGxEbkTw7rLLwBGadVYeo--iizcD3upL3A@mail.gmail.com

    Link: http://lkml.kernel.org/r/20180917133816.43995-1-kirill.shutemov@linux.intel.com
    Fixes: 616b8371539a ("mm: thp: enable thp migration in generic path")
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Vegard Nossum
    Reviewed-by: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

25 Jul, 2018

1 commit

  • commit e1f1b1572e8db87a56609fd05bef76f98f0e456a upstream.

    __split_huge_pmd_locked() must check if the cleared huge pmd was dirty,
    and propagate that to PageDirty: otherwise, data may be lost when a huge
    tmpfs page is modified then split then reclaimed.

    How has this taken so long to be noticed? Because there was no problem
    when the huge page is written by a write system call (shmem_write_end()
    calls set_page_dirty()), nor when the page is allocated for a write fault
    (fault_dirty_shared_page() calls set_page_dirty()); but when allocated for
    a read fault (which MAP_POPULATE simulates), no set_page_dirty().

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1807111741430.1106@eggly.anvils
    Fixes: d21b9e57c74c ("thp: handle file pages in split_huge_pmd()")
    Signed-off-by: Hugh Dickins
    Reported-by: Ashwin Chaugule
    Reviewed-by: Yang Shi
    Reviewed-by: Kirill A. Shutemov
    Cc: "Huang, Ying"
    Cc: [4.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

05 Jun, 2018

1 commit

  • commit 2d077d4b59924acd1f5180c6fb73b57f4771fde6 upstream.

    Swapping load on huge=always tmpfs (with khugepaged tuned up to be very
    eager, but I'm not sure that is relevant) soon hung uninterruptibly,
    waiting for page lock in shmem_getpage_gfp()'s find_lock_entry(), most
    often when "cp -a" was trying to write to a smallish file. Debug showed
    that the page in question was not locked, and page->mapping NULL by now,
    but page->index consistent with having been in a huge page before.

    Reproduced in minutes on a 4.15 kernel, even with 4.17's 605ca5ede764
    ("mm/huge_memory.c: reorder operations in __split_huge_page_tail()") added
    in; but took hours to reproduce on a 4.17 kernel (no idea why).

    The culprit proved to be the __ClearPageDirty() on tails beyond i_size in
    __split_huge_page(): the non-atomic __bitoperation may have been safe when
    4.8's baa355fd3314 ("thp: file pages support for split_huge_page()")
    introduced it, but liable to erase PageWaiters after 4.10's 62906027091f
    ("mm: add PageWaiters indicating tasks are waiting for a page bit").

    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1805291841070.3197@eggly.anvils
    Fixes: 62906027091f ("mm: add PageWaiters indicating tasks are waiting for a page bit")
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Nicholas Piggin
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Hugh Dickins
     

30 May, 2018

1 commit

  • [ Upstream commit 9d3c3354bb85bab4d865fe95039443f09a4c8394 ]

    Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

29 Mar, 2018

1 commit

  • commit fa41b900c30b45fab03783724932dc30cd46a6be upstream.

    deferred_split_scan() gets called from reclaim path. Waiting for page
    lock may lead to deadlock there.

    Replace lock_page() with trylock_page() and skip the page if we failed
    to lock it. We will get to the page on the next scan.

    Link: http://lkml.kernel.org/r/20180315150747.31945-1-kirill.shutemov@linux.intel.com
    Fixes: 9a982250f773 ("thp: introduce deferred_split_huge_page()")
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

05 Dec, 2017

1 commit

  • commit a8f97366452ed491d13cf1e44241bc0b5740b1f0 upstream.

    Currently, we unconditionally make page table dirty in touch_pmd().
    It may result in false-positive can_follow_write_pmd().

    We may avoid the situation, if we would only make the page table entry
    dirty if caller asks for write access -- FOLL_WRITE.

    The patch also changes touch_pud() in the same way.

    Signed-off-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Kirill A. Shutemov
     

03 Nov, 2017

1 commit

  • We need to deposit pre-allocated PTE page table when a PMD migration
    entry is copied in copy_huge_pmd(). Otherwise, we will leak the
    pre-allocated page and cause a NULL pointer dereference later in
    zap_huge_pmd().

    The missing counters during PMD migration entry copy process are added
    as well.

    The bug report is here: https://lkml.org/lkml/2017/10/29/214

    Link: http://lkml.kernel.org/r/20171030144636.4836-1-zi.yan@sent.com
    Fixes: 84c3fc4e9c563 ("mm: thp: check pmd migration entry in common path")
    Signed-off-by: Zi Yan
    Reported-by: Fengguang Wu
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     

09 Sep, 2017

4 commits

  • Soft dirty bit is designed to keep tracked over page migration. This
    patch makes it work in the same manner for thp migration too.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • When THP migration is being used, memory management code needs to handle
    pmd migration entries properly. This patch uses !pmd_present() or
    is_swap_pmd() (depending on whether pmd_none() needs separate code or
    not) to check pmd migration entries at the places where a pmd entry is
    present.

    Since pmd-related code uses split_huge_page(), split_huge_pmd(),
    pmd_trans_huge(), pmd_trans_unstable(), or
    pmd_none_or_trans_huge_or_clear_bad(), this patch:

    1. adds pmd migration entry split code in split_huge_pmd(),

    2. takes care of pmd migration entries whenever pmd_trans_huge() is present,

    3. makes pmd_none_or_trans_huge_or_clear_bad() pmd migration entry aware.

    Since split_huge_page() uses split_huge_pmd() and pmd_trans_unstable()
    is equivalent to pmd_none_or_trans_huge_or_clear_bad(), we do not change
    them.

    Until this commit, a pmd entry should be:
    1. pointing to a pte page,
    2. is_swap_pmd(),
    3. pmd_trans_huge(),
    4. pmd_devmap(), or
    5. pmd_none().

    Signed-off-by: Zi Yan
    Cc: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • Add thp migration's core code, including conversions between a PMD entry
    and a swap entry, setting PMD migration entry, removing PMD migration
    entry, and waiting on PMD migration entries.

    This patch makes it possible to support thp migration. If you fail to
    allocate a destination page as a thp, you just split the source thp as
    we do now, and then enter the normal page migration. If you succeed to
    allocate destination thp, you enter thp migration. Subsequent patches
    actually enable thp migration for each caller of page migration by
    allowing its get_new_page() callback to allocate thps.

    [zi.yan@cs.rutgers.edu: fix gcc-4.9.0 -Wmissing-braces warning]
    Link: http://lkml.kernel.org/r/A0ABA698-7486-46C3-B209-E95A9048B22C@cs.rutgers.edu
    [akpm@linux-foundation.org: fix x86_64 allnoconfig warning]
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zi Yan
     
  • TTU_MIGRATION is used to convert pte into migration entry until thp
    split completes. This behavior conflicts with thp migration added later
    patches, so let's introduce a new TTU flag specifically for freezing.

    try_to_unmap() is used both for thp split (via freeze_page()) and page
    migration (via __unmap_and_move()). In freeze_page(), ttu_flag given
    for head page is like below (assuming anonymous thp):

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
    TTU_MIGRATION | TTU_SPLIT_HUGE_PMD)

    and ttu_flag given for tail pages is:

    (TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS | TTU_RMAP_LOCKED | \
    TTU_MIGRATION)

    __unmap_and_move() calls try_to_unmap() with ttu_flag:

    (TTU_MIGRATION | TTU_IGNORE_MLOCK | TTU_IGNORE_ACCESS)

    Now I'm trying to insert a branch for thp migration at the top of
    try_to_unmap_one() like below

    static int try_to_unmap_one(struct page *page, struct vm_area_struct *vma,
    unsigned long address, void *arg)
    {
    ...
    /* PMD-mapped THP migration entry */
    if (!pvmw.pte && (flags & TTU_MIGRATION)) {
    if (!PageAnon(page))
    continue;

    set_pmd_migration_entry(&pvmw, page);
    continue;
    }
    ...
    }

    so try_to_unmap() for tail pages called by thp split can go into thp
    migration code path (which converts *pmd* into migration entry), while
    the expectation is to freeze thp (which converts *pte* into migration
    entry.)

    I detected this failure as a "bad page state" error in a testcase where
    split_huge_page() is called from queue_pages_pte_range().

    Link: http://lkml.kernel.org/r/20170717193955.20207-4-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Acked-by: Kirill A. Shutemov
    Cc: "H. Peter Anvin"
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

4 commits

  • Huge page helps to reduce TLB miss rate, but it has higher cache
    footprint, sometimes this may cause some issue. For example, when
    clearing huge page on x86_64 platform, the cache footprint is 2M. But
    on a Xeon E5 v3 2699 CPU, there are 18 cores, 36 threads, and only 45M
    LLC (last level cache). That is, in average, there are 2.5M LLC for
    each core and 1.25M LLC for each thread.

    If the cache pressure is heavy when clearing the huge page, and we clear
    the huge page from the begin to the end, it is possible that the begin
    of huge page is evicted from the cache after we finishing clearing the
    end of the huge page. And it is possible for the application to access
    the begin of the huge page after clearing the huge page.

    To help the above situation, in this patch, when we clear a huge page,
    the order to clear sub-pages is changed. In quite some situation, we
    can get the address that the application will access after we clear the
    huge page, for example, in a page fault handler. Instead of clearing
    the huge page from begin to end, we will clear the sub-pages farthest
    from the the sub-page to access firstly, and clear the sub-page to
    access last. This will make the sub-page to access most cache-hot and
    sub-pages around it more cache-hot too. If we cannot know the address
    the application will access, the begin of the huge page is assumed to be
    the the address the application will access.

    With this patch, the throughput increases ~28.3% in vm-scalability
    anon-w-seq test case with 72 processes on a 2 socket Xeon E5 v3 2699
    system (36 cores, 72 threads). The test case creates 72 processes, each
    process mmap a big anonymous memory area and writes to it from the begin
    to the end. For each process, other processes could be seen as other
    workload which generates heavy cache pressure. At the same time, the
    cache miss rate reduced from ~33.4% to ~31.7%, the IPC (instruction per
    cycle) increased from 0.56 to 0.74, and the time spent in user space is
    reduced ~7.9%

    Christopher Lameter suggests to clear bytes inside a sub-page from end
    to begin too. But tests show no visible performance difference in the
    tests. May because the size of page is small compared with the cache
    size.

    Thanks Andi Kleen to propose to use address to access to determine the
    order of sub-pages to clear.

    The hugetlbfs access address could be improved, will do that in another
    patch.

    [ying.huang@intel.com: improve readability of clear_huge_page()]
    Link: http://lkml.kernel.org/r/20170830051842.1397-1-ying.huang@intel.com
    Link: http://lkml.kernel.org/r/20170815014618.15842-1-ying.huang@intel.com
    Suggested-by: Andi Kleen
    Signed-off-by: "Huang, Ying"
    Acked-by: Jan Kara
    Reviewed-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Nadia Yvette Chambers
    Cc: Matthew Wilcox
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Shaohua Li
    Cc: Christopher Lameter
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After adding swapping out support for THP (Transparent Huge Page), it is
    possible that a THP in swap cache (partly swapped out) need to be split.
    To split such a THP, the swap cluster backing the THP need to be split
    too, that is, the CLUSTER_FLAG_HUGE flag need to be cleared for the swap
    cluster. The patch implemented this.

    And because the THP swap writing needs the THP keeps as huge page during
    writing. The PageWriteback flag is checked before splitting.

    Link: http://lkml.kernel.org/r/20170724051840.2309-8-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • After supporting to delay THP (Transparent Huge Page) splitting after
    swapped out, it is possible that some page table mappings of the THP are
    turned into swap entries. So reuse_swap_page() need to check the swap
    count in addition to the map count as before. This patch done that.

    In the huge PMD write protect fault handler, in addition to the page map
    count, the swap count need to be checked too, so the page lock need to
    be acquired too when calling reuse_swap_page() in addition to the page
    table lock.

    [ying.huang@intel.com: silence a compiler warning]
    Link: http://lkml.kernel.org/r/87bmnzizjy.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20170724051840.2309-4-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: "Kirill A . Shutemov"
    Cc: Dan Williams
    Cc: Jens Axboe
    Cc: Michal Hocko
    Cc: Ross Zwisler [for brd.c, zram_drv.c, pmem.c]
    Cc: Vishal L Verma
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • attribute_group are not supposed to change at runtime. All functions
    working with attribute_group provided by work with const
    attribute_group. So mark the non-const structs as const.

    Link: http://lkml.kernel.org/r/1501157240-3876-1-git-send-email-arvind.yadav.cs@gmail.com
    Signed-off-by: Arvind Yadav
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arvind Yadav
     

25 Aug, 2017

1 commit


19 Aug, 2017

1 commit

  • Wenwei Tao has noticed that our current assumption that the oom victim
    is dying and never doing any visible changes after it dies, and so the
    oom_reaper can tear it down, is not entirely true.

    __task_will_free_mem consider a task dying when SIGNAL_GROUP_EXIT is set
    but do_group_exit sends SIGKILL to all threads _after_ the flag is set.
    So there is a race window when some threads won't have
    fatal_signal_pending while the oom_reaper could start unmapping the
    address space. Moreover some paths might not check for fatal signals
    before each PF/g-u-p/copy_from_user.

    We already have a protection for oom_reaper vs. PF races by checking
    MMF_UNSTABLE. This has been, however, checked only for kernel threads
    (use_mm users) which can outlive the oom victim. A simple fix would be
    to extend the current check in handle_mm_fault for all tasks but that
    wouldn't be sufficient because the current check assumes that a kernel
    thread would bail out after EFAULT from get_user*/copy_from_user and
    never re-read the same address which would succeed because the PF path
    has established page tables already. This seems to be the case for the
    only existing use_mm user currently (virtio driver) but it is rather
    fragile in general.

    This is even more fragile in general for more complex paths such as
    generic_perform_write which can re-read the same address more times
    (e.g. iov_iter_copy_from_user_atomic to fail and then
    iov_iter_fault_in_readable on retry).

    Therefore we have to implement MMF_UNSTABLE protection in a robust way
    and never make a potentially corrupted content visible. That requires
    to hook deeper into the PF path and check for the flag _every time_
    before a pte for anonymous memory is established (that means all
    !VM_SHARED mappings).

    The corruption can be triggered artificially
    (http://lkml.kernel.org/r/201708040646.v746kkhC024636@www262.sakura.ne.jp)
    but there doesn't seem to be any real life bug report. The race window
    should be quite tight to trigger most of the time.

    Link: http://lkml.kernel.org/r/20170807113839.16695-3-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Wenwei Tao
    Tested-by: Tetsuo Handa
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Oleg Nesterov
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Aug, 2017

3 commits

  • Merge commit:

    040cca3ab2f6 ("Merge branch 'linus' into locking/core, to resolve conflicts")

    overlooked the fact that do_huge_pmd_numa_page() now does two TLB
    flushes. Commit:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and commit:

    a9b802500ebb ("Revert "mm: numa: defer TLB flush for THP migration as long as possible"")

    Both moved the TLB flush around but slightly different, the end result
    being that what was one became two.

    Clean this up.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vineet Gupta
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Conflicts:
    include/linux/mm_types.h
    mm/huge_memory.c

    I removed the smp_mb__before_spinlock() like the following commit does:

    8b1b436dd1cc ("mm, locking: Rework {set,clear,mm}_tlb_flush_pending()")

    and fixed up the affected commits.

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • While deferring TLB flushes is a good practice, the reverted patch
    caused pending TLB flushes to be checked while the page-table lock is
    not taken. As a result, in architectures with weak memory model (PPC),
    Linux may miss a memory-barrier, miss the fact TLB flushes are pending,
    and cause (in theory) a memory corruption.

    Since the alternative of using smp_mb__after_unlock_lock() was
    considered a bit open-coded, and the performance impact is expected to
    be small, the previous patch is reverted.

    This reverts b0943d61b8fa ("mm: numa: defer TLB flush for THP migration
    as long as possible").

    Link: http://lkml.kernel.org/r/20170802000818.4760-4-namit@vmware.com
    Signed-off-by: Nadav Amit
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Minchan Kim
    Cc: Sergey Senozhatsky
    Cc: Andy Lutomirski
    Cc: "David S. Miller"
    Cc: Andrea Arcangeli
    Cc: Heiko Carstens
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: Jeff Dike
    Cc: Martin Schwidefsky
    Cc: Mel Gorman
    Cc: Nadav Amit
    Cc: Russell King
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nadav Amit
     

10 Aug, 2017

1 commit

  • Commit:

    af2c1401e6f9 ("mm: numa: guarantee that tlb_flush_pending updates are visible before page table updates")

    added smp_mb__before_spinlock() to set_tlb_flush_pending(). I think we
    can solve the same problem without this barrier.

    If instead we mandate that mm_tlb_flush_pending() is used while
    holding the PTL we're guaranteed to observe prior
    set_tlb_flush_pending() instances.

    For this to work we need to rework migrate_misplaced_transhuge_page()
    a little and move the test up into do_huge_pmd_numa_page().

    NOTE: this relies on flush_tlb_range() to guarantee:

    (1) it ensures that prior page table updates are visible to the
    page table walker and
    (2) it ensures that subsequent memory accesses are only made
    visible after the invalidation has completed

    This is required for architectures that implement TRANSPARENT_HUGEPAGE
    (arc, arm, arm64, mips, powerpc, s390, sparc, x86) or otherwise use
    mm_tlb_flush_pending() in their page-table operations (arm, arm64,
    x86).

    This appears true for:

    - arm (DSB ISB before and after),
    - arm64 (DSB ISHST before, and DSB ISH after),
    - powerpc (PTESYNC before and after),
    - s390 and x86 TLB invalidate are serializing instructions

    But I failed to understand the situation for:

    - arc, mips, sparc

    Now SPARC64 is a wee bit special in that flush_tlb_range() is a no-op
    and it flushes the TLBs using arch_{enter,leave}_lazy_mmu_mode()
    inside the PTL. It still needs to guarantee the PTL unlock happens
    _after_ the invalidate completes.

    Vineet, Ralf and Dave could you guys please have a look?

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Ralf Baechle
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Vineet Gupta
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Jul, 2017

3 commits

  • To swap out THP (Transparent Huage Page), before splitting the THP, the
    swap cluster will be allocated and the THP will be added into the swap
    cache. But it is possible that the THP cannot be split, so that we must
    delete the THP from the swap cache and free the swap cluster. To avoid
    that, in this patch, whether the THP can be split is checked firstly.
    The check can only be done racy, but it is good enough for most cases.

    With the patch, the swap out throughput improves 3.6% (from about
    4.16GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    Link: http://lkml.kernel.org/r/20170515112522.32457-5-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Kirill A. Shutemov [for can_split_huge_page()]
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Shaohua Li
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Reinette reported the following crash:

    BUG: Bad page state in process log2exe pfn:57600
    page:ffffea00015d8000 count:0 mapcount:0 mapping: (null) index:0x20200
    flags: 0x4000000000040019(locked|uptodate|dirty|swapbacked)
    raw: 4000000000040019 0000000000000000 0000000000020200 00000000ffffffff
    raw: ffffea00015d8020 ffffea00015d8020 0000000000000000 0000000000000000
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags: 0x1(locked)
    Modules linked in: rfcomm 8021q bnep intel_rapl x86_pkg_temp_thermal coretemp efivars btusb btrtl btbcm pwm_lpss_pci snd_hda_codec_hdmi btintel pwm_lpss snd_hda_codec_realtek snd_soc_skl snd_hda_codec_generic snd_soc_skl_ipc spi_pxa2xx_platform snd_soc_sst_ipc snd_soc_sst_dsp i2c_designware_platform i2c_designware_core snd_hda_ext_core snd_soc_sst_match snd_hda_intel snd_hda_codec mei_me snd_hda_core mei snd_soc_rt286 snd_soc_rl6347a snd_soc_core efivarfs
    CPU: 1 PID: 354 Comm: log2exe Not tainted 4.12.0-rc7-test-test #19
    Hardware name: Intel corporation NUC6CAYS/NUC6CAYB, BIOS AYAPLCEL.86A.0027.2016.1108.1529 11/08/2016
    Call Trace:
    bad_page+0x16a/0x1f0
    free_pages_check_bad+0x117/0x190
    free_hot_cold_page+0x7b1/0xad0
    __put_page+0x70/0xa0
    madvise_free_huge_pmd+0x627/0x7b0
    madvise_free_pte_range+0x6f8/0x1150
    __walk_page_range+0x6b5/0xe30
    walk_page_range+0x13b/0x310
    madvise_free_page_range.isra.16+0xad/0xd0
    madvise_free_single_vma+0x2e4/0x470
    SyS_madvise+0x8ce/0x1450

    If somebody frees the page under us and we hold the last reference to
    it, put_page() would attempt to free the page before unlocking it.

    The fix is trivial reorder of operations.

    Dave said:
    "I came up with the exact same patch. For posterity, here's the test
    case, generated by syzkaller and trimmed down by Reinette:

    https://www.sr71.net/~dave/intel/log2.c

    And the config that helps detect this:

    https://www.sr71.net/~dave/intel/config-log2"

    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Link: http://lkml.kernel.org/r/20170628101249.17879-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Reinette Chatre
    Acked-by: Dave Hansen
    Acked-by: Michal Hocko
    Acked-by: Minchan Kim
    Cc: Huang Ying
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Jun, 2017

1 commit

  • In do_huge_pmd_numa_page(), we attempt to handle a migrating thp pmd by
    waiting until the pmd is unlocked before we return and retry. However,
    we can race with migrate_misplaced_transhuge_page():

    // do_huge_pmd_numa_page // migrate_misplaced_transhuge_page()
    // Holds 0 refs on page // Holds 2 refs on page

    vmf->ptl = pmd_lock(vma->vm_mm, vmf->pmd);
    /* ... */
    if (pmd_trans_migrating(*vmf->pmd)) {
    page = pmd_page(*vmf->pmd);
    spin_unlock(vmf->ptl);
    ptl = pmd_lock(mm, pmd);
    if (page_count(page) != 2)) {
    /* roll back */
    }
    /* ... */
    mlock_migrate_page(new_page, page);
    /* ... */
    spin_unlock(ptl);
    put_page(page);
    put_page(page); // page freed here
    wait_on_page_locked(page);
    goto out;
    }

    This can result in the freed page having its waiters flag set
    unexpectedly, which trips the PAGE_FLAGS_CHECK_AT_PREP checks in the
    page alloc/free functions. This has been observed on arm64 KVM guests.

    We can avoid this by having do_huge_pmd_numa_page() take a reference on
    the page before dropping the pmd lock, mirroring what we do in
    __migration_entry_wait().

    When we hit the race, migrate_misplaced_transhuge_page() will see the
    reference and abort the migration, as it may do today in other cases.

    Fixes: b8916634b77bffb2 ("mm: Prevent parallel splits during THP migration")
    Link: http://lkml.kernel.org/r/1497349722-6731-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Mark Rutland
    Signed-off-by: Will Deacon
    Acked-by: Steve Capper
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mark Rutland
     

09 May, 2017

2 commits

  • Although all architectures use a deposited page table for THP on
    anonymous VMAs, some architectures (s390 and powerpc) require the
    deposited storage even for file backed VMAs due to quirks of their MMUs.

    This patch adds support for depositing a table in DAX PMD fault handling
    path for archs that require it. Other architectures should see no
    functional changes.

    Link: http://lkml.kernel.org/r/20170411174233.21902-3-oohall@gmail.com
    Signed-off-by: Oliver O'Halloran
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     
  • Depending on the flags of the PMD being zapped there may or may not be a
    deposited pgtable to be freed. In two of the three cases this is open
    coded while the third uses the zap_deposited_table() helper. This patch
    converts the others to use the helper to clean things up a bit.

    Link: http://lkml.kernel.org/r/20170411174233.21902-2-oohall@gmail.com
    Cc: Reza Arbab
    Cc: Balbir Singh
    Cc: linux-nvdimm@ml01.01.org
    Cc: Oliver O'Halloran
    Cc: Aneesh Kumar K.V
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oliver O'Halloran
     

04 May, 2017

4 commits

  • try_to_unmap() returns SWAP_SUCCESS or SWAP_FAIL so it's suitable for
    boolean return. This patch changes it.

    Link: http://lkml.kernel.org/r/1489555493-14659-8-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Cc: Naoya Horiguchi
    Cc: Anshuman Khandual
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When memory pressure is high, we free MADV_FREE pages. If the pages are
    not dirty in pte, the pages could be freed immediately. Otherwise we
    can't reclaim them. We put the pages back to anonumous LRU list (by
    setting SwapBacked flag) and the pages will be reclaimed in normal
    swapout way.

    We use normal page reclaim policy. Since MADV_FREE pages are put into
    inactive file list, such pages and inactive file pages are reclaimed
    according to their age. This is expected, because we don't want to
    reclaim too many MADV_FREE pages before used once pages.

    Based on Minchan's original patch

    [minchan@kernel.org: clean up lazyfree page handling]
    Link: http://lkml.kernel.org/r/20170303025237.GB3503@bbox
    Link: http://lkml.kernel.org/r/14b8eb1d3f6bf6cc492833f183ac8c304e560484.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Signed-off-by: Minchan Kim
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • madv()'s MADV_FREE indicate pages are 'lazyfree'. They are still
    anonymous pages, but they can be freed without pageout. To distinguish
    these from normal anonymous pages, we clear their SwapBacked flag.

    MADV_FREE pages could be freed without pageout, so they pretty much like
    used once file pages. For such pages, we'd like to reclaim them once
    there is memory pressure. Also it might be unfair reclaiming MADV_FREE
    pages always before used once file pages and we definitively want to
    reclaim the pages before other anonymous and file pages.

    To speed up MADV_FREE pages reclaim, we put the pages into
    LRU_INACTIVE_FILE list. The rationale is LRU_INACTIVE_FILE list is tiny
    nowadays and should be full of used once file pages. Reclaiming
    MADV_FREE pages will not have much interfere of anonymous and active
    file pages. And the inactive file pages and MADV_FREE pages will be
    reclaimed according to their age, so we don't reclaim too many MADV_FREE
    pages too. Putting the MADV_FREE pages into LRU_INACTIVE_FILE_LIST also
    means we can reclaim the pages without swap support. This idea is
    suggested by Johannes.

    This patch doesn't move MADV_FREE pages to LRU_INACTIVE_FILE list yet to
    avoid bisect failure, next patch will do it.

    The patch is based on Minchan's original patch.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/2f87063c1e9354677b7618c647abde77b07561e5.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • There are a few places the code assumes anonymous pages should have
    SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
    going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
    for them. The assumption doesn't hold any more, so fix them.

    Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

14 Apr, 2017

3 commits

  • Both MADV_DONTNEED and MADV_FREE handled with down_read(mmap_sem).

    It's critical to not clear pmd intermittently while handling MADV_FREE
    to avoid race with MADV_DONTNEED:

    CPU0: CPU1:
    madvise_free_huge_pmd()
    pmdp_huge_get_and_clear_full()
    madvise_dontneed()
    zap_pmd_range()
    pmd_trans_huge(*pmd) == 0 (without ptl)
    // skip the pmd
    set_pmd_at();
    // pmd is re-established

    It results in MADV_DONTNEED skipping the pmd, leaving it not cleared.
    It violates MADV_DONTNEED interface and can result is userspace
    misbehaviour.

    Basically it's the same race as with numa balancing in
    change_huge_pmd(), but a bit simpler to mitigate: we don't need to
    preserve dirty/young flags here due to MADV_FREE functionality.

    [kirill.shutemov@linux.intel.com: Urgh... Power is special again]
    Link: http://lkml.kernel.org/r/20170303102636.bhd2zhtpds4mt62a@black.fi.intel.com
    Link: http://lkml.kernel.org/r/20170302151034.27829-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Minchan Kim
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • In case prot_numa, we are under down_read(mmap_sem). It's critical to
    not clear pmd intermittently to avoid race with MADV_DONTNEED which is
    also under down_read(mmap_sem):

    CPU0: CPU1:
    change_huge_pmd(prot_numa=1)
    pmdp_huge_get_and_clear_notify()
    madvise_dontneed()
    zap_pmd_range()
    pmd_trans_huge(*pmd) == 0 (without ptl)
    // skip the pmd
    set_pmd_at();
    // pmd is re-established

    The race makes MADV_DONTNEED miss the huge pmd and don't clear it
    which may break userspace.

    Found by code analysis, never saw triggered.

    Link: http://lkml.kernel.org/r/20170302151034.27829-3-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Patch series "thp: fix few MADV_DONTNEED races"

    For MADV_DONTNEED to work properly with huge pages, it's critical to not
    clear pmd intermittently unless you hold down_write(mmap_sem).

    Otherwise MADV_DONTNEED can miss the THP which can lead to userspace
    breakage.

    See example of such race in commit message of patch 2/4.

    All these races are found by code inspection. I haven't seen them
    triggered. I don't think it's worth to apply them to stable@.

    This patch (of 4):

    Restructure code in preparation for a fix.

    Link: http://lkml.kernel.org/r/20170302151034.27829-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov