23 Mar, 2018

2 commits

  • Commit 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and
    madvised allocations") changed the page allocator to no longer detect
    thp allocations based on __GFP_NORETRY.

    It did not, however, modify the mem cgroup try_charge() path to avoid
    oom kill for either khugepaged collapsing or thp faulting. It is never
    expected to oom kill a process to allocate a hugepage for thp; reclaim
    is governed by the thp defrag mode and MADV_HUGEPAGE, but allocations
    (and charging) should fallback instead of oom killing processes.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803191409420.124411@chino.kir.corp.google.com
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • khugepaged is not yet able to convert PTE-mapped huge pages back to PMD
    mapped. We do not collapse such pages. See check
    khugepaged_scan_pmd().

    But if between khugepaged_scan_pmd() and __collapse_huge_page_isolate()
    somebody managed to instantiate THP in the range and then split the PMD
    back to PTEs we would have a problem --
    VM_BUG_ON_PAGE(PageCompound(page)) will get triggered.

    It's possible since we drop mmap_sem during collapse to re-take for
    write.

    Replace the VM_BUG_ON() with graceful collapse fail.

    Link: http://lkml.kernel.org/r/20180315152353.27989-1-kirill.shutemov@linux.intel.com
    Fixes: b1caa957ae6d ("khugepaged: ignore pmd tables with THP mapped with ptes")
    Signed-off-by: Kirill A. Shutemov
    Cc: Laura Abbott
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

01 Feb, 2018

2 commits

  • In the current design, khugepaged needs to acquire mmap_sem before
    scanning an mm. But in some corner cases, khugepaged may scan a process
    which is modifying its memory mapping, so khugepaged blocks in
    uninterruptible state. But the process might hold the mmap_sem for a
    long time when modifying a huge memory space and it may trigger the
    below khugepaged hung issue:

    INFO: task khugepaged:270 blocked for more than 120 seconds.
    Tainted: G E 4.9.65-006.ali3000.alios7.x86_64 #1
    "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
    khugepaged D 0 270 2 0x00000000 
    ffff883f3deae4c0 0000000000000000 ffff883f610596c0 ffff883f7d359440
    ffff883f63818000 ffffc90019adfc78 ffffffff817079a5 d67e5aa8c1860a64
    0000000000000246 ffff883f7d359440 ffffc90019adfc88 ffff883f610596c0
    Call Trace:
    schedule+0x36/0x80
    rwsem_down_read_failed+0xf0/0x150
    call_rwsem_down_read_failed+0x18/0x30
    down_read+0x20/0x40
    khugepaged+0x476/0x11d0
    kthread+0xe6/0x100
    ret_from_fork+0x25/0x30

    So it sounds pointless to just block khugepaged waiting for the
    semaphore so replace down_read() with down_read_trylock() to move to
    scan the next mm quickly instead of just blocking on the semaphore so
    that other processes can get more chances to install THP. Then
    khugepaged can come back to scan the skipped mm when it has finished the
    current round full_scan.

    And it appears that the change can improve khugepaged efficiency a
    little bit.

    Below is the test result when running LTP on a 24 cores 4GB memory 2
    nodes NUMA VM:

    pristine w/ trylock
    full_scan 197 187
    pages_collapsed 21 26
    thp_fault_alloc 40818 44466
    thp_fault_fallback 18413 16679
    thp_collapse_alloc 21 150
    thp_collapse_alloc_failed 14 16
    thp_file_alloc 369 369

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: tweak comment]
    [arnd@arndb.de: avoid uninitialized variable use]
    Link: http://lkml.kernel.org/r/20171215125129.2948634-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/1513281203-54878-1-git-send-email-yang.s@alibaba-inc.com
    Signed-off-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Andrea Arcangeli
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Several users of unmap_mapping_range() would prefer to express their
    range in pages rather than bytes. Unfortuately, on a 32-bit kernel, you
    have to remember to cast your page number to a 64-bit type before
    shifting it, and four places in the current tree didn't remember to do
    that. That's a sign of a bad interface.

    Conveniently, unmap_mapping_range() actually converts from bytes into
    pages, so hoist the guts of unmap_mapping_range() into a new function
    unmap_mapping_pages() and convert the callers which want to use pages.

    Link: http://lkml.kernel.org/r/20171206142627.GD32044@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reported-by: "zhangyi (F)"
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     

30 Nov, 2017

1 commit

  • This reverts commit 152e93af3cfe2d29d8136cc0a02a8612507136ee.

    It was a nice cleanup in theory, but as Nicolai Stange points out, we do
    need to make the page dirty for the copy-on-write case even when we
    didn't end up making it writable, since the dirty bit is what we use to
    check that we've gone through a COW cycle.

    Reported-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

28 Nov, 2017

1 commit

  • Currently we make page table entries dirty all the time regardless of
    access type and don't even consider if the mapping is write-protected.
    The reasoning is that we don't really need dirty tracking on THP and
    making the entry dirty upfront may save some time on first write to the
    page.

    Unfortunately, such approach may result in false-positive
    can_follow_write_pmd() for huge zero page or read-only shmem file.

    Let's only make page dirty only if we about to write to the page anyway
    (as we do for small pages).

    I've restructured the code to make entry dirty inside
    maybe_p[mu]d_mkwrite(). It also takes into account if the vma is
    write-protected.

    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

16 Nov, 2017

1 commit

  • Let's add wrappers for ->nr_ptes with the same interface as for nr_pmd
    and nr_pud.

    The patch also makes nr_ptes accounting dependent onto CONFIG_MMU. Page
    table accounting doesn't make sense if you don't have page tables.

    It's preparation for consolidation of page-table counters in mm_struct.

    Link: http://lkml.kernel.org/r/20171006100651.44742-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

02 Nov, 2017

1 commit

  • Many source files in the tree are missing licensing information, which
    makes it harder for compliance tools to determine the correct license.

    By default all files without license information are under the default
    license of the kernel, which is GPL version 2.

    Update the files which contain no license information with the 'GPL-2.0'
    SPDX license identifier. The SPDX identifier is a legally binding
    shorthand, which can be used instead of the full boiler plate text.

    This patch is based on work done by Thomas Gleixner and Kate Stewart and
    Philippe Ombredanne.

    How this work was done:

    Patches were generated and checked against linux-4.14-rc6 for a subset of
    the use cases:
    - file had no licensing information it it.
    - file was a */uapi/* one with no licensing information in it,
    - file was a */uapi/* one with existing licensing information,

    Further patches will be generated in subsequent months to fix up cases
    where non-standard license headers were used, and references to license
    had to be inferred by heuristics based on keywords.

    The analysis to determine which SPDX License Identifier to be applied to
    a file was done in a spreadsheet of side by side results from of the
    output of two independent scanners (ScanCode & Windriver) producing SPDX
    tag:value files created by Philippe Ombredanne. Philippe prepared the
    base worksheet, and did an initial spot review of a few 1000 files.

    The 4.13 kernel was the starting point of the analysis with 60,537 files
    assessed. Kate Stewart did a file by file comparison of the scanner
    results in the spreadsheet to determine which SPDX license identifier(s)
    to be applied to the file. She confirmed any determination that was not
    immediately clear with lawyers working with the Linux Foundation.

    Criteria used to select files for SPDX license identifier tagging was:
    - Files considered eligible had to be source code files.
    - Make and config files were included as candidates if they contained >5
    lines of source
    - File already had some variant of a license header in it (even if
    Reviewed-by: Philippe Ombredanne
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     

11 Jul, 2017

1 commit

  • PR_SET_THP_DISABLE has a rather subtle semantic. It doesn't affect any
    existing mapping because it only updated mm->def_flags which is a
    template for new mappings.

    The mappings created after prctl(PR_SET_THP_DISABLE) have VM_NOHUGEPAGE
    flag set. This can be quite surprising for all those applications which
    do not do prctl(); fork() & exec() and want to control their own THP
    behavior.

    Another usecase when the immediate semantic of the prctl might be useful
    is a combination of pre- and post-copy migration of containers with
    CRIU. In this case CRIU populates a part of a memory region with data
    that was saved during the pre-copy stage. Afterwards, the region is
    registered with userfaultfd and CRIU expects to get page faults for the
    parts of the region that were not yet populated. However, khugepaged
    collapses the pages and the expected page faults do not occur.

    In more general case, the prctl(PR_SET_THP_DISABLE) could be used as a
    temporary mechanism for enabling/disabling THP process wide.

    Implementation wise, a new MMF_DISABLE_THP flag is added. This flag is
    tested when decision whether to use huge pages is taken either during
    page fault of at the time of THP collapse.

    It should be noted, that the new implementation makes PR_SET_THP_DISABLE
    master override to any per-VMA setting, which was not the case
    previously.

    Fixes: a0715cc22601 ("mm, thp: add VM_INIT_DEF_MASK and PRCTL_THP_DISABLE")
    Link: http://lkml.kernel.org/r/1496415802-30944-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Michal Hocko
    Signed-off-by: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Arnd Bergmann
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 Jun, 2017

1 commit

  • This is a partial revert of commit 338a16ba1549 ("mm, thp: copying user
    pages must schedule on collapse") which added a cond_resched() to
    __collapse_huge_page_copy().

    On x86 with CONFIG_HIGHPTE, __collapse_huge_page_copy is called in
    atomic context and thus scheduling is not possible. This is only a
    possible config on arm and i386.

    Although need_resched has been shown to be set for over 100 jiffies
    while doing the iteration in __collapse_huge_page_copy, this is better
    than doing

    if (in_atomic())
    cond_resched()

    to cover only non-CONFIG_HIGHPTE configs.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1706191341550.97821@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Larry Finger
    Tested-by: Larry Finger
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 May, 2017

2 commits

  • We have encountered need_resched warnings in __collapse_huge_page_copy()
    while doing {clear,copy}_user_highpage() over HPAGE_PMD_NR source pages.

    mm->mmap_sem is held for write, but the iteration is well bounded.

    Reschedule as needed.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1705101426380.109808@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • One return case of `__collapse_huge_page_swapin()` does not invoke
    tracepoint while every other return case does. This commit adds a
    tracepoint invocation for the case.

    Link: http://lkml.kernel.org/r/20170507101813.30187-1-sj38.park@gmail.com
    Signed-off-by: SeongJae Park
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    SeongJae Park
     

04 May, 2017

2 commits

  • With the discussion[1], I found it seems there are every PageFlags
    functions return bool at this moment so we don't need double negation
    any more. Although it's not a problem to keep it, it makes future users
    confused to use double negation for them, too.

    Remove such possibility.

    [1] https://marc.info/?l=linux-kernel&m=148881578820434

    Frankly sepaking, I like every PageFlags to return bool instead of int.
    It will make it clear. AFAIR, Chen Gang had tried it but don't know why
    it was not merged at that time.

    http://lkml.kernel.org/r/1469336184-1904-1-git-send-email-chengang@emindsoft.com.cn

    Link: http://lkml.kernel.org/r/1488868597-32222-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Chen Gang
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • There are a few places the code assumes anonymous pages should have
    SwapBacked flag set. MADV_FREE pages are anonymous pages but we are
    going to add them to LRU_INACTIVE_FILE list and clear SwapBacked flag
    for them. The assumption doesn't hold any more, so fix them.

    Link: http://lkml.kernel.org/r/3945232c0df3dd6c4ef001976f35a95f18dcb407.1487965799.git.shli@fb.com
    Signed-off-by: Shaohua Li
    Acked-by: Johannes Weiner
    Acked-by: Hillf Danton
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

02 Mar, 2017

2 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

11 Jan, 2017

2 commits

  • The flag was introduced by commit 78afd5612deb ("mm: add
    __GFP_OTHER_NODE flag") to allow proper accounting of remote node
    allocations done by kernel daemons on behalf of a process - e.g.
    khugepaged.

    After "mm: fix remote numa hits statistics" we do not need and actually
    use the flag so we can safely remove it because all allocations which
    are satisfied from their "home" node are accounted properly.

    [mhocko@suse.com: fix build]
    Link: http://lkml.kernel.org/r/20170106122225.GK5556@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20170102153057.9451-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Taku Izumi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • With THP page cache, when trying to build a huge page from regular pte
    pages, we just clear the pmd entry. We will take another fault and at
    that point we will find the huge page in the radix tree, thereby using
    the huge page to complete the page fault

    The second fault path will allocate the needed pgtable_t page for archs
    like ppc64. So no need to deposit the same in collapse path.
    Depositing them in the collapse path resulting in a pgtable_t memory
    leak also giving errors like

    BUG: non-zero nr_ptes on freeing mm: 3

    Fixes: 953c66c2b22a ("mm: THP page cache support for ppc64")
    Link: http://lkml.kernel.org/r/20161212163428.6780-2-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Michael Ellerman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

15 Dec, 2016

4 commits

  • This fixes several interlinked problems with the iterators in the
    presence of multiorder entries.

    1. radix_tree_iter_next() would only advance by one slot, which would
    result in the iterators returning the same entry more than once if
    there were sibling entries.

    2. radix_tree_next_slot() could return an internal pointer instead of
    a user pointer if a tagged multiorder entry was immediately followed by
    an entry of lower order.

    3. radix_tree_next_slot() expanded to a lot more code than it used to
    when multiorder support was compiled in. And I wasn't comfortable with
    entry_to_node() being in a header file.

    Fixing radix_tree_iter_next() for the presence of sibling entries
    necessarily involves examining the contents of the radix tree, so we now
    need to pass 'slot' to radix_tree_iter_next(), and we need to change the
    calling convention so it is called *before* dropping the lock which
    protects the tree. Also rename it to radix_tree_iter_resume(), as some
    people thought it was necessary to call radix_tree_iter_next() each time
    around the loop.

    radix_tree_next_slot() becomes closer to how it looked before multiorder
    support was introduced. It only checks to see if the next entry in the
    chunk is a sibling entry or a pointer to a node; this should be rare
    enough that handling this case out of line is not a performance impact
    (and such impact is amortised by the fact that the entry we just
    processed was a multiorder entry). Also, radix_tree_next_slot() used to
    force a new chunk lookup for untagged entries, which is more expensive
    than the out of line sibling entry skipping.

    Link: http://lkml.kernel.org/r/1480369871-5271-55-git-send-email-mawilcox@linuxonhyperv.com
    Signed-off-by: Matthew Wilcox
    Tested-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Ross Zwisler
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Add orig_pte field to vm_fault structure to allow ->page_mkwrite
    handlers to fully handle the fault.

    This also allows us to save some passing of extra arguments around.

    Link: http://lkml.kernel.org/r/1479460644-25076-8-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • struct vm_fault has already pgoff entry. Use it instead of passing
    pgoff as a separate argument and then assigning it later.

    Link: http://lkml.kernel.org/r/1479460644-25076-4-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Reviewed-by: Ross Zwisler
    Acked-by: Kirill A. Shutemov
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     
  • Currently we have two different structures for passing fault information
    around - struct vm_fault and struct fault_env. DAX will need more
    information in struct vm_fault to handle its faults so the content of
    that structure would become event closer to fault_env. Furthermore it
    would need to generate struct fault_env to be able to call some of the
    generic functions. So at this point I don't think there's much use in
    keeping these two structures separate. Just embed into struct vm_fault
    all that is needed to use it for both purposes.

    Link: http://lkml.kernel.org/r/1479460644-25076-2-git-send-email-jack@suse.cz
    Signed-off-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Cc: Ross Zwisler
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Kara
     

13 Dec, 2016

4 commits

  • Add arch specific callback in the generic THP page cache code that will
    deposit and withdarw preallocated page table. Archs like ppc64 use this
    preallocated table to store the hash pte slot information.

    Testing:
    kernel build of the patch series on tmpfs mounted with option huge=always

    The related thp stat:
    thp_fault_alloc 72939
    thp_fault_fallback 60547
    thp_collapse_alloc 603
    thp_collapse_alloc_failed 0
    thp_file_alloc 253763
    thp_file_mapped 4251
    thp_split_page 51518
    thp_split_page_failed 1
    thp_deferred_split_page 73566
    thp_split_pmd 665
    thp_zero_page_alloc 3
    thp_zero_page_alloc_failed 0

    [akpm@linux-foundation.org: remove unneeded parentheses, per Kirill]
    Link: http://lkml.kernel.org/r/20161113150025.17942-2-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • The bug in khugepaged fixed earlier in this series shows that radix tree
    slot replacement is fragile; and it will become more so when not only
    NULL!NULL transitions need to be caught but transitions from and to
    exceptional entries as well. We need checks.

    Re-implement radix_tree_replace_slot() on top of the sanity-checked
    __radix_tree_replace(). This requires existing callers to also pass the
    radix tree root, but it'll warn us when somebody replaces slots with
    contents that need proper accounting (transitions between NULL entries,
    real entries, exceptional entries) and where a replacement through the
    slot pointer would corrupt the radix tree node counts.

    Link: http://lkml.kernel.org/r/20161117193021.GB23430@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Suggested-by: Jan Kara
    Reviewed-by: Jan Kara
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • The radix tree counts valid entries in each tree node. Entries stored
    in the tree cannot be removed by simpling storing NULL in the slot or
    the internal counters will be off and the node never gets freed again.

    When collapsing a shmem page fails, restore the holes that were filled
    with radix_tree_insert() with a proper radix tree deletion.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Patch series "mm: workingset: radix tree subtleties & single-page file
    refaults", v3.

    This is another revision of the radix tree / workingset patches based on
    feedback from Jan and Kirill.

    This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
    entries without radix tree node"). That patch fixed an issue that was
    caused mainly by the page cache sneaking special shadow page entries
    into the radix tree and relying on subtleties in the radix tree code to
    make that work. The fix also had to stop tracking refaults for
    single-page files because shadow pages stored as direct pointers in
    radix_tree_root->rnode weren't properly handled during tree extension.

    These patches make the radix tree code explicitely support and track
    such special entries, to eliminate the subtleties and to restore the
    thrash detection for single-page files.

    This patch (of 9):

    When a radix tree iteration drops the tree lock, another thread might
    swoop in and free the node holding the current slot. The iteration
    needs to do another tree lookup from the current index to continue.

    [kirill.shutemov@linux.intel.com: re-lookup for replacement]
    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

01 Dec, 2016

1 commit

  • Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c")
    moved code from huge_memory.c to khugepaged.c. Some of this code should
    be compiled only when CONFIG_SYSFS is enabled but the condition around
    this code was not moved into khugepaged.c.

    The result is a compilation error when CONFIG_SYSFS is disabled:

    mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
    mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'

    This commit adds the #ifdef CONFIG_SYSFS around the code related to
    sysfs.

    Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Jérémy Lefaure
    Acked-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérémy Lefaure
     

20 Sep, 2016

2 commits

  • Currently, khugepaged does not permit swapin if there are enough young
    pages in a THP. The problem is when a THP does not have enough young
    pages, khugepaged leaks mapped ptes.

    This patch prohibits leaking mapped ptes.

    Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • hugepage_vma_revalidate() tries to re-check if we still should try to
    collapse small pages into huge one after the re-acquiring mmap_sem.

    The problem Dmitry Vyukov reported[1] is that the vma found by
    hugepage_vma_revalidate() can be suitable for huge pages, but not the
    same vma we had before dropping mmap_sem. And dereferencing original
    vma can lead to fun results..

    Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
    same as what we had before the lock was dropped.

    [1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com

    Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vegard Nossum
    Cc: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: Greg Thelen
    Cc: Suleiman Souhlal
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: syzkaller
    Cc: Kostya Serebryany
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

29 Jul, 2016

4 commits

  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

6 commits

  • To detect whether khugepaged swapin is worthwhile, this patch checks the
    amount of young pages. There should be at least half of HPAGE_PMD_NR to
    swapin.

    Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • After fixing swapin issues, comment lines stayed as in old version.
    This patch updates the comments.

    Link: http://lkml.kernel.org/r/1468109345-32258-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch extends khugepaged to support collapse of tmpfs/shmem pages.
    We share fair amount of infrastructure with anon-THP collapse.

    Few design points:

    - First we are looking for VMA which can be suitable for mapping huge
    page;

    - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

    - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

    - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

    + scan over radix tree, making the range point to new huge page;

    + new huge page is not-uptodate, locked and freezed (refcount
    is 0), so nobody can touch them until we say so.

    + we swap in pages during the scan. khugepaged_scan_shmem()
    filters out ranges with more than khugepaged_max_ptes_swap
    swapped out pages. It's HPAGE_PMD_NR/8 by default.

    + old pages are isolated, unmapped and put to local list in case
    to be restored back if collapse failed.

    - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

    Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
    first: no point keep it inside the function.

    Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • khugepaged implementation grew to the point when it deserve separate
    file in source.

    Let's move it to mm/khugepaged.c.

    Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov