12 Jan, 2017

2 commits

  • commit 59749e6ce53735d8b696763742225f126e94603f upstream.

    The radix tree counts valid entries in each tree node. Entries stored
    in the tree cannot be removed by simpling storing NULL in the slot or
    the internal counters will be off and the node never gets freed again.

    When collapsing a shmem page fails, restore the holes that were filled
    with radix_tree_insert() with a proper radix tree deletion.

    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-3-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Reported-by: Jan Kara
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     
  • commit 91a45f71078a6569ec3ca5bef74e1ab58121d80e upstream.

    Patch series "mm: workingset: radix tree subtleties & single-page file
    refaults", v3.

    This is another revision of the radix tree / workingset patches based on
    feedback from Jan and Kirill.

    This is a follow-up to d3798ae8c6f3 ("mm: filemap: don't plant shadow
    entries without radix tree node"). That patch fixed an issue that was
    caused mainly by the page cache sneaking special shadow page entries
    into the radix tree and relying on subtleties in the radix tree code to
    make that work. The fix also had to stop tracking refaults for
    single-page files because shadow pages stored as direct pointers in
    radix_tree_root->rnode weren't properly handled during tree extension.

    These patches make the radix tree code explicitely support and track
    such special entries, to eliminate the subtleties and to restore the
    thrash detection for single-page files.

    This patch (of 9):

    When a radix tree iteration drops the tree lock, another thread might
    swoop in and free the node holding the current slot. The iteration
    needs to do another tree lookup from the current index to continue.

    [kirill.shutemov@linux.intel.com: re-lookup for replacement]
    Fixes: f3f0e1d2150b ("khugepaged: add support of collapse for tmpfs/shmem pages")
    Link: http://lkml.kernel.org/r/20161117191138.22769-2-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Jan Kara
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Johannes Weiner
     

01 Dec, 2016

1 commit

  • Commit b46e756f5e47 ("thp: extract khugepaged from mm/huge_memory.c")
    moved code from huge_memory.c to khugepaged.c. Some of this code should
    be compiled only when CONFIG_SYSFS is enabled but the condition around
    this code was not moved into khugepaged.c.

    The result is a compilation error when CONFIG_SYSFS is disabled:

    mm/built-in.o: In function `khugepaged_defrag_store': khugepaged.c:(.text+0x2d095): undefined reference to `single_hugepage_flag_store'
    mm/built-in.o: In function `khugepaged_defrag_show': khugepaged.c:(.text+0x2d0ab): undefined reference to `single_hugepage_flag_show'

    This commit adds the #ifdef CONFIG_SYSFS around the code related to
    sysfs.

    Link: http://lkml.kernel.org/r/20161114203448.24197-1-jeremy.lefaure@lse.epita.fr
    Signed-off-by: Jérémy Lefaure
    Acked-by: Kirill A. Shutemov
    Acked-by: Hillf Danton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérémy Lefaure
     

20 Sep, 2016

2 commits

  • Currently, khugepaged does not permit swapin if there are enough young
    pages in a THP. The problem is when a THP does not have enough young
    pages, khugepaged leaks mapped ptes.

    This patch prohibits leaking mapped ptes.

    Link: http://lkml.kernel.org/r/1472820276-7831-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Kirill A. Shutemov
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • hugepage_vma_revalidate() tries to re-check if we still should try to
    collapse small pages into huge one after the re-acquiring mmap_sem.

    The problem Dmitry Vyukov reported[1] is that the vma found by
    hugepage_vma_revalidate() can be suitable for huge pages, but not the
    same vma we had before dropping mmap_sem. And dereferencing original
    vma can lead to fun results..

    Let's use vma hugepage_vma_revalidate() found instead of assuming it's the
    same as what we had before the lock was dropped.

    [1] http://lkml.kernel.org/r/CACT4Y+Z3gigBvhca9kRJFcjX0G70V_nRhbwKBU+yGoESBDKi9Q@mail.gmail.com

    Link: http://lkml.kernel.org/r/20160907122559.GA6542@black.fi.intel.com
    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dmitry Vyukov
    Reviewed-by: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Vegard Nossum
    Cc: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Andrey Ryabinin
    Cc: Greg Thelen
    Cc: Suleiman Souhlal
    Cc: Hugh Dickins
    Cc: David Rientjes
    Cc: syzkaller
    Cc: Kostya Serebryany
    Cc: Alexander Potapenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

29 Jul, 2016

4 commits

  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • As reclaim is now per-node based, convert zone_reclaim to be
    node_reclaim. It is possible that a node will be reclaimed multiple
    times if it has multiple zones but this is unavoidable without caching
    all nodes traversed so far. The documentation and interface to
    userspace is the same from a configuration perspective and will will be
    similar in behaviour unless the node-local allocation requests were also
    limited to lower zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-24-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

6 commits

  • To detect whether khugepaged swapin is worthwhile, this patch checks the
    amount of young pages. There should be at least half of HPAGE_PMD_NR to
    swapin.

    Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • After fixing swapin issues, comment lines stayed as in old version.
    This patch updates the comments.

    Link: http://lkml.kernel.org/r/1468109345-32258-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch extends khugepaged to support collapse of tmpfs/shmem pages.
    We share fair amount of infrastructure with anon-THP collapse.

    Few design points:

    - First we are looking for VMA which can be suitable for mapping huge
    page;

    - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

    - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

    - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

    + scan over radix tree, making the range point to new huge page;

    + new huge page is not-uptodate, locked and freezed (refcount
    is 0), so nobody can touch them until we say so.

    + we swap in pages during the scan. khugepaged_scan_shmem()
    filters out ranges with more than khugepaged_max_ptes_swap
    swapped out pages. It's HPAGE_PMD_NR/8 by default.

    + old pages are isolated, unmapped and put to local list in case
    to be restored back if collapse failed.

    - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

    Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
    first: no point keep it inside the function.

    Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • khugepaged implementation grew to the point when it deserve separate
    file in source.

    Let's move it to mm/khugepaged.c.

    Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov