27 Jul, 2016

40 commits

  • Randy reported below build error.

    > In file included from ../include/linux/balloon_compaction.h:48:0,
    > from ../mm/balloon_compaction.c:11:
    > ../include/linux/compaction.h:237:51: warning: 'struct node' declared inside parameter list [enabled by default]
    > static inline int compaction_register_node(struct node *node)
    > ../include/linux/compaction.h:237:51: warning: its scope is only this definition or declaration, which is probably not what you want [enabled by default]
    > ../include/linux/compaction.h:242:54: warning: 'struct node' declared inside parameter list [enabled by default]
    > static inline void compaction_unregister_node(struct node *node)
    >

    It was caused by non-lru page migration which needs compaction.h but
    compaction.h doesn't include any header to be standalone.

    I think proper header for non-lru page migration is migrate.h rather
    than compaction.h because migrate.h has already headers needed to work
    non-lru page migration indirectly like isolate_mode_t, migrate_mode
    MIGRATEPAGE_SUCCESS.

    [akpm@linux-foundation.org: revert mm-balloon-use-general-non-lru-movable-page-feature-fix.patch temp fix]
    Link: http://lkml.kernel.org/r/20160610003304.GE29779@bbox
    Signed-off-by: Minchan Kim
    Reported-by: Randy Dunlap
    Cc: Konstantin Khlebnikov
    Cc: Vlastimil Babka
    Cc: Gioh Kim
    Cc: Rafael Aquini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • To detect whether khugepaged swapin is worthwhile, this patch checks the
    amount of young pages. There should be at least half of HPAGE_PMD_NR to
    swapin.

    Link: http://lkml.kernel.org/r/1468109451-1615-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Suggested-by: Minchan Kim
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • After fixing swapin issues, comment lines stayed as in old version.
    This patch updates the comments.

    Link: http://lkml.kernel.org/r/1468109345-32258-1-git-send-email-ebru.akagunduz@gmail.com
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Boaz Harrosh
    Cc: Hillf Danton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Add info about tmpfs/shmem with huge pages.

    Link: http://lkml.kernel.org/r/1466021202-61880-38-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Even if user asked to allocate huge pages always (huge=always), we
    should be able to free up some memory by splitting pages which are
    partly byound i_size if memory presure comes or once we hit limit on
    filesystem size (-o size=).

    In order to do this we maintain per-superblock list of inodes, which
    potentially have huge pages on the border of file size.

    Per-fs shrinker can reclaim memory by splitting such pages.

    If we hit -ENOSPC during shmem_getpage_gfp(), we try to split a page to
    free up space on the filesystem and retry allocation if it succeed.

    Link: http://lkml.kernel.org/r/1466021202-61880-37-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch extends khugepaged to support collapse of tmpfs/shmem pages.
    We share fair amount of infrastructure with anon-THP collapse.

    Few design points:

    - First we are looking for VMA which can be suitable for mapping huge
    page;

    - If the VMA maps shmem file, the rest scan/collapse operations
    operates on page cache, not on page tables as in anon VMA case.

    - khugepaged_scan_shmem() finds a range which is suitable for huge
    page. The scan is lockless and shouldn't disturb system too much.

    - once the candidate for collapse is found, collapse_shmem() attempts
    to create a huge page:

    + scan over radix tree, making the range point to new huge page;

    + new huge page is not-uptodate, locked and freezed (refcount
    is 0), so nobody can touch them until we say so.

    + we swap in pages during the scan. khugepaged_scan_shmem()
    filters out ranges with more than khugepaged_max_ptes_swap
    swapped out pages. It's HPAGE_PMD_NR/8 by default.

    + old pages are isolated, unmapped and put to local list in case
    to be restored back if collapse failed.

    - if collapse succeed, we retract pte page tables from VMAs where huge
    pages mapping is possible. The huge page will be mapped as PMD on
    next minor fault into the range.

    Link: http://lkml.kernel.org/r/1466021202-61880-35-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We are going to need to call shmem_charge() under tree_lock to get
    accoutning right on collapse of small tmpfs pages into a huge one.

    The problem is that tree_lock is irq-safe and lockdep is not happy, that
    we take irq-unsafe lock under irq-safe[1].

    Let's convert the lock to irq-safe.

    [1] https://gist.github.com/kiryl/80c0149e03ed35dfaf26628b8e03cdbc

    Link: http://lkml.kernel.org/r/1466021202-61880-34-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Both variants of khugepaged_alloc_page() do up_read(&mm->mmap_sem)
    first: no point keep it inside the function.

    Link: http://lkml.kernel.org/r/1466021202-61880-33-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • khugepaged implementation grew to the point when it deserve separate
    file in source.

    Let's move it to mm/khugepaged.c.

    Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's wire up existing madvise() hugepage hints for file mappings.

    MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
    VMA. It only has effect if the filesystem is mounted with huge=advise
    or huge=within_size.

    MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
    the VMA. It doesn't prevent a huge page from being allocated by other
    means, i.e. page fault into different mapping or write(2) into file.

    Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Provide a shmem_get_unmapped_area method in file_operations, called at
    mmap time to decide the mapping address. It could be conditional on
    CONFIG_TRANSPARENT_HUGEPAGE, but save #ifdefs in other places by making
    it unconditional.

    shmem_get_unmapped_area() first calls the usual mm->get_unmapped_area
    (which we treat as a black box, highly dependent on architecture and
    config and executable layout). Lots of conditions, and in most cases it
    just goes with the address that chose; but when our huge stars are
    rightly aligned, yet that did not provide a suitable address, go back to
    ask for a larger arena, within which to align the mapping suitably.

    There have to be some direct calls to shmem_get_unmapped_area(), not via
    the file_operations: because of the way shmem_zero_setup() is called to
    create a shmem object late in the mmap sequence, when MAP_SHARED is
    requested with MAP_ANONYMOUS or /dev/zero. Though this only matters
    when /proc/sys/vm/shmem_huge has been set.

    Link: http://lkml.kernel.org/r/1466021202-61880-29-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Hugh Dickins
    Signed-off-by: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • This patch adds new mount option "huge=". It can have following values:

    - "always":
    Attempt to allocate huge pages every time we need a new page;

    - "never":
    Do not allocate huge pages;

    - "within_size":
    Only allocate huge page if it will be fully within i_size.
    Also respect fadvise()/madvise() hints;

    - "advise:
    Only allocate huge pages if requested with fadvise()/madvise();

    Default is "never" for now.

    "mount -o remount,huge= /mountpoint" works fine after mount: remounting
    huge=never will not attempt to break up huge pages at all, just stop
    more from being allocated.

    No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
    is the appropriate option to protect those who don't want the new bloat,
    and with which we shall share some pmd code.

    Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
    invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
    explicit).

    Allow enabling THP only if the machine has_transparent_hugepage().

    But what about Shmem with no user-visible mount? SysV SHM, memfds,
    shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
    objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob
    /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
    huge on those.

    And allow shmem_enabled two further values:

    - "deny":
    For use in emergencies, to force the huge option off from
    all mounts;
    - "force":
    Force the huge option on for all - very useful for testing;

    Based on patch by Hugh Dickins.

    Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For shmem/tmpfs we only need to tweak truncate_inode_page() and
    invalidate_mapping_pages().

    truncate_inode_pages_range() and invalidate_inode_pages2_range() are
    adjusted to use page_to_pgoff().

    Link: http://lkml.kernel.org/r/1466021202-61880-26-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • For now, we would have HPAGE_PMD_NR entries in radix tree for every huge
    page. That's suboptimal and it will be changed to use Matthew's
    multi-order entries later.

    'add' operation is not changed, because we don't need it to implement
    hugetmpfs: shmem uses its own implementation.

    Link: http://lkml.kernel.org/r/1466021202-61880-25-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The new helper is similar to radix_tree_maybe_preload(), but tries to
    preload number of nodes required to insert (1 << order) continuous
    naturally-aligned elements.

    This is required to push huge pages into pagecache.

    Link: http://lkml.kernel.org/r/1466021202-61880-24-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • These flags are in use for file THP.

    Link: http://lkml.kernel.org/r/1466021202-61880-23-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This is preparation of vmscan for file huge pages. We cannot write out
    huge pages, so we need to split them on the way out.

    Link: http://lkml.kernel.org/r/1466021202-61880-22-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Basic scheme is the same as for anon THP.

    Main differences:

    - File pages are on radix-tree, so we have head->_count offset by
    HPAGE_PMD_NR. The count got distributed to small pages during split.

    - mapping->tree_lock prevents non-lockless access to pages under split
    over radix-tree;

    - Lockless access is prevented by setting the head->_count to 0 during
    split;

    - After split, some pages can be beyond i_size. We drop them from
    radix-tree.

    - We don't setup migration entries. Just unmap pages. It helps
    handling cases when i_size is in the middle of the page: no need
    handle unmap pages beyond i_size manually.

    Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • vma_addjust_trans_huge() splits pmd if it's crossing VMA boundary.
    During split we munlock the huge page which requires rmap walk. rmap
    wants to take the lock on its own.

    Let's move vma_adjust_trans_huge() outside i_mmap_rwsem to fix this.

    Link: http://lkml.kernel.org/r/1466021202-61880-19-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • change_huge_pmd() has assert which is not relvant for file page. For
    shared mapping it's perfectly fine to have page table entry writable,
    without explicit mkwrite.

    Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • copy_page_range() has a check for "Don't copy ptes where a page fault
    will fill them correctly." It works on VMA level. We still copy all
    page table entries from private mappings, even if they map page cache.

    We can simplify copy_huge_pmd() a bit by skipping file PMDs.

    We don't map file private pages with PMDs, so they only can map page
    cache. It's safe to skip them as they can be re-faulted later.

    Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • File COW for THP is handled on pte level: just split the pmd.

    It's not clear how benefitial would be allocation of huge pages on COW
    faults. And it would require some code to make them work.

    I think at some point we can consider teaching khugepaged to collapse
    pages in COW mappings, but allocating huge on fault is probably
    overkill.

    Link: http://lkml.kernel.org/r/1466021202-61880-16-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Splitting THP PMD is simple: just unmap it as in DAX case. This way we
    can avoid memory overhead on page table allocation to deposit.

    It's probably a good idea to try to allocation page table with
    GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
    but clearing pmd should be good enough for now.

    Unlike DAX, we also remove the page from rmap and drop reference.
    pmd_young() is transfered to PageReferenced().

    Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • split_huge_pmd() for file mappings (and DAX too) is implemented by just
    clearing pmd entry as we can re-fill this area from page cache on pte
    level later.

    This means we don't need deposit page tables when file THP is mapped.
    Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
    file THP PMD.

    Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • THP_FILE_ALLOC: how many times huge page was allocated and put page
    cache.

    THP_FILE_MAPPED: how many times file huge page was mapped.

    Link: http://lkml.kernel.org/r/1466021202-61880-13-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea (and most of code) is borrowed again: from Hugh's patchset on
    huge tmpfs[1].

    Instead of allocation pte page table upfront, we postpone this until we
    have page to map in hands. This approach opens possibility to map the
    page as huge if filesystem supports this.

    Comparing to Hugh's patch I've pushed page table allocation a bit
    further: into do_set_pte(). This way we can postpone allocation even in
    faultaround case without moving do_fault_around() after __do_fault().

    do_set_pte() got renamed to alloc_set_pte() as it can allocate page
    table if required.

    [1] http://lkml.kernel.org/r/alpine.LSU.2.11.1502202015090.14414@eggly.anvils

    Link: http://lkml.kernel.org/r/1466021202-61880-10-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • We always have vma->vm_mm around.

    Link: http://lkml.kernel.org/r/1466021202-61880-8-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add description of THP handling into unevictable-lru.txt.

    Link: http://lkml.kernel.org/r/1466021202-61880-7-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Vlastimil noted[1] that pmd can be no longer valid after we drop
    mmap_sem. We need recheck it once mmap_sem taken again.

    [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz

    Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After creating revalidate vma function, locking inconsistency occured
    due to directing the code path to wrong label. This patch directs to
    correct label and fix the inconsistency.

    Related commit that caused inconsistency:
    http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

    Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Kirill A. Shutemov
    Cc: Stephen Rothwell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Currently khugepaged makes swapin readahead under down_write. This
    patch supplies to make swapin readahead under down_read instead of
    down_write.

    The patch was tested with a test program that allocates 800MB of memory,
    writes to it, and then sleeps. The system was forced to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    [akpm@linux-foundation.org: update comment to match new code]
    [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
    Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
    Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
    Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Introduce a new sysfs integer knob
    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
    optimistic check for swapin readahead to increase thp collapse rate.
    Before getting swapped out pages to memory, checks them and allows up to a
    certain number. It also prints out using tracepoints amount of unmapped
    ptes.

    [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
    [sfr@canb.auug.org.au: build fix]
    Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz