01 Feb, 2017

1 commit

  • commit 8310d48b125d19fcd9521d83b8293e63eb1646aa upstream.

    In commit 19be0eaffa3a ("mm: remove gup_flags FOLL_WRITE games from
    __get_user_pages()"), the mm code was changed from unsetting FOLL_WRITE
    after a COW was resolved to setting the (newly introduced) FOLL_COW
    instead. Simultaneously, the check in gup.c was updated to still allow
    writes with FOLL_FORCE set if FOLL_COW had also been set.

    However, a similar check in huge_memory.c was forgotten. As a result,
    remote memory writes to ro regions of memory backed by transparent huge
    pages cause an infinite loop in the kernel (handle_mm_fault sets
    FOLL_COW and returns 0 causing a retry, but follow_trans_huge_pmd bails
    out immidiately because `(flags & FOLL_WRITE) && !pmd_write(*pmd)` is
    true.

    While in this state the process is stil SIGKILLable, but little else
    works (e.g. no ptrace attach, no other signals). This is easily
    reproduced with the following code (assuming thp are set to always):

    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #define TEST_SIZE 5 * 1024 * 1024

    int main(void) {
    int status;
    pid_t child;
    int fd = open("/proc/self/mem", O_RDWR);
    void *addr = mmap(NULL, TEST_SIZE, PROT_READ,
    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    assert(addr != MAP_FAILED);
    pid_t parent_pid = getpid();
    if ((child = fork()) == 0) {
    void *addr2 = mmap(NULL, TEST_SIZE, PROT_READ | PROT_WRITE,
    MAP_ANONYMOUS | MAP_PRIVATE, 0, 0);
    assert(addr2 != MAP_FAILED);
    memset(addr2, 'a', TEST_SIZE);
    pwrite(fd, addr2, TEST_SIZE, (uintptr_t)addr);
    return 0;
    }
    assert(child == waitpid(child, &status, 0));
    assert(WIFEXITED(status) && WEXITSTATUS(status) == 0);
    return 0;
    }

    Fix this by updating follow_trans_huge_pmd in huge_memory.c analogously
    to the update in gup.c in the original commit. The same pattern exists
    in follow_devmap_pmd. However, we should not be able to reach that
    check with FOLL_COW set, so add WARN_ONCE to make sure we notice if we
    ever do.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170106015025.GA38411@juliacomputing.com
    Signed-off-by: Keno Fischer
    Acked-by: Kirill A. Shutemov
    Cc: Greg Thelen
    Cc: Nicholas Piggin
    Cc: Willy Tarreau
    Cc: Oleg Nesterov
    Cc: Kees Cook
    Cc: Andy Lutomirski
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Keno Fischer
     

20 Jan, 2017

1 commit

  • commit 20f664aabeb88d582b623a625f83b0454fa34f07 upstream.

    Andreas reported [1] made a test in jemalloc hang in THP mode in arm64:

    http://lkml.kernel.org/r/mvmmvfy37g1.fsf@hawking.suse.de

    The problem is currently page fault handler doesn't supports dirty bit
    emulation of pmd for non-HW dirty-bit architecture so that application
    stucks until VM marked the pmd dirty.

    How the emulation work depends on the architecture. In case of arm64,
    when it set up pte firstly, it sets pte PTE_RDONLY to get a chance to
    mark the pte dirty via triggering page fault when store access happens.
    Once the page fault occurs, VM marks the pmd dirty and arch code for
    setting pmd will clear PTE_RDONLY for application to proceed.

    IOW, if VM doesn't mark the pmd dirty, application hangs forever by
    repeated fault(i.e., store op but the pmd is PTE_RDONLY).

    This patch enables pmd dirty-bit emulation for those architectures.

    [1] b8d3c4c3009d, mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called

    Fixes: b8d3c4c3009d ("mm/huge_memory.c: don't split THP page when MADV_FREE syscall is called")
    Link: http://lkml.kernel.org/r/1482506098-6149-1-git-send-email-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: Andreas Schwab
    Tested-by: Andreas Schwab
    Acked-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Jason Evans
    Cc: Will Deacon
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minchan Kim
     

30 Nov, 2016

1 commit

  • Linus found there still is a race in mremap after commit 5d1904204c99
    ("mremap: fix race between mremap() and page cleanning").

    As described by Linus:
    "the issue is that another thread might make the pte be dirty (in the
    hardware walker, so no locking of ours will make any difference)
    *after* we checked whether it was dirty, but *before* we removed it
    from the page tables"

    Fix it by moving the check after we removed it from the page table.

    Suggested-by: Linus Torvalds
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

18 Nov, 2016

1 commit

  • Prior to 3.15, there was a race between zap_pte_range() and
    page_mkclean() where writes to a page could be lost. Dave Hansen
    discovered by inspection that there is a similar race between
    move_ptes() and page_mkclean().

    We've been able to reproduce the issue by enlarging the race window with
    a msleep(), but have not been able to hit it without modifying the code.
    So, we think it's a real issue, but is difficult or impossible to hit in
    practice.

    The zap_pte_range() issue is fixed by commit 1cf35d47712d("mm: split
    'tlb_flush_mmu()' into tlb flushing and memory freeing parts"). And
    this patch is to fix the race between page_mkclean() and mremap().

    Here is one possible way to hit the race: suppose a process mmapped a
    file with READ | WRITE and SHARED, it has two threads and they are bound
    to 2 different CPUs, e.g. CPU1 and CPU2. mmap returned X, then thread
    1 did a write to addr X so that CPU1 now has a writable TLB for addr X
    on it. Thread 2 starts mremaping from addr X to Y while thread 1
    cleaned the page and then did another write to the old addr X again.
    The 2nd write from thread 1 could succeed but the value will get lost.

    thread 1 thread 2
    (bound to CPU1) (bound to CPU2)

    1: write 1 to addr X to get a
    writeable TLB on this CPU

    2: mremap starts

    3: move_ptes emptied PTE for addr X
    and setup new PTE for addr Y and
    then dropped PTL for X and Y

    4: page laundering for N by doing
    fadvise FADV_DONTNEED. When done,
    pageframe N is deemed clean.

    5: *write 2 to addr X

    6: tlb flush for addr X

    7: munmap (Y, pagesize) to make the
    page unmapped

    8: fadvise with FADV_DONTNEED again
    to kick the page off the pagecache

    9: pread the page from file to verify
    the value. If 1 is there, it means
    we have lost the written 2.

    *the write may or may not cause segmentation fault, it depends on
    if the TLB is still on the CPU.

    Please note that this is only one specific way of how the race could
    occur, it didn't mean that the race could only occur in exact the above
    config, e.g. more than 2 threads could be involved and fadvise() could
    be done in another thread, etc.

    For anonymous pages, they could race between mremap() and page reclaim:
    THP: a huge PMD is moved by mremap to a new huge PMD, then the new huge
    PMD gets unmapped/splitted/pagedout before the flush tlb happened for
    the old huge PMD in move_page_tables() and we could still write data to
    it. The normal anonymous page has similar situation.

    To fix this, check for any dirty PTE in move_ptes()/move_huge_pmd() and
    if any, did the flush before dropping the PTL. If we did the flush for
    every move_ptes()/move_huge_pmd() call then we do not need to do the
    flush in move_pages_tables() for the whole range. But if we didn't, we
    still need to do the whole range flush.

    Alternatively, we can track which part of the range is flushed in
    move_ptes()/move_huge_pmd() and which didn't to avoid flushing the whole
    range in move_page_tables(). But that would require multiple tlb
    flushes for the different sub-ranges and should be less efficient than
    the single whole range flush.

    KBuild test on my Sandybridge desktop doesn't show any noticeable change.
    v4.9-rc4:
    real 5m14.048s
    user 32m19.800s
    sys 4m50.320s

    With this commit:
    real 5m13.888s
    user 32m19.330s
    sys 4m51.200s

    Reported-by: Dave Hansen
    Signed-off-by: Aaron Lu
    Signed-off-by: Linus Torvalds

    Aaron Lu
     

08 Oct, 2016

3 commits

  • vma->vm_page_prot is read lockless from the rmap_walk, it may be updated
    concurrently and this prevents the risk of reading intermediate values.

    Link: http://lkml.kernel.org/r/1474660305-19222-1-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Jan Vorlicek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
    This feature relies on both mmap virtual address and FS block (i.e.
    physical address) to be aligned by the pmd page size. Users can use
    mkfs options to specify FS to align block allocations. However,
    aligning mmap address requires code changes to existing applications for
    providing a pmd-aligned address to mmap().

    For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
    It calls mmap() with a NULL address, which needs to be changed to
    provide a pmd-aligned address for testing with DAX pmd mappings.
    Changing all applications that call mmap() with NULL is undesirable.

    Add thp_get_unmapped_area(), which can be called by filesystem's
    get_unmapped_area to align an mmap address by the pmd size for a DAX
    file. It calls the default handler, mm->get_unmapped_area(), to find a
    range and then aligns it for a DAX file.

    The patch is based on Matthew Wilcox's change that allows adding support
    of the pud page size easily.

    [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
    Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Reviewed-by: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Andreas Dilger
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     

30 Sep, 2016

1 commit


26 Sep, 2016

1 commit

  • The NUMA balancing logic uses an arch-specific PROT_NONE page table flag
    defined by pte_protnone() or pmd_protnone() to mark PTEs or huge page
    PMDs respectively as requiring balancing upon a subsequent page fault.
    User-defined PROT_NONE memory regions which also have this flag set will
    not normally invoke the NUMA balancing code as do_page_fault() will send
    a segfault to the process before handle_mm_fault() is even called.

    However if access_remote_vm() is invoked to access a PROT_NONE region of
    memory, handle_mm_fault() is called via faultin_page() and
    __get_user_pages() without any access checks being performed, meaning
    the NUMA balancing logic is incorrectly invoked on a non-NUMA memory
    region.

    A simple means of triggering this problem is to access PROT_NONE mmap'd
    memory using /proc/self/mem which reliably results in the NUMA handling
    functions being invoked when CONFIG_NUMA_BALANCING is set.

    This issue was reported in bugzilla (issue 99101) which includes some
    simple repro code.

    There are BUG_ON() checks in do_numa_page() and do_huge_pmd_numa_page()
    added at commit c0e7cad to avoid accidentally provoking strange
    behaviour by attempting to apply NUMA balancing to pages that are in
    fact PROT_NONE. The BUG_ON()'s are consistently triggered by the repro.

    This patch moves the PROT_NONE check into mm/memory.c rather than
    invoking BUG_ON() as faulting in these pages via faultin_page() is a
    valid reason for reaching the NUMA check with the PROT_NONE page table
    flag set and is therefore not always a bug.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=99101
    Reported-by: Trevor Saunders
    Signed-off-by: Lorenzo Stoakes
    Acked-by: Rik van Riel
    Cc: Andrew Morton
    Cc: Mel Gorman
    Signed-off-by: Linus Torvalds

    Lorenzo Stoakes
     

22 Sep, 2016

1 commit


14 Sep, 2016

1 commit

  • Commit:

    4d9424669946 ("mm: convert p[te|md]_mknonnuma and remaining page table manipulations")

    changed NUMA balancing from _PAGE_NUMA to using PROT_NONE, and was quickly
    found to introduce a regression with NUMA grouping.

    It was followed up by these commits:

    53da3bc2ba9e ("mm: fix up numa read-only thread grouping logic")
    bea66fbd11af ("mm: numa: group related processes based on VMA flags instead of page table flags")
    b191f9b106ea ("mm: numa: preserve PTE write permissions across a NUMA hinting fault")

    The first of those two commits try alternate approaches to NUMA
    grouping, which apparently do not work as well as looking at the PTE
    write permissions.

    The latter patch preserves the PTE write permissions across a NUMA
    protection fault. However, it forgets to revert the condition for
    whether or not to group tasks together back to what it was before
    v3.19, even though the information is now preserved in the page tables
    once again.

    This patch brings the NUMA grouping heuristic back to what it was
    before commit 4d9424669946, which the changelogs of subsequent
    commits suggest worked best.

    We have all the information again. We should probably use it.

    Signed-off-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: aarcange@redhat.com
    Cc: linux-mm@kvack.org
    Cc: mgorman@suse.de
    Link: http://lkml.kernel.org/r/20160908213053.07c992a9@annuminas.surriel.com
    Signed-off-by: Ingo Molnar

    Rik van Riel
     

10 Sep, 2016

1 commit

  • Attempting to dump /proc//smaps for a process with pmd dax mappings
    currently results in the following VM_BUG_ONs:

    kernel BUG at mm/huge_memory.c:1105!
    task: ffff88045f16b140 task.stack: ffff88045be14000
    RIP: 0010:[] [] follow_trans_huge_pmd+0x2cb/0x340
    [..]
    Call Trace:
    [] smaps_pte_range+0xa0/0x4b0
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    kernel BUG at fs/proc/task_mmu.c:585!
    RIP: 0010:[] [] smaps_pte_range+0x499/0x4b0
    Call Trace:
    [] ? vsnprintf+0x255/0x4c0
    [] __walk_page_range+0x1fe/0x4d0
    [] walk_page_vma+0x62/0x80
    [] show_smap+0xa6/0x2b0

    These locations are sanity checking page flags that must be set for an
    anonymous transparent huge page, but are not set for the zone_device
    pages associated with dax mappings.

    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Acked-by: Andrew Morton
    Signed-off-by: Dan Williams

    Dan Williams
     

27 Aug, 2016

1 commit

  • While adding proper userfaultfd_wp support with bits in pagetable and
    swap entry to avoid false positives WP userfaults through swap/fork/
    KSM/etc, I've been adding a framework that mostly mirrors soft dirty.

    So I noticed in one place I had to add uffd_wp support to the pagetables
    that wasn't covered by soft_dirty and I think it should have.

    Example: in the THP migration code migrate_misplaced_transhuge_page()
    pmd_mkdirty is called unconditionally after mk_huge_pmd.

    entry = mk_huge_pmd(new_page, vma->vm_page_prot);
    entry = maybe_pmd_mkwrite(pmd_mkdirty(entry), vma);

    That sets soft dirty too (it's a false positive for soft dirty, the soft
    dirty bit could be more finegrained and transfer the bit like uffd_wp
    will do.. pmd/pte_uffd_wp() enforces the invariant that when it's set
    pmd/pte_write is not set).

    However in the THP split there's no unconditional pmd_mkdirty after
    mk_huge_pmd and pte_swp_mksoft_dirty isn't called after the migration
    entry is created. The code sets the dirty bit in the struct page
    instead of setting it in the pagetable (which is fully equivalent as far
    as the real dirty bit is concerned, as the whole point of pagetable bits
    is to be eventually flushed out of to the page, but that is not
    equivalent for the soft-dirty bit that gets lost in translation).

    This was found by code review only and totally untested as I'm working
    to actually replace soft dirty and I don't have time to test potential
    soft dirty bugfixes as well :).

    Transfer the soft_dirty from pmd to pte during THP splits.

    This fix avoids losing the soft_dirty bit and avoids userland memory
    corruption in the checkpoint.

    Fixes: eef1b3ba053aa6 ("thp: implement split_huge_pmd()")
    Link: http://lkml.kernel.org/r/1471610515-30229-2-git-send-email-aarcange@redhat.com
    Signed-off-by: Andrea Arcangeli
    Acked-by: Pavel Emelyanov
    Cc: "Kirill A. Shutemov"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

29 Jul, 2016

5 commits

  • After the previous patch, we can distinguish costly allocations that
    should be really lightweight, such as THP page faults, with
    __GFP_NORETRY. This means we don't need to recognize khugepaged
    allocations via PF_KTHREAD anymore. We can also change THP page faults
    in areas where madvise(MADV_HUGEPAGE) was used to try as hard as
    khugepaged, as the process has indicated that it benefits from THP's and
    is willing to pay some initial latency costs.

    We can also make the flags handling less cryptic by distinguishing
    GFP_TRANSHUGE_LIGHT (no reclaim at all, default mode in page fault) from
    GFP_TRANSHUGE (only direct reclaim, khugepaged default). Adding
    __GFP_NORETRY or __GFP_KSWAPD_RECLAIM is done where needed.

    The patch effectively changes the current GFP_TRANSHUGE users as
    follows:

    * get_huge_zero_page() - the zero page lifetime should be relatively
    long and it's shared by multiple users, so it's worth spending some
    effort on it. We use GFP_TRANSHUGE, and __GFP_NORETRY is not added.
    This also restores direct reclaim to this allocation, which was
    unintentionally removed by commit e4a49efe4e7e ("mm: thp: set THP defrag
    by default to madvise and add a stall-free defrag option")

    * alloc_hugepage_khugepaged_gfpmask() - this is khugepaged, so latency
    is not an issue. So if khugepaged "defrag" is enabled (the default), do
    reclaim via GFP_TRANSHUGE without __GFP_NORETRY. We can remove the
    PF_KTHREAD check from page alloc.

    As a side-effect, khugepaged will now no longer check if the initial
    compaction was deferred or contended. This is OK, as khugepaged sleep
    times between collapsion attempts are long enough to prevent noticeable
    disruption, so we should allow it to spend some effort.

    * migrate_misplaced_transhuge_page() - already was masking out
    __GFP_RECLAIM, so just convert to GFP_TRANSHUGE_LIGHT which is
    equivalent.

    * alloc_hugepage_direct_gfpmask() - vma's with VM_HUGEPAGE (via madvise)
    are now allocating without __GFP_NORETRY. Other vma's keep using
    __GFP_NORETRY if direct reclaim/compaction is at all allowed (by default
    it's allowed only for madvised vma's). The rest is conversion to
    GFP_TRANSHUGE(_LIGHT).

    [mhocko@suse.com: suggested GFP_TRANSHUGE_LIGHT]
    Link: http://lkml.kernel.org/r/20160721073614.24395-7-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The definition of return value of madvise_free_huge_pmd is not clear
    before. According to the suggestion of Minchan Kim, change the type of
    return value to bool and return true if we do MADV_FREE successfully on
    entire pmd page, otherwise, return false. Comments are added too.

    Link: http://lkml.kernel.org/r/1467135452-16688-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Minchan Kim
    Cc: "Kirill A. Shutemov"
    Cc: Jerome Marchand
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • There are now a number of accounting oddities such as mapped file pages
    being accounted for on the node while the total number of file pages are
    accounted on the zone. This can be coped with to some extent but it's
    confusing so this patch moves the relevant file-based accounted. Due to
    throttling logic in the page allocator for reliable OOM detection, it is
    still necessary to track dirty and writeback pages on a per-zone basis.

    [mgorman@techsingularity.net: fix NR_ZONE_WRITE_PENDING accounting]
    Link: http://lkml.kernel.org/r/1468404004-5085-5-git-send-email-mgorman@techsingularity.net
    Link: http://lkml.kernel.org/r/1467970510-21195-20-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Hillf Danton
    Acked-by: Johannes Weiner
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This moves the LRU lists from the zone to the node and related data such
    as counters, tracing, congestion tracking and writeback tracking.

    Unfortunately, due to reclaim and compaction retry logic, it is
    necessary to account for the number of LRU pages on both zone and node
    logic. Most reclaim logic is based on the node counters but the retry
    logic uses the zone counters which do not distinguish inactive and
    active sizes. It would be possible to leave the LRU counters on a
    per-zone basis but it's a heavier calculation across multiple cache
    lines that is much more frequent than the retry checks.

    Other than the LRU counters, this is mostly a mechanical patch but note
    that it introduces a number of anomalies. For example, the scans are
    per-zone but using per-node counters. We also mark a node as congested
    when a zone is congested. This causes weird problems that are fixed
    later but is easier to review.

    In the event that there is excessive overhead on 32-bit systems due to
    the nodes being on LRU then there are two potential solutions

    1. Long-term isolation of highmem pages when reclaim is lowmem

    When pages are skipped, they are immediately added back onto the LRU
    list. If lowmem reclaim persisted for long periods of time, the same
    highmem pages get continually scanned. The idea would be that lowmem
    keeps those pages on a separate list until a reclaim for highmem pages
    arrives that splices the highmem pages back onto the LRU. It potentially
    could be implemented similar to the UNEVICTABLE list.

    That would reduce the skip rate with the potential corner case is that
    highmem pages have to be scanned and reclaimed to free lowmem slab pages.

    2. Linear scan lowmem pages if the initial LRU shrink fails

    This will break LRU ordering but may be preferable and faster during
    memory pressure than skipping LRU pages.

    Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Node-based reclaim requires node-based LRUs and locking. This is a
    preparation patch that just moves the lru_lock to the node so later
    patches are easier to review. It is a mechanical change but note this
    patch makes contention worse because the LRU lock is hotter and direct
    reclaim and kswapd can contend on the same lock even when reclaiming
    from different zones.

    Link: http://lkml.kernel.org/r/1467970510-21195-3-git-send-email-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Minchan Kim
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Joonsoo Kim
    Cc: Michal Hocko
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

27 Jul, 2016

22 commits

  • To make the comments consistent with the already changed code.

    Link: http://lkml.kernel.org/r/1466200004-6196-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • For file mappings, we don't deposit page tables on THP allocation
    because it's not strictly required to implement split_huge_pmd(): we can
    just clear pmd and let following page faults to reconstruct the page
    table.

    But Power makes use of deposited page table to address MMU quirk.

    Let's hide THP page cache, including huge tmpfs, under separate config
    option, so it can be forbidden on Power.

    We can revert the patch later once solution for Power found.

    Link: http://lkml.kernel.org/r/1466021202-61880-36-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • khugepaged implementation grew to the point when it deserve separate
    file in source.

    Let's move it to mm/khugepaged.c.

    Link: http://lkml.kernel.org/r/1466021202-61880-32-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's wire up existing madvise() hugepage hints for file mappings.

    MADV_HUGEPAGE advise shmem to allocate huge page on page fault in the
    VMA. It only has effect if the filesystem is mounted with huge=advise
    or huge=within_size.

    MADV_NOHUGEPAGE prevents hugepage from being allocated on page fault in
    the VMA. It doesn't prevent a huge page from being allocated by other
    means, i.e. page fault into different mapping or write(2) into file.

    Link: http://lkml.kernel.org/r/1466021202-61880-31-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Here's basic implementation of huge pages support for shmem/tmpfs.

    It's all pretty streight-forward:

    - shmem_getpage() allcoates huge page if it can and try to inserd into
    radix tree with shmem_add_to_page_cache();

    - shmem_add_to_page_cache() puts the page onto radix-tree if there's
    space for it;

    - shmem_undo_range() removes huge pages, if it fully within range.
    Partial truncate of huge pages zero out this part of THP.

    This have visible effect on fallocate(FALLOC_FL_PUNCH_HOLE)
    behaviour. As we don't really create hole in this case,
    lseek(SEEK_HOLE) may have inconsistent results depending what
    pages happened to be allocated.

    - no need to change shmem_fault: core-mm will map an compound page as
    huge if VMA is suitable;

    Link: http://lkml.kernel.org/r/1466021202-61880-30-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • This patch adds new mount option "huge=". It can have following values:

    - "always":
    Attempt to allocate huge pages every time we need a new page;

    - "never":
    Do not allocate huge pages;

    - "within_size":
    Only allocate huge page if it will be fully within i_size.
    Also respect fadvise()/madvise() hints;

    - "advise:
    Only allocate huge pages if requested with fadvise()/madvise();

    Default is "never" for now.

    "mount -o remount,huge= /mountpoint" works fine after mount: remounting
    huge=never will not attempt to break up huge pages at all, just stop
    more from being allocated.

    No new config option: put this under CONFIG_TRANSPARENT_HUGEPAGE, which
    is the appropriate option to protect those who don't want the new bloat,
    and with which we shall share some pmd code.

    Prohibit the option when !CONFIG_TRANSPARENT_HUGEPAGE, just as mpol is
    invalid without CONFIG_NUMA (was hidden in mpol_parse_str(): make it
    explicit).

    Allow enabling THP only if the machine has_transparent_hugepage().

    But what about Shmem with no user-visible mount? SysV SHM, memfds,
    shared anonymous mmaps (of /dev/zero or MAP_ANONYMOUS), GPU drivers' DRM
    objects, Ashmem. Though unlikely to suit all usages, provide sysfs knob
    /sys/kernel/mm/transparent_hugepage/shmem_enabled to experiment with
    huge on those.

    And allow shmem_enabled two further values:

    - "deny":
    For use in emergencies, to force the huge option off from
    all mounts;
    - "force":
    Force the huge option on for all - very useful for testing;

    Based on patch by Hugh Dickins.

    Link: http://lkml.kernel.org/r/1466021202-61880-28-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Let's add ShmemHugePages and ShmemPmdMapped fields into meminfo and
    smaps. It indicates how many times we allocate and map shmem THP.

    NR_ANON_TRANSPARENT_HUGEPAGES is renamed to NR_ANON_THPS.

    Link: http://lkml.kernel.org/r/1466021202-61880-27-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • As with anon THP, we only mlock file huge pages if we can prove that the
    page is not mapped with PTE. This way we can avoid mlock leak into
    non-mlocked vma on split.

    We rely on PageDoubleMap() under lock_page() to check if the the page
    may be PTE mapped. PG_double_map is set by page_add_file_rmap() when
    the page mapped with PTEs.

    Link: http://lkml.kernel.org/r/1466021202-61880-21-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Basic scheme is the same as for anon THP.

    Main differences:

    - File pages are on radix-tree, so we have head->_count offset by
    HPAGE_PMD_NR. The count got distributed to small pages during split.

    - mapping->tree_lock prevents non-lockless access to pages under split
    over radix-tree;

    - Lockless access is prevented by setting the head->_count to 0 during
    split;

    - After split, some pages can be beyond i_size. We drop them from
    radix-tree.

    - We don't setup migration entries. Just unmap pages. It helps
    handling cases when i_size is in the middle of the page: no need
    handle unmap pages beyond i_size manually.

    Link: http://lkml.kernel.org/r/1466021202-61880-20-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • change_huge_pmd() has assert which is not relvant for file page. For
    shared mapping it's perfectly fine to have page table entry writable,
    without explicit mkwrite.

    Link: http://lkml.kernel.org/r/1466021202-61880-18-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • copy_page_range() has a check for "Don't copy ptes where a page fault
    will fill them correctly." It works on VMA level. We still copy all
    page table entries from private mappings, even if they map page cache.

    We can simplify copy_huge_pmd() a bit by skipping file PMDs.

    We don't map file private pages with PMDs, so they only can map page
    cache. It's safe to skip them as they can be re-faulted later.

    Link: http://lkml.kernel.org/r/1466021202-61880-17-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Splitting THP PMD is simple: just unmap it as in DAX case. This way we
    can avoid memory overhead on page table allocation to deposit.

    It's probably a good idea to try to allocation page table with
    GFP_ATOMIC in __split_huge_pmd_locked() to avoid refaulting the area,
    but clearing pmd should be good enough for now.

    Unlike DAX, we also remove the page from rmap and drop reference.
    pmd_young() is transfered to PageReferenced().

    Link: http://lkml.kernel.org/r/1466021202-61880-15-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • split_huge_pmd() for file mappings (and DAX too) is implemented by just
    clearing pmd entry as we can re-fill this area from page cache on pte
    level later.

    This means we don't need deposit page tables when file THP is mapped.
    Therefore we shouldn't try to withdraw a page table on zap_huge_pmd()
    file THP PMD.

    Link: http://lkml.kernel.org/r/1466021202-61880-14-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • With postponed page table allocation we have chance to setup huge pages.
    do_set_pte() calls do_set_pmd() if following criteria met:

    - page is compound;
    - pmd entry in pmd_none();
    - vma has suitable size and alignment;

    Link: http://lkml.kernel.org/r/1466021202-61880-12-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Naive approach: on mapping/unmapping the page as compound we update
    ->_mapcount on each 4k page. That's not efficient, but it's not obvious
    how we can optimize this. We can look into optimization later.

    PG_double_map optimization doesn't work for file pages since lifecycle
    of file pages is different comparing to anon pages: file page can be
    mapped again at any time.

    Link: http://lkml.kernel.org/r/1466021202-61880-11-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The idea borrowed from Peter's patch from patchset on speculative page
    faults[1]:

    Instead of passing around the endless list of function arguments,
    replace the lot with a single structure so we can change context without
    endless function signature changes.

    The changes are mostly mechanical with exception of faultaround code:
    filemap_map_pages() got reworked a bit.

    This patch is preparation for the next one.

    [1] http://lkml.kernel.org/r/20141020222841.302891540@infradead.org

    Link: http://lkml.kernel.org/r/1466021202-61880-9-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Vlastimil noted[1] that pmd can be no longer valid after we drop
    mmap_sem. We need recheck it once mmap_sem taken again.

    [1] http://lkml.kernel.org/r/12918dcd-a695-c6f4-e06f-69141c5f357f@suse.cz

    Link: http://lkml.kernel.org/r/1466021202-61880-6-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • After creating revalidate vma function, locking inconsistency occured
    due to directing the code path to wrong label. This patch directs to
    correct label and fix the inconsistency.

    Related commit that caused inconsistency:
    http://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git/commit/?id=da4360877094368f6dfe75bbe804b0f0a5d575b0

    Link: http://lkml.kernel.org/r/1464956884-4644-1-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-4-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Vlastimil Babka
    Cc: Sergey Senozhatsky
    Cc: Kirill A. Shutemov
    Cc: Stephen Rothwell
    Cc: Rik van Riel
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Currently khugepaged makes swapin readahead under down_write. This
    patch supplies to make swapin readahead under down_read instead of
    down_write.

    The patch was tested with a test program that allocates 800MB of memory,
    writes to it, and then sleeps. The system was forced to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    [akpm@linux-foundation.org: update comment to match new code]
    [kirill.shutemov@linux.intel.com: passing 'vma' to hugepage_vma_revlidate() is useless]
    Link: http://lkml.kernel.org/r/20160530095058.GA53044@black.fi.intel.com
    Link: http://lkml.kernel.org/r/1466021202-61880-3-git-send-email-kirill.shutemov@linux.intel.com
    Link: http://lkml.kernel.org/r/1464335964-6510-4-git-send-email-ebru.akagunduz@gmail.com
    Link: http://lkml.kernel.org/r/1466021202-61880-2-git-send-email-kirill.shutemov@linux.intel.com
    Signed-off-by: Ebru Akagunduz
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This patch makes swapin readahead to improve thp collapse rate. When
    khugepaged scanned pages, there can be a few of the pages in swap area.

    With the patch THP can collapse 4kB pages into a THP when there are up
    to max_ptes_swap swap ptes in a 2MB range.

    The patch was tested with a test program that allocates 400B of memory,
    writes to it, and then sleeps. I force the system to swap out all.
    Afterwards, the test program touches the area by writing, it skips a
    page in each 20 pages of the area.

    Without the patch, system did not swap in readahead. THP rate was %65
    of the program of the memory, it did not change over time.

    With this patch, after 10 minutes of waiting khugepaged had collapsed
    %99 of the program's memory.

    [kirill.shutemov@linux.intel.com: trivial cleanup of exit path of the function]
    [kirill.shutemov@linux.intel.com: __collapse_huge_page_swapin(): drop unused 'pte' parameter]
    [kirill.shutemov@linux.intel.com: do not hold anon_vma lock during swap in]
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Introduce a new sysfs integer knob
    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap which makes
    optimistic check for swapin readahead to increase thp collapse rate.
    Before getting swapped out pages to memory, checks them and allows up to a
    certain number. It also prints out using tracepoints amount of unmapped
    ptes.

    [vdavydov@parallels.com: fix scan not aborted on SCAN_EXCEED_SWAP_PTE]
    [sfr@canb.auug.org.au: build fix]
    Link: http://lkml.kernel.org/r/20160616154503.65806e12@canb.auug.org.au
    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Naoya Horiguchi
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Xie XiuQi
    Cc: Cyrill Gorcunov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Aneesh Kumar K.V
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • This allows an arch which needs to do special handing with respect to
    different page size when flushing tlb to implement the same in mmu
    gather.

    Link: http://lkml.kernel.org/r/1465049193-22197-3-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Vlastimil Babka
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V