30 Oct, 2014

2 commits

  • If an anonymous mapping is not allowed to fault thp memory and then
    madvise(MADV_HUGEPAGE) is used after fault, khugepaged will never
    collapse this memory into thp memory.

    This occurs because the madvise(2) handler for thp, hugepage_madvise(),
    clears VM_NOHUGEPAGE on the stack and it isn't stored in vma->vm_flags
    until the final action of madvise_behavior(). This causes the
    khugepaged_enter_vma_merge() to be a no-op in hugepage_madvise() when
    the vma had previously had VM_NOHUGEPAGE set.

    Fix this by passing the correct vma flags to the khugepaged mm slot
    handler. There's no chance khugepaged can run on this vma until after
    madvise_behavior() returns since we hold mm->mmap_sem.

    It would be possible to clear VM_NOHUGEPAGE directly from vma->vm_flags
    in hugepage_advise(), but I didn't want to introduce special case
    behavior into madvise_behavior(). I think it's best to just let it
    always set vma->vm_flags itself.

    Signed-off-by: David Rientjes
    Reported-by: Suleiman Souhlal
    Cc: "Kirill A. Shutemov"
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Compound page should be freed by put_page() or free_pages() with correct
    order. Not doing so will cause tail pages leaked.

    The compound order can be obtained by compound_order() or use
    HPAGE_PMD_ORDER in our case. Some people would argue the latter is
    faster but I prefer the former which is more general.

    This bug was observed not just on our servers (the worst case we saw is
    11G leaked on a 48G machine) but also on our workstations running Ubuntu
    based distro.

    $ cat /proc/vmstat | grep thp_zero_page_alloc
    thp_zero_page_alloc 55
    thp_zero_page_alloc_failed 0

    This means there is (thp_zero_page_alloc - 1) * (2M - 4K) memory leaked.

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Yu Zhao
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Bob Liu
    Cc: [3.8+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yu Zhao
     

10 Oct, 2014

3 commits

  • Dump the contents of the relevant struct_mm when we hit the bug condition.

    Signed-off-by: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • Trivially convert a few VM_BUG_ON calls to VM_BUG_ON_VMA to extract
    more information when they trigger.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Sasha Levin
    Reviewed-by: Naoya Horiguchi
    Cc: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Vlastimil Babka
    Cc: Michel Lespinasse
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     
  • When allocating huge page for collapsing, khugepaged currently holds
    mmap_sem for reading on the mm where collapsing occurs. Afterwards the
    read lock is dropped before write lock is taken on the same mmap_sem.

    Holding mmap_sem during whole huge page allocation is therefore useless,
    the vma needs to be rechecked after taking the write lock anyway.
    Furthemore, huge page allocation might involve a rather long sync
    compaction, and thus block any mmap_sem writers and i.e. affect workloads
    that perform frequent m(un)map or mprotect oterations.

    This patch simply releases the read lock before allocating a huge page.
    It also deletes an outdated comment that assumed vma must be stable, as it
    was using alloc_hugepage_vma(). This is no longer true since commit
    9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target node").

    Signed-off-by: Vlastimil Babka
    Cc: Minchan Kim
    Acked-by: Mel Gorman
    Cc: Joonsoo Kim
    Cc: Michal Nazarewicz
    Cc: Naoya Horiguchi
    Cc: Christoph Lameter
    Cc: Rik van Riel
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

03 Oct, 2014

1 commit

  • This patch reverts 1ba6e0b50b ("mm: numa: split_huge_page: transfer the
    NUMA type from the pmd to the pte"). If a huge page is being split due
    a protection change and the tail will be in a PROT_NONE vma then NUMA
    hinting PTEs are temporarily created in the protected VMA.

    VM_RW|VM_PROTNONE
    |-----------------|
    ^
    split here

    In the specific case above, it should get fixed up by change_pte_range()
    but there is a window of opportunity for weirdness to happen. Similarly,
    if a huge page is shrunk and split during a protection update but before
    pmd_numa is cleared then a pte_numa can be left behind.

    Instead of adding complexity trying to deal with the case, this patch
    will not mark PTEs NUMA when splitting a huge page. NUMA hinting faults
    will not be triggered which is marginal in comparison to the complexity
    in dealing with the corner cases during THP split.

    Cc: stable@vger.kernel.org
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

09 Aug, 2014

1 commit

  • These patches rework memcg charge lifetime to integrate more naturally
    with the lifetime of user pages. This drastically simplifies the code and
    reduces charging and uncharging overhead. The most expensive part of
    charging and uncharging is the page_cgroup bit spinlock, which is removed
    entirely after this series.

    Here are the top-10 profile entries of a stress test that reads a 128G
    sparse file on a freshly booted box, without even a dedicated cgroup (i.e.
    executing in the root memcg). Before:

    15.36% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.31% cat [kernel.kallsyms] [k] memset
    11.48% cat [kernel.kallsyms] [k] do_mpage_readpage
    4.23% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.38% cat [kernel.kallsyms] [k] put_page
    2.32% cat [kernel.kallsyms] [k] __mem_cgroup_commit_charge
    2.18% kswapd0 [kernel.kallsyms] [k] __mem_cgroup_uncharge_common
    1.92% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.86% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.62% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn

    After:

    15.67% cat [kernel.kallsyms] [k] copy_user_generic_string
    13.48% cat [kernel.kallsyms] [k] memset
    11.42% cat [kernel.kallsyms] [k] do_mpage_readpage
    3.98% cat [kernel.kallsyms] [k] get_page_from_freelist
    2.46% cat [kernel.kallsyms] [k] put_page
    2.13% kswapd0 [kernel.kallsyms] [k] shrink_page_list
    1.88% cat [kernel.kallsyms] [k] __radix_tree_lookup
    1.67% cat [kernel.kallsyms] [k] __pagevec_lru_add_fn
    1.39% kswapd0 [kernel.kallsyms] [k] free_pcppages_bulk
    1.30% cat [kernel.kallsyms] [k] kfree

    As you can see, the memcg footprint has shrunk quite a bit.

    text data bss dec hex filename
    37970 9892 400 48262 bc86 mm/memcontrol.o.old
    35239 9892 400 45531 b1db mm/memcontrol.o

    This patch (of 4):

    The memcg charge API charges pages before they are rmapped - i.e. have an
    actual "type" - and so every callsite needs its own set of charge and
    uncharge functions to know what type is being operated on. Worse,
    uncharge has to happen from a context that is still type-specific, rather
    than at the end of the page's lifetime with exclusive access, and so
    requires a lot of synchronization.

    Rewrite the charge API to provide a generic set of try_charge(),
    commit_charge() and cancel_charge() transaction operations, much like
    what's currently done for swap-in:

    mem_cgroup_try_charge() attempts to reserve a charge, reclaiming
    pages from the memcg if necessary.

    mem_cgroup_commit_charge() commits the page to the charge once it
    has a valid page->mapping and PageAnon() reliably tells the type.

    mem_cgroup_cancel_charge() aborts the transaction.

    This reduces the charge API and enables subsequent patches to
    drastically simplify uncharging.

    As pages need to be committed after rmap is established but before they
    are added to the LRU, page_add_new_anon_rmap() must stop doing LRU
    additions again. Revive lru_cache_add_active_or_unevictable().

    [hughd@google.com: fix shmem_unuse]
    [hughd@google.com: Add comments on the private use of -EAGAIN]
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Hugh Dickins
    Cc: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

4 commits

  • Commit 9f1b868a13ac ("mm: thp: khugepaged: add policy for finding target
    node") improved the previous khugepaged logic which allocated a
    transparent hugepages from the node of the first page being collapsed.

    However, it is still possible to collapse pages to remote memory which
    may suffer from additional access latency. With the current policy, it
    is possible that 255 pages (with PAGE_SHIFT == 12) will be collapsed
    remotely if the majority are allocated from that node.

    When zone_reclaim_mode is enabled, it means the VM should make every
    attempt to allocate locally to prevent NUMA performance degradation. In
    this case, we do not want to collapse hugepages to remote nodes that
    would suffer from increased access latency. Thus, when
    zone_reclaim_mode is enabled, only allow collapsing to nodes with
    RECLAIM_DISTANCE or less.

    There is no functional change for systems that disable
    zone_reclaim_mode.

    Signed-off-by: David Rientjes
    Cc: Dave Hansen
    Cc: Andrea Arcangeli
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Cc: "Kirill A. Shutemov"
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Transparent huge page charges prefer falling back to regular pages
    rather than spending a lot of time in direct reclaim.

    Desired reclaim behavior is usually declared in the gfp mask, but THP
    charges use GFP_KERNEL and then rely on the fact that OOM is disabled
    for THP charges, and that OOM-disabled charges don't retry reclaim.
    Needless to say, this is anything but obvious and quite error prone.

    Convert THP charges to use GFP_TRANSHUGE instead, which implies
    __GFP_NORETRY, to indicate the low-latency requirement.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • In some architectures like x86, atomic_add() is a full memory barrier.
    In that case, an additional smp_mb() is just a waste of time. This
    patch replaces that smp_mb() by smp_mb__after_atomic() which will avoid
    the redundant memory barrier in some architectures.

    With a 3.16-rc1 based kernel, this patch reduced the execution time of
    breaking 1000 transparent huge pages from 38,245us to 30,964us. A
    reduction of 19% which is quite sizeable. It also reduces the %cpu time
    of the __split_huge_page_refcount function in the perf profile from
    2.18% to 1.15%.

    Signed-off-by: Waiman Long
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Scott J Norton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • In __split_huge_page_map(), the check for page_mapcount(page) is
    invariant within the for loop. Because of the fact that the macro is
    implemented using atomic_read(), the redundant check cannot be optimized
    away by the compiler leading to unnecessary read to the page structure.

    This patch moves the invariant bug check out of the loop so that it will
    be done only once. On a 3.16-rc1 based kernel, the execution time of a
    microbenchmark that broke up 1000 transparent huge pages using munmap()
    had an execution time of 38,245us and 38,548us with and without the
    patch respectively. The performance gain is about 1%.

    Signed-off-by: Waiman Long
    Acked-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Scott J Norton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     

24 Jun, 2014

2 commits

  • Trinity has reported:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: __lock_acquire (kernel/locking/lockdep.c:3070 (discriminator 1))
    CPU: 6 PID: 16173 Comm: trinity-c364 Tainted: G W
    3.15.0-rc1-next-20140415-sasha-00020-gaa90d09 #398
    lock_acquire (arch/x86/include/asm/current.h:14
    kernel/locking/lockdep.c:3602)
    _raw_spin_lock (include/linux/spinlock_api_smp.h:143
    kernel/locking/spinlock.c:151)
    remove_migration_pte (mm/migrate.c:137)
    rmap_walk (mm/rmap.c:1628 mm/rmap.c:1699)
    remove_migration_ptes (mm/migrate.c:224)
    migrate_pages (mm/migrate.c:922 mm/migrate.c:960 mm/migrate.c:1126)
    migrate_misplaced_page (mm/migrate.c:1733)
    __handle_mm_fault (mm/memory.c:3762 mm/memory.c:3812 mm/memory.c:3925)
    handle_mm_fault (mm/memory.c:3948)
    __get_user_pages (mm/memory.c:1851)
    __mlock_vma_pages_range (mm/mlock.c:255)
    __mm_populate (mm/mlock.c:711)
    SyS_mlockall (include/linux/mm.h:1799 mm/mlock.c:817 mm/mlock.c:791)

    I believe this comes about because, whereas collapsing and splitting THP
    functions take anon_vma lock in write mode (which excludes concurrent
    rmap walks), faulting THP functions (write protection and misplaced
    NUMA) do not - and mostly they do not need to.

    But they do use a pmdp_clear_flush(), set_pmd_at() sequence which, for
    an instant (indeed, for a long instant, given the inter-CPU TLB flush in
    there), leaves *pmd neither present not trans_huge.

    Which can confuse a concurrent rmap walk, as when removing migration
    ptes, seen in the dumped trace. Although that rmap walk has a 4k page
    to insert, anon_vmas containing THPs are in no way segregated from
    4k-page anon_vmas, so the 4k-intent mm_find_pmd() does need to cope with
    that instant when a trans_huge pmd is temporarily absent.

    I don't think we need strengthen the locking at the THP end: it's easily
    handled with an ACCESS_ONCE() before testing both conditions.

    And since mm_find_pmd() had only one caller who wanted a THP rather than
    a pmd, let's slightly repurpose it to fail when it hits a THP or
    non-present pmd, and open code split_huge_page_address() again.

    Signed-off-by: Hugh Dickins
    Reported-by: Sasha Levin
    Acked-by: Kirill A. Shutemov
    Cc: Konstantin Khlebnikov
    Cc: Mel Gorman
    Cc: Bob Liu
    Cc: Christoph Lameter
    Cc: Dave Jones
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Trinity has for over a year been reporting a CONFIG_DEBUG_PAGEALLOC oops
    in copy_page_rep() called from copy_user_huge_page() called from
    do_huge_pmd_wp_page().

    I believe this is a DEBUG_PAGEALLOC false positive, due to the source
    page being split, and a tail page freed, while copy is in progress; and
    not a problem without DEBUG_PAGEALLOC, since the pmd_same() check will
    prevent a miscopy from being made visible.

    Fix by adding get_user_huge_page() and put_user_huge_page(): reducing to
    the usual get_page() and put_page() on head page in the usual config;
    but get and put references to all of the tail pages when
    DEBUG_PAGEALLOC.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Hugh Dickins
    Acked-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Dave Jones
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

05 Jun, 2014

2 commits


07 May, 2014

1 commit


19 Apr, 2014

1 commit

  • Sasha Levin has reported two THP BUGs[1][2]. I believe both of them
    have the same root cause. Let's look to them one by one.

    The first bug[1] is "kernel BUG at mm/huge_memory.c:1829!". It's
    BUG_ON(mapcount != page_mapcount(page)) in __split_huge_page(). From my
    testing I see that page_mapcount() is higher than mapcount here.

    I think it happens due to race between zap_huge_pmd() and
    page_check_address_pmd(). page_check_address_pmd() misses PMD which is
    under zap:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    /*
    * We check if PMD present without taking ptl: no
    * serialization against zap_huge_pmd(). We miss this PMD,
    * it's not accounted to 'mapcount' in __split_huge_page().
    */
    pmd_present(pmd) == 0

    BUG_ON(mapcount != page_mapcount(page)) // CRASH!!!

    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)

    The second bug[2] is "kernel BUG at mm/huge_memory.c:1371!".
    It's VM_BUG_ON_PAGE(!PageHead(page), page) in zap_huge_pmd().

    This happens in similar way:

    CPU0 CPU1
    zap_huge_pmd()
    pmdp_get_and_clear()
    page_remove_rmap(page)
    atomic_add_negative(-1, &page->_mapcount)
    __split_huge_page()
    anon_vma_interval_tree_foreach()
    __split_huge_page_splitting()
    page_check_address_pmd()
    mm_find_pmd()
    pmd_present(pmd) == 0 /* The same comment as above */
    /*
    * No crash this time since we already decremented page->_mapcount in
    * zap_huge_pmd().
    */
    BUG_ON(mapcount != page_mapcount(page))

    /*
    * We split the compound page here into small pages without
    * serialization against zap_huge_pmd()
    */
    __split_huge_page_refcount()
    VM_BUG_ON_PAGE(!PageHead(page), page); // CRASH!!!

    So my understanding the problem is pmd_present() check in mm_find_pmd()
    without taking page table lock.

    The bug was introduced by me commit with commit 117b0791ac42. Sorry for
    that. :(

    Let's open code mm_find_pmd() in page_check_address_pmd() and do the
    check under page table lock.

    Note that __page_check_address() does the same for PTE entires
    if sync != 0.

    I've stress tested split and zap code paths for 36+ hours by now and
    don't see crashes with the patch applied. Before it took
    [2] https://lkml.kernel.org/g/

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Sasha Levin
    Tested-by: Sasha Levin
    Cc: Bob Liu
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Dave Jones
    Cc: Vlastimil Babka
    Cc: [3.13+]

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

18 Apr, 2014

1 commit

  • Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE.

    Signed-off-by: Dongsheng Yang
    Acked-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com
    Cc: devel@driverdev.osuosl.org
    Cc: devicetree@vger.kernel.org
    Cc: fcoe-devel@open-fcoe.org
    Cc: linux390@de.ibm.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: nbd-general@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: openipmi-developer@lists.sourceforge.net
    Cc: qla2xxx-upstream@qlogic.com
    Cc: linux-arch@vger.kernel.org
    [ Consolidated the patches, twiddled the changelog. ]
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

08 Apr, 2014

2 commits

  • mem_cgroup_newpage_charge is used only for charging anonymous memory so
    it is better to rename it to mem_cgroup_charge_anon.

    mem_cgroup_cache_charge is used for file backed memory so rename it to
    mem_cgroup_charge_file.

    Signed-off-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The main motivation behind this patch is to provide a way to disable THP
    for jobs where the code cannot be modified, and using a malloc hook with
    madvise is not an option (i.e. statically allocated data). This patch
    allows us to do just that, without affecting other jobs running on the
    system.

    We need to do this sort of thing for jobs where THP hurts performance,
    due to the possibility of increased remote memory accesses that can be
    created by situations such as the following:

    When you touch 1 byte of an untouched, contiguous 2MB chunk, a THP will
    be handed out, and the THP will be stuck on whatever node the chunk was
    originally referenced from. If many remote nodes need to do work on
    that same chunk, they'll be making remote accesses.

    With THP disabled, 4K pages can be handed out to separate nodes as
    they're needed, greatly reducing the amount of remote accesses to
    memory.

    This patch is based on some of my work combined with some
    suggestions/patches given by Oleg Nesterov. The main goal here is to
    add a prctl switch to allow us to disable to THP on a per mm_struct
    basis.

    Here's a bit of test data with the new patch in place...

    First with the flag unset:

    # perf stat -a ./prctl_wrapper_mmv3 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 0
    Set thp_disabled state to 0
    Process pid = 18027

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 1.120 0.060 0.000 0.110 0.110 0.000 28571428864 -9223372036854775808 55803572 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 0 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    273719072.841402 task-clock # 641.026 CPUs utilized [100.00%]
    1,008,986 context-switches # 0.000 M/sec [100.00%]
    7,717 CPU-migrations # 0.000 M/sec [100.00%]
    1,698,932 page-faults # 0.000 M/sec
    355,222,544,890,379 cycles # 1.298 GHz [100.00%]
    536,445,412,234,588 stalled-cycles-frontend # 151.02% frontend cycles idle [100.00%]
    409,110,531,310,223 stalled-cycles-backend # 115.17% backend cycles idle [100.00%]
    148,286,797,266,411 instructions # 0.42 insns per cycle
    # 3.62 stalled cycles per insn [100.00%]
    27,061,793,159,503 branches # 98.867 M/sec [100.00%]
    1,188,655,196 branch-misses # 0.00% of all branches

    427.001706337 seconds time elapsed

    Now with the flag set:

    # perf stat -a ./prctl_wrapper_mmv3 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g
    Setting thp_disabled for this task...
    thp_disable: 1
    Set thp_disabled state to 1
    Process pid = 144957

    PF/
    MAX MIN TOTCPU/ TOT_PF/ TOT_PF/ WSEC/
    TYPE: CPUS WALL WALL SYS USER TOTCPU CPU WALL_SEC SYS_SEC CPU NODES
    512 0.620 0.260 0.250 0.320 0.570 0.001 51612901376 128000000000 100806448 23

    Performance counter stats for './prctl_wrapper_mmv3_hack 1 ./thp_pthread -C 0 -m 0 -c 512 -b 256g':

    138789390.540183 task-clock # 641.959 CPUs utilized [100.00%]
    534,205 context-switches # 0.000 M/sec [100.00%]
    4,595 CPU-migrations # 0.000 M/sec [100.00%]
    63,133,119 page-faults # 0.000 M/sec
    147,977,747,269,768 cycles # 1.066 GHz [100.00%]
    200,524,196,493,108 stalled-cycles-frontend # 135.51% frontend cycles idle [100.00%]
    105,175,163,716,388 stalled-cycles-backend # 71.07% backend cycles idle [100.00%]
    180,916,213,503,160 instructions # 1.22 insns per cycle
    # 1.11 stalled cycles per insn [100.00%]
    26,999,511,005,868 branches # 194.536 M/sec [100.00%]
    714,066,351 branch-misses # 0.00% of all branches

    216.196778807 seconds time elapsed

    As with previous versions of the patch, We're getting about a 2x
    performance increase here. Here's a link to the test case I used, along
    with the little wrapper to activate the flag:

    http://oss.sgi.com/projects/memtests/thp_pthread_mmprctlv3.tar.gz

    This patch (of 3):

    Revert commit 8e72033f2a48 and add in code to fix up any issues caused
    by the revert.

    The revert is necessary because hugepage_madvise would return -EINVAL
    when VM_NOHUGEPAGE is set, which will break subsequent chunks of this
    patch set.

    Here's a snip of an e-mail from Gerald detailing the original purpose of
    this code, and providing justification for the revert:

    "The intent of commit 8e72033f2a48 was to guard against any future
    programming errors that may result in an madvice(MADV_HUGEPAGE) on
    guest mappings, which would crash the kernel.

    Martin suggested adding the bit to arch/s390/mm/pgtable.c, if
    8e72033f2a48 was to be reverted, because that check will also prevent
    a kernel crash in the case described above, it will now send a
    SIGSEGV instead.

    This would now also allow to do the madvise on other parts, if
    needed, so it is a more flexible approach. One could also say that
    it would have been better to do it this way right from the
    beginning..."

    Signed-off-by: Alex Thorlton
    Suggested-by: Oleg Nesterov
    Tested-by: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Paolo Bonzini
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Andrea Arcangeli
    Cc: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

04 Apr, 2014

1 commit

  • I've realized that there's no need for do_huge_pmd_wp_zero_page_fallback().
    We can just split zero page with split_huge_page_pmd() and return
    VM_FAULT_FALLBACK. handle_pte_fault() will handle write-protection
    fault for us.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

04 Mar, 2014

1 commit

  • Daniel Borkmann reported a VM_BUG_ON assertion failing:

    ------------[ cut here ]------------
    kernel BUG at mm/mlock.c:528!
    invalid opcode: 0000 [#1] SMP
    Modules linked in: ccm arc4 iwldvm [...]
    video
    CPU: 3 PID: 2266 Comm: netsniff-ng Not tainted 3.14.0-rc2+ #8
    Hardware name: LENOVO 2429BP3/2429BP3, BIOS G4ET37WW (1.12 ) 05/29/2012
    task: ffff8801f87f9820 ti: ffff88002cb44000 task.ti: ffff88002cb44000
    RIP: 0010:[] [] munlock_vma_pages_range+0x2e0/0x2f0
    Call Trace:
    do_munmap+0x18f/0x3b0
    vm_munmap+0x41/0x60
    SyS_munmap+0x22/0x30
    system_call_fastpath+0x1a/0x1f
    RIP munlock_vma_pages_range+0x2e0/0x2f0
    ---[ end trace a0088dcf07ae10f2 ]---

    because munlock_vma_pages_range() thinks it's unexpectedly in the middle
    of a THP page. This can be reproduced with default config since 3.11
    kernels. A reproducer can be found in the kernel's selftest directory
    for networking by running ./psock_tpacket.

    The problem is that an order=2 compound page (allocated by
    alloc_one_pg_vec_page() is part of the munlocked VM_MIXEDMAP vma (mapped
    by packet_mmap()) and mistaken for a THP page and assumed to be order=9.

    The checks for THP in munlock came with commit ff6a6da60b89 ("mm:
    accelerate munlock() treatment of THP pages"), i.e. since 3.9, but did
    not trigger a bug. It just makes munlock_vma_pages_range() skip such
    compound pages until the next 512-pages-aligned page, when it encounters
    a head page. This is however not a problem for vma's where mlocking has
    no effect anyway, but it can distort the accounting.

    Since commit 7225522bb429 ("mm: munlock: batch non-THP page isolation
    and munlock+putback using pagevec") this can trigger a VM_BUG_ON in
    PageTransHuge() check.

    This patch fixes the issue by adding VM_MIXEDMAP flag to VM_SPECIAL, a
    list of flags that make vma's non-mlockable and non-mergeable. The
    reasoning is that VM_MIXEDMAP vma's are similar to VM_PFNMAP, which is
    already on the VM_SPECIAL list, and both are intended for non-LRU pages
    where mlocking makes no sense anyway. Related Lkml discussion can be
    found in [2].

    [1] tools/testing/selftests/net/psock_tpacket
    [2] https://lkml.org/lkml/2014/1/10/427

    Signed-off-by: Vlastimil Babka
    Signed-off-by: Daniel Borkmann
    Reported-by: Daniel Borkmann
    Tested-by: Daniel Borkmann
    Cc: Thomas Hellstrom
    Cc: John David Anglin
    Cc: HATAYAMA Daisuke
    Cc: Konstantin Khlebnikov
    Cc: Carsten Otte
    Cc: Jared Hulbert
    Tested-by: Hannes Frederic Sowa
    Cc: Kirill A. Shutemov
    Acked-by: Rik van Riel
    Cc: Andrea Arcangeli
    Cc: [3.11.x+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

26 Feb, 2014

1 commit

  • Masayoshi Mizuma reported a bug with the hang of an application under
    the memcg limit. It happens on write-protection fault to huge zero page

    If we successfully allocate a huge page to replace zero page but hit the
    memcg limit we need to split the zero page with split_huge_page_pmd()
    and fallback to small pages.

    The other part of the problem is that VM_FAULT_OOM has special meaning
    in do_huge_pmd_wp_page() context. __handle_mm_fault() expects the page
    to be split if it sees VM_FAULT_OOM and it will will retry page fault
    handling. This causes an infinite loop if the page was not split.

    do_huge_pmd_wp_zero_page_fallback() can return VM_FAULT_OOM if it failed
    to allocate one small page, so fallback to small pages will not help.

    The solution for this part is to replace VM_FAULT_OOM with
    VM_FAULT_FALLBACK is fallback required.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Masayoshi Mizuma
    Reviewed-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Feb, 2014

1 commit

  • Archs like ppc64 doesn't do tlb flush in set_pte/pmd functions when using
    a hash table MMU for various reasons (the flush is handled as part of
    the PTE modification when necessary).

    ppc64 thus doesn't implement flush_tlb_range for hash based MMUs.

    Additionally ppc64 require the tlb flushing to be batched within ptl locks.

    The reason to do that is to ensure that the hash page table is in sync with
    linux page table.

    We track the hpte index in linux pte and if we clear them without flushing
    hash and drop the ptl lock, we can have another cpu update the pte and can
    end up with duplicate entry in the hash table, which is fatal.

    We also want to keep set_pte_at simpler by not requiring them to do hash
    flush for performance reason. We do that by assuming that set_pte_at() is
    never *ever* called on a PTE that is already valid.

    This was the case until the NUMA code went in which broke that assumption.

    Fix that by introducing a new pair of helpers to set _PAGE_NUMA in a
    way similar to ptep/pmdp_set_wrprotect(), with a generic implementation
    using set_pte_at() and a powerpc specific one using the appropriate
    mechanism needed to keep the hash table in sync.

    Acked-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Aneesh Kumar K.V
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

28 Jan, 2014

1 commit

  • Pull powerpc mremap fix from Ben Herrenschmidt:
    "This is the patch that I had sent after -rc8 and which we decided to
    wait before merging. It's based on a different tree than my -next
    branch (it needs some pre-reqs that were in -rc4 or so while my -next
    is based on -rc1) so I left it as a separate branch for your to pull.
    It's identical to the request I did 2 or 3 weeks back.

    This fixes crashes in mremap with THP on powerpc.

    The fix however requires a small change in the generic code. It moves
    a condition into a helper we can override from the arch which is
    harmless, but it *also* slightly changes the order of the set_pmd and
    the withdraw & deposit, which should be fine according to Kirill (who
    wrote that code) but I agree -rc8 is a bit late...

    It was acked by Kirill and Andrew told me to just merge it via powerpc"

    * 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
    powerpc/thp: Fix crash on mremap

    Linus Torvalds
     

24 Jan, 2014

3 commits

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     
  • min_free_kbytes may be raised during THP's initialization. Sometimes,
    this will change the value which was set by the user. Showing this
    message will clarify this confusion.

    Only show this message when changing a value which was set by the user
    according to Michal Hocko's suggestion.

    Show the old value of min_free_kbytes according to Dave Hansen's
    suggestion. This will give user the chance to restore old value of
    min_free_kbytes.

    Signed-off-by: Han Pingtian
    Reviewed-by: Michal Hocko
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Han Pingtian
     
  • Most of the VM_BUG_ON assertions are performed on a page. Usually, when
    one of these assertions fails we'll get a BUG_ON with a call stack and
    the registers.

    I've recently noticed based on the requests to add a small piece of code
    that dumps the page to various VM_BUG_ON sites that the page dump is
    quite useful to people debugging issues in mm.

    This patch adds a VM_BUG_ON_PAGE(cond, page) which beyond doing what
    VM_BUG_ON() does, also dumps the page before executing the actual
    BUG_ON.

    [akpm@linux-foundation.org: fix up includes]
    Signed-off-by: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

15 Jan, 2014

1 commit

  • This patch fix the below crash

    NIP [c00000000004cee4] .__hash_page_thp+0x2a4/0x440
    LR [c0000000000439ac] .hash_page+0x18c/0x5e0
    ...
    Call Trace:
    [c000000736103c40] [00001ffffb000000] 0x1ffffb000000(unreliable)
    [437908.479693] [c000000736103d50] [c0000000000439ac] .hash_page+0x18c/0x5e0
    [437908.479699] [c000000736103e30] [c00000000000924c] .do_hash_page+0x4c/0x58

    On ppc64 we use the pgtable for storing the hpte slot information and
    store address to the pgtable at a constant offset (PTRS_PER_PMD) from
    pmd. On mremap, when we switch the pmd, we need to withdraw and deposit
    the pgtable again, so that we find the pgtable at PTRS_PER_PMD offset
    from new pmd.

    We also want to move the withdraw and deposit before the set_pmd so
    that, when page fault find the pmd as trans huge we can be sure that
    pgtable can be located at the offset.

    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Benjamin Herrenschmidt

    Aneesh Kumar K.V
     

12 Jan, 2014

1 commit

  • We see General Protection Fault on RSI in copy_page_rep: that RSI is
    what you get from a NULL struct page pointer.

    RIP: 0010:[] [] copy_page_rep+0x5/0x10
    RSP: 0000:ffff880136e15c00 EFLAGS: 00010286
    RAX: ffff880000000000 RBX: ffff880136e14000 RCX: 0000000000000200
    RDX: 6db6db6db6db6db7 RSI: db73880000000000 RDI: ffff880dd0c00000
    RBP: ffff880136e15c18 R08: 0000000000000200 R09: 000000000005987c
    R10: 000000000005987c R11: 0000000000000200 R12: 0000000000000001
    R13: ffffea00305aa000 R14: 0000000000000000 R15: 0000000000000000
    FS: 00007f195752f700(0000) GS:ffff880c7fc20000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000093010000 CR3: 00000001458e1000 CR4: 00000000000027e0
    Call Trace:
    copy_user_huge_page+0x93/0xab
    do_huge_pmd_wp_page+0x710/0x815
    handle_mm_fault+0x15d8/0x1d70
    __do_page_fault+0x14d/0x840
    do_page_fault+0x2f/0x90
    page_fault+0x22/0x30

    do_huge_pmd_wp_page() tests is_huge_zero_pmd(orig_pmd) four times: but
    since shrink_huge_zero_page() can free the huge_zero_page, and we have
    no hold of our own on it here (except where the fourth test holds
    page_table_lock and has checked pmd_same), it's possible for it to
    answer yes the first time, but no to the second or third test. Change
    all those last three to tests for NULL page.

    (Note: this is not the same issue as trinity's DEBUG_PAGEALLOC BUG
    in copy_page_rep with RSI: ffff88009c422000, reported by Sasha Levin
    in https://lkml.org/lkml/2013/3/29/103. I believe that one is due
    to the source page being split, and a tail page freed, while copy
    is in progress; and not a problem without DEBUG_PAGEALLOC, since
    the pmd_same check will prevent a miscopy from being made visible.)

    Fixes: 97ae17497e99 ("thp: implement refcounting for huge zero page")
    Signed-off-by: Hugh Dickins
    Cc: stable@vger.kernel.org # v3.10 v3.11 v3.12
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Jan, 2014

1 commit

  • Sasha Levin reported the following warning being triggered

    WARNING: CPU: 28 PID: 35287 at mm/huge_memory.c:887 copy_huge_pmd+0x145/ 0x3a0()
    Call Trace:
    copy_huge_pmd+0x145/0x3a0
    copy_page_range+0x3f2/0x560
    dup_mmap+0x2c9/0x3d0
    dup_mm+0xad/0x150
    copy_process+0xa68/0x12e0
    do_fork+0x96/0x270
    SyS_clone+0x16/0x20
    stub_clone+0x69/0x90

    This warning was introduced by "mm: numa: Avoid unnecessary disruption
    of NUMA hinting during migration" for paranoia reasons but the warning
    is bogus. I was thinking of parallel races between NUMA hinting faults
    and forks but this warning would also be triggered by a parallel reclaim
    splitting a THP during a fork. Remote the bogus warning.

    Signed-off-by: Mel Gorman
    Reported-by: Sasha Levin
    Cc: Alex Thorlton
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

19 Dec, 2013

7 commits

  • THP migration can fail for a variety of reasons. Avoid flushing the TLB
    to deal with THP migration races until the copy is ready to start.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are a few subtle races, between change_protection_range (used by
    mprotect and change_prot_numa) on one side, and NUMA page migration and
    compaction on the other side.

    The basic race is that there is a time window between when the PTE gets
    made non-present (PROT_NONE or NUMA), and the TLB is flushed.

    During that time, a CPU may continue writing to the page.

    This is fine most of the time, however compaction or the NUMA migration
    code may come in, and migrate the page away.

    When that happens, the CPU may continue writing, through the cached
    translation, to what is no longer the current memory location of the
    process.

    This only affects x86, which has a somewhat optimistic pte_accessible.
    All other architectures appear to be safe, and will either always flush,
    or flush whenever there is a valid mapping, even with no permissions
    (SPARC).

    The basic race looks like this:

    CPU A CPU B CPU C

    load TLB entry
    make entry PTE/PMD_NUMA
    fault on entry
    read/write old page
    start migrating page
    change PTE/PMD to new page
    read/write old page [*]
    flush TLB
    reload TLB from new entry
    read/write new page
    lose data

    [*] the old page may belong to a new user at this point!

    The obvious fix is to flush remote TLB entries, by making sure that
    pte_accessible aware of the fact that PROT_NONE and PROT_NUMA memory may
    still be accessible if there is a TLB flush pending for the mm.

    This should fix both NUMA migration and compaction.

    [mgorman@suse.de: fix build]
    Signed-off-by: Rik van Riel
    Signed-off-by: Mel Gorman
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • do_huge_pmd_numa_page() handles the case where there is parallel THP
    migration. However, by the time it is checked the NUMA hinting
    information has already been disrupted. This patch adds an earlier
    check with some helpers.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • On a protection change it is no longer clear if the page should be still
    accessible. This patch clears the NUMA hinting fault bits on a
    protection change.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The anon_vma lock prevents parallel THP splits and any associated
    complexity that arises when handling splits during THP migration. This
    patch checks if the lock was successfully acquired and bails from THP
    migration if it failed for any reason.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If the PMD is flushed then a parallel fault in handle_mm_fault() will
    enter the pmd_none and do_huge_pmd_anonymous_page() path where it'll
    attempt to insert a huge zero page. This is wasteful so the patch
    avoids clearing the PMD when setting pmd_numa.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Base pages are unmapped and flushed from cache and TLB during normal
    page migration and replaced with a migration entry that causes any
    parallel NUMA hinting fault or gup to block until migration completes.

    THP does not unmap pages due to a lack of support for migration entries
    at a PMD level. This allows races with get_user_pages and
    get_user_pages_fast which commit 3f926ab945b6 ("mm: Close races between
    THP migration and PMD numa clearing") made worse by introducing a
    pmd_clear_flush().

    This patch forces get_user_page (fast and normal) on a pmd_numa page to
    go through the slow get_user_page path where it will serialise against
    THP migration and properly account for the NUMA hinting fault. On the
    migration side the page table lock is taken for each PTE update.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Cc: Alex Thorlton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

13 Dec, 2013

1 commit

  • Andrey Wagin reported crash on VM_BUG_ON() in pgtable_pmd_page_dtor() with
    fallowing backtrace:

    free_pgd_range+0x2bf/0x410
    free_pgtables+0xce/0x120
    unmap_region+0xe0/0x120
    do_munmap+0x249/0x360
    move_vma+0x144/0x270
    SyS_mremap+0x3b9/0x510
    system_call_fastpath+0x16/0x1b

    The crash can be reproduce with this test case:

    #define _GNU_SOURCE
    #include
    #include
    #include

    #define MB (1024 * 1024UL)
    #define GB (1024 * MB)

    int main(int argc, char **argv)
    {
    char *p;
    int i;

    p = mmap((void *) GB, 10 * MB, PROT_READ | PROT_WRITE,
    MAP_PRIVATE | MAP_ANONYMOUS | MAP_FIXED, -1, 0);
    for (i = 0; i < 10 * MB; i += 4096)
    p[i] = 1;
    mremap(p, 10 * MB, 10 * MB, MREMAP_FIXED | MREMAP_MAYMOVE, 2 * GB);
    return 0;
    }

    Due to split PMD lock, we now store preallocated PTE tables for THP
    pages per-PMD table. It means we need to move them to other PMD table
    if huge PMD moved there.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Andrey Vagin
    Tested-by: Andrey Vagin
    Reviewed-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

15 Nov, 2013

1 commit

  • Only trivial cases left. Let's convert them altogether.

    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov