09 Oct, 2012

40 commits

  • The thp page table pre-allocation code currently assumes that pgtable_t is
    of type "struct page *". This may not be true for all architectures, so
    this patch removes that assumption by replacing the functions
    prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
    can be defined architecture-specific.

    It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
    operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be
    no functional change introduced by this patch.

    Signed-off-by: Gerald Schaefer
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Cleanup patch in preparation for transparent hugepage support on s390.
    Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
    make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
    64BIT)) && MMU".

    This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has
    MMU "def_bool y", so the MMU check is superfluous there and
    HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.

    Signed-off-by: Gerald Schaefer
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Andrea Arcangeli
    Cc: Andi Kleen
    Cc: Hugh Dickins
    Cc: Hillf Danton
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gerald Schaefer
     
  • Fix an anon_vma locking issue in the following situation:

    - vma has no anon_vma
    - next has an anon_vma
    - vma is being shrunk / next is being expanded, due to an mprotect call

    We need to take next's anon_vma lock to avoid races with rmap users (such
    as page migration) while next is being expanded.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Since it is called in start_khugepaged

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Use khugepaged_enabled to see whether thp is enabled

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Merge khugepaged_loop into khugepaged

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • They are used to abstract the difference between NUMA enabled and NUMA
    disabled to make the code more readable

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • If NUMA is enabled, we can release the page in the page pre-alloc
    operation, then the CONFIG_NUMA dependent code can be reduced

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • There are two pre-alloc operations in these two function, the different is:
    - it allows to sleep if page alloc fail in khugepaged_loop
    - it exits immediately if page alloc fail in khugepaged_do_scan

    Actually, in khugepaged_do_scan, we can allow the pre-alloc to sleep on
    the first failure, then the operation in khugepaged_loop can be removed

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • If NUMA is disabled, hpage is used as page pre-alloc, so there are two
    cases for hpage:

    - it is !NULL, means the page is not consumed otherwise,
    - the page has been consumed

    If NUMA is enabled, hpage is just used as alloc-fail indicator which is
    not a real page, NULL means not fail triggered.

    So, we can release the page only if !IS_ERR_OR_NULL

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Add the check of kthread_should_stop() to the conditions which are used to
    wakeup on khugepaged_wait, then kthread_stop is enough to let the thread
    exit

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Now, khugepaged creation and cancel are completely serial under the
    protection of khugepaged_mutex, it is impossible that many khugepaged
    entities are running

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Currently, hugepaged_mutex is used really complexly and hard to
    understand, actually, it is just used to serialize start_khugepaged and
    khugepaged for these reasons:

    - khugepaged_thread is shared between them
    - the thp disable path (echo never > transparent_hugepage/enabled) is
    nonblocking, so we need to protect khugepaged_thread to get a stable
    running state

    These can be avoided by:

    - use the lock to serialize the thread creation and cancel
    - thp disable path can not finised until the thread exits

    Then khugepaged_thread is fully controlled by start_khugepaged, khugepaged
    will be happy without the lock

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • The check is unnecessary since if mm_slot_cache or mm_slots_hash
    initialize failed, no sysfs interface will be created

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • THP_COLLAPSE_ALLOC is double counted if NUMA is disabled since it has
    already been calculated in khugepaged_alloc_hugepage

    Signed-off-by: Xiao Guangrong
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Make sure the #endif that terminates the standard #ifndef / #define /
    #endif construct gets labeled, and gets positioned at the end of the file
    as is normally the case.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • The core page allocator ensures that page flags are zeroed when freeing
    pages via free_pages_check. A number of architectures (ARM, PPC, MIPS)
    rely on this property to treat new pages as dirty with respect to the data
    cache and perform the appropriate flushing before mapping the pages into
    userspace.

    This can lead to cache synchronisation problems when using hugepages,
    since the allocator keeps its own pool of pages above the usual page
    allocator and does not reset the page flags when freeing a page into the
    pool.

    This patch adds a new architecture hook, arch_clear_hugepage_flags, so
    that architectures which rely on the page flags being in a particular
    state for fresh allocations can adjust the flags accordingly when a page
    is freed into the pool.

    Signed-off-by: Will Deacon
    Cc: Michal Hocko
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Will Deacon
     
  • The deprecated /proc//oom_adj is scheduled for removal this month.

    Signed-off-by: Davidlohr Bueso
    Acked-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Fix the return value while failing to create the kswapd kernel thread.
    Also, the error message is prioritized as KERN_ERR.

    Signed-off-by: Gavin Shan
    Signed-off-by: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • While registering MMU notifier, new instance of MMU notifier_mm will be
    allocated and later free'd if currrent mm_struct's MMU notifier_mm has
    been initialized. That causes some overhead. The patch tries to
    elominate that.

    Signed-off-by: Gavin Shan
    Signed-off-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • With an RCU based mmu_notifier implementation, any callout to
    mmu_notifier_invalidate_range_{start,end}() or
    mmu_notifier_invalidate_page() would not be allowed to call schedule()
    as that could potentially allow a modification to the mmu_notifier
    structure while it is currently being used.

    Since srcu allocs 4 machine words per instance per cpu, we may end up
    with memory exhaustion if we use srcu per mm. So all mms share a global
    srcu. Note that during large mmu_notifier activity exit & unregister
    paths might hang for longer periods, but it is tolerable for current
    mmu_notifier clients.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Haggai Eran
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     
  • There is a bug in set_pte_at_notify() which always sets the pte to the
    new page before releasing the old page in the secondary MMU. At this
    time, the process will access on the new page, but the secondary MMU
    still access on the old page, the memory is inconsistent between them

    The below scenario shows the bug more clearly:

    at the beginning: *p = 0, and p is write-protected by KSM or shared with
    parent process

    CPU 0 CPU 1
    write 1 to p to trigger COW,
    set_pte_at_notify will be called:
    *pte = new_page + W; /* The W bit of pte is set */

    *p = 1; /* pte is valid, so no #PF */

    return back to secondary MMU, then
    the secondary MMU read p, but get:
    *p == 0;

    /*
    * !!!!!!
    * the host has already set p to 1, but the secondary
    * MMU still get the old value 0
    */

    call mmu_notifier_change_pte to release
    old page in secondary MMU

    We can fix it by release old page first, then set the pte to the new
    page.

    Note, the new page will be firstly used in secondary MMU before it is
    mapped into the page table of the process, but this is safe because it
    is protected by the page table lock, there is no race to change the pte

    [akpm@linux-foundation.org: add comment from Andrea]
    Signed-off-by: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     
  • Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier
    related damage v3") introduced a potential memory corruption.
    shmem_alloc_page() uses a pseudo vma and it has one significant unique
    combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.

    get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
    and mpol_cond_put() DOES decrease a policy ref when a policy has
    MPOL_F_SHARED. Therefore, when a cpuset update race occurs,
    alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
    reference count and frees the policy prematurely.

    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When shared_policy_replace() fails to allocate new->policy is not freed
    correctly by mpol_set_shared_policy(). The problem is that shared
    mempolicy code directly call kmem_cache_free() in multiple places where
    it is easy to make a mistake.

    This patch creates an sp_free wrapper function and uses it. The bug was
    introduced pre-git age (IOW, before 2.6.12-rc2).

    [mgorman@suse.de: Editted changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
    be dereferenced if sp->lock is not held and 2) another thread can modify
    sp_node between spin_unlock for allocating a new sp node and next
    spin_lock. The bug was introduced before 2.6.12-rc2.

    Kosaki's original patch for this problem was to allocate an sp node and
    policy within shared_policy_replace and initialise it when the lock is
    reacquired. I was not keen on this approach because it partially
    duplicates sp_alloc(). As the paths were sp->lock is taken are not that
    performance critical this patch converts sp->lock to sp->mutex so it can
    sleep when calling sp_alloc().

    [kosaki.motohiro@jp.fujitsu.com: Original patch]
    Signed-off-by: Mel Gorman
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Dave Jones' system call fuzz testing tool "trinity" triggered the
    following bug error with slab debugging enabled

    =============================================================================
    BUG numa_policy (Not tainted): Poison overwritten
    -----------------------------------------------------------------------------

    INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
    INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
    __slab_alloc+0x3d3/0x445
    kmem_cache_alloc+0x29d/0x2b0
    mpol_new+0xa3/0x140
    sys_mbind+0x142/0x620
    system_call_fastpath+0x16/0x1b

    INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
    __slab_free+0x2e/0x1de
    kmem_cache_free+0x25a/0x260
    __mpol_put+0x27/0x30
    remove_vma+0x68/0x90
    exit_mmap+0x118/0x140
    mmput+0x73/0x110
    exit_mm+0x108/0x130
    do_exit+0x162/0xb90
    do_group_exit+0x4f/0xc0
    sys_exit_group+0x17/0x20
    system_call_fastpath+0x16/0x1b

    INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
    INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

    The problem is that the structure is being prematurely freed due to a
    reference count imbalance. In the following case mbind(addr, len) should
    replace the memory policies of both vma1 and vma2 and thus they will
    become to share the same mempolicy and the new mempolicy will have the
    MPOL_F_SHARED flag.

    +-------------------+-------------------+
    | vma1 | vma2(shmem) |
    +-------------------+-------------------+
    | |
    addr addr+len

    alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
    maintaining the mempolicy reference count. The current rule is that
    get_vma_policy() only increments refcount for shmem VMA and
    mpol_conf_put() only decrements refcount if the policy has
    MPOL_F_SHARED.

    In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
    The reference count will be decreased even though was not increased
    whenever alloc_page_vma() is called. This has been broken since commit
    [52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.

    There is another serious bug with the sharing of memory policies.
    Currently, mempolicy rebind logic (it is called from cpuset rebinding)
    ignores a refcount of mempolicy and override it forcibly. Thus, any
    mempolicy sharing may cause mempolicy corruption. The bug was
    introduced by commit [68860ec1: cpusets: automatic numa mempolicy
    rebinding].

    Ideally, the shared policy handling would be rewritten to either
    properly handle COW of the policy structures or at least reference count
    MPOL_F_SHARED based exclusively on information within the policy.
    However, this patch takes the easier approach of disabling any policy
    sharing between VMAs. Each new range allocated with sp_alloc will
    allocate a new policy, set the reference count to 1 and drop the
    reference count of the old policy. This increases the memory footprint
    but is not expected to be a major problem as mbind() is unlikely to be
    used for fine-grained ranges. It is also inefficient because it means
    we allocate a new policy even in cases where mbind_range() could use the
    new_policy passed to it. However, it is more straight-forward and the
    change should be invisible to the user.

    [mgorman@suse.de: Edited changelog]
    Reported-by: Dave Jones ,
    Cc: Christoph Lameter ,
    Reviewed-by: Christoph Lameter
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle
    vma->vm_policy linkages") removed vma->vm_policy updates code but it is
    the purpose of mbind_range(). Now, mbind_range() is virtually a no-op
    and while it does not allow memory corruption it is not the right fix.
    This patch is a revert.

    [mgorman@suse.de: Edited changelog]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Mel Gorman
    Cc: Christoph Lameter
    Cc: Josh Boyer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • While compaction is migrating pages to free up large contiguous blocks
    for allocation it races with other allocation requests that may steal
    these blocks or break them up. This patch alters direct compaction to
    capture a suitable free page as soon as it becomes available to reduce
    this race. It uses similar logic to split_free_page() to ensure that
    watermarks are still obeyed.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If allocation fails after compaction then compaction may be deferred for
    a number of allocation attempts. If there are subsequent failures,
    compact_defer_shift is increased to defer for longer periods. This
    patch uses that information to scale the number of pages reclaimed with
    compact_defer_shift until allocations succeed again. The rationale is
    that reclaiming the normal number of pages still allowed compaction to
    fail and its success depends on the number of pages. If it's failing,
    reclaim more pages until it succeeds again.

    Note that this is not implying that VM reclaim is not reclaiming enough
    pages or that its logic is broken. try_to_free_pages() always asks for
    SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
    what it does. Direct reclaim stops normally with this check.

    if (sc->nr_reclaimed >= sc->nr_to_reclaim)
    goto out;

    should_continue_reclaim delays when that check is made until a minimum
    number of pages for reclaim/compaction are reclaimed. It is possible
    that this patch could instead set nr_to_reclaim in try_to_free_pages()
    and drive it from there but that's behaves differently and not
    necessarily for the better. If driven from do_try_to_free_pages(), it
    is also possible that priorities will rise.

    When they reach DEF_PRIORITY-2, it will also start stalling and setting
    pages for immediate reclaim which is more disruptive than not desirable
    in this case. That is a more wide-reaching change that could cause
    another regression related to THP requests causing interactive jitter.

    [akpm@linux-foundation.org: fix build]
    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Allocation success rates have been far lower since 3.4 due to commit
    fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled").
    This commit was introduced for good reasons and it was known in advance
    that the success rates would suffer but it was justified on the grounds
    that the high allocation success rates were achieved by aggressive
    reclaim. Success rates are expected to suffer even more in 3.6 due to
    commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
    left") which testing has shown to severely reduce allocation success
    rates under load - to 0% in one case.

    This series aims to improve the allocation success rates without
    regressing the benefits of commit fe2c2a106663. The series is based on
    latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
    away.

    Patch 1 updates a stale comment seeing as I was in the general area.

    Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
    of recent failures.

    Patch 3 captures suitable high-order pages freed by compaction to reduce
    races with parallel allocation requests.

    Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
    start off where it left] to enable compaction again

    Patch 5 identifies when compacion is taking too long due to contention
    and aborts.

    STRESS-HIGHALLOC
    3.6-rc1-akpm full-series
    Pass 1 36.00 ( 0.00%) 51.00 (15.00%)
    Pass 2 42.00 ( 0.00%) 63.00 (21.00%)
    while Rested 86.00 ( 0.00%) 86.00 ( 0.00%)

    From

    http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html

    I know that the allocation success rates in 3.3.6 was 78% in comparison
    to 36% in in the current akpm tree. With the full series applied, the
    success rates are up to around 51% with some variability in the results.
    This is not as high a success rate but it does not reclaim excessively
    which is a key point.

    MMTests Statistics: vmstat
    Page Ins 3050912 3078892
    Page Outs 8033528 8039096
    Swap Ins 0 0
    Swap Outs 0 0

    Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
    there were 71881 pages swapped out.

    Direct pages scanned 70942 122976
    Kswapd pages scanned 1366300 1520122
    Kswapd pages reclaimed 1366214 1484629
    Direct pages reclaimed 70936 105716
    Kswapd efficiency 99% 97%
    Kswapd velocity 1072.550 1182.615
    Direct efficiency 99% 85%
    Direct velocity 55.690 95.672

    The kswapd velocity changes very little as expected. kswapd velocity is
    around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
    allocation success rates it was 8140 pages/second. Direct velocity is
    higher as a result of patch 2 of the series but this is expected and is
    acceptable. The direct reclaim and kswapd velocities change very little.

    If these get accepted for merging then there is a difficulty in how they
    should be handled. 7db8889a ("mm: have order > 0 compaction start off
    where it left") is broken but it is already in 3.6-rc1 and needs to be
    fixed. However, if just patch 4 from this series is applied then Jim
    Schutt's workload is known to break again as his workload also requires
    patch 5. While it would be preferred to have all these patches in 3.6 to
    improve compaction in general, it would at least be acceptable if just
    patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
    compaction completely. On the face of it, that would force
    __GFP_NO_KSWAPD patches to be merged at the same time but I can do a
    version of this series with __GFP_NO_KSWAPD change reverted and then
    rebase it on top of this series. That might be best overall because I
    note that the __GFP_NO_KSWAPD patch should have removed
    deferred_compaction from page_alloc.c but it didn't but fixing that causes
    collisions with this series.

    This patch:

    The comment about order applied when the check was order >
    PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
    use compaction for all allocation orders"). Fixing the comment while I'm
    in the general area.

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Reviewed-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • People get confused by find_vma_prepare(), because it doesn't care about
    what it returns in its output args, when its callers won't be interested.

    Clarify by passing in end-of-range address too, and returning failure if
    any existing vma overlaps the new range: instead of returning an ambiguous
    vma which most callers then must check. find_vma_links() is a clearer
    name.

    This does revert 2.6.27's dfe195fb79e88 ("mm: fix uninitialized variables
    for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of
    those releases too eager to shout about uninitialized variables: only
    copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences.

    [hughd@google.com: fix warning, remove BUG()]
    Signed-off-by: Hugh Dickins
    Cc: Benny Halevy
    Acked-by: Hillf Danton
    Signed-off-by: Hugh Dickins
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
    2048), it will call generic_perform_write() twice for every page:

    write_begin
    mark_page_accessed(page)
    write_end

    write_begin
    mark_page_accessed(page)
    write_end

    Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
    added to active_list even they have be accessed twice because they are not
    PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be
    moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
    page 14th *is* PageLRU(page). And after second write_end() only page 14th
    will be in active_list.

    In Hadoop environment, we do comes to this situation: after writing a
    file, we find out that only 14th, 28th, 42th... page are in active_list
    and others in inactive_list. Now kswapd works, shrinks the inactive_list,
    the file only have 14th, 28th...pages in memory, the readahead request
    size will be broken to only 52k (13*4k), system's performance falls
    dramatically.

    This problem can also replay by below steps (the machine has 8G memory):

    1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
    2. cat another 7.5G file to /dev/null
    3. vmtouch -m 1G -v /test/file.out, it will show:

    /test/file.out
    [oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144

    the 'o' means same pages are in memory but same are not.

    The solution for this problem is simple: the 14th page should be added to
    lru_add_pvecs before mark_page_accessed() just as other pages.

    [akpm@linux-foundation.org: tweak comment]
    [akpm@linux-foundation.org: grab better comment from the v3 patch]
    Signed-off-by: Robin Dong
    Reviewed-by: Minchan Kim
    Cc: KOSAKI Motohiro
    Reviewed-by: Johannes Weiner
    Cc: Wu Fengguang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Dong
     
  • A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
    currently it lost original meaning but still has some effects:

    | effect | alternative flags
    -+------------------------+---------------------------------------------
    1| account as reserved_vm | VM_IO
    2| skip in core dump | VM_IO, VM_DONTDUMP
    3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
    4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

    This patch removes reserved_vm counter from mm_struct. Seems like nobody
    cares about it, it does not exported into userspace directly, it only
    reduces total_vm showed in proc.

    Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

    remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
    remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

    [akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
    VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for
    sys_madvise. The next patch will use it for replacing the outdated flag
    VM_RESERVED.

    Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
    (VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Currently the kernel sets mm->exe_file during sys_execve() and then tracks
    number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
    as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
    resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

    VMA with VM_EXECUTABLE flag appears after mapping file with flag
    MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
    splitting, because sys_mmap ignores this flag. Usually binfmt module sets
    mm->exe_file and mmaps executable vmas with this file, they hold
    mm->exe_file while task is running.

    comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
    where all this stuff was introduced:

    > The kernel implements readlink of /proc/pid/exe by getting the file from
    > the first executable VMA. Then the path to the file is reconstructed and
    > reported as the result.
    >
    > Because of the VMA walk the code is slightly different on nommu systems.
    > This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    > walking the VMAs to find the first executable file-backed VMA we store a
    > reference to the exec'd file in the mm_struct.
    >
    > That reference would prevent the filesystem holding the executable file
    > from being unmounted even after unmapping the VMAs. So we track the number
    > of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    > unmapped. This avoids pinning the mounted filesystem.

    exe_file's vma accounting is hooked into every file mmap/unmmap and vma
    split/merge just to fix some hypothetical pinning fs from umounting by mm,
    which already unmapped all its executable files, but still alive.

    Seems like currently nobody depends on this behaviour. We can try to
    remove this logic and keep mm->exe_file until final mmput().

    mm->exe_file is still protected with mm->mmap_sem, because we want to
    change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
    task can change its mm->exe_file and unpin mountpoint explicitly.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
    a task's executable file. After this patch they will use mm->exe_file
    directly. mm->exe_file is protected with mm->mmap_sem, so locking stays
    the same.

    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Chris Metcalf [arch/tile]
    Acked-by: Tetsuo Handa [tomoyo]
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Acked-by: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn
    ptes, special ptes and normal ptes.

    Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
    VM_PFNMAP. If driver populates whole VMA at mmap() it probably not
    expects page-faults.

    This patch removes special check from vma_wants_writenotify() which
    disables pages write tracking for VMA populated via vm_instert_page().
    BDI below mapped file should not use dirty-accounting, moreover
    do_wp_page() can handle this.

    vm_insert_page() still marks vma after first usage. Usually it is called
    from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
    change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it
    wants to call this function from other places, for example from page-fault
    handler.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Combine several arch-specific vma flags into one.

    before patch:

    0x00000200 0x01000000 0x20000000 0x40000000
    x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT
    powerpc - - VM_SAO -
    parisc VM_GROWSUP - - -
    ia64 VM_GROWSUP - - -
    nommu - VM_MAPPED_COPY - -
    others - - - -

    after patch:

    0x00000200 0x01000000 0x20000000 0x40000000
    x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE
    powerpc - VM_SAO - -
    parisc - VM_GROWSUP - -
    ia64 - VM_GROWSUP - -
    nommu - VM_MAPPED_COPY - -
    others - VM_ARCH_1 - -

    And voila! One completely free bit.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.

    We can toss mapping address from remap_pfn_range() into
    track_pfn_vma_new(), and collect all PAT-related logic together in
    arch/x86/.

    This patch also restores orignal frustration-free is_cow_mapping() check
    in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
    ("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")

    is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
    because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.

    [suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
    Signed-off-by: Konstantin Khlebnikov
    Signed-off-by: Suresh Siddha
    Cc: Venkatesh Pallipadi
    Cc: H. Peter Anvin
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: Hugh Dickins
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov