Eric Lee / smarc-fsl-linux-kernel

09 Oct, 2012

40 commits

e3ebcf643 thp: remove assumptions on pgtable_t type ... Browse Code »

The thp page table pre-allocation code currently assumes that pgtable_t is
of type "struct page *". This may not be true for all architectures, so
this patch removes that assumption by replacing the functions
prepare_pmd_huge_pte() and get_pmd_huge_pte() with two new functions that
can be defined architecture-specific.

It also removes two VM_BUG_ON checks for page_count() and page_mapcount()
operating on a pgtable_t. Apart from the VM_BUG_ON removal, there will be
no functional change introduced by this patch.

Signed-off-by: Gerald Schaefer
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:29 +0800
15626062f thp, x86: introduce HAVE_ARCH_TRANSPARENT_HUGEPAGE ... Browse Code »

Cleanup patch in preparation for transparent hugepage support on s390.
Adding new architectures to the TRANSPARENT_HUGEPAGE config option can
make the "depends" line rather ugly, like "depends on (X86 || (S390 &&
64BIT)) && MMU".

This patch adds a HAVE_ARCH_TRANSPARENT_HUGEPAGE instead. x86 already has
MMU "def_bool y", so the MMU check is superfluous there and
HAVE_ARCH_TRANSPARENT_HUGEPAGE can be selected in arch/x86/Kconfig.

Signed-off-by: Gerald Schaefer
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Andrea Arcangeli
Cc: Andi Kleen
Cc: Hugh Dickins
Cc: Hillf Danton
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gerald Schaefer
2012-10-09 15:22:29 +0800
ca42b26ab mm: fix potential anon_vma locking issue in mprotect() ... Browse Code »

Fix an anon_vma locking issue in the following situation:

- vma has no anon_vma
- next has an anon_vma
- vma is being shrunk / next is being expanded, due to an mprotect call

We need to take next's anon_vma lock to avoid races with rmap users (such
as page migration) while next is being expanded.

Signed-off-by: Michel Lespinasse
Reviewed-by: Andrea Arcangeli
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:28 +0800
227e40474 thp: remove unnecessary set_recommended_min_free_kbytes ... Browse Code »

Since it is called in start_khugepaged

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:28 +0800
17c230afa thp: use khugepaged_enabled to remove duplicate code ... Browse Code »

Use khugepaged_enabled to see whether thp is enabled

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:28 +0800
b7231789b thp: remove khugepaged_loop ... Browse Code »

Merge khugepaged_loop into khugepaged

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800
26234f36e thp: introduce khugepaged_prealloc_page and khugepaged_alloc_page ... Browse Code »

They are used to abstract the difference between NUMA enabled and NUMA
disabled to make the code more readable

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800
420256ef0 thp: release page in page pre-alloc path ... Browse Code »

If NUMA is enabled, we can release the page in the page pre-alloc
operation, then the CONFIG_NUMA dependent code can be reduced

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:27 +0800
d516904bd thp: merge page pre-alloc in khugepaged_loop into khugepaged_do_scan ... Browse Code »

There are two pre-alloc operations in these two function, the different is:
- it allows to sleep if page alloc fail in khugepaged_loop
- it exits immediately if page alloc fail in khugepaged_do_scan

Actually, in khugepaged_do_scan, we can allow the pre-alloc to sleep on
the first failure, then the operation in khugepaged_loop can be removed

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:26 +0800
9817626e7 thp: remove some code depend on CONFIG_NUMA ... Browse Code »

If NUMA is disabled, hpage is used as page pre-alloc, so there are two
cases for hpage:

- it is !NULL, means the page is not consumed otherwise,
- the page has been consumed

If NUMA is enabled, hpage is just used as alloc-fail indicator which is
not a real page, NULL means not fail triggered.

So, we can release the page only if !IS_ERR_OR_NULL

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:26 +0800
2017c0bff thp: remove wake_up_interruptible in the exit path ... Browse Code »

Add the check of kthread_should_stop() to the conditions which are used to
wakeup on khugepaged_wait, then kthread_stop is enough to let the thread
exit

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:26 +0800
e060f0e01 thp: remove unnecessary khugepaged_thread check ... Browse Code »

Now, khugepaged creation and cancel are completely serial under the
protection of khugepaged_mutex, it is impossible that many khugepaged
entities are running

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:26 +0800
911891afe thp: move khugepaged_mutex out of khugepaged ... Browse Code »

Currently, hugepaged_mutex is used really complexly and hard to
understand, actually, it is just used to serialize start_khugepaged and
khugepaged for these reasons:

- khugepaged_thread is shared between them
- the thp disable path (echo never > transparent_hugepage/enabled) is
nonblocking, so we need to protect khugepaged_thread to get a stable
running state

These can be avoided by:

- use the lock to serialize the thread creation and cancel
- thp disable path can not finised until the thread exits

Then khugepaged_thread is fully controlled by start_khugepaged, khugepaged
will be happy without the lock

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:25 +0800
637e3a27e thp: remove unnecessary check in start_khugepaged ... Browse Code »

The check is unnecessary since if mm_slot_cache or mm_slots_hash
initialize failed, no sysfs interface will be created

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:25 +0800
65b3c07b4 thp: fix the count of THP_COLLAPSE_ALLOC ... Browse Code »

THP_COLLAPSE_ALLOC is double counted if NUMA is disabled since it has
already been calculated in khugepaged_alloc_hugepage

Signed-off-by: Xiao Guangrong
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:25 +0800
db9714188 mm: adjust final #endif position in mm/internal.h ... Browse Code »

Make sure the #endif that terminates the standard #ifndef / #define /
#endif construct gets labeled, and gets positioned at the end of the file
as is normally the case.

Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2012-10-09 15:22:24 +0800
5d3a551c2 mm: hugetlb: add arch hook for clearing page flags before entering pool ... Browse Code »

The core page allocator ensures that page flags are zeroed when freeing
pages via free_pages_check. A number of architectures (ARM, PPC, MIPS)
rely on this property to treat new pages as dirty with respect to the data
cache and perform the appropriate flushing before mapping the pages into
userspace.

This can lead to cache synchronisation problems when using hugepages,
since the allocator keeps its own pool of pages above the usual page
allocator and does not reset the page flags when freeing a page into the
pool.

This patch adds a new architecture hook, arch_clear_hugepage_flags, so
that architectures which rely on the page flags being in a particular
state for fresh allocations can adjust the flags accordingly when a page
is freed into the pool.

Signed-off-by: Will Deacon
Cc: Michal Hocko
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Will Deacon
2012-10-09 15:22:24 +0800
01dc52ebd oom: remove deprecated oom_adj ... Browse Code »

The deprecated /proc//oom_adj is scheduled for removal this month.

Signed-off-by: Davidlohr Bueso
Acked-by: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2012-10-09 15:22:24 +0800
d5dc0ad92 mm/vmscan: fix error number for failed kthread ... Browse Code »

Fix the return value while failing to create the kswapd kernel thread.
Also, the error message is prioritized as KERN_ERR.

Signed-off-by: Gavin Shan
Signed-off-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-10-09 15:22:24 +0800
e0f3c3f78 mm/mmu_notifier: init notifier if necessary ... Browse Code »

While registering MMU notifier, new instance of MMU notifier_mm will be
allocated and later free'd if currrent mm_struct's MMU notifier_mm has
been initialized. That causes some overhead. The patch tries to
elominate that.

Signed-off-by: Gavin Shan
Signed-off-by: Wanpeng Li
Cc: Andrea Arcangeli
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Marcelo Tosatti
Cc: Xiao Guangrong
Cc: Sagi Grimberg
Cc: Haggai Eran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gavin Shan
2012-10-09 15:22:23 +0800
21a92735f mm: mmu_notifier: have mmu_notifiers use a global SRCU so they may safely schedule ... Browse Code »

With an RCU based mmu_notifier implementation, any callout to
mmu_notifier_invalidate_range_{start,end}() or
mmu_notifier_invalidate_page() would not be allowed to call schedule()
as that could potentially allow a modification to the mmu_notifier
structure while it is currently being used.

Since srcu allocs 4 machine words per instance per cpu, we may end up
with memory exhaustion if we use srcu per mm. So all mms share a global
srcu. Note that during large mmu_notifier activity exit & unregister
paths might hang for longer periods, but it is tolerable for current
mmu_notifier clients.

Signed-off-by: Sagi Grimberg
Signed-off-by: Andrea Arcangeli
Cc: Peter Zijlstra
Cc: Haggai Eran
Cc: "Paul E. McKenney"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sagi Grimberg
2012-10-09 15:22:23 +0800
48af0d7cb mm: mmu_notifier: fix inconsistent memory between secondary MMU and host ... Browse Code »

There is a bug in set_pte_at_notify() which always sets the pte to the
new page before releasing the old page in the secondary MMU. At this
time, the process will access on the new page, but the secondary MMU
still access on the old page, the memory is inconsistent between them

The below scenario shows the bug more clearly:

at the beginning: *p = 0, and p is write-protected by KSM or shared with
parent process

CPU 0 CPU 1
write 1 to p to trigger COW,
set_pte_at_notify will be called:
*pte = new_page + W; /* The W bit of pte is set */

*p = 1; /* pte is valid, so no #PF */

return back to secondary MMU, then
the secondary MMU read p, but get:
*p == 0;

/*
* !!!!!!
* the host has already set p to 1, but the secondary
* MMU still get the old value 0
*/

call mmu_notifier_change_pte to release
old page in secondary MMU

We can fix it by release old page first, then set the pte to the new
page.

Note, the new page will be firstly used in secondary MMU before it is
mapped into the page table of the process, but this is safe because it
is protected by the page table lock, there is no race to change the pte

[akpm@linux-foundation.org: add comment from Andrea]
Signed-off-by: Xiao Guangrong
Cc: Avi Kivity
Cc: Marcelo Tosatti
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xiao Guangrong
2012-10-09 15:22:22 +0800
00442ad04 mempolicy: fix a memory corruption by refcount imbalance in alloc_pages_vma() ... Browse Code »

Commit cc9a6c877661 ("cpuset: mm: reduce large amounts of memory barrier
related damage v3") introduced a potential memory corruption.
shmem_alloc_page() uses a pseudo vma and it has one significant unique
combination, vma->vm_ops=NULL and vma->policy->flags & MPOL_F_SHARED.

get_vma_policy() does NOT increase a policy ref when vma->vm_ops=NULL
and mpol_cond_put() DOES decrease a policy ref when a policy has
MPOL_F_SHARED. Therefore, when a cpuset update race occurs,
alloc_pages_vma() falls in 'goto retry_cpuset' path, decrements the
reference count and frees the policy prematurely.

Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:22 +0800
63f74ca21 mempolicy: fix refcount leak in mpol_set_shared_policy() ... Browse Code »

When shared_policy_replace() fails to allocate new->policy is not freed
correctly by mpol_set_shared_policy(). The problem is that shared
mempolicy code directly call kmem_cache_free() in multiple places where
it is easy to make a mistake.

This patch creates an sp_free wrapper function and uses it. The bug was
introduced pre-git age (IOW, before 2.6.12-rc2).

[mgorman@suse.de: Editted changelog]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Christoph Lameter
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-10-09 15:22:22 +0800
b22d127a3 mempolicy: fix a race in shared_policy_replace() ... Browse Code »

shared_policy_replace() use of sp_alloc() is unsafe. 1) sp_node cannot
be dereferenced if sp->lock is not held and 2) another thread can modify
sp_node between spin_unlock for allocating a new sp node and next
spin_lock. The bug was introduced before 2.6.12-rc2.

Kosaki's original patch for this problem was to allocate an sp node and
policy within shared_policy_replace and initialise it when the lock is
reacquired. I was not keen on this approach because it partially
duplicates sp_alloc(). As the paths were sp->lock is taken are not that
performance critical this patch converts sp->lock to sp->mutex so it can
sleep when calling sp_alloc().

[kosaki.motohiro@jp.fujitsu.com: Original patch]
Signed-off-by: Mel Gorman
Acked-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:22 +0800
869833f2c mempolicy: remove mempolicy sharing ... Browse Code »

Dave Jones' system call fuzz testing tool "trinity" triggered the
following bug error with slab debugging enabled

=============================================================================
BUG numa_policy (Not tainted): Poison overwritten
-----------------------------------------------------------------------------

INFO: 0xffff880146498250-0xffff880146498250. First byte 0x6a instead of 0x6b
INFO: Allocated in mpol_new+0xa3/0x140 age=46310 cpu=6 pid=32154
__slab_alloc+0x3d3/0x445
kmem_cache_alloc+0x29d/0x2b0
mpol_new+0xa3/0x140
sys_mbind+0x142/0x620
system_call_fastpath+0x16/0x1b

INFO: Freed in __mpol_put+0x27/0x30 age=46268 cpu=6 pid=32154
__slab_free+0x2e/0x1de
kmem_cache_free+0x25a/0x260
__mpol_put+0x27/0x30
remove_vma+0x68/0x90
exit_mmap+0x118/0x140
mmput+0x73/0x110
exit_mm+0x108/0x130
do_exit+0x162/0xb90
do_group_exit+0x4f/0xc0
sys_exit_group+0x17/0x20
system_call_fastpath+0x16/0x1b

INFO: Slab 0xffffea0005192600 objects=27 used=27 fp=0x (null) flags=0x20000000004080
INFO: Object 0xffff880146498250 @offset=592 fp=0xffff88014649b9d0

The problem is that the structure is being prematurely freed due to a
reference count imbalance. In the following case mbind(addr, len) should
replace the memory policies of both vma1 and vma2 and thus they will
become to share the same mempolicy and the new mempolicy will have the
MPOL_F_SHARED flag.

+-------------------+-------------------+
| vma1 | vma2(shmem) |
+-------------------+-------------------+
| |
addr addr+len

alloc_pages_vma() uses get_vma_policy() and mpol_cond_put() pair for
maintaining the mempolicy reference count. The current rule is that
get_vma_policy() only increments refcount for shmem VMA and
mpol_conf_put() only decrements refcount if the policy has
MPOL_F_SHARED.

In above case, vma1 is not shmem vma and vma->policy has MPOL_F_SHARED!
The reference count will be decreased even though was not increased
whenever alloc_page_vma() is called. This has been broken since commit
[52cd3b07: mempolicy: rework mempolicy Reference Counting] in 2008.

There is another serious bug with the sharing of memory policies.
Currently, mempolicy rebind logic (it is called from cpuset rebinding)
ignores a refcount of mempolicy and override it forcibly. Thus, any
mempolicy sharing may cause mempolicy corruption. The bug was
introduced by commit [68860ec1: cpusets: automatic numa mempolicy
rebinding].

Ideally, the shared policy handling would be rewritten to either
properly handle COW of the policy structures or at least reference count
MPOL_F_SHARED based exclusively on information within the policy.
However, this patch takes the easier approach of disabling any policy
sharing between VMAs. Each new range allocated with sp_alloc will
allocate a new policy, set the reference count to 1 and drop the
reference count of the old policy. This increases the memory footprint
but is not expected to be a major problem as mbind() is unlikely to be
used for fine-grained ranges. It is also inefficient because it means
we allocate a new policy even in cases where mbind_range() could use the
new_policy passed to it. However, it is more straight-forward and the
change should be invisible to the user.

[mgorman@suse.de: Edited changelog]
Reported-by: Dave Jones ,
Cc: Christoph Lameter ,
Reviewed-by: Christoph Lameter
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-10-09 15:22:21 +0800
8d34694c1 revert "mm: mempolicy: Let vma_merge and vma_split handle vma->vm_policy linkages" ... Browse Code »

Commit 05f144a0d5c2 ("mm: mempolicy: Let vma_merge and vma_split handle
vma->vm_policy linkages") removed vma->vm_policy updates code but it is
the purpose of mbind_range(). Now, mbind_range() is virtually a no-op
and while it does not allow memory corruption it is not the right fix.
This patch is a revert.

[mgorman@suse.de: Edited changelog]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Cc: Christoph Lameter
Cc: Josh Boyer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-10-09 15:22:21 +0800
1fb3f8ca0 mm: compaction: capture a suitable high-order page immediately when it is made available ... Browse Code »

While compaction is migrating pages to free up large contiguous blocks
for allocation it races with other allocation requests that may steal
these blocks or break them up. This patch alters direct compaction to
capture a suitable free page as soon as it becomes available to reduce
this race. It uses similar logic to split_free_page() to ensure that
watermarks are still obeyed.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:21 +0800
83fde0f22 mm: vmscan: scale number of pages reclaimed by reclaim/compaction based on failures ... Browse Code »

If allocation fails after compaction then compaction may be deferred for
a number of allocation attempts. If there are subsequent failures,
compact_defer_shift is increased to defer for longer periods. This
patch uses that information to scale the number of pages reclaimed with
compact_defer_shift until allocations succeed again. The rationale is
that reclaiming the normal number of pages still allowed compaction to
fail and its success depends on the number of pages. If it's failing,
reclaim more pages until it succeeds again.

Note that this is not implying that VM reclaim is not reclaiming enough
pages or that its logic is broken. try_to_free_pages() always asks for
SWAP_CLUSTER_MAX pages to be reclaimed regardless of order and that is
what it does. Direct reclaim stops normally with this check.

if (sc->nr_reclaimed >= sc->nr_to_reclaim)
goto out;

should_continue_reclaim delays when that check is made until a minimum
number of pages for reclaim/compaction are reclaimed. It is possible
that this patch could instead set nr_to_reclaim in try_to_free_pages()
and drive it from there but that's behaves differently and not
necessarily for the better. If driven from do_try_to_free_pages(), it
is also possible that priorities will rise.

When they reach DEF_PRIORITY-2, it will also start stalling and setting
pages for immediate reclaim which is more disruptive than not desirable
in this case. That is a more wide-reaching change that could cause
another regression related to THP requests causing interactive jitter.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Mel Gorman
Acked-by: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:20 +0800
4ffb6335d mm: compaction: update comment in try_to_compact_pages ... Browse Code »

Allocation success rates have been far lower since 3.4 due to commit
fe2c2a106663 ("vmscan: reclaim at order 0 when compaction is enabled").
This commit was introduced for good reasons and it was known in advance
that the success rates would suffer but it was justified on the grounds
that the high allocation success rates were achieved by aggressive
reclaim. Success rates are expected to suffer even more in 3.6 due to
commit 7db8889ab05b ("mm: have order > 0 compaction start off where it
left") which testing has shown to severely reduce allocation success
rates under load - to 0% in one case.

This series aims to improve the allocation success rates without
regressing the benefits of commit fe2c2a106663. The series is based on
latest mmotm and takes into account the __GFP_NO_KSWAPD flag is going
away.

Patch 1 updates a stale comment seeing as I was in the general area.

Patch 2 updates reclaim/compaction to reclaim pages scaled on the number
of recent failures.

Patch 3 captures suitable high-order pages freed by compaction to reduce
races with parallel allocation requests.

Patch 4 fixes the upstream commit [7db8889a: mm: have order > 0 compaction
start off where it left] to enable compaction again

Patch 5 identifies when compacion is taking too long due to contention
and aborts.

STRESS-HIGHALLOC
3.6-rc1-akpm full-series
Pass 1 36.00 ( 0.00%) 51.00 (15.00%)
Pass 2 42.00 ( 0.00%) 63.00 (21.00%)
while Rested 86.00 ( 0.00%) 86.00 ( 0.00%)

From

http://www.csn.ul.ie/~mel/postings/mmtests-20120424/global-dhp__stress-highalloc-performance-ext3/hydra/comparison.html

I know that the allocation success rates in 3.3.6 was 78% in comparison
to 36% in in the current akpm tree. With the full series applied, the
success rates are up to around 51% with some variability in the results.
This is not as high a success rate but it does not reclaim excessively
which is a key point.

MMTests Statistics: vmstat
Page Ins 3050912 3078892
Page Outs 8033528 8039096
Swap Ins 0 0
Swap Outs 0 0

Note that swap in/out rates remain at 0. In 3.3.6 with 78% success rates
there were 71881 pages swapped out.

Direct pages scanned 70942 122976
Kswapd pages scanned 1366300 1520122
Kswapd pages reclaimed 1366214 1484629
Direct pages reclaimed 70936 105716
Kswapd efficiency 99% 97%
Kswapd velocity 1072.550 1182.615
Direct efficiency 99% 85%
Direct velocity 55.690 95.672

The kswapd velocity changes very little as expected. kswapd velocity is
around the 1000 pages/sec mark where as in kernel 3.3.6 with the high
allocation success rates it was 8140 pages/second. Direct velocity is
higher as a result of patch 2 of the series but this is expected and is
acceptable. The direct reclaim and kswapd velocities change very little.

If these get accepted for merging then there is a difficulty in how they
should be handled. 7db8889a ("mm: have order > 0 compaction start off
where it left") is broken but it is already in 3.6-rc1 and needs to be
fixed. However, if just patch 4 from this series is applied then Jim
Schutt's workload is known to break again as his workload also requires
patch 5. While it would be preferred to have all these patches in 3.6 to
improve compaction in general, it would at least be acceptable if just
patches 4 and 5 were merged to 3.6 to fix a known problem without breaking
compaction completely. On the face of it, that would force
__GFP_NO_KSWAPD patches to be merged at the same time but I can do a
version of this series with __GFP_NO_KSWAPD change reverted and then
rebase it on top of this series. That might be best overall because I
note that the __GFP_NO_KSWAPD patch should have removed
deferred_compaction from page_alloc.c but it didn't but fixing that causes
collisions with this series.

This patch:

The comment about order applied when the check was order >
PAGE_ALLOC_COSTLY_ORDER which has not been the case since c5a73c3d ("thp:
use compaction for all allocation orders"). Fixing the comment while I'm
in the general area.

Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-10-09 15:22:20 +0800
6597d7833 mm/mmap.c: replace find_vma_prepare() with clearer find_vma_links() ... Browse Code »

People get confused by find_vma_prepare(), because it doesn't care about
what it returns in its output args, when its callers won't be interested.

Clarify by passing in end-of-range address too, and returning failure if
any existing vma overlaps the new range: instead of returning an ambiguous
vma which most callers then must check. find_vma_links() is a clearer
name.

This does revert 2.6.27's dfe195fb79e88 ("mm: fix uninitialized variables
for find_vma_prepare callers"), but it looks like gcc 4.3.0 was one of
those releases too eager to shout about uninitialized variables: only
copy_vma() warns with 4.5.1 and 4.7.1, which a BUG on error silences.

[hughd@google.com: fix warning, remove BUG()]
Signed-off-by: Hugh Dickins
Cc: Benny Halevy
Acked-by: Hillf Danton
Signed-off-by: Hugh Dickins
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-10-09 15:22:20 +0800
d741c9cde mm: fix nonuniform page status when writing new file with small buffer ... Browse Code »

When writing a new file with 2048 bytes buffer, such as write(fd, buffer,
2048), it will call generic_perform_write() twice for every page:

write_begin
mark_page_accessed(page)
write_end

write_begin
mark_page_accessed(page)
write_end

Pages 1-13 will be added to lru-pvecs in write_begin() and will *NOT* be
added to active_list even they have be accessed twice because they are not
PageLRU(page). But when page 14th comes, all pages in lru-pvecs will be
moved to inactive_list (by __lru_cache_add() ) in first write_begin(), now
page 14th *is* PageLRU(page). And after second write_end() only page 14th
will be in active_list.

In Hadoop environment, we do comes to this situation: after writing a
file, we find out that only 14th, 28th, 42th... page are in active_list
and others in inactive_list. Now kswapd works, shrinks the inactive_list,
the file only have 14th, 28th...pages in memory, the readahead request
size will be broken to only 52k (13*4k), system's performance falls
dramatically.

This problem can also replay by below steps (the machine has 8G memory):

1. dd if=/dev/zero of=/test/file.out bs=1024 count=1048576
2. cat another 7.5G file to /dev/null
3. vmtouch -m 1G -v /test/file.out, it will show:

/test/file.out
[oooooooooooooooooooOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOOO] 187847/262144

the 'o' means same pages are in memory but same are not.

The solution for this problem is simple: the 14th page should be added to
lru_add_pvecs before mark_page_accessed() just as other pages.

[akpm@linux-foundation.org: tweak comment]
[akpm@linux-foundation.org: grab better comment from the v3 patch]
Signed-off-by: Robin Dong
Reviewed-by: Minchan Kim
Cc: KOSAKI Motohiro
Reviewed-by: Johannes Weiner
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Dong
2012-10-09 15:22:19 +0800
314e51b98 mm: kill vma flag VM_RESERVED and mm->reserved_vm counter ... Browse Code »

A long time ago, in v2.4, VM_RESERVED kept swapout process off VMA,
currently it lost original meaning but still has some effects:

| effect | alternative flags
-+------------------------+---------------------------------------------
1| account as reserved_vm | VM_IO
2| skip in core dump | VM_IO, VM_DONTDUMP
3| do not merge or expand | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP
4| do not mlock | VM_IO, VM_DONTEXPAND, VM_HUGETLB, VM_PFNMAP

This patch removes reserved_vm counter from mm_struct. Seems like nobody
cares about it, it does not exported into userspace directly, it only
reduces total_vm showed in proc.

Thus VM_RESERVED can be replaced with VM_IO or pair VM_DONTEXPAND | VM_DONTDUMP.

remap_pfn_range() and io_remap_pfn_range() set VM_IO|VM_DONTEXPAND|VM_DONTDUMP.
remap_vmalloc_range() set VM_DONTEXPAND | VM_DONTDUMP.

[akpm@linux-foundation.org: drivers/vfio/pci/vfio_pci.c fixup]
Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:19 +0800
0103bd16f mm: prepare VM_DONTDUMP for using in drivers ... Browse Code »

Rename VM_NODUMP into VM_DONTDUMP: this name matches other negative flags:
VM_DONTEXPAND, VM_DONTCOPY. Currently this flag used only for
sys_madvise. The next patch will use it for replacing the outdated flag
VM_RESERVED.

Also forbid madvise(MADV_DODUMP) for special kernel mappings VM_SPECIAL
(VM_IO | VM_DONTEXPAND | VM_RESERVED | VM_PFNMAP)

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:18 +0800
e9714acf8 mm: kill vma flag VM_EXECUTABLE and mm->num_exe_file_vmas ... Browse Code »

Currently the kernel sets mm->exe_file during sys_execve() and then tracks
number of vmas with VM_EXECUTABLE flag in mm->num_exe_file_vmas, as soon
as this counter drops to zero kernel resets mm->exe_file to NULL. Plus it
resets mm->exe_file at last mmput() when mm->mm_users drops to zero.

VMA with VM_EXECUTABLE flag appears after mapping file with flag
MAP_EXECUTABLE, such vmas can appears only at sys_execve() or after vma
splitting, because sys_mmap ignores this flag. Usually binfmt module sets
mm->exe_file and mmaps executable vmas with this file, they hold
mm->exe_file while task is running.

comment from v2.6.25-6245-g925d1c4 ("procfs task exe symlink"),
where all this stuff was introduced:

> The kernel implements readlink of /proc/pid/exe by getting the file from
> the first executable VMA. Then the path to the file is reconstructed and
> reported as the result.
>
> Because of the VMA walk the code is slightly different on nommu systems.
> This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
> walking the VMAs to find the first executable file-backed VMA we store a
> reference to the exec'd file in the mm_struct.
>
> That reference would prevent the filesystem holding the executable file
> from being unmounted even after unmapping the VMAs. So we track the number
> of VM_EXECUTABLE VMAs and drop the new reference when the last one is
> unmapped. This avoids pinning the mounted filesystem.

exe_file's vma accounting is hooked into every file mmap/unmmap and vma
split/merge just to fix some hypothetical pinning fs from umounting by mm,
which already unmapped all its executable files, but still alive.

Seems like currently nobody depends on this behaviour. We can try to
remove this logic and keep mm->exe_file until final mmput().

mm->exe_file is still protected with mm->mmap_sem, because we want to
change it via new sys_prctl(PR_SET_MM_EXE_FILE). Also via this syscall
task can change its mm->exe_file and unpin mountpoint explicitly.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:18 +0800
2dd8ad81e mm: use mm->exe_file instead of first VM_EXECUTABLE vma->vm_file ... Browse Code »

Some security modules and oprofile still uses VM_EXECUTABLE for retrieving
a task's executable file. After this patch they will use mm->exe_file
directly. mm->exe_file is protected with mm->mmap_sem, so locking stays
the same.

Signed-off-by: Konstantin Khlebnikov
Acked-by: Chris Metcalf [arch/tile]
Acked-by: Tetsuo Handa [tomoyo]
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Acked-by: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:18 +0800
0b173bc4d mm: kill vma flag VM_CAN_NONLINEAR ... Browse Code »

Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().

Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.

Now device drivers can implement this method and obtain nonlinear vma support.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf #arch/tile
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:17 +0800
4b6e1e370 mm: kill vma flag VM_INSERTPAGE ... Browse Code »

Merge VM_INSERTPAGE into VM_MIXEDMAP. VM_MIXEDMAP VMA can mix pure-pfn
ptes, special ptes and normal ptes.

Now copy_page_range() always copies VM_MIXEDMAP VMA on fork like
VM_PFNMAP. If driver populates whole VMA at mmap() it probably not
expects page-faults.

This patch removes special check from vma_wants_writenotify() which
disables pages write tracking for VMA populated via vm_instert_page().
BDI below mapped file should not use dirty-accounting, moreover
do_wp_page() can handle this.

vm_insert_page() still marks vma after first usage. Usually it is called
from f_op->mmap() handler under mm->mmap_sem write-lock, so it able to
change vma->vm_flags. Caller must set VM_MIXEDMAP at mmap time if it
wants to call this function from other places, for example from page-fault
handler.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:17 +0800
cc2383ec0 mm: introduce arch-specific vma flag VM_ARCH_1 ... Browse Code »

Combine several arch-specific vma flags into one.

before patch:

0x00000200 0x01000000 0x20000000 0x40000000
x86 VM_NOHUGEPAGE VM_HUGEPAGE - VM_PAT
powerpc - - VM_SAO -
parisc VM_GROWSUP - - -
ia64 VM_GROWSUP - - -
nommu - VM_MAPPED_COPY - -
others - - - -

after patch:

0x00000200 0x01000000 0x20000000 0x40000000
x86 - VM_PAT VM_HUGEPAGE VM_NOHUGEPAGE
powerpc - VM_SAO - -
parisc - VM_GROWSUP - -
ia64 - VM_GROWSUP - -
nommu - VM_MAPPED_COPY - -
others - VM_ARCH_1 - -

And voila! One completely free bit.

Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:16 +0800
b3b9c2932 mm, x86, pat: rework linear pfn-mmap tracking ... Browse Code »

Replace the generic vma-flag VM_PFN_AT_MMAP with x86-only VM_PAT.

We can toss mapping address from remap_pfn_range() into
track_pfn_vma_new(), and collect all PAT-related logic together in
arch/x86/.

This patch also restores orignal frustration-free is_cow_mapping() check
in remap_pfn_range(), as it was before commit v2.6.28-rc8-88-g3c8bb73
("x86: PAT: store vm_pgoff for all linear_over_vma_region mappings - v3")

is_linear_pfn_mapping() checks can be removed from mm/huge_memory.c,
because it already handled by VM_PFNMAP in VM_NO_THP bit-mask.

[suresh.b.siddha@intel.com: Reset the VM_PAT flag as part of untrack_pfn_vma()]
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Suresh Siddha
Cc: Venkatesh Pallipadi
Cc: H. Peter Anvin
Cc: Nick Piggin
Cc: Ingo Molnar
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: Hugh Dickins
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Konstantin Khlebnikov
2012-10-09 15:22:16 +0800