Eric Lee / smarc-fsl-linux-kernel

13 Jan, 2012

3 commits

2565409fc mm,x86,um: move CMPXCHG_DOUBLE config option ... Browse Code »

Move CMPXCHG_DOUBLE and rename it to HAVE_CMPXCHG_DOUBLE so architectures
can simply select the option if it is supported.

Signed-off-by: Heiko Carstens
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2012-01-13 12:13:03 +0800
4156153c4 mm,x86,um: move CMPXCHG_LOCAL config option ... Browse Code »

Move CMPXCHG_LOCAL and rename it to HAVE_CMPXCHG_LOCAL so architectures
can simply select the option if it is supported.

Signed-off-by: Heiko Carstens
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2012-01-13 12:13:03 +0800
43570fd2f mm,slub,x86: decouple size of struct page from CONFIG_CMPXCHG_LOCAL ... Browse Code »

While implementing cmpxchg_double() on s390 I realized that we don't set
CONFIG_CMPXCHG_LOCAL despite the fact that we have support for it.

However setting that option will increase the size of struct page by
eight bytes on 64 bit, which we certainly do not want. Also, it doesn't
make sense that a present cpu feature should increase the size of struct
page.

Besides that it looks like the dependency to CMPXCHG_LOCAL is wrong and
that it should depend on CMPXCHG_DOUBLE instead.

This patch:

If an architecture supports CMPXCHG_LOCAL this shouldn't result
automatically in larger struct pages if the SLUB allocator is used.
Instead introduce a new config option "HAVE_ALIGNED_STRUCT_PAGE" which
can be selected if a double word aligned struct page is required. Also
update x86 Kconfig so that it should work as before.

Signed-off-by: Heiko Carstens
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2012-01-13 12:13:03 +0800

12 Jan, 2012

3 commits

d0b9706c2 Merge branch 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

* 'x86-mm-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/numa: Add constraints check for nid parameters
mm, x86: Remove debug_pagealloc_enabled
x86/mm: Initialize high mem before free_all_bootmem()
arch/x86/kernel/e820.c: quiet sparse noise about plain integer as NULL pointer
arch/x86/kernel/e820.c: Eliminate bubble sort from sanitize_e820_map()
x86: Fix mmap random address range
x86, mm: Unify zone_sizes_init()
x86, mm: Prepare zone_sizes_init() for unification
x86, mm: Use max_low_pfn for ZONE_NORMAL on 64-bit
x86, mm: Wrap ZONE_DMA32 with CONFIG_ZONE_DMA32
x86, mm: Use max_pfn instead of highend_pfn
x86, mm: Move zone init from paging_init() on 64-bit
x86, mm: Use MAX_DMA_PFN for ZONE_DMA on 32-bit

Linus Torvalds
2012-01-12 11:12:10 +0800
6296e5d3c Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux ... Browse Code »

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
slub: disallow changing cpu_partial from userspace for debug caches
slub: add missed accounting
slub: Extract get_freelist from __slab_alloc
slub: Switch per cpu partial page support off for debugging
slub: fix a possible memleak in __slab_alloc()
slub: fix slub_max_order Documentation
slub: add missed accounting
slab: add taint flag outputting to debug paths.
slub: add taint flag outputting to debug paths
slab: introduce slab_max_order kernel parameter
slab: rename slab_break_gfp_order to slab_max_order

Linus Torvalds
2012-01-12 10:52:23 +0800
5878cf431 Merge branch 'slab/urgent' into slab/for-linus Browse Code »

Pekka Enberg
2012-01-12 03:11:29 +0800

11 Jan, 2012

34 commits

001a541ea Merge branch 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux ... Browse Code »

* 'writeback-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/wfg/linux:
writeback: move MIN_WRITEBACK_PAGES to fs-writeback.c
writeback: balanced_rate cannot exceed write bandwidth
writeback: do strict bdi dirty_exceeded
writeback: avoid tiny dirty poll intervals
writeback: max, min and target dirty pause time
writeback: dirty ratelimit - think time compensation
btrfs: fix dirtied pages accounting on sub-page writes
writeback: fix dirtied pages accounting on redirty
writeback: fix dirtied pages accounting on sub-page writes
writeback: charge leaked page dirties to active tasks
writeback: Include all dirty inodes in background writeback

Linus Torvalds
2012-01-11 08:59:59 +0800
db1aecafe mm/vmalloc.c: change void* into explict vm_struct* ... Browse Code »
1

vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct. So change it to a vm_struct* with naming to
improve for readability and type checking.

Signed-off-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-01-11 08:30:46 +0800
3770490ec mm: vmscan: fix typo in isolating lru pages ... Browse Code »

It is not the tag page but the cursor page that we should process, and it
looks a typo.

Signed-off-by: Hillf Danton
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Hugh Dickins
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:46 +0800
043bcbe5e mm: test PageSwapBacked in lumpy reclaim ... Browse Code »

Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-11 08:30:46 +0800
faed836a2 mm/migrate.c: remove the unused macro lru_to_page ... Browse Code »

lru_to_page is not used in mm/migrate.c.

Signed-off-by: Wang Sheng-Hui
Acked-by: Mel Gorman
Acked-by: Kyungmin Park
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2012-01-11 08:30:46 +0800
ea5768c74 mm/hugetlb.c: avoid bogus counter of surplus huge page ... Browse Code »

If we have to hand back the newly allocated huge page to page allocator,
for any reason, the changed counter should be recovered.

This affects only s390 at present.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:45 +0800
1ebb7044c mempool: fix first round failure behavior ... Browse Code »

mempool modifies gfp_mask so that the backing allocator doesn't try too
hard or trigger warning message when there's pool to fall back on. In
addition, for the first try, it removes __GFP_WAIT and IO, so that it
doesn't trigger reclaim or wait when allocation can be fulfilled from
pool; however, when that allocation fails and pool is empty too, it waits
for the pool to be replenished before retrying.

Allocation which could have succeeded after a bit of reclaim has to wait
on the reserved items and it's not like mempool doesn't retry with
__GFP_WAIT and IO. It just does that *after* someone returns an element,
pointlessly delaying things.

Fix it by retrying immediately if the first round of allocation attempts
w/o __GFP_WAIT and IO fails.

[akpm@linux-foundation.org: shorten the lock hold time]
Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
0565d3177 mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy() ... Browse Code »

mempool_destroy() is a thin wrapper around free_pool(). The only thing it
adds is BUG_ON(pool->curr_nr != pool->min_nr). The intention seems to be
to enforce that all allocated elements are freed; however, the BUG_ON()
can't achieve that (it doesn't know anything about objects above min_nr)
and incorrect as mempool_resize() is allowed to leave the pool extended
but not filled. Furthermore, panicking is way worse than any memory leak
and there are better debug tools to track memory leaks.

Drop the BUG_ON() from mempool_destory() and as that leaves the function
identical to free_pool(), replace it.

Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
5b990546e mempool: fix and document synchronization and memory barrier usage ... Browse Code »

mempool_alloc/free() use undocumented smp_mb()'s. The code is slightly
broken and misleading.

The lockless part is in mempool_free(). It wants to determine whether the
item being freed needs to be returned to the pool or backing allocator
without grabbing pool->lock. Two things need to be guaranteed for correct
operation.

1. pool->curr_nr + #allocated should never dip below pool->min_nr.
2. Waiters shouldn't be left dangling.

For #1, The only necessary condition is that curr_nr visible at free is
from after the allocation of the element being freed (details in the
comment). For most cases, this is true without any barrier but there can
be fringe cases where the allocated pointer is passed to the freeing task
without going through memory barriers. To cover this case, wmb is
necessary before returning from allocation and rmb is necessary before
reading curr_nr. IOW,

ALLOCATING TASK FREEING TASK

update pool state after alloc;
wmb();
pass pointer to freeing task;
read pointer;
rmb();
read pool state to free;

The current code doesn't have wmb after pool update during allocation and
may theoretically, on machines where unlock doesn't behave as full wmb,
lead to pool depletion and deadlock. smp_wmb() needs to be added after
successful allocation from reserved elements and smp_mb() in
mempool_free() can be replaced with smp_rmb().

For #2, the waiter needs to add itself to waitqueue and then check the
wait condition and the waker needs to update the wait condition and then
wake up. Because waitqueue operations always go through full spinlock
synchronization, there is no need for extra memory barriers.

Furthermore, mempool_alloc() is already holding pool->lock when it decides
that it needs to wait. There is no reason to do unlock - add waitqueue -
test condition again. It can simply add itself to waitqueue while holding
pool->lock and then unlock and sleep.

This patch adds smp_wmb() after successful allocation from reserved pool,
replaces smp_mb() in mempool_free() with smp_rmb() and extend pool->lock
over waitqueue addition. More importantly, it explains what memory
barriers do and how the lockless testing is correct.

-v2: Oleg pointed out that unlock doesn't imply wmb. Added explicit
smp_wmb() after successful allocation from reserved pool and
updated comments accordingly.

Signed-off-by: Tejun Heo
Cc: Oleg Nesterov
Cc: "Paul E. McKenney"
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
564c81db1 mm/migrate.c: cleanup comment for migration_entry_wait() ... Browse Code »

migration_entry_wait() can also be called from hugetlb_fault() now.
Remove the incorrect comment.

Signed-off-by: Wang Sheng-Hui
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2012-01-11 08:30:45 +0800
fcfb4dcc9 mm/mempolicy.c: mpol_equal(): use bool ... Browse Code »

mpol_equal() logically returns a boolean. Use a bool type to slightly
improve readability.

Signed-off-by: KOSAKI Motohiro
Cc: Stephen Wilson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:45 +0800
0c176d52b mm: hugetlb: fix pgoff computation when unmapping page from vma ... Browse Code »

The computation for pgoff is incorrect, at least with

(vma->vm_pgoff >> PAGE_SHIFT)

involved. It is fixed with the available method if HPAGE_SIZE is
concerned in page cache lookup.

[akpm@linux-foundation.org: use vma_hugecache_offset() directly, per Michal]
Signed-off-by: Hillf Danton
Cc: Mel Gorman
Cc: Michal Hocko
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:45 +0800
86cfd3a45 mm/vmscan.c: consider swap space when deciding whether to continue reclaim ... Browse Code »

It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.

Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-01-11 08:30:45 +0800
799f933a8 mm: bootmem: try harder to free pages in bulk ... Browse Code »

The loop that frees pages to the page allocator while bootstrapping tries
to free higher-order blocks only when the starting address is aligned to
that block size. Otherwise it will free all pages on that node
one-by-one.

Change it to free individual pages up to the first aligned block and then
try higher-order frees from there.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:45 +0800
560a036b3 mm: bootmem: drop superfluous range check when freeing pages in bulk ... Browse Code »

The area node_bootmem_map represents is aligned to BITS_PER_LONG, and all
bits in any aligned word of that map valid. When the represented area
extends beyond the end of the node, the non-existant pages will be marked
as reserved.

As a result, when freeing a page block, doing an explicit range check for
whether that block is within the node's range is redundant as the bitmap
is consulted anyway to see whether all pages in the block are unreserved.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
c3993076f mm: page_alloc: generalize order handling in __free_pages_bootmem() ... Browse Code »

__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().

Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore. The higher-order loop will work just fine for
zero order pages.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
43d2b1132 tracepoint: add tracepoints for debugging oom_score_adj ... Browse Code »

oom_score_adj is used for guarding processes from OOM-Killer. One of
problem is that it's inherited at fork(). When a daemon set oom_score_adj
and make children, it's hard to know where the value is set.

This patch adds some tracepoints useful for debugging. This patch adds
3 trace points.
- creating new task
- renaming a task (exec)
- set oom_score_adj

To debug, users need to enable some trace pointer. Maybe filtering is useful as

# EVENT=/sys/kernel/debug/tracing/events/task/
# echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
# echo "oom_score_adj != 0" > $EVENT/task_rename/filter
# echo 1 > $EVENT/enable
# EVENT=/sys/kernel/debug/tracing/events/oom/
# echo 1 > $EVENT/enable

output will be like this.
# grep oom /sys/kernel/debug/tracing/trace
bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-11 08:30:44 +0800
6bd4837de mm: simplify find_vma_prev() ... Browse Code »
43

commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
using it. Also, this change helps to improve page fault performance
because it has stronger locality of reference.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Peter Zijlstra
Cc: Shaohua Li
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:44 +0800
948f017b0 mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() ... Browse Code »

migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli
Reported-by: Nai Xia
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Pawel Sikora
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-01-11 08:30:44 +0800
df0a6daa0 mm: fix off-by-two in __zone_watermark_ok() ... Browse Code »

Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.

Reported-by: Wang Sheng-Hui
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-11 08:30:44 +0800
9571a9829 bootmem: micro optimize freeing pages in bulk ... Browse Code »

The first entry of bdata->node_bootmem_map holds the data for
bdata->node_min_pfn up to bdata->node_min_pfn + BITS_PER_LONG - 1. So the
test for freeing all pages of a single map entry can be slightly relaxed.

Moreover use DIV_ROUND_UP in another place instead of open coding it.

Signed-off-by: Uwe Kleine-König
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2012-01-11 08:30:44 +0800
31b8384a5 mm: compaction: push isolate search base of compact control one pfn ahead ... Browse Code »

After isolated the current pfn will no longer be scanned and isolated if
the next round is necessary, so push the isolate_migratepages search base
of the given compact_control one step ahead.

Signed-off-by: Hillf Danton
Reviewed-by: Andrea Arcangeli
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:44 +0800
0faa70cb0 mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin() ... Browse Code »

Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
a756cf590 mm: try to distribute dirty pages fairly across zones ... Browse Code »

The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000

fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000

btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000

ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Tested-by: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ccafa2879 mm: writeback: cleanups in preparation for per-zone dirty limits ... Browse Code »

The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory() before
adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that their
relationship is more apparent and that they can be commented on as a
group.

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ab8fabd46 mm: exclude reserved pages from dirtyable memory ... Browse Code »

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Michal Hocko
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
25bd91bd2 vmscan: add task name to warn_scan_unevictable() messages ... Browse Code »

If we need to know a usecase, caller program name is critical important.
Show it.

Signed-off-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
David Rientjes
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:43 +0800
ad8a1b558 fadvise: only initiate writeback for specified range with FADV_DONTNEED ... Browse Code »

Previously POSIX_FADV_DONTNEED would start writeback for the entire file
when the bdi was not write congested. This negatively impacts performance
if the file contains dirty pages outside of the requested range. This
change uses __filemap_fdatawrite_range() to only initiate writeback for
the requested range.

Signed-off-by: Shawn Bohrer
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shawn Bohrer
2012-01-11 08:30:43 +0800
fc8d8620d slub: min order when debug_guardpage_minorder > 0 ... Browse Code »

Disable slub debug facilities and allocate slabs at minimal order when
debug_guardpage_minorder > 0 to increase probability to catch random
memory corruption by cpu exception.

Signed-off-by: Stanislaw Gruszka
Cc: "Rafael J. Wysocki"
Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Cc: Mel Gorman
Cc: Stanislaw Gruszka
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stanislaw Gruszka
2012-01-11 08:30:43 +0800
c0a32fc5a mm: more intensive memory corruption debugging ... Browse Code »

With CONFIG_DEBUG_PAGEALLOC configured, the CPU will generate an exception
on access (read,write) to an unallocated page, which permits us to catch
code which corrupts memory. However the kernel is trying to maximise
memory usage, hence there are usually few free pages in the system and
buggy code usually corrupts some crucial data.

This patch changes the buddy allocator to keep more free/protected pages
and to interlace free/protected and allocated pages to increase the
probability of catching corruption.

When the kernel is compiled with CONFIG_DEBUG_PAGEALLOC,
debug_guardpage_minorder defines the minimum order used by the page
allocator to grant a request. The requested size will be returned with
the remaining pages used as guard pages.

The default value of debug_guardpage_minorder is zero: no change from
current behaviour.

[akpm@linux-foundation.org: tweak documentation, s/flg/flag/]
Signed-off-by: Stanislaw Gruszka
Cc: Mel Gorman
Cc: Andrea Arcangeli
Cc: "Rafael J. Wysocki"
Cc: Christoph Lameter
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stanislaw Gruszka
2012-01-11 08:30:42 +0800
1e16a539a mm/hugetlb.c: fix virtual address handling in hugetlb fault ... Browse Code »

handle_mm_fault() passes 'faulted' address to hugetlb_fault(). This
address is not aligned to a hugepage boundary.

Most of the functions for hugetlb pages are aware of that and calculate an
alignment themselves. However some functions such as
copy_user_huge_page() and clear_huge_page() don't handle alignment by
themselves.

This patch make hugeltb_fault() fix the alignment and pass an aligned
addresss (to address of a faulted hugepage) to functions.

[akpm@linux-foundation.org: use &=]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-11 08:30:42 +0800
ef009b25f hugetlb: clarify hugetlb_instantiation_mutex usage ... Browse Code »

Let's make it clear that we cannot race with other fault handlers due to
hugetlb (global) mutex. Also make it clear that we want to keep pte_same
checks anayway to have a transition from the global mutex easier.

Signed-off-by: Michal Hocko
Cc: Hillf Danton
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-11 08:30:42 +0800
a734bcc81 hugetlb: detect race upon page allocation failure during COW ... Browse Code »

Currently we are not rechecking pte_same in hugetlb_cow after we take ptl
lock again in the page allocation failure code path and simply retry
again. This is not an issue at the moment because hugetlb fault path is
protected by hugetlb_instantiation_mutex so we cannot race.

The original page is locked and so we cannot race even with the page
migration.

Let's add the pte_same check anyway as we want to be consistent with the
other check later in this function and be safe if we ever remove the
mutex.

[mhocko@suse.cz: reworded the changelog]
Signed-off-by: Hillf Danton
Signed-off-by: Michal Hocko
Cc: Andrea Arcangeli
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:42 +0800
f90ac3982 mm: avoid livelock on !__GFP_FS allocations ... Browse Code »

Colin Cross reported;

Under the following conditions, __alloc_pages_slowpath can loop forever:
gfp_mask & __GFP_WAIT is true
gfp_mask & __GFP_FS is false
reclaim and compaction make no progress
order
Signed-off-by: Mel Gorman
Acked-by: David Rientjes
Cc: Minchan Kim
Cc: Pekka Enberg
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2012-01-11 08:30:42 +0800