Eric Lee / smarc-fsl-linux-kernel

11 Jan, 2012

40 commits

8460241e4 MAINTAINERS: serial:blackfin: update F: pattern ... Browse Code »

commit 0c6967b5a0 ("serial:blackfin: rename Blackfin serial driver to
bfin_uart.c") renamed the file, update the pattern.

Signed-off-by: Joe Perches
Acked-by: Sonic Zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
b2b0186d0 MAINTAINERS: staging: media: update F: patterns ... Browse Code »

commit 4860c73804c ("staging: Move media drivers to staging/media") moved
the files, update the F: patterns.

Signed-off-by: Joe Perches
Acked-by: Mauro Carvalho Chehab
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
19c90aa67 MAINTAINERS: update encrypted-keys F: patterns ... Browse Code »

commit 61cf45d0199 ("encrypted-keys: create encrypted-keys directory")
moved the files, update the patterns.

Signed-off-by: Joe Perches
Cc: Mimi Zohar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
a31a96ad7 MAINTAINERS: update greth F: patterns ... Browse Code »

commit 1fe003fd424 ("greth: Move the Aeroflex Gaisler driver") moved the
files, update the patterns.

Signed-off-by: Joe Perches
Cc: Kristoffer Glembo
Cc: Jeff Kirsher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
0f04e2aa0 MAINTAINERS: update tulip F: patterns ... Browse Code »

commit a88394cfb58 ("ewrk3/tulip: Move the DEC - Tulip drivers") moved the
files, update the patterns.

Signed-off-by: Joe Perches
Acked-by: Grant Grundler
Cc: Jeff Kirsher
Cc: Tobias Ringstrom
Cc: Grant Grundler
Cc: David Davies
Cc: David Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
d4a45787a MAINTAINERS: update sdhci F: patterns ... Browse Code »

commit 38576af1f8c ("mmc: sdhci: make sdhci-of device drivers self
registered") moved the files around. Update the patterns.

Signed-off-by: Joe Perches
Cc: Shawn Guo
Cc: Chris Ball
Acked-by: Anton Vorontsov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
d8f663561 MAINTAINERS: update mfd F: patterns ... Browse Code »

commit 8959e74399c ("mfd: Delete ab3550 driver") removed the driver,
update the patterns.

Signed-off-by: Joe Perches
Acked-by: Linus Walleij
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:47 +0800
25b8d2b4f MAINTAINERS: update marvell ccic F: patterns ... Browse Code »

Commit f8fc729870ee ("[media] marvell-cam: Move cafe-ccic into its own
directory") moved the files, update the pattern.

Signed-off-by: Joe Perches
Cc: Jonathan Corbet
Acked-by: Mauro Carvalho Chehab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:46 +0800
72dbb7051 MAINTAINERS: update bt8xx gpio F: patterns ... Browse Code »

Commit c103de240439d ("gpio: reorganize drivers") renamed the file, update
the pattern.

Signed-off-by: Joe Perches
Cc: Grant Likely
Cc: Michael Buesch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:46 +0800
77278d50e MAINTAINERS: update adp gpio F: patterns ... Browse Code »

Commit c103de240439df ("gpio: reorganize drivers") renamed the files,
update the patterns.

Signed-off-by: Joe Perches
Acked-by: Grant Likely
Acked-by: Michael Hennerich
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:46 +0800
38f1b4c53 MAINTAINERS: update various arm F: patterns ... Browse Code »

Track renames and missing or deleted files.

Signed-off-by: Joe Perches
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2012-01-11 08:30:46 +0800
ed128fea3 get_maintainers.pl: follow renames when looking up commit signers ... Browse Code »

I happen to have had a commit to various network drivers since the big
renaming/reorg which happened to drivers/net recently. This means that I
now appear to be in the top few commit signers (by %age) for many of them
so am getting sent all sorts of stuff and people who are involved with the
driver are not. e.g. (to pick one at random):

$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" (commit_signer:5/7=71%)
Ian Campbell (commit_signer:2/7=29%)
Eric Dumazet (commit_signer:1/7=14%)
Jeff Kirsher (commit_signer:1/7=14%)
Jiri Pirko (commit_signer:1/7=14%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)

With the following patch the renames are followed and the result appears
much more sensible:

$ ./scripts/get_maintainer.pl -f drivers/net/ethernet/nvidia/forcedeth.c
"David S. Miller" (commit_signer:31/34=91%)
Joe Perches (commit_signer:11/34=32%)
Szymon Janc (commit_signer:5/34=15%)
Jiri Pirko (commit_signer:3/34=9%)
Paul (commit_signer:2/34=6%)
netdev@vger.kernel.org (open list:NETWORKING DRIVERS)
linux-kernel@vger.kernel.org (open list)

Signed-off-by: Ian Campbell
Acked-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Campbell
2012-01-11 08:30:46 +0800
db1aecafe mm/vmalloc.c: change void* into explict vm_struct* ... Browse Code »
1

vmap_area->private is void* but we don't use the field for various purpose
but use only for vm_struct. So change it to a vm_struct* with naming to
improve for readability and type checking.

Signed-off-by: Minchan Kim
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-01-11 08:30:46 +0800
3770490ec mm: vmscan: fix typo in isolating lru pages ... Browse Code »

It is not the tag page but the cursor page that we should process, and it
looks a typo.

Signed-off-by: Hillf Danton
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Cc: Hugh Dickins
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:46 +0800
043bcbe5e mm: test PageSwapBacked in lumpy reclaim ... Browse Code »

Lumpy reclaim does well to stop at a PageAnon when there's no swap, but
better is to stop at any PageSwapBacked, which includes shmem/tmpfs too.

Signed-off-by: Hugh Dickins
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2012-01-11 08:30:46 +0800
faed836a2 mm/migrate.c: remove the unused macro lru_to_page ... Browse Code »

lru_to_page is not used in mm/migrate.c.

Signed-off-by: Wang Sheng-Hui
Acked-by: Mel Gorman
Acked-by: Kyungmin Park
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2012-01-11 08:30:46 +0800
ea5768c74 mm/hugetlb.c: avoid bogus counter of surplus huge page ... Browse Code »

If we have to hand back the newly allocated huge page to page allocator,
for any reason, the changed counter should be recovered.

This affects only s390 at present.

Signed-off-by: Hillf Danton
Reviewed-by: Michal Hocko
Acked-by: KAMEZAWA Hiroyuki
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:45 +0800
1ebb7044c mempool: fix first round failure behavior ... Browse Code »

mempool modifies gfp_mask so that the backing allocator doesn't try too
hard or trigger warning message when there's pool to fall back on. In
addition, for the first try, it removes __GFP_WAIT and IO, so that it
doesn't trigger reclaim or wait when allocation can be fulfilled from
pool; however, when that allocation fails and pool is empty too, it waits
for the pool to be replenished before retrying.

Allocation which could have succeeded after a bit of reclaim has to wait
on the reserved items and it's not like mempool doesn't retry with
__GFP_WAIT and IO. It just does that *after* someone returns an element,
pointlessly delaying things.

Fix it by retrying immediately if the first round of allocation attempts
w/o __GFP_WAIT and IO fails.

[akpm@linux-foundation.org: shorten the lock hold time]
Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
0565d3177 mempool: drop unnecessary and incorrect BUG_ON() from mempool_destroy() ... Browse Code »

mempool_destroy() is a thin wrapper around free_pool(). The only thing it
adds is BUG_ON(pool->curr_nr != pool->min_nr). The intention seems to be
to enforce that all allocated elements are freed; however, the BUG_ON()
can't achieve that (it doesn't know anything about objects above min_nr)
and incorrect as mempool_resize() is allowed to leave the pool extended
but not filled. Furthermore, panicking is way worse than any memory leak
and there are better debug tools to track memory leaks.

Drop the BUG_ON() from mempool_destory() and as that leaves the function
identical to free_pool(), replace it.

Signed-off-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
5b990546e mempool: fix and document synchronization and memory barrier usage ... Browse Code »

mempool_alloc/free() use undocumented smp_mb()'s. The code is slightly
broken and misleading.

The lockless part is in mempool_free(). It wants to determine whether the
item being freed needs to be returned to the pool or backing allocator
without grabbing pool->lock. Two things need to be guaranteed for correct
operation.

1. pool->curr_nr + #allocated should never dip below pool->min_nr.
2. Waiters shouldn't be left dangling.

For #1, The only necessary condition is that curr_nr visible at free is
from after the allocation of the element being freed (details in the
comment). For most cases, this is true without any barrier but there can
be fringe cases where the allocated pointer is passed to the freeing task
without going through memory barriers. To cover this case, wmb is
necessary before returning from allocation and rmb is necessary before
reading curr_nr. IOW,

ALLOCATING TASK FREEING TASK

update pool state after alloc;
wmb();
pass pointer to freeing task;
read pointer;
rmb();
read pool state to free;

The current code doesn't have wmb after pool update during allocation and
may theoretically, on machines where unlock doesn't behave as full wmb,
lead to pool depletion and deadlock. smp_wmb() needs to be added after
successful allocation from reserved elements and smp_mb() in
mempool_free() can be replaced with smp_rmb().

For #2, the waiter needs to add itself to waitqueue and then check the
wait condition and the waker needs to update the wait condition and then
wake up. Because waitqueue operations always go through full spinlock
synchronization, there is no need for extra memory barriers.

Furthermore, mempool_alloc() is already holding pool->lock when it decides
that it needs to wait. There is no reason to do unlock - add waitqueue -
test condition again. It can simply add itself to waitqueue while holding
pool->lock and then unlock and sleep.

This patch adds smp_wmb() after successful allocation from reserved pool,
replaces smp_mb() in mempool_free() with smp_rmb() and extend pool->lock
over waitqueue addition. More importantly, it explains what memory
barriers do and how the lockless testing is correct.

-v2: Oleg pointed out that unlock doesn't imply wmb. Added explicit
smp_wmb() after successful allocation from reserved pool and
updated comments accordingly.

Signed-off-by: Tejun Heo
Cc: Oleg Nesterov
Cc: "Paul E. McKenney"
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tejun Heo
2012-01-11 08:30:45 +0800
564c81db1 mm/migrate.c: cleanup comment for migration_entry_wait() ... Browse Code »

migration_entry_wait() can also be called from hugetlb_fault() now.
Remove the incorrect comment.

Signed-off-by: Wang Sheng-Hui
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Sheng-Hui
2012-01-11 08:30:45 +0800
fcfb4dcc9 mm/mempolicy.c: mpol_equal(): use bool ... Browse Code »

mpol_equal() logically returns a boolean. Use a bool type to slightly
improve readability.

Signed-off-by: KOSAKI Motohiro
Cc: Stephen Wilson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:45 +0800
0c176d52b mm: hugetlb: fix pgoff computation when unmapping page from vma ... Browse Code »

The computation for pgoff is incorrect, at least with

(vma->vm_pgoff >> PAGE_SHIFT)

involved. It is fixed with the available method if HPAGE_SIZE is
concerned in page cache lookup.

[akpm@linux-foundation.org: use vma_hugecache_offset() directly, per Michal]
Signed-off-by: Hillf Danton
Cc: Mel Gorman
Cc: Michal Hocko
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Andrea Arcangeli
Cc: David Rientjes
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:45 +0800
86cfd3a45 mm/vmscan.c: consider swap space when deciding whether to continue reclaim ... Browse Code »

It's pointless to continue reclaiming when we have no swap space and lots
of anon pages in the inactive list.

Without this patch, it is possible when swap is disabled to continue
trying to reclaim when there are only anonymous pages in the system even
though that will not make any progress.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Mel Gorman
Reviewed-by: Rik van Riel
Cc: Johannes Weiner
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2012-01-11 08:30:45 +0800
799f933a8 mm: bootmem: try harder to free pages in bulk ... Browse Code »

The loop that frees pages to the page allocator while bootstrapping tries
to free higher-order blocks only when the starting address is aligned to
that block size. Otherwise it will free all pages on that node
one-by-one.

Change it to free individual pages up to the first aligned block and then
try higher-order frees from there.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:45 +0800
560a036b3 mm: bootmem: drop superfluous range check when freeing pages in bulk ... Browse Code »

The area node_bootmem_map represents is aligned to BITS_PER_LONG, and all
bits in any aligned word of that map valid. When the represented area
extends beyond the end of the node, the non-existant pages will be marked
as reserved.

As a result, when freeing a page block, doing an explicit range check for
whether that block is within the node's range is redundant as the bitmap
is consulted anyway to see whether all pages in the block are unreserved.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
c3993076f mm: page_alloc: generalize order handling in __free_pages_bootmem() ... Browse Code »

__free_pages_bootmem() used to special-case higher-order frees to save
individual page checking with free_pages_bulk().

Nowadays, both zero order and non-zero order frees use free_pages(), which
checks each individual page anyway, and so there is little point in making
the distinction anymore. The higher-order loop will work just fine for
zero order pages.

Signed-off-by: Johannes Weiner
Cc: Uwe Kleine-König
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
43d2b1132 tracepoint: add tracepoints for debugging oom_score_adj ... Browse Code »

oom_score_adj is used for guarding processes from OOM-Killer. One of
problem is that it's inherited at fork(). When a daemon set oom_score_adj
and make children, it's hard to know where the value is set.

This patch adds some tracepoints useful for debugging. This patch adds
3 trace points.
- creating new task
- renaming a task (exec)
- set oom_score_adj

To debug, users need to enable some trace pointer. Maybe filtering is useful as

# EVENT=/sys/kernel/debug/tracing/events/task/
# echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
# echo "oom_score_adj != 0" > $EVENT/task_rename/filter
# echo 1 > $EVENT/enable
# EVENT=/sys/kernel/debug/tracing/events/oom/
# echo 1 > $EVENT/enable

output will be like this.
# grep oom /sys/kernel/debug/tracing/trace
bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-11 08:30:44 +0800
6bd4837de mm: simplify find_vma_prev() ... Browse Code »
43

commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
using it. Also, this change helps to improve page fault performance
because it has stronger locality of reference.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Peter Zijlstra
Cc: Shaohua Li
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:44 +0800
948f017b0 mremap: enforce rmap src/dst vma ordering in case of vma_merge() succeeding in copy_vma() ... Browse Code »

migrate was doing an rmap_walk with speculative lock-less access on
pagetables. That could lead it to not serializing properly against mremap
PT locks. But a second problem remains in the order of vmas in the
same_anon_vma list used by the rmap_walk.

If vma_merge succeeds in copy_vma, the src vma could be placed after the
dst vma in the same_anon_vma list. That could still lead to migrate
missing some pte.

This patch adds an anon_vma_moveto_tail() function to force the dst vma at
the end of the list before mremap starts to solve the problem.

If the mremap is very large and there are a lots of parents or childs
sharing the anon_vma root lock, this should still scale better than taking
the anon_vma root lock around every pte copy practically for the whole
duration of mremap.

Update: Hugh noticed special care is needed in the error path where
move_page_tables goes in the reverse direction, a second
anon_vma_moveto_tail() call is needed in the error path.

This program exercises the anon_vma_moveto_tail:

===

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
printf("%p\n", p);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
if (memcmp(p, p2, SIZE))
printf("error\n");
p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
if (p4 != p3)
perror("mremap"), exit(1);
p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
if (p4 != p+SIZE/2)
perror("mremap"), exit(1);
if (memcmp(p, p2, SIZE))
printf("error\n");
printf("ok\n");

return 0;
}
===

$ perf probe -a anon_vma_moveto_tail
Add new event:
probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

You can now use it on all perf tools, such as:

perf record -e probe:anon_vma_moveto_tail -aR sleep 1

$ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
0x7f2ca2800000
ok
[ perf record: Woken up 1 times to write data ]
[ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
$ perf report --stdio
100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

Signed-off-by: Andrea Arcangeli
Reported-by: Nai Xia
Acked-by: Mel Gorman
Cc: Hugh Dickins
Cc: Pawel Sikora
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2012-01-11 08:30:44 +0800
df0a6daa0 mm: fix off-by-two in __zone_watermark_ok() ... Browse Code »

Commit 88f5acf88ae6 ("mm: page allocator: adjust the per-cpu counter
threshold when memory is low") changed the form how free_pages is
calculated but it forgot that we used to do free_pages - ((1 << order) -
1) so we ended up with off-by-two when calculating free_pages.

Reported-by: Wang Sheng-Hui
Signed-off-by: Michal Hocko
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2012-01-11 08:30:44 +0800
9571a9829 bootmem: micro optimize freeing pages in bulk ... Browse Code »

The first entry of bdata->node_bootmem_map holds the data for
bdata->node_min_pfn up to bdata->node_min_pfn + BITS_PER_LONG - 1. So the
test for freeing all pages of a single map entry can be slightly relaxed.

Moreover use DIV_ROUND_UP in another place instead of open coding it.

Signed-off-by: Uwe Kleine-König
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Uwe Kleine-König
2012-01-11 08:30:44 +0800
31b8384a5 mm: compaction: push isolate search base of compact control one pfn ahead ... Browse Code »

After isolated the current pfn will no longer be scanned and isolated if
the next round is necessary, so push the isolate_migratepages search base
of the given compact_control one step ahead.

Signed-off-by: Hillf Danton
Reviewed-by: Andrea Arcangeli
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hillf Danton
2012-01-11 08:30:44 +0800
e3a41a5ba btrfs: pass __GFP_WRITE for buffered write page allocations ... Browse Code »

Tell the page allocator that pages allocated for a buffered write are
expected to become dirty soon.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Minchan Kim
Cc: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:44 +0800
0faa70cb0 mm: filemap: pass __GFP_WRITE from grab_cache_page_write_begin() ... Browse Code »

Tell the page allocator that pages allocated through
grab_cache_page_write_begin() are expected to become dirty soon.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
a756cf590 mm: try to distribute dirty pages fairly across zones ... Browse Code »

The maximum number of dirty pages that exist in the system at any time is
determined by a number of pages considered dirtyable and a user-configured
percentage of those, or an absolute number in bytes.

This number of dirtyable pages is the sum of memory provided by all the
zones in the system minus their lowmem reserves and high watermarks, so
that the system can retain a healthy number of free pages without having
to reclaim dirty pages.

But there is a flaw in that we have a zoned page allocator which does not
care about the global state but rather the state of individual memory
zones. And right now there is nothing that prevents one zone from filling
up with dirty pages while other zones are spared, which frequently leads
to situations where kswapd, in order to restore the watermark of free
pages, does indeed have to write pages from that zone's LRU list. This
can interfere so badly with IO from the flusher threads that major
filesystems (btrfs, xfs, ext4) mostly ignore write requests from reclaim
already, taking away the VM's only possibility to keep such a zone
balanced, aside from hoping the flushers will soon clean pages from that
zone.

Enter per-zone dirty limits. They are to a zone's dirtyable memory what
the global limit is to the global amount of dirtyable memory, and try to
make sure that no single zone receives more than its fair share of the
globally allowed dirty pages in the first place. As the number of pages
considered dirtyable excludes the zones' lowmem reserves and high
watermarks, the maximum number of dirty pages in a zone is such that the
zone can always be balanced without requiring page cleaning.

As this is a placement decision in the page allocator and pages are
dirtied only after the allocation, this patch allows allocators to pass
__GFP_WRITE when they know in advance that the page will be written to and
become dirty soon. The page allocator will then attempt to allocate from
the first zone of the zonelist - which on NUMA is determined by the task's
NUMA memory policy - that has not exceeded its dirty limit.

At first glance, it would appear that the diversion to lower zones can
increase pressure on them, but this is not the case. With a full high
zone, allocations will be diverted to lower zones eventually, so it is
more of a shift in timing of the lower zone allocations. Workloads that
previously could fit their dirty pages completely in the higher zone may
be forced to allocate from lower zones, but the amount of pages that
"spill over" are limited themselves by the lower zones' dirty constraints,
and thus unlikely to become a problem.

For now, the problem of unfair dirty page distribution remains for NUMA
configurations where the zones allowed for allocation are in sum not big
enough to trigger the global dirty limits, wake up the flusher threads and
remedy the situation. Because of this, an allocation that could not
succeed on any of the considered zones is allowed to ignore the dirty
limits before going into direct reclaim or even failing the allocation,
until a future patch changes the global dirty throttling and flusher
thread activation so that they take individual zone states into account.

Test results

15M DMA + 3246M DMA32 + 504 Normal = 3765M memory
40% dirty ratio
16G USB thumb drive
10 runs of dd if=/dev/zero of=disk/zeroes bs=32k count=$((10 << 15))

seconds nr_vmscan_write
(stddev) min| median| max
xfs
vanilla: 549.747( 3.492) 0.000| 0.000| 0.000
patched: 550.996( 3.802) 0.000| 0.000| 0.000

fuse-ntfs
vanilla: 1183.094(53.178) 54349.000| 59341.000| 65163.000
patched: 558.049(17.914) 0.000| 0.000| 43.000

btrfs
vanilla: 573.679(14.015) 156657.000| 460178.000| 606926.000
patched: 563.365(11.368) 0.000| 0.000| 1362.000

ext4
vanilla: 561.197(15.782) 0.000|2725438.000|4143837.000
patched: 568.806(17.496) 0.000| 0.000| 0.000

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Tested-by: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ccafa2879 mm: writeback: cleanups in preparation for per-zone dirty limits ... Browse Code »

The next patch will introduce per-zone dirty limiting functions in
addition to the traditional global dirty limiting.

Rename determine_dirtyable_memory() to global_dirtyable_memory() before
adding the zone-specific version, and fix up its documentation.

Also, move the functions to determine the dirtyable memory and the
function to calculate the dirty limit based on that together so that their
relationship is more apparent and that they can be commented on as a
group.

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Reviewed-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Rik van Riel
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
ab8fabd46 mm: exclude reserved pages from dirtyable memory ... Browse Code »

Per-zone dirty limits try to distribute page cache pages allocated for
writing across zones in proportion to the individual zone sizes, to reduce
the likelihood of reclaim having to write back individual pages from the
LRU lists in order to make progress.

This patch:

The amount of dirtyable pages should not include the full number of free
pages: there is a number of reserved pages that the page allocator and
kswapd always try to keep free.

The closer (reclaimable pages - dirty pages) is to the number of reserved
pages, the more likely it becomes for reclaim to run into dirty pages:

+----------+ ---
| anon | |
+----------+ |
| | |
| | -- dirty limit new -- flusher new
| file | | |
| | | |
| | -- dirty limit old -- flusher old
| | |
+----------+ --- reclaim
| reserved |
+----------+
| kernel |
+----------+

This patch introduces a per-zone dirty reserve that takes both the lowmem
reserve as well as the high watermark of the zone into account, and a
global sum of those per-zone values that is subtracted from the global
amount of dirtyable pages. The lowmem reserve is unavailable to page
cache allocations and kswapd tries to keep the high watermark free. We
don't want to end up in a situation where reclaim has to clean pages in
order to balance zones.

Not treating reserved pages as dirtyable on a global level is only a
conceptual fix. In reality, dirty pages are not distributed equally
across zones and reclaim runs into dirty pages on a regular basis.

But it is important to get this right before tackling the problem on a
per-zone level, where the distance between reclaim and the dirty pages is
mostly much smaller in absolute numbers.

[akpm@linux-foundation.org: fix highmem build]
Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Reviewed-by: Michal Hocko
Reviewed-by: Minchan Kim
Acked-by: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Dave Chinner
Cc: Jan Kara
Cc: Shaohua Li
Cc: Chris Mason
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-11 08:30:43 +0800
25bd91bd2 vmscan: add task name to warn_scan_unevictable() messages ... Browse Code »

If we need to know a usecase, caller program name is critical important.
Show it.

Signed-off-by: KOSAKI Motohiro
Acked-by: Johannes Weiner
David Rientjes
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2012-01-11 08:30:43 +0800
f6d7e0cb3 mm, debug: test for online nid when allocating on single node ... Browse Code »

Calling alloc_pages_exact_node() means the allocation only passes the
zonelist of a single node into the page allocator. If that node isn't
online, it's zonelist may never have been initialized causing a strange
oops that may not immediately be clear.

I recently debugged an issue where node 0 wasn't online and an allocator
was passing 0 to alloc_pages_exact_node() and it resulted in a NULL
pointer on zonelist->_zoneref. If CONFIG_DEBUG_VM is enabled, though, it
would be nice to catch this a bit earlier.

Signed-off-by: David Rientjes
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-01-11 08:30:43 +0800