Eric Lee / smarc-fsl-linux-kernel

01 Nov, 2011

40 commits

584cff54e mm/mmap.c: eliminate the ret variable from mm_take_all_locks() ... Browse Code »

The ret variable is really not needed in mm_take_all_locks().

Signed-off-by: Kautuk Consul
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kautuk Consul
2011-11-01 08:30:49 +0800
09f363c73 vmscan: fix shrinker callback bug in fs/super.c ... Browse Code »
129

The callback must not return -1 when nr_to_scan is zero. Fix the bug in
fs/super.c and add this requirement to the callback specification.

Signed-off-by: Mikulas Patocka
Cc: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mikulas Patocka
2011-11-01 08:30:49 +0800
20c8c6289 mm-add-comment-explaining-task-state-setting-in-bdi_forker_thread-fix ... Browse Code »

fiddle wording

Cc: Jan Kara
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2011-11-01 08:30:49 +0800
99ef0315f ksm: fix the comment of try_to_unmap_one() ... Browse Code »

try_to_unmap_one() is called by try_to_unmap_ksm(), too.

Signed-off-by: Wanlong Gao
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanlong Gao
2011-11-01 08:30:49 +0800
de7d2b567 mm/vmalloc.c: report more vmalloc failures ... Browse Code »

Some vmalloc failure paths do not report OOM conditions.

Add warn_alloc_failed, which also does a dump_stack, to those failure
paths.

This allows more site specific vmalloc failure logging message printks to
be removed.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2011-11-01 08:30:48 +0800
f0dfcde09 kswapd: assign new_order and new_classzone_idx after wakeup in sleeping ... Browse Code »

There 2 places to read pgdat in kswapd. One is return from a successful
balance, another is waked up from kswapd sleeping. The new_order and
new_classzone_idx represent the balance input order and classzone_idx.

But current new_order and new_classzone_idx are not assigned after
kswapd_try_to_sleep(), that will cause a bug in the following scenario.

1: after a successful balance, kswapd goes to sleep, and new_order = 0;
new_classzone_idx = __MAX_NR_ZONES - 1;

2: kswapd waked up with order = 3 and classzone_idx = ZONE_NORMAL

3: in the balance_pgdat() running, a new balance wakeup happened with
order = 5, and classzone_idx = ZONE_NORMAL

4: the first wakeup(order = 3) finished successufly, return order = 3
but, the new_order is still 0, so, this balancing will be treated as a
failed balance. And then the second tighter balancing will be missed.

So, to avoid the above problem, the new_order and new_classzone_idx need
to be assigned for later successful comparison.

Signed-off-by: Alex Shi
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Tested-by: Pádraig Brady
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex,Shi
2011-11-01 08:30:48 +0800
d1f0ece6c mm/memblock.c: small function definition fixes ... Browse Code »

warning: function 'memblock_memory_can_coalesce'
with external linkage has definition.

Signed-off-by: Jonghwan Choi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jonghwan Choi
2011-11-01 08:30:48 +0800
d2ebd0f6b kswapd: avoid unnecessary rebalance after an unsuccessful balancing ... Browse Code »

In commit 215ddd66 ("mm: vmscan: only read new_classzone_idx from pgdat
when reclaiming successfully") , Mel Gorman said kswapd is better to sleep
after a unsuccessful balancing if there is tighter reclaim request pending
in the balancing. But in the following scenario, kswapd do something that
is not matched our expectation. The patch fixes this issue.

1, Read pgdat request A (classzone_idx, order = 3)
2, balance_pgdat()
3, During pgdat, a new pgdat request B (classzone_idx, order = 5) is placed
4, balance_pgdat() returns but failed since returned order = 0
5, pgdat of request A assigned to balance_pgdat(), and do balancing again.
While the expectation behavior of kswapd should try to sleep.

Signed-off-by: Alex Shi
Reviewed-by: Tim Chen
Acked-by: Mel Gorman
Tested-by: Pádraig Brady
Cc: Rik van Riel
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex,Shi
2011-11-01 08:30:48 +0800
64212ec56 debug-pagealloc: add support for highmem pages ... Browse Code »

This adds support for highmem pages poisoning and verification to the
debug-pagealloc feature for no-architecture support.

[akpm@linux-foundation.org: remove unneeded preempt_disable/enable]
Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-11-01 08:30:48 +0800
3ee9a4f08 mm: neaten warn_alloc_failed ... Browse Code »

Add __attribute__((format (printf...) to the function to validate format
and arguments. Use vsprintf extension %pV to avoid any possible message
interleaving. Coalesce format string. Convert printks/pr_warning to
pr_warn.

[akpm@linux-foundation.org: use the __printf() macro]
Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2011-11-01 08:30:48 +0800
06d5e032a include/asm-generic/page.h: calculate virt_to_page and page_to_virt via predefined macro ... Browse Code »

On NOMMU architectures, if physical memory doesn't start from 0,
ARCH_PFN_OFFSET is defined to generate page index in mem_map array.
Because virtual address is equal to physical address, PAGE_OFFSET is
always 0. virt_to_page and page_to_virt should not index page by
PAGE_OFFSET directly.

Signed-off-by: Sonic Zhang
Cc: Greg Ungerer
Cc: Geert Uytterhoeven
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sonic Zhang
2011-11-01 08:30:48 +0800
37a1c49a9 thp: mremap support and TLB optimization ... Browse Code »

This adds THP support to mremap (decreases the number of split_huge_page()
calls).

Here are also some benchmarks with a proggy like this:

===
#define _GNU_SOURCE
#include
#include
#include
#include
#include

#define SIZE (5UL*1024*1024*1024)

int main()
{
static struct timeval oldstamp, newstamp;
long diffsec;
char *p, *p2, *p3, *p4;
if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
perror("memalign"), exit(1);
if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
perror("memalign"), exit(1);

memset(p, 0xff, SIZE);
memset(p2, 0xff, SIZE);
memset(p3, 0x77, 4096);
gettimeofday(&oldstamp, NULL);
p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
gettimeofday(&newstamp, NULL);
diffsec = newstamp.tv_sec - oldstamp.tv_sec;
diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
printf("usec %ld\n", diffsec);
if (p == MAP_FAILED || p4 != p3)
//if (p == MAP_FAILED)
perror("mremap"), exit(1);
if (memcmp(p4, p2, SIZE))
printf("mremap bug\n"), exit(1);
printf("ok\n");

return 0;
}
===

THP on

Performance counter stats for './largepage13' (3 runs):

69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

7.314454164 seconds time elapsed ( +- 0.023% )

THP off

Performance counter stats for './largepage13' (3 runs):

1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

9.693655243 seconds time elapsed ( +- 0.069% )

grep thp /proc/vmstat
thp_fault_alloc 35849
thp_fault_fallback 0
thp_collapse_alloc 3
thp_collapse_alloc_failed 0
thp_split 0

thp_split 0 confirms no thp split despite plenty of hugepages allocated.

The measurement of only the mremap time (so excluding the 3 long
memset and final long 10GB memory accessing memcmp):

THP on

usec 14824
usec 14862
usec 14859

THP off

usec 256416
usec 255981
usec 255847

With an older kernel without the mremap optimizations (the below patch
optimizes the non THP version too).

THP on

usec 392107
usec 390237
usec 404124

THP off

usec 444294
usec 445237
usec 445820

I guess with a threaded program that sends more IPI on large SMP it'd
create an even larger difference.

All debug options are off except DEBUG_VM to avoid skewing the
results.

The only problem for native 2M mremap like it happens above both the
source and destination address must be 2M aligned or the hugepmd can't be
moved without a split but that is an hardware limitation.

[akpm@linux-foundation.org: coding-style nitpicking]
Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:48 +0800
7b6efc2bc mremap: avoid sending one IPI per page ... Browse Code »

This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
flush_tlb_range() at the end of the loop, to avoid sending one IPI for
each page.

The mmu_notifier_invalidate_range_start/end section is enlarged
accordingly but this is not going to fundamentally change things. It was
more by accident that the region under mremap was for the most part still
available for secondary MMUs: the primary MMU was never allowed to
reliably access that region for the duration of the mremap (modulo
trapping SIGSEGV on the old address range which sounds unpractical and
flakey). If users wants secondary MMUs not to lose access to a large
region under mremap they should reduce the mremap size accordingly in
userland and run multiple calls. Overall this will run faster so it's
actually going to reduce the time the region is under mremap for the
primary MMU which should provide a net benefit to apps.

For KVM this is a noop because the guest physical memory is never
mremapped, there's just no point it ever moving it while guest runs. One
target of this optimization is JVM GC (so unrelated to the mmu notifier
logic).

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:48 +0800
ebed48460 mremap: check for overflow using deltas ... Browse Code »

Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
those are reasonable requirements but the check remains obscure and it
looks more like an off by one error than an overflow check. This I feel
will improve readability.

Signed-off-by: Andrea Arcangeli
Acked-by: Johannes Weiner
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2011-11-01 08:30:47 +0800
666167205 memblock: add NO_BOOTMEM config symbol ... Browse Code »

With the NO_BOOTMEM symbol added architectures may now use the following
syntax to tell that they do not need bootmem:

select NO_BOOTMEM

This is much more convinient than adding a new kconfig symbol which was
otherwise required.

Adding this symbol does not conflict with the architctures that already
define their own symbol.

Signed-off-by: Sam Ravnborg
Cc: Yinghai Lu
Acked-by: Tejun Heo
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sam Ravnborg
2011-11-01 08:30:47 +0800
0a93ebef6 memblock: add memblock_start_of_DRAM() ... Browse Code »

SPARC32 require access to the start address. Add a new helper
memblock_start_of_DRAM() to give access to the address of the first
memblock - which contains the lowest address.

The awkward name was chosen to match the already present
memblock_end_of_DRAM().

Signed-off-by: Sam Ravnborg
Cc: "David S. Miller"
Cc: Yinghai Lu
Acked-by: Tejun Heo
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sam Ravnborg
2011-11-01 08:30:47 +0800
f5252e009 mm: avoid null pointer access in vm_struct via /proc/vmallocinfo ... Browse Code »
2

The /proc/vmallocinfo shows information about vmalloc allocations in
vmlist that is a linklist of vm_struct. It, however, may access pages
field of vm_struct where a page was not allocated. This results in a null
pointer access and leads to a kernel panic.

Why this happens: In __vmalloc_node_range() called from vmalloc(), newly
allocated vm_struct is added to vmlist at __get_vm_area_node() and then,
some fields of vm_struct such as nr_pages and pages are set at
__vmalloc_area_node(). In other words, it is added to vmlist before it is
fully initialized. At the same time, when the /proc/vmallocinfo is read,
it accesses the pages field of vm_struct according to the nr_pages field
at show_numa_info(). Thus, a null pointer access happens.

The patch adds the newly allocated vm_struct to the vmlist *after* it is
fully initialized. So, it can avoid accessing the pages field with
unallocated page when show_numa_info() is called.

Signed-off-by: Mitsuo Hayasaka
Cc: Andrew Morton
Cc: David Rientjes
Cc: Namhyung Kim
Cc: "Paul E. McKenney"
Cc: Jeremy Fitzhardinge
Cc:
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mitsuo Hayasaka
2011-11-01 08:30:47 +0800
8c5fb8ead mm/debug-pagealloc.c: use memchr_inv ... Browse Code »

Use newly introduced memchr_inv() for page verification.

Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-11-01 08:30:47 +0800
798248206 lib/string.c: introduce memchr_inv() ... Browse Code »

memchr_inv() is mainly used to check whether the whole buffer is filled
with just a specified byte.

The function name and prototype are stolen from logfs and the
implementation is from SLUB.

Signed-off-by: Akinobu Mita
Acked-by: Christoph Lameter
Acked-by: Pekka Enberg
Cc: Matt Mackall
Acked-by: Joern Engel
Cc: Marcin Slusarz
Cc: Eric Dumazet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-11-01 08:30:47 +0800
77311139f mm/debug-pagealloc.c: use plain __ratelimit() instead of printk_ratelimit() ... Browse Code »

printk_ratelimit() should not be used, because it shares ratelimiting
state with all other unrelated printk_ratelimit() callsites.

Signed-off-by: Akinobu Mita
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2011-11-01 08:30:47 +0800
16fb95123 vmscan: count pages into balanced for zone with good watermark ... Browse Code »

It's possible a zone watermark is ok when entering the balance_pgdat()
loop, while the zone is within the requested classzone_idx. Count pages
from this zone into `balanced'. In this way, we can skip shrinking zones
too much for high order allocation.

Signed-off-by: Shaohua Li
Acked-by: Mel Gorman
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-11-01 08:30:47 +0800
49ea7eb65 mm: vmscan: immediately reclaim end-of-LRU dirty pages when writeback completes ... Browse Code »

When direct reclaim encounters a dirty page, it gets recycled around the
LRU for another cycle. This patch marks the page PageReclaim similar to
deactivate_page() so that the page gets reclaimed almost immediately after
the page gets cleaned. This is to avoid reclaiming clean pages that are
younger than a dirty page encountered at the end of the LRU that might
have been something like a use-once page.

Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:47 +0800
92df3a723 mm: vmscan: throttle reclaim if encountering too many dirty pages under writeback ... Browse Code »

Workloads that are allocating frequently and writing files place a large
number of dirty pages on the LRU. With use-once logic, it is possible for
them to reach the end of the LRU quickly requiring the reclaimer to scan
more to find clean pages. Ordinarily, processes that are dirtying memory
will get throttled by dirty balancing but this is a global heuristic and
does not take into account that LRUs are maintained on a per-zone basis.
This can lead to a situation whereby reclaim is scanning heavily, skipping
over a large number of pages under writeback and recycling them around the
LRU consuming CPU.

This patch checks how many of the number of pages isolated from the LRU
were dirty and under writeback. If a percentage of them under writeback,
the process will be throttled if a backing device or the zone is
congested. Note that this applies whether it is anonymous or file-backed
pages that are under writeback meaning that swapping is potentially
throttled. This is intentional due to the fact if the swap device is
congested, scanning more pages and dispatching more IO is not going to
help matters.

The percentage that must be in writeback depends on the priority. At
default priority, all of them must be dirty. At DEF_PRIORITY-1, 50% of
them must be, DEF_PRIORITY-2, 25% etc. i.e. as pressure increases the
greater the likelihood the process will get throttled to allow the flusher
threads to make some progress.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Acked-by: Johannes Weiner
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
f84f6e2b0 mm: vmscan: do not writeback filesystem pages in kswapd except in high priority ... Browse Code »

It is preferable that no dirty pages are dispatched for cleaning from the
page reclaim path. At normal priorities, this patch prevents kswapd
writing pages.

However, page reclaim does have a requirement that pages be freed in a
particular zone. If it is failing to make sufficient progress (reclaiming
< SWAP_CLUSTER_MAX at any priority priority), the priority is raised to
scan more pages. A priority of DEF_PRIORITY - 3 is considered to be the
point where kswapd is getting into trouble reclaiming pages. If this
priority is reached, kswapd will dispatch pages for writing.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
966dbde2c ext4: warn if direct reclaim tries to writeback pages ... Browse Code »

Direct reclaim should never writeback pages. Warn if an attempt is made.

Signed-off-by: Mel Gorman
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
94054fa3f xfs: warn if direct reclaim tries to writeback pages ... Browse Code »

Direct reclaim should never writeback pages. For now, handle the
situation and warn about it. Ultimately, this will be a BUG_ON.

Signed-off-by: Mel Gorman
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
a18bba061 mm: vmscan: remove dead code related to lumpy reclaim waiting on pages under writeback ... Browse Code »

Lumpy reclaim worked with two passes - the first which queued pages for IO
and the second which waited on writeback. As direct reclaim can no longer
write pages there is some dead code. This patch removes it but direct
reclaim will continue to wait on pages under writeback while in
synchronous reclaim mode.

Signed-off-by: Mel Gorman
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
ee72886d8 mm: vmscan: do not writeback filesystem pages in direct reclaim ... Browse Code »

Testing from the XFS folk revealed that there is still too much I/O from
the end of the LRU in kswapd. Previously it was considered acceptable by
VM people for a small number of pages to be written back from reclaim with
testing generally showing about 0.3% of pages reclaimed were written back
(higher if memory was low). That writing back a small number of pages is
ok has been heavily disputed for quite some time and Dave Chinner
explained it well;

It doesn't have to be a very high number to be a problem. IO
is orders of magnitude slower than the CPU time it takes to
flush a page, so the cost of making a bad flush decision is
very high. And single page writeback from the LRU is almost
always a bad flush decision.

To complicate matters, filesystems respond very differently to requests
from reclaim according to Christoph Hellwig;

xfs tries to write it back if the requester is kswapd
ext4 ignores the request if it's a delayed allocation
btrfs ignores the request

As a result, each filesystem has different performance characteristics
when under memory pressure and there are many pages being dirtied. In
some cases, the request is ignored entirely so the VM cannot depend on the
IO being dispatched.

The objective of this series is to reduce writing of filesystem-backed
pages from reclaim, play nicely with writeback that is already in progress
and throttle reclaim appropriately when writeback pages are encountered.
The assumption is that the flushers will always write pages faster than if
reclaim issues the IO.

A secondary goal is to avoid the problem whereby direct reclaim splices
two potentially deep call stacks together.

There is a potential new problem as reclaim has less control over how long
before a page in a particularly zone or container is cleaned and direct
reclaimers depend on kswapd or flusher threads to do the necessary work.
However, as filesystems sometimes ignore direct reclaim requests already,
it is not expected to be a serious issue.

Patch 1 disables writeback of filesystem pages from direct reclaim
entirely. Anonymous pages are still written.

Patch 2 removes dead code in lumpy reclaim as it is no longer able
to synchronously write pages. This hurts lumpy reclaim but
there is an expectation that compaction is used for hugepage
allocations these days and lumpy reclaim's days are numbered.

Patches 3-4 add warnings to XFS and ext4 if called from
direct reclaim. With patch 1, this "never happens" and is
intended to catch regressions in this logic in the future.

Patch 5 disables writeback of filesystem pages from kswapd unless
the priority is raised to the point where kswapd is considered
to be in trouble.

Patch 6 throttles reclaimers if too many dirty pages are being
encountered and the zones or backing devices are congested.

Patch 7 invalidates dirty pages found at the end of the LRU so they
are reclaimed quickly after being written back rather than
waiting for a reclaimer to find them

I consider this series to be orthogonal to the writeback work but it is
worth noting that the writeback work affects the viability of patch 8 in
particular.

I tested this on ext4 and xfs using fs_mark, a simple writeback test based
on dd and a micro benchmark that does a streaming write to a large mapping
(exercises use-once LRU logic) followed by streaming writes to a mix of
anonymous and file-backed mappings. The command line for fs_mark when
botted with 512M looked something like

./fs_mark -d /tmp/fsmark-2676 -D 100 -N 150 -n 150 -L 25 -t 1 -S0 -s 10485760

The number of files was adjusted depending on the amount of available
memory so that the files created was about 3xRAM. For multiple threads,
the -d switch is specified multiple times.

The test machine is x86-64 with an older generation of AMD processor with
4 cores. The underlying storage was 4 disks configured as RAID-0 as this
was the best configuration of storage I had available. Swap is on a
separate disk. Dirty ratio was tuned to 40% instead of the default of
20%.

Testing was run with and without monitors to both verify that the patches
were operating as expected and that any performance gain was real and not
due to interference from monitors.

Here is a summary of results based on testing XFS.

512M1P-xfs Files/s mean 32.69 ( 0.00%) 34.44 ( 5.08%)
512M1P-xfs Elapsed Time fsmark 51.41 48.29
512M1P-xfs Elapsed Time simple-wb 114.09 108.61
512M1P-xfs Elapsed Time mmap-strm 113.46 109.34
512M1P-xfs Kswapd efficiency fsmark 62% 63%
512M1P-xfs Kswapd efficiency simple-wb 56% 61%
512M1P-xfs Kswapd efficiency mmap-strm 44% 42%
512M-xfs Files/s mean 30.78 ( 0.00%) 35.94 (14.36%)
512M-xfs Elapsed Time fsmark 56.08 48.90
512M-xfs Elapsed Time simple-wb 112.22 98.13
512M-xfs Elapsed Time mmap-strm 219.15 196.67
512M-xfs Kswapd efficiency fsmark 54% 56%
512M-xfs Kswapd efficiency simple-wb 54% 55%
512M-xfs Kswapd efficiency mmap-strm 45% 44%
512M-4X-xfs Files/s mean 30.31 ( 0.00%) 33.33 ( 9.06%)
512M-4X-xfs Elapsed Time fsmark 63.26 55.88
512M-4X-xfs Elapsed Time simple-wb 100.90 90.25
512M-4X-xfs Elapsed Time mmap-strm 261.73 255.38
512M-4X-xfs Kswapd efficiency fsmark 49% 50%
512M-4X-xfs Kswapd efficiency simple-wb 54% 56%
512M-4X-xfs Kswapd efficiency mmap-strm 37% 36%
512M-16X-xfs Files/s mean 60.89 ( 0.00%) 65.22 ( 6.64%)
512M-16X-xfs Elapsed Time fsmark 67.47 58.25
512M-16X-xfs Elapsed Time simple-wb 103.22 90.89
512M-16X-xfs Elapsed Time mmap-strm 237.09 198.82
512M-16X-xfs Kswapd efficiency fsmark 45% 46%
512M-16X-xfs Kswapd efficiency simple-wb 53% 55%
512M-16X-xfs Kswapd efficiency mmap-strm 33% 33%

Up until 512-4X, the FSmark improvements were statistically significant.
For the 4X and 16X tests the results were within standard deviations but
just barely. The time to completion for all tests is improved which is an
important result. In general, kswapd efficiency is not affected by
skipping dirty pages.

1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M1P-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-4X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs
1024M-16X-xfs Files/s mean 39.09 ( 0.00%) 41.15 ( 5.01%) Elapsed Time fsmark 84.14 80.41 Elapsed Time simple-wb 210.77 184.78 Elapsed Time mmap-strm 162.00 160.34 Kswapd efficiency fsmark 69% 75% Kswapd efficiency simple-wb 71% 77% Kswapd efficiency mmap-strm 43% 44% Files/s mean 35.45 ( 0.00%) 37.00 ( 4.19%) Elapsed Time fsmark 94.59 91.00 Elapsed Time simple-wb 229.84 195.08 Elapsed Time mmap-strm 405.38 440.29 Kswapd efficiency fsmark 79% 71% Kswapd efficiency simple-wb 74% 74% Kswapd efficiency mmap-strm 39% 42% Files/s mean 32.63 ( 0.00%) 35.05 ( 6.90%) Elapsed Time fsmark 103.33 97.74 Elapsed Time simple-wb 204.48 178.57 Elapsed Time mmap-strm 528.38 511.88 Kswapd efficiency fsmark 81% 70% Kswapd efficiency simple-wb 73% 72% Kswapd efficiency mmap-strm 39% 38% Files/s mean 42.65 ( 0.00%) 42.97 ( 0.74%) Elapsed Time fsmark 103.11 99.11 Elapsed Time simple-wb 200.83 178.24 Elapsed Time mmap-strm 397.35 459.82 Kswapd efficiency fsmark 84% 69% Kswapd efficiency simple-wb 74% 73% Kswapd efficiency mmap-strm 39% 40%

All FSMark tests up to 16X had statistically significant improvements.
For the most part, tests are completing faster with the exception of the
streaming writes to a mixture of anonymous and file-backed mappings which
were slower in two cases

In the cases where the mmap-strm tests were slower, there was more
swapping due to dirty pages being skipped. The number of additional pages
swapped is almost identical to the fewer number of pages written from
reclaim. In other words, roughly the same number of pages were reclaimed
but swapping was slower. As the test is a bit unrealistic and stresses
memory heavily, the small shift is acceptable.

4608M1P-xfs Files/s mean 29.75 ( 0.00%) 30.96 ( 3.91%)
4608M1P-xfs Elapsed Time fsmark 512.01 492.15
4608M1P-xfs Elapsed Time simple-wb 618.18 566.24
4608M1P-xfs Elapsed Time mmap-strm 488.05 465.07
4608M1P-xfs Kswapd efficiency fsmark 93% 86%
4608M1P-xfs Kswapd efficiency simple-wb 88% 84%
4608M1P-xfs Kswapd efficiency mmap-strm 46% 45%
4608M-xfs Files/s mean 27.60 ( 0.00%) 28.85 ( 4.33%)
4608M-xfs Elapsed Time fsmark 555.96 532.34
4608M-xfs Elapsed Time simple-wb 659.72 571.85
4608M-xfs Elapsed Time mmap-strm 1082.57 1146.38
4608M-xfs Kswapd efficiency fsmark 89% 91%
4608M-xfs Kswapd efficiency simple-wb 88% 82%
4608M-xfs Kswapd efficiency mmap-strm 48% 46%
4608M-4X-xfs Files/s mean 26.00 ( 0.00%) 27.47 ( 5.35%)
4608M-4X-xfs Elapsed Time fsmark 592.91 564.00
4608M-4X-xfs Elapsed Time simple-wb 616.65 575.07
4608M-4X-xfs Elapsed Time mmap-strm 1773.02 1631.53
4608M-4X-xfs Kswapd efficiency fsmark 90% 94%
4608M-4X-xfs Kswapd efficiency simple-wb 87% 82%
4608M-4X-xfs Kswapd efficiency mmap-strm 43% 43%
4608M-16X-xfs Files/s mean 26.07 ( 0.00%) 26.42 ( 1.32%)
4608M-16X-xfs Elapsed Time fsmark 602.69 585.78
4608M-16X-xfs Elapsed Time simple-wb 606.60 573.81
4608M-16X-xfs Elapsed Time mmap-strm 1549.75 1441.86
4608M-16X-xfs Kswapd efficiency fsmark 98% 98%
4608M-16X-xfs Kswapd efficiency simple-wb 88% 82%
4608M-16X-xfs Kswapd efficiency mmap-strm 44% 42%

Unlike the other tests, the fsmark results are not statistically
significant but the min and max times are both improved and for the most
part, tests completed faster.

There are other indications that this is an improvement as well. For
example, in the vast majority of cases, there were fewer pages scanned by
direct reclaim implying in many cases that stalls due to direct reclaim
are reduced. KSwapd is scanning more due to skipping dirty pages which is
unfortunate but the CPU usage is still acceptable

In an earlier set of tests, I used blktrace and in almost all cases
throughput throughout the entire test was higher. However, I ended up
discarding those results as recording blktrace data was too heavy for my
liking.

On a laptop, I plugged in a USB stick and ran a similar tests of tests
using it as backing storage. A desktop environment was running and for
the entire duration of the tests, firefox and gnome terminal were
launching and exiting to vaguely simulate a user.

1024M-xfs Files/s mean 0.41 ( 0.00%) 0.44 ( 6.82%)
1024M-xfs Elapsed Time fsmark 2053.52 1641.03
1024M-xfs Elapsed Time simple-wb 1229.53 768.05
1024M-xfs Elapsed Time mmap-strm 4126.44 4597.03
1024M-xfs Kswapd efficiency fsmark 84% 85%
1024M-xfs Kswapd efficiency simple-wb 92% 81%
1024M-xfs Kswapd efficiency mmap-strm 60% 51%
1024M-xfs Avg wait ms fsmark 5404.53 4473.87
1024M-xfs Avg wait ms simple-wb 2541.35 1453.54
1024M-xfs Avg wait ms mmap-strm 3400.25 3852.53

The mmap-strm results were hurt because firefox launching had a tendency
to push the test out of memory. On the postive side, firefox launched
marginally faster with the patches applied. Time to completion for many
tests was faster but more importantly - the "Avg wait" time as measured by
iostat was far lower implying the system would be more responsive. It was
also the case that "Avg wait ms" on the root filesystem was lower. I
tested it manually and while the system felt slightly more responsive
while copying data to a USB stick, it was marginal enough that it could be
my imagination.

This patch: do not writeback filesystem pages in direct reclaim.

When kswapd is failing to keep zones above the min watermark, a process
will enter direct reclaim in the same manner kswapd does. If a dirty page
is encountered during the scan, this page is written to backing storage
using mapping->writepage.

This causes two problems. First, it can result in very deep call stacks,
particularly if the target storage or filesystem are complex. Some
filesystems ignore write requests from direct reclaim as a result. The
second is that a single-page flush is inefficient in terms of IO. While
there is an expectation that the elevator will merge requests, this does
not always happen. Quoting Christoph Hellwig;

The elevator has a relatively small window it can operate on,
and can never fix up a bad large scale writeback pattern.

This patch prevents direct reclaim writing back filesystem pages by
checking if current is kswapd. Anonymous pages are still written to swap
as there is not the equivalent of a flusher thread for anonymous pages.
If the dirty pages cannot be written back, they are placed back on the LRU
lists. There is now a direct dependency on dirty page balancing to
prevent too many pages in the system being dirtied which would prevent
reclaim making forward progress.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Cc: Dave Chinner
Cc: Christoph Hellwig
Cc: Johannes Weiner
Cc: Wu Fengguang
Cc: Jan Kara
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Alex Elder
Cc: Theodore Ts'o
Cc: Chris Mason
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2011-11-01 08:30:46 +0800
e10d59f2c mm: add comments to explain mm_struct fields ... Browse Code »

Add comments to explain the page statistics field in the mm_struct.

[akpm@linux-foundation.org: add missing ;]
Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-11-01 08:30:46 +0800
bc3e53f68 mm: distinguish between mlocked and pinned pages ... Browse Code »
43

Some kernel components pin user space memory (infiniband and perf) (by
increasing the page count) and account that memory as "mlocked".

The difference between mlocking and pinning is:

A. mlocked pages are marked with PG_mlocked and are exempt from
swapping. Page migration may move them around though.
They are kept on a special LRU list.

B. Pinned pages cannot be moved because something needs to
directly access physical memory. They may not be on any
LRU list.

I recently saw an mlockalled process where mm->locked_vm became
bigger than the virtual size of the process (!) because some
memory was accounted for twice:

Once when the page was mlocked and once when the Infiniband
layer increased the refcount because it needt to pin the RDMA
memory.

This patch introduces a separate counter for pinned pages and
accounts them seperately.

Signed-off-by: Christoph Lameter
Cc: Mike Marciniszyn
Cc: Roland Dreier
Cc: Sean Hefty
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2011-11-01 08:30:46 +0800
f11c0ca50 mm: vmscan: drop nr_force_scan[] from get_scan_count ... Browse Code »

The nr_force_scan[] tuple holds the effective scan numbers for anon and
file pages in case the situation called for a forced scan and the
regularly calculated scan numbers turned out zero.

However, the effective scan number can always be assumed to be
SWAP_CLUSTER_MAX right before the division into anon and file. The
numerators and denominator are properly set up for all cases, be it force
scan for just file, just anon, or both, to do the right thing.

Signed-off-by: Johannes Weiner
Reviewed-by: Minchan Kim
Acked-by: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Cc: Ying Han
Cc: Balbir Singh
Cc: KOSAKI Motohiro
Cc: Daisuke Nishimura
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-01 08:30:46 +0800
4f31888c1 mm: output a list of loaded modules when we hit bad_page() ... Browse Code »

When we get a bad_page bug report, it's useful to see what modules the
user had loaded.

Signed-off-by: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jones
2011-11-01 08:30:45 +0800
f5fc870da tmpfs: add "tmpfs" to the Kconfig prompt to make it obvious. ... Browse Code »

Add the leading word "tmpfs" to the Kconfig string to make it blindingly
obvious that this selection refers to tmpfs.

Signed-off-by: Robert P. J. Day
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robert P. J. Day
2011-11-01 08:30:45 +0800
43362a497 oom: fix race while temporarily setting current's oom_score_adj ... Browse Code »

test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
c9f01245b oom: remove oom_disable_count ... Browse Code »

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and

- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes
Reported-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
7b0d44fa4 oom: avoid killing kthreads if they assume the oom killed thread's mm ... Browse Code »

After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups. It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.

A kernel thread, however, may assume a user thread's mm by using
use_mm(). This is only temporary and should not result in sending a
SIGKILL to that kthread.

This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.

Signed-off-by: David Rientjes
Reviewed-by: Michal Hocko
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
f660daac4 oom: thaw threads if oom killed thread is frozen before deferring ... Browse Code »

If a thread has been oom killed and is frozen, thaw it before returning to
the page allocator. Otherwise, it can stay frozen indefinitely and no
memory will be freed.

Signed-off-by: David Rientjes
Reported-by: Konstantin Khlebnikov
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
d08c429b0 mm/page-writeback.c: document bdi_min_ratio ... Browse Code »

Looks like someone got distracted after adding the comment characters.

Signed-off-by: Johannes Weiner
Acked-by: Peter Zijlstra
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2011-11-01 08:30:45 +0800
3da367c3e vmscan: add block plug for page reclaim ... Browse Code »

per-task block plug can reduce block queue lock contention and increase
request merge. Currently page reclaim doesn't support it. I originally
thought page reclaim doesn't need it, because kswapd thread count is
limited and file cache write is done at flusher mostly.

When I test a workload with heavy swap in a 4-node machine, each CPU is
doing direct page reclaim and swap. This causes block queue lock
contention. In my test, without below patch, the CPU utilization is about
2% ~ 7%. With the patch, the CPU utilization is about 1% ~ 3%. Disk
throughput isn't changed. This should improve normal kswapd write and
file cache write too (increase request merge for example), but might not
be so obvious as I explain above.

Signed-off-by: Shaohua Li
Cc: Jens Axboe
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2011-11-01 08:30:45 +0800
3fa36acbc radix_tree: clean away saw_unset_tag leftovers ... Browse Code »

radix_tree_tag_get()'s BUG (when it sees a tag after saw_unset_tag) was
unsafe and removed in 2.6.34, but the pointless saw_unset_tag left behind.

Remove it now, and return 0 as soon as we see unset tag - we already rely
upon the root tag to be correct, returning 0 immediately if it's not set.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2011-11-01 08:30:45 +0800