Doug / smarc-fsl-linux-kernel | Embedian Git Server

25 May, 2010

6 commits

8b25c6d22 vmscan: remove isolate_pages callback scan control ... Browse Code »

For now, we have global isolation vs. memory control group isolation, do
not allow the reclaim entry function to set an arbitrary page isolation
callback, we do not need that flexibility.

And since we already pass around the group descriptor for the memory
control group isolation case, just use it to decide which one of the two
isolator functions to use.

The decisions can be merged into nearby branches, so no extra cost there.
In fact, we save the indirect calls.

Signed-off-by: Johannes Weiner
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:07:00 +0800
0aeb2339e vmscan: remove all_unreclaimable scan control ... Browse Code »

This scan control is abused to communicate a return value from
shrink_zones(). Write this idiomatically and remove the knob.

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-05-25 23:07:00 +0800
5f53e7629 vmscan: page_check_references(): check low order lumpy reclaim properly ... Browse Code »

If vmscan is under lumpy reclaim mode, it have to ignore referenced bit
for making contenious free pages. but current page_check_references()
doesn't.

Fix it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-05-25 23:07:00 +0800
76a33fc38 vmscan: prevent get_scan_ratio() rounding errors ... Browse Code »

get_scan_ratio() calculates percentage and if the percentage is < 1%, it
will round percentage down to 0% and cause we completely ignore scanning
anon/file pages to reclaim memory even the total anon/file pages are very
big.

To avoid underflow, we don't use percentage, instead we directly calculate
how many pages should be scaned. In this way, we should get several
scanned pages for < 1% percent.

This has some benefits:

1. increase our calculation precision

2. making our scan more smoothly. Without this, if percent[x] is
underflow, shrink_zone() doesn't scan any pages and suddenly it scans
all pages when priority is zero. With this, even priority isn't zero,
shrink_zone() gets chance to scan some pages.

Note, this patch doesn't really change logics, but just increase
precision. For system with a lot of memory, this might slightly changes
behavior. For example, in a sequential file read workload, without the
patch, we don't swap any anon pages. With it, if anon memory size is
bigger than 16G, we will see one anon page swapped. The 16G is calculated
as PAGE_SIZE * priority(4096) * (fp/ap). fp/ap is assumed to be 1024
which is common in this workload. So the impact sounds not a big deal.

Signed-off-by: Shaohua Li
Cc: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Shaohua Li
2010-05-25 23:07:00 +0800
c175a0ce7 mm: move definition for LRU isolation modes to a header ... Browse Code »

Currently, vmscan.c defines the isolation modes for __isolate_lru_page().
Memory compaction needs access to these modes for isolating pages for
migration. This patch exports them.

Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Cc: Rik van Riel
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2010-05-25 23:06:59 +0800
c0ff7453b cpuset,mm: fix no node to alloc memory when changing cpuset's mems ... Browse Code »

Before applying this patch, cpuset updates task->mems_allowed and
mempolicy by setting all new bits in the nodemask first, and clearing all
old unallowed bits later. But in the way, the allocator may find that
there is no node to alloc memory.

The reason is that cpuset rebinds the task's mempolicy, it cleans the
nodes which the allocater can alloc pages on, for example:

(mpol: mempolicy)
task1 task1's mpol task2
alloc page 1
alloc on node0? NO 1
1 change mems from 1 to 0
1 rebind task1's mpol
0-1 set new bits
0 clear disallowed bits
alloc on node1? NO 0
...
can't alloc page
goto oom

This patch fixes this problem by expanding the nodes range first(set newly
allowed bits) and shrink it lazily(clear newly disallowed bits). So we
use a variable to tell the write-side task that read-side task is reading
nodemask, and the write-side task clears newly disallowed nodes after
read-side task ends the current memory allocation.

[akpm@linux-foundation.org: fix spello]
Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Nick Piggin
Cc: Paul Menage
Cc: Lee Schermerhorn
Cc: Hugh Dickins
Cc: Ravikiran Thirumalai
Cc: KOSAKI Motohiro
Cc: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-05-25 23:06:57 +0800

07 Apr, 2010

1 commit

d6da1a5ab mm: revert "vmscan: get_scan_ratio() cleanup" ... Browse Code »

Shaohua Li reported his tmpfs streaming I/O test can lead to make oom.
The test uses a 6G tmpfs in a system with 3G memory. In the tmpfs, there
are 6 copies of kernel source and the test does kbuild for each copy. His
investigation shows the test has a lot of rotated anon pages and quite few
file pages, so get_scan_ratio calculates percent[0] (i.e. scanning
percent for anon) to be zero. Actually the percent[0] shoule be a big
value, but our calculation round it to zero.

Although before commit 84b18490 ("vmscan: get_scan_ratio() cleanup") , we
have the same problem too. But the old logic can rescue percent[0]==0
case only when priority==0. It had hided the real issue. I didn't think
merely streaming io can makes percent[0]==0 && priority==0 situation. but
I was wrong.

So, definitely we have to fix such tmpfs streaming io issue. but anyway I
revert the regression commit at first.

This reverts commit 84b18490d1f1bc7ed5095c929f78bc002eb70f26.

Signed-off-by: KOSAKI Motohiro
Reported-by: Shaohua Li
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-04-07 23:38:03 +0800

30 Mar, 2010

1 commit

5a0e3ad6a include cleanup: Update gfp.h and slab.h includes to prepare for breaking implic… ... Browse Code »

…it slab.h inclusion from percpu.h

percpu.h is included by sched.h and module.h and thus ends up being
included when building most .c files. percpu.h includes slab.h which
in turn includes gfp.h making everything defined by the two files
universally available and complicating inclusion dependencies.

percpu.h -> slab.h dependency is about to be removed. Prepare for
this change by updating users of gfp and slab facilities include those
headers directly instead of assuming availability. As this conversion
needs to touch large number of source files, the following script is
used as the basis of conversion.

http://userweb.kernel.org/~tj/misc/slabh-sweep.py

The script does the followings.

* Scan files for gfp and slab usages and update includes such that
only the necessary includes are there. ie. if only gfp is used,
gfp.h, if slab is used, slab.h.

* When the script inserts a new include, it looks at the include
blocks and try to put the new include such that its order conforms
to its surrounding. It's put in the include block which contains
core kernel includes, in the same order that the rest are ordered -
alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
doesn't seem to be any matching order.

* If the script can't find a place to put a new include (mostly
because the file doesn't have fitting include block), it prints out
an error message indicating which .h file needs to be added to the
file.

The conversion was done in the following steps.

1. The initial automatic conversion of all .c files updated slightly
over 4000 files, deleting around 700 includes and adding ~480 gfp.h
and ~3000 slab.h inclusions. The script emitted errors for ~400
files.

2. Each error was manually checked. Some didn't need the inclusion,
some needed manual addition while adding it to implementation .h or
embedding .c file was more appropriate for others. This step added
inclusions to around 150 files.

3. The script was run again and the output was compared to the edits
from #2 to make sure no file was left behind.

4. Several build tests were done and a couple of problems were fixed.
e.g. lib/decompress_*.c used malloc/free() wrappers around slab
APIs requiring slab.h to be added manually.

5. The script was run on all .h files but without automatically
editing them as sprinkling gfp.h and slab.h inclusions around .h
files could easily lead to inclusion dependency hell. Most gfp.h
inclusion directives were ignored as stuff from gfp.h was usually
wildly available and often used in preprocessor macros. Each
slab.h inclusion directive was examined and added manually as
necessary.

6. percpu.h was updated not to include slab.h.

7. Build test were done on the following configurations and failures
were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
distributed build env didn't work with gcov compiles) and a few
more options had to be turned off depending on archs to make things
build (like ipr on powerpc/64 which failed due to missing writeq).

* x86 and x86_64 UP and SMP allmodconfig and a custom test config.
* powerpc and powerpc64 SMP allmodconfig
* sparc and sparc64 SMP allmodconfig
* ia64 SMP allmodconfig
* s390 SMP allmodconfig
* alpha SMP allmodconfig
* um on x86_64 SMP allmodconfig

8. percpu.h modifications were reverted so that it could be applied as
a separate patch and serve as bisection point.

Given the fact that I had only a couple of failures from tests on step
6, I'm fairly confident about the coverage of this conversion patch.
If there is a breakage, it's likely to be something in one of the arch
headers which should be easily discoverable easily on most builds of
the specific arch.

Signed-off-by: Tejun Heo <tj@kernel.org>
Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

Tejun Heo
2010-03-30 21:02:32 +0800

07 Mar, 2010

7 commits

645747462 vmscan: detect mapped file pages used only once ... Browse Code »

The VM currently assumes that an inactive, mapped and referenced file page
is in use and promotes it to the active list.

However, every mapped file page starts out like this and thus a problem
arises when workloads create a stream of such pages that are used only for
a short time. By flooding the active list with those pages, the VM
quickly gets into trouble finding eligible reclaim canditates. The result
is long allocation latencies and eviction of the wrong pages.

This patch reuses the PG_referenced page flag (used for unmapped file
pages) to implement a usage detection that scales with the speed of LRU
list cycling (i.e. memory pressure).

If the scanner encounters those pages, the flag is set and the page cycled
again on the inactive list. Only if it returns with another page table
reference it is activated. Otherwise it is reclaimed as 'not recently
used cache'.

This effectively changes the minimum lifetime of a used-once mapped file
page from a full memory cycle to an inactive list cycle, which allows it
to occur in linear streams without affecting the stable working set of the
system.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
31c0569c3 vmscan: drop page_mapping_inuse() ... Browse Code »

page_mapping_inuse() is a historic predicate function for pages that are
about to be reclaimed or deactivated.

According to it, a page is in use when it is mapped into page tables OR
part of swap cache OR backing an mmapped file.

This function is used in combination with page_referenced(), which checks
for young bits in ptes and the page descriptor itself for the
PG_referenced bit. Thus, checking for unmapped swap cache pages is
meaningless as PG_referenced is not set for anonymous pages and unmapped
pages do not have young ptes. The test makes no difference.

Protecting file pages that are not by themselves mapped but are part of a
mapped file is also a historic leftover for short-lived things like the
exec() code in libc. However, the VM now does reference accounting and
activation of pages at unmap time and thus the special treatment on
reclaim is obsolete.

This patch drops page_mapping_inuse() and switches the two callsites to
use page_mapped() directly.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
dfc8d636c vmscan: factor out page reference checks ... Browse Code »

The used-once mapped file page detection patchset.

It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).

Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again. When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress. Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.

This patch makes the VM be more careful about activating mapped file
pages in the first place. The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.

This test resembles a hashing rtorrent process. Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again. While this happens, every 5 seconds a process is launched and
its execution time taken:

python2.4 -c 'import pydoc'
old: max=2.31s mean=1.26s (0.34)
new: max=1.25s mean=0.32s (0.32)

find /etc -type f
old: max=2.52s mean=1.44s (0.43)
new: max=1.92s mean=0.12s (0.17)

vim -c ':quit'
old: max=6.14s mean=4.03s (0.49)
new: max=3.48s mean=2.41s (0.25)

mplayer --help
old: max=8.08s mean=5.74s (1.02)
new: max=3.79s mean=1.32s (0.81)

overall hash time (stdev):
old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)

I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list. The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.

The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.

I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved. The test would mmap a file of 20MB, then

1. touch all its pages to fault them in
2. force one full scan cycle on the inactive file LRU
-- old: mapping pages activated
-- new: mapping pages inactive
3. touch the mapping pages again
-- old and new: fault exceptions to set the young bits
4. force another full scan cycle on the inactive file LRU
5. touch the mapping pages one last time
-- new: fault exceptions to set the young bits

The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec). 13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated. Note:

1. File mapping the size of one third of main memory, _completely_
in active use across memory pressure - i.e., most pages referenced
within one LRU cycle. This should be rare to non-existant,
especially on such embedded setups.

2. Many huge activation batches. Those batches only occur when the
working set fluctuates. If it changes completely between every full
LRU cycle, you have problematic reclaim overhead anyway.

3. Access of activated pages at maximum speed: sequential loads from
every single page without doing anything in between. In reality,
the extra faults will get distributed between actual operations on
the data.

So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.

Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.

Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.

Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.

Patch 3 implements used-once detection of mapped file pages.

This patch:

Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.

This is just cleaning up, no semantics should have been changed.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
93e4a89a8 mm: restore zone->all_unreclaimable to independence word ... Browse Code »

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro
Cc: David Rientjes
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
76ca542d8 mm, lockdep: annotate reclaim context to zone reclaim too ... Browse Code »

Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
context annotation. But it didn't annotate zone reclaim. This patch do
it.

The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
zone-reclaim doesn't use __alloc_pages_direct_reclaim.

current call graph is

__alloc_pages_nodemask
get_page_from_freelist
zone_reclaim()
__alloc_pages_slowpath
__alloc_pages_direct_reclaim
try_to_free_pages

Actually, if zone_reclaim_mode=1, VM never call
__alloc_pages_direct_reclaim in usual VM pressure.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Acked-by: Nick Piggin
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
84b18490d vmscan: get_scan_ratio() cleanup ... Browse Code »

The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
45973d74f vmscan: check high watermark after shrink zone ... Browse Code »

Kswapd checks that zone has sufficient pages free via zone_watermark_ok().

If any zone doesn't have enough pages, we set all_zones_ok to zero.
!all_zone_ok makes kswapd retry rather than sleeping.

I think the watermark check before shrink_zone() is pointless. Only after
kswapd has tried to shrink the zone is the check meaningful.

Move the check to after the call to shrink_zone().

[akpm@linux-foundation.org: fix comment, layout]
Signed-off-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Reviewed-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2010-03-07 03:26:24 +0800

17 Jan, 2010

1 commit

de3fab393 vmscan: kswapd: don't retry balance_pgdat() if all zones are unreclaimable ... Browse Code »

Commit f50de2d3 (vmscan: have kswapd sleep for a short interval and double
check it should be asleep) can cause kswapd to enter an infinite loop if
running on a single-CPU system. If all zones are unreclaimble,
sleeping_prematurely return 1 and kswapd will call balance_pgdat() again.
but it's totally meaningless, balance_pgdat() doesn't anything against
unreclaimable zone!

Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Reported-by: Will Newton
Reviewed-by: Minchan Kim
Reviewed-by: Rik van Riel
Tested-by: Will Newton
Reviewed-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-01-17 04:15:39 +0800

16 Dec, 2009

11 commits

62c0c2f19 vmscan: simplify code ... Browse Code »

Simplify the code for shrink_inactive_list().

Signed-off-by: Huang Shijie
Reviewed-by: KOSAKI Motohiro
Reviewed-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Huang Shijie
2009-12-16 00:53:21 +0800
b39415b27 vmscan: do not evict inactive pages when skipping an active list scan ... Browse Code »

In AIM7 runs, recent kernels start swapping out anonymous pages well
before they should. This is due to shrink_list falling through to
shrink_inactive_list if !inactive_anon_is_low(zone, sc), when all we
really wanted to do is pre-age some anonymous pages to give them extra
time to be referenced while on the inactive list.

The obvious fix is to make sure that shrink_list does not fall through to
scanning/reclaiming inactive pages when we called it to scan one of the
active lists.

This change should be safe because the loop in shrink_zone ensures that we
will still shrink the anon and file inactive lists whenever we should.

[kosaki.motohiro@jp.fujitsu.com: inactive_file_is_low() should be inactive_anon_is_low()]
Reported-by: Larry Woodman
Signed-off-by: Rik van Riel
Acked-by: Johannes Weiner
Cc: Tomasz Chmielewski
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2009-12-16 00:53:21 +0800
338fde909 vmscan: make consistent of reclaim bale out between do_try_to_free_page and shrink_zone ... Browse Code »

Fix small inconsistent of ">" and ">=".

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:18 +0800
ece74b2e7 vmscan: kill sc.swap_cluster_max ... Browse Code »

Now, All caller of reclaim use swap_cluster_max as SWAP_CLUSTER_MAX.
Then, we can remove it perfectly.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:18 +0800
4f0ddfdff vmscan: zone_reclaim() don't use insane swap_cluster_max ... Browse Code »

In old days, we didn't have sc.nr_to_reclaim and it brought
sc.swap_cluster_max misuse.

huge sc.swap_cluster_max might makes unnecessary OOM risk and no
performance benefit.

Now, we can stop its insane thing.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:18 +0800
7b51755c3 vmscan: kill hibernation specific reclaim logic and unify it ... Browse Code »

shrink_all_zone() was introduced by commit d6277db4ab (swsusp: rework
memory shrinker) for hibernate performance improvement. and
sc.swap_cluster_max was introduced by commit a06fe4d307 (Speed freeing
memory for suspend).

commit a06fe4d307 said

Without the patch:
Freed 14600 pages in 1749 jiffies = 32.61 MB/s (Anomolous!)
Freed 88563 pages in 14719 jiffies = 23.50 MB/s
Freed 205734 pages in 32389 jiffies = 24.81 MB/s

With the patch:
Freed 68252 pages in 496 jiffies = 537.52 MB/s
Freed 116464 pages in 569 jiffies = 798.54 MB/s
Freed 209699 pages in 705 jiffies = 1161.89 MB/s

At that time, their patch was pretty worth. However, Modern Hardware
trend and recent VM improvement broke its worth. From several reason, I
think we should remove shrink_all_zones() at all.

detail:

1) Old days, shrink_zone()'s slowness was mainly caused by stupid io-throttle
at no i/o congestion.
but current shrink_zone() is sane, not slow.

2) shrink_all_zone() try to shrink all pages at a time. but it doesn't works
fine on numa system.
example)
System has 4GB memory and each node have 2GB. and hibernate need 1GB.

optimal)
steal 500MB from each node.
shrink_all_zones)
steal 1GB from node-0.

Oh, Cache balancing logic was broken. ;)
Unfortunately, Desktop system moved ahead NUMA at nowadays.
(Side note, if hibernate require 2GB, shrink_all_zones() never success
on above machine)

3) if the node has several I/O flighting pages, shrink_all_zones() makes
pretty bad result.

schenario) hibernate need 1GB

1) shrink_all_zones() try to reclaim 1GB from Node-0
2) but it only reclaimed 990MB
3) stupidly, shrink_all_zones() try to reclaim 1GB from Node-1
4) it reclaimed 990MB

Oh, well. it reclaimed twice much than required.
In the other hand, current shrink_zone() has sane baling out logic.
then, it doesn't make overkill reclaim. then, we lost shrink_zones()'s risk.

4) SplitLRU VM always keep active/inactive ratio very carefully. inactive list only
shrinking break its assumption. it makes unnecessary OOM risk. it obviously suboptimal.

Now, shrink_all_memory() is only the wrapper function of do_try_to_free_pages().
it bring good reviewability and debuggability, and solve above problems.

side note: Reclaim logic unificication makes two good side effect.
- Fix recursive reclaim bug on shrink_all_memory().
it did forgot to use PF_MEMALLOC. it mean the system be able to stuck into deadlock.
- Now, shrink_all_memory() got lockdep awareness. it bring good debuggability.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Acked-by: Rafael J. Wysocki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:18 +0800
22fba3354 vmscan: separate sc.swap_cluster_max and sc.nr_max_reclaim ... Browse Code »

Currently, sc.scap_cluster_max has double meanings.

1) reclaim batch size as isolate_lru_pages()'s argument
2) reclaim baling out thresolds

The two meanings pretty unrelated. Thus, Let's separate it.
this patch doesn't change any behavior.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: Minchan Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:18 +0800
bb3ab5968 vmscan: stop kswapd waiting on congestion when the min watermark is not being met ... Browse Code »

If reclaim fails to make sufficient progress, the priority is raised.
Once the priority is higher, kswapd starts waiting on congestion.
However, if the zone is below the min watermark then kswapd needs to
continue working without delay as there is a danger of an increased rate
of GFP_ATOMIC allocation failure.

This patch changes the conditions under which kswapd waits on congestion
by only going to sleep if the min watermarks are being met.

[mel@csn.ul.ie: add stats to track how relevant the logic is]
[mel@csn.ul.ie: make kswapd only check its own zones and rename the relevant counters]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-12-16 00:53:16 +0800
f50de2d38 vmscan: have kswapd sleep for a short interval and double check it should be asleep ... Browse Code »

After kswapd balances all zones in a pgdat, it goes to sleep. In the
event of no IO congestion, kswapd can go to sleep very shortly after the
high watermark was reached. If there are a constant stream of allocations
from parallel processes, it can mean that kswapd went to sleep too quickly
and the high watermark is not being maintained for sufficient length time.

This patch makes kswapd go to sleep as a two-stage process. It first
tries to sleep for HZ/10. If it is woken up by another process or the
high watermark is no longer met, it's considered a premature sleep and
kswapd continues work. Otherwise it goes fully to sleep.

This adds more counters to distinguish between fast and slow breaches of
watermarks. A "fast" premature sleep is one where the low watermark was
hit in a very short time after kswapd going to sleep. A "slow" premature
sleep indicates that the high watermark was breached after a very short
interval.

Signed-off-by: Mel Gorman
Cc: Frans Pop
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2009-12-16 00:53:16 +0800
6aceb53be mm/vmscan: change comment generic_file_write to __generic_file_aio_write ... Browse Code »

Commit 543ade1fc9 ("Streamline generic_file_* interfaces and filemap
cleanups") removed generic_file_write() in filemap. Change the comment in
vmscan pageout() to __generic_file_aio_write().

Signed-off-by: Vincent Li
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-12-16 00:53:16 +0800
8fe23e057 mm: clear node in N_HIGH_MEMORY and stop kswapd when all memory is offlined ... Browse Code »

When memory is hot-removed, its node must be cleared in N_HIGH_MEMORY if
there are no present pages left.

In such a situation, kswapd must also be stopped since it has nothing left
to do.

Signed-off-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Cc: Christoph Lameter
Cc: Yasunori Goto
Cc: Mel Gorman
Cc: Rafael J. Wysocki
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Lee Schermerhorn
Cc: Mel Gorman
Cc: Randy Dunlap
Cc: Nishanth Aravamudan
Cc: Andi Kleen
Cc: David Rientjes
Cc: Adam Litke
Cc: Andy Whitcroft
Cc: Eric Whitney
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-12-16 00:53:13 +0800

29 Oct, 2009

3 commits

6a7b95481 vmscan: order evictable rescue in LRU putback ... Browse Code »

Isolators putting a page back to the LRU do not hold the page lock, and if
the page is mlocked, another thread might munlock it concurrently.

Expecting this, the putback code re-checks the evictability of a page when
it just moved it to the unevictable list in order to correct its decision.

The problem, however, is that ordering is not garuanteed between setting
PG_lru when moving the page to the list and checking PG_mlocked
afterwards:

#0: #1

spin_lock()
if (TestClearPageMlocked())
if (PageLRU())
move to evictable list
SetPageLRU()
spin_unlock()
if (!PageMlocked())
move to evictable list

The PageMlocked() check may get reordered before SetPageLRU() in #0,
resulting in #0 not moving the still mlocked page, and in #1 failing to
isolate and move the page as well. The page is now stranded on the
unevictable list.

The race condition is very unlikely. The consequence currently is one
page falling off the reclaim grid and eventually getting freed with
PG_unevictable set, which triggers a warning in the page allocator.

TestClearPageMlocked() in #1 already provides full memory barrier
semantics.

This patch adds an explicit full barrier to force ordering between
SetPageLRU() and PageMlocked() so that either one of the competitors
rescues the page.

Signed-off-by: Johannes Weiner
Reviewed-by: KOSAKI Motohiro
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Peter Zijlstra
Reviewed-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-10-29 22:39:30 +0800
41e20983f vmscan: limit VM_EXEC protection to file pages ... Browse Code »

It is possible to have !Anon but SwapBacked pages, and some apps could
create huge number of such pages with MAP_SHARED|MAP_ANONYMOUS. These
pages go into the ANON lru list, and hence shall not be protected: we only
care mapped executable files. Failing to do so may trigger OOM.

Tested-by: Christian Borntraeger
Reviewed-by: Rik van Riel
Signed-off-by: Wu Fengguang
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-10-29 22:39:27 +0800
58355c787 congestion_wait(): don't use WRITE ... Browse Code »

commit 8aa7e847d (Fix congestion_wait() sync/async vs read/write
confusion) replace WRITE with BLK_RW_ASYNC. Unfortunately, concurrent mm
development made the unchanged place accidentally.

This patch fixes it too.

Signed-off-by: KOSAKI Motohiro
Acked-by: Jens Axboe
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-10-29 22:39:25 +0800

26 Sep, 2009

2 commits

6d7f18f6e Merge branch 'writeback' of git://git.kernel.dk/linux-2.6-block ... Browse Code »

* 'writeback' of git://git.kernel.dk/linux-2.6-block:
writeback: writeback_inodes_sb() should use bdi_start_writeback()
writeback: don't delay inodes redirtied by a fast dirtier
writeback: make the super_block pinning more efficient
writeback: don't resort for a single super_block in move_expired_inodes()
writeback: move inodes from one super_block together
writeback: get rid to incorrect references to pdflush in comments
writeback: improve readability of the wb_writeback() continue/break logic
writeback: cleanup writeback_single_inode()
writeback: kupdate writeback shall not stop when more io is possible
writeback: stop background writeback when below background threshold
writeback: balance_dirty_pages() shall write more than dirtied pages
fs: Fix busyloop in wb_writeback()

Linus Torvalds
2009-09-26 00:27:30 +0800
5b0830cb9 writeback: get rid to incorrect references to pdflush in comments ... Browse Code »

Signed-off-by: Jens Axboe

Jens Axboe
2009-09-26 00:08:25 +0800

24 Sep, 2009

3 commits

db1682636 Merge branch 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6 ... Browse Code »

* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (21 commits)
HWPOISON: Enable error_remove_page on btrfs
HWPOISON: Add simple debugfs interface to inject hwpoison on arbitary PFNs
HWPOISON: Add madvise() based injector for hardware poisoned pages v4
HWPOISON: Enable error_remove_page for NFS
HWPOISON: Enable .remove_error_page for migration aware file systems
HWPOISON: The high level memory error handler in the VM v7
HWPOISON: Add PR_MCE_KILL prctl to control early kill behaviour per process
HWPOISON: shmem: call set_page_dirty() with locked page
HWPOISON: Define a new error_remove_page address space op for async truncation
HWPOISON: Add invalidate_inode_page
HWPOISON: Refactor truncate to allow direct truncating of page v2
HWPOISON: check and isolate corrupted free pages v2
HWPOISON: Handle hardware poisoned pages in try_to_unmap
HWPOISON: Use bitmask/action code for try_to_unmap behaviour
HWPOISON: x86: Add VM_FAULT_HWPOISON handling to x86 page fault handler v2
HWPOISON: Add poison check to page fault handling
HWPOISON: Add basic support for poisoned pages in fault handler v3
HWPOISON: Add new SIGBUS error codes for hardware poison signals
HWPOISON: Add support for poison swap entries v2
HWPOISON: Export some rmap vma locking to outside world
...

Linus Torvalds
2009-09-24 22:53:22 +0800
8d65af789 sysctl: remove "struct file *" argument of ->proc_handler ... Browse Code »

It's unused.

It isn't needed -- read or write flag is already passed and sysctl
shouldn't care about the rest.

It _was_ used in two places at arch/frv for some reason.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: "Eric W. Biederman"
Cc: Al Viro
Cc: Ralf Baechle
Cc: Martin Schwidefsky
Cc: Ingo Molnar
Cc: "David S. Miller"
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-09-24 22:21:04 +0800
4e4169535 memory controller: soft limit reclaim on contention ... Browse Code »

Implement reclaim from groups over their soft limit

Permit reclaim from memory cgroups on contention (via the direct reclaim
path).

memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.

Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.

[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Acked-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-09-24 22:20:59 +0800

22 Sep, 2009

5 commits

f168e1b63 mm/vmscan: remove page_queue_congested() comment ... Browse Code »

Commit 084f71ae5c(kill page_queue_congested()) removed
page_queue_congested(). Remove the page_queue_congested() comment in
vmscan pageout() too.

Signed-off-by: Vincent Li
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-09-22 22:17:39 +0800
f86296317 mm: do batched scans for mem_cgroup ... Browse Code »

For mem_cgroup, shrink_zone() may call shrink_list() with nr_to_scan=1, in
which case shrink_list() _still_ calls isolate_pages() with the much
larger SWAP_CLUSTER_MAX. It effectively scales up the inactive list scan
rate by up to 32 times.

For example, with 16k inactive pages and DEF_PRIORITY=12, (16k >> 12)=4.
So when shrink_zone() expects to scan 4 pages in the active/inactive list,
the active list will be scanned 4 pages, while the inactive list will be
(over) scanned SWAP_CLUSTER_MAX=32 pages in effect. And that could break
the balance between the two lists.

It can further impact the scan of anon active list, due to the anon
active/inactive ratio rebalance logic in balance_pgdat()/shrink_zone():

inactive anon list over scanned => inactive_anon_is_low() == TRUE
=> shrink_active_list()
=> active anon list over scanned

So the end result may be

- anon inactive => over scanned
- anon active => over scanned (maybe not as much)
- file inactive => over scanned
- file active => under scanned (relatively)

The accesses to nr_saved_scan are not lock protected and so not 100%
accurate, however we can tolerate small errors and the resulted small
imbalanced scan rates between zones.

Cc: Rik van Riel
Reviewed-by: KOSAKI Motohiro
Acked-by: Balbir Singh
Reviewed-by: Minchan Kim
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2009-09-22 22:17:39 +0800
0b2176763 mm/vmscan: rename zone_nr_pages() to zone_nr_lru_pages() ... Browse Code »

The name `zone_nr_pages' can be mis-read as zone's (total) number pages,
but it actually returns zone's LRU list number pages.

Signed-off-by: Vincent Li
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vincent Li
2009-09-22 22:17:38 +0800
ceddc3a52 mm: document is_page_cache_freeable() ... Browse Code »

Enlighten the reader of this code about what reference count makes a page
cache page freeable.

Signed-off-by: Johannes Weiner
Cc: Christoph Lameter
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-09-22 22:17:38 +0800
edcf4748c mm: return boolean from page_has_private() ... Browse Code »

Make page_has_private() return a true boolean value and remove the double
negations from the two callsites using it for arithmetic.

Signed-off-by: Johannes Weiner
Cc: Christoph Lameter
Reviewed-by: Christoph Lameter
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-09-22 22:17:38 +0800