Eric Lee / smarc-fsl-linux-kernel

0a47de52d thp: add native_set_pmd_at ... Browse Code »

Used by paravirt and not paravirt set_pmd_at.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

8dd60a3a6 thp: clear compound mapping ... Browse Code »

Clear compound mapping for anonymous compound pages like it already
happens for regular anonymous pages. But crash if mapping is set for any
tail page, also the PageAnon check is meaningless for tail pages. This
check only makes sense for the head page, for tail page it can only hide
bugs and we definitely don't want to hide bugs.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

a5b338f2b thp: update futex compound knowledge ... Browse Code »

Futex code is smarter than most other gup_fast O_DIRECT code and knows
about the compound internals. However now doing a put_page(head_page)
will not release the pin on the tail page taken by gup-fast, leading to
all sort of refcounting bugchecks. Getting a stable head_page is a little
tricky.

page_head = page is there because if this is not a tail page it's also the
page_head. Only in case this is a tail page, compound_head is called,
otherwise it's guaranteed unnecessary. And if it's a tail page
compound_head has to run atomically inside irq disabled section
__get_user_pages_fast before returning. Otherwise ->first_page won't be a
stable pointer.

Disableing irq before __get_user_page_fast and releasing irq after running
compound_head is needed because if __get_user_page_fast returns == 1, it
means the huge pmd is established and cannot go away from under us.
pmdp_splitting_flush_notify in __split_huge_page_splitting will have to
wait for local_irq_enable before the IPI delivery can return. This means
__split_huge_page_refcount can't be running from under us, and in turn
when we run compound_head(page) we're not reading a dangling pointer from
tailpage->first_page. Then after we get to stable head page, we are
always safe to call compound_lock and after taking the compound lock on
head page we can finally re-check if the page returned by gup-fast is
still a tail page. in which case we're set and we didn't need to split
the hugepage in order to take a futex on it.

Signed-off-by: Andrea Arcangeli
Acked-by: Mel Gorman
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

a95a82e96 thp: put_page: recheck PageHead after releasing the compound_lock ... Browse Code »

After releasing the compound_lock split_huge_page can still run and release the
page before put_page_testzero runs.

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

918070634 thp: alter compound get_page/put_page ... Browse Code »

Alter compound get_page/put_page to keep references on subpages too, in
order to allow __split_huge_page_refcount to split an hugepage even while
subpages have been pinned by one of the get_user_pages() variants.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:39 +0800

e9da73d67 thp: compound_lock ... Browse Code »

Add a new compound_lock() needed to serialize put_page against
__split_huge_page_refcount().

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

a826e4224 thp: mm: define MADV_HUGEPAGE ... Browse Code »

Define MADV_HUGEPAGE.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Arnd Bergmann
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

1c9bf22c0 thp: transparent hugepage support documentation ... Browse Code »

Documentation/vm/transhuge.txt

Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

4e9f64c42 thp: fix bad_page to show the real reason the page is bad ... Browse Code »

page_count shows the count of the head page, but the actual check is done
on the tail page, so show what is really being checked.

Signed-off-by: Andrea Arcangeli
Acked-by: Rik van Riel
Acked-by: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

ae52a2adb thp: ksm: free swap when swapcache page is replaced ... Browse Code »

When a swapcache page is replaced by a ksm page, it's best to free that
swap immediately.

Reported-by: Andrea Arcangeli
Signed-off-by: Hugh Dickins
Signed-off-by: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

240c879f2 writeback: avoid unnecessary determine_dirtyable_memory call ... Browse Code »

I think determine_dirtyable_memory() is a rather costly function since it
need many atomic reads for gathering zone/global page state. But when we
use vm_dirty_bytes && dirty_background_bytes, we don't need that costly
calculation.

This patch eliminates such unnecessary overhead.

NOTE : newly added if condition might add overhead in normal path.
But it should be _really_ small because anyway we need the
access both vm_dirty_bytes and dirty_background_bytes so it is
likely to hit the cache.

[akpm@linux-foundation.org: fix used-uninitialised warning]
Signed-off-by: Minchan Kim
Cc: Wu Fengguang
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:38 +0800

ecb256f81 mm: set correct numa_zonelist_order string when configured on the kernel command line ... Browse Code »

When numa_zonelist_order parameter is set to "node" or "zone" on the
command line it's still showing as "default" in sysctl. That's because
early_param parsing function changes only user_zonelist_order variable.
Fix this by copying user-provided string to numa_zonelist_order if it was
successfully parsed.

Signed-off-by: Volodymyr G Lukiianyk
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

dc83edd94 mm: kswapd: use the classzone idx that kswapd was using for sleeping_prematurely() ... Browse Code »

When kswapd is woken up for a high-order allocation, it takes account of
the highest usable zone by the caller (the classzone idx). During
allocation, this index is used to select the lowmem_reserve[] that should
be applied to the watermark calculation in zone_watermark_ok().

When balancing a node, kswapd considers the highest unbalanced zone to be
the classzone index. This will always be at least be the callers
classzone_idx and can be higher. However, sleeping_prematurely() always
considers the lowest zone (e.g. ZONE_DMA) to be the classzone index.
This means that sleeping_prematurely() can consider a zone to be balanced
that is unusable by the allocation request that originally woke kswapd.
This patch changes sleeping_prematurely() to use a classzone_idx matching
the value it used in balance_pgdat().

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: Eric B Munson
Cc: KAMEZAWA Hiroyuki
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

355b09c47 mm: kswapd: treat zone->all_unreclaimable in sleeping_prematurely similar to balance_pgdat() ... Browse Code »

After DEF_PRIORITY, balance_pgdat() considers all_unreclaimable zones to
be balanced but sleeping_prematurely does not. This can force kswapd to
stay awake longer than it should. This patch fixes it.

Signed-off-by: Mel Gorman
Reviewed-by: Eric B Munson
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

4d40502ea mm: kswapd: reset kswapd_max_order and classzone_idx after reading ... Browse Code »

When kswapd wakes up, it reads its order and classzone from pgdat and
calls balance_pgdat. While its awake, it potentially reclaimes at a high
order and a low classzone index. This might have been a once-off that was
not required by subsequent callers. However, because the pgdat values
were not reset, they remain artifically high while balance_pgdat() is
running and potentially kswapd enters a second unnecessary reclaim cycle.
Reset the pgdat order and classzone index after reading.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

0abdee2bd mm: kswapd: use the order that kswapd was reclaiming at for sleeping_prematurely() ... Browse Code »

Before kswapd goes to sleep, it uses sleeping_prematurely() to check if
there was a race pushing a zone below its watermark. If the race
happened, it stays awake. However, balance_pgdat() can decide to reclaim
at order-0 if it decides that high-order reclaim is not working as
expected. This information is not passed back to sleeping_prematurely().
The impact is that kswapd remains awake reclaiming pages long after it
should have gone to sleep. This patch passes the adjusted order to
sleeping_prematurely and uses the same logic as balance_pgdat to decide if
it's ok to go to sleep.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

1741c8775 mm: kswapd: keep kswapd awake for high-order allocations until a percentage of the node is balanced ... Browse Code »

When reclaiming for high-orders, kswapd is responsible for balancing a
node but it should not reclaim excessively. It avoids excessive reclaim
by considering if any zone in a node is balanced then the node is
balanced. In the cases where there are imbalanced zone sizes (e.g.
ZONE_DMA with both ZONE_DMA32 and ZONE_NORMAL), kswapd can go to sleep
prematurely as just one small zone was balanced.

This alters the sleep logic of kswapd slightly. It counts the number of
pages that make up the balanced zones. If the total number of balanced
pages is more than a quarter of the zone, kswapd will go back to sleep.
This should keep a node balanced without reclaiming an excessive number of
pages.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

995047488 mm: kswapd: stop high-order balancing when any suitable zone is balanced ... Browse Code »

Simon Kirby reported the following problem

We're seeing cases on a number of servers where cache never fully
grows to use all available memory. Sometimes we see servers with 4 GB
of memory that never seem to have less than 1.5 GB free, even with a
constantly-active VM. In some cases, these servers also swap out while
this happens, even though they are constantly reading the working set
into memory. We have been seeing this happening for a long time; I
don't think it's anything recent, and it still happens on 2.6.36.

After some debugging work by Simon, Dave Hansen and others, the prevaling
theory became that kswapd is reclaiming order-3 pages requested by SLUB
too aggressive about it.

There are two apparent problems here. On the target machine, there is a
small Normal zone in comparison to DMA32. As kswapd tries to balance all
zones, it would continually try reclaiming for Normal even though DMA32
was balanced enough for callers. The second problem is that
sleeping_prematurely() does not use the same logic as balance_pgdat() when
deciding whether to sleep or not. This keeps kswapd artifically awake.

A number of tests were run and the figures from previous postings will
look very different for a few reasons. One, the old figures were forcing
my network card to use GFP_ATOMIC in attempt to replicate Simon's problem.
Second, I previous specified slub_min_order=3 again in an attempt to
reproduce Simon's problem. In this posting, I'm depending on Simon to say
whether his problem is fixed or not and these figures are to show the
impact to the ordinary cases. Finally, the "vmscan" figures are taken
from /proc/vmstat instead of the tracepoints. There is less information
but recording is less disruptive.

The first test of relevance was postmark with a process running in the
background reading a large amount of anonymous memory in blocks. The
objective was to vaguely simulate what was happening on Simon's machine
and it's memory intensive enough to have kswapd awake.

POSTMARK
traceonly kanyzone
Transactions per second: 156.00 ( 0.00%) 153.00 (-1.96%)
Data megabytes read per second: 21.51 ( 0.00%) 21.52 ( 0.05%)
Data megabytes written per second: 29.28 ( 0.00%) 29.11 (-0.58%)
Files created alone per second: 250.00 ( 0.00%) 416.00 (39.90%)
Files create/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)
Files deleted alone per second: 520.00 ( 0.00%) 420.00 (-23.81%)
Files delete/transact per second: 79.00 ( 0.00%) 76.00 (-3.95%)

MMTests Statistics: duration
User/Sys Time Running Test (seconds) 16.58 17.4
Total Elapsed Time (seconds) 218.48 222.47

VMstat Reclaim Statistics: vmscan
Direct reclaims 0 4
Direct reclaim pages scanned 0 203
Direct reclaim pages reclaimed 0 184
Kswapd pages scanned 326631 322018
Kswapd pages reclaimed 312632 309784
Kswapd low wmark quickly 1 4
Kswapd high wmark quickly 122 475
Kswapd skip congestion_wait 1 0
Pages activated 700040 705317
Pages deactivated 212113 203922
Pages written 9875 6363

Total pages scanned 326631 322221
Total pages reclaimed 312632 309968
%age total pages scanned/reclaimed 95.71% 96.20%
%age total pages scanned/written 3.02% 1.97%

proc vmstat: Faults
Major Faults 300 254
Minor Faults 645183 660284
Page ins 493588 486704
Page outs 4960088 4986704
Swap ins 1230 661
Swap outs 9869 6355

Performance is mildly affected because kswapd is no longer doing as much
work and the background memory consumer process is getting in the way.
Note that kswapd scanned and reclaimed fewer pages as it's less aggressive
and overall fewer pages were scanned and reclaimed. Swap in/out is
particularly reduced again reflecting kswapd throwing out fewer pages.

The slight performance impact is unfortunate here but it looks like a
direct result of kswapd being less aggressive. As the bug report is about
too many pages being freed by kswapd, it may have to be accepted for now.

The second test is a streaming IO benchmark that was previously used by
Johannes to show regressions in page reclaim.

MICRO
traceonly kanyzone
User/Sys Time Running Test (seconds) 29.29 28.87
Total Elapsed Time (seconds) 492.18 488.79

VMstat Reclaim Statistics: vmscan
Direct reclaims 2128 1460
Direct reclaim pages scanned 2284822 1496067
Direct reclaim pages reclaimed 148919 110937
Kswapd pages scanned 15450014 16202876
Kswapd pages reclaimed 8503697 8537897
Kswapd low wmark quickly 3100 3397
Kswapd high wmark quickly 1860 7243
Kswapd skip congestion_wait 708 801
Pages activated 9635 9573
Pages deactivated 1432 1271
Pages written 223 1130

Total pages scanned 17734836 17698943
Total pages reclaimed 8652616 8648834
%age total pages scanned/reclaimed 48.79% 48.87%
%age total pages scanned/written 0.00% 0.01%

proc vmstat: Faults
Major Faults 165 221
Minor Faults 9655785 9656506
Page ins 3880 7228
Page outs 37692940 37480076
Swap ins 0 69
Swap outs 19 15

Again fewer pages are scanned and reclaimed as expected and this time the
test completed faster. Note that kswapd is hitting its watermarks faster
(low and high wmark quickly) which I expect is due to kswapd reclaiming
fewer pages.

I also ran fs-mark, iozone and sysbench but there is nothing interesting
to report in the figures. Performance is not significantly changed and
the reclaim statistics look reasonable.

Tgis patch:

When the allocator enters its slow path, kswapd is woken up to balance the
node. It continues working until all zones within the node are balanced.
For order-0 allocations, this makes perfect sense but for higher orders it
can have unintended side-effects. If the zone sizes are imbalanced,
kswapd may reclaim heavily within a smaller zone discarding an excessive
number of pages. The user-visible behaviour is that kswapd is awake and
reclaiming even though plenty of pages are free from a suitable zone.

This patch alters the "balance" logic for high-order reclaim allowing
kswapd to stop if any suitable zone becomes balanced to reduce the number
of pages it reclaims from other zones. kswapd still tries to ensure that
order-0 watermarks for all zones are met before sleeping.

Signed-off-by: Mel Gorman
Reviewed-by: Minchan Kim
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Eric B Munson
Cc: Simon Kirby
Cc: KOSAKI Motohiro
Cc: Shaohua Li
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:37 +0800

c585a2678 mm: remove likely() from grab_cache_page_write_begin() ... Browse Code »

Running the annotated branch profiler on a box doing average work
(firefox, evolution, xchat, distcc farm), the likely() used in
grab_cache_page_write_begin() was incorrect most of the time:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
1924262 71332401 97 grab_cache_page_write_begin filemap.c 2206

Adding a trace_printk() and running the function tracer limited to
just this function I can see:

gconfd-2-2696 [000] 4467.268935: grab_cache_page_write_begin: page= (null) mapping=ffff8800676a9460 index=7
gconfd-2-2696 [000] 4467.268946: grab_cache_page_write_begin
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

e20e87795 mm: remove unlikely() from page_mapping() ... Browse Code »

page_mapping() has a unlikely that the mapping has PAGE_MAPPING_ANON set.
But running the annotated branch profiler on a normal desktop system doing
vairous tasks (xchat, evolution, firefox, distcc), it is not really that
unlikely that the mapping here will have the PAGE_MAPPING_ANON flag set:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
35935762 1270265395 97 page_mapping mm.h 659
1306198001 143659 0 page_mapping mm.h 657
203131478 121586 0 page_mapping mm.h 657
5415491 1116 0 page_mapping mm.h 657
74899487 1116 0 page_mapping mm.h 657
203132845 224 0 page_mapping mm.h 659
5415464 27 0 page_mapping mm.h 659
13552 0 0 page_mapping mm.h 657
13552 0 0 page_mapping mm.h 659
242630 0 0 page_mapping mm.h 657
242630 0 0 page_mapping mm.h 659
74899487 0 0 page_mapping mm.h 659

The page_mapping() is a static inline, which is why it shows up multiple
times.

The unlikely in page_mapping() was correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect. With this much of
an error, it's best to simply remove the unlikely and have the compiler
and branch prediction figure this out.

Signed-off-by: Steven Rostedt
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

088e54658 mm: remove likely() from mapping_unevictable() ... Browse Code »

The mapping_unevictable() has a likely() around the mapping parameter.
This mapping parameter comes from page_mapping() which has an unlikely()
that the page will be set as PAGE_MAPPING_ANON, and if so, it will return
NULL. One would think that this unlikely() means that the mapping
returned by page_mapping() would not be NULL, but where page_mapping() is
used just above mapping_unevictable(), that unlikely() is incorrect most
of the time. This means that the "likely(mapping)" in
mapping_unevictable() is incorrect most of the time.

Running the annotated branch profiler on my main box which runs firefox,
evolution, xchat and is part of my distcc farm, I had this:

correct incorrect % Function File Line
------- --------- - -------- ---- ----
12872836 1269443893 98 mapping_unevictable pagemap.h 51
35935762 1270265395 97 page_mapping mm.h 659
1306198001 143659 0 page_mapping mm.h 657
203131478 121586 0 page_mapping mm.h 657
5415491 1116 0 page_mapping mm.h 657
74899487 1116 0 page_mapping mm.h 657
203132845 224 0 page_mapping mm.h 659
5415464 27 0 page_mapping mm.h 659
13552 0 0 page_mapping mm.h 657
13552 0 0 page_mapping mm.h 659
242630 0 0 page_mapping mm.h 657
242630 0 0 page_mapping mm.h 659
74899487 0 0 page_mapping mm.h 659

The page_mapping() is a static inline, which is why it shows up multiple
times. The mapping_unevictable() is also a static inline but seems to be
used only once in my setup.

The unlikely in page_mapping() was correct a total of 1909540379 times and
incorrect 1270533123 times, with a 39% being incorrect. Perhaps this is
enough to remove the unlikely from page_mapping() as well.

Signed-off-by: Steven Rostedt
Reviewed-by: KOSAKI Motohiro
Acked-by: Nick Piggin
Acked-by: Rik van Riel
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

ddf9c6d47 vmalloc: remove redundant unlikely() ... Browse Code »

IS_ERR() already implies unlikely(), so it can be omitted here.

Signed-off-by: Tobias Klauser
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

1e50df39f mempolicy: remove tasklist_lock from migrate_pages ... Browse Code »

Today, tasklist_lock in migrate_pages doesn't protect anything.
rcu_read_lock() provide enough protection from pid hash walk.

Signed-off-by: KOSAKI Motohiro
Reported-by: Peter Zijlstra
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

53a7706d5 mlock: do not hold mmap_sem for extended periods of time ... Browse Code »

__get_user_pages gets a new 'nonblocking' parameter to signal that the
caller is prepared to re-acquire mmap_sem and retry the operation if
needed. This is used to split off long operations if they are going to
block on a disk transfer, or when we detect contention on the mmap_sem.

[akpm@linux-foundation.org: remove ref to rwsem_is_contended()]
Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

5fdb20021 mm: move VM_LOCKED check to __mlock_vma_pages_range() ... Browse Code »

Use a single code path for faulting in pages during mlock.

The reason to have it in this patch series is that I did not want to
update both code paths in a later change that releases mmap_sem when
blocking on disk.

Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

110d74a92 mm: add FOLL_MLOCK follow_page flag. ... Browse Code »

Move the code to mlock pages from __mlock_vma_pages_range() to
follow_page().

This allows __mlock_vma_pages_range() to not have to break down work into
16-page batches.

An additional motivation for doing this within the present patch series is
that it'll make it easier for a later chagne to drop mmap_sem when
blocking on disk (we'd like to be able to resume at the page that was read
from disk instead of at the start of a 16-page batch).

Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:36 +0800

fed067da4 mlock: only hold mmap_sem in shared mode when faulting in pages ... Browse Code »

Currently mlock() holds mmap_sem in exclusive mode while the pages get
faulted in. In the case of a large mlock, this can potentially take a
very long time, during which various commands such as 'ps auxw' will
block. This makes sysadmins unhappy:

real 14m36.232s
user 0m0.003s
sys 0m0.015s
(output from 'time ps auxw' while a 20GB file was being mlocked without
being previously preloaded into page cache)

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages, in small batches, releasing mmap_sem when we
block on disk access or when we detect some contention.

This patch:

Before this change, mlock() holds mmap_sem in exclusive mode while the
pages get faulted in. In the case of a large mlock, this can potentially
take a very long time. Various things will block while mmap_sem is held,
including 'ps auxw'. This can make sysadmins angry.

I propose that mlock() could release mmap_sem after the VM_LOCKED bits
have been set in all appropriate VMAs. Then a second pass could be done
to actually mlock the pages with mmap_sem held for reads only. We need to
recheck the vma flags after we re-acquire mmap_sem, but this is easy.

In the case where a vma has been munlocked before mlock completes, pages
that were already marked as PageMlocked() are handled by the munlock()
call, and mlock() is careful to not mark new page batches as PageMlocked()
after the munlock() call has cleared the VM_LOCKED vma flags. So, the end
result will be identical to what'd happen if munlock() had executed after
the mlock() call.

In a later change, I will allow the second pass to release mmap_sem when
blocking on disk accesses or when it is otherwise contended, so that it
won't be held for long periods of time even in shared mode.

Signed-off-by: Michel Lespinasse
Tested-by: Valdis Kletnieks
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

5ecfda041 mlock: avoid dirtying pages and triggering writeback ... Browse Code »

When faulting in pages for mlock(), we want to break COW for anonymous or
file pages within VM_WRITABLE, non-VM_SHARED vmas. However, there is no
need to write-fault into VM_SHARED vmas since shared file pages can be
mlocked first and dirtied later, when/if they actually get written to.
Skipping the write fault is desirable, as we don't want to unnecessarily
cause these pages to be dirtied and queued for writeback.

Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Kosaki Motohiro
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Theodore Tso
Cc: Michael Rubin
Cc: Suleiman Souhlal
Cc: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

72ddc8f72 do_wp_page: clarify dirty_page handling ... Browse Code »

Reorganize the code so that dirty pages are handled closer to the place
that makes them dirty (handling write fault into shared, writable VMAs).
No behavior changes.

Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Kosaki Motohiro
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Theodore Tso
Cc: Michael Rubin
Cc: Suleiman Souhlal
Cc: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

b009c024f do_wp_page: remove the 'reuse' flag ... Browse Code »

mlocking a shared, writable vma currently causes the corresponding pages
to be marked as dirty and queued for writeback. This seems rather
unnecessary given that the pages are not being actually modified during
mlock. It is understood that for non-shared mappings (file or anon) we
want to use a write fault in order to break COW, but there is just no such
need for shared mappings.

The first two patches in this series do not introduce any behavior change.
The intent there is to make it obvious that dirtying file pages is only
done in the (writable, shared) case. I think this clarifies the code, but
I wouldn't mind dropping these two patches if there is no consensus about
them.

The last patch is where we actually avoid dirtying shared mappings during
mlock. Note that as a side effect of this, we won't call page_mkwrite()
for the mappings that define it, and won't be pre-allocating data blocks
at the FS level if the mapped file was sparsely allocated. My
understanding is that mlock does not need to provide such guarantee, as
evidenced by the fact that it never did for the filesystems that don't
define page_mkwrite() - including some common ones like ext3. However, I
would like to gather feedback on this from filesystem people as a
precaution. If this turns out to be a showstopper, maybe block
preallocation can be added back on using a different interface.

Large shared mlocks are getting significantly (>2x) faster in my tests, as
the disk can be fully used for reading the file instead of having to share
between this and writeback.

This patch:

Reorganize the code to remove the 'reuse' flag. No behavior changes.

Signed-off-by: Michel Lespinasse
Cc: Hugh Dickins
Cc: Rik van Riel
Cc: Kosaki Motohiro
Cc: Peter Zijlstra
Cc: Nick Piggin
Cc: Theodore Tso
Cc: Michael Rubin
Cc: Suleiman Souhlal
Cc: Dave Chinner
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

212260aa0 mm: clear PageError bit in msync & fsync ... Browse Code »

Temporary IO failures, eg. due to loss of both multipath paths, can
permanently leave the PageError bit set on a page, resulting in msync or
fsync returning -EIO over and over again, even if IO is now getting to the
disk correctly.

We already clear the AS_ENOSPC and AS_IO bits in mapping->flags in the
filemap_fdatawait_range function. Also clearing the PageError bit on the
page allows subsequent msync or fsync calls on this file to return without
an error, if the subsequent IO succeeds.

Unfortunately data written out in the msync or fsync call that returned
-EIO can still get lost, because the page dirty bit appears to not get
restored on IO error. However, the alternative could be potentially all
of memory filling up with uncleanable dirty pages, hanging the system, so
there is no nice choice here...

Signed-off-by: Rik van Riel
Acked-by: Valerie Aurora
Acked-by: Jeff Layton
Cc: Theodore Ts'o
Acked-by: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

dabb16f63 oom: allow a non-CAP_SYS_RESOURCE proces to oom_score_adj down ... Browse Code »

We'd like to be able to oom_score_adj a process up/down as it
enters/leaves the foreground. Currently, it is not possible to oom_adj
down without CAP_SYS_RESOURCE. This patch allows a task to decrease its
oom_score_adj back to the value that a CAP_SYS_RESOURCE thread set it to
or its inherited value at fork. Assuming the thread that has forked it
has oom_score_adj of 0, each process could decrease it back from 0 upon
activation unless a CAP_SYS_RESOURCE thread elevated it to something
higher.

Alternative considered:

* a setuid binary
* a daemon with CAP_SYS_RESOURCE

Since you don't wan't all processes to be able to reduce their oom_adj, a
setuid or daemon implementation would be complex. The alternatives also
have much higher overhead.

This patch updated from original patch based on feedback from David
Rientjes.

Signed-off-by: Mandeep Singh Baines
Acked-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:35 +0800

d0a21265d mm: unify module_alloc code for vmalloc ... Browse Code »

Four architectures (arm, mips, sparc, x86) use __vmalloc_area() for
module_init(). Much of the code is duplicated and can be generalized in a
globally accessible function, __vmalloc_node_range().

__vmalloc_node() now calls into __vmalloc_node_range() with a range of
[VMALLOC_START, VMALLOC_END) for functionally equivalent behavior.

Each architecture may then use __vmalloc_node_range() directly to remove
the duplication of code.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Cc: Russell King
Cc: Ralf Baechle
Cc: "David S. Miller"
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

ec3f64fc9 mm: remove gfp mask from pcpu_get_vm_areas ... Browse Code »

pcpu_get_vm_areas() only uses GFP_KERNEL allocations, so remove the gfp_t
formal and use the mask internally.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

e5a5623b2 mm: remove unused get_vm_area_node ... Browse Code »

get_vm_area_node() is unused in the kernel and can thus be removed.

Signed-off-by: David Rientjes
Cc: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

f3a310bc4 mm: vmscan: rename lumpy_mode to reclaim_mode ... Browse Code »

With compaction being used instead of lumpy reclaim, the name lumpy_mode
and associated variables is a bit misleading. Rename lumpy_mode to
reclaim_mode which is a better fit. There is no functional change.

Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

9927af740 mm: compaction: perform a faster migration scan when migrating asynchronously ... Browse Code »

try_to_compact_pages() is initially called to only migrate pages
asychronously and kswapd always compacts asynchronously. Both are being
optimistic so it is important to complete the work as quickly as possible
to minimise stalls.

This patch alters the scanner when asynchronous to only consider
MIGRATE_MOVABLE pageblocks as migration candidates. This reduces stalls
when allocating huge pages while not impairing allocation success rates as
a full scan will be performed if necessary after direct reclaim.

Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

7f0f24967 mm: migration: cleanup migrate_pages API by matching types for offlining and sync ... Browse Code »

With the introduction of the boolean sync parameter, the API looks a
little inconsistent as offlining is still an int. Convert offlining to a
bool for the sake of being tidy.

Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:34 +0800

77f1fe6b0 mm: migration: allow migration to operate asynchronously and avoid synchronous c… ... Browse Code »

…ompaction in the faster path

Migration synchronously waits for writeback if the initial passes fails.
Callers of memory compaction do not necessarily want this behaviour if the
caller is latency sensitive or expects that synchronous migration is not
going to have a significantly better success rate.

This patch adds a sync parameter to migrate_pages() allowing the caller to
indicate if wait_on_page_writeback() is allowed within migration or not.
For reclaim/compaction, try_to_compact_pages() is first called
asynchronously, direct reclaim runs and then try_to_compact_pages() is
called synchronously as there is a greater expectation that it'll succeed.

[akpm@linux-foundation.org: build/merge fix]
Signed-off-by: Mel Gorman <mel@csn.ul.ie>
Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
Cc: Rik van Riel <riel@redhat.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Andy Whitcroft <apw@shadowen.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

2011-01-14 09:32:34 +0800

3e7d34497 mm: vmscan: reclaim order-0 and use compaction instead of lumpy reclaim ... Browse Code »

Lumpy reclaim is disruptive. It reclaims a large number of pages and
ignores the age of the pages it reclaims. This can incur significant
stalls and potentially increase the number of major faults.

Compaction has reached the point where it is considered reasonably stable
(meaning it has passed a lot of testing) and is a potential candidate for
displacing lumpy reclaim. This patch introduces an alternative to lumpy
reclaim whe compaction is available called reclaim/compaction. The basic
operation is very simple - instead of selecting a contiguous range of
pages to reclaim, a number of order-0 pages are reclaimed and then
compaction is later by either kswapd (compact_zone_order()) or direct
compaction (__alloc_pages_direct_compact()).

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: use conventional task_struct naming]
Signed-off-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: KOSAKI Motohiro
Cc: Rik van Riel
Acked-by: Johannes Weiner
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

2011-01-14 09:32:33 +0800

GITLAB

Eric Lee / smarc-fsl-linux-kernel

14 Jan, 2011