Eric Lee / smarc-fsl-linux-kernel

07 Mar, 2010

31 commits

08259d58e mm: add comment on swap_duplicate's error code ... Browse Code »

swap_duplicate()'s loop appears to miss out on returning the error code
from __swap_duplicate(), except when that's -ENOMEM. In fact this is
intentional: prior to -ENOMEM for swap_count_continuation,
swap_duplicate() was void (and the case only occurs when copy_one_pte()
hits a corrupt pte). But that's surprising behaviour, which certainly
deserves a comment.

Signed-off-by: Hugh Dickins
Reported-by: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-03-07 03:26:27 +0800
c08c6e1f5 nommu: get_user_pages(): pin last page on non-page-aligned start ... Browse Code »

The noMMU version of get_user_pages() fails to pin the last page when the
start address isn't page-aligned. The patch fixes this in a way that
makes find_extend_vma() congruent to its MMU cousin.

Signed-off-by: Steven J. Magnani
Acked-by: Paul Mundt
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Steven J. Magnani
2010-03-07 03:26:27 +0800
645747462 vmscan: detect mapped file pages used only once ... Browse Code »
90

The VM currently assumes that an inactive, mapped and referenced file page
is in use and promotes it to the active list.

However, every mapped file page starts out like this and thus a problem
arises when workloads create a stream of such pages that are used only for
a short time. By flooding the active list with those pages, the VM
quickly gets into trouble finding eligible reclaim canditates. The result
is long allocation latencies and eviction of the wrong pages.

This patch reuses the PG_referenced page flag (used for unmapped file
pages) to implement a usage detection that scales with the speed of LRU
list cycling (i.e. memory pressure).

If the scanner encounters those pages, the flag is set and the page cycled
again on the inactive list. Only if it returns with another page table
reference it is activated. Otherwise it is reclaimed as 'not recently
used cache'.

This effectively changes the minimum lifetime of a used-once mapped file
page from a full memory cycle to an inactive list cycle, which allows it
to occur in linear streams without affecting the stable working set of the
system.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
31c0569c3 vmscan: drop page_mapping_inuse() ... Browse Code »

page_mapping_inuse() is a historic predicate function for pages that are
about to be reclaimed or deactivated.

According to it, a page is in use when it is mapped into page tables OR
part of swap cache OR backing an mmapped file.

This function is used in combination with page_referenced(), which checks
for young bits in ptes and the page descriptor itself for the
PG_referenced bit. Thus, checking for unmapped swap cache pages is
meaningless as PG_referenced is not set for anonymous pages and unmapped
pages do not have young ptes. The test makes no difference.

Protecting file pages that are not by themselves mapped but are part of a
mapped file is also a historic leftover for short-lived things like the
exec() code in libc. However, the VM now does reference accounting and
activation of pages at unmap time and thus the special treatment on
reclaim is obsolete.

This patch drops page_mapping_inuse() and switches the two callsites to
use page_mapped() directly.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
dfc8d636c vmscan: factor out page reference checks ... Browse Code »

The used-once mapped file page detection patchset.

It is meant to help workloads with large amounts of shortly used file
mappings, like rtorrent hashing a file or git when dealing with loose
objects (git gc on a bigger site?).

Right now, the VM activates referenced mapped file pages on first
encounter on the inactive list and it takes a full memory cycle to
reclaim them again. When those pages dominate memory, the system
no longer has a meaningful notion of 'working set' and is required
to give up the active list to make reclaim progress. Obviously,
this results in rather bad scanning latencies and the wrong pages
being reclaimed.

This patch makes the VM be more careful about activating mapped file
pages in the first place. The minimum granted lifetime without
another memory access becomes an inactive list cycle instead of the
full memory cycle, which is more natural given the mentioned loads.

This test resembles a hashing rtorrent process. Sequentially, 32MB
chunks of a file are mapped into memory, hashed (sha1) and unmapped
again. While this happens, every 5 seconds a process is launched and
its execution time taken:

python2.4 -c 'import pydoc'
old: max=2.31s mean=1.26s (0.34)
new: max=1.25s mean=0.32s (0.32)

find /etc -type f
old: max=2.52s mean=1.44s (0.43)
new: max=1.92s mean=0.12s (0.17)

vim -c ':quit'
old: max=6.14s mean=4.03s (0.49)
new: max=3.48s mean=2.41s (0.25)

mplayer --help
old: max=8.08s mean=5.74s (1.02)
new: max=3.79s mean=1.32s (0.81)

overall hash time (stdev):
old: time=1192.30 (12.85) thruput=25.78mb/s (0.27)
new: time=1060.27 (32.58) thruput=29.02mb/s (0.88) (-11%)

I also tested kernbench with regular IO streaming in the background to
see whether the delayed activation of frequently used mapped file
pages had a negative impact on performance in the presence of pressure
on the inactive list. The patch made no significant difference in
timing, neither for kernbench nor for the streaming IO throughput.

The first patch submission raised concerns about the cost of the extra
faults for actually activated pages on machines that have no hardware
support for young page table entries.

I created an artificial worst case scenario on an ARM machine with
around 300MHz and 64MB of memory to figure out the dimensions
involved. The test would mmap a file of 20MB, then

1. touch all its pages to fault them in
2. force one full scan cycle on the inactive file LRU
-- old: mapping pages activated
-- new: mapping pages inactive
3. touch the mapping pages again
-- old and new: fault exceptions to set the young bits
4. force another full scan cycle on the inactive file LRU
5. touch the mapping pages one last time
-- new: fault exceptions to set the young bits

The test showed an overall increase of 6% in time over 100 iterations
of the above (old: ~212sec, new: ~225sec). 13 secs total overhead /
(100 * 5k pages), ignoring the execution time of the test itself,
makes for about 25us overhead for every page that gets actually
activated. Note:

1. File mapping the size of one third of main memory, _completely_
in active use across memory pressure - i.e., most pages referenced
within one LRU cycle. This should be rare to non-existant,
especially on such embedded setups.

2. Many huge activation batches. Those batches only occur when the
working set fluctuates. If it changes completely between every full
LRU cycle, you have problematic reclaim overhead anyway.

3. Access of activated pages at maximum speed: sequential loads from
every single page without doing anything in between. In reality,
the extra faults will get distributed between actual operations on
the data.

So even if a workload manages to get the VM into the situation of
activating a third of memory in one go on such a setup, it will take
2.2 seconds instead 2.1 without the patch.

Comparing the numbers (and my user-experience over several months),
I think this change is an overall improvement to the VM.

Patch 1 is only refactoring to break up that ugly compound conditional
in shrink_page_list() and make it easy to document and add new checks
in a readable fashion.

Patch 2 gets rid of the obsolete page_mapping_inuse(). It's not
strictly related to #3, but it was in the original submission and is a
net simplification, so I kept it.

Patch 3 implements used-once detection of mapped file pages.

This patch:

Moving the big conditional into its own predicate function makes the code
a bit easier to read and allows for better commenting on the checks
one-by-one.

This is just cleaning up, no semantics should have been changed.

Signed-off-by: Johannes Weiner
Reviewed-by: Rik van Riel
Cc: Minchan Kim
Cc: OSAKI Motohiro
Cc: Lee Schermerhorn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2010-03-07 03:26:27 +0800
72f0ba025 mm: suppress pfn range output for zones without pages ... Browse Code »

free_area_init_nodes() emits pfn ranges for all zones on the system.
There may be no pages on a higher zone, however, due to memory limitations
or the use of the mem= kernel parameter. For example:

Zone PFN ranges:
DMA 0x00000001 -> 0x00001000
DMA32 0x00001000 -> 0x00100000
Normal 0x00100000 -> 0x00100000

The implementation copies the previous zone's highest pfn, if any, as the
next zone's lowest pfn. If its highest pfn is then greater than the
amount of addressable memory, the upper memory limit is used instead.
Thus, both the lowest and highest possible pfn for higher zones without
memory may be the same.

The pfn range for zones without memory is now shown as "empty" instead.

Signed-off-by: David Rientjes
Cc: Mel Gorman
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2010-03-07 03:26:26 +0800
452aa6999 mm/pm: force GFP_NOIO during suspend/hibernation and resume ... Browse Code »

There are quite a few GFP_KERNEL memory allocations made during
suspend/hibernation and resume that may cause the system to hang, because
the I/O operations they depend on cannot be completed due to the
underlying devices being suspended.

Avoid this problem by clearing the __GFP_IO and __GFP_FS bits in
gfp_allowed_mask before suspend/hibernation and restoring the original
values of these bits in gfp_allowed_mask durig the subsequent resume.

[akpm@linux-foundation.org: fix CONFIG_PM=n linkage]
Signed-off-by: Rafael J. Wysocki
Reported-by: Maxim Levitsky
Cc: Sebastian Ott
Cc: Benjamin Herrenschmidt
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rafael J. Wysocki
2010-03-07 03:26:26 +0800
ad2bd7e0e mm/swapfile.c: fix swapon size off-by-one ... Browse Code »

There's an off-by-one disagreement between mkswap and swapon about the
meaning of swap_header last_page: mkswap (in all versions I've looked at:
util-linux-ng and BusyBox and old util-linux; probably as far back as
1999) consistently means the offset (in page units) of the last page of
the swap area, whereas kernel sys_swapon (as far back as 2.2 and 2.3)
strangely takes it to mean the size (in page units) of the swap area.

This disagreement is the safe way round; but it's worrying people, and
loses us one page of swap.

The fix is not just to add one to nr_good_pages: we need to get maxpages
(the size of the swap_map array) right before that; and though that is an
unsigned long, be careful not to overflow the unsigned int p->max which
later holds it (probably why header uses __u32 last_page instead of size).

Why did we subtract one from the maximum swp_offset to calculate maxpages?
Though it was probably me who made that change in 2.4.10, I don't get it:
and now we should be adding one (without risk of overflow in this case).

Fix the handling of swap_header badpages: it could have overrun the
swap_map when very large swap area used on a more limited architecture.

Remove pre-initializations of swap_header, nr_good_pages and maxpages:
those date from when sys_swapon was supporting other versions of header.

Reported-by: Nitin Gupta
Reported-by: Jarkko Lavinen
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2010-03-07 03:26:26 +0800
fc148a5f7 mm: remove VM_LOCK_RMAP code ... Browse Code »

When a VMA is in an inconsistent state during setup or teardown, the worst
that can happen is that the rmap code will not be able to find the page.

The mapping is in the process of being torn down (PTEs just got
invalidated by munmap), or set up (no PTEs have been instantiated yet).

It is also impossible for the rmap code to follow a pointer to an already
freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
teardown code needs to take before the VMA is removed from the anon_vma
chain.

Hence, we should not need the VM_LOCK_RMAP locking at all.

Signed-off-by: Rik van Riel
Cc: Nick Piggin
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
c44b67432 rmap: move exclusively owned pages to own anon_vma in do_wp_page() ... Browse Code »

When the parent process breaks the COW on a page, both the original which
is mapped at child and the new page which is mapped parent end up in that
same anon_vma. Generally this won't be a problem, but for some workloads
it could preserve the O(N) rmap scanning complexity.

A simple fix is to ensure that, when a page which is mapped child gets
reused in do_wp_page, because we already are the exclusive owner, the page
gets moved to our own exclusive child's anon_vma.

Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
033a64b56 rmap: remove obsolete check from __page_check_anon_rmap() ... Browse Code »

When an anonymous page is inherited from a parent process, the
vma->anon_vma can differ from the page anon_vma. This can trip up
__page_check_anon_rmap, which is indirectly called from do_swap_page().

Remove that obsolete check to prevent an oops.

Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Reviewed-by: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
5beb49305 mm: change anon_vma linking to fix multi-process server scalability issue ... Browse Code »

The old anon_vma code can lead to scalability issues with heavily forking
workloads. Specifically, each anon_vma will be shared between the parent
process and all its child processes.

In a workload with 1000 child processes and a VMA with 1000 anonymous
pages per process that get COWed, this leads to a system with a million
anonymous pages in the same anon_vma, each of which is mapped in just one
of the 1000 processes. However, the current rmap code needs to walk them
all, leading to O(N) scanning complexity for each page.

This can result in systems where one CPU is walking the page tables of
1000 processes in page_referenced_one, while all other CPUs are stuck on
the anon_vma lock. This leads to catastrophic failure for a benchmark
like AIM7, where the total number of processes can reach in the tens of
thousands. Real workloads are still a factor 10 less process intensive
than AIM7, but they are catching up.

This patch changes the way anon_vmas and VMAs are linked, which allows us
to associate multiple anon_vmas with a VMA. At fork time, each child
process gets its own anon_vmas, in which its COWed pages will be
instantiated. The parents' anon_vma is also linked to the VMA, because
non-COWed pages could be present in any of the children.

This reduces rmap scanning complexity to O(1) for the pages of the 1000
child processes, with O(N) complexity for at most 1/N pages in the system.
This reduces the average scanning cost in heavily forking workloads from
O(N) to 2.

The only real complexity in this patch stems from the fact that linking a
VMA to anon_vmas now involves memory allocations. This means vma_adjust
can fail, if it needs to attach a VMA to anon_vma structures. This in
turn means error handling needs to be added to the calling functions.

A second source of complexity is that, because there can be multiple
anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
"the" anon_vma lock. To prevent the rmap code from walking up an
incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
to make sure it is impossible to compile a kernel that needs both symbolic
values for the same bitflag.

Some test results:

Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
box with 16GB RAM and not quite enough IO), the system ends up running
>99% in system time, with every CPU on the same anon_vma lock in the
pageout code.

With these changes, AIM7 hits the cross-over point around 29.7k users.
This happens with ~99% IO wait time, there never seems to be any spike in
system time. The anon_vma lock contention appears to be resolved.

[akpm@linux-foundation.org: cleanups]
Signed-off-by: Rik van Riel
Cc: KOSAKI Motohiro
Cc: Larry Woodman
Cc: Lee Schermerhorn
Cc: Minchan Kim
Cc: Andrea Arcangeli
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2010-03-07 03:26:26 +0800
648bcc771 mm/memcontrol.c: fix "integer as NULL pointer" sparse warning ... Browse Code »

mm/memcontrol.c:2548:32: warning: Using plain integer as NULL pointer

Signed-off-by: Thiago Farina
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Thiago Farina
2010-03-07 03:26:26 +0800
0141450f6 readahead: introduce FMODE_RANDOM for POSIX_FADV_RANDOM ... Browse Code »

This fixes inefficient page-by-page reads on POSIX_FADV_RANDOM.

POSIX_FADV_RANDOM used to set ra_pages=0, which leads to poor performance:
a 16K read will be carried out in 4 _sync_ 1-page reads.

In other places, ra_pages==0 means
- it's ramfs/tmpfs/hugetlbfs/sysfs/configfs
- some IO error happened
where multi-page read IO won't help or should be avoided.

POSIX_FADV_RANDOM actually want a different semantics: to disable the
*heuristic* readahead algorithm, and to use a dumb one which faithfully
submit read IO for whatever application requests.

So introduce a flag FMODE_RANDOM for POSIX_FADV_RANDOM.

Note that the random hint is not likely to help random reads performance
noticeably. And it may be too permissive on huge request size (its IO
size is not limited by read_ahead_kb).

In Quentin's report (http://lkml.org/lkml/2009/12/24/145), the overall
(NFS read) performance of the application increased by 313%!

Tested-by: Quentin Barnes
Signed-off-by: Wu Fengguang
Cc: Nick Piggin
Cc: Andi Kleen
Cc: Steven Whitehouse
Cc: David Howells
Cc: Jonathan Corbet
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Trond Myklebust
Cc: Chuck Lever
Cc: [2.6.33.x]
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wu Fengguang
2010-03-07 03:26:25 +0800
85f1fb72f mm/migrate.c: kill anon local variable from migrate_page_copy ... Browse Code »

commit 01b1ae63c2 ("memcg: simple migration handling") removed
mem_cgroup_uncharge_cache_page() call from migrate_page_copy. Local
variable `anon' is now unused.

Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
da0aa1389 mm/mempolicy.c: fix indentation of the comments of do_migrate_pages ... Browse Code »

Currently, do_migrate_pages() have very long comment and this is not
indent properly. I often misunderstand it is function starting commnents
and confused it.

this patch fixes it.

note: this patch doesn't break 80 column rule. I guess original
author intended this indentaion, but an accident corrupted it.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
d96ae5309 memory-hotplug: create /sys/firmware/memmap entry for new memory ... Browse Code »

A memmap is a directory in sysfs which includes 3 text files: start, end
and type. For example:

start: 0x100000
end: 0x7e7b1cff
type: System RAM

Interface firmware_map_add was not called explicitly. Remove it and add
function firmware_map_add_hotplug as hotplug interface of memmap.

Each memory entry has a memmap in sysfs, When we hot-add new memory, sysfs
does not export memmap entry for it. We add a call in function add_memory
to function firmware_map_add_hotplug.

Add a new function add_sysfs_fw_map_entry() to create memmap entry, it
will be called when initialize memmap and hot-add memory.

[akpm@linux-foundation.org: un-kernedoc a no longer kerneldoc comment]
Signed-off-by: Shaohui Zheng
Acked-by: Andi Kleen
Acked-by: Yasunori Goto
Reviewed-by: Wu Fengguang
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

akpm@linux-foundation.org
2010-03-07 03:26:25 +0800
9d8cebd4b mm: fix mbind vma merge problem ... Browse Code »

Strangely, current mbind() doesn't merge vma with neighbor vma although it's possible.
Unfortunately, many vma can reduce performance...

This patch fixes it.

reproduced program
----------------------------------------------------------------
#include
#include
#include
#include
#include
#include
#include

static unsigned long pagesize;

int main(int argc, char** argv)
{
void* addr;
int ch;
int node;
struct bitmask *nmask = numa_allocate_nodemask();
int err;
int node_set = 0;
char buf[128];

while ((ch = getopt(argc, argv, "n:")) != -1){
switch (ch){
case 'n':
node = strtol(optarg, NULL, 0);
numa_bitmask_setbit(nmask, node);
node_set = 1;
break;
default:
;
}
}
argc -= optind;
argv += optind;

if (!node_set)
numa_bitmask_setbit(nmask, 0);

pagesize = getpagesize();

addr = mmap(NULL, pagesize*3, PROT_READ|PROT_WRITE,
MAP_ANON|MAP_PRIVATE, 0, 0);
if (addr == MAP_FAILED)
perror("mmap "), exit(1);

fprintf(stderr, "pid = %d \n" "addr = %p\n", getpid(), addr);

/* make page populate */
memset(addr, 0, pagesize*3);

/* first mbind */
err = mbind(addr+pagesize, pagesize, MPOL_BIND, nmask->maskp,
nmask->size, MPOL_MF_MOVE_ALL);
if (err)
error("mbind1 ");

/* second mbind */
err = mbind(addr, pagesize*3, MPOL_DEFAULT, NULL, 0, 0);
if (err)
error("mbind2 ");

sprintf(buf, "cat /proc/%d/maps", getpid());
system(buf);

return 0;
}
----------------------------------------------------------------

result without this patch

addr = 0x7fe26ef09000
[snip]
7fe26ef09000-7fe26ef0a000 rw-p 00000000 00:00 0
7fe26ef0a000-7fe26ef0b000 rw-p 00000000 00:00 0
7fe26ef0b000-7fe26ef0c000 rw-p 00000000 00:00 0
7fe26ef0c000-7fe26ef0d000 rw-p 00000000 00:00 0

=> 0x7fe26ef09000-0x7fe26ef0c000 have three vmas.

result with this patch

addr = 0x7fc9ebc76000
[snip]
7fc9ebc76000-7fc9ebc7a000 rw-p 00000000 00:00 0
7fffbe690000-7fffbe6a5000 rw-p 00000000 00:00 0 [stack]

=> 0x7fc9ebc76000-0x7fc9ebc7a000 have only one vma.

[minchan.kim@gmail.com: fix file offset passed to vma_merge()]
Signed-off-by: KOSAKI Motohiro
Reviewed-by: Christoph Lameter
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Lee Schermerhorn
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
93e4a89a8 mm: restore zone->all_unreclaimable to independence word ... Browse Code »

commit e815af95 ("change all_unreclaimable zone member to flags") changed
all_unreclaimable member to bit flag. But it had an undesireble side
effect. free_one_page() is one of most hot path in linux kernel and
increasing atomic ops in it can reduce kernel performance a bit.

Thus, this patch revert such commit partially. at least
all_unreclaimable shouldn't share memory word with other zone flags.

[akpm@linux-foundation.org: fix patch interaction]
Signed-off-by: KOSAKI Motohiro
Cc: David Rientjes
Cc: Wu Fengguang
Cc: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Huang Shijie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:25 +0800
fc91668ea mm: remove free_hot_page() ... Browse Code »

free_hot_page() is just a wrapper around free_hot_cold_page() with
parameter 'cold = 0'. After adding a clear comment for
free_hot_cold_page(), it is reasonable to remove a level of call.

[akpm@linux-foundation.org: fix build]
Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Cc: Americo Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:25 +0800
c475dab63 mm/page_alloc.c: adjust a call site to trace_mm_page_free_direct ... Browse Code »

Move a call of trace_mm_page_free_direct() from free_hot_page() to
free_hot_cold_page(). It is clearer and close to kmemcheck_free_shadow(),
as it is done in function __free_pages_ok().

Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:24 +0800
f650316c8 mm/page_alloc.c: remove duplicate call to trace_mm_page_free_direct ... Browse Code »

trace_mm_page_free_direct() is called in function __free_pages(). But it
is called again in free_hot_page() if order == 0 and produce duplicate
records in trace file for mm_page_free_direct event. As below:

K-PID CPU# TIMESTAMP FUNCTION
gnome-terminal-1567 [000] 4415.246466: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246468: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.246506: mm_page_alloc: page=ffffea0003db9f40 pfn=1155800 order=0 migratetype=0 gfp_flags=GFP_KERNEL
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0
gnome-terminal-1567 [000] 4415.255557: mm_page_free_direct: page=ffffea0003db9f40 pfn=1155800 order=0

This patch removes the first call and adds a call to
trace_mm_page_free_direct() in __free_pages_ok().

Signed-off-by: Li Hong
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Ingo Molnar
Cc: Larry Woodman
Cc: Peter Zijlstra
Cc: Li Ming Chun
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Hong
2010-03-07 03:26:24 +0800
76ca542d8 mm, lockdep: annotate reclaim context to zone reclaim too ... Browse Code »

Commit cf40bd16fd ("lockdep: annotate reclaim context") introduced reclaim
context annotation. But it didn't annotate zone reclaim. This patch do
it.

The point is, commit cf40bd16fd annotate __alloc_pages_direct_reclaim but
zone-reclaim doesn't use __alloc_pages_direct_reclaim.

current call graph is

__alloc_pages_nodemask
get_page_from_freelist
zone_reclaim()
__alloc_pages_slowpath
__alloc_pages_direct_reclaim
try_to_free_pages

Actually, if zone_reclaim_mode=1, VM never call
__alloc_pages_direct_reclaim in usual VM pressure.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Minchan Kim
Acked-by: Nick Piggin
Cc: Peter Zijlstra
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
84b18490d vmscan: get_scan_ratio() cleanup ... Browse Code »

The get_scan_ratio() should have all scan-ratio related calculations.
Thus, this patch move some calculation into get_scan_ratio.

Signed-off-by: KOSAKI Motohiro
Reviewed-by: Rik van Riel
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
45973d74f vmscan: check high watermark after shrink zone ... Browse Code »

Kswapd checks that zone has sufficient pages free via zone_watermark_ok().

If any zone doesn't have enough pages, we set all_zones_ok to zero.
!all_zone_ok makes kswapd retry rather than sleeping.

I think the watermark check before shrink_zone() is pointless. Only after
kswapd has tried to shrink the zone is the check meaningful.

Move the check to after the call to shrink_zone().

[akpm@linux-foundation.org: fix comment, layout]
Signed-off-by: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Rik van Riel
Reviewed-by: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2010-03-07 03:26:24 +0800
59e99e5b9 mm: use rlimit helpers ... Browse Code »

Make sure compiler won't do weird things with limits. E.g. fetching them
twice may return 2 different values after writable limits are implemented.

I.e. either use rlimit helpers added in
3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Slaby
2010-03-07 03:26:24 +0800
06f9d8c2b mm: mlock_vma_pages_range() only return success or failure ... Browse Code »

Currently, mlock_vma_pages_range() only return len or 0. then current
error handling of mmap_region() is meaningless complex.

This patch makes simplify and makes consist with brk() code.

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
c58267c32 mm: mlock_vma_pages_range() never return negative value ... Browse Code »

Currently, mlock_vma_pages_range() never return negative value. Then, we
can remove some worthless error check.

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Lee Schermerhorn
Cc: Rik van Riel
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2010-03-07 03:26:24 +0800
b084d4353 mm: count swap usage ... Browse Code »

A frequent questions from users about memory management is what numbers of
swap ents are user for processes. And this information will give some
hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc//smaps, this is very slow and not good for usual process
information handler which works like 'ps' or 'top'. (ps or top is now
enough slow..)

This patch adds a counter of swapents to mm_counter and update is at each
swap events. Information is exported via /proc//status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB
Reviewed-by: Minchan Kim
Reviewed-by: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
34e55232e mm: avoid false sharing of mm_counter ... Browse Code »

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be
counted into a struct in task_struct and synchronized with mm's one at
events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:24 +0800
d559db086 mm: clean up mm_counter ... Browse Code »

Presently, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-07 03:26:23 +0800

06 Mar, 2010

1 commit

64096c174 Merge branch 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6 ... Browse Code »

* 'slab-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/slab-2.6:
SLUB: Fix per-cpu merge conflict
failslab: add ability to filter slab caches
slab: fix regression in touched logic
dma kmalloc handling fixes
slub: remove impossible condition
slab: initialize unused alien cache entry as NULL at alloc_alien_cache().
SLUB: Make slub statistics use this_cpu_inc
SLUB: this_cpu: Remove slub kmem_cache fields
SLUB: Get rid of dynamic DMA kmalloc cache allocation
SLUB: Use this_cpu operations in slub

Linus Torvalds
2010-03-06 06:35:40 +0800

05 Mar, 2010

1 commit

0f2cc4ecd Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: (52 commits)
init: Open /dev/console from rootfs
mqueue: fix typo "failues" -> "failures"
mqueue: only set error codes if they are really necessary
mqueue: simplify do_open() error handling
mqueue: apply mathematics distributivity on mq_bytes calculation
mqueue: remove unneeded info->messages initialization
mqueue: fix mq_open() file descriptor leak on user-space processes
fix race in d_splice_alias()
set S_DEAD on unlink() and non-directory rename() victims
vfs: add NOFOLLOW flag to umount(2)
get rid of ->mnt_parent in tomoyo/realpath
hppfs can use existing proc_mnt, no need for do_kern_mount() in there
Mirror MS_KERNMOUNT in ->mnt_flags
get rid of useless vfsmount_lock use in put_mnt_ns()
Take vfsmount_lock to fs/internal.h
get rid of insanity with namespace roots in tomoyo
take check for new events in namespace (guts of mounts_poll()) to namespace.c
Don't mess with generic_permission() under ->d_lock in hpfs
sanitize const/signedness for udf
nilfs: sanitize const/signedness in dealing with ->d_name.name
...

Fix up fairly trivial (famous last words...) conflicts in
drivers/infiniband/core/uverbs_main.c and security/tomoyo/realpath.c

Linus Torvalds
2010-03-05 00:15:33 +0800

04 Mar, 2010

4 commits

1154fab73 SLUB: Fix per-cpu merge conflict ... Browse Code »

The slab tree adds a percpu variable usage case (commit
9dfc6e68bfe6ee452efb1a4e9ca26a9007f2b864 "SLUB: Use this_cpu operations in
slub"), but the percpu tree removes the prefixing of percpu variables (commit
dd17c8f72993f9461e9c19250e3f155d6d99df22 "percpu: remove per_cpu__ prefix"),
thus causing the following compilation error:

CC mm/slub.o
mm/slub.c: In function ‘alloc_kmem_cache_cpus’:
mm/slub.c:2078: error: implicit declaration of function ‘per_cpu_var’
mm/slub.c:2078: warning: assignment makes pointer from integer without a cast
make[1]: *** [mm/slub.o] Error 1

Signed-off-by: Pekka Enberg

Stephen Rothwell
2010-03-04 18:09:43 +0800
e2b093f3e Merge branches 'slab/cleanups', 'slab/failslab', 'slab/fixes' and 'slub/percpu' into slab-for-linus Browse Code »

Pekka Enberg
2010-03-04 18:07:50 +0800
2ecdc82ef kill unused invalidate_inode_pages helper ... Browse Code »

No one is calling this anymore as everyone has switched to
invalidate_mapping_pages long time ago. Also update a few
references to it in comments. nfs has two more, but I can't
easily figure what they are actually referring to, so I left
them as-is.

Signed-off-by: Christoph Hellwig
Signed-off-by: Al Viro

Christoph Hellwig
2010-03-04 03:07:55 +0800
a626b46e1 Merge branch 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/tip/linux-2.6-tip

* 'x86-bootmem-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (30 commits)
early_res: Need to save the allocation name in drop_range_partial()
sparsemem: Fix compilation on PowerPC
early_res: Add free_early_partial()
x86: Fix non-bootmem compilation on PowerPC
core: Move early_res from arch/x86 to kernel/
x86: Add find_fw_memmap_area
Move round_up/down to kernel.h
x86: Make 32bit support NO_BOOTMEM
early_res: Enhance check_and_double_early_res
x86: Move back find_e820_area to e820.c
x86: Add find_early_area_size
x86: Separate early_res related code from e820.c
x86: Move bios page reserve early to head32/64.c
sparsemem: Put mem map for one node together.
sparsemem: Put usemap for one node together
x86: Make 64 bit use early_res instead of bootmem before slab
x86: Only call dma32_reserve_bootmem 64bit !CONFIG_NUMA
x86: Make early_node_mem get mem > 4 GB if possible
x86: Dynamically increase early_res array size
x86: Introduce max_early_res and early_res_count
...

Linus Torvalds
2010-03-04 00:15:05 +0800

03 Mar, 2010

1 commit

0a135ba14 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu ... Browse Code »

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
percpu: add __percpu sparse annotations to what's left
percpu: add __percpu sparse annotations to fs
percpu: add __percpu sparse annotations to core kernel subsystems
local_t: Remove leftover local.h
this_cpu: Remove pageset_notifier
this_cpu: Page allocator conversion
percpu, x86: Generic inc / dec percpu instructions
local_t: Move local.h include to ringbuffer.c and ring_buffer_benchmark.c
module: Use this_cpu_xx to dynamically allocate counters
local_t: Remove cpu_local_xx macros
percpu: refactor the code in pcpu_[de]populate_chunk()
percpu: remove compile warnings caused by __verify_pcpu_ptr()
percpu: make accessors check for percpu pointer in sparse
percpu: add __percpu for sparse.
percpu: make access macros universal
percpu: remove per_cpu__ prefix.

Linus Torvalds
2010-03-03 23:34:18 +0800

02 Mar, 2010

2 commits

6d6b89bd2 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6 ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next-2.6: (1341 commits)
virtio_net: remove forgotten assignment
be2net: fix tx completion polling
sis190: fix cable detect via link status poll
net: fix protocol sk_buff field
bridge: Fix build error when IGMP_SNOOPING is not enabled
bnx2x: Tx barriers and locks
scm: Only support SCM_RIGHTS on unix domain sockets.
vhost-net: restart tx poll on sk_sndbuf full
vhost: fix get_user_pages_fast error handling
vhost: initialize log eventfd context pointer
vhost: logging thinko fix
wireless: convert to use netdev_for_each_mc_addr
ethtool: do not set some flags, if others failed
ipoib: returned back addrlen check for mc addresses
netlink: Adding inode field to /proc/net/netlink
axnet_cs: add new id
bridge: Make IGMP snooping depend upon BRIDGE.
bridge: Add multicast count/interval sysfs entries
bridge: Add hash elasticity/max sysfs entries
bridge: Add multicast_snooping sysfs toggle
...

Trivial conflicts in Documentation/feature-removal-schedule.txt

Linus Torvalds
2010-03-02 23:55:08 +0800
81d0d950e sparsemem: Fix compilation on PowerPC ... Browse Code »

Stephen reported:
build (powerpc
ppc64_defconfig) produced these warnings:

mm/sparse.c: In function 'sparse_init':
mm/sparse.c:488: warning: unused variable 'map_count'
mm/sparse.c:484: warning: unused variable 'size2'
mm/sparse.c:481: warning: unused variable 'map_map'
mm/sparse.c: At top level:
mm/sparse.c:442: warning: 'sparse_early_mem_maps_alloc_node' defined but not used

Introduced by commit 9bdac914240759457175ac0d6529a37d2820bc4d
("sparsemem: Put mem map for one node together").

Conditionalize the bits appropriately based on the setting of
CONFIG_SPARSEMEM_ALLOC_MEM_MAP_TOGETHER.

Reported-by: Stephen Rothwell
Tested-by: Stephen Rothwell
Signed-off-by: Yinghai Lu
LKML-Reference:
Signed-off-by: H. Peter Anvin

Yinghai Lu
2010-03-02 09:59:24 +0800