Eric Lee / smarc-fsl-linux-kernel

03 Apr, 2009

16 commits

83aae4c73 memcg: cleanup cache_charge ... Browse Code »

Current mem_cgroup_cache_charge is a bit complicated especially
in the case of shmem's swap-in.

This patch cleans it up by using try_charge_swapin and commit_charge_swapin.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-04-03 10:04:56 +0800
627991a20 memcg: remove redundant message at swapon ... Browse Code »

It's pointed out that swap_cgroup's message at swapon() is nonsense.
Because

* It can be calculated very easily if all necessary information is
written in Kconfig.

* It's not necessary to annoying people at every swapon().

In other view, now, memory usage per swp_entry is reduced to 2bytes from
8bytes(64bit) and I think it's reasonably small.

Reported-by: Hugh Dickins
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:56 +0800
a3b2d6926 cgroups: use css id in swap cgroup for saving memory v5 ... Browse Code »

Try to use CSS ID for records in swap_cgroup. By this, on 64bit machine,
size of swap_cgroup goes down to 2 bytes from 8bytes.

This means, when 2GB of swap is equipped, (assume the page size is 4096bytes)

From size of swap_cgroup = 2G/4k * 8 = 4Mbytes.
To size of swap_cgroup = 2G/4k * 2 = 1Mbytes.

Reduction is large. Of course, there are trade-offs. This CSS ID will
add overhead to swap-in/swap-out/swap-free.

But in general,
- swap is a resource which the user tend to avoid use.
- If swap is never used, swap_cgroup area is not used.
- Reading traditional manuals, size of swap should be proportional to
size of memory. Memory size of machine is increasing now.

I think reducing size of swap_cgroup makes sense.

Note:
- ID->CSS lookup routine has no locks, it's under RCU-Read-Side.
- memcg can be obsolete at rmdir() but not freed while refcnt from
swap_cgroup is available.

Changelog v4->v5:
- reworked on to memcg-charge-swapcache-to-proper-memcg.patch
Changlog ->v4:
- fixed not configured case.
- deleted unnecessary comments.
- fixed NULL pointer bug.
- fixed message in dmesg.

[nishimura@mxp.nes.nec.co.jp: css_tryget can be called twice in !PageCgroupUsed case]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Paul Menage
Cc: Hugh Dickins
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:56 +0800
3c776e646 memcg: charge swapcache to proper memcg ... Browse Code »

memcg_test.txt says at 4.1:

This swap-in is one of the most complicated work. In do_swap_page(),
following events occur when pte is unchanged.

(1) the page (SwapCache) is looked up.
(2) lock_page()
(3) try_charge_swapin()
(4) reuse_swap_page() (may call delete_swap_cache())
(5) commit_charge_swapin()
(6) swap_free().

Considering following situation for example.

(A) The page has not been charged before (2) and reuse_swap_page()
doesn't call delete_from_swap_cache().
(B) The page has not been charged before (2) and reuse_swap_page()
calls delete_from_swap_cache().
(C) The page has been charged before (2) and reuse_swap_page() doesn't
call delete_from_swap_cache().
(D) The page has been charged before (2) and reuse_swap_page() calls
delete_from_swap_cache().

memory.usage/memsw.usage changes to this page/swp_entry will be
Case (A) (B) (C) (D)
Event
Before (2) 0/ 1 0/ 1 1/ 1 1/ 1
===========================================
(3) +1/+1 +1/+1 +1/+1 +1/+1
(4) - 0/ 0 - -1/ 0
(5) 0/-1 0/ 0 -1/-1 0/ 0
(6) - 0/-1 - 0/-1
===========================================
Result 1/ 1 1/ 1 1/ 1 1/ 1

In any cases, charges to this page should be 1/ 1.

In case of (D), mem_cgroup_try_get_from_swapcache() returns NULL
(because lookup_swap_cgroup() returns NULL), so "+1/+1" at (3) means
charges to the memcg("foo") to which the "current" belongs.
OTOH, "-1/0" at (4) and "0/-1" at (6) means uncharges from the memcg("baa")
to which the page has been charged.

So, if the "foo" and "baa" is different(for example because of task move),
this charge will be moved from "baa" to "foo".

I think this is an unexpected behavior.

This patch fixes this by modifying mem_cgroup_try_get_from_swapcache()
to return the memcg to which the swapcache has been charged if PCG_USED bit
is set.
IIUC, checking PCG_USED bit of swapcache is safe under page lock.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-04-03 10:04:56 +0800
c137b5ece memcg: remove mem_cgroup_calc_mapped_ratio() ... Browse Code »

Currently, mem_cgroup_calc_mapped_ratio() is unused at all. it can be
removed and KAMEZAWA-san suggested it.

Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-03 10:04:55 +0800
e222432bf memcg: show memcg information during OOM ... Browse Code »

Add RSS and swap to OOM output from memcg

Display memcg values like failcnt, usage and limit when an OOM occurs due
to memcg.

Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
Daisuke Nishimura and KOSAKI Motohiro for review.

Sample output
-------------

Task in /a/x killed as a result of limit of /a
memory: usage 1048576kB, limit 1048576kB, failcnt 4183
memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

[akpm@linux-foundation.org: compilation fix]
[akpm@linux-foundation.org: fix kerneldoc and whitespace]
[akpm@linux-foundation.org: add printk facility level]
Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Li Zefan
Cc: Paul Menage
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-04-03 10:04:55 +0800
0b7f569e4 memcg: fix OOM killer under memcg ... Browse Code »

This patch tries to fix OOM Killer problems caused by hierarchy.
Now, memcg itself has OOM KILL function (in oom_kill.c) and tries to
kill a task in memcg.

But, when hierarchy is used, it's broken and correct task cannot
be killed. For example, in following cgroup

/groupA/ hierarchy=1, limit=1G,
01 nolimit
02 nolimit
All tasks' memory usage under /groupA, /groupA/01, groupA/02 is limited to
groupA's 1Gbytes but OOM Killer just kills tasks in groupA.

This patch provides makes the bad process be selected from all tasks
under hierarchy. BTW, currently, oom_jiffies is updated against groupA
in above case. oom_jiffies of tree should be updated.

To see how oom_jiffies is used, please check mem_cgroup_oom_called()
callers.

[akpm@linux-foundation.org: build fix]
[akpm@linux-foundation.org: const fix]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:55 +0800
81d39c20f memcg: fix shrinking memory to return -EBUSY by fixing retry algorithm ... Browse Code »

As pointed out, shrinking memcg's limit should return -EBUSY after
reasonable retries. This patch tries to fix the current behavior of
shrink_usage.

Before looking into "shrink should return -EBUSY" problem, we should fix
hierarchical reclaim code. It compares current usage and current limit,
but it only makes sense when the kernel reclaims memory because hit
limits. This is also a problem.

What this patch does are.

1. add new argument "shrink" to hierarchical reclaim. If "shrink==true",
hierarchical reclaim returns immediately and the caller checks the kernel
should shrink more or not.
(At shrinking memory, usage is always smaller than limit. So check for
usage < limit is useless.)

2. For adjusting to above change, 2 changes in "shrink"'s retry path.
2-a. retry_count depends on # of children because the kernel visits
the children under hierarchy one by one.
2-b. rather than checking return value of hierarchical_reclaim's progress,
compares usage-before-shrink and usage-after-shrink.
If usage-before-shrink
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:55 +0800
14067bb3e memcg: hierarchical stat ... Browse Code »

Clean up memory.stat file routine and show "total" hierarchical stat.

This patch does
- renamed get_all_zonestat to be get_local_zonestat.
- remove old mem_cgroup_stat_desc, which is only for per-cpu stat.
- add mcs_stat to cover both of per-cpu/per-lru stat.
- add "total" stat of hierarchy (*)
- add a callback system to scan all memcg under a root.
== "total" is added.
[kamezawa@localhost ~]$ cat /opt/cgroup/xxx/memory.stat
cache 0
rss 0
pgpgin 0
pgpgout 0
inactive_anon 0
active_anon 0
inactive_file 0
active_file 0
unevictable 0
hierarchical_memory_limit 50331648
hierarchical_memsw_limit 9223372036854775807
total_cache 65536
total_rss 192512
total_pgpgin 218
total_pgpgout 155
total_inactive_anon 0
total_active_anon 135168
total_inactive_file 61440
total_active_file 4096
total_unevictable 0
==
(*) maybe the user can do calc hierarchical stat by his own program
in userland but if it can be written in clean way, it's worth to be
shown, I think.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:55 +0800
04046e1a0 memcg: use CSS ID ... Browse Code »

Assigning CSS ID for each memcg and use css_get_next() for scanning hierarchy.

Assume folloing tree.

group_A (ID=3)
/01 (ID=4)
/0A (ID=7)
/02 (ID=10)
group_B (ID=5)
and task in group_A/01/0A hits limit at group_A.

reclaim will be done in following order (round-robin).
group_A(3) -> group_A/01 (4) -> group_A/01/0A (7) -> group_A/02(10)
-> group_A -> .....

Round robin by ID. The last visited cgroup is recorded and restart
from it when it start reclaim again.
(More smart algorithm can be implemented..)

No cgroup_mutex or hierarchy_mutex is required.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:55 +0800
ec64f5154 cgroup: fix frequent -EBUSY at rmdir ... Browse Code »

In following situation, with memory subsystem,

/groupA use_hierarchy==1
/01 some tasks
/02 some tasks
/03 some tasks
/04 empty

When tasks under 01/02/03 hit limit on /groupA, hierarchical reclaim
is triggered and the kernel walks tree under groupA. In this case,
rmdir /groupA/04 fails with -EBUSY frequently because of temporal
refcnt from the kernel.

In general. cgroup can be rmdir'd if there are no children groups and
no tasks. Frequent fails of rmdir() is not useful to users.
(And the reason for -EBUSY is unknown to users.....in most cases)

This patch tries to modify above behavior, by
- retries if css_refcnt is got by someone.
- add "return value" to pre_destroy() and allows subsystem to
say "we're really busy!"

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-03 10:04:54 +0800
bf6aede71 workqueue: add to_delayed_work() helper function ... Browse Code »

It is a fairly common operation to have a pointer to a work and to need a
pointer to the delayed work it is contained in. In particular, all
delayed works which want to rearm themselves will have to do that. So it
would seem fair to offer a helper function for this operation.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jean Delvare
Acked-by: Ingo Molnar
Cc: "David S. Miller"
Cc: Herbert Xu
Cc: Benjamin Herrenschmidt
Cc: Martin Schwidefsky
Cc: Greg KH
Cc: Pekka Enberg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jean Delvare
2009-04-03 10:04:50 +0800
58984ce21 mm: do_xip_mapping_read: fix length calculation ... Browse Code »

The calculation of the value nr in do_xip_mapping_read is incorrect. If
the copy required more than one iteration in the do while loop the copies
variable will be non-zero. The maximum length that may be passed to the
call to copy_to_user(buf+copied, xip_mem+offset, nr) is len-copied but the
check only compares against (nr > len).

This bug is the cause for the heap corruption Carsten has been chasing
for so long:

*** glibc detected *** /bin/bash: free(): invalid next size (normal): 0x00000000800e39f0 ***
======= Backtrace: =========
/lib64/libc.so.6[0x200000b9b44]
/lib64/libc.so.6(cfree+0x8e)[0x200000bdade]
/bin/bash(free_buffered_stream+0x32)[0x80050e4e]
/bin/bash(close_buffered_stream+0x1c)[0x80050ea4]
/bin/bash(unset_bash_input+0x2a)[0x8001c366]
/bin/bash(make_child+0x1d4)[0x8004115c]
/bin/bash[0x8002fc3c]
/bin/bash(execute_command_internal+0x656)[0x8003048e]
/bin/bash(execute_command+0x5e)[0x80031e1e]
/bin/bash(execute_command_internal+0x79a)[0x800305d2]
/bin/bash(execute_command+0x5e)[0x80031e1e]
/bin/bash(reader_loop+0x270)[0x8001efe0]
/bin/bash(main+0x1328)[0x8001e960]
/lib64/libc.so.6(__libc_start_main+0x100)[0x200000592a8]
/bin/bash(clearerr+0x5e)[0x8001c092]

With this bug fix the commit 0e4a9b59282914fe057ab17027f55123964bc2e2
"ext2/xip: refuse to change xip flag during remount with busy inodes" can
be removed again.

Cc: Carsten Otte
Cc: Nick Piggin
Cc: Jared Hulbert
Cc:
Signed-off-by: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Schwidefsky
2009-04-03 10:04:49 +0800
98f4ebb29 mm: align vmstat_work's timer ... Browse Code »

Even though vmstat_work is marked deferrable, there are still benefits to
aligning it. For certain applications we want to keep OS jitter as low as
possible and aligning timers and work so they occur together can reduce
their overall impact.

Signed-off-by: Anton Blanchard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anton Blanchard
2009-04-03 10:04:48 +0800
33e5d7697 nommu: fix a number of issues with the per-MM VMA patch ... Browse Code »

Fix a number of issues with the per-MM VMA patch:

(1) Make mmap_pages_allocated an atomic_long_t, just in case this is used on
a NOMMU system with more than 2G pages. Makes no difference on a 32-bit
system.

(2) Report vma->vm_pgoff * PAGE_SIZE as a 64-bit value, not a 32-bit value,
lest it overflow.

(3) Move the allocation of the vm_area_struct slab back for fork.c.

(4) Use KMEM_CACHE() for both vm_area_struct and vm_region slabs.

(5) Use BUG_ON() rather than if () BUG().

(6) Make the default validate_nommu_regions() a static inline rather than a
#define.

(7) Make free_page_series()'s objection to pages with a refcount != 1 more
informative.

(8) Adjust the __put_nommu_region() banner comment to indicate that the
semaphore must be held for writing.

(9) Limit the number of warnings about munmaps of non-mmapped regions.

Reported-by: Andrew Morton
Signed-off-by: David Howells
Cc: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2009-04-03 10:04:48 +0800
ee3b4290a generic debug pagealloc: build fix ... Browse Code »

This fixes a build failure with generic debug pagealloc:

mm/debug-pagealloc.c: In function 'set_page_poison':
mm/debug-pagealloc.c:8: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'clear_page_poison':
mm/debug-pagealloc.c:13: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: In function 'page_poison':
mm/debug-pagealloc.c:18: error: 'struct page' has no member named 'debug_flags'
mm/debug-pagealloc.c: At top level:
mm/debug-pagealloc.c:120: error: redefinition of 'kernel_map_pages'
include/linux/mm.h:1278: error: previous definition of 'kernel_map_pages' was here
mm/debug-pagealloc.c: In function 'kernel_map_pages':
mm/debug-pagealloc.c:122: error: 'debug_pagealloc_enabled' undeclared (first use in this function)

by fixing

- debug_flags should be in struct page
- define DEBUG_PAGEALLOC config option for all architectures

Signed-off-by: Akinobu Mita
Reported-by: Alexander Beregalov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2009-04-03 10:04:48 +0800

01 Apr, 2009

24 commits

9fab5619b shmem: writepage directly to swap ... Browse Code »

Synopsis: if shmem_writepage calls swap_writepage directly, most shmem
swap loads benefit, and a catastrophic interaction between SLUB and some
flash storage is avoided.

shmem_writepage() has always been peculiar in making no attempt to write:
it has just transferred a shmem page from file cache to swap cache, then
let that page make its way around the LRU again before being written and
freed.

The idea was that people use tmpfs because they want those pages to stay
in RAM; so although we give it an overflow to swap, we should resist
writing too soon, giving those pages a second chance before they can be
reclaimed.

That was always questionable, and I've toyed with this patch for years;
but never had a clear justification to depart from the original design.

It became more questionable in 2.6.28, when the split LRU patches classed
shmem and tmpfs pages as SwapBacked rather than as file_cache: that in
itself gives them more resistance to reclaim than normal file pages. I
prepared this patch for 2.6.29, but the merge window arrived before I'd
completed gathering statistics to justify sending it in.

Then while comparing SLQB against SLUB, running SLUB on a laptop I'd
habitually used with SLAB, I found SLUB to run my tmpfs kbuild swapping
tests five times slower than SLAB or SLQB - other machines slower too, but
nowhere near so bad. Simpler "cp -a" swapping tests showed the same.

slub_max_order=0 brings sanity to all, but heavy swapping is too far from
normal to justify such a tuning. The crucial factor on that laptop turns
out to be that I'm using an SD card for swap. What happens is this:

By default, SLUB uses order-2 pages for shmem_inode_cache (and many other
fs inodes), so creating tmpfs files under memory pressure brings lumpy
reclaim into play. One subpage of the order is chosen from the bottom of
the LRU as usual, then the other three picked out from their random
positions on the LRUs.

In a tmpfs load, many of these pages will be ones which already passed
through shmem_writepage, so already have swap allocated. And though their
offsets on swap were probably allocated sequentially, now that the pages
are picked off at random, their swap offsets are scattered.

But the flash storage on the SD card is very sensitive to having its
writes merged: once swap is written at scattered offsets, performance
falls apart. Rotating disk seeks increase too, but less disastrously.

So: stop giving shmem/tmpfs pages a second pass around the LRU, write them
out to swap as soon as their swap has been allocated.

It's surely possible to devise an artificial load which runs faster the
old way, one whose sizing is such that the tmpfs pages on their second
pass are the ones that are wanted again, and other pages not.

But I've not yet found such a load: on all machines, under the loads I've
tried, immediate swap_writepage speeds up shmem swapping: especially when
using the SLUB allocator (and more effectively than slub_max_order=0), but
also with the others; and it also reduces the variance between runs. How
much faster varies widely: a factor of five is rare, 5% is common.

One load which might have suffered: imagine a swapping shmem load in a
limited mem_cgroup on a machine with plenty of memory. Before 2.6.29 the
swapcache was not charged, and such a load would have run quickest with
the shmem swapcache never written to swap. But now swapcache is charged,
so even this load benefits from shmem_writepage directly to swap.

Apologies for the #ifndef CONFIG_SWAP swap_writepage() stub in swap.h:
it's silly because that will never get called; but refactoring shmem.c
sensibly according to CONFIG_SWAP will be a separate task.

Signed-off-by: Hugh Dickins
Acked-by: Pekka Enberg
Acked-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2009-04-01 23:59:15 +0800
327c0e968 vmscan: fix it to take care of nodemask ... Browse Code »

try_to_free_pages() is used for the direct reclaim of up to
SWAP_CLUSTER_MAX pages when watermarks are low. The caller to
alloc_pages_nodemask() can specify a nodemask of nodes that are allowed to
be used but this is not passed to try_to_free_pages(). This can lead to
unnecessary reclaim of pages that are unusable by the caller and int the
worst case lead to allocation failure as progress was not been make where
it is needed.

This patch passes the nodemask used for alloc_pages_nodemask() to
try_to_free_pages().

Reviewed-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-04-01 23:59:15 +0800
88c3bd707 vmscan: print shrink_slab symbol name on negative shrinker objects ... Browse Code »

When a shrinker has a negative number of objects to delete, the symbol
name of the shrinker should be printed, not shrink_slab. This also makes
the error message slightly more informative.

Cc: Ingo Molnar
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-04-01 23:59:15 +0800
71aa653c6 nommu: make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n ... Browse Code »

Make CONFIG_UNEVICTABLE_LRU available when CONFIG_MMU=n. There's no logical
reason it shouldn't be available, and it can be used for ramfs.

Signed-off-by: David Howells
Reviewed-by: KOSAKI Motohiro
Cc: Peter Zijlstra
Cc: Greg Ungerer
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Lee Schermerhorn
Cc: Enrik Berkhan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2009-04-01 23:59:15 +0800
33925b25d nommu: there is no mlock() for NOMMU, so don't provide the bits ... Browse Code »

The mlock() facility does not exist for NOMMU since all mappings are
effectively locked anyway, so we don't make the bits available when
they're not useful.

Signed-off-by: David Howells
Reviewed-by: KOSAKI Motohiro
Cc: Peter Zijlstra
Cc: Greg Ungerer
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Lee Schermerhorn
Cc: Enrik Berkhan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Howells
2009-04-01 23:59:14 +0800
f4112de6b mm: introduce debug_kmap_atomic ... Browse Code »

x86 has debug_kmap_atomic_prot() which is error checking function for
kmap_atomic. It is usefull for the other architectures, although it needs
CONFIG_TRACE_IRQFLAGS_SUPPORT.

This patch exposes it to the other architectures.

Signed-off-by: Akinobu Mita
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2009-04-01 23:59:14 +0800
c2ec175c3 mm: page_mkwrite change prototype to match fault ... Browse Code »

Change the page_mkwrite prototype to take a struct vm_fault, and return
VM_FAULT_xxx flags. There should be no functional change.

This makes it possible to return much more detailed error information to
the VM (and also can provide more information eg. virtual_address to the
driver, which might be important in some special cases).

This is required for a subsequent fix. And will also make it easier to
merge page_mkwrite() with fault() in future.

Signed-off-by: Nick Piggin
Cc: Chris Mason
Cc: Trond Myklebust
Cc: Miklos Szeredi
Cc: Steven Whitehouse
Cc: Mark Fasheh
Cc: Joel Becker
Cc: Artem Bityutskiy
Cc: Felix Blyakher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-04-01 23:59:14 +0800
704503d83 mm: fix proc_dointvec_userhz_jiffies "breakage" ... Browse Code »

Addresses http://bugzilla.kernel.org/show_bug.cgi?id=9838

On i386, HZ=1000, jiffies_to_clock_t() converts time in a somewhat strange
way from the user's point of view:

# echo 500 >/proc/sys/vm/dirty_writeback_centisecs
# cat /proc/sys/vm/dirty_writeback_centisecs
499

So, we have 5000 jiffies converted to only 499 clock ticks and reported
back.

TICK_NSEC = 999848
ACTHZ = 256039

Keeping in-kernel variable in units passed from userspace will fix issue
of course, but this probably won't be right for every sysctl.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Alexey Dobriyan
Cc: Peter Zijlstra
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2009-04-01 23:59:13 +0800
6a11f75b6 generic debug pagealloc ... Browse Code »

CONFIG_DEBUG_PAGEALLOC is now supported by x86, powerpc, sparc64, and
s390. This patch implements it for the rest of the architectures by
filling the pages with poison byte patterns after free_pages() and
verifying the poison patterns before alloc_pages().

This generic one cannot detect invalid page accesses immediately but
invalid read access may cause invalid dereference by poisoned memory and
invalid write access can be detected after a long delay.

Signed-off-by: Akinobu Mita
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akinobu Mita
2009-04-01 23:59:13 +0800
610a77e04 memdup_user(): introduce ... Browse Code »

I notice there are many places doing copy_from_user() which follows
kmalloc():

dst = kmalloc(len, GFP_KERNEL);
if (!dst)
return -ENOMEM;
if (copy_from_user(dst, src, len)) {
kfree(dst);
return -EFAULT
}

memdup_user() is a wrapper of the above code. With this new function, we
don't have to write 'len' twice, which can lead to typos/mistakes. It
also produces smaller code and kernel text.

A quick grep shows 250+ places where memdup_user() *may* be used. I'll
prepare a patchset to do this conversion.

Signed-off-by: Li Zefan
Cc: KOSAKI Motohiro
Cc: Americo Wang
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2009-04-01 23:59:13 +0800
e2f17d945 hugetlb: chg cannot become less than 0 ... Browse Code »

chg is unsigned, so it cannot be less than 0.

Also, since region_chg returns long, let vma_needs_reservation() forward
this to alloc_huge_page(). Store it as long as well. all callers cast it
to long anyway.

Signed-off-by: Roel Kluin
Cc: Andy Whitcroft
Cc: Mel Gorman
Cc: Adam Litke
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2009-04-01 23:59:13 +0800
d1d748717 mm: remove pagevec_swap_free() ... Browse Code »

pagevec_swap_free() is now unused.

Signed-off-by: KOSAKI Motohiro
Cc: Johannes Weiner
Cc: Rik van Riel
Acked-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:13 +0800
ad1c3544d mm: don't free swap slots on page deactivation ... Browse Code »

The pagevec_swap_free() at the end of shrink_active_list() was introduced
in 68a22394 "vmscan: free swap space on swap-in/activation" when
shrink_active_list() was still rotating referenced active pages.

In 7e9cd48 "vmscan: fix pagecache reclaim referenced bit check" this was
changed, the rotating removed but the pagevec_swap_free() after the
rotation loop was forgotten, applying now to the pagevec of the
deactivation loop instead.

Now swap space is freed for deactivated pages. And only for those that
happen to be on the pagevec after the deactivation loop.

Complete 7e9cd48 and remove the rest of the swap freeing.

Signed-off-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-04-01 23:59:13 +0800
2443462b0 mm: move pagevec stripping to save unlock-relock ... Browse Code »

In shrink_active_list() after the deactivation loop, we strip buffer heads
from the potentially remaining pages in the pagevec.

Currently, this drops the zone's lru lock for stripping, only to reacquire
it again afterwards to update statistics.

It is not necessary to strip the pages before updating the stats, so move
the whole thing out of the protected region and save the extra locking.

Signed-off-by: Johannes Weiner
Reviewed-by: MinChan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-04-01 23:59:13 +0800
e3a7cca1e vfs: add/use account_page_dirtied() ... Browse Code »

Add a helper function account_page_dirtied(). Use that from two
callsites. reiser4 adds a function which adds a third callsite.

Signed-off-by: Edward Shishkin
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Edward Shishkin
2009-04-01 23:59:12 +0800
bd2f6199c vmscan: respect higher order in zone_reclaim() ... Browse Code »

During page allocation, there are two stages of direct reclaim that are
applied to each zone in the preferred list. The first stage using
zone_reclaim() reclaims unmapped file backed pages and slab pages if over
defined limits as these are cheaper to reclaim. The caller specifies the
order of the target allocation but the scan control is not being correctly
initialised.

The impact is that the correct number of pages are being reclaimed but
that lumpy reclaim is not being applied. This increases the chances of a
full direct reclaim via try_to_free_pages() is required.

This patch initialises the order field of the scan control as requested by
the caller.

[mel@csn.ul.ie: rewrote changelog]
Signed-off-by: Johannes Weiner
Acked-by: Mel Gorman
Cc: Rik van Riel
Cc: Andy Whitcroft
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-04-01 23:59:12 +0800
bd775c42e mm: add comment why mark_page_accessed() would be better than pte_mkyoung() in follow_page() ... Browse Code »

At first look, mark_page_accessed() in follow_page() seems a bit strange.
It seems pte_mkyoung() would be better consistent with other kernel code.

However, it is intentional. The commit log said:

------------------------------------------------
commit 9e45f61d69be9024a2e6bef3831fb04d90fac7a8
Author: akpm
Date: Fri Aug 15 07:24:59 2003 +0000

[PATCH] Use mark_page_accessed() in follow_page()

Touching a page via follow_page() counts as a reference so we should be
either setting the referenced bit in the pte or running mark_page_accessed().

Altering the pte is tricky because we haven't implemented an atomic
pte_mkyoung(). And mark_page_accessed() is better anyway because it has more
aging state: it can move the page onto the active list.

BKrev: 3f3c8acbplT8FbwBVGtth7QmnqWkIw
------------------------------------------------

The atomic issue is still true nowadays. adding comment help to understand
code intention and it would be better.

[akpm@linux-foundation.org: clarify text]
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:12 +0800
9786bf841 vmscan: clip swap_cluster_max in shrink_all_memory() ... Browse Code »

shrink_inactive_list() scans in sc->swap_cluster_max chunks until it hits
the scan limit it was passed.

shrink_inactive_list()
{
do {
isolate_pages(swap_cluster_max)
shrink_page_list()
} while (nr_scanned < max_scan);
}

This assumes that swap_cluster_max is not bigger than the scan limit
because the latter is checked only after at least one iteration.

In shrink_all_memory() sc->swap_cluster_max is initialized to the overall
reclaim goal in the beginning but not decreased while reclaim is making
progress which leads to subsequent calls to shrink_inactive_list()
reclaiming way too much in the one iteration that is done unconditionally.

Set sc->swap_cluster_max always to the proper goal before doing
shrink_all_zones()
shrink_list()
shrink_inactive_list().

While the current shrink_all_memory() happily reclaims more than actually
requested, this patch fixes it to never exceed the goal:

unpatched
wanted=10000 reclaimed=13356
wanted=10000 reclaimed=19711
wanted=10000 reclaimed=10289
wanted=10000 reclaimed=17306
wanted=10000 reclaimed=10700
wanted=10000 reclaimed=10004
wanted=10000 reclaimed=13301
wanted=10000 reclaimed=10976
wanted=10000 reclaimed=10605
wanted=10000 reclaimed=10088
wanted=10000 reclaimed=15000

patched
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9599
wanted=10000 reclaimed=8476
wanted=10000 reclaimed=8326
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9919
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=9624
wanted=10000 reclaimed=10000
wanted=10000 reclaimed=10000
wanted=8500 reclaimed=8092
wanted=316 reclaimed=316

Signed-off-by: Johannes Weiner
Reviewed-by: MinChan Kim
Acked-by: Nigel Cunningham
Acked-by: "Rafael J. Wysocki"
Reviewed-by: KOSAKI Motohiro
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-04-01 23:59:12 +0800
d979677c4 mm: shrink_all_memory(): use sc.nr_reclaimed ... Browse Code »

Commit a79311c14eae4bb946a97af25f3e1b17d625985d "vmscan: bail out of
direct reclaim after swap_cluster_max pages" moved the nr_reclaimed
counter into the scan control to accumulate the number of all reclaimed
pages in a reclaim invocation.

shrink_all_memory() can use the same mechanism. it increase code
consistency and redability.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: MinChan Kim
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Johannes Weiner
Cc: "Rafael J. Wysocki"
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

MinChan Kim
2009-04-01 23:59:12 +0800
0a0dd05dd mm: don't call mark_page_accessed() in do_swap_page() ... Browse Code »

commit bf3f3bc5e734706730c12a323f9b2068052aa1f0 (mm: don't
mark_page_accessed in fault path) only remove the mark_page_accessed() in
filemap_fault().

Therefore, swap-backed pages and file-backed pages have inconsistent
behavior. mark_page_accessed() should be removed from do_swap_page().

Signed-off-by: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Hugh Dickins
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:11 +0800
ee99c71c5 mm: introduce for_each_populated_zone() macro ... Browse Code »

Impact: cleanup

In almost cases, for_each_zone() is used with populated_zone(). It's
because almost function doesn't need memoryless node information.
Therefore, for_each_populated_zone() can help to make code simplify.

This patch has no functional change.

[akpm@linux-foundation.org: small cleanup]
Signed-off-by: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Reviewed-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-04-01 23:59:11 +0800
a6dc60f89 vmscan: rename sc.may_swap to may_unmap ... Browse Code »

sc.may_swap does not only influence reclaiming of anon pages but pages
mapped into pagetables in general, which also includes mapped file pages.

In shrink_page_list():

if (!sc->may_swap && page_mapped(page))
goto keep_locked;

For anon pages, this makes sense as they are always mapped and reclaiming
them always requires swapping.

But mapped file pages are skipped here as well and it has nothing to do
with swapping.

The real effect of the knob is whether mapped pages are unmapped and
reclaimed or not. Rename it to `may_unmap' to have its name match its
actual meaning more precisely.

Signed-off-by: Johannes Weiner
Reviewed-by: MinChan Kim
Reviewed-by: KOSAKI Motohiro
Cc: Lee Schermerhorn
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2009-04-01 23:59:11 +0800
a12888f77 oom_kill: don't call for int_sqrt(0) ... Browse Code »

There is no need to call for int_sqrt if argument is 0.

Signed-off-by: Cyrill Gorcunov
Cc: Pekka Enberg
Cc: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2009-04-01 23:59:11 +0800
d086817dc vmap: remove needless lock and list in vmap ... Browse Code »

vmap's dirty_list is unused. It's for optimizing flushing. but Nick
didn't write the code yet. so, we don't need it until time as it is
needed.

This patch removes vmap_block's dirty_list and codes related to it.

Signed-off-by: MinChan Kim
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

MinChan Kim
2009-04-01 23:59:11 +0800