Eric Lee / smarc-fsl-linux-kernel

08 Feb, 2008

40 commits

fa9ff4b18 ASIC3 driver ... Browse Code »

This is a patch for the Compaq ASIC3 multi function chip, found in many
PDAs (iPAQs, HTCs...).

It is a simplified version of Paul Sokolovsky's first proposal [1]. With
this code, it is basically a GPIO and IRQ expander. My plan is to add more
features once this patch gets reviewed and accepted.

[1] http://lkml.org/lkml/2007/5/1/46

Signed-off-by: Samuel Ortiz
Cc: Paul Sokolovsky
Cc: Ben Dooks
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Samuel Ortiz
2008-02-08 00:42:23 +0800
8f5aa26c7 cpusets: update_cpumask documentation fix ... Browse Code »

Update cpuset documentation to match the October 2007 "Fix cpusets
update_cpumask" changes that now apply changes to a cpusets 'cpus' allowed
mask immediately to the cpus_allowed of the tasks in that cpuset.

Signed-off-by: Paul Jackson
Acked-by: Cliff Wickman
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:23 +0800
73507f335 Handle pid namespaces in cgroups code ... Browse Code »

There's one place that works with task pids - its the "tasks" file in cgroups.
The read/write handlers assume, that the pid values go to/come from the user
space and thus it is a virtual pid, i.e. the pid as it is seen from inside a
namespace.

Tune the code accordingly.

Signed-off-by: Pavel Emelyanov
Cc: "Eric W. Biederman"
Acked-by: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2008-02-08 00:42:22 +0800
b45012955 hotplug cpu move tasks in empty cpusets - refinements ... Browse Code »

- Narrow the scope of callback_mutex in scan_for_empty_cpusets().

- Avoid rewriting the cpus, mems of cpusets except when it is likely that
we'll be changing them.

- Have remove_tasks_in_empty_cpuset() also check for empty mems.

Signed-off-by: Paul Jackson
Acked-by: Cliff Wickman
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:22 +0800
c8d9c90c7 hotplug cpu: move tasks in empty cpusets to parent various other fixes ... Browse Code »

Various minor formatting and comment tweaks to Cliff Wickman's
[PATCH_3_of_3]_cpusets__update_cpumask_revision.patch

I had had "iff", meaning "if and only if" in a comment. However, except for
ancient mathematicians, the abbreviation "iff" was a tad too cryptic. Cliff
changed it to "if", presumably figuring that the "iff" was a typo. However,
it was the "only if" half of the conjunction that was most interesting.
Reword to emphasis the "only if" aspect.

The locking comment for remove_tasks_in_empty_cpuset() was wrong; it said
callback_mutex had to be held on entry. The opposite is true.

Several mentions of attach_task() in comments needed to be
changed to cgroup_attach_task().

A comment about notify_on_release was no longer relevant,
as the line of code it had commented, namely:
set_bit(CS_RELEASED_RESOURCE, &parent->flags);
is no longer present in that place in the cpuset.c code.

Similarly a comment about notify_on_release before the
scan_for_empty_cpusets() routine was no longer relevant.

Removed extra parentheses and unnecessary return statement.

Renamed attach_task() to cpuset_attach() in various comments.

Removed comment about not needing memory migration, as it seems the migration
is done anyway, via the cpuset_attach() callback from cgroup_attach_task().

Signed-off-by: Paul Jackson
Acked-by: Cliff Wickman
Cc: David Rientjes
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Jackson
2008-02-08 00:42:22 +0800
2df167a30 cgroups: update comments in cpuset.c ... Browse Code »

Some of the comments in kernel/cpuset.c were stale following the
transition to control groups; this patch updates them to more closely
match reality.

Signed-off-by: Paul Menage
Acked-by: Paul Jackson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2008-02-08 00:42:22 +0800
58f4790b7 cpusets: update_cpumask revision ... Browse Code »

Use the new function cgroup_scan_tasks() to step through all tasks in a
cpuset.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman
Cc: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2008-02-08 00:42:22 +0800
956db3ca0 hotplug cpu: move tasks in empty cpusets to parent ... Browse Code »

This patch corrects a situation that occurs when one disables all the cpus in
a cpuset.

Currently, the disabled (cpu-less) cpuset inherits the cpus of its parent,
which is incorrect because it may then overlap its cpu-exclusive sibling.

Tasks of an empty cpuset should be moved to the cpuset which is the parent of
their current cpuset. Or if the parent cpuset has no cpus, to its parent,
etc.

And the empty cpuset should be released (if it is flagged notify_on_release).

Depends on the cgroup_scan_tasks() function (proposed by David Rientjes) to
iterate through all tasks in the cpu-less cpuset. We are deliberately
avoiding a walk of the tasklist.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman
Cc: Paul Menage
Cc: Paul Jackson
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2008-02-08 00:42:22 +0800
31a7df01f cgroups: mechanism to process each task in a cgroup ... Browse Code »

Provide cgroup_scan_tasks(), which iterates through every task in a cgroup,
calling a test function and a process function for each. And call the process
function without holding the css_set_lock lock.

The idea is David Rientjes', predicting that such a function will make it much
easier in the future to extend things that require access to each task in a
cgroup without holding the lock,

[akpm@linux-foundation.org: cleanup]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Cliff Wickman
Cc: Paul Menage
Cc: Paul Jackson
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cliff Wickman
2008-02-08 00:42:22 +0800
dfc05c259 update Documentation/controller/memory.txt ... Browse Code »

Documentation updates for memory controller.

Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
3c541e14b Memory controller remove control_type feature ... Browse Code »

Based on the discussion at http://lkml.org/lkml/2007/12/20/383, it was felt
that control_type might not be a good thing to implement right away. We
can add this flexibility at a later point when required.

Signed-off-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:22 +0800
072c56c13 per-zone and reclaim enhancements for memory controller: per-zone-lock for cgroup ... Browse Code »

Now, lru is per-zone.

Then, lru_lock can be (should be) per-zone, too.
This patch implementes per-zone lru lock.

lru_lock is placed into mem_cgroup_per_zone struct.

lock can be accessed by
mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
&mz->lru_lock

or
mz = page_cgroup_zoneinfo(page_cgroup);
&mz->lru_lock

Signed-off-by: KAMEZAWA hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
1ecaab2bd per-zone and reclaim enhancements for memory controller: per zone lru for cgroup ... Browse Code »

This patch implements per-zone lru for memory cgroup.
This patch makes use of mem_cgroup_per_zone struct for per zone lru.

LRU can be accessed by

mz = mem_cgroup_zoneinfo(mem_cgroup, node, zone);
&mz->active_list
&mz->inactive_list

or
mz = page_cgroup_zoneinfo(page_cgroup);
&mz->active_list
&mz->inactive_list

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
1cfb419b3 per-zone and reclaim enhancements for memory controller: modifies vmscan.c for i… ... Browse Code »

…solate globa/cgroup lru activity

When using memory controller, there are 2 levels of memory reclaim.
1. zone memory reclaim because of system/zone memory shortage.
2. memory cgroup memory reclaim because of hitting limit.

These two can be distinguished by sc->mem_cgroup parameter.
(scan_global_lru() macro)

This patch tries to make memory cgroup reclaim routine avoid affecting
system/zone memory reclaim. This patch inserts if (scan_global_lru()) and
hook to memory_cgroup reclaim support functions.

This patch can be a help for isolating system lru activity and group lru
activity and shows what additional functions are necessary.

* mem_cgroup_calc_mapped_ratio() ... calculate mapped ratio for cgroup.
* mem_cgroup_reclaim_imbalance() ... calculate active/inactive balance in
cgroup.
* mem_cgroup_calc_reclaim_active() ... calculate the number of active pages to
be scanned in this priority in mem_cgroup.

* mem_cgroup_calc_reclaim_inactive() ... calculate the number of inactive pages
to be scanned in this priority in mem_cgroup.

* mem_cgroup_all_unreclaimable() .. checks cgroup's page is all unreclaimable
or not.
* mem_cgroup_get_reclaim_priority() ...
* mem_cgroup_note_reclaim_priority() ... record reclaim priority (temporal)
* mem_cgroup_remember_reclaim_priority()
.... record reclaim priority as
zone->prev_priority.
This value is used for calc reclaim_mapped.

[akpm@linux-foundation.org: fix unused var warning]
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
cc38108e1 per-zone and reclaim enhancements for memory controller: calculate the number of… ... Browse Code »

… pages to be scanned per cgroup

Define function for calculating the number of scan target on each Zone/LRU.

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
6c48a1d04 per-zone and reclaim enhancements for memory controller: remember reclaim priority in memory cgroup ... Browse Code »

Functions to remember reclaim priority per cgroup (as zone->prev_priority)

[akpm@linux-foundation.org: build fixes]
[akpm@linux-foundation.org: more build fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:22 +0800
5932f3671 per-zone and reclaim enhancements for memory controller: calculate active/inacti… ... Browse Code »

…ve imbalance per cgroup

Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
Cc: "Eric W. Biederman" <ebiederm@xmission.com>
Cc: Balbir Singh <balbir@linux.vnet.ibm.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Herbert Poetzl <herbert@13thfloor.at>
Cc: Kirill Korotaev <dev@sw.ru>
Cc: Nick Piggin <nickpiggin@yahoo.com.au>
Cc: Paul Menage <menage@google.com>
Cc: Pavel Emelianov <xemul@openvz.org>
Cc: Peter Zijlstra <a.p.zijlstra@chello.nl>
Cc: Vaidyanathan Srinivasan <svaidy@linux.vnet.ibm.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

KAMEZAWA Hiroyuki
2008-02-08 00:42:21 +0800
58ae83db2 per-zone and reclaim enhancements for memory controller: calculate mapper_ratio per cgroup ... Browse Code »

Define function for calculating mapped_ratio in memory cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:21 +0800
6d12e2d8d per-zone and reclaim enhancements for memory controller: per-zone active inactive counter ... Browse Code »

This patch adds per-zone status in memory cgroup. These values are often read
(as per-zone value) by page reclaiming.

In current design, per-zone stat is just a unsigned long value and not an
atomic value because they are modified only under lru_lock. (So, atomic_ops
is not necessary.)

This patch adds ACTIVE and INACTIVE per-zone status values.

For handling per-zone status, this patch adds
struct mem_cgroup_per_zone {
...
}
and some helper functions. This will be useful to add per-zone objects
in mem_cgroup.

This patch turns memory controller's early_init to be 0 for calling
kmalloc() in initialization.

Acked-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:21 +0800
c0149530d per-zone and reclaim enhancements for memory controller: nid/zid helper function for cgroup ... Browse Code »

Add macro to get node_id and zone_id of page_cgroup. Will be used in
per-zone-xxx patches and others.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:21 +0800
91a45470f per-zone and reclaim enhancements for memory controller: add scan_global_lru macro ... Browse Code »

This is used to detect which scan_control scans global lru or mem_cgroup lru.
And compiled to be static value (1) when memory controller is not configured.
This may make the meaning obvious.

Acked-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:21 +0800
df878fb04 memory cgroup enhancements: implicit force_empty() at rmdir ... Browse Code »

Add pre_destroy handler for mem_cgroup and try to make mem_cgroup empty at
rmdir().

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
4fca88c87 memory cgroup enhancements: add- pre_destroy() handler ... Browse Code »

Add a handler "pre_destroy" to cgroup_subsys. It is called before
cgroup_rmdir() checks all subsys's refcnt.

I think this is useful for subsys which have some extra refs even if there
are no tasks in cgroup. By adding pre_destroy(), the kernel keeps the rule
"destroy() against subsystem is called only when refcnt=0." and allows css
ref to be used by other objects than tasks.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: "Eric W. Biederman"
Cc: Balbir Singh
Cc: David Rientjes
Cc: Herbert Poetzl
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
d2ceb9b7d memory cgroup enhancements: add memory.stat file ... Browse Code »

Show accounted information of memory cgroup by memory.stat file

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix printk warning]
Signed-off-by: YAMAMOTO Takashi
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
d52aa412d memory cgroup enhancements: add status accounting function for memory cgroup ... Browse Code »

Add statistics account infrastructure for memory controller. All account
information is stored per-cpu and caller will not have to take lock or use
atomic ops. This will be used by memory.stat file later.

CACHE includes swapcache now. I'd like to divide it to
PAGECACHE and SWAPCACHE later.

This patch adds 3 functions for accounting.
* __mem_cgroup_stat_add() ... for usual routine.
* __mem_cgroup_stat_add_safe ... for calling under irq_disabled section.
* mem_cgroup_read_stat() ... for reading stat value.
* renamed PAGECACHE to CACHE (because it may include swapcache *now*)

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix smp_processor_id-in-preemptible]
[akpm@linux-foundation.org: uninline things]
[akpm@linux-foundation.org: remove dead code]
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: YAMAMOTO Takashi
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Cc: Kirill Korotaev
Cc: Nick Piggin
Cc: Paul Menage
Cc: Pavel Emelianov
Cc: Peter Zijlstra
Cc: Vaidyanathan Srinivasan
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
3564c7c45 memory cgroup enhancements: remember "a page is on active list of cgroup or not" ... Browse Code »

Remember page_cgroup is on active_list or not in page_cgroup->flags.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: YAMAMOTO Takashi
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
82369553d memcgroup: fix hang with shmem/tmpfs ... Browse Code »

The memcgroup regime relies upon a cgroup reclaiming pages from itself within
add_to_page_cache: which may involve some waiting. Whereas shmem and tmpfs
rely upon using add_to_page_cache while holding a spinlock: when it cannot
wait. The consequence is that when a cgroup reaches its limit, shmem_getpage
just hangs - unless there is outside memory pressure too, neither kswapd nor
radix_tree_preload get it out of the retry loop.

In most cases we can mem_cgroup_cache_charge the page waitably first, to
attach the page_cgroup in advance, so add_to_page_cache will do no more than
increment a count; then mem_cgroup_uncharge_page after (in both success and
failure cases) to balance the books again.

And where there used to be a congestion_wait for kswapd (recently made
redundant by radix_tree_preload), use mem_cgroup_cache_charge with NULL page
to go through a cycle of allocation and freeing, without accounting to any
particular page, and without updating the statistics vector. This brings the
cgroup below its limit so the next try usually succeeds.

Signed-off-by: Hugh Dickins
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-08 00:42:20 +0800
3be91277e memcgroup: tidy up mem_cgroup_charge_common ... Browse Code »

Tidy up mem_cgroup_charge_common before extending it. Adjust some comments,
but mainly clean up its loop: I've an aversion to loops full of continues,
then a break or a goto at the bottom. And the is_atomic test should be on the
__GFP_WAIT bit, not GFP_ATOMIC bits.

Signed-off-by: Hugh Dickins
Cc: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-08 00:42:20 +0800
ac44d354d Memory controller use rcu_read_lock() in mem_cgroup_cache_charge() ... Browse Code »

Hugh Dickins noticed that we were using rcu_dereference() without
rcu_read_lock() in the cache charging routine. The patch below fixes
this problem

Signed-off-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-02-08 00:42:20 +0800
217bc3194 memory cgroup enhancements: remember "a page is charged as page cache" ... Browse Code »

Add a flag to page_cgroup to remember "this page is
charged as cache."
cache here includes page caches and swap cache.
This is useful for implementing precise accounting in memory cgroup.
TODO:
distinguish page-cache and swap-cache

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: YAMAMOTO Takashi
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
cc8475822 memory cgroup enhancements: force_empty interface for dropping all account in empty cgroup ... Browse Code »

This patch adds an interface "memory.force_empty". Any write to this file
will drop all charges in this cgroup if there is no task under.

%echo 1 > /....../memory.force_empty

will drop all charges of memory cgroup if cgroup's tasks is empty.

This is useful to invoke rmdir() against memory cgroup successfully.

Tested and worked well on x86_64/fake-NUMA system.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
417eead30 memory cgroup enhancements: fix zone handling in try_to_free_mem_cgroup_page ... Browse Code »

Because NODE_DATA(node)->node_zonelists[] is guaranteed to contain all
necessary zones, it is not necessary to use for_each_online_node.

And this for_each_online_node() makes reclaim routine start always
from node 0. This is not good. This patch makes reclaim start from
caller's node and just use usual (default) zonelist order.

[akpm@linux-foundation.org: fix warning]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
fa1de9008 memcgroup: revert swap_state mods ... Browse Code »

If we're charging rss and we're charging cache, it seems obvious that we
should be charging swapcache - as has been done. But in practice that
doesn't work out so well: both swapin readahead and swapoff leave the
majority of pages charged to the wrong cgroup (the cgroup that happened to
read them in, rather than the cgroup to which they belong).

(Which is why unuse_pte's GFP_KERNEL while holding pte lock never showed up
as a problem: no allocation was ever done there, every page read being
already charged to the cgroup which initiated the swapoff.)

It all works rather better if we leave the charging to do_swap_page and
unuse_pte, and do nothing for swapcache itself: revert mm/swap_state.c to
what it was before the memory-controller patches. This also speeds up
significantly a contained process working at its limit: because it no
longer needs to keep waiting for swap writeback to complete.

Is it unfair that swap pages become uncharged once they're unmapped, even
though they're still clearly private to particular cgroups? For a short
while, yes; but PageReclaim arranges for those pages to go to the end of
the inactive list and be reclaimed soon if necessary.

shmem/tmpfs pages are a distinct case: their charging also benefits from
this change, but their second life on the lists as swapcache pages may
prove more unfair - that I need to check next.

Signed-off-by: Hugh Dickins
Cc: Pavel Emelianov
Acked-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-08 00:42:20 +0800
436c6541b memcgroup: fix zone isolation OOM ... Browse Code »

mem_cgroup_charge_common shows a tendency to OOM without good reason, when
a memhog goes well beyond its rss limit but with plenty of swap available.
Seen on x86 but not on PowerPC; seen when the next patch omits swapcache
from memcgroup, but we presume it can happen without.

mem_cgroup_isolate_pages is not quite satisfying reclaim's criteria for OOM
avoidance. Already it has to scan beyond the nr_to_scan limit when it
finds a !LRU page or an active page when handling inactive or an inactive
page when handling active. It needs to do exactly the same when it finds a
page from the wrong zone (the x86 tests had two zones, the PowerPC tests
had only one).

Don't increment scan and then decrement it in these cases, just move the
incrementation down. Fix recent off-by-one when checking against
nr_to_scan. Cut out "Check if the meta page went away from under us",
presumably left over from early debugging: no amount of such checks could
save us if this list really were being updated without locking.

This change does make the unlimited scan while holding two spinlocks
even worse - bad for latency and bad for containment; but that's a
separate issue which is better left to be fixed a little later.

Signed-off-by: Hugh Dickins
Cc: Pavel Emelianov
Acked-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hugh Dickins
2008-02-08 00:42:20 +0800
ff7283fa3 bugfix for memory cgroup controller: avoid !PageLRU page in mem_cgroup_isolate_pages ... Browse Code »

This patch makes mem_cgroup_isolate_pages() to be

- ignore !PageLRU pages.
- fixes the bug that isolation makes no progress if page_zone(page) != zone
page once find. (just increment scan in this case.)

kswapd and memory migration removes a page from list when it handles
a page for reclaiming/migration.

Because __isolate_lru_page() doesn't moves page !PageLRU pages, it will
be safe to avoid touching !PageLRU() page and its page_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:20 +0800
ae41be374 bugfix for memory cgroup controller: migration under memory controller fix ... Browse Code »

While using memory control cgroup, page-migration under it works as following.
==
1. uncharge all refs at try to unmap.
2. charge regs again remove_migration_ptes()
==
This is simple but has following problems.
==
The page is uncharged and charged back again if *mapped*.
- This means that cgroup before migration can be different from one after
migration
- If page is not mapped but charged as page cache, charge is just ignored
(because not mapped, it will not be uncharged before migration)
This is memory leak.
==
This patch tries to keep memory cgroup at page migration by increasing
one refcnt during it. 3 functions are added.

mem_cgroup_prepare_migration() --- increase refcnt of page->page_cgroup
mem_cgroup_end_migration() --- decrease refcnt of page->page_cgroup
mem_cgroup_page_migration() --- copy page->page_cgroup from old page to
new page.

During migration
- old page is under PG_locked.
- new page is under PG_locked, too.
- both old page and new page is not on LRU.

These 3 facts guarantee that page_cgroup() migration has no race.

Tested and worked well in x86_64/fake-NUMA box.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:19 +0800
9175e0311 bugfix for memory controller: add helper function for assigning cgroup to page ... Browse Code »

This patch adds following functions.
- clear_page_cgroup(page, pc)
- page_cgroup_assign_new_page_group(page, pc)

Mainly for cleanup.

A manner "check page->cgroup again after lock_page_cgroup()" is
implemented in straight way.

A comment in mem_cgroup_uncharge() will be removed by force-empty patch

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2008-02-08 00:42:19 +0800
f1a9ee758 kswapd should only wait on IO if there is IO ... Browse Code »

The current kswapd (and try_to_free_pages) code has an oddity where the
code will wait on IO, even if there is no IO in flight. This problem is
notable especially when the system scans through many unfreeable pages,
causing unnecessary stalls in the VM.

Additionally, tasks without __GFP_FS or __GFP_IO in the direct reclaim path
will sleep if a significant number of pages are encountered that should be
written out. This gives kswapd a chance to write out those pages, while
the direct reclaim task sleeps.

Signed-off-by: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2008-02-08 00:42:19 +0800
fef1bdd68 oom: add sysctl to enable task memory dump ... Browse Code »

Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.

This is helpful for determining why there was an OOM condition and which
rogue task caused it.

It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.

If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.

Cc: Andrea Arcangeli
Cc: Christoph Lameter
Cc: Balbir Singh
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-02-08 00:42:19 +0800
4c4a22148 memcontrol: move oom task exclusion to tasklist scan ... Browse Code »

Creates a helper function to return non-zero if a task is a member of a
memory controller:

int task_in_mem_cgroup(const struct task_struct *task,
const struct mem_cgroup *mem);

When the OOM killer is constrained by the memory controller, the exclusion
of tasks that are not a member of that controller was previously misplaced
and appeared in the badness scoring function. It should be excluded
during the tasklist scan in select_bad_process() instead.

[akpm@linux-foundation.org: build fix]
Cc: Christoph Lameter
Cc: Balbir Singh
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-02-08 00:42:19 +0800