Eric Lee / linux-smarc-t335x-v3.2

10 Jan, 2009

1 commit

c40f6f8bb Merge git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu ... Browse Code »

* git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-2.6-nommu:
NOMMU: Support XIP on initramfs
NOMMU: Teach kobjsize() about VMA regions.
FLAT: Don't attempt to expand the userspace stack to fill the space allocated
FDPIC: Don't attempt to expand the userspace stack to fill the space allocated
NOMMU: Improve procfs output using per-MM VMAs
NOMMU: Make mmap allocation page trimming behaviour configurable.
NOMMU: Make VMAs per MM as for MMU-mode linux
NOMMU: Delete askedalloc and realalloc variables
NOMMU: Rename ARM's struct vm_region
NOMMU: Fix cleanup handling in ramfs_nommu_get_umapped_area()

Linus Torvalds
2009-01-10 06:00:58 +0800

09 Jan, 2009

39 commits

2cb378c86 cgroups: use hierarchy_mutex in memory controller ... Browse Code »

Update the memory controller to use its hierarchy_mutex rather than
calling cgroup_lock() to protected against cgroup_mkdir()/cgroup_rmdir()
from occurring in its hierarchy.

Signed-off-by: Paul Menage
Tested-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Paul Menage
2009-01-09 00:31:10 +0800
b5a84319a memcg: fix shmem's swap accounting ... Browse Code »

Now, you can see following even when swap accounting is enabled.

1. Create Group 01, and 02.
2. allocate a "file" on tmpfs by a task under 01.
3. swap out the "file" (by memory pressure)
4. Read "file" from a task in group 02.
5. the charge of "file" is moved to group 02.

This is not ideal behavior. This is because SwapCache which was loaded
by read-ahead is not taken into account..

This is a patch to fix shmem's swapcache behavior.
- remove mem_cgroup_cache_charge_swapin().
- Add SwapCache handler routine to mem_cgroup_cache_charge().
By this, shmem's file cache is charged at add_to_page_cache()
with GFP_NOWAIT.
- pass the page of swapcache to shrink_mem_cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
544122e5e memcg: fix LRU accounting for SwapCache ... Browse Code »

Now, a page can be deleted from SwapCache while do_swap_page().
memcg-fix-swap-accounting-leak-v3.patch handles that, but, LRU handling is
still broken. (above behavior broke assumption of memcg-synchronized-lru
patch.)

This patch is a fix for LRU handling (especially for per-zone counters).
At charging SwapCache,
- Remove page_cgroup from LRU if it's not used.
- Add page cgroup to LRU if it's not linked to.

Reported-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
54595fe26 memcg: use css_tryget in memcg ... Browse Code »

From:KAMEZAWA Hiroyuki

css_tryget() newly is added and we can know css is alive or not and get
refcnt of css in very safe way. ("alive" here means "rmdir/destroy" is
not called.)

This patch replaces css_get() to css_tryget(), where I cannot explain
why css_get() is safe. And removes memcg->obsolete flag.

Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
a7ba0eef3 memcg: fix double free and make refcnt sane ... Browse Code »

1. Fix double-free BUG in error route of mem_cgroup_create().
mem_cgroup_free() itself frees per-zone-info.
2. Making refcnt of memcg simple.
Add 1 refcnt at creation and call free when refcnt goes down to 0.

Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Paul Menage
Cc: Daisuke Nishimura
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
03f3c4336 memcg: fix swap accounting leak ... Browse Code »

Fix swapin charge operation of memcg.

Now, memcg has hooks to swap-out operation and checks SwapCache is really
unused or not. That check depends on contents of struct page. I.e. If
PageAnon(page) && page_mapped(page), the page is recoginized as
still-in-use.

Now, reuse_swap_page() calles delete_from_swap_cache() before establishment
of any rmap. Then, in followinig sequence

(Page fault with WRITE)
try_charge() (charge += PAGESIZE)
commit_charge() (Check page_cgroup is used or not..)
reuse_swap_page()
-> delete_from_swapcache()
-> mem_cgroup_uncharge_swapcache() (charge -= PAGESIZE)
......
New charge is uncharged soon....
To avoid this, move commit_charge() after page_mapcount() goes up to 1.
By this,

try_charge() (usage += PAGESIZE)
reuse_swap_page() (may usage -= PAGESIZE if PCG_USED is set)
commit_charge() (If page_cgroup is not marked as PCG_USED,
add new charge.)
Accounting will be correct.

Changelog (v2) -> (v3)
- fixed invalid charge to swp_entry==0.
- updated documentation.
Changelog (v1) -> (v2)
- fixed comment.

[nishimura@mxp.nes.nec.co.jp: swap accounting leak doc fix]
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Tested-by: Balbir Singh
Cc: Hugh Dickins
Cc: Daisuke Nishimura
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:10 +0800
42e9abb62 memcg: change try_to_free_pages to hierarchical_reclaim ... Browse Code »

mem_cgroup_hierarchicl_reclaim() works properly even when !use_hierarchy
now (by memcg-hierarchy-avoid-unnecessary-reclaim.patch), so, instead of
try_to_free_mem_cgroup_pages(), it should be used in many cases.

The only exception is force_empty. The group has no children in this
case.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
7f4d454de memcg: avoid deadlock caused by race between oom and cpuset_attach ... Browse Code »

mpol_rebind_mm(), which can be called from cpuset_attach(), does
down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
called under cgroup_mutex.

OTOH, page fault path does down_read(mm->mmap_sem) and calls
mem_cgroup_try_charge_xxx(), which may eventually calls
mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
cgroup_lock(). This means cgroup_lock() can be called under
down_read(mm->mmap_sem).

If those two paths race, deadlock can happen.

This patch avoid this deadlock by:
- remove cgroup_lock() from mem_cgroup_out_of_memory().
- define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
(->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
a5e924f5f memcg: remove mem_cgroup_try_charge ... Browse Code »

After previous patch, mem_cgroup_try_charge is not used by anyone, so we
can remove it.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
3bb4edf24 memcg: don't trigger oom at page migration ... Browse Code »

I think triggering OOM at mem_cgroup_prepare_migration would be just a bit
overkill. Returning -ENOMEM would be enough for
mem_cgroup_prepare_migration. The caller would handle the case anyway.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
fee7b548e memcg: show real limit under hierarchy mode ... Browse Code »

Show "real" limit of memcg. This helps my debugging and maybe useful for
users.

While testing hierarchy like this

mount -t cgroup none /cgroup -t memory
mkdir /cgroup/A
set use_hierarchy==1 to "A"
mkdir /cgroup/A/01
mkdir /cgroup/A/01/02
mkdir /cgroup/A/01/03
mkdir /cgroup/A/01/03/04
mkdir /cgroup/A/08
mkdir /cgroup/A/08/01
....
and set each own limit to them, "real" limit of each memcg is unclear.
This patch shows real limit by checking all ancestors.

Changelog: (v1) -> (v2)
- remove "if" and use "min(a,b)"

Acked-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:09 +0800
c772be939 memcg: fix calculation of active_ratio ... Browse Code »

Currently, inactive_ratio of memcg is calculated at setting limit.
because page_alloc.c does so and current implementation is straightforward
porting.

However, memcg introduced hierarchy feature recently. In hierarchy
restriction, memory limit is not only decided memory.limit_in_bytes of
current cgroup, but also parent limit and sibling memory usage.

Then, The optimal inactive_ratio is changed frequently. So, everytime
calculation is better.

Tested-by: KAMEZAWA Hiroyuki
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:09 +0800
a7885eb8a memcg: swappiness ... Browse Code »

Currently, /proc/sys/vm/swappiness can change swappiness ratio for global
reclaim. However, memcg reclaim doesn't have tuning parameter for itself.

In general, the optimal swappiness depend on workload. (e.g. hpc
workload need to low swappiness than the others.)

Then, per cgroup swappiness improve administrator tunability.

Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
2733c06ac memcg: protect prev_priority ... Browse Code »

Currently, mem_cgroup doesn't have own lock and almost its member doesn't
need. (e.g. mem_cgroup->info is protected by zone lock, mem_cgroup->stat
is per cpu variable)

However, there is one explict exception. mem_cgroup->prev_priorit need
lock, but doesn't protect. Luckly, this is NOT bug because prev_priority
isn't used for current reclaim code.

However, we plan to use prev_priority future again. Therefore, fixing is
better.

In addition, we plan to reuse this lock for another member. Then
"reclaim_param_lock" name is better than "prev_priority_lock".

Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
e72e2bd67 memcg: rename scan global lru ... Browse Code »

Rename scan_global_lru() to scanning_global_lru().

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:08 +0800
7f016ee8b memcg: show reclaim stat ... Browse Code »

Add the following four fields to memory.stat file:

- inactive_ratio
- recent_rotated_anon
- recent_rotated_file
- recent_scanned_anon
- recent_scanned_file

Acked-by: Rik van Riel
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
9439c1c95 memcg: remove mem_cgroup_cal_reclaim() ... Browse Code »

Now, get_scan_ratio() return correct value although memcg reclaim. Then,
mem_cgroup_calc_reclaim() can be removed.

So, memcg reclaim get the same capability of anon/file reclaim balancing
as global reclaim now.

Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
3e2f41f1f memcg: add zone_reclaim_stat ... Browse Code »

Introduce mem_cgroup_per_zone::reclaim_stat member and its statics
collecting function.

Now, get_scan_ratio() can calculate correct value on memcg reclaim.

[hugh@veritas.com: avoid reclaim_stat oops when disabled]
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
a3d8e0549 memcg: add mem_cgroup_zone_nr_pages() ... Browse Code »

Introduce mem_cgroup_zone_nr_pages(). It is called by zone_nr_pages()
helper function.

This patch doesn't have any behavior change.

Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Acked-by: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
14797e236 memcg: add inactive_anon_is_low() ... Browse Code »

The inactive_anon_is_low() is key component of active/inactive anon
balancing on reclaim. However current inactive_anon_is_low() function
only consider global reclaim.

Therefore, we need following ugly scan_global_lru() condition.

if (lru == LRU_ACTIVE_ANON &&
(!scan_global_lru(sc) || inactive_anon_is_low(zone))) {
shrink_active_list(nr_to_scan, zone, sc, priority, file);
return 0;

it cause that memcg reclaim always deactivate pages when shrink_list() is
called. To make mem_cgroup_inactive_anon_is_low() improve active/inactive
anon balancing of memcgroup.

Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Cyrill Gorcunov
Cc: "Pekka Enberg"
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:08 +0800
549927620 memcg: add null check to page_cgroup_zoneinfo() ... Browse Code »

If CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y, page_cgroup::mem_cgroup can be NULL.
Therefore null checking is better.

A later patch uses this function.

Acked-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
eeee9a8cd mm: make get_scan_ratio() safe for memcg ... Browse Code »

Currently, get_scan_ratio() always calculate the balancing value for
global reclaim and memcg reclaim doesn't use it. Therefore it doesn't
have scan_global_lru() condition.

However, we plan to expand get_scan_ratio() to be usable for memcg too,
latter. Then, The dependency code of global reclaim in the
get_scan_ratio() insert into scan_global_lru() condision explictly.

This patch doesn't have any functional change.

Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
c9f299d98 mm: add zone nr_pages helper function ... Browse Code »

Add zone_nr_pages() helper function.

It is used by a later patch. This patch doesn't have any functional
change.

Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
6e9015716 mm: introduce zone_reclaim struct ... Browse Code »

Add zone_reclam_stat struct for later enhancement.

A later patch uses this. This patch doesn't any behavior change (yet).

Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: KOSAKI Motohiro
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
f89eb90e3 inactive_anon_is_low: move to vmscan ... Browse Code »

The inactive_anon_is_low() is called only vmscan. Then it can move to
vmscan.c

This patch doesn't have any functional change.

Reviewd-by: KAMEZAWA Hiroyuki
Acked-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2009-01-09 00:31:07 +0800
670ec2f17 memcg: hierarchy avoid unnecessary reclaim ... Browse Code »

If hierarchy is not used, no tree-walk is necessary.

Reviewed-by: KOSAKI Motohiro
Signed-off-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:07 +0800
a7fe942e9 memcg: swapout refcnt fix ... Browse Code »

css's refcnt is dropped before end of following access.
Hold it until end of access.

Reported-by: Li Zefan
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:07 +0800
b85a96c0b memcg: memory swap controller: fix limit check ... Browse Code »

There are scatterd calls of res_counter_check_under_limit(), and most of
them don't take mem+swap accounting into account.

define mem_cgroup_check_under_limit() and avoid direct use of
res_counter_check_limit().

Reported-by: Daisuke Nishimura
Signed-off-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:07 +0800
f9717d28d memcg: check group leader fix ... Browse Code »

Remove unnecessary codes (...fragments of not-implemented
functionalilty...)

Reported-by: Nikanth Karthikesan
Signed-off-by: Nikanth Karthikesan
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nikanth Karthikesan
2009-01-09 00:31:06 +0800
2c26fdd70 memcg: revert gfp mask fix ... Browse Code »

My patch, memcg-fix-gfp_mask-of-callers-of-charge.patch changed gfp_mask
of callers of charge to be GFP_HIGHUSER_MOVABLE for showing what will
happen at memory reclaim.

But in recent discussion, it's NACKed because it sounds ugly.

This patch is for reverting it and add some clean up to gfp_mask of
callers of charge. No behavior change but need review before generating
HUNK in deep queue.

This patch also adds explanation to meaning of gfp_mask passed to charge
functions in memcontrol.h.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800
887007561 memcg: fix reclaim result checks ... Browse Code »

check_under_limit logic was wrong and this check should be against
mem_over_limit rather than mem.

Reported-by: Li Zefan
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Badari Pulavarty
Cc: Jan Blunck
Cc: Hirokazu Takahashi
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800
a636b327f memcg: avoid unnecessary system-wide-oom-killer ... Browse Code »

Current mmtom has new oom function as pagefault_out_of_memory(). It's
added for select bad process rathar than killing current.

When memcg hit limit and calls OOM at page_fault, this handler called and
system-wide-oom handling happens. (means kernel panics if panic_on_oom is
true....)

To avoid overkill, check memcg's recent behavior before starting
system-wide-oom.

And this patch also fixes to guarantee "don't accnout against process with
TIF_MEMDIE". This is necessary for smooth OOM.

[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Badari Pulavarty
Cc: Jan Blunck
Cc: Hirokazu Takahashi
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800
18f59ea7d memcg: memory cgroup hierarchy feature selector ... Browse Code »

Don't enable multiple hierarchy support by default. This patch introduces
a features element that can be set to enable the nested depth hierarchy
feature. This feature can only be enabled when the cgroup for which the
feature this is enabled, has no children.

Signed-off-by: Balbir Singh
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: Li Zefan
Cc: David Rientjes
Cc: Pavel Emelianov
Cc: Dhaval Giani
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-01-09 00:31:06 +0800
6d61ef409 memcg: memory cgroup hierarchical reclaim ... Browse Code »

This patch introduces hierarchical reclaim. When an ancestor goes over
its limit, the charging routine points to the parent that is above its
limit. The reclaim process then starts from the last scanned child of the
ancestor and reclaims until the ancestor goes below its limit.

[akpm@linux-foundation.org: coding-style fixes]
[d-nishimura@mtf.biglobe.ne.jp: mem_cgroup_from_res_counter should handle both mem->res and mem->memsw]
Signed-off-by: Balbir Singh
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: Li Zefan
Cc: David Rientjes
Cc: Pavel Emelianov
Cc: Dhaval Giani
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-01-09 00:31:06 +0800
28dbc4b6a memcg: memory cgroup resource counters for hierarchy ... Browse Code »

Add support for building hierarchies in resource counters. Cgroups allows
us to build a deep hierarchy, but we currently don't link the resource
counters belonging to the memory controller control groups, in the same
fashion as the corresponding cgroup entries in the cgroup hierarchy. This
patch provides the infrastructure for resource counters that have the same
hiearchy as their cgroup counter parts.

These set of patches are based on the resource counter hiearchy patches
posted by Pavel Emelianov.

NOTE: Building hiearchies is expensive, deeper hierarchies imply charging
the all the way up to the root. It is known that hiearchies are
expensive, so the user needs to be careful and aware of the trade-offs
before creating very deep ones.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh
Cc: YAMAMOTO Takashi
Cc: Paul Menage
Cc: Li Zefan
Cc: David Rientjes
Cc: Pavel Emelianov
Cc: Dhaval Giani
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-01-09 00:31:05 +0800
f8d665422 memcg: add mem_cgroup_disabled() ... Browse Code »

We check mem_cgroup is disabled or not by checking
mem_cgroup_subsys.disabled. I think it has more references than expected,
now.

replacing
if (mem_cgroup_subsys.disabled)
with
if (mem_cgroup_disabled())

give us good look, I think.

[kamezawa.hiroyu@jp.fujitsu.com: fix typo]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hirokazu Takahashi
2009-01-09 00:31:05 +0800
08e552c69 memcg: synchronized LRU ... Browse Code »

A big patch for changing memcg's LRU semantics.

Now,
- page_cgroup is linked to mem_cgroup's its own LRU (per zone).

- LRU of page_cgroup is not synchronous with global LRU.

- page and page_cgroup is one-to-one and statically allocated.

- To find page_cgroup is on what LRU, you have to check pc->mem_cgroup as
- lru = page_cgroup_zoneinfo(pc, nid_of_pc, zid_of_pc);

- SwapCache is handled.

And, when we handle LRU list of page_cgroup, we do following.

pc = lookup_page_cgroup(page);
lock_page_cgroup(pc); .....................(1)
mz = page_cgroup_zoneinfo(pc);
spin_lock(&mz->lru_lock);
.....add to LRU
spin_unlock(&mz->lru_lock);
unlock_page_cgroup(pc);

But (1) is spin_lock and we have to be afraid of dead-lock with zone->lru_lock.
So, trylock() is used at (1), now. Without (1), we can't trust "mz" is correct.

This is a trial to remove this dirty nesting of locks.
This patch changes mz->lru_lock to be zone->lru_lock.
Then, above sequence will be written as

spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU
mem_cgroup_add/remove/etc_lru() {
pc = lookup_page_cgroup(page);
mz = page_cgroup_zoneinfo(pc);
if (PageCgroupUsed(pc)) {
....add to LRU
}
spin_lock(&zone->lru_lock); # in vmscan.c or swap.c via global LRU

This is much simpler.
(*) We're safe even if we don't take lock_page_cgroup(pc). Because..
1. When pc->mem_cgroup can be modified.
- at charge.
- at account_move().
2. at charge
the PCG_USED bit is not set before pc->mem_cgroup is fixed.
3. at account_move()
the page is isolated and not on LRU.

Pros.
- easy for maintenance.
- memcg can make use of laziness of pagevec.
- we don't have to duplicated LRU/Active/Unevictable bit in page_cgroup.
- LRU status of memcg will be synchronized with global LRU's one.
- # of locks are reduced.
- account_move() is simplified very much.
Cons.
- may increase cost of LRU rotation.
(no impact if memcg is not configured.)

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800
8c7c6e34a memcg: mem+swap controller core ... Browse Code »

This patch implements per cgroup limit for usage of memory+swap. However
there are SwapCache, double counting of swap-cache and swap-entry is
avoided.

Mem+Swap controller works as following.
- memory usage is limited by memory.limit_in_bytes.
- memory + swap usage is limited by memory.memsw_limit_in_bytes.

This has following benefits.
- A user can limit total resource usage of mem+swap.

Without this, because memory resource controller doesn't take care of
usage of swap, a process can exhaust all the swap (by memory leak.)
We can avoid this case.

And Swap is shared resource but it cannot be reclaimed (goes back to memory)
until it's used. This characteristic can be trouble when the memory
is divided into some parts by cpuset or memcg.
Assume group A and group B.
After some application executes, the system can be..

Group A -- very large free memory space but occupy 99% of swap.
Group B -- under memory shortage but cannot use swap...it's nearly full.

Ability to set appropriate swap limit for each group is required.

Maybe someone wonder "why not swap but mem+swap ?"

- The global LRU(kswapd) can swap out arbitrary pages. Swap-out means
to move account from memory to swap...there is no change in usage of
mem+swap.

In other words, when we want to limit the usage of swap without affecting
global LRU, mem+swap limit is better than just limiting swap.

Accounting target information is stored in swap_cgroup which is
per swap entry record.

Charge is done as following.
map
- charge page and memsw.

unmap
- uncharge page/memsw if not SwapCache.

swap-out (__delete_from_swap_cache)
- uncharge page
- record mem_cgroup information to swap_cgroup.

swap-in (do_swap_page)
- charged as page and memsw.
record in swap_cgroup is cleared.
memsw accounting is decremented.

swap-free (swap_free())
- if swap entry is freed, memsw is uncharged by PAGE_SIZE.

There are people work under never-swap environments and consider swap as
something bad. For such people, this mem+swap controller extension is just an
overhead. This overhead is avoided by config or boot option.
(see Kconfig. detail is not in this patch.)

TODO:
- maybe more optimization can be don in swap-in path. (but not very safe.)
But we just do simple accounting at this stage.

[nishimura@mxp.nes.nec.co.jp: make resize limit hold mutex]
[hugh@veritas.com: memswap controller core swapcache fixes]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Daisuke Nishimura
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800
27a7faa07 memcg: swap cgroup for remembering usage ... Browse Code »

For accounting swap, we need a record per swap entry, at least.

This patch adds following function.
- swap_cgroup_swapon() .... called from swapon
- swap_cgroup_swapoff() ... called at the end of swapoff.

- swap_cgroup_record() .... record information of swap entry.
- swap_cgroup_lookup() .... lookup information of swap entry.

This patch just implements "how to record information". No actual method
for limit the usage of swap. These routine uses flat table to record and
lookup. "wise" lookup system like radix-tree requires requires memory
allocation at new records but swap-out is usually called under memory
shortage (or memcg hits limit.) So, I used static allocation. (maybe
dynamic allocation is not very hard but it adds additional memory
allocation in memory shortage path.)

Note1: In this, we use pointer to record information and this means
8bytes per swap entry. I think we can reduce this when we
create "id of cgroup" in the range of 0-65535 or 0-255.

Reported-by: Daisuke Nishimura
Reviewed-by: Daisuke Nishimura
Tested-by: Daisuke Nishimura
Reported-by: Hugh Dickins
Reported-by: Balbir Singh
Reported-by: Andrew Morton
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Pavel Emelianov
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:05 +0800