17 Dec, 2009
1 commit
-
* 'hwpoison' of git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6: (34 commits)
HWPOISON: Remove stray phrase in a comment
HWPOISON: Try to allocate migration page on the same node
HWPOISON: Don't do early filtering if filter is disabled
HWPOISON: Add a madvise() injector for soft page offlining
HWPOISON: Add soft page offline support
HWPOISON: Undefine short-hand macros after use to avoid namespace conflict
HWPOISON: Use new shake_page in memory_failure
HWPOISON: Use correct name for MADV_HWPOISON in documentation
HWPOISON: mention HWPoison in Kconfig entry
HWPOISON: Use get_user_page_fast in hwpoison madvise
HWPOISON: add an interface to switch off/on all the page filters
HWPOISON: add memory cgroup filter
memcg: add accessor to mem_cgroup.css
memcg: rename and export try_get_mem_cgroup_from_page()
HWPOISON: add page flags filter
mm: export stable page flags
HWPOISON: limit hwpoison injector to known page types
HWPOISON: add fs/device filters
HWPOISON: return 0 to indicate success reliably
HWPOISON: make semantics of IGNORED/DELAYED clear
...
16 Dec, 2009
12 commits
-
Variable `progress' isn't used in mem_cgroup_resize_limit() any more.
Remove it.[akpm@linux-foundation.org: cleanup]
Signed-off-by: Bob Liu
Cc: Daisuke Nishimura
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
memcg_tasklist was introduced at commit 7f4d454d(memcg: avoid deadlock
caused by race between oom and cpuset_attach) instead of cgroup_mutex to
fix a deadlock problem. The cgroup_mutex, which was removed by the
commit, in mem_cgroup_out_of_memory() was originally introduced at commit
c7ba5c9e (Memory controller: OOM handling).IIUC, the intention of this cgroup_mutex was to prevent task move during
select_bad_process() so that situations like below can be avoided.Assume cgroup "foo" has exceeded its limit and is about to trigger oom.
1. Process A, which has been in cgroup "baa" and uses large memory, is just
moved to cgroup "foo". Process A can be the candidates for being killed.
2. Process B, which has been in cgroup "foo" and uses large memory, is just
moved from cgroup "foo". Process B can be excluded from the candidates for
being killed.But these race window exists anyway even if we hold a lock, because
__mem_cgroup_try_charge() decides wether it should trigger oom or not
outside of the lock. So the original cgroup_mutex in
mem_cgroup_out_of_memory and thus current memcg_tasklist has no use. And
IMHO, those races are not so critical for users.This patch removes it and make codes simpler.
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
task_in_mem_cgroup(), which is called by select_bad_process() to check
whether a task can be a candidate for being oom-killed from memcg's limit,
checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
to).But this check return true(it's false positive) when:
/aa use_hierarchy == 0 /aa/00 use_hierarchy == 1
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mem_cgroup_move_parent() calls try_charge first and cancel_charge on
failure. IMHO, charge/uncharge(especially charge) is high cost operation,
so we should avoid it as far as possible.This patch tries to delay try_charge in mem_cgroup_move_parent() by
re-ordering checks it does.And this patch renames mem_cgroup_move_account() to
__mem_cgroup_move_account(), changes the return value of
__mem_cgroup_move_account() from int to void, and adds a new
wrapper(mem_cgroup_move_account()), which checks whether a @pc is valid
for moving account and calls __mem_cgroup_move_account().This patch removes the last caller of trylock_page_cgroup(), so removes
its definition too.Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are some places calling both res_counter_uncharge() and css_put() to
cancel the charge and the refcnt we have got by mem_cgroup_tyr_charge().This patch introduces mem_cgroup_cancel_charge() and call it in those
places.Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In global VM, FILE_MAPPED is used but memcg uses MAPPED_FILE. This makes
grep difficult. Replace memcg's MAPPED_FILE with FILE_MAPPEDAnd in global VM, mapped shared memory is accounted into FILE_MAPPED.
But memcg doesn't. fix it.
Note:
page_is_file_cache() just checks SwapBacked or not.
So, we need to check PageAnon.Cc: Balbir Singh
Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a patch for coalescing access to res_counter at charging by percpu
caching. At charge, memcg charges 64pages and remember it in percpu
cache. Because it's cache, drain/flush if necessary.This version uses public percpu area.
2 benefits for using public percpu area.
1. Sum of stocked charge in the system is limited to # of cpus
not to the number of memcg. This shows better synchonization.
2. drain code for flush/cpuhotplug is very easy (and quick)The most important point of this patch is that we never touch res_counter
in fast path. The res_counter is system-wide shared counter which is modified
very frequently. We shouldn't touch it as far as we can for avoiding
false sharing.On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.[without memcg config]
40156968 page-faults # 0.085 M/sec ( +- 0.046% )
27.67 cache-miss/faults[root cgroup]
36659599 page-faults # 0.077 M/sec ( +- 0.247% )
31.58 cache miss/faults[in a child cgroup]
18444157 page-faults # 0.039 M/sec ( +- 0.133% )
69.96 cache miss/faults[ + coalescing uncharge patch]
27133719 page-faults # 0.057 M/sec ( +- 0.155% )
47.16 cache miss/faults[ + coalescing uncharge patch + this patch ]
34224709 page-faults # 0.072 M/sec ( +- 0.173% )
34.69 cache miss/faultsChangelog (since Oct/2):
- updated comments
- replaced get_cpu_var() with __get_cpu_var() if possible.
- removed mutex for system-wide drain. adds a counter instead of it.
- removed CONFIG_HOTPLUG_CPUChangelog (old):
- rebased onto the latest mmotm
- moved charge size check before __GFP_WAIT check for avoiding unnecesary
- added asynchronous flush routine.
- fixed bugs pointed out by Nishimura-san.[akpm@linux-foundation.org: tweak comments]
[nishimura@mxp.nes.nec.co.jp: don't do INIT_WORK() repeatedly against the same work_struct]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In massive parallel enviroment, res_counter can be a performance
bottleneck. One strong techinque to reduce lock contention is reducing
calls by coalescing some amount of calls into one.Considering charge/uncharge chatacteristic,
- charge is done one by one via demand-paging.
- uncharge is done by
- in chunk at munmap, truncate, exit, execve...
- one by one via vmscan/paging.It seems we have a chance to coalesce uncharges for improving scalability
at unmap/truncation.This patch is a for coalescing uncharge. For avoiding scattering memcg's
structure to functions under /mm, this patch adds memcg batch uncharge
information to the task. A reason for per-task batching is for making use
of caller's context information. We do batched uncharge (deleyed
uncharge) when truncation/unmap occurs but do direct uncharge when
uncharge is called by memory reclaim (vmscan.c).The degree of coalescing depends on callers
- at invalidate/trucate... pagevec size
- at unmap ....ZAP_BLOCK_SIZE
(memory itself will be freed in this degree.)
Then, we'll not coalescing too much.On x86-64 8cpu server, I tested overheads of memcg at page fault by
running a program which does map/fault/unmap in a loop. Running
a task per a cpu by taskset and see sum of the number of page faults
in 60secs.[without memcg config]
40156968 page-faults # 0.085 M/sec ( +- 0.046% )
27.67 cache-miss/faults
[root cgroup]
36659599 page-faults # 0.077 M/sec ( +- 0.247% )
31.58 miss/faults
[in a child cgroup]
18444157 page-faults # 0.039 M/sec ( +- 0.133% )
69.96 miss/faults
[child with this patch]
27133719 page-faults # 0.057 M/sec ( +- 0.155% )
47.16 miss/faultsWe can see some amounts of improvement.
(root cgroup doesn't affected by this patch)
Another patch for "charge" will follow this and above will be improved more.Changelog(since 2009/10/02):
- renamed filed of memcg_batch (as pages to bytes, memsw to memsw_bytes)
- some clean up and commentary/description updates.
- added initialize code to copy_process(). (possible bug fix)Changelog(old):
- fixed !CONFIG_MEM_CGROUP case.
- rebased onto the latest mmotm + softlimit fix patches.
- unified patch for callers
- added commetns.
- make ->do_batch as bool.
- removed css_get() at el. We don't need it.Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A memory cgroup has a memory.memsw.usage_in_bytes file. It shows the sum
of the usage of pages and swapents in the cgroup. Presently the root
cgroup's memsw.usage_in_bytes shows the wrong value - the number of
swapents are not added.So take MEM_CGROUP_STAT_SWAPOUT into account.
Signed-off-by: Kirill A. Shutemov
Reviewed-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
So that an outside user can free the reference count grabbed by
try_get_mem_cgroup_from_page().CC: KOSAKI Motohiro
CC: Hugh Dickins
CC: Daisuke Nishimura
CC: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
So that the hwpoison injector can get mem_cgroup for arbitrary page
and thus know whether it is owned by some mem_cgroup task(s).[AK: Merged with latest git tree]
CC: KOSAKI Motohiro
CC: Hugh Dickins
CC: Daisuke Nishimura
CC: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Wu Fengguang
Signed-off-by: Andi Kleen -
But ksm swapping does require one small change in mem cgroup handling.
When do_swap_page()'s call to ksm_might_need_to_copy() does indeed
substitute a duplicate page to accommodate a different anon_vma (or a the
!PageSwapCache check in mem_cgroup_try_charge_swapin().That was returning success without charging, on the assumption that
pte_same() would fail after, which is not the case here. Originally I
proposed that success, so that an unshrinkable mem cgroup at its limit
would not fail unnecessarily; but that's a minor point, and there are
plenty of other places where we may fail an overallocation which might
later prove unnecessary. So just go ahead and do what all the other
exceptions do: proceed to charge current mm.Signed-off-by: Hugh Dickins
Cc: Izik Eidus
Cc: Andrea Arcangeli
Cc: Chris Wright
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Dec, 2009
1 commit
-
That is "success", "unknown", "through", "performance", "[re|un]mapping"
, "access", "default", "reasonable", "[con]currently", "temperature"
, "channel", "[un]used", "application", "example","hierarchy", "therefore"
, "[over|under]flow", "contiguous", "threshold", "enough" and others.Signed-off-by: André Goddard Rosa
Signed-off-by: Jiri Kosina
09 Nov, 2009
1 commit
-
This patch was generated by
git grep -E -i -l '[Aa]quire' | xargs -r perl -p -i -e 's/([Aa])quire/$1cquire/'
and the cumsumed was found by checking the diff for aquire.
Signed-off-by: Uwe Kleine-König
Signed-off-by: Jiri Kosina
02 Oct, 2009
3 commits
-
In charge/uncharge/reclaim path, usage_in_excess is calculated repeatedly
and it takes res_counter's spin_lock every time.This patch removes unnecessary calls for res_count_soft_limit_excess.
Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch clean up/fixes for memcg's uncharge soft limit path.
Problems:
Now, res_counter_charge()/uncharge() handles softlimit information at
charge/uncharge and softlimit-check is done when event counter per memcg
goes over limit. Now, event counter per memcg is updated only when
memory usage is over soft limit. Here, considering hierarchical memcg
management, ancesotors should be taken care of.Now, ancerstors(hierarchy) are handled in charge() but not in uncharge().
This is not good.Prolems:
1. memcg's event counter incremented only when softlimit hits. That's bad.
It makes event counter hard to be reused for other purpose.2. At uncharge, only the lowest level rescounter is handled. This is bug.
Because ancesotor's event counter is not incremented, children should
take care of them.3. res_counter_uncharge()'s 3rd argument is NULL in most case.
ops under res_counter->lock should be small. No "if" sentense is better.Fixes:
* Removed soft_limit_xx poitner and checks in charge and uncharge.
Do-check-only-when-necessary scheme works enough well without them.* make event-counter of memcg incremented at every charge/uncharge.
(per-cpu area will be accessed soon anyway)* All ancestors are checked at soft-limit-check. This is necessary because
ancesotor's event counter may never be modified. Then, they should be
checked at the same time.Reviewed-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__mem_cgroup_largest_soft_limit_node() returns a mem_cgroup_per_zone "mz"
with incremnted mz->mem->css's refcnt. Then, the caller of this function
has to call css_put(mz->mem->css).But, mz can be !NULL even if "not found" i.e. without css_get(). By
this, css->refcnt will go down to minus.This may cause various things...one of results will be
initite-loop in css_tryget() as this.INFO: RCU detected CPU 0 stall (t=10000 jiffies)
sending NMI to all CPUs:
NMI backtrace for cpu 0
CPU 0:<> [] trace_hardirqs_off+0xd/0x10
[] flat_send_IPI_mask+0x90/0xb0
[] flat_send_IPI_all+0x69/0x70
[] arch_trigger_all_cpu_backtrace+0x62/0xa0
[] __rcu_pending+0x7e/0x370
[] rcu_check_callbacks+0x47/0x130
[] update_process_times+0x46/0x70
[] tick_sched_timer+0x60/0x160
[] ? tick_sched_timer+0x0/0x160
[] __run_hrtimer+0xba/0x150
[] hrtimer_interrupt+0xd5/0x1b0
[] ? trace_hardirqs_off_thunk+0x3a/0x3c
[] smp_apic_timer_interrupt+0x6d/0x9b
[] apic_timer_interrupt+0x13/0x20
[] ? mem_cgroup_walk_tree+0x156/0x180
[] ? mem_cgroup_walk_tree+0x73/0x180
[] ? mem_cgroup_walk_tree+0x32/0x180
[] ? mem_cgroup_get_local_stat+0x0/0x110
[] ? mem_control_stat_show+0x14b/0x330
[] ? cgroup_seqfile_show+0x3d/0x60Above shows CPU0 caught in css_tryget()'s inifinite loop because
of bad refcnt.This is a fix to set mz=NULL at the top of retry path.
Signed-off-by: KAMEZAWA Hiroyuki
Acked-by: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Sep, 2009
9 commits
-
We now count MEM_CGROUP_STAT_SWAPOUT, so we can show swap usage. It would
be useful for users to show swap usage in memory.stat file, because they
don't need calculate memsw.usage - res.usage to know swap usage.Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Reduce the resource counter overhead (mostly spinlock) associated with the
root cgroup. This is a part of the several patches to reduce mem cgroup
overhead. I had posted other approaches earlier (including using percpu
counters). Those patches will be a natural addition and will be added
iteratively on top of these.The patch stops resource counter accounting for the root cgroup. The data
for display is derived from the statisitcs we maintain via
mem_cgroup_charge_statistics (which is more scalable). What happens today
is that, we do double accounting, once using res_counter_charge() and once
using memory_cgroup_charge_statistics(). For the root, since we don't
implement limits any more, we don't need to track every charge via
res_counter_charge() and check for limit being exceeded and reclaim.The main mem->res usage_in_bytes can be derived by summing the cache and
rss usage data from memory statistics (MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE). However, for memsw->res usage_in_bytes, we need
additional data about swapped out memory. This patch adds a
MEM_CGROUP_STAT_SWAPOUT and uses that along with MEM_CGROUP_STAT_RSS and
MEM_CGROUP_STAT_CACHE to derive the memsw data. This data is computed
recursively when hierarchy is enabled.The tests results I see on a 24 way show that
1. The lock contention disappears from /proc/lock_stats
2. The results of the test are comparable to running with
cgroup_disable=memory.Here is a sample of my program runs
Without Patch
Performance counter stats for '/home/balbir/parallel_pagefault':
7192804.124144 task-clock-msecs # 23.937 CPUs
424691 context-switches # 0.000 M/sec
267 CPU-migrations # 0.000 M/sec
28498113 page-faults # 0.004 M/sec
5826093739340 cycles # 809.989 M/sec
408883496292 instructions # 0.070 IPC
7057079452 cache-references # 0.981 M/sec
3036086243 cache-misses # 0.422 M/sec300.485365680 seconds time elapsed
With cgroup_disable=memory
Performance counter stats for '/home/balbir/parallel_pagefault':
7182183.546587 task-clock-msecs # 23.915 CPUs
425458 context-switches # 0.000 M/sec
203 CPU-migrations # 0.000 M/sec
92545093 page-faults # 0.013 M/sec
6034363609986 cycles # 840.185 M/sec
437204346785 instructions # 0.072 IPC
6636073192 cache-references # 0.924 M/sec
2358117732 cache-misses # 0.328 M/sec300.320905827 seconds time elapsed
With this patch applied
Performance counter stats for '/home/balbir/parallel_pagefault':
7191619.223977 task-clock-msecs # 23.955 CPUs
422579 context-switches # 0.000 M/sec
88 CPU-migrations # 0.000 M/sec
91946060 page-faults # 0.013 M/sec
5957054385619 cycles # 828.333 M/sec
1058117350365 instructions # 0.178 IPC
9161776218 cache-references # 1.274 M/sec
1920494280 cache-misses # 0.267 M/sec300.218764862 seconds time elapsed
Data from Prarit (kernel compile with make -j64 on a 64
CPU/32G machine)For a single run
Without patch
real 27m8.988s
user 87m24.916s
sys 382m6.037sWith patch
real 4m18.607s
user 84m58.943s
sys 50m52.682sWith config turned off
real 4m54.972s
user 90m13.456s
sys 50m19.711sNOTE: The data looks counterintuitive due to the increased performance
with the patch, even over the config being turned off. We probably need
more runs, but so far all testing has shown that the patches definitely
help.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Balbir Singh
Cc: Prarit Bhargava
Cc: Andi Kleen
Reviewed-by: KAMEZAWA Hiroyuki
Reviewed-by: Daisuke Nishimura
Cc: KOSAKI Motohiro
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Implement reclaim from groups over their soft limit
Permit reclaim from memory cgroups on contention (via the direct reclaim
path).memory cgroup soft limit reclaim finds the group that exceeds its soft
limit by the largest number of pages and reclaims pages from it and then
reinserts the cgroup into its correct place in the rbtree.Add additional checks to mem_cgroup_hierarchical_reclaim() to detect long
loops in case all swap is turned off. The code has been refactored and
the loop check (loop < 2) has been enhanced for soft limits. For soft
limits, we try to do more targetted reclaim. Instead of bailing out after
two loops, the routine now reclaims memory proportional to the size by
which the soft limit is exceeded. The proportion has been empirically
determined.[akpm@linux-foundation.org: build fix]
[kamezawa.hiroyu@jp.fujitsu.com: fix softlimit css refcnt handling]
[nishimura@mxp.nes.nec.co.jp: refcount of the "victim" should be decremented before exiting the loop]
Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Acked-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Refactor mem_cgroup_hierarchical_reclaim()
Refactor the arguments passed to mem_cgroup_hierarchical_reclaim() into
flags, so that new parameters don't have to be passed as we make the
reclaim routine more flexibleSigned-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Organize cgroups over soft limit in a RB-Tree
Introduce an RB-Tree for storing memory cgroups that are over their soft
limit. The overall goal is to1. Add a memory cgroup to the RB-Tree when the soft limit is exceeded.
We are careful about updates, updates take place only after a particular
time interval has passed
2. We remove the node from the RB-Tree when the usage goes below the soft
limitThe next set of patches will exploit the RB-Tree to get the group that is
over its soft limit by the largest amount and reclaim from it, when we
face memory contention.[hugh.dickins@tiscali.co.uk: CONFIG_CGROUP_MEM_RES_CTLR=y CONFIG_PREEMPT=y fails to boot]
Signed-off-by: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Hugh Dickins
Cc: Jiri Slaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add an interface to allow get/set of soft limits. Soft limits for memory
plus swap controller (memsw) is currently not supported. Resource
counters have been enhanced to support soft limits and new type
RES_SOFT_LIMIT has been added. Unlike hard limits, soft limits can be
directly set and do not need any reclaim or checks before setting them to
a newer value.Kamezawa-San raised a question as to whether soft limit should belong to
res_counter. Since all resources understand the basic concepts of hard
and soft limits, it is justified to add soft limits here. Soft limits are
a generic resource usage feature, even file system quotas support soft
limits.Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add comments for the reason of smp_wmb() in mem_cgroup_commit_charge().
[akpm@linux-foundation.org: coding-style fixes]
Cc: Daisuke Nishimura
Cc: Balbir Singh
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Change the memory cgroup to remove the overhead associated with accounting
all pages in the root cgroup. As a side-effect, we can no longer set a
memory hard limit in the root cgroup.A new flag to track whether the page has been accounted or not has been
added as well. Flags are now set atomically for page_cgroup,
pcg_default_flags is now obsolete and removed.[akpm@linux-foundation.org: fix a few documentation glitches]
Signed-off-by: Balbir Singh
Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Li Zefan
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Alter the ss->can_attach and ss->attach functions to be able to deal with
a whole threadgroup at a time, for use in cgroup_attach_proc. (This is a
pre-patch to cgroup-procs-writable.patch.)Currently, new mode of the attach function can only tell the subsystem
about the old cgroup of the threadgroup leader. No subsystem currently
needs that information for each thread that's being moved, but if one were
to be added (for example, one that counts tasks within a group) this bit
would need to be reworked a bit to tell the subsystem the right
information.[hidave.darkstar@gmail.com: fix build]
Signed-off-by: Ben Blum
Signed-off-by: Paul Menage
Acked-by: Li Zefan
Reviewed-by: Matt Helsley
Cc: "Eric W. Biederman"
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Young
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Sep, 2009
1 commit
-
Remove double negations where the operand is already boolean.
Signed-off-by: Johannes Weiner
Cc: Mel Gorman
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jul, 2009
1 commit
-
After commit ec64f51545fffbc4cb968f0cea56341a4b07e85a ("cgroup: fix
frequent -EBUSY at rmdir"), cgroup's rmdir (especially against memcg)
doesn't return -EBUSY by temporary ref counts. That commit expects all
refs after pre_destroy() is temporary but...it wasn't. Then, rmdir can
wait permanently. This patch tries to fix that and change followings.- set CGRP_WAIT_ON_RMDIR flag before pre_destroy().
- clear CGRP_WAIT_ON_RMDIR flag when the subsys finds racy case.
if there are sleeping ones, wakes them up.
- rmdir() sleeps only when CGRP_WAIT_ON_RMDIR flag is set.Tested-by: Daisuke Nishimura
Reported-by: Daisuke Nishimura
Reviewed-by: Paul Menage
Acked-by: Balbir Sigh
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jul, 2009
1 commit
-
Commit 1faa16d22877f4839bd433547d770c676d1d964c accidentally broke
the bdi congestion wait queue logic, causing us to wait on congestion
for WRITE (== 1) when we really wanted BLK_RW_ASYNC (== 0) instead.Signed-off-by: Jens Axboe
19 Jun, 2009
5 commits
-
Try to fix memcg's lru rotation sanity: make memcg use the same logic as
the global LRU does.Now, at __isolate_lru_page() retruns -EBUSY, the page is rotated to the
tail of LRU in global LRU's isolate LRU pages. But in memcg, it's not
handled. This makes memcg do the same behavior as global LRU and rotate
LRU in the page is busy.Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Mel Gorman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A user can set memcg.limit_in_bytes == memcg.memsw.limit_in_bytes when the
user just want to limit the total size of applications, in other words,
not very interested in memory usage itself. In this case, swap-out will
be done only by global-LRU.But, under current implementation, memory.limit_in_bytes is checked at
first and try_to_free_page() may do swap-out. But, that swap-out is
useless for memsw.limit_in_bytes and the thread may hit limit again.This patch tries to fix the current behavior at memory.limit ==
memsw.limit case. And documentation is updated to explain the behavior of
this special case.Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch fixes mis-accounting of swap usage in memcg.
In the current implementation, memcg's swap account is uncharged only when
swap is completely freed. But there are several cases where swap cannot
be freed cleanly. For handling that, this patch changes that memcg
uncharges swap account when swap has no references other than cache.By this, memcg's swap entry accounting can be fully synchronous with the
application's behavior.This patch also changes memcg's hooks for swap-out.
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Acked-by: Balbir Singh
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Li Zefan
Cc: Dhaval Giani
Cc: YAMAMOTO Takashi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We don't need to check do_swap_account in the case that the function which
checks do_swap_account will never get called if do_swap_account == 0.Signed-off-by: Li Zefan
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add file RSS tracking per memory cgroup
We currently don't track file RSS, the RSS we report is actually anon RSS.
All the file mapped pages, come in through the page cache and get
accounted there. This patch adds support for accounting file RSS pages.
It should1. Help improve the metrics reported by the memory resource controller
2. Will form the basis for a future shared memory accounting heuristic
that has been proposed by Kamezawa.Unfortunately, we cannot rename the existing "rss" keyword used in
memory.stat to "anon_rss". We however, add "mapped_file" data and hope to
educate the end user through documentation.[hugh.dickins@tiscali.co.uk: fix mem_cgroup_update_mapped_file_stat oops]
Signed-off-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Dhaval Giani
Cc: Daisuke Nishimura
Cc: YAMAMOTO Takashi
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Jun, 2009
1 commit
-
When the file LRU lists are dominated by streaming IO pages, evict those
pages first, before considering evicting other pages.This should be safe from deadlocks or performance problems
because only three things can happen to an inactive file page:1) referenced twice and promoted to the active list
2) evicted by the pageout code
3) under IO, after which it will get evicted or promotedThe pages freed in this way can either be reused for streaming IO, or
allocated for something else. If the pages are used for streaming IO,
this pageout pattern continues. Otherwise, we will fall back to the
normal pageout pattern.Signed-off-by: Rik van Riel
Reported-by: Elladan
Cc: KOSAKI Motohiro
Cc: Peter Zijlstra
Cc: Lee Schermerhorn
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 May, 2009
2 commits
-
Fix build warning, "mem_cgroup_is_obsolete defined but not used" when
CONFIG_DEBUG_VM is not set. Also avoid checking for !mem again and again.Signed-off-by: Nikanth Karthikesan
Acked-by: Pekka Enberg
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mapping->tree_lock can be acquired from interrupt context. Then,
following dead lock can occur.Assume "A" as a page.
CPU0:
lock_page_cgroup(A)
interrupted
-> take mapping->tree_lock.
CPU1:
take mapping->tree_lock
-> lock_page_cgroup(A)This patch tries to fix above deadlock by moving memcg's hook to out of
mapping->tree_lock. charge/uncharge of pagecache/swapcache is protected
by page lock, not tree_lock.After this patch, lock_page_cgroup() is not called under mapping->tree_lock.
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
03 May, 2009
2 commits
-
Current mem_cgroup_shrink_usage() has two problems.
1. It doesn't call mem_cgroup_out_of_memory and doesn't update
last_oom_jiffies, so pagefault_out_of_memory invokes global OOM.2. Considering hierarchy, shrinking has to be done from the
mem_over_limit, not from the memcg which the page would be charged to.mem_cgroup_try_charge_swapin() does all of these things properly, so we
use it and call cancel_charge_swapin when it succeeded.The name of "shrink_usage" is not appropriate for this behavior, so we
change it too.Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Dhaval Giani
Cc: Daisuke Nishimura
Cc: YAMAMOTO Takashi
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a bugfix for commit 3c776e64660028236313f0e54f3a9945764422df
("memcg: charge swapcache to proper memcg").Used bit of swapcache is solid under page lock, but considering
move_account, pc->mem_cgroup is not.We need lock_page_cgroup() anyway.
Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds