Doug / smarc-fsl-linux-kernel | Embedian Git Server

13 Mar, 2010

40 commits

93c59907c copy_signal() cleanup: clean thread_group_cputime_init() ... Browse Code »

Remove unneeded initializations in thread_group_cputime_init() and in
posix_cpu_timers_init_group(). They are useless after kmem_cache_zalloc()
was used in copy_signal().

Signed-off-by: Veaceslav Falico
Acked-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Veaceslav Falico
2010-03-13 07:52:39 +0800
4dd66e69d copy_signal() cleanup: kill taskstats_tgid_init() and acct_init_pacct() ... Browse Code »

Kill unused functions taskstats_tgid_init() and acct_init_pacct() because
we don't use them anywhere after using kmem_cache_zalloc() in
copy_signal().

Signed-off-by: Veaceslav Falico
Cc: Roland McGrath
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Veaceslav Falico
2010-03-13 07:52:39 +0800
a56704ef6 copy_signal() cleanup: use zalloc and remove initializations ... Browse Code »

Use kmem_cache_zalloc() on signal creation and remove unneeded
initialization lines in copy_signal().

Signed-off-by: Veaceslav Falico
Acked-by: Oleg Nesterov
Cc: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Veaceslav Falico
2010-03-13 07:52:39 +0800
e34112e39 m32r: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't, which is consistent with all architectures using the
modern ptrace code.

The old code only disables the breakpoints on PTRACE_KILL, while after
this patch this also happens for PTRACE_CONT and PTRACE_SYSCALL which
matches the behaviour of the other architetures. I think this is a
bugfixes, but please double verify this is correct.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Hirokazu Takata
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
290ba3aef cris arch-v32: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

The way breakpoints are disabled is entirely inconsistent currently, I
tried to make some sense of it, but I suspect all of the content of
ptrace_disable should be moved into user_disable_single_step, this
defintively needs some revisting as the current patch changes behaviour in
not quite designed ways.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Mikael Starvik
Cc: Jesper Nilsson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
8313809ef cris arch-v10: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT and
PTRACE_KILL. This also makes PTRACE_SINGLESTEP return -EIO while it
previously succeeded despite not actually causing any kind of single
stepping.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Mikael Starvik
Cc: Jesper Nilsson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
6d75ca102 xtensa: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Chris Zankel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
1bd095083 um: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

XXX: I'm not sure arch_has_single_step() is placed in the exactly correct
location, please verify in which of the ptrace headers it should really
be.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Jeff Dike
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
55436c916 mips: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT and
PTRACE_KILL.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: Ralf Baechle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
fa1ac57a3 microblaze: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT and
PTRACE_KILL. This also makes PTRACE_SINGLESTEP return -EIO while it
previously succeeded despite not actually causing any kind of single
stepping.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: Michal Simek
Cc: John Williams
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:39 +0800
7a0fde8b3 m68knommu: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. m68knommu already defines the
nessecary user_enable_single_step and user_disable_single_step functions
for this.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: Greg Ungerer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
857fb252a h8300: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
1d8393171 avr32: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't which is consistent with all architectures using the
modern ptrace code.

Currently avr32 doesn't implement any code to disable single stepping when
one of the non-syscall requests is called which seems wrong, but I've left
it as-is for now.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: Haavard Skinnemoen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
440e6ca79 arm: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't and the single stepping disable only happens if the
tracee process isn't a zombie yet, which is consistent with all
architectures using the modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
fd341abba alpha: use generic ptrace_resume code ... Browse Code »

Use the generic ptrace_resume code for PTRACE_SYSCALL, PTRACE_CONT,
PTRACE_KILL and PTRACE_SINGLESTEP. This implies defining
arch_has_single_step in and implementing the
user_enable_single_step and user_disable_single_step functions, which also
causes the breakpoint information to be cleared on fork, which could be
considered a bug fix.

Also the TIF_SYSCALL_TRACE thread flag is now cleared on PTRACE_KILL which
it previously wasn't, which is consistent with all architectures using the
modern ptrace code.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: Matt Turner
Cc: Ivan Kokshaysky
Cc: Richard Henderson
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
dacbe41f7 ptrace: move user_enable_single_step & co prototypes to linux/ptrace.h ... Browse Code »

While in theory user_enable_single_step/user_disable_single_step/
user_enable_blockstep could also be provided as an inline or macro there's
no good reason to do so, and having the prototype in one places keeps code
size and confusion down.

Roland said:

The original thought there was that user_enable_single_step() et al
might well be only an instruction or three on a sane machine (as if we
have any of those!), and since there is only one call site inlining
would be beneficial. But I agree that there is no strong reason to care
about inlining it.

As to the arch changes, there is only one thought I'd add to the
record. It was always my thinking that for an arch where
PTRACE_SINGLESTEP does text-modifying breakpoint insertion,
user_enable_single_step() should not be provided. That is,
arch_has_single_step()=>true means that there is an arch facility with
"pure" semantics that does not have any unexpected side effects.
Inserting a breakpoint might do very unexpected strange things in
multi-threaded situations. Aside from that, it is a peculiar side
effect that user_{enable,disable}_single_step() should cause COW
de-sharing of text pages and so forth. For PTRACE_SINGLESTEP, all these
peculiarities are the status quo ante for that arch, so having
arch_ptrace() itself do those is one thing. But for building other
things in the future, it is nicer to have a uniform "pure" semantics
that arch-independent code can expect.

OTOH, all such arch issues are really up to the arch maintainer. As
of today, there is nothing but ptrace using user_enable_single_step() et
al so it's a distinction without a practical difference. If/when there
are other facilities that use user_enable_single_step() and might care,
the affected arch's can revisit the question when someone cares about
the quality of the arch support for said new facility.

Signed-off-by: Christoph Hellwig
Cc: Oleg Nesterov
Cc: Roland McGrath
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
b3c1e01a0 ptrace: use ptrace_request() in the remaining architectures ... Browse Code »

Use ptrace_request() in the three remaining architectures that didn't use it
(m68knommu, h8300, microblaze). This means:

- ptrace_request now handles PTRACE_{PEEK,POKE}{TEXT,DATA} and PTRACE_DETATCH
calls that were previously called directly, or in case of h8300 even open
coded.
- adds new support for PTRACE_SETOPTIONS/PTRACE_GETEVENTMSG/
PTRACE_GETSIGINFO/PTRACE_SETSIGINFO

Signed-off-by: Christoph Hellwig
Cc: Geert Uytterhoeven
Cc: Yoshinori Sato
Cc: Oleg Nesterov
Cc: Michal Simek
Acked-by: Greg Ungerer
Acked-by: Roland McGrath
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2010-03-13 07:52:38 +0800
7baab93f9 nodemask: fix the declaration of NODEMASK_ALLOC() ... Browse Code »

we can't declarate two variable at the same scope by NODEMASK_ALLOC().

This patch fixes it.

Signed-off-by: Miao Xie
Cc: David Rientjes
Cc: Lee Schermerhorn
Cc: Nick Piggin
Cc: Paul Menage
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Miao Xie
2010-03-13 07:52:38 +0800
a38374b8b memcg: update maintainer list ... Browse Code »

Nishimura-san have been working for memcg very good. His review and tests
give us much improvements and account migraiton which he is now
challenging is really important.

He is a stakeholder.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:38 +0800
867578cbc memcg: fix oom kill behavior ... Browse Code »

In current page-fault code,

handle_mm_fault()
-> ...
-> mem_cgroup_charge()
-> map page or handle error.
-> check return code.

If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
called. But if it's caused by memcg, OOM should have been already
invoked.

Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
page_fault_out_of_memory from being invoked in near future.

But Nishimura-san reported that check by jiffies is not enough when the
system is terribly heavy.

This patch changes memcg's oom logic as.
* If memcg causes OOM-kill, continue to retry.
* remove jiffies check which is used now.
* add memcg-oom-lock which works like perzone oom lock.
* If current is killed(as a process), bypass charge.

Something more sophisticated can be added but this pactch does
fundamental things.
TODO:
- add oom notifier
- add permemcg disable-oom-kill flag and freezer at oom.
- more chances for wake up oom waiter (when changing memory limit etc..)

Reviewed-by: Daisuke Nishimura
Tested-by: Daisuke Nishimura
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: David Rientjes
Signed-off-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:38 +0800
0263c12c1 memcg: fix typos in memcg_test.txt ... Browse Code »

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:38 +0800
1e111452d memcg: update memcg_test.txt to describe memory thresholds ... Browse Code »

Decription of sanity check for memory thresholds.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:38 +0800
1d8fd973a cgroups: add simple listener of cgroup events to documentation ... Browse Code »

An example of cgroup notification API usage.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
a0a4db548 cgroups: remove events before destroying subsystem state objects ... Browse Code »

Events should be removed after rmdir of cgroup directory, but before
destroying subsystem state objects. Let's take reference to cgroup
directory dentry to do that.

Signed-off-by: Kirill A. Shutemov
Acked-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Acked-by: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
4ab78683c cgroups: fix race between userspace and kernelspace ... Browse Code »

Notify userspace about cgroup removing only after rmdir of cgroup
directory to avoid race between userspace and kernelspace.

eventfd are used to notify about two types of event:
- control file-specific, like crossing memory threshold;
- cgroup removing.

To understand what really happen, userspace can check if the cgroup still
exists. To avoid race beetween userspace and kernelspace we have to
notify userspace about cgroup removing only after rmdir of cgroup
directory.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: KAMEZAWA Hiroyuki
Cc: Paul Menage
Acked-by: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
daaf1e688 memcg: handle panic_on_oom=always case ... Browse Code »

Presently, if panic_on_oom=2, the whole system panics even if the oom
happend in some special situation (as cpuset, mempolicy....). Then,
panic_on_oom=2 means painc_on_oom_always.

Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

BTW, how it's useful ?

kdump+panic_on_oom=2 is the last tool to investigate what happens in
oom-ed system. When a task is killed, the sysytem recovers and there will
be few hint to know what happnes. In mission critical system, oom should
never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
knowing precise information via snapshot.

TODO:
- For memcg, it's for isolate system's memory usage, oom-notiifer and
freeze_at_oom (or rest_at_oom) should be implemented. Then, management
daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: David Rientjes
Cc: Nick Piggin
Reviewed-by: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:37 +0800
1080d7a30 memcg: update memcg_test.txt ... Browse Code »

Update memcg_test.txt to describe how to test the move-charge feature.

Signed-off-by: Daisuke Nishimura
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:37 +0800
d2265e6fa memcg : share event counter rather than duplicate ... Browse Code »

Memcg has 2 eventcountes which counts "the same" event. Just usages are
different from each other. This patch tries to reduce event counter.

Now logic uses "only increment, no reset" counter and masks for each
checks. Softlimit chesk was done per 1000 evetns. So, the similar check
can be done by !(new_counter & 0x3ff). Threshold check was done per 100
events. So, the similar check can be done by (!new_counter & 0x7f)

ALL event checks are done right after EVENT percpu counter is updated.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:37 +0800
430e48631 memcg: update threshold and softlimit at commit ... Browse Code »

Presently, move_task does "batched" precharge. Because res_counter or
css's refcnt are not-scalable jobs for memcg, try_charge_().. tend to be
done in batched manner if allowed.

Now, softlimit and threshold check their event counter in try_charge, but
the charge is not a per-page event. And event counter is not updated at
charge(). Moreover, precharge doesn't pass "page" to try_charge() and
softlimit tree will be never updated until uncharge() causes an event."

So the best place to check the event counter is commit_charge(). This is
per-page event by its nature. This patch move checks to there.

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:37 +0800
c62b1a3b3 memcg: use generic percpu instead of private implementation ... Browse Code »

When per-cpu counter for memcg was implemneted, dynamic percpu allocator
was not very good. But now, we have good one and useful macros. This
patch replaces memcg's private percpu counter implementation with generic
dynamic percpu allocator.

The benefits are
- We can remove private implementation.
- The counters will be NUMA-aware. (Current one is not...)
- This patch makes sizeof struct mem_cgroup smaller. Then,
struct mem_cgroup may be fit in page size on small config.
- About basic performance aspects, see below.

[Before]
# size mm/memcontrol.o
text data bss dec hex filename
24373 2528 4132 31033 7939 mm/memcontrol.o

[page-fault-throuput test on 8cpu/SMP in root cgroup]
# /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

Performance counter stats for './multi-fault-fork 8' (5 runs):

45878618 page-faults ( +- 0.110% )
602635826 cache-misses ( +- 0.105% )

61.005373262 seconds time elapsed ( +- 0.004% )

Then cache-miss/page fault = 13.14

[After]
#size mm/memcontrol.o
text data bss dec hex filename
23913 2528 4132 30573 776d mm/memcontrol.o
# /root/bin/perf stat -a -e page-faults,cache-misses --repeat 5 ./multi-fault-fork 8

Performance counter stats for './multi-fault-fork 8' (5 runs):

48179400 page-faults ( +- 0.271% )
588628407 cache-misses ( +- 0.136% )

61.004615021 seconds time elapsed ( +- 0.004% )

Then cache-miss/page fault = 12.22

Text size is reduced.
This performance improvement is not big and will be invisible in real world
applications. But this result shows this patch has some good effect even
on (small) SMP.

Here is a test program I used.

1. fork() processes on each cpus.
2. do page fault repeatedly on each process.
3. after 60secs, kill all childredn and exit.

(3 is necessary for getting stable data, this is improvement from previous one.)

#define _GNU_SOURCE
#include
#include
#include
#include
#include
#include
#include
#include

/*
* For avoiding contention in page table lock, FAULT area is
* sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
*/
#define FAULT_LENGTH (2 * 1024 * 1024)
#define PAGE_SIZE 4096
#define MAXNUM (128)

void alarm_handler(int sig)
{
}

void *worker(int cpu, int ppid)
{
void *start, *end;
char *c;
cpu_set_t set;
int i;

CPU_ZERO(&set);
CPU_SET(cpu, &set);
sched_setaffinity(0, sizeof(set), &set);

start = mmap(NULL, FAULT_LENGTH, PROT_READ|PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
if (start == MAP_FAILED) {
perror("mmap");
exit(1);
}
end = start + FAULT_LENGTH;

pause();
//fprintf(stderr, "run%d", cpu);
while (1) {
for (c = (char*)start; (void *)c < end; c += PAGE_SIZE)
*c = 0;
madvise(start, FAULT_LENGTH, MADV_DONTNEED);
}
return NULL;
}

int main(int argc, char *argv[])
{
int num, i, ret, pid, status;
int pids[MAXNUM];

if (argc < 2)
return 0;

setpgid(0, 0);
signal(SIGALRM, alarm_handler);
num = atoi(argv[1]);
pid = getpid();

for (i = 0; i < num; ++i) {
ret = fork();
if (!ret) {
worker(i, pid);
exit(0);
}
pids[i] = ret;
}
sleep(1);
kill(-pid, SIGALRM);
sleep(60);
for (i = 0; i < num; i++)
kill(pids[i], SIGKILL);
for (i = 0; i < num; i++)
waitpid(pids[i], &status, 0);
return 0;
}

Signed-off-by: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Balbir Singh
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2010-03-13 07:52:37 +0800
6a6135b64 memcg: typo in comment to mem_cgroup_print_oom_info() ... Browse Code »

s/mem_cgroup_print_mem_info/mem_cgroup_print_oom_info/

Signed-off-by: Kirill A. Shutemov
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
2e72b6347 memcg: implement memory thresholds ... Browse Code »

It allows to register multiple memory and memsw thresholds and gets
notifications when it crosses.

To register a threshold application need:
- create an eventfd;
- open memory.usage_in_bytes or memory.memsw.usage_in_bytes;
- write string like " " to
cgroup.event_control.

Application will be notified through eventfd when memory usage crosses
threshold in any direction.

It's applicable for root and non-root cgroup.

It uses stats to track memory usage, simmilar to soft limits. It checks
if we need to send event to userspace on every 100 page in/out. I guess
it's good compromise between performance and accuracy of thresholds.

[akpm@linux-foundation.org: coding-style fixes]
[nishimura@mxp.nes.nec.co.jp: fix documentation merge issue]
Signed-off-by: Kirill A. Shutemov
Cc: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Vladislav Buzov
Cc: Daisuke Nishimura
Cc: Alexander Shishkin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
378ce724b memcg: rework usage of stats by soft limit ... Browse Code »

Instead of incrementing counter on each page in/out and comparing it with
constant, we set counter to constant, decrement counter on each page
in/out and compare it with zero. We want to make comparing as fast as
possible. On many RISC systems (probably not only RISC) comparing with
zero is more effective than comparing with a constant, since not every
constant can be immediate operand for compare instruction.

Also, I've renamed MEM_CGROUP_STAT_EVENTS to MEM_CGROUP_STAT_SOFTLIMIT,
since really it's not a generic counter.

Signed-off-by: Kirill A. Shutemov
Cc: Li Zefan
Cc: KAMEZAWA Hiroyuki
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Vladislav Buzov
Cc: Daisuke Nishimura
Cc: Alexander Shishkin
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
104f39284 memcg: extract mem_group_usage() from mem_cgroup_read() ... Browse Code »

Helper to get memory or mem+swap usage of the cgroup.

Signed-off-by: Kirill A. Shutemov
Acked-by: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Vladislav Buzov
Cc: Daisuke Nishimura
Cc: Alexander Shishkin
Cc: Davide Libenzi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
0dea11687 cgroup: implement eventfd-based generic API for notifications ... Browse Code »

This patchset introduces eventfd-based API for notifications in cgroups
and implements memory notifications on top of it.

It uses statistics in memory controler to track memory usage.

Output of time(1) on building kernel on tmpfs:

Root cgroup before changes:
make -j2 506.37 user 60.93s system 193% cpu 4:52.77 total
Non-root cgroup before changes:
make -j2 507.14 user 62.66s system 193% cpu 4:54.74 total
Root cgroup after changes (0 thresholds):
make -j2 507.13 user 62.20s system 193% cpu 4:53.55 total
Non-root cgroup after changes (0 thresholds):
make -j2 507.70 user 64.20s system 193% cpu 4:55.70 total
Root cgroup after changes (1 thresholds, never crossed):
make -j2 506.97 user 62.20s system 193% cpu 4:53.90 total
Non-root cgroup after changes (1 thresholds, never crossed):
make -j2 507.55 user 64.08s system 193% cpu 4:55.63 total

This patch:

Introduce the write-only file "cgroup.event_control" in every cgroup.

To register new notification handler you need:
- create an eventfd;
- open a control file to be monitored. Callbacks register_event() and
unregister_event() must be defined for the control file;
- write " " to cgroup.event_control.
Interpretation of args is defined by control file implementation;

eventfd will be woken up by control file implementation or when the
cgroup is removed.

To unregister notification handler just close eventfd.

If you need notification functionality for a control file you have to
implement callbacks register_event() and unregister_event() in the
struct cftype.

[kamezawa.hiroyu@jp.fujitsu.com: Kconfig fix]
Signed-off-by: Kirill A. Shutemov
Reviewed-by: KAMEZAWA Hiroyuki
Paul Menage
Cc: Li Zefan
Cc: Balbir Singh
Cc: Pavel Emelyanov
Cc: Dan Malek
Cc: Vladislav Buzov
Cc: Daisuke Nishimura
Cc: Alexander Shishkin
Cc: Davide Libenzi
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2010-03-13 07:52:37 +0800
483c30b51 memcg: improve performance in moving swap charge ... Browse Code »

Try to reduce overheads in moving swap charge by:

- Adds a new function(__mem_cgroup_put), which takes "count" as a arg and
decrement mem->refcnt by "count".
- Removed res_counter_uncharge, css_put, and mem_cgroup_put from the path
of moving swap account, and consolidate all of them into mem_cgroup_clear_mc.
We cannot do that about mc.to->refcnt.

These changes reduces the overhead from 1.35sec to 0.9sec to move charges
of 1G anonymous memory(including 500MB swap) in my test environment.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800
024914477 memcg: move charges of anonymous swap ... Browse Code »

This patch is another core part of this move-charge-at-task-migration
feature. It enables moving charges of anonymous swaps.

To move the charge of swap, we need to exchange swap_cgroup's record.

In current implementation, swap_cgroup's record is protected by:

- page lock: if the entry is on swap cache.
- swap_lock: if the entry is not on swap cache.

This works well in usual swap-in/out activity.

But this behavior make the feature of moving swap charge check many
conditions to exchange swap_cgroup's record safely.

So I changed modification of swap_cgroup's recored(swap_cgroup_record())
to use xchg, and define a new function to cmpxchg swap_cgroup's record.

This patch also enables moving charge of non pte_present but not uncharged
swap caches, which can be exist on swap-out path, by getting the target
pages via find_get_page() as do_mincore() does.

[kosaki.motohiro@jp.fujitsu.com: fix ia64 build]
[akpm@linux-foundation.org: fix typos]
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800
8033b97c9 memcg: avoid oom during moving charge ... Browse Code »

This move-charge-at-task-migration feature has extra charges on
"to"(pre-charges) and "from"(left-over charges) during moving charge.
This means unnecessary oom can happen.

This patch tries to avoid such oom.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800
854ffa8d1 memcg: improve performance in moving charge ... Browse Code »

Try to reduce overheads in moving charge by:

- Instead of calling res_counter_uncharge() against the old cgroup in
__mem_cgroup_move_account() everytime, call res_counter_uncharge() at the end
of task migration once.
- removed css_get(&to->css) from __mem_cgroup_move_account() because callers
should have already called css_get(). And removed css_put(&to->css) too,
which was called by callers of move_account on success of move_account.
- Instead of calling __mem_cgroup_try_charge(), i.e. res_counter_charge(),
repeatedly, call res_counter_charge(PAGE_SIZE * count) in can_attach() if
possible.
- Instead of calling css_get()/css_put() repeatedly, make use of coalesce
__css_get()/__css_put() if possible.

These changes reduces the overhead from 1.7sec to 0.6sec to move charges
of 1G anonymous memory in my test environment.

Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800
4ffef5fef memcg: move charges of anonymous page ... Browse Code »

This patch is the core part of this move-charge-at-task-migration feature.
It implements functions to move charges of anonymous pages mapped only by
the target task.

Implementation:
- define struct move_charge_struct and a valuable of it(mc) to remember the
count of pre-charges and other information.
- At can_attach(), get anon_rss of the target mm, call __mem_cgroup_try_charge()
repeatedly and count up mc.precharge.
- At attach(), parse the page table, find a target page to be move, and call
mem_cgroup_move_account() about the page.
- Cancel all precharges if mc.precharge > 0 on failure or at the end of
task move.

[akpm@linux-foundation.org: a little simplification]
Signed-off-by: Daisuke Nishimura
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Paul Menage
Cc: Daisuke Nishimura
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2010-03-13 07:52:36 +0800