Eric Lee / smarc-fsl-linux-kernel

24 Feb, 2013

1 commit

58cf188ed memcg, oom: provide more precise dump info while memcg oom happening ... Browse Code »

Currently when a memcg oom is happening the oom dump messages is still
global state and provides few useful info for users. This patch prints
more pointed memcg page statistics for memcg-oom and take hierarchy into
consideration:

Based on Michal's advice, we take hierarchy into consideration: supppose
we trigger an OOM on A's limit

root_memcg
|
A (use_hierachy=1)
/ \
B C
|
D
then the printed info will be:

Memory cgroup stats for /A:...
Memory cgroup stats for /A/B:...
Memory cgroup stats for /A/C:...
Memory cgroup stats for /A/B/D:...

Following are samples of oom output:

(1) Before change:

mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[] dump_header+0x83/0x1ca
..... (call trace)
[] page_fault+0x28/0x30
<<<<<<<<<<<<<<<<<<<<< memcg specific information
Task in /A/B/D killed as a result of limit of /A
memory: usage 101376kB, limit 101376kB, failcnt 57
memory+swap: usage 101376kB, limit 101376kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
<<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
Mem-Info:
Node 0 DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
......
CPU 3: hi: 0, btch: 1 usd: 0
Node 0 DMA32 per-cpu:
CPU 0: hi: 186, btch: 31 usd: 173
......
CPU 3: hi: 186, btch: 31 usd: 130
<<<<<<<<<<<<<<<<<<<<< print global page state
active_anon:92963 inactive_anon:40777 isolated_anon:0
active_file:33027 inactive_file:51718 isolated_file:0
unevictable:0 dirty:3 writeback:0 unstable:0
free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
mapped:20278 shmem:35971 pagetables:5885 bounce:0
free_cma:0
<<<<<<<<<<<<<<<<<<<<< print per zone page state
Node 0 DMA free:15836kB ... all_unreclaimable? no
lowmem_reserve[]: 0 3175 3899 3899
Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
lowmem_reserve[]: 0 0 724 724
lowmem_reserve[]: 0 0 0 0
Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
120710 total pagecache pages
0 pages in swap cache
<<<<<<<<<<<<<<<<<<<<< print global swap cache stat
Swap cache stats: add 0, delete 0, find 0/0
Free swap = 499708kB
Total swap = 499708kB
1040368 pages RAM
58678 pages reserved
169065 pages shared
173632 pages non-shared
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2693] 0 2693 6005 1324 17 0 0 god
[ 2754] 0 2754 6003 1320 16 0 0 god
[ 2811] 0 2811 5992 1304 18 0 0 god
[ 2874] 0 2874 6005 1323 18 0 0 god
[ 2935] 0 2935 8720 7742 21 0 0 mal-30
[ 2976] 0 2976 21520 17577 42 0 0 mal-80
Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB

We can see that messages dumped by show_free_areas() are longsome and can
provide so limited info for memcg that just happen oom.

(2) After change
mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
mal-80 cpuset=/ mems_allowed=0
Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
Call Trace:
[] dump_header+0x83/0x1d1
.......(call trace)
[] page_fault+0x28/0x30
Task in /A/B/D killed as a result of limit of /A
<<<<<<<<<<<<<<<<<<<<< memcg specific information
memory: usage 102400kB, limit 102400kB, failcnt 140
memory+swap: usage 102400kB, limit 102400kB, failcnt 0
kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
[ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
[ 2260] 0 2260 6006 1325 18 0 0 god
[ 2383] 0 2383 6003 1319 17 0 0 god
[ 2503] 0 2503 6004 1321 18 0 0 god
[ 2622] 0 2622 6004 1321 16 0 0 god
[ 2695] 0 2695 8720 7741 22 0 0 mal-30
[ 2704] 0 2704 21520 17839 43 0 0 mal-80
Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB

This version provides more pointed info for memcg in "Memory cgroup stats
for XXX" section.

Signed-off-by: Sha Zhengju
Acked-by: Michal Hocko
Cc: KAMEZAWA Hiroyuki
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sha Zhengju
2013-02-24 09:50:08 +0800

13 Dec, 2012

3 commits

0fa84a4bf mm, oom: remove redundant sleep in pagefault oom handler ... Browse Code »

out_of_memory() will already cause current to schedule if it has not been
killed, so doing it again in pagefault_out_of_memory() is redundant.
Remove it.

Signed-off-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-12-13 09:38:34 +0800
efacd02e4 mm, oom: cleanup pagefault oom handler ... Browse Code »

To lock the entire system from parallel oom killing, it's possible to pass
in a zonelist with all zones rather than using for_each_populated_zone()
for the iteration. This obsoletes try_set_system_oom() and
clear_system_oom() so that they can be removed.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-12-13 09:38:34 +0800
bd3a66c1c oom: use N_MEMORY instead N_HIGH_MEMORY ... Browse Code »

N_HIGH_MEMORY stands for the nodes that has normal or high memory.
N_MEMORY stands for the nodes that has any memory.

The code here need to handle with the nodes which have memory, we should
use N_MEMORY instead.

Signed-off-by: Lai Jiangshan
Acked-by: Hillf Danton
Signed-off-by: Wen Congyang
Cc: Christoph Lameter
Cc: Lin Feng
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2012-12-13 09:38:32 +0800

12 Dec, 2012

3 commits

e1e12d2f3 mm, oom: fix race when specifying a thread as the oom origin ... Browse Code »

test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
specify that current should be killed first if an oom condition occurs in
between the two calls.

The usage is

short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
...
compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);

to store the thread's oom_score_adj, temporarily change it to the maximum
score possible, and then restore the old value if it is still the same.

This happens to still be racy, however, if the user writes
OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
The compare_swap_oom_score_adj() will then incorrectly reset the old value
prior to the write of OOM_SCORE_ADJ_MAX.

To fix this, introduce a new oom_flags_t member in struct signal_struct
that will be used for per-thread oom killer flags. KSM and swapoff can
now use a bit in this member to specify that threads should be killed
first in oom conditions without playing around with oom_score_adj.

This also allows the correct oom_score_adj to always be shown when reading
/proc/pid/oom_score.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Cc: Anton Vorontsov
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-12-12 09:22:27 +0800
a9c58b907 mm, oom: change type of oom_score_adj to short ... Browse Code »

The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
so this range can be represented by the signed short type with no
functional change. The extra space this frees up in struct signal_struct
will be used for per-thread oom kill flags in the next patch.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Cc: Anton Vorontsov
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-12-12 09:22:27 +0800
9ff4868e3 mm, oom: allow exiting threads to have access to memory reserves ... Browse Code »

Exiting threads, those with PF_EXITING set, can pagefault and require
memory before they can make forward progress. This happens, for instance,
when a process must fault task->robust_list, a userspace structure, before
detaching its memory.

These threads also aren't guaranteed to get access to memory reserves
unless oom killed or killed from userspace. The oom killer won't grant
memory reserves if other threads are also exiting other than current and
stalling at the same point. This prevents needlessly killing processes
when others are already exiting.

Instead of special casing all the possible situations between PF_EXITING
getting set and a thread detaching its mm where it may allocate memory,
which probably wouldn't get updated when a change is made to the exit
path, the solution is to give all exiting threads access to memory
reserves if they call the oom killer. This allows them to quickly
allocate, detach its mm, and free the memory it represents.

Summary of Luigi's bug report:

: He had an oom condition where threads were faulting on task->robust_list
: and repeatedly called the oom killer but it would defer killing a thread
: because it saw other PF_EXITING threads. This can happen anytime we need
: to allocate memory after setting PF_EXITING and before detaching our mm;
: if there are other threads in the same state then the oom killer won't do
: anything unless one of them happens to be killed from userspace.
:
: So instead of only deferring for PF_EXITING and !task->robust_list, it's
: better to just give them access to memory reserves to prevent a potential
: livelock so that any other faults that may be introduced in the future in
: the exit path don't cause the same problem (and hopefully we don't allow
: too many of those!).

Signed-off-by: David Rientjes
Acked-by: Minchan Kim
Tested-by: Luigi Semenzato
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-12-12 09:22:24 +0800

09 Oct, 2012

1 commit

01dc52ebd oom: remove deprecated oom_adj ... Browse Code »

The deprecated /proc//oom_adj is scheduled for removal this month.

Signed-off-by: Davidlohr Bueso
Acked-by: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2012-10-09 15:22:24 +0800

01 Aug, 2012

8 commits

876aafbfd mm, memcg: move all oom handling to memcontrol.c ... Browse Code »

By globally defining check_panic_on_oom(), the memcg oom handler can be
moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
oom killer and cleans up the code.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:45 +0800
6b0c81b3b mm, oom: reduce dependency on tasklist_lock ... Browse Code »

Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
try to reduce the amount of time the readside is held for oom kills. This
makes the interface with the memcg oom handler more consistent since it
now never needs to take tasklist_lock unnecessarily.

The only time the oom killer now takes tasklist_lock is when iterating the
children of the selected task, everything else is protected by
rcu_read_lock().

This requires that a reference to the selected process, p, is grabbed
before calling oom_kill_process(). It may release it and grab a reference
on another one of p's threads if !p->mm, but it also guarantees that it
will release the reference before returning.

[hughd@google.com: fix duplicate put_task_struct()]
Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Reviewed-by: Michal Hocko
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:44 +0800
9cbb78bb3 mm, memcg: introduce own oom handler to iterate only over its own threads ... Browse Code »

The global oom killer is serialized by the per-zonelist
try_set_zonelist_oom() which is used in the page allocator. Concurrent
oom kills are thus a rare event and only occur in systems using
mempolicies and with a large number of nodes.

Memory controller oom kills, however, can frequently be concurrent since
there is no serialization once the oom killer is called for oom conditions
in several different memcgs in parallel.

This creates a massive contention on tasklist_lock since the oom killer
requires the readside for the tasklist iteration. If several memcgs are
calling the oom killer, this lock can be held for a substantial amount of
time, especially if threads continue to enter it as other threads are
exiting.

Since the exit path grabs the writeside of the lock with irqs disabled in
a few different places, this can cause a soft lockup on cpus as a result
of tasklist_lock starvation.

The kernel lacks unfair writelocks, and successful calls to the oom killer
usually result in at least one thread entering the exit path, so an
alternative solution is needed.

This patch introduces a seperate oom handler for memcgs so that they do
not require tasklist_lock for as much time. Instead, it iterates only
over the threads attached to the oom memcg and grabs a reference to the
selected thread before calling oom_kill_process() to ensure it doesn't
prematurely exit.

This still requires tasklist_lock for the tasklist dump, iterating
children of the selected process, and killing all other threads on the
system sharing the same memory as the selected victim. So while this
isn't a complete solution to tasklist_lock starvation, it significantly
reduces the amount of time that it is held.

Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Reviewed-by: Sha Zhengju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:44 +0800
462607ecc mm, oom: introduce helper function to process threads during scan ... Browse Code »

This patch introduces a helper function to process each thread during the
iteration over the tasklist. A new return type, enum oom_scan_t, is
defined to determine the future behavior of the iteration:

- OOM_SCAN_OK: continue scanning the thread and find its badness,

- OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
ineligible,

- OOM_SCAN_ABORT: abort the iteration and return, or

- OOM_SCAN_SELECT: always select this thread with the highest badness
possible.

There is no functional change with this patch. This new helper function
will be used in the next patch in the memory controller.

Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: KOSAKI Motohiro
Reviewed-by: Michal Hocko
Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Reviewed-by: Sha Zhengju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:44 +0800
c255a4580 memcg: rename config variables ... Browse Code »

Sanity:

CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

[mhocko@suse.cz: fix missed bits]
Cc: Glauber Costa
Acked-by: Michal Hocko
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Hugh Dickins
Cc: Tejun Heo
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2012-08-01 09:42:43 +0800
de34d965a mm, oom: replace some information in tasklist dump ... Browse Code »

The number of ptes and swap entries are used in the oom killer's badness
heuristic, so they should be shown in the tasklist dump.

This patch adds those fields and replaces cpu and oom_adj values that are
currently emitted. Cpu isn't interesting and oom_adj is deprecated and
will be removed later this year, the same information is already displayed
as oom_score_adj which is used internally.

At the same time, make the documentation a little more clear to state this
information is helpful to determine why the oom killer chose the task it
did to kill.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:42 +0800
121d1ba0a mm, oom: fix potential killing of thread that is disabled from oom killing ... Browse Code »

/proc/sys/vm/oom_kill_allocating_task will immediately kill current when
the oom killer is called to avoid a potentially expensive tasklist scan
for large systems.

Currently, however, it is not checking current's oom_score_adj value which
may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
killing.

This patch avoids killing current in such a condition and simply falls
back to the tasklist scan since memory still needs to be freed.

Signed-off-by: David Rientjes
Cc: KAMEZAWA Hiroyuki
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:42 +0800
4f774b912 mm, oom: do not schedule if current has been killed ... Browse Code »

The oom killer currently schedules away from current in an uninterruptible
sleep if it does not have access to memory reserves. It's possible that
current was killed because it shares memory with the oom killed thread or
because it was killed by the user in the interim, however.

This patch only schedules away from current if it does not have a pending
kill, i.e. if it does not share memory with the oom killed thread. It's
possible that it will immediately retry its memory allocation and fail,
but it will immediately be given access to memory reserves if it calls the
oom killer again.

This prevents the delay of memory freeing when threads that share memory
with the oom killed thread get unnecessarily scheduled.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Acked-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-08-01 09:42:41 +0800

21 Jun, 2012

2 commits

dad7557eb mm: fix kernel-doc warnings ... Browse Code »

Fix kernel-doc warnings such as

Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

Signed-off-by: Wanpeng Li
Cc: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2012-06-21 05:39:36 +0800
61eafb00d mm, oom: fix and cleanup oom score calculations ... Browse Code »

The divide in p->signal->oom_score_adj * totalpages / 1000 within
oom_badness() was causing an overflow of the signed long data type.

This adds both the root bias and p->signal->oom_score_adj before doing the
normalization which fixes the issue and also cleans up the calculation.

Tested-by: Dave Jones
Signed-off-by: David Rientjes
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-06-21 05:39:35 +0800

09 Jun, 2012

1 commit

1e11ad8dc mm, oom: fix badness score underflow ... Browse Code »

If the privileges given to root threads (3% of allowable memory) or a
negative value of /proc/pid/oom_score_adj happen to exceed the amount of
rss of a thread, its badness score overflows as a result of commit
a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
for userspace").

Fix this by making the type signed and return 1, meaning the thread is
still eligible for kill, if the value is negative.

Reported-by: Dave Jones
Acked-by: Oleg Nesterov
Signed-off-by: David Rientjes
Signed-off-by: Linus Torvalds

David Rientjes
2012-06-09 06:07:35 +0800

30 May, 2012

1 commit

a7f638f99 mm, oom: normalize oom scores to oom_score_adj scale only for userspace ... Browse Code »
43

The oom_score_adj scale ranges from -1000 to 1000 and represents the
proportion of memory available to the process at allocation time. This
means an oom_score_adj value of 300, for example, will bias a process as
though it was using an extra 30.0% of available memory and a value of
-350 will discount 35.0% of available memory from its usage.

The oom killer badness heuristic also uses this scale to report the oom
score for each eligible process in determining the "best" process to
kill. Thus, it can only differentiate each process's memory usage by
0.1% of system RAM.

On large systems, this can end up being a large amount of memory: 256MB
on 256GB systems, for example.

This can be fixed by having the badness heuristic to use the actual
memory usage in scoring threads and then normalizing it to the
oom_score_adj scale for userspace. This results in better comparison
between eligible threads for kill and no change from the userspace
perspective.

Suggested-by: KOSAKI Motohiro
Tested-by: Dave Jones
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-05-30 07:22:24 +0800

03 May, 2012

1 commit

078de5f70 userns: Store uid and gid values in struct cred with kuid_t and kgid_t types ... Browse Code »

cred.h and a few trivial users of struct cred are changed. The rest of the users
of struct cred are left for other patches as there are too many changes to make
in one go and leave the change reviewable. If the user namespace is disabled and
CONFIG_UIDGID_STRICT_TYPE_CHECKS are disabled the code will contiue to compile
and behave correctly.

Acked-by: Serge Hallyn
Signed-off-by: Eric W. Biederman

Eric W. Biederman
2012-05-03 18:28:38 +0800

24 Mar, 2012

1 commit

d2d393099 signal: oom_kill_task: use SEND_SIG_FORCED instead of force_sig() ... Browse Code »

Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
of force_sig(SIGKILL). With the recent changes we do not need force_ to
kill the CLONE_NEWPID tasks.

And this is more correct. force_sig() can race with the exiting thread
even if oom_kill_task() checks p->mm != NULL, while
do_send_sig_info(group => true) kille the whole process.

Signed-off-by: Oleg Nesterov
Cc: Tejun Heo
Cc: Anton Vorontsov
Cc: "Eric W. Biederman"
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2012-03-24 07:58:41 +0800

22 Mar, 2012

6 commits

e845e1993 mm, memcg: pass charge order to oom killer ... Browse Code »

The oom killer typically displays the allocation order at the time of oom
as a part of its diangostic messages (for global, cpuset, and mempolicy
ooms).

The memory controller may also pass the charge order to the oom killer so
it can emit the same information. This is useful in determining how large
the memory allocation is that triggered the oom killer.

Signed-off-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Balbir Singh
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:59 +0800
08ab9b10d mm, oom: force oom kill on sysrq+f ... Browse Code »

The oom killer chooses not to kill a thread if:

- an eligible thread has already been oom killed and has yet to exit,
and

- an eligible thread is exiting but has yet to free all its memory and
is not the thread attempting to currently allocate memory.

SysRq+F manually invokes the global oom killer to kill a memory-hogging
task. This is normally done as a last resort to free memory when no
progress is being made or to test the oom killer itself.

For both uses, we always want to kill a thread and never defer. This
patch causes SysRq+F to always kill an eligible thread and can be used to
force a kill even if another oom killed thread has failed to exit.

Signed-off-by: David Rientjes
Acked-by: KOSAKI Motohiro
Acked-by: Pekka Enberg
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:58 +0800
dc3f21ead mm, oom: introduce independent oom killer ratelimit state ... Browse Code »

printk_ratelimit() uses the global ratelimit state for all printks. The
oom killer should not be subjected to this state just because another
subsystem or driver may be flooding the kernel log.

This patch introduces printk ratelimiting specifically for the oom killer.

Signed-off-by: David Rientjes
Acked-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:55 +0800
8447d950e mm, oom: do not emit oom killer warning if chosen thread is already exiting ... Browse Code »

If a thread is chosen for oom kill and is already PF_EXITING, then the oom
killer simply sets TIF_MEMDIE and returns. This allows the thread to have
access to memory reserves so that it may quickly exit. This logic is
preceeded with a comment saying there's no need to alarm the sysadmin.
This patch adds truth to that statement.

There's no need to emit any warning about the oom condition if the thread
is already exiting since it will not be killed. In this condition, just
silently return the oom killer since its only giving access to memory
reserves and is otherwise a no-op.

Acked-by: KOSAKI Motohiro
Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:55 +0800
647f2bdf4 mm, oom: fold oom_kill_task() into oom_kill_process() ... Browse Code »

oom_kill_task() has a single caller, so fold it into its parent function,
oom_kill_process(). Slightly reduces the number of lines in the oom
killer.

Acked-by: KOSAKI Motohiro
Reviewed-by: KAMEZAWA Hiroyuki
Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:55 +0800
2a1c9b1fc mm, oom: avoid looping when chosen thread detaches its mm ... Browse Code »

oom_kill_task() returns non-zero iff the chosen process does not have any
threads with an attached ->mm.

In such a case, it's better to just return to the page allocator and retry
the allocation because memory could have been freed in the interim and the
oom condition may no longer exist. It's unnecessary to loop in the oom
killer and find another thread to kill.

This allows both oom_kill_task() and oom_kill_process() to be converted to
void functions. If the oom condition persists, the oom killer will be
recalled.

Acked-by: KAMEZAWA Hiroyuki
Signed-off-by: David Rientjes
Cc: KOSAKI Motohiro
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2012-03-22 08:54:55 +0800

13 Jan, 2012

2 commits

72835c86c mm: unify remaining mem_cont, mem, etc. variable names to memcg ... Browse Code »

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:06 +0800
ec0fffd84 mm: oom_kill: remove memcg argument from oom_kill_task() ... Browse Code »

The memcg argument of oom_kill_task() hasn't been used since 341aea2
'oom-kill: remove boost_dying_task_prio()'. Kill it.

Signed-off-by: Johannes Weiner
Acked-by: David Rientjes
Acked-by: KAMEZAWA Hiroyuki
Acked-by: Michal Hocko
Cc: Balbir Singh
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2012-01-13 12:13:06 +0800

11 Jan, 2012

1 commit

43d2b1132 tracepoint: add tracepoints for debugging oom_score_adj ... Browse Code »

oom_score_adj is used for guarding processes from OOM-Killer. One of
problem is that it's inherited at fork(). When a daemon set oom_score_adj
and make children, it's hard to know where the value is set.

This patch adds some tracepoints useful for debugging. This patch adds
3 trace points.
- creating new task
- renaming a task (exec)
- set oom_score_adj

To debug, users need to enable some trace pointer. Maybe filtering is useful as

# EVENT=/sys/kernel/debug/tracing/events/task/
# echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
# echo "oom_score_adj != 0" > $EVENT/task_rename/filter
# echo 1 > $EVENT/enable
# EVENT=/sys/kernel/debug/tracing/events/oom/
# echo 1 > $EVENT/enable

output will be like this.
# grep oom /sys/kernel/debug/tracing/trace
bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

Signed-off-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2012-01-11 08:30:44 +0800

22 Dec, 2011

1 commit

b00f4dc5f Merge branch 'master' into pm-sleep ... Browse Code »

* master: (848 commits)
SELinux: Fix RCU deref check warning in sel_netport_insert()
binary_sysctl(): fix memory leak
mm/vmalloc.c: remove static declaration of va from __get_vm_area_node
ipmi_watchdog: restore settings when BMC reset
oom: fix integer overflow of points in oom_badness
memcg: keep root group unchanged if creation fails
nilfs2: potential integer overflow in nilfs_ioctl_clean_segments()
nilfs2: unbreak compat ioctl
cpusets: stall when updating mems_allowed for mempolicy or disjoint nodemask
evm: prevent racing during tfm allocation
evm: key must be set once during initialization
mmc: vub300: fix type of firmware_rom_wait_states module parameter
Revert "mmc: enable runtime PM by default"
mmc: sdhci: remove "state" argument from sdhci_suspend_host
x86, dumpstack: Fix code bytes breakage due to missing KERN_CONT
IB/qib: Correct sense on freectxts increment and decrement
RDMA/cma: Verify private data length
cgroups: fix a css_set not found bug in cgroup_attach_proc
oprofile: Fix uninitialized memory access when writing to writing to oprofilefs
Revert "xen/pv-on-hvm kexec: add xs_reset_watches to shutdown watches from old kernel"
...

Conflicts:
kernel/cgroup_freezer.c

Rafael J. Wysocki
2011-12-22 04:59:45 +0800

21 Dec, 2011

1 commit

ff05b6f7a oom: fix integer overflow of points in oom_badness ... Browse Code »
1

An integer overflow will happen on 64bit archs if task's sum of rss,
swapents and nr_ptes exceeds (2^31)/1000 value. This was introduced by
commit

f755a04 oom: use pte pages in OOM score

where the oom score computation was divided into several steps and it's no
longer computed as one expression in unsigned long(rss, swapents, nr_pte
are unsigned long), where the result value assigned to points(int) is in
range(1..1000). So there could be an int overflow while computing

176 points *= 1000;

and points may have negative value. Meaning the oom score for a mem hog task
will be one.

196 if (points
Acked-by: KOSAKI Motohiro
Acked-by: Oleg Nesterov
Acked-by: David Rientjes
Cc: [2.6.36+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frantisek Hrbata
2011-12-21 02:25:04 +0800

22 Nov, 2011

1 commit

a5be2d0d1 freezer: rename thaw_process() to __thaw_task() and simplify the implementation ... Browse Code »

thaw_process() now has only internal users - system and cgroup
freezers. Remove the unnecessary return value, rename, unexport and
collapse __thaw_process() into it. This will help further updates to
the freezer code.

-v3: oom_kill grew a use of thaw_process() while this patch was
pending. Convert it to use __thaw_task() for now. In the longer
term, this should be handled by allowing tasks to die if killed
even if it's frozen.

-v2: minor style update as suggested by Matt.

Signed-off-by: Tejun Heo
Cc: Paul Menage
Cc: Matt Helsley

Tejun Heo
2011-11-22 04:32:23 +0800

16 Nov, 2011

1 commit

5aecc85ab oom: do not kill tasks with oom_score_adj OOM_SCORE_ADJ_MIN ... Browse Code »

Commit c9f01245 ("oom: remove oom_disable_count") has removed the
oom_disable_count counter which has been used for early break out from
oom_badness so we could never select a task with oom_score_adj set to
OOM_SCORE_ADJ_MIN (oom disabled).

Now that the counter is gone we are always going through heuristics
calculation and we always return a non zero positive value. This means
that we can end up killing a task with OOM disabled because it is
indistinguishable from regular tasks with 1% resp. CAP_SYS_ADMIN tasks
with 3% usage of memory or tasks with oom_score_adj set but OOM enabled.

Let's break out early if the task should have OOM disabled.

Signed-off-by: Michal Hocko
Acked-by: David Rientjes
Acked-by: KOSAKI Motohiro
Cc: Oleg Nesterov
Cc: Ying Han
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2011-11-16 08:41:51 +0800

07 Nov, 2011

1 commit

32aaeffbd Merge branch 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux ... Browse Code »

* 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
Revert "tracing: Include module.h in define_trace.h"
irq: don't put module.h into irq.h for tracking irqgen modules.
bluetooth: macroize two small inlines to avoid module.h
ip_vs.h: fix implicit use of module_get/module_put from module.h
nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
include: replace linux/module.h with "struct module" wherever possible
include: convert various register fcns to macros to avoid include chaining
crypto.h: remove unused crypto_tfm_alg_modname() inline
uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
pm_runtime.h: explicitly requires notifier.h
linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
miscdevice.h: fix up implicit use of lists and types
stop_machine.h: fix implicit use of smp.h for smp_processor_id
of: fix implicit use of errno.h in include/linux/of.h
of_platform.h: delete needless include
acpi: remove module.h include from platform/aclinux.h
miscdevice.h: delete unnecessary inclusion of module.h
device_cgroup.h: delete needless include
net: sch_generic remove redundant use of
net: inet_timewait_sock doesnt need
...

Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
- drivers/media/dvb/frontends/dibx000_common.c
- drivers/media/video/{mt9m111.c,ov6650.c}
- drivers/mfd/ab3550-core.c
- include/linux/dmaengine.h

Linus Torvalds
2011-11-07 11:44:47 +0800

01 Nov, 2011

4 commits

43362a497 oom: fix race while temporarily setting current's oom_score_adj ... Browse Code »

test_set_oom_score_adj() was introduced in 72788c385604 ("oom: replace
PF_OOM_ORIGIN with toggling oom_score_adj") to temporarily elevate
current's oom_score_adj for ksm and swapoff without requiring an
additional per-process flag.

Using that function to both set oom_score_adj to OOM_SCORE_ADJ_MAX and
then reinstate the previous value is racy since it's possible that
userspace can set the value to something else itself before the old value
is reinstated. That results in userspace setting current's oom_score_adj
to a different value and then the kernel immediately setting it back to
its previous value without notification.

To fix this, a new compare_swap_oom_score_adj() function is introduced
with the same semantics as the compare and swap CAS instruction, or
CMPXCHG on x86. It is used to reinstate the previous value of
oom_score_adj if and only if the present value is the same as the old
value.

Signed-off-by: David Rientjes
Cc: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
c9f01245b oom: remove oom_disable_count ... Browse Code »

This removes mm->oom_disable_count entirely since it's unnecessary and
currently buggy. The counter was intended to be per-process but it's
currently decremented in the exit path for each thread that exits, causing
it to underflow.

The count was originally intended to prevent oom killing threads that
share memory with threads that cannot be killed since it doesn't lead to
future memory freeing. The counter could be fixed to represent all
threads sharing the same mm, but it's better to remove the count since:

- it is possible that the OOM_DISABLE thread sharing memory with the
victim is waiting on that thread to exit and will actually cause
future memory freeing, and

- there is no guarantee that a thread is disabled from oom killing just
because another thread sharing its mm is oom disabled.

Signed-off-by: David Rientjes
Reported-by: Oleg Nesterov
Reviewed-by: Oleg Nesterov
Cc: Ying Han
Cc: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
7b0d44fa4 oom: avoid killing kthreads if they assume the oom killed thread's mm ... Browse Code »

After selecting a task to kill, the oom killer iterates all processes and
kills all other threads that share the same mm_struct in different thread
groups. It would not otherwise be helpful to kill a thread if its memory
would not be subsequently freed.

A kernel thread, however, may assume a user thread's mm by using
use_mm(). This is only temporary and should not result in sending a
SIGKILL to that kthread.

This patch ensures that only user threads and not kthreads are sent a
SIGKILL if they share the same mm_struct as the oom killed task.

Signed-off-by: David Rientjes
Reviewed-by: Michal Hocko
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800
f660daac4 oom: thaw threads if oom killed thread is frozen before deferring ... Browse Code »

If a thread has been oom killed and is frozen, thaw it before returning to
the page allocator. Otherwise, it can stay frozen indefinitely and no
memory will be freed.

Signed-off-by: David Rientjes
Reported-by: Konstantin Khlebnikov
Cc: Oleg Nesterov
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: "Rafael J. Wysocki"
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2011-11-01 08:30:45 +0800