Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

17 Jun, 2009

3 commits

812368102 oom: only oom kill exiting tasks with attached memory ... Browse Code »

When a task is chosen for oom kill and is found to be PF_EXITING,
__oom_kill_task() is called to elevate the task's timeslice and give it
access to memory reserves so that it may quickly exit.

This privilege is unnecessary, however, if the task has already detached
its mm. Although its possible for the mm to become detached later since
task_lock() is not held, __oom_kill_task() will simply be a no-op in such
circumstances.

Subsequently, it is no longer necessary to warn about killing mm-less
tasks since it is a no-op.

Signed-off-by: David Rientjes
Acked-by: Rik van Riel
Cc: Balbir Singh
Cc: Minchan Kim
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:45 +0800
4d8b9135c oom: avoid unnecessary mm locking and scanning for OOM_DISABLE ... Browse Code »

This moves the check for OOM_DISABLE to the badness heuristic so it is
only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the
score is 0, which is also correctly exported via /proc/pid/oom_score.
This requires that tasks with badness scores of 0 are prohibited from
being oom killed, which makes sense since they would not allow for future
memory freeing anyway.

Since the oom_adj value is a characteristic of an mm and not a task, it is
no longer necessary to check the oom_adj value for threads sharing the
same memory (except when simply issuing SIGKILLs for threads in other
thread groups).

Cc: Nick Piggin
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:43 +0800
2ff05b2b4 oom: move oom_adj value from task_struct to mm_struct ... Browse Code »

The per-task oom_adj value is a characteristic of its mm more than the
task itself since it's not possible to oom kill any thread that shares the
mm. If a task were to be killed while attached to an mm that could not be
freed because another thread were set to OOM_DISABLE, it would have
needlessly been terminated since there is no potential for future memory
freeing.

This patch moves oomkilladj (now more appropriately named oom_adj) from
struct task_struct to struct mm_struct. This requires task_lock() on a
task to check its oom_adj value to protect against exec, but it's already
necessary to take the lock when dereferencing the mm to find the total VM
size for the badness heuristic.

This fixes a livelock if the oom killer chooses a task and another thread
sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
because oom_kill_task() repeatedly returns 1 and refuses to kill the
chosen task while select_bad_process() will repeatedly choose the same
task during the next retry.

Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
oom_kill_task() to check for threads sharing the same memory will be
removed in the next patch in this series where it will no longer be
necessary.

Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
these threads are immune from oom killing already. They simply report an
oom_adj value of OOM_DISABLE.

Cc: Nick Piggin
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-06-17 10:47:43 +0800

29 May, 2009

1 commit

6d2661ede oom: fix possible oom_dump_tasks NULL pointer ... Browse Code »

When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
pointer for tasks that have detached mm's since task_lock() is not held
during the tasklist scan. Add the task_lock().

Acked-by: Nick Piggin
Acked-by: Mel Gorman
Cc: Rik van Riel
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-05-29 23:40:01 +0800

07 May, 2009

1 commit

184101bf1 oom: prevent livelock when oom_kill_allocating_task is set ... Browse Code »

When /proc/sys/vm/oom_kill_allocating_task is set for large systems that
want to avoid the lengthy tasklist scan, it's possible to livelock if
current is ineligible for oom kill. This normally happens when it is set
to OOM_DISABLE, but is also possible if any threads are sharing the same
->mm with a different tgid.

So change __out_of_memory() to fall back to the full task-list scan if it
was unable to kill `current'.

Cc: Nick Piggin
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-05-07 07:36:09 +0800

03 Apr, 2009

1 commit

e222432bf memcg: show memcg information during OOM ... Browse Code »

Add RSS and swap to OOM output from memcg

Display memcg values like failcnt, usage and limit when an OOM occurs due
to memcg.

Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
Daisuke Nishimura and KOSAKI Motohiro for review.

Sample output
-------------

Task in /a/x killed as a result of limit of /a
memory: usage 1048576kB, limit 1048576kB, failcnt 4183
memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

[akpm@linux-foundation.org: compilation fix]
[akpm@linux-foundation.org: fix kerneldoc and whitespace]
[akpm@linux-foundation.org: add printk facility level]
Signed-off-by: Balbir Singh
Cc: KAMEZAWA Hiroyuki
Cc: Daisuke Nishimura
Cc: Li Zefan
Cc: Paul Menage
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2009-04-03 10:04:55 +0800

01 Apr, 2009

1 commit

a12888f77 oom_kill: don't call for int_sqrt(0) ... Browse Code »

There is no need to call for int_sqrt if argument is 0.

Signed-off-by: Cyrill Gorcunov
Cc: Pekka Enberg
Cc: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cyrill Gorcunov
2009-04-01 23:59:11 +0800

09 Jan, 2009

2 commits

7f4d454de memcg: avoid deadlock caused by race between oom and cpuset_attach ... Browse Code »

mpol_rebind_mm(), which can be called from cpuset_attach(), does
down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
called under cgroup_mutex.

OTOH, page fault path does down_read(mm->mmap_sem) and calls
mem_cgroup_try_charge_xxx(), which may eventually calls
mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
cgroup_lock(). This means cgroup_lock() can be called under
down_read(mm->mmap_sem).

If those two paths race, deadlock can happen.

This patch avoid this deadlock by:
- remove cgroup_lock() from mem_cgroup_out_of_memory().
- define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
(->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

Signed-off-by: Daisuke Nishimura
Reviewed-by: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daisuke Nishimura
2009-01-09 00:31:09 +0800
a636b327f memcg: avoid unnecessary system-wide-oom-killer ... Browse Code »

Current mmtom has new oom function as pagefault_out_of_memory(). It's
added for select bad process rathar than killing current.

When memcg hit limit and calls OOM at page_fault, this handler called and
system-wide-oom handling happens. (means kernel panics if panic_on_oom is
true....)

To avoid overkill, check memcg's recent behavior before starting
system-wide-oom.

And this patch also fixes to guarantee "don't accnout against process with
TIF_MEMDIE". This is necessary for smooth OOM.

[akpm@linux-foundation.org: build fix]
Signed-off-by: KAMEZAWA Hiroyuki
Cc: Li Zefan
Cc: Balbir Singh
Cc: Daisuke Nishimura
Cc: Badari Pulavarty
Cc: Jan Blunck
Cc: Hirokazu Takahashi
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KAMEZAWA Hiroyuki
2009-01-09 00:31:06 +0800

07 Jan, 2009

3 commits

75aa19941 oom: print triggering task's cpuset and mems allowed ... Browse Code »

When cpusets are enabled, it's necessary to print the triggering task's
set of allowable nodes so the subsequently printed meminfo can be
interpreted correctly.

We also print the task's cpuset name for informational purposes.

[rientjes@google.com: task lock current before dereferencing cpuset]
Cc: Paul Menage
Cc: Li Zefan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-01-07 07:58:59 +0800
c7d4caeb1 oom: fix zone_scan_mutex name ... Browse Code »

zone_scan_mutex is actually a spinlock, so name it appropriately.

Signed-off-by: David Rientjes
Reviewed-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2009-01-07 07:58:58 +0800
1c0fe6e3b mm: invoke oom-killer from page fault ... Browse Code »

Rather than have the pagefault handler kill a process directly if it gets
a VM_FAULT_OOM, have it call into the OOM killer.

With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
oom killing throttling, oom priority adjustment or selective disabling,
panic on oom, etc), it's silly to unconditionally kill the faulting
process at page fault time. Create a hook for pagefault oom path to call
into instead.

Only converted x86 and uml so far.

[akpm@linux-foundation.org: make __out_of_memory() static]
[akpm@linux-foundation.org: fix comment]
Signed-off-by: Nick Piggin
Cc: Jeff Dike
Acked-by: Ingo Molnar
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nick Piggin
2009-01-07 07:58:58 +0800

14 Nov, 2008

3 commits

2b8289256 Merge branch 'master' into next ... Browse Code »

Conflicts:
security/keys/internal.h
security/keys/process_keys.c
security/keys/request_key.c

Fixed conflicts above by using the non 'tsk' versions.

Signed-off-by: James Morris

James Morris
2008-11-14 08:29:12 +0800
c69e8d9c0 CRED: Use RCU to access another task's creds and to release a task's own creds ... Browse Code »

Use RCU to access another task's creds and to release a task's own creds.
This means that it will be possible for the credentials of a task to be
replaced without another task (a) requiring a full lock to read them, and (b)
seeing deallocated memory.

Signed-off-by: David Howells
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: James Morris

David Howells
2008-11-14 07:39:19 +0800
b6dff3ec5 CRED: Separate task security context from task_struct ... Browse Code »

Separate the task security context from task_struct. At this point, the
security data is temporarily embedded in the task_struct with two pointers
pointing to it.

Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
entry.S via asm-offsets.

With comment fixes Signed-off-by: Marc Dionne

Signed-off-by: David Howells
Acked-by: James Morris
Acked-by: Serge Hallyn
Signed-off-by: James Morris

David Howells
2008-11-14 07:39:16 +0800

11 Nov, 2008

1 commit

a2f2945a9 The oomkiller calculations make decisions based on capabilities. Since ... Browse Code »

these are not security decisions and LSMs should not record if they fall
the request they should use the new has_capability_noaudit() interface so
the denials will not be recorded.

Signed-off-by: Eric Paris
Acked-by: Stephen Smalley
Signed-off-by: James Morris

Eric Paris
2008-11-11 19:02:54 +0800

07 Nov, 2008

2 commits

fbdd12676 mm/oom_kill.c: fix badness() kerneldoc ... Browse Code »

Paramter @mem has been removed since v2.6.26, now delete it's comment.

Signed-off-by: Qinghuang Feng
Acked-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qinghuang Feng
2008-11-07 07:41:19 +0800
b4416d2be oom: do not dump task state for non thread group leaders ... Browse Code »

When /proc/sys/vm/oom_dump_tasks is enabled, it's only necessary to dump
task state information for thread group leaders. The kernel log gets
quickly overwhelmed on machines with a massive number of threads by
dumping non-thread group leaders.

Reviewed-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-11-07 07:41:18 +0800

14 Aug, 2008

1 commit

5cd9c58fb security: Fix setting of PF_SUPERPRIV by __capable() ... Browse Code »

Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
the target process if that is not the current process and it is trying to
change its own flags in a different way at the same time.

__capable() is using neither atomic ops nor locking to protect t->flags. This
patch removes __capable() and introduces has_capability() that doesn't set
PF_SUPERPRIV on the process being queried.

This patch further splits security_ptrace() in two:

(1) security_ptrace_may_access(). This passes judgement on whether one
process may access another only (PTRACE_MODE_ATTACH for ptrace() and
PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
current is the parent.

(2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
and takes only a pointer to the parent process. current is the child.

In Smack and commoncap, this uses has_capability() to determine whether
the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
This does not set PF_SUPERPRIV.

Two of the instances of __capable() actually only act on current, and so have
been changed to calls to capable().

Of the places that were using __capable():

(1) The OOM killer calls __capable() thrice when weighing the killability of a
process. All of these now use has_capability().

(2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
whether the parent was allowed to trace any process. As mentioned above,
these have been split. For PTRACE_ATTACH and /proc, capable() is now
used, and for PTRACE_TRACEME, has_capability() is used.

(3) cap_safe_nice() only ever saw current, so now uses capable().

(4) smack_setprocattr() rejected accesses to tasks other than current just
after calling __capable(), so the order of these two tests have been
switched and capable() is used instead.

(5) In smack_file_send_sigiotask(), we need to allow privileged processes to
receive SIGIO on files they're manipulating.

(6) In smack_task_wait(), we let a process wait for a privileged process,
whether or not the process doing the waiting is privileged.

I've tested this with the LTP SELinux and syscalls testscripts.

Signed-off-by: David Howells
Acked-by: Serge Hallyn
Acked-by: Casey Schaufler
Acked-by: Andrew G. Morgan
Acked-by: Al Viro
Signed-off-by: James Morris

David Howells
2008-08-14 20:59:43 +0800

28 Apr, 2008

3 commits

97d87c971 oom_kill: remove unused parameter in badness() ... Browse Code »

In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the
memcontroller-related code from badness() to select_bad_process(), so the
parameter 'mem' in badness() is unused now.

Signed-off-by: Li Zefan
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-28 23:58:26 +0800
dd1a239f6 mm: have zonelist contains structs with both a zone pointer and zone_idx ... Browse Code »

Filtering zonelists requires very frequent use of zone_idx(). This is costly
as it involves a lookup of another structure and a substraction operation. As
the zone_idx is often required, it should be quickly accessible. The node idx
could also be stored here if it was found that accessing zone->node is
significant which may be the case on workloads where nodemasks are heavily
used.

This patch introduces a struct zoneref to store a zone pointer and a zone
index. The zonelist then consists of an array of these struct zonerefs which
are looked up as necessary. Helpers are given for accessing the zone index as
well as the node index.

[kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
[hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
[hugh@veritas.com: just return do_try_to_free_pages]
[hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Signed-off-by: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Nick Piggin
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-04-28 23:58:18 +0800
54a6eb5c4 mm: use two zonelist that are filtered by GFP mask ... Browse Code »

Currently a node has two sets of zonelists, one for each zone type in the
system and a second set for GFP_THISNODE allocations. Based on the zones
allowed by a gfp mask, one of these zonelists is selected. All of these
zonelists consume memory and occupy cache lines.

This patch replaces the multiple zonelists per-node with two zonelists. The
first contains all populated zones in the system, ordered by distance, for
fallback allocations when the target/preferred node has no free pages. The
second contains all populated zones in the node suitable for GFP_THISNODE
allocations.

An iterator macro is introduced called for_each_zone_zonelist() that interates
through each zone allowed by the GFP flags in the selected zonelist.

Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Signed-off-by: Lee Schermerhorn
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: Christoph Lameter
Cc: Hugh Dickins
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2008-04-28 23:58:18 +0800

16 Apr, 2008

1 commit

e115f2d89 memcg: fix oops in oom handling ... Browse Code »

When I used a test program to fork mass processes and immediately move them to
a cgroup where the memory limit is low enough to trigger oom kill, I got oops:

BUG: unable to handle kernel NULL pointer dereference at 0000000000000808
IP: [] _spin_lock_irqsave+0x8/0x18
PGD 4c95f067 PUD 4406c067 PMD 0
Oops: 0002 [1] SMP
CPU 2
Modules linked in:

Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5
RIP: 0010:[] [] _spin_lock_irqsave+0x8/0x18
RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002
RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3
RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808
RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900
R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140
R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200
FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0)
Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0
ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0
ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0
Call Trace:
[] ? force_sig_info+0x25/0xb9
[] ? oom_kill_task+0x77/0xe2
[] ? mem_cgroup_out_of_memory+0x55/0x67
[] ? mem_cgroup_charge_common+0xec/0x202
[] ? handle_mm_fault+0x24e/0x77f
[] ? default_wake_function+0x0/0xe
[] ? get_user_pages+0x2ce/0x3af
[] ? mem_cgroup_charge_common+0x2d/0x202
[] ? make_pages_present+0x8e/0xa4
[] ? mmap_region+0x373/0x429
[] ? do_mmap_pgoff+0x2ff/0x364
[] ? sys_mmap+0xe5/0x111
[] ? tracesys+0xdc/0xe1

Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00
RIP [] _spin_lock_irqsave+0x8/0x18
RSP
CR2: 0000000000000808
---[ end trace c3702fa668021ea4 ]---

It's reproducable in a x86_64 box, but doesn't happen in x86_32.

This is because tsk->sighand is not guarded by RCU, so we have to
hold tasklist_lock, just as what out_of_memory() does.

Signed-off-by: Li Zefan
Cc: KAMEZAWA Hiroyuki
Acked-by: Balbir Singh
Cc: Pavel Emelianov
Cc: Paul Menage
Cc: Oleg Nesterov
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Li Zefan
2008-04-16 10:35:40 +0800

20 Mar, 2008

1 commit

1b578df02 mm/oom_kill: fix kernel-doc ... Browse Code »

Fix kernel-doc notation in oom_kill.c.

Signed-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2008-03-20 09:53:35 +0800

05 Mar, 2008

1 commit

00f0b8259 Memory controller: rename to Memory Resource Controller ... Browse Code »

Rename Memory Controller to Memory Resource Controller. Reflect the same
changes in the CONFIG definition for the Memory Resource Controller. Group
together the config options for Resource Counters and Memory Resource
Controller.

Signed-off-by: Balbir Singh
Cc: Paul Menage
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Balbir Singh
2008-03-05 08:35:12 +0800

08 Feb, 2008

3 commits

fef1bdd68 oom: add sysctl to enable task memory dump ... Browse Code »

Adds a new sysctl, 'oom_dump_tasks', that enables the kernel to produce a
dump of all system tasks (excluding kernel threads) when performing an
OOM-killing. Information includes pid, uid, tgid, vm size, rss, cpu,
oom_adj score, and name.

This is helpful for determining why there was an OOM condition and which
rogue task caused it.

It is configurable so that large systems, such as those with several
thousand tasks, do not incur a performance penalty associated with dumping
data they may not desire.

If an OOM was triggered as a result of a memory controller, the tasklist
shall be filtered to exclude tasks that are not a member of the same
cgroup.

Cc: Andrea Arcangeli
Cc: Christoph Lameter
Cc: Balbir Singh
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-02-08 00:42:19 +0800
4c4a22148 memcontrol: move oom task exclusion to tasklist scan ... Browse Code »

Creates a helper function to return non-zero if a task is a member of a
memory controller:

int task_in_mem_cgroup(const struct task_struct *task,
const struct mem_cgroup *mem);

When the OOM killer is constrained by the memory controller, the exclusion
of tasks that are not a member of that controller was previously misplaced
and appeared in the badness scoring function. It should be excluded
during the tasklist scan in select_bad_process() instead.

[akpm@linux-foundation.org: build fix]
Cc: Christoph Lameter
Cc: Balbir Singh
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2008-02-08 00:42:19 +0800
c7ba5c9e8 Memory controller: OOM handling ... Browse Code »

Out of memory handling for cgroups over their limit. A task from the
cgroup over limit is chosen using the existing OOM logic and killed.

TODO:
1. As discussed in the OLS BOF session, consider implementing a user
space policy for OOM handling.

[akpm@linux-foundation.org: fix build due to oom-killer changes]
Signed-off-by: Pavel Emelianov
Signed-off-by: Balbir Singh
Cc: Paul Menage
Cc: Peter Zijlstra
Cc: "Eric W. Biederman"
Cc: Nick Piggin
Cc: Kirill Korotaev
Cc: Herbert Poetzl
Cc: David Rientjes
Cc: Vaidyanathan Srinivasan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelianov
2008-02-08 00:42:19 +0800

06 Feb, 2008

2 commits

97829955a oom_kill: remove uid==0 checks ... Browse Code »

Root processes are considered more important when out of memory and killing
proceses. The check for CAP_SYS_ADMIN was augmented with a check for
uid==0 or euid==0.

There are several possible ways to look at this:

1. uid comparisons are unnecessary, trust CAP_SYS_ADMIN
alone. However CAP_SYS_RESOURCE is the one that really
means "give me extra resources" so allow for that as
well.
2. Any privileged code should be protected, but uid is not
an indication of privilege. So we should check whether
any capabilities are raised.
3. uid==0 makes processes on the host as well as in containers
more important, so we should keep the existing checks.
4. uid==0 makes processes only on the host more important,
even without any capabilities. So we should be keeping
the (uid==0||euid==0) check but only when
userns==&init_user_ns.

I'm following number 1 here.

Signed-off-by: Serge Hallyn
Cc: Andrew Morgan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2008-02-06 01:44:20 +0800
e338d263a Add 64-bit capability support to the kernel ... Browse Code »

The patch supports legacy (32-bit) capability userspace, and where possible
translates 32-bit capabilities to/from userspace and the VFS to 64-bit
kernel space capabilities. If a capability set cannot be compressed into
32-bits for consumption by user space, the system call fails, with -ERANGE.

FWIW libcap-2.00 supports this change (and earlier capability formats)

http://www.kernel.org/pub/linux/libs/security/linux-privs/kernel-2.6/

[akpm@linux-foundation.org: coding-syle fixes]
[akpm@linux-foundation.org: use get_task_comm()]
[ezk@cs.sunysb.edu: build fix]
[akpm@linux-foundation.org: do not initialise statics to 0 or NULL]
[akpm@linux-foundation.org: unused var]
[serue@us.ibm.com: export __cap_ symbols]
Signed-off-by: Andrew G. Morgan
Cc: Stephen Smalley
Acked-by: Serge Hallyn
Cc: Chris Wright
Cc: James Morris
Cc: Casey Schaufler
Signed-off-by: Erez Zadok
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morgan
2008-02-06 01:44:20 +0800

26 Jan, 2008

1 commit

fa717060f sched: sched_rt_entity ... Browse Code »

Move the task_struct members specific to rt scheduling together.
A future optimization could be to put sched_entity and sched_rt_entity
into a union.

Signed-off-by: Peter Zijlstra
CC: Srivatsa Vaddagiri
Signed-off-by: Ingo Molnar

Peter Zijlstra
2008-01-26 04:08:27 +0800

21 Oct, 2007

1 commit

e91a810e8 oom_kill bug ... Browse Code »

Wrong order of arguments

Signed-off-by: Al Viro
Signed-off-by: Linus Torvalds

Al Viro
2007-10-21 06:04:06 +0800

20 Oct, 2007

4 commits

ba25f9dcc Use helpers to obtain task pid in printks ... Browse Code »

The task_struct->pid member is going to be deprecated, so start
using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
the kernel.

The first thing to start with is the pid, printed to dmesg - in
this case we may safely use task_pid_nr(). Besides, printks produce
more (much more) than a half of all the explicit pid usage.

[akpm@linux-foundation.org: git-drm went and changed lots of stuff]
Signed-off-by: Pavel Emelyanov
Cc: Dave Airlie
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:43 +0800
bac0abd61 Isolate some explicit usage of task->tgid ... Browse Code »

With pid namespaces this field is now dangerous to use explicitly, so hide
it behind the helpers.

Also the pid and pgrp fields o task_struct and signal_struct are to be
deprecated. Unfortunately this patch cannot be sent right now as this
leads to tons of warnings, so start isolating them, and deprecate later.

Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
but Oleg pointed out that in case of posix cpu timers this is the same, and
thread_group_leader() is more preferable.

Signed-off-by: Pavel Emelyanov
Acked-by: Oleg Nesterov
Cc: Sukadev Bhattiprolu
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pavel Emelyanov
2007-10-20 02:53:40 +0800
7b1915a98 mm/oom_kill.c: Use list_for_each_entry instead of list_for_each ... Browse Code »

mm/oom_kill.c: Convert list_for_each to list_for_each_entry in
oom_kill_process()

Signed-off-by: Matthias Kaehlcke
Cc: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthias Kaehlcke
2007-10-20 02:53:38 +0800
b460cbc58 pid namespaces: define is_global_init() and is_container_init() ... Browse Code »

is_init() is an ambiguous name for the pid==1 check. Split it into
is_global_init() and is_container_init().

A cgroup init has it's tsk->pid == 1.

A global init also has it's tsk->pid == 1 and it's active pid namespace
is the init_pid_ns. But rather than check the active pid namespace,
compare the task structure with 'init_pid_ns.child_reaper', which is
initialized during boot to the /sbin/init process and never changes.

Changelog:

2.6.22-rc4-mm2-pidns1:
- Use 'init_pid_ns.child_reaper' to determine if a given task is the
global init (/sbin/init) process. This would improve performance
and remove dependence on the task_pid().

2.6.21-mm2-pidns2:

- [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
ppc,avr32}/traps.c for the _exception() call to is_global_init().
This way, we kill only the cgroup if the cgroup's init has a
bug rather than force a kernel panic.

[akpm@linux-foundation.org: fix comment]
[sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
[bunk@stusta.de: kernel/pid.c: remove unused exports]
[sukadev@us.ibm.com: Fix capability.c to work with threaded init]
Signed-off-by: Serge E. Hallyn
Signed-off-by: Sukadev Bhattiprolu
Acked-by: Pavel Emelianov
Cc: Eric W. Biederman
Cc: Cedric Le Goater
Cc: Dave Hansen
Cc: Herbert Poetzel
Cc: Kirill Korotaev
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Serge E. Hallyn
2007-10-20 02:53:37 +0800

17 Oct, 2007

4 commits

ae74138da oom: convert zone_scan_lock from mutex to spinlock ... Browse Code »

There's no reason to sleep in try_set_zone_oom() or clear_zonelist_oom() if
the lock can't be acquired; it will be available soon enough once the zonelist
scanning is done. All other threads waiting for the OOM killer are also
contingent on the exiting task being able to acquire the lock in
clear_zonelist_oom() so it doesn't make sense to put it to sleep.

Cc: Andrea Arcangeli
Cc: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800
3ff566963 oom: do not take callback_mutex ... Browse Code »

Since no task descriptor's 'cpuset' field is dereferenced in the execution of
the OOM killer anymore, it is no longer necessary to take callback_mutex.

[akpm@linux-foundation.org: restore cpuset_lock for other patches]
Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800
bbe373f2c oom: compare cpuset mems_allowed instead of exclusive ancestors ... Browse Code »

Instead of testing for overlap in the memory nodes of the the nearest
exclusive ancestor of both current and the candidate task, it is better to
simply test for intersection between the task's mems_allowed in their task
descriptors. This does not require taking callback_mutex since it is only
used as a hint in the badness scoring.

Tasks that do not have an intersection in their mems_allowed with the current
task are not explicitly restricted from being OOM killed because it is quite
possible that the candidate task has allocated memory there before and has
since changed its mems_allowed.

Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800
7213f5066 oom: suppress extraneous stack and memory dump ... Browse Code »

Suppresses the extraneous stack and memory dump when a parallel OOM killing
has been found. There's no need to fill the ring buffer with this information
if its already been printed and the condition that triggered the previous OOM
killer has not yet been alleviated.

Cc: Andrea Arcangeli
Acked-by: Christoph Lameter
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2007-10-17 23:42:46 +0800