28 May, 2010

1 commit

  • It's pointless to try to kill current if select_bad_process() did not find
    an eligible task to kill in mem_cgroup_out_of_memory() since it's
    guaranteed that current is a member of the memcg that is oom and it is, by
    definition, unkillable.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: Li Zefan
    Cc: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

13 Mar, 2010

2 commits

  • In current page-fault code,

    handle_mm_fault()
    -> ...
    -> mem_cgroup_charge()
    -> map page or handle error.
    -> check return code.

    If page fault's return code is VM_FAULT_OOM, page_fault_out_of_memory() is
    called. But if it's caused by memcg, OOM should have been already
    invoked.

    Then, I added a patch: a636b327f731143ccc544b966cfd8de6cb6d72c6. That
    patch records last_oom_jiffies for memcg's sub-hierarchy and prevents
    page_fault_out_of_memory from being invoked in near future.

    But Nishimura-san reported that check by jiffies is not enough when the
    system is terribly heavy.

    This patch changes memcg's oom logic as.
    * If memcg causes OOM-kill, continue to retry.
    * remove jiffies check which is used now.
    * add memcg-oom-lock which works like perzone oom lock.
    * If current is killed(as a process), bypass charge.

    Something more sophisticated can be added but this pactch does
    fundamental things.
    TODO:
    - add oom notifier
    - add permemcg disable-oom-kill flag and freezer at oom.
    - more chances for wake up oom waiter (when changing memory limit etc..)

    Reviewed-by: Daisuke Nishimura
    Tested-by: Daisuke Nishimura
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • Presently, if panic_on_oom=2, the whole system panics even if the oom
    happend in some special situation (as cpuset, mempolicy....). Then,
    panic_on_oom=2 means painc_on_oom_always.

    Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

    BTW, how it's useful ?

    kdump+panic_on_oom=2 is the last tool to investigate what happens in
    oom-ed system. When a task is killed, the sysytem recovers and there will
    be few hint to know what happnes. In mission critical system, oom should
    never happen. Then, panic_on_oom=2+kdump is useful to avoid next OOM by
    knowing precise information via snapshot.

    TODO:
    - For memcg, it's for isolate system's memory usage, oom-notiifer and
    freeze_at_oom (or rest_at_oom) should be implemented. Then, management
    daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Balbir Singh
    Cc: David Rientjes
    Cc: Nick Piggin
    Reviewed-by: Daisuke Nishimura
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Mar, 2010

1 commit

  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Feb, 2010

1 commit

  • Presently the oom-killer is memcg aware and it finds the worst process
    from processes under memcg(s) in oom. Then, it kills victim's child
    first.

    It may kill a child in another cgroup and may not be any help for
    recovery. And it will break the assumption users have.

    This patch fixes it.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Balbir Singh
    Reviewed-by: Daisuke Nishimura
    Acked-by: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

16 Dec, 2009

4 commits

  • task_in_mem_cgroup(), which is called by select_bad_process() to check
    whether a task can be a candidate for being oom-killed from memcg's limit,
    checks "curr->use_hierarchy"("curr" is the mem_cgroup the task belongs
    to).

    But this check return true(it's false positive) when:

    /aa use_hierarchy == 0 /aa/00 use_hierarchy == 1
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Fix node-oriented allocation handling in oom-kill.c I myself think of this
    as a bugfix not as an ehnancement.

    In these days, things are changed as
    - alloc_pages() eats nodemask as its arguments, __alloc_pages_nodemask().
    - mempolicy don't maintain its own private zonelists.
    (And cpuset doesn't use nodemask for __alloc_pages_nodemask())

    So, current oom-killer's check function is wrong.

    This patch does
    - check nodemask, if nodemask && nodemask doesn't cover all
    node_states[N_HIGH_MEMORY], this is CONSTRAINT_MEMORY_POLICY.
    - Scan all zonelist under nodemask, if it hits cpuset's wall
    this faiulre is from cpuset.
    And
    - modifies the caller of out_of_memory not to call oom if __GFP_THISNODE.
    This doesn't change "current" behavior. If callers use __GFP_THISNODE
    it should handle "page allocation failure" by itself.

    - handle __GFP_NOFAIL+__GFP_THISNODE path.
    This is something like a FIXME but this gfpmask is not used now.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Cc: Daisuke Nishimura
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     
  • In a typical oom analysis scenario, we frequently want to know whether the
    killed process has a memory leak or not at the first step. This patch
    adds vsz and rss information to the oom log to help this analysis. To
    save time for the debugging.

    example:
    ===================================================================
    rsyslogd invoked oom-killer: gfp_mask=0x201da, order=0, oom_adj=0
    Pid: 1308, comm: rsyslogd Not tainted 2.6.32-rc6 #24
    Call Trace:
    [] ?_spin_unlock+0x2b/0x40
    [] oom_kill_process+0xbe/0x2b0

    (snip)

    492283 pages non-shared
    Out of memory: kill process 2341 (memhog) score 527276 or a child
    Killed process 2341 (memhog) vsz:1054552kB, anon-rss:970588kB, file-rss:4kB
    ===========================================================================
    ^
    |
    here

    [rientjes@google.com: fix race, add pid & comm to message]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • The oom killer header, including information such as the allocation order
    and gfp mask, current's cpuset and memory controller, call trace, and VM
    state information is currently only shown when the oom killer has selected
    a task to kill.

    This information is omitted, however, when the oom killer panics either
    because of panic_on_oom sysctl settings or when no killable task was
    found. It is still relevant to know crucial pieces of information such as
    the allocation order and VM state when diagnosing such issues, especially
    at boot.

    This patch displays the oom killer header whenever it panics so that bug
    reports can include pertinent information to debug the issue, if possible.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Sep, 2009

4 commits

  • Current oom_kill doesn't only kill the victim process, but also kill all
    thas shread the same mm. it mean vfork parent will be killed.

    This is definitely incorrect. another process have another oom_adj. we
    shouldn't ignore their oom_adj (it might have OOM_DISABLE).

    following caller hit the minefield.

    ===============================
    switch (constraint) {
    case CONSTRAINT_MEMORY_POLICY:
    oom_kill_process(current, gfp_mask, order, 0, NULL,
    "No available memory (MPOL_BIND)");
    break;

    Note: force_sig(SIGKILL) send SIGKILL to all thread in the process.
    We don't need to care multi thread in here.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • oom-killer kills a process, not task. Then oom_score should be calculated
    as per-process too. it makes consistency more and makes speed up
    select_bad_process().

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, OOM logic callflow is here.

    __out_of_memory()
    select_bad_process() for each task
    badness() calculate badness of one task
    oom_kill_process() search child
    oom_kill_task() kill target task and mm shared tasks with it

    example, process-A have two thread, thread-A and thread-B and it have very
    fat memory and each thread have following oom_adj and oom_score.

    thread-A: oom_adj = OOM_DISABLE, oom_score = 0
    thread-B: oom_adj = 0, oom_score = very-high

    Then, select_bad_process() select thread-B, but oom_kill_task() refuse
    kill the task because thread-A have OOM_DISABLE. Thus __out_of_memory()
    call select_bad_process() again. but select_bad_process() select the same
    task. It mean kernel fall in livelock.

    The fact is, select_bad_process() must select killable task. otherwise
    OOM logic go into livelock.

    And root cause is, oom_adj shouldn't be per-thread value. it should be
    per-process value because OOM-killer kill a process, not thread. Thus
    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct signal_struct. it naturally prevent
    select_bad_process() choose wrong task.

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Just as the swapoff system call allocates many pages of RAM to various
    processes, perhaps triggering OOM, so "echo 2 >/sys/kernel/mm/ksm/run"
    (unmerge) is liable to allocate many pages of RAM to various processes,
    perhaps triggering OOM; and each is normally run from a modest admin
    process (swapoff or shell), easily repeated until it succeeds.

    So treat unmerge_and_remove_all_rmap_items() in the same way that we treat
    try_to_unuse(): generalize PF_SWAPOFF to PF_OOM_ORIGIN, and bracket both
    with that, to ask the OOM killer to kill them first, to prevent them from
    spawning more and more OOM kills.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

19 Aug, 2009

1 commit

  • The commit 2ff05b2b (oom: move oom_adj value) moveed the oom_adj value to
    the mm_struct. It was a very good first step for sanitize OOM.

    However Paul Menage reported the commit makes regression to his job
    scheduler. Current OOM logic can kill OOM_DISABLED process.

    Why? His program has the code of similar to the following.

    ...
    set_oom_adj(OOM_DISABLE); /* The job scheduler never killed by oom */
    ...
    if (vfork() == 0) {
    set_oom_adj(0); /* Invoked child can be killed */
    execve("foo-bar-cmd");
    }
    ....

    vfork() parent and child are shared the same mm_struct. then above
    set_oom_adj(0) doesn't only change oom_adj for vfork() child, it's also
    change oom_adj for vfork() parent. Then, vfork() parent (job scheduler)
    lost OOM immune and it was killed.

    Actually, fork-setting-exec idiom is very frequently used in userland program.
    We must not break this assumption.

    Then, this patch revert commit 2ff05b2b and related commit.

    Reverted commit list
    ---------------------
    - commit 2ff05b2b4e (oom: move oom_adj value from task_struct to mm_struct)
    - commit 4d8b9135c3 (oom: avoid unnecessary mm locking and scanning for OOM_DISABLE)
    - commit 8123681022 (oom: only oom kill exiting tasks with attached memory)
    - commit 933b787b57 (mm: copy over oom_adj value at fork time)

    Signed-off-by: KOSAKI Motohiro
    Cc: Paul Menage
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: Rik van Riel
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Nick Piggin
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

17 Jun, 2009

3 commits

  • When a task is chosen for oom kill and is found to be PF_EXITING,
    __oom_kill_task() is called to elevate the task's timeslice and give it
    access to memory reserves so that it may quickly exit.

    This privilege is unnecessary, however, if the task has already detached
    its mm. Although its possible for the mm to become detached later since
    task_lock() is not held, __oom_kill_task() will simply be a no-op in such
    circumstances.

    Subsequently, it is no longer necessary to warn about killing mm-less
    tasks since it is a no-op.

    Signed-off-by: David Rientjes
    Acked-by: Rik van Riel
    Cc: Balbir Singh
    Cc: Minchan Kim
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This moves the check for OOM_DISABLE to the badness heuristic so it is
    only necessary to hold task_lock() once. If the mm is OOM_DISABLE, the
    score is 0, which is also correctly exported via /proc/pid/oom_score.
    This requires that tasks with badness scores of 0 are prohibited from
    being oom killed, which makes sense since they would not allow for future
    memory freeing anyway.

    Since the oom_adj value is a characteristic of an mm and not a task, it is
    no longer necessary to check the oom_adj value for threads sharing the
    same memory (except when simply issuing SIGKILLs for threads in other
    thread groups).

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The per-task oom_adj value is a characteristic of its mm more than the
    task itself since it's not possible to oom kill any thread that shares the
    mm. If a task were to be killed while attached to an mm that could not be
    freed because another thread were set to OOM_DISABLE, it would have
    needlessly been terminated since there is no potential for future memory
    freeing.

    This patch moves oomkilladj (now more appropriately named oom_adj) from
    struct task_struct to struct mm_struct. This requires task_lock() on a
    task to check its oom_adj value to protect against exec, but it's already
    necessary to take the lock when dereferencing the mm to find the total VM
    size for the badness heuristic.

    This fixes a livelock if the oom killer chooses a task and another thread
    sharing the same memory has an oom_adj value of OOM_DISABLE. This occurs
    because oom_kill_task() repeatedly returns 1 and refuses to kill the
    chosen task while select_bad_process() will repeatedly choose the same
    task during the next retry.

    Taking task_lock() in select_bad_process() to check for OOM_DISABLE and in
    oom_kill_task() to check for threads sharing the same memory will be
    removed in the next patch in this series where it will no longer be
    necessary.

    Writing to /proc/pid/oom_adj for a kthread will now return -EINVAL since
    these threads are immune from oom killing already. They simply report an
    oom_adj value of OOM_DISABLE.

    Cc: Nick Piggin
    Cc: Rik van Riel
    Cc: Mel Gorman
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

29 May, 2009

1 commit

  • When /proc/sys/vm/oom_dump_tasks is enabled, it is possible to get a NULL
    pointer for tasks that have detached mm's since task_lock() is not held
    during the tasklist scan. Add the task_lock().

    Acked-by: Nick Piggin
    Acked-by: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

07 May, 2009

1 commit

  • When /proc/sys/vm/oom_kill_allocating_task is set for large systems that
    want to avoid the lengthy tasklist scan, it's possible to livelock if
    current is ineligible for oom kill. This normally happens when it is set
    to OOM_DISABLE, but is also possible if any threads are sharing the same
    ->mm with a different tgid.

    So change __out_of_memory() to fall back to the full task-list scan if it
    was unable to kill `current'.

    Cc: Nick Piggin
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 Apr, 2009

1 commit

  • Add RSS and swap to OOM output from memcg

    Display memcg values like failcnt, usage and limit when an OOM occurs due
    to memcg.

    Thanks to Johannes Weiner, Li Zefan, David Rientjes, Kamezawa Hiroyuki,
    Daisuke Nishimura and KOSAKI Motohiro for review.

    Sample output
    -------------

    Task in /a/x killed as a result of limit of /a
    memory: usage 1048576kB, limit 1048576kB, failcnt 4183
    memory+swap: usage 1400964kB, limit 9007199254740991kB, failcnt 0

    [akpm@linux-foundation.org: compilation fix]
    [akpm@linux-foundation.org: fix kerneldoc and whitespace]
    [akpm@linux-foundation.org: add printk facility level]
    Signed-off-by: Balbir Singh
    Cc: KAMEZAWA Hiroyuki
    Cc: Daisuke Nishimura
    Cc: Li Zefan
    Cc: Paul Menage
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

01 Apr, 2009

1 commit


09 Jan, 2009

2 commits

  • mpol_rebind_mm(), which can be called from cpuset_attach(), does
    down_write(mm->mmap_sem). This means down_write(mm->mmap_sem) can be
    called under cgroup_mutex.

    OTOH, page fault path does down_read(mm->mmap_sem) and calls
    mem_cgroup_try_charge_xxx(), which may eventually calls
    mem_cgroup_out_of_memory(). And mem_cgroup_out_of_memory() calls
    cgroup_lock(). This means cgroup_lock() can be called under
    down_read(mm->mmap_sem).

    If those two paths race, deadlock can happen.

    This patch avoid this deadlock by:
    - remove cgroup_lock() from mem_cgroup_out_of_memory().
    - define new mutex (memcg_tasklist) and serialize mem_cgroup_move_task()
    (->attach handler of memory cgroup) and mem_cgroup_out_of_memory.

    Signed-off-by: Daisuke Nishimura
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daisuke Nishimura
     
  • Current mmtom has new oom function as pagefault_out_of_memory(). It's
    added for select bad process rathar than killing current.

    When memcg hit limit and calls OOM at page_fault, this handler called and
    system-wide-oom handling happens. (means kernel panics if panic_on_oom is
    true....)

    To avoid overkill, check memcg's recent behavior before starting
    system-wide-oom.

    And this patch also fixes to guarantee "don't accnout against process with
    TIF_MEMDIE". This is necessary for smooth OOM.

    [akpm@linux-foundation.org: build fix]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Li Zefan
    Cc: Balbir Singh
    Cc: Daisuke Nishimura
    Cc: Badari Pulavarty
    Cc: Jan Blunck
    Cc: Hirokazu Takahashi
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

07 Jan, 2009

3 commits

  • When cpusets are enabled, it's necessary to print the triggering task's
    set of allowable nodes so the subsequently printed meminfo can be
    interpreted correctly.

    We also print the task's cpuset name for informational purposes.

    [rientjes@google.com: task lock current before dereferencing cpuset]
    Cc: Paul Menage
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • zone_scan_mutex is actually a spinlock, so name it appropriately.

    Signed-off-by: David Rientjes
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Rather than have the pagefault handler kill a process directly if it gets
    a VM_FAULT_OOM, have it call into the OOM killer.

    With increasingly sophisticated oom behaviour (cpusets, memory cgroups,
    oom killing throttling, oom priority adjustment or selective disabling,
    panic on oom, etc), it's silly to unconditionally kill the faulting
    process at page fault time. Create a hook for pagefault oom path to call
    into instead.

    Only converted x86 and uml so far.

    [akpm@linux-foundation.org: make __out_of_memory() static]
    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Nick Piggin
    Cc: Jeff Dike
    Acked-by: Ingo Molnar
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

14 Nov, 2008

3 commits

  • Conflicts:
    security/keys/internal.h
    security/keys/process_keys.c
    security/keys/request_key.c

    Fixed conflicts above by using the non 'tsk' versions.

    Signed-off-by: James Morris

    James Morris
     
  • Use RCU to access another task's creds and to release a task's own creds.
    This means that it will be possible for the credentials of a task to be
    replaced without another task (a) requiring a full lock to read them, and (b)
    seeing deallocated memory.

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     
  • Separate the task security context from task_struct. At this point, the
    security data is temporarily embedded in the task_struct with two pointers
    pointing to it.

    Note that the Alpha arch is altered as it refers to (E)UID and (E)GID in
    entry.S via asm-offsets.

    With comment fixes Signed-off-by: Marc Dionne

    Signed-off-by: David Howells
    Acked-by: James Morris
    Acked-by: Serge Hallyn
    Signed-off-by: James Morris

    David Howells
     

11 Nov, 2008

1 commit


07 Nov, 2008

2 commits


14 Aug, 2008

1 commit

  • Fix the setting of PF_SUPERPRIV by __capable() as it could corrupt the flags
    the target process if that is not the current process and it is trying to
    change its own flags in a different way at the same time.

    __capable() is using neither atomic ops nor locking to protect t->flags. This
    patch removes __capable() and introduces has_capability() that doesn't set
    PF_SUPERPRIV on the process being queried.

    This patch further splits security_ptrace() in two:

    (1) security_ptrace_may_access(). This passes judgement on whether one
    process may access another only (PTRACE_MODE_ATTACH for ptrace() and
    PTRACE_MODE_READ for /proc), and takes a pointer to the child process.
    current is the parent.

    (2) security_ptrace_traceme(). This passes judgement on PTRACE_TRACEME only,
    and takes only a pointer to the parent process. current is the child.

    In Smack and commoncap, this uses has_capability() to determine whether
    the parent will be permitted to use PTRACE_ATTACH if normal checks fail.
    This does not set PF_SUPERPRIV.

    Two of the instances of __capable() actually only act on current, and so have
    been changed to calls to capable().

    Of the places that were using __capable():

    (1) The OOM killer calls __capable() thrice when weighing the killability of a
    process. All of these now use has_capability().

    (2) cap_ptrace() and smack_ptrace() were using __capable() to check to see
    whether the parent was allowed to trace any process. As mentioned above,
    these have been split. For PTRACE_ATTACH and /proc, capable() is now
    used, and for PTRACE_TRACEME, has_capability() is used.

    (3) cap_safe_nice() only ever saw current, so now uses capable().

    (4) smack_setprocattr() rejected accesses to tasks other than current just
    after calling __capable(), so the order of these two tests have been
    switched and capable() is used instead.

    (5) In smack_file_send_sigiotask(), we need to allow privileged processes to
    receive SIGIO on files they're manipulating.

    (6) In smack_task_wait(), we let a process wait for a privileged process,
    whether or not the process doing the waiting is privileged.

    I've tested this with the LTP SELinux and syscalls testscripts.

    Signed-off-by: David Howells
    Acked-by: Serge Hallyn
    Acked-by: Casey Schaufler
    Acked-by: Andrew G. Morgan
    Acked-by: Al Viro
    Signed-off-by: James Morris

    David Howells
     

28 Apr, 2008

3 commits

  • In commit 4c4a22148909e4c003562ea7ffe0a06e26919e3c, we moved the
    memcontroller-related code from badness() to select_bad_process(), so the
    parameter 'mem' in badness() is unused now.

    Signed-off-by: Li Zefan
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Filtering zonelists requires very frequent use of zone_idx(). This is costly
    as it involves a lookup of another structure and a substraction operation. As
    the zone_idx is often required, it should be quickly accessible. The node idx
    could also be stored here if it was found that accessing zone->node is
    significant which may be the case on workloads where nodemasks are heavily
    used.

    This patch introduces a struct zoneref to store a zone pointer and a zone
    index. The zonelist then consists of an array of these struct zonerefs which
    are looked up as necessary. Helpers are given for accessing the zone index as
    well as the node index.

    [kamezawa.hiroyu@jp.fujitsu.com: Suggested struct zoneref instead of embedding information in pointers]
    [hugh@veritas.com: mm-have-zonelist: fix memcg ooms]
    [hugh@veritas.com: just return do_try_to_free_pages]
    [hugh@veritas.com: do_try_to_free_pages gfp_mask redundant]
    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Acked-by: David Rientjes
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Nick Piggin
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Currently a node has two sets of zonelists, one for each zone type in the
    system and a second set for GFP_THISNODE allocations. Based on the zones
    allowed by a gfp mask, one of these zonelists is selected. All of these
    zonelists consume memory and occupy cache lines.

    This patch replaces the multiple zonelists per-node with two zonelists. The
    first contains all populated zones in the system, ordered by distance, for
    fallback allocations when the target/preferred node has no free pages. The
    second contains all populated zones in the node suitable for GFP_THISNODE
    allocations.

    An iterator macro is introduced called for_each_zone_zonelist() that interates
    through each zone allowed by the GFP flags in the selected zonelist.

    Signed-off-by: Mel Gorman
    Acked-by: Christoph Lameter
    Signed-off-by: Lee Schermerhorn
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Christoph Lameter
    Cc: Hugh Dickins
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

16 Apr, 2008

1 commit

  • When I used a test program to fork mass processes and immediately move them to
    a cgroup where the memory limit is low enough to trigger oom kill, I got oops:

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000808
    IP: [] _spin_lock_irqsave+0x8/0x18
    PGD 4c95f067 PUD 4406c067 PMD 0
    Oops: 0002 [1] SMP
    CPU 2
    Modules linked in:

    Pid: 11973, comm: a.out Not tainted 2.6.25-rc7 #5
    RIP: 0010:[] [] _spin_lock_irqsave+0x8/0x18
    RSP: 0018:ffff8100448c7c30 EFLAGS: 00010002
    RAX: 0000000000000202 RBX: 0000000000000009 RCX: 000000000001c9f3
    RDX: 0000000000000100 RSI: 0000000000000001 RDI: 0000000000000808
    RBP: ffff81007e444080 R08: 0000000000000000 R09: ffff8100448c7900
    R10: ffff81000105f480 R11: 00000100ffffffff R12: ffff810067c84140
    R13: 0000000000000001 R14: ffff8100441d0018 R15: ffff81007da56200
    FS: 00007f70eb1856f0(0000) GS:ffff81007fbad3c0(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    CR2: 0000000000000808 CR3: 000000004498a000 CR4: 00000000000006e0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    Process a.out (pid: 11973, threadinfo ffff8100448c6000, task ffff81007da533e0)
    Stack: ffffffff8023ef5a 00000000000000d0 ffffffff80548dc0 00000000000000d0
    ffff810067c84140 ffff81007e444080 ffffffff8026cef9 00000000000000d0
    ffff8100441d0000 00000000000000d0 ffff8100441d0000 ffff8100505445c0
    Call Trace:
    [] ? force_sig_info+0x25/0xb9
    [] ? oom_kill_task+0x77/0xe2
    [] ? mem_cgroup_out_of_memory+0x55/0x67
    [] ? mem_cgroup_charge_common+0xec/0x202
    [] ? handle_mm_fault+0x24e/0x77f
    [] ? default_wake_function+0x0/0xe
    [] ? get_user_pages+0x2ce/0x3af
    [] ? mem_cgroup_charge_common+0x2d/0x202
    [] ? make_pages_present+0x8e/0xa4
    [] ? mmap_region+0x373/0x429
    [] ? do_mmap_pgoff+0x2ff/0x364
    [] ? sys_mmap+0xe5/0x111
    [] ? tracesys+0xdc/0xe1

    Code: 00 00 01 48 8b 3c 24 e9 46 d4 dd ff f0 ff 07 48 8b 3c 24 e9 3a d4 dd ff fe 07 48 8b 3c 24 e9 2f d4 dd ff 9c 58 fa ba 00 01 00 00 66 0f c1 17 38 f2 74 06 f3 90 8a 17 eb f6 c3 fa b8 00 01 00
    RIP [] _spin_lock_irqsave+0x8/0x18
    RSP
    CR2: 0000000000000808
    ---[ end trace c3702fa668021ea4 ]---

    It's reproducable in a x86_64 box, but doesn't happen in x86_32.

    This is because tsk->sighand is not guarded by RCU, so we have to
    hold tasklist_lock, just as what out_of_memory() does.

    Signed-off-by: Li Zefan
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Paul Menage
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     

20 Mar, 2008

1 commit


05 Mar, 2008

1 commit

  • Rename Memory Controller to Memory Resource Controller. Reflect the same
    changes in the CONFIG definition for the Memory Resource Controller. Group
    together the config options for Resource Counters and Memory Resource
    Controller.

    Signed-off-by: Balbir Singh
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh