31 Jan, 2014

1 commit

  • A 3% of system memory bonus is sometimes too excessive in comparison to
    other processes.

    With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
    killer tries to avoid killing privileged tasks by subtracting 3% of
    overall memory (system or cgroup) from their per-task consumption. But
    as a result, all root tasks that consume less than 3% of overall memory
    are considered equal, and so it only takes 33+ privileged tasks pushing
    the system out of memory for the OOM killer to do something stupid and
    kill dhclient or other root-owned processes. For example, on a 32G
    machine it can't tell the difference between the 1M agetty and the 10G
    fork bomb member.

    The changelog describes this 3% boost as the equivalent to the global
    overcommit limit being 3% higher for privileged tasks, but this is not
    the same as discounting 3% of overall memory from _every privileged task
    individually_ during OOM selection.

    Replace the 3% of system memory bonus with a 3% of current memory usage
    bonus.

    By giving root tasks a bonus that is proportional to their actual size,
    they remain comparable even when relatively small. In the example
    above, the OOM killer will discount the 1M agetty's 256 badness points
    down to 179, and the 10G fork bomb's 262144 points down to 183500 points
    and make the right choice, instead of discounting both to 0 and killing
    agetty because it's first in the task list.

    Signed-off-by: David Rientjes
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jan, 2014

1 commit

  • When two threads have the same badness score, it's preferable to kill
    the thread group leader so that the actual process name is printed to
    the kernel log rather than the thread group name which may be shared
    amongst several processes.

    This was the behavior when select_bad_process() used to do
    for_each_process(), but it now iterates threads instead and leads to
    ambiguity.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Jan, 2014

3 commits

  • find_lock_task_mm() expects it is called under rcu or tasklist lock, but
    it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
    mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

    Perhaps we could fix the callers, but this patch simply adds rcu lock
    into find_lock_task_mm(). This also allows to simplify a bit one of its
    callers, oom_kill_process().

    Signed-off-by: Oleg Nesterov
    Cc: Sergey Dyasly
    Cc: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • At least out_of_memory() calls has_intersects_mems_allowed() without
    even rcu_read_lock(), this is obviously buggy.

    Add the necessary rcu_read_lock(). This means that we can not simply
    return from the loop, we need "bool ret" and "break".

    While at it, swap the names of task_struct's (the argument and the
    local). This cleans up the code a little bit and avoids the unnecessary
    initialization.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change oom_kill.c to use for_each_thread() rather than the racy
    while_each_thread() which can loop forever if we race with exit.

    Note also that most users were buggy even if while_each_thread() was
    fine, the task can exit even _before_ rcu_read_lock().

    Fortunately the new for_each_thread() only requires the stable
    task_struct, so this change fixes both problems.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Nov, 2013

1 commit

  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Oct, 2013

1 commit

  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Sep, 2013

1 commit

  • The memcg OOM handling is incredibly fragile and can deadlock. When a
    task fails to charge memory, it invokes the OOM killer and loops right
    there in the charge code until it succeeds. Comparably, any other task
    that enters the charge path at this point will go to a waitqueue right
    then and there and sleep until the OOM situation is resolved. The problem
    is that these tasks may hold filesystem locks and the mmap_sem; locks that
    the selected OOM victim may need to exit.

    For example, in one reported case, the task invoking the OOM killer was
    about to charge a page cache page during a write(), which holds the
    i_mutex. The OOM killer selected a task that was just entering truncate()
    and trying to acquire the i_mutex:

    OOM invoking task:
    mem_cgroup_handle_oom+0x241/0x3b0
    mem_cgroup_cache_charge+0xbe/0xe0
    add_to_page_cache_locked+0x4c/0x140
    add_to_page_cache_lru+0x22/0x50
    grab_cache_page_write_begin+0x8b/0xe0
    ext3_write_begin+0x88/0x270
    generic_file_buffered_write+0x116/0x290
    __generic_file_aio_write+0x27c/0x480
    generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
    do_sync_write+0xea/0x130
    vfs_write+0xf3/0x1f0
    sys_write+0x51/0x90
    system_call_fastpath+0x18/0x1d

    OOM kill victim:
    do_truncate+0x58/0xa0 # takes i_mutex
    do_last+0x250/0xa30
    path_openat+0xd7/0x440
    do_filp_open+0x49/0xa0
    do_sys_open+0x106/0x240
    sys_open+0x20/0x30
    system_call_fastpath+0x18/0x1d

    The OOM handling task will retry the charge indefinitely while the OOM
    killed task is not releasing any resources.

    A similar scenario can happen when the kernel OOM killer for a memcg is
    disabled and a userspace task is in charge of resolving OOM situations.
    In this case, ALL tasks that enter the OOM path will be made to sleep on
    the OOM waitqueue and wait for userspace to free resources or increase
    the group's limit. But a userspace OOM handler is prone to deadlock
    itself on the locks held by the waiting tasks. For example one of the
    sleeping tasks may be stuck in a brk() call with the mmap_sem held for
    writing but the userspace handler, in order to pick an optimal victim,
    may need to read files from /proc/, which tries to acquire the same
    mmap_sem for reading and deadlocks.

    This patch changes the way tasks behave after detecting a memcg OOM and
    makes sure nobody loops or sleeps with locks held:

    1. When OOMing in a user fault, invoke the OOM killer and restart the
    fault instead of looping on the charge attempt. This way, the OOM
    victim can not get stuck on locks the looping task may hold.

    2. When OOMing in a user fault but somebody else is handling it
    (either the kernel OOM killer or a userspace handler), don't go to
    sleep in the charge context. Instead, remember the OOMing memcg in
    the task struct and then fully unwind the page fault stack with
    -ENOMEM. pagefault_out_of_memory() will then call back into the
    memcg code to check if the -ENOMEM came from the memcg, and then
    either put the task to sleep on the memcg's OOM waitqueue or just
    restart the fault. The OOM victim can no longer get stuck on any
    lock a sleeping task may hold.

    Debugged by Michal Hocko.

    Signed-off-by: Johannes Weiner
    Reported-by: azurIt
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2013

1 commit


24 Feb, 2013

1 commit

  • Currently when a memcg oom is happening the oom dump messages is still
    global state and provides few useful info for users. This patch prints
    more pointed memcg page statistics for memcg-oom and take hierarchy into
    consideration:

    Based on Michal's advice, we take hierarchy into consideration: supppose
    we trigger an OOM on A's limit

    root_memcg
    |
    A (use_hierachy=1)
    / \
    B C
    |
    D
    then the printed info will be:

    Memory cgroup stats for /A:...
    Memory cgroup stats for /A/B:...
    Memory cgroup stats for /A/C:...
    Memory cgroup stats for /A/B/D:...

    Following are samples of oom output:

    (1) Before change:

    mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
    mal-80 cpuset=/ mems_allowed=0
    Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
    Call Trace:
    [] dump_header+0x83/0x1ca
    ..... (call trace)
    [] page_fault+0x28/0x30
    <<<<<<<<<<<<<<<<<<<<< memcg specific information
    Task in /A/B/D killed as a result of limit of /A
    memory: usage 101376kB, limit 101376kB, failcnt 57
    memory+swap: usage 101376kB, limit 101376kB, failcnt 0
    kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
    <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
    Mem-Info:
    Node 0 DMA per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    ......
    CPU 3: hi: 0, btch: 1 usd: 0
    Node 0 DMA32 per-cpu:
    CPU 0: hi: 186, btch: 31 usd: 173
    ......
    CPU 3: hi: 186, btch: 31 usd: 130
    <<<<<<<<<<<<<<<<<<<<< print global page state
    active_anon:92963 inactive_anon:40777 isolated_anon:0
    active_file:33027 inactive_file:51718 isolated_file:0
    unevictable:0 dirty:3 writeback:0 unstable:0
    free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
    mapped:20278 shmem:35971 pagetables:5885 bounce:0
    free_cma:0
    <<<<<<<<<<<<<<<<<<<<< print per zone page state
    Node 0 DMA free:15836kB ... all_unreclaimable? no
    lowmem_reserve[]: 0 3175 3899 3899
    Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
    lowmem_reserve[]: 0 0 724 724
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
    Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
    120710 total pagecache pages
    0 pages in swap cache
    <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 499708kB
    Total swap = 499708kB
    1040368 pages RAM
    58678 pages reserved
    169065 pages shared
    173632 pages non-shared
    [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
    [ 2693] 0 2693 6005 1324 17 0 0 god
    [ 2754] 0 2754 6003 1320 16 0 0 god
    [ 2811] 0 2811 5992 1304 18 0 0 god
    [ 2874] 0 2874 6005 1323 18 0 0 god
    [ 2935] 0 2935 8720 7742 21 0 0 mal-30
    [ 2976] 0 2976 21520 17577 42 0 0 mal-80
    Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
    Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB

    We can see that messages dumped by show_free_areas() are longsome and can
    provide so limited info for memcg that just happen oom.

    (2) After change
    mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
    mal-80 cpuset=/ mems_allowed=0
    Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
    Call Trace:
    [] dump_header+0x83/0x1d1
    .......(call trace)
    [] page_fault+0x28/0x30
    Task in /A/B/D killed as a result of limit of /A
    <<<<<<<<<<<<<<<<<<<<< memcg specific information
    memory: usage 102400kB, limit 102400kB, failcnt 140
    memory+swap: usage 102400kB, limit 102400kB, failcnt 0
    kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
    Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
    [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
    [ 2260] 0 2260 6006 1325 18 0 0 god
    [ 2383] 0 2383 6003 1319 17 0 0 god
    [ 2503] 0 2503 6004 1321 18 0 0 god
    [ 2622] 0 2622 6004 1321 16 0 0 god
    [ 2695] 0 2695 8720 7741 22 0 0 mal-30
    [ 2704] 0 2704 21520 17839 43 0 0 mal-80
    Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
    Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB

    This version provides more pointed info for memcg in "Memory cgroup stats
    for XXX" section.

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     

13 Dec, 2012

3 commits

  • out_of_memory() will already cause current to schedule if it has not been
    killed, so doing it again in pagefault_out_of_memory() is redundant.
    Remove it.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • To lock the entire system from parallel oom killing, it's possible to pass
    in a zonelist with all zones rather than using for_each_populated_zone()
    for the iteration. This obsoletes try_set_system_oom() and
    clear_system_oom() so that they can be removed.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

3 commits

  • test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
    specify that current should be killed first if an oom condition occurs in
    between the two calls.

    The usage is

    short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
    ...
    compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);

    to store the thread's oom_score_adj, temporarily change it to the maximum
    score possible, and then restore the old value if it is still the same.

    This happens to still be racy, however, if the user writes
    OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
    The compare_swap_oom_score_adj() will then incorrectly reset the old value
    prior to the write of OOM_SCORE_ADJ_MAX.

    To fix this, introduce a new oom_flags_t member in struct signal_struct
    that will be used for per-thread oom killer flags. KSM and swapoff can
    now use a bit in this member to specify that threads should be killed
    first in oom conditions without playing around with oom_score_adj.

    This also allows the correct oom_score_adj to always be shown when reading
    /proc/pid/oom_score.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
    so this range can be represented by the signed short type with no
    functional change. The extra space this frees up in struct signal_struct
    will be used for per-thread oom kill flags in the next patch.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Exiting threads, those with PF_EXITING set, can pagefault and require
    memory before they can make forward progress. This happens, for instance,
    when a process must fault task->robust_list, a userspace structure, before
    detaching its memory.

    These threads also aren't guaranteed to get access to memory reserves
    unless oom killed or killed from userspace. The oom killer won't grant
    memory reserves if other threads are also exiting other than current and
    stalling at the same point. This prevents needlessly killing processes
    when others are already exiting.

    Instead of special casing all the possible situations between PF_EXITING
    getting set and a thread detaching its mm where it may allocate memory,
    which probably wouldn't get updated when a change is made to the exit
    path, the solution is to give all exiting threads access to memory
    reserves if they call the oom killer. This allows them to quickly
    allocate, detach its mm, and free the memory it represents.

    Summary of Luigi's bug report:

    : He had an oom condition where threads were faulting on task->robust_list
    : and repeatedly called the oom killer but it would defer killing a thread
    : because it saw other PF_EXITING threads. This can happen anytime we need
    : to allocate memory after setting PF_EXITING and before detaching our mm;
    : if there are other threads in the same state then the oom killer won't do
    : anything unless one of them happens to be killed from userspace.
    :
    : So instead of only deferring for PF_EXITING and !task->robust_list, it's
    : better to just give them access to memory reserves to prevent a potential
    : livelock so that any other faults that may be introduced in the future in
    : the exit path don't cause the same problem (and hopefully we don't allow
    : too many of those!).

    Signed-off-by: David Rientjes
    Acked-by: Minchan Kim
    Tested-by: Luigi Semenzato
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2012

1 commit


01 Aug, 2012

8 commits

  • By globally defining check_panic_on_oom(), the memcg oom handler can be
    moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
    oom killer and cleans up the code.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
    try to reduce the amount of time the readside is held for oom kills. This
    makes the interface with the memcg oom handler more consistent since it
    now never needs to take tasklist_lock unnecessarily.

    The only time the oom killer now takes tasklist_lock is when iterating the
    children of the selected task, everything else is protected by
    rcu_read_lock().

    This requires that a reference to the selected process, p, is grabbed
    before calling oom_kill_process(). It may release it and grab a reference
    on another one of p's threads if !p->mm, but it also guarantees that it
    will release the reference before returning.

    [hughd@google.com: fix duplicate put_task_struct()]
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The global oom killer is serialized by the per-zonelist
    try_set_zonelist_oom() which is used in the page allocator. Concurrent
    oom kills are thus a rare event and only occur in systems using
    mempolicies and with a large number of nodes.

    Memory controller oom kills, however, can frequently be concurrent since
    there is no serialization once the oom killer is called for oom conditions
    in several different memcgs in parallel.

    This creates a massive contention on tasklist_lock since the oom killer
    requires the readside for the tasklist iteration. If several memcgs are
    calling the oom killer, this lock can be held for a substantial amount of
    time, especially if threads continue to enter it as other threads are
    exiting.

    Since the exit path grabs the writeside of the lock with irqs disabled in
    a few different places, this can cause a soft lockup on cpus as a result
    of tasklist_lock starvation.

    The kernel lacks unfair writelocks, and successful calls to the oom killer
    usually result in at least one thread entering the exit path, so an
    alternative solution is needed.

    This patch introduces a seperate oom handler for memcgs so that they do
    not require tasklist_lock for as much time. Instead, it iterates only
    over the threads attached to the oom memcg and grabs a reference to the
    selected thread before calling oom_kill_process() to ensure it doesn't
    prematurely exit.

    This still requires tasklist_lock for the tasklist dump, iterating
    children of the selected process, and killing all other threads on the
    system sharing the same memory as the selected victim. So while this
    isn't a complete solution to tasklist_lock starvation, it significantly
    reduces the amount of time that it is held.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch introduces a helper function to process each thread during the
    iteration over the tasklist. A new return type, enum oom_scan_t, is
    defined to determine the future behavior of the iteration:

    - OOM_SCAN_OK: continue scanning the thread and find its badness,

    - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
    ineligible,

    - OOM_SCAN_ABORT: abort the iteration and return, or

    - OOM_SCAN_SELECT: always select this thread with the highest badness
    possible.

    There is no functional change with this patch. This new helper function
    will be used in the next patch in the memory controller.

    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The number of ptes and swap entries are used in the oom killer's badness
    heuristic, so they should be shown in the tasklist dump.

    This patch adds those fields and replaces cpu and oom_adj values that are
    currently emitted. Cpu isn't interesting and oom_adj is deprecated and
    will be removed later this year, the same information is already displayed
    as oom_score_adj which is used internally.

    At the same time, make the documentation a little more clear to state this
    information is helpful to determine why the oom killer chose the task it
    did to kill.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • /proc/sys/vm/oom_kill_allocating_task will immediately kill current when
    the oom killer is called to avoid a potentially expensive tasklist scan
    for large systems.

    Currently, however, it is not checking current's oom_score_adj value which
    may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
    killing.

    This patch avoids killing current in such a condition and simply falls
    back to the tasklist scan since memory still needs to be freed.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The oom killer currently schedules away from current in an uninterruptible
    sleep if it does not have access to memory reserves. It's possible that
    current was killed because it shares memory with the oom killed thread or
    because it was killed by the user in the interim, however.

    This patch only schedules away from current if it does not have a pending
    kill, i.e. if it does not share memory with the oom killed thread. It's
    possible that it will immediately retry its memory allocation and fail,
    but it will immediately be given access to memory reserves if it calls the
    oom killer again.

    This prevents the delay of memory freeing when threads that share memory
    with the oom killed thread get unnecessarily scheduled.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jun, 2012

2 commits

  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The divide in p->signal->oom_score_adj * totalpages / 1000 within
    oom_badness() was causing an overflow of the signed long data type.

    This adds both the root bias and p->signal->oom_score_adj before doing the
    normalization which fixes the issue and also cleans up the calculation.

    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Jun, 2012

1 commit

  • If the privileges given to root threads (3% of allowable memory) or a
    negative value of /proc/pid/oom_score_adj happen to exceed the amount of
    rss of a thread, its badness score overflows as a result of commit
    a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
    for userspace").

    Fix this by making the type signed and return 1, meaning the thread is
    still eligible for kill, if the value is negative.

    Reported-by: Dave Jones
    Acked-by: Oleg Nesterov
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 May, 2012

1 commit

  • The oom_score_adj scale ranges from -1000 to 1000 and represents the
    proportion of memory available to the process at allocation time. This
    means an oom_score_adj value of 300, for example, will bias a process as
    though it was using an extra 30.0% of available memory and a value of
    -350 will discount 35.0% of available memory from its usage.

    The oom killer badness heuristic also uses this scale to report the oom
    score for each eligible process in determining the "best" process to
    kill. Thus, it can only differentiate each process's memory usage by
    0.1% of system RAM.

    On large systems, this can end up being a large amount of memory: 256MB
    on 256GB systems, for example.

    This can be fixed by having the badness heuristic to use the actual
    memory usage in scoring threads and then normalizing it to the
    oom_score_adj scale for userspace. This results in better comparison
    between eligible threads for kill and no change from the userspace
    perspective.

    Suggested-by: KOSAKI Motohiro
    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 May, 2012

1 commit


24 Mar, 2012

1 commit

  • Change oom_kill_task() to use do_send_sig_info(SEND_SIG_FORCED) instead
    of force_sig(SIGKILL). With the recent changes we do not need force_ to
    kill the CLONE_NEWPID tasks.

    And this is more correct. force_sig() can race with the exiting thread
    even if oom_kill_task() checks p->mm != NULL, while
    do_send_sig_info(group => true) kille the whole process.

    Signed-off-by: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Anton Vorontsov
    Cc: "Eric W. Biederman"
    Cc: KOSAKI Motohiro
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

22 Mar, 2012

6 commits

  • The oom killer typically displays the allocation order at the time of oom
    as a part of its diangostic messages (for global, cpuset, and mempolicy
    ooms).

    The memory controller may also pass the charge order to the oom killer so
    it can emit the same information. This is useful in determining how large
    the memory allocation is that triggered the oom killer.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The oom killer chooses not to kill a thread if:

    - an eligible thread has already been oom killed and has yet to exit,
    and

    - an eligible thread is exiting but has yet to free all its memory and
    is not the thread attempting to currently allocate memory.

    SysRq+F manually invokes the global oom killer to kill a memory-hogging
    task. This is normally done as a last resort to free memory when no
    progress is being made or to test the oom killer itself.

    For both uses, we always want to kill a thread and never defer. This
    patch causes SysRq+F to always kill an eligible thread and can be used to
    force a kill even if another oom killed thread has failed to exit.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Acked-by: Pekka Enberg
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • printk_ratelimit() uses the global ratelimit state for all printks. The
    oom killer should not be subjected to this state just because another
    subsystem or driver may be flooding the kernel log.

    This patch introduces printk ratelimiting specifically for the oom killer.

    Signed-off-by: David Rientjes
    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • If a thread is chosen for oom kill and is already PF_EXITING, then the oom
    killer simply sets TIF_MEMDIE and returns. This allows the thread to have
    access to memory reserves so that it may quickly exit. This logic is
    preceeded with a comment saying there's no need to alarm the sysadmin.
    This patch adds truth to that statement.

    There's no need to emit any warning about the oom condition if the thread
    is already exiting since it will not be killed. In this condition, just
    silently return the oom killer since its only giving access to memory
    reserves and is otherwise a no-op.

    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • oom_kill_task() has a single caller, so fold it into its parent function,
    oom_kill_process(). Slightly reduces the number of lines in the oom
    killer.

    Acked-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • oom_kill_task() returns non-zero iff the chosen process does not have any
    threads with an attached ->mm.

    In such a case, it's better to just return to the page allocator and retry
    the allocation because memory could have been freed in the interim and the
    oom condition may no longer exist. It's unnecessary to loop in the oom
    killer and find another thread to kill.

    This allows both oom_kill_task() and oom_kill_process() to be converted to
    void functions. If the oom condition persists, the oom killer will be
    recalled.

    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

13 Jan, 2012

2 commits


11 Jan, 2012

1 commit

  • oom_score_adj is used for guarding processes from OOM-Killer. One of
    problem is that it's inherited at fork(). When a daemon set oom_score_adj
    and make children, it's hard to know where the value is set.

    This patch adds some tracepoints useful for debugging. This patch adds
    3 trace points.
    - creating new task
    - renaming a task (exec)
    - set oom_score_adj

    To debug, users need to enable some trace pointer. Maybe filtering is useful as

    # EVENT=/sys/kernel/debug/tracing/events/task/
    # echo "oom_score_adj != 0" > $EVENT/task_newtask/filter
    # echo "oom_score_adj != 0" > $EVENT/task_rename/filter
    # echo 1 > $EVENT/enable
    # EVENT=/sys/kernel/debug/tracing/events/oom/
    # echo 1 > $EVENT/enable

    output will be like this.
    # grep oom /sys/kernel/debug/tracing/trace
    bash-7699 [007] d..3 5140.744510: oom_score_adj_update: pid=7699 comm=bash oom_score_adj=-1000
    bash-7699 [007] ...1 5151.818022: task_newtask: pid=7729 comm=bash clone_flags=1200011 oom_score_adj=-1000
    ls-7729 [003] ...2 5151.818504: task_rename: pid=7729 oldcomm=bash newcomm=ls oom_score_adj=-1000
    bash-7699 [002] ...1 5175.701468: task_newtask: pid=7730 comm=bash clone_flags=1200011 oom_score_adj=-1000
    grep-7730 [007] ...2 5175.701993: task_rename: pid=7730 oldcomm=bash newcomm=grep oom_score_adj=-1000

    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki