14 Dec, 2014

2 commits

  • After the previous patch we can remove the PT_TRACE_EXIT check in
    oom_scan_process_thread(), it was added to handle the case when the
    coredumping was "frozen" by ptrace, but it doesn't really work. If
    nothing else, we would need to check all threads which could share the
    same ->mm to make it more or less correct.

    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Cc: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • oom_kill.c assumes that PF_EXITING task should exit and free the memory
    soon. This is wrong in many ways and one important case is the coredump.
    A task can sleep in exit_mm() "forever" while the coredumping sub-thread
    can need more memory.

    Change the PF_EXITING checks to take SIGNAL_GROUP_COREDUMP into account,
    we add the new trivial helper for that.

    Note: this is only the first step, this patch doesn't try to solve other
    problems. The SIGNAL_GROUP_COREDUMP check is obviously racy, a task can
    participate in coredump after it was already observed in PF_EXITING state,
    so TIF_MEMDIE (which also blocks oom-killer) still can be wrongly set.
    fatal_signal_pending() can be true because of SIGNAL_GROUP_COREDUMP so
    out_of_memory() and mem_cgroup_out_of_memory() shouldn't blindly trust it.
    And even the name/usage of the new helper is confusing, an exiting thread
    can only free its ->mm if it is the only/last task in thread group.

    [akpm@linux-foundation.org: add comment]
    Signed-off-by: Oleg Nesterov
    Cc: Cong Wang
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: "Rafael J. Wysocki"
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Dec, 2014

1 commit

  • Pull cgroup update from Tejun Heo:
    "cpuset got simplified a bit. cgroup core got a fix on unified
    hierarchy and grew some effective css related interfaces which will be
    used for blkio support for writeback IO traffic which is currently
    being worked on"

    * 'for-3.19' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
    cgroup: implement cgroup_get_e_css()
    cgroup: add cgroup_subsys->css_e_css_changed()
    cgroup: add cgroup_subsys->css_released()
    cgroup: fix the async css offline wait logic in cgroup_subtree_control_write()
    cgroup: restructure child_subsys_mask handling in cgroup_subtree_control_write()
    cgroup: separate out cgroup_calc_child_subsys_mask() from cgroup_refresh_child_subsys_mask()
    cpuset: lock vs unlock typo
    cpuset: simplify cpuset_node_allowed API
    cpuset: convert callback_mutex to a spinlock

    Linus Torvalds
     

11 Dec, 2014

1 commit

  • None of the mem_cgroup_same_or_subtree() callers actually require it to
    take the RCU lock, either because they hold it themselves or they have css
    references. Remove it.

    To make the API change clear, rename the leftover helper to
    mem_cgroup_is_descendant() to match cgroup_is_descendant().

    Signed-off-by: Johannes Weiner
    Reviewed-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

27 Oct, 2014

1 commit

  • Current cpuset API for checking if a zone/node is allowed to allocate
    from looks rather awkward. We have hardwall and softwall versions of
    cpuset_node_allowed with the softwall version doing literally the same
    as the hardwall version if __GFP_HARDWALL is passed to it in gfp flags.
    If it isn't, the softwall version may check the given node against the
    enclosing hardwall cpuset, which it needs to take the callback lock to
    do.

    Such a distinction was introduced by commit 02a0e53d8227 ("cpuset:
    rework cpuset_zone_allowed api"). Before, we had the only version with
    the __GFP_HARDWALL flag determining its behavior. The purpose of the
    commit was to avoid sleep-in-atomic bugs when someone would mistakenly
    call the function without the __GFP_HARDWALL flag for an atomic
    allocation. The suffixes introduced were intended to make the callers
    think before using the function.

    However, since the callback lock was converted from mutex to spinlock by
    the previous patch, the softwall check function cannot sleep, and these
    precautions are no longer necessary.

    So let's simplify the API back to the single check.

    Suggested-by: David Rientjes
    Signed-off-by: Vladimir Davydov
    Acked-by: Christoph Lameter
    Acked-by: Zefan Li
    Signed-off-by: Tejun Heo

    Vladimir Davydov
     

22 Oct, 2014

1 commit

  • PM freezer relies on having all tasks frozen by the time devices are
    getting frozen so that no task will touch them while they are getting
    frozen. But OOM killer is allowed to kill an already frozen task in
    order to handle OOM situtation. In order to protect from late wake ups
    OOM killer is disabled after all tasks are frozen. This, however, still
    keeps a window open when a killed task didn't manage to die by the time
    freeze_processes finishes.

    Reduce the race window by checking all tasks after OOM killer has been
    disabled. This is still not race free completely unfortunately because
    oom_killer_disable cannot stop an already ongoing OOM killer so a task
    might still wake up from the fridge and get killed without
    freeze_processes noticing. Full synchronization of OOM and freezer is,
    however, too heavy weight for this highly unlikely case.

    Introduce and check oom_kills counter which gets incremented early when
    the allocator enters __alloc_pages_may_oom path and only check all the
    tasks if the counter changes during the freezing attempt. The counter
    is updated so early to reduce the race window since allocator checked
    oom_killer_disabled which is set by PM-freezing code. A false positive
    will push the PM-freezer into a slow path but that is not a big deal.

    Changes since v1
    - push the re-check loop out of freeze_processes into
    check_frozen_processes and invert the condition to make the code more
    readable as per Rafael

    Fixes: f660daac474c6f (oom: thaw threads if oom killed thread is frozen before deferring)
    Cc: 3.2+ # 3.2+
    Signed-off-by: Michal Hocko
    Signed-off-by: Rafael J. Wysocki

    Michal Hocko
     

10 Oct, 2014

1 commit

  • Page reclaim tests zone_is_reclaim_dirty(), but the site that actually
    sets this state does zone_set_flag(zone, ZONE_TAIL_LRU_DIRTY), sending the
    reader through layers indirection just to track down a simple bit.

    Remove all zone flag wrappers and just use bitops against zone->flags
    directly. It's just as readable and the lines are barely any longer.

    Also rename ZONE_TAIL_LRU_DIRTY to ZONE_DIRTY to match ZONE_WRITEBACK, and
    remove the zone_flags_t typedef.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

07 Aug, 2014

3 commits

  • The oom killer scans each process and determines whether it is eligible
    for oom kill or whether the oom killer should abort because of
    concurrent memory freeing. It will abort when an eligible process is
    found to have TIF_MEMDIE set, meaning it has already been oom killed and
    we're waiting for it to exit.

    Processes with task->mm == NULL should not be considered because they
    are either kthreads or have already detached their memory and killing
    them would not lead to memory freeing. That memory is only freed after
    exit_mm() has returned, however, and not when task->mm is first set to
    NULL.

    Clear TIF_MEMDIE after exit_mm()'s mmput() so that an oom killed process
    is no longer considered for oom kill, but only until exit_mm() has
    returned. This was fragile in the past because it relied on
    exit_notify() to be reached before no longer considering TIF_MEMDIE
    processes.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • try_set_zonelist_oom() and clear_zonelist_oom() are not named properly
    to imply that they require locking semantics to avoid out_of_memory()
    being reordered.

    zone_scan_lock is required for both functions to ensure that there is
    proper locking synchronization.

    Rename try_set_zonelist_oom() to oom_zonelist_trylock() and rename
    clear_zonelist_oom() to oom_zonelist_unlock() to imply there is proper
    locking semantics.

    At the same time, convert oom_zonelist_trylock() to return bool instead
    of int since only success and failure are tested.

    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • With memoryless node support being worked on, it's possible that for
    optimizations that a node may not have a non-NULL zonelist. When
    CONFIG_NUMA is enabled and node 0 is memoryless, this means the zonelist
    for first_online_node may become NULL.

    The oom killer requires a zonelist that includes all memory zones for
    the sysrq trigger and pagefault out of memory handler.

    Ensure that a non-NULL zonelist is always passed to the oom killer.

    [akpm@linux-foundation.org: fix non-numa build]
    Signed-off-by: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc: Johannes Weiner
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

31 Jan, 2014

1 commit

  • A 3% of system memory bonus is sometimes too excessive in comparison to
    other processes.

    With commit a63d83f427fb ("oom: badness heuristic rewrite"), the OOM
    killer tries to avoid killing privileged tasks by subtracting 3% of
    overall memory (system or cgroup) from their per-task consumption. But
    as a result, all root tasks that consume less than 3% of overall memory
    are considered equal, and so it only takes 33+ privileged tasks pushing
    the system out of memory for the OOM killer to do something stupid and
    kill dhclient or other root-owned processes. For example, on a 32G
    machine it can't tell the difference between the 1M agetty and the 10G
    fork bomb member.

    The changelog describes this 3% boost as the equivalent to the global
    overcommit limit being 3% higher for privileged tasks, but this is not
    the same as discounting 3% of overall memory from _every privileged task
    individually_ during OOM selection.

    Replace the 3% of system memory bonus with a 3% of current memory usage
    bonus.

    By giving root tasks a bonus that is proportional to their actual size,
    they remain comparable even when relatively small. In the example
    above, the OOM killer will discount the 1M agetty's 256 badness points
    down to 179, and the 10G fork bomb's 262144 points down to 183500 points
    and make the right choice, instead of discounting both to 0 and killing
    agetty because it's first in the task list.

    Signed-off-by: David Rientjes
    Reported-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

24 Jan, 2014

1 commit

  • When two threads have the same badness score, it's preferable to kill
    the thread group leader so that the actual process name is printed to
    the kernel log rather than the thread group name which may be shared
    amongst several processes.

    This was the behavior when select_bad_process() used to do
    for_each_process(), but it now iterates threads instead and leads to
    ambiguity.

    Signed-off-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Cc: Greg Thelen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

22 Jan, 2014

3 commits

  • find_lock_task_mm() expects it is called under rcu or tasklist lock, but
    it seems that at least oom_unkillable_task()->task_in_mem_cgroup() and
    mem_cgroup_out_of_memory()->oom_badness() can call it lockless.

    Perhaps we could fix the callers, but this patch simply adds rcu lock
    into find_lock_task_mm(). This also allows to simplify a bit one of its
    callers, oom_kill_process().

    Signed-off-by: Oleg Nesterov
    Cc: Sergey Dyasly
    Cc: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • At least out_of_memory() calls has_intersects_mems_allowed() without
    even rcu_read_lock(), this is obviously buggy.

    Add the necessary rcu_read_lock(). This means that we can not simply
    return from the loop, we need "bool ret" and "break".

    While at it, swap the names of task_struct's (the argument and the
    local). This cleans up the code a little bit and avoids the unnecessary
    initialization.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change oom_kill.c to use for_each_thread() rather than the racy
    while_each_thread() which can loop forever if we race with exit.

    Note also that most users were buggy even if while_each_thread() was
    fine, the task can exit even _before_ rcu_read_lock().

    Fortunately the new for_each_thread() only requires the stable
    task_struct, so this change fixes both problems.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Sergey Dyasly
    Tested-by: Sergey Dyasly
    Reviewed-by: Sameer Nanda
    Cc: "Eric W. Biederman"
    Cc: Frederic Weisbecker
    Cc: Mandeep Singh Baines
    Cc: "Ma, Xindong"
    Reviewed-by: Michal Hocko
    Cc: "Tu, Xiaobing"
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Nov, 2013

1 commit

  • With split page table lock for PMD level we can't hold mm->page_table_lock
    while updating nr_ptes.

    Let's convert it to atomic_long_t to avoid races.

    Signed-off-by: Kirill A. Shutemov
    Tested-by: Alex Thorlton
    Cc: Ingo Molnar
    Cc: Naoya Horiguchi
    Cc: "Eric W . Biederman"
    Cc: "Paul E . McKenney"
    Cc: Al Viro
    Cc: Andi Kleen
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Dave Jones
    Cc: David Howells
    Cc: Frederic Weisbecker
    Cc: Johannes Weiner
    Cc: Kees Cook
    Cc: Mel Gorman
    Cc: Michael Kerrisk
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Robin Holt
    Cc: Sedat Dilek
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

17 Oct, 2013

1 commit

  • Commit 3812c8c8f395 ("mm: memcg: do not trap chargers with full
    callstack on OOM") assumed that only a few places that can trigger a
    memcg OOM situation do not return VM_FAULT_OOM, like optional page cache
    readahead. But there are many more and it's impractical to annotate
    them all.

    First of all, we don't want to invoke the OOM killer when the failed
    allocation is gracefully handled, so defer the actual kill to the end of
    the fault handling as well. This simplifies the code quite a bit for
    added bonus.

    Second, since a failed allocation might not be the abrupt end of the
    fault, the memcg OOM handler needs to be re-entrant until the fault
    finishes for subsequent allocation attempts. If an allocation is
    attempted after the task already OOMed, allow it to bypass the limit so
    that it can quickly finish the fault and invoke the OOM killer.

    Reported-by: azurIt
    Signed-off-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

13 Sep, 2013

1 commit

  • The memcg OOM handling is incredibly fragile and can deadlock. When a
    task fails to charge memory, it invokes the OOM killer and loops right
    there in the charge code until it succeeds. Comparably, any other task
    that enters the charge path at this point will go to a waitqueue right
    then and there and sleep until the OOM situation is resolved. The problem
    is that these tasks may hold filesystem locks and the mmap_sem; locks that
    the selected OOM victim may need to exit.

    For example, in one reported case, the task invoking the OOM killer was
    about to charge a page cache page during a write(), which holds the
    i_mutex. The OOM killer selected a task that was just entering truncate()
    and trying to acquire the i_mutex:

    OOM invoking task:
    mem_cgroup_handle_oom+0x241/0x3b0
    mem_cgroup_cache_charge+0xbe/0xe0
    add_to_page_cache_locked+0x4c/0x140
    add_to_page_cache_lru+0x22/0x50
    grab_cache_page_write_begin+0x8b/0xe0
    ext3_write_begin+0x88/0x270
    generic_file_buffered_write+0x116/0x290
    __generic_file_aio_write+0x27c/0x480
    generic_file_aio_write+0x76/0xf0 # takes ->i_mutex
    do_sync_write+0xea/0x130
    vfs_write+0xf3/0x1f0
    sys_write+0x51/0x90
    system_call_fastpath+0x18/0x1d

    OOM kill victim:
    do_truncate+0x58/0xa0 # takes i_mutex
    do_last+0x250/0xa30
    path_openat+0xd7/0x440
    do_filp_open+0x49/0xa0
    do_sys_open+0x106/0x240
    sys_open+0x20/0x30
    system_call_fastpath+0x18/0x1d

    The OOM handling task will retry the charge indefinitely while the OOM
    killed task is not releasing any resources.

    A similar scenario can happen when the kernel OOM killer for a memcg is
    disabled and a userspace task is in charge of resolving OOM situations.
    In this case, ALL tasks that enter the OOM path will be made to sleep on
    the OOM waitqueue and wait for userspace to free resources or increase
    the group's limit. But a userspace OOM handler is prone to deadlock
    itself on the locks held by the waiting tasks. For example one of the
    sleeping tasks may be stuck in a brk() call with the mmap_sem held for
    writing but the userspace handler, in order to pick an optimal victim,
    may need to read files from /proc/, which tries to acquire the same
    mmap_sem for reading and deadlocks.

    This patch changes the way tasks behave after detecting a memcg OOM and
    makes sure nobody loops or sleeps with locks held:

    1. When OOMing in a user fault, invoke the OOM killer and restart the
    fault instead of looping on the charge attempt. This way, the OOM
    victim can not get stuck on locks the looping task may hold.

    2. When OOMing in a user fault but somebody else is handling it
    (either the kernel OOM killer or a userspace handler), don't go to
    sleep in the charge context. Instead, remember the OOMing memcg in
    the task struct and then fully unwind the page fault stack with
    -ENOMEM. pagefault_out_of_memory() will then call back into the
    memcg code to check if the -ENOMEM came from the memcg, and then
    either put the task to sleep on the memcg's OOM waitqueue or just
    restart the fault. The OOM victim can no longer get stuck on any
    lock a sleeping task may hold.

    Debugged by Michal Hocko.

    Signed-off-by: Johannes Weiner
    Reported-by: azurIt
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

15 Jul, 2013

1 commit


24 Feb, 2013

1 commit

  • Currently when a memcg oom is happening the oom dump messages is still
    global state and provides few useful info for users. This patch prints
    more pointed memcg page statistics for memcg-oom and take hierarchy into
    consideration:

    Based on Michal's advice, we take hierarchy into consideration: supppose
    we trigger an OOM on A's limit

    root_memcg
    |
    A (use_hierachy=1)
    / \
    B C
    |
    D
    then the printed info will be:

    Memory cgroup stats for /A:...
    Memory cgroup stats for /A/B:...
    Memory cgroup stats for /A/C:...
    Memory cgroup stats for /A/B/D:...

    Following are samples of oom output:

    (1) Before change:

    mal-80 invoked oom-killer:gfp_mask=0xd0, order=0, oom_score_adj=0
    mal-80 cpuset=/ mems_allowed=0
    Pid: 2976, comm: mal-80 Not tainted 3.7.0+ #10
    Call Trace:
    [] dump_header+0x83/0x1ca
    ..... (call trace)
    [] page_fault+0x28/0x30
    <<<<<<<<<<<<<<<<<<<<< memcg specific information
    Task in /A/B/D killed as a result of limit of /A
    memory: usage 101376kB, limit 101376kB, failcnt 57
    memory+swap: usage 101376kB, limit 101376kB, failcnt 0
    kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
    <<<<<<<<<<<<<<<<<<<<< print per cpu pageset stat
    Mem-Info:
    Node 0 DMA per-cpu:
    CPU 0: hi: 0, btch: 1 usd: 0
    ......
    CPU 3: hi: 0, btch: 1 usd: 0
    Node 0 DMA32 per-cpu:
    CPU 0: hi: 186, btch: 31 usd: 173
    ......
    CPU 3: hi: 186, btch: 31 usd: 130
    <<<<<<<<<<<<<<<<<<<<< print global page state
    active_anon:92963 inactive_anon:40777 isolated_anon:0
    active_file:33027 inactive_file:51718 isolated_file:0
    unevictable:0 dirty:3 writeback:0 unstable:0
    free:729995 slab_reclaimable:6897 slab_unreclaimable:6263
    mapped:20278 shmem:35971 pagetables:5885 bounce:0
    free_cma:0
    <<<<<<<<<<<<<<<<<<<<< print per zone page state
    Node 0 DMA free:15836kB ... all_unreclaimable? no
    lowmem_reserve[]: 0 3175 3899 3899
    Node 0 DMA32 free:2888564kB ... all_unrelaimable? no
    lowmem_reserve[]: 0 0 724 724
    lowmem_reserve[]: 0 0 0 0
    Node 0 DMA: 1*4kB (U) ... 3*4096kB (M) = 15836kB
    Node 0 DMA32: 41*4kB (UM) ... 702*4096kB (MR) = 2888316kB
    120710 total pagecache pages
    0 pages in swap cache
    <<<<<<<<<<<<<<<<<<<<< print global swap cache stat
    Swap cache stats: add 0, delete 0, find 0/0
    Free swap = 499708kB
    Total swap = 499708kB
    1040368 pages RAM
    58678 pages reserved
    169065 pages shared
    173632 pages non-shared
    [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
    [ 2693] 0 2693 6005 1324 17 0 0 god
    [ 2754] 0 2754 6003 1320 16 0 0 god
    [ 2811] 0 2811 5992 1304 18 0 0 god
    [ 2874] 0 2874 6005 1323 18 0 0 god
    [ 2935] 0 2935 8720 7742 21 0 0 mal-30
    [ 2976] 0 2976 21520 17577 42 0 0 mal-80
    Memory cgroup out of memory: Kill process 2976 (mal-80) score 665 or sacrifice child
    Killed process 2976 (mal-80) total-vm:86080kB, anon-rss:69964kB, file-rss:344kB

    We can see that messages dumped by show_free_areas() are longsome and can
    provide so limited info for memcg that just happen oom.

    (2) After change
    mal-80 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
    mal-80 cpuset=/ mems_allowed=0
    Pid: 2704, comm: mal-80 Not tainted 3.7.0+ #10
    Call Trace:
    [] dump_header+0x83/0x1d1
    .......(call trace)
    [] page_fault+0x28/0x30
    Task in /A/B/D killed as a result of limit of /A
    <<<<<<<<<<<<<<<<<<<<< memcg specific information
    memory: usage 102400kB, limit 102400kB, failcnt 140
    memory+swap: usage 102400kB, limit 102400kB, failcnt 0
    kmem: usage 0kB, limit 9007199254740991kB, failcnt 0
    Memory cgroup stats for /A: cache:32KB rss:30984KB mapped_file:0KB swap:0KB inactive_anon:6912KB active_anon:24072KB inactive_file:32KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/B: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/C: cache:0KB rss:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:0KB inactive_file:0KB active_file:0KB unevictable:0KB
    Memory cgroup stats for /A/B/D: cache:32KB rss:71352KB mapped_file:0KB swap:0KB inactive_anon:6656KB active_anon:64696KB inactive_file:16KB active_file:16KB unevictable:0KB
    [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name
    [ 2260] 0 2260 6006 1325 18 0 0 god
    [ 2383] 0 2383 6003 1319 17 0 0 god
    [ 2503] 0 2503 6004 1321 18 0 0 god
    [ 2622] 0 2622 6004 1321 16 0 0 god
    [ 2695] 0 2695 8720 7741 22 0 0 mal-30
    [ 2704] 0 2704 21520 17839 43 0 0 mal-80
    Memory cgroup out of memory: Kill process 2704 (mal-80) score 669 or sacrifice child
    Killed process 2704 (mal-80) total-vm:86080kB, anon-rss:71016kB, file-rss:340kB

    This version provides more pointed info for memcg in "Memory cgroup stats
    for XXX" section.

    Signed-off-by: Sha Zhengju
    Acked-by: Michal Hocko
    Cc: KAMEZAWA Hiroyuki
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sha Zhengju
     

13 Dec, 2012

3 commits

  • out_of_memory() will already cause current to schedule if it has not been
    killed, so doing it again in pagefault_out_of_memory() is redundant.
    Remove it.

    Signed-off-by: David Rientjes
    Acked-by: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • To lock the entire system from parallel oom killing, it's possible to pass
    in a zonelist with all zones rather than using for_each_populated_zone()
    for the iteration. This obsoletes try_set_system_oom() and
    clear_system_oom() so that they can be removed.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • N_HIGH_MEMORY stands for the nodes that has normal or high memory.
    N_MEMORY stands for the nodes that has any memory.

    The code here need to handle with the nodes which have memory, we should
    use N_MEMORY instead.

    Signed-off-by: Lai Jiangshan
    Acked-by: Hillf Danton
    Signed-off-by: Wen Congyang
    Cc: Christoph Lameter
    Cc: Lin Feng
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lai Jiangshan
     

12 Dec, 2012

3 commits

  • test_set_oom_score_adj() and compare_swap_oom_score_adj() are used to
    specify that current should be killed first if an oom condition occurs in
    between the two calls.

    The usage is

    short oom_score_adj = test_set_oom_score_adj(OOM_SCORE_ADJ_MAX);
    ...
    compare_swap_oom_score_adj(OOM_SCORE_ADJ_MAX, oom_score_adj);

    to store the thread's oom_score_adj, temporarily change it to the maximum
    score possible, and then restore the old value if it is still the same.

    This happens to still be racy, however, if the user writes
    OOM_SCORE_ADJ_MAX to /proc/pid/oom_score_adj in between the two calls.
    The compare_swap_oom_score_adj() will then incorrectly reset the old value
    prior to the write of OOM_SCORE_ADJ_MAX.

    To fix this, introduce a new oom_flags_t member in struct signal_struct
    that will be used for per-thread oom killer flags. KSM and swapoff can
    now use a bit in this member to specify that threads should be killed
    first in oom conditions without playing around with oom_score_adj.

    This also allows the correct oom_score_adj to always be shown when reading
    /proc/pid/oom_score.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The maximum oom_score_adj is 1000 and the minimum oom_score_adj is -1000,
    so this range can be represented by the signed short type with no
    functional change. The extra space this frees up in struct signal_struct
    will be used for per-thread oom kill flags in the next patch.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Cc: Anton Vorontsov
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Exiting threads, those with PF_EXITING set, can pagefault and require
    memory before they can make forward progress. This happens, for instance,
    when a process must fault task->robust_list, a userspace structure, before
    detaching its memory.

    These threads also aren't guaranteed to get access to memory reserves
    unless oom killed or killed from userspace. The oom killer won't grant
    memory reserves if other threads are also exiting other than current and
    stalling at the same point. This prevents needlessly killing processes
    when others are already exiting.

    Instead of special casing all the possible situations between PF_EXITING
    getting set and a thread detaching its mm where it may allocate memory,
    which probably wouldn't get updated when a change is made to the exit
    path, the solution is to give all exiting threads access to memory
    reserves if they call the oom killer. This allows them to quickly
    allocate, detach its mm, and free the memory it represents.

    Summary of Luigi's bug report:

    : He had an oom condition where threads were faulting on task->robust_list
    : and repeatedly called the oom killer but it would defer killing a thread
    : because it saw other PF_EXITING threads. This can happen anytime we need
    : to allocate memory after setting PF_EXITING and before detaching our mm;
    : if there are other threads in the same state then the oom killer won't do
    : anything unless one of them happens to be killed from userspace.
    :
    : So instead of only deferring for PF_EXITING and !task->robust_list, it's
    : better to just give them access to memory reserves to prevent a potential
    : livelock so that any other faults that may be introduced in the future in
    : the exit path don't cause the same problem (and hopefully we don't allow
    : too many of those!).

    Signed-off-by: David Rientjes
    Acked-by: Minchan Kim
    Tested-by: Luigi Semenzato
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Oct, 2012

1 commit


01 Aug, 2012

8 commits

  • By globally defining check_panic_on_oom(), the memcg oom handler can be
    moved entirely to mm/memcontrol.c. This removes the ugly #ifdef in the
    oom killer and cleans up the code.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since exiting tasks require write_lock_irq(&tasklist_lock) several times,
    try to reduce the amount of time the readside is held for oom kills. This
    makes the interface with the memcg oom handler more consistent since it
    now never needs to take tasklist_lock unnecessarily.

    The only time the oom killer now takes tasklist_lock is when iterating the
    children of the selected task, everything else is protected by
    rcu_read_lock().

    This requires that a reference to the selected process, p, is grabbed
    before calling oom_kill_process(). It may release it and grab a reference
    on another one of p's threads if !p->mm, but it also guarantees that it
    will release the reference before returning.

    [hughd@google.com: fix duplicate put_task_struct()]
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Reviewed-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The global oom killer is serialized by the per-zonelist
    try_set_zonelist_oom() which is used in the page allocator. Concurrent
    oom kills are thus a rare event and only occur in systems using
    mempolicies and with a large number of nodes.

    Memory controller oom kills, however, can frequently be concurrent since
    there is no serialization once the oom killer is called for oom conditions
    in several different memcgs in parallel.

    This creates a massive contention on tasklist_lock since the oom killer
    requires the readside for the tasklist iteration. If several memcgs are
    calling the oom killer, this lock can be held for a substantial amount of
    time, especially if threads continue to enter it as other threads are
    exiting.

    Since the exit path grabs the writeside of the lock with irqs disabled in
    a few different places, this can cause a soft lockup on cpus as a result
    of tasklist_lock starvation.

    The kernel lacks unfair writelocks, and successful calls to the oom killer
    usually result in at least one thread entering the exit path, so an
    alternative solution is needed.

    This patch introduces a seperate oom handler for memcgs so that they do
    not require tasklist_lock for as much time. Instead, it iterates only
    over the threads attached to the oom memcg and grabs a reference to the
    selected thread before calling oom_kill_process() to ensure it doesn't
    prematurely exit.

    This still requires tasklist_lock for the tasklist dump, iterating
    children of the selected process, and killing all other threads on the
    system sharing the same memory as the selected victim. So while this
    isn't a complete solution to tasklist_lock starvation, it significantly
    reduces the amount of time that it is held.

    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • This patch introduces a helper function to process each thread during the
    iteration over the tasklist. A new return type, enum oom_scan_t, is
    defined to determine the future behavior of the iteration:

    - OOM_SCAN_OK: continue scanning the thread and find its badness,

    - OOM_SCAN_CONTINUE: do not consider this thread for oom kill, it's
    ineligible,

    - OOM_SCAN_ABORT: abort the iteration and return, or

    - OOM_SCAN_SELECT: always select this thread with the highest badness
    possible.

    There is no functional change with this patch. This new helper function
    will be used in the next patch in the memory controller.

    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Reviewed-by: Michal Hocko
    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Reviewed-by: Sha Zhengju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Sanity:

    CONFIG_CGROUP_MEM_RES_CTLR -> CONFIG_MEMCG
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP -> CONFIG_MEMCG_SWAP
    CONFIG_CGROUP_MEM_RES_CTLR_SWAP_ENABLED -> CONFIG_MEMCG_SWAP_ENABLED
    CONFIG_CGROUP_MEM_RES_CTLR_KMEM -> CONFIG_MEMCG_KMEM

    [mhocko@suse.cz: fix missed bits]
    Cc: Glauber Costa
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Tejun Heo
    Cc: Aneesh Kumar K.V
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • The number of ptes and swap entries are used in the oom killer's badness
    heuristic, so they should be shown in the tasklist dump.

    This patch adds those fields and replaces cpu and oom_adj values that are
    currently emitted. Cpu isn't interesting and oom_adj is deprecated and
    will be removed later this year, the same information is already displayed
    as oom_score_adj which is used internally.

    At the same time, make the documentation a little more clear to state this
    information is helpful to determine why the oom killer chose the task it
    did to kill.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • /proc/sys/vm/oom_kill_allocating_task will immediately kill current when
    the oom killer is called to avoid a potentially expensive tasklist scan
    for large systems.

    Currently, however, it is not checking current's oom_score_adj value which
    may be OOM_SCORE_ADJ_MIN, meaning that it has been disabled from oom
    killing.

    This patch avoids killing current in such a condition and simply falls
    back to the tasklist scan since memory still needs to be freed.

    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Acked-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The oom killer currently schedules away from current in an uninterruptible
    sleep if it does not have access to memory reserves. It's possible that
    current was killed because it shares memory with the oom killed thread or
    because it was killed by the user in the interim, however.

    This patch only schedules away from current if it does not have a pending
    kill, i.e. if it does not share memory with the oom killed thread. It's
    possible that it will immediately retry its memory allocation and fail,
    but it will immediately be given access to memory reserves if it calls the
    oom killer again.

    This prevents the delay of memory freeing when threads that share memory
    with the oom killed thread get unnecessarily scheduled.

    Signed-off-by: David Rientjes
    Cc: Oleg Nesterov
    Acked-by: KOSAKI Motohiro
    Acked-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

21 Jun, 2012

2 commits

  • Fix kernel-doc warnings such as

    Warning(../mm/page_cgroup.c:432): No description found for parameter 'id'
    Warning(../mm/page_cgroup.c:432): Excess function parameter 'mem' description in 'swap_cgroup_record'

    Signed-off-by: Wanpeng Li
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wanpeng Li
     
  • The divide in p->signal->oom_score_adj * totalpages / 1000 within
    oom_badness() was causing an overflow of the signed long data type.

    This adds both the root bias and p->signal->oom_score_adj before doing the
    normalization which fixes the issue and also cleans up the calculation.

    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

09 Jun, 2012

1 commit

  • If the privileges given to root threads (3% of allowable memory) or a
    negative value of /proc/pid/oom_score_adj happen to exceed the amount of
    rss of a thread, its badness score overflows as a result of commit
    a7f638f999ff ("mm, oom: normalize oom scores to oom_score_adj scale only
    for userspace").

    Fix this by making the type signed and return 1, meaning the thread is
    still eligible for kill, if the value is negative.

    Reported-by: Dave Jones
    Acked-by: Oleg Nesterov
    Signed-off-by: David Rientjes
    Signed-off-by: Linus Torvalds

    David Rientjes
     

30 May, 2012

1 commit

  • The oom_score_adj scale ranges from -1000 to 1000 and represents the
    proportion of memory available to the process at allocation time. This
    means an oom_score_adj value of 300, for example, will bias a process as
    though it was using an extra 30.0% of available memory and a value of
    -350 will discount 35.0% of available memory from its usage.

    The oom killer badness heuristic also uses this scale to report the oom
    score for each eligible process in determining the "best" process to
    kill. Thus, it can only differentiate each process's memory usage by
    0.1% of system RAM.

    On large systems, this can end up being a large amount of memory: 256MB
    on 256GB systems, for example.

    This can be fixed by having the badness heuristic to use the actual
    memory usage in scoring threads and then normalizing it to the
    oom_score_adj scale for userspace. This results in better comparison
    between eligible threads for kill and no change from the userspace
    perspective.

    Suggested-by: KOSAKI Motohiro
    Tested-by: Dave Jones
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

03 May, 2012

1 commit