02 Apr, 2016

1 commit

  • Commit bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using
    simpler way") has simplified the check for tasks already enqueued for
    the oom reaper by checking tsk->oom_reaper_list != NULL. This check is
    not sufficient because the tsk might be the head of the queue without
    any other tasks queued and then we would simply lockup looping on the
    same task. Fix the condition by checking for the head as well.

    Fixes: bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using simpler way")
    Signed-off-by: Michal Hocko
    Acked-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 Mar, 2016

8 commits

  • "oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
    to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
    by simply checking tsk->oom_reaper_list != NULL.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
    address space" oom_reaper will call exit_oom_victim on the target task
    after it is done. This might however race with the PM freezer:

    CPU0 CPU1 CPU2
    freeze_processes
    try_to_freeze_tasks
    # Allocation request
    out_of_memory
    oom_killer_disable
    wake_oom_reaper(P1)
    __oom_reap_task
    exit_oom_victim(P1)
    wait_event(oom_victims==0)
    [...]
    do_exit(P1)
    perform IO/interfere with the freezer

    which breaks the oom_killer_disable semantic. We no longer have a
    guarantee that the oom victim won't interfere with the freezer because
    it might be anywhere on the way to do_exit while the freezer thinks the
    task has already terminated. It might trigger IO or touch devices which
    are frozen already.

    In order to close this race, make the oom_reaper thread freezable. This
    will work because
    a) already running oom_reaper will block freezer to enter the
    quiescent state
    b) wake_oom_reaper will not wake up the reaper after it has been
    frozen
    c) the only way to call exit_oom_victim after try_to_freeze_tasks
    is from the oom victim's context when we know the further
    interference shouldn't be possible

    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Entries are only added/removed from oom_reaper_list at head so we can
    use a single linked list and hence save a word in task_struct.

    Signed-off-by: Vladimir Davydov
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Tetsuo has reported that oom_kill_allocating_task=1 will cause
    oom_reaper_list corruption because oom_kill_process doesn't follow
    standard OOM exclusion (aka ignores TIF_MEMDIE) and allows to enqueue
    the same task multiple times - e.g. by sacrificing the same child
    multiple times.

    This patch fixes the issue by introducing a new MMF_OOM_KILLED mm flag
    which is set in oom_kill_process atomically and oom reaper is disabled
    if the flag was already set.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • wake_oom_reaper has allowed only 1 oom victim to be queued. The main
    reason for that was the simplicity as other solutions would require some
    way of queuing. The current approach is racy and that was deemed
    sufficient as the oom_reaper is considered a best effort approach to
    help with oom handling when the OOM victim cannot terminate in a
    reasonable time. The race could lead to missing an oom victim which can
    get stuck

    out_of_memory
    wake_oom_reaper
    cmpxchg // OK
    oom_reaper
    oom_reap_task
    __oom_reap_task
    oom_victim terminates
    atomic_inc_not_zero // fail
    out_of_memory
    wake_oom_reaper
    cmpxchg // fails
    task_to_reap = NULL

    This race requires 2 OOM invocations in a short time period which is not
    very likely but certainly not impossible. E.g. the original victim
    might have not released a lot of memory for some reason.

    The situation would improve considerably if wake_oom_reaper used a more
    robust queuing. This is what this patch implements. This means adding
    oom_reaper_list list_head into task_struct (eat a hole before embeded
    thread_struct for that purpose) and a oom_reaper_lock spinlock for
    queuing synchronization. wake_oom_reaper will then add the task on the
    queue and oom_reaper will dequeue it.

    Signed-off-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Inform about the successful/failed oom_reaper attempts and dump all the
    held locks to tell us more who is blocking the progress.

    [akpm@linux-foundation.org: fix CONFIG_MMU=n build]
    Signed-off-by: Michal Hocko
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When oom_reaper manages to unmap all the eligible vmas there shouldn't
    be much of the freable memory held by the oom victim left anymore so it
    makes sense to clear the TIF_MEMDIE flag for the victim and allow the
    OOM killer to select another task.

    The lack of TIF_MEMDIE also means that the victim cannot access memory
    reserves anymore but that shouldn't be a problem because it would get
    the access again if it needs to allocate and hits the OOM killer again
    due to the fatal_signal_pending resp. PF_EXITING check. We can safely
    hide the task from the OOM killer because it is clearly not a good
    candidate anymore as everyhing reclaimable has been torn down already.

    This patch will allow to cap the time an OOM victim can keep TIF_MEMDIE
    and thus hold off further global OOM killer actions granted the oom
    reaper is able to take mmap_sem for the associated mm struct. This is
    not guaranteed now but further steps should make sure that mmap_sem for
    write should be blocked killable which will help to reduce such a lock
    contention. This is not done by this patch.

    Note that exit_oom_victim might be called on a remote task from
    __oom_reap_task now so we have to check and clear the flag atomically
    otherwise we might race and underflow oom_victims or wake up waiters too
    early.

    Signed-off-by: Michal Hocko
    Suggested-by: Johannes Weiner
    Suggested-by: Tetsuo Handa
    Cc: Andrea Argangeli
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patch (of 5):

    This is based on the idea from Mel Gorman discussed during LSFMM 2015
    and independently brought up by Oleg Nesterov.

    The OOM killer currently allows to kill only a single task in a good
    hope that the task will terminate in a reasonable time and frees up its
    memory. Such a task (oom victim) will get an access to memory reserves
    via mark_oom_victim to allow a forward progress should there be a need
    for additional memory during exit path.

    It has been shown (e.g. by Tetsuo Handa) that it is not that hard to
    construct workloads which break the core assumption mentioned above and
    the OOM victim might take unbounded amount of time to exit because it
    might be blocked in the uninterruptible state waiting for an event (e.g.
    lock) which is blocked by another task looping in the page allocator.

    This patch reduces the probability of such a lockup by introducing a
    specialized kernel thread (oom_reaper) which tries to reclaim additional
    memory by preemptively reaping the anonymous or swapped out memory owned
    by the oom victim under an assumption that such a memory won't be needed
    when its owner is killed and kicked from the userspace anyway. There is
    one notable exception to this, though, if the OOM victim was in the
    process of coredumping the result would be incomplete. This is
    considered a reasonable constrain because the overall system health is
    more important than debugability of a particular application.

    A kernel thread has been chosen because we need a reliable way of
    invocation so workqueue context is not appropriate because all the
    workers might be busy (e.g. allocating memory). Kswapd which sounds
    like another good fit is not appropriate as well because it might get
    blocked on locks during reclaim as well.

    oom_reaper has to take mmap_sem on the target task for reading so the
    solution is not 100% because the semaphore might be held or blocked for
    write but the probability is reduced considerably wrt. basically any
    lock blocking forward progress as described above. In order to prevent
    from blocking on the lock without any forward progress we are using only
    a trylock and retry 10 times with a short sleep in between. Users of
    mmap_sem which need it for write should be carefully reviewed to use
    _killable waiting as much as possible and reduce allocations requests
    done with the lock held to absolute minimum to reduce the risk even
    further.

    The API between oom killer and oom reaper is quite trivial.
    wake_oom_reaper updates mm_to_reap with cmpxchg to guarantee only
    NULL->mm transition and oom_reaper clear this atomically once it is done
    with the work. This means that only a single mm_struct can be reaped at
    the time. As the operation is potentially disruptive we are trying to
    limit it to the ncessary minimum and the reaper blocks any updates while
    it operates on an mm. mm_struct is pinned by mm_count to allow parallel
    exit_mmap and a race is detected by atomic_inc_not_zero(mm_users).

    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Suggested-by: Mel Gorman
    Acked-by: Mel Gorman
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

18 Mar, 2016

3 commits

  • While oom_killer_disable() is called by freeze_processes() after all
    user threads except the current thread are frozen, it is possible that
    kernel threads invoke the OOM killer and sends SIGKILL to the current
    thread due to sharing the thawed victim's memory. Therefore, checking
    for SIGKILL is preferable than TIF_MEMDIE.

    Signed-off-by: Tetsuo Handa
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Kernel style prefers a single string over split strings when the string is
    'user-visible'.

    Miscellanea:

    - Add a missing newline
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • When the OOM killer scans tasks and encounters a PF_EXITING one, it
    force-selects that task regardless of the score. The problem is that if
    that task got stuck waiting for some state the allocation site is
    holding, the OOM reaper can not move on to the next best victim.

    Frankly, I don't even know why we check for exiting tasks in the OOM
    killer. We've tried direct reclaim at least 15 times by the time we
    decide the system is OOM, there was plenty of time to exit and free
    memory; and a task might exit voluntarily right after we issue a kill.
    This is testing pure noise. Remove it.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Andrea Argangeli
    Cc: Rik van Riel
    Cc: Sasha Levin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Mar, 2016

1 commit

  • It would be useful to translate gfp_flags into string representation
    when printing in case of an OOM, especially as the flags have been
    undergoing some changes recently and the script ./scripts/gfp-translate
    needs a matching source version to be accurate.

    Example output:

    a.out invoked oom-killer: gfp_mask=0x24280ca(GFP_HIGHUSER_MOVABLE|GFP_ZERO), order=0, om_score_adj=0

    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Cc: Minchan Kim
    Cc: Sasha Levin
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

15 Jan, 2016

1 commit

  • Currently looking at /proc//status or statm, there is no way to
    distinguish shmem pages from pages mapped to a regular file (shmem pages
    are mapped to /dev/zero), even though their implication in actual memory
    use is quite different.

    The internal accounting currently counts shmem pages together with
    regular files. As a preparation to extend the userspace interfaces,
    this patch adds MM_SHMEMPAGES counter to mm_rss_stat to account for
    shmem pages separately from MM_FILEPAGES. The next patch will expose it
    to userspace - this patch doesn't change the exported values yet, by
    adding up MM_SHMEMPAGES to MM_FILEPAGES at places where MM_FILEPAGES was
    used before. The only user-visible change after this patch is the OOM
    killer message that separates the reported "shmem-rss" from "file-rss".

    [vbabka@suse.cz: forward-porting, tweak changelog]
    Signed-off-by: Jerome Marchand
    Signed-off-by: Vlastimil Babka
    Acked-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     

13 Dec, 2015

1 commit

  • It's possible that an oom killed victim shares an ->mm with the init
    process and thus oom_kill_process() would end up trying to kill init as
    well.

    This has been shown in practice:

    Out of memory: Kill process 9134 (init) score 3 or sacrifice child
    Killed process 9134 (init) total-vm:1868kB, anon-rss:84kB, file-rss:572kB
    Kill process 1 (init) sharing same memory
    ...
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x00000009

    And this will result in a kernel panic.

    If a process is forked by init and selected for oom kill while still
    sharing init_mm, then it's likely this system is in a recoverable state.
    However, it's better not to try to kill init and allow the machine to
    panic due to unkillable processes.

    [rientjes@google.com: rewrote changelog]
    [akpm@linux-foundation.org: fix inverted test, per Ben]
    Signed-off-by: Chen Jie
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Ben Hutchings
    Cc: Li Zefan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Jie
     

07 Nov, 2015

1 commit

  • Introduce is_sysrq_oom helper function indicating oom kill triggered
    by sysrq to improve readability.

    No functional changes.

    Signed-off-by: Yaowei Bai
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     

06 Nov, 2015

7 commits

  • Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
    wrong. task->mm can be NULL if the task is the exited group leader. This
    means in particular that "kill sharing same memory" loop can miss a
    process with a zombie leader which uses the same ->mm.

    Note: the process_has_mm(child, p->mm) check is still not 100% correct,
    p->mm can be NULL too. This is minor, but probably deserves a fix or a
    comment anyway.

    [akpm@linux-foundation.org: document process_shares_mm() a bit]
    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Purely cosmetic, but the complex "if" condition looks annoying to me.
    Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
    which adds another if/continue.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The fatal_signal_pending() was added to suppress unnecessary "sharing same
    memory" message, but it can't 100% help anyway because it can be
    false-negative; SIGKILL can be already dequeued.

    And worse, it can be false-positive due to exec or coredump. exec is
    mostly fine, but coredump is not. It is possible that the group leader
    has the pending SIGKILL because its sub-thread originated the coredump, in
    this case we must not skip this process.

    We could probably add the additional ->group_exit_task check but this
    patch just removes the wrong check along with pr_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • oom_kill_process() sends SIGKILL to other thread groups sharing victim's
    mm. But printing

    "Kill process %d (%s) sharing same memory\n"

    lines makes no sense if they already have pending SIGKILL. This patch
    reduces the "Kill process" lines by printing that line with info level
    only if SIGKILL is not pending.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • At the for_each_process() loop in oom_kill_process(), we are comparing
    address of OOM victim's mm without holding a reference to that mm. If
    there are a lot of processes to compare or a lot of "Kill process %d (%s)
    sharing same memory" messages to print, for_each_process() loop could take
    very long time.

    It is possible that meanwhile the OOM victim exits and releases its mm,
    and then mm is allocated with the same address and assigned to some
    unrelated process. When we hit such race, the unrelated process will be
    killed by error. To make sure that the OOM victim's mm does not go away
    until for_each_process() loop finishes, get a reference on the OOM
    victim's mm before calling task_unlock(victim).

    [oleg@redhat.com: several fixes]
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • It was confirmed that a local unprivileged user can consume all memory
    reserves and hang up that system using time lag between the OOM killer
    sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
    printk() inside for_each_process() loop at oom_kill_process() can consume
    many seconds when there are many thread groups sharing the same memory.

    Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

    As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

    Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

    The oom-depleter's thread group leader which got TIF_MEMDIE started
    memset() in user space after the OOM killer set TIF_MEMDIE, and it was
    free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
    until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is
    set, the oom-depleter can terminate without touching memory reserves.

    Although the possibility of hitting this time lag is very small for 3.19
    and earlier kernels because TIF_MEMDIE is set immediately before sending
    SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
    step between and allow memory allocations which are not needed for
    terminating the OOM victim.

    Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

09 Sep, 2015

4 commits

  • The "killed" variable in out_of_memory() can be removed since the call to
    oom_kill_process() where we should block to allow the process time to
    exit is obvious.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Sysrq+f is used to kill a process either for debug or when the VM is
    otherwise unresponsive.

    It is not intended to trigger a panic when no process may be killed.

    Avoid panicking the system for sysrq+f when no processes are killed.

    Signed-off-by: David Rientjes
    Suggested-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The force_kill member of struct oom_control isn't needed if an order of -1
    is used instead. This is the same as order == -1 in struct
    compact_control which requires full memory compaction.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Cc: Sergey Senozhatsky
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • There are essential elements to an oom context that are passed around to
    multiple functions.

    Organize these elements into a new struct, struct oom_control, that
    specifies the context for an oom condition.

    This patch introduces no functional change.

    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Jun, 2015

7 commits

  • In oom_kill_process(), the variable 'points' is unsigned int. Print it as
    such.

    Signed-off-by: Wang Long
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wang Long
     
  • The zonelist locking and the oom_sem are two overlapping locks that are
    used to serialize global OOM killing against different things.

    The historical zonelist locking serializes OOM kills from allocations with
    overlapping zonelists against each other to prevent killing more tasks
    than necessary in the same memory domain. Only when neither tasklists nor
    zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
    bound to separate nodes) are OOM kills allowed to execute in parallel.

    The younger oom_sem is a read-write lock to serialize OOM killing against
    the PM code trying to disable the OOM killer altogether.

    However, the OOM killer is a fairly cold error path, there is really no
    reason to optimize for highly performant and concurrent OOM kills. And
    the oom_sem is just flat-out redundant.

    Replace both locking schemes with a single global mutex serializing OOM
    kills regardless of context.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Disabling the OOM killer needs to exclude allocators from entering, not
    existing victims from exiting.

    Right now the only waiter is suspend code, which achieves quiescence by
    disabling the OOM killer. But later on we want to add waits that hold
    the lock instead to stop new victims from showing up.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • It turns out that the mechanism to wait for exiting OOM victims is less
    generic than it looks: it won't issue wakeups unless the OOM killer is
    disabled.

    The reason this check was added was the thought that, since only the OOM
    disabling code would wait on this queue, wakeup operations could be
    saved when that specific consumer is known to be absent.

    However, this is quite the handgrenade. Later attempts to reuse the
    waitqueue for other purposes will lead to completely unexpected bugs and
    the failure mode will appear seemingly illogical. Generally, providers
    shouldn't make unnecessary assumptions about consumers.

    This could have been replaced with waitqueue_active(), but it only saves
    a few instructions in one of the coldest paths in the kernel. Simply
    remove it.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
    can clear it concurrently. Use clear_thread_flag() directly.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Rename unmark_oom_victim() to exit_oom_victim(). Marking and unmarking
    are related in functionality, but the interface is not symmetrical at
    all: one is an internal OOM killer function used during the killing, the
    other is for an OOM victim to signal its own death on exit later on.
    This has locking implications, see follow-up changes.

    While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
    is easier on the eye.

    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Setting oom_killer_disabled to false is atomic, there is no need for
    further synchronization with ongoing allocations trying to OOM-kill.

    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Arcangeli
    Cc: Dave Chinner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

16 Apr, 2015

1 commit


15 Apr, 2015

1 commit

  • If kernel panics due to oom, caused by a cgroup reaching its limit, when
    'compulsory panic_on_oom' is enabled, then we will only see that the OOM
    happened because of "compulsory panic_on_oom is enabled" but this doesn't
    tell the difference between mempolicy and memcg. And dumping system wide
    information is plain wrong and more confusing. This patch provides the
    information of the cgroup whose limit triggerred panic

    Signed-off-by: Balasubramani Vivekanandan
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balasubramani Vivekanandan
     

12 Feb, 2015

4 commits

  • Dave noticed that unprivileged process can allocate significant amount of
    memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
    memory cgroup. The trick is to allocate a lot of PMD page tables. Linux
    kernel doesn't account PMD tables to the process, only PTE.

    The use-cases below use few tricks to allocate a lot of PMD page tables
    while keeping VmRSS and VmPTE low. oom_score for the process will be 0.

    #include
    #include
    #include
    #include
    #include
    #include

    #define PUD_SIZE (1UL << 30)
    #define PMD_SIZE (1UL << 21)

    #define NR_PUD 130000

    int main(void)
    {
    char *addr = NULL;
    unsigned long i;

    prctl(PR_SET_THP_DISABLE);
    for (i = 0; i < NR_PUD ; i++) {
    addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
    if (addr == MAP_FAILED) {
    perror("mmap");
    break;
    }
    *addr = 'x';
    munmap(addr, PMD_SIZE);
    mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
    MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
    if (addr == MAP_FAILED)
    perror("re-mmap"), exit(1);
    }
    printf("PID %d consumed %lu KiB in PMD page tables\n",
    getpid(), i * 4096 >> 10);
    return pause();
    }

    The patch addresses the issue by account PMD tables to the process the
    same way we account PTE.

    The main place where PMD tables is accounted is __pmd_alloc() and
    free_pmd_range(). But there're few corner cases:

    - HugeTLB can share PMD page tables. The patch handles by accounting
    the table to all processes who share it.

    - x86 PAE pre-allocates few PMD tables on fork.

    - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
    check on exit(2).

    Accounting only happens on configuration where PMD page table's level is
    present (PMD is not folded). As with nr_ptes we use per-mm counter. The
    counter value is used to calculate baseline for badness score by
    oom-killer.

    Signed-off-by: Kirill A. Shutemov
    Reported-by: Dave Hansen
    Cc: Hugh Dickins
    Reviewed-by: Cyrill Gorcunov
    Cc: Pavel Emelyanov
    Cc: David Rientjes
    Tested-by: Sedat Dilek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Commit 5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM
    suspend") has left a race window when OOM killer manages to
    note_oom_kill after freeze_processes checks the counter. The race
    window is quite small and really unlikely and partial solution deemed
    sufficient at the time of submission.

    Tejun wasn't happy about this partial solution though and insisted on a
    full solution. That requires the full OOM and freezer's task freezing
    exclusion, though. This is done by this patch which introduces oom_sem
    RW lock and turns oom_killer_disable() into a full OOM barrier.

    oom_killer_disabled check is moved from the allocation path to the OOM
    level and we take oom_sem for reading for both the check and the whole
    OOM invocation.

    oom_killer_disable() takes oom_sem for writing so it waits for all
    currently running OOM killer invocations. Then it disable all the further
    OOMs by setting oom_killer_disabled and checks for any oom victims.
    Victims are counted via mark_tsk_oom_victim resp. unmark_oom_victim. The
    last victim wakes up all waiters enqueued by oom_killer_disable().
    Therefore this function acts as the full OOM barrier.

    The page fault path is covered now as well although it was assumed to be
    safe before. As per Tejun, "We used to have freezing points deep in file
    system code which may be reacheable from page fault." so it would be
    better and more robust to not rely on freezing points here. Same applies
    to the memcg OOM killer.

    out_of_memory tells the caller whether the OOM was allowed to trigger and
    the callers are supposed to handle the situation. The page allocation
    path simply fails the allocation same as before. The page fault path will
    retry the fault (more on that later) and Sysrq OOM trigger will simply
    complain to the log.

    Normally there wouldn't be any unfrozen user tasks after
    try_to_freeze_tasks so the function will not block. But if there was an
    OOM killer racing with try_to_freeze_tasks and the OOM victim didn't
    finish yet then we have to wait for it. This should complete in a finite
    time, though, because

    - the victim cannot loop in the page fault handler (it would die
    on the way out from the exception)
    - it cannot loop in the page allocator because all the further
    allocation would fail and __GFP_NOFAIL allocations are not
    acceptable at this stage
    - it shouldn't be blocked on any locks held by frozen tasks
    (try_to_freeze expects lockless context) and kernel threads and
    work queues are not frozen yet

    Signed-off-by: Michal Hocko
    Suggested-by: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_kill_process only sets TIF_MEMDIE flag and sends a signal to the
    victim. This is basically noop when the task is frozen though because the
    task sleeps in the uninterruptible sleep. The victim is eventually thawed
    later when oom_scan_process_thread meets the task again in a later OOM
    invocation so the OOM killer doesn't live lock. But this is less than
    optimal.

    Let's add __thaw_task into mark_tsk_oom_victim after we set TIF_MEMDIE to
    the victim. We are not checking whether the task is frozen because that
    would be racy and __thaw_task does that already. oom_scan_process_thread
    doesn't need to care about freezer anymore as TIF_MEMDIE and freezer are
    excluded completely now.

    Signed-off-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This patchset addresses a race which was described in the changelog for
    5695be142e20 ("OOM, PM: OOM killed task shouldn't escape PM suspend"):

    : PM freezer relies on having all tasks frozen by the time devices are
    : getting frozen so that no task will touch them while they are getting
    : frozen. But OOM killer is allowed to kill an already frozen task in order
    : to handle OOM situtation. In order to protect from late wake ups OOM
    : killer is disabled after all tasks are frozen. This, however, still keeps
    : a window open when a killed task didn't manage to die by the time
    : freeze_processes finishes.

    The original patch hasn't closed the race window completely because that
    would require a more complex solution as it can be seen by this patchset.

    The primary motivation was to close the race condition between OOM killer
    and PM freezer _completely_. As Tejun pointed out, even though the race
    condition is unlikely the harder it would be to debug weird bugs deep in
    the PM freezer when the debugging options are reduced considerably. I can
    only speculate what might happen when a task is still runnable
    unexpectedly.

    On a plus side and as a side effect the oom enable/disable has a better
    (full barrier) semantic without polluting hot paths.

    I have tested the series in KVM with 100M RAM:
    - many small tasks (20M anon mmap) which are triggering OOM continually
    - s2ram which resumes automatically is triggered in a loop
    echo processors > /sys/power/pm_test
    while true
    do
    echo mem > /sys/power/state
    sleep 1s
    done
    - simple module which allocates and frees 20M in 8K chunks. If it sees
    freezing(current) then it tries another round of allocation before calling
    try_to_freeze
    - debugging messages of PM stages and OOM killer enable/disable/fail added
    and unmark_oom_victim is delayed by 1s after it clears TIF_MEMDIE and before
    it wakes up waiters.
    - rebased on top of the current mmotm which means some necessary updates
    in mm/oom_kill.c. mark_tsk_oom_victim is now called under task_lock but
    I think this should be OK because __thaw_task shouldn't interfere with any
    locking down wake_up_process. Oleg?

    As expected there are no OOM killed tasks after oom is disabled and
    allocations requested by the kernel thread are failing after all the tasks
    are frozen and OOM disabled. I wasn't able to catch a race where
    oom_killer_disable would really have to wait but I kinda expected the race
    is really unlikely.

    [ 242.609330] Killed process 2992 (mem_eater) total-vm:24412kB, anon-rss:2164kB, file-rss:4kB
    [ 243.628071] Unmarking 2992 OOM victim. oom_victims: 1
    [ 243.636072] (elapsed 2.837 seconds) done.
    [ 243.641985] Trying to disable OOM killer
    [ 243.643032] Waiting for concurent OOM victims
    [ 243.644342] OOM killer disabled
    [ 243.645447] Freezing remaining freezable tasks ... (elapsed 0.005 seconds) done.
    [ 243.652983] Suspending console(s) (use no_console_suspend to debug)
    [ 243.903299] kmem_eater: page allocation failure: order:1, mode:0x204010
    [...]
    [ 243.992600] PM: suspend of devices complete after 336.667 msecs
    [ 243.993264] PM: late suspend of devices complete after 0.660 msecs
    [ 243.994713] PM: noirq suspend of devices complete after 1.446 msecs
    [ 243.994717] ACPI: Preparing to enter system sleep state S3
    [ 243.994795] PM: Saving platform NVS memory
    [ 243.994796] Disabling non-boot CPUs ...

    The first 2 patches are simple cleanups for OOM. They should go in
    regardless the rest IMO.

    Patches 3 and 4 are trivial printk -> pr_info conversion and they should
    go in ditto.

    The main patch is the last one and I would appreciate acks from Tejun and
    Rafael. I think the OOM part should be OK (except for __thaw_task vs.
    task_lock where a look from Oleg would appreciated) but I am not so sure I
    haven't screwed anything in the freezer code. I have found several
    surprises there.

    This patch (of 5):

    This patch is just a preparatory and it doesn't introduce any functional
    change.

    Note:
    I am utterly unhappy about lowmemory killer abusing TIF_MEMDIE just to
    wait for the oom victim and to prevent from new killing. This is
    just a side effect of the flag. The primary meaning is to give the oom
    victim access to the memory reserves and that shouldn't be necessary
    here.

    Signed-off-by: Michal Hocko
    Cc: Tejun Heo
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Oleg Nesterov
    Cc: Cong Wang
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko