08 Oct, 2016

13 commits

  • We have received a hard to explain oom report from a customer. The oom
    triggered regardless there is a lot of free memory:

    PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
    PoolThread cpuset=/ mems_allowed=0-7
    Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1
    Call Trace:
    dump_trace+0x75/0x300
    dump_stack+0x69/0x6f
    dump_header+0x8e/0x110
    oom_kill_process+0xa6/0x350
    out_of_memory+0x2b7/0x310
    __alloc_pages_slowpath+0x7dd/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_anonymous_page+0x13e/0x300
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30
    [...]
    active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
    active_file:13093 inactive_file:222506 isolated_file:0
    unevictable:262144 dirty:2 writeback:0 unstable:0
    free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
    mapped:261139 shmem:166297 pagetables:2228282 bounce:0
    [...]
    Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 2892 775542 775542
    Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 772650 772650
    Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0

    So all nodes but Node 0 have a lot of free memory which should suggest
    that there is an available memory especially when mems_allowed=0-7. One
    could speculate that a massive process has managed to terminate and free
    up a lot of memory while racing with the above allocation request.
    Although this is highly unlikely it cannot be ruled out.

    A further debugging, however shown that the faulting process had
    mempolicy (not cpuset) to bind to Node 0. We cannot see that
    information from the report though. mems_allowed turned out to be more
    confusing than really helpful.

    Fix this by always priting the nodemask. It is either mempolicy mask
    (and non-null) or the one defined by the cpusets. The new output for
    the above oom report would be

    PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

    This patch doesn't touch show_mem and the node filtering based on the
    cpuset node mask because mempolicy is always a subset of cpusets and
    seeing the full cpuset oom context might be helpful for tunning more
    specific mempolicies inside cpusets (e.g. when they turn out to be too
    restrictive). To prevent from ugly ifdefs the mask is printed even for
    !NUMA configurations but this should be OK (a single node will be
    printed).

    Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Sellami Abdelkader
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Sellami Abdelkader
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
    raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
    to warn when we raced with disabling the OOM killer.

    Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
    properly" introduced a timeout for oom_killer_disable(). Even if we
    raced with disabling the OOM killer and the system is OOM livelocked,
    the OOM killer will be enabled eventually (in 20 seconds by default) and
    the OOM livelock will be solved. Therefore, we no longer need to warn
    when we raced with disabling the OOM killer.

    Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Since the lumpy reclaim is gone there is no source of higher order pages
    if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
    unreliable for that purpose to say the least. Hitting an OOM for
    !costly higher order requests is therefore all not that hard to imagine.
    We are trying hard to not invoke OOM killer as much as possible but
    there is simply no reliable way to detect whether more reclaim retries
    make sense.

    Disabling COMPACTION is not widespread but it seems that some users
    might have disable the feature without realizing full consequences
    (mostly along with disabling THP because compaction used to be THP
    mainly thing). This patch just adds a note if the OOM killer was
    triggered by higher order request with compaction disabled. This will
    help us identifying possible misconfiguration right from the oom report
    which is easier than to always keep in mind that somebody might have
    disabled COMPACTION without a good reason.

    Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom reaper was skipped for an mm which is shared with the kernel thread
    (aka use_mm()). The primary concern was that such a kthread might want
    to read from the userspace memory and see zero page as a result of the
    oom reaper action. This is no longer a problem after "mm: make sure
    that kthreads will not refault oom reaped memory" because any attempt to
    fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
    target user should see an error. This means that we can finally allow
    oom reaper also to tasks which share their mm with kthreads.

    Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
    OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
    But the usefulness of the flag is rather limited and actually never
    shown in practice. If the flag is set, it means that the holder of
    mm->mmap_sem cannot call up_write() due to presumably being blocked at
    unkillable wait waiting for other thread's memory allocation. But since
    one of threads sharing that mm will queue that mm immediately via
    task_will_free_mem() shortcut (otherwise, oom_badness() will select the
    same mm again due to oom_score_adj value unchanged), retrying
    MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

    Let's always set MMF_OOM_REAPED.

    Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Patch series "fortify oom killer even more", v2.

    This patch (of 9):

    __oom_reap_task() can be simplified a bit if it receives a valid mm from
    oom_reap_task() which also uses that mm when __oom_reap_task() failed.
    We can drop one find_lock_task_mm() call and also make the
    __oom_reap_task() code flow easier to follow. Moreover, this will make
    later patch in the series easier to review. Pinning mm's mm_count for
    longer time is not really harmful because this will not pin much memory.

    This patch doesn't introduce any functional change.

    Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Attempt to demystify the task_will_free_mem() loop.

    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Aug, 2016

1 commit

  • mm/oom_kill.c: In function `task_will_free_mem':
    mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function

    If __task_will_free_mem() is never called inside the for_each_process()
    loop, ret will not be initialized.

    Fixes: 1af8bb43269563e4 ("mm, oom: fortify task_will_free_mem()")
    Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

29 Jul, 2016

8 commits

  • "mm, oom: fortify task_will_free_mem" has dropped task_lock around
    task_will_free_mem in oom_kill_process bacause it assumed that a
    potential race when the selected task exits will not be a problem as the
    oom_reaper will call exit_oom_victim.

    Tetsuo was objecting that nommu doesn't have oom_reaper so the race
    would be still possible. The code would be racy and lockup prone
    theoretically in other aspects without the oom reaper anyway so I didn't
    considered this a big deal. But it seems that further changes I am
    planning in this area will benefit from stable task->mm in this path as
    well. So let's drop find_lock_task_mm from task_will_free_mem and call
    it from under task_lock as we did previously. Just pull the task->mm !=
    NULL check inside the function.

    Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The only case where the oom_reaper is not triggered for the oom victim
    is when it shares the memory with a kernel thread (aka use_mm) or with
    the global init. After "mm, oom: skip vforked tasks from being
    selected" the victim cannot be a vforked task of the global init so we
    are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users
    are quite rare as well.

    In order to help forward progress for the OOM killer, make sure that
    this really rare case will not get in the way - we do this by hiding the
    mm from the oom killer by setting MMF_OOM_REAPED flag for it.
    oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
    MMF_OOM_REAPED flag set to catch these oom victims.

    After this patch we should guarantee forward progress for the OOM killer
    even when the selected victim is sharing memory with a kernel thread or
    global init as long as the victims mm is still alive.

    Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reaper relies on the mmap_sem for read to do its job. Many places
    which might block readers have been converted to use down_write_killable
    and that has reduced chances of the contention a lot. Some paths where
    the mmap_sem is held for write can take other locks and they might
    either be not prepared to fail due to fatal signal pending or too
    impractical to be changed.

    This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
    first attempt to reap a task's mm fails. If the flag is present after
    the failure then we set MMF_OOM_REAPED to hide this mm from the oom
    killer completely so it can go and chose another victim.

    As a result a risk of OOM deadlock when the oom victim would be blocked
    indefinetly and so the oom killer cannot make any progress should be
    mitigated considerably while we still try really hard to perform all
    reclaim attempts and stay predictable in the behavior.

    Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The 0-day robot has encountered the following:

    Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
    Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

    oom_reaper is trying to reap the same task again and again.

    This is possible only when the oom killer is bypassed because of
    task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
    already set during select_bad_process. Teach task_will_free_mem to skip
    over MMF_OOM_REAPED tasks as well because they will be unlikely to free
    anything more.

    Analyzed by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
    process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
    oom_adj: make sure processes sharing mm have same view of oom_score_adj"
    all such processes are sharing the same value so we shouldn't see such a
    task at all (oom_badness would rule them out).

    We can still encounter oom disabled vforked task which has to be killed as
    well if we want to have other tasks sharing the mm reapable because it can
    access the memory before doing exec. Killing such a task should be
    acceptable because it is highly unlikely it has done anything useful
    because it cannot modify any memory before it calls exec. An alternative
    would be to keep the task alive and skip the oom reaper and risk all the
    weird corner cases where the OOM killer cannot make forward progress
    because the oom victim hung somewhere on the way to exit.

    [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
    the setting is inherently racy and we cannot do much about it without
    introducing locks in hot paths]
    Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vforked tasks are not really sitting on any memory. They are sharing the
    mm with parent until they exec into a new code. Until then it is just
    pinning the address space. OOM killer will kill the vforked task along
    with its parent but we still can end up selecting vforked task when the
    parent wouldn't be selected. E.g. init doing vfork to launch a task or
    vforked being a child of oom unkillable task with an updated oom_score_adj
    to be killable.

    Add a new helper to check whether a task is in the vfork sharing memory
    with its parent and use it in oom_badness to skip over these tasks.

    Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_score_adj is shared for the thread groups (via struct signal) but this
    is not sufficient to cover processes sharing mm (CLONE_VM without
    CLONE_SIGHAND) and so we can easily end up in a situation when some
    processes update their oom_score_adj and confuse the oom killer. In the
    worst case some of those processes might hide from the oom killer
    altogether via OOM_SCORE_ADJ_MIN while others are eligible. OOM killer
    would then pick up those eligible but won't be allowed to kill others
    sharing the same mm so the mm wouldn't release the mm and so the memory.

    It would be ideal to have the oom_score_adj per mm_struct because that is
    the natural entity OOM killer considers. But this will not work because
    some programs are doing

    vfork()
    set_oom_adj()
    exec()

    We can achieve the same though. oom_score_adj write handler can set the
    oom_score_adj for all processes sharing the same mm if the task is not in
    the middle of vfork. As a result all the processes will share the same
    oom_score_adj. The current implementation is rather pessimistic and
    checks all the existing processes by default if there is more than 1
    holder of the mm but we do not have any reliable way to check for external
    users yet.

    Link: http://lkml.kernel.org/r/1466426628-15074-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

27 Jul, 2016

4 commits

  • Tetsuo is worried that mmput_async might still lead to a premature new
    oom victim selection due to the following race:

    __oom_reap_task exit_mm
    find_lock_task_mm
    atomic_inc(mm->mm_users) # = 2
    task_unlock
    task_lock
    task->mm = NULL
    up_read(&mm->mmap_sem)
    < somebody write locks mmap_sem >
    task_unlock
    mmput
    atomic_dec_and_test # = 1
    exit_oom_victim
    down_read_trylock # failed - no reclaim
    mmput_async # Takes unpredictable amount of time
    < new OOM situation >

    the final __mmput will be executed in the delayed context which might
    happen far in the future. Such a race is highly unlikely because the
    write holder of mmap_sem would have to be an external task (all direct
    holders are already killed or exiting) and it usually have to pin
    mm_users in order to do anything reasonable.

    We can, however, make sure that the mmput_async is only called when we
    do not back off and reap some memory. That would reduce the impact of
    the delayed __mmput because the real content would be already freed.
    Pin mm_count to keep it alive after we drop task_lock and before we try
    to get mmap_sem. If the mmap_sem succeeds we can try to grab mm_users
    reference and then go on with unmapping the address space.

    It is not clear whether this race is possible at all but it is better to
    be more robust and do not pin mm_users unless we are sure we are
    actually doing some real work during __oom_reap_task.

    Link: http://lkml.kernel.org/r/1465306987-30297-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_scan_process_thread() does not use totalpages argument.
    oom_badness() uses it.

    Link: http://lkml.kernel.org/r/1463796041-7889-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • It's a part of oom context just like allocation order and nodemask, so
    let's move it to oom_control instead of passing it in the argument list.

    Link: http://lkml.kernel.org/r/40e03fd7aaf1f55c75d787128d6d17c5a71226c2.1464358556.git.vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Not used since oom_lock was instroduced.

    Link: http://lkml.kernel.org/r/1464358093-22663-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

25 Jun, 2016

2 commits

  • Since commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper
    managed to unmap the address space") changed to use find_lock_task_mm()
    for finding a mm_struct to reap, it is guaranteed that mm->mm_users > 0
    because find_lock_task_mm() returns a task_struct with ->mm != NULL.
    Therefore, we can safely use atomic_inc().

    Link: http://lkml.kernel.org/r/1465024759-8074-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit e2fe14564d33 ("oom_reaper: close race with exiting task") reduced
    frequency of needlessly selecting next OOM victim, but was calling
    mmput_async() when atomic_inc_not_zero() failed.

    Link: http://lkml.kernel.org/r/1464423365-5555-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

04 Jun, 2016

1 commit

  • Oleg has noted that siglock usage in try_oom_reaper is both pointless
    and dangerous. signal_group_exit can be checked lockless. The problem
    is that sighand becomes NULL in __exit_signal so we can crash.

    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")
    Link: http://lkml.kernel.org/r/1464679423-30218-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Suggested-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

28 May, 2016

2 commits

  • Tetsuo has reported:
    Out of memory: Kill process 443 (oleg's-test) score 855 or sacrifice child
    Killed process 443 (oleg's-test) total-vm:493248kB, anon-rss:423880kB, file-rss:4kB, shmem-rss:0kB
    sh invoked oom-killer: gfp_mask=0x24201ca(GFP_HIGHUSER_MOVABLE|__GFP_COLD), order=0, oom_score_adj=0
    sh cpuset=/ mems_allowed=0
    CPU: 2 PID: 1 Comm: sh Not tainted 4.6.0-rc7+ #51
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
    Call Trace:
    dump_stack+0x85/0xc8
    dump_header+0x5b/0x394
    oom_reaper: reaped process 443 (oleg's-test), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    In other words:

    __oom_reap_task exit_mm
    atomic_inc_not_zero
    tsk->mm = NULL
    mmput
    atomic_dec_and_test # > 0
    exit_oom_victim # New victim will be
    # selected

    # no TIF_MEMDIE task so we can select a new one
    unmap_page_range # to release the memory

    The race exists even without the oom_reaper because anybody who pins the
    address space and gets preempted might race with exit_mm but oom_reaper
    made this race more probable.

    We can address the oom_reaper part by using oom_lock for __oom_reap_task
    because this would guarantee that a new oom victim will not be selected
    if the oom reaper might race with the exit path. This doesn't solve the
    original issue, though, because somebody else still might be pinning
    mm_users and so __mmput won't be called to release the memory but that
    is not really realiably solvable because the task will get away from the
    oom sight as soon as it is unhashed from the task_list and so we cannot
    guarantee a new victim won't be selected.

    [akpm@linux-foundation.org: fix use of unused `mm', Per Stephen]
    [akpm@linux-foundation.org: coding-style fixes]
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Link: http://lkml.kernel.org/r/1464271493-20008-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • If the current process is exiting, we don't invoke oom killer, instead
    we give it access to memory reserves and try to reap its mm in case
    nobody is going to use it. There's a mistake in the code performing
    this check - we just ignore any process of the same thread group no
    matter if it is exiting or not - see try_oom_reaper. Fix it.

    Link: http://lkml.kernel.org/r/1464087628-7318-1-git-send-email-vdavydov@virtuozzo.com
    Fixes: 3ef22dfff239 ("oom, oom_reaper: try to reap tasks which skip regular OOM killer path")Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

21 May, 2016

3 commits

  • Since commit 3a5dda7a17cf ("oom: prevent unnecessary oom kills or kernel
    panics"), select_bad_process() is using for_each_process_thread().

    Since oom_unkillable_task() scans all threads in the caller's thread
    group and oom_task_origin() scans signal_struct of the caller's thread
    group, we don't need to call oom_unkillable_task() and oom_task_origin()
    on each thread. Also, since !mm test will be done later at
    oom_badness(), we don't need to do !mm test on each thread. Therefore,
    we only need to do TIF_MEMDIE test on each thread.

    Although the original code was correct it was quite inefficient because
    each thread group was scanned num_threads times which can be a lot
    especially with processes with many threads. Even though the OOM is
    extremely cold path it is always good to be as effective as possible
    when we are inside rcu_read_lock() - aka unpreemptible context.

    If we track number of TIF_MEMDIE threads inside signal_struct, we don't
    need to do TIF_MEMDIE test on each thread. This will allow
    select_bad_process() to use for_each_process().

    This patch adds a counter to signal_struct for tracking how many
    TIF_MEMDIE threads are in a given thread group, and check it at
    oom_scan_process_thread() so that select_bad_process() can use
    for_each_process() rather than for_each_process_thread().

    [mhocko@suse.com: do not blow the signal_struct size]
    Link: http://lkml.kernel.org/r/20160520075035.GF19172@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/201605182230.IDC73435.MVSOHLFOQFOJtF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Tetsuo has properly noted that mmput slow path might get blocked waiting
    for another party (e.g. exit_aio waits for an IO). If that happens the
    oom_reaper would be put out of the way and will not be able to process
    next oom victim. We should strive for making this context as reliable
    and independent on other subsystems as much as possible.

    Introduce mmput_async which will perform the slow path from an async
    (WQ) context. This will delay the operation but that shouldn't be a
    problem because the oom_reaper has reclaimed the victim's address space
    for most cases as much as possible and the remaining context shouldn't
    bind too much memory anymore. The only exception is when mmap_sem
    trylock has failed which shouldn't happen too often.

    The issue is only theoretical but not impossible.

    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit 36324a990cf5 ("oom: clear TIF_MEMDIE after oom_reaper managed to
    unmap the address space") not only clears TIF_MEMDIE for oom reaped task
    but also set OOM_SCORE_ADJ_MIN for the target task to hide it from the
    oom killer. This works in simple cases but it is not sufficient for
    (unlikely) cases where the mm is shared between independent processes
    (as they do not share signal struct). If the mm had only small amount
    of memory which could be reaped then another task sharing the mm could
    be selected and that wouldn't help to move out from the oom situation.

    Introduce MMF_OOM_REAPED mm flag which is checked in oom_badness (same
    as OOM_SCORE_ADJ_MIN) and task is skipped if the flag is set. Set the
    flag after __oom_reap_task is done with a task. This will force the
    select_bad_process() to ignore all already oom reaped tasks as well as
    no such task is sacrificed for its parent.

    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

20 May, 2016

3 commits

  • Right now the oom reaper will clear TIF_MEMDIE only for tasks which were
    successfully reaped. This is the safest option because we know that
    such an oom victim would only block forward progress of the oom killer
    without a good reason because it is highly unlikely it would release
    much more memory. Basically most of its memory has been already torn
    down.

    We can relax this assumption to catch more corner cases though.

    The first obvious one is when the oom victim clears its mm and gets
    stuck later on. oom_reaper would back of on find_lock_task_mm returning
    NULL. We can safely try to clear TIF_MEMDIE in this case because such a
    task would be ignored by the oom killer anyway. The flag would be
    cleared by that time already most of the time anyway.

    The less obvious one is when the oom reaper fails due to mmap_sem
    contention. Even if we clear TIF_MEMDIE for this task then it is not
    very likely that we would select another task too easily because we
    haven't reaped the last victim and so it would be still the #1
    candidate. There is a rare race condition possible when the current
    victim terminates before the next select_bad_process but considering
    that oom_reap_task had retried several times before giving up then this
    sounds like a borderline thing.

    After this patch we should have a guarantee that the OOM killer will not
    be block for unbounded amount of time for most cases.

    Signed-off-by: Michal Hocko
    Cc: Raushaniya Maksudova
    Cc: Michael S. Tsirkin
    Cc: Paul E. McKenney
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Daniel Vetter
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • If either the current task is already killed or PF_EXITING or a selected
    task is PF_EXITING then the oom killer is suppressed and so is the oom
    reaper. This patch adds try_oom_reaper which checks the given task and
    queues it for the oom reaper if that is safe to be done meaning that the
    task doesn't share the mm with an alive process.

    This might help to release the memory pressure while the task tries to
    exit.

    [akpm@linux-foundation.org: fix nommu build]
    Signed-off-by: Michal Hocko
    Cc: Raushaniya Maksudova
    Cc: Michael S. Tsirkin
    Cc: Paul E. McKenney
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Daniel Vetter
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_may_oom is the central place to decide when the
    out_of_memory should be invoked. This is a good approach for most
    checks there because they are page allocator specific and the allocation
    fails right after for all of them.

    The notable exception is GFP_NOFS context which is faking
    did_some_progress and keep the page allocator looping even though there
    couldn't have been any progress from the OOM killer. This patch doesn't
    change this behavior because we are not ready to allow those allocation
    requests to fail yet (and maybe we will face the reality that we will
    never manage to safely fail these request). Instead __GFP_FS check is
    moved down to out_of_memory and prevent from OOM victim selection there.
    There are two reasons for that

    - OOM notifiers might release some memory even from this context
    as none of the registered notifier seems to be FS related
    - this might help a dying thread to get an access to memory
    reserves and move on which will make the behavior more
    consistent with the case when the task gets killed from a
    different context.

    Keep a comment in __alloc_pages_may_oom to make sure we do not forget
    how GFP_NOFS is special and that we really want to do something about
    it.

    Note to the current oom_notifier users:

    The observable difference for you is that oom notifiers cannot depend on
    any fs locks because we could deadlock. Not that this would be allowed
    today because that would just lockup machine in most of the cases and
    ruling out the OOM killer along the way. Another difference is that
    callbacks might be invoked sooner now because GFP_NOFS is a weaker
    reclaim context and so there could be reclaimable memory which is just
    not reachable now. That would require GFP_NOFS only loads which are
    really rare and more importantly the observable result would be dropping
    of reconstructible object and potential performance drop which is not
    such a big deal when we are struggling to fulfill other important
    allocation requests.

    Signed-off-by: Michal Hocko
    Cc: Raushaniya Maksudova
    Cc: Michael S. Tsirkin
    Cc: Paul E. McKenney
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Cc: Daniel Vetter
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Apr, 2016

1 commit

  • Commit bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using
    simpler way") has simplified the check for tasks already enqueued for
    the oom reaper by checking tsk->oom_reaper_list != NULL. This check is
    not sufficient because the tsk might be the head of the queue without
    any other tasks queued and then we would simply lockup looping on the
    same task. Fix the condition by checking for the head as well.

    Fixes: bb29902a7515 ("oom, oom_reaper: protect oom_reaper_list using simpler way")
    Signed-off-by: Michal Hocko
    Acked-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

26 Mar, 2016

2 commits

  • "oom, oom_reaper: disable oom_reaper for oom_kill_allocating_task" tried
    to protect oom_reaper_list using MMF_OOM_KILLED flag. But we can do it
    by simply checking tsk->oom_reaper_list != NULL.

    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • After "oom: clear TIF_MEMDIE after oom_reaper managed to unmap the
    address space" oom_reaper will call exit_oom_victim on the target task
    after it is done. This might however race with the PM freezer:

    CPU0 CPU1 CPU2
    freeze_processes
    try_to_freeze_tasks
    # Allocation request
    out_of_memory
    oom_killer_disable
    wake_oom_reaper(P1)
    __oom_reap_task
    exit_oom_victim(P1)
    wait_event(oom_victims==0)
    [...]
    do_exit(P1)
    perform IO/interfere with the freezer

    which breaks the oom_killer_disable semantic. We no longer have a
    guarantee that the oom victim won't interfere with the freezer because
    it might be anywhere on the way to do_exit while the freezer thinks the
    task has already terminated. It might trigger IO or touch devices which
    are frozen already.

    In order to close this race, make the oom_reaper thread freezable. This
    will work because
    a) already running oom_reaper will block freezer to enter the
    quiescent state
    b) wake_oom_reaper will not wake up the reaper after it has been
    frozen
    c) the only way to call exit_oom_victim after try_to_freeze_tasks
    is from the oom victim's context when we know the further
    interference shouldn't be possible

    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko