16 May, 2018

1 commit

  • commit 27ae357fa82be5ab73b2ef8d39dcb8ca2563483a upstream.

    Since exit_mmap() is done without the protection of mm->mmap_sem, it is
    possible for the oom reaper to concurrently operate on an mm until
    MMF_OOM_SKIP is set.

    This allows munlock_vma_pages_all() to concurrently run while the oom
    reaper is operating on a vma. Since munlock_vma_pages_range() depends
    on clearing VM_LOCKED from vm_flags before actually doing the munlock to
    determine if any other vmas are locking the same memory, the check for
    VM_LOCKED in the oom reaper is racy.

    This is especially noticeable on architectures such as powerpc where
    clearing a huge pmd requires serialize_against_pte_lookup(). If the pmd
    is zapped by the oom reaper during follow_page_mask() after the check
    for pmd_none() is bypassed, this ends up deferencing a NULL ptl or a
    kernel oops.

    Fix this by manually freeing all possible memory from the mm before
    doing the munlock and then setting MMF_OOM_SKIP. The oom reaper can not
    run on the mm anymore so the munlock is safe to do in exit_mmap(). It
    also matches the logic that the oom reaper currently uses for
    determining when to set MMF_OOM_SKIP itself, so there's no new risk of
    excessive oom killing.

    This issue fixes CVE-2018-1000200.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1804241526320.238665@chino.kir.corp.google.com
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: David Rientjes
    Suggested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: [4.14+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Rientjes
     

20 Dec, 2017

1 commit

  • commit 4837fe37adff1d159904f0c013471b1ecbcb455e upstream.

    David Rientjes has reported the following memory corruption while the
    oom reaper tries to unmap the victims address space

    BUG: Bad page map in process oom_reaper pte:6353826300000000 pmd:00000000
    addr:00007f50cab1d000 vm_flags:08100073 anon_vma:ffff9eea335603f0 mapping: (null) index:7f50cab1d
    file: (null) fault: (null) mmap: (null) readpage: (null)
    CPU: 2 PID: 1001 Comm: oom_reaper
    Call Trace:
    unmap_page_range+0x1068/0x1130
    __oom_reap_task_mm+0xd5/0x16b
    oom_reaper+0xff/0x14c
    kthread+0xc1/0xe0

    Tetsuo Handa has noticed that the synchronization inside exit_mmap is
    insufficient. We only synchronize with the oom reaper if
    tsk_is_oom_victim which is not true if the final __mmput is called from
    a different context than the oom victim exit path. This can trivially
    happen from context of any task which has grabbed mm reference (e.g. to
    read /proc// file which requires mm etc.).

    The race would look like this

    oom_reaper oom_victim task
    mmget_not_zero
    do_exit
    mmput
    __oom_reap_task_mm mmput
    __mmput
    exit_mmap
    remove_vma
    unmap_page_range

    Fix this issue by providing a new mm_is_oom_victim() helper which
    operates on the mm struct rather than a task. Any context which
    operates on a remote mm struct should use this helper in place of
    tsk_is_oom_victim. The flag is set in mark_oom_victim and never cleared
    so it is stable in the exit_mmap path.

    Debugged by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/20171210095130.17110-1-mhocko@kernel.org
    Fixes: 212925802454 ("mm: oom: let oom_reap_task and exit_mmap run concurrently")
    Signed-off-by: Michal Hocko
    Reported-by: David Rientjes
    Acked-by: David Rientjes
    Cc: Tetsuo Handa
    Cc: Andrea Argangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

05 Dec, 2017

1 commit

  • commit 687cb0884a714ff484d038e9190edc874edcf146 upstream.

    tlb_gather_mmu(&tlb, mm, 0, -1) means gathering the whole virtual memory
    space. In this case, tlb->fullmm is true. Some archs like arm64
    doesn't flush TLB when tlb->fullmm is true:

    commit 5a7862e83000 ("arm64: tlbflush: avoid flushing when fullmm == 1").

    Which causes leaking of tlb entries.

    Will clarifies his patch:
    "Basically, we tag each address space with an ASID (PCID on x86) which
    is resident in the TLB. This means we can elide TLB invalidation when
    pulling down a full mm because we won't ever assign that ASID to
    another mm without doing TLB invalidation elsewhere (which actually
    just nukes the whole TLB).

    I think that means that we could potentially not fault on a kernel
    uaccess, because we could hit in the TLB"

    There could be a window between complete_signal() sending IPI to other
    cores and all threads sharing this mm are really kicked off from cores.
    In this window, the oom reaper may calls tlb_flush_mmu_tlbonly() to
    flush TLB then frees pages. However, due to the above problem, the TLB
    entries are not really flushed on arm64. Other threads are possible to
    access these pages through TLB entries. Moreover, a copy_to_user() can
    also write to these pages without generating page fault, causes
    use-after-free bugs.

    This patch gathers each vma instead of gathering full vm space. In this
    case tlb->fullmm is not true. The behavior of oom reaper become similar
    to munmapping before do_exit, which should be safe for all archs.

    Link: http://lkml.kernel.org/r/20171107095453.179940-1-wangnan0@huawei.com
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Wang Nan
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Cc: Minchan Kim
    Cc: Will Deacon
    Cc: Bob Liu
    Cc: Ingo Molnar
    Cc: Roman Gushchin
    Cc: Konstantin Khlebnikov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Wang Nan
     

04 Oct, 2017

1 commit

  • Andrea has noticed that the oom_reaper doesn't invalidate the range via
    mmu notifiers (mmu_notifier_invalidate_range_start/end) and that can
    corrupt the memory of the kvm guest for example.

    tlb_flush_mmu_tlbonly already invokes mmu notifiers but that is not
    sufficient as per Andrea:

    "mmu_notifier_invalidate_range cannot be used in replacement of
    mmu_notifier_invalidate_range_start/end. For KVM
    mmu_notifier_invalidate_range is a noop and rightfully so. A MMU
    notifier implementation has to implement either ->invalidate_range
    method or the invalidate_range_start/end methods, not both. And if you
    implement invalidate_range_start/end like KVM is forced to do, calling
    mmu_notifier_invalidate_range in common code is a noop for KVM.

    For those MMU notifiers that can get away only implementing
    ->invalidate_range, the ->invalidate_range is implicitly called by
    mmu_notifier_invalidate_range_end(). And only those secondary MMUs
    that share the same pagetable with the primary MMU (like AMD iommuv2)
    can get away only implementing ->invalidate_range"

    As the callback is allowed to sleep and the implementation is out of
    hand of the MM it is safer to simply bail out if there is an mmu
    notifier registered. In order to not fail too early make the
    mm_has_notifiers check under the oom_lock and have a little nap before
    failing to give the current oom victim some more time to exit.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170913113427.2291-1-mhocko@kernel.org
    Fixes: aac453635549 ("mm, oom: introduce oom reaper")
    Signed-off-by: Michal Hocko
    Reported-by: Andrea Arcangeli
    Reviewed-by: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

07 Sep, 2017

2 commits

  • This is purely required because exit_aio() may block and exit_mmap() may
    never start, if the oom_reap_task cannot start running on a mm with
    mm_users == 0.

    At the same time if the OOM reaper doesn't wait at all for the memory of
    the current OOM candidate to be freed by exit_mmap->unmap_vmas, it would
    generate a spurious OOM kill.

    If it wasn't because of the exit_aio or similar blocking functions in
    the last mmput, it would be enough to change the oom_reap_task() in the
    case it finds mm_users == 0, to wait for a timeout or to wait for
    __mmput to set MMF_OOM_SKIP itself, but it's not just exit_mmap the
    problem here so the concurrency of exit_mmap and oom_reap_task is
    apparently warranted.

    It's a non standard runtime, exit_mmap() runs without mmap_sem, and
    oom_reap_task runs with the mmap_sem for reading as usual (kind of
    MADV_DONTNEED).

    The race between the two is solved with a combination of
    tsk_is_oom_victim() (serialized by task_lock) and MMF_OOM_SKIP
    (serialized by a dummy down_write/up_write cycle on the same lines of
    the ksm_exit method).

    If the oom_reap_task() may be running concurrently during exit_mmap,
    exit_mmap will wait it to finish in down_write (before taking down mm
    structures that would make the oom_reap_task fail with use after free).

    If exit_mmap comes first, oom_reap_task() will skip the mm if
    MMF_OOM_SKIP is already set and in turn all memory is already freed and
    furthermore the mm data structures may already have been taken down by
    free_pgtables.

    [aarcange@redhat.com: incremental one liner]
    Link: http://lkml.kernel.org/r/20170726164319.GC29716@redhat.com
    [rientjes@google.com: remove unused mmput_async]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1708141733130.50317@chino.kir.corp.google.com
    [aarcange@redhat.com: microoptimization]
    Link: http://lkml.kernel.org/r/20170817171240.GB5066@redhat.com
    Link: http://lkml.kernel.org/r/20170726162912.GA29716@redhat.com
    Fixes: 26db62f179d1 ("oom: keep mm of the killed task available")
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: David Rientjes
    Reported-by: David Rientjes
    Tested-by: David Rientjes
    Reviewed-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • For ages we have been relying on TIF_MEMDIE thread flag to mark OOM
    victims and then, among other things, to give these threads full access
    to memory reserves. There are few shortcomings of this implementation,
    though.

    First of all and the most serious one is that the full access to memory
    reserves is quite dangerous because we leave no safety room for the
    system to operate and potentially do last emergency steps to move on.

    Secondly this flag is per task_struct while the OOM killer operates on
    mm_struct granularity so all processes sharing the given mm are killed.
    Giving the full access to all these task_structs could lead to a quick
    memory reserves depletion. We have tried to reduce this risk by giving
    TIF_MEMDIE only to the main thread and the currently allocating task but
    that doesn't really solve this problem while it surely opens up a room
    for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the
    allocator without access to memory reserves because a particular thread
    was not the group leader.

    Now that we have the oom reaper and that all oom victims are reapable
    after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the
    kthreads") we can be more conservative and grant only partial access to
    memory reserves because there are reasonable chances of the parallel
    memory freeing. We still want some access to reserves because we do not
    want other consumers to eat up the victim's freed memory. oom victims
    will still contend with __GFP_HIGH users but those shouldn't be so
    aggressive to starve oom victims completely.

    Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to
    the half of the reserves. This makes the access to reserves independent
    on which task has passed through mark_oom_victim. Also drop any usage
    of TIF_MEMDIE from the page allocator proper and replace it by
    tsk_is_oom_victim as well which will make page_alloc.c completely
    TIF_MEMDIE free finally.

    CONFIG_MMU=n doesn't have oom reaper so let's stick to the original
    ALLOC_NO_WATERMARKS approach.

    There is a demand to make the oom killer memcg aware which will imply
    many tasks killed at once. This change will allow such a usecase
    without worrying about complete memory reserves depletion.

    Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

11 Jul, 2017

1 commit

  • During the debugging of the problem described in
    https://lkml.org/lkml/2017/5/17/542 and fixed by Tetsuo Handa in
    https://lkml.org/lkml/2017/5/19/383 , I've found that the existing debug
    output is not really useful to understand issues related to the oom
    reaper.

    So, I assume, that adding some tracepoints might help with debugging of
    similar issues.

    Trace the following events:
    1) a process is marked as an oom victim,
    2) a process is added to the oom reaper list,
    3) the oom reaper starts reaping process's mm,
    4) the oom reaper finished reaping,
    5) the oom reaper skips reaping.

    How it works in practice? Below is an example which show how the problem
    mentioned above can be found: one process is added twice to the
    oom_reaper list:

    $ cd /sys/kernel/debug/tracing
    $ echo "oom:mark_victim" > set_event
    $ echo "oom:wake_reaper" >> set_event
    $ echo "oom:skip_task_reaping" >> set_event
    $ echo "oom:start_task_reaping" >> set_event
    $ echo "oom:finish_task_reaping" >> set_event
    $ cat trace_pipe
    allocate-502 [001] .... 91.836405: mark_victim: pid=502
    allocate-502 [001] .N.. 91.837356: wake_reaper: pid=502
    allocate-502 [000] .N.. 91.871149: wake_reaper: pid=502
    oom_reaper-23 [000] .... 91.871177: start_task_reaping: pid=502
    oom_reaper-23 [000] .N.. 91.879511: finish_task_reaping: pid=502
    oom_reaper-23 [000] .... 91.879580: skip_task_reaping: pid=502

    Link: http://lkml.kernel.org/r/20170530185231.GA13412@castle
    Signed-off-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

07 Jul, 2017

1 commit

  • Show count of oom killer invocations in /proc/vmstat and count of
    processes killed in memory cgroup in knob "memory.events" (in
    memory.oom_control for v1 cgroup).

    Also describe difference between "oom" and "oom_kill" in memory cgroup
    documentation. Currently oom in memory cgroup kills tasks iff shortage
    has happened inside page fault.

    These counters helps in monitoring oom kills - for now the only way is
    grepping for magic words in kernel log.

    [akpm@linux-foundation.org: fix for mem_cgroup_count_vm_event() rename]
    [akpm@linux-foundation.org: fix comment, per Konstantin]
    Link: http://lkml.kernel.org/r/149570810989.203600.9492483715840752937.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Roman Guschin
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

04 May, 2017

1 commit

  • Tetsuo has reported that sysrq triggered OOM killer will print a
    misleading information when no tasks are selected:

    sysrq: SysRq : Manual OOM execution
    Out of memory: Kill process 4468 ((agetty)) score 0 or sacrifice child
    Killed process 4468 ((agetty)) total-vm:43704kB, anon-rss:1760kB, file-rss:0kB, shmem-rss:0kB
    sysrq: SysRq : Manual OOM execution
    Out of memory: Kill process 4469 (systemd-cgroups) score 0 or sacrifice child
    Killed process 4469 (systemd-cgroups) total-vm:10704kB, anon-rss:120kB, file-rss:0kB, shmem-rss:0kB
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled
    sysrq: SysRq : Manual OOM execution
    sysrq: OOM request ignored because killer is disabled

    The real reason is that there are no eligible tasks for the OOM killer
    to select but since commit 7c5f64f84483 ("mm: oom: deduplicate victim
    selection code for memcg and global oom") the semantic of out_of_memory
    has changed without updating moom_callback.

    This patch updates moom_callback to tell that no task was eligible which
    is the case for both oom killer disabled and no eligible tasks. In
    order to help distinguish first case from the second add printk to both
    oom_killer_{enable,disable}. This information is useful on its own
    because it might help debugging potential memory allocation failures.

    Fixes: 7c5f64f84483 ("mm: oom: deduplicate victim selection code for memcg and global oom")
    Link: http://lkml.kernel.org/r/20170404134705.6361-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

02 Mar, 2017

3 commits

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

25 Feb, 2017

1 commit

  • Commit 82e7d3abec86 ("oom: print nodemask in the oom report") implicitly
    sets the allocation nodemask to cpuset_current_mems_allowed when there
    is no effective mempolicy. cpuset_current_mems_allowed is only
    effective when cpusets are enabled, which is also printed by
    dump_header(), so setting the nodemask to cpuset_current_mems_allowed is
    redundant and prevents debugging issues where ac->nodemask is not set
    properly in the page allocator.

    This provides better debugging output since
    cpuset_print_current_mems_allowed() is already provided.

    [rientjes@google.com: newline per Hillf]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701200158300.88321@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1701191454470.2381@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Suggested-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

23 Feb, 2017

5 commits

  • Logic on whether we can reap pages from the VMA should match what we
    have in madvise_dontneed(). In particular, we should skip, VM_PFNMAP
    VMAs, but we don't now.

    Let's just extract condition on which we can shoot down pagesi from a
    VMA with MADV_DONTNEED into separate function and use it in both places.

    Link: http://lkml.kernel.org/r/20170118122429.43661-4-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • detail == NULL would give the same functionality as
    .check_swap_entries==true.

    Link: http://lkml.kernel.org/r/20170118122429.43661-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • The only user of ignore_dirty is oom-reaper. But it doesn't really use
    it.

    ignore_dirty only has effect on file pages mapped with dirty pte. But
    oom-repear skips shared VMAs, so there's no way we can dirty file pte in
    them.

    Link: http://lkml.kernel.org/r/20170118122429.43661-1-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • __alloc_pages_may_oom makes sure to skip the OOM killer depending on the
    allocation request. This includes lowmem requests, costly high order
    requests and others. For a long time __GFP_NOFAIL acted as an override
    for all those rules. This is not documented and it can be quite
    surprising as well. E.g. GFP_NOFS requests are not invoking the OOM
    killer but GFP_NOFS|__GFP_NOFAIL does so if we try to convert some of
    the existing open coded loops around allocator to nofail request (and we
    have done that in the past) then such a change would have a non trivial
    side effect which is far from obvious. Note that the primary motivation
    for skipping the OOM killer is to prevent from pre-mature invocation.

    The exception has been added by commit 82553a937f12 ("oom: invoke oom
    killer for __GFP_NOFAIL"). The changelog points out that the oom killer
    has to be invoked otherwise the request would be looping for ever. But
    this argument is rather weak because the OOM killer doesn't really
    guarantee a forward progress for those exceptional cases:

    - it will hardly help to form costly order which in turn can result in
    the system panic because of no oom killable task in the end - I believe
    we certainly do not want to put the system down just because there is a
    nasty driver asking for order-9 page with GFP_NOFAIL not realizing all
    the consequences. It is much better this request would loop for ever
    than the massive system disruption

    - lowmem is also highly unlikely to be freed during OOM killer

    - GFP_NOFS request could trigger while there is still a lot of memory
    pinned by filesystems.

    This patch simply removes the __GFP_NOFAIL special case in order to have a
    more clear semantic without surprising side effects.

    Signed-off-by: Michal Hocko
    Reported-by: Nils Holland
    Acked-by: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Hillf Danton
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • show_mem() allows to filter out node specific data which is irrelevant
    to the allocation request via SHOW_MEM_FILTER_NODES. The filtering is
    done in skip_free_areas_node which skips all nodes which are not in the
    mems_allowed of the current process. This works most of the time as
    expected because the nodemask shouldn't be outside of the allocating
    task but there are some exceptions. E.g. memory hotplug might want to
    request allocations from outside of the allowed nodes (see
    new_node_page).

    Get rid of this hardcoded behavior and push the allocation mask down the
    show_mem path and use it instead of cpuset_current_mems_allowed. NULL
    nodemask is interpreted as cpuset_current_mems_allowed.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20170117091543.25850-5-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

08 Oct, 2016

13 commits

  • We have received a hard to explain oom report from a customer. The oom
    triggered regardless there is a lot of free memory:

    PoolThread invoked oom-killer: gfp_mask=0x280da, order=0, oom_adj=0, oom_score_adj=0
    PoolThread cpuset=/ mems_allowed=0-7
    Pid: 30055, comm: PoolThread Tainted: G E X 3.0.101-80-default #1
    Call Trace:
    dump_trace+0x75/0x300
    dump_stack+0x69/0x6f
    dump_header+0x8e/0x110
    oom_kill_process+0xa6/0x350
    out_of_memory+0x2b7/0x310
    __alloc_pages_slowpath+0x7dd/0x820
    __alloc_pages_nodemask+0x1e9/0x200
    alloc_pages_vma+0xe1/0x290
    do_anonymous_page+0x13e/0x300
    do_page_fault+0x1fd/0x4c0
    page_fault+0x25/0x30
    [...]
    active_anon:1135959151 inactive_anon:1051962 isolated_anon:0
    active_file:13093 inactive_file:222506 isolated_file:0
    unevictable:262144 dirty:2 writeback:0 unstable:0
    free:432672819 slab_reclaimable:7917 slab_unreclaimable:95308
    mapped:261139 shmem:166297 pagetables:2228282 bounce:0
    [...]
    Node 0 DMA free:15896kB min:0kB low:0kB high:0kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15672kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 2892 775542 775542
    Node 0 DMA32 free:2783784kB min:28kB low:32kB high:40kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:2961572kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:0kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 772650 772650
    Node 0 Normal free:8120kB min:8160kB low:10200kB high:12240kB active_anon:779334960kB inactive_anon:2198744kB active_file:0kB inactive_file:180kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:791193600kB mlocked:131072kB dirty:0kB writeback:0kB mapped:372940kB shmem:361480kB slab_reclaimable:4536kB slab_unreclaimable:68472kB kernel_stack:10104kB pagetables:1414820kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:2280 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 1 Normal free:476718144kB min:8192kB low:10240kB high:12288kB active_anon:307623696kB inactive_anon:283620kB active_file:10392kB inactive_file:69908kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:257208kB shmem:189896kB slab_reclaimable:3868kB slab_unreclaimable:44756kB kernel_stack:1848kB pagetables:1369432kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 2 Normal free:386002452kB min:8192kB low:10240kB high:12288kB active_anon:398563752kB inactive_anon:68184kB active_file:10292kB inactive_file:29936kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:32084kB shmem:776kB slab_reclaimable:6888kB slab_unreclaimable:60056kB kernel_stack:8208kB pagetables:1282880kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 3 Normal free:196406760kB min:8192kB low:10240kB high:12288kB active_anon:587445640kB inactive_anon:164396kB active_file:5716kB inactive_file:709844kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:291776kB shmem:111416kB slab_reclaimable:5152kB slab_unreclaimable:44516kB kernel_stack:2168kB pagetables:1455956kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 4 Normal free:425338880kB min:8192kB low:10240kB high:12288kB active_anon:359695204kB inactive_anon:43216kB active_file:5748kB inactive_file:14772kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:24708kB shmem:1120kB slab_reclaimable:1884kB slab_unreclaimable:41060kB kernel_stack:1856kB pagetables:1100208kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 5 Normal free:11140kB min:8192kB low:10240kB high:12288kB active_anon:784240872kB inactive_anon:1217164kB active_file:28kB inactive_file:48kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:11408kB shmem:0kB slab_reclaimable:2008kB slab_unreclaimable:49220kB kernel_stack:1360kB pagetables:531600kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:1202 all_unreclaimable? yes
    lowmem_reserve[]: 0 0 0 0
    Node 6 Normal free:243395332kB min:8192kB low:10240kB high:12288kB active_anon:542015544kB inactive_anon:40208kB active_file:968kB inactive_file:8484kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:0kB writeback:0kB mapped:19992kB shmem:496kB slab_reclaimable:1672kB slab_unreclaimable:37052kB kernel_stack:2088kB pagetables:750264kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0
    Node 7 Normal free:10768kB min:8192kB low:10240kB high:12288kB active_anon:784916936kB inactive_anon:192316kB active_file:19228kB inactive_file:56852kB unevictable:131072kB isolated(anon):0kB isolated(file):0kB present:794296320kB mlocked:131072kB dirty:4kB writeback:0kB mapped:34440kB shmem:4kB slab_reclaimable:5660kB slab_unreclaimable:36100kB kernel_stack:1328kB pagetables:1007968kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
    lowmem_reserve[]: 0 0 0 0

    So all nodes but Node 0 have a lot of free memory which should suggest
    that there is an available memory especially when mems_allowed=0-7. One
    could speculate that a massive process has managed to terminate and free
    up a lot of memory while racing with the above allocation request.
    Although this is highly unlikely it cannot be ruled out.

    A further debugging, however shown that the faulting process had
    mempolicy (not cpuset) to bind to Node 0. We cannot see that
    information from the report though. mems_allowed turned out to be more
    confusing than really helpful.

    Fix this by always priting the nodemask. It is either mempolicy mask
    (and non-null) or the one defined by the cpusets. The new output for
    the above oom report would be

    PoolThread invoked oom-killer: gfp_mask=0x280da(GFP_HIGHUSER_MOVABLE|__GFP_ZERO), nodemask=0, order=0, oom_adj=0, oom_score_adj=0

    This patch doesn't touch show_mem and the node filtering based on the
    cpuset node mask because mempolicy is always a subset of cpusets and
    seeing the full cpuset oom context might be helpful for tunning more
    specific mempolicies inside cpusets (e.g. when they turn out to be too
    restrictive). To prevent from ugly ifdefs the mask is printed even for
    !NUMA configurations but this should be OK (a single node will be
    printed).

    Link: http://lkml.kernel.org/r/20160930214146.28600-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Sellami Abdelkader
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Sellami Abdelkader
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
    raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
    to warn when we raced with disabling the OOM killer.

    Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
    properly" introduced a timeout for oom_killer_disable(). Even if we
    raced with disabling the OOM killer and the system is OOM livelocked,
    the OOM killer will be enabled eventually (in 20 seconds by default) and
    the OOM livelock will be solved. Therefore, we no longer need to warn
    when we raced with disabling the OOM killer.

    Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Since the lumpy reclaim is gone there is no source of higher order pages
    if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
    unreliable for that purpose to say the least. Hitting an OOM for
    !costly higher order requests is therefore all not that hard to imagine.
    We are trying hard to not invoke OOM killer as much as possible but
    there is simply no reliable way to detect whether more reclaim retries
    make sense.

    Disabling COMPACTION is not widespread but it seems that some users
    might have disable the feature without realizing full consequences
    (mostly along with disabling THP because compaction used to be THP
    mainly thing). This patch just adds a note if the OOM killer was
    triggered by higher order request with compaction disabled. This will
    help us identifying possible misconfiguration right from the oom report
    which is easier than to always keep in mind that somebody might have
    disabled COMPACTION without a good reason.

    Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom reaper was skipped for an mm which is shared with the kernel thread
    (aka use_mm()). The primary concern was that such a kthread might want
    to read from the userspace memory and see zero page as a result of the
    oom reaper action. This is no longer a problem after "mm: make sure
    that kthreads will not refault oom reaped memory" because any attempt to
    fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
    target user should see an error. This means that we can finally allow
    oom reaper also to tasks which share their mm with kthreads.

    Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • "mm, oom_reaper: do not attempt to reap a task twice" tried to give the
    OOM reaper one more chance to retry using MMF_OOM_NOT_REAPABLE flag.
    But the usefulness of the flag is rather limited and actually never
    shown in practice. If the flag is set, it means that the holder of
    mm->mmap_sem cannot call up_write() due to presumably being blocked at
    unkillable wait waiting for other thread's memory allocation. But since
    one of threads sharing that mm will queue that mm immediately via
    task_will_free_mem() shortcut (otherwise, oom_badness() will select the
    same mm again due to oom_score_adj value unchanged), retrying
    MMF_OOM_NOT_REAPABLE mm is unlikely helpful.

    Let's always set MMF_OOM_REAPED.

    Link: http://lkml.kernel.org/r/1472119394-11342-3-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Patch series "fortify oom killer even more", v2.

    This patch (of 9):

    __oom_reap_task() can be simplified a bit if it receives a valid mm from
    oom_reap_task() which also uses that mm when __oom_reap_task() failed.
    We can drop one find_lock_task_mm() call and also make the
    __oom_reap_task() code flow easier to follow. Moreover, this will make
    later patch in the series easier to review. Pinning mm's mm_count for
    longer time is not really harmful because this will not pin much memory.

    This patch doesn't introduce any functional change.

    Link: http://lkml.kernel.org/r/1472119394-11342-2-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Attempt to demystify the task_will_free_mem() loop.

    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When selecting an oom victim, we use the same heuristic for both memory
    cgroup and global oom. The only difference is the scope of tasks to
    select the victim from. So we could just export an iterator over all
    memcg tasks and keep all oom related logic in oom_kill.c, but instead we
    duplicate pieces of it in memcontrol.c reusing some initially private
    functions of oom_kill.c in order to not duplicate all of it. That looks
    ugly and error prone, because any modification of select_bad_process
    should also be propagated to mem_cgroup_out_of_memory.

    Let's rework this as follows: keep all oom heuristic related code private
    to oom_kill.c and make oom_kill.c use exported memcg functions when it's
    really necessary (like in case of iterating over memcg tasks).

    Link: http://lkml.kernel.org/r/1470056933-7505-1-git-send-email-vdavydov@virtuozzo.com
    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

12 Aug, 2016

1 commit

  • mm/oom_kill.c: In function `task_will_free_mem':
    mm/oom_kill.c:767: warning: `ret' may be used uninitialized in this function

    If __task_will_free_mem() is never called inside the for_each_process()
    loop, ret will not be initialized.

    Fixes: 1af8bb43269563e4 ("mm, oom: fortify task_will_free_mem()")
    Link: http://lkml.kernel.org/r/1470255599-24841-1-git-send-email-geert@linux-m68k.org
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     

29 Jul, 2016

7 commits

  • "mm, oom: fortify task_will_free_mem" has dropped task_lock around
    task_will_free_mem in oom_kill_process bacause it assumed that a
    potential race when the selected task exits will not be a problem as the
    oom_reaper will call exit_oom_victim.

    Tetsuo was objecting that nommu doesn't have oom_reaper so the race
    would be still possible. The code would be racy and lockup prone
    theoretically in other aspects without the oom reaper anyway so I didn't
    considered this a big deal. But it seems that further changes I am
    planning in this area will benefit from stable task->mm in this path as
    well. So let's drop find_lock_task_mm from task_will_free_mem and call
    it from under task_lock as we did previously. Just pull the task->mm !=
    NULL check inside the function.

    Link: http://lkml.kernel.org/r/1467201562-6709-1-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The only case where the oom_reaper is not triggered for the oom victim
    is when it shares the memory with a kernel thread (aka use_mm) or with
    the global init. After "mm, oom: skip vforked tasks from being
    selected" the victim cannot be a vforked task of the global init so we
    are left with clone(CLONE_VM) (without CLONE_SIGHAND). use_mm() users
    are quite rare as well.

    In order to help forward progress for the OOM killer, make sure that
    this really rare case will not get in the way - we do this by hiding the
    mm from the oom killer by setting MMF_OOM_REAPED flag for it.
    oom_scan_process_thread will ignore any TIF_MEMDIE task if it has
    MMF_OOM_REAPED flag set to catch these oom victims.

    After this patch we should guarantee forward progress for the OOM killer
    even when the selected victim is sharing memory with a kernel thread or
    global init as long as the victims mm is still alive.

    Link: http://lkml.kernel.org/r/1466426628-15074-11-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reaper relies on the mmap_sem for read to do its job. Many places
    which might block readers have been converted to use down_write_killable
    and that has reduced chances of the contention a lot. Some paths where
    the mmap_sem is held for write can take other locks and they might
    either be not prepared to fail due to fatal signal pending or too
    impractical to be changed.

    This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
    first attempt to reap a task's mm fails. If the flag is present after
    the failure then we set MMF_OOM_REAPED to hide this mm from the oom
    killer completely so it can go and chose another victim.

    As a result a risk of OOM deadlock when the oom victim would be blocked
    indefinetly and so the oom killer cannot make any progress should be
    mitigated considerably while we still try really hard to perform all
    reclaim attempts and stay predictable in the behavior.

    Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The 0-day robot has encountered the following:

    Out of memory: Kill process 3914 (trinity-c0) score 167 or sacrifice child
    Killed process 3914 (trinity-c0) total-vm:55864kB, anon-rss:1512kB, file-rss:1088kB, shmem-rss:25616kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26488kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:26900kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:27296kB
    oom_reaper: reaped process 3914 (trinity-c0), now anon-rss:0kB, file-rss:0kB, shmem-rss:28148kB

    oom_reaper is trying to reap the same task again and again.

    This is possible only when the oom killer is bypassed because of
    task_will_free_mem because we skip over tasks with MMF_OOM_REAPED
    already set during select_bad_process. Teach task_will_free_mem to skip
    over MMF_OOM_REAPED tasks as well because they will be unlikely to free
    anything more.

    Analyzed by Tetsuo Handa.

    Link: http://lkml.kernel.org/r/1466426628-15074-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • task_will_free_mem is rather weak. It doesn't really tell whether the
    task has chance to drop its mm. 98748bd72200 ("oom: consider
    multi-threaded tasks in task_will_free_mem") made a first step into making
    it more robust for multi-threaded applications so now we know that the
    whole process is going down and probably drop the mm.

    This patch builds on top for more complex scenarios where mm is shared
    between different processes - CLONE_VM without CLONE_SIGHAND, or in kernel
    use_mm().

    Make sure that all processes sharing the mm are killed or exiting. This
    will allow us to replace try_oom_reaper by wake_oom_reaper because
    task_will_free_mem implies the task is reapable now. Therefore all paths
    which bypass the oom killer are now reapable and so they shouldn't lock up
    the oom killer.

    Link: http://lkml.kernel.org/r/1466426628-15074-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Currently oom_kill_process skips both the oom reaper and SIG_KILL if a
    process sharing the same mm is unkillable via OOM_ADJUST_MIN. After "mm,
    oom_adj: make sure processes sharing mm have same view of oom_score_adj"
    all such processes are sharing the same value so we shouldn't see such a
    task at all (oom_badness would rule them out).

    We can still encounter oom disabled vforked task which has to be killed as
    well if we want to have other tasks sharing the mm reapable because it can
    access the memory before doing exec. Killing such a task should be
    acceptable because it is highly unlikely it has done anything useful
    because it cannot modify any memory before it calls exec. An alternative
    would be to keep the task alive and skip the oom reaper and risk all the
    weird corner cases where the OOM killer cannot make forward progress
    because the oom victim hung somewhere on the way to exit.

    [rientjes@google.com - drop printk when OOM_SCORE_ADJ_MIN killed task
    the setting is inherently racy and we cannot do much about it without
    introducing locks in hot paths]
    Link: http://lkml.kernel.org/r/1466426628-15074-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • vforked tasks are not really sitting on any memory. They are sharing the
    mm with parent until they exec into a new code. Until then it is just
    pinning the address space. OOM killer will kill the vforked task along
    with its parent but we still can end up selecting vforked task when the
    parent wouldn't be selected. E.g. init doing vfork to launch a task or
    vforked being a child of oom unkillable task with an updated oom_score_adj
    to be killable.

    Add a new helper to check whether a task is in the vfork sharing memory
    with its parent and use it in oom_badness to skip over these tasks.

    Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko