09 Jan, 2021

1 commit


07 Jan, 2021

1 commit

  • Free the pages parallely for a task that receives SIGKILL, from ULMK
    process, using the oom_reaper. This freeing of pages will help to give
    the pages to buddy system well advance.

    Add the boot param, reap_mem_when_killed_by=, that configures the
    process name, the kill signal to a process from which makes its memory
    reaped by oom reaper.

    As an example, when reap_mem_when_killed_by=lmkd, then all the processes
    that receives the kill signal from lmkd is added to oom reaper.

    Not initializing this param makes this feature disabled.

    Change-Id: I21adb95de5e380a80d7eb0b87d9b5b553f52e28a
    Bug: 171763461
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Isaac J. Manjarres

    Charan Teja Reddy
     

14 Oct, 2020

1 commit

  • Currently __set_oom_adj loops through all processes in the system to keep
    oom_score_adj and oom_score_adj_min in sync between processes sharing
    their mm. This is done for any task with more that one mm_users, which
    includes processes with multiple threads (sharing mm and signals).
    However for such processes the loop is unnecessary because their signal
    structure is shared as well.

    Android updates oom_score_adj whenever a tasks changes its role
    (background/foreground/...) or binds to/unbinds from a service, making it
    more/less important. Such operation can happen frequently. We noticed
    that updates to oom_score_adj became more expensive and after further
    investigation found out that the patch mentioned in "Fixes" introduced a
    regression. Using Pixel 4 with a typical Android workload, write time to
    oom_score_adj increased from ~3.57us to ~362us. Moreover this regression
    linearly depends on the number of multi-threaded processes running on the
    system.

    Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
    (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK). Change __set_oom_adj to use
    MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
    update should be synchronized between multiple processes. To prevent
    races between clone() and __set_oom_adj(), when oom_score_adj of the
    process being cloned might be modified from userspace, we use
    oom_adj_mutex. Its scope is changed to global.

    The combination of (CLONE_VM && !CLONE_THREAD) is rarely used except for
    the case of vfork(). To prevent performance regressions of vfork(), we
    skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
    specified. Clearing the MMF_MULTIPROCESS flag (when the last process
    sharing the mm exits) is left out of this patch to keep it simple and
    because it is believed that this threading model is rare. Should there
    ever be a need for optimizing that case as well, it can be done by hooking
    into the exit path, likely following the mm_update_next_owner pattern.

    With the combination of (CLONE_VM && !CLONE_THREAD && !CLONE_VFORK) being
    quite rare, the regression is gone after the change is applied.

    [surenb@google.com: v3]
    Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

    Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
    Reported-by: Tim Murray
    Suggested-by: Michal Hocko
    Signed-off-by: Suren Baghdasaryan
    Signed-off-by: Andrew Morton
    Acked-by: Christian Brauner
    Acked-by: Michal Hocko
    Acked-by: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Eugene Syromiatnikov
    Cc: Christian Kellner
    Cc: Adrian Reber
    Cc: Shakeel Butt
    Cc: Aleksa Sarai
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Cc: Alexey Gladkov
    Cc: Michel Lespinasse
    Cc: Daniel Jordan
    Cc: Andrei Vagin
    Cc: Bernd Edlinger
    Cc: John Johansen
    Cc: Yafang Shao
    Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
    Debugged-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     

13 Aug, 2020

2 commits

  • When the OOM killer finds a victim and tryies to kill it, if the victim is
    already exiting, the task mm will be NULL and no process will be killed.
    But the dump_header() has been already executed, so it will be strange to
    dump so much information without killing a process. We'd better show some
    helpful information to indicate why this happens.

    Suggested-by: David Rientjes
    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200721010127.17238-1-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • Recently we found an issue on our production environment that when memcg
    oom is triggered the oom killer doesn't chose the process with largest
    resident memory but chose the first scanned process. Note that all
    processes in this memcg have the same oom_score_adj, so the oom killer
    should chose the process with largest resident memory.

    Bellow is part of the oom info, which is enough to analyze this issue.
    [7516987.983223] memory: usage 16777216kB, limit 16777216kB, failcnt 52843037
    [7516987.983224] memory+swap: usage 16777216kB, limit 9007199254740988kB, failcnt 0
    [7516987.983225] kmem: usage 301464kB, limit 9007199254740988kB, failcnt 0
    [...]
    [7516987.983293] [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    [7516987.983510] [ 5740] 0 5740 257 1 32768 0 -998 pause
    [7516987.983574] [58804] 0 58804 4594 771 81920 0 -998 entry_point.bas
    [7516987.983577] [58908] 0 58908 7089 689 98304 0 -998 cron
    [7516987.983580] [58910] 0 58910 16235 5576 163840 0 -998 supervisord
    [7516987.983590] [59620] 0 59620 18074 1395 188416 0 -998 sshd
    [7516987.983594] [59622] 0 59622 18680 6679 188416 0 -998 python
    [7516987.983598] [59624] 0 59624 1859266 5161 548864 0 -998 odin-agent
    [7516987.983600] [59625] 0 59625 707223 9248 983040 0 -998 filebeat
    [7516987.983604] [59627] 0 59627 416433 64239 774144 0 -998 odin-log-agent
    [7516987.983607] [59631] 0 59631 180671 15012 385024 0 -998 python3
    [7516987.983612] [61396] 0 61396 791287 3189 352256 0 -998 client
    [7516987.983615] [61641] 0 61641 1844642 29089 946176 0 -998 client
    [7516987.983765] [ 9236] 0 9236 2642 467 53248 0 -998 php_scanner
    [7516987.983911] [42898] 0 42898 15543 838 167936 0 -998 su
    [7516987.983915] [42900] 1000 42900 3673 867 77824 0 -998 exec_script_vr2
    [7516987.983918] [42925] 1000 42925 36475 19033 335872 0 -998 python
    [7516987.983921] [57146] 1000 57146 3673 848 73728 0 -998 exec_script_J2p
    [7516987.983925] [57195] 1000 57195 186359 22958 491520 0 -998 python2
    [7516987.983928] [58376] 1000 58376 275764 14402 290816 0 -998 rosmaster
    [7516987.983931] [58395] 1000 58395 155166 4449 245760 0 -998 rosout
    [7516987.983935] [58406] 1000 58406 18285584 3967322 37101568 0 -998 data_sim
    [7516987.984221] oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null),cpuset=3aa16c9482ae3a6f6b78bda68a55d32c87c99b985e0f11331cddf05af6c4d753,mems_allowed=0-1,oom_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184,task_memcg=/kubepods/podf1c273d3-9b36-11ea-b3df-246e9693c184/1f246a3eeea8f70bf91141eeaf1805346a666e225f823906485ea0b6c37dfc3d,task=pause,pid=5740,uid=0
    [7516987.984254] Memory cgroup out of memory: Killed process 5740 (pause) total-vm:1028kB, anon-rss:4kB, file-rss:0kB, shmem-rss:0kB
    [7516988.092344] oom_reaper: reaped process 5740 (pause), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    We can find that the first scanned process 5740 (pause) was killed, but
    its rss is only one page. That is because, when we calculate the oom
    badness in oom_badness(), we always ignore the negtive point and convert
    all of these negtive points to 1. Now as oom_score_adj of all the
    processes in this targeted memcg have the same value -998, the points of
    these processes are all negtive value. As a result, the first scanned
    process will be killed.

    The oom_socre_adj (-998) in this memcg is set by kubelet, because it is a
    a Guaranteed pod, which has higher priority to prevent from being killed
    by system oom.

    To fix this issue, we should make the calculation of oom point more
    accurate. We can achieve it by convert the chosen_point from 'unsigned
    long' to 'long'.

    [cai@lca.pw: reported a issue in the previous version]
    [mhocko@suse.com: fixed the issue reported by Cai]
    [mhocko@suse.com: add the comment in proc_oom_score()]
    [laoar.shao@gmail.com: v3]
    Link: http://lkml.kernel.org/r/1594396651-9931-1-git-send-email-laoar.shao@gmail.com

    Signed-off-by: Yafang Shao
    Signed-off-by: Andrew Morton
    Tested-by: Naresh Kamboju
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/1594309987-9919-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

08 Aug, 2020

1 commit

  • In order to prepare for per-object slab memory accounting, convert
    NR_SLAB_RECLAIMABLE and NR_SLAB_UNRECLAIMABLE vmstat items to bytes.

    To make it obvious, rename them to NR_SLAB_RECLAIMABLE_B and
    NR_SLAB_UNRECLAIMABLE_B (similar to NR_KERNEL_STACK_KB).

    Internally global and per-node counters are stored in pages, however memcg
    and lruvec counters are stored in bytes. This scheme may look weird, but
    only for now. As soon as slab pages will be shared between multiple
    cgroups, global and node counters will reflect the total number of slab
    pages. However memcg and lruvec counters will be used for per-memcg slab
    memory tracking, which will take separate kernel objects in the account.
    Keeping global and node counters in pages helps to avoid additional
    overhead.

    The size of slab memory shouldn't exceed 4Gb on 32-bit machines, so it
    will fit into atomic_long_t we use for vmstats.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Michal Hocko
    Cc: Tejun Heo
    Link: http://lkml.kernel.org/r/20200623174037.3951353-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

11 Jun, 2020

1 commit

  • Switch the function documentation to kerneldoc comments, and add
    WARN_ON_ONCE asserts that the calling thread is a kernel thread and does
    not have ->mm set (or has ->mm set in the case of unuse_mm).

    Also give the functions a kthread_ prefix to better document the use case.

    [hch@lst.de: fix a comment typo, cover the newly merged use_mm/unuse_mm caller in vfio]
    Link: http://lkml.kernel.org/r/20200416053158.586887-3-hch@lst.de
    [sfr@canb.auug.org.au: powerpc/vas: fix up for {un}use_mm() rename]
    Link: http://lkml.kernel.org/r/20200422163935.5aa93ba5@canb.auug.org.au

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Tested-by: Jens Axboe
    Reviewed-by: Jens Axboe
    Acked-by: Felix Kuehling
    Acked-by: Greg Kroah-Hartman [usb]
    Acked-by: Haren Myneni
    Cc: Alex Deucher
    Cc: Al Viro
    Cc: Felipe Balbi
    Cc: Jason Wang
    Cc: "Michael S. Tsirkin"
    Cc: Zhenyu Wang
    Cc: Zhi Wang
    Link: http://lkml.kernel.org/r/20200404094101.672954-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

10 Jun, 2020

3 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Convert comments that reference old mmap_sem APIs to reference
    corresponding new mmap locking APIs instead.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-12-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

04 Jun, 2020

1 commit

  • classzone_idx is just different name for high_zoneidx now. So, integrate
    them and add some comment to struct alloc_context in order to reduce
    future confusion about the meaning of this variable.

    The accessor, ac_classzone_idx() is also removed since it isn't needed
    after integration.

    In addition to integration, this patch also renames high_zoneidx to
    highest_zoneidx since it represents more precise meaning.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Ye Xiaolong
    Link: http://lkml.kernel.org/r/1587095923-7515-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

01 Feb, 2020

1 commit

  • When a process cannot be oom reaped, for whatever reason, currently the
    list of locks that are held is currently dumped to the kernel log.

    Much more interesting is the stack trace of the victim that cannot be
    reaped. If the stack trace is dumped, we have the ability to find
    related occurrences in the same kernel code and hopefully solve the
    issue that is making it wedged.

    Dump the stack trace when a process fails to be oom reaped.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2001141519280.200484@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

05 Jan, 2020

1 commit

  • pr_err() expects kB, but mm_pgtables_bytes() returns the number of bytes.
    As everything else is printed in kB, I chose to fix the value rather than
    the string.

    Before:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1878] 1000 1878 217253 151144 1269760 0 0 python
    ...
    Out of memory: Killed process 1878 (python) total-vm:869012kB, anon-rss:604572kB, file-rss:4kB, shmem-rss:0kB, UID:1000 pgtables:1269760kB oom_score_adj:0

    After:

    [ pid ] uid tgid total_vm rss pgtables_bytes swapents oom_score_adj name
    ...
    [ 1436] 1000 1436 217253 151890 1294336 0 0 python
    ...
    Out of memory: Killed process 1436 (python) total-vm:869012kB, anon-rss:607516kB, file-rss:44kB, shmem-rss:0kB, UID:1000 pgtables:1264kB oom_score_adj:0

    Link: http://lkml.kernel.org/r/20191211202830.1600-1-idryomov@gmail.com
    Fixes: 70cb6d267790 ("mm/oom: add oom_score_adj and pgtables to Killed process message")
    Signed-off-by: Ilya Dryomov
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Edward Chron
    Cc: David Rientjes
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ilya Dryomov
     

26 Sep, 2019

1 commit

  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

25 Sep, 2019

5 commits

  • constrained_alloc() calculates the size of the oom domain by using
    node_spanned_pages which is incorrect because this is the full range of
    the physical memory range that the numa node occupies rather than the
    memory that backs that range which is represented by node_present_pages.

    Sparsely populated nodes (e.g. after memory hot remove or simply sparse
    due to memory layout) can have really a large difference between the two.
    This shouldn't really cause any real user observable problems because the
    oom calculates a ratio against totalpages and used memory cannot exceed
    present pages but it is confusing and wrong from code point of view.

    Link: http://lkml.kernel.org/r/20190829163443.899-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: David Hildenbrand
    Reviewed-by: David Hildenbrand
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Commit ac311a14c682 ("oom: decouple mems_allowed from
    oom_unkillable_task") changed has_intersects_mems_allowed() to
    oom_cpuset_eligible(), but didn't change the comment.

    Link: http://lkml.kernel.org/r/1566959929-10638-1-git-send-email-wang.yi59@zte.com.cn
    Signed-off-by: Yi Wang
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yi Wang
     
  • For an OOM event: print oom_score_adj value for the OOM Killed process to
    document what the oom score adjust value was at the time the process was
    OOM Killed. The adjustment value can be set by user code and it affects
    the resulting oom_score so it is used to influence kill process selection.

    When eligible tasks are not printed (sysctl oom_dump_tasks = 0) printing
    this value is the only documentation of the value for the process being
    killed. Having this value on the Killed process message is useful to
    document if a miscconfiguration occurred or to confirm that the
    oom_score_adj configuration applies as expected.

    An example which illustates both misconfiguration and validation that the
    oom_score_adj was applied as expected is:

    Aug 14 23:00:02 testserver kernel: Out of memory: Killed process 2692
    (systemd-udevd) total-vm:1056800kB, anon-rss:1052760kB, file-rss:4kB,
    shmem-rss:0kB pgtables:22kB oom_score_adj:1000

    The systemd-udevd is a critical system application that should have an
    oom_score_adj of -1000. It was miconfigured to have a adjustment of 1000
    making it a highly favored OOM kill target process. The output documents
    both the misconfiguration and the fact that the process was correctly
    targeted by OOM due to the miconfiguration. This can be quite helpful for
    triage and problem determination.

    The addition of the pgtables_bytes shows page table usage by the process
    and is a useful measure of the memory size of the process.

    Link: http://lkml.kernel.org/r/20190822173157.1569-1-echron@arista.com
    Signed-off-by: Edward Chron
    Acked-by: Michal Hocko
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Edward Chron
     
  • Masoud Sharbiani noticed that commit 29ef680ae7c21110 ("memcg, oom: move
    out_of_memory back to the charge path") broke memcg OOM called from
    __xfs_filemap_fault() path. It turned out that try_charge() is retrying
    forever without making forward progress because mem_cgroup_oom(GFP_NOFS)
    cannot invoke the OOM killer due to commit 3da88fb3bacfaa33 ("mm, oom:
    move GFP_NOFS check to out_of_memory").

    Allowing forced charge due to being unable to invoke memcg OOM killer will
    lead to global OOM situation. Also, just returning -ENOMEM will be risky
    because OOM path is lost and some paths (e.g. get_user_pages()) will leak
    -ENOMEM. Therefore, invoking memcg OOM killer (despite GFP_NOFS) will be
    the only choice we can choose for now.

    Until 29ef680ae7c21110, we were able to invoke memcg OOM killer when
    GFP_KERNEL reclaim failed [1]. But since 29ef680ae7c21110, we need to
    invoke memcg OOM killer when GFP_NOFS reclaim failed [2]. Although in the
    past we did invoke memcg OOM killer for GFP_NOFS [3], we might get
    pre-mature memcg OOM reports due to this patch.

    [1]

    leaker invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), nodemask=(null), order=0, oom_score_adj=0
    CPU: 0 PID: 2746 Comm: leaker Not tainted 4.18.0+ #19
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    Call Trace:
    dump_stack+0x63/0x88
    dump_header+0x67/0x27a
    ? mem_cgroup_scan_tasks+0x91/0xf0
    oom_kill_process+0x210/0x410
    out_of_memory+0x10a/0x2c0
    mem_cgroup_out_of_memory+0x46/0x80
    mem_cgroup_oom_synchronize+0x2e4/0x310
    ? high_work_func+0x20/0x20
    pagefault_out_of_memory+0x31/0x76
    mm_fault_error+0x55/0x115
    ? handle_mm_fault+0xfd/0x220
    __do_page_fault+0x433/0x4e0
    do_page_fault+0x22/0x30
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30
    RIP: 0033:0x4009f0
    Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
    RSP: 002b:00007ffe29ae96f0 EFLAGS: 00010206
    RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001ce1000
    RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
    RBP: 000000000000000c R08: 0000000000000000 R09: 00007f94be09220d
    R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
    R13: 0000000000000003 R14: 00007f949d845000 R15: 0000000002800000
    Task in /leaker killed as a result of limit of /leaker
    memory: usage 524288kB, limit 524288kB, failcnt 158965
    memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 2016kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /leaker: cache:844KB rss:521136KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:132KB writeback:0KB inactive_anon:0KB active_anon:521224KB inactive_file:1012KB active_file:8KB unevictable:0KB
    Memory cgroup out of memory: Kill process 2746 (leaker) score 998 or sacrifice child
    Killed process 2746 (leaker) total-vm:536704kB, anon-rss:521176kB, file-rss:1208kB, shmem-rss:0kB
    oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    [2]

    leaker invoked oom-killer: gfp_mask=0x600040(GFP_NOFS), nodemask=(null), order=0, oom_score_adj=0
    CPU: 1 PID: 2746 Comm: leaker Not tainted 4.18.0+ #20
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    Call Trace:
    dump_stack+0x63/0x88
    dump_header+0x67/0x27a
    ? mem_cgroup_scan_tasks+0x91/0xf0
    oom_kill_process+0x210/0x410
    out_of_memory+0x109/0x2d0
    mem_cgroup_out_of_memory+0x46/0x80
    try_charge+0x58d/0x650
    ? __radix_tree_replace+0x81/0x100
    mem_cgroup_try_charge+0x7a/0x100
    __add_to_page_cache_locked+0x92/0x180
    add_to_page_cache_lru+0x4d/0xf0
    iomap_readpages_actor+0xde/0x1b0
    ? iomap_zero_range_actor+0x1d0/0x1d0
    iomap_apply+0xaf/0x130
    iomap_readpages+0x9f/0x150
    ? iomap_zero_range_actor+0x1d0/0x1d0
    xfs_vm_readpages+0x18/0x20 [xfs]
    read_pages+0x60/0x140
    __do_page_cache_readahead+0x193/0x1b0
    ondemand_readahead+0x16d/0x2c0
    page_cache_async_readahead+0x9a/0xd0
    filemap_fault+0x403/0x620
    ? alloc_set_pte+0x12c/0x540
    ? _cond_resched+0x14/0x30
    __xfs_filemap_fault+0x66/0x180 [xfs]
    xfs_filemap_fault+0x27/0x30 [xfs]
    __do_fault+0x19/0x40
    __handle_mm_fault+0x8e8/0xb60
    handle_mm_fault+0xfd/0x220
    __do_page_fault+0x238/0x4e0
    do_page_fault+0x22/0x30
    ? page_fault+0x8/0x30
    page_fault+0x1e/0x30
    RIP: 0033:0x4009f0
    Code: 03 00 00 00 e8 71 fd ff ff 48 83 f8 ff 49 89 c6 74 74 48 89 c6 bf c0 0c 40 00 31 c0 e8 69 fd ff ff 45 85 ff 7e 21 31 c9 66 90 0f be 14 0e 01 d3 f7 c1 ff 0f 00 00 75 05 41 c6 04 0e 2a 48 83
    RSP: 002b:00007ffda45c9290 EFLAGS: 00010206
    RAX: 000000000000001b RBX: 0000000000000000 RCX: 0000000001a1e000
    RDX: 0000000000000000 RSI: 000000007fffffe5 RDI: 0000000000000000
    RBP: 000000000000000c R08: 0000000000000000 R09: 00007f6d061ff20d
    R10: 0000000000000002 R11: 0000000000000246 R12: 00000000000186a0
    R13: 0000000000000003 R14: 00007f6ce59b2000 R15: 0000000002800000
    Task in /leaker killed as a result of limit of /leaker
    memory: usage 524288kB, limit 524288kB, failcnt 7221
    memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 1944kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /leaker: cache:3632KB rss:518232KB rss_huge:0KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:518408KB inactive_file:3908KB active_file:12KB unevictable:0KB
    Memory cgroup out of memory: Kill process 2746 (leaker) score 992 or sacrifice child
    Killed process 2746 (leaker) total-vm:536704kB, anon-rss:518264kB, file-rss:1188kB, shmem-rss:0kB
    oom_reaper: reaped process 2746 (leaker), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB

    [3]

    leaker invoked oom-killer: gfp_mask=0x50, order=0, oom_score_adj=0
    leaker cpuset=/ mems_allowed=0
    CPU: 1 PID: 3206 Comm: leaker Not tainted 3.10.0-957.27.2.el7.x86_64 #1
    Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    Call Trace:
    [] dump_stack+0x19/0x1b
    [] dump_header+0x90/0x229
    [] ? find_lock_task_mm+0x56/0xc0
    [] ? try_get_mem_cgroup_from_mm+0x28/0x60
    [] oom_kill_process+0x254/0x3d0
    [] mem_cgroup_oom_synchronize+0x546/0x570
    [] ? mem_cgroup_charge_common+0xc0/0xc0
    [] pagefault_out_of_memory+0x14/0x90
    [] mm_fault_error+0x6a/0x157
    [] __do_page_fault+0x3c8/0x4f0
    [] do_page_fault+0x35/0x90
    [] page_fault+0x28/0x30
    Task in /leaker killed as a result of limit of /leaker
    memory: usage 524288kB, limit 524288kB, failcnt 20628
    memory+swap: usage 524288kB, limit 9007199254740988kB, failcnt 0
    kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
    Memory cgroup stats for /leaker: cache:840KB rss:523448KB rss_huge:0KB mapped_file:0KB swap:0KB inactive_anon:0KB active_anon:523448KB inactive_file:464KB active_file:376KB unevictable:0KB
    Memory cgroup out of memory: Kill process 3206 (leaker) score 970 or sacrifice child
    Killed process 3206 (leaker) total-vm:536692kB, anon-rss:523304kB, file-rss:412kB, shmem-rss:0kB

    Bisected by Masoud Sharbiani.

    Link: http://lkml.kernel.org/r/cbe54ed1-b6ba-a056-8899-2dc42526371d@i-love.sakura.ne.jp
    Fixes: 3da88fb3bacfaa33 ("mm, oom: move GFP_NOFS check to out_of_memory") [necessary after 29ef680ae7c21110]
    Signed-off-by: Tetsuo Handa
    Reported-by: Masoud Sharbiani
    Tested-by: Masoud Sharbiani
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: [4.19+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • In the event of an oom kill, useful information about the killed process
    is printed to dmesg. Users, especially system administrators, will find
    it useful to immediately see the UID of the process.

    We already print uid when dumping eligible tasks so it is not overly hard
    to find that information in the oom report. However this information is
    unavailable when dumping of eligible tasks is disabled.

    In the following example, abuse_the_ram is the name of a program that
    attempts to iteratively allocate all available memory until it is stopped
    by force.

    Current message:

    Out of memory: Killed process 35389 (abuse_the_ram)
    total-vm:133718232kB, anon-rss:129624980kB, file-rss:0kB,
    shmem-rss:0kB

    Patched message:

    Out of memory: Killed process 2739 (abuse_the_ram),
    total-vm:133880028kB, anon-rss:129754836kB, file-rss:0kB,
    shmem-rss:0kB, UID:0

    [akpm@linux-foundation.org: s/UID %d/UID:%u/ in printk]
    Link: http://lkml.kernel.org/r/1560362273-534-1-git-send-email-jsavitz@redhat.com
    Signed-off-by: Joel Savitz
    Suggested-by: David Rientjes
    Acked-by: Rafael Aquini
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joel Savitz
     

13 Jul, 2019

5 commits

  • Since commit bbbe48029720 ("mm, oom: remove 'prefer children over
    parent' heuristic") removed the

    "%s: Kill process %d (%s) score %u or sacrifice child\n"

    line, oc->chosen_points is no longer used after select_bad_process().

    Link: http://lkml.kernel.org/r/1560853435-15575-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Shakeel Butt
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit ef08e3b4981a ("[PATCH] cpusets: confine oom_killer to
    mem_exclusive cpuset") introduces a heuristic where a potential
    oom-killer victim is skipped if the intersection of the potential victim
    and the current (the process triggered the oom) is empty based on the
    reason that killing such victim most probably will not help the current
    allocating process.

    However the commit 7887a3da753e ("[PATCH] oom: cpuset hint") changed the
    heuristic to just decrease the oom_badness scores of such potential
    victim based on the reason that the cpuset of such processes might have
    changed and previously they may have allocated memory on mems where the
    current allocating process can allocate from.

    Unintentionally 7887a3da753e ("[PATCH] oom: cpuset hint") introduced a
    side effect as the oom_badness is also exposed to the user space through
    /proc/[pid]/oom_score, so, readers with different cpusets can read
    different oom_score of the same process.

    Later, commit 6cf86ac6f36b ("oom: filter tasks not sharing the same
    cpuset") fixed the side effect introduced by 7887a3da753e by moving the
    cpuset intersection back to only oom-killer context and out of
    oom_badness. However the combination of ab290adbaf8f ("oom: make
    oom_unkillable_task() helper function") and 26ebc984913b ("oom:
    /proc//oom_score treat kernel thread honestly") unintentionally
    brought back the cpuset intersection check into the oom_badness
    calculation function.

    Other than doing cpuset/mempolicy intersection from oom_badness, the memcg
    oom context is also doing cpuset/mempolicy intersection which is quite
    wrong and is caught by syzcaller with the following report:

    kasan: CONFIG_KASAN_INLINE enabled
    kasan: GPF could be caused by NULL-ptr deref or user memory access
    general protection fault: 0000 [#1] PREEMPT SMP KASAN
    CPU: 0 PID: 28426 Comm: syz-executor.5 Not tainted 5.2.0-rc3-next-20190607
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS
    Google 01/01/2011
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000607304 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600
    Call Trace:
    oom_evaluate_task+0x49/0x520 mm/oom_kill.c:321
    mem_cgroup_scan_tasks+0xcc/0x180 mm/memcontrol.c:1169
    select_bad_process mm/oom_kill.c:374 [inline]
    out_of_memory mm/oom_kill.c:1088 [inline]
    out_of_memory+0x6b2/0x1280 mm/oom_kill.c:1035
    mem_cgroup_out_of_memory+0x1ca/0x230 mm/memcontrol.c:1573
    mem_cgroup_oom mm/memcontrol.c:1905 [inline]
    try_charge+0xfbe/0x1480 mm/memcontrol.c:2468
    mem_cgroup_try_charge+0x24d/0x5e0 mm/memcontrol.c:6073
    mem_cgroup_try_charge_delay+0x1f/0xa0 mm/memcontrol.c:6088
    do_huge_pmd_wp_page_fallback+0x24f/0x1680 mm/huge_memory.c:1201
    do_huge_pmd_wp_page+0x7fc/0x2160 mm/huge_memory.c:1359
    wp_huge_pmd mm/memory.c:3793 [inline]
    __handle_mm_fault+0x164c/0x3eb0 mm/memory.c:4006
    handle_mm_fault+0x3b7/0xa90 mm/memory.c:4053
    do_user_addr_fault arch/x86/mm/fault.c:1455 [inline]
    __do_page_fault+0x5ef/0xda0 arch/x86/mm/fault.c:1521
    do_page_fault+0x71/0x57d arch/x86/mm/fault.c:1552
    page_fault+0x1e/0x30 arch/x86/entry/entry_64.S:1156
    RIP: 0033:0x400590
    Code: 06 e9 49 01 00 00 48 8b 44 24 10 48 0b 44 24 28 75 1f 48 8b 14 24 48
    8b 7c 24 20 be 04 00 00 00 e8 f5 56 00 00 48 8b 74 24 08 06 e9 1e 01
    00 00 48 8b 44 24 08 48 8b 14 24 be 04 00 00 00 8b
    RSP: 002b:00007fff7bc49780 EFLAGS: 00010206
    RAX: 0000000000000001 RBX: 0000000000760000 RCX: 0000000000000000
    RDX: 0000000000000000 RSI: 000000002000cffc RDI: 0000000000000001
    RBP: fffffffffffffffe R08: 0000000000000000 R09: 0000000000000000
    R10: 0000000000000075 R11: 0000000000000246 R12: 0000000000760008
    R13: 00000000004c55f2 R14: 0000000000000000 R15: 00007fff7bc499b0
    Modules linked in:
    ---[ end trace a65689219582ffff ]---
    RIP: 0010:__read_once_size include/linux/compiler.h:194 [inline]
    RIP: 0010:has_intersects_mems_allowed mm/oom_kill.c:84 [inline]
    RIP: 0010:oom_unkillable_task mm/oom_kill.c:168 [inline]
    RIP: 0010:oom_unkillable_task+0x180/0x400 mm/oom_kill.c:155
    Code: c1 ea 03 80 3c 02 00 0f 85 80 02 00 00 4c 8b a3 10 07 00 00 48 b8 00
    00 00 00 00 fc ff df 4d 8d 74 24 10 4c 89 f2 48 c1 ea 03 3c 02 00 0f
    85 67 02 00 00 49 8b 44 24 10 4c 8d a0 68 fa ff ff
    RSP: 0018:ffff888000127490 EFLAGS: 00010a03
    RAX: dffffc0000000000 RBX: ffff8880a4cd5438 RCX: ffffffff818dae9c
    RDX: 100000000c3cc602 RSI: ffffffff818dac8d RDI: 0000000000000001
    RBP: ffff8880001274d0 R08: ffff888000086180 R09: ffffed1015d26be0
    R10: ffffed1015d26bdf R11: ffff8880ae935efb R12: 8000000061e63007
    R13: 0000000000000000 R14: 8000000061e63017 R15: 1ffff11000024ea6
    FS: 00005555561f5940(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000001b2f823000 CR3: 000000009237e000 CR4: 00000000001426f0
    DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000600

    The fix is to decouple the cpuset/mempolicy intersection check from
    oom_unkillable_task() and make sure cpuset/mempolicy intersection check is
    only done in the global oom context.

    [shakeelb@google.com: change function name and update comment]
    Link: http://lkml.kernel.org/r/20190628152421.198994-3-shakeelb@google.com
    Link: http://lkml.kernel.org/r/20190624212631.87212-3-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Reported-by: syzbot+d0fc9d3c166bc5e4a94b@syzkaller.appspotmail.com
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • oom_unkillable_task() can be called from three different contexts i.e.
    global OOM, memcg OOM and oom_score procfs interface. At the moment
    oom_unkillable_task() does a task_in_mem_cgroup() check on the given
    process. Since there is no reason to perform task_in_mem_cgroup()
    check for global OOM and oom_score procfs interface, those contexts
    provide NULL memcg and skips the task_in_mem_cgroup() check. However
    for memcg OOM context, the oom_unkillable_task() is always called from
    mem_cgroup_scan_tasks() and thus task_in_mem_cgroup() check becomes
    redundant and effectively dead code. So, just remove the
    task_in_mem_cgroup() check altogether.

    Link: http://lkml.kernel.org/r/20190624212631.87212-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Signed-off-by: Tetsuo Handa
    Acked-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Paul Jackson
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • dump_tasks() traverses all the existing processes even for the memcg OOM
    context which is not only unnecessary but also wasteful. This imposes a
    long RCU critical section even from a contained context which can be quite
    disruptive.

    Change dump_tasks() to be aligned with select_bad_process and use
    mem_cgroup_scan_tasks to selectively traverse only processes of the target
    memcg hierarchy during memcg OOM.

    Link: http://lkml.kernel.org/r/20190617231207.160865-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc: Vladimir Davydov
    Cc: David Rientjes
    Cc: KOSAKI Motohiro
    Cc: Paul Jackson
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Since commit c03cd7738a83 ("cgroup: Include dying leaders with live
    threads in PROCS iterations") corrected how CSS_TASK_ITER_PROCS works,
    mem_cgroup_scan_tasks() can use CSS_TASK_ITER_PROCS in order to check
    only one thread from each thread group.

    [penguin-kernel@I-love.SAKURA.ne.jp: remove thread group leader check in oom_evaluate_task()]
    Link: http://lkml.kernel.org/r/1560853257-14934-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Link: http://lkml.kernel.org/r/c763afc8-f0ae-756a-56a7-395f625b95fc@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

29 Jun, 2019

1 commit

  • In dump_oom_summary() oc->constraint is used to show oom_constraint_text,
    but it hasn't been set before. So the value of it is always the default
    value 0. We should inititialize it before.

    Bellow is the output when memcg oom occurs,

    before this patch:
    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=7997,uid=0

    after this patch:
    oom-kill:constraint=CONSTRAINT_MEMCG,nodemask=(null), cpuset=/,mems_allowed=0,oom_memcg=/foo,task_memcg=/foo,task=bash,pid=13681,uid=0

    Link: http://lkml.kernel.org/r/1560522038-15879-1-git-send-email-laoar.shao@gmail.com
    Fixes: ef8444ea01d7 ("mm, oom: reorganize the oom report in dump_header")
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Cc: Wind Yu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

1 commit

  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

06 Mar, 2019

2 commits

  • Since setting global init process to some memory cgroup is technically
    possible, oom_kill_memcg_member() must check it.

    Tasks in /test1 are going to be killed due to memory.oom.group set
    Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b

    #include
    #include
    #include
    #include
    #include
    #include

    int main(int argc, char *argv[])
    {
    static char buffer[10485760];
    static int pipe_fd[2] = { EOF, EOF };
    unsigned int i;
    int fd;
    char buf[64] = { };
    if (pipe(pipe_fd))
    return 1;
    if (chdir("/sys/fs/cgroup/"))
    return 1;
    fd = open("cgroup.subtree_control", O_WRONLY);
    write(fd, "+memory", 7);
    close(fd);
    mkdir("test1", 0755);
    fd = open("test1/memory.oom.group", O_WRONLY);
    write(fd, "1", 1);
    close(fd);
    fd = open("test1/cgroup.procs", O_WRONLY);
    write(fd, "1", 1);
    snprintf(buf, sizeof(buf) - 1, "%d", getpid());
    write(fd, buf, strlen(buf));
    close(fd);
    snprintf(buf, sizeof(buf) - 1, "%lu", sizeof(buffer) * 5);
    fd = open("test1/memory.max", O_WRONLY);
    write(fd, buf, strlen(buf));
    close(fd);
    for (i = 0; i < 10; i++)
    if (fork() == 0) {
    char c;
    close(pipe_fd[1]);
    read(pipe_fd[0], &c, 1);
    memset(buffer, 0, sizeof(buffer));
    sleep(3);
    _exit(0);
    }
    close(pipe_fd[0]);
    close(pipe_fd[1]);
    sleep(3);
    return 0;
    }

    [ 37.052923][ T9185] a.out invoked oom-killer: gfp_mask=0xcc0(GFP_KERNEL), order=0, oom_score_adj=0
    [ 37.056169][ T9185] CPU: 4 PID: 9185 Comm: a.out Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.059205][ T9185] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.062954][ T9185] Call Trace:
    [ 37.063976][ T9185] dump_stack+0x67/0x95
    [ 37.065263][ T9185] dump_header+0x51/0x570
    [ 37.066619][ T9185] ? trace_hardirqs_on+0x3f/0x110
    [ 37.068171][ T9185] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.069967][ T9185] oom_kill_process+0x18d/0x210
    [ 37.071515][ T9185] out_of_memory+0x11b/0x380
    [ 37.072936][ T9185] mem_cgroup_out_of_memory+0xb6/0xd0
    [ 37.074601][ T9185] try_charge+0x790/0x820
    [ 37.076021][ T9185] mem_cgroup_try_charge+0x42/0x1d0
    [ 37.077629][ T9185] mem_cgroup_try_charge_delay+0x11/0x30
    [ 37.079370][ T9185] do_anonymous_page+0x105/0x5e0
    [ 37.080939][ T9185] __handle_mm_fault+0x9cb/0x1070
    [ 37.082485][ T9185] handle_mm_fault+0x1b2/0x3a0
    [ 37.083819][ T9185] ? handle_mm_fault+0x47/0x3a0
    [ 37.085181][ T9185] __do_page_fault+0x255/0x4c0
    [ 37.086529][ T9185] do_page_fault+0x28/0x260
    [ 37.087788][ T9185] ? page_fault+0x8/0x30
    [ 37.088978][ T9185] page_fault+0x1e/0x30
    [ 37.090142][ T9185] RIP: 0033:0x7f8b183aefe0
    [ 37.091433][ T9185] Code: 20 f3 44 0f 7f 44 17 d0 f3 44 0f 7f 47 30 f3 44 0f 7f 44 17 c0 48 01 fa 48 83 e2 c0 48 39 d1 74 a3 66 0f 1f 84 00 00 00 00 00 44 0f 7f 01 66 44 0f 7f 41 10 66 44 0f 7f 41 20 66 44 0f 7f 41
    [ 37.096917][ T9185] RSP: 002b:00007fffc5d329e8 EFLAGS: 00010206
    [ 37.098615][ T9185] RAX: 00000000006010e0 RBX: 0000000000000008 RCX: 0000000000c30000
    [ 37.100905][ T9185] RDX: 00000000010010c0 RSI: 0000000000000000 RDI: 00000000006010e0
    [ 37.103349][ T9185] RBP: 0000000000000000 R08: 00007f8b188f4740 R09: 0000000000000000
    [ 37.105797][ T9185] R10: 00007fffc5d32420 R11: 00007f8b183aef40 R12: 0000000000000005
    [ 37.108228][ T9185] R13: 0000000000000000 R14: ffffffffffffffff R15: 0000000000000000
    [ 37.110840][ T9185] memory: usage 51200kB, limit 51200kB, failcnt 125
    [ 37.113045][ T9185] memory+swap: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.115808][ T9185] kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
    [ 37.117660][ T9185] Memory cgroup stats for /test1: cache:0KB rss:49484KB rss_huge:30720KB shmem:0KB mapped_file:0KB dirty:0KB writeback:0KB inactive_anon:0KB active_anon:49700KB inactive_file:0KB active_file:0KB unevictable:0KB
    [ 37.123371][ T9185] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,oom_memcg=/test1,task_memcg=/test1,task=a.out,pid=9188,uid=0
    [ 37.128158][ T9185] Memory cgroup out of memory: Killed process 9188 (a.out) total-vm:14456kB, anon-rss:10324kB, file-rss:504kB, shmem-rss:0kB
    [ 37.132710][ T9185] Tasks in /test1 are going to be killed due to memory.oom.group set
    [ 37.132833][ T54] oom_reaper: reaped process 9188 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.135498][ T9185] Memory cgroup out of memory: Killed process 1 (systemd) total-vm:43400kB, anon-rss:1228kB, file-rss:3992kB, shmem-rss:0kB
    [ 37.143434][ T9185] Memory cgroup out of memory: Killed process 9182 (a.out) total-vm:14456kB, anon-rss:76kB, file-rss:588kB, shmem-rss:0kB
    [ 37.144328][ T54] oom_reaper: reaped process 1 (systemd), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.147585][ T9185] Memory cgroup out of memory: Killed process 9183 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157222][ T9185] Memory cgroup out of memory: Killed process 9184 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157259][ T9185] Memory cgroup out of memory: Killed process 9185 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157291][ T9185] Memory cgroup out of memory: Killed process 9186 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:508kB, shmem-rss:0kB
    [ 37.157306][ T54] oom_reaper: reaped process 9183 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.157328][ T9185] Memory cgroup out of memory: Killed process 9187 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.157452][ T9185] Memory cgroup out of memory: Killed process 9189 (a.out) total-vm:14456kB, anon-rss:6228kB, file-rss:512kB, shmem-rss:0kB
    [ 37.158733][ T9185] Memory cgroup out of memory: Killed process 9190 (a.out) total-vm:14456kB, anon-rss:552kB, file-rss:512kB, shmem-rss:0kB
    [ 37.160083][ T54] oom_reaper: reaped process 9186 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.160187][ T54] oom_reaper: reaped process 9189 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.206941][ T54] oom_reaper: reaped process 9185 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.212300][ T9185] Memory cgroup out of memory: Killed process 9191 (a.out) total-vm:14456kB, anon-rss:4180kB, file-rss:512kB, shmem-rss:0kB
    [ 37.212317][ T54] oom_reaper: reaped process 9190 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.218860][ T9185] Memory cgroup out of memory: Killed process 9192 (a.out) total-vm:14456kB, anon-rss:1080kB, file-rss:512kB, shmem-rss:0kB
    [ 37.227667][ T54] oom_reaper: reaped process 9192 (a.out), now anon-rss:0kB, file-rss:0kB, shmem-rss:0kB
    [ 37.292323][ T9193] abrt-hook-ccpp (9193) used greatest stack depth: 10480 bytes left
    [ 37.351843][ T1] Kernel panic - not syncing: Attempted to kill init! exitcode=0x0000008b
    [ 37.354833][ T1] CPU: 7 PID: 1 Comm: systemd Kdump: loaded Not tainted 5.0.0-rc4-next-20190131 #280
    [ 37.357876][ T1] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/13/2018
    [ 37.361685][ T1] Call Trace:
    [ 37.363239][ T1] dump_stack+0x67/0x95
    [ 37.365010][ T1] panic+0xfc/0x2b0
    [ 37.366853][ T1] do_exit+0xd55/0xd60
    [ 37.368595][ T1] do_group_exit+0x47/0xc0
    [ 37.370415][ T1] get_signal+0x32a/0x920
    [ 37.372449][ T1] ? _raw_spin_unlock_irqrestore+0x3d/0x70
    [ 37.374596][ T1] do_signal+0x32/0x6e0
    [ 37.376430][ T1] ? exit_to_usermode_loop+0x26/0x9b
    [ 37.378418][ T1] ? prepare_exit_to_usermode+0xa8/0xd0
    [ 37.380571][ T1] exit_to_usermode_loop+0x3e/0x9b
    [ 37.382588][ T1] prepare_exit_to_usermode+0xa8/0xd0
    [ 37.384594][ T1] ? page_fault+0x8/0x30
    [ 37.386453][ T1] retint_user+0x8/0x18
    [ 37.388160][ T1] RIP: 0033:0x7f42c06974a8
    [ 37.389922][ T1] Code: Bad RIP value.
    [ 37.391788][ T1] RSP: 002b:00007ffc3effd388 EFLAGS: 00010213
    [ 37.394075][ T1] RAX: 000000000000000e RBX: 00007ffc3effd390 RCX: 0000000000000000
    [ 37.396963][ T1] RDX: 000000000000002a RSI: 00007ffc3effd390 RDI: 0000000000000004
    [ 37.399550][ T1] RBP: 00007ffc3effd680 R08: 0000000000000000 R09: 0000000000000000
    [ 37.402334][ T1] R10: 00000000ffffffff R11: 0000000000000246 R12: 0000000000000001
    [ 37.404890][ T1] R13: ffffffffffffffff R14: 0000000000000884 R15: 000056460b1ac3b0

    Link: http://lkml.kernel.org/r/201902010336.x113a4EO027170@www262.sakura.ne.jp
    Fixes: 3d8b38eb81cac813 ("mm, oom: introduce memory.oom.group")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Roman Gushchin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Since the start of the git history of Linux, the kernel after selecting
    the worst process to be oom-killed, prefer to kill its child (if the
    child does not share mm with the parent). Later it was changed to
    prefer to kill a child who is worst. If the parent is still the worst
    then the parent will be killed.

    This heuristic assumes that the children did less work than their parent
    and by killing one of them, the work lost will be less. However this is
    very workload dependent. If there is a workload which can benefit from
    this heuristic, can use oom_score_adj to prefer children to be killed
    before the parent.

    The select_bad_process() has already selected the worst process in the
    system/memcg. There is no need to recheck the badness of its children
    and hoping to find a worse candidate. That's a lot of unneeded racy
    work. Also the heuristic is dangerous because it make fork bomb like
    workloads to recover much later because we constantly pick and kill
    processes which are not memory hogs. So, let's remove this whole
    heuristic.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/20190121215850.221745-2-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Cc: David Rientjes
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

02 Feb, 2019

2 commits

  • Syzbot instance running on upstream kernel found a use-after-free bug in
    oom_kill_process. On further inspection it seems like the process
    selected to be oom-killed has exited even before reaching
    read_lock(&tasklist_lock) in oom_kill_process(). More specifically the
    tsk->usage is 1 which is due to get_task_struct() in oom_evaluate_task()
    and the put_task_struct within for_each_thread() frees the tsk and
    for_each_thread() tries to access the tsk. The easiest fix is to do
    get/put across the for_each_thread() on the selected task.

    Now the next question is should we continue with the oom-kill as the
    previously selected task has exited? However before adding more
    complexity and heuristics, let's answer why we even look at the children
    of oom-kill selected task? The select_bad_process() has already selected
    the worst process in the system/memcg. Due to race, the selected
    process might not be the worst at the kill time but does that matter?
    The userspace can use the oom_score_adj interface to prefer children to
    be killed before the parent. I looked at the history but it seems like
    this is there before git history.

    Link: http://lkml.kernel.org/r/20190121215850.221745-1-shakeelb@google.com
    Reported-by: syzbot+7fbbfa368521945f0e3d@syzkaller.appspotmail.com
    Fixes: 6b0c81b3be11 ("mm, oom: reduce dependency on tasklist_lock")
    Signed-off-by: Shakeel Butt
    Reviewed-by: Roman Gushchin
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Arkadiusz reported that enabling memcg's group oom killing causes
    strange memcg statistics where there is no task in a memcg despite the
    number of tasks in that memcg is not 0. It turned out that there is a
    bug in wake_oom_reaper() which allows enqueuing same task twice which
    makes impossible to decrease the number of tasks in that memcg due to a
    refcount leak.

    This bug existed since the OOM reaper became invokable from
    task_will_free_mem(current) path in out_of_memory() in Linux 4.7,

    T1@P1 |T2@P1 |T3@P1 |OOM reaper
    ----------+----------+----------+------------
    # Processing an OOM victim in a different memcg domain.
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    try_charge()
    mem_cgroup_out_of_memory()
    mutex_lock(&oom_lock)
    out_of_memory()
    oom_kill_process(P1)
    do_send_sig_info(SIGKILL, @P1)
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T2@P1)
    wake_oom_reaper(T2@P1) # T2@P1 is enqueued.
    mutex_unlock(&oom_lock)
    out_of_memory()
    mark_oom_victim(T1@P1)
    wake_oom_reaper(T1@P1) # T1@P1 is enqueued again due to oom_reaper_list == T2@P1 && T1@P1->oom_reaper_list == NULL.
    mutex_unlock(&oom_lock)
    # Completed processing an OOM victim in a different memcg domain.
    spin_lock(&oom_reaper_lock)
    # T1P1 is dequeued.
    spin_unlock(&oom_reaper_lock)

    but memcg's group oom killing made it easier to trigger this bug by
    calling wake_oom_reaper() on the same task from one out_of_memory()
    request.

    Fix this bug using an approach used by commit 855b018325737f76 ("oom,
    oom_reaper: disable oom_reaper for oom_kill_allocating_task"). As a
    side effect of this patch, this patch also avoids enqueuing multiple
    threads sharing memory via task_will_free_mem(current) path.

    Link: http://lkml.kernel.org/r/e865a044-2c10-9858-f4ef-254bc71d6cc2@i-love.sakura.ne.jp
    Link: http://lkml.kernel.org/r/5ee34fc6-1485-34f8-8790-903ddabaa809@i-love.sakura.ne.jp
    Fixes: af8e15cc85a25315 ("oom, oom_reaper: do not enqueue task if it is on the oom_reaper_list head")
    Signed-off-by: Tetsuo Handa
    Reported-by: Arkadiusz Miskiewicz
    Tested-by: Arkadiusz Miskiewicz
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Cc: Tejun Heo
    Cc: Aleksa Sarai
    Cc: Jay Kamat
    Cc: Johannes Weiner
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

29 Dec, 2018

4 commits

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The current oom report doesn't display victim's memcg context during the
    global OOM situation. While this information is not strictly needed, it
    can be really helpful for containerized environments to locate which
    container has lost a process. Now that we have a single line for the oom
    context, we can trivially add both the oom memcg (this can be either
    global_oom or a specific memcg which hits its hard limits) and task_memcg
    which is the victim's memcg.

    Below is the single line output in the oom report after this patch.

    - global oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,global_oom,task_memcg=,task=,pid=,uid=

    - memcg oom context information:

    oom-kill:constraint=,nodemask=,cpuset=,mems_allowed=,oom_memcg=,task_memcg=,task=,pid=,uid=

    [penguin-kernel@I-love.SAKURA.ne.jp: use pr_cont() in mem_cgroup_print_oom_context()]
    Link: http://lkml.kernel.org/r/201812190723.wBJ7NdkN032628@www262.sakura.ne.jp
    Link: http://lkml.kernel.org/r/1542799799-36184-2-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Andrea Arcangeli
    Cc: Tetsuo Handa
    Cc: Roman Gushchin
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • OOM report contains several sections. The first one is the allocation
    context that has triggered the OOM. Then we have cpuset context followed
    by the stack trace of the OOM path. The tird one is the OOM memory
    information. Followed by the current memory state of all system tasks.
    At last, we will show oom eligible tasks and the information about the
    chosen oom victim.

    One thing that makes parsing more awkward than necessary is that we do not
    have a single and easily parsable line about the oom context. This patch
    is reorganizing the oom report to

    1) who invoked oom and what was the allocation request

    [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

    2) OOM stack trace

    [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
    [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
    [ 515.906821] Call Trace:
    [ 515.908062] dump_stack+0x5a/0x73
    [ 515.909311] dump_header+0x55/0x28c
    [ 515.914260] oom_kill_process+0x2d8/0x300
    [ 515.916708] out_of_memory+0x145/0x4a0
    [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16
    [ 515.919157] __alloc_pages_nodemask+0x277/0x290
    [ 515.920367] filemap_fault+0x3d0/0x6c0
    [ 515.921529] ? filemap_map_pages+0x2b8/0x420
    [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4]
    [ 515.923884] __do_fault+0x20/0x80
    [ 515.925032] __handle_mm_fault+0xbc0/0xe80
    [ 515.926195] handle_mm_fault+0xfa/0x210
    [ 515.927357] __do_page_fault+0x233/0x4c0
    [ 515.928506] do_page_fault+0x32/0x140
    [ 515.929646] ? page_fault+0x8/0x30
    [ 515.930770] page_fault+0x1e/0x30

    3) OOM memory information

    [ 515.958093] Mem-Info:
    [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
    active_file:4402672 inactive_file:483963 isolated_file:1344
    unevictable:0 dirty:4886753 writeback:0 unstable:0
    slab_reclaimable:148442 slab_unreclaimable:18741
    mapped:1347 shmem:1347 pagetables:58669 bounce:0
    free:88663 free_pcp:0 free_cma:0
    ...

    4) current memory state of all system tasks

    [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal
    [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad
    [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd
    [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd
    [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd
    [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance
    [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd
    [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager
    [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned
    [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic
    [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic
    [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic
    [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic
    [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic

    5) oom context (contrains and the chosen victim).

    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

    An admin can easily get the full oom context at a single line which
    makes parsing much easier.

    Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Roman Gushchin
    Cc: Tetsuo Handa
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     

24 Oct, 2018

1 commit

  • …iederm/user-namespace

    Pull siginfo updates from Eric Biederman:
    "I have been slowly sorting out siginfo and this is the culmination of
    that work.

    The primary result is in several ways the signal infrastructure has
    been made less error prone. The code has been updated so that manually
    specifying SEND_SIG_FORCED is never necessary. The conversion to the
    new siginfo sending functions is now complete, which makes it
    difficult to send a signal without filling in the proper siginfo
    fields.

    At the tail end of the patchset comes the optimization of decreasing
    the size of struct siginfo in the kernel from 128 bytes to about 48
    bytes on 64bit. The fundamental observation that enables this is by
    definition none of the known ways to use struct siginfo uses the extra
    bytes.

    This comes at the cost of a small user space observable difference.
    For the rare case of siginfo being injected into the kernel only what
    can be copied into kernel_siginfo is delivered to the destination, the
    rest of the bytes are set to 0. For cases where the signal and the
    si_code are known this is safe, because we know those bytes are not
    used. For cases where the signal and si_code combination is unknown
    the bits that won't fit into struct kernel_siginfo are tested to
    verify they are zero, and the send fails if they are not.

    I made an extensive search through userspace code and I could not find
    anything that would break because of the above change. If it turns out
    I did break something it will take just the revert of a single change
    to restore kernel_siginfo to the same size as userspace siginfo.

    Testing did reveal dependencies on preferring the signo passed to
    sigqueueinfo over si->signo, so bit the bullet and added the
    complexity necessary to handle that case.

    Testing also revealed bad things can happen if a negative signal
    number is passed into the system calls. Something no sane application
    will do but something a malicious program or a fuzzer might do. So I
    have fixed the code that performs the bounds checks to ensure negative
    signal numbers are handled"

    * 'siginfo-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace: (80 commits)
    signal: Guard against negative signal numbers in copy_siginfo_from_user32
    signal: Guard against negative signal numbers in copy_siginfo_from_user
    signal: In sigqueueinfo prefer sig not si_signo
    signal: Use a smaller struct siginfo in the kernel
    signal: Distinguish between kernel_siginfo and siginfo
    signal: Introduce copy_siginfo_from_user and use it's return value
    signal: Remove the need for __ARCH_SI_PREABLE_SIZE and SI_PAD_SIZE
    signal: Fail sigqueueinfo if si_signo != sig
    signal/sparc: Move EMT_TAGOVF into the generic siginfo.h
    signal/unicore32: Use force_sig_fault where appropriate
    signal/unicore32: Generate siginfo in ucs32_notify_die
    signal/unicore32: Use send_sig_fault where appropriate
    signal/arc: Use force_sig_fault where appropriate
    signal/arc: Push siginfo generation into unhandled_exception
    signal/ia64: Use force_sig_fault where appropriate
    signal/ia64: Use the force_sig(SIGSEGV,...) in ia64_rt_sigreturn
    signal/ia64: Use the generic force_sigsegv in setup_frame
    signal/arm/kvm: Use send_sig_mceerr
    signal/arm: Use send_sig_fault where appropriate
    signal/arm: Use force_sig_fault where appropriate
    ...

    Linus Torvalds
     

12 Sep, 2018

1 commit

  • Now that siginfo is never allocated for SIGKILL and SIGSTOP there is
    no difference between SEND_SIG_PRIV and SEND_SIG_FORCED for SIGKILL
    and SIGSTOP. This makes SEND_SIG_FORCED unnecessary and redundant in
    the presence of SIGKILL and SIGSTOP. Therefore change users of
    SEND_SIG_FORCED that are sending SIGKILL or SIGSTOP to use
    SEND_SIG_PRIV instead.

    This removes the last users of SEND_SIG_FORCED.

    Reviewed-by: Thomas Gleixner
    Signed-off-by: "Eric W. Biederman"

    Eric W. Biederman
     

05 Sep, 2018

2 commits

  • Commit 93065ac753e4 ("mm, oom: distinguish blockable mode for mmu
    notifiers") has added an ability to skip over vmas with blockable mmu
    notifiers. This however didn't call tlb_finish_mmu as it should.

    As a result inc_tlb_flush_pending has been called without its pairing
    dec_tlb_flush_pending and all callers mm_tlb_flush_pending would flush
    even though this is not really needed. This alone is not harmful and it
    seems there shouldn't be any such callers for oom victims at all but
    there is no real reason to skip tlb_finish_mmu on early skip either so
    call it.

    [mhocko@suse.com: new changelog]
    Link: http://lkml.kernel.org/r/b752d1d5-81ad-7a35-2394-7870641be51c@i-love.sakura.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • When the memcg OOM killer runs out of killable tasks, it currently
    prints a WARN with no further OOM context. This has caused some user
    confusion.

    Warnings indicate a kernel problem. In a reported case, however, the
    situation was triggered by a nonsensical memcg configuration (hard limit
    set to 0). But without any VM context this wasn't obvious from the
    report, and it took some back and forth on the mailing list to identify
    what is actually a trivial issue.

    Handle this OOM condition like we handle it in the global OOM killer:
    dump the full OOM context and tell the user we ran out of tasks.

    This way the user can identify misconfigurations easily by themselves
    and rectify the problem - without having to go through the hassle of
    running into an obscure but unsettling warning, finding the appropriate
    kernel mailing list and waiting for a kernel developer to remote-analyze
    that the memcg configuration caused this.

    If users cannot make sense of why the OOM killer was triggered or why it
    failed, they will still report it to the mailing list, we know that from
    experience. So in case there is an actual kernel bug causing this,
    kernel developers will very likely hear about it.

    Link: http://lkml.kernel.org/r/20180821160406.22578-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Dmitry Vyukov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     

23 Aug, 2018

1 commit

  • Merge more updates from Andrew Morton:

    - the rest of MM

    - procfs updates

    - various misc things

    - more y2038 fixes

    - get_maintainer updates

    - lib/ updates

    - checkpatch updates

    - various epoll updates

    - autofs updates

    - hfsplus

    - some reiserfs work

    - fatfs updates

    - signal.c cleanups

    - ipc/ updates

    * emailed patches from Andrew Morton : (166 commits)
    ipc/util.c: update return value of ipc_getref from int to bool
    ipc/util.c: further variable name cleanups
    ipc: simplify ipc initialization
    ipc: get rid of ids->tables_initialized hack
    lib/rhashtable: guarantee initial hashtable allocation
    lib/rhashtable: simplify bucket_table_alloc()
    ipc: drop ipc_lock()
    ipc/util.c: correct comment in ipc_obtain_object_check
    ipc: rename ipcctl_pre_down_nolock()
    ipc/util.c: use ipc_rcu_putref() for failues in ipc_addid()
    ipc: reorganize initialization of kern_ipc_perm.seq
    ipc: compute kern_ipc_perm.id under the ipc lock
    init/Kconfig: remove EXPERT from CHECKPOINT_RESTORE
    fs/sysv/inode.c: use ktime_get_real_seconds() for superblock stamp
    adfs: use timespec64 for time conversion
    kernel/sysctl.c: fix typos in comments
    drivers/rapidio/devices/rio_mport_cdev.c: remove redundant pointer md
    fork: don't copy inconsistent signal handler state to child
    signal: make get_signal() return bool
    signal: make sigkill_pending() return bool
    ...

    Linus Torvalds