06 Nov, 2015

40 commits

  • get_mergeable_page() can only return NULL (also in case of errors) or the
    pinned mergeable page. It can't return an error different than NULL.
    This optimizes away the unnecessary error check.

    Add a return after the "out:" label in the callee to make it more
    readable.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Doing the VM_MERGEABLE check after the page == kpage check won't provide
    any meaningful benefit. The !vma->anon_vma check of find_mergeable_vma is
    the only superfluous bit in using find_mergeable_vma because the !PageAnon
    check of try_to_merge_one_page() implicitly checks for that, but it still
    looks cleaner to share the same find_mergeable_vma().

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This just uses the helper function to cleanup the assumption on the
    hlist_node internals.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • The stable_nodes can become stale at any time if the underlying pages gets
    freed. The stable_node gets collected and removed from the stable rbtree
    if that is detected during the rbtree lookups.

    Don't fail the lookup if running into stale stable_nodes, just restart the
    lookup after collecting the stale stable_nodes. Otherwise the CPU spent
    in the preparation stage is wasted and the lookup must be repeated at the
    next loop potentially failing a second time in a second stale stable_node.

    If we don't prune aggressively we delay the merging of the unstable node
    candidates and at the same time we delay the freeing of the stale
    stable_nodes. Keeping stale stable_nodes around wastes memory and it
    can't provide any benefit.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While at it add it to the file and anon walks too.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Hugh Dickins
    Cc: Petr Holasek
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Before the previous patch ("memcg: unify slab and other kmem pages
    charging"), __mem_cgroup_from_kmem had to handle two types of kmem - slab
    pages and pages allocated with alloc_kmem_pages - memcg in the page
    struct. Now we can unify it. Since after it, this function becomes tiny
    we can fold it into mem_cgroup_from_kmem.

    [hughd@google.com: move mem_cgroup_from_kmem into list_lru.c]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • We have memcg_kmem_charge and memcg_kmem_uncharge methods for charging and
    uncharging kmem pages to memcg, but currently they are not used for
    charging slab pages (i.e. they are only used for charging pages allocated
    with alloc_kmem_pages). The only reason why the slab subsystem uses
    special helpers, memcg_charge_slab and memcg_uncharge_slab, is that it
    needs to charge to the memcg of kmem cache while memcg_charge_kmem charges
    to the memcg that the current task belongs to.

    To remove this diversity, this patch adds an extra argument to
    __memcg_kmem_charge that can be a pointer to a memcg or NULL. If it is
    not NULL, the function tries to charge to the memcg it points to,
    otherwise it charge to the current context. Next, it makes the slab
    subsystem use this function to charge slab pages.

    Since memcg_charge_kmem and memcg_uncharge_kmem helpers are now used only
    in __memcg_kmem_charge and __memcg_kmem_uncharge, they are inlined. Since
    __memcg_kmem_charge stores a pointer to the memcg in the page struct, we
    don't need memcg_uncharge_slab anymore and can use free_kmem_pages.
    Besides, one can now detect which memcg a slab page belongs to by reading
    /proc/kpagecgroup.

    Note, this patch switches slab to charge-after-alloc design. Since this
    design is already used for all other memcg charges, it should not make any
    difference.

    [hannes@cmpxchg.org: better to have an outer function than a magic parameter for the memcg lookup]
    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Signed-off-by: Johannes Weiner
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Charging kmem pages proceeds in two steps. First, we try to charge the
    allocation size to the memcg the current task belongs to, then we allocate
    a page and "commit" the charge storing the pointer to the memcg in the
    page struct.

    Such a design looks overcomplicated, because there is not much sense in
    trying charging the allocation before actually allocating a page: we won't
    be able to consume much memory over the limit even if we charge after
    doing the actual allocation, besides we already charge user pages post
    factum, so being pedantic with kmem pages just looks pointless.

    So this patch simplifies the design by merging the "charge" and the
    "commit" steps into the same function, which takes the allocated page.

    Also, rename the charge and uncharge methods to memcg_kmem_charge and
    memcg_kmem_uncharge and make the charge method return error code instead
    of bool to conform to mem_cgroup_try_charge.

    Signed-off-by: Vladimir Davydov
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • If kernelcore was not specified, or the kernelcore size is zero
    (required_movablecore >= totalpages), or the kernelcore size is larger
    than totalpages, there is no ZONE_MOVABLE. We should fill the zone with
    both kernel memory and movable memory.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Tang Chen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • This function incurs in very hot paths and merely does a few loads for
    validity check. Lets inline it, such that we can save the function call
    overhead.

    (akpm: this is cosmetic - the compiler already inlines vmacache_valid_mm())

    Signed-off-by: Davidlohr Bueso
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • Change HIGHMEM_ZONE to be the same as the DMA_ZONE macro.

    Signed-off-by: yalin wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yalin wang
     
  • Srinivas Kandagatla reported bad page messages when trying to remove the
    bottom 2MB on an ARM based IFC6410 board

    BUG: Bad page state in process swapper pfn:fffa8
    page:ef7fb500 count:0 mapcount:0 mapping: (null) index:0x0
    flags: 0x96640253(locked|error|dirty|active|arch_1|reclaim|mlocked)
    page dumped because: PAGE_FLAGS_CHECK_AT_FREE flag(s) set
    bad because of flags:
    flags: 0x200041(locked|active|mlocked)
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 3.19.0-rc3-00007-g412f9ba-dirty #816
    Hardware name: Qualcomm (Flattened Device Tree)
    unwind_backtrace
    show_stack
    dump_stack
    bad_page
    free_pages_prepare
    free_hot_cold_page
    __free_pages
    free_highmem_page
    mem_init
    start_kernel
    Disabling lock debugging due to kernel taint

    Removing the lower 2MB made the start of the lowmem zone to no longer be
    page block aligned. IFC6410 uses CONFIG_FLATMEM where alloc_node_mem_map
    allocates memory for the mem_map. alloc_node_mem_map will offset for
    unaligned nodes with the assumption the pfn/page translation functions
    will account for the offset. The functions for CONFIG_FLATMEM do not
    offset however, resulting in overrunning the memmap array. Just use the
    allocated memmap without any offset when running with CONFIG_FLATMEM to
    avoid the overrun.

    Signed-off-by: Laura Abbott
    Signed-off-by: Laura Abbott
    Reported-by: Srinivas Kandagatla
    Tested-by: Srinivas Kandagatla
    Acked-by: Vlastimil Babka
    Tested-by: Bjorn Andersson
    Cc: Santosh Shilimkar
    Cc: Russell King
    Cc: Kevin Hilman
    Cc: Arnd Bergman
    Cc: Stephen Boyd
    Cc: Andy Gross
    Cc: Mel Gorman
    Cc: Steven Rostedt
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laura Abbott
     
  • With x86_64 (config http://ozlabs.org/~akpm/config-akpm2.txt) and old gcc
    (4.4.4), drivers/base/node.c:node_read_meminfo() is using 2344 bytes of
    stack. Uninlining node_page_state() reduces this to 440 bytes.

    The stack consumption issue is fixed by newer gcc (4.8.4) however with
    that compiler this patch reduces the node.o text size from 7314 bytes to
    4578.

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Make __install_special_mapping() args order match the caller, so the
    caller can pass their register args directly to callee with no touch.

    For most of architectures, args (at least the first 5th args) are in
    registers, so this change will have effect on most of architectures.

    For -O2, __install_special_mapping() may be inlined under most of
    architectures, but for -Os, it should not. So this change can get a
    little better performance for -Os, at least.

    Signed-off-by: Chen Gang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • (1) For !CONFIG_BUG cases, the bug call is a no-op, so we couldn't
    care less and the change is ok.

    (2) ppc and mips, which HAVE_ARCH_BUG_ON, do not rely on branch
    predictions as it seems to be pointless[1] and thus callers should not
    be trying to push an optimization in the first place.

    (3) For CONFIG_BUG and !HAVE_ARCH_BUG_ON cases, BUG_ON() contains an
    unlikely compiler flag already.

    Hence, we can drop unlikely behind BUG_ON().

    [1] http://lkml.iu.edu/hypermail/linux/kernel/1101.3/02289.html

    Signed-off-by: Geliang Tang
    Acked-by: Davidlohr Bueso
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Geliang Tang
     
  • When fget() fails we can return -EBADF directly.

    Signed-off-by: Chen Gang
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • It is still a little better to remove it, although it should be skipped
    by "-O2".

    Signed-off-by: Chen Gang =0A=
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     
  • This came up when implementing HIHGMEM/PAE40 for ARC. The kmap() /
    kmap_atomic() generated code seemed needlessly bloated due to the way
    PageHighMem() macro is implemented. It derives the exact zone for page
    and then does pointer subtraction with first zone to infer the zone_type.
    The pointer arithmatic in turn generates the code bloat.

    PageHighMem(page)
    is_highmem(page_zone(page))
    zone_off = (char *)zone - (char *)zone->zone_pgdat->node_zones

    Instead use is_highmem_idx() to work on zone_type available in page flags

    ----- Before -----
    80756348: mov_s r13,r0
    8075634a: ld_s r2,[r13,0]
    8075634c: lsr_s r2,r2,30
    8075634e: mpy r2,r2,0x2a4
    80756352: add_s r2,r2,0x80aef880
    80756358: ld_s r3,[r2,28]
    8075635a: sub_s r2,r2,r3
    8075635c: breq r2,0x2a4,80756378
    80756364: breq r2,0x548,80756378

    ----- After -----
    80756330: mov_s r13,r0
    80756332: ld_s r2,[r13,0]
    80756334: lsr_s r2,r2,30
    80756336: sub_s r2,r2,1
    80756338: brlo r2,2,80756348

    For x86 defconfig build (32 bit only) it saves around 900 bytes.
    For ARC defconfig with HIGHMEM, it saved around 2K bytes.

    ---->8-------
    ./scripts/bloat-o-meter x86/vmlinux-defconfig-pre x86/vmlinux-defconfig-post
    add/remove: 0/0 grow/shrink: 0/36 up/down: 0/-934 (-934)
    function old new delta
    saveable_page 162 154 -8
    saveable_highmem_page 154 146 -8
    skb_gro_reset_offset 147 131 -16
    ...
    ...
    __change_page_attr_set_clr 1715 1678 -37
    setup_data_read 434 394 -40
    mon_bin_event 1967 1927 -40
    swsusp_save 1148 1105 -43
    _set_pages_array 549 493 -56
    ---->8-------

    e.g. For ARC kmap()

    Signed-off-by: Vineet Gupta
    Acked-by: Michal Hocko
    Cc: Naoya Horiguchi
    Cc: Hugh Dickins
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Jennifer Herbert
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • Both "child->mm == mm" and "p->mm != mm" checks in oom_kill_process() are
    wrong. task->mm can be NULL if the task is the exited group leader. This
    means in particular that "kill sharing same memory" loop can miss a
    process with a zombie leader which uses the same ->mm.

    Note: the process_has_mm(child, p->mm) check is still not 100% correct,
    p->mm can be NULL too. This is minor, but probably deserves a fix or a
    comment anyway.

    [akpm@linux-foundation.org: document process_shares_mm() a bit]
    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Purely cosmetic, but the complex "if" condition looks annoying to me.
    Especially because it is not consistent with OOM_SCORE_ADJ_MIN check
    which adds another if/continue.

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Hillf Danton
    Cc: Tetsuo Handa
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The fatal_signal_pending() was added to suppress unnecessary "sharing same
    memory" message, but it can't 100% help anyway because it can be
    false-negative; SIGKILL can be already dequeued.

    And worse, it can be false-positive due to exec or coredump. exec is
    mostly fine, but coredump is not. It is possible that the group leader
    has the pending SIGKILL because its sub-thread originated the coredump, in
    this case we must not skip this process.

    We could probably add the additional ->group_exit_task check but this
    patch just removes the wrong check along with pr_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: David Rientjes
    Acked-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Kyle Walker
    Cc: Stanislav Kozina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cosmetic, but expand_upwards() and expand_downwards() overuse vma->vm_mm,
    a local variable makes sense imho.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "mm->locked_vm += grow" and vm_stat_account() in acct_stack_growth() are
    not safe; multiple threads using the same ->mm can do this at the same
    time trying to expans different vma's under down_read(mmap_sem). This
    means that one of the "locked_vm += grow" changes can be lost and we can
    miss munlock_vma_pages_all() later.

    Move this code into the caller(s) under mm->page_table_lock. All other
    updates to ->locked_vm hold mmap_sem for writing.

    Signed-off-by: Oleg Nesterov
    Acked-by: Hugh Dickins
    Cc: Andrey Konovalov
    Cc: Davidlohr Bueso
    Cc: "Kirill A. Shutemov"
    Cc: Sasha Levin
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the user set "movablecore=xx" to a large number, corepages will
    overflow. Fix the problem.

    Signed-off-by: Xishi Qiu
    Reviewed-by: Yasuaki Ishimatsu
    Acked-by: Tang Chen
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Tang Chen
    Cc: Zhang Yanfei
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xishi Qiu
     
  • In zone_reclaimable_pages(), `nr' is returned by a function which is
    declared as returning "unsigned long", so declare it such. Negative
    values are meaningless here.

    In zone_pagecache_reclaimable() we should also declare `delta' and
    `nr_pagecache_reclaimable' as being unsigned longs because they're used to
    store the values returned by zone_page_state() and
    zone_unmapped_file_pages() which also happen to return unsigned integers.

    [akpm@linux-foundation.org: make zone_pagecache_reclaimable() return ulong rather than long]
    Signed-off-by: Alexandru Moise
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Johannes Weiner
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandru Moise
     
  • The oom killer takes task_lock() in a couple of places solely to protect
    printing the task's comm.

    A process's comm, including current's comm, may change due to
    /proc/pid/comm or PR_SET_NAME.

    The comm will always be NULL-terminated, so the worst race scenario would
    only be during update. We can tolerate a comm being printed that is in
    the middle of an update to avoid taking the lock.

    Other locations in the kernel have already dropped task_lock() when
    printing comm, so this is consistent.

    Signed-off-by: David Rientjes
    Suggested-by: Oleg Nesterov
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Sergey Senozhatsky
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Compaction returns prematurely with COMPACT_PARTIAL when contended or has
    fatal signal pending. This is ok for the callers, but might be misleading
    in the traces, as the usual reason to return COMPACT_PARTIAL is that we
    think the allocation should succeed. After this patch we distinguish the
    premature ending condition in the mm_compaction_finished and
    mm_compaction_end tracepoints.

    The contended status covers the following reasons:
    - lock contention or need_resched() detected in async compaction
    - fatal signal pending
    - too many pages isolated in the zone (only for async compaction)
    Further distinguishing the exact reason seems unnecessary for now.

    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Some compaction tracepoints use zone->name to print which zone is being
    compacted. This works for in-kernel printing, but not userspace trace
    printing of raw captured trace such as via trace-cmd report.

    This patch uses zone_idx() instead of zone->name as the raw value, and
    when printing, converts the zone_type to string using the appropriate EM()
    macros and some ugly tricks to overcome the problem that half the values
    depend on CONFIG_ options and one does not simply use #ifdef inside of
    #define.

    trace-cmd output before:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=partial

    after:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=Normal order=9 ret=partial

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Valentin Rothberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Some compaction tracepoints convert the integer return values to strings
    using the compaction_status_string array. This works for in-kernel
    printing, but not userspace trace printing of raw captured trace such as
    via trace-cmd report.

    This patch converts the private array to appropriate tracepoint macros
    that result in proper userspace support.

    trace-cmd output before:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=

    after:
    transhuge-stres-4235 [000] 453.149280: mm_compaction_finished: node=0
    zone=ffffffff81815d7a order=9 ret=partial

    Signed-off-by: Vlastimil Babka
    Reviewed-by: Steven Rostedt
    Cc: Joonsoo Kim
    Cc: Ingo Molnar
    Cc: Mel Gorman
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • oom_kill_process() sends SIGKILL to other thread groups sharing victim's
    mm. But printing

    "Kill process %d (%s) sharing same memory\n"

    lines makes no sense if they already have pending SIGKILL. This patch
    reduces the "Kill process" lines by printing that line with info level
    only if SIGKILL is not pending.

    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • At the for_each_process() loop in oom_kill_process(), we are comparing
    address of OOM victim's mm without holding a reference to that mm. If
    there are a lot of processes to compare or a lot of "Kill process %d (%s)
    sharing same memory" messages to print, for_each_process() loop could take
    very long time.

    It is possible that meanwhile the OOM victim exits and releases its mm,
    and then mm is allocated with the same address and assigned to some
    unrelated process. When we hit such race, the unrelated process will be
    killed by error. To make sure that the OOM victim's mm does not go away
    until for_each_process() loop finishes, get a reference on the OOM
    victim's mm before calling task_unlock(victim).

    [oleg@redhat.com: several fixes]
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • It was confirmed that a local unprivileged user can consume all memory
    reserves and hang up that system using time lag between the OOM killer
    sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
    printk() inside for_each_process() loop at oom_kill_process() can consume
    many seconds when there are many thread groups sharing the same memory.

    Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

    As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

    Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

    The oom-depleter's thread group leader which got TIF_MEMDIE started
    memset() in user space after the OOM killer set TIF_MEMDIE, and it was
    free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
    until SIGKILL is delivered. If SIGKILL is delivered before TIF_MEMDIE is
    set, the oom-depleter can terminate without touching memory reserves.

    Although the possibility of hitting this time lag is very small for 3.19
    and earlier kernels because TIF_MEMDIE is set immediately before sending
    SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
    step between and allow memory allocations which are not needed for
    terminating the OOM victim.

    Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock")
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: [4.0+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Make mem_cgroup_inactive_anon_is_low return bool due to this particular
    function only using either one or zero as its return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Make inactive_anon/file_is_low return bool due to these particular
    functions only using either one or zero as their return value.

    No functional change.

    Signed-off-by: Yaowei Bai
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • max_ptes_swap specifies how many pages can be brought in from swap when
    collapsing a group of pages into a transparent huge page.

    /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_swap

    A higher value can cause excessive swap IO and waste memory. A lower
    value can prevent THPs from being collapsed, resulting fewer pages being
    collapsed into THPs, and lower memory access performance.

    Signed-off-by: Ebru Akagunduz
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Oleg Nesterov
    Cc: Kirill A. Shutemov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ebru Akagunduz
     
  • Since commit 6539cc053869 ("mm: memcontrol: fold mem_cgroup_do_charge()"),
    the order to pass to mem_cgroup_oom() is calculated by passing the
    number of pages to get_order() instead of the expected size in bytes.
    AFAICT, it only affects the value displayed in the oom warning message.
    This patch fix this.

    Michal said:

    : We haven't noticed that just because the OOM is enabled only for page
    : faults of order-0 (single page) and get_order work just fine. Thanks for
    : noticing this. If we ever start triggering OOM on different orders this
    : would be broken.

    Signed-off-by: Jerome Marchand
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jerome Marchand
     
  • Currently kernel prints out results of every single unpoison event, which
    i= s not necessary because unpoison is purely a testing feature and
    testers can = get little or no information from lots of lines of unpoison
    log storm. So this patch ratelimits printk in unpoison_memory().

    This patch introduces a file local ratelimit_state, which adds 64 bytes to
    memory-failure.o. If we apply pr_info_ratelimited() for 8 callsite below,
    2= 56 bytes is added, so it's a win.

    Signed-off-by: Naoya Horiguchi
    Cc: Andi Kleen
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • filemap_fdatawait() is a function to wait for on-going writeback to
    complete but also consume and clear error status of the mapping set during
    writeback.

    The latter functionality is critical for applications to detect writeback
    error with system calls like fsync(2)/fdatasync(2).

    However filemap_fdatawait() is also used by sync(2) or FIFREEZE ioctl,
    which don't check error status of individual mappings.

    As a result, fsync() may not be able to detect writeback error if events
    happen in the following order:

    Application System admin
    ----------------------------------------------------------
    write data on page cache
    Run sync command
    writeback completes with error
    filemap_fdatawait() clears error
    fsync returns success
    (but the data is not on disk)

    This patch adds filemap_fdatawait_keep_errors() for call sites where
    writeback error is not handled so that they don't clear error status.

    Signed-off-by: Jun'ichi Nomura
    Acked-by: Andi Kleen
    Reviewed-by: Tejun Heo
    Cc: Fengguang Wu
    Cc: Dave Chinner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Junichi Nomura
     
  • Introduce is_via_compact_memory() helper indicating compacting via
    /proc/sys/vm/compact_memory to improve readability.

    To catch this situation in __compaction_suitable, use order as parameter
    directly instead of using struct compact_control.

    This patch has no functional changes.

    Signed-off-by: Yaowei Bai
    Cc: Mel Gorman
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai
     
  • Delete unnecessary if to let inactive_anon_is_low_global return
    directly.

    No functional changes.

    Signed-off-by: Yaowei Bai
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yaowei Bai