08 Oct, 2016

40 commits

  • So they are CONFIG_DEBUG_VM-only and more informative.

    Cc: Al Viro
    Cc: David S. Miller
    Cc: Hugh Dickins
    Cc: Jens Axboe
    Cc: Joe Perches
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Santosh Shilimkar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     
  • Commit c32b3cbe0d06 ("oom, PM: make OOM detection in the freezer path
    raceless") inserted a WARN_ON() into pagefault_out_of_memory() in order
    to warn when we raced with disabling the OOM killer.

    Now, patch "oom, suspend: fix oom_killer_disable vs. pm suspend
    properly" introduced a timeout for oom_killer_disable(). Even if we
    raced with disabling the OOM killer and the system is OOM livelocked,
    the OOM killer will be enabled eventually (in 20 seconds by default) and
    the OOM livelock will be solved. Therefore, we no longer need to warn
    when we raced with disabling the OOM killer.

    Link: http://lkml.kernel.org/r/1473442120-7246-1-git-send-email-penguin-kernel@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Fragmentation index and the vm.extfrag_threshold sysctl is meant as a
    heuristic to prevent excessive compaction for costly orders (i.e. THP).
    It's unlikely to make any difference for non-costly orders, especially
    with the default threshold. But we cannot afford any uncertainty for
    the non-costly orders where the only alternative to successful
    reclaim/compaction is OOM. After the recent patches we are guaranteed
    maximum effort without heuristics from compaction before deciding OOM,
    and fragindex is the last remaining heuristic. Therefore skip fragindex
    altogether for non-costly orders.

    Suggested-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20160926162025.21555-5-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The compaction_zonelist_suitable() function tries to determine if
    compaction will be able to proceed after sufficient reclaim, i.e.
    whether there are enough reclaimable pages to provide enough order-0
    freepages for compaction.

    This addition of reclaimable pages to the free pages works well for the
    order-0 watermark check, but in the fragmentation index check we only
    consider truly free pages. Thus we can get fragindex value close to 0
    which indicates failure do to lack of memory, and wrongly decide that
    compaction won't be suitable even after reclaim.

    Instead of trying to somehow adjust fragindex for reclaimable pages,
    let's just skip it from compaction_zonelist_suitable().

    Link: http://lkml.kernel.org/r/20160926162025.21555-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The should_reclaim_retry() makes decisions based on no_progress_loops,
    so it makes sense to also update the counter there. It will be also
    consistent with should_compact_retry() and compaction_retries. No
    functional change.

    [hillf.zj@alibaba-inc.com: fix missing pointer dereferences]
    Link: http://lkml.kernel.org/r/20160926162025.21555-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Hillf Danton
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Several people have reported premature OOMs for order-2 allocations
    (stack) due to OOM rework in 4.7. In the scenario (parallel kernel
    build and dd writing to two drives) many pageblocks get marked as
    Unmovable and compaction free scanner struggles to isolate free pages.
    Joonsoo Kim pointed out that the free scanner skips pageblocks that are
    not movable to prevent filling them and forcing non-movable allocations
    to fallback to other pageblocks. Such heuristic makes sense to help
    prevent long-term fragmentation, but premature OOMs are relatively more
    urgent problem. As a compromise, this patch disables the heuristic only
    for the ultimate compaction priority.

    Link: http://lkml.kernel.org/r/20160906135258.18335-5-vbabka@suse.cz
    Reported-by: Ralf-Peter Rohbeck
    Reported-by: Arkadiusz Miskiewicz
    Reported-by: Olaf Hering
    Suggested-by: Joonsoo Kim
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The new ultimate compaction priority disables some heuristics, which may
    result in excessive cost. This is fine for non-costly orders where we
    want to try hard before resulting for OOM, but might be disruptive for
    costly orders which do not trigger OOM and should generally have some
    fallback. Thus, we disable the full priority for costly orders.

    Suggested-by: Michal Hocko
    Link: http://lkml.kernel.org/r/20160906135258.18335-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During reclaim/compaction loop, compaction priority can be increased by
    the should_compact_retry() function, but the current code is not
    optimal. Priority is only increased when compaction_failed() is true,
    which means that compaction has scanned the whole zone. This may not
    happen even after multiple attempts with a lower priority due to
    parallel activity, so we might needlessly struggle on the lower
    priorities and possibly run out of compaction retry attempts in the
    process.

    After this patch we are guaranteed at least one attempt at the highest
    compaction priority even if we exhaust all retries at the lower
    priorities.

    Link: http://lkml.kernel.org/r/20160906135258.18335-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "reintroduce compaction feedback for OOM decisions".

    After several people reported OOM's for order-2 allocations in 4.7 due
    to Michal Hocko's OOM rework, he reverted the part that considered
    compaction feedback [1] in the decisions to retry reclaim/compaction.
    This was to provide a fix quickly for 4.8 rc and 4.7 stable series,
    while mmotm had an almost complete solution that instead improved
    compaction reliability.

    This series completes the mmotm solution and reintroduces the compaction
    feedback into OOM decisions. The first two patches restore the state of
    mmotm before the temporary solution was merged, the last patch should be
    the missing piece for reliability. The third patch restricts the
    hardened compaction to non-costly orders, since costly orders don't
    result in OOMs in the first place.

    [1] http://marc.info/?i=20160822093249.GA14916%40dhcp22.suse.cz%3E

    This patch (of 4):

    Commit 6b4e3181d7bd ("mm, oom: prevent premature OOM killer invocation
    for high order request") was intended as a quick fix of OOM regressions
    for 4.8 and stable 4.7.x kernels. For a better long-term solution, we
    still want to consider compaction feedback, which should be possible
    after some more improvements in the following patches.

    This reverts commit 6b4e3181d7bd5ca5ab6f45929e4a5ffa7ab4ab7f.

    Link: http://lkml.kernel.org/r/20160906135258.18335-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Rik van Riel
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • After using the offset of the swap entry as the key of the swap cache,
    the page_index() becomes exactly same as page_file_index(). So the
    page_file_index() is removed and the callers are changed to use
    page_index() instead.

    Link: http://lkml.kernel.org/r/1473270649-27229-2-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Trond Myklebust
    Cc: Anna Schumaker
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Johannes Weiner
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Ross Zwisler
    Cc: Eric Dumazet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • This patch is to improve the performance of swap cache operations when
    the type of the swap device is not 0. Originally, the whole swap entry
    value is used as the key of the swap cache, even though there is one
    radix tree for each swap device. If the type of the swap device is not
    0, the height of the radix tree of the swap cache will be increased
    unnecessary, especially on 64bit architecture. For example, for a 1GB
    swap device on the x86_64 architecture, the height of the radix tree of
    the swap cache is 11. But if the offset of the swap entry is used as
    the key of the swap cache, the height of the radix tree of the swap
    cache is 4. The increased height causes unnecessary radix tree
    descending and increased cache footprint.

    This patch reduces the height of the radix tree of the swap cache via
    using the offset of the swap entry instead of the whole swap entry value
    as the key of the swap cache. In 32 processes sequential swap out test
    case on a Xeon E5 v3 system with RAM disk as swap, the lock contention
    for the spinlock of the swap cache is reduced from 20.15% to 12.19%,
    when the type of the swap device is 1.

    Use the whole swap entry as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 10.37,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 9.78,

    Use the swap offset as key,

    perf-profile.calltrace.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list: 6.25,
    perf-profile.calltrace.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_node_memcg: 5.94,

    Link: http://lkml.kernel.org/r/1473270649-27229-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: Joonsoo Kim
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Aaron Lu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • vm_insert_mixed() unlike vm_insert_pfn_prot() and vmf_insert_pfn_pmd(),
    fails to check the pgprot_t it uses for the mapping against the one
    recorded in the memtype tracking tree. Add the missing call to
    track_pfn_insert() to preclude cases where incompatible aliased mappings
    are established for a given physical address range.

    Link: http://lkml.kernel.org/r/147328717909.35069.14256589123570653697.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Cc: David Airlie
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • If store_mem_state() is called to online memory which is already online,
    it will return 1, the value it got from device_online().

    This is wrong because store_mem_state() is a device_attribute .store
    function. Thus a non-negative return value represents input bytes read.

    Set the return value to -EINVAL in this case.

    Link: http://lkml.kernel.org/r/1472743777-24266-1-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Cc: Greg Kroah-Hartman
    Cc: Vlastimil Babka
    Cc: Vitaly Kuznetsov
    Cc: David Rientjes
    Cc: Yaowei Bai
    Cc: Joonsoo Kim
    Cc: Dan Williams
    Cc: Xishi Qiu
    Cc: David Vrabel
    Cc: Chen Yucong
    Cc: Andrew Banman
    Cc: Seth Jennings
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • mem_cgroup_count_precharge() and mem_cgroup_move_charge() both call
    walk_page_range() on the range 0 to ~0UL, neither provide a pte_hole
    callback, which causes the current implementation to skip non-vma
    regions. This is all fine but follow up changes would like to make
    walk_page_range more generic so it is better to be explicit about which
    range to traverse so let's use highest_vm_end to explicitly traverse
    only user mmaped memory.

    [mhocko@kernel.org: rewrote changelog]
    Link: http://lkml.kernel.org/r/1472655897-22532-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • The global zero page is used to satisfy an anonymous read fault. If
    THP(Transparent HugePage) is enabled then the global huge zero page is
    used. The global huge zero page uses an atomic counter for reference
    counting and is allocated/freed dynamically according to its counter
    value.

    CPU time spent on that counter will greatly increase if there are a lot
    of processes doing anonymous read faults. This patch proposes a way to
    reduce the access to the global counter so that the CPU load can be
    reduced accordingly.

    To do this, a new flag of the mm_struct is introduced:
    MMF_USED_HUGE_ZERO_PAGE. With this flag, the process only need to touch
    the global counter in two cases:

    1 The first time it uses the global huge zero page;
    2 The time when mm_user of its mm_struct reaches zero.

    Note that right now, the huge zero page is eligible to be freed as soon
    as its last use goes away. With this patch, the page will not be
    eligible to be freed until the exit of the last process from which it
    was ever used.

    And with the use of mm_user, the kthread is not eligible to use huge
    zero page either. Since no kthread is using huge zero page today, there
    is no difference after applying this patch. But if that is not desired,
    I can change it to when mm_count reaches zero.

    Case used for test on Haswell EP:

    usemem -n 72 --readonly -j 0x200000 100G

    Which spawns 72 processes and each will mmap 100G anonymous space and
    then do read only access to that space sequentially with a step of 2MB.

    CPU cycles from perf report for base commit:
    54.03% usemem [kernel.kallsyms] [k] get_huge_zero_page
    CPU cycles from perf report for this commit:
    0.11% usemem [kernel.kallsyms] [k] mm_get_huge_zero_page

    Performance(throughput) of the workload for base commit: 1784430792
    Performance(throughput) of the workload for this commit: 4726928591
    164% increase.

    Runtime of the workload for base commit: 707592 us
    Runtime of the workload for this commit: 303970 us
    50% drop.

    Link: http://lkml.kernel.org/r/fe51a88f-446a-4622-1363-ad1282d71385@intel.com
    Signed-off-by: Aaron Lu
    Cc: Sergey Senozhatsky
    Cc: "Kirill A. Shutemov"
    Cc: Dave Hansen
    Cc: Tim Chen
    Cc: Huang Ying
    Cc: Vlastimil Babka
    Cc: Jerome Marchand
    Cc: Andrea Arcangeli
    Cc: Mel Gorman
    Cc: Ebru Akagunduz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • Trying to walk all of virtual memory requires architecture specific
    knowledge. On x86_64, addresses must be sign extended from bit 48,
    whereas on arm64 the top VA_BITS of address space have their own set of
    page tables.

    clear_refs_write() calls walk_page_range() on the range 0 to ~0UL, it
    provides a test_walk() callback that only expects to be walking over
    VMAs. Currently walk_pmd_range() will skip memory regions that don't
    have a VMA, reporting them as a hole.

    As this call only expects to walk user address space, make it walk 0 to
    'highest_vm_end'.

    Link: http://lkml.kernel.org/r/1472655792-22439-1-git-send-email-james.morse@arm.com
    Signed-off-by: James Morse
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    James Morse
     
  • In current kernel code, we only call node_set_state(cpu_to_node(cpu),
    N_CPU) when a cpu is hot plugged. But we do not set the node state for
    N_CPU when the cpus are brought online during boot.

    So this could lead to failure when we check to see if a node contains
    cpu with node_state(node_id, N_CPU).

    One use case is in the node_reclaime function:

    /*
    * Only run node reclaim on the local node or on nodes that do
    * not
    * have associated processors. This will favor the local
    * processor
    * over remote processors and spread off node memory allocations
    * as wide as possible.
    */
    if (node_state(pgdat->node_id, N_CPU) && pgdat->node_id !=
    numa_node_id())
    return NODE_RECLAIM_NOSCAN;

    I instrumented the kernel to call this function after boot and it always
    returns 0 on a x86 desktop machine until I apply the attached patch.

    int num_cpu_node(void)
    {
    int i, nr_cpu_nodes = 0;

    for_each_node(i) {
    if (node_state(i, N_CPU))
    ++ nr_cpu_nodes;
    }

    return nr_cpu_nodes;
    }

    Fix this by checking each node for online CPU when we initialize
    vmstat that's responsible for maintaining node state.

    Link: http://lkml.kernel.org/r/20160829175922.GA21775@linux.intel.com
    Signed-off-by: Tim Chen
    Acked-by: David Rientjes
    Cc: Ingo Molnar
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Cc: Tim Chen
    Cc:
    Cc: Ying
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: Dan Williams
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tim Chen
     
  • To support DAX pmd mappings with unmodified applications, filesystems
    need to align an mmap address by the pmd size.

    Call thp_get_unmapped_area() from f_op->get_unmapped_area.

    Note, there is no change in behavior for a non-DAX file.

    Link: http://lkml.kernel.org/r/1472497881-9323-3-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Cc: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Andreas Dilger
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • When CONFIG_FS_DAX_PMD is set, DAX supports mmap() using pmd page size.
    This feature relies on both mmap virtual address and FS block (i.e.
    physical address) to be aligned by the pmd page size. Users can use
    mkfs options to specify FS to align block allocations. However,
    aligning mmap address requires code changes to existing applications for
    providing a pmd-aligned address to mmap().

    For instance, fio with "ioengine=mmap" performs I/Os with mmap() [1].
    It calls mmap() with a NULL address, which needs to be changed to
    provide a pmd-aligned address for testing with DAX pmd mappings.
    Changing all applications that call mmap() with NULL is undesirable.

    Add thp_get_unmapped_area(), which can be called by filesystem's
    get_unmapped_area to align an mmap address by the pmd size for a DAX
    file. It calls the default handler, mm->get_unmapped_area(), to find a
    range and then aligns it for a DAX file.

    The patch is based on Matthew Wilcox's change that allows adding support
    of the pud page size easily.

    [1]: https://github.com/axboe/fio/blob/master/engines/mmap.c
    Link: http://lkml.kernel.org/r/1472497881-9323-2-git-send-email-toshi.kani@hpe.com
    Signed-off-by: Toshi Kani
    Reviewed-by: Dan Williams
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Kirill A. Shutemov
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Theodore Ts'o
    Cc: Andreas Dilger
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toshi Kani
     
  • This patch will randomly perform mlock/mlock2 on a given memory region,
    and verify the RLIMIT_MEMLOCK limitation works properly.

    Suggested-by: David Rientjes
    Link: http://lkml.kernel.org/r/1473325970-11393-4-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Eric B Munson
    Cc: Simon Guo
    Cc: Mel Gorman
    Cc: Alexey Klimov
    Cc: Andrea Arcangeli
    Cc: Thierry Reding
    Cc: Mike Kravetz
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • Function seek_to_smaps_entry() can be useful for other selftest
    functionalities, so move it out to header file.

    Link: http://lkml.kernel.org/r/1473325970-11393-3-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Shuah Khan
    Cc: Vlastimil Babka
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Eric B Munson
    Cc: Simon Guo
    Cc: Mel Gorman
    Cc: Alexey Klimov
    Cc: Andrea Arcangeli
    Cc: Thierry Reding
    Cc: Mike Kravetz
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • This patch adds mlock() test for multiple invocation on the same address
    area, and verify it doesn't mess the rlimit mlock limitation.

    Link: http://lkml.kernel.org/r/1472554781-9835-5-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • To prepare mlock2.h whose functionality will be reused.

    Link: http://lkml.kernel.org/r/1472554781-9835-4-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • When one vma was with flag VM_LOCKED|VM_LOCKONFAULT (by invoking
    mlock2(,MLOCK_ONFAULT)), it can again be populated with mlock() with
    VM_LOCKED flag only.

    There is a hole in mlock_fixup() which increase mm->locked_vm twice even
    the two operations are on the same vma and both with VM_LOCKED flags.

    The issue can be reproduced by following code:

    mlock2(p, 1024 * 64, MLOCK_ONFAULT); //VM_LOCKED|VM_LOCKONFAULT
    mlock(p, 1024 * 64); //VM_LOCKED

    Then check the increase VmLck field in /proc/pid/status(to 128k).

    When vma is set with different vm_flags, and the new vm_flags is with
    VM_LOCKED, it is not necessarily be a "new locked" vma. This patch
    corrects this bug by prevent mm->locked_vm from increment when old
    vm_flags is already VM_LOCKED.

    Link: http://lkml.kernel.org/r/1472554781-9835-3-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Acked-by: Kirill A. Shutemov
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • In do_mlock(), the check against locked memory limitation has a hole
    which will fail following cases at step 3):

    1) User has a memory chunk from addressA with 50k, and user mem lock
    rlimit is 64k.
    2) mlock(addressA, 30k)
    3) mlock(addressA, 40k)

    The 3rd step should have been allowed since the 40k request is
    intersected with the previous 30k at step 2), and the 3rd step is
    actually for mlock on the extra 10k memory.

    This patch checks vma to caculate the actual "new" mlock size, if
    necessary, and ajust the logic to fix this issue.

    [akpm@linux-foundation.org: clean up comment layout]
    [wei.guo.simon@gmail.com: correct a typo in count_mm_mlocked_page_nr()]
    Link: http://lkml.kernel.org/r/1473325970-11393-2-git-send-email-wei.guo.simon@gmail.com
    Link: http://lkml.kernel.org/r/1472554781-9835-2-git-send-email-wei.guo.simon@gmail.com
    Signed-off-by: Simon Guo
    Cc: Alexey Klimov
    Cc: Eric B Munson
    Cc: Geert Uytterhoeven
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Shuah Khan
    Cc: Simon Guo
    Cc: Thierry Reding
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Simon Guo
     
  • Since the lumpy reclaim is gone there is no source of higher order pages
    if CONFIG_COMPACTION=n except for the order-0 pages reclaim which is
    unreliable for that purpose to say the least. Hitting an OOM for
    !costly higher order requests is therefore all not that hard to imagine.
    We are trying hard to not invoke OOM killer as much as possible but
    there is simply no reliable way to detect whether more reclaim retries
    make sense.

    Disabling COMPACTION is not widespread but it seems that some users
    might have disable the feature without realizing full consequences
    (mostly along with disabling THP because compaction used to be THP
    mainly thing). This patch just adds a note if the OOM killer was
    triggered by higher order request with compaction disabled. This will
    help us identifying possible misconfiguration right from the oom report
    which is easier than to always keep in mind that somebody might have
    disabled COMPACTION without a good reason.

    Link: http://lkml.kernel.org/r/20160830111632.GD23963@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • File pages use a set of radix tree tags (DIRTY, TOWRITE, WRITEBACK,
    etc.) to accelerate finding the pages with a specific tag in the radix
    tree during inode writeback. But for anonymous pages in the swap cache,
    there is no inode writeback. So there is no need to find the pages with
    some writeback tags in the radix tree. It is not necessary to touch
    radix tree writeback tags for pages in the swap cache.

    Per Rik van Riel's suggestion, a new flag AS_NO_WRITEBACK_TAGS is
    introduced for address spaces which don't need to update the writeback
    tags. The flag is set for swap caches. It may be used for DAX file
    systems, etc.

    With this patch, the swap out bandwidth improved 22.3% (from ~1.2GB/s to
    ~1.48GBps) in the vm-scalability swap-w-seq test case with 8 processes.
    The test is done on a Xeon E5 v3 system. The swap device used is a RAM
    simulated PMEM (persistent memory) device. The improvement comes from
    the reduced contention on the swap cache radix tree lock. To test
    sequential swapping out, the test case uses 8 processes, which
    sequentially allocate and write to the anonymous pages until RAM and
    part of the swap device is used up.

    Details of comparison is as follow,

    base base+patch
    ---------------- --------------------------
    %stddev %change %stddev
    \ | \
    2506952 ± 2% +28.1% 3212076 ± 7% vm-scalability.throughput
    1207402 ± 7% +22.3% 1476578 ± 6% vmstat.swap.so
    10.86 ± 12% -23.4% 8.31 ± 16% perf-profile.cycles-pp._raw_spin_lock_irq.__add_to_swap_cache.add_to_swap_cache.add_to_swap.shrink_page_list
    10.82 ± 13% -33.1% 7.24 ± 14% perf-profile.cycles-pp._raw_spin_lock_irqsave.__remove_mapping.shrink_page_list.shrink_inactive_list.shrink_zone_memcg
    10.36 ± 11% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.__test_set_page_writeback.bdev_write_page.__swap_writepage.swap_writepage
    10.52 ± 12% -100.0% 0.00 ± -1% perf-profile.cycles-pp._raw_spin_lock_irqsave.test_clear_page_writeback.end_page_writeback.page_endio.pmem_rw_page

    Link: http://lkml.kernel.org/r/1472578089-5560-1-git-send-email-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: Tejun Heo
    Cc: Wu Fengguang
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • In ___alloc_bootmem_node_nopanic(), replace kzalloc() by kzalloc_node()
    in order to allocate memory within given node preferentially when slab
    is available

    Link: http://lkml.kernel.org/r/1f487f12-6af4-5e4f-a28c-1de2361cdcd8@zoho.com
    Signed-off-by: zijun_hu
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • Fix the following bugs:

    - the same ARCH_LOW_ADDRESS_LIMIT statements are duplicated between
    header and relevant source

    - don't ensure ARCH_LOW_ADDRESS_LIMIT perhaps defined by ARCH in
    asm/processor.h is preferred over default in linux/bootmem.h
    completely since the former header isn't included by the latter

    Link: http://lkml.kernel.org/r/e046aeaa-e160-6d9e-dc1b-e084c2fd999f@zoho.com
    Signed-off-by: zijun_hu
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zijun_hu
     
  • Currently significant amount of memory is reserved only in kernel booted
    to capture kernel dump using the fa_dump method.

    Kernels compiled with CONFIG_DEFERRED_STRUCT_PAGE_INIT will initialize
    only certain size memory per node. The certain size takes into account
    the dentry and inode cache sizes. Currently the cache sizes are
    calculated based on the total system memory including the reserved
    memory. However such a kernel when booting the same kernel as fadump
    kernel will not be able to allocate the required amount of memory to
    suffice for the dentry and inode caches. This results in crashes like

    Hence only implement arch_reserved_kernel_pages() for CONFIG_FA_DUMP
    configurations. The amount reserved will be reduced while calculating
    the large caches and will avoid crashes like the below on large systems
    such as 32 TB systems.

    Dentry cache hash table entries: 536870912 (order: 16, 4294967296 bytes)
    vmalloc: allocation failure, allocated 4097114112 of 17179934720 bytes
    swapper/0: page allocation failure: order:0, mode:0x2080020(GFP_ATOMIC)
    CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.6-master+ #3
    Call Trace:
    dump_stack+0xb0/0xf0 (unreliable)
    warn_alloc_failed+0x114/0x160
    __vmalloc_node_range+0x304/0x340
    __vmalloc+0x6c/0x90
    alloc_large_system_hash+0x1b8/0x2c0
    inode_init+0x94/0xe4
    vfs_caches_init+0x8c/0x13c
    start_kernel+0x50c/0x578
    start_here_common+0x20/0xa8

    Link: http://lkml.kernel.org/r/1472476010-4709-4-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Srikar Dronamraju
    Suggested-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Michael Ellerman
    Cc: Mahesh Salgaonkar
    Cc: Hari Bathini
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • The total reserved memory in a system is accounted but not available for
    use use outside mm/memblock.c. By exposing the total reserved memory,
    systems can better calculate the size of large hashes.

    Link: http://lkml.kernel.org/r/1472476010-4709-3-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Srikar Dronamraju
    Suggested-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Michael Ellerman
    Cc: Mahesh Salgaonkar
    Cc: Hari Bathini
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • Currently arch specific code can reserve memory blocks but
    alloc_large_system_hash() may not take it into consideration when sizing
    the hashes. This can lead to bigger hash than required and lead to no
    available memory for other purposes. This is specifically true for
    systems with CONFIG_DEFERRED_STRUCT_PAGE_INIT enabled.

    One approach to solve this problem would be to walk through the memblock
    regions and calculate the available memory and base the size of hash
    system on the available memory.

    The other approach would be to depend on the architecture to provide the
    number of pages that are reserved. This change provides hooks to allow
    the architecture to provide the required info.

    Link: http://lkml.kernel.org/r/1472476010-4709-2-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Srikar Dronamraju
    Suggested-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Michael Ellerman
    Cc: Mahesh Salgaonkar
    Cc: Hari Bathini
    Cc: Dave Hansen
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Srikar Dronamraju
     
  • Use the existing enums instead of hardcoded index when looking at the
    zonelist. This makes it more readable. No functionality change by this
    patch.

    Link: http://lkml.kernel.org/r/1472227078-24852-1-git-send-email-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • oom reaper was skipped for an mm which is shared with the kernel thread
    (aka use_mm()). The primary concern was that such a kthread might want
    to read from the userspace memory and see zero page as a result of the
    oom reaper action. This is no longer a problem after "mm: make sure
    that kthreads will not refault oom reaped memory" because any attempt to
    fault in when the MMF_UNSTABLE is set will result in SIGBUS and so the
    target user should see an error. This means that we can finally allow
    oom reaper also to tasks which share their mm with kthreads.

    Link: http://lkml.kernel.org/r/1472119394-11342-10-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are only few use_mm() users in the kernel right now. Most of them
    write to the target memory but vhost driver relies on
    copy_from_user/get_user from a kernel thread context. This makes it
    impossible to reap the memory of an oom victim which shares the mm with
    the vhost kernel thread because it could see a zero page unexpectedly
    and theoretically make an incorrect decision visible outside of the
    killed task context.

    To quote Michael S. Tsirkin:
    : Getting an error from __get_user and friends is handled gracefully.
    : Getting zero instead of a real value will cause userspace
    : memory corruption.

    The vhost kernel thread is bound to an open fd of the vhost device which
    is not tight to the mm owner life cycle in general. The device fd can
    be inherited or passed over to another process which means that we
    really have to be careful about unexpected memory corruption because
    unlike for normal oom victims the result will be visible outside of the
    oom victim context.

    Make sure that no kthread context (users of use_mm) can ever see
    corrupted data because of the oom reaper and hook into the page fault
    path by checking MMF_UNSTABLE mm flag. __oom_reap_task_mm will set the
    flag before it starts unmapping the address space while the flag is
    checked after the page fault has been handled. If the flag is set then
    SIGBUS is triggered so any g-u-p user will get a error code.

    Regular tasks do not need this protection because all which share the mm
    are killed when the mm is reaped and so the corruption will not outlive
    them.

    This patch shouldn't have any visible effect at this moment because the
    OOM killer doesn't invoke oom reaper for tasks with mm shared with
    kthreads yet.

    Link: http://lkml.kernel.org/r/1472119394-11342-9-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: "Michael S. Tsirkin"
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There are no users of exit_oom_victim on !current task anymore so enforce
    the API to always work on the current.

    Link: http://lkml.kernel.org/r/1472119394-11342-8-git-send-email-mhocko@kernel.org
    Signed-off-by: Tetsuo Handa
    Signed-off-by: Michal Hocko
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Commit 74070542099c ("oom, suspend: fix oom_reaper vs.
    oom_killer_disable race") has workaround an existing race between
    oom_killer_disable and oom_reaper by adding another round of
    try_to_freeze_tasks after the oom killer was disabled. This was the
    easiest thing to do for a late 4.7 fix. Let's fix it properly now.

    After "oom: keep mm of the killed task available" we no longer have to
    call exit_oom_victim from the oom reaper because we have stable mm
    available and hide the oom_reaped mm by MMF_OOM_SKIP flag. So let's
    remove exit_oom_victim and the race described in the above commit
    doesn't exist anymore if.

    Unfortunately this alone is not sufficient for the oom_killer_disable
    usecase because now we do not have any reliable way to reach
    exit_oom_victim (the victim might get stuck on a way to exit for an
    unbounded amount of time). OOM killer can cope with that by checking mm
    flags and move on to another victim but we cannot do the same for
    oom_killer_disable as we would lose the guarantee of no further
    interference of the victim with the rest of the system. What we can do
    instead is to cap the maximum time the oom_killer_disable waits for
    victims. The only current user of this function (pm suspend) already
    has a concept of timeout for back off so we can reuse the same value
    there.

    Let's drop set_freezable for the oom_reaper kthread because it is no
    longer needed as the reaper doesn't wake or thaw any processes.

    Link: http://lkml.kernel.org/r/1472119394-11342-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • After "oom: keep mm of the killed task available" we can safely detect
    an oom victim by checking task->signal->oom_mm so we do not need the
    signal_struct counter anymore so let's get rid of it.

    This alone wouldn't be sufficient for nommu archs because
    exit_oom_victim doesn't hide the process from the oom killer anymore.
    We can, however, mark the mm with a MMF flag in __mmput. We can reuse
    MMF_OOM_REAPED and rename it to a more generic MMF_OOM_SKIP.

    Link: http://lkml.kernel.org/r/1472119394-11342-6-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Lockdep complains that __mmdrop is not safe from the softirq context:

    =================================
    [ INFO: inconsistent lock state ]
    4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949 Tainted: G W
    ---------------------------------
    inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
    swapper/1/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (pgd_lock){+.?...}, at: pgd_free+0x19/0x6b
    {SOFTIRQ-ON-W} state was registered at:
    __lock_acquire+0xa06/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    __change_page_attr_set_clr+0x2a5/0xacd
    change_page_attr_set_clr+0x16f/0x32c
    set_memory_nx+0x37/0x3a
    free_init_pages+0x9e/0xc7
    alternative_instructions+0xa2/0xb3
    check_bugs+0xe/0x2d
    start_kernel+0x3ce/0x3ea
    x86_64_start_reservations+0x2a/0x2c
    x86_64_start_kernel+0x17a/0x18d
    irq event stamp: 105916
    hardirqs last enabled at (105916): free_hot_cold_page+0x37e/0x390
    hardirqs last disabled at (105915): free_hot_cold_page+0x2c1/0x390
    softirqs last enabled at (105878): _local_bh_enable+0x42/0x44
    softirqs last disabled at (105879): irq_exit+0x6f/0xd1

    other info that might help us debug this:
    Possible unsafe locking scenario:

    CPU0
    ----
    lock(pgd_lock);

    lock(pgd_lock);

    *** DEADLOCK ***

    1 lock held by swapper/1/0:
    #0: (rcu_callback){......}, at: rcu_process_callbacks+0x390/0x800

    stack backtrace:
    CPU: 1 PID: 0 Comm: swapper/1 Tainted: G W 4.6.0-oomfortification2-00011-geeb3eadeab96-dirty #949
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Debian-1.8.2-1 04/01/2014
    Call Trace:

    print_usage_bug.part.25+0x259/0x268
    mark_lock+0x381/0x567
    __lock_acquire+0x993/0x196e
    lock_acquire+0x139/0x1e1
    _raw_spin_lock+0x32/0x41
    pgd_free+0x19/0x6b
    __mmdrop+0x25/0xb9
    __put_task_struct+0x103/0x11e
    delayed_put_task_struct+0x157/0x15e
    rcu_process_callbacks+0x660/0x800
    __do_softirq+0x1ec/0x4d5
    irq_exit+0x6f/0xd1
    smp_apic_timer_interrupt+0x42/0x4d
    apic_timer_interrupt+0x8e/0xa0

    arch_cpu_idle+0xf/0x11
    default_idle_call+0x32/0x34
    cpu_startup_entry+0x20c/0x399
    start_secondary+0xfe/0x101

    More over commit a79e53d85683 ("x86/mm: Fix pgd_lock deadlock") was
    explicit about pgd_lock not to be called from the irq context. This
    means that __mmdrop called from free_signal_struct has to be postponed
    to a user context. We already have a similar mechanism for mmput_async
    so we can use it here as well. This is safe because mm_count is pinned
    by mm_users.

    This fixes bug introduced by "oom: keep mm of the killed task available"

    Link: http://lkml.kernel.org/r/1472119394-11342-5-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • oom_reap_task has to call exit_oom_victim in order to make sure that the
    oom vicim will not block the oom killer for ever. This is, however,
    opening new problems (e.g oom_killer_disable exclusion - see commit
    74070542099c ("oom, suspend: fix oom_reaper vs. oom_killer_disable
    race")). exit_oom_victim should be only called from the victim's
    context ideally.

    One way to achieve this would be to rely on per mm_struct flags. We
    already have MMF_OOM_REAPED to hide a task from the oom killer since
    "mm, oom: hide mm which is shared with kthread or global init". The
    problem is that the exit path:

    do_exit
    exit_mm
    tsk->mm = NULL;
    mmput
    __mmput
    exit_oom_victim

    doesn't guarantee that exit_oom_victim will get called in a bounded
    amount of time. At least exit_aio depends on IO which might get blocked
    due to lack of memory and who knows what else is lurking there.

    This patch takes a different approach. We remember tsk->mm into the
    signal_struct and bind it to the signal struct life time for all oom
    victims. __oom_reap_task_mm as well as oom_scan_process_thread do not
    have to rely on find_lock_task_mm anymore and they will have a reliable
    reference to the mm struct. As a result all the oom specific
    communication inside the OOM killer can be done via tsk->signal->oom_mm.

    Increasing the signal_struct for something as unlikely as the oom killer
    is far from ideal but this approach will make the code much more
    reasonable and long term we even might want to move task->mm into the
    signal_struct anyway. In the next step we might want to make the oom
    killer exclusion and access to memory reserves completely independent
    which would be also nice.

    Link: http://lkml.kernel.org/r/1472119394-11342-4-git-send-email-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: Tetsuo Handa
    Cc: Oleg Nesterov
    Cc: David Rientjes
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko