18 Mar, 2016

40 commits

  • There are a mixture of pr_warning and pr_warn uses in mm. Use pr_warn
    consistently.

    Miscellanea:

    - Coalesce formats
    - Realign arguments

    Signed-off-by: Joe Perches
    Acked-by: Tejun Heo [percpu]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     
  • ZONE_DEVICE (merged in 4.3) and ZONE_CMA (proposed) are examples of new
    mm zones that are bumping up against the current maximum limit of 4
    zones, i.e. 2 bits in page->flags for the GFP_ZONE_TABLE.

    The GFP_ZONE_TABLE poses an interesting constraint since
    include/linux/gfp.h gets included by the 32-bit portion of a 64-bit
    build. We need to be careful to only build the table for zones that
    have a corresponding gfp_t flag. GFP_ZONES_SHIFT is introduced for this
    purpose. This patch does not attempt to solve the problem of adding a
    new zone that also has a corresponding GFP_ flag.

    Vlastimil points out that ZONE_DEVICE, by depending on x86_64 and
    SPARSEMEM_VMEMMAP implies that SECTIONS_WIDTH is zero. In other words
    even though ZONE_DEVICE does not fit in GFP_ZONE_TABLE it is free to
    consume another bit in page->flags (expand ZONES_WIDTH) with room to
    spare.

    Link: https://bugzilla.kernel.org/show_bug.cgi?id=110931
    Fixes: 033fbae988fc ("mm: ZONE_DEVICE for "device memory"")
    Signed-off-by: Dan Williams
    Reported-by: Mark
    Reported-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Dave Hansen
    Cc: Sudip Mukherjee
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • - Do not take memcg_limit_mutex for resetting limits - the cgroup cannot
    be altered from userspace anymore, so no need to protect them.

    - Use plain page_counter_limit() for resetting ->memory and ->memsw
    limits instead of mem_cgrouop_resize_* helpers - we enlarge the limits,
    so no need in special handling.

    - Reset ->swap and ->tcpmem limits as well.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • online_pages() simply returns an error value if
    memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
    want for successfully onlining target pages. This patch arms to print
    more failure information like offline_pages() in online_pages.

    This patch also converts printk(KERN_) to pr_(), and moves
    __offline_pages() to not print failure information with KERN_INFO
    according to David Rientjes's suggestion[1].

    [1] https://lkml.org/lkml/2016/2/24/1094

    Signed-off-by: Chen Yucong
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Yucong
     
  • Commit 647757197cd3 ("mm: clarify __GFP_NOFAIL deprecation status") was
    incomplete and didn't remove the comment about __GFP_NOFAIL being
    deprecated in buffered_rmqueue.

    Let's get rid of this leftover but keep the WARN_ON_ONCE for order > 1
    because we should really discourage from using __GFP_NOFAIL with higher
    order allocations because those are just too subtle.

    Signed-off-by: Michal Hocko
    Reviewed-by: Nikolay Borisov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • CMA allocation should be guaranteed to succeed by definition, but,
    unfortunately, it would be failed sometimes. It is hard to track down
    the problem, because it is related to page reference manipulation and we
    don't have any facility to analyze it.

    This patch adds tracepoints to track down page reference manipulation.
    With it, we can find exact reason of failure and can fix the problem.
    Following is an example of tracepoint output. (note: this example is
    stale version that printing flags as the number. Recent version will
    print it as human readable string.)

    -9018 [004] 92.678375: page_ref_set: pfn=0x17ac9 flags=0x0 count=1 mapcount=0 mapping=(nil) mt=4 val=1
    -9018 [004] 92.678378: kernel_stack:
    => get_page_from_freelist (ffffffff81176659)
    => __alloc_pages_nodemask (ffffffff81176d22)
    => alloc_pages_vma (ffffffff811bf675)
    => handle_mm_fault (ffffffff8119e693)
    => __do_page_fault (ffffffff810631ea)
    => trace_do_page_fault (ffffffff81063543)
    => do_async_page_fault (ffffffff8105c40a)
    => async_page_fault (ffffffff817581d8)
    [snip]
    -9018 [004] 92.678379: page_ref_mod: pfn=0x17ac9 flags=0x40048 count=2 mapcount=1 mapping=0xffff880015a78dc1 mt=4 val=1
    [snip]
    ...
    ...
    -9131 [001] 93.174468: test_pages_isolated: start_pfn=0x17800 end_pfn=0x17c00 fin_pfn=0x17ac9 ret=fail
    [snip]
    -9018 [004] 93.174843: page_ref_mod_and_test: pfn=0x17ac9 flags=0x40068 count=0 mapcount=0 mapping=0xffff880015a78dc1 mt=4 val=-1 ret=1
    => release_pages (ffffffff8117c9e4)
    => free_pages_and_swap_cache (ffffffff811b0697)
    => tlb_flush_mmu_free (ffffffff81199616)
    => tlb_finish_mmu (ffffffff8119a62c)
    => exit_mmap (ffffffff811a53f7)
    => mmput (ffffffff81073f47)
    => do_exit (ffffffff810794e9)
    => do_group_exit (ffffffff81079def)
    => SyS_exit_group (ffffffff81079e74)
    => entry_SYSCALL_64_fastpath (ffffffff817560b6)

    This output shows that problem comes from exit path. In exit path, to
    improve performance, pages are not freed immediately. They are gathered
    and processed by batch. During this process, migration cannot be
    possible and CMA allocation is failed. This problem is hard to find
    without this page reference tracepoint facility.

    Enabling this feature bloat kernel text 30 KB in my configuration.

    text data bss dec hex filename
    12127327 2243616 1507328 15878271 f2487f vmlinux_disabled
    12157208 2258880 1507328 15923416 f2f8d8 vmlinux_enabled

    Note that, due to header file dependency problem between mm.h and
    tracepoint.h, this feature has to open code the static key functions for
    tracepoints. Proposed by Steven Rostedt in following link.

    https://lkml.org/lkml/2015/12/9/699

    [arnd@arndb.de: crypto/async_pq: use __free_page() instead of put_page()]
    [iamjoonsoo.kim@lge.com: fix build failure for xtensa]
    [akpm@linux-foundation.org: tweak Kconfig text, per Vlastimil]
    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Acked-by: Steven Rostedt
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • The success of CMA allocation largely depends on the success of
    migration and key factor of it is page reference count. Until now, page
    reference is manipulated by direct calling atomic functions so we cannot
    follow up who and where manipulate it. Then, it is hard to find actual
    reason of CMA allocation failure. CMA allocation should be guaranteed
    to succeed so finding offending place is really important.

    In this patch, call sites where page reference is manipulated are
    converted to introduced wrapper function. This is preparation step to
    add tracepoint to each page reference manipulation function. With this
    facility, we can easily find reason of CMA allocation failure. There is
    no functional change in this patch.

    In addition, this patch also converts reference read sites. It will
    help a second step that renames page._count to something else and
    prevents later attempt to direct access to it (Suggested by Andrew).

    Signed-off-by: Joonsoo Kim
    Acked-by: Michal Nazarewicz
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Sergey Senozhatsky
    Cc: Steven Rostedt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • THP defrag is enabled by default to direct reclaim/compact but not wake
    kswapd in the event of a THP allocation failure. The problem is that
    THP allocation requests potentially enter reclaim/compaction. This
    potentially incurs a severe stall that is not guaranteed to be offset by
    reduced TLB misses. While there has been considerable effort to reduce
    the impact of reclaim/compaction, it is still a high cost and workloads
    that should fit in memory fail to do so. Specifically, a simple
    anon/file streaming workload will enter direct reclaim on NUMA at least
    even though the working set size is 80% of RAM. It's been years and
    it's time to throw in the towel.

    First, this patch defines THP defrag as follows;

    madvise: A failed allocation will direct reclaim/compact if the application requests it
    never: Neither reclaim/compact nor wake kswapd
    defer: A failed allocation will wake kswapd/kcompactd
    always: A failed allocation will direct reclaim/compact (historical behaviour)
    khugepaged defrag will enter direct/reclaim but not wake kswapd.

    Next it sets the default defrag option to be "madvise" to only enter
    direct reclaim/compaction for applications that specifically requested
    it.

    Lastly, it removes a check from the page allocator slowpath that is
    related to __GFP_THISNODE to allow "defer" to work. The callers that
    really cares are slub/slab and they are updated accordingly. The slab
    one may be surprising because it also corrects a comment as kswapd was
    never woken up by that path.

    This means that a THP fault will no longer stall for most applications
    by default and the ideal for most users that get THP if they are
    immediately available. There are still options for users that prefer a
    stall at startup of a new application by either restoring historical
    behaviour with "always" or pick a half-way point with "defer" where
    kswapd does some of the work in the background and wakes kcompactd if
    necessary. THP defrag for khugepaged remains enabled and will enter
    direct/reclaim but no wakeup kswapd or kcompactd.

    After this patch a THP allocation failure will quickly fallback and rely
    on khugepaged to recover the situation at some time in the future. In
    some cases, this will reduce THP usage but the benefit of THP is hard to
    measure and not a universal win where as a stall to reclaim/compaction
    is definitely measurable and can be painful.

    The first test for this is using "usemem" to read a large file and write
    a large anonymous mapping (to avoid the zero page) multiple times. The
    total size of the mappings is 80% of RAM and the benchmark simply
    measures how long it takes to complete. It uses multiple threads to see
    if that is a factor. On UMA, the performance is almost identical so is
    not reported but on NUMA, we see this

    usemem
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean System-1 102.86 ( 0.00%) 46.81 ( 54.50%)
    Amean System-4 37.85 ( 0.00%) 34.02 ( 10.12%)
    Amean System-7 48.12 ( 0.00%) 46.89 ( 2.56%)
    Amean System-12 51.98 ( 0.00%) 56.96 ( -9.57%)
    Amean System-21 80.16 ( 0.00%) 79.05 ( 1.39%)
    Amean System-30 110.71 ( 0.00%) 107.17 ( 3.20%)
    Amean System-48 127.98 ( 0.00%) 124.83 ( 2.46%)
    Amean Elapsd-1 185.84 ( 0.00%) 105.51 ( 43.23%)
    Amean Elapsd-4 26.19 ( 0.00%) 25.58 ( 2.33%)
    Amean Elapsd-7 21.65 ( 0.00%) 21.62 ( 0.16%)
    Amean Elapsd-12 18.58 ( 0.00%) 17.94 ( 3.43%)
    Amean Elapsd-21 17.53 ( 0.00%) 16.60 ( 5.33%)
    Amean Elapsd-30 17.45 ( 0.00%) 17.13 ( 1.84%)
    Amean Elapsd-48 15.40 ( 0.00%) 15.27 ( 0.82%)

    For a single thread, the benchmark completes 43.23% faster with this
    patch applied with smaller benefits as the thread increases. Similar,
    notice the large reduction in most cases in system CPU usage. The
    overall CPU time is

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    User 10357.65 10438.33
    System 3988.88 3543.94
    Elapsed 2203.01 1634.41

    Which is substantial. Now, the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 128458477 278352931
    Major Faults 2174976 225
    Swap Ins 16904701 0
    Swap Outs 17359627 0
    Allocation stalls 43611 0
    DMA allocs 0 0
    DMA32 allocs 19832646 19448017
    Normal allocs 614488453 580941839
    Movable allocs 0 0
    Direct pages scanned 24163800 0
    Kswapd pages scanned 0 0
    Kswapd pages reclaimed 0 0
    Direct pages reclaimed 20691346 0
    Compaction stalls 42263 0
    Compaction success 938 0
    Compaction failures 41325 0

    This patch eliminates almost all swapping and direct reclaim activity.
    There is still overhead but it's from NUMA balancing which does not
    identify that it's pointless trying to do anything with this workload.

    I also tried the thpscale benchmark which forces a corner case where
    compaction can be used heavily and measures the latency of whether base
    or huge pages were used

    thpscale Fault Latencies
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Amean fault-base-1 5288.84 ( 0.00%) 2817.12 ( 46.73%)
    Amean fault-base-3 6365.53 ( 0.00%) 3499.11 ( 45.03%)
    Amean fault-base-5 6526.19 ( 0.00%) 4363.06 ( 33.15%)
    Amean fault-base-7 7142.25 ( 0.00%) 4858.08 ( 31.98%)
    Amean fault-base-12 13827.64 ( 0.00%) 10292.11 ( 25.57%)
    Amean fault-base-18 18235.07 ( 0.00%) 13788.84 ( 24.38%)
    Amean fault-base-24 21597.80 ( 0.00%) 24388.03 (-12.92%)
    Amean fault-base-30 26754.15 ( 0.00%) 19700.55 ( 26.36%)
    Amean fault-base-32 26784.94 ( 0.00%) 19513.57 ( 27.15%)
    Amean fault-huge-1 4223.96 ( 0.00%) 2178.57 ( 48.42%)
    Amean fault-huge-3 2194.77 ( 0.00%) 2149.74 ( 2.05%)
    Amean fault-huge-5 2569.60 ( 0.00%) 2346.95 ( 8.66%)
    Amean fault-huge-7 3612.69 ( 0.00%) 2997.70 ( 17.02%)
    Amean fault-huge-12 3301.75 ( 0.00%) 6727.02 (-103.74%)
    Amean fault-huge-18 6696.47 ( 0.00%) 6685.72 ( 0.16%)
    Amean fault-huge-24 8000.72 ( 0.00%) 9311.43 (-16.38%)
    Amean fault-huge-30 13305.55 ( 0.00%) 9750.45 ( 26.72%)
    Amean fault-huge-32 9981.71 ( 0.00%) 10316.06 ( -3.35%)

    The average time to fault pages is substantially reduced in the majority
    of caseds but with the obvious caveat that fewer THPs are actually used
    in this adverse workload

    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Percentage huge-1 0.71 ( 0.00%) 14.04 (1865.22%)
    Percentage huge-3 10.77 ( 0.00%) 33.05 (206.85%)
    Percentage huge-5 60.39 ( 0.00%) 38.51 (-36.23%)
    Percentage huge-7 45.97 ( 0.00%) 34.57 (-24.79%)
    Percentage huge-12 68.12 ( 0.00%) 40.07 (-41.17%)
    Percentage huge-18 64.93 ( 0.00%) 47.82 (-26.35%)
    Percentage huge-24 62.69 ( 0.00%) 44.23 (-29.44%)
    Percentage huge-30 43.49 ( 0.00%) 55.38 ( 27.34%)
    Percentage huge-32 50.72 ( 0.00%) 51.90 ( 2.35%)

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 37429143 47564000
    Major Faults 1916 1558
    Swap Ins 1466 1079
    Swap Outs 2936863 149626
    Allocation stalls 62510 3
    DMA allocs 0 0
    DMA32 allocs 6566458 6401314
    Normal allocs 216361697 216538171
    Movable allocs 0 0
    Direct pages scanned 25977580 17998
    Kswapd pages scanned 0 3638931
    Kswapd pages reclaimed 0 207236
    Direct pages reclaimed 8833714 88
    Compaction stalls 103349 5
    Compaction success 270 4
    Compaction failures 103079 1

    Note again that while this does swap as it's an aggressive workload, the
    direct relcim activity and allocation stalls is substantially reduced.
    There is some kswapd activity but ftrace showed that the kswapd activity
    was due to normal wakeups from 4K pages being allocated.
    Compaction-related stalls and activity are almost eliminated.

    I also tried the stutter benchmark. For this, I do not have figures for
    NUMA but it's something that does impact UMA so I'll report what is
    available

    stutter
    4.4.0 4.4.0
    kcompactd-v1r1 nodefrag-v1r3
    Min mmap 7.3571 ( 0.00%) 7.3438 ( 0.18%)
    1st-qrtle mmap 7.5278 ( 0.00%) 17.9200 (-138.05%)
    2nd-qrtle mmap 7.6818 ( 0.00%) 21.6055 (-181.25%)
    3rd-qrtle mmap 11.0889 ( 0.00%) 21.8881 (-97.39%)
    Max-90% mmap 27.8978 ( 0.00%) 22.1632 ( 20.56%)
    Max-93% mmap 28.3202 ( 0.00%) 22.3044 ( 21.24%)
    Max-95% mmap 28.5600 ( 0.00%) 22.4580 ( 21.37%)
    Max-99% mmap 29.6032 ( 0.00%) 25.5216 ( 13.79%)
    Max mmap 4109.7289 ( 0.00%) 4813.9832 (-17.14%)
    Mean mmap 12.4474 ( 0.00%) 19.3027 (-55.07%)

    This benchmark is trying to fault an anonymous mapping while there is a
    heavy IO load -- a scenario that desktop users used to complain about
    frequently. This shows a mix because the ideal case of mapping with THP
    is not hit as often. However, note that 99% of the mappings complete
    13.79% faster. The CPU usage here is particularly interesting

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    User 67.50 0.99
    System 1327.88 91.30
    Elapsed 2079.00 2128.98

    And once again we look at the reclaim figures

    4.4.0 4.4.0
    kcompactd-v1r1nodefrag-v1r3
    Minor Faults 335241922 1314582827
    Major Faults 715 819
    Swap Ins 0 0
    Swap Outs 0 0
    Allocation stalls 532723 0
    DMA allocs 0 0
    DMA32 allocs 1822364341 1177950222
    Normal allocs 1815640808 1517844854
    Movable allocs 0 0
    Direct pages scanned 21892772 0
    Kswapd pages scanned 20015890 41879484
    Kswapd pages reclaimed 19961986 41822072
    Direct pages reclaimed 21892741 0
    Compaction stalls 1065755 0
    Compaction success 514 0
    Compaction failures 1065241 0

    Allocation stalls and all direct reclaim activity is eliminated as well
    as compaction-related stalls.

    THP gives impressive gains in some cases but only if they are quickly
    available. We're not going to reach the point where they are completely
    free so lets take the costs out of the fast paths finally and defer the
    cost to kswapd, kcompactd and khugepaged where it belongs.

    Signed-off-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Johannes Weiner
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • If an oom killed thread calls mempool_alloc(), it is possible that it'll
    loop forever if there are no elements on the freelist since
    __GFP_NOMEMALLOC prevents it from accessing needed memory reserves in
    oom conditions.

    Only set __GFP_NOMEMALLOC if there are elements on the freelist. If
    there are no free elements, allow allocations without the bit set so
    that memory reserves can be accessed if needed.

    Additionally, using mempool_alloc() with __GFP_NOMEMALLOC is not
    supported since the implementation can loop forever without accessing
    memory reserves when needed.

    Signed-off-by: David Rientjes
    Cc: Greg Thelen
    Cc: Michal Hocko
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Since __GFP_NOACCOUNT was removed by commit 20b5c3039863 ("Revert 'gfp:
    add __GFP_NOACCOUNT'"), its description is not necessary.

    Signed-off-by: Satoru Takeuchi
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Satoru Takeuchi
     
  • In machines with 140G of memory and enterprise flash storage, we have
    seen read and write bursts routinely exceed the kswapd watermarks and
    cause thundering herds in direct reclaim. Unfortunately, the only way
    to tune kswapd aggressiveness is through adjusting min_free_kbytes - the
    system's emergency reserves - which is entirely unrelated to the
    system's latency requirements. In order to get kswapd to maintain a
    250M buffer of free memory, the emergency reserves need to be set to 1G.
    That is a lot of memory wasted for no good reason.

    On the other hand, it's reasonable to assume that allocation bursts and
    overall allocation concurrency scale with memory capacity, so it makes
    sense to make kswapd aggressiveness a function of that as well.

    Change the kswapd watermark scale factor from the currently fixed 25% of
    the tunable emergency reserve to a tunable 0.1% of memory.

    Beyond 1G of memory, this will produce bigger watermark steps than the
    current formula in default settings. Ensure that the new formula never
    chooses steps smaller than that, i.e. 25% of the emergency reserve.

    On a 140G machine, this raises the default watermark steps - the
    distance between min and low, and low and high - from 16M to 143M.

    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • There are few things about *pte_alloc*() helpers worth cleaning up:

    - 'vma' argument is unused, let's drop it;

    - most __pte_alloc() callers do speculative check for pmd_none(),
    before taking ptl: let's introduce pte_alloc() macro which does
    the check.

    The only direct user of __pte_alloc left is userfaultfd, which has
    different expectation about atomicity wrt pmd.

    - pte_alloc_map() and pte_alloc_map_lock() are redefined using
    pte_alloc().

    [sudeep.holla@arm.com: fix build for arm64 hugetlbpage]
    [sfr@canb.auug.org.au: fix arch/arm/mm/mmu.c some more]
    Signed-off-by: Kirill A. Shutemov
    Cc: Dave Hansen
    Signed-off-by: Sudeep Holla
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
    statistics protocol, corresponding to 'Available' in /proc/meminfo.

    It indicates to the hypervisor how big the balloon can be inflated
    without pushing the guest system to swap.

    Signed-off-by: Igor Redko
    Signed-off-by: Denis V. Lunev
    Reviewed-by: Roman Kagan
    Cc: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Redko
     
  • Add a new field, VIRTIO_BALLOON_S_AVAIL, to virtio_balloon memory
    statistics protocol, corresponding to 'Available' in /proc/meminfo.

    It indicates to the hypervisor how big the balloon can be inflated
    without pushing the guest system to swap. This metric would be very
    useful in VM orchestration software to improve memory management of
    different VMs under overcommit.

    This patch (of 2):

    Factor out calculation of the available memory counter into a separate
    exportable function, in order to be able to use it in other parts of the
    kernel.

    In particular, it appears a relevant metric to report to the hypervisor
    via virtio-balloon statistics interface (in a followup patch).

    Signed-off-by: Igor Redko
    Signed-off-by: Denis V. Lunev
    Reviewed-by: Roman Kagan
    Cc: Michael S. Tsirkin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Igor Redko
     
  • MEMORY_HOTPLUG already depends on ARCH_ENABLE_MEMORY_HOTPLUG which is
    selected by the supported architectures, so the following arch depend is
    unnecessary.

    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • With THP refcounting work, no need to mark PMDs splitting.

    (ARC got missed under the sweeping arch change as THP support was likely
    not present in orig baseline)

    Signed-off-by: Vineet Gupta
    Cc: Kirill A. Shutemov

    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vineet Gupta
     
  • We remove one instace of flush_tlb_range here. That was added by commit
    f714f4f20e59 ("mm: numa: call MMU notifiers on THP migration"). But the
    pmdp_huge_clear_flush_notify should have done the require flush for us.
    Hence remove the extra flush.

    Signed-off-by: Aneesh Kumar K.V
    Cc: Mel Gorman
    Cc: "Kirill A. Shutemov"
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • Get list of VMA flags up-to-date and sort it to match VM_* definition
    order.

    [vbabka@suse.cz: add a note above vmaflag definitions to update the names when changing]
    Signed-off-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Currently we have two copies of the same code which implements memory
    overcommitment logic. Let's move it into mm/util.c and hence avoid
    duplication. No functional changes here.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • max_map_count sysctl unrelated to scheduler. Move its bits from
    include/linux/sched/sysctl.h to include/linux/mm.h.

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • Count how many times we put a THP in split queue. Currently, it happens
    on partial unmap of a THP.

    Rapidly growing value can indicate that an application behaves
    unfriendly wrt THP: often fault in huge page and then unmap part of it.
    This leads to unnecessary memory fragmentation and the application may
    require tuning.

    The event also can help with debugging kernel [mis-]behaviour.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     
  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • A page is activated on refault if the refault distance stored in the
    corresponding shadow entry is less than the number of active file pages.
    Since active file pages can't occupy more than half memory, we assume
    that the maximal effective refault distance can't be greater than half
    the number of present pages and size the shadow nodes lru list
    appropriately. Generally speaking, this assumption is correct, but it
    can result in wasting a considerable chunk of memory on stale shadow
    nodes in case the portion of file pages is small, e.g. if a workload
    mostly uses anonymous memory.

    To sort this out, we need to compute the size of shadow nodes lru basing
    not on the maximal possible, but the current size of file cache. We
    could take the size of active file lru for the maximal refault distance,
    but active lru is pretty unstable - it can shrink dramatically at
    runtime possibly disrupting workingset detection logic.

    Instead we assume that the maximal refault distance equals half the
    total number of file cache pages. This will protect us against active
    file lru size fluctuations while still being correct, because size of
    active lru is normally maintained lower than size of inactive lru.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Allocation of radix_tree_node objects can be easily triggered from
    userspace, so we should account them to memory cgroup. Besides, we need
    them accounted for making shadow node shrinker per memcg (see
    mm/workingset.c).

    A tricky thing about accounting radix_tree_node objects is that they are
    mostly allocated through radix_tree_preload(), so we can't just set
    SLAB_ACCOUNT for radix_tree_node_cachep - that would likely result in a
    lot of unrelated cgroups using objects from each other's caches.

    One way to overcome this would be making radix tree preloads per memcg,
    but that would probably look cumbersome and overcomplicated.

    Instead, we make radix_tree_node_alloc() first try to allocate from the
    cache with __GFP_ACCOUNT, no matter if the caller has preloaded or not,
    and only if it fails fall back on using per cpu preloads. This should
    make most allocations accounted.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • As kmem accounting is now either enabled for all cgroups or disabled
    system-wide, there's no point in having memcg_kmem_online() helper -
    instead one can use memcg_kmem_enabled() and mem_cgroup_online(), as
    shrink_slab() now does.

    There are only two places left where this helper is used -
    __memcg_kmem_charge() and memcg_create_kmem_cache(). The former can
    only be called if memcg_kmem_enabled() returned true. Since the cgroup
    it operates on is online, mem_cgroup_is_root() check will be enough.

    memcg_create_kmem_cache() can't use mem_cgroup_online() helper instead
    of memcg_kmem_online(), because it relies on the fact that in
    memcg_offline_kmem() memcg->kmem_state is changed before
    memcg_deactivate_kmem_caches() is called, but there we can just
    open-code the check.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • It's just convenient to implement a memcg aware shrinker when you know
    that shrink_control->memcg != NULL unless memcg_kmem_enabled() returns
    false.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Workingset code was recently made memcg aware, but shadow node shrinker
    is still global. As a result, one small cgroup can consume all memory
    available for shadow nodes, possibly hurting other cgroups by reclaiming
    their shadow nodes, even though reclaim distances stored in its shadow
    nodes have no effect. To avoid this, we need to make shadow node
    shrinker memcg aware.

    The actual work is done in patch 6 of the series. Patches 1 and 2
    prepare memcg/shrinker infrastructure for the change. Patch 3 is just a
    collateral cleanup. Patch 4 makes radix_tree_node accounted, which is
    necessary for making shadow node shrinker memcg aware. Patch 5 reduces
    shadow nodes overhead in case workload mostly uses anonymous pages.

    This patch:

    Currently, in the legacy hierarchy kmem accounting is off for all
    cgroups by default and must be enabled explicitly by writing something
    to memory.kmem.limit_in_bytes. Since we don't support reclaim on
    hitting kmem limit, nor do we have any plans to implement it, this is
    likely to be -1, just to enable kmem accounting and limit kernel memory
    consumption by the memory.limit_in_bytes along with user memory.

    This user API was introduced when the implementation of kmem accounting
    lacked slab shrinker support and hence was useless in practice. Things
    have changed since then - slab shrinkers were made memcg aware, the
    accounting overhead seems to be negligible, and a failure to charge a
    kmem allocation should not have critical consequences, because we only
    account those kernel objects that should be safe to fail. That's why
    kmem accounting is enabled by default for all cgroups in the default
    hierarchy, which will eventually replace the legacy one.

    The ability to enable kmem accounting for some cgroups while keeping it
    disabled for others is getting difficult to maintain. E.g. to make
    shadow node shrinker memcg aware (see mm/workingset.c), we need to know
    the relationship between the number of shadow nodes allocated for a
    cgroup and the size of its lru list. If kmem accounting is enabled for
    all cgroups there is no problem, but what should we do if kmem
    accounting is enabled only for half of cgroups? We've no other choice
    but use global lru stats while scanning root cgroup's shadow nodes, but
    that would be wrong if kmem accounting was enabled for all cgroups
    (which is the case if the unified hierarchy is used), in which case we
    should use lru stats of the root cgroup's lruvec.

    That being said, let's enable kmem accounting for all memory cgroups by
    default. If one finds it unstable or too costly, it can always be
    disabled system-wide by passing cgroup.memory=nokmem to the kernel at
    boot time.

    Signed-off-by: Vladimir Davydov
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     
  • Sometimes gcc mysteriously doesn't inline
    very small functions we expect to be inlined. See

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

    With this .config:
    http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
    the following functions get deinlined many times.
    Examples of disassembly:

    (43 copies, 141 calls):
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    f0 80 0f 08 lock orb $0x8,(%rdi)
    5d pop %rbp
    c3 retq

    (10 copies, 134 calls):
    48 8b 07 mov (%rdi),%rax
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    48 c1 e8 0b shr $0xb,%rax
    83 e0 01 and $0x1,%eax
    5d pop %rbp
    c3 retq

    This patch fixes this via s/inline/__always_inline/.

    Code size decrease after the patch is ~7k:

    text data bss dec hex filename
    92125002 20826048 36417536 149368586 8e72f0a vmlinux
    92118087 20826112 36417536 149361735 8e71447 vmlinux7_pageops_after

    Signed-off-by: Denys Vlasenko
    Cc: Ingo Molnar
    Cc: Thomas Graf
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • With both gcc 4.7.2 and 4.9.2, sometimes gcc mysteriously doesn't inline
    very small functions we expect to be inlined. See

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=66122

    With this .config:
    http://busybox.net/~vda/kernel_config_OPTIMIZE_INLINING_and_Os,
    set_buffer_foo(), clear_buffer_foo() and similar functions get deinlined
    about 60 times. Examples of disassembly:

    (14 copies, 43 calls):
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    f0 80 0f 20 lock orb $0x20,(%rdi)
    5d pop %rbp
    c3 retq
    (3 copies, 34 calls):
    48 8b 07 mov (%rdi),%rax
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    48 c1 e8 05 shr $0x5,%rax
    83 e0 01 and $0x1,%eax
    5d pop %rbp
    c3 retq
    (5 copies, 13 calls):
    55 push %rbp
    48 89 e5 mov %rsp,%rbp
    f0 80 0f 40 lock orb $0x40,(%rdi)
    5d pop %rbp
    c3 retq

    This patch fixes this via s/inline/__always_inline/.
    This decreases vmlinux by about 3 kbytes.

    text data bss dec hex filename
    88200439 19905208 36421632 144527279 89d4faf vmlinux2
    88197239 19905240 36421632 144524111 89d434f vmlinux

    Signed-off-by: Denys Vlasenko
    Cc: Ingo Molnar
    Cc: Thomas Graf
    Cc: Peter Zijlstra
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denys Vlasenko
     
  • This adds two command line keys:

    -c|--cgroup path|@inode Walk only pages owned by this memory cgroup
    -C|--list-cgroup Show memory cgroup inodes

    [vdavydov@virtuozzo.com: opt_cgroup should be uint64_t. Fix conflicts with "tools/vm/page-types.c: support swap entry"]
    Signed-off-by: Konstantin Khlebnikov
    Cc: Naoya Horiguchi
    Reviewed-by: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • Similarly to direct reclaim/compaction, kswapd attempts to combine
    reclaim and compaction to attempt making memory allocation of given
    order available.

    The details differ from direct reclaim e.g. in having high watermark as
    a goal. The code involved in kswapd's reclaim/compaction decisions has
    evolved to be quite complex.

    Testing reveals that it doesn't actually work in at least one scenario,
    and closer inspection suggests that it could be greatly simplified
    without compromising on the goal (make high-order page available) or
    efficiency (don't reclaim too much). The simplification relieas of
    doing all compaction in kcompactd, which is simply woken up when high
    watermarks are reached by kswapd's reclaim.

    The scenario where kswapd compaction doesn't work was found with mmtests
    test stress-highalloc configured to attempt order-9 allocations without
    direct reclaim, just waking up kswapd. There was no compaction attempt
    from kswapd during the whole test. Some added instrumentation shows
    what happens:

    - balance_pgdat() sets end_zone to Normal, as it's not balanced
    - reclaim is attempted on DMA zone, which sets nr_attempted to 99, but
    it cannot reclaim anything, so sc.nr_reclaimed is 0
    - for zones DMA32 and Normal, kswapd_shrink_zone uses testorder=0, so
    it merely checks if high watermarks were reached for base pages.
    This is true, so no reclaim is attempted. For DMA, testorder=0
    wasn't used, as compaction_suitable() returned COMPACT_SKIPPED
    - even though the pgdat_needs_compaction flag wasn't set to false, no
    compaction happens due to the condition sc.nr_reclaimed >
    nr_attempted being false (as 0 < 99)
    - priority-- due to nr_reclaimed being 0, repeat until priority reaches
    0 pgdat_balanced() is false as only the small zone DMA appears
    balanced (curiously in that check, watermark appears OK and
    compaction_suitable() returns COMPACT_PARTIAL, because a lower
    classzone_idx is used there)

    Now, even if it was decided that reclaim shouldn't be attempted on the
    DMA zone, the scenario would be the same, as (sc.nr_reclaimed=0 >
    nr_attempted=0) is also false. The condition really should use >= as
    the comment suggests. Then there is a mismatch in the check for setting
    pgdat_needs_compaction to false using low watermark, while the rest uses
    high watermark, and who knows what other subtlety. Hopefully this
    demonstrates that this is unsustainable.

    Luckily we can simplify this a lot. The reclaim/compaction decisions
    make sense for direct reclaim scenario, but in kswapd, our primary goal
    is to reach high watermark in order-0 pages. Afterwards we can attempt
    compaction just once. Unlike direct reclaim, we don't reclaim extra
    pages (over the high watermark), the current code already disallows it
    for good reasons.

    After this patch, we simply wake up kcompactd to process the pgdat,
    after we have either succeeded or failed to reach the high watermarks in
    kswapd, which goes to sleep. We pass kswapd's order and classzone_idx,
    so kcompactd can apply the same criteria to determine which zones are
    worth compacting. Note that we use the classzone_idx from
    wakeup_kswapd(), not balanced_classzone_idx which can include higher
    zones that kswapd tried to balance too, but didn't consider them in
    pgdat_balanced().

    Since kswapd now cannot create high-order pages itself, we need to
    adjust how it determines the zones to be balanced. The key element here
    is adding a "highorder" parameter to zone_balanced, which, when set to
    false, makes it consider only order-0 watermark instead of the desired
    higher order (this was done previously by kswapd_shrink_zone(), but not
    elsewhere). This false is passed for example in pgdat_balanced().
    Importantly, wakeup_kswapd() uses true to make sure kswapd and thus
    kcompactd are woken up for a high-order allocation failure.

    The last thing is to decide what to do with pageblock_skip bitmap
    handling. Compaction maintains a pageblock_skip bitmap to record
    pageblocks where isolation recently failed. This bitmap can be reset by
    three ways:

    1) direct compaction is restarting after going through the full deferred cycle

    2) kswapd goes to sleep, and some other direct compaction has previously
    finished scanning the whole zone and set zone->compact_blockskip_flush.
    Note that a successful direct compaction clears this flag.

    3) compaction was invoked manually via trigger in /proc

    The case 2) is somewhat fuzzy to begin with, but after introducing
    kcompactd we should update it. The check for direct compaction in 1),
    and to set the flush flag in 2) use current_is_kswapd(), which doesn't
    work for kcompactd. Thus, this patch adds bool direct_compaction to
    compact_control to use in 2). For the case 1) we remove the check
    completely - unlike the former kswapd compaction, kcompactd does use the
    deferred compaction functionality, so flushing tied to restarting from
    deferred compaction makes sense here.

    Note that when kswapd goes to sleep, kcompactd is woken up, so it will
    see the flushed pageblock_skip bits. This is different from when the
    former kswapd compaction observed the bits and I believe it makes more
    sense. Kcompactd can afford to be more thorough than a direct
    compaction trying to limit allocation latency, or kswapd whose primary
    goal is to reclaim.

    For testing, I used stress-highalloc configured to do order-9
    allocations with GFP_NOWAIT|__GFP_HIGH|__GFP_COMP, so they relied just
    on kswapd/kcompactd reclaim/compaction (the interfering kernel builds in
    phases 1 and 2 work as usual):

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Success 1 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 1 Mean 1.40 ( 0.00%) 6.20 (-55.00%)
    Success 1 Max 2.00 ( 0.00%) 7.00 (-16.67%)
    Success 2 Min 1.00 ( 0.00%) 5.00 (-66.67%)
    Success 2 Mean 1.80 ( 0.00%) 6.40 (-52.38%)
    Success 2 Max 3.00 ( 0.00%) 7.00 (-16.67%)
    Success 3 Min 34.00 ( 0.00%) 62.00 ( 1.59%)
    Success 3 Mean 41.80 ( 0.00%) 63.80 ( 1.24%)
    Success 3 Max 53.00 ( 0.00%) 65.00 ( 2.99%)

    User 3166.67 3181.09
    System 1153.37 1158.25
    Elapsed 1768.53 1799.37

    4.5-rc1+before 4.5-rc1+after
    -nodirect -nodirect
    Direct pages scanned 32938 32797
    Kswapd pages scanned 2183166 2202613
    Kswapd pages reclaimed 2152359 2143524
    Direct pages reclaimed 32735 32545
    Percentage direct scans 1% 1%
    THP fault alloc 579 612
    THP collapse alloc 304 316
    THP splits 0 0
    THP fault fallback 793 778
    THP collapse fail 11 16
    Compaction stalls 1013 1007
    Compaction success 92 67
    Compaction failures 920 939
    Page migrate success 238457 721374
    Page migrate failure 23021 23469
    Compaction pages isolated 504695 1479924
    Compaction migrate scanned 661390 8812554
    Compaction free scanned 13476658 84327916
    Compaction cost 262 838

    After this patch we see improvements in allocation success rate
    (especially for phase 3) along with increased compaction activity. The
    compaction stalls (direct compaction) in the interfering kernel builds
    (probably THP's) also decreased somewhat thanks to kcompactd activity,
    yet THP alloc successes improved a bit.

    Note that elapsed and user time isn't so useful for this benchmark,
    because of the background interference being unpredictable. It's just
    to quickly spot some major unexpected differences. System time is
    somewhat more useful and that didn't increase.

    Also (after adjusting mmtests' ftrace monitor):

    Time kswapd awake 2547781 2269241
    Time kcompactd awake 0 119253
    Time direct compacting 939937 557649
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 119099

    The decrease of overal time spent compacting appears to not match the
    increased compaction stats. I suspect the tasks get rescheduled and
    since the ftrace monitor doesn't see that, the reported time is wall
    time, not CPU time. But arguably direct compactors care about overall
    latency anyway, whether busy compacting or waiting for CPU doesn't
    matter. And that latency seems to almost halved.

    It's also interesting how much time kswapd spent awake just going
    through all the priorities and failing to even try compacting, over and
    over.

    We can also configure stress-highalloc to perform both direct
    reclaim/compaction and wakeup kswapd/kcompactd, by using
    GFP_KERNEL|__GFP_HIGH|__GFP_COMP:

    stress-highalloc
    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Success 1 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 1 Mean 8.00 ( 0.00%) 10.00 (-19.05%)
    Success 1 Max 12.00 ( 0.00%) 11.00 ( 15.38%)
    Success 2 Min 4.00 ( 0.00%) 9.00 (-50.00%)
    Success 2 Mean 8.20 ( 0.00%) 10.00 (-16.28%)
    Success 2 Max 13.00 ( 0.00%) 11.00 ( 8.33%)
    Success 3 Min 75.00 ( 0.00%) 74.00 ( 1.33%)
    Success 3 Mean 75.60 ( 0.00%) 75.20 ( 0.53%)
    Success 3 Max 77.00 ( 0.00%) 76.00 ( 0.00%)

    User 3344.73 3246.04
    System 1194.24 1172.29
    Elapsed 1838.04 1836.76

    4.5-rc1+before 4.5-rc1+after
    -direct -direct
    Direct pages scanned 125146 120966
    Kswapd pages scanned 2119757 2135012
    Kswapd pages reclaimed 2073183 2108388
    Direct pages reclaimed 124909 120577
    Percentage direct scans 5% 5%
    THP fault alloc 599 652
    THP collapse alloc 323 354
    THP splits 0 0
    THP fault fallback 806 793
    THP collapse fail 17 16
    Compaction stalls 2457 2025
    Compaction success 906 518
    Compaction failures 1551 1507
    Page migrate success 2031423 2360608
    Page migrate failure 32845 40852
    Compaction pages isolated 4129761 4802025
    Compaction migrate scanned 11996712 21750613
    Compaction free scanned 214970969 344372001
    Compaction cost 2271 2694

    In this scenario, this patch doesn't change the overall success rate as
    direct compaction already tries all it can. There's however significant
    reduction in direct compaction stalls (that is, the number of
    allocations that went into direct compaction). The number of successes
    (i.e. direct compaction stalls that ended up with successful
    allocation) is reduced by the same number. This means the offload to
    kcompactd is working as expected, and direct compaction is reduced
    either due to detecting contention, or compaction deferred by kcompactd.
    In the previous version of this patchset there was some apparent
    reduction of success rate, but the changes in this version (such as
    using sync compaction only), new baseline kernel, and/or averaging
    results from 5 executions (my bet), made this go away.

    Ftrace-based stats seem to roughly agree:

    Time kswapd awake 2532984 2326824
    Time kcompactd awake 0 257916
    Time direct compacting 864839 735130
    Time kswapd compacting 0 0
    Time kcompactd compacting 0 257585

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • We can reuse the nid we've determined instead of repeated pfn_to_nid()
    usages. Also zone_to_nid() should be a bit cheaper in general than
    pfn_to_nid().

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Memory compaction can be currently performed in several contexts:

    - kswapd balancing a zone after a high-order allocation failure
    - direct compaction to satisfy a high-order allocation, including THP
    page fault attemps
    - khugepaged trying to collapse a hugepage
    - manually from /proc

    The purpose of compaction is two-fold. The obvious purpose is to
    satisfy a (pending or future) high-order allocation, and is easy to
    evaluate. The other purpose is to keep overal memory fragmentation low
    and help the anti-fragmentation mechanism. The success wrt the latter
    purpose is more

    The current situation wrt the purposes has a few drawbacks:

    - compaction is invoked only when a high-order page or hugepage is not
    available (or manually). This might be too late for the purposes of
    keeping memory fragmentation low.
    - direct compaction increases latency of allocations. Again, it would
    be better if compaction was performed asynchronously to keep
    fragmentation low, before the allocation itself comes.
    - (a special case of the previous) the cost of compaction during THP
    page faults can easily offset the benefits of THP.
    - kswapd compaction appears to be complex, fragile and not working in
    some scenarios. It could also end up compacting for a high-order
    allocation request when it should be reclaiming memory for a later
    order-0 request.

    To improve the situation, we should be able to benefit from an
    equivalent of kswapd, but for compaction - i.e. a background thread
    which responds to fragmentation and the need for high-order allocations
    (including hugepages) somewhat proactively.

    One possibility is to extend the responsibilities of kswapd, which could
    however complicate its design too much. It should be better to let
    kswapd handle reclaim, as order-0 allocations are often more critical
    than high-order ones.

    Another possibility is to extend khugepaged, but this kthread is a
    single instance and tied to THP configs.

    This patch goes with the option of a new set of per-node kthreads called
    kcompactd, and lays the foundations, without introducing any new
    tunables. The lifecycle mimics kswapd kthreads, including the memory
    hotplug hooks.

    For compaction, kcompactd uses the standard compaction_suitable() and
    ompact_finished() criteria and the deferred compaction functionality.
    Unlike direct compaction, it uses only sync compaction, as there's no
    allocation latency to minimize.

    This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
    compact/reclaim loop for high-order pages will be replaced by waking up
    kcompactd in the next patch with the description of what's wrong with
    the old approach.

    Waking up of the kcompactd threads is also tied to kswapd activity and
    follows these rules:
    - we don't want to affect any fastpaths, so wake up kcompactd only from
    the slowpath, as it's done for kswapd
    - if kswapd is doing reclaim, it's more important than compaction, so
    don't invoke kcompactd until kswapd goes to sleep
    - the target order used for kswapd is passed to kcompactd

    Future possible future uses for kcompactd include the ability to wake up
    kcompactd on demand in special situations, such as when hugepages are
    not available (currently not done due to __GFP_NO_KSWAPD) or when a
    fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
    possible to perform periodic compaction with kcompactd.

    [arnd@arndb.de: fix build errors with kcompactd]
    [paul.gortmaker@windriver.com: don't use modular references for non modular code]
    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Arnd Bergmann
    Signed-off-by: Paul Gortmaker
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • During work on kcompactd integration I have spotted a confusing check of
    balance_classzone_idx, which I believe is bogus.

    The balanced_classzone_idx is filled by balance_pgdat() as the highest
    zone it attempted to balance. This was introduced by commit dc83edd941f4
    ("mm: kswapd: use the classzone idx that kswapd was using for
    sleeping_prematurely()").

    The intention is that (as expressed in today's function names), the
    value used for kswapd_shrink_zone() calls in balance_pgdat() is the same
    as for the decisions in kswapd_try_to_sleep().

    An unwanted side-effect of that commit was breaking the checks in
    kswapd() whether there was another kswapd_wakeup with a tighter (=lower)
    classzone_idx. Commits 215ddd6664ce ("mm: vmscan: only read
    new_classzone_idx from pgdat when reclaiming successfully") and
    d2ebd0f6b895 ("kswapd: avoid unnecessary rebalance after an unsuccessful
    balancing") tried to fixed, but apparently introduced a bogus check that
    this patch removes.

    Consider zone indexes X < Y < Z, where:
    - Z is the value used for the first kswapd wakeup.
    - Y is returned as balanced_classzone_idx, which means zones with index higher
    than Y (including Z) were found to be unreclaimable.
    - X is the value used for the second kswapd wakeup

    The new wakeup with value X means that kswapd is now supposed to balance
    harder all zones with index < Z, it will go
    sleep and won't read the new value X. This is subtly wrong.

    The effect of this patch is that kswapd will react better in some
    situations, where e.g. the first wakeup is for ZONE_DMA32, the second is
    for ZONE_DMA, and due to unreclaimable ZONE_NORMAL. Before this patch,
    kswapd would go sleep instead of reclaiming ZONE_DMA harder. I expect
    these situations are very rare, and more value is in better
    maintainability due to the removal of confusing and bogus check.

    Signed-off-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Rik van Riel
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    Signed-off-by: Joonsoo Kim
    Cc: Benjamin Herrenschmidt
    Acked-by: Chris Metcalf
    Cc: Christian Borntraeger
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Cc: Christian Borntraeger
    Cc: Benjamin Herrenschmidt
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: export _debug_pagealloc_enabled to modules]
    Signed-off-by: Joonsoo Kim
    Acked-by: David Rientjes
    Acked-by: Takashi Iwai
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Christian Borntraeger
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • We can disable debug_pagealloc processing even if the code is compiled
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: clean up code, per Christian]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Christian Borntraeger
    Cc: Benjamin Herrenschmidt
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Pekka Enberg
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • As CONFIG_DEBUG_PAGEALLOC can be enabled/disabled via kernel parameters
    we can optimize some cases by checking the enablement state.

    This is follow-up work for Christian's Optimize CONFIG_DEBUG_PAGEALLOC:

    https://lkml.org/lkml/2016/1/27/194

    Remaining work is to make sparc to be aware of this but it looks not
    easy for me so I skip that in this series.

    This patch (of 5):

    We can disable debug_pagealloc processing even if the code is complied
    with CONFIG_DEBUG_PAGEALLOC. This patch changes the code to query
    whether it is enabled or not in runtime.

    [akpm@linux-foundation.org: update comment, per David. Adjust comment to use 80 cols]
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Christian Borntraeger
    Acked-by: David Rientjes
    Cc: Benjamin Herrenschmidt
    Cc: Takashi Iwai
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • /proc/pid/pagemap (pte_to_pagemap_entry() internally) already reports
    about swap entry, so let's make the in-kernel utility aware of it.

    Signed-off-by: Naoya Horiguchi
    Cc: Vladimir Davydov
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi