30 Apr, 2019

1 commit

  • Make hibernate handle unmapped pages on the direct map when
    CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages
    to invalid configurations, so now hibernate should check if the pages have
    valid mappings and handle if they are unmapped when doing a hibernate
    save operation.

    Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y
    was configured. It does not appear to have a big hibernating performance
    impact. The speed of the saving operation before this change was measured
    as 819.02 MB/s, and after was measured at 813.32 MB/s.

    Before:
    [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)

    After:
    [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Pavel Machek
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nadav Amit
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

27 Apr, 2019

4 commits

  • Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
    removed setting of the ALLOC_NOFRAGMENT flag. Bring it back.

    The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so
    that allocations are spread across local zones to avoid fragmentation
    due to mixing pageblocks as long as possible.

    Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com
    Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
    'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
    Bail out on NULL zone to avoid potential crash. Currently we don't see
    any crashes only because alloc_flags_nofragment() has another bug which
    allows compiler to optimize away all accesses to 'zone'.

    Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
    Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • During the development of commit 5e1f0f098b46 ("mm, compaction: capture
    a page under direct compaction"), a paranoid check was added to ensure
    that if a captured page was available after compaction that it was
    consistent with the final state of compaction. The intent was to catch
    serious programming bugs such as using a stale page pointer and causing
    corruption problems.

    However, it is possible to get a captured page even if compaction was
    unsuccessful if an interrupt triggered and happened to free pages in
    interrupt context that got merged into a suitable high-order page. It's
    highly unlikely but Li Wang did report the following warning on s390
    occuring when testing OOM handling. Note that the warning is slightly
    edited for clarity.

    WARNING: CPU: 0 PID: 9783 at mm/page_alloc.c:3777 __alloc_pages_direct_compact+0x182/0x190
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs
    lockd grace fscache sunrpc pkey ghash_s390 prng xts aes_s390
    des_s390 des_generic sha512_s390 zcrypt_cex4 zcrypt vmur binfmt_misc
    ip_tables xfs libcrc32c dasd_fba_mod qeth_l2 dasd_eckd_mod dasd_mod
    qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log
    dm_mod
    CPU: 0 PID: 9783 Comm: copy.sh Kdump: loaded Not tainted 5.1.0-rc 5 #1

    This patch simply removes the check entirely instead of trying to be
    clever about pages freed from interrupt context. If a serious
    programming error was introduced, it is highly likely to be caught by
    prep_new_page() instead.

    Link: http://lkml.kernel.org/r/20190419085133.GH18914@techsingularity.net
    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Signed-off-by: Mel Gorman
    Reported-by: Li Wang
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
    amounts of memory when an external fragmentation event occurs") "broke"
    memory management on parisc.

    The machine is not NUMA but the DISCONTIG model creates three pgdats
    even though it's a UMA machine for the following ranges

    0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
    1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB
    2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB

    Mikulas reported:

    With the patch 1c30844d2, the kernel will incorrectly reclaim the
    first zone when it fills up, ignoring the fact that there are two
    completely free zones. Basiscally, it limits cache size to 1GiB.

    For example, if I run:
    # dd if=/dev/sda of=/dev/null bs=1M count=2048

    - with the proper kernel, there should be "Buffers - 2GiB"
    when this command finishes. With the patch 1c30844d2, buffers
    will consume just 1GiB or slightly more, because the kernel was
    incorrectly reclaiming them.

    The page allocator and reclaim makes assumptions that pgdats really
    represent NUMA nodes and zones represent ranges and makes decisions on
    that basis. Watermark boosting for small pgdats leads to unexpected
    results even though this would have behaved reasonably on SPARSEMEM.

    DISCONTIG is essentially deprecated and even parisc plans to move to
    SPARSEMEM so there is no need to be fancy, this patch simply disables
    watermark boosting by default on DISCONTIGMEM.

    Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reported-by: Mikulas Patocka
    Tested-by: Mikulas Patocka
    Acked-by: Vlastimil Babka
    Cc: James Bottomley
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Apr, 2019

1 commit

  • has_unmovable_pages() is used by allocating CMA and gigantic pages as
    well as the memory hotplug. The later doesn't know how to offline CMA
    pool properly now, but if an unused (free) CMA page is encountered, then
    has_unmovable_pages() happily considers it as a free memory and
    propagates this up the call chain. Memory offlining code then frees the
    page without a proper CMA tear down which leads to an accounting issues.
    Moreover if the same memory range is onlined again then the memory never
    gets back to the CMA pool.

    State after memory offline:

    # grep cma /proc/vmstat
    nr_free_cma 205824

    # cat /sys/kernel/debug/cma/cma-kvm_cma/count
    209920

    Also, kmemleak still think those memory address are reserved below but
    have already been used by the buddy allocator after onlining. This
    patch fixes the situation by treating CMA pageblocks as unmovable except
    when has_unmovable_pages() is called as part of CMA allocation.

    Offlined Pages 4096
    kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
    Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    create_object+0x344/0x380
    __kmalloc_node+0x3ec/0x860
    kvmalloc_node+0x58/0x110
    seq_read+0x41c/0x620
    __vfs_read+0x3c/0x70
    vfs_read+0xbc/0x1a0
    ksys_read+0x7c/0x140
    system_call+0x5c/0x70
    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xc000201cc8000000 (size 13757317120):
    kmemleak: comm "swapper/0", pid 0, jiffies 4294937297
    kmemleak: min_count = -1
    kmemleak: count = 0
    kmemleak: flags = 0x5
    kmemleak: checksum = 0
    kmemleak: backtrace:
    cma_declare_contiguous+0x2a4/0x3b0
    kvm_cma_reserve+0x11c/0x134
    setup_arch+0x300/0x3f8
    start_kernel+0x9c/0x6e8
    start_here_common+0x1c/0x4b0
    kmemleak: Automatic memory scanning thread ended

    [cai@lca.pw: use is_migrate_cma_page() and update commit log]
    Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

30 Mar, 2019

1 commit

  • Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
    memory to zones until online") introduced move_pfn_range_to_zone() which
    calls memmap_init_zone() during onlining a memory block.
    memmap_init_zone() will reset pagetype flags and makes migrate type to
    be MOVABLE.

    However, in __offline_pages(), it also call undo_isolate_page_range()
    after offline_isolated_pages() to do the same thing. Due to commit
    2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
    __first_valid_page() to skip offline pages, undo_isolate_page_range()
    here just waste CPU cycles looping around the offlining PFN range while
    doing nothing, because __first_valid_page() will return NULL as
    offline_isolated_pages() has already marked all memory sections within
    the pfn range as offline via offline_mem_sections().

    Also, after calling the "useless" undo_isolate_page_range() here, it
    reaches the point of no returning by notifying MEM_OFFLINE. Those pages
    will be marked as MIGRATE_MOVABLE again once onlining. The only thing
    left to do is to decrease the number of isolated pageblocks zone counter
    which would make some paths of the page allocation slower that the above
    commit introduced.

    Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
    on ppc64, an "int" should still be enough to represent the number of
    pageblocks there. Fix an incorrect comment along the way.

    [cai@lca.pw: v4]
    Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
    Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Vlastimil Babka
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

13 Mar, 2019

1 commit

  • As all the memblock allocation functions return NULL in case of error
    rather than panic(), the duplicates with _nopanic suffix can be removed.

    Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Petr Mladek [printk]
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Guo Ren [c-sky]
    Cc: Heiko Carstens
    Cc: Juergen Gross [Xen]
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Paul Burton
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

06 Mar, 2019

14 commits

  • This function is only used by built-in code, which makes perfect sense
    given the purpose of it.

    Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Number of online NUMA nodes can't be negative as well. This doesn't
    save space as the variable is used only in 32-bit context, but do it
    anyway for consistency.

    Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • There are two early memory allocations that use
    memblock_alloc_node_nopanic() and do not check its return value.

    While this happens very early during boot and chances that the
    allocation will fail are diminishing, it is still worth to have proper
    checks for the allocation errors.

    Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • No functional change.

    Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Pekka Enberg
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     
  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When pageblocks get fragmented, watermarks are artifically boosted to
    reclaim pages to avoid further fragmentation events. However,
    compaction is often either fragmentation-neutral or moving movable pages
    away from unmovable/reclaimable pages. As the true watermarks are
    preserved, allow compaction to ignore the boost factor.

    The expected impact is very slight as the main benefit is that
    compaction is slightly more likely to succeed when the system has been
    fragmented very recently. On both 1-socket and 2-socket machines for
    THP-intensive allocation during fragmentation the success rate was
    increased by less than 1% which is marginal. However, detailed tracing
    indicated that failure of migration due to a premature ENOMEM triggered
    by watermark checks were eliminated.

    Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the current implementation, there are two places to isolate a range
    of page: __offline_pages() and alloc_contig_range(). During this
    procedure, it will drain pages on pcp list.

    Below is a brief call flow:

    __offline_pages()/alloc_contig_range()
    start_isolate_page_range()
    set_migratetype_isolate()
    drain_all_pages()
    drain_all_pages()
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
    functions, so, the users don't have to explicitly check that condition.

    This is purely code cleanup patch without any functional change. Only
    the order of checks in memcg_charge_slab() can potentially be changed
    but the functionally it will be same. This should not matter as
    memcg_charge_slab() is not in the hot path.

    Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • When freeing pages are done with higher order, time spent on coalescing
    pages by buddy allocator can be reduced. With section size of 256MB,
    hot add latency of a single section shows improvement from 50-60 ms to
    less than 1 ms, hence improving the hot add latency by 60 times. Modify
    external providers of online callback to align with the change.

    [arunks@codeaurora.org: v11]
    Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
    [akpm@linux-foundation.org: remove unused local, per Arun]
    [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
    [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
    [arunks@codeaurora.org: v8]
    Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
    [arunks@codeaurora.org: v9]
    Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
    Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Alexander Duyck
    Cc: K. Y. Srinivasan
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Greg Kroah-Hartman
    Cc: Mathieu Malaterre
    Cc: "Kirill A. Shutemov"
    Cc: Souptick Joarder
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Srivatsa Vaddagiri
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
    It triggers false positives in the allocation path:

    BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
    Read of size 8 at addr ffff88881f800000 by task swapper/0
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    __asan_report_load8_noabort+0x19/0x20
    memchr_inv+0x2ea/0x330
    kernel_poison_pages+0x103/0x3d5
    get_page_from_freelist+0x15e7/0x4d90

    because KASAN has not yet unpoisoned the shadow page for allocation
    before it checks memchr_inv() but only found a stale poison pattern.

    Also, false positives in free path,

    BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
    Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    check_memory_region+0x22d/0x250
    memset+0x28/0x40
    kernel_poison_pages+0x29e/0x3d5
    __free_pages_ok+0x75f/0x13e0

    due to KASAN adds poisoned redzones around slab objects, but the page
    poisoning needs to poison the whole page.

    Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

22 Feb, 2019

1 commit

  • Yury Norov reported that an arm64 KVM instance could not boot since
    after v5.0-rc1 and could addressed by reverting the patches

    1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
    73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")

    The problem is that a division by zero error is possible if boosting
    occurs very early in boot if the system has very little memory. This
    patch avoids the division by zero error.

    Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reported-by: Yury Norov
    Tested-by: Yury Norov
    Tested-by: Will Deacon
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Feb, 2019

1 commit

  • This patch replaces the size + 1 value introduced with the recent fix for 1
    byte allocs with a constant value.

    The idea here is to reduce code overhead as the previous logic would have
    to read size into a register, then increment it, and write it back to
    whatever field was being used. By using a constant we can avoid those
    memory reads and arithmetic operations in favor of just encoding the
    maximum value into the operation itself.

    Fixes: 2c2ade81741c ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

15 Feb, 2019

1 commit

  • The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
    number of references that we might need to create in the fastpath later,
    the bump-allocation fastpath only has to modify the non-atomic bias value
    that tracks the number of extra references we hold instead of the atomic
    refcount. The maximum number of allocations we can serve (under the
    assumption that no allocation is made with size 0) is nc->size, so that's
    the bias used.

    However, even when all memory in the allocation has been given away, a
    reference to the page is still held; and in the `offset < 0` slowpath, the
    page may be reused if everyone else has dropped their references.
    This means that the necessary number of references is actually
    `nc->size+1`.

    Luckily, from a quick grep, it looks like the only path that can call
    page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
    requires CAP_NET_ADMIN in the init namespace and is only intended to be
    used for kernel testing and fuzzing.

    To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
    `offset < 0` path, below the virt_to_page() call, and then repeatedly call
    writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
    with a vector consisting of 15 elements containing 1 byte each.

    Signed-off-by: Jann Horn
    Signed-off-by: David S. Miller

    Jann Horn
     

29 Jan, 2019

1 commit

  • This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

    The underlying assumption that one sparse section belongs into a single
    numa node doesn't hold really. Robert Shteynfeld has reported a boot
    failure. The boot log was not captured but his memory layout is as
    follows:

    Early memory node ranges
    node 1: [mem 0x0000000000001000-0x0000000000090fff]
    node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node 1: [mem 0x0000000100000000-0x0000001423ffffff]
    node 0: [mem 0x0000001424000000-0x0000002023ffffff]

    This means that node0 starts in the middle of a memory section which is
    also in node1. memmap_init_zone tries to initialize padding of a
    section even when it is outside of the given pfn range because there are
    code paths (e.g. memory hotplug) which assume that the full worth of
    memory section is always initialized.

    In this particular case, though, such a range is already intialized and
    most likely already managed by the page allocator. Scribbling over
    those pages corrupts the internal state and likely blows up when any of
    those pages gets used.

    Reported-by: Robert Shteynfeld
    Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
    Cc: stable@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jan, 2019

1 commit

  • syzbot reported the following regression in the latest merge window and
    it was confirmed by Qian Cai that a similar bug was visible from a
    different context.

    ======================================================
    WARNING: possible circular locking dependency detected
    4.20.0+ #297 Not tainted
    ------------------------------------------------------
    syz-executor0/8529 is trying to acquire lock:
    000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
    __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

    but task is already holding lock:
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
    include/linux/spinlock.h:329 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
    mm/page_alloc.c:2548 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
    mm/page_alloc.c:3021 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
    mm/page_alloc.c:3050 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
    mm/page_alloc.c:3072 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
    get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

    It appears to be a false positive in that the only way the lock ordering
    should be inverted is if kswapd is waking itself and the wakeup
    allocates debugging objects which should already be allocated if it's
    kswapd doing the waking. Nevertheless, the possibility exists and so
    it's best to avoid the problem.

    This patch flags a zone as needing a kswapd using the, surprisingly,
    unused zone flag field. The flag is read without the lock held to do
    the wakeup. It's possible that the flag setting context is not the same
    as the flag clearing context or for small races to occur. However, each
    race possibility is harmless and there is no visible degredation in
    fragmentation treatment.

    While zone->flag could have continued to be unused, there is potential
    for moving some existing fields into the flags field instead.
    Particularly read-mostly ones like zone->initialized and
    zone->contiguous.

    Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Tested-by: Qian Cai
    Cc: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Dec, 2018

13 commits

  • Model call chain after should_failslab(). Likewise, we can now use a
    kprobe to override the return value of should_fail_alloc_page() and inject
    allocation failures into alloc_page*().

    This will allow injecting allocation failures using the BCC tools even
    without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
    fail_page_alloc= parameter, which incurs some overhead even when failures
    are not being injected. On the other hand, this patch adds an
    unconditional call to should_fail_alloc_page() from page allocation
    hotpath. That overhead should be rather negligible with
    CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

    [vbabka@suse.cz: changelog addition]
    Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
    Signed-off-by: Benjamin Poirier
    Acked-by: Vlastimil Babka
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Poirier
     
  • drain_all_pages is documented to drain per-cpu pages for a given zone (if
    non-NULL). The current implementation doesn't match the description
    though. It will drain all pcp pages for all zones that happen to have
    cached pages on the same cpu as the given zone. This will lead to
    premature pcp cache draining for zones that are not of any interest to the
    caller - e.g. compaction, hwpoison or memory offline.

    This forces the page allocator to take locks and potential lock contention
    as a result.

    There is no real reason for this sub-optimal implementation. Replace
    per-cpu work item with a dedicated structure which contains a pointer to
    the zone and pass it over to the worker. This will get the zone
    information all the way down to the worker function and do the right job.

    [akpm@linux-foundation.org: avoid 80-col tricks]
    [mhocko@suse.com: refactor the whole changelog]
    Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
    pages initialization can take a long time. Below were the reported init
    times on a 8-socket 96-core 4TB IvyBridge system.

    1) Non-debug kernel without CONFIG_KASAN
    [ 8.764222] node 1 initialised, 132086516 pages in 7027ms

    2) Debug kernel with CONFIG_KASAN
    [ 146.288115] node 1 initialised, 132075466 pages in 143052ms

    So the page init time in a debug kernel was 20X of the non-debug kernel.
    The long init time can be problematic as the page initialization is done
    with interrupt disabled. In this particular case, it caused the
    appearance of following warning messages as well as NMI backtraces of all
    the cores that were doing the initialization.

    [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
    [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
    [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
    [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
    [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
    [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
    [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
    [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

    The long init time was mainly caused by the call to kasan_free_pages() to
    poison the newly initialized pages. On a 4TB system, we are talking about
    almost 500GB of memory probably on the same node.

    In reality, we may not need to poison the newly initialized pages before
    they are ever allocated. So KASAN poisoning of freed pages before the
    completion of deferred memory initialization is now disabled. Those pages
    will be properly poisoned when they are allocated or freed after deferred
    pages initialization is done.

    With this change, the new page initialization time became:

    [ 21.948010] node 1 initialised, 132075466 pages in 18702ms

    This was still about double the non-debug kernel time, but was much
    better than before.

    Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
    If someone adds extra migrate type, then he may forget to enlarge the
    NR_PAGEBLOCK_BITS. Hence it requires some way to fix.

    NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
    two different .h file with reverse dependency, it is a little hard to
    refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind
    such relation in compiling-time.

    Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     
  • Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
    free_area_init_core_hotplug"), some functions changed to only be called
    during system initialization. Concretly, free_area_init_node() and the
    functions that hang from it.

    Also, some variables are no longer used after the system has gone
    through initialization. So this could be considered as a late clean-up
    for that patch.

    This patch changes the functions from __meminit to __init, and the
    variables from __meminitdata to __initdata.

    In return, we get some KBs back:

    Before:
    Freeing unused kernel image memory: 2472K

    After:
    Freeing unused kernel image memory: 2480K

    Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc: Alexander Duyck
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
    each node's highest zone is initialized before defer stage.

    static_init_pgcnt is used to store the number of pages like this:

    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
    pgdat->node_spanned_pages);

    because we don't want to overflow zone's range.

    But this is not necessary, since defer_init() is called like this:

    memmap_init_zone()
    for pfn in [start_pfn, end_pfn)
    defer_init(pfn, end_pfn)

    In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
    stop before calling defer_init().

    BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
    since nr_initialised is zone based instead of node based. Even
    node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
    would have pages less than PAGES_PER_SECTION.

    Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Alexander Duyck
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • OOM report contains several sections. The first one is the allocation
    context that has triggered the OOM. Then we have cpuset context followed
    by the stack trace of the OOM path. The tird one is the OOM memory
    information. Followed by the current memory state of all system tasks.
    At last, we will show oom eligible tasks and the information about the
    chosen oom victim.

    One thing that makes parsing more awkward than necessary is that we do not
    have a single and easily parsable line about the oom context. This patch
    is reorganizing the oom report to

    1) who invoked oom and what was the allocation request

    [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

    2) OOM stack trace

    [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
    [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
    [ 515.906821] Call Trace:
    [ 515.908062] dump_stack+0x5a/0x73
    [ 515.909311] dump_header+0x55/0x28c
    [ 515.914260] oom_kill_process+0x2d8/0x300
    [ 515.916708] out_of_memory+0x145/0x4a0
    [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16
    [ 515.919157] __alloc_pages_nodemask+0x277/0x290
    [ 515.920367] filemap_fault+0x3d0/0x6c0
    [ 515.921529] ? filemap_map_pages+0x2b8/0x420
    [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4]
    [ 515.923884] __do_fault+0x20/0x80
    [ 515.925032] __handle_mm_fault+0xbc0/0xe80
    [ 515.926195] handle_mm_fault+0xfa/0x210
    [ 515.927357] __do_page_fault+0x233/0x4c0
    [ 515.928506] do_page_fault+0x32/0x140
    [ 515.929646] ? page_fault+0x8/0x30
    [ 515.930770] page_fault+0x1e/0x30

    3) OOM memory information

    [ 515.958093] Mem-Info:
    [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
    active_file:4402672 inactive_file:483963 isolated_file:1344
    unevictable:0 dirty:4886753 writeback:0 unstable:0
    slab_reclaimable:148442 slab_unreclaimable:18741
    mapped:1347 shmem:1347 pagetables:58669 bounce:0
    free:88663 free_pcp:0 free_cma:0
    ...

    4) current memory state of all system tasks

    [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal
    [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad
    [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd
    [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd
    [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd
    [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance
    [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd
    [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager
    [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned
    [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic
    [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic
    [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic
    [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic
    [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic

    5) oom context (contrains and the chosen victim).

    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

    An admin can easily get the full oom context at a single line which
    makes parsing much easier.

    Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Roman Gushchin
    Cc: Tetsuo Handa
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • and propagate through down the call stack.

    Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Those strings are immutable in fact.

    Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

    The kernel reduces the probability of such events by increasing the
    watermark sizes by calling set_recommended_min_free_kbytes early in the
    lifetime of the system. This works reasonably well in general but if
    there are enough sparsely populated pageblocks then the problem can still
    occur as enough memory is free overall and kswapd stays asleep.

    This patch introduces a watermark_boost_factor sysctl that allows a zone
    watermark to be temporarily boosted when an external fragmentation causing
    events occurs. The boosting will stall allocations that would decrease
    free memory below the boosted low watermark and kswapd is woken if the
    calling context allows to reclaim an amount of memory relative to the size
    of the high watermark and the watermark_boost_factor until the boost is
    cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
    to clean some of the pageblocks that may have been affected by the
    fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
    from reclaim context during this operation to avoid excessive system
    disruption in the name of fragmentation avoidance. Care is taken so that
    kswapd will do normal reclaim work if the system is really low on memory.

    This was evaluated using the same workloads as "mm, page_alloc: Spread
    allocations across zones before introducing fragmentation".

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)
    4.20-rc3+patch1-4: 18421 (98% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
    Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

    Note that external fragmentation causing events are massively reduced by
    this path whether in comparison to the previous kernel or the vanilla
    kernel. The fault latency for huge pages appears to be increased but that
    is only because THP allocations were successful with the patch applied.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)
    4.20-rc3+patch1-4: 13464 (95% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
    Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
    Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
    Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

    As before, massive reduction in external fragmentation events, some jitter
    on latencies and an increase in THP allocation success rates.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)
    4.20-rc3+patch1-4: 14263 (93% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
    Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

    There is a 93% reduction in fragmentation causing events, there is a big
    reduction in the huge page fault latency and allocation success rate is
    higher.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)
    4.20-rc3+patch1-4: 11095 (93% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
    Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

    There is a large reduction in fragmentation events with some jitter around
    the latencies and success rates. As before, the high THP allocation
    success rate does mean the system is under a lot of pressure. However, as
    the fragmentation events are reduced, it would be expected that the
    long-term allocation success rate would be higher.

    Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
    into alloc_flags. This is a preparation patch only that avoids having to
    pass gfp_mask through a long callchain in a future patch.

    Note that the setting in the fast path happens in alloc_flags_nofragment()
    and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
    That's true in this patch but is not true later so it's done now for
    easier review to show where the flag needs to be recorded.

    No functional change.

    [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
    Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
    Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation patch only, no functional change.

    Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Fragmentation avoidance improvements", v5.

    It has been noted before that fragmentation avoidance (aka
    anti-fragmentation) is not perfect. Given sufficient time or an adverse
    workload, memory gets fragmented and the long-term success of high-order
    allocations degrades. This series defines an adverse workload, a definition
    of external fragmentation events (including serious) ones and a series
    that reduces the level of those fragmentation events.

    The details of the workload and the consequences are described in more
    detail in the changelogs. However, from patch 1, this is a high-level
    summary of the adverse workload. The exact details are found in the
    mmtests implementation.

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch)
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameterr create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed
    3. Warm up a number of fio read-only threads accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll fault back in old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup

    Overall the series reduces external fragmentation causing events by over 94%
    on 1 and 2 socket machines, which in turn impacts high-order allocation
    success rates over the long term. There are differences in latencies and
    high-order allocation success rates. Latencies are a mixed bag as they
    are vulnerable to exact system state and whether allocations succeeded
    so they are treated as a secondary metric.

    Patch 1 uses lower zones if they are populated and have free memory
    instead of fragmenting a higher zone. It's special cased to
    handle a Normal->DMA32 fallback with the reasons explained
    in the changelog.

    Patch 2-4 boosts watermarks temporarily when an external fragmentation
    event occurs. kswapd wakes to reclaim a small amount of old memory
    and then wakes kcompactd on completion to recover the system
    slightly. This introduces some overhead in the slowpath. The level
    of boosting can be tuned or disabled depending on the tolerance
    for fragmentation vs allocation latency.

    Patch 5 stalls some movable allocation requests to let kswapd from patch 4
    make some progress. The duration of the stalls is very low but it
    is possible to tune the system to avoid fragmentation events if
    larger stalls can be tolerated.

    The bulk of the improvement in fragmentation avoidance is from patches
    1-4 but patch 5 can deal with a rare corner case and provides the option
    of tuning a system for THP allocation success rates in exchange for
    some stalls to control fragmentation.

    This patch (of 5):

    The page allocator zone lists are iterated based on the watermarks of each
    zone which does not take anti-fragmentation into account. On x86, node 0
    may have multiple zones while other nodes have one zone. A consequence is
    that tasks running on node 0 may fragment ZONE_NORMAL even though
    ZONE_DMA32 has plenty of free memory. This patch special cases the
    allocator fast path such that it'll try an allocation from a lower local
    zone before fragmenting a higher zone. In this case, stealing of
    pageblocks or orders larger than a pageblock are still allowed in the fast
    path as they are uninteresting from a fragmentation point of view.

    This was evaluated using a benchmark designed to fragment memory before
    attempting THP allocations. It's implemented in mmtests as the following
    configurations

    configs/config-global-dhp__workload_thpfioscale
    configs/config-global-dhp__workload_thpfioscale-defrag
    configs/config-global-dhp__workload_thpfioscale-madvhugepage

    e.g. from mmtests
    ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch).
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameter create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed.
    3. Warm up a number of fio read-only processes accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll refault old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds.
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup the test files.

    Note that due to the use of IO and page cache that this benchmark is not
    suitable for running on large machines where the time to fragment memory
    may be excessive. Also note that while this is one mix that generates
    fragmentation that it's not the only mix that generates fragmentation.
    Differences in workload that are more slab-intensive or whether SLUB is
    used with high-order pages may yield different results.

    When the page allocator fragments memory, it records the event using the
    mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
    a pageblock order (order-9 on 64-bit x86) then it's considered to be an
    "external fragmentation event" that may cause issues in the future.
    Hence, the primary metric here is the number of external fragmentation
    events that occur with order < 9. The secondary metric is allocation
    latency and huge page allocation success rates but note that differences
    in latencies and what the success rate also can affect the number of
    external fragmentation event which is why it's a secondary metric.

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
    Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    Fault latencies are slightly reduced while allocation success rates remain
    at zero as this configuration does not make any special effort to allocate
    THP and fio is heavily active at the time and either filling memory or
    keeping pages resident. However, a 49% reduction of serious fragmentation
    events reduces the changes of external fragmentation being a problem in
    the future.

    Vlastimil asked during review for a breakdown of the allocation types
    that are falling back.

    vanilla
    3816 MIGRATE_UNMOVABLE
    800845 MIGRATE_MOVABLE
    33 MIGRATE_UNRECLAIMABLE

    patch
    735 MIGRATE_UNMOVABLE
    408135 MIGRATE_MOVABLE
    42 MIGRATE_UNRECLAIMABLE

    The majority of the fallbacks are due to movable allocations and this is
    consistent for the workload throughout the series so will not be presented
    again as the primary source of fallbacks are movable allocations.

    Movable fallbacks are sometimes considered "ok" to fallback because they
    can be migrated. The problem is that they can fill an
    unmovable/reclaimable pageblock causing those allocations to fallback
    later and polluting pageblocks with pages that cannot move. If there is a
    movable fallback, it is pretty much guaranteed to affect an
    unmovable/reclaimable pageblock and while it might not be enough to
    actually cause a unmovable/reclaimable fallback in the future, we cannot
    know that in advance so the patch takes the only option available to it.
    Hence, it's important to control them. This point is also consistent
    throughout the series and will not be repeated.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
    Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)

    Fragmentation events were reduced quite a bit although this is known
    to be a little variable. The latencies and allocation success rates
    are similar but they were already quite high.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
    Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)

    The reduction of external fragmentation events is slight and this is
    partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
    ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
    allocations can now spill over to remote nodes instead of fragmenting
    local memory.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%)
    Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%*

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%)

    There was a slight reduction in external fragmentation events although the
    latencies were higher. The allocation success rate is high enough that
    the system is struggling and there is quite a lot of parallel reclaim and
    compaction activity. There is also a certain degree of luck on whether
    processes start on node 0 or not for this patch but the relevance is
    reduced later in the series.

    Overall, the patch reduces the number of external fragmentation causing
    events so the success of THP over long periods of time would be improved
    for this adverse workload.

    Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Zi Yan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman