29 Jan, 2019

1 commit

  • This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

    The underlying assumption that one sparse section belongs into a single
    numa node doesn't hold really. Robert Shteynfeld has reported a boot
    failure. The boot log was not captured but his memory layout is as
    follows:

    Early memory node ranges
    node 1: [mem 0x0000000000001000-0x0000000000090fff]
    node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node 1: [mem 0x0000000100000000-0x0000001423ffffff]
    node 0: [mem 0x0000001424000000-0x0000002023ffffff]

    This means that node0 starts in the middle of a memory section which is
    also in node1. memmap_init_zone tries to initialize padding of a
    section even when it is outside of the given pfn range because there are
    code paths (e.g. memory hotplug) which assume that the full worth of
    memory section is always initialized.

    In this particular case, though, such a range is already intialized and
    most likely already managed by the page allocator. Scribbling over
    those pages corrupts the internal state and likely blows up when any of
    those pages gets used.

    Reported-by: Robert Shteynfeld
    Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
    Cc: stable@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

09 Jan, 2019

1 commit

  • syzbot reported the following regression in the latest merge window and
    it was confirmed by Qian Cai that a similar bug was visible from a
    different context.

    ======================================================
    WARNING: possible circular locking dependency detected
    4.20.0+ #297 Not tainted
    ------------------------------------------------------
    syz-executor0/8529 is trying to acquire lock:
    000000005e7fb829 (&pgdat->kswapd_wait){....}, at:
    __wake_up_common_lock+0x19e/0x330 kernel/sched/wait.c:120

    but task is already holding lock:
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: spin_lock
    include/linux/spinlock.h:329 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_bulk
    mm/page_alloc.c:2548 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: __rmqueue_pcplist
    mm/page_alloc.c:3021 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue_pcplist
    mm/page_alloc.c:3050 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at: rmqueue
    mm/page_alloc.c:3072 [inline]
    000000009bb7bae0 (&(&zone->lock)->rlock){-.-.}, at:
    get_page_from_freelist+0x1bae/0x52a0 mm/page_alloc.c:3491

    It appears to be a false positive in that the only way the lock ordering
    should be inverted is if kswapd is waking itself and the wakeup
    allocates debugging objects which should already be allocated if it's
    kswapd doing the waking. Nevertheless, the possibility exists and so
    it's best to avoid the problem.

    This patch flags a zone as needing a kswapd using the, surprisingly,
    unused zone flag field. The flag is read without the lock held to do
    the wakeup. It's possible that the flag setting context is not the same
    as the flag clearing context or for small races to occur. However, each
    race possibility is harmless and there is no visible degredation in
    fragmentation treatment.

    While zone->flag could have continued to be unused, there is potential
    for moving some existing fields into the flags field instead.
    Particularly read-mostly ones like zone->initialized and
    zone->contiguous.

    Link: http://lkml.kernel.org/r/20190103225712.GJ31517@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Reported-by: syzbot+93d94a001cfbce9e60e1@syzkaller.appspotmail.com
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Tested-by: Qian Cai
    Cc: Dmitry Vyukov
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

29 Dec, 2018

24 commits

  • Model call chain after should_failslab(). Likewise, we can now use a
    kprobe to override the return value of should_fail_alloc_page() and inject
    allocation failures into alloc_page*().

    This will allow injecting allocation failures using the BCC tools even
    without building kernel with CONFIG_FAIL_PAGE_ALLOC and booting it with a
    fail_page_alloc= parameter, which incurs some overhead even when failures
    are not being injected. On the other hand, this patch adds an
    unconditional call to should_fail_alloc_page() from page allocation
    hotpath. That overhead should be rather negligible with
    CONFIG_FAIL_PAGE_ALLOC=n when there's no kprobe attached, though.

    [vbabka@suse.cz: changelog addition]
    Link: http://lkml.kernel.org/r/20181214074330.18917-1-bpoirier@suse.com
    Signed-off-by: Benjamin Poirier
    Acked-by: Vlastimil Babka
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Benjamin Poirier
     
  • drain_all_pages is documented to drain per-cpu pages for a given zone (if
    non-NULL). The current implementation doesn't match the description
    though. It will drain all pcp pages for all zones that happen to have
    cached pages on the same cpu as the given zone. This will lead to
    premature pcp cache draining for zones that are not of any interest to the
    caller - e.g. compaction, hwpoison or memory offline.

    This forces the page allocator to take locks and potential lock contention
    as a result.

    There is no real reason for this sub-optimal implementation. Replace
    per-cpu work item with a dedicated structure which contains a pointer to
    the zone and pass it over to the worker. This will get the zone
    information all the way down to the worker function and do the right job.

    [akpm@linux-foundation.org: avoid 80-col tricks]
    [mhocko@suse.com: refactor the whole changelog]
    Link: http://lkml.kernel.org/r/20181212142550.61686-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
    pages initialization can take a long time. Below were the reported init
    times on a 8-socket 96-core 4TB IvyBridge system.

    1) Non-debug kernel without CONFIG_KASAN
    [ 8.764222] node 1 initialised, 132086516 pages in 7027ms

    2) Debug kernel with CONFIG_KASAN
    [ 146.288115] node 1 initialised, 132075466 pages in 143052ms

    So the page init time in a debug kernel was 20X of the non-debug kernel.
    The long init time can be problematic as the page initialization is done
    with interrupt disabled. In this particular case, it caused the
    appearance of following warning messages as well as NMI backtraces of all
    the cores that were doing the initialization.

    [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
    [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
    [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
    [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
    [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
    [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
    [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
    [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

    The long init time was mainly caused by the call to kasan_free_pages() to
    poison the newly initialized pages. On a 4TB system, we are talking about
    almost 500GB of memory probably on the same node.

    In reality, we may not need to poison the newly initialized pages before
    they are ever allocated. So KASAN poisoning of freed pages before the
    completion of deferred memory initialization is now disabled. Those pages
    will be properly poisoned when they are allocated or freed after deferred
    pages initialization is done.

    With this change, the new page initialization time became:

    [ 21.948010] node 1 initialised, 132075466 pages in 18702ms

    This was still about double the non-debug kernel time, but was much
    better than before.

    Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Currently, NR_PAGEBLOCK_BITS and MIGRATE_TYPES are not associated by code.
    If someone adds extra migrate type, then he may forget to enlarge the
    NR_PAGEBLOCK_BITS. Hence it requires some way to fix.

    NR_PAGEBLOCK_BITS depends on MIGRATE_TYPES, while these macro spread on
    two different .h file with reverse dependency, it is a little hard to
    refer to MIGRATE_TYPES in pageblock-flag.h. This patch tries to remind
    such relation in compiling-time.

    Link: http://lkml.kernel.org/r/1544508709-11358-1-git-send-email-kernelfans@gmail.com
    Signed-off-by: Pingfan Liu
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Joonsoo Kim
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pingfan Liu
     
  • Since commit 03e85f9d5f1 ("mm/page_alloc: Introduce
    free_area_init_core_hotplug"), some functions changed to only be called
    during system initialization. Concretly, free_area_init_node() and the
    functions that hang from it.

    Also, some variables are no longer used after the system has gone
    through initialization. So this could be considered as a late clean-up
    for that patch.

    This patch changes the functions from __meminit to __init, and the
    variables from __meminitdata to __initdata.

    In return, we get some KBs back:

    Before:
    Freeing unused kernel image memory: 2472K

    After:
    Freeing unused kernel image memory: 2480K

    Link: http://lkml.kernel.org/r/20181204111507.4808-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc: Alexander Duyck
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • When DEFERRED_STRUCT_PAGE_INIT is configured, only the first section of
    each node's highest zone is initialized before defer stage.

    static_init_pgcnt is used to store the number of pages like this:

    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
    pgdat->node_spanned_pages);

    because we don't want to overflow zone's range.

    But this is not necessary, since defer_init() is called like this:

    memmap_init_zone()
    for pfn in [start_pfn, end_pfn)
    defer_init(pfn, end_pfn)

    In case (pgdat->node_spanned_pages < PAGES_PER_SECTION), the loop would
    stop before calling defer_init().

    BTW, comparing PAGES_PER_SECTION with node_spanned_pages is not correct,
    since nr_initialised is zone based instead of node based. Even
    node_spanned_pages is bigger than PAGES_PER_SECTION, its highest zone
    would have pages less than PAGES_PER_SECTION.

    Link: http://lkml.kernel.org/r/20181122094807.6985-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Alexander Duyck
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • OOM report contains several sections. The first one is the allocation
    context that has triggered the OOM. Then we have cpuset context followed
    by the stack trace of the OOM path. The tird one is the OOM memory
    information. Followed by the current memory state of all system tasks.
    At last, we will show oom eligible tasks and the information about the
    chosen oom victim.

    One thing that makes parsing more awkward than necessary is that we do not
    have a single and easily parsable line about the oom context. This patch
    is reorganizing the oom report to

    1) who invoked oom and what was the allocation request

    [ 515.902945] tuned invoked oom-killer: gfp_mask=0x6200ca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=0

    2) OOM stack trace

    [ 515.904273] CPU: 24 PID: 1809 Comm: tuned Not tainted 4.20.0-rc3+ #3
    [ 515.905518] Hardware name: Inspur SA5212M4/YZMB-00370-107, BIOS 4.1.10 11/14/2016
    [ 515.906821] Call Trace:
    [ 515.908062] dump_stack+0x5a/0x73
    [ 515.909311] dump_header+0x55/0x28c
    [ 515.914260] oom_kill_process+0x2d8/0x300
    [ 515.916708] out_of_memory+0x145/0x4a0
    [ 515.917932] __alloc_pages_slowpath+0x7d2/0xa16
    [ 515.919157] __alloc_pages_nodemask+0x277/0x290
    [ 515.920367] filemap_fault+0x3d0/0x6c0
    [ 515.921529] ? filemap_map_pages+0x2b8/0x420
    [ 515.922709] ext4_filemap_fault+0x2c/0x40 [ext4]
    [ 515.923884] __do_fault+0x20/0x80
    [ 515.925032] __handle_mm_fault+0xbc0/0xe80
    [ 515.926195] handle_mm_fault+0xfa/0x210
    [ 515.927357] __do_page_fault+0x233/0x4c0
    [ 515.928506] do_page_fault+0x32/0x140
    [ 515.929646] ? page_fault+0x8/0x30
    [ 515.930770] page_fault+0x1e/0x30

    3) OOM memory information

    [ 515.958093] Mem-Info:
    [ 515.959647] active_anon:26501758 inactive_anon:1179809 isolated_anon:0
    active_file:4402672 inactive_file:483963 isolated_file:1344
    unevictable:0 dirty:4886753 writeback:0 unstable:0
    slab_reclaimable:148442 slab_unreclaimable:18741
    mapped:1347 shmem:1347 pagetables:58669 bounce:0
    free:88663 free_pcp:0 free_cma:0
    ...

    4) current memory state of all system tasks

    [ 516.079544] [ 744] 0 744 9211 1345 114688 82 0 systemd-journal
    [ 516.082034] [ 787] 0 787 31764 0 143360 92 0 lvmetad
    [ 516.084465] [ 792] 0 792 10930 1 110592 208 -1000 systemd-udevd
    [ 516.086865] [ 1199] 0 1199 13866 0 131072 112 -1000 auditd
    [ 516.089190] [ 1222] 0 1222 31990 1 110592 157 0 smartd
    [ 516.091477] [ 1225] 0 1225 4864 85 81920 43 0 irqbalance
    [ 516.093712] [ 1226] 0 1226 52612 0 258048 426 0 abrtd
    [ 516.112128] [ 1280] 0 1280 109774 55 299008 400 0 NetworkManager
    [ 516.113998] [ 1295] 0 1295 28817 37 69632 24 0 ksmtuned
    [ 516.144596] [ 10718] 0 10718 2622484 1721372 15998976 267219 0 panic
    [ 516.145792] [ 10719] 0 10719 2622484 1164767 9818112 53576 0 panic
    [ 516.146977] [ 10720] 0 10720 2622484 1174361 9904128 53709 0 panic
    [ 516.148163] [ 10721] 0 10721 2622484 1209070 10194944 54824 0 panic
    [ 516.149329] [ 10722] 0 10722 2622484 1745799 14774272 91138 0 panic

    5) oom context (contrains and the chosen victim).

    oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0-1,task=panic,pid=10737,uid=0

    An admin can easily get the full oom context at a single line which
    makes parsing much easier.

    Link: http://lkml.kernel.org/r/1542799799-36184-1-git-send-email-ufo19890607@gmail.com
    Signed-off-by: yuzhoujian
    Acked-by: Michal Hocko
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: "Kirill A . Shutemov"
    Cc: Roman Gushchin
    Cc: Tetsuo Handa
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    yuzhoujian
     
  • and propagate through down the call stack.

    Link: http://lkml.kernel.org/r/20181124091411.GC10969@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Those strings are immutable in fact.

    Link: http://lkml.kernel.org/r/20181124090327.GA10877@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • An external fragmentation event was previously described as

    When the page allocator fragments memory, it records the event using
    the mm_page_alloc_extfrag event. If the fallback_order is smaller
    than a pageblock order (order-9 on 64-bit x86) then it's considered
    an event that will cause external fragmentation issues in the future.

    The kernel reduces the probability of such events by increasing the
    watermark sizes by calling set_recommended_min_free_kbytes early in the
    lifetime of the system. This works reasonably well in general but if
    there are enough sparsely populated pageblocks then the problem can still
    occur as enough memory is free overall and kswapd stays asleep.

    This patch introduces a watermark_boost_factor sysctl that allows a zone
    watermark to be temporarily boosted when an external fragmentation causing
    events occurs. The boosting will stall allocations that would decrease
    free memory below the boosted low watermark and kswapd is woken if the
    calling context allows to reclaim an amount of memory relative to the size
    of the high watermark and the watermark_boost_factor until the boost is
    cleared. When kswapd finishes, it wakes kcompactd at the pageblock order
    to clean some of the pageblocks that may have been affected by the
    fragmentation event. kswapd avoids any writeback, slab shrinkage and swap
    from reclaim context during this operation to avoid excessive system
    disruption in the name of fragmentation avoidance. Care is taken so that
    kswapd will do normal reclaim work if the system is really low on memory.

    This was evaluated using the same workloads as "mm, page_alloc: Spread
    allocations across zones before introducing fragmentation".

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)
    4.20-rc3+patch1-4: 18421 (98% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-1 653.58 ( 0.00%) 652.71 ( 0.13%)
    Amean fault-huge-1 0.00 ( 0.00%) 178.93 * -99.00%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 0.00 ( 0.00%) 5.12 ( 100.00%)

    Note that external fragmentation causing events are massively reduced by
    this path whether in comparison to the previous kernel or the vanilla
    kernel. The fault latency for huge pages appears to be increased but that
    is only because THP allocations were successful with the patch applied.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)
    4.20-rc3+patch1-4: 13464 (95% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Min fault-base-1 912.00 ( 0.00%) 905.00 ( 0.77%)
    Min fault-huge-1 127.00 ( 0.00%) 135.00 ( -6.30%)
    Amean fault-base-1 1467.55 ( 0.00%) 1481.67 ( -0.96%)
    Amean fault-huge-1 1127.11 ( 0.00%) 1063.88 * 5.61%*

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-1 77.64 ( 0.00%) 83.46 ( 7.49%)

    As before, massive reduction in external fragmentation events, some jitter
    on latencies and an increase in THP allocation success rates.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)
    4.20-rc3+patch1-4: 14263 (93% reduction)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 1346.45 ( 0.00%) 1306.87 ( 2.94%)
    Amean fault-huge-5 3418.60 ( 0.00%) 1348.94 ( 60.54%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 0.78 ( 0.00%) 7.91 ( 910.64%)

    There is a 93% reduction in fragmentation causing events, there is a big
    reduction in the huge page fault latency and allocation success rate is
    higher.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)
    4.20-rc3+patch1-4: 11095 (93% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Amean fault-base-5 6217.43 ( 0.00%) 7419.67 * -19.34%*
    Amean fault-huge-5 3163.33 ( 0.00%) 3263.80 ( -3.18%)

    4.20.0-rc3 4.20.0-rc3
    lowzone-v5r8 boost-v5r8
    Percentage huge-5 95.14 ( 0.00%) 87.98 ( -7.53%)

    There is a large reduction in fragmentation events with some jitter around
    the latencies and success rates. As before, the high THP allocation
    success rate does mean the system is under a lot of pressure. However, as
    the fragmentation events are reduced, it would be expected that the
    long-term allocation success rate would be higher.

    Link: http://lkml.kernel.org/r/20181123114528.28802-5-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation patch that copies the GFP flag __GFP_KSWAPD_RECLAIM
    into alloc_flags. This is a preparation patch only that avoids having to
    pass gfp_mask through a long callchain in a future patch.

    Note that the setting in the fast path happens in alloc_flags_nofragment()
    and it may be claimed that this has nothing to do with ALLOC_NO_FRAGMENT.
    That's true in this patch but is not true later so it's done now for
    easier review to show where the flag needs to be recorded.

    No functional change.

    [mgorman@techsingularity.net: ALLOC_KSWAPD flag needs to be applied in the !CONFIG_ZONE_DMA32 case]
    Link: http://lkml.kernel.org/r/20181126143503.GO23260@techsingularity.net
    Link: http://lkml.kernel.org/r/20181123114528.28802-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This is a preparation patch only, no functional change.

    Link: http://lkml.kernel.org/r/20181123114528.28802-3-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Zi Yan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Patch series "Fragmentation avoidance improvements", v5.

    It has been noted before that fragmentation avoidance (aka
    anti-fragmentation) is not perfect. Given sufficient time or an adverse
    workload, memory gets fragmented and the long-term success of high-order
    allocations degrades. This series defines an adverse workload, a definition
    of external fragmentation events (including serious) ones and a series
    that reduces the level of those fragmentation events.

    The details of the workload and the consequences are described in more
    detail in the changelogs. However, from patch 1, this is a high-level
    summary of the adverse workload. The exact details are found in the
    mmtests implementation.

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch)
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameterr create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed
    3. Warm up a number of fio read-only threads accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll fault back in old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup

    Overall the series reduces external fragmentation causing events by over 94%
    on 1 and 2 socket machines, which in turn impacts high-order allocation
    success rates over the long term. There are differences in latencies and
    high-order allocation success rates. Latencies are a mixed bag as they
    are vulnerable to exact system state and whether allocations succeeded
    so they are treated as a secondary metric.

    Patch 1 uses lower zones if they are populated and have free memory
    instead of fragmenting a higher zone. It's special cased to
    handle a Normal->DMA32 fallback with the reasons explained
    in the changelog.

    Patch 2-4 boosts watermarks temporarily when an external fragmentation
    event occurs. kswapd wakes to reclaim a small amount of old memory
    and then wakes kcompactd on completion to recover the system
    slightly. This introduces some overhead in the slowpath. The level
    of boosting can be tuned or disabled depending on the tolerance
    for fragmentation vs allocation latency.

    Patch 5 stalls some movable allocation requests to let kswapd from patch 4
    make some progress. The duration of the stalls is very low but it
    is possible to tune the system to avoid fragmentation events if
    larger stalls can be tolerated.

    The bulk of the improvement in fragmentation avoidance is from patches
    1-4 but patch 5 can deal with a rare corner case and provides the option
    of tuning a system for THP allocation success rates in exchange for
    some stalls to control fragmentation.

    This patch (of 5):

    The page allocator zone lists are iterated based on the watermarks of each
    zone which does not take anti-fragmentation into account. On x86, node 0
    may have multiple zones while other nodes have one zone. A consequence is
    that tasks running on node 0 may fragment ZONE_NORMAL even though
    ZONE_DMA32 has plenty of free memory. This patch special cases the
    allocator fast path such that it'll try an allocation from a lower local
    zone before fragmenting a higher zone. In this case, stealing of
    pageblocks or orders larger than a pageblock are still allowed in the fast
    path as they are uninteresting from a fragmentation point of view.

    This was evaluated using a benchmark designed to fragment memory before
    attempting THP allocations. It's implemented in mmtests as the following
    configurations

    configs/config-global-dhp__workload_thpfioscale
    configs/config-global-dhp__workload_thpfioscale-defrag
    configs/config-global-dhp__workload_thpfioscale-madvhugepage

    e.g. from mmtests
    ./run-mmtests.sh --run-monitor --config configs/config-global-dhp__workload_thpfioscale test-run-1

    The broad details of the workload are as follows;

    1. Create an XFS filesystem (not specified in the configuration but done
    as part of the testing for this patch).
    2. Start 4 fio threads that write a number of 64K files inefficiently.
    Inefficiently means that files are created on first access and not
    created in advance (fio parameter create_on_open=1) and fallocate
    is not used (fallocate=none). With multiple IO issuers this creates
    a mix of slab and page cache allocations over time. The total size
    of the files is 150% physical memory so that the slabs and page cache
    pages get mixed.
    3. Warm up a number of fio read-only processes accessing the same files
    created in step 2. This part runs for the same length of time it
    took to create the files. It'll refault old data and further
    interleave slab and page cache allocations. As it's now low on
    memory due to step 2, fragmentation occurs as pageblocks get
    stolen.
    4. While step 3 is still running, start a process that tries to allocate
    75% of memory as huge pages with a number of threads. The number of
    threads is based on a (NR_CPUS_SOCKET - NR_FIO_THREADS)/4 to avoid THP
    threads contending with fio, any other threads or forcing cross-NUMA
    scheduling. Note that the test has not been used on a machine with less
    than 8 cores. The benchmark records whether huge pages were allocated
    and what the fault latency was in microseconds.
    5. Measure the number of events potentially causing external fragmentation,
    the fault latency and the huge page allocation success rate.
    6. Cleanup the test files.

    Note that due to the use of IO and page cache that this benchmark is not
    suitable for running on large machines where the time to fragment memory
    may be excessive. Also note that while this is one mix that generates
    fragmentation that it's not the only mix that generates fragmentation.
    Differences in workload that are more slab-intensive or whether SLUB is
    used with high-order pages may yield different results.

    When the page allocator fragments memory, it records the event using the
    mm_page_alloc_extfrag ftrace event. If the fallback_order is smaller than
    a pageblock order (order-9 on 64-bit x86) then it's considered to be an
    "external fragmentation event" that may cause issues in the future.
    Hence, the primary metric here is the number of external fragmentation
    events that occur with order < 9. The secondary metric is allocation
    latency and huge page allocation success rates but note that differences
    in latencies and what the success rate also can affect the number of
    external fragmentation event which is why it's a secondary metric.

    1-socket Skylake machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 1 THP allocating thread
    --------------------------------------

    4.20-rc3 extfrag events < order 9: 804694
    4.20-rc3+patch: 408912 (49% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 662.92 ( 0.00%) 653.58 * 1.41%*
    Amean fault-huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 0.00 ( 0.00%) 0.00 ( 0.00%)

    Fault latencies are slightly reduced while allocation success rates remain
    at zero as this configuration does not make any special effort to allocate
    THP and fio is heavily active at the time and either filling memory or
    keeping pages resident. However, a 49% reduction of serious fragmentation
    events reduces the changes of external fragmentation being a problem in
    the future.

    Vlastimil asked during review for a breakdown of the allocation types
    that are falling back.

    vanilla
    3816 MIGRATE_UNMOVABLE
    800845 MIGRATE_MOVABLE
    33 MIGRATE_UNRECLAIMABLE

    patch
    735 MIGRATE_UNMOVABLE
    408135 MIGRATE_MOVABLE
    42 MIGRATE_UNRECLAIMABLE

    The majority of the fallbacks are due to movable allocations and this is
    consistent for the workload throughout the series so will not be presented
    again as the primary source of fallbacks are movable allocations.

    Movable fallbacks are sometimes considered "ok" to fallback because they
    can be migrated. The problem is that they can fill an
    unmovable/reclaimable pageblock causing those allocations to fallback
    later and polluting pageblocks with pages that cannot move. If there is a
    movable fallback, it is pretty much guaranteed to affect an
    unmovable/reclaimable pageblock and while it might not be enough to
    actually cause a unmovable/reclaimable fallback in the future, we cannot
    know that in advance so the patch takes the only option available to it.
    Hence, it's important to control them. This point is also consistent
    throughout the series and will not be repeated.

    1-socket Skylake machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 291392
    4.20-rc3+patch: 191187 (34% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-1 1495.14 ( 0.00%) 1467.55 ( 1.85%)
    Amean fault-huge-1 1098.48 ( 0.00%) 1127.11 ( -2.61%)

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-1 78.57 ( 0.00%) 77.64 ( -1.18%)

    Fragmentation events were reduced quite a bit although this is known
    to be a little variable. The latencies and allocation success rates
    are similar but they were already quite high.

    2-socket Haswell machine
    config-global-dhp__workload_thpfioscale XFS (no special madvise)
    4 fio threads, 5 THP allocating threads
    ----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 215698
    4.20-rc3+patch: 200210 (7% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 1350.05 ( 0.00%) 1346.45 ( 0.27%)
    Amean fault-huge-5 4181.01 ( 0.00%) 3418.60 ( 18.24%)

    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 1.15 ( 0.00%) 0.78 ( -31.88%)

    The reduction of external fragmentation events is slight and this is
    partially due to the removal of __GFP_THISNODE in commit ac5b2c18911f
    ("mm: thp: relax __GFP_THISNODE for MADV_HUGEPAGE mappings") as THP
    allocations can now spill over to remote nodes instead of fragmenting
    local memory.

    2-socket Haswell machine
    global-dhp__workload_thpfioscale-madvhugepage-xfs (MADV_HUGEPAGE)
    -----------------------------------------------------------------

    4.20-rc3 extfrag events < order 9: 166352
    4.20-rc3+patch: 147463 (11% reduction)

    thpfioscale Fault Latencies
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Amean fault-base-5 6138.97 ( 0.00%) 6217.43 ( -1.28%)
    Amean fault-huge-5 2294.28 ( 0.00%) 3163.33 * -37.88%*

    thpfioscale Percentage Faults Huge
    4.20.0-rc3 4.20.0-rc3
    vanilla lowzone-v5r8
    Percentage huge-5 96.82 ( 0.00%) 95.14 ( -1.74%)

    There was a slight reduction in external fragmentation events although the
    latencies were higher. The allocation success rate is high enough that
    the system is struggling and there is quite a lot of parallel reclaim and
    compaction activity. There is also a certain degree of luck on whether
    processes start on node 0 or not for this patch but the relevance is
    reduced later in the series.

    Overall, the patch reduces the number of external fragmentation causing
    events so the success of THP over long periods of time would be improved
    for this adverse workload.

    Link: http://lkml.kernel.org/r/20181123114528.28802-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Zi Yan
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • There are multiple places of freeing a page, they all do the same things
    so a common function can be used to reduce code duplicate.

    It also avoids bug fixed in one function but left in another.

    Link: http://lkml.kernel.org/r/20181119134834.17765-3-aaron.lu@intel.com
    Signed-off-by: Aaron Lu
    Acked-by: Vlastimil Babka
    Cc: Alexander Duyck
    Cc: Ilias Apalodimas
    Cc: Jesper Dangaard Brouer
    Cc: Mel Gorman
    Cc: Pankaj gupta
    Cc: Pawel Staszewski
    Cc: Tariq Toukan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • page_frag_free() calls __free_pages_ok() to free the page back to Buddy.
    This is OK for high order page, but for order-0 pages, it misses the
    optimization opportunity of using Per-Cpu-Pages and can cause zone lock
    contention when called frequently.

    Pawel Staszewski recently shared his result of 'how Linux kernel handles
    normal traffic'[1] and from perf data, Jesper Dangaard Brouer found the
    lock contention comes from page allocator:

    mlx5e_poll_tx_cq
    |
    --16.34%--napi_consume_skb
    |
    |--12.65%--__free_pages_ok
    | |
    | --11.86%--free_one_page
    | |
    | |--10.10%--queued_spin_lock_slowpath
    | |
    | --0.65%--_raw_spin_lock
    |
    |--1.55%--page_frag_free
    |
    --1.44%--skb_release_data

    Jesper explained how it happened: mlx5 driver RX-page recycle mechanism is
    not effective in this workload and pages have to go through the page
    allocator. The lock contention happens during mlx5 DMA TX completion
    cycle. And the page allocator cannot keep up at these speeds.[2]

    I thought that __free_pages_ok() are mostly freeing high order pages and
    thought this is an lock contention for high order pages but Jesper
    explained in detail that __free_pages_ok() here are actually freeing
    order-0 pages because mlx5 is using order-0 pages to satisfy its page pool
    allocation request.[3]

    The free path as pointed out by Jesper is:
    skb_free_head()
    -> skb_free_frag()
    -> page_frag_free()
    And the pages being freed on this path are order-0 pages.

    Fix this by doing similar things as in __page_frag_cache_drain() - send
    the being freed page to PCP if it's an order-0 page, or directly to Buddy
    if it is a high order page.

    With this change, Paweł hasn't noticed lock contention yet in his
    workload and Jesper has noticed a 7% performance improvement using a micro
    benchmark and lock contention is gone. Ilias' test on a 'low' speed 1Gbit
    interface on an cortex-a53 shows ~11% performance boost testing with
    64byte packets and __free_pages_ok() disappeared from perf top.

    [1]: https://www.spinics.net/lists/netdev/msg531362.html
    [2]: https://www.spinics.net/lists/netdev/msg531421.html
    [3]: https://www.spinics.net/lists/netdev/msg531556.html

    [akpm@linux-foundation.org: add comment]
    Link: http://lkml.kernel.org/r/20181120014544.GB10657@intel.com
    Signed-off-by: Aaron Lu
    Reported-by: Pawel Staszewski
    Analysed-by: Jesper Dangaard Brouer
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Acked-by: Jesper Dangaard Brouer
    Acked-by: Ilias Apalodimas
    Tested-by: Ilias Apalodimas
    Acked-by: Alexander Duyck
    Acked-by: Tariq Toukan
    Acked-by: Pankaj gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • In the enum migratetype definition, MIGRATE_MOVABLE is before
    MIGRATE_RECLAIMABLE. Change the order of them to match the enumeration's
    order.

    Link: http://lkml.kernel.org/r/20181121085821.3442-1-sjhuang@iluvatar.ai
    Signed-off-by: Huang Shijie
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • Now that totalram_pages and managed_pages are atomic varibles, no need of
    managed_page_count spinlock. The lock had really a weak consistency
    guarantee. It hasn't been used for anything but the update but no reader
    actually cares about all the values being updated to be in sync.

    Link: http://lkml.kernel.org/r/1542090790-21750-5-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Konstantin Khlebnikov
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • totalram_pages and totalhigh_pages are made static inline function.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/1542090790-21750-4-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • totalram_pages, zone->managed_pages and totalhigh_pages updates are
    protected by managed_page_count_lock, but readers never care about it.
    Convert these variables to atomic to avoid readers potentially seeing a
    store tear.

    This patch converts zone->managed_pages. Subsequent patches will convert
    totalram_panges, totalhigh_pages and eventually managed_page_count_lock
    will be removed.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 So it seemes
    better to remove the lock and convert variables to atomic, with preventing
    poteintial store-to-read tearing as a bonus.

    Link: http://lkml.kernel.org/r/1542090790-21750-3-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Suggested-by: Michal Hocko
    Suggested-by: Vlastimil Babka
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • Patch series "mm: convert totalram_pages, totalhigh_pages and managed
    pages to atomic", v5.

    This series converts totalram_pages, totalhigh_pages and
    zone->managed_pages to atomic variables.

    totalram_pages, zone->managed_pages and totalhigh_pages updates are
    protected by managed_page_count_lock, but readers never care about it.
    Convert these variables to atomic to avoid readers potentially seeing a
    store tear.

    Main motivation was that managed_page_count_lock handling was complicating
    things. It was discussed in length here,
    https://lore.kernel.org/patchwork/patch/995739/#1181785 It seemes better
    to remove the lock and convert variables to atomic. With the change,
    preventing poteintial store-to-read tearing comes as a bonus.

    This patch (of 4):

    This is in preparation to a later patch which converts totalram_pages and
    zone->managed_pages to atomic variables. Please note that re-reading the
    value might lead to a different value and as such it could lead to
    unexpected behavior. There are no known bugs as a result of the current
    code but it is better to prevent from them in principle.

    Link: http://lkml.kernel.org/r/1542090790-21750-2-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Konstantin Khlebnikov
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • per_cpu_pageset is cleared by memset, it is not necessary to reset it
    again.

    Link: http://lkml.kernel.org/r/20181021023920.5501-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Heiko has complained that his log is swamped by warnings from
    has_unmovable_pages

    [ 20.536664] page dumped because: has_unmovable_pages
    [ 20.536792] page:000003d081ff4080 count:1 mapcount:0 mapping:000000008ff88600 index:0x0 compound_mapcount: 0
    [ 20.536794] flags: 0x3fffe0000010200(slab|head)
    [ 20.536795] raw: 03fffe0000010200 0000000000000100 0000000000000200 000000008ff88600
    [ 20.536796] raw: 0000000000000000 0020004100000000 ffffffff00000001 0000000000000000
    [ 20.536797] page dumped because: has_unmovable_pages
    [ 20.536814] page:000003d0823b0000 count:1 mapcount:0 mapping:0000000000000000 index:0x0
    [ 20.536815] flags: 0x7fffe0000000000()
    [ 20.536817] raw: 07fffe0000000000 0000000000000100 0000000000000200 0000000000000000
    [ 20.536818] raw: 0000000000000000 0000000000000000 ffffffff00000001 0000000000000000

    which are not triggered by the memory hotplug but rather CMA allocator.
    The original idea behind dumping the page state for all call paths was
    that these messages will be helpful debugging failures. From the above it
    seems that this is not the case for the CMA path because we are lacking
    much more context. E.g the second reported page might be a CMA allocated
    page. It is still interesting to see a slab page in the CMA area but it
    is hard to tell whether this is bug from the above output alone.

    Address this issue by dumping the page state only on request. Both
    start_isolate_page_range and has_unmovable_pages already have an argument
    to ignore hwpoison pages so make this argument more generic and turn it
    into flags and allow callers to combine non-default modes into a mask.
    While we are at it, has_unmovable_pages call from
    is_pageblock_removable_nolock (sysfs removable file) is questionable to
    report the failure so drop it from there as well.

    Link: http://lkml.kernel.org/r/20181218092802.31429-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Heiko Carstens
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • There is only very limited information printed when the memory offlining
    fails:

    [ 1984.506184] rac1 kernel: memory offlining [mem 0x82600000000-0x8267fffffff] failed due to signal backoff

    This tells us that the failure is triggered by the userspace intervention
    but it doesn't tell us much more about the underlying reason. It might be
    that the page migration failes repeatedly and the userspace timeout
    expires and send a signal or it might be some of the earlier steps
    (isolation, memory notifier) takes too long.

    If the migration failes then it would be really helpful to see which page
    that and its state. The same applies to the isolation phase. If we fail
    to isolate a page from the allocator then knowing the state of the page
    would be helpful as well.

    Dump the page state that fails to get isolated or migrated. This will
    tell us more about the failure and what to focus on during debugging.

    [akpm@linux-foundation.org: add missing printk arg]
    [mhocko@suse.com: tweak dump_page() `reason' text]
    Link: http://lkml.kernel.org/r/20181116083020.20260-6-mhocko@kernel.org
    Link: http://lkml.kernel.org/r/20181107101830.17405-6-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Reviewed-by: Anshuman Khandual
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Tag-based KASAN doesn't check memory accesses through pointers tagged with
    0xff. When page_address is used to get pointer to memory that corresponds
    to some page, the tag of the resulting pointer gets set to 0xff, even
    though the allocated memory might have been tagged differently.

    For slab pages it's impossible to recover the correct tag to return from
    page_address, since the page might contain multiple slab objects tagged
    with different values, and we can't know in advance which one of them is
    going to get accessed. For non slab pages however, we can recover the tag
    in page_address, since the whole page was marked with the same tag.

    This patch adds tagging to non slab memory allocated with pagealloc. To
    set the tag of the pointer returned from page_address, the tag gets stored
    to page->flags when the memory gets allocated.

    Link: http://lkml.kernel.org/r/d758ddcef46a5abc9970182b9137e2fbee202a2c.1544099024.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Andrey Ryabinin
    Reviewed-by: Dmitry Vyukov
    Acked-by: Will Deacon
    Cc: Christoph Lameter
    Cc: Mark Rutland
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

22 Dec, 2018

2 commits

  • While playing with gigantic hugepages and memory_hotplug, I triggered
    the following #PF when "cat memoryX/removable":

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:has_unmovable_pages+0x154/0x210
    Call Trace:
    is_mem_section_removable+0x7d/0x100
    removable_show+0x90/0xb0
    dev_attr_show+0x1c/0x50
    sysfs_kf_seq_show+0xca/0x1b0
    seq_read+0x133/0x380
    __vfs_read+0x26/0x180
    vfs_read+0x89/0x140
    ksys_read+0x42/0x90
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The reason is we do not pass the Head to page_hstate(), and so, the call
    to compound_order() in page_hstate() returns 0, so we end up checking
    all hstates's size to match PAGE_SIZE.

    Obviously, we do not find any hstate matching that size, and we return
    NULL. Then, we dereference that NULL pointer in
    hugepage_migration_supported() and we got the #PF from above.

    Fix that by getting the head page before calling page_hstate().

    Also, since gigantic pages span several pageblocks, re-adjust the logic
    for skipping pages. While are it, we can also get rid of the
    round_up().

    [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
    Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
    Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Vlastimil Babka
    Cc: Pavel Tatashin
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • If memory end is not aligned with the sparse memory section boundary,
    the mapping of such a section is only partly initialized. This may lead
    to VM_BUG_ON due to uninitialized struct page access from
    is_mem_section_removable() or test_pages_in_a_zone() function triggered
    by memory_hotplug sysfs handlers:

    Here are the the panic examples:
    CONFIG_DEBUG_VM=y
    CONFIG_DEBUG_VM_PGFLAGS=y

    kernel parameter mem=2050M
    --------------------------
    page:000003d082008000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    ( test_pages_in_a_zone+0xde/0x160)
    show_valid_zones+0x5c/0x190
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    test_pages_in_a_zone+0xde/0x160
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    kernel parameter mem=3075M
    --------------------------
    page:000003d08300c000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    ( is_mem_section_removable+0xb4/0x190)
    show_mem_removable+0x9a/0xd8
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    is_mem_section_removable+0xb4/0x190
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Fix the problem by initializing the last memory section of each zone in
    memmap_init_zone() till the very end, even if it goes beyond the zone end.

    Michal said:

    : This has alwways been problem AFAIU. It just went unnoticed because we
    : have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
    : zeroing memory during allocation in vmemmap") and so the above test
    : would simply skip these ranges as belonging to zone 0 or provided a
    : garbage.
    :
    : So I guess we do care for post f7f99100d8d9 kernels mostly and
    : therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
    : allocation in vmemmap")

    Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: Mikhail Zaslonko
    Reviewed-by: Gerald Schaefer
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Reported-by: Mikhail Gavrilov
    Tested-by: Mikhail Gavrilov
    Cc: Dave Hansen
    Cc: Alexander Duyck
    Cc: Pasha Tatashin
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mikhail Zaslonko
     

01 Dec, 2018

1 commit

  • init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
    'zone_idx(zone) + 1' unconditionally. This is correct in the normal
    case, while not exact in hot-plug situation.

    This function is used in two places:

    * free_area_init_core()
    * move_pfn_range_to_zone()

    In the first case, we are sure zone index increase monotonically. While
    in the second one, this is under users control.

    One way to reproduce this is:
    ----------------------------

    1. create a virtual machine with empty node1

    -m 4G,slots=32,maxmem=32G \
    -smp 4,maxcpus=8 \
    -numa node,nodeid=0,mem=4G,cpus=0-3 \
    -numa node,nodeid=1,mem=0G,cpus=4-7

    2. hot-add cpu 3-7

    cpu-add [3-7]

    2. hot-add memory to nod1

    object_add memory-backend-ram,id=ram0,size=1G
    device_add pc-dimm,id=dimm0,memdev=ram0,node=1

    3. online memory with following order

    echo online_movable > memory47/state
    echo online > memory40/state

    After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
    instead of (ZONE_MOVABLE + 1).

    Michal said:
    "Having an incorrect nr_zones might result in all sorts of problems
    which would be quite hard to debug (e.g. reclaim not considering the
    movable zone). I do not expect many users would suffer from this it
    but still this is trivial and obviously right thing to do so
    backporting to the stable tree shouldn't be harmful (last famous
    words)"

    Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

19 Nov, 2018

2 commits

  • Konstantin has noticed that kvmalloc might trigger the following
    warning:

    WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
    [...]
    Call Trace:
    fragmentation_index+0x76/0x90
    compaction_suitable+0x4f/0xf0
    shrink_node+0x295/0x310
    node_reclaim+0x205/0x250
    get_page_from_freelist+0x649/0xad0
    __alloc_pages_nodemask+0x12a/0x2a0
    kmalloc_large_node+0x47/0x90
    __kmalloc_node+0x22b/0x2e0
    kvmalloc_node+0x3e/0x70
    xt_alloc_table_info+0x3a/0x80 [x_tables]
    do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
    nf_setsockopt+0x44/0x60
    SyS_setsockopt+0x6f/0xc0
    do_syscall_64+0x67/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    the problem is that we only check for an out of bound order in the slow
    path and the node reclaim might happen from the fast path already. This
    is fixable by making sure that kvmalloc doesn't ever use kmalloc for
    requests that are larger than KMALLOC_MAX_SIZE but this also shows that
    the code is rather fragile. A recent UBSAN report just underlines that
    by the following report

    UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
    shift exponent 51 is too large for 32-bit type 'int'
    CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xd2/0x148 lib/dump_stack.c:113
    ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
    __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
    zone_watermark_fast mm/page_alloc.c:3216 [inline]
    get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
    __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
    alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
    alloc_pages include/linux/gfp.h:509 [inline]
    __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
    dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
    raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
    raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
    fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
    fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
    block_ioctl+0x105/0x150 fs/block_dev.c:1883
    vfs_ioctl fs/ioctl.c:46 [inline]
    do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
    ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
    __do_sys_ioctl fs/ioctl.c:709 [inline]
    __se_sys_ioctl fs/ioctl.c:707 [inline]
    __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
    do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note that this is not a kvmalloc path. It is just that the fast path
    really depends on having sanitzed order as well. Therefore move the
    order check to the fast path.

    Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Konstantin Khlebnikov
    Reported-by: Kyungtae Kim
    Acked-by: Vlastimil Babka
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Byoungyoung Lee
    Cc: "Dae R. Jeong"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Page state checks are racy. Under a heavy memory workload (e.g. stress
    -m 200 -t 2h) it is quite easy to hit a race window when the page is
    allocated but its state is not fully populated yet. A debugging patch to
    dump the struct page state shows

    has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
    page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
    flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

    Note that the state has been checked for both PageLRU and PageSwapBacked
    already. Closing this race completely would require some sort of retry
    logic. This can be tricky and error prone (think of potential endless
    or long taking loops).

    Workaround this problem for movable zones at least. Such a zone should
    only contain movable pages. Commit 15c30bc09085 ("mm, memory_hotplug:
    make has_unmovable_pages more robust") has told us that this is not
    strictly true though. Bootmem pages should be marked reserved though so
    we can move the original check after the PageReserved check. Pages from
    other zones are still prone to races but we even do not pretend that
    memory hotremove works for those so pre-mature failure doesn't hurt that
    much.

    Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
    Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
    Signed-off-by: Michal Hocko
    Reported-by: Baoquan He
    Tested-by: Baoquan He
    Acked-by: Baoquan He
    Reviewed-by: Oscar Salvador
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

31 Oct, 2018

6 commits

  • When a memblock allocation APIs are called with align = 0, the alignment
    is implicitly set to SMP_CACHE_BYTES.

    Implicit alignment is done deep in the memblock allocator and it can
    come as a surprise. Not that such an alignment would be wrong even
    when used incorrectly but it is better to be explicit for the sake of
    clarity and the prinicple of the least surprise.

    Replace all such uses of memblock APIs with the 'align' parameter
    explicitly set to SMP_CACHE_BYTES and stop implicit alignment assignment
    in the memblock internal allocation functions.

    For the case when memblock APIs are used via helper functions, e.g. like
    iommu_arena_new_node() in Alpha, the helper functions were detected with
    Coccinelle's help and then manually examined and updated where
    appropriate.

    The direct memblock APIs users were updated using the semantic patch below:

    @@
    expression size, min_addr, max_addr, nid;
    @@
    (
    |
    - memblock_alloc_try_nid_raw(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_raw(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid_nopanic(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid_nopanic(size, SMP_CACHE_BYTES, min_addr, max_addr,
    nid)
    |
    - memblock_alloc_try_nid(size, 0, min_addr, max_addr, nid)
    + memblock_alloc_try_nid(size, SMP_CACHE_BYTES, min_addr, max_addr, nid)
    |
    - memblock_alloc(size, 0)
    + memblock_alloc(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_raw(size, 0)
    + memblock_alloc_raw(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from(size, 0, min_addr)
    + memblock_alloc_from(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_nopanic(size, 0)
    + memblock_alloc_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low(size, 0)
    + memblock_alloc_low(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_low_nopanic(size, 0)
    + memblock_alloc_low_nopanic(size, SMP_CACHE_BYTES)
    |
    - memblock_alloc_from_nopanic(size, 0, min_addr)
    + memblock_alloc_from_nopanic(size, SMP_CACHE_BYTES, min_addr)
    |
    - memblock_alloc_node(size, 0, nid)
    + memblock_alloc_node(size, SMP_CACHE_BYTES, nid)
    )

    [mhocko@suse.com: changelog update]
    [akpm@linux-foundation.org: coding-style fixes]
    [rppt@linux.ibm.com: fix missed uses of implicit alignment]
    Link: http://lkml.kernel.org/r/20181016133656.GA10925@rapoport-lnx
    Link: http://lkml.kernel.org/r/1538687224-17535-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Suggested-by: Michal Hocko
    Acked-by: Paul Burton [MIPS]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: Matt Turner
    Cc: Michal Simek
    Cc: Richard Weinberger
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Move remaining definitions and declarations from include/linux/bootmem.h
    into include/linux/memblock.h and remove the redundant header.

    The includes were replaced with the semantic patch below and then
    semi-automated removal of duplicated '#include

    @@
    @@
    - #include
    + #include

    [sfr@canb.auug.org.au: dma-direct: fix up for the removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181002185342.133d1680@canb.auug.org.au
    [sfr@canb.auug.org.au: powerpc: fix up for removal of linux/bootmem.h]
    Link: http://lkml.kernel.org/r/20181005161406.73ef8727@canb.auug.org.au
    [sfr@canb.auug.org.au: x86/kaslr, ACPI/NUMA: fix for linux/bootmem.h removal]
    Link: http://lkml.kernel.org/r/20181008190341.5e396491@canb.auug.org.au
    Link: http://lkml.kernel.org/r/1536927045-23536-30-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Signed-off-by: Stephen Rothwell
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@__free_pages_bootmem@memblock_free_pages@' \
    $(git grep -l __free_pages_bootmem)

    Link: http://lkml.kernel.org/r/1536927045-23536-27-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@free_all_bootmem@memblock_free_all@' \
    $(git grep -l free_all_bootmem)

    Link: http://lkml.kernel.org/r/1536927045-23536-26-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The conversion is done using

    sed -i 's@memblock_virt_alloc@memblock_alloc@g' \
    $(git grep -l memblock_virt_alloc)

    Link: http://lkml.kernel.org/r/1536927045-23536-8-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • All architecures use memblock for early memory management. There is no need
    for the CONFIG_HAVE_MEMBLOCK configuration option.

    [rppt@linux.vnet.ibm.com: of/fdt: fixup #ifdefs]
    Link: http://lkml.kernel.org/r/20180919103457.GA20545@rapoport-lnx
    [rppt@linux.vnet.ibm.com: csky: fixups after bootmem removal]
    Link: http://lkml.kernel.org/r/20180926112744.GC4628@rapoport-lnx
    [rppt@linux.vnet.ibm.com: remove stale #else and the code it protects]
    Link: http://lkml.kernel.org/r/1538067825-24835-1-git-send-email-rppt@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1536927045-23536-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Tested-by: Jonathan Cameron
    Cc: Catalin Marinas
    Cc: Chris Zankel
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Kroah-Hartman
    Cc: Guan Xuetao
    Cc: Ingo Molnar
    Cc: "James E.J. Bottomley"
    Cc: Jonas Bonn
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Burton
    Cc: Richard Kuo
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Serge Semin
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

27 Oct, 2018

3 commits

  • When checking for valid pfns in zero_resv_unavail(), it is not necessary
    to verify that pfns within pageblock_nr_pages ranges are valid, only the
    first one needs to be checked. This is because memory for pages are
    allocated in contiguous chunks that contain pageblock_nr_pages struct
    pages.

    Link: http://lkml.kernel.org/r/20181002143821.5112-3-msys.mizuma@gmail.com
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Masayoshi Mizuma
    Reviewed-by: Masayoshi Mizuma
    Acked-by: Naoya Horiguchi
    Reviewed-by: Oscar Salvador
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "mm: Fix for movable_node boot option", v3.

    This patch series contains a fix for the movable_node boot option issue
    which was introduced by commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM
    regions into memblock.reserved").

    The commit breaks the option because it changed the memory gap range to
    reserved memblock. So, the node is marked as Normal zone even if the SRAT
    has Hot pluggable affinity.

    First and second patch fix the original issue which the commit tried to
    fix, then revert the commit.

    This patch (of 3):

    There is a kernel panic that is triggered when reading /proc/kpageflags on
    the kernel booted with kernel parameter 'memmap=nn[KMG]!ss[KMG]':

    BUG: unable to handle kernel paging request at fffffffffffffffe
    PGD 9b20e067 P4D 9b20e067 PUD 9b210067 PMD 0
    Oops: 0000 [#1] SMP PTI
    CPU: 2 PID: 1728 Comm: page-types Not tainted 4.17.0-rc6-mm1-v4.17-rc6-180605-0816-00236-g2dfb086ef02c+ #160
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.fc28 04/01/2014
    RIP: 0010:stable_page_flags+0x27/0x3c0
    Code: 00 00 00 0f 1f 44 00 00 48 85 ff 0f 84 a0 03 00 00 41 54 55 49 89 fc 53 48 8b 57 08 48 8b 2f 48 8d 42 ff 83 e2 01 48 0f 44 c7 8b 00 f6 c4 01 0f 84 10 03 00 00 31 db 49 8b 54 24 08 4c 89 e7
    RSP: 0018:ffffbbd44111fde0 EFLAGS: 00010202
    RAX: fffffffffffffffe RBX: 00007fffffffeff9 RCX: 0000000000000000
    RDX: 0000000000000001 RSI: 0000000000000202 RDI: ffffed1182fff5c0
    RBP: ffffffffffffffff R08: 0000000000000001 R09: 0000000000000001
    R10: ffffbbd44111fed8 R11: 0000000000000000 R12: ffffed1182fff5c0
    R13: 00000000000bffd7 R14: 0000000002fff5c0 R15: ffffbbd44111ff10
    FS: 00007efc4335a500(0000) GS:ffff93a5bfc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: fffffffffffffffe CR3: 00000000b2a58000 CR4: 00000000001406e0
    Call Trace:
    kpageflags_read+0xc7/0x120
    proc_reg_read+0x3c/0x60
    __vfs_read+0x36/0x170
    vfs_read+0x89/0x130
    ksys_pread64+0x71/0x90
    do_syscall_64+0x5b/0x160
    entry_SYSCALL_64_after_hwframe+0x44/0xa9
    RIP: 0033:0x7efc42e75e23
    Code: 09 00 ba 9f 01 00 00 e8 ab 81 f4 ff 66 2e 0f 1f 84 00 00 00 00 00 90 83 3d 29 0a 2d 00 00 75 13 49 89 ca b8 11 00 00 00 0f 05 3d 01 f0 ff ff 73 34 c3 48 83 ec 08 e8 db d3 01 00 48 89 04 24

    According to kernel bisection, this problem became visible due to commit
    f7f99100d8d9 which changes how struct pages are initialized.

    Memblock layout affects the pfn ranges covered by node/zone. Consider
    that we have a VM with 2 NUMA nodes and each node has 4GB memory, and the
    default (no memmap= given) memblock layout is like below:

    MEMBLOCK configuration:
    memory size = 0x00000001fff75c00 reserved size = 0x000000000300c000
    memory.cnt = 0x4
    memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
    memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
    memory[0x2] [0x0000000100000000-0x000000013fffffff], 0x0000000040000000 bytes on node 0 flags: 0x0
    memory[0x3] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
    ...

    If you give memmap=1G!4G (so it just covers memory[0x2]),
    the range [0x100000000-0x13fffffff] is gone:

    MEMBLOCK configuration:
    memory size = 0x00000001bff75c00 reserved size = 0x000000000300c000
    memory.cnt = 0x3
    memory[0x0] [0x0000000000001000-0x000000000009efff], 0x000000000009e000 bytes on node 0 flags: 0x0
    memory[0x1] [0x0000000000100000-0x00000000bffd6fff], 0x00000000bfed7000 bytes on node 0 flags: 0x0
    memory[0x2] [0x0000000140000000-0x000000023fffffff], 0x0000000100000000 bytes on node 1 flags: 0x0
    ...

    This causes shrinking node 0's pfn range because it is calculated by the
    address range of memblock.memory. So some of struct pages in the gap
    range are left uninitialized.

    We have a function zero_resv_unavail() which does zeroing the struct pages
    outside memblock.memory, but currently it covers only the reserved
    unavailable range (i.e. memblock.memory && !memblock.reserved). This
    patch extends it to cover all unavailable range, which fixes the reported
    issue.

    Link: http://lkml.kernel.org/r/20181002143821.5112-2-msys.mizuma@gmail.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: Naoya Horiguchi
    Signed-off-by-by: Masayoshi Mizuma
    Tested-by: Oscar Salvador
    Tested-by: Masayoshi Mizuma
    Reviewed-by: Pavel Tatashin
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • memmap_init_zone, is getting complex, because it is called from different
    contexts: hotplug, and during boot, and also because it must handle some
    architecture quirks. One of them is mirrored memory.

    Move the code that decides whether to skip mirrored memory outside of
    memmap_init_zone, into a separate function.

    [pasha.tatashin@oracle.com: uninline overlap_memmap_init()]
    Link: http://lkml.kernel.org/r/20180726193509.3326-4-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180724235520.10200-4-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin