06 Apr, 2019

1 commit

  • [ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]

    KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
    It triggers false positives in the allocation path:

    BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
    Read of size 8 at addr ffff88881f800000 by task swapper/0
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    __asan_report_load8_noabort+0x19/0x20
    memchr_inv+0x2ea/0x330
    kernel_poison_pages+0x103/0x3d5
    get_page_from_freelist+0x15e7/0x4d90

    because KASAN has not yet unpoisoned the shadow page for allocation
    before it checks memchr_inv() but only found a stale poison pattern.

    Also, false positives in free path,

    BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
    Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    check_memory_region+0x22d/0x250
    memset+0x28/0x40
    kernel_poison_pages+0x29e/0x3d5
    __free_pages_ok+0x75f/0x13e0

    due to KASAN adds poisoned redzones around slab objects, but the page
    poisoning needs to poison the whole page.

    Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Qian Cai
     

24 Mar, 2019

1 commit

  • [ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]

    The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
    number of references that we might need to create in the fastpath later,
    the bump-allocation fastpath only has to modify the non-atomic bias value
    that tracks the number of extra references we hold instead of the atomic
    refcount. The maximum number of allocations we can serve (under the
    assumption that no allocation is made with size 0) is nc->size, so that's
    the bias used.

    However, even when all memory in the allocation has been given away, a
    reference to the page is still held; and in the `offset < 0` slowpath, the
    page may be reused if everyone else has dropped their references.
    This means that the necessary number of references is actually
    `nc->size+1`.

    Luckily, from a quick grep, it looks like the only path that can call
    page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
    requires CAP_NET_ADMIN in the init namespace and is only intended to be
    used for kernel testing and fuzzing.

    To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
    `offset < 0` path, below the virt_to_page() call, and then repeatedly call
    writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
    with a vector consisting of 15 elements containing 1 byte each.

    Signed-off-by: Jann Horn
    Signed-off-by: David S. Miller
    Signed-off-by: Sasha Levin

    Jann Horn
     

13 Feb, 2019

1 commit

  • [ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]

    When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
    pages initialization can take a long time. Below were the reported init
    times on a 8-socket 96-core 4TB IvyBridge system.

    1) Non-debug kernel without CONFIG_KASAN
    [ 8.764222] node 1 initialised, 132086516 pages in 7027ms

    2) Debug kernel with CONFIG_KASAN
    [ 146.288115] node 1 initialised, 132075466 pages in 143052ms

    So the page init time in a debug kernel was 20X of the non-debug kernel.
    The long init time can be problematic as the page initialization is done
    with interrupt disabled. In this particular case, it caused the
    appearance of following warning messages as well as NMI backtraces of all
    the cores that were doing the initialization.

    [ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
    [ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
    [ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
    [ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
    [ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
    [ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
    [ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
    [ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)

    The long init time was mainly caused by the call to kasan_free_pages() to
    poison the newly initialized pages. On a 4TB system, we are talking about
    almost 500GB of memory probably on the same node.

    In reality, we may not need to poison the newly initialized pages before
    they are ever allocated. So KASAN poisoning of freed pages before the
    completion of deferred memory initialization is now disabled. Those pages
    will be properly poisoned when they are allocated or freed after deferred
    pages initialization is done.

    With this change, the new page initialization time became:

    [ 21.948010] node 1 initialised, 132075466 pages in 18702ms

    This was still about double the non-debug kernel time, but was much
    better than before.

    Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
    Signed-off-by: Waiman Long
    Reviewed-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Michal Hocko
    Cc: Pasha Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Waiman Long
     

31 Jan, 2019

1 commit

  • commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.

    This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.

    The underlying assumption that one sparse section belongs into a single
    numa node doesn't hold really. Robert Shteynfeld has reported a boot
    failure. The boot log was not captured but his memory layout is as
    follows:

    Early memory node ranges
    node 1: [mem 0x0000000000001000-0x0000000000090fff]
    node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
    node 1: [mem 0x0000000100000000-0x0000001423ffffff]
    node 0: [mem 0x0000001424000000-0x0000002023ffffff]

    This means that node0 starts in the middle of a memory section which is
    also in node1. memmap_init_zone tries to initialize padding of a
    section even when it is outside of the given pfn range because there are
    code paths (e.g. memory hotplug) which assume that the full worth of
    memory section is always initialized.

    In this particular case, though, such a range is already intialized and
    most likely already managed by the page allocator. Scribbling over
    those pages corrupts the internal state and likely blows up when any of
    those pages gets used.

    Reported-by: Robert Shteynfeld
    Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
    Cc: stable@kernel.org
    Signed-off-by: Michal Hocko
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Michal Hocko
     

29 Dec, 2018

2 commits

  • commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.

    While playing with gigantic hugepages and memory_hotplug, I triggered
    the following #PF when "cat memoryX/removable":

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
    #PF error: [normal kernel read fault]
    PGD 0 P4D 0
    Oops: 0000 [#1] SMP PTI
    CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:has_unmovable_pages+0x154/0x210
    Call Trace:
    is_mem_section_removable+0x7d/0x100
    removable_show+0x90/0xb0
    dev_attr_show+0x1c/0x50
    sysfs_kf_seq_show+0xca/0x1b0
    seq_read+0x133/0x380
    __vfs_read+0x26/0x180
    vfs_read+0x89/0x140
    ksys_read+0x42/0x90
    do_syscall_64+0x5b/0x180
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The reason is we do not pass the Head to page_hstate(), and so, the call
    to compound_order() in page_hstate() returns 0, so we end up checking
    all hstates's size to match PAGE_SIZE.

    Obviously, we do not find any hstate matching that size, and we return
    NULL. Then, we dereference that NULL pointer in
    hugepage_migration_supported() and we got the #PF from above.

    Fix that by getting the head page before calling page_hstate().

    Also, since gigantic pages span several pageblocks, re-adjust the logic
    for skipping pages. While are it, we can also get rid of the
    round_up().

    [osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
    Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
    Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: David Hildenbrand
    Cc: Vlastimil Babka
    Cc: Pavel Tatashin
    Cc: Mike Rapoport
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Oscar Salvador
     
  • commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.

    If memory end is not aligned with the sparse memory section boundary,
    the mapping of such a section is only partly initialized. This may lead
    to VM_BUG_ON due to uninitialized struct page access from
    is_mem_section_removable() or test_pages_in_a_zone() function triggered
    by memory_hotplug sysfs handlers:

    Here are the the panic examples:
    CONFIG_DEBUG_VM=y
    CONFIG_DEBUG_VM_PGFLAGS=y

    kernel parameter mem=2050M
    --------------------------
    page:000003d082008000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    ( test_pages_in_a_zone+0xde/0x160)
    show_valid_zones+0x5c/0x190
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    test_pages_in_a_zone+0xde/0x160
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    kernel parameter mem=3075M
    --------------------------
    page:000003d08300c000 is uninitialized and poisoned
    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
    Call Trace:
    ( is_mem_section_removable+0xb4/0x190)
    show_mem_removable+0x9a/0xd8
    dev_attr_show+0x34/0x70
    sysfs_kf_seq_show+0xc8/0x148
    seq_read+0x204/0x480
    __vfs_read+0x32/0x178
    vfs_read+0x82/0x138
    ksys_read+0x5a/0xb0
    system_call+0xdc/0x2d8
    Last Breaking-Event-Address:
    is_mem_section_removable+0xb4/0x190
    Kernel panic - not syncing: Fatal exception: panic_on_oops

    Fix the problem by initializing the last memory section of each zone in
    memmap_init_zone() till the very end, even if it goes beyond the zone end.

    Michal said:

    : This has alwways been problem AFAIU. It just went unnoticed because we
    : have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
    : zeroing memory during allocation in vmemmap") and so the above test
    : would simply skip these ranges as belonging to zone 0 or provided a
    : garbage.
    :
    : So I guess we do care for post f7f99100d8d9 kernels mostly and
    : therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
    : allocation in vmemmap")

    Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: Mikhail Zaslonko
    Reviewed-by: Gerald Schaefer
    Suggested-by: Michal Hocko
    Acked-by: Michal Hocko
    Reported-by: Mikhail Gavrilov
    Tested-by: Mikhail Gavrilov
    Cc: Dave Hansen
    Cc: Alexander Duyck
    Cc: Pasha Tatashin
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Mikhail Zaslonko
     

17 Dec, 2018

1 commit

  • [ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]

    init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
    'zone_idx(zone) + 1' unconditionally. This is correct in the normal
    case, while not exact in hot-plug situation.

    This function is used in two places:

    * free_area_init_core()
    * move_pfn_range_to_zone()

    In the first case, we are sure zone index increase monotonically. While
    in the second one, this is under users control.

    One way to reproduce this is:
    ----------------------------

    1. create a virtual machine with empty node1

    -m 4G,slots=32,maxmem=32G \
    -smp 4,maxcpus=8 \
    -numa node,nodeid=0,mem=4G,cpus=0-3 \
    -numa node,nodeid=1,mem=0G,cpus=4-7

    2. hot-add cpu 3-7

    cpu-add [3-7]

    2. hot-add memory to nod1

    object_add memory-backend-ram,id=ram0,size=1G
    device_add pc-dimm,id=dimm0,memdev=ram0,node=1

    3. online memory with following order

    echo online_movable > memory47/state
    echo online > memory40/state

    After this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
    instead of (ZONE_MOVABLE + 1).

    Michal said:
    "Having an incorrect nr_zones might result in all sorts of problems
    which would be quite hard to debug (e.g. reclaim not considering the
    movable zone). I do not expect many users would suffer from this it
    but still this is trivial and obviously right thing to do so
    backporting to the stable tree shouldn't be harmful (last famous
    words)"

    Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
    Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
    Signed-off-by: Wei Yang
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Signed-off-by: Sasha Levin

    Wei Yang
     

01 Dec, 2018

2 commits

  • [ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]

    Konstantin has noticed that kvmalloc might trigger the following
    warning:

    WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
    [...]
    Call Trace:
    fragmentation_index+0x76/0x90
    compaction_suitable+0x4f/0xf0
    shrink_node+0x295/0x310
    node_reclaim+0x205/0x250
    get_page_from_freelist+0x649/0xad0
    __alloc_pages_nodemask+0x12a/0x2a0
    kmalloc_large_node+0x47/0x90
    __kmalloc_node+0x22b/0x2e0
    kvmalloc_node+0x3e/0x70
    xt_alloc_table_info+0x3a/0x80 [x_tables]
    do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
    nf_setsockopt+0x44/0x60
    SyS_setsockopt+0x6f/0xc0
    do_syscall_64+0x67/0x120
    entry_SYSCALL_64_after_hwframe+0x3d/0xa2

    the problem is that we only check for an out of bound order in the slow
    path and the node reclaim might happen from the fast path already. This
    is fixable by making sure that kvmalloc doesn't ever use kmalloc for
    requests that are larger than KMALLOC_MAX_SIZE but this also shows that
    the code is rather fragile. A recent UBSAN report just underlines that
    by the following report

    UBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
    shift exponent 51 is too large for 32-bit type 'int'
    CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0xd2/0x148 lib/dump_stack.c:113
    ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
    __ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
    __zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
    zone_watermark_fast mm/page_alloc.c:3216 [inline]
    get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
    __alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
    alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
    alloc_pages include/linux/gfp.h:509 [inline]
    __get_free_pages+0x12/0x60 mm/page_alloc.c:4414
    dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
    raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
    raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
    fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
    fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
    __blkdev_driver_ioctl block/ioctl.c:303 [inline]
    blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
    block_ioctl+0x105/0x150 fs/block_dev.c:1883
    vfs_ioctl fs/ioctl.c:46 [inline]
    do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
    ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
    __do_sys_ioctl fs/ioctl.c:709 [inline]
    __se_sys_ioctl fs/ioctl.c:707 [inline]
    __x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
    do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    Note that this is not a kvmalloc path. It is just that the fast path
    really depends on having sanitzed order as well. Therefore move the
    order check to the fast path.

    Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
    Signed-off-by: Michal Hocko
    Reported-by: Konstantin Khlebnikov
    Reported-by: Kyungtae Kim
    Acked-by: Vlastimil Babka
    Cc: Balbir Singh
    Cc: Mel Gorman
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Byoungyoung Lee
    Cc: "Dae R. Jeong"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     
  • [ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]

    Page state checks are racy. Under a heavy memory workload (e.g. stress
    -m 200 -t 2h) it is quite easy to hit a race window when the page is
    allocated but its state is not fully populated yet. A debugging patch to
    dump the struct page state shows

    has_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
    page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
    flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)

    Note that the state has been checked for both PageLRU and PageSwapBacked
    already. Closing this race completely would require some sort of retry
    logic. This can be tricky and error prone (think of potential endless
    or long taking loops).

    Workaround this problem for movable zones at least. Such a zone should
    only contain movable pages. Commit 15c30bc09085 ("mm, memory_hotplug:
    make has_unmovable_pages more robust") has told us that this is not
    strictly true though. Bootmem pages should be marked reserved though so
    we can move the original check after the PageReserved check. Pages from
    other zones are still prone to races but we even do not pretend that
    memory hotremove works for those so pre-mature failure doesn't hurt that
    much.

    Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
    Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
    Signed-off-by: Michal Hocko
    Reported-by: Baoquan He
    Tested-by: Baoquan He
    Acked-by: Baoquan He
    Reviewed-by: Oscar Salvador
    Acked-by: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Michal Hocko
     

09 Oct, 2018

1 commit

  • Remove the leftover pglist_data::numabalancing_migrate_lock and its
    initialization, we stopped using this lock with:

    efaffc5e40ae ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")

    [ mingo: Rewrote the changelog. ]

    Signed-off-by: Srikar Dronamraju
    Acked-by: Mel Gorman
    Cc: Linus Torvalds
    Cc: Linux-MM
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Srikar Dronamraju
     

02 Oct, 2018

1 commit

  • Rate limiting of page migrations due to automatic NUMA balancing was
    introduced to mitigate the worst-case scenario of migrating at high
    frequency due to false sharing or slowly ping-ponging between nodes.
    Since then, a lot of effort was spent on correctly identifying these
    pages and avoiding unnecessary migrations and the safety net may no longer
    be required.

    Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
    avoids spreading STREAM tasks wide prematurely. However, once the task
    was properly placed, it delayed migrating the memory due to rate limiting.
    Increasing the limit fixed the problem for him.

    Currently, the limit is hard-coded and does not account for the real
    capabilities of the hardware. Even if an estimate was attempted, it would
    not properly account for the number of memory controllers and it could
    not account for the amount of bandwidth used for normal accesses. Rather
    than fudging, this patch simply eliminates the rate limiting.

    However, Jirka reports that a STREAM configuration using multiple
    processes achieved similar performance to 4.16. In local tests, this patch
    improved performance of STREAM relative to the baseline but it is somewhat
    machine-dependent. Most workloads show little or not performance difference
    implying that there is not a heavily reliance on the throttling mechanism
    and it is safe to remove.

    STREAM on 2-socket machine
    4.19.0-rc5 4.19.0-rc5
    numab-v1r1 noratelimit-v1r1
    MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%)
    MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%)
    MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%)
    MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24%

    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: Jirka Hladky
    Cc: Linus Torvalds
    Cc: Linux-MM
    Cc: Srikar Dronamraju
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
    Signed-off-by: Ingo Molnar

    Mel Gorman
     

05 Sep, 2018

1 commit

  • When scanning for movable pages, filter out Hugetlb pages if hugepage
    migration is not supported. Without this we hit infinte loop in
    __offline_pages() where we do

    pfn = scan_movable_pages(start_pfn, end_pfn);
    if (pfn) { /* We have movable pages */
    ret = do_migrate_range(pfn, end_pfn);
    goto repeat;
    }

    Fix this by checking hugepage_migration_supported both in
    has_unmovable_pages which is the primary backoff mechanism for page
    offlining and for consistency reasons also into scan_movable_pages
    because it doesn't make any sense to return a pfn to non-migrateable
    huge page.

    This issue was revealed by, but not caused by 72b39cfc4d75 ("mm,
    memory_hotplug: do not fail offlining too early").

    Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
    Fixes: 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
    Signed-off-by: Aneesh Kumar K.V
    Reported-by: Haren Myneni
    Acked-by: Michal Hocko
    Reviewed-by: Naoya Horiguchi
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     

30 Aug, 2018

1 commit


24 Aug, 2018

1 commit

  • A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
    allocate a page that was just freed on the way of soft-offline. This is
    undesirable because soft-offline (which is about corrected error) is
    less aggressive than hard-offline (which is about uncorrected error),
    and we can make soft-offline fail and keep using the page for good
    reason like "system is busy."

    Two main changes of this patch are:

    - setting migrate type of the target page to MIGRATE_ISOLATE. As done
    in free_unref_page_commit(), this makes kernel bypass pcplist when
    freeing the page. So we can assume that the page is in freelist just
    after put_page() returns,

    - setting PG_hwpoison on free page under zone->lock which protects
    freelists, so this allows us to avoid setting PG_hwpoison on a page
    that is decided to be allocated soon.

    [akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
    Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Reported-by: Xishi Qiu
    Tested-by: Mike Kravetz
    Cc: Michal Hocko
    Cc:
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

23 Aug, 2018

5 commits

  • Currently, whenever a new node is created/re-used from the memhotplug
    path, we call free_area_init_node()->free_area_init_core(). But there is
    some code that we do not really need to run when we are coming from such
    path.

    free_area_init_core() performs the following actions:

    1) Initializes pgdat internals, such as spinlock, waitqueues and more.
    2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
    when creating hash tables.
    3) Account number of managed_pages per zone, substracting dma_reserved and
    memmap pages.
    4) Initializes some fields of the zone structure data
    5) Calls init_currently_empty_zone to initialize all the freelists
    6) Calls memmap_init to initialize all pages belonging to certain zone

    When called from memhotplug path, free_area_init_core() only performs
    actions #1 and #4.

    Action #2 is pointless as the zones do not have any pages since either the
    node was freed, or we are re-using it, eitherway all zones belonging to
    this node should have 0 pages. For the same reason, action #3 results
    always in manages_pages being 0.

    Action #5 and #6 are performed later on when onlining the pages:
    online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
    online_pages()->move_pfn_range_to_zone()->memmap_init_zone()

    This patch does two things:

    First, moves the node/zone initializtion to their own function, so it
    allows us to create a small version of free_area_init_core, where we only
    perform:

    1) Initialization of pgdat internals, such as spinlock, waitqueues and more
    4) Initialization of some fields of the zone structure data

    These two functions are: pgdat_init_internals() and zone_init_internals().

    The second thing this patch does, is to introduce
    free_area_init_core_hotplug(), the memhotplug version of
    free_area_init_core():

    Currently, we call free_area_init_node() from the memhotplug path. In
    there, we set some pgdat's fields, and call calculate_node_totalpages().
    calculate_node_totalpages() calculates the # of pages the node has.

    Since the node is either new, or we are re-using it, the zones belonging
    to this node should not have any pages, so there is no point to calculate
    this now.

    Actually, we re-set these values to 0 later on with the calls to:

    reset_node_managed_pages()
    reset_node_present_pages()

    The # of pages per node and the # of pages per zone will be calculated when
    onlining the pages:

    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
    online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()

    Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
    __paginginit with __init, so their code gets freed up.

    [osalvador@techadventures.net: fix section usage]
    Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
    [osalvador@suse.de: v6]
    Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
    Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Pasha Tatashin
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
    function. Not having an ifdef in the function makes the code more
    readable.

    Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: Pavel Tatashin
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • __paginginit is the same thing as __meminit except for platforms without
    sparsemem, there it is defined as __init.

    Remove __paginginit and use __meminit. Use __ref in one single function
    that merges __meminit and __init sections: setup_usemap().

    Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
    have inline functions to access this field in order to avoid ifdef's in c
    files.

    Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Aaron Lu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Joonsoo Kim
    Cc: Mel Gorman
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "Refactor free_area_init_core and add
    free_area_init_core_hotplug", v6.

    This patchset does three things:

    1) Clean up/refactor free_area_init_core/free_area_init_node
    by moving the ifdefery out of the functions.
    2) Move the pgdat/zone initialization in free_area_init_core to its
    own function.
    3) Introduce free_area_init_core_hotplug, a small subset of
    free_area_init_core, which is only called from memhotlug code path. In this
    way, we have:

    free_area_init_core: called during early initialization
    free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)

    This patch (of 5):

    Moving the #ifdefs out of the function makes it easier to follow.

    Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
    Signed-off-by: Oscar Salvador
    Acked-by: Michal Hocko
    Reviewed-by: Pavel Tatashin
    Acked-by: Vlastimil Babka
    Cc: Pasha Tatashin
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Joonsoo Kim
    Cc: Dan Williams
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

18 Aug, 2018

4 commits

  • To improve page allocator's performance for order-0 pages, each CPU has
    a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
    PCP will be checked first before asking pages from Buddy. When PCP is
    used up, a batch of pages will be fetched from Buddy to improve
    performance and the size of batch can affect performance.

    zone's batch size gets doubled last time by commit ba56e91c9401("mm:
    page_alloc: increase size of per-cpu-pages") over ten years ago. Since
    then, CPU has envolved a lot and CPU's cache sizes also increased.

    Dave Hansen is concerned the current batch size doesn't fit well with
    modern hardware and suggested me to do two things: first, use a page
    allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
    out how performance changes with different batch sizes on various
    machines and then choose a new default batch size; second, see how this
    new batch size work with other workloads.

    In the first test, we saw performance gains on high-core-count systems
    and little to no effect on older systems with more modest core counts.
    In this phase's test data, two candidates: 63 and 127 are chosen.

    In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
    and more will-it-scale sub-tests are tested to see how these two
    candidates work with these workloads and decides a new default according
    to their results.

    Most test results are flat. will-it-scale/page_fault2 process mode has
    10%-18% performance increase on 4-sockets Skylake and Broadwell.
    vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
    4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
    drop. Further analysis showed that, with a larger pcp->batch and thus
    larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
    maintained in this patch), zone lock contention shifted to LRU add side
    lock contention and that caused performance drop. This performance drop
    might be mitigated by others' work on optimizing LRU lock.

    Another downside of increasing pcp->batch is, when PCP is used up and need
    to fetch a batch of pages from Buddy, since batch is increased, that time
    can be longer than before. My understanding is, this doesn't affect
    slowpath where direct reclaim and compaction dominates. For fastpath,
    throughput is a win(according to will-it-scale/page_fault1) but worst
    latency can be larger now.

    Overall, I think double the batch size from 31 to 63 is relatively safe
    and provide good performance boost for high-core-count systems.

    The two phase's test results are listed below(all tests are done with THP
    disabled).

    Phase one(will-it-scale/page_fault1) test results:

    Skylake-EX: increased batch size has a good effect on zone->lock
    contention, though LRU contention will rise at the same time and
    limited the final performance increase.

    batch score change zone_contention lru_contention total_contention
    31 15345900 +0.00% 64% 8% 72%
    53 17903847 +16.67% 32% 38% 70%
    63 17992886 +17.25% 24% 45% 69%
    73 18022825 +17.44% 10% 61% 71%
    119 18023401 +17.45% 4% 66% 70%
    127 18029012 +17.48% 3% 66% 69%
    137 18036075 +17.53% 4% 66% 70%
    165 18035964 +17.53% 2% 67% 69%
    188 18101105 +17.95% 2% 67% 69%
    223 18130951 +18.15% 2% 67% 69%
    255 18118898 +18.07% 2% 67% 69%
    267 18101559 +17.96% 2% 67% 69%
    299 18160468 +18.34% 2% 68% 70%
    320 18139845 +18.21% 2% 67% 69%
    393 18160869 +18.34% 2% 68% 70%
    424 18170999 +18.41% 2% 68% 70%
    458 18144868 +18.24% 2% 68% 70%
    467 18142366 +18.22% 2% 68% 70%
    498 18154549 +18.30% 1% 68% 69%
    511 18134525 +18.17% 1% 69% 70%

    Broadwell-EX: similar pattern as Skylake-EX.

    batch score change zone_contention lru_contention total_contention
    31 16703983 +0.00% 67% 7% 74%
    53 18195393 +8.93% 43% 28% 71%
    63 18288885 +9.49% 38% 33% 71%
    73 18344329 +9.82% 35% 37% 72%
    119 18535529 +10.96% 24% 46% 70%
    127 18513596 +10.83% 23% 48% 71%
    137 18514327 +10.84% 23% 48% 71%
    165 18511840 +10.82% 22% 49% 71%
    188 18593478 +11.31% 17% 53% 70%
    223 18601667 +11.36% 17% 52% 69%
    255 18774825 +12.40% 12% 58% 70%
    267 18754781 +12.28% 9% 60% 69%
    299 18892265 +13.10% 7% 63% 70%
    320 18873812 +12.99% 8% 62% 70%
    393 18891174 +13.09% 6% 64% 70%
    424 18975108 +13.60% 6% 64% 70%
    458 18932364 +13.34% 8% 62% 70%
    467 18960891 +13.51% 5% 65% 70%
    498 18944526 +13.41% 5% 64% 69%
    511 18960839 +13.51% 5% 64% 69%

    Skylake-EP: although increased batch reduced zone->lock contention, but
    the effect is not as good as EX: zone->lock contention is still as high as
    20% with a very high batch value instead of 1% on Skylake-EX or 5% on
    Broadwell-EX. Also, total_contention actually decreased with a higher
    batch but that doesn't translate to performance increase.

    batch score change zone_contention lru_contention total_contention
    31 9554867 +0.00% 66% 3% 69%
    53 9855486 +3.15% 63% 3% 66%
    63 9980145 +4.45% 62% 4% 66%
    73 10092774 +5.63% 62% 5% 67%
    119 10310061 +7.90% 45% 19% 64%
    127 10342019 +8.24% 42% 19% 61%
    137 10358182 +8.41% 42% 21% 63%
    165 10397060 +8.81% 37% 24% 61%
    188 10341808 +8.24% 34% 26% 60%
    223 10349135 +8.31% 31% 27% 58%
    255 10327189 +8.08% 28% 29% 57%
    267 10344204 +8.26% 27% 29% 56%
    299 10325043 +8.06% 25% 30% 55%
    320 10310325 +7.91% 25% 31% 56%
    393 10293274 +7.73% 21% 31% 52%
    424 10311099 +7.91% 21% 32% 53%
    458 10321375 +8.02% 21% 32% 53%
    467 10303881 +7.84% 21% 32% 53%
    498 10332462 +8.14% 20% 33% 53%
    511 10325016 +8.06% 20% 32% 52%

    Broadwell-EP: zone->lock and lru lock had an agreement to make sure
    performance doesn't increase and they successfully managed to keep total
    contention at 70%.

    batch score change zone_contention lru_contention total_contention
    31 10121178 +0.00% 19% 50% 69%
    53 10142366 +0.21% 6% 63% 69%
    63 10117984 -0.03% 11% 58% 69%
    73 10123330 +0.02% 7% 63% 70%
    119 10108791 -0.12% 2% 67% 69%
    127 10166074 +0.44% 3% 66% 69%
    137 10141574 +0.20% 3% 66% 69%
    165 10154499 +0.33% 2% 68% 70%
    188 10124921 +0.04% 2% 67% 69%
    223 10137399 +0.16% 2% 67% 69%
    255 10143289 +0.22% 0% 68% 68%
    267 10123535 +0.02% 1% 68% 69%
    299 10140952 +0.20% 0% 68% 68%
    320 10163170 +0.41% 0% 68% 68%
    393 10000633 -1.19% 0% 69% 69%
    424 10087998 -0.33% 0% 69% 69%
    458 10187116 +0.65% 0% 69% 69%
    467 10146790 +0.25% 0% 69% 69%
    498 10197958 +0.76% 0% 69% 69%
    511 10152326 +0.31% 0% 69% 69%

    Haswell-EP: similar to Broadwell-EP.

    batch score change zone_contention lru_contention total_contention
    31 10442205 +0.00% 14% 48% 62%
    53 10442255 +0.00% 5% 57% 62%
    63 10452059 +0.09% 6% 57% 63%
    73 10482349 +0.38% 5% 59% 64%
    119 10454644 +0.12% 3% 60% 63%
    127 10431514 -0.10% 3% 59% 62%
    137 10423785 -0.18% 3% 60% 63%
    165 10481216 +0.37% 2% 61% 63%
    188 10448755 +0.06% 2% 61% 63%
    223 10467144 +0.24% 2% 61% 63%
    255 10480215 +0.36% 2% 61% 63%
    267 10484279 +0.40% 2% 61% 63%
    299 10466450 +0.23% 2% 61% 63%
    320 10452578 +0.10% 2% 61% 63%
    393 10499678 +0.55% 1% 62% 63%
    424 10481454 +0.38% 1% 62% 63%
    458 10473562 +0.30% 1% 62% 63%
    467 10484269 +0.40% 0% 62% 62%
    498 10505599 +0.61% 0% 62% 62%
    511 10483395 +0.39% 0% 62% 62%

    Westmere-EP: contention is pretty small so not interesting. Note too high
    a batch value could hurt performance.

    batch score change zone_contention lru_contention total_contention
    31 4831523 +0.00% 2% 3% 5%
    53 4834086 +0.05% 2% 4% 6%
    63 4834262 +0.06% 2% 3% 5%
    73 4832851 +0.03% 2% 4% 6%
    119 4830534 -0.02% 1% 3% 4%
    127 4827461 -0.08% 1% 4% 5%
    137 4827459 -0.08% 1% 3% 4%
    165 4820534 -0.23% 0% 4% 4%
    188 4817947 -0.28% 0% 3% 3%
    223 4809671 -0.45% 0% 3% 3%
    255 4802463 -0.60% 0% 4% 4%
    267 4801634 -0.62% 0% 3% 3%
    299 4798047 -0.69% 0% 3% 3%
    320 4793084 -0.80% 0% 3% 3%
    393 4785877 -0.94% 0% 3% 3%
    424 4782911 -1.01% 0% 3% 3%
    458 4779346 -1.08% 0% 3% 3%
    467 4780306 -1.06% 0% 3% 3%
    498 4780589 -1.05% 0% 3% 3%
    511 4773724 -1.20% 0% 3% 3%

    Skylake-Desktop: similar to Westmere-EP, nothing interesting.

    batch score change zone_contention lru_contention total_contention
    31 3906608 +0.00% 2% 3% 5%
    53 3940164 +0.86% 2% 3% 5%
    63 3937289 +0.79% 2% 3% 5%
    73 3940201 +0.86% 2% 3% 5%
    119 3933240 +0.68% 2% 3% 5%
    127 3930514 +0.61% 2% 4% 6%
    137 3938639 +0.82% 0% 3% 3%
    165 3908755 +0.05% 0% 3% 3%
    188 3905621 -0.03% 0% 3% 3%
    223 3903015 -0.09% 0% 4% 4%
    255 3889480 -0.44% 0% 3% 3%
    267 3891669 -0.38% 0% 4% 4%
    299 3898728 -0.20% 0% 4% 4%
    320 3894547 -0.31% 0% 4% 4%
    393 3875137 -0.81% 0% 4% 4%
    424 3874521 -0.82% 0% 3% 3%
    458 3880432 -0.67% 0% 4% 4%
    467 3888715 -0.46% 0% 3% 3%
    498 3888633 -0.46% 0% 4% 4%
    511 3875305 -0.80% 0% 5% 5%

    Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
    contention is higher than other desktops.

    batch score change zone_contention lru_contention total_contention
    31 3511158 +0.00% 2% 5% 7%
    53 3555445 +1.26% 2% 6% 8%
    63 3561082 +1.42% 2% 6% 8%
    73 3547218 +1.03% 2% 6% 8%
    119 3571319 +1.71% 1% 7% 8%
    127 3549375 +1.09% 0% 6% 6%
    137 3560233 +1.40% 0% 6% 6%
    165 3555176 +1.25% 2% 6% 8%
    188 3551501 +1.15% 0% 8% 8%
    223 3531462 +0.58% 0% 7% 7%
    255 3570400 +1.69% 0% 7% 7%
    267 3532235 +0.60% 1% 8% 9%
    299 3562326 +1.46% 0% 6% 6%
    320 3553569 +1.21% 0% 8% 8%
    393 3539519 +0.81% 0% 7% 7%
    424 3549271 +1.09% 0% 8% 8%
    458 3528885 +0.50% 0% 8% 8%
    467 3526554 +0.44% 0% 7% 7%
    498 3525302 +0.40% 0% 9% 9%
    511 3527556 +0.47% 0% 8% 8%

    Sandybridge-Desktop: the 0% contention isn't accurate but caused by
    dropped fractional part. Since multiple contention path's contentions
    are all under 1% here, with some arithmetic operations like add, the
    final deviation could be as large as 3%.

    batch score change zone_contention lru_contention total_contention
    31 1744495 +0.00% 0% 0% 0%
    53 1755341 +0.62% 0% 0% 0%
    63 1758469 +0.80% 0% 0% 0%
    73 1759626 +0.87% 0% 0% 0%
    119 1770417 +1.49% 0% 0% 0%
    127 1768252 +1.36% 0% 0% 0%
    137 1767848 +1.34% 0% 0% 0%
    165 1765088 +1.18% 0% 0% 0%
    188 1766918 +1.29% 0% 0% 0%
    223 1767866 +1.34% 0% 0% 0%
    255 1768074 +1.35% 0% 0% 0%
    267 1763187 +1.07% 0% 0% 0%
    299 1765620 +1.21% 0% 0% 0%
    320 1767603 +1.32% 0% 0% 0%
    393 1764612 +1.15% 0% 0% 0%
    424 1758476 +0.80% 0% 0% 0%
    458 1758593 +0.81% 0% 0% 0%
    467 1757915 +0.77% 0% 0% 0%
    498 1753363 +0.51% 0% 0% 0%
    511 1755548 +0.63% 0% 0% 0%

    Phase two test results:
    Note: all percent change is against base(batch=31).

    ebizzy.throughput (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0%
    lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1%
    lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6%
    lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1%
    lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4%
    lkp-skl-d01 264126 262791 -0.5% 264113 +0.0%
    lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1%
    lkp-sb02 98937 98937 +0.0% 99030 +0.1%

    kbuild.buildtime (less is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1%
    lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1%
    lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1%
    lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4%
    lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1%
    lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7%
    lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1%

    netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0%
    lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1%
    lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1%
    lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1%
    lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1%
    lkp-skl-d01 77314 77303 -0.0% 77375 +0.1%
    lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7%
    lkp-sb02 29990 30137 +0.5% 30103 +0.4%

    oltp.transactions (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4%
    lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0%
    lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7%
    lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4%
    lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0%
    lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0%
    lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8%

    pigz.throughput (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2%
    lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8%
    lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0%
    lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7%
    lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3%
    lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1%
    lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0%
    lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0%

    will-it-scale.malloc1.processes (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0%
    lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0%
    lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3%
    lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0%
    lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6%
    lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0%
    lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2%
    lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6%
    lkp-sb02 589286 589284 -0.0% 588101 -0.2%

    will-it-scale.malloc1.threads (higher is better)
    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2%
    lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4%
    lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9%
    lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9%
    lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2%
    lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1%
    lkp-skl-d01 175277 173343 -1.1% 173082 -1.3%
    lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8%
    lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4%

    will-it-scale.malloc2.processes (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0%
    lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0%
    lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1%
    lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1%
    lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1%
    lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1%
    lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0%
    lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0%
    lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0%

    will-it-scale.malloc2.threads (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0%
    lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0%
    lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2%
    lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3%
    lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0%
    lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0%
    lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0%
    lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0%
    lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0%

    will-it-scale.page_fault2.processes (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0%
    lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0%
    lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3%
    lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9%
    lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7%
    lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5%
    lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9%
    lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6%
    lkp-sb02 842616 837844 -0.6% 833921 -1.0%

    will-it-scale.page_fault2.threads

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1%
    lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9%
    lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0%
    lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8%
    lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3%
    lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9%
    lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4%
    lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5%
    lkp-sb02 750626 748615 -0.3% 746621 -0.5%

    will-it-scale.page_fault3.processes (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2%
    lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2%
    lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0%
    lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3%
    lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3%
    lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9%
    lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5%
    lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5%
    lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4%

    will-it-scale.page_fault3.threads (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1%
    lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8%
    lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9%
    lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6%
    lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5%
    lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9%
    lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9%
    lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9%
    lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6%

    will-it-scale.read1.processes (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable)
    lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2%
    lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1%
    lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6%
    lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2%
    lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2%
    lkp-skl-d01 10969487 10972650 +0.0% no data
    lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8%
    lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5%

    will-it-scale.read1.threads (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable)
    lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3%
    lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6%
    lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3%
    lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1%
    lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7%
    lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1%
    lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1%
    lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0%

    will-it-scale.write1.processes (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0%
    lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4%
    lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1%
    lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9%
    lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7%
    lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9%
    lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8%
    lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4%
    lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6%

    will-it-scale.write1.threads (higer is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7%
    lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6%
    lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7%
    lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1%
    lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8%
    lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8%
    lkp-skl-d01 8346174 8235695 -1.3% no data
    lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3%
    lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0%

    vm-scalability.anon-r-rand.throughput (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1%
    lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9%
    lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4%
    lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7%
    lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5%
    lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2%
    lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8%
    lkp-sb02 570572 567854 -0.5% 568449 -0.4%

    vm-scalability.anon-r-rand-mt.throughput (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1%
    lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9%
    lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6%
    lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9%
    lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0%
    lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0%
    lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1%
    lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9%
    lkp-sb02 567137 567346 +0.0% 566144 -0.2%

    vm-scalability.lru-file-mmap-read.throughput (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7%
    lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9%
    lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5%
    lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1%
    lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7%
    lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3%
    lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9%
    lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4%
    lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3%

    vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)

    machine batch=31 batch=63 batch=127
    lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6%
    lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2%
    lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0%
    lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1%
    lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1%
    lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1%
    lkp-skl-d01 696689 695537 -0.2% 694106 -0.4%
    lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2%
    lkp-sb02 258939 258787 -0.1% 258199 -0.3%

    Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
    Signed-off-by: Aaron Lu
    Suggested-by: Dave Hansen
    Acked-by: Michal Hocko
    Acked-by: Jesper Dangaard Brouer
    Cc: Huang Ying
    Cc: Kemi Wang
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • There is no real reason to blow up just because the caller doesn't know
    that __get_free_pages cannot return highmem pages. Simply fix that up
    silently. Even if we have some confused users such a fixup will not be
    harmful.

    [akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
    Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Jiankang Chen
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Yisheng Xie
    Cc: Hanjun Guo
    Cc: Kefeng Wang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • __alloc_pages_slowpath() has for a long time contained code to ignore
    node restrictions from memory policies for high priority allocations.
    The current code that resets the zonelist iterator however does
    effectively nothing after commit 7810e6781e0f ("mm, page_alloc: do not
    break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
    Even before that commit, mempolicy restrictions were still not ignored,
    as they are passed in ac->nodemask which is untouched by the code.

    We can either remove the code, or make it work as intended. Since
    ac->nodemask can be set from task's mempolicy via alloc_pages_current()
    and thus also alloc_pages(), it may indeed affect kernel allocations,
    and it makes sense to ignore it to allow progress for high priority
    allocations.

    Thus, this patch resets ac->nodemask to NULL in such cases. This
    assumes all callers can handle it (i.e. there are no guarantees as in
    the case of __GFP_THISNODE) which seems to be the case. The same
    assumption is already present in check_retry_cpuset() for some time.

    The expected effect is that high priority kernel allocations in the
    context of userspace tasks (e.g. OOM victims) restricted by mempolicies
    will have higher chance to succeed if they are restricted to nodes with
    depleted memory, while there are other nodes with free memory left.

    It's not a new intention, but for the first time the code will match the
    intention, AFAICS. It was intended by commit 183f6371aac2 ("mm: ignore
    mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
    really worked, as mempolicy restriction was already encoded in nodemask,
    not zonelist, at that time.

    So originally that was for ALLOC_NO_WATERMARK only. Then it was
    adjusted by e46e7b77c909 ("mm, page_alloc: recalculate the preferred
    zoneref if the context can ignore memory policies") and cd04ae1e2dc8
    ("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
    current state. So even GFP_ATOMIC would now ignore mempolicies after
    the initial attempts fail - if the code worked as people thought it
    does.

    Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Mel Gorman
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The role of zero_resv_unavail() is to make sure that every struct page
    that is allocated but is not backed by memory that is accessible by
    kernel is zeroed and not in some uninitialized state.

    Since struct pages are allocated in blocks (2M pages in x86 case), we
    can skip pageblock_nr_pages at a time, when the first one is found to be
    invalid.

    This optimization may help since now on x86 every hole in e820 maps is
    marked as reserved in memblock, and thus will go through this function.

    This function is called before sched_clock() is initialized, so I used
    my x86 early boot clock patches to measure the performance improvement.

    With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
    but with this optimization 0.001103s.

    Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Oscar Salvador
    Reviewed-by: Naoya Horiguchi
    Cc: Pasha Tatashin
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc: Dan Williams
    Cc: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

15 Aug, 2018

1 commit

  • Pull power management updates from Rafael Wysocki:
    "These add a new framework for CPU idle time injection, to be used by
    all of the idle injection code in the kernel in the future, fix some
    issues and add a number of relatively small extensions in multiple
    places.

    Specifics:

    - Add a new framework for CPU idle time injection (Daniel Lezcano).

    - Add AVS support to the armada-37xx cpufreq driver (Gregory
    CLEMENT).

    - Add support for current CPU frequency reporting to the ACPI CPPC
    cpufreq driver (George Cherian).

    - Rework the cooling device registration in the imx6q/thermal driver
    (Bastian Stender).

    - Make the pcc-cpufreq driver refuse to work with dynamic scaling
    governors on systems with many CPUs to avoid scalability issues
    with it (Rafael Wysocki).

    - Fix the intel_pstate driver to report different maximum CPU
    frequencies on systems where they really are different and to
    ignore the turbo active ratio if hardware-managend P-states (HWP)
    are in use; make it use the match_string() helper (Xie Yisheng,
    Srinivas Pandruvada).

    - Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
    (Niklas Cassel).

    - Add a tracepoint for the tracking of frequency limits changes (from
    Andriod) to the cpufreq core (Ruchi Kandoi).

    - Fix a circular lock dependency between CPU hotplug and sysfs
    locking in the cpufreq core reported by lockdep (Waiman Long).

    - Avoid excessive error reports on driver registration failures in
    the ARM cpuidle driver (Sudeep Holla).

    - Add a new device links flag to the driver core to make links go
    away automatically on supplier driver removal (Vivek Gautam).

    - Eliminate potential race condition between system-wide power
    management transitions and system shutdown (Pingfan Liu).

    - Add a quirk to save NVS memory on system suspend for the ASUS 1025C
    laptop (Willy Tarreau).

    - Make more systems use suspend-to-idle (instead of ACPI S3) by
    default (Tristian Celestin).

    - Get rid of stack VLA usage in the low-level hibernation code on
    64-bit x86 (Kees Cook).

    - Fix error handling in the hibernation core and mark an expected
    fall-through switch in it (Chengguang Xu, Gustavo Silva).

    - Extend the generic power domains (genpd) framework to support
    attaching a device to a power domain by name (Ulf Hansson).

    - Fix device reference counting and user limits initialization in the
    devfreq core (Arvind Yadav, Matthias Kaehlcke).

    - Fix a few issues in the rk3399_dmc devfreq driver and improve its
    documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).

    - Drop a redundant error message from the exynos-ppmu devfreq driver
    (Markus Elfring)"

    * tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
    PM / reboot: Eliminate race between reboot and suspend
    PM / hibernate: Mark expected switch fall-through
    cpufreq: intel_pstate: Ignore turbo active ratio in HWP
    cpufreq: Fix a circular lock dependency problem
    cpu/hotplug: Add a cpus_read_trylock() function
    x86/power/hibernate_64: Remove VLA usage
    cpufreq: trace frequency limits change
    cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
    cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
    cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
    cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
    cpufreq: armada-37xx: Add AVS support
    dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
    PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
    PM / devfreq: Init user limits from OPP limits, not viceversa
    PM / devfreq: rk3399_dmc: fix spelling mistakes.
    PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
    dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
    PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
    dt-bindings: clock: add rk3399 DDR3 standard speed bins.
    ...

    Linus Torvalds
     

14 Aug, 2018

1 commit

  • Merge changes in the PM core, system-wide PM infrastructure, generic
    power domains (genpd) framework, ACPI PM infrastructure and cpuidle
    for 4.19.

    * pm-core:
    driver core: Add flag to autoremove device link on supplier unbind
    driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER

    * pm-domains:
    PM / Domains: Introduce dev_pm_domain_attach_by_name()
    PM / Domains: Introduce option to attach a device by name to genpd
    PM / Domains: dt: Add a power-domain-names property

    * pm-sleep:
    PM / reboot: Eliminate race between reboot and suspend
    PM / hibernate: Mark expected switch fall-through
    x86/power/hibernate_64: Remove VLA usage
    PM / hibernate: cast PAGE_SIZE to int when comparing with error code

    * acpi-pm:
    ACPI / PM: save NVS memory for ASUS 1025C laptop
    ACPI / PM: Default to s2idle in all machines supporting LP S0

    * pm-cpuidle:
    ARM: cpuidle: silence error on driver registration failure

    Rafael J. Wysocki
     

06 Aug, 2018

2 commits

  • At present, "systemctl suspend" and "shutdown" can run in parrallel. A
    system can suspend after devices_shutdown(), and resume. Then the shutdown
    task goes on to power off. This causes many devices are not really shut
    off. Hence replacing reboot_mutex with system_transition_mutex (renamed
    from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
    system_transition_mutex can be better to reflect the purpose of the mutex.

    Signed-off-by: Pingfan Liu
    Acked-by: Pavel Machek
    Signed-off-by: Rafael J. Wysocki

    Pingfan Liu
     
  • free_reserved_area() takes pointers as arguments to show which addresses
    should be freed. However, it does this in a somewhat ambiguous way. If it
    gets a kernel direct map address, it always works. However, if it gets an
    address that is part of the kernel image alias mapping, it can fail.

    It fails if all of the following happen:
    * The specified address is part of the kernel image alias
    * Poisoning is requested (forcing a memset())
    * The address is in a read-only portion of the kernel image

    The memset() fails on the read-only mapping, of course.
    free_reserved_area() *is* called both on the direct map and on kernel image
    alias addresses. We've just lucked out thus far that the kernel image
    alias areas it gets used on are read-write. I'm fairly sure this has been
    just a happy accident.

    It is quite easy to make free_reserved_area() work for all cases: just
    convert the address to a direct map address before doing the memset(), and
    do this unconditionally. There is little chance of a regression here
    because we previously did a virt_to_page() on the address for the memset,
    so we know these are not highmem pages for which virt_to_page() would fail.

    Signed-off-by: Dave Hansen
    Signed-off-by: Thomas Gleixner
    Cc: keescook@google.com
    Cc: aarcange@redhat.com
    Cc: jgross@suse.com
    Cc: jpoimboe@redhat.com
    Cc: gregkh@linuxfoundation.org
    Cc: peterz@infradead.org
    Cc: hughd@google.com
    Cc: torvalds@linux-foundation.org
    Cc: bp@alien8.de
    Cc: luto@kernel.org
    Cc: ak@linux.intel.com
    Cc: Kees Cook
    Cc: Andrea Arcangeli
    Cc: Juergen Gross
    Cc: Josh Poimboeuf
    Cc: Greg Kroah-Hartman
    Cc: Peter Zijlstra
    Cc: Hugh Dickins
    Cc: Linus Torvalds
    Cc: Borislav Petkov
    Cc: Andy Lutomirski
    Cc: Andi Kleen
    Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com

    Dave Hansen
     

17 Jul, 2018

1 commit

  • Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
    x86-32.

    The cause is that we access struct pages before they are allocated when
    CONFIG_FLAT_NODE_MEM_MAP is used.

    free_area_init_nodes()
    zero_resv_unavail()
    mm_zero_struct_page(pfn_to_page(pfn));
    Tested-by: Matt Hart
    Acked-by: Michal Hocko
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

15 Jul, 2018

1 commit

  • We must zero struct pages for memory that is not backed by physical
    memory, or kernel does not have access to.

    Recently, there was a change which zeroed all memmap for all holes in
    e820. Unfortunately, it introduced a bug that is discussed here:

    https://www.spinics.net/lists/linux-mm/msg156764.html

    Linus, also saw this bug on his machine, and confirmed that reverting
    commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
    memblock.reserved") fixes the issue.

    The problem is that we incorrectly zero some struct pages after they
    were setup.

    The fix is to zero unavailable struct pages prior to initializing of
    struct pages.

    A more detailed fix should come later that would avoid double zeroing
    cases: one in __init_single_page(), the other one in
    zero_resv_unavail().

    Fixes: 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

15 Jun, 2018

1 commit

  • mm/*.c files use symbolic and octal styles for permissions.

    Using octal and not symbolic permissions is preferred by many as more
    readable.

    https://lkml.org/lkml/2016/8/2/1945

    Prefer the direct use of octal for permissions.

    Done using
    $ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
    and some typing.

    Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    44
    After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
    86

    Miscellanea:

    o Whitespace neatening around these conversions.

    Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
    Signed-off-by: Joe Perches
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joe Perches
     

08 Jun, 2018

8 commits

  • In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
    allocations that can ignore memory policies. The zonelist is obtained
    from current CPU's node. This is a problem for __GFP_THISNODE
    allocations that want to allocate on a different node, e.g. because the
    allocating thread has been migrated to a different CPU.

    This has been observed to break SLAB in our 4.4-based kernel, because
    there it relies on __GFP_THISNODE working as intended. If a slab page
    is put on wrong node's list, then further list manipulations may corrupt
    the list because page_to_nid() is used to determine which node's
    list_lock should be locked and thus we may take a wrong lock and race.

    Current SLAB implementation seems to be immune by luck thanks to commit
    511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
    arbitrary node") but there may be others assuming that __GFP_THISNODE
    works as promised.

    We can fix it by simply removing the zonelist reset completely. There
    is actually no reason to reset it, because memory policies and cpusets
    don't affect the zonelist choice in the first place. This was different
    when commit 183f6371aac2 ("mm: ignore mempolicies when using
    ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
    own restricted zonelists.

    We might consider this for 4.17 although I don't know if there's
    anything currently broken.

    SLAB is currently not affected, but in kernels older than 4.7 that don't
    yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
    allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
    ones I'll have to check.

    So stable backports should be more important, but will have to be
    reviewed carefully, as the code went through many changes. BTW I think
    that also the ac->preferred_zoneref reset is currently useless if we
    don't also reset ac->nodemask from a mempolicy to NULL first (which we
    probably should for the OOM victims etc?), but I would leave that for a
    separate patch.

    Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
    Acked-by: Mel Gorman
    Cc: Michal Hocko
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • This gives us five words of space in a single union in struct page. The
    compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
    systems, but that does not seem likely to cause any trouble.

    Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Now that we can represent the location of 'deferred_list' in C instead of
    comments, make use of that ability.

    Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Vlastimil Babka
    Acked-by: Kirill A. Shutemov
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • We're already using a union of many fields here, so stop abusing the
    _mapcount and make page_type its own field. That implies renaming some of
    the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
    back the PG_buddy, PG_balloon and PG_kmemcg names.

    As suggested by Kirill, make page_type a bitmask. Because it starts out
    life as -1 (thanks to sharing the storage with _mapcount), setting a page
    flag means clearing the appropriate bit. This gives us space for probably
    twenty or so extra bits (depending how paranoid we want to be about
    _mapcount underflow).

    Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Kirill A. Shutemov
    Acked-by: Vlastimil Babka
    Cc: Christoph Lameter
    Cc: Dave Hansen
    Cc: Jérôme Glisse
    Cc: Lai Jiangshan
    Cc: Martin Schwidefsky
    Cc: Pekka Enberg
    Cc: Randy Dunlap
    Cc: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • finalise_ac() has parameter order which is not used at all. Remove it.

    Signed-off-by: Huaisheng Ye
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huaisheng Ye
     
  • is_pageblock_removable_nolock() is not used outside of
    mm/memory_hotplug.c. Move it next to unique caller
    is_mem_section_removable() and make it static.

    Remove prototype in to silence gcc warning (W=1):

    mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]

    Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
    Signed-off-by: Mathieu Malaterre
    Suggested-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mathieu Malaterre
     
  • While revisiting my Btrfs swapfile series [1], I introduced a situation
    in which reclaim would lock i_rwsem, and even though the swapon() path
    clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
    complaints from lockdep. It turns out that the rework of the fs_reclaim
    annotation was broken: if the current task has PF_MEMALLOC set, we don't
    acquire the dummy fs_reclaim lock, but when reclaiming we always check
    this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
    fix this by moving the fs_reclaim_{acquire,release}() outside of the
    memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
    different. After applying this, I got the expected lockdep splats.

    1: https://lwn.net/Articles/625412/

    Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
    Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
    Signed-off-by: Omar Sandoval
    Reviewed-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Tetsuo Handa
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Omar Sandoval
     
  • Highmem's realsize always equals to freesize, so it is not necessary to
    spare a variable to record this.

    Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     

26 May, 2018

1 commit

  • Oscar has reported:
    : Due to an unfortunate setting with movablecore, memblocks containing bootmem
    : memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
    : So while trying to remove that memory, the system failed in do_migrate_range
    : and __offline_pages never returned.
    :
    : This can be reproduced by running
    : qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
    : and movablecore=4G kernel command line
    :
    : linux kernel: BIOS-provided physical RAM map:
    : linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
    : linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
    : linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
    : linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
    : linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
    : linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
    : linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
    : linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
    : linux kernel: NX (Execute Disable) protection: active
    : linux kernel: SMBIOS 2.8 present.
    : linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
    : linux kernel: Hypervisor detected: KVM
    : linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
    : linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
    : linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
    :
    : linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
    : linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
    : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
    : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
    : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
    : linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
    : linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
    : linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
    : linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
    : linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
    : linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
    :
    : zoneinfo shows that the zone movable is placed into both numa nodes:
    : Node 0, zone Movable
    : pages free 160140
    : min 1823
    : low 2278
    : high 2733
    : spanned 262144
    : present 262144
    : managed 245670
    : Node 1, zone Movable
    : pages free 448427
    : min 3827
    : low 4783
    : high 5739
    : spanned 524288
    : present 524288
    : managed 515766

    Note how only Node 0 has a hutplugable memory region which would rule it
    out from the early memblock allocations (most likely memmap). Node1
    will surely contain memmaps on the same node and those would prevent
    offlining to succeed. So this is arguably a configuration issue.
    Although one could argue that we should be more clever and rule early
    allocations from the zone movable. This would be correct but probably
    not worth the effort considering what a hack movablecore is.

    Anyway, We could do better for those cases though. We rely on
    start_isolate_page_range resp. has_unmovable_pages to do their job.
    The first one isolates the whole range to be offlined so that we do not
    allocate from it anymore and the later makes sure we are not stumbling
    over non-migrateable pages.

    has_unmovable_pages is overly optimistic, however. It doesn't check all
    the pages if we are withing zone_movable because we rely that those
    pages will be always migrateable. As it turns out we are still not
    perfect there. While bootmem pages in zonemovable sound like a clear
    bug which should be fixed let's remove the optimization for now and warn
    if we encounter unmovable pages in zone_movable in the meantime. That
    should help for now at least.

    Btw. this wasn't a real problem until commit 72b39cfc4d75 ("mm,
    memory_hotplug: do not fail offlining too early") because we used to
    have a small number of retries and then failed. This turned out to be
    too fragile though.

    Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reported-by: Oscar Salvador
    Tested-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc: Reza Arbab
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

25 May, 2018

1 commit

  • This reverts the following commits that change CMA design in MM.

    3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")

    1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")

    bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")

    Ville reported a following error on i386.

    Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    microcode: microcode updated early to revision 0x4, date = 2013-06-28
    Initializing CPU#0
    Initializing HighMem for node 0 (000377fe:00118000)
    Initializing Movable for node 0 (00000001:00118000)
    BUG: Bad page state in process swapper pfn:377fe
    page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
    flags: 0x80000000()
    raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
    page dumped because: nonzero mapcount
    Modules linked in:
    CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
    Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
    Call Trace:
    dump_stack+0x60/0x96
    bad_page+0x9a/0x100
    free_pages_check_bad+0x3f/0x60
    free_pcppages_bulk+0x29d/0x5b0
    free_unref_page_commit+0x84/0xb0
    free_unref_page+0x3e/0x70
    __free_pages+0x1d/0x20
    free_highmem_page+0x19/0x40
    add_highpages_with_active_regions+0xab/0xeb
    set_highmem_pages_init+0x66/0x73
    mem_init+0x1b/0x1d7
    start_kernel+0x17a/0x363
    i386_start_kernel+0x95/0x99
    startup_32_smp+0x164/0x168

    The reason for this error is that the span of MOVABLE_ZONE is extended
    to whole node span for future CMA initialization, and, normal memory is
    wrongly freed here. I submitted the fix and it seems to work, but,
    another problem happened.

    It's so late time to fix the later problem so I decide to reverting the
    series.

    Reported-by: Ville Syrjälä
    Acked-by: Laura Abbott
    Acked-by: Michal Hocko
    Cc: Andrew Morton
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Linus Torvalds

    Joonsoo Kim