07 Oct, 2020

1 commit

  • commit c1d0da83358a2316d9be7f229f26126dbaa07468 upstream.

    Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Laurent Dufour
     

26 Aug, 2020

2 commits

  • commit 88e8ac11d2ea3acc003cf01bb5a38c8aa76c3cfd upstream.

    The following race is observed with the repeated online, offline and a
    delay between two successive online of memory blocks of movable zone.

    P1 P2

    Online the first memory block in
    the movable zone. The pcp struct
    values are initialized to default
    values,i.e., pcp->high = 0 &
    pcp->batch = 1.

    Allocate the pages from the
    movable zone.

    Try to Online the second memory
    block in the movable zone thus it
    entered the online_pages() but yet
    to call zone_pcp_update().
    This process is entered into
    the exit path thus it tries
    to release the order-0 pages
    to pcp lists through
    free_unref_page_commit().
    As pcp->high = 0, pcp->count = 1
    proceed to call the function
    free_pcppages_bulk().
    Update the pcp values thus the
    new pcp values are like, say,
    pcp->high = 378, pcp->batch = 63.
    Read the pcp's batch value using
    READ_ONCE() and pass the same to
    free_pcppages_bulk(), pcp values
    passed here are, batch = 63,
    count = 1.

    Since num of pages in the pcp
    lists are less than ->batch,
    then it will stuck in
    while(list_empty(list)) loop
    with interrupts disabled thus
    a core hung.

    Avoid this by ensuring free_pcppages_bulk() is called with proper count of
    pcp list pages.

    The mentioned race is some what easily reproducible without [1] because
    pcp's are not updated for the first memory block online and thus there is
    a enough race window for P2 between alloc+free and pcp struct values
    update through onlining of second memory block.

    With [1], the race still exists but it is very narrow as we update the pcp
    struct values for the first memory block online itself.

    This is not limited to the movable zone, it could also happen in cases
    with the normal zone (e.g., hotplug to a node that only has DMA memory, or
    no other memory yet).

    [1]: https://patchwork.kernel.org/patch/11696389/

    Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Vinayak Menon
    Cc: [2.6+]
    Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Charan Teja Reddy
     
  • commit e08d3fdfe2dafa0331843f70ce1ff6c1c4900bf4 upstream.

    The lowmem_reserve arrays provide a means of applying pressure against
    allocations from lower zones that were targeted at higher zones. Its
    values are a function of the number of pages managed by higher zones and
    are assigned by a call to the setup_per_zone_lowmem_reserve() function.

    The function is initially called at boot time by the function
    init_per_zone_wmark_min() and may be called later by accesses of the
    /proc/sys/vm/lowmem_reserve_ratio sysctl file.

    The function init_per_zone_wmark_min() was moved up from a module_init to
    a core_initcall to resolve a sequencing issue with khugepaged.
    Unfortunately this created a sequencing issue with CMA page accounting.

    The CMA pages are added to the managed page count of a zone when
    cma_init_reserved_areas() is called at boot also as a core_initcall. This
    makes it uncertain whether the CMA pages will be added to the managed page
    counts of their zones before or after the call to
    init_per_zone_wmark_min() as it becomes dependent on link order. With the
    current link order the pages are added to the managed count after the
    lowmem_reserve arrays are initialized at boot.

    This means the lowmem_reserve values at boot may be lower than the values
    used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
    ratio values are unchanged.

    In many cases the difference is not significant, but for example
    an ARM platform with 1GB of memory and the following memory layout

    cma: Reserved 256 MiB at 0x0000000030000000
    Zone ranges:
    DMA [mem 0x0000000000000000-0x000000002fffffff]
    Normal empty
    HighMem [mem 0x0000000030000000-0x000000003fffffff]

    would result in 0 lowmem_reserve for the DMA zone. This would allow
    userspace to deplete the DMA zone easily.

    Funnily enough

    $ cat /proc/sys/vm/lowmem_reserve_ratio

    would fix up the situation because as a side effect it forces
    setup_per_zone_lowmem_reserve.

    This commit breaks the link order dependency by invoking
    init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
    have the chance to be properly accounted in their zone(s) and allowing
    the lowmem_reserve arrays to receive consistent values.

    Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
    Signed-off-by: Doug Berger
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Jason Baron
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc:
    Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Doug Berger
     

22 Jun, 2020

3 commits

  • commit da97f2d56bbd880b4138916a7ef96f9881a551b2 upstream.

    Now that deferred pages are initialized with interrupts enabled we can
    replace touch_nmi_watchdog() with cond_resched(), as it was before
    3a2d7fa8a3d5.

    For now, we cannot do the same in deferred_grow_zone() as it is still
    initializes pages with interrupts disabled.

    This change fixes RCU problem described in
    https://lkml.kernel.org/r/20200401104156.11564-2-david@redhat.com

    [ 60.474005] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
    [ 60.475000] rcu: 1-...0: (0 ticks this GP) idle=02a/1/0x4000000000000000 softirq=1/1 fqs=15000
    [ 60.475000] rcu: (detected by 0, t=60002 jiffies, g=-1199, q=1)
    [ 60.475000] Sending NMI from CPU 0 to CPUs 1:
    [ 1.760091] NMI backtrace for cpu 1
    [ 1.760091] CPU: 1 PID: 20 Comm: pgdatinit0 Not tainted 4.18.0-147.9.1.el8_1.x86_64 #1
    [ 1.760091] Hardware name: Red Hat KVM, BIOS 1.13.0-1.module+el8.2.0+5520+4e5817f3 04/01/2014
    [ 1.760091] RIP: 0010:__init_single_page.isra.65+0x10/0x4f
    [ 1.760091] Code: 48 83 cf 63 48 89 f8 0f 1f 40 00 48 89 c6 48 89 d7 e8 6b 18 80 ff 66 90 5b c3 31 c0 b9 10 00 00 00 49 89 f8 48 c1 e6 33 f3 ab 07 00 00 00 48 c1 e2 36 41 c7 40 34 01 00 00 00 48 c1 e0 33 41
    [ 1.760091] RSP: 0000:ffffba783123be40 EFLAGS: 00000006
    [ 1.760091] RAX: 0000000000000000 RBX: fffffad34405e300 RCX: 0000000000000000
    [ 1.760091] RDX: 0000000000000000 RSI: 0010000000000000 RDI: fffffad34405e340
    [ 1.760091] RBP: 0000000033f3177e R08: fffffad34405e300 R09: 0000000000000002
    [ 1.760091] R10: 000000000000002b R11: ffff98afb691a500 R12: 0000000000000002
    [ 1.760091] R13: 0000000000000000 R14: 000000003f03ea00 R15: 000000003e10178c
    [ 1.760091] FS: 0000000000000000(0000) GS:ffff9c9ebeb00000(0000) knlGS:0000000000000000
    [ 1.760091] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1.760091] CR2: 00000000ffffffff CR3: 000000a1cf20a001 CR4: 00000000003606e0
    [ 1.760091] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 1.760091] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
    [ 1.760091] Call Trace:
    [ 1.760091] deferred_init_pages+0x8f/0xbf
    [ 1.760091] deferred_init_memmap+0x184/0x29d
    [ 1.760091] ? deferred_free_pages.isra.97+0xba/0xba
    [ 1.760091] kthread+0x112/0x130
    [ 1.760091] ? kthread_flush_work_fn+0x10/0x10
    [ 1.760091] ret_from_fork+0x35/0x40
    [ 89.123011] node 0 initialised, 1055935372 pages in 88650ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Yiqian Wei
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Tested-by: David Hildenbrand
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Shile Zhang
    Cc: Vlastimil Babka
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-4-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     
  • commit 117003c32771df617acf66e140fbdbdeb0ac71f5 upstream.

    Patch series "initialize deferred pages with interrupts enabled", v4.

    Keep interrupts enabled during deferred page initialization in order to
    make code more modular and allow jiffies to update.

    Original approach, and discussion can be found here:
    http://lkml.kernel.org/r/20200311123848.118638-1-shile.zhang@linux.alibaba.com

    This patch (of 3):

    deferred_init_memmap() disables interrupts the entire time, so it calls
    touch_nmi_watchdog() periodically to avoid soft lockup splats. Soon it
    will run with interrupts enabled, at which point cond_resched() should be
    used instead.

    deferred_grow_zone() makes the same watchdog calls through code shared
    with deferred init but will continue to run with interrupts disabled, so
    it can't call cond_resched().

    Pull the watchdog calls up to these two places to allow the first to be
    changed later, independently of the second. The frequency reduces from
    twice per pageblock (init and free) to once per max order block.

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Signed-off-by: Daniel Jordan
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: Shile Zhang
    Cc: Kirill Tkhai
    Cc: James Morris
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-2-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Daniel Jordan
     
  • commit 3d060856adfc59afb9d029c233141334cfaba418 upstream.

    Initializing struct pages is a long task and keeping interrupts disabled
    for the duration of this operation introduces a number of problems.

    1. jiffies are not updated for long period of time, and thus incorrect time
    is reported. See proposed solution and discussion here:
    lkml/20200311123848.118638-1-shile.zhang@linux.alibaba.com
    2. It prevents farther improving deferred page initialization by allowing
    intra-node multi-threading.

    We are keeping interrupts disabled to solve a rather theoretical problem
    that was never observed in real world (See 3a2d7fa8a3d5).

    Let's keep interrupts enabled. In case we ever encounter a scenario where
    an interrupt thread wants to allocate large amount of memory this early in
    boot we can deal with that by growing zone (see deferred_grow_zone()) by
    the needed amount before starting deferred_init_memmap() threads.

    Before:
    [ 1.232459] node 0 initialised, 12058412 pages in 1ms

    After:
    [ 1.632580] node 0 initialised, 12051227 pages in 436ms

    Fixes: 3a2d7fa8a3d5 ("mm: disable interrupts while initializing deferred pages")
    Reported-by: Shile Zhang
    Signed-off-by: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Dan Williams
    Cc: James Morris
    Cc: Kirill Tkhai
    Cc: Sasha Levin
    Cc: Yiqian Wei
    Cc: [4.17+]
    Link: http://lkml.kernel.org/r/20200403140952.17177-3-pasha.tatashin@soleen.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Pavel Tatashin
     

14 May, 2020

2 commits

  • commit 14f69140ff9c92a0928547ceefb153a842e8492c upstream.

    Commit 1c30844d2dfe ("mm: reclaim small amounts of memory when an
    external fragmentation event occurs") adds a boost_watermark() function
    which increases the min watermark in a zone by at least
    pageblock_nr_pages or the number of pages in a page block.

    On Arm64, with 64K pages and 512M huge pages, this is 8192 pages or
    512M. It does this regardless of the number of managed pages managed in
    the zone or the likelihood of success.

    This can put the zone immediately under water in terms of allocating
    pages from the zone, and can cause a small machine to fail immediately
    due to OoM. Unlike set_recommended_min_free_kbytes(), which
    substantially increases min_free_kbytes and is tied to THP,
    boost_watermark() can be called even if THP is not active.

    The problem is most likely to appear on architectures such as Arm64
    where pageblock_nr_pages is very large.

    It is desirable to run the kdump capture kernel in as small a space as
    possible to avoid wasting memory. In some architectures, such as Arm64,
    there are restrictions on where the capture kernel can run, and
    therefore, the space available. A capture kernel running in 768M can
    fail due to OoM immediately after boost_watermark() sets the min in zone
    DMA32, where most of the memory is, to 512M. It fails even though there
    is over 500M of free memory. With boost_watermark() suppressed, the
    capture kernel can run successfully in 448M.

    This patch limits boost_watermark() to boosting a zone's min watermark
    only when there are enough pages that the boost will produce positive
    results. In this case that is estimated to be four times as many pages
    as pageblock_nr_pages.

    Mel said:

    : There is no harm in marking it stable. Clearly it does not happen very
    : often but it's not impossible. 32-bit x86 is a lot less common now
    : which would previously have been vulnerable to triggering this easily.
    : ppc64 has a larger base page size but typically only has one zone.
    : arm64 is likely the most vulnerable, particularly when CMA is
    : configured with a small movable zone.

    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Henry Willard
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Mel Gorman
    Cc: Vlastimil Babka
    Cc:
    Link: http://lkml.kernel.org/r/1588294148-6586-1-git-send-email-henry.willard@oracle.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Henry Willard
     
  • commit e84fe99b68ce353c37ceeecc95dce9696c976556 upstream.

    Without CONFIG_PREEMPT, it can happen that we get soft lockups detected,
    e.g., while booting up.

    watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [swapper/0:1]
    CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.6.0-next-20200331+ #4
    Hardware name: Red Hat KVM, BIOS 1.11.1-4.module+el8.1.0+4066+0f1aadab 04/01/2014
    RIP: __pageblock_pfn_to_page+0x134/0x1c0
    Call Trace:
    set_zone_contiguous+0x56/0x70
    page_alloc_init_late+0x166/0x176
    kernel_init_freeable+0xfa/0x255
    kernel_init+0xa/0x106
    ret_from_fork+0x35/0x40

    The issue becomes visible when having a lot of memory (e.g., 4TB)
    assigned to a single NUMA node - a system that can easily be created
    using QEMU. Inside VMs on a hypervisor with quite some memory
    overcommit, this is fairly easy to trigger.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Pankaj Gupta
    Reviewed-by: Baoquan He
    Reviewed-by: Shile Zhang
    Acked-by: Michal Hocko
    Cc: Kirill Tkhai
    Cc: Shile Zhang
    Cc: Pavel Tatashin
    Cc: Daniel Jordan
    Cc: Michal Hocko
    Cc: Alexander Duyck
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc:
    Link: http://lkml.kernel.org/r/20200416073417.5003-1-david@redhat.com
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

11 Feb, 2020

1 commit

  • commit e822969cab48b786b64246aad1a3ba2a774f5d23 upstream.

    Patch series "mm: fix max_pfn not falling on section boundary", v2.

    Playing with different memory sizes for a x86-64 guest, I discovered that
    some memmaps (highest section if max_mem does not fall on the section
    boundary) are marked as being valid and online, but contain garbage. We
    have to properly initialize these memmaps.

    Looking at /proc/kpageflags and friends, I found some more issues,
    partially related to this.

    This patch (of 3):

    If max_pfn is not aligned to a section boundary, we can easily run into
    BUGs. This can e.g., be triggered on x86-64 under QEMU by specifying a
    memory size that is not a multiple of 128MB (e.g., 4097MB, but also
    4160MB). I was told that on real HW, we can easily have this scenario
    (esp., one of the main reasons sub-section hotadd of devmem was added).

    The issue is, that we have a valid memmap (pfn_valid()) for the whole
    section, and the whole section will be marked "online".
    pfn_to_online_page() will succeed, but the memmap contains garbage.

    E.g., doing a "./page-types -r -a 0x144001" when QEMU was started with "-m
    4160M" - (see tools/vm/page-types.c):

    [ 200.476376] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 200.477500] #PF: supervisor read access in kernel mode
    [ 200.478334] #PF: error_code(0x0000) - not-present page
    [ 200.479076] PGD 59614067 P4D 59614067 PUD 59616067 PMD 0
    [ 200.479557] Oops: 0000 [#4] SMP NOPTI
    [ 200.479875] CPU: 0 PID: 603 Comm: page-types Tainted: G D W 5.5.0-rc1-next-20191209 #93
    [ 200.480646] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu4
    [ 200.481648] RIP: 0010:stable_page_flags+0x4d/0x410
    [ 200.482061] Code: f3 ff 41 89 c0 48 b8 00 00 00 00 01 00 00 00 45 84 c0 0f 85 cd 02 00 00 48 8b 53 08 48 8b 2b 48f
    [ 200.483644] RSP: 0018:ffffb139401cbe60 EFLAGS: 00010202
    [ 200.484091] RAX: fffffffffffffffe RBX: fffffbeec5100040 RCX: 0000000000000000
    [ 200.484697] RDX: 0000000000000001 RSI: ffffffff9535c7cd RDI: 0000000000000246
    [ 200.485313] RBP: ffffffffffffffff R08: 0000000000000000 R09: 0000000000000000
    [ 200.485917] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000144001
    [ 200.486523] R13: 00007ffd6ba55f48 R14: 00007ffd6ba55f40 R15: ffffb139401cbf08
    [ 200.487130] FS: 00007f68df717580(0000) GS:ffff9ec77fa00000(0000) knlGS:0000000000000000
    [ 200.487804] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 200.488295] CR2: fffffffffffffffe CR3: 0000000135d48000 CR4: 00000000000006f0
    [ 200.488897] Call Trace:
    [ 200.489115] kpageflags_read+0xe9/0x140
    [ 200.489447] proc_reg_read+0x3c/0x60
    [ 200.489755] vfs_read+0xc2/0x170
    [ 200.490037] ksys_pread64+0x65/0xa0
    [ 200.490352] do_syscall_64+0x5c/0xa0
    [ 200.490665] entry_SYSCALL_64_after_hwframe+0x49/0xbe

    But it can be triggered much easier via "cat /proc/kpageflags > /dev/null"
    after cold/hot plugging a DIMM to such a system:

    [root@localhost ~]# cat /proc/kpageflags > /dev/null
    [ 111.517275] BUG: unable to handle page fault for address: fffffffffffffffe
    [ 111.517907] #PF: supervisor read access in kernel mode
    [ 111.518333] #PF: error_code(0x0000) - not-present page
    [ 111.518771] PGD a240e067 P4D a240e067 PUD a2410067 PMD 0

    This patch fixes that by at least zero-ing out that memmap (so e.g.,
    page_to_pfn() will not crash). Commit 907ec5fca3dc ("mm: zero remaining
    unavailable struct pages") tried to fix a similar issue, but forgot to
    consider this special case.

    After this patch, there are still problems to solve. E.g., not all of
    these pages falling into a memory hole will actually get initialized later
    and set PageReserved - they are only zeroed out - but at least the
    immediate crashes are gone. A follow-up patch will take care of this.

    Link: http://lkml.kernel.org/r/20191211163201.17179-2-david@redhat.com
    Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
    Signed-off-by: David Hildenbrand
    Tested-by: Daniel Jordan
    Cc: Naoya Horiguchi
    Cc: Pavel Tatashin
    Cc: Andrew Morton
    Cc: Steven Sistare
    Cc: Michal Hocko
    Cc: Daniel Jordan
    Cc: Bob Picco
    Cc: Oscar Salvador
    Cc: Alexey Dobriyan
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Stephen Rothwell
    Cc: [4.15+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    David Hildenbrand
     

23 Jan, 2020

1 commit

  • commit 8e57f8acbbd121ecfb0c9dc13b8b030f86c6bd3b upstream.

    Commit 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable
    debugging") has introduced a static key to reduce overhead when
    debug_pagealloc is compiled in but not enabled. It relied on the
    assumption that jump_label_init() is called before parse_early_param()
    as in start_kernel(), so when the "debug_pagealloc=on" option is parsed,
    it is safe to enable the static key.

    However, it turns out multiple architectures call parse_early_param()
    earlier from their setup_arch(). x86 also calls jump_label_init() even
    earlier, so no issue was found while testing the commit, but same is not
    true for e.g. ppc64 and s390 where the kernel would not boot with
    debug_pagealloc=on as found by our QA.

    To fix this without tricky changes to init code of multiple
    architectures, this patch partially reverts the static key conversion
    from 96a2b03f281d. Init-time and non-fastpath calls (such as in arch
    code) of debug_pagealloc_enabled() will again test a simple bool
    variable. Fastpath mm code is converted to a new
    debug_pagealloc_enabled_static() variant that relies on the static key,
    which is enabled in a well-defined point in mm_init() where it's
    guaranteed that jump_label_init() has been called, regardless of
    architecture.

    [sfr@canb.auug.org.au: export _debug_pagealloc_enabled_early]
    Link: http://lkml.kernel.org/r/20200106164944.063ac07b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20191219130612.23171-1-vbabka@suse.cz
    Fixes: 96a2b03f281d ("mm, debug_pagelloc: use static keys to enable debugging")
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Stephen Rothwell
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Vlastimil Babka
     

07 Nov, 2019

2 commits

  • While investigating a bug related to higher atomic allocation failures,
    we noticed the failure warnings positively drowning the console, and in
    our case trigger lockup warnings because of a serial console too slow to
    handle all that output.

    But even if we had a faster console, it's unclear what additional
    information the current level of repetition provides.

    Allocation failures happen for three reasons: The machine is OOM, the VM
    is failing to handle reasonable requests, or somebody is making
    unreasonable requests (and didn't acknowledge their opportunism with
    __GFP_NOWARN). Having the memory dump, a callstack, and the ratelimit
    stats on skipped failure warnings should provide enough information to
    let users/admins/developers know whether something is wrong and point
    them in the right direction for debugging, bpftracing etc.

    Limit allocation failure warnings to one spew every ten seconds.

    Link: http://lkml.kernel.org/r/20191028194906.26899-1-hannes@cmpxchg.org
    Signed-off-by: Johannes Weiner
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • Deferred memory initialisation updates zone->managed_pages during the
    initialisation phase but before that finishes, the per-cpu page
    allocator (pcpu) calculates the number of pages allocated/freed in
    batches as well as the maximum number of pages allowed on a per-cpu
    list. As zone->managed_pages is not up to date yet, the pcpu
    initialisation calculates inappropriately low batch and high values.

    This increases zone lock contention quite severely in some cases with
    the degree of severity depending on how many CPUs share a local zone and
    the size of the zone. A private report indicated that kernel build
    times were excessive with extremely high system CPU usage. A perf
    profile indicated that a large chunk of time was lost on zone->lock
    contention.

    This patch recalculates the pcpu batch and high values after deferred
    initialisation completes for every populated zone in the system. It was
    tested on a 2-socket AMD EPYC 2 machine using a kernel compilation
    workload -- allmodconfig and all available CPUs.

    mmtests configuration: config-workload-kernbench-max Configuration was
    modified to build on a fresh XFS partition.

    kernbench
    5.4.0-rc3 5.4.0-rc3
    vanilla resetpcpu-v2
    Amean user-256 13249.50 ( 0.00%) 16401.31 * -23.79%*
    Amean syst-256 14760.30 ( 0.00%) 4448.39 * 69.86%*
    Amean elsp-256 162.42 ( 0.00%) 119.13 * 26.65%*
    Stddev user-256 42.97 ( 0.00%) 19.15 ( 55.43%)
    Stddev syst-256 336.87 ( 0.00%) 6.71 ( 98.01%)
    Stddev elsp-256 2.46 ( 0.00%) 0.39 ( 84.03%)

    5.4.0-rc3 5.4.0-rc3
    vanilla resetpcpu-v2
    Duration User 39766.24 49221.79
    Duration System 44298.10 13361.67
    Duration Elapsed 519.11 388.87

    The patch reduces system CPU usage by 69.86% and total build time by
    26.65%. The variance of system CPU usage is also much reduced.

    Before, this was the breakdown of batch and high values over all zones
    was:

    256 batch: 1
    256 batch: 63
    512 batch: 7
    256 high: 0
    256 high: 378
    512 high: 42

    512 pcpu pagesets had a batch limit of 7 and a high limit of 42. After
    the patch:

    256 batch: 1
    768 batch: 63
    256 high: 0
    768 high: 378

    [mgorman@techsingularity.net: fix merge/linkage snafu]
    Link: http://lkml.kernel.org/r/20191023084705.GD3016@techsingularity.netLink: http://lkml.kernel.org/r/20191021094808.28824-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Acked-by: David Hildenbrand
    Cc: Matt Fleming
    Cc: Thomas Gleixner
    Cc: Borislav Petkov
    Cc: Qian Cai
    Cc: [4.1+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

15 Oct, 2019

1 commit

  • Commit b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when
    compaction may not succeed") has chnaged the allocator to bail out from
    the allocator early to prevent from a potentially excessive memory
    reclaim. __GFP_RETRY_MAYFAIL is designed to retry the allocation,
    reclaim and compaction loop as long as there is a reasonable chance to
    make forward progress. Neither COMPACT_SKIPPED nor COMPACT_DEFERRED at
    the INIT_COMPACT_PRIORITY compaction attempt gives this feedback.

    The most obvious affected subsystem is hugetlbfs which allocates huge
    pages based on an admin request (or via admin configured overcommit). I
    have done a simple test which tries to allocate half of the memory for
    hugetlb pages while the memory is full of a clean page cache. This is
    not an unusual situation because we try to cache as much of the memory
    as possible and sysctl/sysfs interface to allocate huge pages is there
    for flexibility to allocate hugetlb pages at any time.

    System has 1GB of RAM and we are requesting 515MB worth of hugetlb pages
    after the memory is prefilled by a clean page cache:

    root@test1:~# cat hugetlb_test.sh

    set -x
    echo 0 > /proc/sys/vm/nr_hugepages
    echo 3 > /proc/sys/vm/drop_caches
    echo 1 > /proc/sys/vm/compact_memory
    dd if=/mnt/data/file-1G of=/dev/null bs=$((4< /proc/sys/vm/nr_hugepages
    cat /proc/sys/vm/nr_hugepages

    The results for 2 consecutive runs on clean 5.3

    root@test1:~# sh hugetlb_test.sh
    + echo 0
    + echo 3
    + echo 1
    + dd if=/mnt/data/file-1G of=/dev/null bs=4096
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB) copied, 21.0694 s, 51.0 MB/s
    + date +%s
    + TS=1569905284
    + echo 256
    + cat /proc/sys/vm/nr_hugepages
    256
    root@test1:~# sh hugetlb_test.sh
    + echo 0
    + echo 3
    + echo 1
    + dd if=/mnt/data/file-1G of=/dev/null bs=4096
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB) copied, 21.7548 s, 49.4 MB/s
    + date +%s
    + TS=1569905311
    + echo 256
    + cat /proc/sys/vm/nr_hugepages
    256

    Now with b39d0ee2632d applied

    root@test1:~# sh hugetlb_test.sh
    + echo 0
    + echo 3
    + echo 1
    + dd if=/mnt/data/file-1G of=/dev/null bs=4096
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB) copied, 20.1815 s, 53.2 MB/s
    + date +%s
    + TS=1569905516
    + echo 256
    + cat /proc/sys/vm/nr_hugepages
    11
    root@test1:~# sh hugetlb_test.sh
    + echo 0
    + echo 3
    + echo 1
    + dd if=/mnt/data/file-1G of=/dev/null bs=4096
    262144+0 records in
    262144+0 records out
    1073741824 bytes (1.1 GB) copied, 21.9485 s, 48.9 MB/s
    + date +%s
    + TS=1569905541
    + echo 256
    + cat /proc/sys/vm/nr_hugepages
    12

    The success rate went down by factor of 20!

    Although hugetlb allocation requests might fail and it is reasonable to
    expect them to under extremely fragmented memory or when the memory is
    under a heavy pressure but the above situation is not that case.

    Fix the regression by reverting back to the previous behavior for
    __GFP_RETRY_MAYFAIL requests and disable the beail out heuristic for
    those requests.

    Mike said:

    : hugetlbfs allocations are commonly done via sysctl/sysfs shortly after
    : boot where this may not be as much of an issue. However, I am aware of at
    : least three use cases where allocations are made after the system has been
    : up and running for quite some time:
    :
    : - DB reconfiguration. If sysctl/sysfs fails to get required number of
    : huge pages, system is rebooted to perform allocation after boot.
    :
    : - VM provisioning. If unable get required number of huge pages, fall
    : back to base pages.
    :
    : - An application that does not preallocate pool, but rather allocates
    : pages at fault time for optimal NUMA locality.
    :
    : In all cases, I would expect b39d0ee2632d to cause regressions and
    : noticable behavior changes.
    :
    : My quick/limited testing in
    : https://lkml.kernel.org/r/3468b605-a3a9-6978-9699-57c52a90bd7e@oracle.com
    : was insufficient. It was also mentioned that if something like
    : b39d0ee2632d went forward, I would like exemptions for __GFP_RETRY_MAYFAIL
    : requests as in this patch.

    [mhocko@suse.com: reworded changelog]
    Link: http://lkml.kernel.org/r/20191007075548.12456-1-mhocko@kernel.org
    Fixes: b39d0ee2632d ("mm, page_alloc: avoid expensive reclaim when compaction may not succeed")
    Signed-off-by: David Rientjes
    Signed-off-by: Michal Hocko
    Reviewed-by: Mike Kravetz
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

08 Oct, 2019

1 commit

  • On architectures like s390, arch_free_page() could mark the page unused
    (set_page_unused()) and any access later would trigger a kernel panic.
    Fix it by moving arch_free_page() after all possible accessing calls.

    Hardware name: IBM 2964 N96 400 (z/VM 6.4.0)
    Krnl PSW : 0404e00180000000 0000000026c2b96e (__free_pages_ok+0x34e/0x5d8)
    R:0 T:1 IO:0 EX:0 Key:0 M:1 W:0 P:0 AS:3 CC:2 PM:0 RI:0 EA:3
    Krnl GPRS: 0000000088d43af7 0000000000484000 000000000000007c 000000000000000f
    000003d080012100 000003d080013fc0 0000000000000000 0000000000100000
    00000000275cca48 0000000000000100 0000000000000008 000003d080010000
    00000000000001d0 000003d000000000 0000000026c2b78a 000000002717fdb0
    Krnl Code: 0000000026c2b95c: ec1100b30659 risbgn %r1,%r1,0,179,6
    0000000026c2b962: e32014000036 pfd 2,1024(%r1)
    #0000000026c2b968: d7ff10001000 xc 0(256,%r1),0(%r1)
    >0000000026c2b96e: 41101100 la %r1,256(%r1)
    0000000026c2b972: a737fff8 brctg %r3,26c2b962
    0000000026c2b976: d7ff10001000 xc 0(256,%r1),0(%r1)
    0000000026c2b97c: e31003400004 lg %r1,832
    0000000026c2b982: ebff1430016a asi 5168(%r1),-1
    Call Trace:
    __free_pages_ok+0x16a/0x5d8)
    memblock_free_all+0x206/0x290
    mem_init+0x58/0x120
    start_kernel+0x2b0/0x570
    startup_continue+0x6a/0xc0
    INFO: lockdep is turned off.
    Last Breaking-Event-Address:
    __free_pages_ok+0x372/0x5d8
    Kernel panic - not syncing: Fatal exception: panic_on_oops
    00: HCPGIR450W CP entered; disabled wait PSW 00020001 80000000 00000000 26A2379C

    In the past, only kernel_poison_pages() would trigger this but it needs
    "page_poison=on" kernel cmdline, and I suspect nobody tested that on
    s390. Recently, kernel_init_free_pages() (commit 6471384af2a6 ("mm:
    security: introduce init_on_alloc=1 and init_on_free=1 boot options"))
    was added and could trigger this as well.

    [akpm@linux-foundation.org: add comment]
    Link: http://lkml.kernel.org/r/1569613623-16820-1-git-send-email-cai@lca.pw
    Fixes: 8823b1dbc05f ("mm/page_poison.c: enable PAGE_POISONING as a separate option")
    Fixes: 6471384af2a6 ("mm: security: introduce init_on_alloc=1 and init_on_free=1 boot options")
    Signed-off-by: Qian Cai
    Reviewed-by: Heiko Carstens
    Acked-by: Christian Borntraeger
    Acked-by: Michal Hocko
    Cc: "Kirill A. Shutemov"
    Cc: Vasily Gorbik
    Cc: Alexander Duyck
    Cc: [5.3+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

29 Sep, 2019

2 commits

  • Merge hugepage allocation updates from David Rientjes:
    "We (mostly Linus, Andrea, and myself) have been discussing offlist how
    to implement a sane default allocation strategy for hugepages on NUMA
    platforms.

    With these reverts in place, the page allocator will happily allocate
    a remote hugepage immediately rather than try to make a local hugepage
    available. This incurs a substantial performance degradation when
    memory compaction would have otherwise made a local hugepage
    available.

    This series reverts those reverts and attempts to propose a more sane
    default allocation strategy specifically for hugepages. Andrea
    acknowledges this is likely to fix the swap storms that he originally
    reported that resulted in the patches that removed __GFP_THISNODE from
    hugepage allocations.

    The immediate goal is to return 5.3 to the behavior the kernel has
    implemented over the past several years so that remote hugepages are
    not immediately allocated when local hugepages could have been made
    available because the increased access latency is untenable.

    The next goal is to introduce a sane default allocation strategy for
    hugepages allocations in general regardless of the configuration of
    the system so that we prevent thrashing of local memory when
    compaction is unlikely to succeed and can prefer remote hugepages over
    remote native pages when the local node is low on memory."

    Note on timing: this reverts the hugepage VM behavior changes that got
    introduced fairly late in the 5.3 cycle, and that fixed a huge
    performance regression for certain loads that had been around since
    4.18.

    Andrea had this note:

    "The regression of 4.18 was that it was taking hours to start a VM
    where 3.10 was only taking a few seconds, I reported all the details
    on lkml when it was finally tracked down in August 2018.

    https://lore.kernel.org/linux-mm/20180820032640.9896-2-aarcange@redhat.com/

    __GFP_THISNODE in MADV_HUGEPAGE made the above enterprise vfio
    workload degrade like in the "current upstream" above. And it still
    would have been that bad as above until 5.3-rc5"

    where the bad behavior ends up happening as you fill up a local node,
    and without that change, you'd get into the nasty swap storm behavior
    due to compaction working overtime to make room for more memory on the
    nodes.

    As a result 5.3 got the two performance fix reverts in rc5.

    However, David Rientjes then noted that those performance fixes in turn
    regressed performance for other loads - although not quite to the same
    degree. He suggested reverting the reverts and instead replacing them
    with two small changes to how hugepage allocations are done (patch
    descriptions rephrased by me):

    - "avoid expensive reclaim when compaction may not succeed": just admit
    that the allocation failed when you're trying to allocate a huge-page
    and compaction wasn't successful.

    - "allow hugepage fallback to remote nodes when madvised": when that
    node-local huge-page allocation failed, retry without forcing the
    local node.

    but by then I judged it too late to replace the fixes for a 5.3 release.
    So 5.3 was released with behavior that harked back to the pre-4.18 logic.

    But now we're in the merge window for 5.4, and we can see if this
    alternate model fixes not just the horrendous swap storm behavior, but
    also restores the performance regression that the late reverts caused.

    Fingers crossed.

    * emailed patches from David Rientjes :
    mm, page_alloc: allow hugepage fallback to remote nodes when madvised
    mm, page_alloc: avoid expensive reclaim when compaction may not succeed
    Revert "Revert "Revert "mm, thp: consolidate THP gfp handling into alloc_hugepage_direct_gfpmask""
    Revert "Revert "mm, thp: restore node-local hugepage allocations""

    Linus Torvalds
     
  • Memory compaction has a couple significant drawbacks as the allocation
    order increases, specifically:

    - isolate_freepages() is responsible for finding free pages to use as
    migration targets and is implemented as a linear scan of memory
    starting at the end of a zone,

    - failing order-0 watermark checks in memory compaction does not account
    for how far below the watermarks the zone actually is: to enable
    migration, there must be *some* free memory available. Per the above,
    watermarks are not always suffficient if isolate_freepages() cannot
    find the free memory but it could require hundreds of MBs of reclaim to
    even reach this threshold (read: potentially very expensive reclaim with
    no indication compaction can be successful), and

    - if compaction at this order has failed recently so that it does not even
    run as a result of deferred compaction, looping through reclaim can often
    be pointless.

    For hugepage allocations, these are quite substantial drawbacks because
    these are very high order allocations (order-9 on x86) and falling back to
    doing reclaim can potentially be *very* expensive without any indication
    that compaction would even be successful.

    Reclaim itself is unlikely to free entire pageblocks and certainly no
    reliance should be put on it to do so in isolation (recall lumpy reclaim).
    This means we should avoid reclaim and simply fail hugepage allocation if
    compaction is deferred.

    It is also not helpful to thrash a zone by doing excessive reclaim if
    compaction may not be able to access that memory. If order-0 watermarks
    fail and the allocation order is sufficiently large, it is likely better
    to fail the allocation rather than thrashing the zone.

    Signed-off-by: David Rientjes
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Cc: Stefan Priebe - Profihost AG
    Cc: "Kirill A. Shutemov"
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

25 Sep, 2019

4 commits

  • A later patch makes THP deferred split shrinker memcg aware, but it needs
    page->mem_cgroup information in THP destructor, which is called after
    mem_cgroup_uncharge() now.

    So move mem_cgroup_uncharge() from __page_cache_release() to compound page
    destructor, which is called by both THP and other compound pages except
    HugeTLB. And call it in __put_single_page() for single order page.

    Link: http://lkml.kernel.org/r/1565144277-36240-3-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Patch series "Make deferred split shrinker memcg aware", v6.

    Currently THP deferred split shrinker is not memcg aware, this may cause
    premature OOM with some configuration. For example the below test would
    run into premature OOM easily:

    $ cgcreate -g memory:thp
    $ echo 4G > /sys/fs/cgroup/memory/thp/memory/limit_in_bytes
    $ cgexec -g memory:thp transhuge-stress 4000

    transhuge-stress comes from kernel selftest.

    It is easy to hit OOM, but there are still a lot THP on the deferred split
    queue, memcg direct reclaim can't touch them since the deferred split
    shrinker is not memcg aware.

    Convert deferred split shrinker memcg aware by introducing per memcg
    deferred split queue. The THP should be on either per node or per memcg
    deferred split queue if it belongs to a memcg. When the page is
    immigrated to the other memcg, it will be immigrated to the target memcg's
    deferred split queue too.

    Reuse the second tail page's deferred_list for per memcg list since the
    same THP can't be on multiple deferred split queues.

    Make deferred split shrinker not depend on memcg kmem since it is not
    slab. It doesn't make sense to not shrink THP even though memcg kmem is
    disabled.

    With the above change the test demonstrated above doesn't trigger OOM even
    though with cgroup.memory=nokmem.

    This patch (of 4):

    Put split_queue, split_queue_lock and split_queue_len into a struct in
    order to reduce code duplication when we convert deferred_split to memcg
    aware in the later patches.

    Link: http://lkml.kernel.org/r/1565144277-36240-2-git-send-email-yang.shi@linux.alibaba.com
    Signed-off-by: Yang Shi
    Suggested-by: "Kirill A . Shutemov"
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Kirill Tkhai
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Hugh Dickins
    Cc: Shakeel Butt
    Cc: David Rientjes
    Cc: Qian Cai
    Cc: Vladimir Davydov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Mike Kravetz reports that "hugetlb allocations could stall for minutes or
    hours when should_compact_retry() would return true more often then it
    should. Specifically, this was in the case where compact_result was
    COMPACT_DEFERRED and COMPACT_PARTIAL_SKIPPED and no progress was being
    made."

    The problem is that the compaction_withdrawn() test in
    should_compact_retry() includes compaction outcomes that are only possible
    on low compaction priority, and results in a retry without increasing the
    priority. This may result in furter reclaim, and more incomplete
    compaction attempts.

    With this patch, compaction priority is raised when possible, or
    should_compact_retry() returns false.

    The COMPACT_SKIPPED result doesn't really fit together with the other
    outcomes in compaction_withdrawn(), as that's a result caused by
    insufficient order-0 pages, not due to low compaction priority. With this
    patch, it is moved to a new compaction_needs_reclaim() function, and for
    that outcome we keep the current logic of retrying if it looks like
    reclaim will be able to help.

    Link: http://lkml.kernel.org/r/20190806014744.15446-4-mike.kravetz@oracle.com
    Reported-by: Mike Kravetz
    Signed-off-by: Vlastimil Babka
    Signed-off-by: Mike Kravetz
    Tested-by: Mike Kravetz
    Cc: Hillf Danton
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Replace 1 << compound_order(page) with compound_nr(page). Minor
    improvements in readability.

    Link: http://lkml.kernel.org/r/20190721104612.19120-4-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Reviewed-by: Andrew Morton
    Reviewed-by: Ira Weiny
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

17 Sep, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:

    - MAINTAINERS: Add Mark Rutland as perf submaintainer, Juri Lelli and
    Vincent Guittot as scheduler submaintainers. Add Dietmar Eggemann,
    Steven Rostedt, Ben Segall and Mel Gorman as scheduler reviewers.

    As perf and the scheduler is getting bigger and more complex,
    document the status quo of current responsibilities and interests,
    and spread the review pain^H^H^H^H fun via an increase in the Cc:
    linecount generated by scripts/get_maintainer.pl. :-)

    - Add another series of patches that brings the -rt (PREEMPT_RT) tree
    closer to mainline: split the monolithic CONFIG_PREEMPT dependencies
    into a new CONFIG_PREEMPTION category that will allow the eventual
    introduction of CONFIG_PREEMPT_RT. Still a few more hundred patches
    to go though.

    - Extend the CPU cgroup controller with uclamp.min and uclamp.max to
    allow the finer shaping of CPU bandwidth usage.

    - Micro-optimize energy-aware wake-ups from O(CPUS^2) to O(CPUS).

    - Improve the behavior of high CPU count, high thread count
    applications running under cpu.cfs_quota_us constraints.

    - Improve balancing with SCHED_IDLE (SCHED_BATCH) tasks present.

    - Improve CPU isolation housekeeping CPU allocation NUMA locality.

    - Fix deadline scheduler bandwidth calculations and logic when cpusets
    rebuilds the topology, or when it gets deadline-throttled while it's
    being offlined.

    - Convert the cpuset_mutex to percpu_rwsem, to allow it to be used from
    setscheduler() system calls without creating global serialization.
    Add new synchronization between cpuset topology-changing events and
    the deadline acceptance tests in setscheduler(), which were broken
    before.

    - Rework the active_mm state machine to be less confusing and more
    optimal.

    - Rework (simplify) the pick_next_task() slowpath.

    - Improve load-balancing on AMD EPYC systems.

    - ... and misc cleanups, smaller fixes and improvements - please see
    the Git log for more details.

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (53 commits)
    sched/psi: Correct overly pessimistic size calculation
    sched/fair: Speed-up energy-aware wake-ups
    sched/uclamp: Always use 'enum uclamp_id' for clamp_id values
    sched/uclamp: Update CPU's refcount on TG's clamp changes
    sched/uclamp: Use TG's clamps to restrict TASK's clamps
    sched/uclamp: Propagate system defaults to the root group
    sched/uclamp: Propagate parent clamps
    sched/uclamp: Extend CPU's cgroup controller
    sched/topology: Improve load balancing on AMD EPYC systems
    arch, ia64: Make NUMA select SMP
    sched, perf: MAINTAINERS update, add submaintainers and reviewers
    sched/fair: Use rq_lock/unlock in online_fair_sched_group
    cpufreq: schedutil: fix equation in comment
    sched: Rework pick_next_task() slow-path
    sched: Allow put_prev_task() to drop rq->lock
    sched/fair: Expose newidle_balance()
    sched: Add task_struct pointer to sched_class::set_curr_task
    sched: Rework CPU hotplug task selection
    sched/{rt,deadline}: Fix set_next_task vs pick_next_task
    sched: Fix kerneldoc comment for ia64_set_curr_task
    ...

    Linus Torvalds
     

03 Sep, 2019

1 commit

  • SD_BALANCE_{FORK,EXEC} and SD_WAKE_AFFINE are stripped in sd_init()
    for any sched domains with a NUMA distance greater than 2 hops
    (RECLAIM_DISTANCE). The idea being that it's expensive to balance
    across domains that far apart.

    However, as is rather unfortunately explained in:

    commit 32e45ff43eaf ("mm: increase RECLAIM_DISTANCE to 30")

    the value for RECLAIM_DISTANCE is based on node distance tables from
    2011-era hardware.

    Current AMD EPYC machines have the following NUMA node distances:

    node distances:
    node 0 1 2 3 4 5 6 7
    0: 10 16 16 16 32 32 32 32
    1: 16 10 16 16 32 32 32 32
    2: 16 16 10 16 32 32 32 32
    3: 16 16 16 10 32 32 32 32
    4: 32 32 32 32 10 16 16 16
    5: 32 32 32 32 16 10 16 16
    6: 32 32 32 32 16 16 10 16
    7: 32 32 32 32 16 16 16 10

    where 2 hops is 32.

    The result is that the scheduler fails to load balance properly across
    NUMA nodes on different sockets -- 2 hops apart.

    For example, pinning 16 busy threads to NUMA nodes 0 (CPUs 0-7) and 4
    (CPUs 32-39) like so,

    $ numactl -C 0-7,32-39 ./spinner 16

    causes all threads to fork and remain on node 0 until the active
    balancer kicks in after a few seconds and forcibly moves some threads
    to node 4.

    Override node_reclaim_distance for AMD Zen.

    Signed-off-by: Matt Fleming
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Mel Gorman
    Cc: Borislav Petkov
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Suravee.Suthikulpanit@amd.com
    Cc: Thomas Gleixner
    Cc: Thomas.Lendacky@amd.com
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20190808195301.13222-3-matt@codeblueprint.co.uk
    Signed-off-by: Ingo Molnar

    Matt Fleming
     

25 Aug, 2019

1 commit

  • After commit 907ec5fca3dc ("mm: zero remaining unavailable struct
    pages"), struct page of reserved memory is zeroed. This causes
    page->flags to be 0 and fixes issues related to reading
    /proc/kpageflags, for example, of reserved memory.

    The VM_BUG_ON() in move_freepages_block(), however, assumes that
    page_zone() is meaningful even for reserved memory. That assumption is
    no longer true after the aforementioned commit.

    There's no reason why move_freepages_block() should be testing the
    legitimacy of page_zone() for reserved memory; its scope is limited only
    to pages on the zone's freelist.

    Note that pfn_valid() can be true for reserved memory: there is a
    backing struct page. The check for page_to_nid(page) is also buggy but
    reserved memory normally only appears on node 0 so the zeroing doesn't
    affect this.

    Move the debug checks to after verifying PageBuddy is true. This
    isolates the scope of the checks to only be for buddy pages which are on
    the zone's freelist which move_freepages_block() is operating on. In
    this case, an incorrect node or zone is a bug worthy of being warned
    about (and the examination of struct page is acceptable bcause this
    memory is not reserved).

    Why does move_freepages_block() gets called on reserved memory? It's
    simply math after finding a valid free page from the per-zone free area
    to use as fallback. We find the beginning and end of the pageblock of
    the valid page and that can bring us into memory that was reserved per
    the e820. pfn_valid() is still true (it's backed by a struct page), but
    since it's zero'd we shouldn't make any inferences here about comparing
    its node or zone. The current node check just happens to succeed most
    of the time by luck because reserved memory typically appears on node 0.

    The fix here is to validate that we actually have buddy pages before
    testing if there's any type of zone or node strangeness going on.

    We noticed it almost immediately after bringing 907ec5fca3dc in on
    CONFIG_DEBUG_VM builds. It depends on finding specific free pages in
    the per-zone free area where the math in move_freepages() will bring the
    start or end pfn into reserved memory and wanting to claim that entire
    pageblock as a new migratetype. So the path will be rare, require
    CONFIG_DEBUG_VM, and require fallback to a different migratetype.

    Some struct pages were already zeroed from reserve pages before
    907ec5fca3c so it theoretically could trigger before this commit. I
    think it's rare enough under a config option that most people don't run
    that others may not have noticed. I wouldn't argue against a stable tag
    and the backport should be easy enough, but probably wouldn't single out
    a commit that this is fixing.

    Mel said:

    : The overhead of the debugging check is higher with this patch although
    : it'll only affect debug builds and the path is not particularly hot.
    : If this was a concern, I think it would be reasonable to simply remove
    : the debugging check as the zone boundaries are checked in
    : move_freepages_block and we never expect a zone/node to be smaller than
    : a pageblock and stuck in the middle of another zone.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.1908122036560.10779@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Mel Gorman
    Cc: Naoya Horiguchi
    Cc: Masayoshi Mizuma
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

20 Aug, 2019

1 commit

  • The dev field in struct dev_pagemap is only used to print dev_name in two
    places, which are at best nice to have. Just remove the field and thus
    the name in those two messages.

    Link: https://lore.kernel.org/r/20190818090557.17853-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ira Weiny
    Reviewed-by: Dan Williams
    Tested-by: Bharata B Rao
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

19 Jul, 2019

4 commits

  • The libnvdimm sub-system has suffered a series of hacks and broken
    workarounds for the memory-hotplug implementation's awkward
    section-aligned (128MB) granularity.

    For example the following backtrace is emitted when attempting
    arch_add_memory() with physical address ranges that intersect 'System
    RAM' (RAM) with 'Persistent Memory' (PMEM) within a given section:

    # cat /proc/iomem | grep -A1 -B1 Persistent\ Memory
    100000000-1ffffffff : System RAM
    200000000-303ffffff : Persistent Memory (legacy)
    304000000-43fffffff : System RAM
    440000000-23ffffffff : Persistent Memory
    2400000000-43bfffffff : Persistent Memory
    2400000000-43bfffffff : namespace2.0

    WARNING: CPU: 38 PID: 928 at arch/x86/mm/init_64.c:850 add_pages+0x5c/0x60
    [..]
    RIP: 0010:add_pages+0x5c/0x60
    [..]
    Call Trace:
    devm_memremap_pages+0x460/0x6e0
    pmem_attach_disk+0x29e/0x680 [nd_pmem]
    ? nd_dax_probe+0xfc/0x120 [libnvdimm]
    nvdimm_bus_probe+0x66/0x160 [libnvdimm]

    It was discovered that the problem goes beyond RAM vs PMEM collisions as
    some platform produce PMEM vs PMEM collisions within a given section.
    The libnvdimm workaround for that case revealed that the libnvdimm
    section-alignment-padding implementation has been broken for a long
    while.

    A fix for that long-standing breakage introduces as many problems as it
    solves as it would require a backward-incompatible change to the
    namespace metadata interpretation. Instead of that dubious route [1],
    address the root problem in the memory-hotplug implementation.

    Note that EEXIST is no longer treated as success as that is how
    sparse_add_section() reports subsection collisions, it was also obviated
    by recent changes to perform the request_region() for 'System RAM'
    before arch_add_memory() in the add_memory() sequence.

    [1] https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com

    [osalvador@suse.de: fix deactivate_section for early sections]
    Link: http://lkml.kernel.org/r/20190715081549.32577-2-osalvador@suse.de
    Link: http://lkml.kernel.org/r/156092354368.979959.6232443923440952359.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Oscar Salvador
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Given there are no more usages of is_dev_zone() outside of 'ifdef
    CONFIG_ZONE_DEVICE' protection, kill off the compilation helper.

    Link: http://lkml.kernel.org/r/156092353211.979959.1489004866360828964.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Wei Yang
    Acked-by: David Hildenbrand
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Logan Gunthorpe
    Cc: Jane Chu
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Prepare for hot{plug,remove} of sub-ranges of a section by tracking a
    sub-section active bitmask, each bit representing a PMD_SIZE span of the
    architecture's memory hotplug section size.

    The implications of a partially populated section is that pfn_valid()
    needs to go beyond a valid_section() check and either determine that the
    section is an "early section", or read the sub-section active ranges
    from the bitmask. The expectation is that the bitmask (subsection_map)
    fits in the same cacheline as the valid_section() / early_section()
    data, so the incremental performance overhead to pfn_valid() should be
    negligible.

    The rationale for using early_section() to short-ciruit the
    subsection_map check is that there are legacy code paths that use
    pfn_valid() at section granularity before validating the pfn against
    pgdat data. So, the early_section() check allows those traditional
    assumptions to persist while also permitting subsection_map to tell the
    truth for purposes of populating the unused portions of early sections
    with PMEM and other ZONE_DEVICE mappings.

    Link: http://lkml.kernel.org/r/156092350874.979959.18185938451405518285.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Qian Cai
    Tested-by: Jane Chu
    Tested-by: Aneesh Kumar K.V [ppc64]
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jeff Moyer
    Cc: Jérôme Glisse
    Cc: Jonathan Corbet
    Cc: Mike Rapoport
    Cc: Toshi Kani
    Cc: Wei Yang
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Sub-section memory hotplug support", v10.

    The memory hotplug section is an arbitrary / convenient unit for memory
    hotplug. 'Section-size' units have bled into the user interface
    ('memblock' sysfs) and can not be changed without breaking existing
    userspace. The section-size constraint, while mostly benign for typical
    memory hotplug, has and continues to wreak havoc with 'device-memory'
    use cases, persistent memory (pmem) in particular. Recall that pmem
    uses devm_memremap_pages(), and subsequently arch_add_memory(), to
    allocate a 'struct page' memmap for pmem. However, it does not use the
    'bottom half' of memory hotplug, i.e. never marks pmem pages online and
    never exposes the userspace memblock interface for pmem. This leaves an
    opening to redress the section-size constraint.

    To date, the libnvdimm subsystem has attempted to inject padding to
    satisfy the internal constraints of arch_add_memory(). Beyond
    complicating the code, leading to bugs [2], wasting memory, and limiting
    configuration flexibility, the padding hack is broken when the platform
    changes this physical memory alignment of pmem from one boot to the
    next. Device failure (intermittent or permanent) and physical
    reconfiguration are events that can cause the platform firmware to
    change the physical placement of pmem on a subsequent boot, and device
    failure is an everyday event in a data-center.

    It turns out that sections are only a hard requirement of the
    user-facing interface for memory hotplug and with a bit more
    infrastructure sub-section arch_add_memory() support can be added for
    kernel internal usages like devm_memremap_pages(). Here is an analysis
    of the current design assumptions in the current code and how they are
    addressed in the new implementation:

    Current design assumptions:

    - Sections that describe boot memory (early sections) are never
    unplugged / removed.

    - pfn_valid(), in the CONFIG_SPARSEMEM_VMEMMAP=y, case devolves to a
    valid_section() check

    - __add_pages() and helper routines assume all operations occur in
    PAGES_PER_SECTION units.

    - The memblock sysfs interface only comprehends full sections

    New design assumptions:

    - Sections are instrumented with a sub-section bitmask to track (on
    x86) individual 2MB sub-divisions of a 128MB section.

    - Partially populated early sections can be extended with additional
    sub-sections, and those sub-sections can be removed with
    arch_remove_memory(). With this in place we no longer lose usable
    memory capacity to padding.

    - pfn_valid() is updated to look deeper than valid_section() to also
    check the active-sub-section mask. This indication is in the same
    cacheline as the valid_section() so the performance impact is
    expected to be negligible. So far the lkp robot has not reported any
    regressions.

    - Outside of the core vmemmap population routines which are replaced,
    other helper routines like shrink_{zone,pgdat}_span() are updated to
    handle the smaller granularity. Core memory hotplug routines that
    deal with online memory are not touched.

    - The existing memblock sysfs user api guarantees / assumptions are not
    touched since this capability is limited to !online
    !memblock-sysfs-accessible sections.

    Meanwhile the issue reports continue to roll in from users that do not
    understand when and how the 128MB constraint will bite them. The current
    implementation relied on being able to support at least one misaligned
    namespace, but that immediately falls over on any moderately complex
    namespace creation attempt. Beyond the initial problem of 'System RAM'
    colliding with pmem, and the unsolvable problem of physical alignment
    changes, Linux is now being exposed to platforms that collide pmem ranges
    with other pmem ranges by default [3]. In short, devm_memremap_pages()
    has pushed the venerable section-size constraint past the breaking point,
    and the simplicity of section-aligned arch_add_memory() is no longer
    tenable.

    These patches are exposed to the kbuild robot on a subsection-v10 branch
    [4], and a preview of the unit test for this functionality is available
    on the 'subsection-pending' branch of ndctl [5].

    [2]: https://lore.kernel.org/r/155000671719.348031.2347363160141119237.stgit@dwillia2-desk3.amr.corp.intel.com
    [3]: https://github.com/pmem/ndctl/issues/76
    [4]: https://git.kernel.org/pub/scm/linux/kernel/git/djbw/nvdimm.git/log/?h=subsection-v10
    [5]: https://github.com/pmem/ndctl/commit/7c59b4867e1c

    This patch (of 13):

    Towards enabling memory hotplug to track partial population of a section,
    introduce 'struct mem_section_usage'.

    A pointer to a 'struct mem_section_usage' instance replaces the existing
    pointer to a 'pageblock_flags' bitmap. Effectively it adds one more
    'unsigned long' beyond the 'pageblock_flags' (usemap) allocation to house
    a new 'subsection_map' bitmap. The new bitmap enables the memory
    hot{plug,remove} implementation to act on incremental sub-divisions of a
    section.

    SUBSECTION_SHIFT is defined as global constant instead of per-architecture
    value like SECTION_SIZE_BITS in order to allow cross-arch compatibility of
    subsection users. Specifically a common subsection size allows for the
    possibility that persistent memory namespace configurations be made
    compatible across architectures.

    The primary motivation for this functionality is to support platforms that
    mix "System RAM" and "Persistent Memory" within a single section, or
    multiple PMEM ranges with different mapping lifetimes within a single
    section. The section restriction for hotplug has caused an ongoing saga
    of hacks and bugs for devm_memremap_pages() users.

    Beyond the fixups to teach existing paths how to retrieve the 'usemap'
    from a section, and updates to usemap allocation path, there are no
    expected behavior changes.

    Link: http://lkml.kernel.org/r/156092349845.979959.73333291612799019.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Tested-by: Aneesh Kumar K.V [ppc64]
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Logan Gunthorpe
    Cc: Pavel Tatashin
    Cc: David Hildenbrand
    Cc: Jérôme Glisse
    Cc: Mike Rapoport
    Cc: Jane Chu
    Cc: Pavel Tatashin
    Cc: Jonathan Corbet
    Cc: Qian Cai
    Cc: Logan Gunthorpe
    Cc: Toshi Kani
    Cc: Jeff Moyer
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Jason Gunthorpe
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

17 Jul, 2019

1 commit

  • Patch series "mm/vmscan: calculate reclaimed slab in all reclaim paths".

    This patchset is to fix the issues in doing shrink slab.

    There're six different reclaim paths by now,
    - kswapd reclaim path
    - node reclaim path
    - hibernate preallocate memory reclaim path
    - direct reclaim path
    - memcg reclaim path
    - memcg softlimit reclaim path

    The slab caches reclaimed in these paths are only calculated in the
    above three paths. The issues are detailed explained in patch #2. We
    should calculate the reclaimed slab caches in every reclaim path. In
    order to do it, the struct reclaim_state is placed into the struct
    shrink_control.

    In node reclaim path, there'is another issue about shrinking slab, which
    is adressed in "mm/vmscan: shrink slab in node reclaim"
    (https://lore.kernel.org/linux-mm/1559874946-22960-1-git-send-email-laoar.shao@gmail.com/).

    This patch (of 2):

    The struct reclaim_state is used to record how many slab caches are
    reclaimed in one reclaim path. The struct shrink_control is used to
    control one reclaim path. So we'd better put reclaim_state into
    shrink_control.

    [laoar.shao@gmail.com: remove reclaim_state assignment from __perform_reclaim()]
    Link: http://lkml.kernel.org/r/1561381582-13697-1-git-send-email-laoar.shao@gmail.com
    Link: http://lkml.kernel.org/r/1561112086-6169-2-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Reviewed-by: Andrew Morton
    Reviewed-by: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

7 commits

  • Patch series "add init_on_alloc/init_on_free boot options", v10.

    Provide init_on_alloc and init_on_free boot options.

    These are aimed at preventing possible information leaks and making the
    control-flow bugs that depend on uninitialized values more deterministic.

    Enabling either of the options guarantees that the memory returned by the
    page allocator and SL[AU]B is initialized with zeroes. SLOB allocator
    isn't supported at the moment, as its emulation of kmem caches complicates
    handling of SLAB_TYPESAFE_BY_RCU caches correctly.

    Enabling init_on_free also guarantees that pages and heap objects are
    initialized right after they're freed, so it won't be possible to access
    stale data by using a dangling pointer.

    As suggested by Michal Hocko, right now we don't let the heap users to
    disable initialization for certain allocations. There's not enough
    evidence that doing so can speed up real-life cases, and introducing ways
    to opt-out may result in things going out of control.

    This patch (of 2):

    The new options are needed to prevent possible information leaks and make
    control-flow bugs that depend on uninitialized values more deterministic.

    This is expected to be on-by-default on Android and Chrome OS. And it
    gives the opportunity for anyone else to use it under distros too via the
    boot args. (The init_on_free feature is regularly requested by folks
    where memory forensics is included in their threat models.)

    init_on_alloc=1 makes the kernel initialize newly allocated pages and heap
    objects with zeroes. Initialization is done at allocation time at the
    places where checks for __GFP_ZERO are performed.

    init_on_free=1 makes the kernel initialize freed pages and heap objects
    with zeroes upon their deletion. This helps to ensure sensitive data
    doesn't leak via use-after-free accesses.

    Both init_on_alloc=1 and init_on_free=1 guarantee that the allocator
    returns zeroed memory. The two exceptions are slab caches with
    constructors and SLAB_TYPESAFE_BY_RCU flag. Those are never
    zero-initialized to preserve their semantics.

    Both init_on_alloc and init_on_free default to zero, but those defaults
    can be overridden with CONFIG_INIT_ON_ALLOC_DEFAULT_ON and
    CONFIG_INIT_ON_FREE_DEFAULT_ON.

    If either SLUB poisoning or page poisoning is enabled, those options take
    precedence over init_on_alloc and init_on_free: initialization is only
    applied to unpoisoned allocations.

    Slowdown for the new features compared to init_on_free=0, init_on_alloc=0:

    hackbench, init_on_free=1: +7.62% sys time (st.err 0.74%)
    hackbench, init_on_alloc=1: +7.75% sys time (st.err 2.14%)

    Linux build with -j12, init_on_free=1: +8.38% wall time (st.err 0.39%)
    Linux build with -j12, init_on_free=1: +24.42% sys time (st.err 0.52%)
    Linux build with -j12, init_on_alloc=1: -0.13% wall time (st.err 0.42%)
    Linux build with -j12, init_on_alloc=1: +0.57% sys time (st.err 0.40%)

    The slowdown for init_on_free=0, init_on_alloc=0 compared to the baseline
    is within the standard error.

    The new features are also going to pave the way for hardware memory
    tagging (e.g. arm64's MTE), which will require both on_alloc and on_free
    hooks to set the tags for heap objects. With MTE, tagging will have the
    same cost as memory initialization.

    Although init_on_free is rather costly, there are paranoid use-cases where
    in-memory data lifetime is desired to be minimized. There are various
    arguments for/against the realism of the associated threat models, but
    given that we'll need the infrastructure for MTE anyway, and there are
    people who want wipe-on-free behavior no matter what the performance cost,
    it seems reasonable to include it in this series.

    [glider@google.com: v8]
    Link: http://lkml.kernel.org/r/20190626121943.131390-2-glider@google.com
    [glider@google.com: v9]
    Link: http://lkml.kernel.org/r/20190627130316.254309-2-glider@google.com
    [glider@google.com: v10]
    Link: http://lkml.kernel.org/r/20190628093131.199499-2-glider@google.com
    Link: http://lkml.kernel.org/r/20190617151050.92663-2-glider@google.com
    Signed-off-by: Alexander Potapenko
    Acked-by: Kees Cook
    Acked-by: Michal Hocko [page and dmapool parts
    Acked-by: James Morris ]
    Cc: Christoph Lameter
    Cc: Masahiro Yamada
    Cc: "Serge E. Hallyn"
    Cc: Nick Desaulniers
    Cc: Kostya Serebryany
    Cc: Dmitry Vyukov
    Cc: Sandeep Patil
    Cc: Laura Abbott
    Cc: Randy Dunlap
    Cc: Jann Horn
    Cc: Mark Rutland
    Cc: Marco Elver
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Potapenko
     
  • CONFIG_NUMA on 64-bit CPUs currently enables hashdist unconditionally even
    when booting on single node machines. This causes the large system hashes
    to be allocated with vmalloc, and mapped with small pages.

    This change clears hashdist if only one node has come up with memory.

    This results in the important large inode and dentry hashes using memblock
    allocations. All others are within 4MB size up to about 128GB of RAM,
    which allows them to be allocated from the linear map on most non-NUMA
    images.

    Other big hashes like futex and TCP should eventually be moved over to the
    same style of allocation as those vfs caches that use HASH_EARLY if
    !hashdist, so they don't exceed MAX_ORDER on very large non-NUMA images.

    This brings dTLB misses for linux kernel tree `git diff` from ~45,000 to
    ~8,000 on a Kaby Lake KVM guest with 8MB dentry hash and mitigations=off
    (performance is in the noise, under 1% difference, page tables are likely
    to be well cached for this workload).

    Link: http://lkml.kernel.org/r/20190605144814.29319-2-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • The kernel currently clamps large system hashes to MAX_ORDER when hashdist
    is not set, which is rather arbitrary.

    vmalloc space is limited on 32-bit machines, but this shouldn't result in
    much more used because of small physical memory limiting system hash
    sizes.

    Include "vmalloc" or "linear" in the kernel log message.

    Link: http://lkml.kernel.org/r/20190605144814.29319-1-npiggin@gmail.com
    Signed-off-by: Nicholas Piggin
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     
  • When debug_pagealloc is enabled, we currently allocate the page_ext
    array to mark guard pages with the PAGE_EXT_DEBUG_GUARD flag. Now that
    we have the page_type field in struct page, we can use that instead, as
    guard pages are neither PageSlab nor mapped to userspace. This reduces
    memory overhead when debug_pagealloc is enabled and there are no other
    features requiring the page_ext array.

    Link: http://lkml.kernel.org/r/20190603143451.27353-4-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Matthew Wilcox
    Cc: "Kirill A. Shutemov"
    Cc: Mel Gorman
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • The page allocator checks struct pages for expected state (mapcount,
    flags etc) as pages are being allocated (check_new_page()) and freed
    (free_pages_check()) to provide some defense against errors in page
    allocator users.

    Prior commits 479f854a207c ("mm, page_alloc: defer debugging checks of
    pages allocated from the PCP") and 4db7548ccbd9 ("mm, page_alloc: defer
    debugging checks of freed pages until a PCP drain") this has happened
    for order-0 pages as they were allocated from or freed to the per-cpu
    caches (pcplists). Since those are fast paths, the checks are now
    performed only when pages are moved between pcplists and global free
    lists. This however lowers the chances of catching errors soon enough.

    In order to increase the chances of the checks to catch errors, the
    kernel has to be rebuilt with CONFIG_DEBUG_VM, which also enables
    multiple other internal debug checks (VM_BUG_ON() etc), which is
    suboptimal when the goal is to catch errors in mm users, not in mm code
    itself.

    To catch some wrong users of the page allocator we have
    CONFIG_DEBUG_PAGEALLOC, which is designed to have virtually no overhead
    unless enabled at boot time. Memory corruptions when writing to freed
    pages have often the same underlying errors (use-after-free, double free)
    as corrupting the corresponding struct pages, so this existing debugging
    functionality is a good fit to extend by also perform struct page checks
    at least as often as if CONFIG_DEBUG_VM was enabled.

    Specifically, after this patch, when debug_pagealloc is enabled on boot,
    and CONFIG_DEBUG_VM disabled, pages are checked when allocated from or
    freed to the pcplists *in addition* to being moved between pcplists and
    free lists. When both debug_pagealloc and CONFIG_DEBUG_VM are enabled,
    pages are checked when being moved between pcplists and free lists *in
    addition* to when allocated from or freed to the pcplists.

    When debug_pagealloc is not enabled on boot, the overhead in fast paths
    should be virtually none thanks to the use of static key.

    Link: http://lkml.kernel.org/r/20190603143451.27353-3-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Patch series "debug_pagealloc improvements".

    I have been recently debugging some pcplist corruptions, where it would be
    useful to perform struct page checks immediately as pages are allocated
    from and freed to pcplists, which is now only possible by rebuilding the
    kernel with CONFIG_DEBUG_VM (details in Patch 2 changelog).

    To make this kind of debugging simpler in future on a distro kernel, I
    have improved CONFIG_DEBUG_PAGEALLOC so that it has even smaller overhead
    when not enabled at boot time (Patch 1) and also when enabled (Patch 3),
    and extended it to perform the struct page checks more often when enabled
    (Patch 2). Now it can be configured in when building a distro kernel
    without extra overhead, and debugging page use after free or double free
    can be enabled simply by rebooting with debug_pagealloc=on.

    This patch (of 3):

    CONFIG_DEBUG_PAGEALLOC has been redesigned by 031bc5743f15
    ("mm/debug-pagealloc: make debug-pagealloc boottime configurable") to
    allow being always enabled in a distro kernel, but only perform its
    expensive functionality when booted with debug_pagelloc=on. We can
    further reduce the overhead when not boot-enabled (including page
    allocator fast paths) using static keys. This patch introduces one for
    debug_pagealloc core functionality, and another for the optional guard
    page functionality (enabled by booting with debug_guardpage_minorder=X).

    Link: http://lkml.kernel.org/r/20190603143451.27353-2-vbabka@suse.cz
    Signed-off-by: Vlastimil Babka
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Previously totalram_pages was the global variable. Currently,
    totalram_pages is the static inline function from the include/linux/mm.h
    However, the function is also marked as EXPORT_SYMBOL, which is at best an
    odd combination. Because there is no point for the static inline function
    from a public header to be exported, this commit removes the
    EXPORT_SYMBOL() marking. It will be still possible to use the function in
    modules because all the symbols it depends on are exported.

    Link: http://lkml.kernel.org/r/20190710141031.15642-1-efremov@linux.com
    Fixes: ca79b0c211af6 ("mm: convert totalram_pages and totalhigh_pages variables to atomic")
    Signed-off-by: Denis Efremov
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Oscar Salvador
    Cc: Pavel Tatashin
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Alexander Duyck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Denis Efremov
     

05 Jul, 2019

1 commit

  • Commit 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time
    instead of doing larger sections") is causing a regression on some
    systems when the kernel is booted as Xen dom0.

    The system will just hang in early boot.

    Reason is an endless loop in get_page_from_freelist() in case the first
    zone looked at has no free memory. deferred_grow_zone() is always
    returning true due to the following code snipplet:

    /* If the zone is empty somebody else may have cleared out the zone */
    if (!deferred_init_mem_pfn_range_in_zone(&i, zone, &spfn, &epfn,
    first_deferred_pfn)) {
    pgdat->first_deferred_pfn = ULONG_MAX;
    pgdat_resize_unlock(pgdat, &flags);
    return true;
    }

    This in turn results in the loop as get_page_from_freelist() is assuming
    forward progress can be made by doing some more struct page
    initialization.

    Link: http://lkml.kernel.org/r/20190620160821.4210-1-jgross@suse.com
    Fixes: 0e56acae4b4d ("mm: initialize MAX_ORDER_NR_PAGES at a time instead of doing larger sections")
    Signed-off-by: Juergen Gross
    Suggested-by: Alexander Duyck
    Acked-by: Alexander Duyck
    Cc: Michal Hocko
    Cc: Pavel Tatashin
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Juergen Gross
     

03 Jul, 2019

1 commit