03 Aug, 2022

1 commit

  • commit 9282012fc0aa248b77a69f5eb802b67c5a16bb13 upstream.

    There was a report that a task is waiting at the
    throttle_direct_reclaim. The pgscan_direct_throttle in vmstat was
    increasing.

    This is a bug where zone_watermark_fast returns true even when the free
    is very low. The commit f27ce0e14088 ("page_alloc: consider highatomic
    reserve in watermark fast") changed the watermark fast to consider
    highatomic reserve. But it did not handle a negative value case which
    can be happened when reserved_highatomic pageblock is bigger than the
    actual free.

    If watermark is considered as ok for the negative value, allocating
    contexts for order-0 will consume all free pages without direct reclaim,
    and finally free page may become depleted except highatomic free.

    Then allocating contexts may fall into throttle_direct_reclaim. This
    symptom may easily happen in a system where wmark min is low and other
    reclaimers like kswapd does not make free pages quickly.

    Handle the negative case by using MIN.

    Link: https://lkml.kernel.org/r/20220725095212.25388-1-jaewon31.kim@samsung.com
    Fixes: f27ce0e14088 ("page_alloc: consider highatomic reserve in watermark fast")
    Signed-off-by: Jaewon Kim
    Reported-by: GyeongHwan Hong
    Acked-by: Mel Gorman
    Cc: Minchan Kim
    Cc: Baoquan He
    Cc: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Yong-Taek Lee
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Jaewon Kim
     

09 Jun, 2022

1 commit

  • commit c572e4888ad1be123c1516ec577ad30a700bbec4 upstream.

    Peter Pavlisko reported the following problem on kernel bugzilla 216007.

    When I try to extract an uncompressed tar archive (2.6 milion
    files, 760.3 GiB in size) on newly created (empty) XFS file system,
    after first low tens of gigabytes extracted the process hangs in
    iowait indefinitely. One CPU core is 100% occupied with iowait,
    the other CPU core is idle (on 2-core Intel Celeron G1610T).

    It was bisected to c9fa563072e1 ("xfs: use alloc_pages_bulk_array() for
    buffers") but XFS is only the messenger. The problem is that nothing is
    waking kswapd to reclaim some pages at a time the PCP lists cannot be
    refilled until some reclaim happens. The bulk allocator checks that there
    are some pages in the array and the original intent was that a bulk
    allocator did not necessarily need all the requested pages and it was best
    to return as quickly as possible.

    This was fine for the first user of the API but both NFS and XFS require
    the requested number of pages be available before making progress. Both
    could be adjusted to call the page allocator directly if a bulk allocation
    fails but it puts a burden on users of the API. Adjust the semantics to
    attempt at least one allocation via __alloc_pages() before returning so
    kswapd is woken if necessary.

    It was reported via bugzilla that the patch addressed the problem and that
    the tar extraction completed successfully. This may also address bug
    215975 but has yet to be confirmed.

    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=216007
    BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=215975
    Link: https://lkml.kernel.org/r/20220526091210.GC3441@techsingularity.net
    Fixes: 387ba26fb1cb ("mm/page_alloc: add a bulk page allocator")
    Signed-off-by: Mel Gorman
    Cc: "Darrick J. Wong"
    Cc: Dave Chinner
    Cc: Jan Kara
    Cc: Vlastimil Babka
    Cc: Jesper Dangaard Brouer
    Cc: Chuck Lever
    Cc: [5.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Mel Gorman
     

27 Apr, 2022

1 commit

  • commit ca831f29f8f25c97182e726429b38c0802200c8f upstream.

    Arthur Marsh reported we would hit the error below when building kernel
    with gcc-12:

    CC mm/page_alloc.o
    mm/page_alloc.c: In function `mem_init_print_info':
    mm/page_alloc.c:8173:27: error: comparison between two arrays [-Werror=array-compare]
    8173 | if (start < end && size > adj) \
    |

    In C++20, the comparision between arrays should be warned.

    Link: https://lkml.kernel.org/r/20211125130928.32465-1-sxwjean@me.com
    Signed-off-by: Xiongwei Song
    Reported-by: Arthur Marsh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Khem Raj
    Signed-off-by: Greg Kroah-Hartman

    Xiongwei Song
     

20 Apr, 2022

1 commit

  • commit e553f62f10d93551eb883eca227ac54d1a4fad84 upstream.

    Since commit 6aa303defb74 ("mm, vmscan: only allocate and reclaim from
    zones with pages managed by the buddy allocator") only zones with free
    memory are included in a built zonelist. This is problematic when e.g.
    all memory of a zone has been ballooned out when zonelists are being
    rebuilt.

    The decision whether to rebuild the zonelists when onlining new memory
    is done based on populated_zone() returning 0 for the zone the memory
    will be added to. The new zone is added to the zonelists only, if it
    has free memory pages (managed_zone() returns a non-zero value) after
    the memory has been onlined. This implies, that onlining memory will
    always free the added pages to the allocator immediately, but this is
    not true in all cases: when e.g. running as a Xen guest the onlined new
    memory will be added only to the ballooned memory list, it will be freed
    only when the guest is being ballooned up afterwards.

    Another problem with using managed_zone() for the decision whether a
    zone is being added to the zonelists is, that a zone with all memory
    used will in fact be removed from all zonelists in case the zonelists
    happen to be rebuilt.

    Use populated_zone() when building a zonelist as it has been done before
    that commit.

    There was a report that QubesOS (based on Xen) is hitting this problem.
    Xen has switched to use the zone device functionality in kernel 5.9 and
    QubesOS wants to use memory hotplugging for guests in order to be able
    to start a guest with minimal memory and expand it as needed. This was
    the report leading to the patch.

    Link: https://lkml.kernel.org/r/20220407120637.9035-1-jgross@suse.com
    Fixes: 6aa303defb74 ("mm, vmscan: only allocate and reclaim from zones with pages managed by the buddy allocator")
    Signed-off-by: Juergen Gross
    Reported-by: Marek Marczykowski-Górecki
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Cc: Marek Marczykowski-Górecki
    Reviewed-by: Wei Yang
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Juergen Gross
     

08 Apr, 2022

1 commit

  • commit ddbc84f3f595cf1fc8234a191193b5d20ad43938 upstream.

    ZONE_MOVABLE uses the remaining memory in each node. Its starting pfn
    is also aligned to MAX_ORDER_NR_PAGES. It is possible for the remaining
    memory in a node to be less than MAX_ORDER_NR_PAGES, meaning there is
    not enough room for ZONE_MOVABLE on that node.

    Unfortunately this condition is not checked for. This leads to
    zone_movable_pfn[] getting set to a pfn greater than the last pfn in a
    node.

    calculate_node_totalpages() then sets zone->present_pages to be greater
    than zone->spanned_pages which is invalid, as spanned_pages represents
    the maximum number of pages in a zone assuming no holes.

    Subsequently it is possible free_area_init_core() will observe a zone of
    size zero with present pages. In this case it will skip setting up the
    zone, including the initialisation of free_lists[].

    However populated_zone() checks zone->present_pages to see if a zone has
    memory available. This is used by iterators such as
    walk_zones_in_node(). pagetypeinfo_showfree() uses this to walk the
    free_list of each zone in each node, which are assumed to be initialised
    due to the zone not being empty.

    As free_area_init_core() never initialised the free_lists[] this results
    in the following kernel crash when trying to read /proc/pagetypeinfo:

    BUG: kernel NULL pointer dereference, address: 0000000000000000
    #PF: supervisor read access in kernel mode
    #PF: error_code(0x0000) - not-present page
    PGD 0 P4D 0
    Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC NOPTI
    CPU: 0 PID: 456 Comm: cat Not tainted 5.16.0 #461
    Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.14.0-2 04/01/2014
    RIP: 0010:pagetypeinfo_show+0x163/0x460
    Code: 9e 82 e8 80 57 0e 00 49 8b 06 b9 01 00 00 00 4c 39 f0 75 16 e9 65 02 00 00 48 83 c1 01 48 81 f9 a0 86 01 00 0f 84 48 02 00 00 8b 00 4c 39 f0 75 e7 48 c7 c2 80 a2 e2 82 48 c7 c6 79 ef e3 82
    RSP: 0018:ffffc90001c4bd10 EFLAGS: 00010003
    RAX: 0000000000000000 RBX: ffff88801105f638 RCX: 0000000000000001
    RDX: 0000000000000001 RSI: 000000000000068b RDI: ffff8880163dc68b
    RBP: ffffc90001c4bd90 R08: 0000000000000001 R09: ffff8880163dc67e
    R10: 656c6261766f6d6e R11: 6c6261766f6d6e55 R12: ffff88807ffb4a00
    R13: ffff88807ffb49f8 R14: ffff88807ffb4580 R15: ffff88807ffb3000
    FS: 00007f9c83eff5c0(0000) GS:ffff88807dc00000(0000) knlGS:0000000000000000
    CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    CR2: 0000000000000000 CR3: 0000000013c8e000 CR4: 0000000000350ef0
    Call Trace:
    seq_read_iter+0x128/0x460
    proc_reg_read_iter+0x51/0x80
    new_sync_read+0x113/0x1a0
    vfs_read+0x136/0x1d0
    ksys_read+0x70/0xf0
    __x64_sys_read+0x1a/0x20
    do_syscall_64+0x3b/0xc0
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Fix this by checking that the aligned zone_movable_pfn[] does not exceed
    the end of the node, and if it does skip creating a movable zone on this
    node.

    Link: https://lkml.kernel.org/r/20220215025831.2113067-1-apopple@nvidia.com
    Fixes: 2a1e274acf0b ("Create the ZONE_MOVABLE zone")
    Signed-off-by: Alistair Popple
    Acked-by: David Hildenbrand
    Acked-by: Mel Gorman
    Cc: John Hubbard
    Cc: Zi Yan
    Cc: Anshuman Khandual
    Cc: Oscar Salvador
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Alistair Popple
     

27 Jan, 2022

2 commits

  • commit c4dc63f0032c77464fbd4e7a6afc22fa6913c4a7 upstream.

    In kdump kernel of x86_64, page allocation failure is observed:

    kworker/u2:2: page allocation failure: order:0, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 0 PID: 55 Comm: kworker/u2:2 Not tainted 5.16.0-rc4+ #5
    Hardware name: AMD Dinar/Dinar, BIOS RDN1505B 06/05/2013
    Workqueue: events_unbound async_run_entry_fn
    Call Trace:

    dump_stack_lvl+0x48/0x5e
    warn_alloc.cold+0x72/0xd6
    __alloc_pages_slowpath.constprop.0+0xc69/0xcd0
    __alloc_pages+0x1df/0x210
    new_slab+0x389/0x4d0
    ___slab_alloc+0x58f/0x770
    __slab_alloc.constprop.0+0x4a/0x80
    kmem_cache_alloc_trace+0x24b/0x2c0
    sr_probe+0x1db/0x620
    ......
    device_add+0x405/0x920
    ......
    __scsi_add_device+0xe5/0x100
    ata_scsi_scan_host+0x97/0x1d0
    async_run_entry_fn+0x30/0x130
    process_one_work+0x1e8/0x3c0
    worker_thread+0x50/0x3b0
    ? rescuer_thread+0x350/0x350
    kthread+0x16b/0x190
    ? set_kthread_struct+0x40/0x40
    ret_from_fork+0x22/0x30

    Mem-Info:
    ......

    The above failure happened when calling kmalloc() to allocate buffer with
    GFP_DMA. It requests to allocate slab page from DMA zone while no managed
    pages at all in there.

    sr_probe()
    --> get_capabilities()
    --> buffer = kmalloc(512, GFP_KERNEL | GFP_DMA);

    Because in the current kernel, dma-kmalloc will be created as long as
    CONFIG_ZONE_DMA is enabled. However, kdump kernel of x86_64 doesn't have
    managed pages on DMA zone since commit 6f599d84231f ("x86/kdump: Always
    reserve the low 1M when the crashkernel option is specified"). The
    failure can be always reproduced.

    For now, let's mute the warning of allocation failure if requesting pages
    from DMA zone while no managed pages.

    [akpm@linux-foundation.org: fix warning]

    Link: https://lkml.kernel.org/r/20211223094435.248523-4-bhe@redhat.com
    Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
    Signed-off-by: Baoquan He
    Acked-by: John Donnelly
    Reviewed-by: Hyeonggon Yoo
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: Borislav Petkov
    Cc: Christoph Hellwig
    Cc: David Hildenbrand
    Cc: David Laight
    Cc: Marek Szyprowski
    Cc: Robin Murphy
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     
  • commit 62b3107073646e0946bd97ff926832bafb846d17 upstream.

    Patch series "Handle warning of allocation failure on DMA zone w/o
    managed pages", v4.

    **Problem observed:
    On x86_64, when crash is triggered and entering into kdump kernel, page
    allocation failure can always be seen.

    ---------------------------------
    DMA: preallocated 128 KiB GFP_KERNEL pool for atomic allocations
    swapper/0: page allocation failure: order:5, mode:0xcc1(GFP_KERNEL|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0
    CPU: 0 PID: 1 Comm: swapper/0
    Call Trace:
    dump_stack+0x7f/0xa1
    warn_alloc.cold+0x72/0xd6
    ......
    __alloc_pages+0x24d/0x2c0
    ......
    dma_atomic_pool_init+0xdb/0x176
    do_one_initcall+0x67/0x320
    ? rcu_read_lock_sched_held+0x3f/0x80
    kernel_init_freeable+0x290/0x2dc
    ? rest_init+0x24f/0x24f
    kernel_init+0xa/0x111
    ret_from_fork+0x22/0x30
    Mem-Info:
    ------------------------------------

    ***Root cause:
    In the current kernel, it assumes that DMA zone must have managed pages
    and try to request pages if CONFIG_ZONE_DMA is enabled. While this is not
    always true. E.g in kdump kernel of x86_64, only low 1M is presented and
    locked down at very early stage of boot, so that this low 1M won't be
    added into buddy allocator to become managed pages of DMA zone. This
    exception will always cause page allocation failure if page is requested
    from DMA zone.

    ***Investigation:
    This failure happens since below commit merged into linus's tree.
    1a6a9044b967 x86/setup: Remove CONFIG_X86_RESERVE_LOW and reservelow= options
    23721c8e92f7 x86/crash: Remove crash_reserve_low_1M()
    f1d4d47c5851 x86/setup: Always reserve the first 1M of RAM
    7c321eb2b843 x86/kdump: Remove the backup region handling
    6f599d84231f x86/kdump: Always reserve the low 1M when the crashkernel option is specified

    Before them, on x86_64, the low 640K area will be reused by kdump kernel.
    So in kdump kernel, the content of low 640K area is copied into a backup
    region for dumping before jumping into kdump. Then except of those firmware
    reserved region in [0, 640K], the left area will be added into buddy
    allocator to become available managed pages of DMA zone.

    However, after above commits applied, in kdump kernel of x86_64, the low
    1M is reserved by memblock, but not released to buddy allocator. So any
    later page allocation requested from DMA zone will fail.

    At the beginning, if crashkernel is reserved, the low 1M need be locked
    down because AMD SME encrypts memory making the old backup region
    mechanims impossible when switching into kdump kernel.

    Later, it was also observed that there are BIOSes corrupting memory
    under 1M. To solve this, in commit f1d4d47c5851, the entire region of
    low 1M is always reserved after the real mode trampoline is allocated.

    Besides, recently, Intel engineer mentioned their TDX (Trusted domain
    extensions) which is under development in kernel also needs to lock down
    the low 1M. So we can't simply revert above commits to fix the page allocation
    failure from DMA zone as someone suggested.

    ***Solution:
    Currently, only DMA atomic pool and dma-kmalloc will initialize and
    request page allocation with GFP_DMA during bootup.

    So only initializ DMA atomic pool when DMA zone has available managed
    pages, otherwise just skip the initialization.

    For dma-kmalloc(), for the time being, let's mute the warning of
    allocation failure if requesting pages from DMA zone while no manged
    pages. Meanwhile, change code to use dma_alloc_xx/dma_map_xx API to
    replace kmalloc(GFP_DMA), or do not use GFP_DMA when calling kmalloc() if
    not necessary. Christoph is posting patches to fix those under
    drivers/scsi/. Finally, we can remove the need of dma-kmalloc() as people
    suggested.

    This patch (of 3):

    In some places of the current kernel, it assumes that dma zone must have
    managed pages if CONFIG_ZONE_DMA is enabled. While this is not always
    true. E.g in kdump kernel of x86_64, only low 1M is presented and locked
    down at very early stage of boot, so that there's no managed pages at all
    in DMA zone. This exception will always cause page allocation failure if
    page is requested from DMA zone.

    Here add function has_managed_dma() and the relevant helper functions to
    check if there's DMA zone with managed pages. It will be used in later
    patches.

    Link: https://lkml.kernel.org/r/20211223094435.248523-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20211223094435.248523-2-bhe@redhat.com
    Fixes: 6f599d84231f ("x86/kdump: Always reserve the low 1M when the crashkernel option is specified")
    Signed-off-by: Baoquan He
    Reviewed-by: David Hildenbrand
    Acked-by: John Donnelly
    Cc: Christoph Hellwig
    Cc: Christoph Lameter
    Cc: Hyeonggon Yoo
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Vlastimil Babka
    Cc: David Laight
    Cc: Borislav Petkov
    Cc: Marek Szyprowski
    Cc: Robin Murphy
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     

29 Oct, 2021

2 commits

  • When handling shmem page fault the THP with corrupted subpage could be
    PMD mapped if certain conditions are satisfied. But kernel is supposed
    to send SIGBUS when trying to map hwpoisoned page.

    There are two paths which may do PMD map: fault around and regular
    fault.

    Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
    codepaths") the thing was even worse in fault around path. The THP
    could be PMD mapped as long as the VMA fits regardless what subpage is
    accessed and corrupted. After this commit as long as head page is not
    corrupted the THP could be PMD mapped.

    In the regular fault path the THP could be PMD mapped as long as the
    corrupted page is not accessed and the VMA fits.

    This loophole could be fixed by iterating every subpage to check if any
    of them is hwpoisoned or not, but it is somewhat costly in page fault
    path.

    So introduce a new page flag called HasHWPoisoned on the first tail
    page. It indicates the THP has hwpoisoned subpage(s). It is set if any
    subpage of THP is found hwpoisoned by memory failure and after the
    refcount is bumped successfully, then cleared when the THP is freed or
    split.

    The soft offline path doesn't need this since soft offline handler just
    marks a subpage hwpoisoned when the subpage is migrated successfully.
    But shmem THP didn't get split then migrated at all.

    Link: https://lkml.kernel.org/r/20211020210755.23964-3-shy828301@gmail.com
    Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
    Signed-off-by: Yang Shi
    Reviewed-by: Naoya Horiguchi
    Suggested-by: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Matthew Wilcox
    Cc: Oscar Salvador
    Cc: Peter Xu
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Commit 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in
    __vmalloc_area_node()") switched to bulk page allocator for order 0
    allocation backing vmalloc. However bulk page allocator does not
    support __GFP_ACCOUNT allocations and there are several users of
    kvmalloc(__GFP_ACCOUNT).

    For now make __GFP_ACCOUNT allocations bypass bulk page allocator. In
    future if there is workload that can be significantly improved with the
    bulk page allocator with __GFP_ACCCOUNT support, we can revisit the
    decision.

    Link: https://lkml.kernel.org/r/20211014151607.2171970-1-shakeelb@google.com
    Fixes: 5c1f4e690eec ("mm/vmalloc: switch to bulk allocator in __vmalloc_area_node()")
    Signed-off-by: Shakeel Butt
    Reported-by: Vasily Averin
    Tested-by: Vasily Averin
    Acked-by: David Hildenbrand
    Acked-by: Michal Hocko
    Acked-by: Roman Gushchin
    Acked-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     

09 Sep, 2021

4 commits

  • If it's not prepared to free unref page, the pcp page migratetype is
    unset. Thus we will get rubbish from get_pcppage_migratetype() and
    might list_del(&page->lru) again after it's already deleted from the list
    leading to grumble about data corruption.

    Link: https://lkml.kernel.org/r/20210902115447.57050-1-linmiaohe@huawei.com
    Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
    Signed-off-by: Miaohe Lin
    Acked-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Reviewed-by: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Merge more updates from Andrew Morton:
    "147 patches, based on 7d2a07b769330c34b4deabeed939325c77a7ec2f.

    Subsystems affected by this patch series: mm (memory-hotplug, rmap,
    ioremap, highmem, cleanups, secretmem, kfence, damon, and vmscan),
    alpha, percpu, procfs, misc, core-kernel, MAINTAINERS, lib,
    checkpatch, epoll, init, nilfs2, coredump, fork, pids, criu, kconfig,
    selftests, ipc, and scripts"

    * emailed patches from Andrew Morton : (94 commits)
    scripts: check_extable: fix typo in user error message
    mm/workingset: correct kernel-doc notations
    ipc: replace costly bailout check in sysvipc_find_ipc()
    selftests/memfd: remove unused variable
    Kconfig.debug: drop selecting non-existing HARDLOCKUP_DETECTOR_ARCH
    configs: remove the obsolete CONFIG_INPUT_POLLDEV
    prctl: allow to setup brk for et_dyn executables
    pid: cleanup the stale comment mentioning pidmap_init().
    kernel/fork.c: unexport get_{mm,task}_exe_file
    coredump: fix memleak in dump_vma_snapshot()
    fs/coredump.c: log if a core dump is aborted due to changed file permissions
    nilfs2: use refcount_dec_and_lock() to fix potential UAF
    nilfs2: fix memory leak in nilfs_sysfs_delete_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_create_snapshot_group
    nilfs2: fix memory leak in nilfs_sysfs_delete_##name##_group
    nilfs2: fix memory leak in nilfs_sysfs_create_##name##_group
    nilfs2: fix NULL pointer in nilfs_##name##_attr_release
    nilfs2: fix memory leak in nilfs_sysfs_create_device_group
    trap: cleanup trap_init()
    init: move usermodehelper_enable() to populate_rootfs()
    ...

    Linus Torvalds
     
  • Patch series "mm/memory_hotplug: "auto-movable" online policy and memory groups", v3.

    I. Goal

    The goal of this series is improving in-kernel auto-online support. It
    tackles the fundamental problems that:

    1) We can create zone imbalances when onlining all memory blindly to
    ZONE_MOVABLE, in the worst case crashing the system. We have to know
    upfront how much memory we are going to hotplug such that we can
    safely enable auto-onlining of all hotplugged memory to ZONE_MOVABLE
    via "online_movable". This is far from practical and only applicable in
    limited setups -- like inside VMs under the RHV/oVirt hypervisor which
    will never hotplug more than 3 times the boot memory (and the
    limitation is only in place due to the Linux limitation).

    2) We see more setups that implement dynamic VM resizing, hot(un)plugging
    memory to resize VM memory. In these setups, we might hotplug a lot of
    memory, but it might happen in various small steps in both directions
    (e.g., 2 GiB -> 8 GiB -> 4 GiB -> 16 GiB ...). virtio-mem is the
    primary driver of this upstream right now, performing such dynamic
    resizing NUMA-aware via multiple virtio-mem devices.

    Onlining all hotplugged memory to ZONE_NORMAL means we basically have
    no hotunplug guarantees. Onlining all to ZONE_MOVABLE means we can
    easily run into zone imbalances when growing a VM. We want a mixture,
    and we want as much memory as reasonable/configured in ZONE_MOVABLE.
    Details regarding zone imbalances can be found at [1].

    3) Memory devices consist of 1..X memory block devices, however, the
    kernel doesn't really track the relationship. Consequently, also user
    space has no idea. We want to make per-device decisions.

    As one example, for memory hotunplug it doesn't make sense to use a
    mixture of zones within a single DIMM: we want all MOVABLE if
    possible, otherwise all !MOVABLE, because any !MOVABLE part will easily
    block the whole DIMM from getting hotunplugged.

    As another example, virtio-mem operates on individual units that span
    1..X memory blocks. Similar to a DIMM, we want a unit to either be all
    MOVABLE or !MOVABLE. A "unit" can be thought of like a DIMM, however,
    all units of a virtio-mem device logically belong together and are
    managed (added/removed) by a single driver. We want as much memory of
    a virtio-mem device to be MOVABLE as possible.

    4) We want memory onlining to be done right from the kernel while adding
    memory, not triggered by user space via udev rules; for example, this
    is reqired for fast memory hotplug for drivers that add individual
    memory blocks, like virito-mem. We want a way to configure a policy in
    the kernel and avoid implementing advanced policies in user space.

    The auto-onlining support we have in the kernel is not sufficient. All we
    have is a) online everything MOVABLE (online_movable) b) online everything
    !MOVABLE (online_kernel) c) keep zones contiguous (online). This series
    allows configuring c) to mean instead "online movable if possible
    according to the coniguration, driven by a maximum MOVABLE:KERNEL ratio"
    -- a new onlining policy.

    II. Approach

    This series does 3 things:

    1) Introduces the "auto-movable" online policy that initially operates on
    individual memory blocks only. It uses a maximum MOVABLE:KERNEL ratio
    to make a decision whether a memory block will be onlined to
    ZONE_MOVABLE or not. However, in the basic form, hotplugged KERNEL
    memory does not allow for more MOVABLE memory (details in the
    patches). CMA memory is treated like MOVABLE memory.

    2) Introduces static (e.g., DIMM) and dynamic (e.g., virtio-mem) memory
    groups and uses group information to make decisions in the
    "auto-movable" online policy across memory blocks of a single memory
    device (modeled as memory group). More details can be found in patch
    #3 or in the DIMM example below.

    3) Maximizes ZONE_MOVABLE memory within dynamic memory groups, by
    allowing ZONE_NORMAL memory within a dynamic memory group to allow for
    more ZONE_MOVABLE memory within the same memory group. The target use
    case is dynamic VM resizing using virtio-mem. See the virtio-mem
    example below.

    I remember that the basic idea of using a ratio to implement a policy in
    the kernel was once mentioned by Vitaly Kuznetsov, but I might be wrong (I
    lost the pointer to that discussion).

    For me, the main use case is using it along with virtio-mem (and DIMMs /
    ppc64 dlpar where necessary) for dynamic resizing of VMs, increasing the
    amount of memory we can hotunplug reliably again if we might eventually
    hotplug a lot of memory to a VM.

    III. Target Usage

    The target usage will be:

    1) Linux boots with "mhp_default_online_type=offline"

    2) User space (e.g., systemd unit) configures memory onlining (according
    to a config file and system properties), for example:
    * Setting memory_hotplug.online_policy=auto-movable
    * Setting memory_hotplug.auto_movable_ratio=301
    * Setting memory_hotplug.auto_movable_numa_aware=true

    3) User space enabled auto onlining via "echo online >
    /sys/devices/system/memory/auto_online_blocks"

    4) User space triggers manual onlining of all already-offline memory
    blocks (go over offline memory blocks and set them to "online")

    IV. Example

    For DIMMs, hotplugging 4 GiB DIMMs to a 4 GiB VM with a configured ratio of
    301% results in the following layout:
    Memory block 0-15: DMA32 (early)
    Memory block 32-47: Normal (early)
    Memory block 48-79: Movable (DIMM 0)
    Memory block 80-111: Movable (DIMM 1)
    Memory block 112-143: Movable (DIMM 2)
    Memory block 144-275: Normal (DIMM 3)
    Memory block 176-207: Normal (DIMM 4)
    ... all Normal
    (-> hotplugged Normal memory does not allow for more Movable memory)

    For virtio-mem, using a simple, single virtio-mem device with a 4 GiB VM
    will result in the following layout:
    Memory block 0-15: DMA32 (early)
    Memory block 32-47: Normal (early)
    Memory block 48-143: Movable (virtio-mem, first 12 GiB)
    Memory block 144: Normal (virtio-mem, next 128 MiB)
    Memory block 145-147: Movable (virtio-mem, next 384 MiB)
    Memory block 148: Normal (virtio-mem, next 128 MiB)
    Memory block 149-151: Movable (virtio-mem, next 384 MiB)
    ... Normal/Movable mixture as above
    (-> hotplugged Normal memory allows for more Movable memory within
    the same device)

    Which gives us maximum flexibility when dynamically growing/shrinking a
    VM in smaller steps.

    V. Doc Update

    I'll update the memory-hotplug.rst documentation, once the overhaul [1] is
    usptream. Until then, details can be found in patch #2.

    VI. Future Work

    1) Use memory groups for ppc64 dlpar
    2) Being able to specify a portion of (early) kernel memory that will be
    excluded from the ratio. Like "128 MiB globally/per node" are excluded.

    This might be helpful when starting VMs with extremely small memory
    footprint (e.g., 128 MiB) and hotplugging memory later -- not wanting
    the first hotplugged units getting onlined to ZONE_MOVABLE. One
    alternative would be a trigger to not consider ZONE_DMA memory
    in the ratio. We'll have to see if this is really rrequired.
    3) Indicate to user space that MOVABLE might be a bad idea -- especially
    relevant when memory ballooning without support for balloon compaction
    is active.

    This patch (of 9):

    For implementing a new memory onlining policy, which determines when to
    online memory blocks to ZONE_MOVABLE semi-automatically, we need the
    number of present early (boot) pages -- present pages excluding hotplugged
    pages. Let's track these pages per zone.

    Pass a page instead of the zone to adjust_present_page_count(), similar as
    adjust_managed_page_count() and derive the zone from the page.

    It's worth noting that a memory block to be offlined/onlined is either
    completely "early" or "not early". add_memory() and friends can only add
    complete memory blocks and we only online/offline complete (individual)
    memory blocks.

    Link: https://lkml.kernel.org/r/20210806124715.17090-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20210806124715.17090-2-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Vitaly Kuznetsov
    Cc: "Michael S. Tsirkin"
    Cc: Jason Wang
    Cc: Marek Kedzierski
    Cc: Hui Zhu
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Anshuman Khandual
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mike Rapoport
    Cc: "Rafael J. Wysocki"
    Cc: Len Brown
    Cc: Pavel Tatashin
    Cc: Greg Kroah-Hartman
    Cc: Rafael J. Wysocki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm: remove pfn_valid_within() and CONFIG_HOLES_IN_ZONE".

    After recent updates to freeing unused parts of the memory map, no
    architecture can have holes in the memory map within a pageblock. This
    makes pfn_valid_within() check and CONFIG_HOLES_IN_ZONE configuration
    option redundant.

    The first patch removes them both in a mechanical way and the second patch
    simplifies memory_hotplug::test_pages_in_a_zone() that had
    pfn_valid_within() surrounded by more logic than simple if.

    This patch (of 2):

    After recent changes in freeing of the unused parts of the memory map and
    rework of pfn_valid() in arm and arm64 there are no architectures that can
    have holes in the memory map within a pageblock and so nothing can enable
    CONFIG_HOLES_IN_ZONE which guards non trivial implementation of
    pfn_valid_within().

    With that, pfn_valid_within() is always hardwired to 1 and can be
    completely removed.

    Remove calls to pfn_valid_within() and CONFIG_HOLES_IN_ZONE.

    Link: https://lkml.kernel.org/r/20210713080035.7464-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210713080035.7464-2-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

04 Sep, 2021

9 commits

  • Under normal circumstances, migrate_pages() returns the number of pages
    migrated. In error conditions, it returns an error code. When returning
    an error code, there is no way to know how many pages were migrated or not
    migrated.

    Make migrate_pages() return how many pages are demoted successfully for
    all cases, including when encountering errors. Page reclaim behavior will
    depend on this in subsequent patches.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-3-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-4-ying.huang@intel.com
    Signed-off-by: Yang Shi
    Signed-off-by: Dave Hansen
    Signed-off-by: "Huang, Ying"
    Suggested-by: Oscar Salvador [optional parameter]
    Reviewed-by: Yang Shi
    Reviewed-by: Zi Yan
    Cc: Michal Hocko
    Cc: Wei Xu
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: David Rientjes
    Cc: Greg Thelen
    Cc: Keith Busch
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yang Shi
     
  • Patch series "Migrate Pages in lieu of discard", v11.

    We're starting to see systems with more and more kinds of memory such as
    Intel's implementation of persistent memory.

    Let's say you have a system with some DRAM and some persistent memory.
    Today, once DRAM fills up, reclaim will start and some of the DRAM
    contents will be thrown out. Allocations will, at some point, start
    falling over to the slower persistent memory.

    That has two nasty properties. First, the newer allocations can end up in
    the slower persistent memory. Second, reclaimed data in DRAM are just
    discarded even if there are gobs of space in persistent memory that could
    be used.

    This patchset implements a solution to these problems. At the end of the
    reclaim process in shrink_page_list() just before the last page refcount
    is dropped, the page is migrated to persistent memory instead of being
    dropped.

    While I've talked about a DRAM/PMEM pairing, this approach would function
    in any environment where memory tiers exist.

    This is not perfect. It "strands" pages in slower memory and never brings
    them back to fast DRAM. Huang Ying has follow-on work which repurposes
    NUMA balancing to promote hot pages back to DRAM.

    This is also all based on an upstream mechanism that allows persistent
    memory to be onlined and used as if it were volatile:

    http://lkml.kernel.org/r/20190124231441.37A4A305@viggo.jf.intel.com

    With that, the DRAM and PMEM in each socket will be represented as 2
    separate NUMA nodes, with the CPUs sit in the DRAM node. So the
    general inter-NUMA demotion mechanism introduced in the patchset can
    migrate the cold DRAM pages to the PMEM node.

    We have tested the patchset with the postgresql and pgbench. On a
    2-socket server machine with DRAM and PMEM, the kernel with the patchset
    can improve the score of pgbench up to 22.1% compared with that of the
    DRAM only + disk case. This comes from the reduced disk read throughput
    (which reduces up to 70.8%).

    == Open Issues ==

    * Memory policies and cpusets that, for instance, restrict allocations
    to DRAM can be demoted to PMEM whenever they opt in to this
    new mechanism. A cgroup-level API to opt-in or opt-out of
    these migrations will likely be required as a follow-on.
    * Could be more aggressive about where anon LRU scanning occurs
    since it no longer necessarily involves I/O. get_scan_count()
    for instance says: "If we have no swap space, do not bother
    scanning anon pages"

    This patch (of 9):

    Prepare for the kernel to auto-migrate pages to other memory nodes with a
    node migration table. This allows creating single migration target for
    each NUMA node to enable the kernel to do NUMA page migrations instead of
    simply discarding colder pages. A node with no target is a "terminal
    node", so reclaim acts normally there. The migration target does not
    fundamentally _need_ to be a single node, but this implementation starts
    there to limit complexity.

    When memory fills up on a node, memory contents can be automatically
    migrated to another node. The biggest problems are knowing when to
    migrate and to where the migration should be targeted.

    The most straightforward way to generate the "to where" list would be to
    follow the page allocator fallback lists. Those lists already tell us if
    memory is full where to look next. It would also be logical to move
    memory in that order.

    But, the allocator fallback lists have a fatal flaw: most nodes appear in
    all the lists. This would potentially lead to migration cycles (A->B,
    B->A, A->B, ...).

    Instead of using the allocator fallback lists directly, keep a separate
    node migration ordering. But, reuse the same data used to generate page
    allocator fallback in the first place: find_next_best_node().

    This means that the firmware data used to populate node distances
    essentially dictates the ordering for now. It should also be
    architecture-neutral since all NUMA architectures have a working
    find_next_best_node().

    RCU is used to allow lock-less read of node_demotion[] and prevent
    demotion cycles been observed. If multiple reads of node_demotion[] are
    performed, a single rcu_read_lock() must be held over all reads to ensure
    no cycles are observed. Details are as follows.

    === What does RCU provide? ===

    Imagine a simple loop which walks down the demotion path looking
    for the last node:

    terminal_node = start_node;
    while (node_demotion[terminal_node] != NUMA_NO_NODE) {
    terminal_node = node_demotion[terminal_node];
    }

    The initial values are:

    node_demotion[0] = 1;
    node_demotion[1] = NUMA_NO_NODE;

    and are updated to:

    node_demotion[0] = NUMA_NO_NODE;
    node_demotion[1] = 0;

    What guarantees that the cycle is not observed:

    node_demotion[0] = 1;
    node_demotion[1] = 0;

    and would loop forever?

    With RCU, a rcu_read_lock/unlock() can be placed around the loop. Since
    the write side does a synchronize_rcu(), the loop that observed the old
    contents is known to be complete before the synchronize_rcu() has
    completed.

    RCU, combined with disable_all_migrate_targets(), ensures that the old
    migration state is not visible by the time __set_migration_target_nodes()
    is called.

    === What does READ_ONCE() provide? ===

    READ_ONCE() forbids the compiler from merging or reordering successive
    reads of node_demotion[]. This ensures that any updates are *eventually*
    observed.

    Consider the above loop again. The compiler could theoretically read the
    entirety of node_demotion[] into local storage (registers) and never go
    back to memory, and *permanently* observe bad values for node_demotion[].

    Note: RCU does not provide any universal compiler-ordering
    guarantees:

    https://lore.kernel.org/lkml/20150921204327.GH4029@linux.vnet.ibm.com/

    This code is unused for now. It will be called later in the
    series.

    Link: https://lkml.kernel.org/r/20210721063926.3024591-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-1-ying.huang@intel.com
    Link: https://lkml.kernel.org/r/20210715055145.195411-2-ying.huang@intel.com
    Signed-off-by: Dave Hansen
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Yang Shi
    Reviewed-by: Zi Yan
    Reviewed-by: Oscar Salvador
    Cc: Michal Hocko
    Cc: Wei Xu
    Cc: David Rientjes
    Cc: Dan Williams
    Cc: David Hildenbrand
    Cc: Greg Thelen
    Cc: Keith Busch
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dave Hansen
     
  • Obsoleted in_intrrupt() include task context with disabled BH, it's better
    to use in_task() instead.

    Link: https://lkml.kernel.org/r/877caa99-1994-5545-92d2-d0bb2e394182@virtuozzo.com
    Signed-off-by: Vasily Averin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vasily Averin
     
  • alloc_node_mem_map() is never only called from free_area_init_node() that
    is an __init function.

    Make the actual alloc_node_mem_map() also __init and its stub version
    static inline.

    Link: https://lkml.kernel.org/r/20210716064124.31865-1-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • When compiling with -Werror, cc1 will warn that 'zone_id' may be used
    uninitialized in this function warning.

    Initialize the zone_id as 0.

    Its safe to assume that if the code reaches this point it has at least one
    numa node with memory, so no need for an assertion before
    init_unavilable_range.

    Link: https://lkml.kernel.org/r/20210716210336.1114114-1-npache@redhat.com
    Fixes: 122e093c1734 ("mm/page_alloc: fix memory map initialization for descending nodes")
    Signed-off-by: Nico Pache
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nico Pache
     
  • There are several places that allocate memory for the memory map:
    alloc_node_mem_map() for FLATMEM, sparse_buffer_init() and
    __populate_section_memmap() for SPARSEMEM.

    The memory allocated in the FLATMEM case is zeroed and it is never
    poisoned, regardless of CONFIG_PAGE_POISON setting.

    The memory allocated in the SPARSEMEM cases is not zeroed and it is
    implicitly poisoned inside memblock if CONFIG_PAGE_POISON is set.

    Introduce memmap_alloc() wrapper for memblock allocators that will be used
    for both FLATMEM and SPARSEMEM cases and will makei memory map zeroing and
    poisoning consistent for different memory models.

    Link: https://lkml.kernel.org/r/20210714123739.16493-4-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "mm: ensure consistency of memory map poisoning".

    Currently memory map allocation for FLATMEM case does not poison the
    struct pages regardless of CONFIG_PAGE_POISON setting.

    This happens because allocation of the memory map for FLATMEM and SPARSMEM
    use different memblock functions and those that are used for SPARSMEM case
    (namely memblock_alloc_try_nid_raw() and memblock_alloc_exact_nid_raw())
    implicitly poison the allocated memory.

    Another side effect of this implicit poisoning is that early setup code
    that uses the same functions to allocate memory burns cycles for the
    memory poisoning even if it was not intended.

    These patches introduce memmap_alloc() wrapper that ensure that the memory
    map allocation is consistent for different memory models.

    This patch (of 4):

    Currently memory map for the holes is initialized only when SPARSEMEM
    memory model is used. Yet, even with FLATMEM there could be holes in the
    physical memory layout that have memory map entries.

    For instance, the memory reserved using e820 API on i386 or
    "reserved-memory" nodes in device tree would not appear in memblock.memory
    and hence the struct pages for such holes will be skipped during memory
    map initialization.

    These struct pages will be zeroed because the memory map for FLATMEM
    systems is allocated with memblock_alloc_node() that clears the allocated
    memory. While zeroed struct pages do not cause immediate problems, the
    correct behaviour is to initialize every page using __init_single_page().
    Besides, enabling page poison for FLATMEM case will trigger
    PF_POISONED_CHECK() unless the memory map is properly initialized.

    Make sure init_unavailable_range() is called for both SPARSEMEM and
    FLATMEM so that struct pages representing memory holes would appear as
    PG_Reserved with any memory layout.

    [rppt@kernel.org: fix microblaze]
    Link: https://lkml.kernel.org/r/YQWW3RCE4eWBuMu/@kernel.org

    Link: https://lkml.kernel.org/r/20210714123739.16493-1-rppt@kernel.org
    Link: https://lkml.kernel.org/r/20210714123739.16493-2-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: David Hildenbrand
    Tested-by: Guenter Roeck
    Cc: Michal Simek
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Print NR_KERNEL_MISC_RECLAIMABLE stat from show_free_areas() so users can
    check whether the shrinker is working correctly and to show the current
    memory usage.

    Link: https://lkml.kernel.org/r/20210813104725.4562-1-liuhailong@oppo.com
    Signed-off-by: liuhailong
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    liuhailong
     
  • A recent lockdep report included these lines:

    [ 96.177910] 3 locks held by containerd/770:
    [ 96.177934] #0: ffff88810815ea28 (&mm->mmap_lock#2){++++}-{3:3},
    at: do_user_addr_fault+0x115/0x770
    [ 96.177999] #1: ffffffff82915020 (rcu_read_lock){....}-{1:2}, at:
    get_swap_device+0x33/0x140
    [ 96.178057] #2: ffffffff82955ba0 (fs_reclaim){+.+.}-{0:0}, at:
    __fs_reclaim_acquire+0x5/0x30

    While it was not useful to that bug report to know where the reclaim lock
    had been acquired, it might be useful under other circumstances. Allow
    the caller of __fs_reclaim_acquire to specify the instruction pointer to
    use.

    Link: https://lkml.kernel.org/r/20210719185709.1755149-1-willy@infradead.org
    Signed-off-by: Matthew Wilcox (Oracle)
    Cc: Omar Sandoval
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Boqun Feng
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

21 Aug, 2021

1 commit

  • When placing pages on a pcp list, migratetype values over
    MIGRATE_PCPTYPES get added to the MIGRATE_MOVABLE pcp list.

    However, the actual migratetype is preserved in the page and should
    not be changed to MIGRATE_MOVABLE or the page may end up on the wrong
    free_list.

    The impact is that HIGHATOMIC or CMA pages getting bulk freed from the
    PCP lists could potentially end up on the wrong buddy list. There are
    various consequences but minimally NR_FREE_CMA_PAGES accounting could
    get screwed up.

    [mgorman@techsingularity.net: changelog update]

    Link: https://lkml.kernel.org/r/20210811182917.2607994-1-opendmb@gmail.com
    Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
    Signed-off-by: Doug Berger
    Acked-by: Vlastimil Babka
    Acked-by: Mel Gorman
    Cc: "Peter Zijlstra (Intel)"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Doug Berger
     

24 Jul, 2021

1 commit

  • To reproduce the failure we need the following system:

    - kernel command: page_poison=1 init_on_free=0 init_on_alloc=0

    - kernel config:
    * CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y
    * CONFIG_INIT_ON_FREE_DEFAULT_ON=y
    * CONFIG_PAGE_POISONING=y

    Resulting in:

    0000000085629bdd: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    0000000022861832: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    00000000c597f5b0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
    CPU: 11 PID: 15195 Comm: bash Kdump: loaded Tainted: G U O 5.13.1-gentoo-x86_64 #1
    Hardware name: System manufacturer System Product Name/PRIME Z370-A, BIOS 2801 01/13/2021
    Call Trace:
    dump_stack+0x64/0x7c
    __kernel_unpoison_pages.cold+0x48/0x84
    post_alloc_hook+0x60/0xa0
    get_page_from_freelist+0xdb8/0x1000
    __alloc_pages+0x163/0x2b0
    __get_free_pages+0xc/0x30
    pgd_alloc+0x2e/0x1a0
    mm_init+0x185/0x270
    dup_mm+0x6b/0x4f0
    copy_process+0x190d/0x1b10
    kernel_clone+0xba/0x3b0
    __do_sys_clone+0x8f/0xb0
    do_syscall_64+0x68/0x80
    entry_SYSCALL_64_after_hwframe+0x44/0xae

    Before commit 51cba1ebc60d ("init_on_alloc: Optimize static branches")
    init_on_alloc never enabled static branch by default. It could only be
    enabed explicitly by init_mem_debugging_and_hardening().

    But after commit 51cba1ebc60d, a static branch could already be enabled
    by default. There was no code to ever disable it. That caused
    page_poison=1 / init_on_free=1 conflict.

    This change extends init_mem_debugging_and_hardening() to also disable
    static branch disabling.

    Link: https://lkml.kernel.org/r/20210714031935.4094114-1-keescook@chromium.org
    Link: https://lore.kernel.org/r/20210712215816.1512739-1-slyfox@gentoo.org
    Fixes: 51cba1ebc60d ("init_on_alloc: Optimize static branches")
    Signed-off-by: Sergei Trofimovich
    Signed-off-by: Kees Cook
    Co-developed-by: Kees Cook
    Reported-by: Mikhail Morfikov
    Reported-by:
    Tested-by:
    Reviewed-by: David Hildenbrand
    Cc: Alexander Potapenko
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergei Trofimovich
     

16 Jul, 2021

4 commits

  • The author of commit b3b64ebd3822 ("mm/page_alloc: do bulk array
    bounds check after checking populated elements") was possibly
    confused by the mixture of return values throughout the function.

    The API contract is clear that the function "Returns the number of pages
    on the list or array." It does not list zero as a unique return value with
    a special meaning. Therefore zero is a plausible return value only if
    @nr_pages is zero or less.

    Clean up the return logic to make it clear that the returned value is
    always the total number of pages in the array/list, not the number of
    pages that were allocated during this call.

    The only change in behavior with this patch is the value returned if
    prepare_alloc_pages() fails. To match the API contract, the number of
    pages currently in the array/list is returned in this case.

    The call site in __page_pool_alloc_pages_slow() also seems to be confused
    on this matter. It should be attended to by someone who is familiar with
    that code.

    [mel@techsingularity.net: Return nr_populated if 0 pages are requested]

    Link: https://lkml.kernel.org/r/20210713152100.10381-4-mgorman@techsingularity.net
    Signed-off-by: Chuck Lever
    Signed-off-by: Mel Gorman
    Acked-by: Jesper Dangaard Brouer
    Cc: Desmond Cheong Zhi Xi
    Cc: Zhang Qiang
    Cc: Yanfei Xu
    Cc: Matteo Croce
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chuck Lever
     
  • If the array passed in is already partially populated, we should return
    "nr_populated" even failing at preparing arguments stage.

    Link: https://lkml.kernel.org/r/20210713152100.10381-3-mgorman@techsingularity.net
    Signed-off-by: Yanfei Xu
    Signed-off-by: Mel Gorman
    Link: https://lore.kernel.org/r/20210709102855.55058-1-yanfei.xu@windriver.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yanfei Xu
     
  • Syzbot is reporting potential deadlocks due to pagesets.lock when
    PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is as
    follows

    __alloc_pages_bulk()
    local_lock_irqsave(&pagesets.lock, flags)
    Reported-by: Desmond Cheong Zhi Xi
    Reported-by: "Zhang, Qiang"
    Reported-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
    Tested-by: syzbot+127fd7828d6eeb611703@syzkaller.appspotmail.com
    Acked-by: Rafael Aquini
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • This reverts commit f7173090033c70886d925995e9dfdfb76dbb2441.

    Fix an unresolved symbol error when CONFIG_DEBUG_INFO_BTF=y:

    LD vmlinux
    BTFIDS vmlinux
    FAILED unresolved symbol should_fail_alloc_page
    make: *** [Makefile:1199: vmlinux] Error 255
    make: *** Deleting file 'vmlinux'

    Link: https://lkml.kernel.org/r/20210708191128.153796-1-mcroce@linux.microsoft.com
    Fixes: f7173090033c ("mm/page_alloc: make should_fail_alloc_page() static")
    Signed-off-by: Matteo Croce
    Acked-by: Mel Gorman
    Tested-by: John Hubbard
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Vlastimil Babka
    Cc: Dan Streetman
    Cc: Yang Shi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matteo Croce
     

11 Jul, 2021

1 commit

  • Commit dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to
    local_lock") folded in a workaround patch for pahole that was unable to
    deal with zero-sized percpu structures.

    A superior workaround is achieved with commit a0b8200d06ad ("kbuild:
    skip per-CPU BTF generation for pahole v1.18-v1.21").

    This patch reverts the dummy field and the pahole version check.

    Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

03 Jul, 2021

1 commit

  • Merge more updates from Andrew Morton:
    "190 patches.

    Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
    vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
    migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
    zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
    core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
    signals, exec, kcov, selftests, compress/decompress, and ipc"

    * emailed patches from Andrew Morton : (190 commits)
    ipc/util.c: use binary search for max_idx
    ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
    ipc: use kmalloc for msg_queue and shmid_kernel
    ipc sem: use kvmalloc for sem_undo allocation
    lib/decompressors: remove set but not used variabled 'level'
    selftests/vm/pkeys: exercise x86 XSAVE init state
    selftests/vm/pkeys: refill shadow register after implicit kernel write
    selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
    selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
    kcov: add __no_sanitize_coverage to fix noinstr for all architectures
    exec: remove checks in __register_bimfmt()
    x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
    hfsplus: report create_date to kstat.btime
    hfsplus: remove unnecessary oom message
    nilfs2: remove redundant continue statement in a while-loop
    kprobes: remove duplicated strong free_insn_page in x86 and s390
    init: print out unknown kernel parameters
    checkpatch: do not complain about positive return values starting with EPOLL
    checkpatch: improve the indented label test
    checkpatch: scripts/spdxcheck.py now requires python3
    ...

    Linus Torvalds
     

02 Jul, 2021

2 commits

  • make W=1 generates the following warning for mm/page_alloc.c

    mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
    noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
    ^~~~~~~~~~~~~~~~~~~~~~

    This function is deliberately split out for BPF to allow errors to be
    injected. The function is not used anywhere else so it is local to the
    file. Make it static which should still allow error injection to be used
    similar to how block/blk-core.c:should_fail_bio() works.

    Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Yang Shi
    Acked-by: Vlastimil Babka
    Cc: Dan Streetman
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Fix some spelling mistakes in comments:
    each having differents usage ==> each has a different usage
    statments ==> statements
    adresses ==> addresses
    aggresive ==> aggressive
    datas ==> data
    posion ==> poison
    higer ==> higher
    precisly ==> precisely
    wont ==> won't
    We moves tha ==> We move the
    endianess ==> endianness

    Link: https://lkml.kernel.org/r/20210519065853.7723-2-thunder.leizhen@huawei.com
    Signed-off-by: Zhen Lei
    Reviewed-by: Souptick Joarder
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zhen Lei
     

01 Jul, 2021

1 commit

  • In [1], Jann Horn points out a possible race between
    prep_compound_gigantic_page and __page_cache_add_speculative. The root
    cause of the possible race is prep_compound_gigantic_page uncondittionally
    setting the ref count of pages to zero. It does this because
    prep_compound_gigantic_page is handed a 'group' of pages from an allocator
    and needs to convert that group of pages to a compound page. The ref
    count of each page in this 'group' is one as set by the allocator.
    However, the ref count of compound page tail pages must be zero.

    The potential race comes about when ref counted pages are returned from
    the allocator. When this happens, other mm code could also take a
    reference on the page. __page_cache_add_speculative is one such example.
    Therefore, prep_compound_gigantic_page can not just set the ref count of
    pages to zero as it does today. Doing so would lose the reference taken
    by any other code. This would lead to BUGs in code checking ref counts
    and could possibly even lead to memory corruption.

    There are two possible ways to address this issue.

    1) Make all allocators of gigantic groups of pages be able to return a
    properly constructed compound page.

    2) Make prep_compound_gigantic_page be more careful when constructing a
    compound page.

    This patch takes approach 2.

    In prep_compound_gigantic_page, use cmpxchg to only set ref count to zero
    if it is one. If the cmpxchg fails, call synchronize_rcu() in the hope
    that the extra ref count will be driopped during a rcu grace period. This
    is not a performance critical code path and the wait should be
    accceptable. If the ref count is still inflated after the grace period,
    then undo any modifications made and return an error.

    Currently prep_compound_gigantic_page is type void and does not return
    errors. Modify the two callers to check for and handle error returns. On
    error, the caller must free the 'group' of pages as they can not be used
    to form a gigantic page. After freeing pages, the runtime caller
    (alloc_fresh_huge_page) will retry the allocation once. Boot time
    allocations can not be retried.

    The routine prep_compound_page also unconditionally sets the ref count of
    compound page tail pages to zero. However, in this case the buddy
    allocator is constructing a compound page from freshly allocated pages.
    The ref count on those freshly allocated pages is already zero, so the
    set_page_count(p, 0) is unnecessary and could lead to confusion. Just
    remove it.

    [1] https://lore.kernel.org/linux-mm/CAG48ez23q0Jy9cuVnwAe7t_fdhMk2S7N5Hdi-GLcCeq5bsfLxw@mail.gmail.com/

    Link: https://lkml.kernel.org/r/20210622021423.154662-3-mike.kravetz@oracle.com
    Fixes: 58a84aa92723 ("thp: set compound tail page _count to zero")
    Signed-off-by: Mike Kravetz
    Reported-by: Jann Horn
    Cc: Youquan Song
    Cc: Andrea Arcangeli
    Cc: Jan Kara
    Cc: John Hubbard
    Cc: "Kirill A . Shutemov"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Muchun Song
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     

30 Jun, 2021

7 commits

  • Merge misc updates from Andrew Morton:
    "191 patches.

    Subsystems affected by this patch series: kthread, ia64, scripts,
    ntfs, squashfs, ocfs2, kernel/watchdog, and mm (gup, pagealloc, slab,
    slub, kmemleak, dax, debug, pagecache, gup, swap, memcg, pagemap,
    mprotect, bootmem, dma, tracing, vmalloc, kasan, initialization,
    pagealloc, and memory-failure)"

    * emailed patches from Andrew Morton : (191 commits)
    mm,hwpoison: make get_hwpoison_page() call get_any_page()
    mm,hwpoison: send SIGBUS with error virutal address
    mm/page_alloc: split pcp->high across all online CPUs for cpuless nodes
    mm/page_alloc: allow high-order pages to be stored on the per-cpu lists
    mm: replace CONFIG_FLAT_NODE_MEM_MAP with CONFIG_FLATMEM
    mm: replace CONFIG_NEED_MULTIPLE_NODES with CONFIG_NUMA
    docs: remove description of DISCONTIGMEM
    arch, mm: remove stale mentions of DISCONIGMEM
    mm: remove CONFIG_DISCONTIGMEM
    m68k: remove support for DISCONTIGMEM
    arc: remove support for DISCONTIGMEM
    arc: update comment about HIGHMEM implementation
    alpha: remove DISCONTIGMEM and NUMA
    mm/page_alloc: move free_the_page
    mm/page_alloc: fix counting of managed_pages
    mm/page_alloc: improve memmap_pages dbg msg
    mm: drop SECTION_SHIFT in code comments
    mm/page_alloc: introduce vm.percpu_pagelist_high_fraction
    mm/page_alloc: limit the number of pages on PCP lists when reclaim is active
    mm/page_alloc: scale the number of pages that are batch freed
    ...

    Linus Torvalds
     
  • Dave Hansen reported the following about Feng Tang's tests on a machine
    with persistent memory onlined as a DRAM-like device.

    Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
    ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
    use" mode and being managed via the buddy just like the normal RAM.

    The PMEM zones are big ones:

    present 65011712 = 248 G
    high 134595 = 525 M

    The PMEM nodes, of course, don't have any CPUs in them.

    With your series, the pcp->high value per-cpu is 69584 pages or about
    270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
    worst-case memory in the pcps per zone, or roughly 10% of the size of
    the zone.

    This should not cause a problem as such although it could trigger reclaim
    due to pages being stored on per-cpu lists for CPUs remote to a node. It
    is not possible to treat cpuless nodes exactly the same as normal nodes
    but the worst-case scenario can be mitigated by splitting pcp->high across
    all online CPUs for cpuless memory nodes.

    Link: https://lkml.kernel.org/r/20210616110743.GK30378@techsingularity.net
    Suggested-by: Dave Hansen
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Acked-by: Dave Hansen
    Cc: Hillf Danton
    Cc: Michal Hocko
    Cc: "Tang, Feng"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • The per-cpu page allocator (PCP) only stores order-0 pages. This means
    that all THP and "cheap" high-order allocations including SLUB contends on
    the zone->lock. This patch extends the PCP allocator to store THP and
    "cheap" high-order pages. Note that struct per_cpu_pages increases in
    size to 256 bytes (4 cache lines) on x86-64.

    Note that this is not necessarily a universal performance win because of
    how it is implemented. High-order pages can cause pcp->high to be
    exceeded prematurely for lower-orders so for example, a large number of
    THP pages being freed could release order-0 pages from the PCP lists.
    Hence, much depends on the allocation/free pattern as observed by a single
    CPU to determine if caching helps or hurts a particular workload.

    That said, basic performance testing passed. The following is a netperf
    UDP_STREAM test which hits the relevant patches as some of the network
    allocations are high-order.

    netperf-udp
    5.13.0-rc2 5.13.0-rc2
    mm-pcpburst-v3r4 mm-pcphighorder-v1r7
    Hmean send-64 261.46 ( 0.00%) 266.30 * 1.85%*
    Hmean send-128 516.35 ( 0.00%) 536.78 * 3.96%*
    Hmean send-256 1014.13 ( 0.00%) 1034.63 * 2.02%*
    Hmean send-1024 3907.65 ( 0.00%) 4046.11 * 3.54%*
    Hmean send-2048 7492.93 ( 0.00%) 7754.85 * 3.50%*
    Hmean send-3312 11410.04 ( 0.00%) 11772.32 * 3.18%*
    Hmean send-4096 13521.95 ( 0.00%) 13912.34 * 2.89%*
    Hmean send-8192 21660.50 ( 0.00%) 22730.72 * 4.94%*
    Hmean send-16384 31902.32 ( 0.00%) 32637.50 * 2.30%*

    Functionally, a patch like this is necessary to make bulk allocation of
    high-order pages work with similar performance to order-0 bulk
    allocations. The bulk allocator is not updated in this series as it would
    have to be determined by bulk allocation users how they want to track the
    order of pages allocated with the bulk allocator.

    Link: https://lkml.kernel.org/r/20210611135753.GC30378@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Zi Yan
    Cc: Dave Hansen
    Cc: Michal Hocko
    Cc: Jesper Dangaard Brouer
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • After removal of the DISCONTIGMEM memory model the FLAT_NODE_MEM_MAP
    configuration option is equivalent to FLATMEM.

    Drop CONFIG_FLAT_NODE_MEM_MAP and use CONFIG_FLATMEM instead.

    Link: https://lkml.kernel.org/r/20210608091316.3622-10-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: Arnd Bergmann
    Acked-by: David Hildenbrand
    Cc: Geert Uytterhoeven
    Cc: Ivan Kokshaysky
    Cc: Jonathan Corbet
    Cc: Matt Turner
    Cc: Richard Henderson
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • After removal of DISCINTIGMEM the NEED_MULTIPLE_NODES and NUMA
    configuration options are equivalent.

    Drop CONFIG_NEED_MULTIPLE_NODES and use CONFIG_NUMA instead.

    Done with

    $ sed -i 's/CONFIG_NEED_MULTIPLE_NODES/CONFIG_NUMA/' \
    $(git grep -wl CONFIG_NEED_MULTIPLE_NODES)
    $ sed -i 's/NEED_MULTIPLE_NODES/NUMA/' \
    $(git grep -wl NEED_MULTIPLE_NODES)

    with manual tweaks afterwards.

    [rppt@linux.ibm.com: fix arm boot crash]
    Link: https://lkml.kernel.org/r/YMj9vHhHOiCVN4BF@linux.ibm.com

    Link: https://lkml.kernel.org/r/20210608091316.3622-9-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: Arnd Bergmann
    Acked-by: David Hildenbrand
    Cc: Geert Uytterhoeven
    Cc: Ivan Kokshaysky
    Cc: Jonathan Corbet
    Cc: Matt Turner
    Cc: Richard Henderson
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • There are no architectures that support DISCONTIGMEM left.

    Remove the configuration option and the dead code it was guarding in the
    generic memory management code.

    Link: https://lkml.kernel.org/r/20210608091316.3622-6-rppt@kernel.org
    Signed-off-by: Mike Rapoport
    Acked-by: Arnd Bergmann
    Acked-by: David Hildenbrand
    Cc: Geert Uytterhoeven
    Cc: Ivan Kokshaysky
    Cc: Jonathan Corbet
    Cc: Matt Turner
    Cc: Richard Henderson
    Cc: Vineet Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "Allow high order pages to be stored on PCP", v2.

    The per-cpu page allocator (PCP) only handles order-0 pages. With the
    series "Use local_lock for pcp protection and reduce stat overhead" and
    "Calculate pcp->high based on zone sizes and active CPUs", it's now
    feasible to store high-order pages on PCP lists.

    This small series allows PCP to store "cheap" orders where cheap is
    determined by PAGE_ALLOC_COSTLY_ORDER and THP-sized allocations.

    This patch (of 2):

    In the next page, free_compount_page is going to use the common helper
    free_the_page. This patch moves the definition to ease review. No
    functional change.

    Link: https://lkml.kernel.org/r/20210603142220.10851-1-mgorman@techsingularity.net
    Link: https://lkml.kernel.org/r/20210603142220.10851-2-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Jesper Dangaard Brouer
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman