06 Jan, 2021

1 commit

  • commit dc2da7b45ffe954a0090f5d0310ed7b0b37d2bd2 upstream.

    VMware observed a performance regression during memmap init on their
    platform, and bisected to commit 73a6e474cb376 ("mm: memmap_init:
    iterate over memblock regions rather that check each PFN") causing it.

    Before the commit:

    [0.033176] Normal zone: 1445888 pages used for memmap
    [0.033176] Normal zone: 89391104 pages, LIFO batch:63
    [0.035851] ACPI: PM-Timer IO Port: 0x448

    With commit

    [0.026874] Normal zone: 1445888 pages used for memmap
    [0.026875] Normal zone: 89391104 pages, LIFO batch:63
    [2.028450] ACPI: PM-Timer IO Port: 0x448

    The root cause is the current memmap defer init doesn't work as expected.

    Before, memmap_init_zone() was used to do memmap init of one whole zone,
    to initialize all low zones of one numa node, but defer memmap init of
    the last zone in that numa node. However, since commit 73a6e474cb376,
    function memmap_init() is adapted to iterater over memblock regions
    inside one zone, then call memmap_init_zone() to do memmap init for each
    region.

    E.g, on VMware's system, the memory layout is as below, there are two
    memory regions in node 2. The current code will mistakenly initialize the
    whole 1st region [mem 0xab00000000-0xfcffffffff], then do memmap defer to
    iniatialize only one memmory section on the 2nd region [mem
    0x10000000000-0x1033fffffff]. In fact, we only expect to see that there's
    only one memory section's memmap initialized. That's why more time is
    costed at the time.

    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
    [ 0.008842] ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
    [ 0.008843] ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x55ffffffff]
    [ 0.008844] ACPI: SRAT: Node 1 PXM 1 [mem 0x5600000000-0xaaffffffff]
    [ 0.008844] ACPI: SRAT: Node 2 PXM 2 [mem 0xab00000000-0xfcffffffff]
    [ 0.008845] ACPI: SRAT: Node 2 PXM 2 [mem 0x10000000000-0x1033fffffff]

    Now, let's add a parameter 'zone_end_pfn' to memmap_init_zone() to pass
    down the real zone end pfn so that defer_init() can use it to judge
    whether defer need be taken in zone wide.

    Link: https://lkml.kernel.org/r/20201223080811.16211-1-bhe@redhat.com
    Link: https://lkml.kernel.org/r/20201223080811.16211-2-bhe@redhat.com
    Fixes: commit 73a6e474cb376 ("mm: memmap_init: iterate over memblock regions rather that check each PFN")
    Signed-off-by: Baoquan He
    Reported-by: Rahul Gopakumar
    Reviewed-by: Mike Rapoport
    Cc: David Hildenbrand
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Baoquan He
     

30 Dec, 2020

1 commit

  • [ Upstream commit 597c892038e08098b17ccfe65afd9677e6979800 ]

    On 2-node NUMA hosts we see bursts of kswapd reclaim and subsequent
    pressure spikes and stalls from cache refaults while there is plenty of
    free memory in the system.

    Usually, kswapd is woken up when all eligible nodes in an allocation are
    full. But the code related to watermark boosting can wake kswapd on one
    full node while the other one is mostly empty. This may be justified to
    fight fragmentation, but is currently unconditionally done whether
    watermark boosting is occurring or not.

    In our case, many of our workloads' throughput scales with available
    memory, and pure utilization is a more tangible concern than trends
    around longer-term fragmentation. As a result we generally disable
    watermark boosting.

    Wake kswapd only woken when watermark boosting is requested.

    Link: https://lkml.kernel.org/r/20201020175833.397286-1-hannes@cmpxchg.org
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Johannes Weiner
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Johannes Weiner
     

19 Nov, 2020

1 commit

  • The ethernet driver may allocate skb (and skb->data) via napi_alloc_skb().
    This ends up to page_frag_alloc() to allocate skb->data from
    page_frag_cache->va.

    During the memory pressure, page_frag_cache->va may be allocated as
    pfmemalloc page. As a result, the skb->pfmemalloc is always true as
    skb->data is from page_frag_cache->va. The skb will be dropped if the
    sock (receiver) does not have SOCK_MEMALLOC. This is expected behaviour
    under memory pressure.

    However, once kernel is not under memory pressure any longer (suppose large
    amount of memory pages are just reclaimed), the page_frag_alloc() may still
    re-use the prior pfmemalloc page_frag_cache->va to allocate skb->data. As a
    result, the skb->pfmemalloc is always true unless page_frag_cache->va is
    re-allocated, even if the kernel is not under memory pressure any longer.

    Here is how kernel runs into issue.

    1. The kernel is under memory pressure and allocation of
    PAGE_FRAG_CACHE_MAX_ORDER in __page_frag_cache_refill() will fail. Instead,
    the pfmemalloc page is allocated for page_frag_cache->va.

    2: All skb->data from page_frag_cache->va (pfmemalloc) will have
    skb->pfmemalloc=true. The skb will always be dropped by sock without
    SOCK_MEMALLOC. This is an expected behaviour.

    3. Suppose a large amount of pages are reclaimed and kernel is not under
    memory pressure any longer. We expect skb->pfmemalloc drop will not happen.

    4. Unfortunately, page_frag_alloc() does not proactively re-allocate
    page_frag_alloc->va and will always re-use the prior pfmemalloc page. The
    skb->pfmemalloc is always true even kernel is not under memory pressure any
    longer.

    Fix this by freeing and re-allocating the page instead of recycling it.

    References: https://lore.kernel.org/lkml/20201103193239.1807-1-dongli.zhang@oracle.com/
    References: https://lore.kernel.org/linux-mm/20201105042140.5253-1-willy@infradead.org/
    Suggested-by: Matthew Wilcox (Oracle)
    Cc: Aruna Ramakrishna
    Cc: Bert Barbe
    Cc: Rama Nichanamatlu
    Cc: Venkat Venkatsubra
    Cc: Manjunath Patil
    Cc: Joe Jin
    Cc: SRINIVAS
    Fixes: 79930f5892e1 ("net: do not deplete pfmemalloc reserve")
    Signed-off-by: Dongli Zhang
    Acked-by: Vlastimil Babka
    Reviewed-by: Eric Dumazet
    Link: https://lore.kernel.org/r/20201115201029.11903-1-dongli.zhang@oracle.com
    Signed-off-by: Jakub Kicinski

    Dongli Zhang
     

17 Oct, 2020

13 commits

  • Merge more updates from Andrew Morton:
    "155 patches.

    Subsystems affected by this patch series: mm (dax, debug, thp,
    readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
    core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
    binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
    romfs, and fault-injection"

    * emailed patches from Andrew Morton : (155 commits)
    lib, uaccess: add failure injection to usercopy functions
    lib, include/linux: add usercopy failure capability
    ROMFS: support inode blocks calculation
    ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
    sched.h: drop in_ubsan field when UBSAN is in trap mode
    scripts/gdb/tasks: add headers and improve spacing format
    scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
    kernel/relay.c: drop unneeded initialization
    panic: dump registers on panic_on_warn
    rapidio: fix the missed put_device() for rio_mport_add_riodev
    rapidio: fix error handling path
    nilfs2: fix some kernel-doc warnings for nilfs2
    autofs: harden ioctl table
    ramfs: fix nommu mmap with gaps in the page cache
    mm: remove the now-unnecessary mmget_still_valid() hack
    mm/gup: take mmap_lock in get_dump_page()
    binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
    coredump: rework elf/elf_fdpic vma_dump_size() into common helper
    coredump: refactor page range dumping into common helper
    coredump: let dump_emit() bail out on short writes
    ...

    Linus Torvalds
     
  • The current page_order() can only be called on pages in the buddy
    allocator. For compound pages, you have to use compound_order(). This is
    confusing and led to a bug, so rename page_order() to buddy_order().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • __free_pages_core() is used when exposing fresh memory to the buddy during
    system boot and when onlining memory in generic_online_page().

    generic_online_page() is used in two cases:

    1. Direct memory onlining in online_pages().
    2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
    balloon and virtio-mem), when parts of a section are kept
    fake-offline to be fake-onlined later on.

    In 1, we already place pages to the tail of the freelist. Pages will be
    freed to MIGRATE_ISOLATE lists first and moved to the tail of the
    freelists via undo_isolate_page_range().

    In 2, we currently don't implement a proper rule. In case of virtio-mem,
    where we currently always online MAX_ORDER - 1 pages, the pages will be
    placed to the HEAD of the freelist - undesireable. While the hyper-v
    balloon calls generic_online_page() with single pages, usually it will
    call it on successive single pages in a larger block.

    The pages are fresh, so place them to the tail of the freelist and avoid
    the PCP. In __free_pages_core(), remove the now superflouos call to
    set_page_refcounted() and add a comment regarding page initialization and
    the refcount.

    Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
    is usually of limited use in virtualized environments), we might want to
    shuffle after a sequence of generic_online_page() calls in the relevant
    callers.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Acked-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Mike Rapoport
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Scott Cheloha
    Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Whenever we move pages between freelists via move_to_free_list()/
    move_freepages_block(), we don't actually touch the pages:
    1. Page isolation doesn't actually touch the pages, it simply isolates
    pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
    When undoing isolation, we move the pages back to the target list.
    2. Page stealing (steal_suitable_fallback()) moves free pages directly
    between lists without touching them.
    3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
    free pages directly between freelists without touching them.

    We already place pages to the tail of the freelists when undoing isolation
    via __putback_isolated_page(), let's do it in any case (e.g., if order
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Acked-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mike Rapoport
    Cc: Scott Cheloha
    Cc: Michael Ellerman
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • __putback_isolated_page() already documents that pages will be placed to
    the tail of the freelist - this is, however, not the case for "order >=
    MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
    all existing users.

    This change affects two users:
    - free page reporting
    - page isolation, when undoing the isolation (including memory onlining).

    This behavior is desirable for pages that haven't really been touched
    lately, so exactly the two users that don't actually read/write page
    content, but rather move untouched pages.

    The new behavior is especially desirable for memory onlining, where we
    allow allocation of newly onlined pages via undo_isolate_page_range() in
    online_pages(). Right now, we always place them to the head of the
    freelist, resulting in undesireable behavior: Assume we add individual
    memory chunks via add_memory() and online them right away to the NORMAL
    zone. We create a dependency chain of unmovable allocations e.g., via the
    memmap. The memmap of the next chunk will be placed onto previous chunks
    - if the last block cannot get offlined+removed, all dependent ones cannot
    get offlined+removed. While this can already be observed with individual
    DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
    DLPAR).

    Document that this should only be used for optimizations, and no code
    should rely on this behavior for correction (if the order of the freelists
    ever changes).

    We won't care about page shuffling: memory onlining already properly
    shuffles after onlining. free page reporting doesn't care about
    physically contiguous ranges, and there are already cases where page
    isolation will simply move (physically close) free pages to (currently)
    the head of the freelists via move_freepages_block() instead of shuffling.
    If this becomes ever relevant, we should shuffle the whole zone when
    undoing isolation of larger ranges, and after free_contig_range().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mike Rapoport
    Cc: Scott Cheloha
    Cc: Michael Ellerman
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.

    When adding separate memory blocks via add_memory*() and onlining them
    immediately, the metadata (especially the memmap) of the next block will
    be placed onto one of the just added+onlined block. This creates a chain
    of unmovable allocations: If the last memory block cannot get
    offlined+removed() so will all dependent ones. We directly have unmovable
    allocations all over the place.

    This can be observed quite easily using virtio-mem, however, it can also
    be observed when using DIMMs. The freshly onlined pages will usually be
    placed to the head of the freelists, meaning they will be allocated next,
    turning the just-added memory usually immediately un-removable. The fresh
    pages are cold, prefering to allocate others (that might be hot) also
    feels to be the natural thing to do.

    It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
    adding separate, successive memory blocks, each memory block will have
    unmovable allocations on them - for example gigantic pages will fail to
    allocate.

    While the ZONE_NORMAL doesn't provide any guarantees that memory can get
    offlined+removed again (any kind of fragmentation with unmovable
    allocations is possible), there are many scenarios (hotplugging a lot of
    memory, running workload, hotunplug some memory/as much as possible) where
    we can offline+remove quite a lot with this patchset.

    a) To visualize the problem, a very simple example:

    Start a VM with 4GB and 8GB of virtio-mem memory:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000033fffffff 9G online yes 32-103

    Memory block size: 128M
    Total online memory: 12G
    Total offline memory: 0B

    Then try to unplug as much as possible using virtio-mem. Observe which
    memory blocks are still around. Without this patch set:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000013fffffff 1G online yes 32-39
    0x0000000148000000-0x000000014fffffff 128M online yes 41
    0x0000000158000000-0x000000015fffffff 128M online yes 43
    0x0000000168000000-0x000000016fffffff 128M online yes 45
    0x0000000178000000-0x000000017fffffff 128M online yes 47
    0x0000000188000000-0x0000000197ffffff 256M online yes 49-50
    0x00000001a0000000-0x00000001a7ffffff 128M online yes 52
    0x00000001b0000000-0x00000001b7ffffff 128M online yes 54
    0x00000001c0000000-0x00000001c7ffffff 128M online yes 56
    0x00000001d0000000-0x00000001d7ffffff 128M online yes 58
    0x00000001e0000000-0x00000001e7ffffff 128M online yes 60
    0x00000001f0000000-0x00000001f7ffffff 128M online yes 62
    0x0000000200000000-0x0000000207ffffff 128M online yes 64
    0x0000000210000000-0x0000000217ffffff 128M online yes 66
    0x0000000220000000-0x0000000227ffffff 128M online yes 68
    0x0000000230000000-0x0000000237ffffff 128M online yes 70
    0x0000000240000000-0x0000000247ffffff 128M online yes 72
    0x0000000250000000-0x0000000257ffffff 128M online yes 74
    0x0000000260000000-0x0000000267ffffff 128M online yes 76
    0x0000000270000000-0x0000000277ffffff 128M online yes 78
    0x0000000280000000-0x0000000287ffffff 128M online yes 80
    0x0000000290000000-0x0000000297ffffff 128M online yes 82
    0x00000002a0000000-0x00000002a7ffffff 128M online yes 84
    0x00000002b0000000-0x00000002b7ffffff 128M online yes 86
    0x00000002c0000000-0x00000002c7ffffff 128M online yes 88
    0x00000002d0000000-0x00000002d7ffffff 128M online yes 90
    0x00000002e0000000-0x00000002e7ffffff 128M online yes 92
    0x00000002f0000000-0x00000002f7ffffff 128M online yes 94
    0x0000000300000000-0x0000000307ffffff 128M online yes 96
    0x0000000310000000-0x0000000317ffffff 128M online yes 98
    0x0000000320000000-0x0000000327ffffff 128M online yes 100
    0x0000000330000000-0x000000033fffffff 256M online yes 102-103

    Memory block size: 128M
    Total online memory: 8.1G
    Total offline memory: 0B

    With this patch set:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000013fffffff 1G online yes 32-39

    Memory block size: 128M
    Total online memory: 4G
    Total offline memory: 0B

    All memory can get unplugged, all memory block can get removed. Of
    course, no workload ran and the system was basically idle, but it
    highlights the issue - the fairly deterministic chain of unmovable
    allocations. When a huge page for the 2MB memmap is needed, a
    just-onlined 4MB page will be split. The remaining 2MB page will be used
    for the memmap of the next memory block. So one memory block will hold
    the memmap of the two following memory blocks. Finally the pages of the
    last-onlined memory block will get used for the next bigger allocations -
    if any allocation is unmovable, all dependent memory blocks cannot get
    unplugged and removed until that allocation is gone.

    Note that with bigger memory blocks (e.g., 256MB), *all* memory
    blocks are dependent and none can get unplugged again!

    b) Experiment with memory intensive workload

    I performed an experiment with an older version of this patch set (before
    we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
    with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
    kernel when adding it. I then run various memory intensive workloads that
    consume most system memory for a total of 45 minutes. Once finished, I
    try to unplug as much memory as possible.

    With this change, I am able to remove via virtio-mem (adding individual
    128MB memory blocks) 413 out of 448 added memory blocks. Via individual
    (256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any
    numbers without this patchset, but looking at the above example, it's at
    most half of the 448 memory blocks for virtio-mem, and most probably none
    for DIMMs).

    Again, there are workloads that might behave very differently due to the
    nature of ZONE_NORMAL.

    This change also affects (besides memory onlining):
    - Other users of undo_isolate_page_range(): Pages are always placed to the
    tail.
    -- When memory offlining fails
    -- When memory isolation fails after having isolated some pageblocks
    -- When alloc_contig_range() either succeeds or fails
    - Other users of __putback_isolated_page(): Pages are always placed to the
    tail.
    -- Free page reporting
    - Other users of __free_pages_core()
    -- AFAIKs, any memory that is getting exposed to the buddy during boot.
    IIUC we will now usually allocate memory from lower addresses within
    a zone first (especially during boot).
    - Other users of generic_online_page()
    -- Hyper-V balloon

    This patch (of 5):

    Let's prepare for additional flags and avoid long parameter lists of
    bools. Follow-up patches will also make use of the flags in
    __free_pages_ok().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Mike Rapoport
    Cc: Matthew Wilcox
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Scott Cheloha
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Michal Hocko
    Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • On the memory onlining path, we want to start with MIGRATE_ISOLATE, to
    un-isolate the pages after memory onlining is complete. Let's allow
    passing in the migratetype.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: Dan Williams
    Cc: Mike Rapoport
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Michel Lespinasse
    Cc: Charan Teja Reddy
    Cc: Mel Gorman
    Link: https://lkml.kernel.org/r/20200819175957.28465-10-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Commit ac5d2539b238 ("mm: meminit: reduce number of times pageblocks are
    set during struct page init") moved the actual zone range check, leaving
    only the alignment check for pageblocks.

    Let's drop the stale comment and make the pageblock check easier to read.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Mel Gorman
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-9-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Callers no longer need the number of isolated pageblocks. Let's simplify.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-7-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • offline_pages() is the only user. __offline_isolated_pages() never gets
    called with ranges that contain memory holes and we no longer care about
    the return value. Drop the return value handling and all pfn_valid()
    checks.

    Update the documentation.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • This patch changes the way we set and handle in-use poisoned pages. Until
    now, poisoned pages were released to the buddy allocator, trusting that
    the checks that take place at allocation time would act as a safe net and
    would skip that page.

    This has proved to be wrong, as we got some pfn walkers out there, like
    compaction, that all they care is the page to be in a buddy freelist.

    Although this might not be the only user, having poisoned pages in the
    buddy allocator seems a bad idea as we should only have free pages that
    are ready and meant to be used as such.

    Before explaining the taken approach, let us break down the kind of pages
    we can soft offline.

    - Anonymous THP (after the split, they end up being 4K pages)
    - Hugetlb
    - Order-0 pages (that can be either migrated or invalited)

    * Normal pages (order-0 and anon-THP)

    - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
    - If they are mapped/dirty, we do the isolate-and-migrate dance.

    Either way, do not call put_page directly from those paths. Instead, we
    keep the page and send it to page_handle_poison to perform the right
    handling.

    page_handle_poison sets the HWPoison flag and does the last put_page.

    Down the chain, we placed a check for HWPoison page in
    free_pages_prepare, that just skips any poisoned page, so those pages
    do not end up in any pcplist/freelist.

    After that, we set the refcount on the page to 1 and we increment
    the poisoned pages counter.

    If we see that the check in free_pages_prepare creates trouble, we can
    always do what we do for free pages:

    - wait until the page hits buddy's freelists
    - take it off, and flag it

    The downside of the above approach is that we could race with an
    allocation, so by the time we want to take the page off the buddy, the
    page has been already allocated so we cannot soft offline it.
    But the user could always retry it.

    * Hugetlb pages

    - We isolate-and-migrate them

    After the migration has been successful, we call dissolve_free_huge_page,
    and we set HWPoison on the page if we succeed.
    Hugetlb has a slightly different handling though.

    While for non-hugetlb pages we cared about closing the race with an
    allocation, doing so for hugetlb pages requires quite some additional
    and intrusive code (we would need to hook in free_huge_page and some other
    places).
    So I decided to not make the code overly complicated and just fail
    normally if the page we allocated in the meantime.

    We can always build on top of this.

    As a bonus, because of the way we handle now in-use pages, we no longer
    need the put-as-isolation-migratetype dance, that was guarding for poisoned
    pages to end up in pcplists.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200922135650.1634-10-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • When trying to soft-offline a free page, we need to first take it off the
    buddy allocator. Once we know is out of reach, we can safely flag it as
    poisoned.

    take_page_off_buddy will be used to take a page meant to be poisoned off
    the buddy allocator. take_page_off_buddy calls break_down_buddy_pages,
    which splits a higher-order page in case our page belongs to one.

    Once the page is under our control, we call page_handle_poison to set it
    as poisoned and grab a refcount on it.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200922135650.1634-9-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • The implementation of split_page_owner() prefers a count rather than the
    old order of the page. When we support a variable size THP, we won't
    have the order at this point, but we will have the number of pages.
    So change the interface to what the caller and callee would prefer.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: SeongJae Park
    Acked-by: Kirill A. Shutemov
    Cc: Huang Ying
    Link: https://lkml.kernel.org/r/20200908195539.25896-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

16 Oct, 2020

1 commit

  • Pull networking updates from Jakub Kicinski:

    - Add redirect_neigh() BPF packet redirect helper, allowing to limit
    stack traversal in common container configs and improving TCP
    back-pressure.

    Daniel reports ~10Gbps => ~15Gbps single stream TCP performance gain.

    - Expand netlink policy support and improve policy export to user
    space. (Ge)netlink core performs request validation according to
    declared policies. Expand the expressiveness of those policies
    (min/max length and bitmasks). Allow dumping policies for particular
    commands. This is used for feature discovery by user space (instead
    of kernel version parsing or trial and error).

    - Support IGMPv3/MLDv2 multicast listener discovery protocols in
    bridge.

    - Allow more than 255 IPv4 multicast interfaces.

    - Add support for Type of Service (ToS) reflection in SYN/SYN-ACK
    packets of TCPv6.

    - In Multi-patch TCP (MPTCP) support concurrent transmission of data on
    multiple subflows in a load balancing scenario. Enhance advertising
    addresses via the RM_ADDR/ADD_ADDR options.

    - Support SMC-Dv2 version of SMC, which enables multi-subnet
    deployments.

    - Allow more calls to same peer in RxRPC.

    - Support two new Controller Area Network (CAN) protocols - CAN-FD and
    ISO 15765-2:2016.

    - Add xfrm/IPsec compat layer, solving the 32bit user space on 64bit
    kernel problem.

    - Add TC actions for implementing MPLS L2 VPNs.

    - Improve nexthop code - e.g. handle various corner cases when nexthop
    objects are removed from groups better, skip unnecessary
    notifications and make it easier to offload nexthops into HW by
    converting to a blocking notifier.

    - Support adding and consuming TCP header options by BPF programs,
    opening the doors for easy experimental and deployment-specific TCP
    option use.

    - Reorganize TCP congestion control (CC) initialization to simplify
    life of TCP CC implemented in BPF.

    - Add support for shipping BPF programs with the kernel and loading
    them early on boot via the User Mode Driver mechanism, hence reusing
    all the user space infra we have.

    - Support sleepable BPF programs, initially targeting LSM and tracing.

    - Add bpf_d_path() helper for returning full path for given 'struct
    path'.

    - Make bpf_tail_call compatible with bpf-to-bpf calls.

    - Allow BPF programs to call map_update_elem on sockmaps.

    - Add BPF Type Format (BTF) support for type and enum discovery, as
    well as support for using BTF within the kernel itself (current use
    is for pretty printing structures).

    - Support listing and getting information about bpf_links via the bpf
    syscall.

    - Enhance kernel interfaces around NIC firmware update. Allow
    specifying overwrite mask to control if settings etc. are reset
    during update; report expected max time operation may take to users;
    support firmware activation without machine reboot incl. limits of
    how much impact reset may have (e.g. dropping link or not).

    - Extend ethtool configuration interface to report IEEE-standard
    counters, to limit the need for per-vendor logic in user space.

    - Adopt or extend devlink use for debug, monitoring, fw update in many
    drivers (dsa loop, ice, ionic, sja1105, qed, mlxsw, mv88e6xxx,
    dpaa2-eth).

    - In mlxsw expose critical and emergency SFP module temperature alarms.
    Refactor port buffer handling to make the defaults more suitable and
    support setting these values explicitly via the DCBNL interface.

    - Add XDP support for Intel's igb driver.

    - Support offloading TC flower classification and filtering rules to
    mscc_ocelot switches.

    - Add PTP support for Marvell Octeontx2 and PP2.2 hardware, as well as
    fixed interval period pulse generator and one-step timestamping in
    dpaa-eth.

    - Add support for various auth offloads in WiFi APs, e.g. SAE (WPA3)
    offload.

    - Add Lynx PHY/PCS MDIO module, and convert various drivers which have
    this HW to use it. Convert mvpp2 to split PCS.

    - Support Marvell Prestera 98DX3255 24-port switch ASICs, as well as
    7-port Mediatek MT7531 IP.

    - Add initial support for QCA6390 and IPQ6018 in ath11k WiFi driver,
    and wcn3680 support in wcn36xx.

    - Improve performance for packets which don't require much offloads on
    recent Mellanox NICs by 20% by making multiple packets share a
    descriptor entry.

    - Move chelsio inline crypto drivers (for TLS and IPsec) from the
    crypto subtree to drivers/net. Move MDIO drivers out of the phy
    directory.

    - Clean up a lot of W=1 warnings, reportedly the actively developed
    subsections of networking drivers should now build W=1 warning free.

    - Make sure drivers don't use in_interrupt() to dynamically adapt their
    code. Convert tasklets to use new tasklet_setup API (sadly this
    conversion is not yet complete).

    * tag 'net-next-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2583 commits)
    Revert "bpfilter: Fix build error with CONFIG_BPFILTER_UMH"
    net, sockmap: Don't call bpf_prog_put() on NULL pointer
    bpf, selftest: Fix flaky tcp_hdr_options test when adding addr to lo
    bpf, sockmap: Add locking annotations to iterator
    netfilter: nftables: allow re-computing sctp CRC-32C in 'payload' statements
    net: fix pos incrementment in ipv6_route_seq_next
    net/smc: fix invalid return code in smcd_new_buf_create()
    net/smc: fix valid DMBE buffer sizes
    net/smc: fix use-after-free of delayed events
    bpfilter: Fix build error with CONFIG_BPFILTER_UMH
    cxgb4/ch_ipsec: Replace the module name to ch_ipsec from chcr
    net: sched: Fix suspicious RCU usage while accessing tcf_tunnel_info
    bpf: Fix register equivalence tracking.
    rxrpc: Fix loss of final ack on shutdown
    rxrpc: Fix bundle counting for exclusive connections
    netfilter: restore NF_INET_NUMHOOKS
    ibmveth: Identify ingress large send packets.
    ibmveth: Switch order of ibmveth_helper calls.
    cxgb4: handle 4-tuple PEDIT to NAT mode translation
    selftests: Add VRF route leaking tests
    ...

    Linus Torvalds
     

14 Oct, 2020

11 commits

  • for_each_memblock() is used to iterate over memblock.memory in a few
    places that use data from memblock_region rather than the memory ranges.

    Introduce separate for_each_mem_region() and
    for_each_reserved_mem_region() to improve encapsulation of memblock
    internals from its users.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Ingo Molnar [x86]
    Acked-by: Thomas Bogendoerfer [MIPS]
    Acked-by: Miguel Ojeda [.clang-format]
    Cc: Andy Lutomirski
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-18-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Currently for_each_mem_range() and for_each_mem_range_rev() iterators are
    the most generic way to traverse memblock regions. As such, they have 8
    parameters and they are hardly convenient to users. Most users choose to
    utilize one of their wrappers and the only user that actually needs most
    of the parameters is memblock itself.

    To avoid yet another naming for memblock iterators, rename the existing
    for_each_mem_range[_rev]() to __for_each_mem_range[_rev]() and add a new
    for_each_mem_range[_rev]() wrappers with only index, start and end
    parameters.

    The new wrapper nicely fits into init_unavailable_mem() and will be used
    in upcoming changes to simplify memblock traversals.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Acked-by: Thomas Bogendoerfer [MIPS]
    Cc: Andy Lutomirski
    Cc: Baoquan He
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Christoph Hellwig
    Cc: Daniel Axtens
    Cc: Dave Hansen
    Cc: Emil Renner Berthing
    Cc: Hari Bathini
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Jonathan Cameron
    Cc: Marek Szyprowski
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Miguel Ojeda
    Cc: Palmer Dabbelt
    Cc: Paul Mackerras
    Cc: Paul Walmsley
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Link: https://lkml.kernel.org/r/20200818151634.14343-11-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Here is a very rare race which leaks memory:

    Page P0 is allocated to the page cache. Page P1 is free.

    Thread A Thread B Thread C
    find_get_entry():
    xas_load() returns P0
    Removes P0 from page cache
    P0 finds its buddy P1
    alloc_pages(GFP_KERNEL, 1) returns P0
    P0 has refcount 1
    page_cache_get_speculative(P0)
    P0 has refcount 2
    __free_pages(P0)
    P0 has refcount 1
    put_page(P0)
    P1 is not freed

    Fix this by freeing all the pages in __free_pages() that won't be freed
    by the call to put_page(). It's usually not a good idea to split a page,
    but this is a very unlikely scenario.

    Fixes: e286781d5f2e ("mm: speculative page references")
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Mike Rapoport
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Link: https://lkml.kernel.org/r/20200926213919.26642-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Previously 'for_next_zone_zonelist_nodemask' macro parameter 'zlist' was
    unused so this patch removes it.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200917211906.30059-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • __perform_reclaim()'s single caller expects it to return 'unsigned long',
    hence change its return value and a local variable to 'unsigned long'.

    Suggested-by: Andrew Morton
    Signed-off-by: Yanfei Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200916022138.16740-1-yanfei.xu@windriver.com
    Signed-off-by: Linus Torvalds

    Yanfei Xu
     
  • finalise_ac() is just 'epilogue' for 'prepare_alloc_pages'. Therefore
    there is no need to keep them both so 'finalise_ac' content can be merged
    into prepare_alloc_pages() code. It would make __alloc_pages_nodemask()
    cleaner when it comes to readability.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Link: https://lkml.kernel.org/r/20200916110118.6537-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Previously in '__init early_init_on_alloc' and '__init early_init_on_free'
    the return values from 'kstrtobool' were not handled properly. That
    caused potential garbage value read from variable 'bool_result'.
    Introduced patch fixes error handling.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200916214125.28271-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Previously flags check was separated into two separated checks with two
    separated branches. In case of presence of any of two mentioned flags,
    the same effect on flow occurs. Therefore checks can be merged and one
    branch can be avoided.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200911092310.31136-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Previously variable 'tmp' was initialized, but was not read later before
    reassigning. So the initialization can be removed.

    [akpm@linux-foundation.org: remove `tmp' altogether]

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200904132422.17387-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • In has_unmovable_pages(), the page parameter would not always be the first
    page within a pageblock (see how the page pointer is passed in from
    start_isolate_page_range() after call __first_valid_page()), so that would
    cause checking unmovable pages span two pageblocks.

    After this patch, the checking is enforced within one pageblock no matter
    the page is first one or not, and obey the semantics of this function.

    This issue is found by code inspection.

    Michal said "this might lead to false negatives when an unrelated block
    would cause an isolation failure".

    Signed-off-by: Li Xinhai
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: David Hildenbrand
    Link: https://lkml.kernel.org/r/20200824065811.383266-1-lixinhai.lxh@gmail.com
    Signed-off-by: Linus Torvalds

    Li Xinhai
     
  • Patch series "mm / virtio-mem: support ZONE_MOVABLE", v5.

    When introducing virtio-mem, the semantics of ZONE_MOVABLE were rather
    unclear, which is why we special-cased ZONE_MOVABLE such that partially
    plugged blocks would never end up in ZONE_MOVABLE.

    Now that the semantics are much clearer (and are documented in patch #6),
    let's support partially plugged memory blocks in ZONE_MOVABLE, allowing
    partially plugged memory blocks to be online to ZONE_MOVABLE and also
    unplugging from such memory blocks. This avoids surprises when onlining
    of memory blocks suddenly fails, just because they are not completely
    populated by virtio-mem (yet).

    This is especially helpful for testing, but also paves the way for
    virtio-mem optimizations, allowing more memory to get reliably unplugged.

    Cleanup has_unmovable_pages() and set_migratetype_isolate(), providing
    better documentation of how ZONE_MOVABLE interacts with different kind of
    unmovable pages (memory offlining vs. alloc_contig_range()).

    This patch (of 6):

    Let's move the split comment regarding bootmem allocations and memory
    holes, especially in the context of ZONE_MOVABLE, to the PageReserved()
    check.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Cc: Michal Hocko
    Cc: Michael S. Tsirkin
    Cc: Mike Kravetz
    Cc: Pankaj Gupta
    Cc: Jason Wang
    Cc: Mike Rapoport
    Cc: Qian Cai
    Link: http://lkml.kernel.org/r/20200816125333.7434-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20200816125333.7434-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

12 Oct, 2020

1 commit

  • When memory is hotplug added or removed the min_free_kbytes should be
    recalculated based on what is expected by khugepaged. Currently after
    hotplug, min_free_kbytes will be set to a lower default and higher
    default set when THP enabled is lost.

    This change restores min_free_kbytes as expected for THP consumers.

    [vijayb@linux.microsoft.com: v5]
    Link: https://lkml.kernel.org/r/1601398153-5517-1-git-send-email-vijayb@linux.microsoft.com

    Fixes: f000565adb77 ("thp: set recommended min free kbytes")
    Signed-off-by: Vijay Balakrishna
    Signed-off-by: Andrew Morton
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Allen Pais
    Cc: Andrea Arcangeli
    Cc: "Kirill A. Shutemov"
    Cc: Oleg Nesterov
    Cc: Song Liu
    Cc:
    Link: https://lkml.kernel.org/r/1600305709-2319-2-git-send-email-vijayb@linux.microsoft.com
    Link: https://lkml.kernel.org/r/1600204258-13683-1-git-send-email-vijayb@linux.microsoft.com
    Signed-off-by: Linus Torvalds

    Vijay Balakrishna
     

06 Oct, 2020

1 commit


04 Oct, 2020

1 commit

  • memalloc_nocma_{save/restore} APIs can be used to skip page allocation
    on CMA area, but, there is a missing case and the page on CMA area could
    be allocated even if APIs are used. This patch handles this case to fix
    the potential issue.

    For now, these APIs are used to prevent long-term pinning on the CMA
    page. When the long-term pinning is requested on the CMA page, it is
    migrated to the non-CMA page before pinning. This non-CMA page is
    allocated by using memalloc_nocma_{save/restore} APIs. If APIs doesn't
    work as intended, the CMA page is allocated and it is pinned for a long
    time. This long-term pin for the CMA page causes cma_alloc() failure
    and it could result in wrong behaviour on the device driver who uses the
    cma_alloc().

    Missing case is an allocation from the pcplist. MIGRATE_MOVABLE pcplist
    could have the pages on CMA area so we need to skip it if ALLOC_CMA
    isn't specified.

    Fixes: 8510e69c8efe (mm/page_alloc: fix memalloc_nocma_{save/restore} APIs)
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Mel Gorman
    Link: https://lkml.kernel.org/r/1601429472-12599-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     

27 Sep, 2020

1 commit

  • Patch series "mm: fix memory to node bad links in sysfs", v3.

    Sometimes, firmware may expose interleaved memory layout like this:

    Early memory node ranges
    node 1: [mem 0x0000000000000000-0x000000011fffffff]
    node 2: [mem 0x0000000120000000-0x000000014fffffff]
    node 1: [mem 0x0000000150000000-0x00000001ffffffff]
    node 0: [mem 0x0000000200000000-0x000000048fffffff]
    node 2: [mem 0x0000000490000000-0x00000007ffffffff]

    In that case, we can see memory blocks assigned to multiple nodes in
    sysfs:

    $ ls -l /sys/devices/system/memory/memory21
    total 0
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node1 -> ../../node/node1
    lrwxrwxrwx 1 root root 0 Aug 24 05:27 node2 -> ../../node/node2
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 online
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_device
    -r--r--r-- 1 root root 65536 Aug 24 05:27 phys_index
    drwxr-xr-x 2 root root 0 Aug 24 05:27 power
    -r--r--r-- 1 root root 65536 Aug 24 05:27 removable
    -rw-r--r-- 1 root root 65536 Aug 24 05:27 state
    lrwxrwxrwx 1 root root 0 Aug 24 05:25 subsystem -> ../../../../bus/memory
    -rw-r--r-- 1 root root 65536 Aug 24 05:25 uevent
    -r--r--r-- 1 root root 65536 Aug 24 05:27 valid_zones

    The same applies in the node's directory with a memory21 link in both
    the node1 and node2's directory.

    This is wrong but doesn't prevent the system to run. However when
    later, one of these memory blocks is hot-unplugged and then hot-plugged,
    the system is detecting an inconsistency in the sysfs layout and a
    BUG_ON() is raised:

    kernel BUG at /Users/laurent/src/linux-ppc/mm/memory_hotplug.c:1084!
    LE PAGE_SIZE=64K MMU=Hash SMP NR_CPUS=2048 NUMA pSeries
    Modules linked in: rpadlpar_io rpaphp pseries_rng rng_core vmx_crypto gf128mul binfmt_misc ip_tables x_tables xfs libcrc32c crc32c_vpmsum autofs4
    CPU: 8 PID: 10256 Comm: drmgr Not tainted 5.9.0-rc1+ #25
    Call Trace:
    add_memory_resource+0x23c/0x340 (unreliable)
    __add_memory+0x5c/0xf0
    dlpar_add_lmb+0x1b4/0x500
    dlpar_memory+0x1f8/0xb80
    handle_dlpar_errorlog+0xc0/0x190
    dlpar_store+0x198/0x4a0
    kobj_attr_store+0x30/0x50
    sysfs_kf_write+0x64/0x90
    kernfs_fop_write+0x1b0/0x290
    vfs_write+0xe8/0x290
    ksys_write+0xdc/0x130
    system_call_exception+0x160/0x270
    system_call_common+0xf0/0x27c

    This has been seen on PowerPC LPAR.

    The root cause of this issue is that when node's memory is registered,
    the range used can overlap another node's range, thus the memory block
    is registered to multiple nodes in sysfs.

    There are two issues here:

    (a) The sysfs memory and node's layouts are broken due to these
    multiple links

    (b) The link errors in link_mem_sections() should not lead to a system
    panic.

    To address (a) register_mem_sect_under_node should not rely on the
    system state to detect whether the link operation is triggered by a hot
    plug operation or not. This is addressed by the patches 1 and 2 of this
    series.

    Issue (b) will be addressed separately.

    This patch (of 2):

    The memmap_context enum is used to detect whether a memory operation is
    due to a hot-add operation or happening at boot time.

    Make it general to the hotplug operation and rename it as
    meminit_context.

    There is no functional change introduced by this patch

    Suggested-by: David Hildenbrand
    Signed-off-by: Laurent Dufour
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J . Wysocki"
    Cc: Nathan Lynch
    Cc: Scott Cheloha
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc:
    Link: https://lkml.kernel.org/r/20200915094143.79181-1-ldufour@linux.ibm.com
    Link: https://lkml.kernel.org/r/20200915132624.9723-1-ldufour@linux.ibm.com
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

02 Sep, 2020

1 commit

  • Daniel Borkmann says:

    ====================
    pull-request: bpf-next 2020-09-01

    The following pull-request contains BPF updates for your *net-next* tree.

    There are two small conflicts when pulling, resolve as follows:

    1) Merge conflict in tools/lib/bpf/libbpf.c between 88a82120282b ("libbpf: Factor
    out common ELF operations and improve logging") in bpf-next and 1e891e513e16
    ("libbpf: Fix map index used in error message") in net-next. Resolve by taking
    the hunk in bpf-next:

    [...]
    scn = elf_sec_by_idx(obj, obj->efile.btf_maps_shndx);
    data = elf_sec_data(obj, scn);
    if (!scn || !data) {
    pr_warn("elf: failed to get %s map definitions for %s\n",
    MAPS_ELF_SEC, obj->path);
    return -EINVAL;
    }
    [...]

    2) Merge conflict in drivers/net/ethernet/mellanox/mlx5/core/en/xsk/rx.c between
    9647c57b11e5 ("xsk: i40e: ice: ixgbe: mlx5: Test for dma_need_sync earlier for
    better performance") in bpf-next and e20f0dbf204f ("net/mlx5e: RX, Add a prefetch
    command for small L1_CACHE_BYTES") in net-next. Resolve the two locations by retaining
    net_prefetch() and taking xsk_buff_dma_sync_for_cpu() from bpf-next. Should look like:

    [...]
    xdp_set_data_meta_invalid(xdp);
    xsk_buff_dma_sync_for_cpu(xdp, rq->xsk_pool);
    net_prefetch(xdp->data);
    [...]

    We've added 133 non-merge commits during the last 14 day(s) which contain
    a total of 246 files changed, 13832 insertions(+), 3105 deletions(-).

    The main changes are:

    1) Initial support for sleepable BPF programs along with bpf_copy_from_user() helper
    for tracing to reliably access user memory, from Alexei Starovoitov.

    2) Add BPF infra for writing and parsing TCP header options, from Martin KaFai Lau.

    3) bpf_d_path() helper for returning full path for given 'struct path', from Jiri Olsa.

    4) AF_XDP support for shared umems between devices and queues, from Magnus Karlsson.

    5) Initial prep work for full BPF-to-BPF call support in libbpf, from Andrii Nakryiko.

    6) Generalize bpf_sk_storage map & add local storage for inodes, from KP Singh.

    7) Implement sockmap/hash updates from BPF context, from Lorenz Bauer.

    8) BPF xor verification for scalar types & add BPF link iterator, from Yonghong Song.

    9) Use target's prog type for BPF_PROG_TYPE_EXT prog verification, from Udip Pant.

    10) Rework BPF tracing samples to use libbpf loader, from Daniel T. Lee.

    11) Fix xdpsock sample to really cycle through all buffers, from Weqaar Janjua.

    12) Improve type safety for tun/veth XDP frame handling, from Maciej Żenczykowski.

    13) Various smaller cleanups and improvements all over the place.
    ====================

    Signed-off-by: David S. Miller

    David S. Miller
     

29 Aug, 2020

1 commit

  • 'static' and 'static noinline' function attributes make no guarantees that
    gcc/clang won't optimize them. The compiler may decide to inline 'static'
    function and in such case ALLOW_ERROR_INJECT becomes meaningless. The compiler
    could have inlined __add_to_page_cache_locked() in one callsite and didn't
    inline in another. In such case injecting errors into it would cause
    unpredictable behavior. It's worse with 'static noinline' which won't be
    inlined, but it still can be optimized. Like the compiler may decide to remove
    one argument or constant propagate the value depending on the callsite.

    To avoid such issues make sure that these functions are global noinline.

    Fixes: af3b854492f3 ("mm/page_alloc.c: allow error injection")
    Fixes: cfcbfb1382db ("mm/filemap.c: enable error injection at add_to_page_cache()")
    Signed-off-by: Alexei Starovoitov
    Signed-off-by: Daniel Borkmann
    Reviewed-by: Josef Bacik
    Link: https://lore.kernel.org/bpf/20200827220114.69225-2-alexei.starovoitov@gmail.com

    Alexei Starovoitov
     

22 Aug, 2020

2 commits

  • The following race is observed with the repeated online, offline and a
    delay between two successive online of memory blocks of movable zone.

    P1 P2

    Online the first memory block in
    the movable zone. The pcp struct
    values are initialized to default
    values,i.e., pcp->high = 0 &
    pcp->batch = 1.

    Allocate the pages from the
    movable zone.

    Try to Online the second memory
    block in the movable zone thus it
    entered the online_pages() but yet
    to call zone_pcp_update().
    This process is entered into
    the exit path thus it tries
    to release the order-0 pages
    to pcp lists through
    free_unref_page_commit().
    As pcp->high = 0, pcp->count = 1
    proceed to call the function
    free_pcppages_bulk().
    Update the pcp values thus the
    new pcp values are like, say,
    pcp->high = 378, pcp->batch = 63.
    Read the pcp's batch value using
    READ_ONCE() and pass the same to
    free_pcppages_bulk(), pcp values
    passed here are, batch = 63,
    count = 1.

    Since num of pages in the pcp
    lists are less than ->batch,
    then it will stuck in
    while(list_empty(list)) loop
    with interrupts disabled thus
    a core hung.

    Avoid this by ensuring free_pcppages_bulk() is called with proper count of
    pcp list pages.

    The mentioned race is some what easily reproducible without [1] because
    pcp's are not updated for the first memory block online and thus there is
    a enough race window for P2 between alloc+free and pcp struct values
    update through onlining of second memory block.

    With [1], the race still exists but it is very narrow as we update the pcp
    struct values for the first memory block online itself.

    This is not limited to the movable zone, it could also happen in cases
    with the normal zone (e.g., hotplug to a node that only has DMA memory, or
    no other memory yet).

    [1]: https://patchwork.kernel.org/patch/11696389/

    Fixes: 5f8dcc21211a ("page-allocator: split per-cpu list into one-list-per-migrate-type")
    Signed-off-by: Charan Teja Reddy
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Vinayak Menon
    Cc: [2.6+]
    Link: http://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org
    Signed-off-by: Linus Torvalds

    Charan Teja Reddy
     
  • The lowmem_reserve arrays provide a means of applying pressure against
    allocations from lower zones that were targeted at higher zones. Its
    values are a function of the number of pages managed by higher zones and
    are assigned by a call to the setup_per_zone_lowmem_reserve() function.

    The function is initially called at boot time by the function
    init_per_zone_wmark_min() and may be called later by accesses of the
    /proc/sys/vm/lowmem_reserve_ratio sysctl file.

    The function init_per_zone_wmark_min() was moved up from a module_init to
    a core_initcall to resolve a sequencing issue with khugepaged.
    Unfortunately this created a sequencing issue with CMA page accounting.

    The CMA pages are added to the managed page count of a zone when
    cma_init_reserved_areas() is called at boot also as a core_initcall. This
    makes it uncertain whether the CMA pages will be added to the managed page
    counts of their zones before or after the call to
    init_per_zone_wmark_min() as it becomes dependent on link order. With the
    current link order the pages are added to the managed count after the
    lowmem_reserve arrays are initialized at boot.

    This means the lowmem_reserve values at boot may be lower than the values
    used later if /proc/sys/vm/lowmem_reserve_ratio is accessed even if the
    ratio values are unchanged.

    In many cases the difference is not significant, but for example
    an ARM platform with 1GB of memory and the following memory layout

    cma: Reserved 256 MiB at 0x0000000030000000
    Zone ranges:
    DMA [mem 0x0000000000000000-0x000000002fffffff]
    Normal empty
    HighMem [mem 0x0000000030000000-0x000000003fffffff]

    would result in 0 lowmem_reserve for the DMA zone. This would allow
    userspace to deplete the DMA zone easily.

    Funnily enough

    $ cat /proc/sys/vm/lowmem_reserve_ratio

    would fix up the situation because as a side effect it forces
    setup_per_zone_lowmem_reserve.

    This commit breaks the link order dependency by invoking
    init_per_zone_wmark_min() as a postcore_initcall so that the CMA pages
    have the chance to be properly accounted in their zone(s) and allowing
    the lowmem_reserve arrays to receive consistent values.

    Fixes: bc22af74f271 ("mm: update min_free_kbytes from khugepaged after core initialization")
    Signed-off-by: Doug Berger
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Jason Baron
    Cc: David Rientjes
    Cc: "Kirill A. Shutemov"
    Cc:
    Link: http://lkml.kernel.org/r/1597423766-27849-1-git-send-email-opendmb@gmail.com
    Signed-off-by: Linus Torvalds

    Doug Berger
     

15 Aug, 2020

1 commit

  • Patch series "THP prep patches".

    These are some generic cleanups and improvements, which I would like
    merged into mmotm soon. The first one should be a performance improvement
    for all users of compound pages, and the others are aimed at getting code
    to compile away when CONFIG_TRANSPARENT_HUGEPAGE is disabled (ie small
    systems). Also better documented / less confusing than the current prefix
    mixture of compound, hpage and thp.

    This patch (of 7):

    This removes a few instructions from functions which need to know how many
    pages are in a compound page. The storage used is either page->mapping on
    64-bit or page->index on 32-bit. Both of these are fine to overlay on
    tail pages.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: William Kucharski
    Reviewed-by: Zi Yan
    Cc: David Hildenbrand
    Cc: Mike Kravetz
    Cc: "Kirill A. Shutemov"
    Link: http://lkml.kernel.org/r/20200629151959.15779-1-willy@infradead.org
    Link: http://lkml.kernel.org/r/20200629151959.15779-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

13 Aug, 2020

2 commits

  • There is a well-defined standard migration target callback. Use it
    directly.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-8-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Drop the repeated word "them" and "that".
    Change "the the" to "to the".

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Zi Yan
    Link: http://lkml.kernel.org/r/20200801173822.14973-10-rdunlap@infradead.org
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

08 Aug, 2020

1 commit

  • Currently, memalloc_nocma_{save/restore} API that prevents CMA area
    in page allocation is implemented by using current_gfp_context(). However,
    there are two problems of this implementation.

    First, this doesn't work for allocation fastpath. In the fastpath,
    original gfp_mask is used since current_gfp_context() is introduced in
    order to control reclaim and it is on slowpath. So, CMA area can be
    allocated through the allocation fastpath even if
    memalloc_nocma_{save/restore} APIs are used. Currently, there is just
    one user for these APIs and it has a fallback method to prevent actual
    problem.
    Second, clearing __GFP_MOVABLE in current_gfp_context() has a side effect
    to exclude the memory on the ZONE_MOVABLE for allocation target.

    To fix these problems, this patch changes the implementation to exclude
    CMA area in page allocation. Main point of this change is using the
    alloc_flags. alloc_flags is mainly used to control allocation so it fits
    for excluding CMA area in allocation.

    Fixes: d7fefcc8de91 (mm/cma: add PF flag to force non cma alloc)
    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Roman Gushchin
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Link: http://lkml.kernel.org/r/1595468942-29687-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim