21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

14 commits

  • When freeing a page with an order >= shuffle_page_order randomly select
    the front or back of the list for insertion.

    While the mm tries to defragment physical pages into huge pages this can
    tend to make the page allocator more predictable over time. Inject the
    front-back randomness to preserve the initial randomness established by
    shuffle_free_memory() when the kernel was booted.

    The overhead of this manipulation is constrained by only being applied
    for MAX_ORDER sized pages by default.

    [akpm@linux-foundation.org: coding-style fixes]
    Link: http://lkml.kernel.org/r/154899812788.3165233.9066631950746578517.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Kees Cook
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • In preparation for runtime randomization of the zone lists, take all
    (well, most of) the list_*() functions in the buddy allocator and put
    them in helper functions. Provide a common control point for injecting
    additional behavior when freeing pages.

    [dan.j.williams@intel.com: fix buddy list helpers]
    Link: http://lkml.kernel.org/r/155033679702.1773410.13041474192173212653.stgit@dwillia2-desk3.amr.corp.intel.com
    [vbabka@suse.cz: remove del_page_from_free_area() migratetype parameter]
    Link: http://lkml.kernel.org/r/4672701b-6775-6efd-0797-b6242591419e@suse.cz
    Link: http://lkml.kernel.org/r/154899812264.3165233.5219320056406926223.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Vlastimil Babka
    Tested-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Kees Cook
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm: Randomize free memory", v10.

    This patch (of 3):

    Randomization of the page allocator improves the average utilization of
    a direct-mapped memory-side-cache. Memory side caching is a platform
    capability that Linux has been previously exposed to in HPC
    (high-performance computing) environments on specialty platforms. In
    that instance it was a smaller pool of high-bandwidth-memory relative to
    higher-capacity / lower-bandwidth DRAM. Now, this capability is going
    to be found on general purpose server platforms where DRAM is a cache in
    front of higher latency persistent memory [1].

    Robert offered an explanation of the state of the art of Linux
    interactions with memory-side-caches [2], and I copy it here:

    It's been a problem in the HPC space:
    http://www.nersc.gov/research-and-development/knl-cache-mode-performance-coe/

    A kernel module called zonesort is available to try to help:
    https://software.intel.com/en-us/articles/xeon-phi-software

    and this abandoned patch series proposed that for the kernel:
    https://lkml.kernel.org/r/20170823100205.17311-1-lukasz.daniluk@intel.com

    Dan's patch series doesn't attempt to ensure buffers won't conflict, but
    also reduces the chance that the buffers will. This will make performance
    more consistent, albeit slower than "optimal" (which is near impossible
    to attain in a general-purpose kernel). That's better than forcing
    users to deploy remedies like:
    "To eliminate this gradual degradation, we have added a Stream
    measurement to the Node Health Check that follows each job;
    nodes are rebooted whenever their measured memory bandwidth
    falls below 300 GB/s."

    A replacement for zonesort was merged upstream in commit cc9aec03e58f
    ("x86/numa_emulation: Introduce uniform split capability"). With this
    numa_emulation capability, memory can be split into cache sized
    ("near-memory" sized) numa nodes. A bind operation to such a node, and
    disabling workloads on other nodes, enables full cache performance.
    However, once the workload exceeds the cache size then cache conflicts
    are unavoidable. While HPC environments might be able to tolerate
    time-scheduling of cache sized workloads, for general purpose server
    platforms, the oversubscribed cache case will be the common case.

    The worst case scenario is that a server system owner benchmarks a
    workload at boot with an un-contended cache only to see that performance
    degrade over time, even below the average cache performance due to
    excessive conflicts. Randomization clips the peaks and fills in the
    valleys of cache utilization to yield steady average performance.

    Here are some performance impact details of the patches:

    1/ An Intel internal synthetic memory bandwidth measurement tool, saw a
    3X speedup in a contrived case that tries to force cache conflicts.
    The contrived cased used the numa_emulation capability to force an
    instance of the benchmark to be run in two of the near-memory sized
    numa nodes. If both instances were placed on the same emulated they
    would fit and cause zero conflicts. While on separate emulated nodes
    without randomization they underutilized the cache and conflicted
    unnecessarily due to the in-order allocation per node.

    2/ A well known Java server application benchmark was run with a heap
    size that exceeded cache size by 3X. The cache conflict rate was 8%
    for the first run and degraded to 21% after page allocator aging. With
    randomization enabled the rate levelled out at 11%.

    3/ A MongoDB workload did not observe measurable difference in
    cache-conflict rates, but the overall throughput dropped by 7% with
    randomization in one case.

    4/ Mel Gorman ran his suite of performance workloads with randomization
    enabled on platforms without a memory-side-cache and saw a mix of some
    improvements and some losses [3].

    While there is potentially significant improvement for applications that
    depend on low latency access across a wide working-set, the performance
    may be negligible to negative for other workloads. For this reason the
    shuffle capability defaults to off unless a direct-mapped
    memory-side-cache is detected. Even then, the page_alloc.shuffle=0
    parameter can be specified to disable the randomization on those systems.

    Outside of memory-side-cache utilization concerns there is potentially
    security benefit from randomization. Some data exfiltration and
    return-oriented-programming attacks rely on the ability to infer the
    location of sensitive data objects. The kernel page allocator, especially
    early in system boot, has predictable first-in-first out behavior for
    physical pages. Pages are freed in physical address order when first
    onlined.

    Quoting Kees:
    "While we already have a base-address randomization
    (CONFIG_RANDOMIZE_MEMORY), attacks against the same hardware and
    memory layouts would certainly be using the predictability of
    allocation ordering (i.e. for attacks where the base address isn't
    important: only the relative positions between allocated memory).
    This is common in lots of heap-style attacks. They try to gain
    control over ordering by spraying allocations, etc.

    I'd really like to see this because it gives us something similar
    to CONFIG_SLAB_FREELIST_RANDOM but for the page allocator."

    While SLAB_FREELIST_RANDOM reduces the predictability of some local slab
    caches it leaves vast bulk of memory to be predictably in order allocated.
    However, it should be noted, the concrete security benefits are hard to
    quantify, and no known CVE is mitigated by this randomization.

    Introduce shuffle_free_memory(), and its helper shuffle_zone(), to perform
    a Fisher-Yates shuffle of the page allocator 'free_area' lists when they
    are initially populated with free memory at boot and at hotplug time. Do
    this based on either the presence of a page_alloc.shuffle=Y command line
    parameter, or autodetection of a memory-side-cache (to be added in a
    follow-on patch).

    The shuffling is done in terms of CONFIG_SHUFFLE_PAGE_ORDER sized free
    pages where the default CONFIG_SHUFFLE_PAGE_ORDER is MAX_ORDER-1 i.e. 10,
    4MB this trades off randomization granularity for time spent shuffling.
    MAX_ORDER-1 was chosen to be minimally invasive to the page allocator
    while still showing memory-side cache behavior improvements, and the
    expectation that the security implications of finer granularity
    randomization is mitigated by CONFIG_SLAB_FREELIST_RANDOM. The
    performance impact of the shuffling appears to be in the noise compared to
    other memory initialization work.

    This initial randomization can be undone over time so a follow-on patch is
    introduced to inject entropy on page free decisions. It is reasonable to
    ask if the page free entropy is sufficient, but it is not enough due to
    the in-order initial freeing of pages. At the start of that process
    putting page1 in front or behind page0 still keeps them close together,
    page2 is still near page1 and has a high chance of being adjacent. As
    more pages are added ordering diversity improves, but there is still high
    page locality for the low address pages and this leads to no significant
    impact to the cache conflict rate.

    [1]: https://itpeernetwork.intel.com/intel-optane-dc-persistent-memory-operating-modes/
    [2]: https://lkml.kernel.org/r/AT5PR8401MB1169D656C8B5E121752FC0F8AB120@AT5PR8401MB1169.NAMPRD84.PROD.OUTLOOK.COM
    [3]: https://lkml.org/lkml/2018/10/12/309

    [dan.j.williams@intel.com: fix shuffle enable]
    Link: http://lkml.kernel.org/r/154943713038.3858443.4125180191382062871.stgit@dwillia2-desk3.amr.corp.intel.com
    [cai@lca.pw: fix SHUFFLE_PAGE_ALLOCATOR help texts]
    Link: http://lkml.kernel.org/r/20190425201300.75650-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/154899811738.3165233.12325692939590944259.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Signed-off-by: Qian Cai
    Reviewed-by: Kees Cook
    Acked-by: Michal Hocko
    Cc: Dave Hansen
    Cc: Keith Busch
    Cc: Robert Elliott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Commit 0139aa7b7fa ("mm: rename _count, field of the struct page, to
    _refcount") left out a couple of references to the old field name. Fix
    that.

    Link: http://lkml.kernel.org/r/cedf87b02eb8a6b3eac57e8e91da53fb15c3c44c.1556537475.git.baruch@tkos.co.il
    Fixes: 0139aa7b7fa ("mm: rename _count, field of the struct page, to _refcount")
    Signed-off-by: Baruch Siach
    Reviewed-by: Andrew Morton
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Baruch Siach
     
  • Most architectures do not need the memblock memory after the page
    allocator is initialized, but only few enable ARCH_DISCARD_MEMBLOCK in the
    arch Kconfig.

    Replacing ARCH_DISCARD_MEMBLOCK with ARCH_KEEP_MEMBLOCK and inverting the
    logic makes it clear which architectures actually use memblock after
    system initialization and skips the necessity to add ARCH_DISCARD_MEMBLOCK
    to the architectures that are still missing that option.

    Link: http://lkml.kernel.org/r/1556102150-32517-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michael Ellerman (powerpc)
    Cc: Russell King
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Richard Kuo
    Cc: Tony Luck
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: James Hogan
    Cc: Ley Foon Tan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc: Eric Biederman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Because rmqueue_pcplist() is only called when order is 0, we don't need to
    use order as a parameter.

    Link: http://lkml.kernel.org/r/1555591709-11744-1-git-send-email-laoar.shao@gmail.com
    Signed-off-by: Yafang Shao
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yafang Shao
     
  • check_pages_isolated_cb currently accounts the whole pfn range as being
    offlined if test_pages_isolated suceeds on the range. This is based on
    the assumption that all pages in the range are freed which is currently
    the case in most cases but it won't be with later changes, as pages marked
    as vmemmap won't be isolated.

    Move the offlined pages counting to offline_isolated_pages_cb and rely on
    __offline_isolated_pages to return the correct value.
    check_pages_isolated_cb will still do it's primary job and check the pfn
    range.

    While we are at it remove check_pages_isolated and offline_isolated_pages
    and use directly walk_system_ram_range as do in online_pages.

    Link: http://lkml.kernel.org/r/20190408082633.2864-2-osalvador@suse.de
    Reviewed-by: David Hildenbrand
    Signed-off-by: Michal Hocko
    Signed-off-by: Oscar Salvador
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Add yet another iterator, for_each_free_mem_range_in_zone_from, and then
    use it to support initializing and freeing pages in groups no larger than
    MAX_ORDER_NR_PAGES. By doing this we can greatly improve the cache
    locality of the pages while we do several loops over them in the init and
    freeing process.

    We are able to tighten the loops further as a result of the "from"
    iterator as we can perform the initial checks for first_init_pfn in our
    first call to the iterator, and continue without the need for those checks
    via the "from" iterator. I have added this functionality in the function
    called deferred_init_mem_pfn_range_in_zone that primes the iterator and
    causes us to exit if we encounter any failure.

    On my x86_64 test system with 384GB of memory per node I saw a reduction
    in initialization time from 1.85s to 1.38s as a result of this patch.

    Link: http://lkml.kernel.org/r/20190405221231.12227.85836.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Reviewed-by: Pavel Tatashin
    Cc: Mike Rapoport
    Cc: Michal Hocko
    Cc: Dave Jiang
    Cc: Matthew Wilcox
    Cc: Ingo Molnar
    Cc:
    Cc: Khalid Aziz
    Cc: Mike Rapoport
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Cc: Laurent Dufour
    Cc: Mel Gorman
    Cc: David S. Miller
    Cc: "Kirill A. Shutemov"
    Cc: Pavel Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Introduce a new iterator for_each_free_mem_pfn_range_in_zone.

    This iterator will take care of making sure a given memory range provided
    is in fact contained within a zone. It takes are of all the bounds
    checking we were doing in deferred_grow_zone, and deferred_init_memmap.
    In addition it should help to speed up the search a bit by iterating until
    the end of a range is greater than the start of the zone pfn range, and
    will exit completely if the start is beyond the end of the zone.

    Link: http://lkml.kernel.org/r/20190405221225.12227.22573.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Reviewed-by: Pavel Tatashin
    Reviewed-by: Mike Rapoport
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: David S. Miller
    Cc: Ingo Molnar
    Cc: Khalid Aziz
    Cc: "Kirill A. Shutemov"
    Cc: Laurent Dufour
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • As best as I can tell the meminit_pfn_in_nid call is completely redundant.
    The deferred memory initialization is already making use of
    for_each_free_mem_range which in turn will call into __next_mem_range
    which will only return a memory range if it matches the node ID provided
    assuming it is not NUMA_NO_NODE.

    I am operating on the assumption that there are no zones or pgdata_t
    structures that have a NUMA node of NUMA_NO_NODE associated with them. If
    that is the case then __next_mem_range will never return a memory range
    that doesn't match the zone's node ID and as such the check is redundant.

    So one piece I would like to verify on this is if this works for ia64.
    Technically it was using a different approach to get the node ID, but it
    seems to have the node ID also encoded into the memblock. So I am
    assuming this is okay, but would like to get confirmation on that.

    On my x86_64 test system with 384GB of memory per node I saw a reduction
    in initialization time from 2.80s to 1.85s as a result of this patch.

    Link: http://lkml.kernel.org/r/20190405221219.12227.93957.stgit@localhost.localdomain
    Signed-off-by: Alexander Duyck
    Reviewed-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Cc: Mike Rapoport
    Cc: Dan Williams
    Cc: Dave Jiang
    Cc: David S. Miller
    Cc: Ingo Molnar
    Cc: Khalid Aziz
    Cc: "Kirill A. Shutemov"
    Cc: Laurent Dufour
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Mike Rapoport
    Cc: Pavel Tatashin
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • 342332e6a925 ("mm/page_alloc.c: introduce kernelcore=mirror option") and
    later patches rewrote the calculation of node spanned pages.

    e506b99696a2 ("mem-hotplug: fix node spanned pages when we have a movable
    node"), but the current code still has problems,

    When we have a node with only zone_movable and the node id is not zero,
    the size of node spanned pages is double added.

    That's because we have an empty normal zone, and zone_start_pfn or
    zone_end_pfn is not between arch_zone_lowest_possible_pfn and
    arch_zone_highest_possible_pfn, so we need to use clamp to constrain the
    range just like the commit (bootmem: Reimplement
    __absent_pages_in_range() using for_each_mem_pfn_range()).

    e.g.
    Zone ranges:
    DMA [mem 0x0000000000001000-0x0000000000ffffff]
    DMA32 [mem 0x0000000001000000-0x00000000ffffffff]
    Normal [mem 0x0000000100000000-0x000000023fffffff]
    Movable zone start for each node
    Node 0: 0x0000000100000000
    Node 1: 0x0000000140000000
    Early memory node ranges
    node 0: [mem 0x0000000000001000-0x000000000009efff]
    node 0: [mem 0x0000000000100000-0x00000000bffdffff]
    node 0: [mem 0x0000000100000000-0x000000013fffffff]
    node 1: [mem 0x0000000140000000-0x000000023fffffff]

    node 0 DMA spanned:0xfff present:0xf9e absent:0x61
    node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020
    node 0 Normal spanned:0 present:0 absent:0
    node 0 Movable spanned:0x40000 present:0x40000 absent:0
    On node 0 totalpages(node_present_pages): 1048446
    node_spanned_pages:1310719
    node 1 DMA spanned:0 present:0 absent:0
    node 1 DMA32 spanned:0 present:0 absent:0
    node 1 Normal spanned:0x100000 present:0x100000 absent:0
    node 1 Movable spanned:0x100000 present:0x100000 absent:0
    On node 1 totalpages(node_present_pages): 2097152
    node_spanned_pages:2097152
    Memory: 6967796K/12582392K available (16388K kernel code, 3686K rwdata,
    4468K rodata, 2160K init, 10444K bss, 5614596K reserved, 0K
    cma-reserved)

    It shows that the current memory of node 1 is double added.
    After this patch, the problem is fixed.

    node 0 DMA spanned:0xfff present:0xf9e absent:0x61
    node 0 DMA32 spanned:0xff000 present:0xbefe0 absent:0x40020
    node 0 Normal spanned:0 present:0 absent:0
    node 0 Movable spanned:0x40000 present:0x40000 absent:0
    On node 0 totalpages(node_present_pages): 1048446
    node_spanned_pages:1310719
    node 1 DMA spanned:0 present:0 absent:0
    node 1 DMA32 spanned:0 present:0 absent:0
    node 1 Normal spanned:0 present:0 absent:0
    node 1 Movable spanned:0x100000 present:0x100000 absent:0
    On node 1 totalpages(node_present_pages): 1048576
    node_spanned_pages:1048576
    memory: 6967796K/8388088K available (16388K kernel code, 3686K rwdata,
    4468K rodata, 2160K init, 10444K bss, 1420292K reserved, 0K
    cma-reserved)

    Link: http://lkml.kernel.org/r/1554178276-10372-1-git-send-email-fanglinxu@huawei.com
    Signed-off-by: Linxu Fang
    Cc: Taku Izumi
    Cc: Xishi Qiu
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Pavel Tatashin
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linxu Fang
     
  • On systems without CONTIG_ALLOC activated but that support gigantic pages,
    boottime reserved gigantic pages can not be freed at all. This patch
    simply enables the possibility to hand back those pages to memory
    allocator.

    Link: http://lkml.kernel.org/r/20190327063626.18421-5-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Acked-by: David S. Miller [sparc]
    Reviewed-by: Mike Kravetz
    Cc: Andy Lutomirsky
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: Heiko Carstens
    Cc: "H . Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • This condition allows to define alloc_contig_range, so simplify it into a
    more accurate naming.

    Link: http://lkml.kernel.org/r/20190327063626.18421-4-alex@ghiti.fr
    Signed-off-by: Alexandre Ghiti
    Suggested-by: Vlastimil Babka
    Acked-by: Vlastimil Babka
    Cc: Andy Lutomirsky
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Borislav Petkov
    Cc: Catalin Marinas
    Cc: Dave Hansen
    Cc: David S. Miller
    Cc: Heiko Carstens
    Cc: "H . Peter Anvin"
    Cc: Ingo Molnar
    Cc: Martin Schwidefsky
    Cc: Michael Ellerman
    Cc: Mike Kravetz
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    Cc: Rich Felker
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexandre Ghiti
     
  • alloc_pages_exact*() allocates a page of sufficient order and then splits
    it to return only the number of pages requested. That makes it
    incompatible with __GFP_COMP, because compound pages cannot be split.

    As shown by [1] things may silently work until the requested size
    (possibly depending on user) stops being power of two. Then for
    CONFIG_DEBUG_VM, BUG_ON() triggers in split_page(). Without
    CONFIG_DEBUG_VM, consequences are unclear.

    There are several options here, none of them great:

    1) Don't do the splitting when __GFP_COMP is passed, and return the
    whole compound page. However if caller then returns it via
    free_pages_exact(), that will be unexpected and the freeing actions
    there will be wrong.

    2) Warn and remove __GFP_COMP from the flags. But the caller may have
    really wanted it, so things may break later somewhere.

    3) Warn and return NULL. However NULL may be unexpected, especially
    for small sizes.

    This patch picks option 2, because as Michal Hocko put it: "callers wanted
    it" is much less probable than "caller is simply confused and more gfp
    flags is surely better than fewer".

    [1] https://lore.kernel.org/lkml/20181126002805.GI18977@shao2-debian/T/#u

    Link: http://lkml.kernel.org/r/0c6393eb-b28d-4607-c386-862a71f09de6@suse.cz
    Signed-off-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Acked-by: Mel Gorman
    Cc: Takashi Iwai
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     

30 Apr, 2019

1 commit

  • Make hibernate handle unmapped pages on the direct map when
    CONFIG_ARCH_HAS_SET_ALIAS=y is set. These functions allow for setting pages
    to invalid configurations, so now hibernate should check if the pages have
    valid mappings and handle if they are unmapped when doing a hibernate
    save operation.

    Previously this checking was already done when CONFIG_DEBUG_PAGEALLOC=y
    was configured. It does not appear to have a big hibernating performance
    impact. The speed of the saving operation before this change was measured
    as 819.02 MB/s, and after was measured at 813.32 MB/s.

    Before:
    [ 4.670938] PM: Wrote 171996 kbytes in 0.21 seconds (819.02 MB/s)

    After:
    [ 4.504714] PM: Wrote 178932 kbytes in 0.22 seconds (813.32 MB/s)

    Signed-off-by: Rick Edgecombe
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Pavel Machek
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc:
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nadav Amit
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190426001143.4983-16-namit@vmware.com
    Signed-off-by: Ingo Molnar

    Rick Edgecombe
     

27 Apr, 2019

4 commits

  • Commit 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
    removed setting of the ALLOC_NOFRAGMENT flag. Bring it back.

    The runtime effect is that ALLOC_NOFRAGMENT behaviour is restored so
    that allocations are spread across local zones to avoid fragmentation
    due to mixing pageblocks as long as possible.

    Link: http://lkml.kernel.org/r/20190423120806.3503-2-aryabinin@virtuozzo.com
    Fixes: 0a79cdad5eb2 ("mm: use alloc_flags to record if kswapd can wake")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • ac.preferred_zoneref->zone passed to alloc_flags_nofragment() can be NULL.
    'zone' pointer unconditionally derefernced in alloc_flags_nofragment().
    Bail out on NULL zone to avoid potential crash. Currently we don't see
    any crashes only because alloc_flags_nofragment() has another bug which
    allows compiler to optimize away all accesses to 'zone'.

    Link: http://lkml.kernel.org/r/20190423120806.3503-1-aryabinin@virtuozzo.com
    Fixes: 6bb154504f8b ("mm, page_alloc: spread allocations across zones before introducing fragmentation")
    Signed-off-by: Andrey Ryabinin
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     
  • During the development of commit 5e1f0f098b46 ("mm, compaction: capture
    a page under direct compaction"), a paranoid check was added to ensure
    that if a captured page was available after compaction that it was
    consistent with the final state of compaction. The intent was to catch
    serious programming bugs such as using a stale page pointer and causing
    corruption problems.

    However, it is possible to get a captured page even if compaction was
    unsuccessful if an interrupt triggered and happened to free pages in
    interrupt context that got merged into a suitable high-order page. It's
    highly unlikely but Li Wang did report the following warning on s390
    occuring when testing OOM handling. Note that the warning is slightly
    edited for clarity.

    WARNING: CPU: 0 PID: 9783 at mm/page_alloc.c:3777 __alloc_pages_direct_compact+0x182/0x190
    Modules linked in: rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs
    lockd grace fscache sunrpc pkey ghash_s390 prng xts aes_s390
    des_s390 des_generic sha512_s390 zcrypt_cex4 zcrypt vmur binfmt_misc
    ip_tables xfs libcrc32c dasd_fba_mod qeth_l2 dasd_eckd_mod dasd_mod
    qeth qdio lcs ctcm ccwgroup fsm dm_mirror dm_region_hash dm_log
    dm_mod
    CPU: 0 PID: 9783 Comm: copy.sh Kdump: loaded Not tainted 5.1.0-rc 5 #1

    This patch simply removes the check entirely instead of trying to be
    clever about pages freed from interrupt context. If a serious
    programming error was introduced, it is highly likely to be caught by
    prep_new_page() instead.

    Link: http://lkml.kernel.org/r/20190419085133.GH18914@techsingularity.net
    Fixes: 5e1f0f098b46 ("mm, compaction: capture a page under direct compaction")
    Signed-off-by: Mel Gorman
    Reported-by: Li Wang
    Acked-by: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • Mikulas Patocka reported that commit 1c30844d2dfe ("mm: reclaim small
    amounts of memory when an external fragmentation event occurs") "broke"
    memory management on parisc.

    The machine is not NUMA but the DISCONTIG model creates three pgdats
    even though it's a UMA machine for the following ranges

    0) Start 0x0000000000000000 End 0x000000003fffffff Size 1024 MB
    1) Start 0x0000000100000000 End 0x00000001bfdfffff Size 3070 MB
    2) Start 0x0000004040000000 End 0x00000040ffffffff Size 3072 MB

    Mikulas reported:

    With the patch 1c30844d2, the kernel will incorrectly reclaim the
    first zone when it fills up, ignoring the fact that there are two
    completely free zones. Basiscally, it limits cache size to 1GiB.

    For example, if I run:
    # dd if=/dev/sda of=/dev/null bs=1M count=2048

    - with the proper kernel, there should be "Buffers - 2GiB"
    when this command finishes. With the patch 1c30844d2, buffers
    will consume just 1GiB or slightly more, because the kernel was
    incorrectly reclaiming them.

    The page allocator and reclaim makes assumptions that pgdats really
    represent NUMA nodes and zones represent ranges and makes decisions on
    that basis. Watermark boosting for small pgdats leads to unexpected
    results even though this would have behaved reasonably on SPARSEMEM.

    DISCONTIG is essentially deprecated and even parisc plans to move to
    SPARSEMEM so there is no need to be fancy, this patch simply disables
    watermark boosting by default on DISCONTIGMEM.

    Link: http://lkml.kernel.org/r/20190419094335.GJ18914@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reported-by: Mikulas Patocka
    Tested-by: Mikulas Patocka
    Acked-by: Vlastimil Babka
    Cc: James Bottomley
    Cc: Matthew Wilcox
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

20 Apr, 2019

1 commit

  • has_unmovable_pages() is used by allocating CMA and gigantic pages as
    well as the memory hotplug. The later doesn't know how to offline CMA
    pool properly now, but if an unused (free) CMA page is encountered, then
    has_unmovable_pages() happily considers it as a free memory and
    propagates this up the call chain. Memory offlining code then frees the
    page without a proper CMA tear down which leads to an accounting issues.
    Moreover if the same memory range is onlined again then the memory never
    gets back to the CMA pool.

    State after memory offline:

    # grep cma /proc/vmstat
    nr_free_cma 205824

    # cat /sys/kernel/debug/cma/cma-kvm_cma/count
    209920

    Also, kmemleak still think those memory address are reserved below but
    have already been used by the buddy allocator after onlining. This
    patch fixes the situation by treating CMA pageblocks as unmovable except
    when has_unmovable_pages() is called as part of CMA allocation.

    Offlined Pages 4096
    kmemleak: Cannot insert 0xc000201f7d040008 into the object search tree (overlaps existing)
    Call Trace:
    dump_stack+0xb0/0xf4 (unreliable)
    create_object+0x344/0x380
    __kmalloc_node+0x3ec/0x860
    kvmalloc_node+0x58/0x110
    seq_read+0x41c/0x620
    __vfs_read+0x3c/0x70
    vfs_read+0xbc/0x1a0
    ksys_read+0x7c/0x140
    system_call+0x5c/0x70
    kmemleak: Kernel memory leak detector disabled
    kmemleak: Object 0xc000201cc8000000 (size 13757317120):
    kmemleak: comm "swapper/0", pid 0, jiffies 4294937297
    kmemleak: min_count = -1
    kmemleak: count = 0
    kmemleak: flags = 0x5
    kmemleak: checksum = 0
    kmemleak: backtrace:
    cma_declare_contiguous+0x2a4/0x3b0
    kvm_cma_reserve+0x11c/0x134
    setup_arch+0x300/0x3f8
    start_kernel+0x9c/0x6e8
    start_here_common+0x1c/0x4b0
    kmemleak: Automatic memory scanning thread ended

    [cai@lca.pw: use is_migrate_cma_page() and update commit log]
    Link: http://lkml.kernel.org/r/20190416170510.20048-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190413002623.8967-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Acked-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

30 Mar, 2019

1 commit

  • Commit f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded
    memory to zones until online") introduced move_pfn_range_to_zone() which
    calls memmap_init_zone() during onlining a memory block.
    memmap_init_zone() will reset pagetype flags and makes migrate type to
    be MOVABLE.

    However, in __offline_pages(), it also call undo_isolate_page_range()
    after offline_isolated_pages() to do the same thing. Due to commit
    2ce13640b3f4 ("mm: __first_valid_page skip over offline pages") changed
    __first_valid_page() to skip offline pages, undo_isolate_page_range()
    here just waste CPU cycles looping around the offlining PFN range while
    doing nothing, because __first_valid_page() will return NULL as
    offline_isolated_pages() has already marked all memory sections within
    the pfn range as offline via offline_mem_sections().

    Also, after calling the "useless" undo_isolate_page_range() here, it
    reaches the point of no returning by notifying MEM_OFFLINE. Those pages
    will be marked as MIGRATE_MOVABLE again once onlining. The only thing
    left to do is to decrease the number of isolated pageblocks zone counter
    which would make some paths of the page allocation slower that the above
    commit introduced.

    Even if alloc_contig_range() can be used to isolate 16GB-hugetlb pages
    on ppc64, an "int" should still be enough to represent the number of
    pageblocks there. Fix an incorrect comment along the way.

    [cai@lca.pw: v4]
    Link: http://lkml.kernel.org/r/20190314150641.59358-1-cai@lca.pw
    Link: http://lkml.kernel.org/r/20190313143133.46200-1-cai@lca.pw
    Fixes: 2ce13640b3f4 ("mm: __first_valid_page skip over offline pages")
    Signed-off-by: Qian Cai
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Cc: Vlastimil Babka
    Cc: [4.13+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

13 Mar, 2019

1 commit

  • As all the memblock allocation functions return NULL in case of error
    rather than panic(), the duplicates with _nopanic suffix can be removed.

    Link: http://lkml.kernel.org/r/1548057848-15136-22-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Greg Kroah-Hartman
    Reviewed-by: Petr Mladek [printk]
    Cc: Catalin Marinas
    Cc: Christophe Leroy
    Cc: Christoph Hellwig
    Cc: "David S. Miller"
    Cc: Dennis Zhou
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Guo Ren [c-sky]
    Cc: Heiko Carstens
    Cc: Juergen Gross [Xen]
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Simek
    Cc: Paul Burton
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Rob Herring
    Cc: Rob Herring
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

06 Mar, 2019

14 commits

  • This function is only used by built-in code, which makes perfect sense
    given the purpose of it.

    Link: http://lkml.kernel.org/r/20190213174621.29297-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Many kernel-doc comments in mm/ have the return value descriptions
    either misformatted or omitted at all which makes kernel-doc script
    unhappy:

    $ make V=1 htmldocs
    ...
    ./mm/util.c:36: info: Scanning doc for kstrdup
    ./mm/util.c:41: warning: No description found for return value of 'kstrdup'
    ./mm/util.c:57: info: Scanning doc for kstrdup_const
    ./mm/util.c:66: warning: No description found for return value of 'kstrdup_const'
    ./mm/util.c:75: info: Scanning doc for kstrndup
    ./mm/util.c:83: warning: No description found for return value of 'kstrndup'
    ...

    Fixing the formatting and adding the missing return value descriptions
    eliminates ~100 such warnings.

    Link: http://lkml.kernel.org/r/1549549644-4903-4-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Number of online NUMA nodes can't be negative as well. This doesn't
    save space as the variable is used only in 32-bit context, but do it
    anyway for consistency.

    Link: http://lkml.kernel.org/r/20190201223151.GB15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Number of NUMA nodes can't be negative.

    This saves a few bytes on x86_64:

    add/remove: 0/0 grow/shrink: 4/21 up/down: 27/-265 (-238)
    Function old new delta
    hv_synic_alloc.cold 88 110 +22
    prealloc_shrinker 260 262 +2
    bootstrap 249 251 +2
    sched_init_numa 1566 1567 +1
    show_slab_objects 778 777 -1
    s_show 1201 1200 -1
    kmem_cache_init 346 345 -1
    __alloc_workqueue_key 1146 1145 -1
    mem_cgroup_css_alloc 1614 1612 -2
    __do_sys_swapon 4702 4699 -3
    __list_lru_init 655 651 -4
    nic_probe 2379 2374 -5
    store_user_store 118 111 -7
    red_zone_store 106 99 -7
    poison_store 106 99 -7
    wq_numa_init 348 338 -10
    __kmem_cache_empty 75 65 -10
    task_numa_free 186 173 -13
    merge_across_nodes_store 351 336 -15
    irq_create_affinity_masks 1261 1246 -15
    do_numa_crng_init 343 321 -22
    task_numa_fault 4760 4737 -23
    swapfile_init 179 156 -23
    hv_synic_alloc 536 492 -44
    apply_wqattrs_prepare 746 695 -51

    Link: http://lkml.kernel.org/r/20190201223029.GA15820@avx2
    Signed-off-by: Alexey Dobriyan
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • There are two early memory allocations that use
    memblock_alloc_node_nopanic() and do not check its return value.

    While this happens very early during boot and chances that the
    allocation will fail are diminishing, it is still worth to have proper
    checks for the allocation errors.

    Link: http://lkml.kernel.org/r/1547734941-944-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: William Kucharski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • No functional change.

    Link: http://lkml.kernel.org/r/20190118235123.27843-1-richard.weiyang@gmail.com
    Signed-off-by: Wei Yang
    Reviewed-by: Pekka Enberg
    Acked-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • When calling debugfs functions, there is no need to ever check the
    return value. The function can work or not, but the code logic should
    never do something different based on this.

    Link: http://lkml.kernel.org/r/20190122152151.16139-14-gregkh@linuxfoundation.org
    Signed-off-by: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Laura Abbott
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Greg Kroah-Hartman
     
  • Compaction is inherently race-prone as a suitable page freed during
    compaction can be allocated by any parallel task. This patch uses a
    capture_control structure to isolate a page immediately when it is freed
    by a direct compactor in the slow path of the page allocator. The
    intent is to avoid redundant scanning.

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Amean fault-both-1 0.00 ( 0.00%) 0.00 * 0.00%*
    Amean fault-both-3 2582.11 ( 0.00%) 2563.68 ( 0.71%)
    Amean fault-both-5 4500.26 ( 0.00%) 4233.52 ( 5.93%)
    Amean fault-both-7 5819.53 ( 0.00%) 6333.65 ( -8.83%)
    Amean fault-both-12 9321.18 ( 0.00%) 9759.38 ( -4.70%)
    Amean fault-both-18 9782.76 ( 0.00%) 10338.76 ( -5.68%)
    Amean fault-both-24 15272.81 ( 0.00%) 13379.55 * 12.40%*
    Amean fault-both-30 15121.34 ( 0.00%) 16158.25 ( -6.86%)
    Amean fault-both-32 18466.67 ( 0.00%) 18971.21 ( -2.73%)

    Latency is only moderately affected but the devil is in the details. A
    closer examination indicates that base page fault latency is reduced but
    latency of huge pages is increased as it takes creater care to succeed.
    Part of the "problem" is that allocation success rates are close to 100%
    even when under pressure and compaction gets harder

    5.0.0-rc1 5.0.0-rc1
    selective-v3r17 capture-v3r19
    Percentage huge-3 96.70 ( 0.00%) 98.23 ( 1.58%)
    Percentage huge-5 96.99 ( 0.00%) 95.30 ( -1.75%)
    Percentage huge-7 94.19 ( 0.00%) 97.24 ( 3.24%)
    Percentage huge-12 94.95 ( 0.00%) 97.35 ( 2.53%)
    Percentage huge-18 96.74 ( 0.00%) 97.30 ( 0.58%)
    Percentage huge-24 97.07 ( 0.00%) 97.55 ( 0.50%)
    Percentage huge-30 95.69 ( 0.00%) 98.50 ( 2.95%)
    Percentage huge-32 96.70 ( 0.00%) 99.27 ( 2.65%)

    And scan rates are reduced as expected by 6% for the migration scanner
    and 29% for the free scanner indicating that there is less redundant
    work.

    Compaction migrate scanned 20815362 19573286
    Compaction free scanned 16352612 11510663

    [mgorman@techsingularity.net: remove redundant check]
    Link: http://lkml.kernel.org/r/20190201143853.GH9565@techsingularity.net
    Link: http://lkml.kernel.org/r/20190118175136.31341-23-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • When pageblocks get fragmented, watermarks are artifically boosted to
    reclaim pages to avoid further fragmentation events. However,
    compaction is often either fragmentation-neutral or moving movable pages
    away from unmovable/reclaimable pages. As the true watermarks are
    preserved, allow compaction to ignore the boost factor.

    The expected impact is very slight as the main benefit is that
    compaction is slightly more likely to succeed when the system has been
    fragmented very recently. On both 1-socket and 2-socket machines for
    THP-intensive allocation during fragmentation the success rate was
    increased by less than 1% which is marginal. However, detailed tracing
    indicated that failure of migration due to a premature ENOMEM triggered
    by watermark checks were eliminated.

    Link: http://lkml.kernel.org/r/20190118175136.31341-9-mgorman@techsingularity.net
    Signed-off-by: Mel Gorman
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Dan Carpenter
    Cc: David Rientjes
    Cc: YueHaibing
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • In the current implementation, there are two places to isolate a range
    of page: __offline_pages() and alloc_contig_range(). During this
    procedure, it will drain pages on pcp list.

    Below is a brief call flow:

    __offline_pages()/alloc_contig_range()
    start_isolate_page_range()
    set_migratetype_isolate()
    drain_all_pages()
    drain_all_pages()
    Acked-by: Michal Hocko
    Acked-by: David Hildenbrand
    Reviewed-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Move the memcg_kmem_enabled() checks into memcg kmem charge/uncharge
    functions, so, the users don't have to explicitly check that condition.

    This is purely code cleanup patch without any functional change. Only
    the order of checks in memcg_charge_slab() can potentially be changed
    but the functionally it will be same. This should not matter as
    memcg_charge_slab() is not in the hot path.

    Link: http://lkml.kernel.org/r/20190103161203.162375-1-shakeelb@google.com
    Signed-off-by: Shakeel Butt
    Acked-by: Michal Hocko
    Cc: Johannes Weiner
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shakeel Butt
     
  • Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

    All these places for replacement were found by running the following
    grep patterns on the entire kernel code. Please let me know if this
    might have missed some instances. This might also have replaced some
    false positives. I will appreciate suggestions, inputs and review.

    1. git grep "nid == -1"
    2. git grep "node == -1"
    3. git grep "nid = -1"
    4. git grep "node = -1"

    This patch (of 2):

    At present there are multiple places where invalid node number is
    encoded as -1. Even though implicitly understood it is always better to
    have macros in there. Replace these open encodings for an invalid node
    number with the global macro NUMA_NO_NODE. This helps remove NUMA
    related assumptions like 'invalid node' from various places redirecting
    them to a common definition.

    Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Anshuman Khandual
    Reviewed-by: David Hildenbrand
    Acked-by: Jeff Kirsher [ixgbe]
    Acked-by: Jens Axboe [mtip32xx]
    Acked-by: Vinod Koul [dmaengine.c]
    Acked-by: Michael Ellerman [powerpc]
    Acked-by: Doug Ledford [drivers/infiniband]
    Cc: Joseph Qi
    Cc: Hans Verkuil
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • When freeing pages are done with higher order, time spent on coalescing
    pages by buddy allocator can be reduced. With section size of 256MB,
    hot add latency of a single section shows improvement from 50-60 ms to
    less than 1 ms, hence improving the hot add latency by 60 times. Modify
    external providers of online callback to align with the change.

    [arunks@codeaurora.org: v11]
    Link: http://lkml.kernel.org/r/1547792588-18032-1-git-send-email-arunks@codeaurora.org
    [akpm@linux-foundation.org: remove unused local, per Arun]
    [akpm@linux-foundation.org: avoid return of void-returning __free_pages_core(), per Oscar]
    [akpm@linux-foundation.org: fix it for mm-convert-totalram_pages-and-totalhigh_pages-variables-to-atomic.patch]
    [arunks@codeaurora.org: v8]
    Link: http://lkml.kernel.org/r/1547032395-24582-1-git-send-email-arunks@codeaurora.org
    [arunks@codeaurora.org: v9]
    Link: http://lkml.kernel.org/r/1547098543-26452-1-git-send-email-arunks@codeaurora.org
    Link: http://lkml.kernel.org/r/1538727006-5727-1-git-send-email-arunks@codeaurora.org
    Signed-off-by: Arun KS
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Reviewed-by: Oscar Salvador
    Reviewed-by: Alexander Duyck
    Cc: K. Y. Srinivasan
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Dan Williams
    Cc: Vlastimil Babka
    Cc: Joonsoo Kim
    Cc: Greg Kroah-Hartman
    Cc: Mathieu Malaterre
    Cc: "Kirill A. Shutemov"
    Cc: Souptick Joarder
    Cc: Mel Gorman
    Cc: Aaron Lu
    Cc: Srivatsa Vaddagiri
    Cc: Vinayak Menon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arun KS
     
  • KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
    It triggers false positives in the allocation path:

    BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
    Read of size 8 at addr ffff88881f800000 by task swapper/0
    CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    __asan_report_load8_noabort+0x19/0x20
    memchr_inv+0x2ea/0x330
    kernel_poison_pages+0x103/0x3d5
    get_page_from_freelist+0x15e7/0x4d90

    because KASAN has not yet unpoisoned the shadow page for allocation
    before it checks memchr_inv() but only found a stale poison pattern.

    Also, false positives in free path,

    BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
    Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
    CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
    Call Trace:
    dump_stack+0xe0/0x19a
    print_address_description.cold.2+0x9/0x28b
    kasan_report.cold.3+0x7a/0xb5
    check_memory_region+0x22d/0x250
    memset+0x28/0x40
    kernel_poison_pages+0x29e/0x3d5
    __free_pages_ok+0x75f/0x13e0

    due to KASAN adds poisoned redzones around slab objects, but the page
    poisoning needs to poison the whole page.

    Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
    Signed-off-by: Qian Cai
    Acked-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Qian Cai
     

22 Feb, 2019

1 commit

  • Yury Norov reported that an arm64 KVM instance could not boot since
    after v5.0-rc1 and could addressed by reverting the patches

    1c30844d2dfe272d58c ("mm: reclaim small amounts of memory when an external
    73444bc4d8f92e46a20 ("mm, page_alloc: do not wake kswapd with zone lock held")

    The problem is that a division by zero error is possible if boosting
    occurs very early in boot if the system has very little memory. This
    patch avoids the division by zero error.

    Link: http://lkml.kernel.org/r/20190213143012.GT9565@techsingularity.net
    Fixes: 1c30844d2dfe ("mm: reclaim small amounts of memory when an external fragmentation event occurs")
    Signed-off-by: Mel Gorman
    Reported-by: Yury Norov
    Tested-by: Yury Norov
    Tested-by: Will Deacon
    Acked-by: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

18 Feb, 2019

1 commit

  • This patch replaces the size + 1 value introduced with the recent fix for 1
    byte allocs with a constant value.

    The idea here is to reduce code overhead as the previous logic would have
    to read size into a register, then increment it, and write it back to
    whatever field was being used. By using a constant we can avoid those
    memory reads and arithmetic operations in favor of just encoding the
    maximum value into the operation itself.

    Fixes: 2c2ade81741c ("mm: page_alloc: fix ref bias in page_frag_alloc() for 1-byte allocs")
    Signed-off-by: Alexander Duyck
    Signed-off-by: David S. Miller

    Alexander Duyck
     

15 Feb, 2019

1 commit

  • The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
    number of references that we might need to create in the fastpath later,
    the bump-allocation fastpath only has to modify the non-atomic bias value
    that tracks the number of extra references we hold instead of the atomic
    refcount. The maximum number of allocations we can serve (under the
    assumption that no allocation is made with size 0) is nc->size, so that's
    the bias used.

    However, even when all memory in the allocation has been given away, a
    reference to the page is still held; and in the `offset < 0` slowpath, the
    page may be reused if everyone else has dropped their references.
    This means that the necessary number of references is actually
    `nc->size+1`.

    Luckily, from a quick grep, it looks like the only path that can call
    page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
    requires CAP_NET_ADMIN in the init namespace and is only intended to be
    used for kernel testing and fuzzing.

    To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
    `offset < 0` path, below the virt_to_page() call, and then repeatedly call
    writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
    with a vector consisting of 15 elements containing 1 byte each.

    Signed-off-by: Jann Horn
    Signed-off-by: David S. Miller

    Jann Horn