07 Apr, 2018

1 commit

  • Merge updates from Andrew Morton:

    - a few misc things

    - ocfs2 updates

    - the v9fs maintainers have been missing for a long time. I've taken
    over v9fs patch slinging.

    - most of MM

    * emailed patches from Andrew Morton : (116 commits)
    mm,oom_reaper: check for MMF_OOM_SKIP before complaining
    mm/ksm: fix interaction with THP
    mm/memblock.c: cast constant ULLONG_MAX to phys_addr_t
    headers: untangle kmemleak.h from mm.h
    include/linux/mmdebug.h: make VM_WARN* non-rvals
    mm/page_isolation.c: make start_isolate_page_range() fail if already isolated
    mm: change return type to vm_fault_t
    mm, oom: remove 3% bonus for CAP_SYS_ADMIN processes
    mm, page_alloc: wakeup kcompactd even if kswapd cannot free more memory
    kernel/fork.c: detect early free of a live mm
    mm: make counting of list_lru_one::nr_items lockless
    mm/swap_state.c: make bool enable_vma_readahead and swap_vma_readahead() static
    block_invalidatepage(): only release page if the full page was invalidated
    mm: kernel-doc: add missing parameter descriptions
    mm/swap.c: remove @cold parameter description for release_pages()
    mm/nommu: remove description of alloc_vm_area
    zram: drop max_zpage_size and use zs_huge_class_size()
    zsmalloc: introduce zs_huge_class_size()
    mm: fix races between swapoff and flush dcache
    fs/direct-io.c: minor cleanups in do_blockdev_direct_IO
    ...

    Linus Torvalds
     

06 Apr, 2018

39 commits

  • I got "oom_reaper: unable to reap pid:" messages when the victim thread
    was blocked inside free_pgtables() (which occurred after returning from
    unmap_vmas() and setting MMF_OOM_SKIP). We don't need to complain when
    exit_mmap() already set MMF_OOM_SKIP.

    Killed process 7558 (a.out) total-vm:4176kB, anon-rss:84kB, file-rss:0kB, shmem-rss:0kB
    oom_reaper: unable to reap pid:7558 (a.out)
    a.out D13272 7558 6931 0x00100084
    Call Trace:
    schedule+0x2d/0x80
    rwsem_down_write_failed+0x2bb/0x440
    call_rwsem_down_write_failed+0x13/0x20
    down_write+0x49/0x60
    unlink_file_vma+0x28/0x50
    free_pgtables+0x36/0x100
    exit_mmap+0xbb/0x180
    mmput+0x50/0x110
    copy_process.part.41+0xb61/0x1fe0
    _do_fork+0xe6/0x560
    do_syscall_64+0x74/0x230
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

    Link: http://lkml.kernel.org/r/201803221946.DHG65638.VFJHFtOSQLOMOF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • This patch fixes a corner case for KSM. When two pages belong or
    belonged to the same transparent hugepage, and they should be merged,
    KSM fails to split the page, and therefore no merging happens.

    This bug can be reproduced by:
    * making sure ksm is running (in case disabling ksmtuned)
    * enabling transparent hugepages
    * allocating a THP-aligned 1-THP-sized buffer
    e.g. on amd64: posix_memalign(&p, 1<<<<
    Co-authored-by: Gerald Schaefer
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     
  • This fixes a warning shown when phys_addr_t is 32-bit int when compiling
    with clang:

    mm/memblock.c:927:15: warning: implicit conversion from 'unsigned long long'
    to 'phys_addr_t' (aka 'unsigned int') changes value from
    18446744073709551615 to 4294967295 [-Wconstant-conversion]
    r->base : ULLONG_MAX;
    ^~~~~~~~~~
    ./include/linux/kernel.h:30:21: note: expanded from macro 'ULLONG_MAX'
    #define ULLONG_MAX (~0ULL)
    ^~~~~

    Link: http://lkml.kernel.org/r/20180319005645.29051-1-stefan@agner.ch
    Signed-off-by: Stefan Agner
    Reviewed-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Catalin Marinas
    Cc: Pavel Tatashin
    Cc: Ard Biesheuvel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stefan Agner
     
  • Currently #includes for no obvious
    reason. It looks like it's only a convenience, so remove kmemleak.h
    from slab.h and add to any users of kmemleak_* that
    don't already #include it. Also remove from source
    files that do not use it.

    This is tested on i386 allmodconfig and x86_64 allmodconfig. It would
    be good to run it through the 0day bot for other $ARCHes. I have
    neither the horsepower nor the storage space for the other $ARCHes.

    Update: This patch has been extensively build-tested by both the 0day
    bot & kisskb/ozlabs build farms. Both of them reported 2 build failures
    for which patches are included here (in v2).

    [ slab.h is the second most used header file after module.h; kernel.h is
    right there with slab.h. There could be some minor error in the
    counting due to some #includes having comments after them and I didn't
    combine all of those. ]

    [akpm@linux-foundation.org: security/keys/big_key.c needs vmalloc.h, per sfr]
    Link: http://lkml.kernel.org/r/e4309f98-3749-93e1-4bb7-d9501a39d015@infradead.org
    Link: http://kisskb.ellerman.id.au/kisskb/head/13396/
    Signed-off-by: Randy Dunlap
    Reviewed-by: Ingo Molnar
    Reported-by: Michael Ellerman [2 build failures]
    Reported-by: Fengguang Wu [2 build failures]
    Reviewed-by: Andrew Morton
    Cc: Wei Yongjun
    Cc: Luis R. Rodriguez
    Cc: Greg Kroah-Hartman
    Cc: Mimi Zohar
    Cc: John Johansen
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • start_isolate_page_range() is used to set the migrate type of a set of
    pageblocks to MIGRATE_ISOLATE while attempting to start a migration
    operation. It assumes that only one thread is calling it for the
    specified range. This routine is used by CMA, memory hotplug and
    gigantic huge pages. Each of these users synchronize access to the
    range within their subsystem. However, two subsystems (CMA and gigantic
    huge pages for example) could attempt operations on the same range. If
    this happens, one thread may 'undo' the work another thread is doing.
    This can result in pageblocks being incorrectly left marked as
    MIGRATE_ISOLATE and therefore not available for page allocation.

    What is ideally needed is a way to synchronize access to a set of
    pageblocks that are undergoing isolation and migration. The only thing
    we know about these pageblocks is that they are all in the same zone. A
    per-node mutex is too coarse as we want to allow multiple operations on
    different ranges within the same zone concurrently. Instead, we will
    use the migration type of the pageblocks themselves as a form of
    synchronization.

    start_isolate_page_range sets the migration type on a set of page-
    blocks going in order from the one associated with the smallest pfn to
    the largest pfn. The zone lock is acquired to check and set the
    migration type. When going through the list of pageblocks check if
    MIGRATE_ISOLATE is already set. If so, this indicates another thread is
    working on this pageblock. We know exactly which pageblocks we set, so
    clean up by undo those and return -EBUSY.

    This allows start_isolate_page_range to serve as a synchronization
    mechanism and will allow for more general use of callers making use of
    these interfaces. Update comments in alloc_contig_range to reflect this
    new functionality.

    Each CPU holds the associated zone lock to modify or examine the
    migration type of a pageblock. And, it will only examine/update a
    single pageblock per lock acquire/release cycle.

    Link: http://lkml.kernel.org/r/20180309224731.16978-1-mike.kravetz@oracle.com
    Signed-off-by: Mike Kravetz
    Reviewed-by: Andrew Morton
    Cc: KAMEZAWA Hiroyuki
    Cc: Luiz Capitulino
    Cc: Michal Nazarewicz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • Since the 2.6 kernel, the oom killer has slightly biased away from
    CAP_SYS_ADMIN processes by discounting some of its memory usage in
    comparison to other processes.

    This has always been implicit and nothing exactly relies on the
    behavior.

    Gaurav notices that __task_cred() can dereference a potentially freed
    pointer if the task under consideration is exiting because a reference
    to the task_struct is not held.

    Remove the CAP_SYS_ADMIN bias so that all processes are treated equally.

    If any CAP_SYS_ADMIN process would like to be biased against, it is
    always allowed to adjust /proc/pid/oom_score_adj.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803071548510.6996@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reported-by: Gaurav Kohli
    Acked-by: Michal Hocko
    Cc: Kirill A. Shutemov
    Cc: Andrea Arcangeli
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Kswapd will not wakeup if per-zone watermarks are not failing or if too
    many previous attempts at background reclaim have failed.

    This can be true if there is a lot of free memory available. For high-
    order allocations, kswapd is responsible for waking up kcompactd for
    background compaction. If the zone is not below its watermarks or
    reclaim has recently failed (lots of free memory, nothing left to
    reclaim), kcompactd does not get woken up.

    When __GFP_DIRECT_RECLAIM is not allowed, allow kcompactd to still be
    woken up even if kswapd will not reclaim. This allows high-order
    allocations, such as thp, to still trigger background compaction even
    when the zone has an abundance of free memory.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803111659420.209721@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • During the reclaiming slab of a memcg, shrink_slab iterates over all
    registered shrinkers in the system, and tries to count and consume
    objects related to the cgroup. In case of memory pressure, this behaves
    bad: I observe high system time and time spent in list_lru_count_one()
    for many processes on RHEL7 kernel.

    This patch makes list_lru_node::memcg_lrus rcu protected, that allows to
    skip taking spinlock in list_lru_count_one().

    Shakeel Butt with the patch observes significant perf graph change. He
    says:

    ========================================================================
    Setup: running a fork-bomb in a memcg of 200MiB on a 8GiB and 4 vcpu
    VM and recording the trace with 'perf record -g -a'.

    The trace without the patch:

    + 34.19% fb.sh [kernel.kallsyms] [k] queued_spin_lock_slowpath
    + 30.77% fb.sh [kernel.kallsyms] [k] _raw_spin_lock
    + 3.53% fb.sh [kernel.kallsyms] [k] list_lru_count_one
    + 2.26% fb.sh [kernel.kallsyms] [k] super_cache_count
    + 1.68% fb.sh [kernel.kallsyms] [k] shrink_slab
    + 0.59% fb.sh [kernel.kallsyms] [k] down_read_trylock
    + 0.48% fb.sh [kernel.kallsyms] [k] _raw_spin_unlock_irqrestore
    + 0.38% fb.sh [kernel.kallsyms] [k] shrink_node_memcg
    + 0.32% fb.sh [kernel.kallsyms] [k] queue_work_on
    + 0.26% fb.sh [kernel.kallsyms] [k] count_shadow_nodes

    With the patch:

    + 0.16% swapper [kernel.kallsyms] [k] default_idle
    + 0.13% oom_reaper [kernel.kallsyms] [k] mutex_spin_on_owner
    + 0.05% perf [kernel.kallsyms] [k] copy_user_generic_string
    + 0.05% init.real [kernel.kallsyms] [k] wait_consider_task
    + 0.05% kworker/0:0 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/2:1 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/3:1 [kernel.kallsyms] [k] finish_task_switch
    + 0.04% kworker/1:0 [kernel.kallsyms] [k] finish_task_switch
    + 0.03% binary [kernel.kallsyms] [k] copy_page
    ========================================================================

    Thanks Shakeel for the testing.

    [ktkhai@virtuozzo.com: v2]
    Link: http://lkml.kernel.org/r/151203869520.3915.2587549826865799173.stgit@localhost.localdomain
    Link: http://lkml.kernel.org/r/150583358557.26700.8490036563698102569.stgit@localhost.localdomain
    Signed-off-by: Kirill Tkhai
    Tested-by: Shakeel Butt
    Acked-by: Vladimir Davydov
    Cc: Andrey Ryabinin
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill Tkhai
     
  • The bool enable_vma_readahead and swap_vma_readahead() are local to the
    source and do not need to be in global scope, so make them static.

    Cleans up sparse warnings:

    mm/swap_state.c:41:6: warning: symbol 'enable_vma_readahead' was not declared. Should it be static?
    mm/swap_state.c:742:13: warning: symbol 'swap_vma_readahead' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180223164852.5159-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • Link: http://lkml.kernel.org/r/1519585191-10180-4-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The 'cold' parameter was removed from release_pages function by commit
    c6f92f9fbe7d ("mm: remove cold parameter for release_pages").

    Update the description to match the code.

    Link: http://lkml.kernel.org/r/1519585191-10180-3-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • The alloc_mm_area in nommu is a stub, but its description states it
    allocates kernel address space. Remove the description to make the code
    and the documentation agree.

    Link: http://lkml.kernel.org/r/1519585191-10180-2-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     
  • Patch series "zsmalloc/zram: drop zram's max_zpage_size", v3.

    ZRAM's max_zpage_size is a bad thing. It forces zsmalloc to store
    normal objects as huge ones, which results in bigger zsmalloc memory
    usage. Drop it and use actual zsmalloc huge-class value when decide if
    the object is huge or not.

    This patch (of 2):

    Not every object can be share its zspage with other objects, e.g. when
    the object is as big as zspage or nearly as big a zspage. For such
    objects zsmalloc has a so called huge class - every object which belongs
    to huge class consumes the entire zspage (which consists of a physical
    page). On x86_64, PAGE_SHIFT 12 box, the first non-huge class size is
    3264, so starting down from size 3264, objects can share page(-s) and
    thus minimize memory wastage.

    ZRAM, however, has its own statically defined watermark for huge
    objects, namely "3 * PAGE_SIZE / 4 = 3072", and forcibly stores every
    object larger than this watermark (3072) as a PAGE_SIZE object, in other
    words, to a huge class, while zsmalloc can keep some of those objects in
    non-huge classes. This results in increased memory consumption.

    zsmalloc knows better if the object is huge or not. Introduce
    zs_huge_class_size() function which tells if the given object can be
    stored in one of non-huge classes or not. This will let us to drop
    ZRAM's huge object watermark and fully rely on zsmalloc when we decide
    if the object is huge.

    [sergey.senozhatsky.work@gmail.com: add pool param to zs_huge_class_size()]
    Link: http://lkml.kernel.org/r/20180314081833.1096-2-sergey.senozhatsky@gmail.com
    Link: http://lkml.kernel.org/r/20180306070639.7389-2-sergey.senozhatsky@gmail.com
    Signed-off-by: Sergey Senozhatsky
    Acked-by: Minchan Kim
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sergey Senozhatsky
     
  • Thanks to commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB
    trunks"), after swapoff the address_space associated with the swap
    device will be freed. So page_mapping() users which may touch the
    address_space need some kind of mechanism to prevent the address_space
    from being freed during accessing.

    The dcache flushing functions (flush_dcache_page(), etc) in architecture
    specific code may access the address_space of swap device for anonymous
    pages in swap cache via page_mapping() function. But in some cases
    there are no mechanisms to prevent the swap device from being swapoff,
    for example,

    CPU1 CPU2
    __get_user_pages() swapoff()
    flush_dcache_page()
    mapping = page_mapping()
    ... exit_swap_address_space()
    ... kvfree(spaces)
    mapping_mapped(mapping)

    The address space may be accessed after being freed.

    But from cachetlb.txt and Russell King, flush_dcache_page() only care
    about file cache pages, for anonymous pages, flush_anon_page() should be
    used. The implementation of flush_dcache_page() in all architectures
    follows this too. They will check whether page_mapping() is NULL and
    whether mapping_mapped() is true to determine whether to flush the
    dcache immediately. And they will use interval tree (mapping->i_mmap)
    to find all user space mappings. While mapping_mapped() and
    mapping->i_mmap isn't used by anonymous pages in swap cache at all.

    So, to fix the race between swapoff and flush dcache, __page_mapping()
    is add to return the address_space for file cache pages and NULL
    otherwise. All page_mapping() invoking in flush dcache functions are
    replaced with page_mapping_file().

    [akpm@linux-foundation.org: simplify page_mapping_file(), per Mike]
    Link: http://lkml.kernel.org/r/20180305083634.15174-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Andrew Morton
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Chen Liqin
    Cc: Russell King
    Cc: Yoshinori Sato
    Cc: "James E.J. Bottomley"
    Cc: Guan Xuetao
    Cc: "David S. Miller"
    Cc: Chris Zankel
    Cc: Vineet Gupta
    Cc: Ley Foon Tan
    Cc: Ralf Baechle
    Cc: Andi Kleen
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • When device-dax is operating in huge-page mode we want it to behave like
    hugetlbfs and report the MMU page mapping size that is being enforced by
    the vma.

    Similar to commit 31383c6865a5 "mm, hugetlbfs: introduce ->split() to
    vm_operations_struct" it would be messy to teach vma_mmu_pagesize()
    about device-dax page mapping sizes in the same (hstate) way that
    hugetlbfs communicates this attribute. Instead, these patches introduce
    a new ->pagesize() vm operation.

    Link: http://lkml.kernel.org/r/151996254734.27922.15813097401404359642.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Jane Chu
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • Patch series "mm, smaps: MMUPageSize for device-dax", v3.

    Similar to commit 31383c6865a5 ("mm, hugetlbfs: introduce ->split() to
    vm_operations_struct") here is another occasion where we want
    special-case hugetlbfs/hstate enabling to also apply to device-dax.

    This prompts the question what other hstate conversions we might do
    beyond ->split() and ->pagesize(), but this appears to be the last of
    the usages of hstate_vma() in generic/non-hugetlbfs specific code paths.

    This patch (of 3):

    The current powerpc definition of vma_mmu_pagesize() open codes looking
    up the page size via hstate. It is identical to the generic
    vma_kernel_pagesize() implementation.

    Now, vma_kernel_pagesize() is growing support for determining the page
    size of Device-DAX vmas in addition to the existing Hugetlbfs page size
    determination.

    Ideally, if the powerpc vma_mmu_pagesize() used vma_kernel_pagesize() it
    would automatically benefit from any new vma-type support that is added
    to vma_kernel_pagesize(). However, the powerpc vma_mmu_pagesize() is
    prevented from calling vma_kernel_pagesize() due to a circular header
    dependency that requires vma_mmu_pagesize() to be defined before
    including .

    Break this circular dependency by defining the default vma_mmu_pagesize()
    as a __weak symbol to be overridden by the powerpc version.

    Link: http://lkml.kernel.org/r/151996254179.27922.2213728278535578744.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reviewed-by: Andrew Morton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Jane Chu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     
  • - Fixed style error: 8 spaces -> 1 tab.
    - Fixed style warning: Corrected misleading indentation.

    Link: http://lkml.kernel.org/r/20180302210254.31888-1-marioleinweber@web.de
    Signed-off-by: Mario Leinweber
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mario Leinweber
     
  • When a page is freed back to the global pool, its buddy will be checked
    to see if it's possible to do a merge. This requires accessing buddy's
    page structure and that access could take a long time if it's cache
    cold.

    This patch adds a prefetch to the to-be-freed page's buddy outside of
    zone->lock in hope of accessing buddy's page structure later under
    zone->lock will be faster. Since we *always* do buddy merging and check
    an order-0 page's buddy to try to merge it when it goes into the main
    allocator, the cacheline will always come in, i.e. the prefetched data
    will never be unused.

    Normally, the number of prefetch will be pcp->batch(default=31 and has
    an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
    pcp's pages get all drained, it will be pcp->count which has an upper
    limit of pcp->high. pcp->high, although has a default value of 186
    (pcp->batch=31 * 6), can be changed by user through
    /proc/sys/vm/percpu_pagelist_fraction and there is no software upper
    limit so could be large, like several thousand. For this reason, only
    the first pcp->batch number of page's buddy structure is prefetched to
    avoid excessive prefetching.

    In the meantime, there are two concerns:

    1. the prefetch could potentially evict existing cachelines, especially
    for L1D cache since it is not huge

    2. there is some additional instruction overhead, namely calculating
    buddy pfn twice

    For 1, it's hard to say, this microbenchmark though shows good result
    but the actual benefit of this patch will be workload/CPU dependant;

    For 2, since the calculation is a XOR on two local variables, it's
    expected in many cases that cycles spent will be offset by reduced
    memory latency later. This is especially true for NUMA machines where
    multiple CPUs are contending on zone->lock and the most time consuming
    part under zone->lock is the wait of 'struct page' cacheline of the
    to-be-freed pages and their buddies.

    Test with will-it-scale/page_fault1 full load:

    kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
    v4.16-rc2+ 9034215 7971818 13667135 15677465
    patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
    this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%

    Note: this patch's performance improvement percent is against patch2/3.

    (Changelog stolen from Dave Hansen and Mel Gorman's comments at
    http://lkml.kernel.org/r/148a42d8-8306-2f2f-7f7c-86bc118f8ccd@intel.com)

    [aaron.lu@intel.com: use helper function, avoid disordering pages]
    Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
    Link: http://lkml.kernel.org/r/20180320113146.GB24737@intel.com
    [aaron.lu@intel.com: v4]
    Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
    Link: http://lkml.kernel.org/r/20180309082431.GB30868@intel.com
    Link: http://lkml.kernel.org/r/20180301062845.26038-4-aaron.lu@intel.com
    Signed-off-by: Aaron Lu
    Suggested-by: Ying Huang
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Kemi Wang
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Tim Chen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy, the
    zone->lock is held and then pages are chosen from PCP's migratetype
    list. While there is actually no need to do this 'choose part' under
    lock since it's PCP pages, the only CPU that can touch them is us and
    irq is also disabled.

    Moving this part outside could reduce lock held time and improve
    performance. Test with will-it-scale/page_fault1 full load:

    kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
    v4.16-rc2+ 9034215 7971818 13667135 15677465
    this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

    What the test does is: starts $nr_cpu processes and each will repeatedly
    do the following for 5 minutes:

    - mmap 128M anonymouse space

    - write access to that space

    - munmap.

    The score is the aggregated iteration.

    https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

    Link: http://lkml.kernel.org/r/20180301062845.26038-3-aaron.lu@intel.com
    Signed-off-by: Aaron Lu
    Acked-by: Mel Gorman
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Andi Kleen
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Huang Ying
    Cc: Kemi Wang
    Cc: Matthew Wilcox
    Cc: Tim Chen
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • Matthew Wilcox found that all callers of free_pcppages_bulk() currently
    update pcp->count immediately after so it's natural to do it inside
    free_pcppages_bulk().

    No functionality or performance change is expected from this patch.

    Link: http://lkml.kernel.org/r/20180301062845.26038-2-aaron.lu@intel.com
    Signed-off-by: Aaron Lu
    Suggested-by: Matthew Wilcox
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Huang Ying
    Cc: Dave Hansen
    Cc: Kemi Wang
    Cc: Tim Chen
    Cc: Andi Kleen
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aaron Lu
     
  • It's possible for free pages to become stranded on per-cpu pagesets
    (pcps) that, if drained, could be merged with buddy pages on the zone's
    free area to form large order pages, including up to MAX_ORDER.

    Consider a verbose example using the tools/vm/page-types tool at the
    beginning of a ZONE_NORMAL ('B' indicates a buddy page and 'S' indicates
    a slab page). Pages on pcps do not have any page flags set.

    109954 1 _______S________________________________________________________
    109955 2 __________B_____________________________________________________
    109957 1 ________________________________________________________________
    109958 1 __________B_____________________________________________________
    109959 7 ________________________________________________________________
    109960 1 __________B_____________________________________________________
    109961 9 ________________________________________________________________
    10996a 1 __________B_____________________________________________________
    10996b 3 ________________________________________________________________
    10996e 1 __________B_____________________________________________________
    10996f 1 ________________________________________________________________
    ...
    109f8c 1 __________B_____________________________________________________
    109f8d 2 ________________________________________________________________
    109f8f 2 __________B_____________________________________________________
    109f91 f ________________________________________________________________
    109fa0 1 __________B_____________________________________________________
    109fa1 7 ________________________________________________________________
    109fa8 1 __________B_____________________________________________________
    109fa9 1 ________________________________________________________________
    109faa 1 __________B_____________________________________________________
    109fab 1 _______S________________________________________________________

    The compaction migration scanner is attempting to defragment this memory
    since it is at the beginning of the zone. It has done so quite well,
    all movable pages have been migrated. From pfn [0x109955, 0x109fab),
    there are only buddy pages and pages without flags set.

    These pages may be stranded on pcps that could otherwise allow this
    memory to be coalesced if freed back to the zone free area. It is
    possible that some of these pages may not be on pcps and that something
    has called alloc_pages() and used the memory directly, but we rely on
    the absence of __GFP_MOVABLE in these cases to allocate from
    MIGATE_UNMOVABLE pageblocks to try to keep these MIGRATE_MOVABLE
    pageblocks as free as possible.

    These buddy and pcp pages, spanning 1,621 pages, could be coalesced and
    allow for three transparent hugepages to be dynamically allocated.
    Running the numbers for all such spans on the system, it was found that
    there were over 400 such spans of only buddy pages and pages without
    flags set at the time this /proc/kpageflags sample was collected.
    Without this support, there were _no_ order-9 or order-10 pages free.

    When kcompactd fails to defragment memory such that a cc.order page can
    be allocated, drain all pcps for the zone back to the buddy allocator so
    this stranding cannot occur. Compaction for that order will
    subsequently be deferred, which acts as a ratelimit on this drain.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1803010340100.88270@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • should_failslab() is a convenient function to hook into for directed
    error injection into kmalloc(). However, it is only available if a
    config flag is set.

    The following BCC script, for example, fails kmalloc() calls after a
    btrfs umount:

    from bcc import BPF

    prog = r"""
    BPF_HASH(flag);

    #include

    int kprobe__btrfs_close_devices(void *ctx) {
    u64 key = 1;
    flag.update(&key, &key);
    return 0;
    }

    int kprobe__should_failslab(struct pt_regs *ctx) {
    u64 key = 1;
    u64 *res;
    res = flag.lookup(&key);
    if (res != 0) {
    bpf_override_return(ctx, -ENOMEM);
    }
    return 0;
    }
    """
    b = BPF(text=prog)

    while 1:
    b.kprobe_poll()

    This patch refactors the should_failslab implementation so that the
    function is always available for error injection, independent of flags.

    This change would be similar in nature to commit f5490d3ec921 ("block:
    Add should_fail_bio() for bpf error injection").

    Link: http://lkml.kernel.org/r/20180222020320.6944-1-hmclauchlan@fb.com
    Signed-off-by: Howard McLauchlan
    Reviewed-by: Andrew Morton
    Cc: Akinobu Mita
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Josef Bacik
    Cc: Johannes Weiner
    Cc: Alexei Starovoitov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Howard McLauchlan
     
  • The early_param() is only called during kernel initialization, So Linux
    marks the function of it with __init macro to save memory.

    But it forgot to mark the early_page_poison_param(). So, Make it __init
    as well.

    Link: http://lkml.kernel.org/r/20180117034757.27024-1-douly.fnst@cn.fujitsu.com
    Signed-off-by: Dou Liyang
    Reviewed-by: Andrew Morton
    Acked-by: Michal Hocko
    Cc: Philippe Ombredanne
    Cc: Kate Stewart
    Cc: Michal Hocko
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dou Liyang
     
  • The early_param() is only called during kernel initialization, So Linux
    marks the functions of it with __init macro to save memory.

    But it forgot to mark the early_page_owner_param(). So, Make it __init
    as well.

    Link: http://lkml.kernel.org/r/20180117034736.26963-1-douly.fnst@cn.fujitsu.com
    Signed-off-by: Dou Liyang
    Reviewed-by: Andrew Morton
    Cc: Vlastimil Babka
    Cc: Michal Hocko
    Cc: Greg Kroah-Hartman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dou Liyang
     
  • The early_param() is only called during kernel initialization, So Linux
    marks the functions of it with __init macro to save memory.

    But it forgot to mark the kmemleak_boot_config(). So, Make it __init as
    well.

    Link: http://lkml.kernel.org/r/20180117034720.26897-1-douly.fnst@cn.fujitsu.com
    Signed-off-by: Dou Liyang
    Reviewed-by: Andrew Morton
    Cc: Catalin Marinas
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dou Liyang
     
  • This patch makes do_swap_page() not need to be aware of two different
    swap readahead algorithms. Just unify cluster-based and vma-based
    readahead function call.

    Link: http://lkml.kernel.org/r/1509520520-32367-3-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-3-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When I see recent change of swap readahead, I am very unhappy about
    current code structure which diverges two swap readahead algorithm in
    do_swap_page. This patch is to clean it up.

    Main motivation is that fault handler doesn't need to be aware of
    readahead algorithms but just should call swapin_readahead.

    As first step, this patch cleans up a little bit but not perfect (I just
    separate for review easier) so next patch will make the goal complete.

    [minchan@kernel.org: do not check readahead flag with THP anon]
    Link: http://lkml.kernel.org/r/874lm83zho.fsf@yhuang-dev.intel.com
    Link: http://lkml.kernel.org/r/20180227232611.169883-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/1509520520-32367-2-git-send-email-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20180220085249.151400-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Since we no longer use return value of shrink_slab() for normal reclaim,
    the comment is no longer true. If some do_shrink_slab() call takes
    unexpectedly long (root cause of stall is currently unknown) when
    register_shrinker()/unregister_shrinker() is pending, trying to drop
    caches via /proc/sys/vm/drop_caches could become infinite cond_resched()
    loop if many mem_cgroup are defined. For safety, let's not pretend
    forward progress.

    Link: http://lkml.kernel.org/r/201802202229.GGF26507.LVFtMSOOHFJOQF@I-love.SAKURA.ne.jp
    Signed-off-by: Tetsuo Handa
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Dave Chinner
    Cc: Glauber Costa
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     
  • Currently if z3fold couldn't find an unbuddied page it would first try
    to pull a page off the stale list. The problem with this approach is
    that we can't 100% guarantee that the page is not processed by the
    workqueue thread at the same time unless we run cancel_work_sync() on
    it, which we can't do if we're in an atomic context. So let's just
    limit stale list usage to non-atomic contexts only.

    Link: http://lkml.kernel.org/r/47ab51e7-e9c1-d30e-ab17-f734dbc3abce@gmail.com
    Signed-off-by: Vitaly Vul
    Reviewed-by: Andrew Morton
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Wool
     
  • THP split makes non-atomic change of tail page flags. This is almost ok
    because tail pages are locked and isolated but this breaks recent
    changes in page locking: non-atomic operation could clear bit
    PG_waiters.

    As a result concurrent sequence get_page_unless_zero() -> lock_page()
    might block forever. Especially if this page was truncated later.

    Fix is trivial: clone flags before unfreezing page reference counter.

    This race exists since commit 62906027091f ("mm: add PageWaiters
    indicating tasks are waiting for a page bit") while unsave unfreeze
    itself was added in commit 8df651c7059e ("thp: cleanup
    split_huge_page()").

    clear_compound_head() also must be called before unfreezing page
    reference because after successful get_page_unless_zero() might follow
    put_page() which needs correct compound_head().

    And replace page_ref_inc()/page_ref_add() with page_ref_unfreeze() which
    is made especially for that and has semantic of smp_store_release().

    Link: http://lkml.kernel.org/r/151844393341.210639.13162088407980624477.stgit@buzz
    Signed-off-by: Konstantin Khlebnikov
    Acked-by: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Nicholas Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     
  • When page_mapping() is called and the mapping is dereferenced in
    page_evicatable() through shrink_active_list(), it is possible for the
    inode to be truncated and the embedded address space to be freed at the
    same time. This may lead to the following race.

    CPU1 CPU2

    truncate(inode) shrink_active_list()
    ... page_evictable(page)
    truncate_inode_page(mapping, page);
    delete_from_page_cache(page)
    spin_lock_irqsave(&mapping->tree_lock, flags);
    __delete_from_page_cache(page, NULL)
    page_cache_tree_delete(..)
    ... mapping = page_mapping(page);
    page->mapping = NULL;
    ...
    spin_unlock_irqrestore(&mapping->tree_lock, flags);
    page_cache_free_page(mapping, page)
    put_page(page)
    if (put_page_testzero(page)) -> false
    - inode now has no pages and can be freed including embedded address_space

    mapping_unevictable(mapping)
    test_bit(AS_UNEVICTABLE, &mapping->flags);
    - we've dereferenced mapping which is potentially already free.

    Similar race exists between swap cache freeing and page_evicatable()
    too.

    The address_space in inode and swap cache will be freed after a RCU
    grace period. So the races are fixed via enclosing the page_mapping()
    and address_space usage in rcu_read_lock/unlock(). Some comments are
    added in code to make it clear what is protected by the RCU read lock.

    Link: http://lkml.kernel.org/r/20180212081227.1940-1-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: "Huang, Ying"
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • ...instead of open coding file operations followed by custom ->open()
    callbacks per each attribute.

    [andriy.shevchenko@linux.intel.com: add tags, fix compilation issue]
    Link: http://lkml.kernel.org/r/20180217144253.58604-1-andriy.shevchenko@linux.intel.com
    Link: http://lkml.kernel.org/r/20180214154644.54505-1-andriy.shevchenko@linux.intel.com
    Signed-off-by: Andy Shevchenko
    Reviewed-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Acked-by: Christoph Lameter
    Cc: Tejun Heo
    Cc: Dennis Zhou
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Sergey Senozhatsky
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andy Shevchenko
     
  • mirrored_kernelcore can be in __meminitdata, so move it there.

    At the same time, fixup section specifiers to be after the name of the
    variable per checkpatch.

    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121623280.179479@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Andrew Morton
    Cc: Mike Kravetz
    Cc: Jonathan Corbet
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Both kernelcore= and movablecore= can be used to define the amount of
    ZONE_NORMAL and ZONE_MOVABLE on a system, respectively. This requires
    the system memory capacity to be known when specifying the command line,
    however.

    This introduces the ability to define both kernelcore= and movablecore=
    as a percentage of total system memory. This is convenient for systems
    software that wants to define the amount of ZONE_MOVABLE, for example,
    as a proportion of a system's memory rather than a hardcoded byte value.

    To define the percentage, the final character of the parameter should be
    a '%'.

    mhocko: "why is anyone using these options nowadays?"

    rientjes:
    :
    : Fragmentation of non-__GFP_MOVABLE pages due to low on memory
    : situations can pollute most pageblocks on the system, as much as 1GB of
    : slab being fragmented over 128GB of memory, for example. When the
    : amount of kernel memory is well bounded for certain systems, it is
    : better to aggressively reclaim from existing MIGRATE_UNMOVABLE
    : pageblocks rather than eagerly fallback to others.
    :
    : We have additional patches that help with this fragmentation if you're
    : interested, specifically kcompactd compaction of MIGRATE_UNMOVABLE
    : pageblocks triggered by fallback of non-__GFP_MOVABLE allocations and
    : draining of pcp lists back to the zone free area to prevent stranding.

    [rientjes@google.com: updates]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802131700160.71590@chino.kir.corp.google.com
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1802121622470.179479@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Cc: Jonathan Corbet
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • Recently the following BUG was reported:

    Injecting memory failure for pfn 0x3c0000 at process virtual address 0x7fe300000000
    Memory failure: 0x3c0000: recovery action for huge page: Recovered
    BUG: unable to handle kernel paging request at ffff8dfcc0003000
    IP: gup_pgd_range+0x1f0/0xc20
    PGD 17ae72067 P4D 17ae72067 PUD 0
    Oops: 0000 [#1] SMP PTI
    ...
    CPU: 3 PID: 5467 Comm: hugetlb_1gb Not tainted 4.15.0-rc8-mm1-abc+ #3
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.9.3-1.fc25 04/01/2014

    You can easily reproduce this by calling madvise(MADV_HWPOISON) twice on
    a 1GB hugepage. This happens because get_user_pages_fast() is not aware
    of a migration entry on pud that was created in the 1st madvise() event.

    I think that conversion to pud-aligned migration entry is working, but
    other MM code walking over page table isn't prepared for it. We need
    some time and effort to make all this work properly, so this patch
    avoids the reported bug by just disabling error handling for 1GB
    hugepage.

    [n-horiguchi@ah.jp.nec.com: v2]
    Link: http://lkml.kernel.org/r/1517284444-18149-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Link: http://lkml.kernel.org/r/1517207283-15769-1-git-send-email-n-horiguchi@ah.jp.nec.com
    Signed-off-by: Naoya Horiguchi
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Kravetz
    Acked-by: Punit Agrawal
    Tested-by: Michael Ellerman
    Cc: Anshuman Khandual
    Cc: "Aneesh Kumar K.V"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     
  • During memory hotplugging we traverse struct pages three times:

    1. memset(0) in sparse_add_one_section()
    2. loop in __add_section() to set do: set_page_node(page, nid); and
    SetPageReserved(page);
    3. loop in memmap_init_zone() to call __init_single_pfn()

    This patch removes the first two loops, and leaves only loop 3. All
    struct pages are initialized in one place, the same as it is done during
    boot.

    The benefits:

    - We improve memory hotplug performance because we are not evicting the
    cache several times and also reduce loop branching overhead.

    - Remove condition from hotpath in __init_single_pfn(), that was added
    in order to fix the problem that was reported by Bharata in the above
    email thread, thus also improve performance during normal boot.

    - Make memory hotplug more similar to the boot memory initialization
    path because we zero and initialize struct pages only in one
    function.

    - Simplifies memory hotplug struct page initialization code, and thus
    enables future improvements, such as multi-threading the
    initialization of struct pages in order to improve hotplug
    performance even further on larger machines.

    [pasha.tatashin@oracle.com: v5]
    Link: http://lkml.kernel.org/r/20180228030308.1116-7-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180215165920.8570-7-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Ingo Molnar
    Cc: Michal Hocko
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • During memory hotplugging the probe routine will leave struct pages
    uninitialized, the same as it is currently done during boot. Therefore,
    we do not want to access the inside of struct pages before
    __init_single_page() is called during onlining.

    Because during hotplug we know that pages in one memory block belong to
    the same numa node, we can skip the checking. We should keep checking
    for the boot case.

    [pasha.tatashin@oracle.com: s/register_new_memory()/hotplug_memory_register()]
    Link: http://lkml.kernel.org/r/20180228030308.1116-6-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180215165920.8570-6-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Ingo Molnar
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • During boot we poison struct page memory in order to ensure that no one
    is accessing this memory until the struct pages are initialized in
    __init_single_page().

    This patch adds more scrutiny to this checking by making sure that flags
    do not equal the poison pattern when they are accessed. The pattern is
    all ones.

    Since node id is also stored in struct page, and may be accessed quite
    early, we add this enforcement into page_to_nid() function as well.
    Note, this is applicable only when NODE_NOT_IN_PAGE_FLAGS=n

    [pasha.tatashin@oracle.com: v4]
    Link: http://lkml.kernel.org/r/20180215165920.8570-4-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180213193159.14606-4-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Ingo Molnar
    Acked-by: Michal Hocko
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "optimize memory hotplug", v3.

    This patchset:

    - Improves hotplug performance by eliminating a number of struct page
    traverses during memory hotplug.

    - Fixes some issues with hotplugging, where boundaries were not
    properly checked. And on x86 block size was not properly aligned with
    end of memory

    - Also, potentially improves boot performance by eliminating condition
    from __init_single_page().

    - Adds robustness by verifying that that struct pages are correctly
    poisoned when flags are accessed.

    The following experiments were performed on Xeon(R) CPU E7-8895 v3 @
    2.60GHz with 1T RAM:

    booting in qemu with 960G of memory, time to initialize struct pages:

    no-kvm:
    TRY1 TRY2
    BEFORE: 39.433668 39.39705
    AFTER: 36.903781 36.989329

    with-kvm:
    BEFORE: 10.977447 11.103164
    AFTER: 10.929072 10.751885

    Hotplug 896G memory:
    no-kvm:
    TRY1 TRY2
    BEFORE: 848.740000 846.910000
    AFTER: 783.070000 786.560000

    with-kvm:
    TRY1 TRY2
    BEFORE: 34.410000 33.57
    AFTER: 29.810000 29.580000

    This patch (of 6):

    Start qemu with the following arguments:

    -m 64G,slots=2,maxmem=66G -object memory-backend-ram,id=mem1,size=2G

    Which: boots machine with 64G, and adds a device mem1 with 2G which can
    be hotplugged later.

    Also make sure that config has the following turned on:
    CONFIG_MEMORY_HOTPLUG
    CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE
    CONFIG_ACPI_HOTPLUG_MEMORY

    Using the qemu monitor hotplug the memory (make sure config has (qemu)
    device_add pc-dimm,id=dimm1,memdev=mem1

    The operation will fail with the following trace:

    WARNING: CPU: 0 PID: 91 at drivers/base/memory.c:205
    pages_correctly_reserved+0xe6/0x110
    Modules linked in:
    CPU: 0 PID: 91 Comm: systemd-udevd Not tainted 4.16.0-rc1_pt_master #29
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
    BIOS rel-1.11.0-0-g63451fca13-prebuilt.qemu-project.org 04/01/2014
    RIP: 0010:pages_correctly_reserved+0xe6/0x110
    Call Trace:
    memory_subsys_online+0x44/0xa0
    device_online+0x51/0x80
    store_mem_state+0x5e/0xe0
    kernfs_fop_write+0xfa/0x170
    __vfs_write+0x2e/0x150
    vfs_write+0xa8/0x1a0
    SyS_write+0x4d/0xb0
    do_syscall_64+0x5d/0x110
    entry_SYSCALL_64_after_hwframe+0x21/0x86
    ---[ end trace 6203bc4f1a5d30e8 ]---

    The problem is detected in: drivers/base/memory.c

    static bool pages_correctly_reserved(unsigned long start_pfn)
    205 if (WARN_ON_ONCE(!pfn_valid(pfn)))

    This function loops through every section in the newly added memory
    block and verifies that the first pfn is valid, meaning section exists,
    has mapping (struct page array), and is online.

    The block size on x86 is usually 128M, but when machine is booted with
    more than 64G of memory, the block size is changed to 2G: $ cat
    /sys/devices/system/memory/block_size_bytes 80000000

    or

    $ dmesg | grep "block size"
    [ 0.086469] x86/mm: Memory block size: 2048MB

    During memory hotplug, and hotremove we verify that the range is section
    size aligned, but we actually must verify that it is block size aligned,
    because that is the proper unit for hotplug operations. See:
    Documentation/memory-hotplug.txt

    So, when the start_pfn of newly added memory is not block size aligned,
    we can get a memory block that has only part of it with properly
    populated sections.

    In our case the start_pfn starts from the last_pfn (end of physical
    memory).

    $ dmesg | grep last_pfn
    [ 0.000000] e820: last_pfn = 0x1040000 max_arch_pfn = 0x400000000

    0x1040000 == 65G, and so is not 2G aligned!

    The fix is to enforce that memory that is hotplugged and hotremoved is
    block size aligned.

    With this fix, running the above sequence yield to the following result:

    (qemu) device_add pc-dimm,id=dimm1,memdev=mem1
    Block size [0x80000000] unaligned hotplug range: start 0x1040000000,
    size 0x80000000
    acpi PNP0C80:00: add_memory failed
    acpi PNP0C80:00: acpi_memory_enable_device() error
    acpi PNP0C80:00: Enumeration failure

    Link: http://lkml.kernel.org/r/20180213193159.14606-2-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Ingo Molnar
    Acked-by: Michal Hocko
    Cc: Baoquan He
    Cc: Bharata B Rao
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: "H. Peter Anvin"
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Steven Sistare
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin