21 Sep, 2018

1 commit

  • Deferred struct page init is needed only on systems with large amount of
    physical memory to improve boot performance. 32-bit systems do not
    benefit from this feature.

    Jiri reported a problem where deferred struct pages do not work well with
    x86-32:

    [ 0.035162] Dentry cache hash table entries: 131072 (order: 7, 524288 bytes)
    [ 0.035725] Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
    [ 0.036269] Initializing CPU#0
    [ 0.036513] Initializing HighMem for node 0 (00036ffe:0007ffe0)
    [ 0.038459] page:f6780000 is uninitialized and poisoned
    [ 0.038460] raw: ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff ffffffff
    [ 0.039509] page dumped because: VM_BUG_ON_PAGE(1 && PageCompound(page))
    [ 0.040038] ------------[ cut here ]------------
    [ 0.040399] kernel BUG at include/linux/page-flags.h:293!
    [ 0.040823] invalid opcode: 0000 [#1] SMP PTI
    [ 0.041166] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc1_pt_jiri #9
    [ 0.041694] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    [ 0.042496] EIP: free_highmem_page+0x64/0x80
    [ 0.042839] Code: 13 46 d8 c1 e8 18 5d 83 e0 03 8d 04 c0 c1 e0 06 ff 80 ec 5f 44 d8 c3 8d b4 26 00 00 00 00 ba 08 65 28 d8 89 d8 e8 fc 71 02 00 0b 8d 76 00 8d bc 27 00 00 00 00 ba d0 b1 26 d8 89 d8 e8 e4 71
    [ 0.044338] EAX: 0000003c EBX: f6780000 ECX: 00000000 EDX: d856cbe8
    [ 0.044868] ESI: 0007ffe0 EDI: d838df20 EBP: d838df00 ESP: d838defc
    [ 0.045372] DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068 EFLAGS: 00210086
    [ 0.045913] CR0: 80050033 CR2: 00000000 CR3: 18556000 CR4: 00040690
    [ 0.046413] DR0: 00000000 DR1: 00000000 DR2: 00000000 DR3: 00000000
    [ 0.046913] DR6: fffe0ff0 DR7: 00000400
    [ 0.047220] Call Trace:
    [ 0.047419] add_highpages_with_active_regions+0xbd/0x10d
    [ 0.047854] set_highmem_pages_init+0x5b/0x71
    [ 0.048202] mem_init+0x2b/0x1e8
    [ 0.048460] start_kernel+0x1d2/0x425
    [ 0.048757] i386_start_kernel+0x93/0x97
    [ 0.049073] startup_32_smp+0x164/0x168
    [ 0.049379] Modules linked in:
    [ 0.049626] ---[ end trace 337949378db0abbb ]---

    We free highmem pages before their struct pages are initialized:

    mem_init()
    set_highmem_pages_init()
    add_highpages_with_active_regions()
    free_highmem_page()
    .. Access uninitialized struct page here..

    Because there is no reason to have this feature on 32-bit systems, just
    disable it.

    Link: http://lkml.kernel.org/r/20180831150506.31246-1-pavel.tatashin@microsoft.com
    Fixes: 2e3ca40f03bb ("mm: relax deferred struct page requirements")
    Signed-off-by: Pavel Tatashin
    Reported-by: Jiri Slaby
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Pasha Tatashin
     

18 Aug, 2018

3 commits

  • CONFIG_THP_SWAP should depend on CONFIG_SWAP, because it's unreasonable
    to optimize swapping for THP (Transparent Huge Page) without basic
    swapping support.

    In original code, when CONFIG_SWAP=n and CONFIG_THP_SWAP=y,
    split_swap_cluster() will not be built because it is in swapfile.c, but
    it will be called in huge_memory.c. This doesn't trigger a build error
    in practice because the call site is enclosed by PageSwapCache(), which
    is defined to be constant 0 when CONFIG_SWAP=n. But this is fragile and
    should be fixed.

    The comments are fixed too to reflect the latest progress.

    Link: http://lkml.kernel.org/r/20180713021228.439-1-ying.huang@intel.com
    Fixes: 38d8b4e6bdc8 ("mm, THP, swap: delay splitting THP during swap out")
    Signed-off-by: "Huang, Ying"
    Reviewed-by: Dan Williams
    Reviewed-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shaohua Li
    Cc: Hugh Dickins
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Dave Hansen
    Cc: Zi Yan
    Cc: Daniel Jordan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Rename new_sparse_init() to sparse_init() which enables it. Delete old
    sparse_init() and all the code that became obsolete with.

    [pasha.tatashin@oracle.com: remove unused sparse_mem_maps_populate_node()]
    Link: http://lkml.kernel.org/r/20180716174447.14529-6-pasha.tatashin@oracle.com
    Link: http://lkml.kernel.org/r/20180712203730.8703-6-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Tested-by: Michael Ellerman [powerpc]
    Tested-by: Oscar Salvador
    Reviewed-by: Oscar Salvador
    Cc: Pasha Tatashin
    Cc: Abdul Haleem
    Cc: Baoquan He
    Cc: Daniel Jordan
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Rientjes
    Cc: Greg Kroah-Hartman
    Cc: Ingo Molnar
    Cc: Jan Kara
    Cc: Jérôme Glisse
    Cc: "Kirill A. Shutemov"
    Cc: Michal Hocko
    Cc: Souptick Joarder
    Cc: Steven Sistare
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • The deferred memory initialization relies on section definitions, e.g
    PAGES_PER_SECTION, that are only available when CONFIG_SPARSEMEM=y on
    most architectures.

    Initially DEFERRED_STRUCT_PAGE_INIT depended on explicit
    ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT configuration option, but since
    the commit 2e3ca40f03bb13709df4 ("mm: relax deferred struct page
    requirements") this requirement was relaxed and now it is possible to
    enable DEFERRED_STRUCT_PAGE_INIT on architectures that support
    DISCONTINGMEM and NO_BOOTMEM which causes build failures.

    For instance, setting SMP=y and DEFERRED_STRUCT_PAGE_INIT=y on arc
    causes the following build failure:

    CC mm/page_alloc.o
    mm/page_alloc.c: In function 'update_defer_init':
    mm/page_alloc.c:321:14: error: 'PAGES_PER_SECTION'
    undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    (pfn & (PAGES_PER_SECTION - 1)) == 0) {
    ^~~~~~~~~~~~~~~~~
    USEC_PER_SEC
    mm/page_alloc.c:321:14: note: each undeclared identifier is reported only once for each function it appears in
    In file included from include/linux/cache.h:5:0,
    from include/linux/printk.h:9,
    from include/linux/kernel.h:14,
    from include/asm-generic/bug.h:18,
    from arch/arc/include/asm/bug.h:32,
    from include/linux/bug.h:5,
    from include/linux/mmdebug.h:5,
    from include/linux/mm.h:9,
    from mm/page_alloc.c:18:
    mm/page_alloc.c: In function 'deferred_grow_zone':
    mm/page_alloc.c:1624:52: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
    ^
    include/uapi/linux/kernel.h:11:47: note: in definition of macro '__ALIGN_KERNEL_MASK'
    #define __ALIGN_KERNEL_MASK(x, mask) (((x) + (mask)) & ~(mask))
    ^~~~
    include/linux/kernel.h:58:22: note: in expansion of macro '__ALIGN_KERNEL'
    #define ALIGN(x, a) __ALIGN_KERNEL((x), (a))
    ^~~~~~~~~~~~~~
    mm/page_alloc.c:1624:34: note: in expansion of macro 'ALIGN'
    unsigned long nr_pages_needed = ALIGN(1 << order, PAGES_PER_SECTION);
    ^~~~~
    In file included from include/asm-generic/bug.h:18:0,
    from arch/arc/include/asm/bug.h:32,
    from include/linux/bug.h:5,
    from include/linux/mmdebug.h:5,
    from include/linux/mm.h:9,
    from mm/page_alloc.c:18:
    mm/page_alloc.c: In function 'free_area_init_node':
    mm/page_alloc.c:6379:50: error: 'PAGES_PER_SECTION' undeclared (first use in this function); did you mean 'USEC_PER_SEC'?
    pgdat->static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
    ^
    include/linux/kernel.h:812:22: note: in definition of macro '__typecheck'
    (!!(sizeof((typeof(x) *)1 == (typeof(y) *)1)))
    ^
    include/linux/kernel.h:836:24: note: in expansion of macro '__safe_cmp'
    __builtin_choose_expr(__safe_cmp(x, y), \
    ^~~~~~~~~~
    include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
    #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
    ^~~~~
    include/linux/kernel.h:836:2: error: first argument to '__builtin_choose_expr' not a constant
    __builtin_choose_expr(__safe_cmp(x, y), \
    ^
    include/linux/kernel.h:904:27: note: in expansion of macro '__careful_cmp'
    #define min_t(type, x, y) __careful_cmp((type)(x), (type)(y), static_init_pgcnt = min_t(unsigned long, PAGES_PER_SECTION,
    ^~~~~
    scripts/Makefile.build:317: recipe for target 'mm/page_alloc.o' failed

    Let's make the DEFERRED_STRUCT_PAGE_INIT explicitly depend on SPARSEMEM
    as the systems that support DISCONTIGMEM do not seem to have that huge
    amounts of memory that would make DEFERRED_STRUCT_PAGE_INIT relevant.

    Link: http://lkml.kernel.org/r/1530279308-24988-1-git-send-email-rppt@linux.vnet.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Michal Hocko
    Reviewed-by: Pavel Tatashin
    Tested-by: Randy Dunlap
    Cc: Pasha Tatashin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

02 Aug, 2018

1 commit


09 Jun, 2018

2 commits

  • Pull libnvdimm updates from Dan Williams:
    "This adds a user for the new 'bytes-remaining' updates to
    memcpy_mcsafe() that you already received through Ingo via the
    x86-dax- for-linus pull.

    Not included here, but still targeting this cycle, is support for
    handling memory media errors (poison) consumed via userspace dax
    mappings.

    Summary:

    - DAX broke a fundamental assumption of truncate of file mapped
    pages. The truncate path assumed that it is safe to disconnect a
    pinned page from a file and let the filesystem reclaim the physical
    block. With DAX the page is equivalent to the filesystem block.
    Introduce dax_layout_busy_page() to enable filesystems to wait for
    pinned DAX pages to be released. Without this wait a filesystem
    could allocate blocks under active device-DMA to a new file.

    - DAX arranges for the block layer to be bypassed and uses
    dax_direct_access() + copy_to_iter() to satisfy read(2) calls.
    However, the memcpy_mcsafe() facility is available through the pmem
    block driver. In order to safely handle media errors, via the DAX
    block-layer bypass, introduce copy_to_iter_mcsafe().

    - Fix cache management policy relative to the ACPI NFIT Platform
    Capabilities Structure to properly elide cache flushes when they
    are not necessary. The table indicates whether CPU caches are
    power-fail protected. Clarify that a deep flush is always performed
    on REQ_{FUA,PREFLUSH} requests"

    * tag 'libnvdimm-for-4.18' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (21 commits)
    dax: Use dax_write_cache* helpers
    libnvdimm, pmem: Do not flush power-fail protected CPU caches
    libnvdimm, pmem: Unconditionally deep flush on *sync
    libnvdimm, pmem: Complete REQ_FLUSH => REQ_PREFLUSH
    acpi, nfit: Remove ecc_unit_size
    dax: dax_insert_mapping_entry always succeeds
    libnvdimm, e820: Register all pmem resources
    libnvdimm: Debug probe times
    linvdimm, pmem: Preserve read-only setting for pmem devices
    x86, nfit_test: Add unit test for memcpy_mcsafe()
    pmem: Switch to copy_to_iter_mcsafe()
    dax: Report bytes remaining in dax_iomap_actor()
    dax: Introduce a ->copy_to_iter dax operation
    uio, lib: Fix CONFIG_ARCH_HAS_UACCESS_MCSAFE compilation
    xfs, dax: introduce xfs_break_dax_layouts()
    xfs: prepare xfs_break_layouts() for another layout type
    xfs: prepare xfs_break_layouts() to be called with XFS_MMAPLOCK_EXCL
    mm, fs, dax: handle layout changes to pinned dax mappings
    mm: fix __gup_device_huge vs unmap
    mm: introduce MEMORY_DEVICE_FS_DAX and CONFIG_DEV_PAGEMAP_OPS
    ...

    Linus Torvalds
     
  • Dan Williams
     

08 Jun, 2018

1 commit

  • Currently the PTE special supports is turned on in per architecture
    header files. Most of the time, it is defined in
    arch/*/include/asm/pgtable.h depending or not on some other per
    architecture static definition.

    This patch introduce a new configuration variable to manage this
    directly in the Kconfig files. It would later replace
    __HAVE_ARCH_PTE_SPECIAL.

    Here notes for some architecture where the definition of
    __HAVE_ARCH_PTE_SPECIAL is not obvious:

    arm
    __HAVE_ARCH_PTE_SPECIAL which is currently defined in
    arch/arm/include/asm/pgtable-3level.h which is included by
    arch/arm/include/asm/pgtable.h when CONFIG_ARM_LPAE is set.
    So select ARCH_HAS_PTE_SPECIAL if ARM_LPAE.

    powerpc
    __HAVE_ARCH_PTE_SPECIAL is defined in 2 files:
    - arch/powerpc/include/asm/book3s/64/pgtable.h
    - arch/powerpc/include/asm/pte-common.h
    The first one is included if (PPC_BOOK3S & PPC64) while the second is
    included in all the other cases.
    So select ARCH_HAS_PTE_SPECIAL all the time.

    sparc:
    __HAVE_ARCH_PTE_SPECIAL is defined if defined(__sparc__) &&
    defined(__arch64__) which are defined through the compiler in
    sparc/Makefile if !SPARC32 which I assume to be if SPARC64.
    So select ARCH_HAS_PTE_SPECIAL if SPARC64

    There is no functional change introduced by this patch.

    Link: http://lkml.kernel.org/r/1523433816-14460-2-git-send-email-ldufour@linux.vnet.ibm.com
    Signed-off-by: Laurent Dufour
    Suggested-by: Jerome Glisse
    Reviewed-by: Jerome Glisse
    Acked-by: David Rientjes
    Cc: Michal Hocko
    Cc: "Aneesh Kumar K . V"
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Jonathan Corbet
    Cc: Catalin Marinas
    Cc: Will Deacon
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: David S. Miller
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Vineet Gupta
    Cc: Palmer Dabbelt
    Cc: Albert Ou
    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: David Rientjes
    Cc: Robin Murphy
    Cc: Christophe LEROY
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Laurent Dufour
     

05 Jun, 2018

2 commits

  • Pull documentation updates from Jonathan Corbet:
    "There's been a fair amount of work in the docs tree this time around,
    including:

    - Extensive RST conversions and organizational work in the
    memory-management docs thanks to Mike Rapoport.

    - An update of Documentation/features from Andrea Parri and a script
    to keep it updated.

    - Various LICENSES updates from Thomas, along with a script to check
    SPDX tags.

    - Work to fix dangling references to documentation files; this
    involved a fair number of one-liner comment changes outside of
    Documentation/

    ... and the usual list of documentation improvements, typo fixes, etc"

    * tag 'docs-4.18' of git://git.lwn.net/linux: (103 commits)
    Documentation: document hung_task_panic kernel parameter
    docs/admin-guide/mm: add high level concepts overview
    docs/vm: move ksm and transhuge from "user" to "internals" section.
    docs: Use the kerneldoc comments for memalloc_no*()
    doc: document scope NOFS, NOIO APIs
    docs: update kernel versions and dates in tables
    docs/vm: transhuge: split userspace bits to admin-guide/mm/transhuge
    docs/vm: transhuge: minor updates
    docs/vm: transhuge: change sections order
    Documentation: arm: clean up Marvell Berlin family info
    Documentation: gpio: driver: Fix a typo and some odd grammar
    docs: ranoops.rst: fix location of ramoops.txt
    scripts/documentation-file-ref-check: rewrite it in perl with auto-fix mode
    docs: uio-howto.rst: use a code block to solve a warning
    mm, THP, doc: Add document for thp_swpout/thp_swpout_fallback
    w1: w1_io.c: fix a kernel-doc warning
    Documentation/process/posting: wrap text at 80 cols
    docs: admin-guide: add cgroup-v2 documentation
    Revert "Documentation/features/vm: Remove arch support status file for 'pte_special'"
    Documentation: refcount-vs-atomic: Update reference to LKMM doc.
    ...

    Linus Torvalds
     
  • Pull dma-mapping updates from Christoph Hellwig:

    - replace the force_dma flag with a dma_configure bus method. (Nipun
    Gupta, although one patch is іncorrectly attributed to me due to a
    git rebase bug)

    - use GFP_DMA32 more agressively in dma-direct. (Takashi Iwai)

    - remove PCI_DMA_BUS_IS_PHYS and rely on the dma-mapping API to do the
    right thing for bounce buffering.

    - move dma-debug initialization to common code, and apply a few
    cleanups to the dma-debug code.

    - cleanup the Kconfig mess around swiotlb selection

    - swiotlb comment fixup (Yisheng Xie)

    - a trivial swiotlb fix. (Dan Carpenter)

    - support swiotlb on RISC-V. (based on a patch from Palmer Dabbelt)

    - add a new generic dma-noncoherent dma_map_ops implementation and use
    it for arc, c6x and nds32.

    - improve scatterlist validity checking in dma-debug. (Robin Murphy)

    - add a struct device quirk to limit the dma-mask to 32-bit due to
    bridge/system issues, and switch x86 to use it instead of a local
    hack for VIA bridges.

    - handle devices without a dma_mask more gracefully in the dma-direct
    code.

    * tag 'dma-mapping-4.18' of git://git.infradead.org/users/hch/dma-mapping: (48 commits)
    dma-direct: don't crash on device without dma_mask
    nds32: use generic dma_noncoherent_ops
    nds32: implement the unmap_sg DMA operation
    nds32: consolidate DMA cache maintainance routines
    x86/pci-dma: switch the VIA 32-bit DMA quirk to use the struct device flag
    x86/pci-dma: remove the explicit nodac and allowdac option
    x86/pci-dma: remove the experimental forcesac boot option
    Documentation/x86: remove a stray reference to pci-nommu.c
    core, dma-direct: add a flag 32-bit dma limits
    dma-mapping: remove unused gfp_t parameter to arch_dma_alloc_attrs
    dma-debug: check scatterlist segments
    c6x: use generic dma_noncoherent_ops
    arc: use generic dma_noncoherent_ops
    arc: fix arc_dma_{map,unmap}_page
    arc: fix arc_dma_sync_sg_for_{cpu,device}
    arc: simplify arc_dma_sync_single_for_{cpu,device}
    dma-mapping: provide a generic dma-noncoherent implementation
    dma-mapping: simplify Kconfig dependencies
    riscv: add swiotlb support
    riscv: only enable ZONE_DMA32 for 64-bit
    ...

    Linus Torvalds
     

22 May, 2018

1 commit

  • In preparation for fixing dax-dma-vs-unmap issues, filesystems need to
    be able to rely on the fact that they will get wakeups on dev_pagemap
    page-idle events. Introduce MEMORY_DEVICE_FS_DAX and
    generic_dax_page_free() as common indicator / infrastructure for dax
    filesytems to require. With this change there are no users of the
    MEMORY_DEVICE_HOST designation, so remove it.

    The HMM sub-system extended dev_pagemap to arrange a callback when a
    dev_pagemap managed page is freed. Since a dev_pagemap page is free /
    idle when its reference count is 1 it requires an additional branch to
    check the page-type at put_page() time. Given put_page() is a hot-path
    we do not want to incur that check if HMM is not in use, so a static
    branch is used to avoid that overhead when not necessary.

    Now, the FS_DAX implementation wants to reuse this mechanism for
    receiving dev_pagemap ->page_free() callbacks. Rework the HMM-specific
    static-key into a generic mechanism that either HMM or FS_DAX code paths
    can enable.

    For ARCH=um builds, and any other arch that lacks ZONE_DEVICE support,
    care must be taken to compile out the DEV_PAGEMAP_OPS infrastructure.
    However, we still need to support FS_DAX in the FS_DAX_LIMITED case
    implemented by the s390/dcssblk driver.

    Cc: Martin Schwidefsky
    Cc: Heiko Carstens
    Cc: Michal Hocko
    Reported-by: kbuild test robot
    Reported-by: Thomas Meyer
    Reported-by: Dave Jiang
    Cc: "Jérôme Glisse"
    Reviewed-by: Jan Kara
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Dan Williams

    Dan Williams
     

19 May, 2018

1 commit

  • It is unsafe to do virtual to physical translations before mm_init() is
    called if struct page is needed in order to determine the memory section
    number (see SECTION_IN_PAGE_FLAGS). This is because only in mm_init()
    we initialize struct pages for all the allocated memory when deferred
    struct pages are used.

    My recent fix in commit c9e97a1997 ("mm: initialize pages on demand
    during boot") exposed this problem, because it greatly reduced number of
    pages that are initialized before mm_init(), but the problem existed
    even before my fix, as Fengguang Wu found.

    Below is a more detailed explanation of the problem.

    We initialize struct pages in four places:

    1. Early in boot a small set of struct pages is initialized to fill the
    first section, and lower zones.

    2. During mm_init() we initialize "struct pages" for all the memory that
    is allocated, i.e reserved in memblock.

    3. Using on-demand logic when pages are allocated after mm_init call
    (when memblock is finished)

    4. After smp_init() when the rest free deferred pages are initialized.

    The problem occurs if we try to do va to phys translation of a memory
    between steps 1 and 2. Because we have not yet initialized struct pages
    for all the reserved pages, it is inherently unsafe to do va to phys if
    the translation itself requires access of "struct page" as in case of
    this combination: CONFIG_SPARSE && !CONFIG_SPARSE_VMEMMAP

    The following path exposes the problem:

    start_kernel()
    trap_init()
    setup_cpu_entry_areas()
    setup_cpu_entry_area(cpu)
    get_cpu_gdt_paddr(cpu)
    per_cpu_ptr_to_phys(addr)
    pcpu_addr_to_page(addr)
    virt_to_page(addr)
    pfn_to_page(__pa(addr) >> PAGE_SHIFT)

    We disable this path by not allowing NEED_PER_CPU_KM with deferred
    struct pages feature.

    The problems are discussed in these threads:
    http://lkml.kernel.org/r/20180418135300.inazvpxjxowogyge@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180419013128.iurzouiqxvcnpbvz@wfg-t540p.sh.intel.com
    http://lkml.kernel.org/r/20180426202619.2768-1-pasha.tatashin@oracle.com

    Link: http://lkml.kernel.org/r/20180515175124.1770-1-pasha.tatashin@oracle.com
    Fixes: 3a80a7fa7989 ("mm: meminit: initialise a subset of struct pages if CONFIG_DEFERRED_STRUCT_PAGE_INIT is set")
    Signed-off-by: Pavel Tatashin
    Acked-by: Michal Hocko
    Reviewed-by: Andrew Morton
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Mel Gorman
    Cc: Fengguang Wu
    Cc: Dennis Zhou
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

09 May, 2018

1 commit


28 Apr, 2018

1 commit


17 Apr, 2018

2 commits

  • Mike Rapoport says:

    These patches convert files in Documentation/vm to ReST format, add an
    initial index and link it to the top level documentation.

    There are no contents changes in the documentation, except few spelling
    fixes. The relatively large diffstat stems from the indentation and
    paragraph wrapping changes.

    I've tried to keep the formatting as consistent as possible, but I could
    miss some places that needed markup and add some markup where it was not
    necessary.

    [jc: significant conflicts in vm/hmm.rst]

    Jonathan Corbet
     
  • Signed-off-by: Mike Rapoport
    Signed-off-by: Jonathan Corbet

    Mike Rapoport
     

26 Mar, 2018

1 commit


23 Feb, 2018

1 commit

  • Now that arch/metag/ has been removed, drop a bunch of metag references
    in various codes across the whole tree:
    - VM_GROWSUP and __VM_ARCH_SPECIFIC_1.
    - MT_METAG_* ELF note types.
    - METAG Kconfig dependencies (FRAME_POINTER) and ranges
    (MAX_STACK_SIZE_MB).
    - metag cases in tools (checkstack.pl, recordmcount.c, perf).

    Signed-off-by: James Hogan
    Acked-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Reviewed-by: Guenter Roeck
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Alexander Shishkin
    Cc: Jiri Olsa
    Cc: Namhyung Kim
    Cc: linux-mm@kvack.org
    Cc: linux-metag@vger.kernel.org

    James Hogan
     

01 Feb, 2018

1 commit

  • There is no need to have ARCH_SUPPORTS_DEFERRED_STRUCT_PAGE_INIT, as all
    the page initialization code is in common code.

    Also, there is no need to depend on MEMORY_HOTPLUG, as initialization
    code does not really use hotplug memory functionality. So, we can
    remove this requirement as well.

    This patch allows to use deferred struct page initialization on all
    platforms with memblock allocator.

    Tested on x86, arm64, and sparc. Also, verified that code compiles on
    PPC with CONFIG_MEMORY_HOTPLUG disabled.

    Link: http://lkml.kernel.org/r/20171117014601.31606-1-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Acked-by: Heiko Carstens [s390]
    Reviewed-by: Khalid Aziz
    Acked-by: Michael Ellerman
    Acked-by: Michal Hocko
    Cc: Steven Sistare
    Cc: Daniel Jordan
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Kirill A. Shutemov
    Cc: Reza Arbab
    Cc: Martin Schwidefsky
    Cc: Thomas Gleixner
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

18 Nov, 2017

1 commit

  • Performance of get_user_pages_fast() is critical for some workloads, but
    it's tricky to test it directly.

    This patch provides /sys/kernel/debug/gup_benchmark that helps with
    testing performance of it.

    See tools/testing/selftests/vm/gup_benchmark.c for userspace
    counterpart.

    Link: http://lkml.kernel.org/r/20170908215603.9189-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Kirill A. Shutemov
    Cc: Shuah Khan
    Cc: Ingo Molnar
    Cc: Thorsten Leemhuis
    Cc: Jonathan Corbet
    Cc: Huang Ying
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

09 Sep, 2017

6 commits

  • This moves all new code including new page migration helper behind kernel
    Kconfig option so that there is no codee bloat for arch or user that do
    not want to use HMM or any of its associated features.

    arm allyesconfig (without all the patchset, then with and this patch):
    text data bss dec hex filename
    83721896 46511131 27582964 157815991 96814b7 ../without/vmlinux
    83722364 46511131 27582964 157816459 968168b vmlinux

    [jglisse@redhat.com: struct hmm is only use by HMM mirror functionality]
    Link: http://lkml.kernel.org/r/20170825213133.27286-1-jglisse@redhat.com
    [sfr@canb.auug.org.au: fix build (arm multi_v7_defconfig)]
    Link: http://lkml.kernel.org/r/20170828181849.323ab81b@canb.auug.org.au
    Link: http://lkml.kernel.org/r/20170818032858.7447-1-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Stephen Rothwell
    Cc: Dan Williams
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Platform with advance system bus (like CAPI or CCIX) allow device memory
    to be accessible from CPU in a cache coherent fashion. Add a new type of
    ZONE_DEVICE to represent such memory. The use case are the same as for
    the un-addressable device memory but without all the corners cases.

    Link: http://lkml.kernel.org/r/20170817000548.32038-19-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Aneesh Kumar
    Cc: Paul E. McKenney
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: Ross Zwisler
    Cc: Balbir Singh
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM (heterogeneous memory management) need struct page to support
    migration from system main memory to device memory. Reasons for HMM and
    migration to device memory is explained with HMM core patch.

    This patch deals with device memory that is un-addressable memory (ie CPU
    can not access it). Hence we do not want those struct page to be manage
    like regular memory. That is why we extend ZONE_DEVICE to support
    different types of memory.

    A persistent memory type is define for existing user of ZONE_DEVICE and a
    new device un-addressable type is added for the un-addressable memory
    type. There is a clear separation between what is expected from each
    memory type and existing user of ZONE_DEVICE are un-affected by new
    requirement and new use of the un-addressable type. All specific code
    path are protect with test against the memory type.

    Because memory is un-addressable we use a new special swap type for when a
    page is migrated to device memory (this reduces the number of maximum swap
    file).

    The main two additions beside memory type to ZONE_DEVICE is two callbacks.
    First one, page_free() is call whenever page refcount reach 1 (which
    means the page is free as ZONE_DEVICE page never reach a refcount of 0).
    This allow device driver to manage its memory and associated struct page.

    The second callback page_fault() happens when there is a CPU access to an
    address that is back by a device page (which are un-addressable by the
    CPU). This callback is responsible to migrate the page back to system
    main memory. Device driver can not block migration back to system memory,
    HMM make sure that such page can not be pin into device memory.

    If device is in some error condition and can not migrate memory back then
    a CPU page fault to device memory should end with SIGBUS.

    [arnd@arndb.de: fix warning]
    Link: http://lkml.kernel.org/r/20170823133213.712917-1-arnd@arndb.de
    Link: http://lkml.kernel.org/r/20170817000548.32038-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Arnd Bergmann
    Acked-by: Dan Williams
    Cc: Ross Zwisler
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: David Nellans
    Cc: Evgeny Baskakov
    Cc: Johannes Weiner
    Cc: John Hubbard
    Cc: Kirill A. Shutemov
    Cc: Mark Hairgrove
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Sherry Cheung
    Cc: Subhash Gutti
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This is a heterogeneous memory management (HMM) process address space
    mirroring. In a nutshell this provide an API to mirror process address
    space on a device. This boils down to keeping CPU and device page table
    synchronize (we assume that both device and CPU are cache coherent like
    PCIe device can be).

    This patch provide a simple API for device driver to achieve address space
    mirroring thus avoiding each device driver to grow its own CPU page table
    walker and its own CPU page table synchronization mechanism.

    This is useful for NVidia GPU >= Pascal, Mellanox IB >= mlx5 and more
    hardware in the future.

    [jglisse@redhat.com: fix hmm for "mmu_notifier kill invalidate_page callback"]
    Link: http://lkml.kernel.org/r/20170830231955.GD9445@redhat.com
    Link: http://lkml.kernel.org/r/20170817000548.32038-4-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • HMM provides 3 separate types of functionality:
    - Mirroring: synchronize CPU page table and device page table
    - Device memory: allocating struct page for device memory
    - Migration: migrating regular memory to device memory

    This patch introduces some common helpers and definitions to all of
    those 3 functionality.

    Link: http://lkml.kernel.org/r/20170817000548.32038-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Evgeny Baskakov
    Signed-off-by: John Hubbard
    Signed-off-by: Mark Hairgrove
    Signed-off-by: Sherry Cheung
    Signed-off-by: Subhash Gutti
    Cc: Aneesh Kumar
    Cc: Balbir Singh
    Cc: Benjamin Herrenschmidt
    Cc: Dan Williams
    Cc: David Nellans
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Michal Hocko
    Cc: Paul E. McKenney
    Cc: Ross Zwisler
    Cc: Vladimir Davydov
    Cc: Bob Liu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Introduce CONFIG_ARCH_ENABLE_THP_MIGRATION to limit thp migration
    functionality to x86_64, which should be safer at the first step.

    Link: http://lkml.kernel.org/r/20170717193955.20207-5-zi.yan@sent.com
    Signed-off-by: Naoya Horiguchi
    Signed-off-by: Zi Yan
    Reviewed-by: Anshuman Khandual
    Cc: "H. Peter Anvin"
    Cc: Dave Hansen
    Cc: David Nellans
    Cc: Ingo Molnar
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Thomas Gleixner
    Cc: Vlastimil Babka
    Cc: Andrea Arcangeli
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

07 Sep, 2017

1 commit

  • devm_memremap_pages() records mapped ranges in pgmap_radix with an entry
    per section's worth of memory (128MB). The key for each of those
    entries is a section number.

    This leads to false positives when devm_memremap_pages() is passed a
    section-unaligned range as lookups in the misalignment fail to return
    NULL. We can close this hole by using the pfn as the key for entries in
    the tree. The number of entries required to describe a remapped range
    is reduced by leveraging multi-order entries.

    In practice this approach usually yields just one entry in the tree if
    the size and starting address are of the same power-of-2 alignment.
    Previously we always needed nr_entries = mapping_size / 128MB.

    Link: https://lists.01.org/pipermail/linux-nvdimm/2016-August/006666.html
    Link: http://lkml.kernel.org/r/150215410565.39310.13767886055248249438.stgit@dwillia2-desk3.amr.corp.intel.com
    Signed-off-by: Dan Williams
    Reported-by: Toshi Kani
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dan Williams
     

11 Jul, 2017

1 commit

  • KASAN doesn't happen work with memory hotplug because hotplugged memory
    doesn't have any shadow memory. So any access to hotplugged memory
    would cause a crash on shadow check.

    Use memory hotplug notifier to allocate and map shadow memory when the
    hotplugged memory is going online and free shadow after the memory
    offlined.

    Link: http://lkml.kernel.org/r/20170601162338.23540-4-aryabinin@virtuozzo.com
    Signed-off-by: Andrey Ryabinin
    Cc: "H. Peter Anvin"
    Cc: Alexander Potapenko
    Cc: Catalin Marinas
    Cc: Dmitry Vyukov
    Cc: Ingo Molnar
    Cc: Ingo Molnar
    Cc: Mark Rutland
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Ryabinin
     

08 Jul, 2017

1 commit

  • Pull powerpc updates from Michael Ellerman:
    "Highlights include:

    - Support for STRICT_KERNEL_RWX on 64-bit server CPUs.

    - Platform support for FSP2 (476fpe) board

    - Enable ZONE_DEVICE on 64-bit server CPUs.

    - Generic & powerpc spin loop primitives to optimise busy waiting

    - Convert VDSO update function to use new update_vsyscall() interface

    - Optimisations to hypercall/syscall/context-switch paths

    - Improvements to the CPU idle code on Power8 and Power9.

    As well as many other fixes and improvements.

    Thanks to: Akshay Adiga, Andrew Donnellan, Andrew Jeffery, Anshuman
    Khandual, Anton Blanchard, Balbir Singh, Benjamin Herrenschmidt,
    Christophe Leroy, Christophe Lombard, Colin Ian King, Dan Carpenter,
    Gautham R. Shenoy, Hari Bathini, Ian Munsie, Ivan Mikhaylov, Javier
    Martinez Canillas, Madhavan Srinivasan, Masahiro Yamada, Matt Brown,
    Michael Neuling, Michal Suchanek, Murilo Opsfelder Araujo, Naveen N.
    Rao, Nicholas Piggin, Oliver O'Halloran, Paul Mackerras, Pavel Machek,
    Russell Currey, Santosh Sivaraj, Stephen Rothwell, Thiago Jung
    Bauermann, Yang Li"

    * tag 'powerpc-4.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (158 commits)
    powerpc/Kconfig: Enable STRICT_KERNEL_RWX for some configs
    powerpc/mm/radix: Implement STRICT_RWX/mark_rodata_ro() for Radix
    powerpc/mm/hash: Implement mark_rodata_ro() for hash
    powerpc/vmlinux.lds: Align __init_begin to 16M
    powerpc/lib/code-patching: Use alternate map for patch_instruction()
    powerpc/xmon: Add patch_instruction() support for xmon
    powerpc/kprobes/optprobes: Use patch_instruction()
    powerpc/kprobes: Move kprobes over to patch_instruction()
    powerpc/mm/radix: Fix execute permissions for interrupt_vectors
    powerpc/pseries: Fix passing of pp0 in updatepp() and updateboltedpp()
    powerpc/64s: Blacklist rtas entry/exit from kprobes
    powerpc/64s: Blacklist functions invoked on a trap
    powerpc/64s: Un-blacklist system_call() from kprobes
    powerpc/64s: Move system_call() symbol to just after setting MSR_EE
    powerpc/64s: Blacklist system_call() and system_call_common() from kprobes
    powerpc/64s: Convert .L__replay_interrupt_return to a local label
    powerpc64/elfv1: Only dereference function descriptor for non-text symbols
    cxl: Export library to support IBM XSL
    powerpc/dts: Use #include "..." to include local DT
    powerpc/perf/hv-24x7: Aggregate result elements on POWER9 SMT8
    ...

    Linus Torvalds
     

07 Jul, 2017

2 commits

  • Commit 20b2f52b73fe ("numa: add CONFIG_MOVABLE_NODE for
    movable-dedicated node") has introduced CONFIG_MOVABLE_NODE without a
    good explanation on why it is actually useful.

    It makes a lot of sense to make movable node semantic opt in but we
    already have that because the feature has to be explicitly enabled on
    the kernel command line. A config option on top only makes the
    configuration space larger without a good reason. It also adds an
    additional ifdefery that pollutes the code.

    Just drop the config option and make it de-facto always enabled. This
    shouldn't introduce any change to the semantic.

    Link: http://lkml.kernel.org/r/20170529114141.536-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Reza Arbab
    Acked-by: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Jerome Glisse
    Cc: Yasuaki Ishimatsu
    Cc: Xishi Qiu
    Cc: Kani Toshimitsu
    Cc: Chen Yucong
    Cc: Joonsoo Kim
    Cc: Andi Kleen
    Cc: David Rientjes
    Cc: Daniel Kiper
    Cc: Igor Mammedov
    Cc: Vitaly Kuznetsov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "THP swap: Delay splitting THP during swapping out", v11.

    This patchset is to optimize the performance of Transparent Huge Page
    (THP) swap.

    Recently, the performance of the storage devices improved so fast that
    we cannot saturate the disk bandwidth with single logical CPU when do
    page swap out even on a high-end server machine. Because the
    performance of the storage device improved faster than that of single
    logical CPU. And it seems that the trend will not change in the near
    future. On the other hand, the THP becomes more and more popular
    because of increased memory size. So it becomes necessary to optimize
    THP swap performance.

    The advantages of the THP swap support include:

    - Batch the swap operations for the THP to reduce lock
    acquiring/releasing, including allocating/freeing the swap space,
    adding/deleting to/from the swap cache, and writing/reading the swap
    space, etc. This will help improve the performance of the THP swap.

    - The THP swap space read/write will be 2M sequential IO. It is
    particularly helpful for the swap read, which are usually 4k random
    IO. This will improve the performance of the THP swap too.

    - It will help the memory fragmentation, especially when the THP is
    heavily used by the applications. The 2M continuous pages will be
    free up after THP swapping out.

    - It will improve the THP utilization on the system with the swap
    turned on. Because the speed for khugepaged to collapse the normal
    pages into the THP is quite slow. After the THP is split during the
    swapping out, it will take quite long time for the normal pages to
    collapse back into the THP after being swapped in. The high THP
    utilization helps the efficiency of the page based memory management
    too.

    There are some concerns regarding THP swap in, mainly because possible
    enlarged read/write IO size (for swap in/out) may put more overhead on
    the storage device. To deal with that, the THP swap in should be turned
    on only when necessary. For example, it can be selected via
    "always/never/madvise" logic, to be turned on globally, turned off
    globally, or turned on only for VMA with MADV_HUGEPAGE, etc.

    This patchset is the first step for the THP swap support. The plan is
    to delay splitting THP step by step, finally avoid splitting THP during
    the THP swapping out and swap out/in the THP as a whole.

    As the first step, in this patchset, the splitting huge page is delayed
    from almost the first step of swapping out to after allocating the swap
    space for the THP and adding the THP into the swap cache. This will
    reduce lock acquiring/releasing for the locks used for the swap cache
    management.

    With the patchset, the swap out throughput improves 15.5% (from about
    3.73GB/s to about 4.31GB/s) in the vm-scalability swap-w-seq test case
    with 8 processes. The test is done on a Xeon E5 v3 system. The swap
    device used is a RAM simulated PMEM (persistent memory) device. To test
    the sequential swapping out, the test case creates 8 processes, which
    sequentially allocate and write to the anonymous pages until the RAM and
    part of the swap device is used up.

    This patch (of 5):

    In this patch, splitting huge page is delayed from almost the first step
    of swapping out to after allocating the swap space for the THP
    (Transparent Huge Page) and adding the THP into the swap cache. This
    will batch the corresponding operation, thus improve THP swap out
    throughput.

    This is the first step for the THP swap optimization. The plan is to
    delay splitting the THP step by step and avoid splitting the THP
    finally.

    In this patch, one swap cluster is used to hold the contents of each THP
    swapped out. So, the size of the swap cluster is changed to that of the
    THP (Transparent Huge Page) on x86_64 architecture (512). For other
    architectures which want such THP swap optimization,
    ARCH_USES_THP_SWAP_CLUSTER needs to be selected in the Kconfig file for
    the architecture. In effect, this will enlarge swap cluster size by 2
    times on x86_64. Which may make it harder to find a free cluster when
    the swap space becomes fragmented. So that, this may reduce the
    continuous swap space allocation and sequential write in theory. The
    performance test in 0day shows no regressions caused by this.

    In the future of THP swap optimization, some information of the swapped
    out THP (such as compound map count) will be recorded in the
    swap_cluster_info data structure.

    The mem cgroup swap accounting functions are enhanced to support charge
    or uncharge a swap cluster backing a THP as a whole.

    The swap cluster allocate/free functions are added to allocate/free a
    swap cluster for a THP. A fair simple algorithm is used for swap
    cluster allocation, that is, only the first swap device in priority list
    will be tried to allocate the swap cluster. The function will fail if
    the trying is not successful, and the caller will fallback to allocate a
    single swap slot instead. This works good enough for normal cases. If
    the difference of the number of the free swap clusters among multiple
    swap devices is significant, it is possible that some THPs are split
    earlier than necessary. For example, this could be caused by big size
    difference among multiple swap devices.

    The swap cache functions is enhanced to support add/delete THP to/from
    the swap cache as a set of (HPAGE_PMD_NR) sub-pages. This may be
    enhanced in the future with multi-order radix tree. But because we will
    split the THP soon during swapping out, that optimization doesn't make
    much sense for this first step.

    The THP splitting functions are enhanced to support to split THP in swap
    cache during swapping out. The page lock will be held during allocating
    the swap cluster, adding the THP into the swap cache and splitting the
    THP. So in the code path other than swapping out, if the THP need to be
    split, the PageSwapCache(THP) will be always false.

    The swap cluster is only available for SSD, so the THP swap optimization
    in this patchset has no effect for HDD.

    [ying.huang@intel.com: fix two issues in THP optimize patch]
    Link: http://lkml.kernel.org/r/87k25ed8zo.fsf@yhuang-dev.intel.com
    [hannes@cmpxchg.org: extensive cleanups and simplifications, reduce code size]
    Link: http://lkml.kernel.org/r/20170515112522.32457-2-ying.huang@intel.com
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Johannes Weiner
    Suggested-by: Andrew Morton [for config option]
    Acked-by: Kirill A. Shutemov [for changes in huge_memory.c and huge_mm.h]
    Cc: Andrea Arcangeli
    Cc: Ebru Akagunduz
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Tejun Heo
    Cc: Hugh Dickins
    Cc: Shaohua Li
    Cc: Minchan Kim
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Ying
     

06 Jul, 2017

1 commit

  • Pull percpu updates from Tejun Heo:
    "These are the percpu changes for the v4.13-rc1 merge window. There are
    a couple visibility related changes - tracepoints and allocator stats
    through debugfs, along with __ro_after_init markings and a cosmetic
    rename in percpu_counter.

    Please note that the simple O(#elements_in_the_chunk) area allocator
    used by percpu allocator is again showing scalability issues,
    primarily with bpf allocating and freeing large number of counters.
    Dennis is working on the replacement allocator and the percpu
    allocator will be seeing increased churns in the coming cycles"

    * 'for-4.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu: fix static checker warnings in pcpu_destroy_chunk
    percpu: fix early calls for spinlock in pcpu_stats
    percpu: resolve err may not be initialized in pcpu_alloc
    percpu_counter: Rename __percpu_counter_add to percpu_counter_add_batch
    percpu: add tracepoint support for percpu memory
    percpu: expose statistics about percpu memory via debugfs
    percpu: migrate percpu data structures to internal header
    percpu: add missing lockdep_assert_held to func pcpu_free_area
    mark most percpu globals as __ro_after_init

    Linus Torvalds
     

02 Jul, 2017

1 commit

  • Currently ZONE_DEVICE depends on X86_64 and this will get unwieldly as
    new architectures (and platforms) get ZONE_DEVICE support. Move to an
    arch selected Kconfig option to save us the trouble.

    Cc: linux-mm@kvack.org
    Acked-by: Ingo Molnar
    Acked-by: Balbir Singh
    Signed-off-by: Oliver O'Halloran
    Signed-off-by: Michael Ellerman

    Oliver O'Halloran
     

21 Jun, 2017

1 commit

  • There is limited visibility into the use of percpu memory leaving us
    unable to reason about correctness of parameters and overall use of
    percpu memory. These counters and statistics aim to help understand
    basic statistics about percpu memory such as number of allocations over
    the lifetime, allocation sizes, and fragmentation.

    New Config: PERCPU_STATS

    Signed-off-by: Dennis Zhou
    Signed-off-by: Tejun Heo

    Dennis Zhou
     

13 Jun, 2017

1 commit

  • This patch provides all required callbacks required by the generic
    get_user_pages_fast() code and switches x86 over - and removes
    the platform specific implementation.

    Signed-off-by: Kirill A. Shutemov
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Brian Gerst
    Cc: Dave Hansen
    Cc: Denys Vlasenko
    Cc: H. Peter Anvin
    Cc: Josh Poimboeuf
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-mm@kvack.org
    Link: http://lkml.kernel.org/r/20170606113133.22974-2-kirill.shutemov@linux.intel.com
    Signed-off-by: Ingo Molnar

    Kirill A. Shutemov
     

01 May, 2017

1 commit


13 Dec, 2016

3 commits

  • Add arch specific callback in the generic THP page cache code that will
    deposit and withdarw preallocated page table. Archs like ppc64 use this
    preallocated table to store the hash pte slot information.

    Testing:
    kernel build of the patch series on tmpfs mounted with option huge=always

    The related thp stat:
    thp_fault_alloc 72939
    thp_fault_fallback 60547
    thp_collapse_alloc 603
    thp_collapse_alloc_failed 0
    thp_file_alloc 253763
    thp_file_mapped 4251
    thp_split_page 51518
    thp_split_page_failed 1
    thp_deferred_split_page 73566
    thp_split_pmd 665
    thp_zero_page_alloc 3
    thp_zero_page_alloc_failed 0

    [akpm@linux-foundation.org: remove unneeded parentheses, per Kirill]
    Link: http://lkml.kernel.org/r/20161113150025.17942-2-aneesh.kumar@linux.vnet.ibm.com
    Signed-off-by: Aneesh Kumar K.V
    Acked-by: Kirill A. Shutemov
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Michael Neuling
    Cc: Paul Mackerras
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Aneesh Kumar K.V
     
  • When movable nodes are enabled, any node containing only hotpluggable
    memory is made movable at boot time.

    On x86, hotpluggable memory is discovered by parsing the ACPI SRAT,
    making corresponding calls to memblock_mark_hotplug().

    If we introduce a dt property to describe memory as hotpluggable,
    configs supporting early fdt may then also do this marking and use
    movable nodes.

    Link: http://lkml.kernel.org/r/1479160961-25840-5-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Tested-by: Balbir Singh
    Acked-by: Balbir Singh
    Cc: "Aneesh Kumar K.V"
    Cc: "H. Peter Anvin"
    Cc: Alistair Popple
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Bharata B Rao
    Cc: Frank Rowand
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Nathan Fontenot
    Cc: Paul Mackerras
    Cc: Rob Herring
    Cc: Stewart Smith
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     
  • To support movable memory nodes (CONFIG_MOVABLE_NODE), at least one of
    the following must be true:

    1. This config has the capability to identify movable nodes at boot.
    Right now, only x86 can do this.

    2. Our config supports memory hotplug, which means that a movable node
    can be created by hotplugging all of its memory into ZONE_MOVABLE.

    Fix the Kconfig definition of CONFIG_MOVABLE_NODE, which currently
    recognizes (1), but not (2).

    Link: http://lkml.kernel.org/r/1479160961-25840-4-git-send-email-arbab@linux.vnet.ibm.com
    Signed-off-by: Reza Arbab
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Balbir Singh
    Cc: "Aneesh Kumar K.V"
    Cc: "H. Peter Anvin"
    Cc: Alistair Popple
    Cc: Aneesh Kumar K.V
    Cc: Benjamin Herrenschmidt
    Cc: Bharata B Rao
    Cc: Frank Rowand
    Cc: Ingo Molnar
    Cc: Michael Ellerman
    Cc: Nathan Fontenot
    Cc: Paul Mackerras
    Cc: Rob Herring
    Cc: Stewart Smith
    Cc: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Reza Arbab
     

28 Oct, 2016

1 commit

  • No, KASAN may not be able to co-exist with HOTPLUG_MEMORY at runtime,
    but for build testing there is no reason not to allow them together.

    This hopefully means better build coverage and fewer embarrasing silly
    problems like the one fixed by commit 9db4f36e82c2 ("mm: remove unused
    variable in memory hotplug") in the future.

    Cc: Stephen Rothwell
    Cc: Andrey Ryabinin
    Cc: Alexander Potapenko
    Signed-off-by: Linus Torvalds

    Linus Torvalds