07 Dec, 2020

1 commit

  • While I was doing zram testing, I found sometimes decompression failed
    since the compression buffer was corrupted. With investigation, I found
    below commit calls cond_resched unconditionally so it could make a
    problem in atomic context if the task is reschedule.

    BUG: sleeping function called from invalid context at mm/vmalloc.c:108
    in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
    3 locks held by memhog/946:
    #0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
    #1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
    #2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
    CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
    Call Trace:
    unmap_kernel_range_noflush+0x2eb/0x350
    unmap_kernel_range+0x14/0x30
    zs_unmap_object+0xd5/0xe0
    zram_bvec_rw.isra.0+0x38c/0x8e0
    zram_rw_page+0x90/0x101
    bdev_write_page+0x92/0xe0
    __swap_writepage+0x94/0x4a0
    pageout+0xe3/0x3a0
    shrink_page_list+0xb94/0xd60
    shrink_inactive_list+0x158/0x460

    We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
    contains the offending calling code) from zsmalloc.

    Even though this option showed some amount improvement(e.g., 30%) in
    some arm32 platforms, it has been headache to maintain since it have
    abused APIs[1](e.g., unmap_kernel_range in atomic context).

    Since we are approaching to deprecate 32bit machines and already made
    the config option available for only builtin build since v5.8, lastly it
    has been not default option in zsmalloc, it's time to drop the option
    for better maintenance.

    [1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

    Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Sergey Senozhatsky
    Cc: Tony Lindgren
    Cc: Christoph Hellwig
    Cc: Harish Sriram
    Cc: Uladzislau Rezki
    Cc:
    Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

19 Oct, 2020

1 commit

  • Add a proper helper to remap PFNs into kernel virtual space so that
    drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
    for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

17 Oct, 2020

1 commit

  • Currently, it can happen that pages are allocated (and freed) via the
    buddy before we finished basic memory onlining.

    For example, pages are exposed to the buddy and can be allocated before we
    actually mark the sections online. Allocated pages could suddenly fail
    pfn_to_online_page() checks. We had similar issues with pcp handling,
    when pages are allocated+freed before we reach zone_pcp_update() in
    online_pages() [1].

    Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
    impossible. Once done with the heavy lifting, use
    undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
    freelist, marking them ready for allocation. Similar to offline_pages(),
    we have to manually adjust zone->nr_isolate_pageblock.

    [1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Wei Yang
    Cc: Baoquan He
    Cc: Pankaj Gupta
    Cc: Charan Teja Reddy
    Cc: Dan Williams
    Cc: Fenghua Yu
    Cc: Logan Gunthorpe
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Mel Gorman
    Cc: Mel Gorman
    Cc: Michel Lespinasse
    Cc: Mike Rapoport
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

16 Oct, 2020

1 commit

  • Pull dma-mapping updates from Christoph Hellwig:

    - rework the non-coherent DMA allocator

    - move private definitions out of

    - lower CMA_ALIGNMENT (Paul Cercueil)

    - remove the omap1 dma address translation in favor of the common code

    - make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

    - support per-node DMA CMA areas (Barry Song)

    - increase the default seg boundary limit (Nicolin Chen)

    - misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

    - various cleanups

    * tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
    ARM/ixp4xx: add a missing include of dma-map-ops.h
    dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
    dma-direct: factor out a dma_direct_alloc_from_pool helper
    dma-direct check for highmem pages in dma_direct_alloc_pages
    dma-mapping: merge into
    dma-mapping: move large parts of to kernel/dma
    dma-mapping: move dma-debug.h to kernel/dma/
    dma-mapping: remove
    dma-mapping: merge into
    dma-contiguous: remove dma_contiguous_set_default
    dma-contiguous: remove dev_set_cma_area
    dma-contiguous: remove dma_declare_contiguous
    dma-mapping: split
    cma: decrease CMA_ALIGNMENT lower limit to 2
    firewire-ohci: use dma_alloc_pages
    dma-iommu: implement ->alloc_noncoherent
    dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
    dma-mapping: add a new dma_alloc_pages API
    dma-mapping: remove dma_cache_sync
    53c700: convert to dma_alloc_noncoherent
    ...

    Linus Torvalds
     

14 Oct, 2020

1 commit

  • In the beginning, mm/gup_benchmark.c supported get_user_pages_fast() only,
    but right now, it supports the benchmarking of a couple of
    get_user_pages() related calls like:

    * get_user_pages_fast()
    * get_user_pages()
    * pin_user_pages_fast()
    * pin_user_pages()

    The documentation is confusing and needs update.

    Signed-off-by: Barry Song
    Signed-off-by: Andrew Morton
    Cc: John Hubbard
    Cc: Keith Busch
    Cc: Ira Weiny
    Cc: Kirill A. Shutemov
    Link: https://lkml.kernel.org/r/20200821032546.19992-1-song.bao.hua@hisilicon.com
    Signed-off-by: Linus Torvalds

    Barry Song
     

25 Sep, 2020

1 commit

  • nommu-mmap.rst was moved to Documentation/admin-guide/mm; this patch
    updates the remaining stale references to Documentation/mm.

    Fixes: 800c02f5d030 ("docs: move nommu-mmap.txt to admin-guide and rename to ReST")
    Signed-off-by: Stephen Kitt
    Link: https://lore.kernel.org/r/20200812092230.27541-1-steve@sk2.org
    Signed-off-by: Jonathan Corbet

    Stephen Kitt
     

01 Sep, 2020

1 commit

  • Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
    coherent DMA buffers to save their command queues and page tables. As
    there is only one default CMA in the whole system, SMMUs on nodes other
    than node0 will get remote memory. This leads to significant latency.

    This patch provides per-numa CMA so that drivers like SMMU can get local
    memory. Tests show localizing CMA can decrease dma_unmap latency much.
    For instance, before this patch, SMMU on node2 has to wait for more than
    560ns for the completion of CMD_SYNC in an empty command queue; with this
    patch, it needs 240ns only.

    A positive side effect of this patch would be improving performance even
    further for those users who are worried about performance more than DMA
    security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
    drivers can get local coherent DMA buffers.

    Also, this patch changes the default CONFIG_CMA_AREAS to 19 in NUMA. As
    1+CONFIG_CMA_AREAS should be quite enough for most servers on the market
    even they enable both hugetlb_cma and pernuma_cma.
    2 numa nodes: 2(hugetlb) + 2(pernuma) + 1(default global cma) = 5
    4 numa nodes: 4(hugetlb) + 4(pernuma) + 1(default global cma) = 9
    8 numa nodes: 8(hugetlb) + 8(pernuma) + 1(default global cma) = 17

    Signed-off-by: Barry Song
    Signed-off-by: Christoph Hellwig

    Barry Song
     

08 Aug, 2020

1 commit

  • After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
    functions that call memory_present() for each region in memblock.memory:
    sparse_memory_present_with_active_regions() and membocks_present().

    Moreover, all architectures have a call to either of these functions
    preceding the call to sparse_init() and in the most cases they are called
    one after the other.

    Mark the regions from memblock.memory as present during sparce_init() by
    making sparse_init() call memblocks_present(), make memblocks_present()
    and memory_present() functions static and remove redundant
    sparse_memory_present_with_active_regions() function.

    Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

27 Jun, 2020

1 commit

  • The nommu-mmap.txt file provides description of user visible
    behaviuour. So, move it to the admin-guide.

    As it is already at the ReST, also rename it.

    Suggested-by: Mike Rapoport
    Suggested-by: Jonathan Corbet
    Signed-off-by: Mauro Carvalho Chehab
    Link: https://lore.kernel.org/r/3a63d1833b513700755c85bf3bda0a6c4ab56986.1592918949.git.mchehab+huawei@kernel.org
    Signed-off-by: Jonathan Corbet

    Mauro Carvalho Chehab
     

08 Jun, 2020

1 commit

  • Pull sparc updates from David Miller:

    - Rework the sparc32 page tables so that READ_ONCE(*pmd), as done by
    generic code, operates on a word sized element. From Will Deacon.

    - Some scnprintf() conversions, from Chen Zhou.

    - A pin_user_pages() conversion from John Hubbard.

    - Several 32-bit ptrace register handling fixes and such from Al Viro.

    * git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
    fix a braino in "sparc32: fix register window handling in genregs32_[gs]et()"
    sparc32: mm: Only call ctor()/dtor() functions for first and last user
    sparc32: mm: Disable SPLIT_PTLOCK_CPUS
    sparc32: mm: Don't try to free page-table pages if ctor() fails
    sparc32: register memory occupied by kernel as memblock.memory
    sparc: remove unused header file nfs_fs.h
    sparc32: fix register window handling in genregs32_[gs]et()
    sparc64: fix misuses of access_process_vm() in genregs32_[sg]et()
    oradax: convert get_user_pages() --> pin_user_pages()
    sparc: use scnprintf() in show_pciobppath_attr() in vio.c
    sparc: use scnprintf() in show_pciobppath_attr() in pci.c
    tty: vcc: Fix error return code in vcc_probe()
    sparc32: mm: Reduce allocation size for PMD and PTE tables
    sparc32: mm: Change pgtable_t type to pte_t * instead of struct page *
    sparc32: mm: Restructure sparc32 MMU page-table layout
    sparc32: mm: Fix argument checking in __srmmu_get_nocache()
    sparc64: Replace zero-length array with flexible-array
    sparc: mm: return true,false in kern_addr_valid()

    Linus Torvalds
     

05 Jun, 2020

2 commits

  • Memory hotlug is broken for 32b systems at least since c6f03e2903c9 ("mm,
    memory_hotplug: remove zone restrictions") which has considerably reworked
    how can be memory associated with movable/kernel zones. The same is not
    really trivial to achieve in 32b where only lowmem is the kernel zone.
    While we can tweak this immediate problem around there are likely other
    land mines hidden at other places.

    It is also quite dubious that there is a real usecase for the memory
    hotplug on 32b in the first place. Low memory is just too small to be
    hotplugable (for hot add) and generally unusable for hotremove. Adding
    more memory to highmem is also dubious because it would increase the low
    mem or vmalloc space pressure for memmaps.

    Restrict the functionality to 64b systems. This will help future
    development to focus on usecases that have real life application. We can
    remove this restriction in future in presence of a real life usecase of
    course but until then make it explicit that hotplug on 32b is broken and
    requires a non trivial amount of work to fix.

    Robin said:
    "32-bit Arm doesn't support memory hotplug, and as far as I'm aware
    there's little likelihood of it ever wanting to. FWIW it looks like
    SuperH is the only pure-32-bit architecture to have hotplug support at
    all"

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Acked-by: Baoquan He
    Cc: Wei Yang
    Cc: Naoya Horiguchi
    Cc: Oscar Salvador
    Cc: Robin Murphy
    Cc: Vamshi K Sthambamkadi
    Link: http://lkml.kernel.org/r/20200218100532.GA4151@dhcp22.suse.cz
    Link: https://bugzilla.kernel.org/show_bug.cgi?id=206401
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
    longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
    no pages at all, until memory is moved to the zone/node via
    move_pfn_range_to_zone() - e.g., when onlining memory blocks.

    The only archs that care about memblocks for hotplugged memory (either for
    iterating over all system RAM or testing for memory validity) are arm64,
    s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK. Without
    CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Acked-by: Mike Rapoport
    Acked-by: Michal Hocko
    Cc: Michal Hocko
    Cc: Baoquan He
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Mike Rapoport
    Cc: Anshuman Khandual
    Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

04 Jun, 2020

2 commits

  • Deferred struct page init is a significant bottleneck in kernel boot.
    Optimizing it maximizes availability for large-memory systems and allows
    spinning up short-lived VMs as needed without having to leave them
    running. It also benefits bare metal machines hosting VMs that are
    sensitive to downtime. In projects such as VMM Fast Restart[1], where
    guest state is preserved across kexec reboot, it helps prevent application
    and network timeouts in the guests.

    Multithread to take full advantage of system memory bandwidth.

    The maximum number of threads is capped at the number of CPUs on the node
    because speedups always improve with additional threads on every system
    tested, and at this phase of boot, the system is otherwise idle and
    waiting on page init to finish.

    Helper threads operate on section-aligned ranges to both avoid false
    sharing when setting the pageblock's migrate type and to avoid accessing
    uninitialized buddy pages, though max order alignment is enough for the
    latter.

    The minimum chunk size is also a section. There was benefit to using
    multiple threads even on relatively small memory (1G) systems, and this is
    the smallest size that the alignment allows.

    The time (milliseconds) is the slowest node to initialize since boot
    blocks until all nodes finish. intel_pstate is loaded in active mode
    without hwp and with turbo enabled, and intel_idle is active as well.

    Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
    2 nodes * 26 cores * 2 threads = 104 CPUs
    384G/node = 768G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6)
    2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8)
    12% ( 6) 34.9% 2662.7 ( 2.9) 79.9% 359.3 ( 0.6)
    25% ( 13) 39.9% 2459.0 ( 3.6) 91.2% 157.0 ( 0.0)
    37% ( 19) 39.2% 2485.0 ( 29.7) 90.4% 172.0 ( 28.6)
    50% ( 26) 39.3% 2482.7 ( 25.7) 90.3% 173.7 ( 30.0)
    75% ( 39) 39.0% 2495.7 ( 5.5) 89.4% 190.0 ( 1.0)
    100% ( 52) 40.2% 2443.7 ( 3.8) 92.3% 138.0 ( 1.0)

    Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
    1 node * 16 cores * 2 threads = 32 CPUs
    192G/node = 192G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5)
    3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0)
    12% ( 4) 41.1% 1170.3 ( 14.2) 73.8% 287.0 ( 3.6)
    25% ( 8) 47.1% 1052.7 ( 21.9) 83.9% 177.0 ( 13.5)
    38% ( 12) 48.9% 1016.3 ( 12.1) 86.8% 144.7 ( 1.5)
    50% ( 16) 48.9% 1015.7 ( 8.1) 87.8% 134.0 ( 4.4)
    75% ( 24) 49.1% 1012.3 ( 3.1) 88.1% 130.3 ( 2.3)
    100% ( 32) 49.5% 1004.0 ( 5.3) 88.5% 125.7 ( 2.1)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
    2 nodes * 18 cores * 2 threads = 72 CPUs
    128G/node = 256G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 1680.0 ( 4.6) -- 627.0 ( 4.0)
    3% ( 1) 0.3% 1675.7 ( 4.5) -0.2% 628.0 ( 3.6)
    11% ( 4) 25.6% 1250.7 ( 2.1) 67.9% 201.0 ( 0.0)
    25% ( 9) 30.7% 1164.0 ( 17.3) 81.8% 114.3 ( 17.7)
    36% ( 13) 31.4% 1152.7 ( 10.8) 84.0% 100.3 ( 17.9)
    50% ( 18) 31.5% 1150.7 ( 9.3) 83.9% 101.0 ( 14.1)
    75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4)
    100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0)

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
    1 node * 8 cores * 2 threads = 16 CPUs
    64G/node = 64G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 1029.3 ( 25.1) -- 240.7 ( 1.5)
    6% ( 1) -0.6% 1036.0 ( 7.8) -2.2% 246.0 ( 0.0)
    12% ( 2) 11.8% 907.7 ( 8.6) 44.7% 133.0 ( 1.0)
    25% ( 4) 13.9% 886.0 ( 10.6) 62.6% 90.0 ( 6.0)
    38% ( 6) 17.8% 845.7 ( 14.2) 69.1% 74.3 ( 3.8)
    50% ( 8) 16.8% 856.0 ( 22.1) 72.9% 65.3 ( 5.7)
    75% ( 12) 15.4% 871.0 ( 29.2) 79.8% 48.7 ( 7.4)
    100% ( 16) 21.0% 813.7 ( 21.0) 80.5% 47.0 ( 5.2)

    Server-oriented distros that enable deferred page init sometimes run in
    small VMs, and they still benefit even though the fraction of boot time
    saved is smaller:

    AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
    1 node * 2 cores * 2 threads = 4 CPUs
    16G/node = 16G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 716.0 ( 14.0) -- 49.7 ( 0.6)
    25% ( 1) 1.8% 703.0 ( 5.3) -4.0% 51.7 ( 0.6)
    50% ( 2) 1.6% 704.7 ( 1.2) 43.0% 28.3 ( 0.6)
    75% ( 3) 2.7% 696.7 ( 13.1) 49.7% 25.0 ( 0.0)
    100% ( 4) 4.1% 687.0 ( 10.4) 55.7% 22.0 ( 0.0)

    Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
    1 node * 2 cores * 2 threads = 4 CPUs
    14G/node = 14G memory

    kernel boot deferred init
    ------------------------ ------------------------
    node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
    ( 0) -- 787.7 ( 6.4) -- 122.3 ( 0.6)
    25% ( 1) 0.2% 786.3 ( 10.8) -2.5% 125.3 ( 2.1)
    50% ( 2) 5.9% 741.0 ( 13.9) 37.6% 76.3 ( 19.7)
    75% ( 3) 8.3% 722.0 ( 19.0) 49.9% 61.3 ( 3.2)
    100% ( 4) 9.3% 714.7 ( 9.5) 56.4% 53.3 ( 1.5)

    On Josh's 96-CPU and 192G memory system:

    Without this patch series:
    [ 0.487132] node 0 initialised, 23398907 pages in 292ms
    [ 0.499132] node 1 initialised, 24189223 pages in 304ms
    ...
    [ 0.629376] Run /sbin/init as init process

    With this patch series:
    [ 0.231435] node 1 initialised, 24189223 pages in 32ms
    [ 0.236718] node 0 initialised, 23398907 pages in 36ms

    [1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

    Signed-off-by: Daniel Jordan
    Signed-off-by: Andrew Morton
    Tested-by: Josh Triplett
    Reviewed-by: Alexander Duyck
    Cc: Alex Williamson
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Herbert Xu
    Cc: Jason Gunthorpe
    Cc: Jonathan Corbet
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Pavel Machek
    Cc: Pavel Tatashin
    Cc: Peter Zijlstra
    Cc: Randy Dunlap
    Cc: Robert Elliott
    Cc: Shile Zhang
    Cc: Steffen Klassert
    Cc: Steven Sistare
    Cc: Tejun Heo
    Cc: Zi Yan
    Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.com
    Signed-off-by: Linus Torvalds

    Daniel Jordan
     
  • CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
    nodes and zones structures between the systems that have region to node
    mapping in memblock and those that don't.

    Currently all the NUMA architectures enable this option and for the
    non-NUMA systems we can presume that all the memory belongs to node 0 and
    therefore the compile time configuration option is not required.

    The remaining few architectures that use DISCONTIGMEM without NUMA are
    easily updated to use memblock_add_node() instead of memblock_add() and
    thus have proper correspondence of memblock regions to NUMA nodes.

    Still, free_area_init_node() must have a backward compatible version
    because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
    different. Once all the architectures will use the new semantics, the
    entire compatibility layer can be dropped.

    To avoid addition of extra run time memory to store node id for
    architectures that keep memblock but have only a single node, the node id
    field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
    the corresponding accessors presume that in those cases it is always 0.

    Signed-off-by: Mike Rapoport
    Signed-off-by: Andrew Morton
    Tested-by: Hoan Tran [arm64]
    Acked-by: Catalin Marinas [arm64]
    Cc: Baoquan He
    Cc: Brian Cain
    Cc: "David S. Miller"
    Cc: Geert Uytterhoeven
    Cc: Greentime Hu
    Cc: Greg Ungerer
    Cc: Guan Xuetao
    Cc: Guo Ren
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: "James E.J. Bottomley"
    Cc: Jonathan Corbet
    Cc: Ley Foon Tan
    Cc: Mark Salter
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Michal Simek
    Cc: Nick Hu
    Cc: Paul Walmsley
    Cc: Richard Weinberger
    Cc: Rich Felker
    Cc: Russell King
    Cc: Stafford Horne
    Cc: Thomas Bogendoerfer
    Cc: Tony Luck
    Cc: Vineet Gupta
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.org
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

03 Jun, 2020

3 commits

  • The SRMMU page-table allocator is not compatible with SPLIT_PTLOCK_CPUS
    for two major reasons:

    1. Pages are allocated via memblock, and therefore the ptl is not
    cleared by prep_new_page(), which is expected by ptlock_init()

    2. Multiple PTE tables can exist in a single page, causing them to
    share the same ptl and deadlock when attempting to take the same
    lock twice (e.g. as part of copy_page_range()).

    Ensure that SPLIT_PTLOCK_CPUS is not selected for SPARC32.

    Cc: David S. Miller
    Signed-off-by: Will Deacon
    Signed-off-by: David S. Miller

    Will Deacon
     
  • This allows to unexport map_vm_area and unmap_kernel_range, which are
    rather deep internal and should not be available to modules, as they for
    example allow fine grained control of mapping permissions, and also
    allow splitting the setup of a vmalloc area and the actual mapping and
    thus expose vmalloc internals.

    zsmalloc is typically built-in and continues to work (just like the
    percpu-vm code using a similar patter), while modular zsmalloc also
    continues to work, but must use copies.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-12-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Rename the Kconfig variable to clarify the scope.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Acked-by: Minchan Kim
    Acked-by: Peter Zijlstra (Intel)
    Cc: Christian Borntraeger
    Cc: Christophe Leroy
    Cc: Daniel Vetter
    Cc: David Airlie
    Cc: Gao Xiang
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Johannes Weiner
    Cc: "K. Y. Srinivasan"
    Cc: Laura Abbott
    Cc: Mark Rutland
    Cc: Michael Kelley
    Cc: Nitin Gupta
    Cc: Robin Murphy
    Cc: Sakari Ailus
    Cc: Stephen Hemminger
    Cc: Sumit Semwal
    Cc: Wei Liu
    Cc: Benjamin Herrenschmidt
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Paul Mackerras
    Cc: Vasily Gorbik
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20200414131348.444715-11-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

09 Apr, 2020

1 commit

  • Pull libnvdimm and dax updates from Dan Williams:
    "There were multiple touches outside of drivers/nvdimm/ this round to
    add cross arch compatibility to the devm_memremap_pages() interface,
    enhance numa information for persistent memory ranges, and add a
    zero_page_range() dax operation.

    This cycle I switched from the patchwork api to Konstantin's b4 script
    for collecting tags (from x86, PowerPC, filesystem, and device-mapper
    folks), and everything looks to have gone ok there. This has all
    appeared in -next with no reported issues.

    Summary:

    - Add support for region alignment configuration and enforcement to
    fix compatibility across architectures and PowerPC page size
    configurations.

    - Introduce 'zero_page_range' as a dax operation. This facilitates
    filesystem-dax operation without a block-device.

    - Introduce phys_to_target_node() to facilitate drivers that want to
    know resulting numa node if a given reserved address range was
    onlined.

    - Advertise a persistence-domain for of_pmem and papr_scm. The
    persistence domain indicates where cpu-store cycles need to reach
    in the platform-memory subsystem before the platform will consider
    them power-fail protected.

    - Promote numa_map_to_online_node() to a cross-kernel generic
    facility.

    - Save x86 numa information to allow for node-id lookups for reserved
    memory ranges, deploy that capability for the e820-pmem driver.

    - Pick up some miscellaneous minor fixes, that missed v5.6-final,
    including a some smatch reports in the ioctl path and some unit
    test compilation fixups.

    - Fixup some flexible-array declarations"

    * tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
    dax: Move mandatory ->zero_page_range() check in alloc_dax()
    dax,iomap: Add helper dax_iomap_zero() to zero a range
    dax: Use new dax zero page method for zeroing a page
    dm,dax: Add dax zero_page_range operation
    s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
    dax, pmem: Add a dax operation zero_page_range
    pmem: Add functions for reading/writing page to/from pmem
    libnvdimm: Update persistence domain value for of_pmem and papr_scm device
    tools/test/nvdimm: Fix out of tree build
    libnvdimm/region: Fix build error
    libnvdimm/region: Replace zero-length array with flexible-array member
    libnvdimm/label: Replace zero-length array with flexible-array member
    ACPI: NFIT: Replace zero-length array with flexible-array member
    libnvdimm/region: Introduce an 'align' attribute
    libnvdimm/region: Introduce NDD_LABELING
    libnvdimm/namespace: Enforce memremap_compat_align()
    libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
    libnvdimm: Out of bounds read in __nd_ioctl()
    acpi/nfit: improve bounds checking for 'func'
    mm/memremap_pages: Introduce memremap_compat_align()
    ...

    Linus Torvalds
     

08 Apr, 2020

3 commits

  • The compressed cache for swap pages (zswap) currently needs from 1 to 3
    extra kernel command line parameters in order to make it work: it has to
    be enabled by adding a "zswap.enabled=1" command line parameter and if one
    wants a different compressor or pool allocator than the default lzo / zbud
    combination then these choices also need to be specified on the kernel
    command line in additional parameters.

    Using a different compressor and allocator for zswap is actually pretty
    common as guides often recommend using the lz4 / z3fold pair instead of
    the default one. In such case it is also necessary to remember to enable
    the appropriate compression algorithm and pool allocator in the kernel
    config manually.

    Let's avoid the need for adding these kernel command line parameters and
    automatically pull in the dependencies for the selected compressor
    algorithm and pool allocator by adding an appropriate default switches to
    Kconfig.

    The default values for these options match what the code was using
    previously as its defaults.

    Signed-off-by: Maciej S. Szmigiero
    Signed-off-by: Andrew Morton
    Reviewed-by: Vitaly Wool
    Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
    Signed-off-by: Linus Torvalds

    Maciej S. Szmigiero
     
  • In order to pave the way for free page reporting in virtualized
    environments we will need a way to get pages out of the free lists and
    identify those pages after they have been returned. To accomplish this,
    this patch adds the concept of a Reported Buddy, which is essentially
    meant to just be the Uptodate flag used in conjunction with the Buddy page
    type.

    To prevent the reported pages from leaking outside of the buddy lists I
    added a check to clear the PageReported bit in the del_page_from_free_list
    function. As a result any reported page that is split, merged, or
    allocated will have the flag cleared prior to the PageBuddy value being
    cleared.

    The process for reporting pages is fairly simple. Once we free a page
    that meets the minimum order for page reporting we will schedule a worker
    thread to start 2s or more in the future. That worker thread will begin
    working from the lowest supported page reporting order up to MAX_ORDER - 1
    pulling unreported pages from the free list and storing them in the
    scatterlist.

    When processing each individual free list it is necessary for the worker
    thread to release the zone lock when it needs to stop and report the full
    scatterlist of pages. To reduce the work of the next iteration the worker
    thread will rotate the free list so that the first unreported page in the
    free list becomes the first entry in the list.

    It will then call a reporting function providing information on how many
    entries are in the scatterlist. Once the function completes it will
    return the pages to the free area from which they were allocated and start
    over pulling more pages from the free areas until there are no longer
    enough pages to report on to keep the worker busy, or we have processed as
    many pages as were contained in the free area when we started processing
    the list.

    The worker thread will work in a round-robin fashion making its way though
    each zone requesting reporting, and through each reportable free list
    within that zone. Once all free areas within the zone have been processed
    it will check to see if there have been any requests for reporting while
    it was processing. If so it will reschedule the worker thread to start up
    again in roughly 2s and exit.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

18 Feb, 2020

1 commit

  • Currently x86 numa_meminfo is marked __initdata in the
    CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
    drivers to map reserved memory to a 'target_node'
    (phys_to_target_node()), add support for removing the __initdata
    designation for those users. Both memory hotplug and
    phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the
    arch to maintain its physical address to NUMA mapping infrastructure
    post init.

    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Borislav Petkov
    Cc: "H. Peter Anvin"
    Cc:
    Cc: Andrew Morton
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Reviewed-by: Ingo Molnar
    Signed-off-by: Dan Williams
    Reviewed-by: Thomas Gleixner
    Link: https://lore.kernel.org/r/158188326422.894464.15742054998046628934.stgit@dwillia2-desk3.amr.corp.intel.com

    Dan Williams
     

02 Dec, 2019

2 commits

  • End a Kconfig help text sentence with a period (aka full stop).

    Link: http://lkml.kernel.org/r/c17f2c75-dc2a-42a4-2229-bb6b489addf2@infradead.org
    Signed-off-by: Randy Dunlap
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • Adjust indentation from spaces to tab (+optional two spaces) as in
    coding style with command like:

    $ sed -e 's/^ / /' -i */Kconfig

    Link: http://lkml.kernel.org/r/1574306437-28837-1-git-send-email-krzk@kernel.org
    Signed-off-by: Krzysztof Kozlowski
    Reviewed-by: David Hildenbrand
    Cc: Greg Kroah-Hartman
    Cc: Jiri Kosina
    Cc: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     

01 Dec, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is another round of bug fixing and cleanup. This time the focus
    is on the driver pattern to use mmu notifiers to monitor a VA range.
    This code is lifted out of many drivers and hmm_mirror directly into
    the mmu_notifier core and written using the best ideas from all the
    driver implementations.

    This removes many bugs from the drivers and has a very pleasing
    diffstat. More drivers can still be converted, but that is for another
    cycle.

    - A shared branch with RDMA reworking the RDMA ODP implementation

    - New mmu_interval_notifier API. This is focused on the use case of
    monitoring a VA and simplifies the process for drivers

    - A common seq-count locking scheme built into the
    mmu_interval_notifier API usable by drivers that call
    get_user_pages() or hmm_range_fault() with the VA range

    - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
    GntDev drivers to the new API. This deletes a lot of wonky driver
    code.

    - Two improvements for hmm_range_fault(), from testing done by Ralph"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
    mm/hmm: make full use of walk_page_range()
    xen/gntdev: use mmu_interval_notifier_insert
    mm/hmm: remove hmm_mirror and related
    drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
    drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
    drm/amdgpu: Call find_vma under mmap_sem
    nouveau: use mmu_interval_notifier instead of hmm_mirror
    nouveau: use mmu_notifier directly for invalidate_range_start
    drm/radeon: use mmu_interval_notifier_insert
    RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
    RDMA/odp: Use mmu_interval_notifier_insert()
    mm/hmm: define the pre-processor related parts of hmm.h even if disabled
    mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
    mm/mmu_notifier: add an interval tree notifier
    mm/mmu_notifier: define the header pre-processor parts even if disabled
    mm/hmm: allow snapshot of the special zero page

    Linus Torvalds
     

24 Nov, 2019

2 commits

  • The only two users of this are now converted to use mmu_interval_notifier,
    delete all the code and update hmm.rst.

    Link: https://lore.kernel.org/r/20191112202231.3856-14-jgg@ziepe.ca
    Reviewed-by: Jérôme Glisse
    Tested-by: Ralph Campbell
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • Of the 13 users of mmu_notifiers, 8 of them use only
    invalidate_range_start/end() and immediately intersect the
    mmu_notifier_range with some kind of internal list of VAs. 4 use an
    interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
    of some kind (scif_dma, vhost, gntdev, hmm)

    And the remaining 5 either don't use invalidate_range_start() or do some
    special thing with it.

    It turns out that building a correct scheme with an interval tree is
    pretty complicated, particularly if the use case is synchronizing against
    another thread doing get_user_pages(). Many of these implementations have
    various subtle and difficult to fix races.

    This approach puts the interval tree as common code at the top of the mmu
    notifier call tree and implements a shareable locking scheme.

    It includes:
    - An interval tree tracking VA ranges, with per-range callbacks
    - A read/write locking scheme for the interval tree that avoids
    sleeping in the notifier path (for OOM killer)
    - A sequence counter based collision-retry locking scheme to tell
    device page fault that a VA range is being concurrently invalidated.

    This is based on various ideas:
    - hmm accumulates invalidated VA ranges and releases them when all
    invalidates are done, via active_invalidate_ranges count.
    This approach avoids having to intersect the interval tree twice (as
    umem_odp does) at the potential cost of a longer device page fault.

    - kvm/umem_odp use a sequence counter to drive the collision retry,
    via invalidate_seq

    - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
    This makes adding/removing interval tree members more deterministic

    - seqlock, except this version makes the seqlock idea multi-holder on the
    write side by protecting it with active_invalidate_ranges and a spinlock

    To minimize MM overhead when only the interval tree is being used, the
    entire SRCU and hlist overheads are dropped using some simple
    branches. Similarly the interval tree overhead is dropped when in hlist
    mode.

    The overhead from the mandatory spinlock is broadly the same as most of
    existing users which already had a lock (or two) of some sort on the
    invalidation path.

    Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
    Acked-by: Christian König
    Tested-by: Philip Yang
    Tested-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

06 Nov, 2019

1 commit

  • Add two utilities to 1) write-protect and 2) clean all ptes pointing into
    a range of an address space.
    The utilities are intended to aid in tracking dirty pages (either
    driver-allocated system memory or pci device memory).
    The write-protect utility should be used in conjunction with
    page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
    accesses. Typically one would want to use this on sparse accesses into
    large memory regions. The clean utility should be used to utilize
    hardware dirtying functionality and avoid the overhead of page-faults,
    typically on large accesses into small memory regions.

    Cc: Andrew Morton
    Cc: Matthew Wilcox
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Minchan Kim
    Cc: Michal Hocko
    Cc: Huang Ying
    Cc: Jérôme Glisse
    Cc: Kirill A. Shutemov
    Signed-off-by: Thomas Hellstrom
    Acked-by: Andrew Morton

    Thomas Hellstrom
     

25 Sep, 2019

2 commits

  • This patch is (hopefully) the first step to enable THP for non-shmem
    filesystems.

    This patch enables an application to put part of its text sections to THP
    via madvise, for example:

    madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

    We tried to reuse the logic for THP on tmpfs.

    Currently, write is not supported for non-shmem THP. khugepaged will only
    process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
    (see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
    execve(). This requirement limits non-shmem THP to text sections.

    The next patch will handle writes, which would only happen when the all
    the vmas with VM_DENYWRITE are unmapped.

    An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
    feature.

    [songliubraving@fb.com: fix build without CONFIG_SHMEM]
    Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
    [songliubraving@fb.com: fix double unlock in collapse_file()]
    Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
    Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
    Signed-off-by: Song Liu
    Acked-by: Rik van Riel
    Acked-by: Kirill A. Shutemov
    Acked-by: Johannes Weiner
    Cc: Stephen Rothwell
    Cc: Dan Carpenter
    Cc: Hillf Danton
    Cc: Hugh Dickins
    Cc: William Kucharski
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Song Liu
     
  • Patch series "mm: remove quicklist page table caches".

    A while ago Nicholas proposed to remove quicklist page table caches [1].

    I've rebased his patch on the curren upstream and switched ia64 and sh to
    use generic versions of PTE allocation.

    [1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

    This patch (of 3):

    Remove page table allocator "quicklists". These have been around for a
    long time, but have not got much traction in the last decade and are only
    used on ia64 and sh architectures.

    The numbers in the initial commit look interesting but probably don't
    apply anymore. If anybody wants to resurrect this it's in the git
    history, but it's unhelpful to have this code and divergent allocator
    behaviour for minor archs.

    Also it might be better to instead make more general improvements to page
    allocator if this is still so slow.

    Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Nicholas Piggin
    Signed-off-by: Mike Rapoport
    Cc: Tony Luck
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nicholas Piggin
     

20 Aug, 2019

1 commit

  • CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
    devic private memory support. Remove the option and just check for
    CONFIG_DEVICE_PRIVATE instead.

    Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

08 Aug, 2019

2 commits


17 Jul, 2019

1 commit

  • ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
    with the long-out-of-date comment can lead to the impression than an
    architecture may just enable it (since __add_pages() now "comprehends
    device memory" for itself) and expect things to work.

    In practice, however, ZONE_DEVICE users have little chance of
    functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
    that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
    dependency so the real situation is clearer.

    Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.com
    Signed-off-by: Robin Murphy
    Acked-by: Dan Williams
    Reviewed-by: Ira Weiny
    Acked-by: Oliver O'Halloran
    Reviewed-by: Anshuman Khandual
    Cc: Michael Ellerman
    Cc: Catalin Marinas
    Cc: David Hildenbrand
    Cc: Jerome Glisse
    Cc: Michal Hocko
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Murphy
     

15 Jul, 2019

1 commit

  • Pull HMM updates from Jason Gunthorpe:
    "Improvements and bug fixes for the hmm interface in the kernel:

    - Improve clarity, locking and APIs related to the 'hmm mirror'
    feature merged last cycle. In linux-next we now see AMDGPU and
    nouveau to be using this API.

    - Remove old or transitional hmm APIs. These are hold overs from the
    past with no users, or APIs that existed only to manage cross tree
    conflicts. There are still a few more of these cleanups that didn't
    make the merge window cut off.

    - Improve some core mm APIs:
    - export alloc_pages_vma() for driver use
    - refactor into devm_request_free_mem_region() to manage
    DEVICE_PRIVATE resource reservations
    - refactor duplicative driver code into the core dev_pagemap
    struct

    - Remove hmm wrappers of improved core mm APIs, instead have drivers
    use the simplified API directly

    - Remove DEVICE_PUBLIC

    - Simplify the kconfig flow for the hmm users and core code"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
    mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
    mm: remove the HMM config option
    mm: sort out the DEVICE_PRIVATE Kconfig mess
    mm: simplify ZONE_DEVICE page private data
    mm: remove hmm_devmem_add
    mm: remove hmm_vma_alloc_locked_page
    nouveau: use devm_memremap_pages directly
    nouveau: use alloc_page_vma directly
    PCI/P2PDMA: use the dev_pagemap internal refcount
    device-dax: use the dev_pagemap internal refcount
    memremap: provide an optional internal refcount in struct dev_pagemap
    memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
    memremap: remove the data field in struct dev_pagemap
    memremap: add a migrate_to_ram method to struct dev_pagemap_ops
    memremap: lift the devmap_enable manipulation into devm_memremap_pages
    memremap: pass a struct dev_pagemap to ->kill and ->cleanup
    memremap: move dev_pagemap callbacks into a separate structure
    memremap: validate the pagemap type passed to devm_memremap_pages
    mm: factor out a devm_request_free_mem_region helper
    mm: export alloc_pages_vma
    ...

    Linus Torvalds
     

13 Jul, 2019

4 commits

  • While only powerpc supports the hugepd case, the code is pretty generic
    and I'd like to keep all GUP internals in one place.

    Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Always build mm/gup.c so that we don't have to provide separate nommu
    stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
    when HAVE_FAST_GUP into the main implementations, which will never call
    the fast path if HAVE_FAST_GUP is not set.

    This also ensures the new put_user_pages* helpers are available for nommu,
    as those are currently missing, which would create a problem as soon as we
    actually grew users for it.

    Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Jason Gunthorpe
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • We only support the generic GUP now, so rename the config option to
    be more clear, and always use the mm/Kconfig definition of the
    symbol and select it from the arch Kconfigs.

    Link: http://lkml.kernel.org/r/20190625143715.1689-11-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Khalid Aziz
    Reviewed-by: Jason Gunthorpe
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • The split low/high access is the only non-READ_ONCE version of gup_get_pte
    that did show up in the various arch implemenations. Lift it to common
    code and drop the ifdef based arch override.

    Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Cc: Andrey Konovalov
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: James Hogan
    Cc: Khalid Aziz
    Cc: Michael Ellerman
    Cc: Nicholas Piggin
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Yoshinori Sato
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

03 Jul, 2019

1 commit