Eric Lee / smarc-fsl-linux-kernel

07 Dec, 2020

1 commit

e91d8d782 mm/zsmalloc.c: drop ZSMALLOC_PGTABLE_MAPPING ... Browse Code »

While I was doing zram testing, I found sometimes decompression failed
since the compression buffer was corrupted. With investigation, I found
below commit calls cond_resched unconditionally so it could make a
problem in atomic context if the task is reschedule.

BUG: sleeping function called from invalid context at mm/vmalloc.c:108
in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 946, name: memhog
3 locks held by memhog/946:
#0: ffff9d01d4b193e8 (&mm->mmap_lock#2){++++}-{4:4}, at: __mm_populate+0x103/0x160
#1: ffffffffa3d53de0 (fs_reclaim){+.+.}-{0:0}, at: __alloc_pages_slowpath.constprop.0+0xa98/0x1160
#2: ffff9d01d56b8110 (&zspage->lock){.+.+}-{3:3}, at: zs_map_object+0x8e/0x1f0
CPU: 0 PID: 946 Comm: memhog Not tainted 5.9.3-00011-gc5bfc0287345-dirty #316
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1 04/01/2014
Call Trace:
unmap_kernel_range_noflush+0x2eb/0x350
unmap_kernel_range+0x14/0x30
zs_unmap_object+0xd5/0xe0
zram_bvec_rw.isra.0+0x38c/0x8e0
zram_rw_page+0x90/0x101
bdev_write_page+0x92/0xe0
__swap_writepage+0x94/0x4a0
pageout+0xe3/0x3a0
shrink_page_list+0xb94/0xd60
shrink_inactive_list+0x158/0x460

We can fix this by removing the ZSMALLOC_PGTABLE_MAPPING feature (which
contains the offending calling code) from zsmalloc.

Even though this option showed some amount improvement(e.g., 30%) in
some arm32 platforms, it has been headache to maintain since it have
abused APIs[1](e.g., unmap_kernel_range in atomic context).

Since we are approaching to deprecate 32bit machines and already made
the config option available for only builtin build since v5.8, lastly it
has been not default option in zsmalloc, it's time to drop the option
for better maintenance.

[1] http://lore.kernel.org/linux-mm/20201105170249.387069-1-minchan@kernel.org

Fixes: e47110e90584 ("mm/vunmap: add cond_resched() in vunmap_pmd_range")
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Sergey Senozhatsky
Cc: Tony Lindgren
Cc: Christoph Hellwig
Cc: Harish Sriram
Cc: Uladzislau Rezki
Cc:
Link: https://lkml.kernel.org/r/20201117202916.GA3856507@google.com
Signed-off-by: Linus Torvalds

Minchan Kim
2020-12-07 02:19:07 +0800

19 Oct, 2020

1 commit

3e9a9e256 mm: add a vmap_pfn function ... Browse Code »

Add a proper helper to remap PFNs into kernel virtual space so that
drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
for it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800

17 Oct, 2020

1 commit

b30c59279 mm/memory_hotplug: mark pageblocks MIGRATE_ISOLATE while onlining memory ... Browse Code »

Currently, it can happen that pages are allocated (and freed) via the
buddy before we finished basic memory onlining.

For example, pages are exposed to the buddy and can be allocated before we
actually mark the sections online. Allocated pages could suddenly fail
pfn_to_online_page() checks. We had similar issues with pcp handling,
when pages are allocated+freed before we reach zone_pcp_update() in
online_pages() [1].

Instead, mark all pageblocks MIGRATE_ISOLATE, such that allocations are
impossible. Once done with the heavy lifting, use
undo_isolate_page_range() to move the pages to the MIGRATE_MOVABLE
freelist, marking them ready for allocation. Similar to offline_pages(),
we have to manually adjust zone->nr_isolate_pageblock.

[1] https://lkml.kernel.org/r/1597150703-19003-1-git-send-email-charante@codeaurora.org

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Wei Yang
Cc: Baoquan He
Cc: Pankaj Gupta
Cc: Charan Teja Reddy
Cc: Dan Williams
Cc: Fenghua Yu
Cc: Logan Gunthorpe
Cc: "Matthew Wilcox (Oracle)"
Cc: Mel Gorman
Cc: Mel Gorman
Cc: Michel Lespinasse
Cc: Mike Rapoport
Cc: Tony Luck
Link: https://lkml.kernel.org/r/20200819175957.28465-11-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-10-17 02:11:17 +0800

16 Oct, 2020

1 commit

5a32c3413 Merge tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping ... Browse Code »

Pull dma-mapping updates from Christoph Hellwig:

- rework the non-coherent DMA allocator

- move private definitions out of

- lower CMA_ALIGNMENT (Paul Cercueil)

- remove the omap1 dma address translation in favor of the common code

- make dma-direct aware of multiple dma offset ranges (Jim Quinlan)

- support per-node DMA CMA areas (Barry Song)

- increase the default seg boundary limit (Nicolin Chen)

- misc fixes (Robin Murphy, Thomas Tai, Xu Wang)

- various cleanups

* tag 'dma-mapping-5.10' of git://git.infradead.org/users/hch/dma-mapping: (63 commits)
ARM/ixp4xx: add a missing include of dma-map-ops.h
dma-direct: simplify the DMA_ATTR_NO_KERNEL_MAPPING handling
dma-direct: factor out a dma_direct_alloc_from_pool helper
dma-direct check for highmem pages in dma_direct_alloc_pages
dma-mapping: merge into
dma-mapping: move large parts of to kernel/dma
dma-mapping: move dma-debug.h to kernel/dma/
dma-mapping: remove
dma-mapping: merge into
dma-contiguous: remove dma_contiguous_set_default
dma-contiguous: remove dev_set_cma_area
dma-contiguous: remove dma_declare_contiguous
dma-mapping: split
cma: decrease CMA_ALIGNMENT lower limit to 2
firewire-ohci: use dma_alloc_pages
dma-iommu: implement ->alloc_noncoherent
dma-mapping: add new {alloc,free}_noncoherent dma_map_ops methods
dma-mapping: add a new dma_alloc_pages API
dma-mapping: remove dma_cache_sync
53c700: convert to dma_alloc_noncoherent
...

Linus Torvalds
2020-10-16 05:43:29 +0800

14 Oct, 2020

1 commit

4c6cd03ed mm/gup_benchmark: update the documentation in Kconfig ... Browse Code »

In the beginning, mm/gup_benchmark.c supported get_user_pages_fast() only,
but right now, it supports the benchmarking of a couple of
get_user_pages() related calls like:

* get_user_pages_fast()
* get_user_pages()
* pin_user_pages_fast()
* pin_user_pages()

The documentation is confusing and needs update.

Signed-off-by: Barry Song
Signed-off-by: Andrew Morton
Cc: John Hubbard
Cc: Keith Busch
Cc: Ira Weiny
Cc: Kirill A. Shutemov
Link: https://lkml.kernel.org/r/20200821032546.19992-1-song.bao.hua@hisilicon.com
Signed-off-by: Linus Torvalds

Barry Song
2020-10-14 09:38:29 +0800

25 Sep, 2020

1 commit

dd19d2938 Fix references to nommu-mmap.rst ... Browse Code »

nommu-mmap.rst was moved to Documentation/admin-guide/mm; this patch
updates the remaining stale references to Documentation/mm.

Fixes: 800c02f5d030 ("docs: move nommu-mmap.txt to admin-guide and rename to ReST")
Signed-off-by: Stephen Kitt
Link: https://lore.kernel.org/r/20200812092230.27541-1-steve@sk2.org
Signed-off-by: Jonathan Corbet

Stephen Kitt
2020-09-25 01:03:40 +0800

01 Sep, 2020

1 commit

b7176c261 dma-contiguous: provide the ability to reserve per-numa CMA ... Browse Code »

Right now, drivers like ARM SMMU are using dma_alloc_coherent() to get
coherent DMA buffers to save their command queues and page tables. As
there is only one default CMA in the whole system, SMMUs on nodes other
than node0 will get remote memory. This leads to significant latency.

This patch provides per-numa CMA so that drivers like SMMU can get local
memory. Tests show localizing CMA can decrease dma_unmap latency much.
For instance, before this patch, SMMU on node2 has to wait for more than
560ns for the completion of CMD_SYNC in an empty command queue; with this
patch, it needs 240ns only.

A positive side effect of this patch would be improving performance even
further for those users who are worried about performance more than DMA
security and use iommu.passthrough=1 to skip IOMMU. With local CMA, all
drivers can get local coherent DMA buffers.

Also, this patch changes the default CONFIG_CMA_AREAS to 19 in NUMA. As
1+CONFIG_CMA_AREAS should be quite enough for most servers on the market
even they enable both hugetlb_cma and pernuma_cma.
2 numa nodes: 2(hugetlb) + 2(pernuma) + 1(default global cma) = 5
4 numa nodes: 4(hugetlb) + 4(pernuma) + 1(default global cma) = 9
8 numa nodes: 8(hugetlb) + 8(pernuma) + 1(default global cma) = 17

Signed-off-by: Barry Song
Signed-off-by: Christoph Hellwig

Barry Song
2020-09-01 15:19:28 +0800

08 Aug, 2020

1 commit

c89ab04fe mm/sparse: cleanup the code surrounding memory_present() ... Browse Code »

After removal of CONFIG_HAVE_MEMBLOCK_NODE_MAP we have two equivalent
functions that call memory_present() for each region in memblock.memory:
sparse_memory_present_with_active_regions() and membocks_present().

Moreover, all architectures have a call to either of these functions
preceding the call to sparse_init() and in the most cases they are called
one after the other.

Mark the regions from memblock.memory as present during sparce_init() by
making sparse_init() call memblocks_present(), make memblocks_present()
and memory_present() functions static and remove redundant
sparse_memory_present_with_active_regions() function.

Also remove no longer required HAVE_MEMORY_PRESENT configuration option.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200712083130.22919-1-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-08-08 02:33:27 +0800

27 Jun, 2020

1 commit

800c02f5d docs: move nommu-mmap.txt to admin-guide and rename to ReST ... Browse Code »

The nommu-mmap.txt file provides description of user visible
behaviuour. So, move it to the admin-guide.

As it is already at the ReST, also rename it.

Suggested-by: Mike Rapoport
Suggested-by: Jonathan Corbet
Signed-off-by: Mauro Carvalho Chehab
Link: https://lore.kernel.org/r/3a63d1833b513700755c85bf3bda0a6c4ab56986.1592918949.git.mchehab+huawei@kernel.org
Signed-off-by: Jonathan Corbet

Mauro Carvalho Chehab
2020-06-27 01:33:35 +0800

08 Jun, 2020

1 commit

52e0ad262 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next ... Browse Code »

Pull sparc updates from David Miller:

- Rework the sparc32 page tables so that READ_ONCE(*pmd), as done by
generic code, operates on a word sized element. From Will Deacon.

- Some scnprintf() conversions, from Chen Zhou.

- A pin_user_pages() conversion from John Hubbard.

- Several 32-bit ptrace register handling fixes and such from Al Viro.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/sparc-next:
fix a braino in "sparc32: fix register window handling in genregs32_[gs]et()"
sparc32: mm: Only call ctor()/dtor() functions for first and last user
sparc32: mm: Disable SPLIT_PTLOCK_CPUS
sparc32: mm: Don't try to free page-table pages if ctor() fails
sparc32: register memory occupied by kernel as memblock.memory
sparc: remove unused header file nfs_fs.h
sparc32: fix register window handling in genregs32_[gs]et()
sparc64: fix misuses of access_process_vm() in genregs32_[sg]et()
oradax: convert get_user_pages() --> pin_user_pages()
sparc: use scnprintf() in show_pciobppath_attr() in vio.c
sparc: use scnprintf() in show_pciobppath_attr() in pci.c
tty: vcc: Fix error return code in vcc_probe()
sparc32: mm: Reduce allocation size for PMD and PTE tables
sparc32: mm: Change pgtable_t type to pte_t * instead of struct page *
sparc32: mm: Restructure sparc32 MMU page-table layout
sparc32: mm: Fix argument checking in __srmmu_get_nocache()
sparc64: Replace zero-length array with flexible-array
sparc: mm: return true,false in kern_addr_valid()

Linus Torvalds
2020-06-08 08:25:29 +0800

05 Jun, 2020

2 commits

b59d02ed0 mm/memory_hotplug: disable the functionality for 32b ... Browse Code »

Memory hotlug is broken for 32b systems at least since c6f03e2903c9 ("mm,
memory_hotplug: remove zone restrictions") which has considerably reworked
how can be memory associated with movable/kernel zones. The same is not
really trivial to achieve in 32b where only lowmem is the kernel zone.
While we can tweak this immediate problem around there are likely other
land mines hidden at other places.

It is also quite dubious that there is a real usecase for the memory
hotplug on 32b in the first place. Low memory is just too small to be
hotplugable (for hot add) and generally unusable for hotremove. Adding
more memory to highmem is also dubious because it would increase the low
mem or vmalloc space pressure for memmaps.

Restrict the functionality to 64b systems. This will help future
development to focus on usecases that have real life application. We can
remove this restriction in future in presence of a real life usecase of
course but until then make it explicit that hotplug on 32b is broken and
requires a non trivial amount of work to fix.

Robin said:
"32-bit Arm doesn't support memory hotplug, and as far as I'm aware
there's little likelihood of it ever wanting to. FWIW it looks like
SuperH is the only pure-32-bit architecture to have hotplug support at
all"

Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Acked-by: Baoquan He
Cc: Wei Yang
Cc: Naoya Horiguchi
Cc: Oscar Salvador
Cc: Robin Murphy
Cc: Vamshi K Sthambamkadi
Link: http://lkml.kernel.org/r/20200218100532.GA4151@dhcp22.suse.cz
Link: https://bugzilla.kernel.org/show_bug.cgi?id=206401
Signed-off-by: Linus Torvalds

Michal Hocko
2020-06-05 10:06:23 +0800
52219aeaf mm/memory_hotplug: handle memblocks only with CONFIG_ARCH_KEEP_MEMBLOCK ... Browse Code »

The comment in add_memory_resource() is stale: hotadd_new_pgdat() will no
longer call get_pfn_range_for_nid(), as a hotadded pgdat will simply span
no pages at all, until memory is moved to the zone/node via
move_pfn_range_to_zone() - e.g., when onlining memory blocks.

The only archs that care about memblocks for hotplugged memory (either for
iterating over all system RAM or testing for memory validity) are arm64,
s390x, and powerpc - due to CONFIG_ARCH_KEEP_MEMBLOCK. Without
CONFIG_ARCH_KEEP_MEMBLOCK, we can simply stop messing with memblocks.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Acked-by: Mike Rapoport
Acked-by: Michal Hocko
Cc: Michal Hocko
Cc: Baoquan He
Cc: Oscar Salvador
Cc: Pankaj Gupta
Cc: Mike Rapoport
Cc: Anshuman Khandual
Link: http://lkml.kernel.org/r/20200422155353.25381-3-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-06-05 10:06:23 +0800

04 Jun, 2020

2 commits

e44431498 mm: parallelize deferred_init_memmap() ... Browse Code »

Deferred struct page init is a significant bottleneck in kernel boot.
Optimizing it maximizes availability for large-memory systems and allows
spinning up short-lived VMs as needed without having to leave them
running. It also benefits bare metal machines hosting VMs that are
sensitive to downtime. In projects such as VMM Fast Restart[1], where
guest state is preserved across kexec reboot, it helps prevent application
and network timeouts in the guests.

Multithread to take full advantage of system memory bandwidth.

The maximum number of threads is capped at the number of CPUs on the node
because speedups always improve with additional threads on every system
tested, and at this phase of boot, the system is otherwise idle and
waiting on page init to finish.

Helper threads operate on section-aligned ranges to both avoid false
sharing when setting the pageblock's migrate type and to avoid accessing
uninitialized buddy pages, though max order alignment is enough for the
latter.

The minimum chunk size is also a section. There was benefit to using
multiple threads even on relatively small memory (1G) systems, and this is
the smallest size that the alignment allows.

The time (milliseconds) is the slowest node to initialize since boot
blocks until all nodes finish. intel_pstate is loaded in active mode
without hwp and with turbo enabled, and intel_idle is active as well.

Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz (Skylake, bare metal)
2 nodes * 26 cores * 2 threads = 104 CPUs
384G/node = 768G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 4089.7 ( 8.1) -- 1785.7 ( 7.6)
2% ( 1) 1.7% 4019.3 ( 1.5) 3.8% 1717.7 ( 11.8)
12% ( 6) 34.9% 2662.7 ( 2.9) 79.9% 359.3 ( 0.6)
25% ( 13) 39.9% 2459.0 ( 3.6) 91.2% 157.0 ( 0.0)
37% ( 19) 39.2% 2485.0 ( 29.7) 90.4% 172.0 ( 28.6)
50% ( 26) 39.3% 2482.7 ( 25.7) 90.3% 173.7 ( 30.0)
75% ( 39) 39.0% 2495.7 ( 5.5) 89.4% 190.0 ( 1.0)
100% ( 52) 40.2% 2443.7 ( 3.8) 92.3% 138.0 ( 1.0)

Intel(R) Xeon(R) CPU E5-2699C v4 @ 2.20GHz (Broadwell, kvm guest)
1 node * 16 cores * 2 threads = 32 CPUs
192G/node = 192G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1988.7 ( 9.6) -- 1096.0 ( 11.5)
3% ( 1) 1.1% 1967.0 ( 17.6) 0.3% 1092.7 ( 11.0)
12% ( 4) 41.1% 1170.3 ( 14.2) 73.8% 287.0 ( 3.6)
25% ( 8) 47.1% 1052.7 ( 21.9) 83.9% 177.0 ( 13.5)
38% ( 12) 48.9% 1016.3 ( 12.1) 86.8% 144.7 ( 1.5)
50% ( 16) 48.9% 1015.7 ( 8.1) 87.8% 134.0 ( 4.4)
75% ( 24) 49.1% 1012.3 ( 3.1) 88.1% 130.3 ( 2.3)
100% ( 32) 49.5% 1004.0 ( 5.3) 88.5% 125.7 ( 2.1)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, bare metal)
2 nodes * 18 cores * 2 threads = 72 CPUs
128G/node = 256G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1680.0 ( 4.6) -- 627.0 ( 4.0)
3% ( 1) 0.3% 1675.7 ( 4.5) -0.2% 628.0 ( 3.6)
11% ( 4) 25.6% 1250.7 ( 2.1) 67.9% 201.0 ( 0.0)
25% ( 9) 30.7% 1164.0 ( 17.3) 81.8% 114.3 ( 17.7)
36% ( 13) 31.4% 1152.7 ( 10.8) 84.0% 100.3 ( 17.9)
50% ( 18) 31.5% 1150.7 ( 9.3) 83.9% 101.0 ( 14.1)
75% ( 27) 31.7% 1148.0 ( 5.6) 84.5% 97.3 ( 6.4)
100% ( 36) 32.0% 1142.3 ( 4.0) 85.6% 90.0 ( 1.0)

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 8 cores * 2 threads = 16 CPUs
64G/node = 64G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 1029.3 ( 25.1) -- 240.7 ( 1.5)
6% ( 1) -0.6% 1036.0 ( 7.8) -2.2% 246.0 ( 0.0)
12% ( 2) 11.8% 907.7 ( 8.6) 44.7% 133.0 ( 1.0)
25% ( 4) 13.9% 886.0 ( 10.6) 62.6% 90.0 ( 6.0)
38% ( 6) 17.8% 845.7 ( 14.2) 69.1% 74.3 ( 3.8)
50% ( 8) 16.8% 856.0 ( 22.1) 72.9% 65.3 ( 5.7)
75% ( 12) 15.4% 871.0 ( 29.2) 79.8% 48.7 ( 7.4)
100% ( 16) 21.0% 813.7 ( 21.0) 80.5% 47.0 ( 5.2)

Server-oriented distros that enable deferred page init sometimes run in
small VMs, and they still benefit even though the fraction of boot time
saved is smaller:

AMD EPYC 7551 32-Core Processor (Zen, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
16G/node = 16G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 716.0 ( 14.0) -- 49.7 ( 0.6)
25% ( 1) 1.8% 703.0 ( 5.3) -4.0% 51.7 ( 0.6)
50% ( 2) 1.6% 704.7 ( 1.2) 43.0% 28.3 ( 0.6)
75% ( 3) 2.7% 696.7 ( 13.1) 49.7% 25.0 ( 0.0)
100% ( 4) 4.1% 687.0 ( 10.4) 55.7% 22.0 ( 0.0)

Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz (Haswell, kvm guest)
1 node * 2 cores * 2 threads = 4 CPUs
14G/node = 14G memory

kernel boot deferred init
------------------------ ------------------------
node% (thr) speedup time_ms (stdev) speedup time_ms (stdev)
( 0) -- 787.7 ( 6.4) -- 122.3 ( 0.6)
25% ( 1) 0.2% 786.3 ( 10.8) -2.5% 125.3 ( 2.1)
50% ( 2) 5.9% 741.0 ( 13.9) 37.6% 76.3 ( 19.7)
75% ( 3) 8.3% 722.0 ( 19.0) 49.9% 61.3 ( 3.2)
100% ( 4) 9.3% 714.7 ( 9.5) 56.4% 53.3 ( 1.5)

On Josh's 96-CPU and 192G memory system:

Without this patch series:
[ 0.487132] node 0 initialised, 23398907 pages in 292ms
[ 0.499132] node 1 initialised, 24189223 pages in 304ms
...
[ 0.629376] Run /sbin/init as init process

With this patch series:
[ 0.231435] node 1 initialised, 24189223 pages in 32ms
[ 0.236718] node 0 initialised, 23398907 pages in 36ms

[1] https://static.sched.com/hosted_files/kvmforum2019/66/VMM-fast-restart_kvmforum2019.pdf

Signed-off-by: Daniel Jordan
Signed-off-by: Andrew Morton
Tested-by: Josh Triplett
Reviewed-by: Alexander Duyck
Cc: Alex Williamson
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Herbert Xu
Cc: Jason Gunthorpe
Cc: Jonathan Corbet
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Pavel Machek
Cc: Pavel Tatashin
Cc: Peter Zijlstra
Cc: Randy Dunlap
Cc: Robert Elliott
Cc: Shile Zhang
Cc: Steffen Klassert
Cc: Steven Sistare
Cc: Tejun Heo
Cc: Zi Yan
Link: http://lkml.kernel.org/r/20200527173608.2885243-7-daniel.m.jordan@oracle.com
Signed-off-by: Linus Torvalds

Daniel Jordan
2020-06-04 11:09:45 +0800
3f08a302f mm: remove CONFIG_HAVE_MEMBLOCK_NODE_MAP option ... Browse Code »

CONFIG_HAVE_MEMBLOCK_NODE_MAP is used to differentiate initialization of
nodes and zones structures between the systems that have region to node
mapping in memblock and those that don't.

Currently all the NUMA architectures enable this option and for the
non-NUMA systems we can presume that all the memory belongs to node 0 and
therefore the compile time configuration option is not required.

The remaining few architectures that use DISCONTIGMEM without NUMA are
easily updated to use memblock_add_node() instead of memblock_add() and
thus have proper correspondence of memblock regions to NUMA nodes.

Still, free_area_init_node() must have a backward compatible version
because its semantics with and without CONFIG_HAVE_MEMBLOCK_NODE_MAP is
different. Once all the architectures will use the new semantics, the
entire compatibility layer can be dropped.

To avoid addition of extra run time memory to store node id for
architectures that keep memblock but have only a single node, the node id
field of the memblock_region is guarded by CONFIG_NEED_MULTIPLE_NODES and
the corresponding accessors presume that in those cases it is always 0.

Signed-off-by: Mike Rapoport
Signed-off-by: Andrew Morton
Tested-by: Hoan Tran [arm64]
Acked-by: Catalin Marinas [arm64]
Cc: Baoquan He
Cc: Brian Cain
Cc: "David S. Miller"
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guan Xuetao
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Helge Deller
Cc: "James E.J. Bottomley"
Cc: Jonathan Corbet
Cc: Ley Foon Tan
Cc: Mark Salter
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Ellerman
Cc: Michal Hocko
Cc: Michal Simek
Cc: Nick Hu
Cc: Paul Walmsley
Cc: Richard Weinberger
Cc: Rich Felker
Cc: Russell King
Cc: Stafford Horne
Cc: Thomas Bogendoerfer
Cc: Tony Luck
Cc: Vineet Gupta
Cc: Yoshinori Sato
Link: http://lkml.kernel.org/r/20200412194859.12663-4-rppt@kernel.org
Signed-off-by: Linus Torvalds

Mike Rapoport
2020-06-04 11:09:43 +0800

03 Jun, 2020

3 commits

60bccaa67 sparc32: mm: Disable SPLIT_PTLOCK_CPUS ... Browse Code »

The SRMMU page-table allocator is not compatible with SPLIT_PTLOCK_CPUS
for two major reasons:

1. Pages are allocated via memblock, and therefore the ptl is not
cleared by prep_new_page(), which is expected by ptlock_init()

2. Multiple PTE tables can exist in a single page, causing them to
share the same ptl and deadlock when attempting to take the same
lock twice (e.g. as part of copy_page_range()).

Ensure that SPLIT_PTLOCK_CPUS is not selected for SPARC32.

Cc: David S. Miller
Signed-off-by: Will Deacon
Signed-off-by: David S. Miller

Will Deacon
2020-06-03 09:45:51 +0800
b607e6d17 mm: only allow page table mappings for built-in zsmalloc ... Browse Code »

This allows to unexport map_vm_area and unmap_kernel_range, which are
rather deep internal and should not be available to modules, as they for
example allow fine grained control of mapping permissions, and also
allow splitting the setup of a vmalloc area and the actual mapping and
thus expose vmalloc internals.

zsmalloc is typically built-in and continues to work (just like the
percpu-vm code using a similar patter), while modular zsmalloc also
continues to work, but must use copies.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Acked-by: Peter Zijlstra (Intel)
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Daniel Vetter
Cc: David Airlie
Cc: Gao Xiang
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Johannes Weiner
Cc: "K. Y. Srinivasan"
Cc: Laura Abbott
Cc: Mark Rutland
Cc: Michael Kelley
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Robin Murphy
Cc: Sakari Ailus
Cc: Stephen Hemminger
Cc: Sumit Semwal
Cc: Wei Liu
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Heiko Carstens
Cc: Paul Mackerras
Cc: Vasily Gorbik
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200414131348.444715-12-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-06-03 01:59:10 +0800
8b136018d mm: rename CONFIG_PGTABLE_MAPPING to CONFIG_ZSMALLOC_PGTABLE_MAPPING ... Browse Code »

Rename the Kconfig variable to clarify the scope.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Acked-by: Minchan Kim
Acked-by: Peter Zijlstra (Intel)
Cc: Christian Borntraeger
Cc: Christophe Leroy
Cc: Daniel Vetter
Cc: David Airlie
Cc: Gao Xiang
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Johannes Weiner
Cc: "K. Y. Srinivasan"
Cc: Laura Abbott
Cc: Mark Rutland
Cc: Michael Kelley
Cc: Nitin Gupta
Cc: Robin Murphy
Cc: Sakari Ailus
Cc: Stephen Hemminger
Cc: Sumit Semwal
Cc: Wei Liu
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Heiko Carstens
Cc: Paul Mackerras
Cc: Vasily Gorbik
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20200414131348.444715-11-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-06-03 01:59:10 +0800

09 Apr, 2020

1 commit

9b06860d7 Merge tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm ... Browse Code »

Pull libnvdimm and dax updates from Dan Williams:
"There were multiple touches outside of drivers/nvdimm/ this round to
add cross arch compatibility to the devm_memremap_pages() interface,
enhance numa information for persistent memory ranges, and add a
zero_page_range() dax operation.

This cycle I switched from the patchwork api to Konstantin's b4 script
for collecting tags (from x86, PowerPC, filesystem, and device-mapper
folks), and everything looks to have gone ok there. This has all
appeared in -next with no reported issues.

Summary:

- Add support for region alignment configuration and enforcement to
fix compatibility across architectures and PowerPC page size
configurations.

- Introduce 'zero_page_range' as a dax operation. This facilitates
filesystem-dax operation without a block-device.

- Introduce phys_to_target_node() to facilitate drivers that want to
know resulting numa node if a given reserved address range was
onlined.

- Advertise a persistence-domain for of_pmem and papr_scm. The
persistence domain indicates where cpu-store cycles need to reach
in the platform-memory subsystem before the platform will consider
them power-fail protected.

- Promote numa_map_to_online_node() to a cross-kernel generic
facility.

- Save x86 numa information to allow for node-id lookups for reserved
memory ranges, deploy that capability for the e820-pmem driver.

- Pick up some miscellaneous minor fixes, that missed v5.6-final,
including a some smatch reports in the ioctl path and some unit
test compilation fixups.

- Fixup some flexible-array declarations"

* tag 'libnvdimm-for-5.7' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (29 commits)
dax: Move mandatory ->zero_page_range() check in alloc_dax()
dax,iomap: Add helper dax_iomap_zero() to zero a range
dax: Use new dax zero page method for zeroing a page
dm,dax: Add dax zero_page_range operation
s390,dcssblk,dax: Add dax zero_page_range operation to dcssblk driver
dax, pmem: Add a dax operation zero_page_range
pmem: Add functions for reading/writing page to/from pmem
libnvdimm: Update persistence domain value for of_pmem and papr_scm device
tools/test/nvdimm: Fix out of tree build
libnvdimm/region: Fix build error
libnvdimm/region: Replace zero-length array with flexible-array member
libnvdimm/label: Replace zero-length array with flexible-array member
ACPI: NFIT: Replace zero-length array with flexible-array member
libnvdimm/region: Introduce an 'align' attribute
libnvdimm/region: Introduce NDD_LABELING
libnvdimm/namespace: Enforce memremap_compat_align()
libnvdimm/pfn: Prevent raw mode fallback if pfn-infoblock valid
libnvdimm: Out of bounds read in __nd_ioctl()
acpi/nfit: improve bounds checking for 'func'
mm/memremap_pages: Introduce memremap_compat_align()
...

Linus Torvalds
2020-04-09 12:03:40 +0800

08 Apr, 2020

3 commits

bb8b93b5b mm/zswap: allow setting default status, compressor and allocator in Kconfig ... Browse Code »

The compressed cache for swap pages (zswap) currently needs from 1 to 3
extra kernel command line parameters in order to make it work: it has to
be enabled by adding a "zswap.enabled=1" command line parameter and if one
wants a different compressor or pool allocator than the default lzo / zbud
combination then these choices also need to be specified on the kernel
command line in additional parameters.

Using a different compressor and allocator for zswap is actually pretty
common as guides often recommend using the lz4 / z3fold pair instead of
the default one. In such case it is also necessary to remember to enable
the appropriate compression algorithm and pool allocator in the kernel
config manually.

Let's avoid the need for adding these kernel command line parameters and
automatically pull in the dependencies for the selected compressor
algorithm and pool allocator by adding an appropriate default switches to
Kconfig.

The default values for these options match what the code was using
previously as its defaults.

Signed-off-by: Maciej S. Szmigiero
Signed-off-by: Andrew Morton
Reviewed-by: Vitaly Wool
Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
Signed-off-by: Linus Torvalds

Maciej S. Szmigiero
2020-04-08 01:43:41 +0800
36e66c554 mm: introduce Reported pages ... Browse Code »

In order to pave the way for free page reporting in virtualized
environments we will need a way to get pages out of the free lists and
identify those pages after they have been returned. To accomplish this,
this patch adds the concept of a Reported Buddy, which is essentially
meant to just be the Uptodate flag used in conjunction with the Buddy page
type.

To prevent the reported pages from leaking outside of the buddy lists I
added a check to clear the PageReported bit in the del_page_from_free_list
function. As a result any reported page that is split, merged, or
allocated will have the flag cleared prior to the PageBuddy value being
cleared.

The process for reporting pages is fairly simple. Once we free a page
that meets the minimum order for page reporting we will schedule a worker
thread to start 2s or more in the future. That worker thread will begin
working from the lowest supported page reporting order up to MAX_ORDER - 1
pulling unreported pages from the free list and storing them in the
scatterlist.

When processing each individual free list it is necessary for the worker
thread to release the zone lock when it needs to stop and report the full
scatterlist of pages. To reduce the work of the next iteration the worker
thread will rotate the free list so that the first unreported page in the
free list becomes the first entry in the list.

It will then call a reporting function providing information on how many
entries are in the scatterlist. Once the function completes it will
return the pages to the free area from which they were allocated and start
over pulling more pages from the free areas until there are no longer
enough pages to report on to keep the worker busy, or we have processed as
many pages as were contained in the free area when we started processing
the list.

The worker thread will work in a round-robin fashion making its way though
each zone requesting reporting, and through each reportable free list
within that zone. Once all free areas within the zone have been processed
it will check to see if there have been any requests for reporting while
it was processing. If so it will reschedule the worker thread to start up
again in roughly 2s and exit.

Signed-off-by: Alexander Duyck
Signed-off-by: Andrew Morton
Acked-by: Mel Gorman
Cc: Andrea Arcangeli
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Konrad Rzeszutek Wilk
Cc: Luiz Capitulino
Cc: Matthew Wilcox
Cc: Michael S. Tsirkin
Cc: Michal Hocko
Cc: Nitesh Narayan Lal
Cc: Oscar Salvador
Cc: Pankaj Gupta
Cc: Paolo Bonzini
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: Wei Wang
Cc: Yang Zhang
Cc: wei qi
Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
Signed-off-by: Linus Torvalds

Alexander Duyck
2020-04-08 01:43:38 +0800
396bcc529 mm: remove CONFIG_TRANSPARENT_HUGE_PAGECACHE ... Browse Code »

Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
notes that it should be reverted when the PowerPC problem was fixed. The
commit fixing the PowerPC problem (953c66c2b22a) did not revert the
commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
oversight, so remove the Kconfig symbol and undo the work of commit
e496cf3d7821.

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Acked-by: Kirill A. Shutemov
Cc: Aneesh Kumar K.V
Cc: Christoph Hellwig
Cc: Pankaj Gupta
Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-04-08 01:43:38 +0800

18 Feb, 2020

1 commit

1e5d8e1e4 x86/mm: Introduce CONFIG_NUMA_KEEP_MEMINFO ... Browse Code »

Currently x86 numa_meminfo is marked __initdata in the
CONFIG_MEMORY_HOTPLUG=n case. In support of a new facility to allow
drivers to map reserved memory to a 'target_node'
(phys_to_target_node()), add support for removing the __initdata
designation for those users. Both memory hotplug and
phys_to_target_node() users select CONFIG_NUMA_KEEP_MEMINFO to tell the
arch to maintain its physical address to NUMA mapping infrastructure
post init.

Cc: Dave Hansen
Cc: Andy Lutomirski
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Borislav Petkov
Cc: "H. Peter Anvin"
Cc:
Cc: Andrew Morton
Cc: David Hildenbrand
Cc: Michal Hocko
Reviewed-by: Ingo Molnar
Signed-off-by: Dan Williams
Reviewed-by: Thomas Gleixner
Link: https://lore.kernel.org/r/158188326422.894464.15742054998046628934.stgit@dwillia2-desk3.amr.corp.intel.com

Dan Williams
2020-02-18 02:49:06 +0800

02 Dec, 2019

2 commits

dd33d29a1 mm/Kconfig: fix trivial help text punctuation ... Browse Code »

End a Kconfig help text sentence with a period (aka full stop).

Link: http://lkml.kernel.org/r/c17f2c75-dc2a-42a4-2229-bb6b489addf2@infradead.org
Signed-off-by: Randy Dunlap
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2019-12-02 04:59:10 +0800
19fa40a0f mm/Kconfig: fix indentation ... Browse Code »

Adjust indentation from spaces to tab (+optional two spaces) as in
coding style with command like:

$ sed -e 's/^ / /' -i */Kconfig

Link: http://lkml.kernel.org/r/1574306437-28837-1-git-send-email-krzk@kernel.org
Signed-off-by: Krzysztof Kozlowski
Reviewed-by: David Hildenbrand
Cc: Greg Kroah-Hartman
Cc: Jiri Kosina
Cc: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Krzysztof Kozlowski
2019-12-02 04:59:10 +0800

01 Dec, 2019

1 commit

aa32f1169 Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma ... Browse Code »

Pull hmm updates from Jason Gunthorpe:
"This is another round of bug fixing and cleanup. This time the focus
is on the driver pattern to use mmu notifiers to monitor a VA range.
This code is lifted out of many drivers and hmm_mirror directly into
the mmu_notifier core and written using the best ideas from all the
driver implementations.

This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.

- A shared branch with RDMA reworking the RDMA ODP implementation

- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers

- A common seq-count locking scheme built into the
mmu_interval_notifier API usable by drivers that call
get_user_pages() or hmm_range_fault() with the VA range

- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
GntDev drivers to the new API. This deletes a lot of wonky driver
code.

- Two improvements for hmm_range_fault(), from testing done by Ralph"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
mm/hmm: make full use of walk_page_range()
xen/gntdev: use mmu_interval_notifier_insert
mm/hmm: remove hmm_mirror and related
drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
drm/amdgpu: Call find_vma under mmap_sem
nouveau: use mmu_interval_notifier instead of hmm_mirror
nouveau: use mmu_notifier directly for invalidate_range_start
drm/radeon: use mmu_interval_notifier_insert
RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
RDMA/odp: Use mmu_interval_notifier_insert()
mm/hmm: define the pre-processor related parts of hmm.h even if disabled
mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
mm/mmu_notifier: add an interval tree notifier
mm/mmu_notifier: define the header pre-processor parts even if disabled
mm/hmm: allow snapshot of the special zero page

Linus Torvalds
2019-12-01 02:33:14 +0800

24 Nov, 2019

2 commits

a22dd5064 mm/hmm: remove hmm_mirror and related ... Browse Code »

The only two users of this are now converted to use mmu_interval_notifier,
delete all the code and update hmm.rst.

Link: https://lore.kernel.org/r/20191112202231.3856-14-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse
Tested-by: Ralph Campbell
Reviewed-by: Christoph Hellwig
Signed-off-by: Jason Gunthorpe

Jason Gunthorpe
2019-11-24 07:56:45 +0800
99cb252f5 mm/mmu_notifier: add an interval tree notifier ... Browse Code »

Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs. 4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)

And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.

It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages(). Many of these implementations have
various subtle and difficult to fix races.

This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.

It includes:
- An interval tree tracking VA ranges, with per-range callbacks
- A read/write locking scheme for the interval tree that avoids
sleeping in the notifier path (for OOM killer)
- A sequence counter based collision-retry locking scheme to tell
device page fault that a VA range is being concurrently invalidated.

This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
invalidates are done, via active_invalidate_ranges count.
This approach avoids having to intersect the interval tree twice (as
umem_odp does) at the potential cost of a longer device page fault.

- kvm/umem_odp use a sequence counter to drive the collision retry,
via invalidate_seq

- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
This makes adding/removing interval tree members more deterministic

- seqlock, except this version makes the seqlock idea multi-holder on the
write side by protecting it with active_invalidate_ranges and a spinlock

To minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.

The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.

Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
Acked-by: Christian König
Tested-by: Philip Yang
Tested-by: Ralph Campbell
Reviewed-by: John Hubbard
Reviewed-by: Christoph Hellwig
Signed-off-by: Jason Gunthorpe

Jason Gunthorpe
2019-11-24 07:56:44 +0800

06 Nov, 2019

1 commit

c5acad84c mm: Add write-protect and clean utilities for address space ranges ... Browse Code »

Add two utilities to 1) write-protect and 2) clean all ptes pointing into
a range of an address space.
The utilities are intended to aid in tracking dirty pages (either
driver-allocated system memory or pci device memory).
The write-protect utility should be used in conjunction with
page_mkwrite() and pfn_mkwrite() to trigger write page-faults on page
accesses. Typically one would want to use this on sparse accesses into
large memory regions. The clean utility should be used to utilize
hardware dirtying functionality and avoid the overhead of page-faults,
typically on large accesses into small memory regions.

Cc: Andrew Morton
Cc: Matthew Wilcox
Cc: Will Deacon
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Minchan Kim
Cc: Michal Hocko
Cc: Huang Ying
Cc: Jérôme Glisse
Cc: Kirill A. Shutemov
Signed-off-by: Thomas Hellstrom
Acked-by: Andrew Morton

Thomas Hellstrom
2019-11-06 20:03:36 +0800

25 Sep, 2019

2 commits

99cb0dbd4 mm,thp: add read-only THP support for (non-shmem) FS ... Browse Code »

This patch is (hopefully) the first step to enable THP for non-shmem
filesystems.

This patch enables an application to put part of its text sections to THP
via madvise, for example:

madvise((void *)0x600000, 0x200000, MADV_HUGEPAGE);

We tried to reuse the logic for THP on tmpfs.

Currently, write is not supported for non-shmem THP. khugepaged will only
process vma with VM_DENYWRITE. sys_mmap() ignores VM_DENYWRITE requests
(see ksys_mmap_pgoff). The only way to create vma with VM_DENYWRITE is
execve(). This requirement limits non-shmem THP to text sections.

The next patch will handle writes, which would only happen when the all
the vmas with VM_DENYWRITE are unmapped.

An EXPERIMENTAL config, READ_ONLY_THP_FOR_FS, is added to gate this
feature.

[songliubraving@fb.com: fix build without CONFIG_SHMEM]
Link: http://lkml.kernel.org/r/F53407FB-96CC-42E8-9862-105C92CC2B98@fb.com
[songliubraving@fb.com: fix double unlock in collapse_file()]
Link: http://lkml.kernel.org/r/B960CBFA-8EFC-4DA4-ABC5-1977FFF2CA57@fb.com
Link: http://lkml.kernel.org/r/20190801184244.3169074-7-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Rik van Riel
Acked-by: Kirill A. Shutemov
Acked-by: Johannes Weiner
Cc: Stephen Rothwell
Cc: Dan Carpenter
Cc: Hillf Danton
Cc: Hugh Dickins
Cc: William Kucharski
Cc: Oleg Nesterov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800
13224794c mm: remove quicklist page table caches ... Browse Code »

Patch series "mm: remove quicklist page table caches".

A while ago Nicholas proposed to remove quicklist page table caches [1].

I've rebased his patch on the curren upstream and switched ia64 and sh to
use generic versions of PTE allocation.

[1] https://lore.kernel.org/linux-mm/20190711030339.20892-1-npiggin@gmail.com

This patch (of 3):

Remove page table allocator "quicklists". These have been around for a
long time, but have not got much traction in the last decade and are only
used on ia64 and sh architectures.

The numbers in the initial commit look interesting but probably don't
apply anymore. If anybody wants to resurrect this it's in the git
history, but it's unhelpful to have this code and divergent allocator
behaviour for minor archs.

Also it might be better to instead make more general improvements to page
allocator if this is still so slow.

Link: http://lkml.kernel.org/r/1565250728-21721-2-git-send-email-rppt@linux.ibm.com
Signed-off-by: Nicholas Piggin
Signed-off-by: Mike Rapoport
Cc: Tony Luck
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicholas Piggin
2019-09-25 06:54:09 +0800

20 Aug, 2019

1 commit

9b2ed9cb9 mm: remove CONFIG_MIGRATE_VMA_HELPER ... Browse Code »

CONFIG_MIGRATE_VMA_HELPER guards helpers that are required for proper
devic private memory support. Remove the option and just check for
CONFIG_DEVICE_PRIVATE instead.

Link: https://lore.kernel.org/r/20190814075928.23766-11-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-08-20 20:35:03 +0800

08 Aug, 2019

2 commits

9c240a7bb mm/hmm: make HMM_MIRROR an implicit option ... Browse Code »

Make HMM_MIRROR an option that is selected by drivers wanting to use it
instead of a user visible option as it is just a low-level implementation
detail.

Link: https://lore.kernel.org/r/20190806160554.14046-15-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-08-08 01:58:06 +0800
f442c283e mm/hmm: allow HMM_MIRROR on all architectures with MMU ... Browse Code »

There isn't really any architecture specific code in this page table walk
implementation, so drop the dependencies.

Link: https://lore.kernel.org/r/20190806160554.14046-14-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-08-08 01:58:06 +0800

17 Jul, 2019

1 commit

175967318 mm: introduce ARCH_HAS_PTE_DEVMAP ... Browse Code »

ARCH_HAS_ZONE_DEVICE is somewhat meaningless in itself, and combined
with the long-out-of-date comment can lead to the impression than an
architecture may just enable it (since __add_pages() now "comprehends
device memory" for itself) and expect things to work.

In practice, however, ZONE_DEVICE users have little chance of
functioning correctly without __HAVE_ARCH_PTE_DEVMAP, so let's clean
that up the same way as ARCH_HAS_PTE_SPECIAL and make it the proper
dependency so the real situation is clearer.

Link: http://lkml.kernel.org/r/87554aa78478a02a63f2c4cf60a847279ae3eb3b.1558547956.git.robin.murphy@arm.com
Signed-off-by: Robin Murphy
Acked-by: Dan Williams
Reviewed-by: Ira Weiny
Acked-by: Oliver O'Halloran
Reviewed-by: Anshuman Khandual
Cc: Michael Ellerman
Cc: Catalin Marinas
Cc: David Hildenbrand
Cc: Jerome Glisse
Cc: Michal Hocko
Cc: Will Deacon
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Murphy
2019-07-17 10:23:25 +0800

15 Jul, 2019

1 commit

fec88ab0a Merge tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma ... Browse Code »

Pull HMM updates from Jason Gunthorpe:
"Improvements and bug fixes for the hmm interface in the kernel:

- Improve clarity, locking and APIs related to the 'hmm mirror'
feature merged last cycle. In linux-next we now see AMDGPU and
nouveau to be using this API.

- Remove old or transitional hmm APIs. These are hold overs from the
past with no users, or APIs that existed only to manage cross tree
conflicts. There are still a few more of these cleanups that didn't
make the merge window cut off.

- Improve some core mm APIs:
- export alloc_pages_vma() for driver use
- refactor into devm_request_free_mem_region() to manage
DEVICE_PRIVATE resource reservations
- refactor duplicative driver code into the core dev_pagemap
struct

- Remove hmm wrappers of improved core mm APIs, instead have drivers
use the simplified API directly

- Remove DEVICE_PUBLIC

- Simplify the kconfig flow for the hmm users and core code"

* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (42 commits)
mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR
mm: remove the HMM config option
mm: sort out the DEVICE_PRIVATE Kconfig mess
mm: simplify ZONE_DEVICE page private data
mm: remove hmm_devmem_add
mm: remove hmm_vma_alloc_locked_page
nouveau: use devm_memremap_pages directly
nouveau: use alloc_page_vma directly
PCI/P2PDMA: use the dev_pagemap internal refcount
device-dax: use the dev_pagemap internal refcount
memremap: provide an optional internal refcount in struct dev_pagemap
memremap: replace the altmap_valid field with a PGMAP_ALTMAP_VALID flag
memremap: remove the data field in struct dev_pagemap
memremap: add a migrate_to_ram method to struct dev_pagemap_ops
memremap: lift the devmap_enable manipulation into devm_memremap_pages
memremap: pass a struct dev_pagemap to ->kill and ->cleanup
memremap: move dev_pagemap callbacks into a separate structure
memremap: validate the pagemap type passed to devm_memremap_pages
mm: factor out a devm_request_free_mem_region helper
mm: export alloc_pages_vma
...

Linus Torvalds
2019-07-15 10:42:11 +0800

13 Jul, 2019

4 commits

cbd34da7d mm: move the powerpc hugepd code to mm/gup.c ... Browse Code »

While only powerpc supports the hugepd case, the code is pretty generic
and I'd like to keep all GUP internals in one place.

Link: http://lkml.kernel.org/r/20190625143715.1689-15-hch@lst.de
Signed-off-by: Christoph Hellwig
Cc: Andrey Konovalov
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: James Hogan
Cc: Jason Gunthorpe
Cc: Khalid Aziz
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2019-07-13 02:05:45 +0800
050a9adc6 mm: consolidate the get_user_pages* implementations ... Browse Code »

Always build mm/gup.c so that we don't have to provide separate nommu
stubs. Also merge the get_user_pages_fast and __get_user_pages_fast stubs
when HAVE_FAST_GUP into the main implementations, which will never call
the fast path if HAVE_FAST_GUP is not set.

This also ensures the new put_user_pages* helpers are available for nommu,
as those are currently missing, which would create a problem as soon as we
actually grew users for it.

Link: http://lkml.kernel.org/r/20190625143715.1689-13-hch@lst.de
Signed-off-by: Christoph Hellwig
Cc: Andrey Konovalov
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: James Hogan
Cc: Jason Gunthorpe
Cc: Khalid Aziz
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2019-07-13 02:05:45 +0800
67a929e09 mm: rename CONFIG_HAVE_GENERIC_GUP to CONFIG_HAVE_FAST_GUP ... Browse Code »

We only support the generic GUP now, so rename the config option to
be more clear, and always use the mm/Kconfig definition of the
symbol and select it from the arch Kconfigs.

Link: http://lkml.kernel.org/r/20190625143715.1689-11-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Khalid Aziz
Reviewed-by: Jason Gunthorpe
Cc: Andrey Konovalov
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: James Hogan
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2019-07-13 02:05:44 +0800
39656e83d mm: lift the x86_32 PAE version of gup_get_pte to common code ... Browse Code »

The split low/high access is the only non-READ_ONCE version of gup_get_pte
that did show up in the various arch implemenations. Lift it to common
code and drop the ifdef based arch override.

Link: http://lkml.kernel.org/r/20190625143715.1689-4-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Cc: Andrey Konovalov
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: James Hogan
Cc: Khalid Aziz
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Paul Burton
Cc: Paul Mackerras
Cc: Ralf Baechle
Cc: Rich Felker
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2019-07-13 02:05:44 +0800

03 Jul, 2019

1 commit

b6b346a06 mm: don't select MIGRATE_VMA_HELPER from HMM_MIRROR ... Browse Code »

The migrate_vma helper is only used by noveau to migrate device private
pages around. Other HMM_MIRROR users like amdgpu or infiniband don't
need it.

Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Reviewed-by: Dan Williams
Signed-off-by: Jason Gunthorpe

Christoph Hellwig
2019-07-03 01:32:45 +0800