Eric Lee / smarc-fsl-linux-kernel

12 Jul, 2021

4 commits

6c855fce2 mm/rmap: try_to_migrate() skip zone_device !device_private ... Browse Code »

I know nothing about zone_device pages and !device_private pages; but if
try_to_migrate_one() will do nothing for them, then it's better that
try_to_migrate() filter them first, than trawl through all their vmas.

Signed-off-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Reviewed-by: Alistair Popple
Link: https://lore.kernel.org/lkml/1241d356-8ec9-f47b-a5ec-9b2bf66d242@google.com/
Cc: Andrew Morton
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: Christoph Hellwig
Cc: Yang Shi
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-12 06:05:15 +0800
023e1a8dd mm/rmap: fix new bug: premature return from page_mlock_one() ... Browse Code »

In the unlikely race case that page_mlock_one() finds VM_LOCKED has been
cleared by the time it got page table lock, page_vma_mapped_walk_done()
must be called before returning, either explicitly, or by a final call
to page_vma_mapped_walk() - otherwise the page table remains locked.

Fixes: cd62734ca60d ("mm/rmap: split try_to_munlock from try_to_unmap")
Signed-off-by: Hugh Dickins
Reviewed-by: Alistair Popple
Reviewed-by: Shakeel Butt
Reported-by: kernel test robot
Link: https://lore.kernel.org/lkml/20210711151446.GB4070@xsang-OptiPlex-9020/
Link: https://lore.kernel.org/lkml/f71f8523-cba7-3342-40a7-114abc5d1f51@google.com/
Cc: Andrew Morton
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: Christoph Hellwig
Cc: Yang Shi
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-12 06:05:15 +0800
d9770fcc1 mm/rmap: fix old bug: munlocking THP missed other mlocks ... Browse Code »

The kernel recovers in due course from missing Mlocked pages: but there
was no point in calling page_mlock() (formerly known as
try_to_munlock()) on a THP, because nothing got done even when it was
found to be mapped in another VM_LOCKED vma.

It's true that we need to be careful: Mlocked accounting of pte-mapped
THPs is too difficult (so consistently avoided); but Mlocked accounting
of only-pmd-mapped THPs is supposed to work, even when multiple mappings
are mlocked and munlocked or munmapped. Refine the tests.

There is already a VM_BUG_ON_PAGE(PageDoubleMap) in page_mlock(), so
page_mlock_one() does not even have to worry about that complication.

(I said the kernel recovers: but would page reclaim be likely to split
THP before rediscovering that it's VM_LOCKED? I've not followed that up)

Fixes: 9a73f61bdb8a ("thp, mlock: do not mlock PTE-mapped file huge pages")
Signed-off-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Acked-by: Kirill A. Shutemov
Link: https://lore.kernel.org/lkml/cfa154c-d595-406-eb7d-eb9df730f944@google.com/
Cc: Andrew Morton
Cc: Alistair Popple
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: Christoph Hellwig
Cc: Yang Shi
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-12 06:05:15 +0800
64b586d19 mm/rmap: fix comments left over from recent changes ... Browse Code »

Parallel developments in mm/rmap.c have left behind some out-of-date
comments: try_to_migrate_one() also accepts TTU_SYNC (already commented
in try_to_migrate() itself), and try_to_migrate() returns nothing at
all.

TTU_SPLIT_FREEZE has just been deleted, so reword the comment about it
in mm/huge_memory.c; and TTU_IGNORE_ACCESS was removed in 5.11, so
delete the "recently referenced" comment from try_to_unmap_one() (once
upon a time the comment was near the removed codeblock, but they drifted
apart).

Signed-off-by: Hugh Dickins
Reviewed-by: Shakeel Butt
Reviewed-by: Alistair Popple
Link: https://lore.kernel.org/lkml/563ce5b2-7a44-5b4d-1dfd-59a0e65932a9@google.com/
Cc: Andrew Morton
Cc: Jason Gunthorpe
Cc: Ralph Campbell
Cc: Christoph Hellwig
Cc: Yang Shi
Cc: Kirill A. Shutemov
Signed-off-by: Linus Torvalds

Hugh Dickins
2021-07-12 06:05:15 +0800

11 Jul, 2021

2 commits

6bce24439 mm/page_alloc: Revert pahole zero-sized workaround ... Browse Code »

Commit dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to
local_lock") folded in a workaround patch for pahole that was unable to
deal with zero-sized percpu structures.

A superior workaround is achieved with commit a0b8200d06ad ("kbuild:
skip per-CPU BTF generation for pahole v1.18-v1.21").

This patch reverts the dummy field and the pahole version check.

Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-11 07:09:39 +0800
20d5e570a Merge branch 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu ... Browse Code »

Pull percpu fix from Dennis Zhou:
"This is just a single change to fix percpu depopulation. The code
relied on depopulation code written specifically for the free path and
relied on vmalloc to do the tlb flush lazily. As we're modifying the
backing pages during the lifetime of a chunk, we need to also flush
the tlb accordingly.

Guenter Roeck reported this issue in [1] on mips. I believe we just
happen to be lucky given the much larger chunk sizes on x86 and
consequently less churning of this memory"

Link: https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/ [1]

* 'for-5.14-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: flush tlb in pcpu_reclaim_populated()

Linus Torvalds
2021-07-11 00:06:41 +0800

09 Jul, 2021

10 commits

3bbda69c4 mm/mremap: allow arch runtime override ... Browse Code »

Patch series "Speedup mremap on ppc64", v8.

This patchset enables MOVE_PMD/MOVE_PUD support on power. This requires
the platform to support updating higher-level page tables without updating
page table entries. This also needs to invalidate the Page Walk Cache on
architecture supporting the same.

This patch (of 3):

Architectures like ppc64 support faster mremap only with radix
translation. Hence allow a runtime check w.r.t support for fast mremap.

Link: https://lkml.kernel.org/r/20210616045735.374532-1-aneesh.kumar@linux.ibm.com
Link: https://lkml.kernel.org/r/20210616045735.374532-2-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Michael Ellerman
Cc: Kalesh Singh
Cc: Nicholas Piggin
Cc: Joel Fernandes
Cc: Christophe Leroy
Cc: Kirill A. Shutemov
Cc: "Aneesh Kumar K . V"
Cc: Hugh Dickins
Cc: Kirill A. Shutemov
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-07-09 02:48:23 +0800
97113eb39 mm/mremap: hold the rmap lock in write mode when moving page table entries. ... Browse Code »

To avoid a race between rmap walk and mremap, mremap does
take_rmap_locks(). The lock was taken to ensure that rmap walk don't miss
a page table entry due to PTE moves via move_pagetables(). The kernel
does further optimization of this lock such that if we are going to find
the newly added vma after the old vma, the rmap lock is not taken. This
is because rmap walk would find the vmas in the same order and if we don't
find the page table attached to older vma we would find it with the new
vma which we would iterate later.

As explained in commit eb66ae030829 ("mremap: properly flush TLB before
releasing the page") mremap is special in that it doesn't take ownership
of the page. The optimized version for PUD/PMD aligned mremap also
doesn't hold the ptl lock. This can result in stale TLB entries as show
below.

This patch updates the rmap locking requirement in mremap to handle the race condition
explained below with optimized mremap::

Optmized PMD move

CPU 1 CPU 2 CPU 3

mremap(old_addr, new_addr) page_shrinker/try_to_unmap_one

mmap_write_lock_killable()

addr = old_addr
lock(pte_ptl)
lock(pmd_ptl)
pmd = *old_pmd
pmd_clear(old_pmd)
flush_tlb_range(old_addr)

*new_pmd = pmd
*new_addr = 10; and fills
TLB with new addr
and old pfn

unlock(pmd_ptl)
ptep_clear_flush()
old pfn is free.
Stale TLB entry

Optimized PUD move also suffers from a similar race. Both the above race
condition can be fixed if we force mremap path to take rmap lock.

Link: https://lkml.kernel.org/r/20210616045239.370802-7-aneesh.kumar@linux.ibm.com
Fixes: 2c91bd4a4e2e ("mm: speed up mremap by 20x on large regions")
Fixes: c49dd3401802 ("mm: speedup mremap on 1GB or larger regions")
Link: https://lore.kernel.org/linux-mm/CAHk-=wgXVR04eBNtxQfevontWnP6FDm+oj5vauQXP3S-huwbPw@mail.gmail.com
Signed-off-by: Aneesh Kumar K.V
Acked-by: Hugh Dickins
Acked-by: Kirill A. Shutemov
Cc: Christophe Leroy
Cc: Joel Fernandes
Cc: Kalesh Singh
Cc: Kirill A. Shutemov
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Stephen Rothwell
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-07-09 02:48:23 +0800
0881ace29 mm/mremap: use pmd/pud_poplulate to update page table entries ... Browse Code »

pmd/pud_populate is the right interface to be used to set the respective
page table entries. Some architectures like ppc64 do assume that
set_pmd/pud_at can only be used to set a hugepage PTE. Since we are not
setting up a hugepage PTE here, use the pmd/pud_populate interface.

Link: https://lkml.kernel.org/r/20210616045239.370802-6-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Christophe Leroy
Cc: Hugh Dickins
Cc: Joel Fernandes
Cc: Kalesh Singh
Cc: Kirill A. Shutemov
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-07-09 02:48:23 +0800
d6655dff2 mm/mremap: don't enable optimized PUD move if page table levels is 2 ... Browse Code »

With two level page table don't enable move_normal_pud.

Link: https://lkml.kernel.org/r/20210616045239.370802-5-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Christophe Leroy
Cc: Hugh Dickins
Cc: Joel Fernandes
Cc: Kalesh Singh
Cc: Kirill A. Shutemov
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-07-09 02:48:23 +0800
7d846db7d mm/mremap: convert huge PUD move to separate helper ... Browse Code »

With TRANSPARENT_HUGEPAGE_PUD enabled the kernel can find huge PUD
entries. Add a helper to move huge PUD entries on mremap().

This will be used by a later patch to optimize mremap of PUD_SIZE aligned
level 4 PTE mapped address

This also make sure we support mremap on huge PUD entries even with
CONFIG_HAVE_MOVE_PUD disabled.

[aneesh.kumar@linux.ibm.com: fix build failure with clang-10]
Link: https://lore.kernel.org/lkml/YMuOSnJsL9qkxweY@archlinux-ax161
Link: https://lkml.kernel.org/r/20210619134310.89098-1-aneesh.kumar@linux.ibm.com

Link: https://lkml.kernel.org/r/20210616045239.370802-4-aneesh.kumar@linux.ibm.com
Signed-off-by: Aneesh Kumar K.V
Cc: Christophe Leroy
Cc: Hugh Dickins
Cc: Joel Fernandes
Cc: Kalesh Singh
Cc: Kirill A. Shutemov
Cc: Michael Ellerman
Cc: Nicholas Piggin
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aneesh Kumar K.V
2021-07-09 02:48:23 +0800
5748fbc53 mm: add setup_initial_init_mm() helper ... Browse Code »

Patch series "init_mm: cleanup ARCH's text/data/brk setup code", v3.

Add setup_initial_init_mm() helper, then use it to cleanup the text, data
and brk setup code.

This patch (of 15):

Add setup_initial_init_mm() helper to setup kernel text, data and brk.

Link: https://lkml.kernel.org/r/20210608083418.137226-1-wangkefeng.wang@huawei.com
Link: https://lkml.kernel.org/r/20210608083418.137226-2-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang
Cc: Souptick Joarder
Cc: Christophe Leroy
Cc: Benjamin Herrenschmidt
Cc: Catalin Marinas
Cc: Christian Borntraeger
Cc: Geert Uytterhoeven
Cc: Greentime Hu
Cc: Greg Ungerer
Cc: Guo Ren
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Jonas Bonn
Cc: Ley Foon Tan
Cc: Michael Ellerman
Cc: Nick Hu
Cc: Palmer Dabbelt
Cc: Paul Walmsley
Cc: Rich Felker
Cc: Russell King (Oracle)
Cc: Stafford Horne
Cc: Stefan Kristiansson
Cc: Thomas Gleixner
Cc: Vasily Gorbik
Cc: Vineet Gupta
Cc: Will Deacon
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kefeng Wang
2021-07-09 02:48:21 +0800
9a436f8ff PM: hibernate: disable when there are active secretmem users ... Browse Code »

It is unsafe to allow saving of secretmem areas to the hibernation
snapshot as they would be visible after the resume and this essentially
will defeat the purpose of secret memory mappings.

Prevent hibernation whenever there are active secret memory users.

Link: https://lkml.kernel.org/r/20210518072034.31572-6-rppt@kernel.org
Signed-off-by: Mike Rapoport
Acked-by: David Hildenbrand
Acked-by: James Bottomley
Cc: Alexander Viro
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christopher Lameter
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Elena Reshetova
Cc: Hagen Paul Pfeifer
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Bottomley
Cc: "Kirill A. Shutemov"
Cc: Mark Rutland
Cc: Matthew Wilcox
Cc: Michael Kerrisk
Cc: Palmer Dabbelt
Cc: Palmer Dabbelt
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Rick Edgecombe
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Shuah Khan
Cc: Thomas Gleixner
Cc: Tycho Andersen
Cc: Will Deacon
Cc: kernel test robot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2021-07-09 02:48:21 +0800
1507f5125 mm: introduce memfd_secret system call to create "secret" memory areas ... Browse Code »

Introduce "memfd_secret" system call with the ability to create memory
areas visible only in the context of the owning process and not mapped not
only to other processes but in the kernel page tables as well.

The secretmem feature is off by default and the user must explicitly
enable it at the boot time.

Once secretmem is enabled, the user will be able to create a file
descriptor using the memfd_secret() system call. The memory areas created
by mmap() calls from this file descriptor will be unmapped from the kernel
direct map and they will be only mapped in the page table of the processes
that have access to the file descriptor.

Secretmem is designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes
"simple" ROP insufficient to perform exfiltration, which increases the
required complexity of the attack. Along with other protections like
the kernel stack size limit and address space layout randomization which
make finding gadgets is really hard, absence of any in-kernel primitive
for accessing secret memory means the one gadget ROP attack can't work.
Since the only way to access secret memory is to reconstruct the missing
mapping entry, the attacker has to recover the physical page and insert
a PTE pointing to it in the kernel and then retrieve the contents. That
takes at least three gadgets which is a level of difficulty beyond most
standard attacks.

* Prevent cross-process secret userspace memory exposures. Once the
secret memory is allocated, the user can't accidentally pass it into the
kernel to be transmitted somewhere. The secreremem pages cannot be
accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws. In order to access secretmem,
a kernel-side attack would need to either walk the page tables and
create new ones, or spawn a new privileged uiserspace process to perform
secrets exfiltration using ptrace.

The file descriptor based memory has several advantages over the
"traditional" mm interfaces, such as mlock(), mprotect(), madvise(). File
descriptor approach allows explicit and controlled sharing of the memory
areas, it allows to seal the operations. Besides, file descriptor based
memory paves the way for VMMs to remove the secret memory range from the
userspace hipervisor process, for instance QEMU. Andy Lutomirski says:

"Getting fd-backed memory into a guest will take some possibly major
work in the kernel, but getting vma-backed memory into a guest without
mapping it in the host user address space seems much, much worse."

memfd_secret() is made a dedicated system call rather than an extension to
memfd_create() because it's purpose is to allow the user to create more
secure memory mappings rather than to simply allow file based access to
the memory. Nowadays a new system call cost is negligible while it is way
simpler for userspace to deal with a clear-cut system calls than with a
multiplexer or an overloaded syscall. Moreover, the initial
implementation of memfd_secret() is completely distinct from
memfd_create() so there is no much sense in overloading memfd_create() to
begin with. If there will be a need for code sharing between these
implementation it can be easily achieved without a need to adjust user
visible APIs.

The secret memory remains accessible in the process context using uaccess
primitives, but it is not exposed to the kernel otherwise; secret memory
areas are removed from the direct map and functions in the
follow_page()/get_user_page() family will refuse to return a page that
belongs to the secret memory area.

Once there will be a use case that will require exposing secretmem to the
kernel it will be an opt-in request in the system call flags so that user
would have to decide what data can be exposed to the kernel.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

Pages in the secretmem regions are unevictable and unmovable to avoid
accidental exposure of the sensitive data via swap or during page
migration.

Since the secretmem mappings are locked in memory they cannot exceed
RLIMIT_MEMLOCK. Since these mappings are already locked independently
from mlock(), an attempt to mlock()/munlock() secretmem range would fail
and mlockall()/munlockall() will ignore secretmem mappings.

However, unlike mlock()ed memory, secretmem currently behaves more like
long-term GUP: secretmem mappings are unmovable mappings directly consumed
by user space. With default limits, there is no excessive use of
secretmem and it poses no real problem in combination with
ZONE_MOVABLE/CMA, but in the future this should be addressed to allow
balanced use of large amounts of secretmem along with ZONE_MOVABLE/CMA.

A page that was a part of the secret memory area is cleared when it is
freed to ensure the data is not exposed to the next user of that page.

The following example demonstrates creation of a secret mapping (error
handling is omitted):

fd = memfd_secret(0);
ftruncate(fd, MAP_SIZE);
ptr = mmap(NULL, MAP_SIZE, PROT_READ | PROT_WRITE,
MAP_SHARED, fd, 0);

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

[akpm@linux-foundation.org: suppress Kconfig whine]

Link: https://lkml.kernel.org/r/20210518072034.31572-5-rppt@kernel.org
Signed-off-by: Mike Rapoport
Acked-by: Hagen Paul Pfeifer
Acked-by: James Bottomley
Cc: Alexander Viro
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christopher Lameter
Cc: Dan Williams
Cc: Dave Hansen
Cc: Elena Reshetova
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Bottomley
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Mark Rutland
Cc: Michael Kerrisk
Cc: Palmer Dabbelt
Cc: Palmer Dabbelt
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Rick Edgecombe
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Shuah Khan
Cc: Thomas Gleixner
Cc: Tycho Andersen
Cc: Will Deacon
Cc: David Hildenbrand
Cc: kernel test robot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2021-07-09 02:48:21 +0800
6aeb25425 mmap: make mlock_future_check() global ... Browse Code »

Patch series "mm: introduce memfd_secret system call to create "secret" memory areas", v20.

This is an implementation of "secret" mappings backed by a file
descriptor.

The file descriptor backing secret memory mappings is created using a
dedicated memfd_secret system call The desired protection mode for the
memory is configured using flags parameter of the system call. The mmap()
of the file descriptor created with memfd_secret() will create a "secret"
memory mapping. The pages in that mapping will be marked as not present
in the direct map and will be present only in the page table of the owning
mm.

Although normally Linux userspace mappings are protected from other users,
such secret mappings are useful for environments where a hostile tenant is
trying to trick the kernel into giving them access to other tenants
mappings.

It's designed to provide the following protections:

* Enhanced protection (in conjunction with all the other in-kernel
attack prevention systems) against ROP attacks. Seceretmem makes
"simple" ROP insufficient to perform exfiltration, which increases the
required complexity of the attack. Along with other protections like
the kernel stack size limit and address space layout randomization which
make finding gadgets is really hard, absence of any in-kernel primitive
for accessing secret memory means the one gadget ROP attack can't work.
Since the only way to access secret memory is to reconstruct the missing
mapping entry, the attacker has to recover the physical page and insert
a PTE pointing to it in the kernel and then retrieve the contents. That
takes at least three gadgets which is a level of difficulty beyond most
standard attacks.

* Prevent cross-process secret userspace memory exposures. Once the
secret memory is allocated, the user can't accidentally pass it into the
kernel to be transmitted somewhere. The secreremem pages cannot be
accessed via the direct map and they are disallowed in GUP.

* Harden against exploited kernel flaws. In order to access secretmem,
a kernel-side attack would need to either walk the page tables and
create new ones, or spawn a new privileged uiserspace process to perform
secrets exfiltration using ptrace.

In the future the secret mappings may be used as a mean to protect guest
memory in a virtual machine host.

For demonstration of secret memory usage we've created a userspace library

https://git.kernel.org/pub/scm/linux/kernel/git/jejb/secret-memory-preloader.git

that does two things: the first is act as a preloader for openssl to
redirect all the OPENSSL_malloc calls to secret memory meaning any secret
keys get automatically protected this way and the other thing it does is
expose the API to the user who needs it. We anticipate that a lot of the
use cases would be like the openssl one: many toolkits that deal with
secret keys already have special handling for the memory to try to give
them greater protection, so this would simply be pluggable into the
toolkits without any need for user application modification.

Hiding secret memory mappings behind an anonymous file allows usage of the
page cache for tracking pages allocated for the "secret" mappings as well
as using address_space_operations for e.g. page migration callbacks.

The anonymous file may be also used implicitly, like hugetlb files, to
implement mmap(MAP_SECRET) and use the secret memory areas with "native"
mm ABIs in the future.

Removing of the pages from the direct map may cause its fragmentation on
architectures that use large pages to map the physical memory which
affects the system performance. However, the original Kconfig text for
CONFIG_DIRECT_GBPAGES said that gigabyte pages in the direct map "... can
improve the kernel's performance a tiny bit ..." (commit 00d1c5e05736
("x86: add gbpages switches")) and the recent report [1] showed that "...
although 1G mappings are a good default choice, there is no compelling
evidence that it must be the only choice". Hence, it is sufficient to
have secretmem disabled by default with the ability of a system
administrator to enable it at boot time.

In addition, there is also a long term goal to improve management of the
direct map.

[1] https://lore.kernel.org/linux-mm/213b4567-46ce-f116-9cdf-bbd0c884eb3c@linux.intel.com/

This patch (of 7):

It will be used by the upcoming secret memory implementation.

Link: https://lkml.kernel.org/r/20210518072034.31572-1-rppt@kernel.org
Link: https://lkml.kernel.org/r/20210518072034.31572-2-rppt@kernel.org
Signed-off-by: Mike Rapoport
Reviewed-by: David Hildenbrand
Acked-by: James Bottomley
Cc: Alexander Viro
Cc: Andy Lutomirski
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Catalin Marinas
Cc: Christopher Lameter
Cc: Dan Williams
Cc: Dave Hansen
Cc: David Hildenbrand
Cc: Elena Reshetova
Cc: Hagen Paul Pfeifer
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: James Bottomley
Cc: "Kirill A. Shutemov"
Cc: Mark Rutland
Cc: Matthew Wilcox
Cc: Michael Kerrisk
Cc: Palmer Dabbelt
Cc: Palmer Dabbelt
Cc: Paul Walmsley
Cc: Peter Zijlstra
Cc: Rick Edgecombe
Cc: Roman Gushchin
Cc: Shakeel Butt
Cc: Shuah Khan
Cc: Thomas Gleixner
Cc: Tycho Andersen
Cc: Will Deacon
Cc: kernel test robot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2021-07-09 02:48:20 +0800
788691464 mm/slub: use stackdepot to save stack trace in objects ... Browse Code »

Many stack traces are similar so there are many similar arrays.
Stackdepot saves each unique stack only once.

Replace field addrs in struct track with depot_stack_handle_t handle. Use
stackdepot to save stack trace.

The benefits are smaller memory overhead and possibility to aggregate
per-cache statistics in the future using the stackdepot handle instead of
matching stacks manually.

[rdunlap@infradead.org: rename save_stack_trace()]
Link: https://lkml.kernel.org/r/20210513051920.29320-1-rdunlap@infradead.org
[vbabka@suse.cz: fix lockdep splat]
Link: https://lkml.kernel.org/r/20210516195150.26740-1-vbabka@suse.czLink: https://lkml.kernel.org/r/20210414163434.4376-1-glittao@gmail.com

Signed-off-by: Oliver Glitta
Signed-off-by: Randy Dunlap
Signed-off-by: Vlastimil Babka
Reviewed-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oliver Glitta
2021-07-09 02:48:20 +0800

05 Jul, 2021

3 commits

28e92f990 Merge branch 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/… ... Browse Code »

…git/paulmck/linux-rcu

Pull RCU updates from Paul McKenney:

- Bitmap parsing support for "all" as an alias for all bits

- Documentation updates

- Miscellaneous fixes, including some that overlap into mm and lockdep

- kvfree_rcu() updates

- mem_dump_obj() updates, with acks from one of the slab-allocator
maintainers

- RCU NOCB CPU updates, including limited deoffloading

- SRCU updates

- Tasks-RCU updates

- Torture-test updates

* 'core-rcu-2021.07.04' of git://git.kernel.org/pub/scm/linux/kernel/git/paulmck/linux-rcu: (78 commits)
tasks-rcu: Make show_rcu_tasks_gp_kthreads() be static inline
rcu-tasks: Make ksoftirqd provide RCU Tasks quiescent states
rcu: Add missing __releases() annotation
rcu: Remove obsolete rcu_read_unlock() deadlock commentary
rcu: Improve comments describing RCU read-side critical sections
rcu: Create an unrcu_pointer() to remove __rcu from a pointer
srcu: Early test SRCU polling start
rcu: Fix various typos in comments
rcu/nocb: Unify timers
rcu/nocb: Prepare for fine-grained deferred wakeup
rcu/nocb: Only cancel nocb timer if not polling
rcu/nocb: Delete bypass_timer upon nocb_gp wakeup
rcu/nocb: Cancel nocb_timer upon nocb_gp wakeup
rcu/nocb: Allow de-offloading rdp leader
rcu/nocb: Directly call __wake_nocb_gp() from bypass timer
rcu: Don't penalize priority boosting when there is nothing to boost
rcu: Point to documentation of ordering guarantees
rcu: Make rcu_gp_cleanup() be noinline for tracing
rcu: Restrict RCU_STRICT_GRACE_PERIOD to at most four CPUs
rcu: Make show_rcu_gp_kthreads() dump rcu_node structures blocking GP
...

Linus Torvalds
2021-07-05 03:58:33 +0800
a412897fb Merge tag 'memblock-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock ... Browse Code »

Pull memblock updates from Mike Rapoport:
"Fix arm crashes caused by holes in the memory map.

The coordination between freeing of unused memory map, pfn_valid() and
core mm assumptions about validity of the memory map in various ranges
was not designed for complex layouts of the physical memory with a lot
of holes all over the place.

Kefen Wang reported crashes in move_freepages() on a system with the
following memory layout [1]:

node 0: [mem 0x0000000080a00000-0x00000000855fffff]
node 0: [mem 0x0000000086a00000-0x0000000087dfffff]
node 0: [mem 0x000000008bd00000-0x000000008c4fffff]
node 0: [mem 0x000000008e300000-0x000000008ecfffff]
node 0: [mem 0x0000000090d00000-0x00000000bfffffff]
node 0: [mem 0x00000000cc000000-0x00000000dc9fffff]
node 0: [mem 0x00000000de700000-0x00000000de9fffff]
node 0: [mem 0x00000000e0800000-0x00000000e0bfffff]
node 0: [mem 0x00000000f4b00000-0x00000000f6ffffff]
node 0: [mem 0x00000000fda00000-0x00000000ffffefff]

These crashes can be mitigated by enabling CONFIG_HOLES_IN_ZONE on ARM
and essentially turning pfn_valid_within() to pfn_valid() instead of
having it hardwired to 1 on that architecture, but this would require
to keep CONFIG_HOLES_IN_ZONE solely for this purpose.

A cleaner approach is to update ARM's implementation of pfn_valid() to
take into accounting rounding of the freed memory map to pageblock
boundaries and make sure it returns true for PFNs that have memory map
entries even if there is no physical memory backing those PFNs"

Link: https://lore.kernel.org/lkml/2a1592ad-bc9d-4664-fd19-f7448a37edc0@huawei.com [1]

* tag 'memblock-v5.14-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock:
arm: extend pfn_valid to take into account freed memory map alignment
memblock: ensure there is no overflow in memblock_overlaps_region()
memblock: align freed memory map on pageblock boundaries with SPARSEMEM
memblock: free_unused_memmap: use pageblock units instead of MAX_ORDER

Linus Torvalds
2021-07-05 03:23:05 +0800
93274f1dd percpu: flush tlb in pcpu_reclaim_populated() ... Browse Code »

Prior to "percpu: implement partial chunk depopulation",
pcpu_depopulate_chunk() was called only on the destruction path. This
meant the virtual address range was on its way back to vmalloc which
will handle flushing the tlbs for us.

However, with pcpu_reclaim_populated(), we are now calling
pcpu_depopulate_chunk() during the active lifecycle of a chunk.
Therefore, we need to flush the tlb as well otherwise we can end up
accessing the wrong page through an invalid tlb mapping as reported in
[1].

[1] https://lore.kernel.org/lkml/20210702191140.GA3166599@roeck-us.net/

Fixes: f183324133ea ("percpu: implement partial chunk depopulation")
Reported-and-tested-by: Guenter Roeck
Signed-off-by: Dennis Zhou

Dennis Zhou
2021-07-05 02:30:17 +0800

04 Jul, 2021

1 commit

d3acb15a3 Merge branch 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull iov_iter updates from Al Viro:
"iov_iter cleanups and fixes.

There are followups, but this is what had sat in -next this cycle. IMO
the macro forest in there became much thinner and easier to follow..."

* 'work.iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (37 commits)
csum_and_copy_to_pipe_iter(): leave handling of csum_state to caller
clean up copy_mc_pipe_to_iter()
pipe_zero(): we don't need no stinkin' kmap_atomic()...
iov_iter: clean csum_and_copy_...() primitives up a bit
copy_page_from_iter(): don't need kmap_atomic() for kvec/bvec cases
copy_page_to_iter(): don't bother with kmap_atomic() for bvec/kvec cases
iterate_xarray(): only of the first iteration we might get offset != 0
pull handling of ->iov_offset into iterate_{iovec,bvec,xarray}
iov_iter: make iterator callbacks use base and len instead of iovec
iov_iter: make the amount already copied available to iterator callbacks
iov_iter: get rid of separate bvec and xarray callbacks
iov_iter: teach iterate_{bvec,xarray}() about possible short copies
iterate_bvec(): expand bvec.h macro forest, massage a bit
iov_iter: unify iterate_iovec and iterate_kvec
iov_iter: massage iterate_iovec and iterate_kvec to logics similar to iterate_bvec
iterate_and_advance(): get rid of magic in case when n is 0
csum_and_copy_to_iter(): massage into form closer to csum_and_copy_from_iter()
iov_iter: replace iov_iter_copy_from_user_atomic() with iterator-advancing variant
[xarray] iov_iter_npages(): just use DIV_ROUND_UP()
iov_iter_npages(): don't bother with iterate_all_kinds()
...

Linus Torvalds
2021-07-04 02:30:04 +0800

03 Jul, 2021

1 commit

71bd93410 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"190 patches.

Subsystems affected by this patch series: mm (hugetlb, userfaultfd,
vmscan, kconfig, proc, z3fold, zbud, ras, mempolicy, memblock,
migration, thp, nommu, kconfig, madvise, memory-hotplug, zswap,
zsmalloc, zram, cleanups, kfence, and hmm), procfs, sysctl, misc,
core-kernel, lib, lz4, checkpatch, init, kprobes, nilfs2, hfs,
signals, exec, kcov, selftests, compress/decompress, and ipc"

* emailed patches from Andrew Morton : (190 commits)
ipc/util.c: use binary search for max_idx
ipc/sem.c: use READ_ONCE()/WRITE_ONCE() for use_global_lock
ipc: use kmalloc for msg_queue and shmid_kernel
ipc sem: use kvmalloc for sem_undo allocation
lib/decompressors: remove set but not used variabled 'level'
selftests/vm/pkeys: exercise x86 XSAVE init state
selftests/vm/pkeys: refill shadow register after implicit kernel write
selftests/vm/pkeys: handle negative sys_pkey_alloc() return code
selftests/vm/pkeys: fix alloc_random_pkey() to make it really, really random
kcov: add __no_sanitize_coverage to fix noinstr for all architectures
exec: remove checks in __register_bimfmt()
x86: signal: don't do sas_ss_reset() until we are certain that sigframe won't be abandoned
hfsplus: report create_date to kstat.btime
hfsplus: remove unnecessary oom message
nilfs2: remove redundant continue statement in a while-loop
kprobes: remove duplicated strong free_insn_page in x86 and s390
init: print out unknown kernel parameters
checkpatch: do not complain about positive return values starting with EPOLL
checkpatch: improve the indented label test
checkpatch: scripts/spdxcheck.py now requires python3
...

Linus Torvalds
2021-07-03 03:08:10 +0800

02 Jul, 2021

19 commits

e267992f9 Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu ... Browse Code »

Pull percpu updates from Dennis Zhou:

- percpu chunk depopulation - depopulate backing pages for chunks with
empty pages when we exceed a global threshold without those pages.
This lets us reclaim a portion of memory that would previously be
lost until the full chunk would be freed (possibly never).

- memcg accounting cleanup - previously separate chunks were managed
for normal allocations and __GFP_ACCOUNT allocations. These are now
consolidated which cleans up the code quite a bit.

- a few misc clean ups for clang warnings

* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/dennis/percpu:
percpu: optimize locking in pcpu_balance_workfn()
percpu: initialize best_upa variable
percpu: rework memcg accounting
mm, memcg: introduce mem_cgroup_kmem_disabled()
mm, memcg: mark cgroup_memory_nosocket, nokmem and noswap as __ro_after_init
percpu: make symbol 'pcpu_free_slot' static
percpu: implement partial chunk depopulation
percpu: use pcpu_free_slot instead of pcpu_nr_slots - 1
percpu: factor out pcpu_check_block_hint()
percpu: split __pcpu_balance_workfn()
percpu: fix a comment about the chunks ordering

Linus Torvalds
2021-07-02 08:17:24 +0800
b756a3b5e mm: device exclusive memory access ... Browse Code »

Some devices require exclusive write access to shared virtual memory (SVM)
ranges to perform atomic operations on that memory. This requires CPU
page tables to be updated to deny access whilst atomic operations are
occurring.

In order to do this introduce a new swap entry type
(SWP_DEVICE_EXCLUSIVE). When a SVM range needs to be marked for exclusive
access by a device all page table mappings for the particular range are
replaced with device exclusive swap entries. This causes any CPU access
to the page to result in a fault.

Faults are resovled by replacing the faulting entry with the original
mapping. This results in MMU notifiers being called which a driver uses
to update access permissions such as revoking atomic access. After
notifiers have been called the device will no longer have exclusive access
to the region.

Walking of the page tables to find the target pages is handled by
get_user_pages() rather than a direct page table walk. A direct page
table walk similar to what migrate_vma_collect()/unmap() does could also
have been utilised. However this resulted in more code similar in
functionality to what get_user_pages() provides as page faulting is
required to make the PTEs present and to break COW.

[dan.carpenter@oracle.com: fix signedness bug in make_device_exclusive_range()]
Link: https://lkml.kernel.org/r/YNIz5NVnZ5GiZ3u1@mwanda

Link: https://lkml.kernel.org/r/20210616105937.23201-8-apopple@nvidia.com
Signed-off-by: Alistair Popple
Signed-off-by: Dan Carpenter
Reviewed-by: Christoph Hellwig
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Ralph Campbell
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
9a5cc85c4 mm/memory.c: allow different return codes for copy_nonpresent_pte() ... Browse Code »

Currently if copy_nonpresent_pte() returns a non-zero value it is assumed
to be a swap entry which requires further processing outside the loop in
copy_pte_range() after dropping locks. This prevents other values being
returned to signal conditions such as failure which a subsequent change
requires.

Instead make copy_nonpresent_pte() return an error code if further
processing is required and read the value for the swap entry in the main
loop under the ptl.

Link: https://lkml.kernel.org/r/20210616105937.23201-7-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Peter Xu
Cc: Ben Skeggs
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Ralph Campbell
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
6b49bf6dd mm: rename migrate_pgmap_owner ... Browse Code »

MMU notifier ranges have a migrate_pgmap_owner field which is used by
drivers to store a pointer. This is subsequently used by the driver
callback to filter MMU_NOTIFY_MIGRATE events. Other notifier event types
can also benefit from this filtering, so rename the 'migrate_pgmap_owner'
field to 'owner' and create a new notifier initialisation function to
initialise this field.

Link: https://lkml.kernel.org/r/20210616105937.23201-6-apopple@nvidia.com
Signed-off-by: Alistair Popple
Suggested-by: Peter Xu
Reviewed-by: Peter Xu
Cc: Ben Skeggs
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Ralph Campbell
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
a98a2f0c8 mm/rmap: split migration into its own function ... Browse Code »

Migration is currently implemented as a mode of operation for
try_to_unmap_one() generally specified by passing the TTU_MIGRATION flag
or in the case of splitting a huge anonymous page TTU_SPLIT_FREEZE.

However it does not have much in common with the rest of the unmap
functionality of try_to_unmap_one() and thus splitting it into a separate
function reduces the complexity of try_to_unmap_one() making it more
readable.

Several simplifications can also be made in try_to_migrate_one() based on
the following observations:

- All users of TTU_MIGRATION also set TTU_IGNORE_MLOCK.
- No users of TTU_MIGRATION ever set TTU_IGNORE_HWPOISON.
- No users of TTU_MIGRATION ever set TTU_BATCH_FLUSH.

TTU_SPLIT_FREEZE is a special case of migration used when splitting an
anonymous page. This is most easily dealt with by calling the correct
function from unmap_page() in mm/huge_memory.c - either try_to_migrate()
for PageAnon or try_to_unmap().

Link: https://lkml.kernel.org/r/20210616105937.23201-5-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
cd62734ca mm/rmap: split try_to_munlock from try_to_unmap ... Browse Code »

The behaviour of try_to_unmap_one() is difficult to follow because it
performs different operations based on a fairly large set of flags used in
different combinations.

TTU_MUNLOCK is one such flag. However it is exclusively used by
try_to_munlock() which specifies no other flags. Therefore rather than
overload try_to_unmap_one() with unrelated behaviour split this out into
it's own function and remove the flag.

Link: https://lkml.kernel.org/r/20210616105937.23201-4-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Ralph Campbell
Reviewed-by: Christoph Hellwig
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
4dd845b5a mm/swapops: rework swap entry manipulation code ... Browse Code »

Both migration and device private pages use special swap entries that are
manipluated by a range of inline functions. The arguments to these are
somewhat inconsistent so rework them to remove flag type arguments and to
make the arguments similar for both read and write entry creation.

Link: https://lkml.kernel.org/r/20210616105937.23201-3-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Reviewed-by: Ralph Campbell
Cc: Ben Skeggs
Cc: Hugh Dickins
Cc: John Hubbard
Cc: "Matthew Wilcox (Oracle)"
Cc: Peter Xu
Cc: Shakeel Butt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
af5cdaf82 mm: remove special swap entry functions ... Browse Code »

Patch series "Add support for SVM atomics in Nouveau", v11.

Introduction
============

Some devices have features such as atomic PTE bits that can be used to
implement atomic access to system memory. To support atomic operations to
a shared virtual memory page such a device needs access to that page which
is exclusive of the CPU. This series introduces a mechanism to
temporarily unmap pages granting exclusive access to a device.

These changes are required to support OpenCL atomic operations in Nouveau
to shared virtual memory (SVM) regions allocated with the
CL_MEM_SVM_ATOMICS clSVMAlloc flag. A more complete description of the
OpenCL SVM feature is available at
https://www.khronos.org/registry/OpenCL/specs/3.0-unified/html/
OpenCL_API.html#_shared_virtual_memory .

Implementation
==============

Exclusive device access is implemented by adding a new swap entry type
(SWAP_DEVICE_EXCLUSIVE) which is similar to a migration entry. The main
difference is that on fault the original entry is immediately restored by
the fault handler instead of waiting.

Restoring the entry triggers calls to MMU notifers which allows a device
driver to revoke the atomic access permission from the GPU prior to the
CPU finalising the entry.

Patches
=======

Patches 1 & 2 refactor existing migration and device private entry
functions.

Patches 3 & 4 rework try_to_unmap_one() by splitting out unrelated
functionality into separate functions - try_to_migrate_one() and
try_to_munlock_one().

Patch 5 renames some existing code but does not introduce functionality.

Patch 6 is a small clean-up to swap entry handling in copy_pte_range().

Patch 7 contains the bulk of the implementation for device exclusive
memory.

Patch 8 contains some additions to the HMM selftests to ensure everything
works as expected.

Patch 9 is a cleanup for the Nouveau SVM implementation.

Patch 10 contains the implementation of atomic access for the Nouveau
driver.

Testing
=======

This has been tested with upstream Mesa 21.1.0 and a simple OpenCL program
which checks that GPU atomic accesses to system memory are atomic.
Without this series the test fails as there is no way of write-protecting
the page mapping which results in the device clobbering CPU writes. For
reference the test is available at
https://ozlabs.org/~apopple/opencl_svm_atomics/

Further testing has been performed by adding support for testing exclusive
access to the hmm-tests kselftests.

This patch (of 10):

Remove multiple similar inline functions for dealing with different types
of special swap entries.

Both migration and device private swap entries use the swap offset to
store a pfn. Instead of multiple inline functions to obtain a struct page
for each swap entry type use a common function pfn_swap_entry_to_page().
Also open-code the various entry_to_pfn() functions as this results is
shorter code that is easier to understand.

Link: https://lkml.kernel.org/r/20210616105937.23201-1-apopple@nvidia.com
Link: https://lkml.kernel.org/r/20210616105937.23201-2-apopple@nvidia.com
Signed-off-by: Alistair Popple
Reviewed-by: Ralph Campbell
Reviewed-by: Christoph Hellwig
Cc: "Matthew Wilcox (Oracle)"
Cc: Hugh Dickins
Cc: Peter Xu
Cc: Shakeel Butt
Cc: Ben Skeggs
Cc: Jason Gunthorpe
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alistair Popple
2021-07-02 02:06:03 +0800
ff06e45d3 kfence: unconditionally use unbound work queue ... Browse Code »

Unconditionally use unbound work queue, and not just if wq_power_efficient
is true. Because if the system is idle, KFENCE may wait, and by being run
on the unbound work queue, we permit the scheduler to make better
scheduling decisions and not require pinning KFENCE to the same CPU upon
waking up.

Link: https://lkml.kernel.org/r/20210521111630.472579-1-elver@google.com
Fixes: 36f0b35d0894 ("kfence: use power-efficient work queue to run delayed work")
Signed-off-by: Marco Elver
Reported-by: Hillf Danton
Reviewed-by: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Marco Elver
2021-07-02 02:06:03 +0800
ffd8f251f mm/page_alloc: move prototype for find_suitable_fallback ... Browse Code »

make W=1 generates the following warning in mmap_lock.c for allnoconfig

mm/page_alloc.c:2670:5: warning: no previous prototype for `find_suitable_fallback' [-Wmissing-prototypes]
int find_suitable_fallback(struct free_area *area, unsigned int order,
^~~~~~~~~~~~~~~~~~~~~~

find_suitable_fallback is only shared outside of page_alloc.c for
CONFIG_COMPACTION but to suppress the warning, move the protype outside of
CONFIG_COMPACTION. It is not worth the effort at this time to find a
clever way of allowing compaction.c to share the code or avoid the use
entirely as the function is called on relatively slow paths.

Link: https://lkml.kernel.org/r/20210520084809.8576-14-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:03 +0800
d01079f3d mm/mmap_lock: remove dead code for !CONFIG_TRACING configurations ... Browse Code »

make W=1 generates the following warning in mmap_lock.c for allnoconfig

mm/mmap_lock.c:213:6: warning: no previous prototype for `__mmap_lock_do_trace_start_locking' [-Wmissing-prototypes]
void __mmap_lock_do_trace_start_locking(struct mm_struct *mm, bool write)
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:219:6: warning: no previous prototype for `__mmap_lock_do_trace_acquire_returned' [-Wmissing-prototypes]
void __mmap_lock_do_trace_acquire_returned(struct mm_struct *mm, bool write,
^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
mm/mmap_lock.c:226:6: warning: no previous prototype for `__mmap_lock_do_trace_released' [-Wmissing-prototypes]
void __mmap_lock_do_trace_released(struct mm_struct *mm, bool write)

On !CONFIG_TRACING configurations, the code is dead so put it behind an
#ifdef.

[cuibixuan@huawei.com: fix warning when CONFIG_TRACING is not defined]
Link: https://lkml.kernel.org/r/20210531033426.74031-1-cuibixuan@huawei.com

Link: https://lkml.kernel.org/r/20210520084809.8576-13-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Signed-off-by: Bixuan Cui
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:03 +0800
30522175d mm/z3fold: add kerneldoc fields for z3fold_pool ... Browse Code »

make W=1 generates the following warning for z3fold_pool

mm/z3fold.c:171: warning: Function parameter or member 'zpool' not described in 'z3fold_pool'
mm/z3fold.c:171: warning: Function parameter or member 'zpool_ops' not described in 'z3fold_pool'

Commit 9a001fc19ccc ("z3fold: the 3-fold allocator for compressed pages")
simply did not document the fields at the time. Add rudimentary
documentation.

Link: https://lkml.kernel.org/r/20210520084809.8576-11-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:03 +0800
a29a75066 mm/zbud: add kerneldoc fields for zbud_pool ... Browse Code »

make W=1 generates the following warning for zbud_pool

mm/zbud.c:105: warning: Function parameter or member 'zpool' not described in 'zbud_pool'
mm/zbud.c:105: warning: Function parameter or member 'zpool_ops' not described in 'zbud_pool'

Commit 479305fd7172 ("zpool: remove zpool_evict()") removed the
zpool_evict helper and added the associated zpool and operations structure
in struct zbud_pool but did not add documentation for the fields. Add
rudimentary documentation.

Link: https://lkml.kernel.org/r/20210520084809.8576-10-mgorman@techsingularity.net
Fixes: 479305fd7172 ("zpool: remove zpool_evict()")
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:03 +0800
5640c9ca7 mm/memory_hotplug: fix kerneldoc comment for __remove_memory ... Browse Code »

make W=1 generates the following warning for __remove_memory

mm/memory_hotplug.c:2044: warning: expecting prototype for remove_memory(). Prototype was for __remove_memory() instead

Commit eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
introduced the kerneldoc comment and function but the kerneldoc name and
function name did not match.

Link: https://lkml.kernel.org/r/20210520084809.8576-9-mgorman@techsingularity.net
Fixes: eca499ab3749 ("mm/hotplug: make remove_memory() interface usable")
Signed-off-by: Mel Gorman
Reviewed-by: David Hildenbrand
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800
ba2d26660 mm/memory_hotplug: fix kerneldoc comment for __try_online_node ... Browse Code »

make W=1 generates the following warning for try_online_node

mm/memory_hotplug.c:1087: warning: expecting prototype for try_online_node(). Prototype was for __try_online_node() instead

Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use
__try_online_node") renamed the function but did not update the associated
kerneldoc. The function is static and somewhat specialised in nature so
it's not clear it warrants being a kerneldoc by moving the comment to
try_online_node. Hence, leave the comment of the internal helper in place
but leave it out of kerneldoc and correct the function name in the
comment.

Link: https://lkml.kernel.org/r/20210520084809.8576-8-mgorman@techsingularity.net
Fixes: Commit b9ff036082cd ("mm/memory_hotplug.c: make add_memory_resource use __try_online_node")
Signed-off-by: Mel Gorman
Reviewed-by: David Hildenbrand
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800
05395718b mm/memcontrol.c: fix kerneldoc comment for mem_cgroup_calculate_protection ... Browse Code »

make W=1 generates the following warning for mem_cgroup_calculate_protection

mm/memcontrol.c:6468: warning: expecting prototype for mem_cgroup_protected(). Prototype was for mem_cgroup_calculate_protection() instead

Commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from
protection checks") changed the function definition but not the associated
kerneldoc comment.

Link: https://lkml.kernel.org/r/20210520084809.8576-7-mgorman@techsingularity.net
Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks")
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Chris Down
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800
b417941f3 mm/mapping_dirty_helpers: remove double Note in kerneldoc ... Browse Code »

make W=1 generates the following warning for mm/mapping_dirty_helpers.c

mm/mapping_dirty_helpers.c:325: warning: duplicate section name 'Note'

The helper function is very specific to one driver -- vmwgfx. While the
two notes are separate, all of it needs to be taken into account when
using the helper so make it one note.

Link: https://lkml.kernel.org/r/20210520084809.8576-5-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800
f71730900 mm/page_alloc: make should_fail_alloc_page() static ... Browse Code »

make W=1 generates the following warning for mm/page_alloc.c

mm/page_alloc.c:3651:15: warning: no previous prototype for `should_fail_alloc_page' [-Wmissing-prototypes]
noinline bool should_fail_alloc_page(gfp_t gfp_mask, unsigned int order)
^~~~~~~~~~~~~~~~~~~~~~

This function is deliberately split out for BPF to allow errors to be
injected. The function is not used anywhere else so it is local to the
file. Make it static which should still allow error injection to be used
similar to how block/blk-core.c:should_fail_bio() works.

Link: https://lkml.kernel.org/r/20210520084809.8576-4-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800
5da96bdd9 mm/vmalloc: include header for prototype of set_iounmap_nonlazy ... Browse Code »

make W=1 generates the following warning for mm/vmalloc.c

mm/vmalloc.c:1599:6: warning: no previous prototype for `set_iounmap_nonlazy' [-Wmissing-prototypes]
void set_iounmap_nonlazy(void)
^~~~~~~~~~~~~~~~~~~

This is an arch-generic function only used by x86. On other arches, it's
dead code. Include the header with the definition and make it x86-64
specific.

Link: https://lkml.kernel.org/r/20210520084809.8576-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Reviewed-by: Yang Shi
Acked-by: Vlastimil Babka
Cc: Dan Streetman
Cc: David Hildenbrand
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2021-07-02 02:06:02 +0800