Eric Lee / smarc-fsl-linux-kernel

08 Apr, 2020

40 commits

bb8b93b5b mm/zswap: allow setting default status, compressor and allocator in Kconfig ... Browse Code »

The compressed cache for swap pages (zswap) currently needs from 1 to 3
extra kernel command line parameters in order to make it work: it has to
be enabled by adding a "zswap.enabled=1" command line parameter and if one
wants a different compressor or pool allocator than the default lzo / zbud
combination then these choices also need to be specified on the kernel
command line in additional parameters.

Using a different compressor and allocator for zswap is actually pretty
common as guides often recommend using the lz4 / z3fold pair instead of
the default one. In such case it is also necessary to remember to enable
the appropriate compression algorithm and pool allocator in the kernel
config manually.

Let's avoid the need for adding these kernel command line parameters and
automatically pull in the dependencies for the selected compressor
algorithm and pool allocator by adding an appropriate default switches to
Kconfig.

The default values for these options match what the code was using
previously as its defaults.

Signed-off-by: Maciej S. Szmigiero
Signed-off-by: Andrew Morton
Reviewed-by: Vitaly Wool
Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
Signed-off-by: Linus Torvalds

Maciej S. Szmigiero
2020-04-08 01:43:41 +0800
4708f3188 mm: prevent a warning when casting void* -> enum ... Browse Code »

I recently build the RISC-V port with LLVM trunk, which has introduced a
new warning when casting from a pointer to an enum of a smaller size.
This patch simply casts to a long in the middle to stop the warning. I'd
be surprised this is the only one in the kernel, but it's the only one I
saw.

Signed-off-by: Palmer Dabbelt
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
Signed-off-by: Linus Torvalds

Palmer Dabbelt
2020-04-08 01:43:41 +0800
71725ed10 mm: huge tmpfs: try to split_huge_page() when punching hole ... Browse Code »

Yang Shi writes:

Currently, when truncating a shmem file, if the range is partly in a THP
(start or end is in the middle of THP), the pages actually will just get
cleared rather than being freed, unless the range covers the whole THP.
Even though all the subpages are truncated (randomly or sequentially), the
THP may still be kept in page cache.

This might be fine for some usecases which prefer preserving THP, but
balloon inflation is handled in base page size. So when using shmem THP
as memory backend, QEMU inflation actually doesn't work as expected since
it doesn't free memory. But the inflation usecase really needs to get the
memory freed. (Anonymous THP will also not get freed right away, but will
be freed eventually when all subpages are unmapped: whereas shmem THP
still stays in page cache.)

Split THP right away when doing partial hole punch, and if split fails
just clear the page so that read of the punched area will return zeroes.

Hugh Dickins adds:

Our earlier "team of pages" huge tmpfs implementation worked in the way
that Yang Shi proposes; and we have been using this patch to continue to
split the huge page when hole-punched or truncated, since converting over
to the compound page implementation. Although huge tmpfs gives out huge
pages when available, if the user specifically asks to truncate or punch a
hole (perhaps to free memory, perhaps to reduce the memcg charge), then
the filesystem should do so as best it can, splitting the huge page.

That is not always possible: any additional reference to the huge page
prevents split_huge_page() from succeeding, so the result can be flaky.
But in practice it works successfully enough that we've not seen any
problem from that.

Add shmem_punch_compound() to encapsulate the decision of when a split is
needed, and doing the split if so. Using this simplifies the flow in
shmem_undo_range(); and the first (trylock) pass does not need to do any
page clearing on failure, because the second pass will either succeed or
do that clearing. Following the example of zero_user_segment() when
clearing a partial page, add flush_dcache_page() and set_page_dirty() when
clearing a hole - though I'm not certain that either is needed.

But: split_huge_page() would be sure to fail if shmem_undo_range()'s
pagevec holds further references to the huge page. The easiest way to fix
that is for find_get_entries() to return early, as soon as it has put one
compound head or tail into the pagevec. At first this felt like a hack;
but on examination, this convention better suits all its callers - or will
do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
speedup by checking for compound pages there.

Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Cc: Yang Shi
Cc: Alexander Duyck
Cc: "Michael S. Tsirkin"
Cc: David Hildenbrand
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Cc: Andrea Arcangeli
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
Signed-off-by: Linus Torvalds

Hugh Dickins
2020-04-08 01:43:41 +0800
343c3d7f0 mm/shmem.c: clean code by removing unnecessary assignment ... Browse Code »

Previously 0 was assigned to variable 'error' but the variable was never
read before reassignemnt later. So the assignment can be removed.

Signed-off-by: Mateusz Nosek
Signed-off-by: Andrew Morton
Reviewed-by: Matthew Wilcox (Oracle)
Acked-by: Pankaj Gupta
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
Signed-off-by: Linus Torvalds

Mateusz Nosek
2020-04-08 01:43:41 +0800
27d80fa24 mm/shmem.c: distribute switch variables for initialization ... Browse Code »

Variables declared in a switch statement before any case statements cannot
be automatically initialized with compiler instrumentation (as they are
not part of any execution flow). With GCC's proposed automatic stack
variable initialization feature, this triggers a warning (and they don't
get initialized). Clang's automatic stack variable initialization (via
CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
initialize such variables[1]. Note that these warnings (or silent
skipping) happen before the dead-store elimination optimization phase, so
even when the automatic initializations are later elided in favor of
direct initializations, the warnings remain.

To avoid these problems, move such variables into the "case" where they're
used or lift them up into the main function body.

mm/shmem.c: In function `shmem_getpage_gfp':
mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
1816 | loff_t i_size;
| ^~~~~~

[1] https://bugs.llvm.org/show_bug.cgi?id=44916

Signed-off-by: Kees Cook
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Cc: Hugh Dickins
Cc: Alexander Potapenko
Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
Signed-off-by: Linus Torvalds

Kees Cook
2020-04-08 01:43:41 +0800
104049017 mm/memory_hotplug.c: use __pfn_to_section() instead of open-coding ... Browse Code »

Use __pfn_to_section() API instead of open-coding for better code
readability.

Signed-off-by: chenqiwu
Signed-off-by: Andrew Morton
Acked-by: David Hildenbrand
Link: http://lkml.kernel.org/r/1584345134-16671-1-git-send-email-qiwuchen55@gmail.com
Signed-off-by: Linus Torvalds

chenqiwu
2020-04-08 01:43:41 +0800
5f47adf76 mm/memory_hotplug: allow to specify a default online_type ... Browse Code »

For now, distributions implement advanced udev rules to essentially
- Don't online any hotplugged memory (s390x)
- Online all memory to ZONE_NORMAL (e.g., most virt environments like
hyperv)
- Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
care of (e.g., bare metal, special virt environments)

In summary: All memory is usually onlined the same way, however, the
kernel always has to ask user space to come up with the same answer.
E.g., Hyper-V always waits for a memory block to get onlined before
continuing, otherwise it might end up adding memory faster than
onlining it, which can result in strange OOM situations. This waiting
slows down adding of a bigger amount of memory.

Let's allow to specify a default online_type, not just "online" and
"offline". This allows distributions to configure the default online_type
when booting up and be done with it.

We can now specify "offline", "online", "online_movable" and
"online_kernel" via
- "memhp_default_state=" on the kernel cmdline
- /sys/devices/system/memory/auto_online_blocks
just like we are able to specify for a single memory block via
/sys/devices/system/memory/memoryX/state

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Greg Kroah-Hartman
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:41 +0800
862919e56 mm/memory_hotplug: convert memhp_auto_online to store an online_type ... Browse Code »

... and rename it to memhp_default_online_type. This is a preparation
for more detailed default online behavior.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Greg Kroah-Hartman
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
5a04af132 mm/memory_hotplug: unexport memhp_auto_online ... Browse Code »

All in-tree users except the mm-core are gone. Let's drop the export.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Greg Kroah-Hartman
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
bc58ebd50 hv_balloon: don't check for memhp_auto_online manually ... Browse Code »

We get the MEM_ONLINE notifier call if memory is added right from the
kernel via add_memory() or later from user space.

Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt
mechanism (->done) for that. Initialize the wait event only once and
reinitialize before adding memory. Unconditionally call complete() and
wait_for_completion_timeout().

If there are no waiters, complete() will only increment ->done - which
will be reset by reinit_completion(). If complete() has already been
called, wait_for_completion_timeout() will not wait.

There is still the chance for a small race between concurrent
reinit_completion() and complete(). If complete() wins, we would not wait
- which is tolerable (and the race exists in current code as well).

Note: We only wait for "some" memory to get onlined, which seems to be
good enough for now.

[akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David]
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Vitaly Kuznetsov
Reviewed-by: Baoquan He
Cc: "K. Y. Srinivasan"
Cc: Haiyang Zhang
Cc: Stephen Hemminger
Cc: Wei Liu
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Greg Kroah-Hartman
Cc: Igor Mammedov
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
ed7f9fec8 powernv/memtrace: always online added memory blocks ... Browse Code »

Let's always try to online the re-added memory blocks. In case
add_memory() already onlined the added memory blocks, the first
device_online() call will fail and stop processing the remaining memory
blocks.

This avoids manually having to check memhp_auto_online.

Note: PPC always onlines all hotplugged memory directly from the kernel as
well - something that is handled by user space on other architectures.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Michael Ellerman
Cc: Greg Kroah-Hartman
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Eduardo Habkost
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
4dc8207bf drivers/base/memory: store mapping between MMOP_* and string in an array ... Browse Code »

Let's use a simple array which we can reuse soon. While at it, move the
string->mmop conversion out of the device hotplug lock.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Greg Kroah-Hartman
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
efc978ad0 drivers/base/memory: map MMOP_OFFLINE to 0 ... Browse Code »

Historically, we used the value -1. Just treat 0 as the special case now.
Clarify a comment (which was wrong, when we come via device_online() the
first time, the online_type would have been 0 / MEM_ONLINE). The default
is now always MMOP_OFFLINE. This removes the last user of the manual
"-1", which didn't use the enum value.

This is a preparation to use the online_type as an array index.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Michal Hocko
Acked-by: Pankaj Gupta
Cc: Greg Kroah-Hartman
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Benjamin Herrenschmidt
Cc: Eduardo Habkost
Cc: Haiyang Zhang
Cc: Igor Mammedov
Cc: "K. Y. Srinivasan"
Cc: Michael Ellerman
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Vitaly Kuznetsov
Cc: Wei Liu
Cc: Yumei Huang
Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
956f8b445 drivers/base/memory: rename MMOP_ONLINE_KEEP to MMOP_ONLINE ... Browse Code »

Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.

Distributions nowadays use udev rules ([1] [2]) to specify if and how to
online hotplugged memory. The rules seem to get more complex with many
special cases. Due to the various special cases,
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
is handled via udev rules.

Every time we hotplug memory, the udev rule will come to the same
conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
memory in separate memory blocks and wait for memory to get onlined by
user space before continuing to add more memory blocks (to not add memory
faster than it is getting onlined). This of course slows down the whole
memory hotplug process.

To make the job of distributions easier and to avoid udev rules that get
more and more complicated, let's extend the mechanism provided by
- /sys/devices/system/memory/auto_online_blocks
- "memhp_default_state=" on the kernel cmdline
to be able to specify also "online_movable" as well as "online_kernel"

=== Example /usr/libexec/config-memhotplug ===

#!/bin/bash

VIRT=`systemd-detect-virt --vm`
ARCH=`uname -p`

sense_virtio_mem() {
if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
if [ $DEVICES != "0" ]; then
return 0
fi
fi
return 1
}

if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
echo "Memory hotplug configuration support missing in the kernel"
exit 1
fi

if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
exit 1
fi

if [ $VIRT == "microsoft" ]; then
echo "Detected Hyper-V on $ARCH"
# Hyper-V wants all memory in ZONE_NORMAL
ONLINE_TYPE="online_kernel"
elif sense_virtio_mem; then
echo "Detected virtio-mem on $ARCH"
# virtio-mem wants all memory in ZONE_NORMAL
ONLINE_TYPE="online_kernel"
elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
echo "Detected $ARCH"
# standby memory should not be onlined automatically
ONLINE_TYPE="offline"
elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
echo "Detected" $ARCH
# PPC64 onlines all hotplugged memory right from the kernel
ONLINE_TYPE="offline"
elif [ $VIRT == "none" ]; then
echo "Detected bare-metal on $ARCH"
# Bare metal users expect hotplugged memory to be unpluggable. We assume
# that ZONE imbalances on such enterpise servers cannot happen and is
# properly documented
ONLINE_TYPE="online_movable"
else
# TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
# imbalances won't happen
echo "Detected $VIRT on $ARCH"
# Usually, ballooning is used in virtual environments, so memory should go to
# ZONE_NORMAL. However, sometimes "movable_node" is relevant.
ONLINE_TYPE="online"
fi

echo "Selected online_type:" $ONLINE_TYPE

# Configure what to do with memory that will be hotplugged in the future
echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
if [ $? != "0" ]; then
echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
# A backup udev rule should handle old kernels if necessary
exit 1
fi

# Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
if [ $ONLINE_TYPE != "offline" ]; then
for MEMORY in /sys/devices/system/memory/memory*; do
STATE=`cat $MEMORY/state`
if [ $STATE == "offline" ]; then
echo $ONLINE_TYPE > $MEMORY/state
fi
done
fi

=== Example /usr/lib/systemd/system/config-memhotplug.service ===

[Unit]
Description=Configure memory hotplug behavior
DefaultDependencies=no
Conflicts=shutdown.target
Before=sysinit.target shutdown.target
After=systemd-modules-load.service
ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks

[Service]
ExecStart=/usr/libexec/config-memhotplug
Type=oneshot
TimeoutSec=0
RemainAfterExit=yes

[Install]
WantedBy=sysinit.target

=== Example modification to the 40-redhat.rules [2] ===

: diff --git a/40-redhat.rules b/40-redhat.rules-new
: index 2c690e5..168fd03 100644
: --- a/40-redhat.rules
: +++ b/40-redhat.rules-new
: @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
: # Memory hotadd request
: SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
: ACTION!="add", GOTO="memory_hotplug_end"
: +# memory hotplug behavior configured
: +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
: +
: PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
:
: ENV{.state}="online"

===

[1] https://github.com/lnykryn/systemd-rhel/pull/281
[2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules

This patch (of 8):

The name is misleading and it's not really clear what is "kept". Let's
just name it like the online_type name we expose to user space ("online").

Add some documentation to the types.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: Baoquan He
Acked-by: Pankaj Gupta
Cc: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: Oscar Salvador
Cc: "Rafael J. Wysocki"
Cc: Wei Yang
Cc: Vitaly Kuznetsov
Cc: Yumei Huang
Cc: Igor Mammedov
Cc: Eduardo Habkost
Cc: Benjamin Herrenschmidt
Cc: Haiyang Zhang
Cc: K. Y. Srinivasan
Cc: Michael Ellerman (powerpc)
Cc: Paul Mackerras
Cc: Stephen Hemminger
Cc: Wei Liu
Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
6ecb0fc61 mm/sparse.c: move subsection_map related functions together ... Browse Code »

No functional change.

[bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: David Hildenbrand
Cc: Wei Yang
Cc: Dan Williams
Cc: Pankaj Gupta
Cc: Stephen Rothwell
Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
95a5a34df mm/sparse.c: add note about only VMEMMAP supporting sub-section hotplug ... Browse Code »

And tell check_pfn_span() gating the porper alignment and size of hot
added memory region.

And also move the code comments from inside section_deactivate() to being
above it. The code comments are reasonable for the whole function, and
the moving makes code cleaner.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Michal Hocko
Cc: Dan Williams
Cc: Pankaj Gupta
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
0a9f9f623 mm/sparse.c: only use subsection map in VMEMMAP case ... Browse Code »

Currently, to support subsection aligned memory region adding for pmem,
subsection map is added to track which subsection is present.

However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, it's meaningless. Even worse, it may confuse people when
checking code related to the classic sparse.

About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.

Combining the above reasons, no need to provide subsection map and the
relevant handling for the classic sparse. Let's remove them.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Cc: Dan Williams
Cc: Michal Hocko
Cc: Pankaj Gupta
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
37bc15020 mm/sparse.c: introduce a new function clear_subsection_map() ... Browse Code »

Factor out the code which clear subsection map of one memory region from
section_deactivate() into clear_subsection_map().

And also add helper function is_subsection_map_empty() to check if the
current subsection map is empty or not.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Cc: Dan Williams
Cc: Michal Hocko
Cc: Wei Yang
Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
5d87255ca mm/sparse.c: introduce new function fill_subsection_map() ... Browse Code »

Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.

Memory sub-section hotplug was added to fix the issue that nvdimm could be
mapped at non-section aligned starting address. A subsection map is added
into struct mem_section_usage to implement it.

However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
classic sparse, subsection map is meaningless and confusing.

About the classic sparse which doesn't support subsection hotplug, Dan
said it's more because the effort and maintenance burden outweighs the
benefit. Besides, the current 64 bit ARCHes all enable
SPARSEMEM_VMEMMAP_ENABLE by default.

This patch (of 5):

Factor out the code that fills the subsection map from section_activate()
into fill_subsection_map(), this makes section_activate() cleaner and
easier to follow.

Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: Wei Yang
Reviewed-by: David Hildenbrand
Acked-by: Pankaj Gupta
Cc: Dan Williams
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
6cdd0b30a mm/memory_hotplug.c: cleanup __add_pages() ... Browse Code »

Let's drop the basically unused section stuff and simplify. The logic now
matches the logic in __remove_pages().

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Reviewed-by: Wei Yang
Cc: Segher Boessenkool
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
a11b94192 mm/memory_hotplug.c: simplify calculation of number of pages in __remove_pages() ... Browse Code »

In commit 52fb87c81f11 ("mm/memory_hotplug: cleanup __remove_pages()"), we
cleaned up __remove_pages(), and introduced a shorter variant to calculate
the number of pages to the next section boundary.

Turns out we can make this calculation easier to read. We always want to
have the number of pages (> 0) to the next section boundary, starting from
the current pfn.

We'll clean up __remove_pages() in a follow-up patch and directly make use
of this computation.

Suggested-by: Segher Boessenkool
Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Reviewed-by: Baoquan He
Reviewed-by: Wei Yang
Cc: Oscar Salvador
Cc: Michal Hocko
Cc: Dan Williams
Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
f3cd4c865 mm/memory_hotplug.c: only respect mem= parameter during boot stage ... Browse Code »

In commit 357b4da50a62 ("x86: respect memory size limiting via mem=
parameter") a global varialbe max_mem_size is added to store the value
parsed from 'mem= ', then checked when memory region is added. This truly
stops those DIMMs from being added into system memory during boot-time.

However, it also limits the later memory hotplug functionality. Any DIMM
can't be hotplugged any more if its region is beyond the max_mem_size. We
will get errors like:

[ 216.387164] acpi PNP0C80:02: add_memory failed
[ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
[ 216.392187] acpi PNP0C80:02: Enumeration failure

This will cause issue in a known use case where 'mem=' is added to the
hypervisor. The memory that lies after 'mem=' boundary will be assigned
to KVM guests. After commit 357b4da50a62 merged, memory can't be extended
dynamically if system memory on hypervisor is not sufficient.

So fix it by also checking if it's during boot-time restricting to add
memory. Otherwise, skip the restriction.

And also add this use case to document of 'mem=' kernel parameter.

Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter")
Signed-off-by: Baoquan He
Signed-off-by: Andrew Morton
Reviewed-by: Juergen Gross
Acked-by: Michal Hocko
Cc: Ingo Molnar
Cc: William Kucharski
Cc: David Hildenbrand
Cc: Balbir Singh
Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com
Signed-off-by: Linus Torvalds

Baoquan He
2020-04-08 01:43:40 +0800
dccacf8de mm/page_ext.c: drop pfn_present() check when onlining ... Browse Code »

Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we disallow to offline any
memory with holes. As all boot memory is online and hotplugged memory
cannot contain holes, we never online memory with holes.

This present check can be dropped.

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Michal Hocko
Cc: Anshuman Khandual
Cc: Dan Williams
Cc: Greg Kroah-Hartman
Cc: Pavel Tatashin
Cc: "Rafael J. Wysocki"
Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
fada9ae3e drivers/base/memory.c: drop pages_correctly_probed() ... Browse Code »

pages_correctly_probed() is a leftover from ancient times. It dates back
to commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove
functions"), where Pg_reserved checks were added as a sfety net:

/*
* The probe routines leave the pages reserved, just
* as the bootmem code does. Make sure they're still
* that way.
*/

The checks were refactored quite a bit over the years, especially in
commit b77eab7079d9 ("mm/memory_hotplug: optimize probe routine"), where
checks for present, valid, and online sections were added.

Hotplugged memory is added via add_memory(), which will create the full
memmap for the hotplugged memory, and mark all sections valid and present.

Only full memory blocks are onlined/offlined, so we also cannot have an
inconsistency in that regard (especially, memory blocks with some sections
being online and some being offline).

1. Boot memory always starts online. Since commit c5e79ef561b0
("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
holes") we disallow to offline any memory with holes. Therefore, we
never online memory with holes. Present and validity checks are
superfluous.

2. Only complete memory blocks are onlined/offlined (and especially,
the state - online or offline - is stored for whole memory blocks).
Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
manually calls offline_pages() and fiddels with memory block states.
But it also only offlines complete memory blocks.

3. To make any of these conditions trigger, something would have to be
terribly messed up in the core. (e.g., online/offline only some
sections of a memory block).

4. Memory unplug properly makes sure that all sysfs attributes were
removed (and therefore, that all threads left the sysfs handlers). We
don't have to worry about zombie devices at this point.

5. The valid_section_nr(section_nr) check is actually dead code, as it
would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).

No wonder we haven't seen any of these errors in a long time (or even
ever, according to my search). Let's just get rid of them. Now, all
checks that could hinder onlining and offlining are completely
contained in online_pages()/offline_pages().

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Andrew Morton
Cc: Michal Hocko
Cc: Dan Williams
Cc: Pavel Tatashin
Cc: Anshuman Khandual
Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
68c3a6ac6 drivers/base/memory.c: drop section_count ... Browse Code »

Patch series "mm: drop superfluous section checks when onlining/offlining".

Let's drop some superfluous section checks on the onlining/offlining path.

This patch (of 3):

Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
online/offline memory blocks with holes") we have a generic check in
offline_pages() that disallows offlining memory blocks with holes.

Memory blocks with missing sections are just another variant of these type
of blocks. We can stop checking (and especially storing) present
sections. A proper error message is now printed why offlining failed.

section_count was initially introduced in commit 07681215975e ("Driver
core: Add section count to memory_block struct") in order to detect when
it is okay to remove a memory block. It was used in commit 26bbe7ef6d5c
("drivers/base/memory.c: prohibit offlining of memory blocks with missing
sections") to disallow offlining memory blocks with missing sections. As
we refactored creation/removal of memory devices and have a proper check
for holes in place, we can drop the section_count.

This also removes a leftover comment regarding the mem_sysfs_mutex, which
was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the
mem_sysfs_mutex").

Signed-off-by: David Hildenbrand
Signed-off-by: Andrew Morton
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Michal Hocko
Cc: Dan Williams
Cc: Pavel Tatashin
Cc: Anshuman Khandual
Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
Signed-off-by: Linus Torvalds

David Hildenbrand
2020-04-08 01:43:40 +0800
9b12488a7 userfaultfd: selftests: add write-protect test ... Browse Code »

Add uffd tests for write protection.

Instead of introducing new tests for it, let's simply squashing uffd-wp
tests into existing uffd-missing test cases. Changes are:

(1) Bouncing tests

We do the write-protection in two ways during the bouncing test:

- By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
we'll make sure for each bounce process every single page will be
at least fault twice: once for MISSING, once for WP.

- By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
To further torture the explicit page protection procedures of
uffd-wp, we split each bounce procedure into two halves (in the
background thread): the first half will be MISSING+WP for each
page as explained above. After the first half, we write protect
the faulted region in the background thread to make sure at least
half of the pages will be write protected again which is the first
half to test the new UFFDIO_WRITEPROTECT call. Then we continue
with the 2nd half, which will contain both MISSING and WP faulting
tests for the 2nd half and WP-only faults from the 1st half.

(2) Event/Signal test

Mostly previous tests but will do MISSING+WP for each page. For
sigbus-mode test we'll need to provide standalone path to handle the
write protection faults.

For all tests, do statistics as well for uffd-wp pages.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:40 +0800
5c8aed6c1 userfaultfd: selftests: refactor statistics ... Browse Code »

Introduce uffd_stats structure for statistics of the self test, at the
same time refactor the code to always pass in the uffd_stats for either
read() or poll() typed fault handling threads instead of using two
different ways to return the statistic results. No functional change.

With the new structure, it's very easy to introduce new statistics.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-19-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:40 +0800
14819305e userfaultfd: wp: declare _UFFDIO_WRITEPROTECT conditionally ... Browse Code »

Only declare _UFFDIO_WRITEPROTECT if the user specified
UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user
registers regions with shmem/hugetlbfs we won't expose the new ioctl to
them. Even with complete anonymous memory range, we'll only expose the
new WP ioctl bit if the register mode has MODE_WP.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:40 +0800
57e5d4f27 userfaultfd: wp: UFFDIO_REGISTER_MODE_WP documentation update ... Browse Code »

Add documentation about the write protection support.

[peterx@redhat.com: rewrite in rst format; fixups here and there]
Signed-off-by: Martin Cracauer
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-17-peterx@redhat.com
Signed-off-by: Linus Torvalds

Martin Cracauer
2020-04-08 01:43:39 +0800
23080e278 userfaultfd: wp: don't wake up when doing write protect ... Browse Code »

It does not make sense to try to wake up any waiting thread when we're
write-protecting a memory region. Only wake up when resolving a write
protected page fault.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
e06f1e1dd userfaultfd: wp: enabled write protection in userfaultfd API ... Browse Code »

Now it's safe to enable write protection in userfaultfd API

Signed-off-by: Shaohua Li
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com
Signed-off-by: Linus Torvalds

Shaohua Li
2020-04-08 01:43:39 +0800
63b2d4174 userfaultfd: wp: add the writeprotect API to userfaultfd ioctl ... Browse Code »

Introduce the new uffd-wp APIs for userspace.

Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can
co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
userspace program can not only resolve missing page faults, and at the
same time tracking page data changes along the way.

Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
write protection tracking. Note that we will need to register the memory
region with UFFDIO_REGISTER_MODE_WP before that.

[peterx@redhat.com: write up the commit message]
[peterx@redhat.com: remove useless block, write commit message, check against
VM_MAYWRITE rather than VM_WRITE when register]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2020-04-08 01:43:39 +0800
ffd057939 userfaultfd: wp: support write protection for userfault vma range ... Browse Code »

Add API to enable/disable writeprotect a vma range. Unlike mprotect, this
doesn't split/merge vmas.

[peterx@redhat.com:
- use the helper to find VMA;
- return -ENOENT if not found to match mcopy case;
- use the new MM_CP_UFFD_WP* flags for change_protection
- check against mmap_changing for failures
- replace find_dst_vma with vma_find_uffd]
Signed-off-by: Shaohua Li
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Rik van Riel
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com
Signed-off-by: Linus Torvalds

Shaohua Li
2020-04-08 01:43:39 +0800
e1e267c79 khugepaged: skip collapse if uffd-wp detected ... Browse Code »

Don't collapse the huge PMD if there is any userfault write protected
small PTEs. The problem is that the write protection is in small page
granularity and there's no way to keep all these write protection
information if the small pages are going to be merged into a huge PMD.

The same thing needs to be considered for swap entries and migration
entries. So do the check as well disregarding khugepaged_max_ptes_swap.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
f45ec5ff1 userfaultfd: wp: support swap and page migration ... Browse Code »

For either swap and page migration, we all use the bit 2 of the entry to
identify whether this entry is uffd write-protected. It plays a similar
role as the existing soft dirty bit in swap entries but only for keeping
the uffd-wp tracking for a specific PTE/PMD.

Something special here is that when we want to recover the uffd-wp bit
from a swap/migration entry to the PTE bit we'll also need to take care of
the _PAGE_RW bit and make sure it's cleared, otherwise even with the
_PAGE_UFFD_WP bit we can't trap it at all.

In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
That can lead to data mismatch if the page that we are going to write
protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
also applies/removes the uffd-wp bit even for the swap entries.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Jerome Glisse
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
2e3d5dc50 userfaultfd: wp: add pmd_swp_*uffd_wp() helpers ... Browse Code »

Adding these missing helpers for uffd-wp operations with pmd
swap/migration entries.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
b569a1760 userfaultfd: wp: drop _PAGE_UFFD_WP properly when fork ... Browse Code »

UFFD_EVENT_FORK support for uffd-wp should be already there, except that
we should clean the uffd-wp bit if uffd fork event is not enabled. Detect
that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
by VM_UFFD_WP. Do this for both small PTEs and huge PMDs.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
292924b26 userfaultfd: wp: apply _PAGE_UFFD_WP bit ... Browse Code »

Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
change_protection() when used with uffd-wp and make sure the two new flags
are exclusively used. Then,

- For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
when a range of memory is write protected by uffd

- For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
_PAGE_RW when write protection is resolved from userspace

And use this new interface in mwriteprotect_range() to replace the old
MM_CP_DIRTY_ACCT.

Do this change for both PTEs and huge PMDs. Then we can start to identify
which PTE/PMD is write protected by general (e.g., COW or soft dirty
tracking), and which is for userfaultfd-wp.

Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
can be even more strict when detecting uffd-wp page faults in either
do_wp_page() or wp_huge_pmd().

After we're with _PAGE_UFFD_WP, a special case is when a page is both
protected by the general COW logic and also userfault-wp. Here the
userfault-wp will have higher priority and will be handled first. Only
after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
the general COW. These are the steps on what will happen with such a
page:

1. CPU accesses write protected shared page (so both protected by
general COW and uffd-wp), blocked by uffd-wp first because in
do_wp_page we'll handle uffd-wp first, so it has higher priority
than general COW.

2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
to remove the uffd-wp bit upon the PTE/PMD. However here we
still keep the write bit cleared. Notify the blocked CPU.

3. The blocked CPU resumes the page fault process with a fault
retry, during retry it'll notice it was not with the uffd-wp bit
this time but it is still write protected by general COW, then
it'll go though the COW path in the fault handler, copy the page,
apply write bit where necessary, and retry again.

4. The CPU will be able to access this page with write bit set.

Suggested-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Cc: Brian Geffon
Cc: Pavel Emelyanov
Cc: Mike Kravetz
Cc: David Hildenbrand
Cc: Martin Cracauer
Cc: Mel Gorman
Cc: Bobby Powers
Cc: Mike Rapoport
Cc: "Kirill A . Shutemov"
Cc: Maya Gokhale
Cc: Johannes Weiner
Cc: Marty McFadden
Cc: Denis Plotnikov
Cc: Hugh Dickins
Cc: "Dr . David Alan Gilbert"
Cc: Jerome Glisse
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
58705444c mm: merge parameters for change_protection() ... Browse Code »

change_protection() was used by either the NUMA or mprotect() code,
there's one parameter for each of the callers (dirty_accountable and
prot_numa). Further, these parameters are passed along the calls:

- change_protection_range()
- change_p4d_range()
- change_pud_range()
- change_pmd_range()
- ...

Now we introduce a flag for change_protect() and all these helpers to
replace these parameters. Then we can avoid passing multiple parameters
multiple times along the way.

More importantly, it'll greatly simplify the work if we want to introduce
any new parameters to change_protection(). In the follow up patches, a
new parameter for userfaultfd write protection will be introduced.

No functional change at all.

Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Cc: Andrea Arcangeli
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Mike Rapoport
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
Signed-off-by: Linus Torvalds

Peter Xu
2020-04-08 01:43:39 +0800
72981e0e7 userfaultfd: wp: add UFFDIO_COPY_MODE_WP ... Browse Code »

This allows UFFDIO_COPY to map pages write-protected.

[peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
commit messages]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Peter Xu
Signed-off-by: Andrew Morton
Reviewed-by: Jerome Glisse
Reviewed-by: Mike Rapoport
Cc: Bobby Powers
Cc: Brian Geffon
Cc: David Hildenbrand
Cc: Denis Plotnikov
Cc: "Dr . David Alan Gilbert"
Cc: Hugh Dickins
Cc: Johannes Weiner
Cc: "Kirill A . Shutemov"
Cc: Martin Cracauer
Cc: Marty McFadden
Cc: Maya Gokhale
Cc: Mel Gorman
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Rik van Riel
Cc: Shaohua Li
Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2020-04-08 01:43:39 +0800