08 Apr, 2020

40 commits

  • The compressed cache for swap pages (zswap) currently needs from 1 to 3
    extra kernel command line parameters in order to make it work: it has to
    be enabled by adding a "zswap.enabled=1" command line parameter and if one
    wants a different compressor or pool allocator than the default lzo / zbud
    combination then these choices also need to be specified on the kernel
    command line in additional parameters.

    Using a different compressor and allocator for zswap is actually pretty
    common as guides often recommend using the lz4 / z3fold pair instead of
    the default one. In such case it is also necessary to remember to enable
    the appropriate compression algorithm and pool allocator in the kernel
    config manually.

    Let's avoid the need for adding these kernel command line parameters and
    automatically pull in the dependencies for the selected compressor
    algorithm and pool allocator by adding an appropriate default switches to
    Kconfig.

    The default values for these options match what the code was using
    previously as its defaults.

    Signed-off-by: Maciej S. Szmigiero
    Signed-off-by: Andrew Morton
    Reviewed-by: Vitaly Wool
    Link: http://lkml.kernel.org/r/20200202000112.456103-1-mail@maciej.szmigiero.name
    Signed-off-by: Linus Torvalds

    Maciej S. Szmigiero
     
  • I recently build the RISC-V port with LLVM trunk, which has introduced a
    new warning when casting from a pointer to an enum of a smaller size.
    This patch simply casts to a long in the middle to stop the warning. I'd
    be surprised this is the only one in the kernel, but it's the only one I
    saw.

    Signed-off-by: Palmer Dabbelt
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200227211741.83165-1-palmer@dabbelt.com
    Signed-off-by: Linus Torvalds

    Palmer Dabbelt
     
  • Yang Shi writes:

    Currently, when truncating a shmem file, if the range is partly in a THP
    (start or end is in the middle of THP), the pages actually will just get
    cleared rather than being freed, unless the range covers the whole THP.
    Even though all the subpages are truncated (randomly or sequentially), the
    THP may still be kept in page cache.

    This might be fine for some usecases which prefer preserving THP, but
    balloon inflation is handled in base page size. So when using shmem THP
    as memory backend, QEMU inflation actually doesn't work as expected since
    it doesn't free memory. But the inflation usecase really needs to get the
    memory freed. (Anonymous THP will also not get freed right away, but will
    be freed eventually when all subpages are unmapped: whereas shmem THP
    still stays in page cache.)

    Split THP right away when doing partial hole punch, and if split fails
    just clear the page so that read of the punched area will return zeroes.

    Hugh Dickins adds:

    Our earlier "team of pages" huge tmpfs implementation worked in the way
    that Yang Shi proposes; and we have been using this patch to continue to
    split the huge page when hole-punched or truncated, since converting over
    to the compound page implementation. Although huge tmpfs gives out huge
    pages when available, if the user specifically asks to truncate or punch a
    hole (perhaps to free memory, perhaps to reduce the memcg charge), then
    the filesystem should do so as best it can, splitting the huge page.

    That is not always possible: any additional reference to the huge page
    prevents split_huge_page() from succeeding, so the result can be flaky.
    But in practice it works successfully enough that we've not seen any
    problem from that.

    Add shmem_punch_compound() to encapsulate the decision of when a split is
    needed, and doing the split if so. Using this simplifies the flow in
    shmem_undo_range(); and the first (trylock) pass does not need to do any
    page clearing on failure, because the second pass will either succeed or
    do that clearing. Following the example of zero_user_segment() when
    clearing a partial page, add flush_dcache_page() and set_page_dirty() when
    clearing a hole - though I'm not certain that either is needed.

    But: split_huge_page() would be sure to fail if shmem_undo_range()'s
    pagevec holds further references to the huge page. The easiest way to fix
    that is for find_get_entries() to return early, as soon as it has put one
    compound head or tail into the pagevec. At first this felt like a hack;
    but on examination, this convention better suits all its callers - or will
    do, if the slight one-page-per-pagevec slowdown in shmem_unlock_mapping()
    and shmem_seek_hole_data() is transformed into a 512-page-per-pagevec
    speedup by checking for compound pages there.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Cc: Yang Shi
    Cc: Alexander Duyck
    Cc: "Michael S. Tsirkin"
    Cc: David Hildenbrand
    Cc: "Kirill A. Shutemov"
    Cc: Matthew Wilcox
    Cc: Andrea Arcangeli
    Link: http://lkml.kernel.org/r/alpine.LSU.2.11.2002261959020.10801@eggly.anvils
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Previously 0 was assigned to variable 'error' but the variable was never
    read before reassignemnt later. So the assignment can be removed.

    Signed-off-by: Mateusz Nosek
    Signed-off-by: Andrew Morton
    Reviewed-by: Matthew Wilcox (Oracle)
    Acked-by: Pankaj Gupta
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200301152832.24595-1-mateusznosek0@gmail.com
    Signed-off-by: Linus Torvalds

    Mateusz Nosek
     
  • Variables declared in a switch statement before any case statements cannot
    be automatically initialized with compiler instrumentation (as they are
    not part of any execution flow). With GCC's proposed automatic stack
    variable initialization feature, this triggers a warning (and they don't
    get initialized). Clang's automatic stack variable initialization (via
    CONFIG_INIT_STACK_ALL=y) doesn't throw a warning, but it also doesn't
    initialize such variables[1]. Note that these warnings (or silent
    skipping) happen before the dead-store elimination optimization phase, so
    even when the automatic initializations are later elided in favor of
    direct initializations, the warnings remain.

    To avoid these problems, move such variables into the "case" where they're
    used or lift them up into the main function body.

    mm/shmem.c: In function `shmem_getpage_gfp':
    mm/shmem.c:1816:10: warning: statement will never be executed [-Wswitch-unreachable]
    1816 | loff_t i_size;
    | ^~~~~~

    [1] https://bugs.llvm.org/show_bug.cgi?id=44916

    Signed-off-by: Kees Cook
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Cc: Hugh Dickins
    Cc: Alexander Potapenko
    Link: http://lkml.kernel.org/r/20200220062312.69165-1-keescook@chromium.org
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • Use __pfn_to_section() API instead of open-coding for better code
    readability.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Acked-by: David Hildenbrand
    Link: http://lkml.kernel.org/r/1584345134-16671-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     
  • For now, distributions implement advanced udev rules to essentially
    - Don't online any hotplugged memory (s390x)
    - Online all memory to ZONE_NORMAL (e.g., most virt environments like
    hyperv)
    - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
    care of (e.g., bare metal, special virt environments)

    In summary: All memory is usually onlined the same way, however, the
    kernel always has to ask user space to come up with the same answer.
    E.g., Hyper-V always waits for a memory block to get onlined before
    continuing, otherwise it might end up adding memory faster than
    onlining it, which can result in strange OOM situations. This waiting
    slows down adding of a bigger amount of memory.

    Let's allow to specify a default online_type, not just "online" and
    "offline". This allows distributions to configure the default online_type
    when booting up and be done with it.

    We can now specify "offline", "online", "online_movable" and
    "online_kernel" via
    - "memhp_default_state=" on the kernel cmdline
    - /sys/devices/system/memory/auto_online_blocks
    just like we are able to specify for a single memory block via
    /sys/devices/system/memory/memoryX/state

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • ... and rename it to memhp_default_online_type. This is a preparation
    for more detailed default online behavior.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • All in-tree users except the mm-core are gone. Let's drop the export.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Greg Kroah-Hartman
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-7-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • We get the MEM_ONLINE notifier call if memory is added right from the
    kernel via add_memory() or later from user space.

    Let's get rid of the "ha_waiting" flag - the wait event has an inbuilt
    mechanism (->done) for that. Initialize the wait event only once and
    reinitialize before adding memory. Unconditionally call complete() and
    wait_for_completion_timeout().

    If there are no waiters, complete() will only increment ->done - which
    will be reset by reinit_completion(). If complete() has already been
    called, wait_for_completion_timeout() will not wait.

    There is still the chance for a small race between concurrent
    reinit_completion() and complete(). If complete() wins, we would not wait
    - which is tolerable (and the race exists in current code as well).

    Note: We only wait for "some" memory to get onlined, which seems to be
    good enough for now.

    [akpm@linux-foundation.org: register_memory_notifier() after init_completion(), per David]
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Vitaly Kuznetsov
    Reviewed-by: Baoquan He
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Greg Kroah-Hartman
    Cc: Igor Mammedov
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's always try to online the re-added memory blocks. In case
    add_memory() already onlined the added memory blocks, the first
    device_online() call will fail and stop processing the remaining memory
    blocks.

    This avoids manually having to check memhp_auto_online.

    Note: PPC always onlines all hotplugged memory directly from the kernel as
    well - something that is handled by user space on other architectures.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Let's use a simple array which we can reuse soon. While at it, move the
    string->mmop conversion out of the device hotplug lock.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Historically, we used the value -1. Just treat 0 as the special case now.
    Clarify a comment (which was wrong, when we come via device_online() the
    first time, the online_type would have been 0 / MEM_ONLINE). The default
    is now always MMOP_OFFLINE. This removes the last user of the manual
    "-1", which didn't use the enum value.

    This is a preparation to use the online_type as an array index.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.

    Distributions nowadays use udev rules ([1] [2]) to specify if and how to
    online hotplugged memory. The rules seem to get more complex with many
    special cases. Due to the various special cases,
    CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
    is handled via udev rules.

    Every time we hotplug memory, the udev rule will come to the same
    conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
    memory in separate memory blocks and wait for memory to get onlined by
    user space before continuing to add more memory blocks (to not add memory
    faster than it is getting onlined). This of course slows down the whole
    memory hotplug process.

    To make the job of distributions easier and to avoid udev rules that get
    more and more complicated, let's extend the mechanism provided by
    - /sys/devices/system/memory/auto_online_blocks
    - "memhp_default_state=" on the kernel cmdline
    to be able to specify also "online_movable" as well as "online_kernel"

    === Example /usr/libexec/config-memhotplug ===

    #!/bin/bash

    VIRT=`systemd-detect-virt --vm`
    ARCH=`uname -p`

    sense_virtio_mem() {
    if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
    DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
    if [ $DEVICES != "0" ]; then
    return 0
    fi
    fi
    return 1
    }

    if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
    echo "Memory hotplug configuration support missing in the kernel"
    exit 1
    fi

    if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
    echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
    exit 1
    fi

    if [ $VIRT == "microsoft" ]; then
    echo "Detected Hyper-V on $ARCH"
    # Hyper-V wants all memory in ZONE_NORMAL
    ONLINE_TYPE="online_kernel"
    elif sense_virtio_mem; then
    echo "Detected virtio-mem on $ARCH"
    # virtio-mem wants all memory in ZONE_NORMAL
    ONLINE_TYPE="online_kernel"
    elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
    echo "Detected $ARCH"
    # standby memory should not be onlined automatically
    ONLINE_TYPE="offline"
    elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
    echo "Detected" $ARCH
    # PPC64 onlines all hotplugged memory right from the kernel
    ONLINE_TYPE="offline"
    elif [ $VIRT == "none" ]; then
    echo "Detected bare-metal on $ARCH"
    # Bare metal users expect hotplugged memory to be unpluggable. We assume
    # that ZONE imbalances on such enterpise servers cannot happen and is
    # properly documented
    ONLINE_TYPE="online_movable"
    else
    # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
    # imbalances won't happen
    echo "Detected $VIRT on $ARCH"
    # Usually, ballooning is used in virtual environments, so memory should go to
    # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
    ONLINE_TYPE="online"
    fi

    echo "Selected online_type:" $ONLINE_TYPE

    # Configure what to do with memory that will be hotplugged in the future
    echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
    if [ $? != "0" ]; then
    echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
    # A backup udev rule should handle old kernels if necessary
    exit 1
    fi

    # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
    if [ $ONLINE_TYPE != "offline" ]; then
    for MEMORY in /sys/devices/system/memory/memory*; do
    STATE=`cat $MEMORY/state`
    if [ $STATE == "offline" ]; then
    echo $ONLINE_TYPE > $MEMORY/state
    fi
    done
    fi

    === Example /usr/lib/systemd/system/config-memhotplug.service ===

    [Unit]
    Description=Configure memory hotplug behavior
    DefaultDependencies=no
    Conflicts=shutdown.target
    Before=sysinit.target shutdown.target
    After=systemd-modules-load.service
    ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks

    [Service]
    ExecStart=/usr/libexec/config-memhotplug
    Type=oneshot
    TimeoutSec=0
    RemainAfterExit=yes

    [Install]
    WantedBy=sysinit.target

    === Example modification to the 40-redhat.rules [2] ===

    : diff --git a/40-redhat.rules b/40-redhat.rules-new
    : index 2c690e5..168fd03 100644
    : --- a/40-redhat.rules
    : +++ b/40-redhat.rules-new
    : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
    : # Memory hotadd request
    : SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
    : ACTION!="add", GOTO="memory_hotplug_end"
    : +# memory hotplug behavior configured
    : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
    : +
    : PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    :
    : ENV{.state}="online"

    ===

    [1] https://github.com/lnykryn/systemd-rhel/pull/281
    [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules

    This patch (of 8):

    The name is misleading and it's not really clear what is "kept". Let's
    just name it like the online_type name we expose to user space ("online").

    Add some documentation to the types.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Vitaly Kuznetsov
    Cc: Yumei Huang
    Cc: Igor Mammedov
    Cc: Eduardo Habkost
    Cc: Benjamin Herrenschmidt
    Cc: Haiyang Zhang
    Cc: K. Y. Srinivasan
    Cc: Michael Ellerman (powerpc)
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • No functional change.

    [bhe@redhat.com: move functions into CONFIG_MEMORY_HOTPLUG ifdeffery scope]
    Link: http://lkml.kernel.org/r/20200316045804.GC3486@MiWiFi-R3L-srv
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: David Hildenbrand
    Cc: Wei Yang
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Stephen Rothwell
    Link: http://lkml.kernel.org/r/20200312124414.439-6-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • And tell check_pfn_span() gating the porper alignment and size of hot
    added memory region.

    And also move the code comments from inside section_deactivate() to being
    above it. The code comments are reasonable for the whole function, and
    the moving makes code cleaner.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michal Hocko
    Cc: Dan Williams
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-5-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Currently, to support subsection aligned memory region adding for pmem,
    subsection map is added to track which subsection is present.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, it's meaningless. Even worse, it may confuse people when
    checking code related to the classic sparse.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    Combining the above reasons, no need to provide subsection map and the
    relevant handling for the classic sparse. Let's remove them.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Factor out the code which clear subsection map of one memory region from
    section_deactivate() into clear_subsection_map().

    And also add helper function is_subsection_map_empty() to check if the
    current subsection map is empty or not.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-3-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Patch series "mm/hotplug: Only use subsection map for VMEMMAP", v4.

    Memory sub-section hotplug was added to fix the issue that nvdimm could be
    mapped at non-section aligned starting address. A subsection map is added
    into struct mem_section_usage to implement it.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, subsection map is meaningless and confusing.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    This patch (of 5):

    Factor out the code that fills the subsection map from section_activate()
    into fill_subsection_map(), this makes section_activate() cleaner and
    easier to follow.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: David Hildenbrand
    Acked-by: Pankaj Gupta
    Cc: Dan Williams
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200312124414.439-2-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Let's drop the basically unused section stuff and simplify. The logic now
    matches the logic in __remove_pages().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Cc: Segher Boessenkool
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200228095819.10750-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • In commit 52fb87c81f11 ("mm/memory_hotplug: cleanup __remove_pages()"), we
    cleaned up __remove_pages(), and introduced a shorter variant to calculate
    the number of pages to the next section boundary.

    Turns out we can make this calculation easier to read. We always want to
    have the number of pages (> 0) to the next section boundary, starting from
    the current pfn.

    We'll clean up __remove_pages() in a follow-up patch and directly make use
    of this computation.

    Suggested-by: Segher Boessenkool
    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Reviewed-by: Wei Yang
    Cc: Oscar Salvador
    Cc: Michal Hocko
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200228095819.10750-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • In commit 357b4da50a62 ("x86: respect memory size limiting via mem=
    parameter") a global varialbe max_mem_size is added to store the value
    parsed from 'mem= ', then checked when memory region is added. This truly
    stops those DIMMs from being added into system memory during boot-time.

    However, it also limits the later memory hotplug functionality. Any DIMM
    can't be hotplugged any more if its region is beyond the max_mem_size. We
    will get errors like:

    [ 216.387164] acpi PNP0C80:02: add_memory failed
    [ 216.389301] acpi PNP0C80:02: acpi_memory_enable_device() error
    [ 216.392187] acpi PNP0C80:02: Enumeration failure

    This will cause issue in a known use case where 'mem=' is added to the
    hypervisor. The memory that lies after 'mem=' boundary will be assigned
    to KVM guests. After commit 357b4da50a62 merged, memory can't be extended
    dynamically if system memory on hypervisor is not sufficient.

    So fix it by also checking if it's during boot-time restricting to add
    memory. Otherwise, skip the restriction.

    And also add this use case to document of 'mem=' kernel parameter.

    Fixes: 357b4da50a62 ("x86: respect memory size limiting via mem= parameter")
    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: Juergen Gross
    Acked-by: Michal Hocko
    Cc: Ingo Molnar
    Cc: William Kucharski
    Cc: David Hildenbrand
    Cc: Balbir Singh
    Link: http://lkml.kernel.org/r/20200204050643.20925-1-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
    online/offline memory blocks with holes") we disallow to offline any
    memory with holes. As all boot memory is online and hotplugged memory
    cannot contain holes, we never online memory with holes.

    This present check can be dropped.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Anshuman Khandual
    Cc: Dan Williams
    Cc: Greg Kroah-Hartman
    Cc: Pavel Tatashin
    Cc: "Rafael J. Wysocki"
    Link: http://lkml.kernel.org/r/20200127110424.5757-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • pages_correctly_probed() is a leftover from ancient times. It dates back
    to commit 3947be1969a9 ("[PATCH] memory hotplug: sysfs and add/remove
    functions"), where Pg_reserved checks were added as a sfety net:

    /*
    * The probe routines leave the pages reserved, just
    * as the bootmem code does. Make sure they're still
    * that way.
    */

    The checks were refactored quite a bit over the years, especially in
    commit b77eab7079d9 ("mm/memory_hotplug: optimize probe routine"), where
    checks for present, valid, and online sections were added.

    Hotplugged memory is added via add_memory(), which will create the full
    memmap for the hotplugged memory, and mark all sections valid and present.

    Only full memory blocks are onlined/offlined, so we also cannot have an
    inconsistency in that regard (especially, memory blocks with some sections
    being online and some being offline).

    1. Boot memory always starts online. Since commit c5e79ef561b0
    ("mm/memory_hotplug.c: don't allow to online/offline memory blocks with
    holes") we disallow to offline any memory with holes. Therefore, we
    never online memory with holes. Present and validity checks are
    superfluous.

    2. Only complete memory blocks are onlined/offlined (and especially,
    the state - online or offline - is stored for whole memory blocks).
    Besides the core, only arch/powerpc/platforms/powernv/memtrace.c
    manually calls offline_pages() and fiddels with memory block states.
    But it also only offlines complete memory blocks.

    3. To make any of these conditions trigger, something would have to be
    terribly messed up in the core. (e.g., online/offline only some
    sections of a memory block).

    4. Memory unplug properly makes sure that all sysfs attributes were
    removed (and therefore, that all threads left the sysfs handlers). We
    don't have to worry about zombie devices at this point.

    5. The valid_section_nr(section_nr) check is actually dead code, as it
    would never have been reached due to the WARN_ON_ONCE(!pfn_valid(pfn)).

    No wonder we haven't seen any of these errors in a long time (or even
    ever, according to my search). Let's just get rid of them. Now, all
    checks that could hinder onlining and offlining are completely
    contained in online_pages()/offline_pages().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Andrew Morton
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Anshuman Khandual
    Link: http://lkml.kernel.org/r/20200127110424.5757-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm: drop superfluous section checks when onlining/offlining".

    Let's drop some superfluous section checks on the onlining/offlining path.

    This patch (of 3):

    Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
    online/offline memory blocks with holes") we have a generic check in
    offline_pages() that disallows offlining memory blocks with holes.

    Memory blocks with missing sections are just another variant of these type
    of blocks. We can stop checking (and especially storing) present
    sections. A proper error message is now printed why offlining failed.

    section_count was initially introduced in commit 07681215975e ("Driver
    core: Add section count to memory_block struct") in order to detect when
    it is okay to remove a memory block. It was used in commit 26bbe7ef6d5c
    ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
    sections") to disallow offlining memory blocks with missing sections. As
    we refactored creation/removal of memory devices and have a proper check
    for holes in place, we can drop the section_count.

    This also removes a leftover comment regarding the mem_sysfs_mutex, which
    was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the
    mem_sysfs_mutex").

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Anshuman Khandual
    Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Add uffd tests for write protection.

    Instead of introducing new tests for it, let's simply squashing uffd-wp
    tests into existing uffd-missing test cases. Changes are:

    (1) Bouncing tests

    We do the write-protection in two ways during the bouncing test:

    - By using UFFDIO_COPY_MODE_WP when resolving MISSING pages: then
    we'll make sure for each bounce process every single page will be
    at least fault twice: once for MISSING, once for WP.

    - By direct call UFFDIO_WRITEPROTECT on existing faulted memories:
    To further torture the explicit page protection procedures of
    uffd-wp, we split each bounce procedure into two halves (in the
    background thread): the first half will be MISSING+WP for each
    page as explained above. After the first half, we write protect
    the faulted region in the background thread to make sure at least
    half of the pages will be write protected again which is the first
    half to test the new UFFDIO_WRITEPROTECT call. Then we continue
    with the 2nd half, which will contain both MISSING and WP faulting
    tests for the 2nd half and WP-only faults from the 1st half.

    (2) Event/Signal test

    Mostly previous tests but will do MISSING+WP for each page. For
    sigbus-mode test we'll need to provide standalone path to handle the
    write protection faults.

    For all tests, do statistics as well for uffd-wp pages.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-20-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Introduce uffd_stats structure for statistics of the self test, at the
    same time refactor the code to always pass in the uffd_stats for either
    read() or poll() typed fault handling threads instead of using two
    different ways to return the statistic results. No functional change.

    With the new structure, it's very easy to introduce new statistics.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-19-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Only declare _UFFDIO_WRITEPROTECT if the user specified
    UFFDIO_REGISTER_MODE_WP and if all the checks passed. Then when the user
    registers regions with shmem/hugetlbfs we won't expose the new ioctl to
    them. Even with complete anonymous memory range, we'll only expose the
    new WP ioctl bit if the register mode has MODE_WP.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-18-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Add documentation about the write protection support.

    [peterx@redhat.com: rewrite in rst format; fixups here and there]
    Signed-off-by: Martin Cracauer
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-17-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Martin Cracauer
     
  • It does not make sense to try to wake up any waiting thread when we're
    write-protecting a memory region. Only wake up when resolving a write
    protected page fault.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-16-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Now it's safe to enable write protection in userfaultfd API

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Introduce the new uffd-wp APIs for userspace.

    Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
    using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can
    co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
    userspace program can not only resolve missing page faults, and at the
    same time tracking page data changes along the way.

    Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
    write protection tracking. Note that we will need to register the memory
    region with UFFDIO_REGISTER_MODE_WP before that.

    [peterx@redhat.com: write up the commit message]
    [peterx@redhat.com: remove useless block, write commit message, check against
    VM_MAYWRITE rather than VM_WRITE when register]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add API to enable/disable writeprotect a vma range. Unlike mprotect, this
    doesn't split/merge vmas.

    [peterx@redhat.com:
    - use the helper to find VMA;
    - return -ENOENT if not found to match mcopy case;
    - use the new MM_CP_UFFD_WP* flags for change_protection
    - check against mmap_changing for failures
    - replace find_dst_vma with vma_find_uffd]
    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Don't collapse the huge PMD if there is any userfault write protected
    small PTEs. The problem is that the write protection is in small page
    granularity and there's no way to keep all these write protection
    information if the small pages are going to be merged into a huge PMD.

    The same thing needs to be considered for swap entries and migration
    entries. So do the check as well disregarding khugepaged_max_ptes_swap.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Adding these missing helpers for uffd-wp operations with pmd
    swap/migration entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • UFFD_EVENT_FORK support for uffd-wp should be already there, except that
    we should clean the uffd-wp bit if uffd fork event is not enabled. Detect
    that to avoid _PAGE_UFFD_WP being set even if the VMA is not being tracked
    by VM_UFFD_WP. Do this for both small PTEs and huge PMDs.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-9-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
    change_protection() when used with uffd-wp and make sure the two new flags
    are exclusively used. Then,

    - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

    - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

    And use this new interface in mwriteprotect_range() to replace the old
    MM_CP_DIRTY_ACCT.

    Do this change for both PTEs and huge PMDs. Then we can start to identify
    which PTE/PMD is write protected by general (e.g., COW or soft dirty
    tracking), and which is for userfaultfd-wp.

    Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
    into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
    can be even more strict when detecting uffd-wp page faults in either
    do_wp_page() or wp_huge_pmd().

    After we're with _PAGE_UFFD_WP, a special case is when a page is both
    protected by the general COW logic and also userfault-wp. Here the
    userfault-wp will have higher priority and will be handled first. Only
    after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
    the general COW. These are the steps on what will happen with such a
    page:

    1. CPU accesses write protected shared page (so both protected by
    general COW and uffd-wp), blocked by uffd-wp first because in
    do_wp_page we'll handle uffd-wp first, so it has higher priority
    than general COW.

    2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
    to remove the uffd-wp bit upon the PTE/PMD. However here we
    still keep the write bit cleared. Notify the blocked CPU.

    3. The blocked CPU resumes the page fault process with a fault
    retry, during retry it'll notice it was not with the uffd-wp bit
    this time but it is still write protected by general COW, then
    it'll go though the COW path in the fault handler, copy the page,
    apply write bit where necessary, and retry again.

    4. The CPU will be able to access this page with write bit set.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Brian Geffon
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: Martin Cracauer
    Cc: Mel Gorman
    Cc: Bobby Powers
    Cc: Mike Rapoport
    Cc: "Kirill A . Shutemov"
    Cc: Maya Gokhale
    Cc: Johannes Weiner
    Cc: Marty McFadden
    Cc: Denis Plotnikov
    Cc: Hugh Dickins
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • This allows UFFDIO_COPY to map pages write-protected.

    [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
    around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
    commit messages]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli