08 Apr, 2020

40 commits

  • With CONFIG_CC_OPTIMIZE_FOR_SIZE, objtool reports:

    drivers/gpu/drm/i915/gem/i915_gem_execbuffer.o: warning: objtool: i915_gem_execbuffer2_ioctl()+0x5b7: call to gen8_canonical_addr() with UACCESS enabled

    This means i915_gem_execbuffer2_ioctl() is calling gen8_canonical_addr()
    from the user_access_begin/end critical region (i.e, with SMAP disabled).

    While it's probably harmless in this case, in general we like to avoid
    extra function calls in SMAP-disabled regions because it can open up
    inadvertent security holes.

    Fix the warning by changing the sign extension helpers to __always_inline.
    This convinces GCC to inline gen8_canonical_addr().

    The sign extension functions are trivial anyway, so it makes sense to
    always inline them. With my test optimize-for-size-based config, this
    actually shrinks the text size of i915_gem_execbuffer.o by 45 bytes -- and
    no change for vmlinux.

    Reported-by: Randy Dunlap
    Signed-off-by: Josh Poimboeuf
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Al Viro
    Cc: Chris Wilson
    Link: http://lkml.kernel.org/r/740179324b2b18b750b16295c48357f00b5fa9ed.1582982020.git.jpoimboe@redhat.com
    Signed-off-by: Linus Torvalds

    Josh Poimboeuf
     
  • compiletime_assert() uses __LINE__ to create a unique function name. This
    means that if you have more than one BUILD_BUG_ON() in the same source
    line (which can happen if they appear e.g. in a macro), then the error
    message from the compiler might output the wrong condition.

    For this source file:

    #include

    #define macro() \
    BUILD_BUG_ON(1); \
    BUILD_BUG_ON(0);

    void foo()
    {
    macro();
    }

    gcc would output:

    ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_9' declared with attribute error: BUILD_BUG_ON failed: 0
    _compiletime_assert(condition, msg, __compiletime_assert_, __LINE__)

    However, it was not the BUILD_BUG_ON(0) that failed, so it should say 1
    instead of 0. With this patch, we use __COUNTER__ instead of __LINE__, so
    each BUILD_BUG_ON() gets a different function name and the correct
    condition is printed:

    ./include/linux/compiler.h:350:38: error: call to `__compiletime_assert_0' declared with attribute error: BUILD_BUG_ON failed: 1
    _compiletime_assert(condition, msg, __compiletime_assert_, __COUNTER__)

    Signed-off-by: Vegard Nossum
    Signed-off-by: Andrew Morton
    Reviewed-by: Masahiro Yamada
    Reviewed-by: Daniel Santos
    Cc: Rasmus Villemoes
    Cc: Ian Abbott
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200331112637.25047-1-vegard.nossum@oracle.com
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     
  • Commit ac7c3e4ff401 ("compiler: enable CONFIG_OPTIMIZE_INLINING
    forcibly") made this always-on option. We released v5.4 and v5.5
    including that commit.

    Remove the CONFIG option and clean up the code now.

    Signed-off-by: Masahiro Yamada
    Signed-off-by: Andrew Morton
    Reviewed-by: Miguel Ojeda
    Reviewed-by: Nathan Chancellor
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: David Miller
    Cc: "H. Peter Anvin"
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20200220110807.32534-2-masahiroy@kernel.org
    Signed-off-by: Linus Torvalds

    Masahiro Yamada
     
  • The process maps file was the only user of version (introduced back in
    2005). Now that it uses ppos instead, we can remove it.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200317193201.9924-4-adobriyan@gmail.com
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • Now that "struct proc_ops" exist we can start putting there stuff which
    could not fly with VFS "struct file_operations"...

    Most of fs/proc/inode.c file is dedicated to make open/read/.../close
    reliable in the event of disappearing /proc entries which usually happens
    if module is getting removed. Files like /proc/cpuinfo which never
    disappear simply do not need such protection.

    Save 2 atomic ops, 1 allocation, 1 free per open/read/close sequence for such
    "permanent" files.

    Enable "permanent" flag for

    /proc/cpuinfo
    /proc/kmsg
    /proc/modules
    /proc/slabinfo
    /proc/stat
    /proc/sysvipc/*
    /proc/swaps

    More will come once I figure out foolproof way to prevent out module
    authors from marking their stuff "permanent" for performance reasons
    when it is not.

    This should help with scalability: benchmark is "read /proc/cpuinfo R times
    by N threads scattered over the system".

    N R t, s (before) t, s (after)
    -----------------------------------------------------
    64 4096 1.582458 1.530502 -3.2%
    256 4096 6.371926 6.125168 -3.9%
    1024 4096 25.64888 24.47528 -4.6%

    Benchmark source:

    #include
    #include
    #include
    #include

    #include
    #include
    #include
    #include

    const int NR_CPUS = sysconf(_SC_NPROCESSORS_ONLN);
    int N;
    const char *filename;
    int R;

    int xxx = 0;

    int glue(int n)
    {
    cpu_set_t m;
    CPU_ZERO(&m);
    CPU_SET(n, &m);
    return sched_setaffinity(0, sizeof(cpu_set_t), &m);
    }

    void f(int n)
    {
    glue(n % NR_CPUS);

    while (*(volatile int *)&xxx == 0) {
    }

    for (int i = 0; i < R; i++) {
    int fd = open(filename, O_RDONLY);
    char buf[4096];
    ssize_t rv = read(fd, buf, sizeof(buf));
    asm volatile ("" :: "g" (rv));
    close(fd);
    }
    }

    int main(int argc, char *argv[])
    {
    if (argc < 4) {
    std::cerr << "usage: " << argv[0] << ' ' << "N /proc/filename R
    ";
    return 1;
    }

    N = atoi(argv[1]);
    filename = argv[2];
    R = atoi(argv[3]);

    for (int i = 0; i < NR_CPUS; i++) {
    if (glue(i) == 0)
    break;
    }

    std::vector T;
    T.reserve(N);
    for (int i = 0; i < N; i++) {
    T.emplace_back(f, i);
    }

    auto t0 = std::chrono::system_clock::now();
    {
    *(volatile int *)&xxx = 1;
    for (auto& t: T) {
    t.join();
    }
    }
    auto t1 = std::chrono::system_clock::now();
    std::chrono::duration dt = t1 - t0;
    std::cout << dt.count() << '
    ';

    return 0;
    }

    P.S.:
    Explicit randomization marker is added because adding non-function pointer
    will silently disable structure layout randomization.

    [akpm@linux-foundation.org: coding style fixes]
    Reported-by: kbuild test robot
    Reported-by: Dan Carpenter
    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Joe Perches
    Link: http://lkml.kernel.org/r/20200222201539.GA22576@avx2
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • Both bootmem_data and bootmem_data_t structures are no longer defined.
    Remove the dummy forward declarations.

    Signed-off-by: Waiman Long
    Signed-off-by: Andrew Morton
    Reviewed-by: Baoquan He
    Acked-by: Mike Rapoport
    Link: http://lkml.kernel.org/r/20200326022617.26208-1-longman@redhat.com
    Signed-off-by: Linus Torvalds

    Waiman Long
     
  • Fixes: 80a72d0af05a ("memremap: remove the data field in struct dev_pagemap")
    Fixes: fdc029b19dfd ("memremap: remove the dev field in struct dev_pagemap")
    Signed-off-by: Ira Weiny
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Cc: Jason Gunthorpe
    Cc: Dan Williams
    Link: http://lkml.kernel.org/r/20200316213205.145333-1-ira.weiny@intel.com
    Signed-off-by: Linus Torvalds

    Ira Weiny
     
  • If CONFIG_DEVICE_PRIVATE is defined, but neither CONFIG_MEMORY_FAILURE nor
    CONFIG_MIGRATION, then non_swap_entry() will return 0, meaning that the
    condition (non_swap_entry(entry) && is_device_private_entry(entry)) in
    zap_pte_range() will never be true even if the entry is a device private
    one.

    Equally any other code depending on non_swap_entry() will not function as
    expected.

    I originally spotted this just by looking at the code, I haven't actually
    observed any problems.

    Looking a bit more closely it appears that actually this situation
    (currently at least) cannot occur:

    DEVICE_PRIVATE depends on ZONE_DEVICE
    ZONE_DEVICE depends on MEMORY_HOTREMOVE
    MEMORY_HOTREMOVE depends on MIGRATION

    Fixes: 5042db43cc26 ("mm/ZONE_DEVICE: new type of ZONE_DEVICE for unaddressable memory")
    Signed-off-by: Steven Price
    Signed-off-by: Andrew Morton
    Cc: Jérôme Glisse
    Cc: Arnd Bergmann
    Cc: Dan Williams
    Cc: John Hubbard
    Link: http://lkml.kernel.org/r/20200305130550.22693-1-steven.price@arm.com
    Signed-off-by: Linus Torvalds

    Steven Price
     
  • The parameter of remap_pfn_range() @pfn passed from the caller is actually
    a page-frame number converted by corresponding physical address of kernel
    memory, the original comment is ambiguous that may mislead the users.

    Meanwhile, there is an ambiguous typo "VMM" in the comment of
    vm_area_struct. So fixing them will make the code more readable.

    Signed-off-by: chenqiwu
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/1583026921-15279-1-git-send-email-qiwuchen55@gmail.com
    Signed-off-by: Linus Torvalds

    chenqiwu
     
  • For now, distributions implement advanced udev rules to essentially
    - Don't online any hotplugged memory (s390x)
    - Online all memory to ZONE_NORMAL (e.g., most virt environments like
    hyperv)
    - Online all memory to ZONE_MOVABLE in case the zone imbalance is taken
    care of (e.g., bare metal, special virt environments)

    In summary: All memory is usually onlined the same way, however, the
    kernel always has to ask user space to come up with the same answer.
    E.g., Hyper-V always waits for a memory block to get onlined before
    continuing, otherwise it might end up adding memory faster than
    onlining it, which can result in strange OOM situations. This waiting
    slows down adding of a bigger amount of memory.

    Let's allow to specify a default online_type, not just "online" and
    "offline". This allows distributions to configure the default online_type
    when booting up and be done with it.

    We can now specify "offline", "online", "online_movable" and
    "online_kernel" via
    - "memhp_default_state=" on the kernel cmdline
    - /sys/devices/system/memory/auto_online_blocks
    just like we are able to specify for a single memory block via
    /sys/devices/system/memory/memoryX/state

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-9-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • ... and rename it to memhp_default_online_type. This is a preparation
    for more detailed default online behavior.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-8-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Historically, we used the value -1. Just treat 0 as the special case now.
    Clarify a comment (which was wrong, when we come via device_online() the
    first time, the online_type would have been 0 / MEM_ONLINE). The default
    is now always MMOP_OFFLINE. This removes the last user of the manual
    "-1", which didn't use the enum value.

    This is a preparation to use the online_type as an array index.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Benjamin Herrenschmidt
    Cc: Eduardo Habkost
    Cc: Haiyang Zhang
    Cc: Igor Mammedov
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Vitaly Kuznetsov
    Cc: Wei Liu
    Cc: Yumei Huang
    Link: http://lkml.kernel.org/r/20200317104942.11178-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm/memory_hotplug: allow to specify a default online_type", v3.

    Distributions nowadays use udev rules ([1] [2]) to specify if and how to
    online hotplugged memory. The rules seem to get more complex with many
    special cases. Due to the various special cases,
    CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE cannot be used. All memory hotplug
    is handled via udev rules.

    Every time we hotplug memory, the udev rule will come to the same
    conclusion. Especially Hyper-V (but also soon virtio-mem) add a lot of
    memory in separate memory blocks and wait for memory to get onlined by
    user space before continuing to add more memory blocks (to not add memory
    faster than it is getting onlined). This of course slows down the whole
    memory hotplug process.

    To make the job of distributions easier and to avoid udev rules that get
    more and more complicated, let's extend the mechanism provided by
    - /sys/devices/system/memory/auto_online_blocks
    - "memhp_default_state=" on the kernel cmdline
    to be able to specify also "online_movable" as well as "online_kernel"

    === Example /usr/libexec/config-memhotplug ===

    #!/bin/bash

    VIRT=`systemd-detect-virt --vm`
    ARCH=`uname -p`

    sense_virtio_mem() {
    if [ -d "/sys/bus/virtio/drivers/virtio_mem/" ]; then
    DEVICES=`find /sys/bus/virtio/drivers/virtio_mem/ -maxdepth 1 -type l | wc -l`
    if [ $DEVICES != "0" ]; then
    return 0
    fi
    fi
    return 1
    }

    if [ ! -e "/sys/devices/system/memory/auto_online_blocks" ]; then
    echo "Memory hotplug configuration support missing in the kernel"
    exit 1
    fi

    if grep "memhp_default_state=" /proc/cmdline > /dev/null; then
    echo "Memory hotplug configuration overridden in kernel cmdline (memhp_default_state=)"
    exit 1
    fi

    if [ $VIRT == "microsoft" ]; then
    echo "Detected Hyper-V on $ARCH"
    # Hyper-V wants all memory in ZONE_NORMAL
    ONLINE_TYPE="online_kernel"
    elif sense_virtio_mem; then
    echo "Detected virtio-mem on $ARCH"
    # virtio-mem wants all memory in ZONE_NORMAL
    ONLINE_TYPE="online_kernel"
    elif [ $ARCH == "s390x" ] || [ $ARCH == "s390" ]; then
    echo "Detected $ARCH"
    # standby memory should not be onlined automatically
    ONLINE_TYPE="offline"
    elif [ $ARCH == "ppc64" ] || [ $ARCH == "ppc64le" ]; then
    echo "Detected" $ARCH
    # PPC64 onlines all hotplugged memory right from the kernel
    ONLINE_TYPE="offline"
    elif [ $VIRT == "none" ]; then
    echo "Detected bare-metal on $ARCH"
    # Bare metal users expect hotplugged memory to be unpluggable. We assume
    # that ZONE imbalances on such enterpise servers cannot happen and is
    # properly documented
    ONLINE_TYPE="online_movable"
    else
    # TODO: Hypervisors that want to unplug DIMMs and can guarantee that ZONE
    # imbalances won't happen
    echo "Detected $VIRT on $ARCH"
    # Usually, ballooning is used in virtual environments, so memory should go to
    # ZONE_NORMAL. However, sometimes "movable_node" is relevant.
    ONLINE_TYPE="online"
    fi

    echo "Selected online_type:" $ONLINE_TYPE

    # Configure what to do with memory that will be hotplugged in the future
    echo $ONLINE_TYPE 2>/dev/null > /sys/devices/system/memory/auto_online_blocks
    if [ $? != "0" ]; then
    echo "Memory hotplug cannot be configured (e.g., old kernel or missing permissions)"
    # A backup udev rule should handle old kernels if necessary
    exit 1
    fi

    # Process all already pluggedd blocks (e.g., DIMMs, but also Hyper-V or virtio-mem)
    if [ $ONLINE_TYPE != "offline" ]; then
    for MEMORY in /sys/devices/system/memory/memory*; do
    STATE=`cat $MEMORY/state`
    if [ $STATE == "offline" ]; then
    echo $ONLINE_TYPE > $MEMORY/state
    fi
    done
    fi

    === Example /usr/lib/systemd/system/config-memhotplug.service ===

    [Unit]
    Description=Configure memory hotplug behavior
    DefaultDependencies=no
    Conflicts=shutdown.target
    Before=sysinit.target shutdown.target
    After=systemd-modules-load.service
    ConditionPathExists=|/sys/devices/system/memory/auto_online_blocks

    [Service]
    ExecStart=/usr/libexec/config-memhotplug
    Type=oneshot
    TimeoutSec=0
    RemainAfterExit=yes

    [Install]
    WantedBy=sysinit.target

    === Example modification to the 40-redhat.rules [2] ===

    : diff --git a/40-redhat.rules b/40-redhat.rules-new
    : index 2c690e5..168fd03 100644
    : --- a/40-redhat.rules
    : +++ b/40-redhat.rules-new
    : @@ -6,6 +6,9 @@ SUBSYSTEM=="cpu", ACTION=="add", TEST=="online", ATTR{online}=="0", ATTR{online}
    : # Memory hotadd request
    : SUBSYSTEM!="memory", GOTO="memory_hotplug_end"
    : ACTION!="add", GOTO="memory_hotplug_end"
    : +# memory hotplug behavior configured
    : +PROGRAM=="grep online /sys/devices/system/memory/auto_online_blocks", GOTO="memory_hotplug_end"
    : +
    : PROGRAM="/bin/uname -p", RESULT=="s390*", GOTO="memory_hotplug_end"
    :
    : ENV{.state}="online"

    ===

    [1] https://github.com/lnykryn/systemd-rhel/pull/281
    [2] https://github.com/lnykryn/systemd-rhel/blob/staging/rules/40-redhat.rules

    This patch (of 8):

    The name is misleading and it's not really clear what is "kept". Let's
    just name it like the online_type name we expose to user space ("online").

    Add some documentation to the types.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Reviewed-by: Baoquan He
    Acked-by: Pankaj Gupta
    Cc: Greg Kroah-Hartman
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: "Rafael J. Wysocki"
    Cc: Wei Yang
    Cc: Vitaly Kuznetsov
    Cc: Yumei Huang
    Cc: Igor Mammedov
    Cc: Eduardo Habkost
    Cc: Benjamin Herrenschmidt
    Cc: Haiyang Zhang
    Cc: K. Y. Srinivasan
    Cc: Michael Ellerman (powerpc)
    Cc: Paul Mackerras
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: http://lkml.kernel.org/r/20200319131221.14044-1-david@redhat.com
    Link: http://lkml.kernel.org/r/20200317104942.11178-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Currently, to support subsection aligned memory region adding for pmem,
    subsection map is added to track which subsection is present.

    However, config ZONE_DEVICE depends on SPARSEMEM_VMEMMAP. It means
    subsection map only makes sense when SPARSEMEM_VMEMMAP enabled. For the
    classic sparse, it's meaningless. Even worse, it may confuse people when
    checking code related to the classic sparse.

    About the classic sparse which doesn't support subsection hotplug, Dan
    said it's more because the effort and maintenance burden outweighs the
    benefit. Besides, the current 64 bit ARCHes all enable
    SPARSEMEM_VMEMMAP_ENABLE by default.

    Combining the above reasons, no need to provide subsection map and the
    relevant handling for the classic sparse. Let's remove them.

    Signed-off-by: Baoquan He
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Cc: Dan Williams
    Cc: Michal Hocko
    Cc: Pankaj Gupta
    Cc: Wei Yang
    Link: http://lkml.kernel.org/r/20200312124414.439-4-bhe@redhat.com
    Signed-off-by: Linus Torvalds

    Baoquan He
     
  • Patch series "mm: drop superfluous section checks when onlining/offlining".

    Let's drop some superfluous section checks on the onlining/offlining path.

    This patch (of 3):

    Since commit c5e79ef561b0 ("mm/memory_hotplug.c: don't allow to
    online/offline memory blocks with holes") we have a generic check in
    offline_pages() that disallows offlining memory blocks with holes.

    Memory blocks with missing sections are just another variant of these type
    of blocks. We can stop checking (and especially storing) present
    sections. A proper error message is now printed why offlining failed.

    section_count was initially introduced in commit 07681215975e ("Driver
    core: Add section count to memory_block struct") in order to detect when
    it is okay to remove a memory block. It was used in commit 26bbe7ef6d5c
    ("drivers/base/memory.c: prohibit offlining of memory blocks with missing
    sections") to disallow offlining memory blocks with missing sections. As
    we refactored creation/removal of memory devices and have a proper check
    for holes in place, we can drop the section_count.

    This also removes a leftover comment regarding the mem_sysfs_mutex, which
    was removed in commit 848e19ad3c33 ("drivers/base/memory.c: drop the
    mem_sysfs_mutex").

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Cc: Greg Kroah-Hartman
    Cc: "Rafael J. Wysocki"
    Cc: Michal Hocko
    Cc: Dan Williams
    Cc: Pavel Tatashin
    Cc: Anshuman Khandual
    Link: http://lkml.kernel.org/r/20200127110424.5757-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Now it's safe to enable write protection in userfaultfd API

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220163112.11409-15-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Introduce the new uffd-wp APIs for userspace.

    Firstly, we'll allow to do UFFDIO_REGISTER with write protection tracking
    using the new UFFDIO_REGISTER_MODE_WP flag. Note that this flag can
    co-exist with the existing UFFDIO_REGISTER_MODE_MISSING, in which case the
    userspace program can not only resolve missing page faults, and at the
    same time tracking page data changes along the way.

    Secondly, we introduced the new UFFDIO_WRITEPROTECT API to do page level
    write protection tracking. Note that we will need to register the memory
    region with UFFDIO_REGISTER_MODE_WP before that.

    [peterx@redhat.com: write up the commit message]
    [peterx@redhat.com: remove useless block, write commit message, check against
    VM_MAYWRITE rather than VM_WRITE when register]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-14-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Add API to enable/disable writeprotect a vma range. Unlike mprotect, this
    doesn't split/merge vmas.

    [peterx@redhat.com:
    - use the helper to find VMA;
    - return -ENOENT if not found to match mcopy case;
    - use the new MM_CP_UFFD_WP* flags for change_protection
    - check against mmap_changing for failures
    - replace find_dst_vma with vma_find_uffd]
    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220163112.11409-13-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Don't collapse the huge PMD if there is any userfault write protected
    small PTEs. The problem is that the write protection is in small page
    granularity and there's no way to keep all these write protection
    information if the small pages are going to be merged into a huge PMD.

    The same thing needs to be considered for swap entries and migration
    entries. So do the check as well disregarding khugepaged_max_ptes_swap.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-12-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • For either swap and page migration, we all use the bit 2 of the entry to
    identify whether this entry is uffd write-protected. It plays a similar
    role as the existing soft dirty bit in swap entries but only for keeping
    the uffd-wp tracking for a specific PTE/PMD.

    Something special here is that when we want to recover the uffd-wp bit
    from a swap/migration entry to the PTE bit we'll also need to take care of
    the _PAGE_RW bit and make sure it's cleared, otherwise even with the
    _PAGE_UFFD_WP bit we can't trap it at all.

    In change_pte_range() we do nothing for uffd if the PTE is a swap entry.
    That can lead to data mismatch if the page that we are going to write
    protect is swapped out when sending the UFFDIO_WRITEPROTECT. This patch
    also applies/removes the uffd-wp bit even for the swap entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Jerome Glisse
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-11-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Adding these missing helpers for uffd-wp operations with pmd
    swap/migration entries.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-10-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • Firstly, introduce two new flags MM_CP_UFFD_WP[_RESOLVE] for
    change_protection() when used with uffd-wp and make sure the two new flags
    are exclusively used. Then,

    - For MM_CP_UFFD_WP: apply the _PAGE_UFFD_WP bit and remove _PAGE_RW
    when a range of memory is write protected by uffd

    - For MM_CP_UFFD_WP_RESOLVE: remove the _PAGE_UFFD_WP bit and recover
    _PAGE_RW when write protection is resolved from userspace

    And use this new interface in mwriteprotect_range() to replace the old
    MM_CP_DIRTY_ACCT.

    Do this change for both PTEs and huge PMDs. Then we can start to identify
    which PTE/PMD is write protected by general (e.g., COW or soft dirty
    tracking), and which is for userfaultfd-wp.

    Since we should keep the _PAGE_UFFD_WP when doing pte_modify(), add it
    into _PAGE_CHG_MASK as well. Meanwhile, since we have this new bit, we
    can be even more strict when detecting uffd-wp page faults in either
    do_wp_page() or wp_huge_pmd().

    After we're with _PAGE_UFFD_WP, a special case is when a page is both
    protected by the general COW logic and also userfault-wp. Here the
    userfault-wp will have higher priority and will be handled first. Only
    after the uffd-wp bit is cleared on the PTE/PMD will we continue to handle
    the general COW. These are the steps on what will happen with such a
    page:

    1. CPU accesses write protected shared page (so both protected by
    general COW and uffd-wp), blocked by uffd-wp first because in
    do_wp_page we'll handle uffd-wp first, so it has higher priority
    than general COW.

    2. Uffd service thread receives the request, do UFFDIO_WRITEPROTECT
    to remove the uffd-wp bit upon the PTE/PMD. However here we
    still keep the write bit cleared. Notify the blocked CPU.

    3. The blocked CPU resumes the page fault process with a fault
    retry, during retry it'll notice it was not with the uffd-wp bit
    this time but it is still write protected by general COW, then
    it'll go though the COW path in the fault handler, copy the page,
    apply write bit where necessary, and retry again.

    4. The CPU will be able to access this page with write bit set.

    Suggested-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Cc: Brian Geffon
    Cc: Pavel Emelyanov
    Cc: Mike Kravetz
    Cc: David Hildenbrand
    Cc: Martin Cracauer
    Cc: Mel Gorman
    Cc: Bobby Powers
    Cc: Mike Rapoport
    Cc: "Kirill A . Shutemov"
    Cc: Maya Gokhale
    Cc: Johannes Weiner
    Cc: Marty McFadden
    Cc: Denis Plotnikov
    Cc: Hugh Dickins
    Cc: "Dr . David Alan Gilbert"
    Cc: Jerome Glisse
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-8-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • change_protection() was used by either the NUMA or mprotect() code,
    there's one parameter for each of the callers (dirty_accountable and
    prot_numa). Further, these parameters are passed along the calls:

    - change_protection_range()
    - change_p4d_range()
    - change_pud_range()
    - change_pmd_range()
    - ...

    Now we introduce a flag for change_protect() and all these helpers to
    replace these parameters. Then we can avoid passing multiple parameters
    multiple times along the way.

    More importantly, it'll greatly simplify the work if we want to introduce
    any new parameters to change_protection(). In the follow up patches, a
    new parameter for userfaultfd write protection will be introduced.

    No functional change at all.

    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Cc: Andrea Arcangeli
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Mike Rapoport
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-7-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Peter Xu
     
  • This allows UFFDIO_COPY to map pages write-protected.

    [peterx@redhat.com: switch to VM_WARN_ON_ONCE in mfill_atomic_pte; add brackets
    around "dst_vma->vm_flags & VM_WRITE"; fix wordings in comments and
    commit messages]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-6-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Implement helpers methods to invoke userfaultfd wp faults more
    selectively: not only when a wp fault triggers on a vma with vma->vm_flags
    VM_UFFD_WP set, but only if the _PAGE_UFFD_WP bit is set in the pagetable
    too.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-5-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Accurate userfaultfd WP tracking is possible by tracking exactly which
    virtual memory ranges were writeprotected by userland. We can't relay
    only on the RW bit of the mapped pagetable because that information is
    destroyed by fork() or KSM or swap. If we were to relay on that, we'd
    need to stay on the safe side and generate false positive wp faults for
    every swapped out page.

    [peterx@redhat.com: append _PAGE_UFD_WP to _PAGE_CHG_MASK]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: "Kirill A . Shutemov"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mel Gorman
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Cc: Rik van Riel
    Cc: Shaohua Li
    Link: http://lkml.kernel.org/r/20200220163112.11409-4-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Patch series "userfaultfd: write protection support", v6.

    Overview
    ========

    The uffd-wp work was initialized by Shaohua Li [1], and later continued by
    Andrea [2]. This series is based upon Andrea's latest userfaultfd tree,
    and it is a continuous works from both Shaohua and Andrea. Many of the
    follow up ideas come from Andrea too.

    Besides the old MISSING register mode of userfaultfd, the new uffd-wp
    support provides another alternative register mode called
    UFFDIO_REGISTER_MODE_WP that can be used to listen to not only missing
    page faults but also write protection page faults, or even they can be
    registered together. At the same time, the new feature also provides a
    new userfaultfd ioctl called UFFDIO_WRITEPROTECT which allows the
    userspace to write protect a range or memory or fixup write permission of
    faulted pages.

    Please refer to the document patch "userfaultfd: wp:
    UFFDIO_REGISTER_MODE_WP documentation update" for more information on the
    new interface and what it can do.

    The major workflow of an uffd-wp program should be:

    1. Register a memory region with WP mode using UFFDIO_REGISTER_MODE_WP

    2. Write protect part of the whole registered region using
    UFFDIO_WRITEPROTECT, passing in UFFDIO_WRITEPROTECT_MODE_WP to
    show that we want to write protect the range.

    3. Start a working thread that modifies the protected pages,
    meanwhile listening to UFFD messages.

    4. When a write is detected upon the protected range, page fault
    happens, a UFFD message will be generated and reported to the
    page fault handling thread

    5. The page fault handler thread resolves the page fault using the
    new UFFDIO_WRITEPROTECT ioctl, but this time passing in
    !UFFDIO_WRITEPROTECT_MODE_WP instead showing that we want to
    recover the write permission. Before this operation, the fault
    handler thread can do anything it wants, e.g., dumps the page to
    a persistent storage.

    6. The worker thread will continue running with the correctly
    applied write permission from step 5.

    Currently there are already two projects that are based on this new
    userfaultfd feature.

    QEMU Live Snapshot: The project provides a way to allow the QEMU
    hypervisor to take snapshot of VMs without
    stopping the VM [3].

    LLNL umap library: The project provides a mmap-like interface and
    "allow to have an application specific buffer of
    pages cached from a large file, i.e. out-of-core
    execution using memory map" [4][5].

    Before posting the patchset, this series was smoke tested against QEMU
    live snapshot and the LLNL umap library (by doing parallel quicksort using
    128 sorting threads + 80 uffd servicing threads). My sincere thanks to
    Marty Mcfadden and Denis Plotnikov for the help along the way.

    TODO
    ====

    - hugetlbfs/shmem support
    - performance
    - more architectures
    - cooperate with mprotect()-allowed processes (???)
    - ...

    References
    ==========

    [1] https://lwn.net/Articles/666187/
    [2] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/log/?h=userfault
    [3] https://github.com/denis-plotnikov/qemu/commits/background-snapshot-kvm
    [4] https://github.com/LLNL/umap
    [5] https://llnl-umap.readthedocs.io/en/develop/
    [6] https://git.kernel.org/pub/scm/linux/kernel/git/andrea/aa.git/commit/?h=userfault&id=b245ecf6cf59156966f3da6e6b674f6695a5ffa5
    [7] https://lkml.org/lkml/2018/11/21/370
    [8] https://lkml.org/lkml/2018/12/30/64

    This patch (of 19):

    Add helper for writeprotect check. Will use it later.

    Signed-off-by: Shaohua Li
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Peter Xu
    Signed-off-by: Andrew Morton
    Reviewed-by: Jerome Glisse
    Reviewed-by: Mike Rapoport
    Cc: Rik van Riel
    Cc: Kirill A. Shutemov
    Cc: Mel Gorman
    Cc: Hugh Dickins
    Cc: Johannes Weiner
    Cc: Bobby Powers
    Cc: Brian Geffon
    Cc: David Hildenbrand
    Cc: Denis Plotnikov
    Cc: "Dr . David Alan Gilbert"
    Cc: Martin Cracauer
    Cc: Marty McFadden
    Cc: Maya Gokhale
    Cc: Mike Kravetz
    Cc: Pavel Emelyanov
    Link: http://lkml.kernel.org/r/20200220163112.11409-2-peterx@redhat.com
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • In order to keep ourselves from reporting pages that are just going to be
    reused again in the case of heavy churn we can put a limit on how many
    total pages we will process per pass. Doing this will allow the worker
    thread to go into idle much more quickly so that we avoid competing with
    other threads that might be allocating or freeing pages.

    The logic added here will limit the worker thread to no more than one
    sixteenth of the total free pages in a given area per list. Once that
    limit is reached it will update the state so that at the end of the pass
    we will reschedule the worker to try again in 2 seconds when the memory
    churn has hopefully settled down.

    Again this optimization doesn't show much of a benefit in the standard
    case as the memory churn is minmal. However with page allocator shuffling
    enabled the gain is quite noticeable. Below are the results with a THP
    enabled version of the will-it-scale page_fault1 test showing the
    improvement in iterations for 16 processes or threads.

    Without:
    tasks processes processes_idle threads threads_idle
    16 8283274.75 0.17 5594261.00 38.15

    With:
    tasks processes processes_idle threads threads_idle
    16 8767010.50 0.21 5791312.75 36.98

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224719.29318.72113.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Add support for the page reporting feature provided by virtio-balloon.
    Reporting differs from the regular balloon functionality in that is is
    much less durable than a standard memory balloon. Instead of creating a
    list of pages that cannot be accessed the pages are only inaccessible
    while they are being indicated to the virtio interface. Once the
    interface has acknowledged them they are placed back into their respective
    free lists and are once again accessible by the guest system.

    Unlike a standard balloon we don't inflate and deflate the pages. Instead
    we perform the reporting, and once the reporting is completed it is
    assumed that the page has been dropped from the guest and will be faulted
    back in the next time the page is accessed.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Acked-by: Michael S. Tsirkin
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224657.29318.68624.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • In order to pave the way for free page reporting in virtualized
    environments we will need a way to get pages out of the free lists and
    identify those pages after they have been returned. To accomplish this,
    this patch adds the concept of a Reported Buddy, which is essentially
    meant to just be the Uptodate flag used in conjunction with the Buddy page
    type.

    To prevent the reported pages from leaking outside of the buddy lists I
    added a check to clear the PageReported bit in the del_page_from_free_list
    function. As a result any reported page that is split, merged, or
    allocated will have the flag cleared prior to the PageBuddy value being
    cleared.

    The process for reporting pages is fairly simple. Once we free a page
    that meets the minimum order for page reporting we will schedule a worker
    thread to start 2s or more in the future. That worker thread will begin
    working from the lowest supported page reporting order up to MAX_ORDER - 1
    pulling unreported pages from the free list and storing them in the
    scatterlist.

    When processing each individual free list it is necessary for the worker
    thread to release the zone lock when it needs to stop and report the full
    scatterlist of pages. To reduce the work of the next iteration the worker
    thread will rotate the free list so that the first unreported page in the
    free list becomes the first entry in the list.

    It will then call a reporting function providing information on how many
    entries are in the scatterlist. Once the function completes it will
    return the pages to the free area from which they were allocated and start
    over pulling more pages from the free areas until there are no longer
    enough pages to report on to keep the worker busy, or we have processed as
    many pages as were contained in the free area when we started processing
    the list.

    The worker thread will work in a round-robin fashion making its way though
    each zone requesting reporting, and through each reportable free list
    within that zone. Once all free areas within the zone have been processed
    it will check to see if there have been any requests for reporting while
    it was processing. If so it will reschedule the worker thread to start up
    again in roughly 2s and exit.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dan Williams
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Pankaj Gupta
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224635.29318.19750.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • In order to enable the use of the zone from the list manipulator functions
    I will need access to the zone pointer. As it turns out most of the
    accessors were always just being directly passed &zone->free_area[order]
    anyway so it would make sense to just fold that into the function itself
    and pass the zone and order as arguments instead of the free area.

    In order to be able to reference the zone we need to move the declaration
    of the functions down so that we have the zone defined before we define
    the list manipulation functions. Since the functions are only used in the
    file mm/page_alloc.c we can just move them there to reduce noise in the
    header.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Reviewed-by: David Hildenbrand
    Reviewed-by: Pankaj Gupta
    Acked-by: Mel Gorman
    Cc: Andrea Arcangeli
    Cc: Dave Hansen
    Cc: Konrad Rzeszutek Wilk
    Cc: Luiz Capitulino
    Cc: Matthew Wilcox
    Cc: Michael S. Tsirkin
    Cc: Michal Hocko
    Cc: Nitesh Narayan Lal
    Cc: Oscar Salvador
    Cc: Paolo Bonzini
    Cc: Rik van Riel
    Cc: Vlastimil Babka
    Cc: Wei Wang
    Cc: Yang Zhang
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224613.29318.43080.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Patch series "mm / virtio: Provide support for free page reporting", v17.

    This series provides an asynchronous means of reporting free guest pages
    to a hypervisor so that the memory associated with those pages can be
    dropped and reused by other processes and/or guests on the host. Using
    this it is possible to avoid unnecessary I/O to disk and greatly improve
    performance in the case of memory overcommit on the host.

    When enabled we will be performing a scan of free memory every 2 seconds
    while pages of sufficiently high order are being freed. In each pass at
    least one sixteenth of each free list will be reported. By doing this we
    avoid racing against other threads that may be causing a high amount of
    memory churn.

    The lowest page order currently scanned when reporting pages is
    pageblock_order so that this feature will not interfere with the use of
    Transparent Huge Pages in the case of virtualization.

    Currently this is only in use by virtio-balloon however there is the hope
    that at some point in the future other hypervisors might be able to make
    use of it. In the virtio-balloon/QEMU implementation the hypervisor is
    currently using MADV_DONTNEED to indicate to the host kernel that the page
    is currently free. It will be zeroed and faulted back into the guest the
    next time the page is accessed.

    To track if a page is reported or not the Uptodate flag was repurposed and
    used as a Reported flag for Buddy pages. We walk though the free list
    isolating pages and adding them to the scatterlist until we either
    encounter the end of the list or have processed at least one sixteenth of
    the pages that were listed in nr_free prior to us starting. If we fill
    the scatterlist before we reach the end of the list we rotate the list so
    that the first unreported page we encounter is moved to the head of the
    list as that is where we will resume after we have freed the reported
    pages back into the tail of the list.

    Below are the results from various benchmarks. I primarily focused on two
    tests. The first is the will-it-scale/page_fault2 test, and the other is
    a modified version of will-it-scale/page_fault1 that was enabled to use
    THP. I did this as it allows for better visibility into different parts
    of the memory subsystem. The guest is running with 32G for RAM on one
    node of a E5-2630 v3. The host has had some features such as CPU turbo
    disabled in the BIOS.

    Test page_fault1 (THP) page_fault2
    Name tasks Process Iter STDEV Process Iter STDEV
    Baseline 1 1012402.50 0.14% 361855.25 0.81%
    16 8827457.25 0.09% 3282347.00 0.34%

    Patches Applied 1 1007897.00 0.23% 361887.00 0.26%
    16 8784741.75 0.39% 3240669.25 0.48%

    Patches Enabled 1 1010227.50 0.39% 359749.25 0.56%
    16 8756219.00 0.24% 3226608.75 0.97%

    Patches Enabled 1 1050982.00 4.26% 357966.25 0.14%
    page shuffle 16 8672601.25 0.49% 3223177.75 0.40%

    Patches enabled 1 1003238.00 0.22% 360211.00 0.22%
    shuffle w/ RFC 16 8767010.50 0.32% 3199874.00 0.71%

    The results above are for a baseline with a linux-next-20191219 kernel,
    that kernel with this patch set applied but page reporting disabled in
    virtio-balloon, the patches applied and page reporting fully enabled, the
    patches enabled with page shuffling enabled, and the patches applied with
    page shuffling enabled and an RFC patch that makes used of MADV_FREE in
    QEMU. These results include the deviation seen between the average value
    reported here versus the high and/or low value. I observed that during
    the test memory usage for the first three tests never dropped whereas with
    the patches fully enabled the VM would drop to using only a few GB of the
    host's memory when switching from memhog to page fault tests.

    Any of the overhead visible with this patch set enabled seems due to page
    faults caused by accessing the reported pages and the host zeroing the
    page before giving it back to the guest. This overhead is much more
    visible when using THP than with standard 4K pages. In addition page
    shuffling seemed to increase the amount of faults generated due to an
    increase in memory churn. The overehad is reduced when using MADV_FREE as
    we can avoid the extra zeroing of the pages when they are reintroduced to
    the host, as can be seen when the RFC is applied with shuffling enabled.

    The overall guest size is kept fairly small to only a few GB while the
    test is running. If the host memory were oversubscribed this patch set
    should result in a performance improvement as swapping memory in the host
    can be avoided.

    A brief history on the background of free page reporting can be found at:
    https://lore.kernel.org/lkml/29f43d5796feed0dec8e8bb98b187d9dac03b900.camel@linux.intel.com/

    This patch (of 9):

    Move the head/tail adding logic out of the shuffle code and into the
    __free_one_page function since ultimately that is where it is really
    needed anyway. By doing this we should be able to reduce the overhead and
    can consolidate all of the list addition bits in one spot.

    Signed-off-by: Alexander Duyck
    Signed-off-by: Andrew Morton
    Reviewed-by: Dan Williams
    Acked-by: Mel Gorman
    Acked-by: David Hildenbrand
    Cc: Yang Zhang
    Cc: Pankaj Gupta
    Cc: Konrad Rzeszutek Wilk
    Cc: Nitesh Narayan Lal
    Cc: Rik van Riel
    Cc: Matthew Wilcox
    Cc: Luiz Capitulino
    Cc: Dave Hansen
    Cc: Wei Wang
    Cc: Andrea Arcangeli
    Cc: Paolo Bonzini
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Oscar Salvador
    Cc: Michael S. Tsirkin
    Cc: wei qi
    Link: http://lkml.kernel.org/r/20200211224602.29318.84523.stgit@localhost.localdomain
    Signed-off-by: Linus Torvalds

    Alexander Duyck
     
  • Some comments for MADV_FREE is revised and added to help people understand
    the MADV_FREE code, especially the page flag, PG_swapbacked. This makes
    page_is_file_cache() isn't consistent with its comments. So the function
    is renamed to page_is_file_lru() to make them consistent again. All these
    are put in one patch as one logical change.

    Suggested-by: David Hildenbrand
    Suggested-by: Johannes Weiner
    Suggested-by: David Rientjes
    Signed-off-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Pankaj Gupta
    Acked-by: Vlastimil Babka
    Cc: Dave Hansen
    Cc: Mel Gorman
    Cc: Minchan Kim
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Link: http://lkml.kernel.org/r/20200317100342.2730705-1-ying.huang@intel.com
    Signed-off-by: Linus Torvalds

    Huang Ying
     
  • Commit e496cf3d7821 ("thp: introduce CONFIG_TRANSPARENT_HUGE_PAGECACHE")
    notes that it should be reverted when the PowerPC problem was fixed. The
    commit fixing the PowerPC problem (953c66c2b22a) did not revert the
    commit; instead setting CONFIG_TRANSPARENT_HUGE_PAGECACHE to the same as
    CONFIG_TRANSPARENT_HUGEPAGE. Checking with Kirill and Aneesh, this was an
    oversight, so remove the Kconfig symbol and undo the work of commit
    e496cf3d7821.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Christoph Hellwig
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-6-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • If THP is disabled, find_subpage() can become a no-op by using
    hpage_nr_pages() instead of compound_nr(). hpage_nr_pages() embeds a
    check for PageTail, so we can drop the check here.

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Reviewed-by: Christoph Hellwig
    Acked-by: Kirill A. Shutemov
    Cc: Aneesh Kumar K.V
    Cc: Pankaj Gupta
    Link: http://lkml.kernel.org/r/20200318140253.6141-5-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The thp_fault_fallback and thp_file_fallback vmstats are incremented if
    either the hugepage allocation fails through the page allocator or the
    hugepage charge fails through mem cgroup.

    This patch leaves this field untouched but adds two new fields,
    thp_{fault,file}_fallback_charge, which is incremented only when the mem
    cgroup charge fails.

    This distinguishes between attempted hugepage allocations that fail due to
    fragmentation (or low memory conditions) and those that fail due to mem
    cgroup limits. That can be used to determine the impact of fragmentation
    on the system by excluding faults that failed due to memcg usage.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061422070.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • The existing thp_fault_fallback indicates when thp attempts to allocate a
    hugepage but fails, or if the hugepage cannot be charged to the mem cgroup
    hierarchy.

    Extend this to shmem as well. Adds a new thp_file_fallback to complement
    thp_file_alloc that gets incremented when a hugepage is attempted to be
    allocated but fails, or if it cannot be charged to the mem cgroup
    hierarchy.

    Additionally, remove the check for CONFIG_TRANSPARENT_HUGE_PAGECACHE from
    shmem_alloc_hugepage() since it is only called with this configuration
    option.

    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Reviewed-by: Yang Shi
    Acked-by: Kirill A. Shutemov
    Cc: Mike Rapoport
    Cc: Jeremy Cline
    Cc: Andrea Arcangeli
    Cc: Mike Kravetz
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Link: http://lkml.kernel.org/r/alpine.DEB.2.21.2003061421240.7412@chino.kir.corp.google.com
    Signed-off-by: Linus Torvalds

    David Rientjes
     
  • While it might be really clear to MM developers that gfp reclaim modifiers
    are applicable only to sleepable allocations (those with
    __GFP_DIRECT_RECLAIM) it seems that actual users of the API are not always
    sure. Make it explicit that they are not applicable for GFP_NOWAIT or
    GFP_ATOMIC allocations which are the most commonly used non-sleepable
    allocation masks.

    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Reviewed-by: Joel Fernandes (Google)
    Acked-by: Paul E. McKenney
    Acked-by: David Rientjes
    Cc: Neil Brown
    Link: http://lkml.kernel.org/r/20200403083543.11552-3-mhocko@kernel.org
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • This replaces all remaining open encodings with is_vm_hugetlb_page().

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Cc: Michael Ellerman
    Cc: Alexander Viro
    Cc: Will Deacon
    Cc: "Aneesh Kumar K.V"
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Arnd Bergmann
    Cc: Ingo Molnar
    Cc: Arnaldo Carvalho de Melo
    Cc: Andy Lutomirski
    Cc: Dave Hansen
    Cc: Geert Uytterhoeven
    Cc: Guo Ren
    Cc: Mel Gorman
    Cc: Paul Burton
    Cc: Paul Mackerras
    Cc: Ralf Baechle
    Cc: Rich Felker
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: Yoshinori Sato
    Link: http://lkml.kernel.org/r/1582520593-30704-4-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual
     
  • Lets move vma_is_accessible() helper to include/linux/mm.h which makes it
    available for general use. While here, this replaces all remaining open
    encodings for VMA access check with vma_is_accessible().

    Signed-off-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Acked-by: Geert Uytterhoeven
    Acked-by: Guo Ren
    Acked-by: Vlastimil Babka
    Cc: Guo Ren
    Cc: Geert Uytterhoeven
    Cc: Ralf Baechle
    Cc: Paul Burton
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Michael Ellerman
    Cc: Yoshinori Sato
    Cc: Rich Felker
    Cc: Dave Hansen
    Cc: Andy Lutomirski
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Steven Rostedt
    Cc: Mel Gorman
    Cc: Alexander Viro
    Cc: "Aneesh Kumar K.V"
    Cc: Arnaldo Carvalho de Melo
    Cc: Arnd Bergmann
    Cc: Nick Piggin
    Cc: Paul Mackerras
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1582520593-30704-3-git-send-email-anshuman.khandual@arm.com
    Signed-off-by: Linus Torvalds

    Anshuman Khandual