19 Oct, 2020

21 commits

  • No point in having the filename inside the file.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "two small vmalloc cleanups".

    This patch (of 2):

    __vmalloc_area_node currently has four different gfp_t variables to
    just express this simple logic:

    - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
    suitable) for the underlying page allocation
    - use just the reclaim flags from the passed in mask plus __GFP_ZERO
    for allocating the page array

    Simplify this down to just use the pre-existing nested_gfp as-is for
    the page array allocation, and just the passed in gfp_mask for the
    page allocation, after conditionally ORing __GFP_HIGHMEM into it. This
    also makes the allocation warning a little more correct.

    Also initialize two variables at the time of declaration while touching
    this area.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • All users are gone now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Just manually pre-fault the PTEs using apply_to_page_range.

    Co-developed-by: Minchan Kim
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Besides calling the callback on each page, apply_to_page_range also has
    the effect of pre-faulting all PTEs for the range. To support callers
    that only need the pre-faulting, make the callback optional.

    Based on a patch from Minchan Kim .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a proper helper to remap PFNs into kernel virtual space so that
    drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
    for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a flag so that vmap takes ownership of the passed in page array. When
    vfree is called on such an allocation it will put one reference on each
    page, and free the page array itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "remove alloc_vm_area", v4.

    This series removes alloc_vm_area, which was left over from the big
    vmalloc interface rework. It is a rather arkane interface, basicaly the
    equivalent of get_vm_area + actually faulting in all PTEs in the allocated
    area. It was originally addeds for Xen (which isn't modular to start
    with), and then grew users in zsmalloc and i915 which seems to mostly
    qualify as abuses of the interface, especially for i915 as a random driver
    should not set up PTE bits directly.

    This patch (of 11):

    * Document that you can call vfree() on an address returned from vmap()
    * Remove the note about the minimum size -- the minimum size of a vmalloc
    allocation is one page
    * Add a Context: section
    * Fix capitalisation
    * Reword the prohibition on calling from NMI context to avoid a double
    negative

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Cc: Chris Wilson
    Cc: Matthew Auld
    Cc: Rodrigo Vivi
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Nitin Gupta
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "introduce memory hinting API for external process", v9.

    Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
    that, application could give hints to kernel what memory range are
    preferred to be reclaimed. However, in some platform(e.g., Android), the
    information required to make the hinting decision is not known to the app.
    Instead, it is known to a centralized userspace daemon(e.g.,
    ActivityManagerService), and that daemon must be able to initiate reclaim
    on its own without any app involvement.

    To solve the concern, this patch introduces new syscall -
    process_madvise(2). Bascially, it's same with madvise(2) syscall but it
    has some differences.

    1. It needs pidfd of target process to provide the hint

    2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
    moment. Other hints in madvise will be opened when there are explicit
    requests from community to prevent unexpected bugs we couldn't support.

    3. Only privileged processes can do something for other process's
    address space.

    For more detail of the new API, please see "mm: introduce external memory
    hinting API" description in this patchset.

    This patch (of 3):

    In upcoming patches, do_madvise will be called from external process
    context so we shouldn't asssume "current" is always hinted process's
    task_struct.

    Furthermore, we must not access mm_struct via task->mm, but obtain it via
    access_mm() once (in the following patch) and only use that pointer [1],
    so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
    safe, so we can use them further down the call stack.

    And let's pass current->mm as arguments of do_madvise so it shouldn't
    change existing behavior but prepare next patch to make review easy.

    [vbabka@suse.cz: changelog tweak]
    [minchan@kernel.org: use current->mm for io_uring]
    Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
    [akpm@linux-foundation.org: fix it for upstream changes]
    [akpm@linux-foundation.org: whoops]
    [rdunlap@infradead.org: add missing includes]

    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Daniel Colascione
    Cc: Sandeep Patil
    Cc: Sonny Rao
    Cc: Brian Geffon
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: John Dias
    Cc: Joel Fernandes
    Cc: Alexander Duyck
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Kirill Tkhai
    Cc: Oleksandr Natalenko
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • To be safe against concurrent changes to the VMA tree, we must take the
    mmap lock around GUP operations (excluding the GUP-fast family of
    operations, which will take the mmap lock by themselves if necessary).

    This code is only for testing, and it's only reachable by root through
    debugfs, so this doesn't really have any impact; however, if we want to
    add lockdep asserts into the GUP path, we need to have clean locking here.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Acked-by: Michel Lespinasse
    Cc: "Eric W . Biederman"
    Cc: Mauro Carvalho Chehab
    Cc: Sakari Ailus
    Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • There are two locations that have a block of code for munmapping a vma
    range. Change those two locations to use a function and add meaningful
    comments about what happens to the arguments, which was unclear in the
    previous code.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There are three places that the next vma is required which uses the same
    block of code. Replace the block with a function and add comments on what
    happens in the case where NULL is encountered.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There is no need to check if this process has the right to modify the
    specified process when they are same. And we could also skip the security
    hook call if a process is modifying its own pages. Add helper function to
    handle these.

    Suggested-by: Matthew Wilcox
    Signed-off-by: Hongxiang Lou
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christopher Lameter
    Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • To calculate the correct node to migrate the page for hotplug, we need to
    check node id of the page. Wrapper for alloc_migration_target() exists
    for this purpose.

    However, Vlastimil informs that all migration source pages come from a
    single node. In this case, we don't need to check the node id for each
    page and we don't need to re-set the target nodemask for each page by
    using the wrapper. Set up the migration_target_control once and use it
    for all pages.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a well-defined standard migration target callback. Use it
    directly.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If a memcg to charge can be determined (using remote charging API), there
    are no reasons to exclude allocations made from an interrupt context from
    the accounting.

    Such allocations will pass even if the resulting memcg size will exceed
    the hard limit, but it will affect the application of the memory pressure
    and an inability to put the workload under the limit will eventually
    trigger the OOM.

    To use active_memcg() helper, memcg_kmem_bypass() is moved back to
    memcontrol.c.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Remote memcg charging API uses current->active_memcg to store the
    currently active memory cgroup, which overwrites the memory cgroup of the
    current process. It works well for normal contexts, but doesn't work for
    interrupt contexts: indeed, if an interrupt occurs during the execution of
    a section with an active memcg set, all allocations inside the interrupt
    will be charged to the active memcg set (given that we'll enable
    accounting for allocations from an interrupt context). But because the
    interrupt might have no relation to the active memcg set outside, it's
    obviously wrong from the accounting prospective.

    To resolve this problem, let's add a global percpu int_active_memcg
    variable, which will be used to store an active memory cgroup which will
    be used from interrupt contexts. set_active_memcg() will transparently
    use current->active_memcg or int_active_memcg depending on the context.

    To make the read part simple and transparent for the caller, let's
    introduce two new functions:
    - struct mem_cgroup *active_memcg(void),
    - struct mem_cgroup *get_active_memcg(void).

    They are returning the active memcg if it's set, hiding all implementation
    details: where to get it depending on the current context.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There are checks for current->mm and current->active_memcg in
    get_obj_cgroup_from_current(), but these checks are redundant:
    memcg_kmem_bypass() called just above performs same checks.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

17 Oct, 2020

19 commits

  • Pull documentation updates from Mauro Carvalho Chehab:
    "A series of patches addressing warnings produced by make htmldocs.
    This includes:

    - kernel-doc markup fixes

    - ReST fixes

    - Updates at the build system in order to support newer versions of
    the docs build toolchain (Sphinx)

    After this series, the number of html build warnings should reduce
    significantly, and building with Sphinx 3.1 or later should now be
    supported (although it is still recommended to use Sphinx 2.4.4).

    As agreed with Jon, I should be sending you a late pull request by the
    end of the merge window addressing remaining issues with docs build,
    as there are a number of warning fixes that depends on pull requests
    that should be happening along the merge window.

    The end goal is to have a clean htmldocs build on Kernel 5.10.

    PS. It should be noticed that Sphinx 3.0 is not currently supported,
    as it lacks support for C domain namespaces. Such feature, needed in
    order to document uAPI system calls with Sphinx 3.x, was added only on
    Sphinx 3.1"

    * tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (75 commits)
    PM / devfreq: remove a duplicated kernel-doc markup
    mm/doc: fix a literal block markup
    workqueue: fix a kernel-doc warning
    docs: virt: user_mode_linux_howto_v2.rst: fix a literal block markup
    Input: sparse-keymap: add a description for @sw
    rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu
    nl80211: docs: add a description for s1g_cap parameter
    usb: docs: document altmode register/unregister functions
    kunit: test.h: fix a bad kernel-doc markup
    drivers: core: fix kernel-doc markup for dev_err_probe()
    docs: bio: fix a kerneldoc markup
    kunit: test.h: solve kernel-doc warnings
    block: bio: fix a warning at the kernel-doc markups
    docs: powerpc: syscall64-abi.rst: fix a malformed table
    drivers: net: hamradio: fix document location
    net: appletalk: Kconfig: Fix docs location
    dt-bindings: fix references to files converted to yaml
    memblock: get rid of a :c:type leftover
    math64.h: kernel-docs: Convert some markups into normal comments
    media: uAPI: buffer.rst: remove a left-over documentation
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:
    "155 patches.

    Subsystems affected by this patch series: mm (dax, debug, thp,
    readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
    core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
    binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
    romfs, and fault-injection"

    * emailed patches from Andrew Morton : (155 commits)
    lib, uaccess: add failure injection to usercopy functions
    lib, include/linux: add usercopy failure capability
    ROMFS: support inode blocks calculation
    ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
    sched.h: drop in_ubsan field when UBSAN is in trap mode
    scripts/gdb/tasks: add headers and improve spacing format
    scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
    kernel/relay.c: drop unneeded initialization
    panic: dump registers on panic_on_warn
    rapidio: fix the missed put_device() for rio_mport_add_riodev
    rapidio: fix error handling path
    nilfs2: fix some kernel-doc warnings for nilfs2
    autofs: harden ioctl table
    ramfs: fix nommu mmap with gaps in the page cache
    mm: remove the now-unnecessary mmget_still_valid() hack
    mm/gup: take mmap_lock in get_dump_page()
    binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
    coredump: rework elf/elf_fdpic vma_dump_size() into common helper
    coredump: refactor page range dumping into common helper
    coredump: let dump_emit() bail out on short writes
    ...

    Linus Torvalds
     
  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Properly take the mmap_lock before calling into the GUP code from
    get_dump_page(); and play nice, allowing the GUP code to drop the
    mmap_lock if it has to sleep.

    As Linus pointed out, we don't actually need the VMA because
    __get_user_pages() will flush the dcache for us if necessary.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

    At the moment, we have that rather ugly mmget_still_valid() helper to work
    around : ELF core dumping doesn't
    take the mmap_sem while traversing the task's VMAs, and if anything (like
    userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
    at the moment we use mmget_still_valid() to bail out in any writers that
    might be operating on a remote mm's VMAs.

    With this series, I'm trying to get rid of the need for that as cleanly as
    possible. ("cleanly" meaning "avoid holding the mmap_lock across
    unbounded sleeps".)

    Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
    dumping code.

    Patches 5 and 6 implement the main change: Instead of repeatedly accessing
    the VMA list with sleeps in between, we snapshot it at the start with
    proper locking, and then later we just use our copy of the VMA list. This
    ensures that the kernel won't crash, that VMA metadata in the coredump is
    consistent even in the presence of concurrent modifications, and that any
    virtual addresses that aren't being concurrently modified have their
    contents show up in the core dump properly.

    The disadvantage of this approach is that we need a bit more memory during
    core dumping for storing metadata about all VMAs.

    At the end of the series, patch 7 removes the old workaround for this
    issue (mmget_still_valid()).

    I have tested:

    - Creating a simple core dump on X86-64 still works.
    - The created coredump on X86-64 opens in GDB and looks plausible.
    - X86-64 core dumps contain the first page for executable mappings at
    offset 0, and don't contain the first page for non-executable file
    mappings or executable mappings at offset !=0.
    - NOMMU 32-bit ARM can still generate plausible-looking core dumps
    through the FDPIC implementation. (I can't test this with GDB because
    GDB is missing some structure definition for nommu ARM, but I've
    poked around in the hexdump and it looked decent.)

    This patch (of 7):

    dump_emit() is for kernel pointers, and VMAs describe userspace memory.
    Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
    even if it probably doesn't matter much on !MMU systems - especially given
    that it looks like we can just use the same get_dump_page() as on MMU if
    we move it out of the CONFIG_MMU block.

    One small change we have to make in get_dump_page() is to use
    __get_user_pages_locked() instead of __get_user_pages(), since the latter
    doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
    just call __get_user_pages() for us.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
    Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • The current page_order() can only be called on pages in the buddy
    allocator. For compound pages, you have to use compound_order(). This is
    confusing and led to a bug, so rename page_order() to buddy_order().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • In commit 1da177e4c3f4 ("Linux-2.6.12-rc2"), the helper put_write_access()
    came with the atomic_dec operation of the i_writecount field. But it
    forgot to use this helper in __vma_link_file() and dup_mmap().

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200924115235.5111-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Fix following warnings caused by mismatch bewteen function parameters and
    comments.

    mm/workingset.c:228: warning: Function parameter or member 'lruvec' not described in 'workingset_age_nonresident'
    mm/workingset.c:228: warning: Excess function parameter 'memcg' description in 'workingset_age_nonresident'

    Signed-off-by: Xiaofei Tan
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/1600485913-11192-1-git-send-email-tanxiaofei@huawei.com
    Signed-off-by: Linus Torvalds

    Xiaofei Tan
     
  • Correct one function name "get_partials" with "get_partial". Update the
    old struct name of list3 with kmem_cache_node.

    Signed-off-by: Chen Tao
    Signed-off-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/Message-ID:
    Signed-off-by: Linus Torvalds

    Chen Tao
     
  • Fix some broken comments including typo, grammar error and wrong function
    name.

    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20200913095456.54873-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • Signed-off-by: Yu Zhao
    Signed-off-by: Andrew Morton
    Cc: Alex Shi
    Link: http://lkml.kernel.org/r/20200831175042.3527153-2-yuzhao@google.com
    Signed-off-by: Linus Torvalds

    Yu Zhao
     
  • The #endif at the end of the file matches up with the '#if
    defined(HASHED_PAGE_VIRTUAL)' on line 374. Not the CONFIG_HIGHMEM #if
    earlier.

    Fix comments on both of the #endif's to indicate the correct end of
    blocks for each.

    Signed-off-by: Ira Weiny
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Mike Rapoport
    Link: https://lkml.kernel.org/r/20200819184635.112579-1-ira.weiny@intel.com
    Signed-off-by: Linus Torvalds

    Ira Weiny
     
  • list_for_each_entry_safe() guarantees that we will never stumble over the
    list head; "&page->lru != list" will always evaluate to true. Let's
    simplify.

    [david@redhat.com: Changelog refinements]

    Signed-off-by: Wei Yang
    Signed-off-by: Andrew Morton
    Reviewed-by: David Hildenbrand
    Reviewed-by: Alexander Duyck
    Link: http://lkml.kernel.org/r/20200818084448.33969-1-richard.weiyang@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • Remove duplicate header which is included twice.

    Signed-off-by: YueHaibing
    Signed-off-by: Andrew Morton
    Reviewed-by: Pekka Enberg
    Link: http://lkml.kernel.org/r/20200818114323.58156-1-yuehaibing@huawei.com
    Signed-off-by: Linus Torvalds

    YueHaibing
     
  • As we no longer shuffle via generic_online_page() and when undoing
    isolation, we can simplify the comment.

    We now effectively shuffle only once (properly) when onlining new memory.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Wei Yang
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Wei Yang
    Cc: Oscar Salvador
    Cc: Mike Rapoport
    Cc: Pankaj Gupta
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Scott Cheloha
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-6-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • __free_pages_core() is used when exposing fresh memory to the buddy during
    system boot and when onlining memory in generic_online_page().

    generic_online_page() is used in two cases:

    1. Direct memory onlining in online_pages().
    2. Deferred memory onlining in memory-ballooning-like mechanisms (HyperV
    balloon and virtio-mem), when parts of a section are kept
    fake-offline to be fake-onlined later on.

    In 1, we already place pages to the tail of the freelist. Pages will be
    freed to MIGRATE_ISOLATE lists first and moved to the tail of the
    freelists via undo_isolate_page_range().

    In 2, we currently don't implement a proper rule. In case of virtio-mem,
    where we currently always online MAX_ORDER - 1 pages, the pages will be
    placed to the HEAD of the freelist - undesireable. While the hyper-v
    balloon calls generic_online_page() with single pages, usually it will
    call it on successive single pages in a larger block.

    The pages are fresh, so place them to the tail of the freelist and avoid
    the PCP. In __free_pages_core(), remove the now superflouos call to
    set_page_refcounted() and add a comment regarding page initialization and
    the refcount.

    Note: In 2. we currently don't shuffle. If ever relevant (page shuffling
    is usually of limited use in virtualized environments), we might want to
    shuffle after a sequence of generic_online_page() calls in the relevant
    callers.

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Acked-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Mike Rapoport
    Cc: "K. Y. Srinivasan"
    Cc: Haiyang Zhang
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Matthew Wilcox
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Scott Cheloha
    Link: https://lkml.kernel.org/r/20201005121534.15649-5-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Whenever we move pages between freelists via move_to_free_list()/
    move_freepages_block(), we don't actually touch the pages:
    1. Page isolation doesn't actually touch the pages, it simply isolates
    pageblocks and moves all free pages to the MIGRATE_ISOLATE freelist.
    When undoing isolation, we move the pages back to the target list.
    2. Page stealing (steal_suitable_fallback()) moves free pages directly
    between lists without touching them.
    3. reserve_highatomic_pageblock()/unreserve_highatomic_pageblock() moves
    free pages directly between freelists without touching them.

    We already place pages to the tail of the freelists when undoing isolation
    via __putback_isolated_page(), let's do it in any case (e.g., if order
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Acked-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Alexander Duyck
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mike Rapoport
    Cc: Scott Cheloha
    Cc: Michael Ellerman
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-4-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • __putback_isolated_page() already documents that pages will be placed to
    the tail of the freelist - this is, however, not the case for "order >=
    MAX_ORDER - 2" (see buddy_merge_likely()) - which should be the case for
    all existing users.

    This change affects two users:
    - free page reporting
    - page isolation, when undoing the isolation (including memory onlining).

    This behavior is desirable for pages that haven't really been touched
    lately, so exactly the two users that don't actually read/write page
    content, but rather move untouched pages.

    The new behavior is especially desirable for memory onlining, where we
    allow allocation of newly onlined pages via undo_isolate_page_range() in
    online_pages(). Right now, we always place them to the head of the
    freelist, resulting in undesireable behavior: Assume we add individual
    memory chunks via add_memory() and online them right away to the NORMAL
    zone. We create a dependency chain of unmovable allocations e.g., via the
    memmap. The memmap of the next chunk will be placed onto previous chunks
    - if the last block cannot get offlined+removed, all dependent ones cannot
    get offlined+removed. While this can already be observed with individual
    DIMMs, it's more of an issue for virtio-mem (and I suspect also ppc
    DLPAR).

    Document that this should only be used for optimizations, and no code
    should rely on this behavior for correction (if the order of the freelists
    ever changes).

    We won't care about page shuffling: memory onlining already properly
    shuffles after onlining. free page reporting doesn't care about
    physically contiguous ranges, and there are already cases where page
    isolation will simply move (physically close) free pages to (currently)
    the head of the freelists via move_freepages_block() instead of shuffling.
    If this becomes ever relevant, we should shuffle the whole zone when
    undoing isolation of larger ranges, and after free_contig_range().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Vlastimil Babka
    Cc: Mike Rapoport
    Cc: Scott Cheloha
    Cc: Michael Ellerman
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Matthew Wilcox
    Cc: Michal Hocko
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Link: https://lkml.kernel.org/r/20201005121534.15649-3-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     
  • Patch series "mm: place pages to the freelist tail when onlining and undoing isolation", v2.

    When adding separate memory blocks via add_memory*() and onlining them
    immediately, the metadata (especially the memmap) of the next block will
    be placed onto one of the just added+onlined block. This creates a chain
    of unmovable allocations: If the last memory block cannot get
    offlined+removed() so will all dependent ones. We directly have unmovable
    allocations all over the place.

    This can be observed quite easily using virtio-mem, however, it can also
    be observed when using DIMMs. The freshly onlined pages will usually be
    placed to the head of the freelists, meaning they will be allocated next,
    turning the just-added memory usually immediately un-removable. The fresh
    pages are cold, prefering to allocate others (that might be hot) also
    feels to be the natural thing to do.

    It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: when
    adding separate, successive memory blocks, each memory block will have
    unmovable allocations on them - for example gigantic pages will fail to
    allocate.

    While the ZONE_NORMAL doesn't provide any guarantees that memory can get
    offlined+removed again (any kind of fragmentation with unmovable
    allocations is possible), there are many scenarios (hotplugging a lot of
    memory, running workload, hotunplug some memory/as much as possible) where
    we can offline+remove quite a lot with this patchset.

    a) To visualize the problem, a very simple example:

    Start a VM with 4GB and 8GB of virtio-mem memory:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000033fffffff 9G online yes 32-103

    Memory block size: 128M
    Total online memory: 12G
    Total offline memory: 0B

    Then try to unplug as much as possible using virtio-mem. Observe which
    memory blocks are still around. Without this patch set:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000013fffffff 1G online yes 32-39
    0x0000000148000000-0x000000014fffffff 128M online yes 41
    0x0000000158000000-0x000000015fffffff 128M online yes 43
    0x0000000168000000-0x000000016fffffff 128M online yes 45
    0x0000000178000000-0x000000017fffffff 128M online yes 47
    0x0000000188000000-0x0000000197ffffff 256M online yes 49-50
    0x00000001a0000000-0x00000001a7ffffff 128M online yes 52
    0x00000001b0000000-0x00000001b7ffffff 128M online yes 54
    0x00000001c0000000-0x00000001c7ffffff 128M online yes 56
    0x00000001d0000000-0x00000001d7ffffff 128M online yes 58
    0x00000001e0000000-0x00000001e7ffffff 128M online yes 60
    0x00000001f0000000-0x00000001f7ffffff 128M online yes 62
    0x0000000200000000-0x0000000207ffffff 128M online yes 64
    0x0000000210000000-0x0000000217ffffff 128M online yes 66
    0x0000000220000000-0x0000000227ffffff 128M online yes 68
    0x0000000230000000-0x0000000237ffffff 128M online yes 70
    0x0000000240000000-0x0000000247ffffff 128M online yes 72
    0x0000000250000000-0x0000000257ffffff 128M online yes 74
    0x0000000260000000-0x0000000267ffffff 128M online yes 76
    0x0000000270000000-0x0000000277ffffff 128M online yes 78
    0x0000000280000000-0x0000000287ffffff 128M online yes 80
    0x0000000290000000-0x0000000297ffffff 128M online yes 82
    0x00000002a0000000-0x00000002a7ffffff 128M online yes 84
    0x00000002b0000000-0x00000002b7ffffff 128M online yes 86
    0x00000002c0000000-0x00000002c7ffffff 128M online yes 88
    0x00000002d0000000-0x00000002d7ffffff 128M online yes 90
    0x00000002e0000000-0x00000002e7ffffff 128M online yes 92
    0x00000002f0000000-0x00000002f7ffffff 128M online yes 94
    0x0000000300000000-0x0000000307ffffff 128M online yes 96
    0x0000000310000000-0x0000000317ffffff 128M online yes 98
    0x0000000320000000-0x0000000327ffffff 128M online yes 100
    0x0000000330000000-0x000000033fffffff 256M online yes 102-103

    Memory block size: 128M
    Total online memory: 8.1G
    Total offline memory: 0B

    With this patch set:

    [root@localhost ~]# lsmem
    RANGE SIZE STATE REMOVABLE BLOCK
    0x0000000000000000-0x00000000bfffffff 3G online yes 0-23
    0x0000000100000000-0x000000013fffffff 1G online yes 32-39

    Memory block size: 128M
    Total online memory: 4G
    Total offline memory: 0B

    All memory can get unplugged, all memory block can get removed. Of
    course, no workload ran and the system was basically idle, but it
    highlights the issue - the fairly deterministic chain of unmovable
    allocations. When a huge page for the 2MB memmap is needed, a
    just-onlined 4MB page will be split. The remaining 2MB page will be used
    for the memmap of the next memory block. So one memory block will hold
    the memmap of the two following memory blocks. Finally the pages of the
    last-onlined memory block will get used for the next bigger allocations -
    if any allocation is unmovable, all dependent memory blocks cannot get
    unplugged and removed until that allocation is gone.

    Note that with bigger memory blocks (e.g., 256MB), *all* memory
    blocks are dependent and none can get unplugged again!

    b) Experiment with memory intensive workload

    I performed an experiment with an older version of this patch set (before
    we used undo_isolate_page_range() in online_pages(): Hotplug 56GB to a VM
    with an initial 4GB, onlining all memory to ZONE_NORMAL right from the
    kernel when adding it. I then run various memory intensive workloads that
    consume most system memory for a total of 45 minutes. Once finished, I
    try to unplug as much memory as possible.

    With this change, I am able to remove via virtio-mem (adding individual
    128MB memory blocks) 413 out of 448 added memory blocks. Via individual
    (256MB) DIMMs 380 out of 448 added memory blocks. (I don't have any
    numbers without this patchset, but looking at the above example, it's at
    most half of the 448 memory blocks for virtio-mem, and most probably none
    for DIMMs).

    Again, there are workloads that might behave very differently due to the
    nature of ZONE_NORMAL.

    This change also affects (besides memory onlining):
    - Other users of undo_isolate_page_range(): Pages are always placed to the
    tail.
    -- When memory offlining fails
    -- When memory isolation fails after having isolated some pageblocks
    -- When alloc_contig_range() either succeeds or fails
    - Other users of __putback_isolated_page(): Pages are always placed to the
    tail.
    -- Free page reporting
    - Other users of __free_pages_core()
    -- AFAIKs, any memory that is getting exposed to the buddy during boot.
    IIUC we will now usually allocate memory from lower addresses within
    a zone first (especially during boot).
    - Other users of generic_online_page()
    -- Hyper-V balloon

    This patch (of 5):

    Let's prepare for additional flags and avoid long parameter lists of
    bools. Follow-up patches will also make use of the flags in
    __free_pages_ok().

    Signed-off-by: David Hildenbrand
    Signed-off-by: Andrew Morton
    Reviewed-by: Alexander Duyck
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Oscar Salvador
    Reviewed-by: Wei Yang
    Reviewed-by: Pankaj Gupta
    Acked-by: Michal Hocko
    Cc: Mel Gorman
    Cc: Dave Hansen
    Cc: Mike Rapoport
    Cc: Matthew Wilcox
    Cc: Haiyang Zhang
    Cc: "K. Y. Srinivasan"
    Cc: Michael Ellerman
    Cc: Michal Hocko
    Cc: Scott Cheloha
    Cc: Stephen Hemminger
    Cc: Wei Liu
    Cc: Michal Hocko
    Link: https://lkml.kernel.org/r/20201005121534.15649-1-david@redhat.com
    Link: https://lkml.kernel.org/r/20201005121534.15649-2-david@redhat.com
    Signed-off-by: Linus Torvalds

    David Hildenbrand