14 Apr, 2018

5 commits

  • cache_reap() is initially scheduled in start_cpu_timer() via
    schedule_delayed_work_on(). But then the next iterations are scheduled
    via schedule_delayed_work(), i.e. using WORK_CPU_UNBOUND.

    Thus since commit ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND
    work on wq_unbound_cpumask CPUs") there is no guarantee the future
    iterations will run on the originally intended cpu, although it's still
    preferred. I was able to demonstrate this with
    /sys/module/workqueue/parameters/debug_force_rr_cpu. IIUC, it may also
    happen due to migrating timers in nohz context. As a result, some cpu's
    would be calling cache_reap() more frequently and others never.

    This patch uses schedule_delayed_work_on() with the current cpu when
    scheduling the next iteration.

    Link: http://lkml.kernel.org/r/20180411070007.32225-1-vbabka@suse.cz
    Fixes: ef557180447f ("workqueue: schedule WORK_CPU_UNBOUND work on wq_unbound_cpumask CPUs")
    Signed-off-by: Vlastimil Babka
    Acked-by: Pekka Enberg
    Acked-by: Christoph Lameter
    Cc: Joonsoo Kim
    Cc: David Rientjes
    Cc: Tejun Heo
    Cc: Lai Jiangshan
    Cc: John Stultz
    Cc: Thomas Gleixner
    Cc: Stephen Boyd
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vlastimil Babka
     
  • Building orangefs on MMU-less machines now results in a link error
    because of the newly introduced use of the filemap_page_mkwrite()
    function:

    ERROR: "filemap_page_mkwrite" [fs/orangefs/orangefs.ko] undefined!

    This adds a dummy version for it, similar to the existing
    generic_file_mmap and generic_file_readonly_mmap stubs in the same file,
    to avoid the link error without adding #ifdefs in each file system that
    uses these.

    Link: http://lkml.kernel.org/r/20180409105555.2439976-1-arnd@arndb.de
    Fixes: a5135eeab2e5 ("orangefs: implement vm_ops->fault")
    Signed-off-by: Arnd Bergmann
    Reviewed-by: Jan Kara
    Reviewed-by: Andrew Morton
    Cc: Martin Brandenburg
    Cc: Mike Marshall
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Arnd Bergmann
     
  • __get_user_pages_fast handles errors differently from
    get_user_pages_fast: the former always returns the number of pages
    pinned, the later might return a negative error code.

    Link: http://lkml.kernel.org/r/1522962072-182137-6-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • get_user_pages_fast is supposed to be a faster drop-in equivalent of
    get_user_pages. As such, callers expect it to return a negative return
    code when passed an invalid address, and never expect it to return 0
    when passed a positive number of pages, since its documentation says:

    * Returns number of pages pinned. This may be fewer than the number
    * requested. If nr_pages is 0 or negative, returns 0. If no pages
    * were pinned, returns -errno.

    When get_user_pages_fast fall back on get_user_pages this is exactly
    what happens. Unfortunately the implementation is inconsistent: it
    returns 0 if passed a kernel address, confusing callers: for example,
    the following is pretty common but does not appear to do the right thing
    with a kernel address:

    ret = get_user_pages_fast(addr, 1, writeable, &page);
    if (ret < 0)
    return ret;

    Change get_user_pages_fast to return -EFAULT when supplied a kernel
    address to make it match expectations.

    All callers have been audited for consistency with the documented
    semantics.

    Link: http://lkml.kernel.org/r/1522962072-182137-4-git-send-email-mst@redhat.com
    Fixes: 5b65c4677a57 ("mm, x86/mm: Fix performance regression in get_user_pages_fast()")
    Signed-off-by: Michael S. Tsirkin
    Reported-by: syzbot+6304bf97ef436580fede@syzkaller.appspotmail.com
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     
  • Patch series "mm/get_user_pages_fast fixes, cleanups", v2.

    Turns out get_user_pages_fast and __get_user_pages_fast return different
    values on error when given a single page: __get_user_pages_fast returns
    0. get_user_pages_fast returns either 0 or an error.

    Callers of get_user_pages_fast expect an error so fix it up to return an
    error consistently.

    Stress the difference between get_user_pages_fast and
    __get_user_pages_fast to make sure callers aren't confused.

    This patch (of 3):

    __gup_benchmark_ioctl does not handle the case where get_user_pages_fast
    fails:

    - a negative return code will cause a buffer overrun

    - returning with partial success will cause use of uninitialized
    memory.

    [akpm@linux-foundation.org: simplification]
    Link: http://lkml.kernel.org/r/1522962072-182137-3-git-send-email-mst@redhat.com
    Signed-off-by: Michael S. Tsirkin
    Reviewed-by: Andrew Morton
    Cc: Kirill A. Shutemov
    Cc: Huang Ying
    Cc: Jonathan Corbet
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Thorsten Leemhuis
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael S. Tsirkin
     

12 Apr, 2018

35 commits

  • Remove the address_space ->tree_lock and use the xa_lock newly added to
    the radix_tree_root. Rename the address_space ->page_tree to ->i_pages,
    since we don't really care that it's a tree.

    [willy@infradead.org: fix nds32, fs/dax.c]
    Link: http://lkml.kernel.org/r/20180406145415.GB20605@bombadil.infradead.orgLink: http://lkml.kernel.org/r/20180313132639.17387-9-willy@infradead.org
    Signed-off-by: Matthew Wilcox
    Acked-by: Jeff Layton
    Cc: Darrick J. Wong
    Cc: Dave Chinner
    Cc: Ryusuke Konishi
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • Juergen Gross noticed that commit f7f99100d8d ("mm: stop zeroing memory
    during allocation in vmemmap") broke XEN PV domains when deferred struct
    page initialization is enabled.

    This is because the xen's PagePinned() flag is getting erased from
    struct pages when they are initialized later in boot.

    Juergen fixed this problem by disabling deferred pages on xen pv
    domains. It is desirable, however, to have this feature available as it
    reduces boot time. This fix re-enables the feature for pv-dmains, and
    fixes the problem the following way:

    The fix is to delay setting PagePinned flag until struct pages for all
    allocated memory are initialized, i.e. until after free_all_bootmem().

    A new x86_init.hyper op init_after_bootmem() is called to let xen know
    that boot allocator is done, and hence struct pages for all the
    allocated memory are now initialized. If deferred page initialization
    is enabled, the rest of struct pages are going to be initialized later
    in boot once page_alloc_init_late() is called.

    xen_after_bootmem() walks page table's pages and marks them pinned.

    Link: http://lkml.kernel.org/r/20180226160112.24724-2-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Acked-by: Ingo Molnar
    Reviewed-by: Juergen Gross
    Tested-by: Juergen Gross
    Cc: Daniel Jordan
    Cc: Pavel Tatashin
    Cc: Alok Kataria
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Boris Ostrovsky
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Andy Lutomirski
    Cc: Laura Abbott
    Cc: Kirill A. Shutemov
    Cc: Borislav Petkov
    Cc: Mathias Krause
    Cc: Jinbum Park
    Cc: Dan Williams
    Cc: Baoquan He
    Cc: Jia Zhang
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Stefano Stabellini
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     
  • Patch series "mm: introduce MAP_FIXED_NOREPLACE", v2.

    This has started as a follow up discussion [3][4] resulting in the
    runtime failure caused by hardening patch [5] which removes MAP_FIXED
    from the elf loader because MAP_FIXED is inherently dangerous as it
    might silently clobber an existing underlying mapping (e.g. stack).
    The reason for the failure is that some architectures enforce an
    alignment for the given address hint without MAP_FIXED used (e.g. for
    shared or file backed mappings).

    One way around this would be excluding those archs which do alignment
    tricks from the hardening [6]. The patch is really trivial but it has
    been objected, rightfully so, that this screams for a more generic
    solution. We basically want a non-destructive MAP_FIXED.

    The first patch introduced MAP_FIXED_NOREPLACE which enforces the given
    address but unlike MAP_FIXED it fails with EEXIST if the given range
    conflicts with an existing one. The flag is introduced as a completely
    new one rather than a MAP_FIXED extension because of the backward
    compatibility. We really want a never-clobber semantic even on older
    kernels which do not recognize the flag. Unfortunately mmap sucks
    wrt flags evaluation because we do not EINVAL on unknown flags. On
    those kernels we would simply use the traditional hint based semantic so
    the caller can still get a different address (which sucks) but at least
    not silently corrupt an existing mapping. I do not see a good way
    around that. Except we won't export expose the new semantic to the
    userspace at all.

    It seems there are users who would like to have something like that.
    Jemalloc has been mentioned by Michael Ellerman [7]

    Florian Weimer has mentioned the following:
    : glibc ld.so currently maps DSOs without hints. This means that the kernel
    : will map right next to each other, and the offsets between them a completely
    : predictable. We would like to change that and supply a random address in a
    : window of the address space. If there is a conflict, we do not want the
    : kernel to pick a non-random address. Instead, we would try again with a
    : random address.

    John Hubbard has mentioned CUDA example
    : a) Searches /proc//maps for a "suitable" region of available
    : VA space. "Suitable" generally means it has to have a base address
    : within a certain limited range (a particular device model might
    : have odd limitations, for example), it has to be large enough, and
    : alignment has to be large enough (again, various devices may have
    : constraints that lead us to do this).
    :
    : This is of course subject to races with other threads in the process.
    :
    : Let's say it finds a region starting at va.
    :
    : b) Next it does:
    : p = mmap(va, ...)
    :
    : *without* setting MAP_FIXED, of course (so va is just a hint), to
    : attempt to safely reserve that region. If p != va, then in most cases,
    : this is a failure (almost certainly due to another thread getting a
    : mapping from that region before we did), and so this layer now has to
    : call munmap(), before returning a "failure: retry" to upper layers.
    :
    : IMPROVEMENT: --> if instead, we could call this:
    :
    : p = mmap(va, ... MAP_FIXED_NOREPLACE ...)
    :
    : , then we could skip the munmap() call upon failure. This
    : is a small thing, but it is useful here. (Thanks to Piotr
    : Jaroszynski and Mark Hairgrove for helping me get that detail
    : exactly right, btw.)
    :
    : c) After that, CUDA suballocates from p, via:
    :
    : q = mmap(sub_region_start, ... MAP_FIXED ...)
    :
    : Interestingly enough, "freeing" is also done via MAP_FIXED, and
    : setting PROT_NONE to the subregion. Anyway, I just included (c) for
    : general interest.

    Atomic address range probing in the multithreaded programs in general
    sounds like an interesting thing to me.

    The second patch simply replaces MAP_FIXED use in elf loader by
    MAP_FIXED_NOREPLACE. I believe other places which rely on MAP_FIXED
    should follow. Actually real MAP_FIXED usages should be docummented
    properly and they should be more of an exception.

    [1] http://lkml.kernel.org/r/20171116101900.13621-1-mhocko@kernel.org
    [2] http://lkml.kernel.org/r/20171129144219.22867-1-mhocko@kernel.org
    [3] http://lkml.kernel.org/r/20171107162217.382cd754@canb.auug.org.au
    [4] http://lkml.kernel.org/r/1510048229.12079.7.camel@abdul.in.ibm.com
    [5] http://lkml.kernel.org/r/20171023082608.6167-1-mhocko@kernel.org
    [6] http://lkml.kernel.org/r/20171113094203.aofz2e7kueitk55y@dhcp22.suse.cz
    [7] http://lkml.kernel.org/r/87efp1w7vy.fsf@concordia.ellerman.id.au

    This patch (of 2):

    MAP_FIXED is used quite often to enforce mapping at the particular range.
    The main problem of this flag is, however, that it is inherently dangerous
    because it unmaps existing mappings covered by the requested range. This
    can cause silent memory corruptions. Some of them even with serious
    security implications. While the current semantic might be really
    desiderable in many cases there are others which would want to enforce the
    given range but rather see a failure than a silent memory corruption on a
    clashing range. Please note that there is no guarantee that a given range
    is obeyed by the mmap even when it is free - e.g. arch specific code is
    allowed to apply an alignment.

    Introduce a new MAP_FIXED_NOREPLACE flag for mmap to achieve this
    behavior. It has the same semantic as MAP_FIXED wrt. the given address
    request with a single exception that it fails with EEXIST if the requested
    address is already covered by an existing mapping. We still do rely on
    get_unmaped_area to handle all the arch specific MAP_FIXED treatment and
    check for a conflicting vma after it returns.

    The flag is introduced as a completely new one rather than a MAP_FIXED
    extension because of the backward compatibility. We really want a
    never-clobber semantic even on older kernels which do not recognize the
    flag. Unfortunately mmap sucks wrt. flags evaluation because we do not
    EINVAL on unknown flags. On those kernels we would simply use the
    traditional hint based semantic so the caller can still get a different
    address (which sucks) but at least not silently corrupt an existing
    mapping. I do not see a good way around that.

    [mpe@ellerman.id.au: fix whitespace]
    [fail on clashing range with EEXIST as per Florian Weimer]
    [set MAP_FIXED before round_hint_to_min as per Khalid Aziz]
    Link: http://lkml.kernel.org/r/20171213092550.2774-2-mhocko@kernel.org
    Reviewed-by: Khalid Aziz
    Signed-off-by: Michal Hocko
    Acked-by: Michael Ellerman
    Cc: Khalid Aziz
    Cc: Russell King - ARM Linux
    Cc: Andrea Arcangeli
    Cc: Florian Weimer
    Cc: John Hubbard
    Cc: Matthew Wilcox
    Cc: Abdul Haleem
    Cc: Joel Stanley
    Cc: Kees Cook
    Cc: Michal Hocko
    Cc: Jason Evans
    Cc: David Goldblatt
    Cc: Edward Tomasz Napierała
    Cc: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "exec: Pin stack limit during exec".

    Attempts to solve problems with the stack limit changing during exec
    continue to be frustrated[1][2]. In addition to the specific issues
    around the Stack Clash family of flaws, Andy Lutomirski pointed out[3]
    other places during exec where the stack limit is used and is assumed to
    be unchanging. Given the many places it gets used and the fact that it
    can be manipulated/raced via setrlimit() and prlimit(), I think the only
    way to handle this is to move away from the "current" view of the stack
    limit and instead attach it to the bprm, and plumb this down into the
    functions that need to know the stack limits. This series implements
    the approach.

    [1] 04e35f4495dd ("exec: avoid RLIMIT_STACK races with prlimit()")
    [2] 779f4e1c6c7c ("Revert "exec: avoid RLIMIT_STACK races with prlimit()"")
    [3] to security@kernel.org, "Subject: existing rlimit races?"

    This patch (of 3):

    Since it is possible that the stack rlimit can change externally during
    exec (either via another thread calling setrlimit() or another process
    calling prlimit()), provide a way to pass the rlimit down into the
    per-architecture mm layout functions so that the rlimit can stay in the
    bprm structure instead of sitting in the signal structure until exec is
    finalized.

    Link: http://lkml.kernel.org/r/1518638796-20819-2-git-send-email-keescook@chromium.org
    Signed-off-by: Kees Cook
    Cc: Michal Hocko
    Cc: Ben Hutchings
    Cc: Willy Tarreau
    Cc: Hugh Dickins
    Cc: Oleg Nesterov
    Cc: "Jason A. Donenfeld"
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Greg KH
    Cc: Andy Lutomirski
    Cc: Ben Hutchings
    Cc: Brad Spengler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kees Cook
     
  • The kasan_slab_free hook's return value denotes whether the reuse of a
    slab object must be delayed (e.g. when the object is put into memory
    qurantine).

    The current way SLUB handles this hook is by ignoring its return value
    and hardcoding checks similar (but not exactly the same) to the ones
    performed in kasan_slab_free, which is prone to making mistakes.

    The main difference between the hardcoded checks and the ones in
    kasan_slab_free is whether we want to perform a free in case when an
    invalid-free or a double-free was detected (we don't).

    This patch changes the way SLUB handles this by:
    1. taking into account the return value of kasan_slab_free for each of
    the objects, that are being freed;
    2. reconstructing the freelist of objects to exclude the ones, whose
    reuse must be delayed.

    [andreyknvl@google.com: eliminate unnecessary branch in slab_free]
    Link: http://lkml.kernel.org/r/a62759a2545fddf69b0c034547212ca1eb1b3ce2.1520359686.git.andreyknvl@google.com
    Link: http://lkml.kernel.org/r/083f58501e54731203801d899632d76175868e97.1519400992.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Acked-by: Andrey Ryabinin
    Cc: Christoph Lameter
    Cc: Pekka Enberg
    Cc: David Rientjes
    Cc: Joonsoo Kim
    Cc: Alexander Potapenko
    Cc: Dmitry Vyukov
    Cc: Kostya Serebryany
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     
  • There was a regression report for "mm/cma: manage the memory of the CMA
    area by using the ZONE_MOVABLE" [1] and I think that it is related to
    this problem. CMA patchset makes the system use one more zone
    (ZONE_MOVABLE) and then increases min_free_kbytes. It reduces usable
    memory and it could cause regression.

    ZONE_MOVABLE only has movable pages so we don't need to keep enough
    freepages to avoid or deal with fragmentation. So, don't count it.

    This changes min_free_kbytes and thus min_watermark greatly if
    ZONE_MOVABLE is used. It will make the user uses more memory.

    System:
    22GB ram, fakenuma, 2 nodes. 5 zones are used.

    Before:
    min_free_kbytes: 112640

    zone_info (min_watermark):
    Node 0, zone DMA
    min 19
    Node 0, zone DMA32
    min 3778
    Node 0, zone Normal
    min 10191
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 14043
    Node 1, zone Movable
    min 127
    Node 1, zone Device
    min 0

    After:
    min_free_kbytes: 90112

    zone_info (min_watermark):
    Node 0, zone DMA
    min 15
    Node 0, zone DMA32
    min 3022
    Node 0, zone Normal
    min 8152
    Node 0, zone Movable
    min 0
    Node 0, zone Device
    min 0
    Node 1, zone DMA
    min 0
    Node 1, zone DMA32
    min 0
    Node 1, zone Normal
    min 11234
    Node 1, zone Movable
    min 102
    Node 1, zone Device
    min 0

    [1] (lkml.kernel.org/r/20180102063528.GG30397%20()%20yexl-desktop)

    Link: http://lkml.kernel.org/r/1522913236-15776-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Cc: Michal Hocko
    Cc: "Kirill A . Shutemov"
    Cc: Mel Gorman
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Now, all reserved pages for CMA region are belong to the ZONE_MOVABLE
    and it only serves for a request with GFP_HIGHMEM && GFP_MOVABLE.

    Therefore, we don't need to maintain ALLOC_CMA at all.

    Link: http://lkml.kernel.org/r/1512114786-5085-3-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Patch series "mm/cma: manage the memory of the CMA area by using the
    ZONE_MOVABLE", v2.

    0. History

    This patchset is the follow-up of the discussion about the "Introduce
    ZONE_CMA (v7)" [1]. Please reference it if more information is needed.

    1. What does this patch do?

    This patch changes the management way for the memory of the CMA area in
    the MM subsystem. Currently the memory of the CMA area is managed by
    the zone where their pfn is belong to. However, this approach has some
    problems since MM subsystem doesn't have enough logic to handle the
    situation that different characteristic memories are in a single zone.
    To solve this issue, this patch try to manage all the memory of the CMA
    area by using the MOVABLE zone. In MM subsystem's point of view,
    characteristic of the memory on the MOVABLE zone and the memory of the
    CMA area are the same. So, managing the memory of the CMA area by using
    the MOVABLE zone will not have any problem.

    2. Motivation

    There are some problems with current approach. See following. Although
    these problem would not be inherent and it could be fixed without this
    conception change, it requires many hooks addition in various code path
    and it would be intrusive to core MM and would be really error-prone.
    Therefore, I try to solve them with this new approach. Anyway,
    following is the problems of the current implementation.

    o CMA memory utilization

    First, following is the freepage calculation logic in MM.

    - For movable allocation: freepage = total freepage
    - For unmovable allocation: freepage = total freepage - CMA freepage

    Freepages on the CMA area is used after the normal freepages in the zone
    where the memory of the CMA area is belong to are exhausted. At that
    moment that the number of the normal freepages is zero, so

    - For movable allocation: freepage = total freepage = CMA freepage
    - For unmovable allocation: freepage = 0

    If unmovable allocation comes at this moment, allocation request would
    fail to pass the watermark check and reclaim is started. After reclaim,
    there would exist the normal freepages so freepages on the CMA areas
    would not be used.

    FYI, there is another attempt [2] trying to solve this problem in lkml.
    And, as far as I know, Qualcomm also has out-of-tree solution for this
    problem.

    Useless reclaim:

    There is no logic to distinguish CMA pages in the reclaim path. Hence,
    CMA page is reclaimed even if the system just needs the page that can be
    usable for the kernel allocation.

    Atomic allocation failure:

    This is also related to the fallback allocation policy for the memory of
    the CMA area. Consider the situation that the number of the normal
    freepages is *zero* since the bunch of the movable allocation requests
    come. Kswapd would not be woken up due to following freepage
    calculation logic.

    - For movable allocation: freepage = total freepage = CMA freepage

    If atomic unmovable allocation request comes at this moment, it would
    fails due to following logic.

    - For unmovable allocation: freepage = total freepage - CMA freepage = 0

    It was reported by Aneesh [3].

    Useless compaction:

    Usual high-order allocation request is unmovable allocation request and
    it cannot be served from the memory of the CMA area. In compaction,
    migration scanner try to migrate the page in the CMA area and make
    high-order page there. As mentioned above, it cannot be usable for the
    unmovable allocation request so it's just waste.

    3. Current approach and new approach

    Current approach is that the memory of the CMA area is managed by the
    zone where their pfn is belong to. However, these memory should be
    distinguishable since they have a strong limitation. So, they are
    marked as MIGRATE_CMA in pageblock flag and handled specially. However,
    as mentioned in section 2, the MM subsystem doesn't have enough logic to
    deal with this special pageblock so many problems raised.

    New approach is that the memory of the CMA area is managed by the
    MOVABLE zone. MM already have enough logic to deal with special zone
    like as HIGHMEM and MOVABLE zone. So, managing the memory of the CMA
    area by the MOVABLE zone just naturally work well because constraints
    for the memory of the CMA area that the memory should always be
    migratable is the same with the constraint for the MOVABLE zone.

    There is one side-effect for the usability of the memory of the CMA
    area. The use of MOVABLE zone is only allowed for a request with
    GFP_HIGHMEM && GFP_MOVABLE so now the memory of the CMA area is also
    only allowed for this gfp flag. Before this patchset, a request with
    GFP_MOVABLE can use them. IMO, It would not be a big issue since most
    of GFP_MOVABLE request also has GFP_HIGHMEM flag. For example, file
    cache page and anonymous page. However, file cache page for blockdev
    file is an exception. Request for it has no GFP_HIGHMEM flag. There is
    pros and cons on this exception. In my experience, blockdev file cache
    pages are one of the top reason that causes cma_alloc() to fail
    temporarily. So, we can get more guarantee of cma_alloc() success by
    discarding this case.

    Note that there is no change in admin POV since this patchset is just
    for internal implementation change in MM subsystem. Just one minor
    difference for admin is that the memory stat for CMA area will be
    printed in the MOVABLE zone. That's all.

    4. Result

    Following is the experimental result related to utilization problem.

    8 CPUs, 1024 MB, VIRTUAL MACHINE
    make -j16

    CMA area: 0 MB 512 MB
    Elapsed-time: 92.4 186.5
    pswpin: 82 18647
    pswpout: 160 69839

    CMA : 0 MB 512 MB
    Elapsed-time: 93.1 93.4
    pswpin: 84 46
    pswpout: 183 92

    akpm: "kernel test robot" reported a 26% improvement in
    vm-scalability.throughput:
    http://lkml.kernel.org/r/20180330012721.GA3845@yexl-desktop

    [1]: lkml.kernel.org/r/1491880640-9944-1-git-send-email-iamjoonsoo.kim@lge.com
    [2]: https://lkml.org/lkml/2014/10/15/623
    [3]: http://www.spinics.net/lists/linux-mm/msg100562.html

    Link: http://lkml.kernel.org/r/1512114786-5085-2-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Tested-by: Tony Lindgren
    Acked-by: Vlastimil Babka
    Cc: Johannes Weiner
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Mel Gorman
    Cc: Michal Hocko
    Cc: Michal Nazarewicz
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Russell King
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • Freepage on ZONE_HIGHMEM doesn't work for kernel memory so it's not that
    important to reserve. When ZONE_MOVABLE is used, this problem would
    theorectically cause to decrease usable memory for GFP_HIGHUSER_MOVABLE
    allocation request which is mainly used for page cache and anon page
    allocation. So, fix it by setting 0 to
    sysctl_lowmem_reserve_ratio[ZONE_HIGHMEM].

    And, defining sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES - 1 size
    makes code complex. For example, if there is highmem system, following
    reserve ratio is activated for *NORMAL ZONE* which would be easyily
    misleading people.

    #ifdef CONFIG_HIGHMEM
    32
    #endif

    This patch also fixes this situation by defining
    sysctl_lowmem_reserve_ratio array by MAX_NR_ZONES and place "#ifdef" to
    right place.

    Link: http://lkml.kernel.org/r/1504672525-17915-1-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Joonsoo Kim
    Reviewed-by: Aneesh Kumar K.V
    Acked-by: Vlastimil Babka
    Tested-by: Tony Lindgren
    Cc: Michal Hocko
    Cc: Vlastimil Babka
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: "Aneesh Kumar K . V"
    Cc: Minchan Kim
    Cc: Rik van Riel
    Cc: Laura Abbott
    Cc: Marek Szyprowski
    Cc: Michal Nazarewicz
    Cc: Russell King
    Cc: Will Deacon
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by spliting the THP into small pages while moving the head
    page to the newly allocated order-0 page. Remaning pages are moved to
    the LRU list by split_huge_page. The same happens if the THP allocation
    fails. This is really ugly and error prone [1].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    This patch tries to unclutter the situation by moving the special THP
    handling up to the migrate_pages layer where it actually belongs. We
    simply split the THP page into the existing list if unmap_and_move fails
    with ENOMEM and retry. So we will _always_ migrate all THP subpages and
    specific migrate_pages users do not have to deal with this case in a
    special way.

    [1] http://lkml.kernel.org/r/20171121021855.50525-1-zi.yan@sent.com

    Link: http://lkml.kernel.org/r/20180103082555.14592-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • No allocation callback is using this argument anymore. new_page_node
    used to use this parameter to convey node_id resp. migration error up
    to move_pages code (do_move_page_to_node_array). The error status never
    made it into the final status field and we have a better way to
    communicate node id to the status field now. All other allocation
    callbacks simply ignored the argument so we can drop it finally.

    [mhocko@suse.com: fix migration callback]
    Link: http://lkml.kernel.org/r/20180105085259.GH2801@dhcp22.suse.cz
    [akpm@linux-foundation.org: fix alloc_misplaced_dst_page()]
    [mhocko@kernel.org: fix build]
    Link: http://lkml.kernel.org/r/20180103091134.GB11319@dhcp22.suse.cz
    Link: http://lkml.kernel.org/r/20180103082555.14592-3-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Reviewed-by: Zi Yan
    Cc: Andrea Reale
    Cc: Anshuman Khandual
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Patch series "unclutter thp migration"

    Motivation:

    THP migration is hacked into the generic migration with rather
    surprising semantic. The migration allocation callback is supposed to
    check whether the THP can be migrated at once and if that is not the
    case then it allocates a simple page to migrate. unmap_and_move then
    fixes that up by splitting the THP into small pages while moving the
    head page to the newly allocated order-0 page. Remaining pages are
    moved to the LRU list by split_huge_page. The same happens if the THP
    allocation fails. This is really ugly and error prone [2].

    I also believe that split_huge_page to the LRU lists is inherently wrong
    because all tail pages are not migrated. Some callers will just work
    around that by retrying (e.g. memory hotplug). There are other pfn
    walkers which are simply broken though. e.g. madvise_inject_error will
    migrate head and then advances next pfn by the huge page size.
    do_move_page_to_node_array, queue_pages_range (migrate_pages, mbind),
    will simply split the THP before migration if the THP migration is not
    supported then falls back to single page migration but it doesn't handle
    tail pages if the THP migration path is not able to allocate a fresh THP
    so we end up with ENOMEM and fail the whole migration which is a
    questionable behavior. Page compaction doesn't try to migrate large
    pages so it should be immune.

    The first patch reworks do_pages_move which relies on a very ugly
    calling semantic when the return status is pushed to the migration path
    via private pointer. It uses pre allocated fixed size batching to
    achieve that. We simply cannot do the same if a THP is to be split
    during the migration path which is done in the patch 3. Patch 2 is
    follow up cleanup which removes the mentioned return status calling
    convention ugliness.

    On a side note:

    There are some semantic issues I have encountered on the way when
    working on patch 1 but I am not addressing them here. E.g. trying to
    move THP tail pages will result in either success or EBUSY (the later
    one more likely once we isolate head from the LRU list). Hugetlb
    reports EACCESS on tail pages. Some errors are reported via status
    parameter but migration failures are not even though the original
    `reason' argument suggests there was an intention to do so. From a
    quick look into git history this never worked. I have tried to keep the
    semantic unchanged.

    Then there is a relatively minor thing that the page isolation might
    fail because of pages not being on the LRU - e.g. because they are
    sitting on the per-cpu LRU caches. Easily fixable.

    This patch (of 3):

    do_pages_move is supposed to move user defined memory (an array of
    addresses) to the user defined numa nodes (an array of nodes one for
    each address). The user provided status array then contains resulting
    numa node for each address or an error. The semantic of this function
    is little bit confusing because only some errors are reported back.
    Notably migrate_pages error is only reported via the return value. This
    patch doesn't try to address these semantic nuances but rather change
    the underlying implementation.

    Currently we are processing user input (which can be really large) in
    batches which are stored to a temporarily allocated page. Each address
    is resolved to its struct page and stored to page_to_node structure
    along with the requested target numa node. The array of these
    structures is then conveyed down the page migration path via private
    argument. new_page_node then finds the corresponding structure and
    allocates the proper target page.

    What is the problem with the current implementation and why to change
    it? Apart from being quite ugly it also doesn't cope with unexpected
    pages showing up on the migration list inside migrate_pages path. That
    doesn't happen currently but the follow up patch would like to make the
    thp migration code more clear and that would need to split a THP into
    the list for some cases.

    How does the new implementation work? Well, instead of batching into a
    fixed size array we simply batch all pages that should be migrated to
    the same node and isolate all of them into a linked list which doesn't
    require any additional storage. This should work reasonably well
    because page migration usually migrates larger ranges of memory to a
    specific node. So the common case should work equally well as the
    current implementation. Even if somebody constructs an input where the
    target numa nodes would be interleaved we shouldn't see a large
    performance impact because page migration alone doesn't really benefit
    from batching. mmap_sem batching for the lookup is quite questionable
    and isolate_lru_page which would benefit from batching is not using it
    even in the current implementation.

    Link: http://lkml.kernel.org/r/20180103082555.14592-2-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Kirill A. Shutemov
    Reviewed-by: Andrew Morton
    Cc: Anshuman Khandual
    Cc: Zi Yan
    Cc: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Andrea Reale
    Cc: Kirill A. Shutemov
    Cc: Mike Kravetz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • The pointer swap_avail_heads is local to the source and does not need to
    be in global scope, so make it static.

    Cleans up sparse warning:

    mm/swapfile.c:88:19: warning: symbol 'swap_avail_heads' was not declared. Should it be static?

    Link: http://lkml.kernel.org/r/20180206215836.12366-1-colin.king@canonical.com
    Signed-off-by: Colin Ian King
    Reviewed-by: Andrew Morton
    Acked-by: "Huang, Ying"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Colin Ian King
     
  • syzbot has triggered a NULL ptr dereference when allocation fault
    injection enforces a failure and alloc_mem_cgroup_per_node_info
    initializes memcg->nodeinfo only half way through.

    But __mem_cgroup_free still tries to free all per-node data and
    dereferences pn->lruvec_stat_cpu unconditioanlly even if the specific
    per-node data hasn't been initialized.

    The bug is quite unlikely to hit because small allocations do not fail
    and we would need quite some numa nodes to make struct
    mem_cgroup_per_node large enough to cross the costly order.

    Link: http://lkml.kernel.org/r/20180406100906.17790-1-mhocko@kernel.org
    Reported-by: syzbot+8a5de3cce7cdc70e9ebe@syzkaller.appspotmail.com
    Fixes: 00f3ca2c2d66 ("mm: memcontrol: per-lruvec stats infrastructure")
    Signed-off-by: Michal Hocko
    Reviewed-by: Andrey Ryabinin
    Cc: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Calling swapon() on a zero length swap file on SSD can lead to a
    divide-by-zero.

    Although creating such files isn't possible with mkswap and they woud be
    considered invalid, it would be better for the swapon code to be more
    robust and handle this condition gracefully (return -EINVAL).
    Especially since the fix is small and straightforward.

    To help with wear leveling on SSD, the swapon syscall calculates a
    random position in the swap file using modulo p->highest_bit, which is
    set to maxpages - 1 in read_swap_header.

    If the swap file is zero length, read_swap_header sets maxpages=1 and
    last_page=0, resulting in p->highest_bit=0 and we divide-by-zero when we
    modulo p->highest_bit in swapon syscall.

    This can be prevented by having read_swap_header return zero if
    last_page is zero.

    Link: http://lkml.kernel.org/r/5AC747C1020000A7001FA82C@prv-mh.provo.novell.com
    Signed-off-by: Thomas Abraham
    Reported-by:
    Reviewed-by: Andrew Morton
    Cc: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tom Abraham
     
  • Commit a983b5ebee57 ("mm: memcontrol: fix excessive complexity in
    memory.stat reporting") added per-cpu drift to all memory cgroup stats
    and events shown in memory.stat and memory.events.

    For memory.stat this is acceptable. But memory.events issues file
    notifications, and somebody polling the file for changes will be
    confused when the counters in it are unchanged after a wakeup.

    Luckily, the events in memory.events - MEMCG_LOW, MEMCG_HIGH, MEMCG_MAX,
    MEMCG_OOM - are sufficiently rare and high-level that we don't need
    per-cpu buffering for them: MEMCG_HIGH and MEMCG_MAX would be the most
    frequent, but they're counting invocations of reclaim, which is a
    complex operation that touches many shared cachelines.

    This splits memory.events from the generic VM events and tracks them in
    their own, unbuffered atomic counters. That's also cleaner, as it
    eliminates the ugly enum nesting of VM and cgroup events.

    [hannes@cmpxchg.org: "array subscript is above array bounds"]
    Link: http://lkml.kernel.org/r/20180406155441.GA20806@cmpxchg.org
    Link: http://lkml.kernel.org/r/20180405175507.GA24817@cmpxchg.org
    Fixes: a983b5ebee57 ("mm: memcontrol: fix excessive complexity in memory.stat reporting")
    Signed-off-by: Johannes Weiner
    Reported-by: Tejun Heo
    Acked-by: Tejun Heo
    Acked-by: Michal Hocko
    Cc: Vladimir Davydov
    Cc: Roman Gushchin
    Cc: Rik van Riel
    Cc: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Johannes Weiner
     
  • When using KSM with use_zero_pages, we replace anonymous pages
    containing only zeroes with actual zero pages, which are not anonymous.
    We need to do proper accounting of the mm counters, otherwise we will
    get wrong values in /proc and a BUG message in dmesg when tearing down
    the mm.

    Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
    Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
    Signed-off-by: Claudio Imbrenda
    Reviewed-by: Andrew Morton
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Kirill A. Shutemov
    Cc: Hugh Dickins
    Cc: Christian Borntraeger
    Cc: Gerald Schaefer
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Claudio Imbrenda
     
  • We have a perfectly good macro to determine whether the gfp flags allow
    you to sleep or not; use it instead of trying to infer it.

    Link: http://lkml.kernel.org/r/20180408062206.GC16007@bombadil.infradead.org
    Signed-off-by: Matthew Wilcox
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthew Wilcox
     
  • In z3fold_create_pool(), the memory allocated by __alloc_percpu() is not
    released on the error path that pool->compact_wq , which holds the
    return value of create_singlethread_workqueue(), is NULL. This will
    result in a memory leak bug.

    [akpm@linux-foundation.org: fix oops on kzalloc() failure, check __alloc_percpu() retval]
    Link: http://lkml.kernel.org/r/1522803111-29209-1-git-send-email-wangxidong_97@163.com
    Signed-off-by: Xidong Wang
    Reviewed-by: Andrew Morton
    Cc: Vitaly Wool
    Cc: Mike Rapoport
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xidong Wang
     
  • A THP memcg charge can trigger the oom killer since 2516035499b9 ("mm,
    thp: remove __GFP_NORETRY from khugepaged and madvised allocations").
    We have used an explicit __GFP_NORETRY previously which ruled the OOM
    killer automagically.

    Memcg charge path should be semantically compliant with the allocation
    path and that means that if we do not trigger the OOM killer for costly
    orders which should do the same in the memcg charge path as well.
    Otherwise we are forcing callers to distinguish the two and use
    different gfp masks which is both non-intuitive and bug prone. As soon
    as we get a costly high order kmalloc user we even do not have any means
    to tell the memcg specific gfp mask to prevent from OOM because the
    charging is deep within guts of the slab allocator.

    The unexpected memcg OOM on THP has already been fixed upstream by
    9d3c3354bb85 ("mm, thp: do not cause memcg oom for thp") but this is a
    one-off fix rather than a generic solution. Teach mem_cgroup_oom to
    bail out on costly order requests to fix the THP issue as well as any
    other costly OOM eligible allocations to be added in future.

    Also revert 9d3c3354bb85 because special gfp for THP is no longer
    needed.

    Link: http://lkml.kernel.org/r/20180403193129.22146-1-mhocko@kernel.org
    Fixes: 2516035499b9 ("mm, thp: remove __GFP_NORETRY from khugepaged and madvised allocations")
    Signed-off-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: "Kirill A. Shutemov"
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • Use of pte_write(pte) is only valid for present pte, the common code
    which set the migration entry can be reach for both valid present pte
    and special swap entry (for device memory). Fix the code to use the
    mpfn value which properly handle both cases.

    On x86 this did not have any bad side effect because pte write bit is
    below PAGE_BIT_GLOBAL and thus special swap entry have it set to 0 which
    in turn means we were always creating read only special migration entry.

    So once migration did finish we always write protected the CPU page
    table entry (moreover this is only an issue when migrating from device
    memory to system memory). End effect is that CPU write access would
    fault again and restore write permission.

    This behaviour isn't too bad; it just burns CPU cycles by forcing CPU to
    take a second fault on write access. ie, double faulting the same
    address. There is no corruption or incorrect states (it behaves as a
    COWed page from a fork with a mapcount of 1).

    Link: http://lkml.kernel.org/r/20180402023506.12180-1-jglisse@redhat.com
    Signed-off-by: Ralph Campbell
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     
  • change_pte_range is called from task work context to mark PTEs for
    receiving NUMA faulting hints. If the marked pages are dirty then
    migration may fail. Some filesystems cannot migrate dirty pages without
    blocking so are skipped in MIGRATE_ASYNC mode which just wastes CPU.
    Even when they can, it can be a waste of cycles when the pages are
    shared forcing higher scan rates. This patch avoids marking shared
    dirty pages for hinting faults but also will skip a migration if the
    page was dirtied after the scanner updated a clean page.

    This is most noticeable running the NASA Parallel Benchmark when backed
    by btrfs, the default root filesystem for some distributions, but also
    noticeable when using XFS.

    The following are results from a 4-socket machine running a 4.16-rc4
    kernel with some scheduler patches that are pending for the next merge
    window.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    Time cg.D 459.07 ( 0.00%) 444.21 ( 3.24%)
    Time ep.D 76.96 ( 0.00%) 77.69 ( -0.95%)
    Time is.D 25.55 ( 0.00%) 27.85 ( -9.00%)
    Time lu.D 601.58 ( 0.00%) 596.87 ( 0.78%)
    Time mg.D 107.73 ( 0.00%) 108.22 ( -0.45%)

    is.D regresses slightly in terms of absolute time but note that that
    particular load varies quite a bit from run to run. The more relevant
    observation is the total system CPU usage.

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1
    User 71471.91 70627.04
    System 11078.96 8256.13
    Elapsed 661.66 632.74

    That is a substantial drop in system CPU usage and overall the workload
    completes faster. The NUMA balancing statistics are also interesting

    NUMA base PTE updates 111407972 139848884
    NUMA huge PMD updates 206506 264869
    NUMA page range updates 217139044 275461812
    NUMA hint faults 4300924 3719784
    NUMA hint local faults 3012539 3416618
    NUMA hint local percent 70 91
    NUMA pages migrated 1517487 1358420

    While more PTEs are scanned due to changes in what faults are gathered,
    it's clear that a far higher percentage of faults are local as the bulk
    of the remote hits were dirty pages that, in this case with btrfs, had
    no chance of migrating.

    The following is a comparison when using XFS as that is a more realistic
    filesystem choice for a data partition

    4.16.0-rc4 4.16.0-rc4
    schedtip-20180309 nodirty-v1r47
    Time cg.D 485.28 ( 0.00%) 442.62 ( 8.79%)
    Time ep.D 77.68 ( 0.00%) 77.54 ( 0.18%)
    Time is.D 26.44 ( 0.00%) 24.79 ( 6.24%)
    Time lu.D 597.46 ( 0.00%) 597.11 ( 0.06%)
    Time mg.D 142.65 ( 0.00%) 105.83 ( 25.81%)

    That is a reasonable gain on two relatively long-lived workloads. While
    not presented, there is also a substantial drop in system CPu usage and
    the NUMA balancing stats show similar improvements in locality as btrfs
    did.

    Link: http://lkml.kernel.org/r/20180326094334.zserdec62gwmmfqf@techsingularity.net
    Signed-off-by: Mel Gorman
    Reviewed-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mel Gorman
     
  • hmm_devmem_find() requires rcu_read_lock_held() but there's nothing which
    actually uses the RCU protection. The only caller is
    hmm_devmem_pages_create() which already grabs the mutex and does
    superfluous rcu_read_lock/unlock() around the function.

    This doesn't add anything and just adds to confusion. Remove the RCU
    protection and open-code the radix tree lookup. If this needs to become
    more sophisticated in the future, let's add them back when necessary.

    Link: http://lkml.kernel.org/r/20180314194515.1661824-4-tj@kernel.org
    Signed-off-by: Tejun Heo
    Reviewed-by: Jérôme Glisse
    Cc: Paul E. McKenney
    Cc: Benjamin LaHaise
    Cc: Al Viro
    Cc: Kent Overstreet
    Cc: Matthew Wilcox
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tejun Heo
     
  • Users of hmm_vma_fault() and hmm_vma_get_pfns() provide a flags array and
    pfn shift value allowing them to define their own encoding for HMM pfn
    that are fill inside the pfns array of the hmm_range struct. With this
    device driver can get pfn that match their own private encoding out of HMM
    without having to do any conversion.

    [rcampbell@nvidia.com: don't ignore specific pte fault flag in hmm_vma_fault()]
    Link: http://lkml.kernel.org/r/20180326213009.2460-2-jglisse@redhat.com
    [rcampbell@nvidia.com: clarify fault logic for device private memory]
    Link: http://lkml.kernel.org/r/20180326213009.2460-3-jglisse@redhat.com
    Link: http://lkml.kernel.org/r/20180323005527.758-16-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Ralph Campbell
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This changes hmm_vma_fault() to not take a global write fault flag for a
    range but instead rely on caller to populate HMM pfns array with proper
    fault flag ie HMM_PFN_VALID if driver want read fault for that address or
    HMM_PFN_VALID and HMM_PFN_WRITE for write.

    Moreover by setting HMM_PFN_DEVICE_PRIVATE the device driver can ask for
    device private memory to be migrated back to system memory through page
    fault.

    This is more flexible API and it better reflects how device handles and
    reports fault.

    Link: http://lkml.kernel.org/r/20180323005527.758-15-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • No functional change, just create one function to handle pmd and one to
    handle pte (hmm_vma_handle_pmd() and hmm_vma_handle_pte()).

    Link: http://lkml.kernel.org/r/20180323005527.758-14-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Move hmm_pfns_clear() closer to where it is used to make it clear it is
    not use by page table walkers.

    Link: http://lkml.kernel.org/r/20180323005527.758-13-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Make naming consistent across code, DEVICE_PRIVATE is the name use outside
    HMM code so use that one.

    Link: http://lkml.kernel.org/r/20180323005527.758-12-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • There is no point in differentiating between a range for which there is
    not even a directory (and thus entries) and empty entry (pte_none() or
    pmd_none() returns true).

    Simply drop the distinction ie remove HMM_PFN_EMPTY flag and merge now
    duplicate hmm_vma_walk_hole() and hmm_vma_walk_clear() functions.

    Link: http://lkml.kernel.org/r/20180323005527.758-11-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Special vma (one with any of the VM_SPECIAL flags) can not be access by
    device because there is no consistent model across device drivers on those
    vma and their backing memory.

    This patch directly use hmm_range struct for hmm_pfns_special() argument
    as it is always affecting the whole vma and thus the whole range.

    It also make behavior consistent after this patch both hmm_vma_fault() and
    hmm_vma_get_pfns() returns -EINVAL when facing such vma. Previously
    hmm_vma_fault() returned 0 and hmm_vma_get_pfns() return -EINVAL but both
    were filling the HMM pfn array with special entry.

    Link: http://lkml.kernel.org/r/20180323005527.758-10-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • All device driver we care about are using 64bits page table entry. In
    order to match this and to avoid useless define convert all HMM pfn to
    directly use uint64_t. It is a first step on the road to allow driver to
    directly use pfn value return by HMM (saving memory and CPU cycles use for
    conversion between the two).

    Link: http://lkml.kernel.org/r/20180323005527.758-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Only peculiar architecture allow write without read thus assume that any
    valid pfn do allow for read. Note we do not care for write only because
    it does make sense with thing like atomic compare and exchange or any
    other operations that allow you to get the memory value through them.

    Link: http://lkml.kernel.org/r/20180323005527.758-8-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Both hmm_vma_fault() and hmm_vma_get_pfns() were taking a hmm_range struct
    as parameter and were initializing that struct with others of their
    parameters. Have caller of those function do this as they are likely to
    already do and only pass this struct to both function this shorten
    function signature and make it easier in the future to add new parameters
    by simply adding them to the structure.

    Link: http://lkml.kernel.org/r/20180323005527.758-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: John Hubbard
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • The private field of mm_walk struct point to an hmm_vma_walk struct and
    not to the hmm_range struct desired. Fix to get proper struct pointer.

    Link: http://lkml.kernel.org/r/20180323005527.758-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • This code was lost in translation at one point. This properly call
    mmu_notifier_unregister_no_release() once last user is gone. This fix the
    zombie mm_struct as without this patch we do not drop the refcount we have
    on it.

    Link: http://lkml.kernel.org/r/20180323005527.758-5-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Evgeny Baskakov
    Cc: Ralph Campbell
    Cc: Mark Hairgrove
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse