30 Dec, 2020

1 commit

  • [ Upstream commit 1e8aaedb182d6ddffc894b832e4962629907b3e0 ]

    madvise_inject_error() uses get_user_pages_fast to translate the address
    we specified to a page. After [1], we drop the extra reference count for
    memory_failure() path. That commit says that memory_failure wanted to
    keep the pin in order to take the page out of circulation.

    The truth is that we need to keep the page pinned, otherwise the page
    might be re-used after the put_page() and we can end up messing with
    someone else's memory.

    E.g:

    CPU0
    process X CPU1
    madvise_inject_error
    get_user_pages
    put_page
    page gets reclaimed
    process Y allocates the page
    memory_failure
    // We mess with process Y memory

    madvise() is meant to operate on a self address space, so messing with
    pages that do not belong to us seems the wrong thing to do.
    To avoid that, let us keep the page pinned for memory_failure as well.

    Pages for DAX mappings will release this extra refcount in
    memory_failure_dev_pagemap.

    [1] ("23e7b5c2e271: mm, madvise_inject_error:
    Let memory_failure() optionally take a page reference")

    Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
    Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
    Signed-off-by: Oscar Salvador
    Suggested-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    Oscar Salvador
     

09 Dec, 2020

1 commit

  • Jann spotted the security hole due to race of mm ownership check.

    If the task is sharing the mm_struct but goes through execve() before
    mm_access(), it could skip process_madvise_behavior_valid check. That
    makes *any advice hint* to reach into the remote process.

    This patch removes the mm ownership check. With it, it will lose the
    ability that local process could give *any* advice hint with vector
    interface for some reason (e.g., performance). Since there is no
    concrete example in upstream yet, it would be better to remove the
    abiliity at this moment and need to review when such new advice comes
    up.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Reported-by: Jann Horn
    Suggested-by: Jann Horn
    Signed-off-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Nov, 2020

2 commits

  • The calculation of the end page index was incorrect, leading to a
    regression of 70% when running stress-ng.

    With this fix, we instead see a performance improvement of 3%.

    Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
    Reported-by: kernel test robot
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: Xing Zhengjun
    Acked-by: Johannes Weiner
    Cc: William Kucharski
    Cc: Feng Tang
    Cc: "Chen, Rong A"
    Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The early return in process_madvise() will produce a memory leak.

    Fix it.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Oct, 2020

2 commits

  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "introduce memory hinting API for external process", v9.

    Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
    that, application could give hints to kernel what memory range are
    preferred to be reclaimed. However, in some platform(e.g., Android), the
    information required to make the hinting decision is not known to the app.
    Instead, it is known to a centralized userspace daemon(e.g.,
    ActivityManagerService), and that daemon must be able to initiate reclaim
    on its own without any app involvement.

    To solve the concern, this patch introduces new syscall -
    process_madvise(2). Bascially, it's same with madvise(2) syscall but it
    has some differences.

    1. It needs pidfd of target process to provide the hint

    2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
    moment. Other hints in madvise will be opened when there are explicit
    requests from community to prevent unexpected bugs we couldn't support.

    3. Only privileged processes can do something for other process's
    address space.

    For more detail of the new API, please see "mm: introduce external memory
    hinting API" description in this patchset.

    This patch (of 3):

    In upcoming patches, do_madvise will be called from external process
    context so we shouldn't asssume "current" is always hinted process's
    task_struct.

    Furthermore, we must not access mm_struct via task->mm, but obtain it via
    access_mm() once (in the following patch) and only use that pointer [1],
    so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
    safe, so we can use them further down the call stack.

    And let's pass current->mm as arguments of do_madvise so it shouldn't
    change existing behavior but prepare next patch to make review easy.

    [vbabka@suse.cz: changelog tweak]
    [minchan@kernel.org: use current->mm for io_uring]
    Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
    [akpm@linux-foundation.org: fix it for upstream changes]
    [akpm@linux-foundation.org: whoops]
    [rdunlap@infradead.org: add missing includes]

    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Daniel Colascione
    Cc: Sandeep Patil
    Cc: Sonny Rao
    Cc: Brian Geffon
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: John Dias
    Cc: Joel Fernandes
    Cc: Alexander Duyck
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Kirill Tkhai
    Cc: Oleksandr Natalenko
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

17 Oct, 2020

3 commits

  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Currently, there is an inconsistency when calling soft-offline from
    different paths on a page that is already poisoned.

    1) madvise:

    madvise_inject_error skips any poisoned page and continues
    the loop.
    If that was the only page to madvise, it returns 0.

    2) /sys/devices/system/memory/:

    When calling soft_offline_page_store()->soft_offline_page(),
    we return -EBUSY in case the page is already poisoned.
    This is inconsistent with a) the above example and b)
    memory_failure, where we return 0 if the page was poisoned.

    Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
    let soft_offline_page return 0 if it finds the page already poisoned.

    Please, note that this represents a user-api change, since now the return
    error when calling soft_offline_page_store()->soft_offline_page() will be
    different.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Make a proper if-else condition for {hard,soft}-offline.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Tony Luck
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Mike Kravetz
    Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

14 Oct, 2020

1 commit

  • Instead of calling find_get_entry() for every page index, use an XArray
    iterator to skip over NULL entries, and avoid calling get_page(),
    because we only want the swap entries.

    [willy@infradead.org: fix LTP soft lockups]
    Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Matthew Auld
    Cc: William Kucharski
    Cc: Qian Cai
    Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

27 Sep, 2020

1 commit

  • syzbot reported the following KASAN splat:

    general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
    CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
    Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
    RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
    Call Trace:
    lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:354 [inline]
    madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
    walk_pmd_range mm/pagewalk.c:89 [inline]
    walk_pud_range mm/pagewalk.c:160 [inline]
    walk_p4d_range mm/pagewalk.c:193 [inline]
    walk_pgd_range mm/pagewalk.c:229 [inline]
    __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
    walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
    madvise_pageout_page_range mm/madvise.c:521 [inline]
    madvise_pageout mm/madvise.c:557 [inline]
    madvise_vma mm/madvise.c:946 [inline]
    do_madvise+0x12d0/0x2090 mm/madvise.c:1145
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The backing vma was shmem.

    In case of split page of file-backed THP, madvise zaps the pmd instead
    of remapping of sub-pages. So we need to check pmd validity after
    split.

    Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

06 Sep, 2020

1 commit

  • The syzbot reported the below use-after-free:

    BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
    BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
    BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996

    CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x18f/0x20d lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
    madvise_willneed mm/madvise.c:293 [inline]
    madvise_vma mm/madvise.c:942 [inline]
    do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    do_madvise mm/madvise.c:1169 [inline]
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Allocated by task 9992:
    kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
    vm_area_alloc+0x1c/0x110 kernel/fork.c:347
    mmap_region+0x8e5/0x1780 mm/mmap.c:1743
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 9992:
    kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
    remove_vma+0x132/0x170 mm/mmap.c:184
    remove_vma_list mm/mmap.c:2613 [inline]
    __do_munmap+0x743/0x1170 mm/mmap.c:2869
    do_munmap mm/mmap.c:2877 [inline]
    mmap_region+0x257/0x1780 mm/mmap.c:1716
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    It is because vma is accessed after releasing mmap_lock, but someone
    else acquired the mmap_lock and the vma is gone.

    Releasing mmap_lock after accessing vma should fix the problem.

    Fixes: 692fe62433d4c ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
    Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: [5.4+]
    Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

25 Apr, 2020

1 commit

  • IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
    another process, which means that it can race with our crazy core dump
    handling which accesses the VM state without holding the mmap_sem
    (because it incorrectly thinks that it is the final user).

    This is clearly a core dumping problem, but we've never fixed it the
    right way, and instead have the notion of "check that the mm is still
    ok" using mmget_still_valid() after getting the mmap_sem for writing in
    any situation where we're not the original VM thread.

    See commit 04f5866e41fb ("coredump: fix race condition between
    mmget_not_zero()/get_task_mm() and core dumping") for more background on
    this whole mmget_still_valid() thing. You might want to have a barf bag
    handy when you do.

    We're discussing just fixing this properly in the only remaining core
    dumping routines. But even if we do that, let's make do_madvise() do
    the right thing, and then when we fix core dumping, we can remove all
    these mmget_still_valid() checks.

    Reported-and-tested-by: Jann Horn
    Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE")
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Mar, 2020

1 commit

  • Jann has brought up a very interesting point [1]. While shared pages
    are excluded from MADV_PAGEOUT normally, CoW pages can be easily
    reclaimed that way. This can lead to all sorts of hard to debug
    problems. E.g. performance problems outlined by Daniel [2].

    There are runtime environments where there is a substantial memory
    shared among security domains via CoW memory and a easy to reclaim way
    of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
    performance degradation in for the parent process which might be more
    privileged or even open side channel attacks.

    The feasibility of the latter is not really clear to me TBH but there is
    no real reason for exposure at this stage. It seems there is no real
    use case to depend on reclaiming CoW memory via madvise at this stage so
    it is much easier to simply disallow it and this is what this patch
    does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the
    exclusively owned memory which is a straightforward semantic.

    [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
    [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

    Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
    Reported-by: Jann Horn
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: "Joel Fernandes (Google)"
    Cc:
    Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 Jan, 2020

1 commit

  • This is in preparation for enabling this functionality through io_uring.
    Add a helper that is just exporting what sys_madvise() does, and have the
    system call use it.

    No functional changes in this patch.

    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Dec, 2019

3 commits

  • Improve readability, no functional change.

    Link: http://lkml.kernel.org/r/20191118032857.22683-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • page_size() is supported after the commit a50b854e073c ("mm: introduce
    page_size()").

    Use page_size() in madvise_inject_error() for readability.

    [akpm@linux-foundation.org: use ulong for `size', per David]
    Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.com
    Signed-off-by: Yunfeng Ye
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Jason Gunthorpe
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Hu Shiyuan
    Cc: Feilong Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yunfeng Ye
     
  • Currently soft_offline_page() receives struct page, and its sibling
    memory_failure() receives pfn. This discrepancy looks weird and makes
    precheck on pfn validity tricky. So let's align them.

    Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Nov, 2019

1 commit

  • Recently, I hit the following issue when running upstream.

    kernel BUG at mm/vmscan.c:1521!
    invalid opcode: 0000 [#1] SMP KASAN PTI
    CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
    RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
    Call Trace:
    reclaim_pages+0x499/0x800 mm/vmscan.c:2188
    madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
    walk_pmd_range mm/pagewalk.c:53 [inline]
    walk_pud_range mm/pagewalk.c:112 [inline]
    walk_p4d_range mm/pagewalk.c:139 [inline]
    walk_pgd_range mm/pagewalk.c:166 [inline]
    __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
    walk_page_range+0x179/0x310 mm/pagewalk.c:349
    madvise_pageout_page_range mm/madvise.c:506 [inline]
    madvise_pageout+0x1f0/0x330 mm/madvise.c:542
    madvise_vma mm/madvise.c:931 [inline]
    __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
    do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    madvise_pageout() accesses the specified range of the vma and isolates
    them, then runs shrink_page_list() to reclaim its memory. But it also
    isolates the unevictable pages to reclaim. Hence, we can catch the
    cases in shrink_page_list().

    The root cause is that we scan the page tables instead of specific LRU
    list. and so we need to filter out the unevictable lru pages from our
    end.

    Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: zhong jiang
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

26 Sep, 2019

4 commits

  • There are many common parts between MADV_COLD and MADV_PAGEOUT.
    This patch factor them out to save code duplication.

    Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: James E.J. Bottomley
    Cc: Joel Fernandes (Google)
    Cc: kbuild test robot
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • When a process expects no accesses to a certain memory range for a long
    time, it could hint kernel that the pages can be reclaimed instantly but
    data should be preserved for future use. This could reduce workingset
    eviction so it ends up increasing performance.

    This patch introduces the new MADV_PAGEOUT hint to madvise(2) syscall.
    MADV_PAGEOUT can be used by a process to mark a memory range as not
    expected to be used for a long time so that kernel reclaims *any LRU*
    pages instantly. The hint can help kernel in deciding which pages to
    evict proactively.

    A note: It doesn't apply SWAP_CLUSTER_MAX LRU page isolation limit
    intentionally because it's automatically bounded by PMD size. If PMD
    size(e.g., 256) makes some trouble, we could fix it later by limit it to
    SWAP_CLUSTER_MAX[1].

    - man-page material

    MADV_PAGEOUT (since Linux x.x)

    Do not expect access in the near future so pages in the specified
    regions could be reclaimed instantly regardless of memory pressure.
    Thus, access in the range after successful operation could cause
    major page fault but never lose the up-to-date contents unlike
    MADV_DONTNEED. Pages belonging to a shared mapping are only processed
    if a write access is allowed for the calling process.

    MADV_PAGEOUT cannot be applied to locked pages, Huge TLB pages, or
    VM_PFNMAP pages.

    [1] https://lore.kernel.org/lkml/20190710194719.GS29695@dhcp22.suse.cz/

    [minchan@kernel.org: clear PG_active on MADV_PAGEOUT]
    Link: http://lkml.kernel.org/r/20190802200643.GA181880@google.com
    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-5-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Johannes Weiner
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "Introduce MADV_COLD and MADV_PAGEOUT", v7.

    - Background

    The Android terminology used for forking a new process and starting an app
    from scratch is a cold start, while resuming an existing app is a hot
    start. While we continually try to improve the performance of cold
    starts, hot starts will always be significantly less power hungry as well
    as faster so we are trying to make hot start more likely than cold start.

    To increase hot start, Android userspace manages the order that apps
    should be killed in a process called ActivityManagerService.
    ActivityManagerService tracks every Android app or service that the user
    could be interacting with at any time and translates that into a ranked
    list for lmkd(low memory killer daemon). They are likely to be killed by
    lmkd if the system has to reclaim memory. In that sense they are similar
    to entries in any other cache. Those apps are kept alive for
    opportunistic performance improvements but those performance improvements
    will vary based on the memory requirements of individual workloads.

    - Problem

    Naturally, cached apps were dominant consumers of memory on the system.
    However, they were not significant consumers of swap even though they are
    good candidate for swap. Under investigation, swapping out only begins
    once the low zone watermark is hit and kswapd wakes up, but the overall
    allocation rate in the system might trip lmkd thresholds and cause a
    cached process to be killed(we measured performance swapping out vs.
    zapping the memory by killing a process. Unsurprisingly, zapping is 10x
    times faster even though we use zram which is much faster than real
    storage) so kill from lmkd will often satisfy the high zone watermark,
    resulting in very few pages actually being moved to swap.

    - Approach

    The approach we chose was to use a new interface to allow userspace to
    proactively reclaim entire processes by leveraging platform information.
    This allowed us to bypass the inaccuracy of the kernel’s LRUs for pages
    that are known to be cold from userspace and to avoid races with lmkd by
    reclaiming apps as soon as they entered the cached state. Additionally,
    it could provide many chances for platform to use much information to
    optimize memory efficiency.

    To achieve the goal, the patchset introduce two new options for madvise.
    One is MADV_COLD which will deactivate activated pages and the other is
    MADV_PAGEOUT which will reclaim private pages instantly. These new
    options complement MADV_DONTNEED and MADV_FREE by adding non-destructive
    ways to gain some free memory space. MADV_PAGEOUT is similar to
    MADV_DONTNEED in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed immediately; MADV_COLD is similar
    to MADV_FREE in a way that it hints the kernel that memory region is not
    currently needed and should be reclaimed when memory pressure rises.

    This patch (of 5):

    When a process expects no accesses to a certain memory range, it could
    give a hint to kernel that the pages can be reclaimed when memory pressure
    happens but data should be preserved for future use. This could reduce
    workingset eviction so it ends up increasing performance.

    This patch introduces the new MADV_COLD hint to madvise(2) syscall.
    MADV_COLD can be used by a process to mark a memory range as not expected
    to be used in the near future. The hint can help kernel in deciding which
    pages to evict early during memory pressure.

    It works for every LRU pages like MADV_[DONTNEED|FREE]. IOW, It moves

    active file page -> inactive file LRU
    active anon page -> inacdtive anon LRU

    Unlike MADV_FREE, it doesn't move active anonymous pages to inactive file
    LRU's head because MADV_COLD is a little bit different symantic.
    MADV_FREE means it's okay to discard when the memory pressure because the
    content of the page is *garbage* so freeing such pages is almost zero
    overhead since we don't need to swap out and access afterward causes just
    minor fault. Thus, it would make sense to put those freeable pages in
    inactive file LRU to compete other used-once pages. It makes sense for
    implmentaion point of view, too because it's not swapbacked memory any
    longer until it would be re-dirtied. Even, it could give a bonus to make
    them be reclaimed on swapless system. However, MADV_COLD doesn't mean
    garbage so reclaiming them requires swap-out/in in the end so it's bigger
    cost. Since we have designed VM LRU aging based on cost-model, anonymous
    cold pages would be better to position inactive anon's LRU list, not file
    LRU. Furthermore, it would help to avoid unnecessary scanning if system
    doesn't have a swap device. Let's start simpler way without adding
    complexity at this moment. However, keep in mind, too that it's a caveat
    that workloads with a lot of pages cache are likely to ignore MADV_COLD on
    anonymous memory because we rarely age anonymous LRU lists.

    * man-page material

    MADV_COLD (since Linux x.x)

    Pages in the specified regions will be treated as less-recently-accessed
    compared to pages in the system with similar access frequencies. In
    contrast to MADV_FREE, the contents of the region are preserved regardless
    of subsequent writes to pages.

    MADV_COLD cannot be applied to locked pages, Huge TLB pages, or VM_PFNMAP
    pages.

    [akpm@linux-foundation.org: resolve conflicts with hmm.git]
    Link: http://lkml.kernel.org/r/20190726023435.214162-2-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Reported-by: kbuild test robot
    Acked-by: Michal Hocko
    Acked-by: Johannes Weiner
    Cc: James E.J. Bottomley
    Cc: Richard Henderson
    Cc: Ralf Baechle
    Cc: Chris Zankel
    Cc: Johannes Weiner
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: Joel Fernandes (Google)
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • This patch is a part of a series that extends kernel ABI to allow to pass
    tagged user pointers (with the top byte set to something else other than
    0x00) as syscall arguments.

    This patch allows tagged pointers to be passed to the following memory
    syscalls: get_mempolicy, madvise, mbind, mincore, mlock, mlock2, mprotect,
    mremap, msync, munlock, move_pages.

    The mmap and mremap syscalls do not currently accept tagged addresses.
    Architectures may interpret the tag as a background colour for the
    corresponding vma.

    Link: http://lkml.kernel.org/r/aaf0c0969d46b2feb9017f3e1b3ef3970b633d91.1563904656.git.andreyknvl@google.com
    Signed-off-by: Andrey Konovalov
    Reviewed-by: Khalid Aziz
    Reviewed-by: Vincenzo Frascino
    Reviewed-by: Catalin Marinas
    Reviewed-by: Kees Cook
    Cc: Al Viro
    Cc: Dave Hansen
    Cc: Eric Auger
    Cc: Felix Kuehling
    Cc: Jens Wiklander
    Cc: Mauro Carvalho Chehab
    Cc: Mike Rapoport
    Cc: Will Deacon
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrey Konovalov
     

25 Sep, 2019

1 commit

  • madvise_behavior() converts -ENOMEM to -EAGAIN in several places using
    identical code.

    Move that code to a common error handling path.

    No functional changes.

    Link: http://lkml.kernel.org/r/1564640896-1210-1-git-send-email-rppt@linux.ibm.com
    Signed-off-by: Mike Rapoport
    Acked-by: Pankaj Gupta
    Reviewed-by: Anshuman Khandual
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Mike Rapoport
     

22 Sep, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is more cleanup and consolidation of the hmm APIs and the very
    strongly related mmu_notifier interfaces. Many places across the tree
    using these interfaces are touched in the process. Beyond that a
    cleanup to the page walker API and a few memremap related changes
    round out the series:

    - General improvement of hmm_range_fault() and related APIs, more
    documentation, bug fixes from testing, API simplification &
    consolidation, and unused API removal

    - Simplify the hmm related kconfigs to HMM_MIRROR and DEVICE_PRIVATE,
    and make them internal kconfig selects

    - Hoist a lot of code related to mmu notifier attachment out of
    drivers by using a refcount get/put attachment idiom and remove the
    convoluted mmu_notifier_unregister_no_release() and related APIs.

    - General API improvement for the migrate_vma API and revision of its
    only user in nouveau

    - Annotate mmu_notifiers with lockdep and sleeping region debugging

    Two series unrelated to HMM or mmu_notifiers came along due to
    dependencies:

    - Allow pagemap's memremap_pages family of APIs to work without
    providing a struct device

    - Make walk_page_range() and related use a constant structure for
    function pointers"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: (75 commits)
    libnvdimm: Enable unit test infrastructure compile checks
    mm, notifier: Catch sleeping/blocking for !blockable
    kernel.h: Add non_block_start/end()
    drm/radeon: guard against calling an unpaired radeon_mn_unregister()
    csky: add missing brackets in a macro for tlb.h
    pagewalk: use lockdep_assert_held for locking validation
    pagewalk: separate function pointers from iterator data
    mm: split out a new pagewalk.h header from mm.h
    mm/mmu_notifiers: annotate with might_sleep()
    mm/mmu_notifiers: prime lockdep
    mm/mmu_notifiers: add a lockdep map for invalidate_range_start/end
    mm/mmu_notifiers: remove the __mmu_notifier_invalidate_range_start/end exports
    mm/hmm: hmm_range_fault() infinite loop
    mm/hmm: hmm_range_fault() NULL pointer bug
    mm/hmm: fix hmm_range_fault()'s handling of swapped out pages
    mm/mmu_notifiers: remove unregister_no_release
    RDMA/odp: remove ib_ucontext from ib_umem
    RDMA/odp: use mmu_notifier_get/put for 'struct ib_ucontext_per_mm'
    RDMA/mlx5: Use odp instead of mr->umem in pagefault_mr
    RDMA/mlx5: Use ib_umem_start instead of umem.address
    ...

    Linus Torvalds
     

07 Sep, 2019

2 commits

  • The mm_walk structure currently mixed data and code. Split out the
    operations vectors into a new mm_walk_ops structure, and while we are
    changing the API also declare the mm_walk structure inside the
    walk_page_range and walk_page_vma functions.

    Based on patch from Linus Torvalds.

    Link: https://lore.kernel.org/r/20190828141955.22210-3-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     
  • Add a new header for the two handful of users of the walk_page_range /
    walk_page_vma interface instead of polluting all users of mm.h with it.

    Link: https://lore.kernel.org/r/20190828141955.22210-2-hch@lst.de
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Thomas Hellstrom
    Reviewed-by: Steven Price
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

31 Aug, 2019

1 commit

  • Currently handling of MADV_WILLNEED hint calls directly into readahead
    code. Handle it by calling vfs_fadvise() instead so that filesystem can
    use its ->fadvise() callback to acquire necessary locks or otherwise
    prepare for the request.

    Suggested-by: Amir Goldstein
    Reviewed-by: Boaz Harrosh
    CC: stable@vger.kernel.org
    Signed-off-by: Jan Kara
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Darrick J. Wong

    Jan Kara
     

03 Jul, 2019

1 commit

  • The code hasn't been used since it was added to the tree, and doesn't
    appear to actually be usable.

    Signed-off-by: Christoph Hellwig
    Reviewed-by: Jason Gunthorpe
    Acked-by: Michal Hocko
    Reviewed-by: Dan Williams
    Tested-by: Dan Williams
    Signed-off-by: Jason Gunthorpe

    Christoph Hellwig
     

15 May, 2019

2 commits

  • This updates each existing invalidation to use the correct mmu notifier
    event that represent what is happening to the CPU page table. See the
    patch which introduced the events to see the rational behind this.

    Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • CPU page table update can happens for many reasons, not only as a result
    of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
    a result of kernel activities (memory compression, reclaim, migration,
    ...).

    Users of mmu notifier API track changes to the CPU page table and take
    specific action for them. While current API only provide range of virtual
    address affected by the change, not why the changes is happening.

    This patchset do the initial mechanical convertion of all the places that
    calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
    event as well as the vma if it is know (most invalidation happens against
    a given vma). Passing down the vma allows the users of mmu notifier to
    inspect the new vma page protection.

    The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
    should assume that every for the range is going away when that event
    happens. A latter patch do convert mm call path to use a more appropriate
    events for each call.

    This is done as 2 patches so that no call site is forgotten especialy
    as it uses this following coccinelle patch:

    %vm_mm, E3, E4)
    ...>

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(..., struct vm_area_struct *VMA, ...) {
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN, VMA;
    @@
    FN(...) {
    struct vm_area_struct *VMA;
    }

    @@
    expression E1, E2, E3, E4;
    identifier FN;
    @@
    FN(...) {
    }
    ---------------------------------------------------------------------->%

    Applied with:
    spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
    spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
    spatch --sp-file mmu-notifier.spatch --dir mm --in-place

    Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

03 Apr, 2019

1 commit

  • Move the mmu_gather::page_size things into the generic code instead of
    PowerPC specific bits.

    No change in behavior intended.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Andy Lutomirski
    Cc: Aneesh Kumar K.V
    Cc: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Dec, 2018

1 commit

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

29 Oct, 2018

1 commit

  • Pull XArray conversion from Matthew Wilcox:
    "The XArray provides an improved interface to the radix tree data
    structure, providing locking as part of the API, specifying GFP flags
    at allocation time, eliminating preloading, less re-walking the tree,
    more efficient iterations and not exposing RCU-protected pointers to
    its users.

    This patch set

    1. Introduces the XArray implementation

    2. Converts the pagecache to use it

    3. Converts memremap to use it

    The page cache is the most complex and important user of the radix
    tree, so converting it was most important. Converting the memremap
    code removes the only other user of the multiorder code, which allows
    us to remove the radix tree code that supported it.

    I have 40+ followup patches to convert many other users of the radix
    tree over to the XArray, but I'd like to get this part in first. The
    other conversions haven't been in linux-next and aren't suitable for
    applying yet, but you can see them in the xarray-conv branch if you're
    interested"

    * 'xarray' of git://git.infradead.org/users/willy/linux-dax: (90 commits)
    radix tree: Remove multiorder support
    radix tree test: Convert multiorder tests to XArray
    radix tree tests: Convert item_delete_rcu to XArray
    radix tree tests: Convert item_kill_tree to XArray
    radix tree tests: Move item_insert_order
    radix tree test suite: Remove multiorder benchmarking
    radix tree test suite: Remove __item_insert
    memremap: Convert to XArray
    xarray: Add range store functionality
    xarray: Move multiorder_check to in-kernel tests
    xarray: Move multiorder_shrink to kernel tests
    xarray: Move multiorder account test in-kernel
    radix tree test suite: Convert iteration test to XArray
    radix tree test suite: Convert tag_tagged_items to XArray
    radix tree: Remove radix_tree_clear_tags
    radix tree: Remove radix_tree_maybe_preload_order
    radix tree: Remove split/join code
    radix tree: Remove radix_tree_update_node_t
    page cache: Finish XArray conversion
    dax: Convert page fault handlers to XArray
    ...

    Linus Torvalds
     

06 Oct, 2018

1 commit

  • Reproducer, assuming 2M of hugetlbfs available:

    Hugetlbfs mounted, size=2M and option user=testuser

    # mount | grep ^hugetlbfs
    hugetlbfs on /dev/hugepages type hugetlbfs (rw,pagesize=2M,user=dan)
    # sysctl vm.nr_hugepages=1
    vm.nr_hugepages = 1
    # grep Huge /proc/meminfo
    AnonHugePages: 0 kB
    ShmemHugePages: 0 kB
    HugePages_Total: 1
    HugePages_Free: 1
    HugePages_Rsvd: 0
    HugePages_Surp: 0
    Hugepagesize: 2048 kB
    Hugetlb: 2048 kB

    Code:

    #include
    #include
    #define SIZE 2*1024*1024
    int main()
    {
    void *ptr;
    ptr = mmap(NULL, SIZE, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_HUGETLB | MAP_ANONYMOUS, -1, 0);
    madvise(ptr, SIZE, MADV_DONTDUMP);
    madvise(ptr, SIZE, MADV_DODUMP);
    }

    Compile and strace:

    mmap(NULL, 2097152, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7ff7c9200000
    madvise(0x7ff7c9200000, 2097152, MADV_DONTDUMP) = 0
    madvise(0x7ff7c9200000, 2097152, MADV_DODUMP) = -1 EINVAL (Invalid argument)

    hugetlbfs pages have VM_DONTEXPAND in the VmFlags driver pages based on
    author testing with analysis from Florian Weimer[1].

    The inclusion of VM_DONTEXPAND into the VM_SPECIAL defination was a
    consequence of the large useage of VM_DONTEXPAND in device drivers.

    A consequence of [2] is that VM_DONTEXPAND marked pages are unable to be
    marked DODUMP.

    A user could quite legitimately madvise(MADV_DONTDUMP) their hugetlbfs
    memory for a while and later request that madvise(MADV_DODUMP) on the same
    memory. We correct this omission by allowing madvice(MADV_DODUMP) on
    hugetlbfs pages.

    [1] https://stackoverflow.com/questions/52548260/madvisedodump-on-the-same-ptr-size-as-a-successful-madvisedontdump-fails-wit
    [2] commit 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")

    Link: http://lkml.kernel.org/r/20180930054629.29150-1-daniel@linux.ibm.com
    Link: https://lists.launchpad.net/maria-discuss/msg05245.html
    Fixes: 0103bd16fb90 ("mm: prepare VM_DONTDUMP for using in drivers")
    Reported-by: Kenneth Penza
    Signed-off-by: Daniel Black
    Reviewed-by: Mike Kravetz
    Cc: Konstantin Khlebnikov
    Signed-off-by: Andrew Morton
    Signed-off-by: Greg Kroah-Hartman

    Daniel Black
     

30 Sep, 2018

1 commit

  • Introduce xarray value entries and tagged pointers to replace radix
    tree exceptional entries. This is a slight change in encoding to allow
    the use of an extra bit (we can now store BITS_PER_LONG - 1 bits in a
    value entry). It is also a change in emphasis; exceptional entries are
    intimidating and different. As the comment explains, you can choose
    to store values or pointers in the xarray and they are both first-class
    citizens.

    Signed-off-by: Matthew Wilcox
    Reviewed-by: Josef Bacik

    Matthew Wilcox
     

24 Jul, 2018

1 commit

  • The madvise_inject_error() routine uses get_user_pages() to lookup the
    pfn and other information for injected error, but it does not release
    that pin. The assumption is that failed pages should be taken out of
    circulation.

    However, for dax mappings it is not possible to take pages out of
    circulation since they are 1:1 physically mapped as filesystem blocks,
    or device-dax capacity. They also typically represent persistent memory
    which has an error clearing capability.

    In preparation for adding a special handler for dax mappings, shift the
    responsibility of taking the page reference to memory_failure(). I.e.
    drop the page reference and do not specify MF_COUNT_INCREASED to
    memory_failure().

    Cc: Michal Hocko
    Cc: Naoya Horiguchi
    Signed-off-by: Dan Williams
    Acked-by: Naoya Horiguchi
    Signed-off-by: Dave Jiang

    Dan Williams
     

24 Jan, 2018

1 commit