12 Jul, 2022

1 commit

  • [ Upstream commit d1fe111fb62a1cf0446a2919f5effbb33ad0702c ]

    When the hwpoison page meets the filter conditions, it should not be
    regarded as successful memory_failure() processing for mce handler, but
    should return a distinct value, otherwise mce handler regards the error
    page has been identified and isolated, which may lead to calling
    set_mce_nospec() to change page attribute, etc.

    Here memory_failure() return -EOPNOTSUPP to indicate that the error
    event is filtered, mce handler should not take any action for this
    situation and hwpoison injector should treat as correct.

    Link: https://lkml.kernel.org/r/20220223082135.2769649-1-luofei@unicloud.com
    Signed-off-by: luofei
    Acked-by: Borislav Petkov
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Ingo Molnar
    Cc: Miaohe Lin
    Cc: Naoya Horiguchi
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Sasha Levin

    luofei
     

08 Apr, 2022

3 commits

  • commit e6b0a7b357659c332231621e4315658d062c23ee upstream.

    This reverts commit 08095d6310a7 ("mm: madvise: skip unmapped vma holes
    passed to process_madvise") as process_madvise() fails to return the
    exact processed bytes in other cases too.

    As an example: if process_madvise() hits mlocked pages after processing
    some initial bytes passed in [start, end), it just returns EINVAL
    although some bytes are processed. Thus making an exception only for
    ENOMEM is partially fixing the problem of returning the proper advised
    bytes.

    Thus revert this patch and return proper bytes advised.

    Link: https://lkml.kernel.org/r/e73da1304a88b6a8a11907045117cccf4c2b8374.1648046642.git.quic_charante@quicinc.com
    Fixes: 08095d6310a7ce ("mm: madvise: skip unmapped vma holes passed to process_madvise")
    Signed-off-by: Charan Teja Kalla
    Acked-by: Michal Hocko
    Cc: Suren Baghdasaryan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Nadav Amit
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Charan Teja Kalla
     
  • commit 5bd009c7c9a9e888077c07535dc0c70aeab242c3 upstream.

    Patch series "mm: madvise: return correct bytes processed with
    process_madvise", v2. With the process_madvise(), always choose to return
    non zero processed bytes over an error. This can help the user to know on
    which VMA, passed in the 'struct iovec' vector list, is failed to advise
    thus can take the decission of retrying/skipping on that VMA.

    This patch (of 2):

    The process_madvise() system call returns error even after processing some
    VMA's passed in the 'struct iovec' vector list which leaves the user
    confused to know where to restart the advise next. It is also against
    this syscall man page[1] documentation where it mentions that "return
    value may be less than the total number of requested bytes, if an error
    occurred after some iovec elements were already processed.".

    Consider a user passed 10 VMA's in the 'struct iovec' vector list of which
    9 are processed but one. Then it just returns the error caused on that
    failed VMA despite the first 9 VMA's processed, leaving the user confused
    about on which VMA it is failed. Returning the number of bytes processed
    here can help the user to know which VMA it is failed on and thus can
    retry/skip the advise on that VMA.

    [1]https://man7.org/linux/man-pages/man2/process_madvise.2.html.

    Link: https://lkml.kernel.org/r/cover.1647008754.git.quic_charante@quicinc.com
    Link: https://lkml.kernel.org/r/125b61a0edcee5c2db8658aed9d06a43a19ccafc.1647008754.git.quic_charante@quicinc.com
    Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Signed-off-by: Charan Teja Kalla
    Cc: Suren Baghdasaryan
    Cc: Vlastimil Babka
    Cc: David Rientjes
    Cc: Stephen Rothwell
    Cc: Minchan Kim
    Cc: Nadav Amit
    Cc: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Charan Teja Kalla
     
  • commit 08095d6310a7ce43256b4251577bc66a25c6e1a6 upstream.

    The process_madvise() system call is expected to skip holes in vma passed
    through 'struct iovec' vector list. But do_madvise, which
    process_madvise() calls for each vma, returns ENOMEM in case of unmapped
    holes, despite the VMA is processed.

    Thus process_madvise() should treat ENOMEM as expected and consider the
    VMA passed to as processed and continue processing other vma's in the
    vector list. Returning -ENOMEM to user, despite the VMA is processed,
    will be unable to figure out where to start the next madvise.

    Link: https://lkml.kernel.org/r/4f091776142f2ebf7b94018146de72318474e686.1647008754.git.quic_charante@quicinc.com
    Fixes: ecb8ac8b1f14("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Signed-off-by: Charan Teja Kalla
    Cc: David Rientjes
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Nadav Amit
    Cc: Stephen Rothwell
    Cc: Suren Baghdasaryan
    Cc: Vlastimil Babka
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Charan Teja Kalla
     

04 Sep, 2021

2 commits

  • Merge misc updates from Andrew Morton:
    "173 patches.

    Subsystems affected by this series: ia64, ocfs2, block, and mm (debug,
    pagecache, gup, swap, shmem, memcg, selftests, pagemap, mremap,
    bootmem, sparsemem, vmalloc, kasan, pagealloc, memory-failure,
    hugetlb, userfaultfd, vmscan, compaction, mempolicy, memblock,
    oom-kill, migration, ksm, percpu, vmstat, and madvise)"

    * emailed patches from Andrew Morton : (173 commits)
    mm/madvise: add MADV_WILLNEED to process_madvise()
    mm/vmstat: remove unneeded return value
    mm/vmstat: simplify the array size calculation
    mm/vmstat: correct some wrong comments
    mm/percpu,c: remove obsolete comments of pcpu_chunk_populated()
    selftests: vm: add COW time test for KSM pages
    selftests: vm: add KSM merging time test
    mm: KSM: fix data type
    selftests: vm: add KSM merging across nodes test
    selftests: vm: add KSM zero page merging test
    selftests: vm: add KSM unmerge test
    selftests: vm: add KSM merge test
    mm/migrate: correct kernel-doc notation
    mm: wire up syscall process_mrelease
    mm: introduce process_mrelease system call
    memblock: make memblock_find_in_range method private
    mm/mempolicy.c: use in_task() in mempolicy_slab_node()
    mm/mempolicy: unify the create() func for bind/interleave/prefer-many policies
    mm/mempolicy: advertise new MPOL_PREFERRED_MANY
    mm/hugetlb: add support for mempolicy MPOL_PREFERRED_MANY
    ...

    Linus Torvalds
     
  • There is a usecase in Android that an app process's memory is swapped out
    by process_madvise() with MADV_PAGEOUT, such as the memory is swapped to
    zram or a backing device. When the process is scheduled to running, like
    switch to foreground, multiple page faults may cause the app dropped
    frames.

    To reduce the problem, System Management Software can read-ahead memory
    of the process immediately when the app switches to forground. Calling
    process_madvise() with MADV_WILLNEED can meet this need.

    Link: https://lkml.kernel.org/r/20210804082010.12482-1-zhangkui@oppo.com
    Signed-off-by: zhangkui
    Cc: David Hildenbrand
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhangkui
     

31 Aug, 2021

1 commit

  • Pull fs hole punching vs cache filling race fixes from Jan Kara:
    "Fix races leading to possible data corruption or stale data exposure
    in multiple filesystems when hole punching races with operations such
    as readahead.

    This is the series I was sending for the last merge window but with
    your objection fixed - now filemap_fault() has been modified to take
    invalidate_lock only when we need to create new page in the page cache
    and / or bring it uptodate"

    * tag 'hole_punch_for_v5.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/jack/linux-fs:
    filesystems/locking: fix Malformed table warning
    cifs: Fix race between hole punch and page fault
    ceph: Fix race between hole punch and page fault
    fuse: Convert to using invalidate_lock
    f2fs: Convert to using invalidate_lock
    zonefs: Convert to using invalidate_lock
    xfs: Convert double locking of MMAPLOCK to use VFS helpers
    xfs: Convert to use invalidate_lock
    xfs: Refactor xfs_isilocked()
    ext2: Convert to using invalidate_lock
    ext4: Convert to use mapping->invalidate_lock
    mm: Add functions to lock invalidate_lock for two mappings
    mm: Protect operations adding pages to page cache with invalidate_lock
    documentation: Sync file_operations members with reality
    mm: Fix comments mentioning i_mutex

    Linus Torvalds
     

14 Aug, 2021

1 commit

  • Doing some extended tests and polishing the man page update for
    MADV_POPULATE_(READ|WRITE), I realized that we end up converting also
    SIGBUS (via -EFAULT) to -EINVAL, making it look like yet another
    madvise() user error.

    We want to report only problematic mappings and permission problems that
    the user could have know as -EINVAL.

    Let's not convert -EFAULT arising due to SIGBUS (or SIGSEGV) to -EINVAL,
    but instead indicate -EFAULT to user space. While we could also convert
    it to -ENOMEM, using -EFAULT looks more helpful when user space might
    want to troubleshoot what's going wrong: MADV_POPULATE_(READ|WRITE) is
    not part of an final Linux release and we can still adjust the behavior.

    Link: https://lkml.kernel.org/r/20210726154932.102880-1-david@redhat.com
    Fixes: 4ca9b3859dac ("mm/madvise: introduce MADV_POPULATE_(READ|WRITE) to prefault page tables")
    Signed-off-by: David Hildenbrand
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Matthew Wilcox (Oracle)
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Jann Horn
    Cc: Jason Gunthorpe
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Michael S. Tsirkin
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Thomas Bogendoerfer
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: Rolf Eike Beer
    Cc: Ram Pai
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

13 Jul, 2021

1 commit

  • inode->i_mutex has been replaced with inode->i_rwsem long ago. Fix
    comments still mentioning i_mutex.

    Reviewed-by: Christoph Hellwig
    Reviewed-by: Darrick J. Wong
    Acked-by: Hugh Dickins
    Signed-off-by: Jan Kara

    Jan Kara
     

01 Jul, 2021

1 commit

  • I. Background: Sparse Memory Mappings

    When we manage sparse memory mappings dynamically in user space - also
    sometimes involving MAP_NORESERVE - we want to dynamically populate/
    discard memory inside such a sparse memory region. Example users are
    hypervisors (especially implementing memory ballooning or similar
    technologies like virtio-mem) and memory allocators. In addition, we want
    to fail in a nice way (instead of generating SIGBUS) if populating does
    not succeed because we are out of backend memory (which can happen easily
    with file-based mappings, especially tmpfs and hugetlbfs).

    While MADV_DONTNEED, MADV_REMOVE and FALLOC_FL_PUNCH_HOLE allow for
    reliably discarding memory for most mapping types, there is no generic
    approach to populate page tables and preallocate memory.

    Although mmap() supports MAP_POPULATE, it is not applicable to the concept
    of sparse memory mappings, where we want to populate/discard dynamically
    and avoid expensive/problematic remappings. In addition, we never
    actually report errors during the final populate phase - it is best-effort
    only.

    fallocate() can be used to preallocate file-based memory and fail in a
    safe way. However, it cannot really be used for any private mappings on
    anonymous files via memfd due to COW semantics. In addition, fallocate()
    does not actually populate page tables, so we still always get pagefaults
    on first access - which is sometimes undesired (i.e., real-time workloads)
    and requires real prefaulting of page tables, not just a preallocation of
    backend storage. There might be interesting use cases for sparse memory
    regions along with mlockall(MCL_ONFAULT) which fallocate() cannot satisfy
    as it does not prefault page tables.

    II. On preallcoation/prefaulting from user space

    Because we don't have a proper interface, what applications (like QEMU and
    databases) end up doing is touching (i.e., reading+writing one byte to not
    overwrite existing data) all individual pages.

    However, that approach
    1) Can result in wear on storage backing, because we end up reading/writing
    each page; this is especially a problem for dax/pmem.
    2) Can result in mmap_sem contention when prefaulting via multiple
    threads.
    3) Requires expensive signal handling, especially to catch SIGBUS in case
    of hugetlbfs/shmem/file-backed memory. For example, this is
    problematic in hypervisors like QEMU where SIGBUS handlers might already
    be used by other subsystems concurrently to e.g, handle hardware errors.
    "Simply" doing preallocation concurrently from other thread is not that
    easy.

    III. On MADV_WILLNEED

    Extending MADV_WILLNEED is not an option because
    1. It would change the semantics: "Expect access in the near future." and
    "might be a good idea to read some pages" vs. "Definitely populate/
    preallocate all memory and definitely fail on errors.".
    2. Existing users (like virtio-balloon in QEMU when deflating the balloon)
    don't want populate/prealloc semantics. They treat this rather as a hint
    to give a little performance boost without too much overhead - and don't
    expect that a lot of memory might get consumed or a lot of time
    might be spent.

    IV. MADV_POPULATE_READ and MADV_POPULATE_WRITE

    Let's introduce MADV_POPULATE_READ and MADV_POPULATE_WRITE, inspired by
    MAP_POPULATE, with the following semantics:
    1. MADV_POPULATE_READ can be used to prefault page tables just like
    manually reading each individual page. This will not break any COW
    mappings. The shared zero page might get mapped and no backend storage
    might get preallocated -- allocation might be deferred to
    write-fault time. Especially shared file mappings require an explicit
    fallocate() upfront to actually preallocate backend memory (blocks in
    the file system) in case the file might have holes.
    2. If MADV_POPULATE_READ succeeds, all page tables have been populated
    (prefaulted) readable once.
    3. MADV_POPULATE_WRITE can be used to preallocate backend memory and
    prefault page tables just like manually writing (or
    reading+writing) each individual page. This will break any COW
    mappings -- e.g., the shared zeropage is never populated.
    4. If MADV_POPULATE_WRITE succeeds, all page tables have been populated
    (prefaulted) writable once.
    5. MADV_POPULATE_READ and MADV_POPULATE_WRITE cannot be applied to special
    mappings marked with VM_PFNMAP and VM_IO. Also, proper access
    permissions (e.g., PROT_READ, PROT_WRITE) are required. If any such
    mapping is encountered, madvise() fails with -EINVAL.
    6. If MADV_POPULATE_READ or MADV_POPULATE_WRITE fails, some page tables
    might have been populated.
    7. MADV_POPULATE_READ and MADV_POPULATE_WRITE will return -EHWPOISON
    when encountering a HW poisoned page in the range.
    8. Similar to MAP_POPULATE, MADV_POPULATE_READ and MADV_POPULATE_WRITE
    cannot protect from the OOM (Out Of Memory) handler killing the
    process.

    While the use case for MADV_POPULATE_WRITE is fairly obvious (i.e.,
    preallocate memory and prefault page tables for VMs), one issue is that
    whenever we prefault pages writable, the pages have to be marked dirty,
    because the CPU could dirty them any time. while not a real problem for
    hugetlbfs or dax/pmem, it can be a problem for shared file mappings: each
    page will be marked dirty and has to be written back later when evicting.

    MADV_POPULATE_READ allows for optimizing this scenario: Pre-read a whole
    mapping from backend storage without marking it dirty, such that eviction
    won't have to write it back. As discussed above, shared file mappings
    might require an explciit fallocate() upfront to achieve
    preallcoation+prepopulation.

    Although sparse memory mappings are the primary use case, this will also
    be useful for other preallocate/prefault use cases where MAP_POPULATE is
    not desired or the semantics of MAP_POPULATE are not sufficient: as one
    example, QEMU users can trigger preallocation/prefaulting of guest RAM
    after the mapping was created -- and don't want errors to be silently
    suppressed.

    Looking at the history, MADV_POPULATE was already proposed in 2013 [1],
    however, the main motivation back than was performance improvements --
    which should also still be the case.

    V. Single-threaded performance comparison

    I did a short experiment, prefaulting page tables on completely *empty
    mappings/files* and repeated the experiment 10 times. The results
    correspond to the shortest execution time. In general, the performance
    benefit for huge pages is negligible with small mappings.

    V.1: Private mappings

    POPULATE_READ and POPULATE_WRITE is fastest. Note that
    Reading/POPULATE_READ will populate the shared zeropage where applicable
    -- which result in short population times.

    The fastest way to allocate backend storage (here: swap or huge pages) and
    prefault page tables is POPULATE_WRITE.

    V.2: Shared mappings

    fallocate() is fastest, however, doesn't prefault page tables.
    POPULATE_WRITE is faster than simple writes and read/writes.
    POPULATE_READ is faster than simple reads.

    Without a fd, the fastest way to allocate backend storage and prefault
    page tables is POPULATE_WRITE. With an fd, the fastest way is usually
    FALLOCATE+POPULATE_READ or FALLOCATE+POPULATE_WRITE respectively; one
    exception are actual files: FALLOCATE+Read is slightly faster than
    FALLOCATE+POPULATE_READ.

    The fastest way to allocate backend storage prefault page tables is
    FALLOCATE+POPULATE_WRITE -- except when dealing with actual files; then,
    FALLOCATE+POPULATE_READ is fastest and won't directly mark all pages as
    dirty.

    v.3: Detailed results

    ==================================================
    2 MiB MAP_PRIVATE:
    **************************************************
    Anon 4 KiB : Read : 0.119 ms
    Anon 4 KiB : Write : 0.222 ms
    Anon 4 KiB : Read/Write : 0.380 ms
    Anon 4 KiB : POPULATE_READ : 0.060 ms
    Anon 4 KiB : POPULATE_WRITE : 0.158 ms
    Memfd 4 KiB : Read : 0.034 ms
    Memfd 4 KiB : Write : 0.310 ms
    Memfd 4 KiB : Read/Write : 0.362 ms
    Memfd 4 KiB : POPULATE_READ : 0.039 ms
    Memfd 4 KiB : POPULATE_WRITE : 0.229 ms
    Memfd 2 MiB : Read : 0.030 ms
    Memfd 2 MiB : Write : 0.030 ms
    Memfd 2 MiB : Read/Write : 0.030 ms
    Memfd 2 MiB : POPULATE_READ : 0.030 ms
    Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
    tmpfs : Read : 0.033 ms
    tmpfs : Write : 0.313 ms
    tmpfs : Read/Write : 0.406 ms
    tmpfs : POPULATE_READ : 0.039 ms
    tmpfs : POPULATE_WRITE : 0.285 ms
    file : Read : 0.033 ms
    file : Write : 0.351 ms
    file : Read/Write : 0.408 ms
    file : POPULATE_READ : 0.039 ms
    file : POPULATE_WRITE : 0.290 ms
    hugetlbfs : Read : 0.030 ms
    hugetlbfs : Write : 0.030 ms
    hugetlbfs : Read/Write : 0.030 ms
    hugetlbfs : POPULATE_READ : 0.030 ms
    hugetlbfs : POPULATE_WRITE : 0.030 ms
    **************************************************
    4096 MiB MAP_PRIVATE:
    **************************************************
    Anon 4 KiB : Read : 237.940 ms
    Anon 4 KiB : Write : 708.409 ms
    Anon 4 KiB : Read/Write : 1054.041 ms
    Anon 4 KiB : POPULATE_READ : 124.310 ms
    Anon 4 KiB : POPULATE_WRITE : 572.582 ms
    Memfd 4 KiB : Read : 136.928 ms
    Memfd 4 KiB : Write : 963.898 ms
    Memfd 4 KiB : Read/Write : 1106.561 ms
    Memfd 4 KiB : POPULATE_READ : 78.450 ms
    Memfd 4 KiB : POPULATE_WRITE : 805.881 ms
    Memfd 2 MiB : Read : 357.116 ms
    Memfd 2 MiB : Write : 357.210 ms
    Memfd 2 MiB : Read/Write : 357.606 ms
    Memfd 2 MiB : POPULATE_READ : 356.094 ms
    Memfd 2 MiB : POPULATE_WRITE : 356.937 ms
    tmpfs : Read : 137.536 ms
    tmpfs : Write : 954.362 ms
    tmpfs : Read/Write : 1105.954 ms
    tmpfs : POPULATE_READ : 80.289 ms
    tmpfs : POPULATE_WRITE : 822.826 ms
    file : Read : 137.874 ms
    file : Write : 987.025 ms
    file : Read/Write : 1107.439 ms
    file : POPULATE_READ : 80.413 ms
    file : POPULATE_WRITE : 857.622 ms
    hugetlbfs : Read : 355.607 ms
    hugetlbfs : Write : 355.729 ms
    hugetlbfs : Read/Write : 356.127 ms
    hugetlbfs : POPULATE_READ : 354.585 ms
    hugetlbfs : POPULATE_WRITE : 355.138 ms
    **************************************************
    2 MiB MAP_SHARED:
    **************************************************
    Anon 4 KiB : Read : 0.394 ms
    Anon 4 KiB : Write : 0.348 ms
    Anon 4 KiB : Read/Write : 0.400 ms
    Anon 4 KiB : POPULATE_READ : 0.326 ms
    Anon 4 KiB : POPULATE_WRITE : 0.273 ms
    Anon 2 MiB : Read : 0.030 ms
    Anon 2 MiB : Write : 0.030 ms
    Anon 2 MiB : Read/Write : 0.030 ms
    Anon 2 MiB : POPULATE_READ : 0.030 ms
    Anon 2 MiB : POPULATE_WRITE : 0.030 ms
    Memfd 4 KiB : Read : 0.412 ms
    Memfd 4 KiB : Write : 0.372 ms
    Memfd 4 KiB : Read/Write : 0.419 ms
    Memfd 4 KiB : POPULATE_READ : 0.343 ms
    Memfd 4 KiB : POPULATE_WRITE : 0.288 ms
    Memfd 4 KiB : FALLOCATE : 0.137 ms
    Memfd 4 KiB : FALLOCATE+Read : 0.446 ms
    Memfd 4 KiB : FALLOCATE+Write : 0.330 ms
    Memfd 4 KiB : FALLOCATE+Read/Write : 0.454 ms
    Memfd 4 KiB : FALLOCATE+POPULATE_READ : 0.379 ms
    Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 0.268 ms
    Memfd 2 MiB : Read : 0.030 ms
    Memfd 2 MiB : Write : 0.030 ms
    Memfd 2 MiB : Read/Write : 0.030 ms
    Memfd 2 MiB : POPULATE_READ : 0.030 ms
    Memfd 2 MiB : POPULATE_WRITE : 0.030 ms
    Memfd 2 MiB : FALLOCATE : 0.030 ms
    Memfd 2 MiB : FALLOCATE+Read : 0.031 ms
    Memfd 2 MiB : FALLOCATE+Write : 0.031 ms
    Memfd 2 MiB : FALLOCATE+Read/Write : 0.031 ms
    Memfd 2 MiB : FALLOCATE+POPULATE_READ : 0.030 ms
    Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 0.030 ms
    tmpfs : Read : 0.416 ms
    tmpfs : Write : 0.369 ms
    tmpfs : Read/Write : 0.425 ms
    tmpfs : POPULATE_READ : 0.346 ms
    tmpfs : POPULATE_WRITE : 0.295 ms
    tmpfs : FALLOCATE : 0.139 ms
    tmpfs : FALLOCATE+Read : 0.447 ms
    tmpfs : FALLOCATE+Write : 0.333 ms
    tmpfs : FALLOCATE+Read/Write : 0.454 ms
    tmpfs : FALLOCATE+POPULATE_READ : 0.380 ms
    tmpfs : FALLOCATE+POPULATE_WRITE : 0.272 ms
    file : Read : 0.191 ms
    file : Write : 0.511 ms
    file : Read/Write : 0.524 ms
    file : POPULATE_READ : 0.196 ms
    file : POPULATE_WRITE : 0.434 ms
    file : FALLOCATE : 0.004 ms
    file : FALLOCATE+Read : 0.197 ms
    file : FALLOCATE+Write : 0.554 ms
    file : FALLOCATE+Read/Write : 0.480 ms
    file : FALLOCATE+POPULATE_READ : 0.201 ms
    file : FALLOCATE+POPULATE_WRITE : 0.381 ms
    hugetlbfs : Read : 0.030 ms
    hugetlbfs : Write : 0.030 ms
    hugetlbfs : Read/Write : 0.030 ms
    hugetlbfs : POPULATE_READ : 0.030 ms
    hugetlbfs : POPULATE_WRITE : 0.030 ms
    hugetlbfs : FALLOCATE : 0.030 ms
    hugetlbfs : FALLOCATE+Read : 0.031 ms
    hugetlbfs : FALLOCATE+Write : 0.031 ms
    hugetlbfs : FALLOCATE+Read/Write : 0.030 ms
    hugetlbfs : FALLOCATE+POPULATE_READ : 0.030 ms
    hugetlbfs : FALLOCATE+POPULATE_WRITE : 0.030 ms
    **************************************************
    4096 MiB MAP_SHARED:
    **************************************************
    Anon 4 KiB : Read : 1053.090 ms
    Anon 4 KiB : Write : 913.642 ms
    Anon 4 KiB : Read/Write : 1060.350 ms
    Anon 4 KiB : POPULATE_READ : 893.691 ms
    Anon 4 KiB : POPULATE_WRITE : 782.885 ms
    Anon 2 MiB : Read : 358.553 ms
    Anon 2 MiB : Write : 358.419 ms
    Anon 2 MiB : Read/Write : 357.992 ms
    Anon 2 MiB : POPULATE_READ : 357.533 ms
    Anon 2 MiB : POPULATE_WRITE : 357.808 ms
    Memfd 4 KiB : Read : 1078.144 ms
    Memfd 4 KiB : Write : 942.036 ms
    Memfd 4 KiB : Read/Write : 1100.391 ms
    Memfd 4 KiB : POPULATE_READ : 925.829 ms
    Memfd 4 KiB : POPULATE_WRITE : 804.394 ms
    Memfd 4 KiB : FALLOCATE : 304.632 ms
    Memfd 4 KiB : FALLOCATE+Read : 1163.359 ms
    Memfd 4 KiB : FALLOCATE+Write : 933.186 ms
    Memfd 4 KiB : FALLOCATE+Read/Write : 1187.304 ms
    Memfd 4 KiB : FALLOCATE+POPULATE_READ : 1013.660 ms
    Memfd 4 KiB : FALLOCATE+POPULATE_WRITE : 794.560 ms
    Memfd 2 MiB : Read : 358.131 ms
    Memfd 2 MiB : Write : 358.099 ms
    Memfd 2 MiB : Read/Write : 358.250 ms
    Memfd 2 MiB : POPULATE_READ : 357.563 ms
    Memfd 2 MiB : POPULATE_WRITE : 357.334 ms
    Memfd 2 MiB : FALLOCATE : 356.735 ms
    Memfd 2 MiB : FALLOCATE+Read : 358.152 ms
    Memfd 2 MiB : FALLOCATE+Write : 358.331 ms
    Memfd 2 MiB : FALLOCATE+Read/Write : 358.018 ms
    Memfd 2 MiB : FALLOCATE+POPULATE_READ : 357.286 ms
    Memfd 2 MiB : FALLOCATE+POPULATE_WRITE : 357.523 ms
    tmpfs : Read : 1087.265 ms
    tmpfs : Write : 950.840 ms
    tmpfs : Read/Write : 1107.567 ms
    tmpfs : POPULATE_READ : 922.605 ms
    tmpfs : POPULATE_WRITE : 810.094 ms
    tmpfs : FALLOCATE : 306.320 ms
    tmpfs : FALLOCATE+Read : 1169.796 ms
    tmpfs : FALLOCATE+Write : 933.730 ms
    tmpfs : FALLOCATE+Read/Write : 1191.610 ms
    tmpfs : FALLOCATE+POPULATE_READ : 1020.474 ms
    tmpfs : FALLOCATE+POPULATE_WRITE : 798.945 ms
    file : Read : 654.101 ms
    file : Write : 1259.142 ms
    file : Read/Write : 1289.509 ms
    file : POPULATE_READ : 661.642 ms
    file : POPULATE_WRITE : 1106.816 ms
    file : FALLOCATE : 1.864 ms
    file : FALLOCATE+Read : 656.328 ms
    file : FALLOCATE+Write : 1153.300 ms
    file : FALLOCATE+Read/Write : 1180.613 ms
    file : FALLOCATE+POPULATE_READ : 668.347 ms
    file : FALLOCATE+POPULATE_WRITE : 996.143 ms
    hugetlbfs : Read : 357.245 ms
    hugetlbfs : Write : 357.413 ms
    hugetlbfs : Read/Write : 357.120 ms
    hugetlbfs : POPULATE_READ : 356.321 ms
    hugetlbfs : POPULATE_WRITE : 356.693 ms
    hugetlbfs : FALLOCATE : 355.927 ms
    hugetlbfs : FALLOCATE+Read : 357.074 ms
    hugetlbfs : FALLOCATE+Write : 357.120 ms
    hugetlbfs : FALLOCATE+Read/Write : 356.983 ms
    hugetlbfs : FALLOCATE+POPULATE_READ : 356.413 ms
    hugetlbfs : FALLOCATE+POPULATE_WRITE : 356.266 ms
    **************************************************

    [1] https://lkml.org/lkml/2013/6/27/698

    [akpm@linux-foundation.org: coding style fixes]

    Link: https://lkml.kernel.org/r/20210419135443.12822-3-david@redhat.com
    Signed-off-by: David Hildenbrand
    Cc: Arnd Bergmann
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Cc: Matthew Wilcox (Oracle)
    Cc: Andrea Arcangeli
    Cc: Minchan Kim
    Cc: Jann Horn
    Cc: Jason Gunthorpe
    Cc: Dave Hansen
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Michael S. Tsirkin
    Cc: Kirill A. Shutemov
    Cc: Vlastimil Babka
    Cc: Richard Henderson
    Cc: Ivan Kokshaysky
    Cc: Matt Turner
    Cc: Thomas Bogendoerfer
    Cc: "James E.J. Bottomley"
    Cc: Helge Deller
    Cc: Chris Zankel
    Cc: Max Filippov
    Cc: Mike Kravetz
    Cc: Peter Xu
    Cc: Rolf Eike Beer
    Cc: Ram Pai
    Cc: Shuah Khan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Hildenbrand
     

07 May, 2021

1 commit

  • Fix ~94 single-word typos in locking code comments, plus a few
    very obvious grammar mistakes.

    Link: https://lkml.kernel.org/r/20210322212624.GA1963421@gmail.com
    Link: https://lore.kernel.org/r/20210322205203.GB1959563@gmail.com
    Signed-off-by: Ingo Molnar
    Reviewed-by: Matthew Wilcox (Oracle)
    Reviewed-by: Randy Dunlap
    Cc: Bhaskar Chowdhury
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

14 Mar, 2021

1 commit

  • process_madvise currently requires ptrace attach capability.
    PTRACE_MODE_ATTACH gives one process complete control over another
    process. It effectively removes the security boundary between the two
    processes (in one direction). Granting ptrace attach capability even to a
    system process is considered dangerous since it creates an attack surface.
    This severely limits the usage of this API.

    The operations process_madvise can perform do not affect the correctness
    of the operation of the target process; they only affect where the data is
    physically located (and therefore, how fast it can be accessed). What we
    want is the ability for one process to influence another process in order
    to optimize performance across the entire system while leaving the
    security boundary intact.

    Replace PTRACE_MODE_ATTACH with a combination of PTRACE_MODE_READ and
    CAP_SYS_NICE. PTRACE_MODE_READ to prevent leaking ASLR metadata and
    CAP_SYS_NICE for influencing process performance.

    Link: https://lkml.kernel.org/r/20210303185807.2160264-1-surenb@google.com
    Signed-off-by: Suren Baghdasaryan
    Reviewed-by: Kees Cook
    Acked-by: Minchan Kim
    Acked-by: David Rientjes
    Cc: Jann Horn
    Cc: Jeff Vander Stoep
    Cc: Michal Hocko
    Cc: Shakeel Butt
    Cc: Tim Murray
    Cc: Florian Weimer
    Cc: Oleg Nesterov
    Cc: James Morris
    Cc: [5.10+]
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Suren Baghdasaryan
     

24 Feb, 2021

1 commit

  • Pull idmapped mounts from Christian Brauner:
    "This introduces idmapped mounts which has been in the making for some
    time. Simply put, different mounts can expose the same file or
    directory with different ownership. This initial implementation comes
    with ports for fat, ext4 and with Christoph's port for xfs with more
    filesystems being actively worked on by independent people and
    maintainers.

    Idmapping mounts handle a wide range of long standing use-cases. Here
    are just a few:

    - Idmapped mounts make it possible to easily share files between
    multiple users or multiple machines especially in complex
    scenarios. For example, idmapped mounts will be used in the
    implementation of portable home directories in
    systemd-homed.service(8) where they allow users to move their home
    directory to an external storage device and use it on multiple
    computers where they are assigned different uids and gids. This
    effectively makes it possible to assign random uids and gids at
    login time.

    - It is possible to share files from the host with unprivileged
    containers without having to change ownership permanently through
    chown(2).

    - It is possible to idmap a container's rootfs and without having to
    mangle every file. For example, Chromebooks use it to share the
    user's Download folder with their unprivileged containers in their
    Linux subsystem.

    - It is possible to share files between containers with
    non-overlapping idmappings.

    - Filesystem that lack a proper concept of ownership such as fat can
    use idmapped mounts to implement discretionary access (DAC)
    permission checking.

    - They allow users to efficiently changing ownership on a per-mount
    basis without having to (recursively) chown(2) all files. In
    contrast to chown (2) changing ownership of large sets of files is
    instantenous with idmapped mounts. This is especially useful when
    ownership of a whole root filesystem of a virtual machine or
    container is changed. With idmapped mounts a single syscall
    mount_setattr syscall will be sufficient to change the ownership of
    all files.

    - Idmapped mounts always take the current ownership into account as
    idmappings specify what a given uid or gid is supposed to be mapped
    to. This contrasts with the chown(2) syscall which cannot by itself
    take the current ownership of the files it changes into account. It
    simply changes the ownership to the specified uid and gid. This is
    especially problematic when recursively chown(2)ing a large set of
    files which is commong with the aforementioned portable home
    directory and container and vm scenario.

    - Idmapped mounts allow to change ownership locally, restricting it
    to specific mounts, and temporarily as the ownership changes only
    apply as long as the mount exists.

    Several userspace projects have either already put up patches and
    pull-requests for this feature or will do so should you decide to pull
    this:

    - systemd: In a wide variety of scenarios but especially right away
    in their implementation of portable home directories.

    https://systemd.io/HOME_DIRECTORY/

    - container runtimes: containerd, runC, LXD:To share data between
    host and unprivileged containers, unprivileged and privileged
    containers, etc. The pull request for idmapped mounts support in
    containerd, the default Kubernetes runtime is already up for quite
    a while now: https://github.com/containerd/containerd/pull/4734

    - The virtio-fs developers and several users have expressed interest
    in using this feature with virtual machines once virtio-fs is
    ported.

    - ChromeOS: Sharing host-directories with unprivileged containers.

    I've tightly synced with all those projects and all of those listed
    here have also expressed their need/desire for this feature on the
    mailing list. For more info on how people use this there's a bunch of
    talks about this too. Here's just two recent ones:

    https://www.cncf.io/wp-content/uploads/2020/12/Rootless-Containers-in-Gitpod.pdf
    https://fosdem.org/2021/schedule/event/containers_idmap/

    This comes with an extensive xfstests suite covering both ext4 and
    xfs:

    https://git.kernel.org/brauner/xfstests-dev/h/idmapped_mounts

    It covers truncation, creation, opening, xattrs, vfscaps, setid
    execution, setgid inheritance and more both with idmapped and
    non-idmapped mounts. It already helped to discover an unrelated xfs
    setgid inheritance bug which has since been fixed in mainline. It will
    be sent for inclusion with the xfstests project should you decide to
    merge this.

    In order to support per-mount idmappings vfsmounts are marked with
    user namespaces. The idmapping of the user namespace will be used to
    map the ids of vfs objects when they are accessed through that mount.
    By default all vfsmounts are marked with the initial user namespace.
    The initial user namespace is used to indicate that a mount is not
    idmapped. All operations behave as before and this is verified in the
    testsuite.

    Based on prior discussions we want to attach the whole user namespace
    and not just a dedicated idmapping struct. This allows us to reuse all
    the helpers that already exist for dealing with idmappings instead of
    introducing a whole new range of helpers. In addition, if we decide in
    the future that we are confident enough to enable unprivileged users
    to setup idmapped mounts the permission checking can take into account
    whether the caller is privileged in the user namespace the mount is
    currently marked with.

    The user namespace the mount will be marked with can be specified by
    passing a file descriptor refering to the user namespace as an
    argument to the new mount_setattr() syscall together with the new
    MOUNT_ATTR_IDMAP flag. The system call follows the openat2() pattern
    of extensibility.

    The following conditions must be met in order to create an idmapped
    mount:

    - The caller must currently have the CAP_SYS_ADMIN capability in the
    user namespace the underlying filesystem has been mounted in.

    - The underlying filesystem must support idmapped mounts.

    - The mount must not already be idmapped. This also implies that the
    idmapping of a mount cannot be altered once it has been idmapped.

    - The mount must be a detached/anonymous mount, i.e. it must have
    been created by calling open_tree() with the OPEN_TREE_CLONE flag
    and it must not already have been visible in the filesystem.

    The last two points guarantee easier semantics for userspace and the
    kernel and make the implementation significantly simpler.

    By default vfsmounts are marked with the initial user namespace and no
    behavioral or performance changes are observed.

    The manpage with a detailed description can be found here:

    https://git.kernel.org/brauner/man-pages/c/1d7b902e2875a1ff342e036a9f866a995640aea8

    In order to support idmapped mounts, filesystems need to be changed
    and mark themselves with the FS_ALLOW_IDMAP flag in fs_flags. The
    patches to convert individual filesystem are not very large or
    complicated overall as can be seen from the included fat, ext4, and
    xfs ports. Patches for other filesystems are actively worked on and
    will be sent out separately. The xfstestsuite can be used to verify
    that port has been done correctly.

    The mount_setattr() syscall is motivated independent of the idmapped
    mounts patches and it's been around since July 2019. One of the most
    valuable features of the new mount api is the ability to perform
    mounts based on file descriptors only.

    Together with the lookup restrictions available in the openat2()
    RESOLVE_* flag namespace which we added in v5.6 this is the first time
    we are close to hardened and race-free (e.g. symlinks) mounting and
    path resolution.

    While userspace has started porting to the new mount api to mount
    proper filesystems and create new bind-mounts it is currently not
    possible to change mount options of an already existing bind mount in
    the new mount api since the mount_setattr() syscall is missing.

    With the addition of the mount_setattr() syscall we remove this last
    restriction and userspace can now fully port to the new mount api,
    covering every use-case the old mount api could. We also add the
    crucial ability to recursively change mount options for a whole mount
    tree, both removing and adding mount options at the same time. This
    syscall has been requested multiple times by various people and
    projects.

    There is a simple tool available at

    https://github.com/brauner/mount-idmapped

    that allows to create idmapped mounts so people can play with this
    patch series. I'll add support for the regular mount binary should you
    decide to pull this in the following weeks:

    Here's an example to a simple idmapped mount of another user's home
    directory:

    u1001@f2-vm:/$ sudo ./mount --idmap both:1000:1001:1 /home/ubuntu/ /mnt

    u1001@f2-vm:/$ ls -al /home/ubuntu/
    total 28
    drwxr-xr-x 2 ubuntu ubuntu 4096 Oct 28 22:07 .
    drwxr-xr-x 4 root root 4096 Oct 28 04:00 ..
    -rw------- 1 ubuntu ubuntu 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 ubuntu ubuntu 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 ubuntu ubuntu 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 ubuntu ubuntu 807 Feb 25 2020 .profile
    -rw-r--r-- 1 ubuntu ubuntu 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 ubuntu ubuntu 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ ls -al /mnt/
    total 28
    drwxr-xr-x 2 u1001 u1001 4096 Oct 28 22:07 .
    drwxr-xr-x 29 root root 4096 Oct 28 22:01 ..
    -rw------- 1 u1001 u1001 3154 Oct 28 22:12 .bash_history
    -rw-r--r-- 1 u1001 u1001 220 Feb 25 2020 .bash_logout
    -rw-r--r-- 1 u1001 u1001 3771 Feb 25 2020 .bashrc
    -rw-r--r-- 1 u1001 u1001 807 Feb 25 2020 .profile
    -rw-r--r-- 1 u1001 u1001 0 Oct 16 16:11 .sudo_as_admin_successful
    -rw------- 1 u1001 u1001 1144 Oct 28 00:43 .viminfo

    u1001@f2-vm:/$ touch /mnt/my-file

    u1001@f2-vm:/$ setfacl -m u:1001:rwx /mnt/my-file

    u1001@f2-vm:/$ sudo setcap -n 1001 cap_net_raw+ep /mnt/my-file

    u1001@f2-vm:/$ ls -al /mnt/my-file
    -rw-rwxr--+ 1 u1001 u1001 0 Oct 28 22:14 /mnt/my-file

    u1001@f2-vm:/$ ls -al /home/ubuntu/my-file
    -rw-rwxr--+ 1 ubuntu ubuntu 0 Oct 28 22:14 /home/ubuntu/my-file

    u1001@f2-vm:/$ getfacl /mnt/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: mnt/my-file
    # owner: u1001
    # group: u1001
    user::rw-
    user:u1001:rwx
    group::rw-
    mask::rwx
    other::r--

    u1001@f2-vm:/$ getfacl /home/ubuntu/my-file
    getfacl: Removing leading '/' from absolute path names
    # file: home/ubuntu/my-file
    # owner: ubuntu
    # group: ubuntu
    user::rw-
    user:ubuntu:rwx
    group::rw-
    mask::rwx
    other::r--"

    * tag 'idmapped-mounts-v5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux: (41 commits)
    xfs: remove the possibly unused mp variable in xfs_file_compat_ioctl
    xfs: support idmapped mounts
    ext4: support idmapped mounts
    fat: handle idmapped mounts
    tests: add mount_setattr() selftests
    fs: introduce MOUNT_ATTR_IDMAP
    fs: add mount_setattr()
    fs: add attr_flags_to_mnt_flags helper
    fs: split out functions to hold writers
    namespace: only take read lock in do_reconfigure_mnt()
    mount: make {lock,unlock}_mount_hash() static
    namespace: take lock_mount_hash() directly when changing flags
    nfs: do not export idmapped mounts
    overlayfs: do not mount on top of idmapped mounts
    ecryptfs: do not mount on top of idmapped mounts
    ima: handle idmapped mounts
    apparmor: handle idmapped mounts
    fs: make helpers idmap mount aware
    exec: handle idmapped mounts
    would_dump: handle idmapped mounts
    ...

    Linus Torvalds
     

30 Jan, 2021

2 commits

  • The 'start' and 'end' arguments to tlb_gather_mmu() are no longer
    needed now that there is a separate function for 'fullmm' flushing.

    Remove the unused arguments and update all callers.

    Suggested-by: Linus Torvalds
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Yu Zhao
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Link: https://lore.kernel.org/r/CAHk-=wjQWa14_4UpfDf=fiineNP+RH74kZeDMo_f1D35xNzq9w@mail.gmail.com

    Will Deacon
     
  • Since commit 7a30df49f63a ("mm: mmu_gather: remove __tlb_reset_range()
    for force flush"), the 'start' and 'end' arguments to tlb_finish_mmu()
    are no longer used, since we flush the whole mm in case of a nested
    invalidation.

    Remove the unused arguments and update all callers.

    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Yu Zhao
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Linus Torvalds
    Link: https://lkml.kernel.org/r/20210127235347.1402-3-will@kernel.org

    Will Deacon
     

24 Jan, 2021

2 commits

  • The inode_owner_or_capable() helper determines whether the caller is the
    owner of the inode or is capable with respect to that inode. Allow it to
    handle idmapped mounts. If the inode is accessed through an idmapped
    mount it according to the mount's user namespace. Afterwards the checks
    are identical to non-idmapped mounts. If the initial user namespace is
    passed nothing changes so non-idmapped mounts will see identical
    behavior as before.

    Similarly, allow the inode_init_owner() helper to handle idmapped
    mounts. It initializes a new inode on idmapped mounts by mapping the
    fsuid and fsgid of the caller from the mount's user namespace. If the
    initial user namespace is passed nothing changes so non-idmapped mounts
    will see identical behavior as before.

    Link: https://lore.kernel.org/r/20210121131959.646623-7-christian.brauner@ubuntu.com
    Cc: Christoph Hellwig
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     
  • Add two simple helpers to check permissions on a file and path
    respectively and convert over some callers. It simplifies quite a few
    codepaths and also reduces the churn in later patches quite a bit.
    Christoph also correctly points out that this makes codepaths (e.g.
    ioctls) way easier to follow that would otherwise have to do more
    complex argument passing than necessary.

    Link: https://lore.kernel.org/r/20210121131959.646623-4-christian.brauner@ubuntu.com
    Cc: David Howells
    Cc: Al Viro
    Cc: linux-fsdevel@vger.kernel.org
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: James Morris
    Signed-off-by: Christian Brauner

    Christian Brauner
     

16 Dec, 2020

2 commits

  • madvise_inject_error() uses get_user_pages_fast to translate the address
    we specified to a page. After [1], we drop the extra reference count for
    memory_failure() path. That commit says that memory_failure wanted to
    keep the pin in order to take the page out of circulation.

    The truth is that we need to keep the page pinned, otherwise the page
    might be re-used after the put_page() and we can end up messing with
    someone else's memory.

    E.g:

    CPU0
    process X CPU1
    madvise_inject_error
    get_user_pages
    put_page
    page gets reclaimed
    process Y allocates the page
    memory_failure
    // We mess with process Y memory

    madvise() is meant to operate on a self address space, so messing with
    pages that do not belong to us seems the wrong thing to do.
    To avoid that, let us keep the page pinned for memory_failure as well.

    Pages for DAX mappings will release this extra refcount in
    memory_failure_dev_pagemap.

    [1] ("23e7b5c2e271: mm, madvise_inject_error:
    Let memory_failure() optionally take a page reference")

    Link: https://lkml.kernel.org/r/20201207094818.8518-1-osalvador@suse.de
    Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
    Signed-off-by: Oscar Salvador
    Suggested-by: Vlastimil Babka
    Acked-by: Naoya Horiguchi
    Cc: Vlastimil Babka
    Cc: Dan Williams
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • memory_failure and soft_offline_path paths now drain pcplists by calling
    get_hwpoison_page.

    memory_failure flags the page as HWPoison before, so that page cannot
    longer go into a pcplist, and soft_offline_page only flags a page as
    HWPoison if 1) we took the page off a buddy freelist 2) the page was
    in-use and we migrated it 3) was a clean pagecache.

    Because of that, a page cannot longer be poisoned and be in a pcplist.

    Link: https://lkml.kernel.org/r/20201013144447.6706-5-osalvador@suse.de
    Signed-off-by: Oscar Salvador
    Acked-by: Naoya Horiguchi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

09 Dec, 2020

1 commit

  • Jann spotted the security hole due to race of mm ownership check.

    If the task is sharing the mm_struct but goes through execve() before
    mm_access(), it could skip process_madvise_behavior_valid check. That
    makes *any advice hint* to reach into the remote process.

    This patch removes the mm ownership check. With it, it will lose the
    ability that local process could give *any* advice hint with vector
    interface for some reason (e.g., performance). Since there is no
    concrete example in upstream yet, it would be better to remove the
    abiliity at this moment and need to review when such new advice comes
    up.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Reported-by: Jann Horn
    Suggested-by: Jann Horn
    Signed-off-by: Minchan Kim
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

23 Nov, 2020

2 commits

  • The calculation of the end page index was incorrect, leading to a
    regression of 70% when running stress-ng.

    With this fix, we instead see a performance improvement of 3%.

    Fixes: e6e88712e43b ("mm: optimise madvise WILLNEED")
    Reported-by: kernel test robot
    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Tested-by: Xing Zhengjun
    Acked-by: Johannes Weiner
    Cc: William Kucharski
    Cc: Feng Tang
    Cc: "Chen, Rong A"
    Link: https://lkml.kernel.org/r/20201109134851.29692-1-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • The early return in process_madvise() will produce a memory leak.

    Fix it.

    Fixes: ecb8ac8b1f14 ("mm/madvise: introduce process_madvise() syscall: an external memory hinting API")
    Signed-off-by: Eric Dumazet
    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201116155132.GA3805951@google.com
    Signed-off-by: Linus Torvalds

    Eric Dumazet
     

19 Oct, 2020

2 commits

  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "introduce memory hinting API for external process", v9.

    Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
    that, application could give hints to kernel what memory range are
    preferred to be reclaimed. However, in some platform(e.g., Android), the
    information required to make the hinting decision is not known to the app.
    Instead, it is known to a centralized userspace daemon(e.g.,
    ActivityManagerService), and that daemon must be able to initiate reclaim
    on its own without any app involvement.

    To solve the concern, this patch introduces new syscall -
    process_madvise(2). Bascially, it's same with madvise(2) syscall but it
    has some differences.

    1. It needs pidfd of target process to provide the hint

    2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
    moment. Other hints in madvise will be opened when there are explicit
    requests from community to prevent unexpected bugs we couldn't support.

    3. Only privileged processes can do something for other process's
    address space.

    For more detail of the new API, please see "mm: introduce external memory
    hinting API" description in this patchset.

    This patch (of 3):

    In upcoming patches, do_madvise will be called from external process
    context so we shouldn't asssume "current" is always hinted process's
    task_struct.

    Furthermore, we must not access mm_struct via task->mm, but obtain it via
    access_mm() once (in the following patch) and only use that pointer [1],
    so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
    safe, so we can use them further down the call stack.

    And let's pass current->mm as arguments of do_madvise so it shouldn't
    change existing behavior but prepare next patch to make review easy.

    [vbabka@suse.cz: changelog tweak]
    [minchan@kernel.org: use current->mm for io_uring]
    Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
    [akpm@linux-foundation.org: fix it for upstream changes]
    [akpm@linux-foundation.org: whoops]
    [rdunlap@infradead.org: add missing includes]

    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Daniel Colascione
    Cc: Sandeep Patil
    Cc: Sonny Rao
    Cc: Brian Geffon
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: John Dias
    Cc: Joel Fernandes
    Cc: Alexander Duyck
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Kirill Tkhai
    Cc: Oleksandr Natalenko
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

17 Oct, 2020

3 commits

  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Currently, there is an inconsistency when calling soft-offline from
    different paths on a page that is already poisoned.

    1) madvise:

    madvise_inject_error skips any poisoned page and continues
    the loop.
    If that was the only page to madvise, it returns 0.

    2) /sys/devices/system/memory/:

    When calling soft_offline_page_store()->soft_offline_page(),
    we return -EBUSY in case the page is already poisoned.
    This is inconsistent with a) the above example and b)
    memory_failure, where we return 0 if the page was poisoned.

    Fix this by dropping the PageHWPoison() check in madvise_inject_error, and
    let soft_offline_page return 0 if it finds the page already poisoned.

    Please, note that this represents a user-api change, since now the return
    error when calling soft_offline_page_store()->soft_offline_page() will be
    different.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Oscar Salvador
    Cc: Qian Cai
    Cc: Tony Luck
    Link: https://lkml.kernel.org/r/20200922135650.1634-12-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     
  • Make a proper if-else condition for {hard,soft}-offline.

    Signed-off-by: Oscar Salvador
    Signed-off-by: Andrew Morton
    Acked-by: Naoya Horiguchi
    Cc: Michal Hocko
    Cc: Qian Cai
    Cc: Tony Luck
    Cc: "Aneesh Kumar K.V"
    Cc: Aneesh Kumar K.V
    Cc: Aristeu Rozanski
    Cc: Dave Hansen
    Cc: David Hildenbrand
    Cc: Dmitry Yakunin
    Cc: Mike Kravetz
    Link: https://lkml.kernel.org/r/20200908075626.11976-3-osalvador@suse.de
    Signed-off-by: Linus Torvalds

    Oscar Salvador
     

14 Oct, 2020

1 commit

  • Instead of calling find_get_entry() for every page index, use an XArray
    iterator to skip over NULL entries, and avoid calling get_page(),
    because we only want the swap entries.

    [willy@infradead.org: fix LTP soft lockups]
    Link: https://lkml.kernel.org/r/20200914165032.GS6583@casper.infradead.org

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Cc: Alexey Dobriyan
    Cc: Chris Wilson
    Cc: Huang Ying
    Cc: Hugh Dickins
    Cc: Jani Nikula
    Cc: Matthew Auld
    Cc: William Kucharski
    Cc: Qian Cai
    Link: https://lkml.kernel.org/r/20200910183318.20139-4-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     

27 Sep, 2020

1 commit

  • syzbot reported the following KASAN splat:

    general protection fault, probably for non-canonical address 0xdffffc0000000003: 0000 [#1] PREEMPT SMP KASAN
    KASAN: null-ptr-deref in range [0x0000000000000018-0x000000000000001f]
    CPU: 1 PID: 6826 Comm: syz-executor142 Not tainted 5.9.0-rc4-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    RIP: 0010:__lock_acquire+0x84/0x2ae0 kernel/locking/lockdep.c:4296
    Code: ff df 8a 04 30 84 c0 0f 85 e3 16 00 00 83 3d 56 58 35 08 00 0f 84 0e 17 00 00 83 3d 25 c7 f5 07 00 74 2c 4c 89 e8 48 c1 e8 03 3c 30 00 74 12 4c 89 ef e8 3e d1 5a 00 48 be 00 00 00 00 00 fc
    RSP: 0018:ffffc90004b9f850 EFLAGS: 00010006
    Call Trace:
    lock_acquire+0x140/0x6f0 kernel/locking/lockdep.c:5006
    __raw_spin_lock include/linux/spinlock_api_smp.h:142 [inline]
    _raw_spin_lock+0x2a/0x40 kernel/locking/spinlock.c:151
    spin_lock include/linux/spinlock.h:354 [inline]
    madvise_cold_or_pageout_pte_range+0x52f/0x25c0 mm/madvise.c:389
    walk_pmd_range mm/pagewalk.c:89 [inline]
    walk_pud_range mm/pagewalk.c:160 [inline]
    walk_p4d_range mm/pagewalk.c:193 [inline]
    walk_pgd_range mm/pagewalk.c:229 [inline]
    __walk_page_range+0xe7b/0x1da0 mm/pagewalk.c:331
    walk_page_range+0x2c3/0x5c0 mm/pagewalk.c:427
    madvise_pageout_page_range mm/madvise.c:521 [inline]
    madvise_pageout mm/madvise.c:557 [inline]
    madvise_vma mm/madvise.c:946 [inline]
    do_madvise+0x12d0/0x2090 mm/madvise.c:1145
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0x76/0x80 mm/madvise.c:1169
    do_syscall_64+0x31/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    The backing vma was shmem.

    In case of split page of file-backed THP, madvise zaps the pmd instead
    of remapping of sub-pages. So we need to check pmd validity after
    split.

    Reported-by: syzbot+ecf80462cb7d5d552bc7@syzkaller.appspotmail.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: Minchan Kim
    Acked-by: Kirill A. Shutemov
    Signed-off-by: Linus Torvalds

    Minchan Kim
     

06 Sep, 2020

1 commit

  • The syzbot reported the below use-after-free:

    BUG: KASAN: use-after-free in madvise_willneed mm/madvise.c:293 [inline]
    BUG: KASAN: use-after-free in madvise_vma mm/madvise.c:942 [inline]
    BUG: KASAN: use-after-free in do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    Read of size 8 at addr ffff8880a6163eb0 by task syz-executor.0/9996

    CPU: 0 PID: 9996 Comm: syz-executor.0 Not tainted 5.9.0-rc1-syzkaller #0
    Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
    Call Trace:
    __dump_stack lib/dump_stack.c:77 [inline]
    dump_stack+0x18f/0x20d lib/dump_stack.c:118
    print_address_description.constprop.0.cold+0xae/0x497 mm/kasan/report.c:383
    __kasan_report mm/kasan/report.c:513 [inline]
    kasan_report.cold+0x1f/0x37 mm/kasan/report.c:530
    madvise_willneed mm/madvise.c:293 [inline]
    madvise_vma mm/madvise.c:942 [inline]
    do_madvise.part.0+0x1c8b/0x1cf0 mm/madvise.c:1145
    do_madvise mm/madvise.c:1169 [inline]
    __do_sys_madvise mm/madvise.c:1171 [inline]
    __se_sys_madvise mm/madvise.c:1169 [inline]
    __x64_sys_madvise+0xd9/0x110 mm/madvise.c:1169
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Allocated by task 9992:
    kmem_cache_alloc+0x138/0x3a0 mm/slab.c:3482
    vm_area_alloc+0x1c/0x110 kernel/fork.c:347
    mmap_region+0x8e5/0x1780 mm/mmap.c:1743
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Freed by task 9992:
    kmem_cache_free.part.0+0x67/0x1f0 mm/slab.c:3693
    remove_vma+0x132/0x170 mm/mmap.c:184
    remove_vma_list mm/mmap.c:2613 [inline]
    __do_munmap+0x743/0x1170 mm/mmap.c:2869
    do_munmap mm/mmap.c:2877 [inline]
    mmap_region+0x257/0x1780 mm/mmap.c:1716
    do_mmap+0xcf9/0x11d0 mm/mmap.c:1545
    vm_mmap_pgoff+0x195/0x200 mm/util.c:506
    ksys_mmap_pgoff+0x43a/0x560 mm/mmap.c:1596
    do_syscall_64+0x2d/0x70 arch/x86/entry/common.c:46
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    It is because vma is accessed after releasing mmap_lock, but someone
    else acquired the mmap_lock and the vma is gone.

    Releasing mmap_lock after accessing vma should fix the problem.

    Fixes: 692fe62433d4c ("mm: Handle MADV_WILLNEED through vfs_fadvise()")
    Reported-by: syzbot+b90df26038d1d5d85c97@syzkaller.appspotmail.com
    Signed-off-by: Yang Shi
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Reviewed-by: Jan Kara
    Cc: [5.4+]
    Link: https://lkml.kernel.org/r/20200816141204.162624-1-shy828301@gmail.com
    Signed-off-by: Linus Torvalds

    Yang Shi
     

10 Jun, 2020

2 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

25 Apr, 2020

1 commit

  • IORING_OP_MADVISE can end up basically doing mprotect() on the VM of
    another process, which means that it can race with our crazy core dump
    handling which accesses the VM state without holding the mmap_sem
    (because it incorrectly thinks that it is the final user).

    This is clearly a core dumping problem, but we've never fixed it the
    right way, and instead have the notion of "check that the mm is still
    ok" using mmget_still_valid() after getting the mmap_sem for writing in
    any situation where we're not the original VM thread.

    See commit 04f5866e41fb ("coredump: fix race condition between
    mmget_not_zero()/get_task_mm() and core dumping") for more background on
    this whole mmget_still_valid() thing. You might want to have a barf bag
    handy when you do.

    We're discussing just fixing this properly in the only remaining core
    dumping routines. But even if we do that, let's make do_madvise() do
    the right thing, and then when we fix core dumping, we can remove all
    these mmget_still_valid() checks.

    Reported-and-tested-by: Jann Horn
    Fixes: c1ca757bd6f4 ("io_uring: add IORING_OP_MADVISE")
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

22 Mar, 2020

1 commit

  • Jann has brought up a very interesting point [1]. While shared pages
    are excluded from MADV_PAGEOUT normally, CoW pages can be easily
    reclaimed that way. This can lead to all sorts of hard to debug
    problems. E.g. performance problems outlined by Daniel [2].

    There are runtime environments where there is a substantial memory
    shared among security domains via CoW memory and a easy to reclaim way
    of that memory, which MADV_{COLD,PAGEOUT} offers, can lead to either
    performance degradation in for the parent process which might be more
    privileged or even open side channel attacks.

    The feasibility of the latter is not really clear to me TBH but there is
    no real reason for exposure at this stage. It seems there is no real
    use case to depend on reclaiming CoW memory via madvise at this stage so
    it is much easier to simply disallow it and this is what this patch
    does. Put it simply MADV_{PAGEOUT,COLD} can operate only on the
    exclusively owned memory which is a straightforward semantic.

    [1] http://lkml.kernel.org/r/CAG48ez0G3JkMq61gUmyQAaCq=_TwHbi1XKzWRooxZkv08PQKuw@mail.gmail.com
    [2] http://lkml.kernel.org/r/CAKOZueua_v8jHCpmEtTB6f3i9e2YnmX4mqdYVWhV4E=Z-n+zRQ@mail.gmail.com

    Fixes: 9c276cc65a58 ("mm: introduce MADV_COLD")
    Reported-by: Jann Horn
    Signed-off-by: Michal Hocko
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Minchan Kim
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: "Joel Fernandes (Google)"
    Cc:
    Link: http://lkml.kernel.org/r/20200312082248.GS23944@dhcp22.suse.cz
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

21 Jan, 2020

1 commit

  • This is in preparation for enabling this functionality through io_uring.
    Add a helper that is just exporting what sys_madvise() does, and have the
    system call use it.

    No functional changes in this patch.

    Reviewed-by: Pavel Begunkov
    Signed-off-by: Jens Axboe

    Jens Axboe
     

02 Dec, 2019

3 commits

  • Improve readability, no functional change.

    Link: http://lkml.kernel.org/r/20191118032857.22683-1-richardw.yang@linux.intel.com
    Signed-off-by: Wei Yang
    Reviewed-by: Andrew Morton
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Wei Yang
     
  • page_size() is supported after the commit a50b854e073c ("mm: introduce
    page_size()").

    Use page_size() in madvise_inject_error() for readability.

    [akpm@linux-foundation.org: use ulong for `size', per David]
    Link: http://lkml.kernel.org/r/29dce60c-38d6-0220-f292-e298f0c78c4d@huawei.com
    Signed-off-by: Yunfeng Ye
    Reviewed-by: Andrew Morton
    Acked-by: David Rientjes
    Cc: Jason Gunthorpe
    Cc: Michal Hocko
    Cc: Minchan Kim
    Cc: Peter Zijlstra
    Cc: Jan Kara
    Cc: Mike Rapoport
    Cc: Hu Shiyuan
    Cc: Feilong Lin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Yunfeng Ye
     
  • Currently soft_offline_page() receives struct page, and its sibling
    memory_failure() receives pfn. This discrepancy looks weird and makes
    precheck on pfn validity tricky. So let's align them.

    Link: http://lkml.kernel.org/r/20191016234706.GA5493@www9186uo.sakura.ne.jp
    Signed-off-by: Naoya Horiguchi
    Acked-by: Andrew Morton
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Oscar Salvador
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Naoya Horiguchi
     

16 Nov, 2019

1 commit

  • Recently, I hit the following issue when running upstream.

    kernel BUG at mm/vmscan.c:1521!
    invalid opcode: 0000 [#1] SMP KASAN PTI
    CPU: 0 PID: 23385 Comm: syz-executor.6 Not tainted 5.4.0-rc4+ #1
    RIP: 0010:shrink_page_list+0x12b6/0x3530 mm/vmscan.c:1521
    Call Trace:
    reclaim_pages+0x499/0x800 mm/vmscan.c:2188
    madvise_cold_or_pageout_pte_range+0x58a/0x710 mm/madvise.c:453
    walk_pmd_range mm/pagewalk.c:53 [inline]
    walk_pud_range mm/pagewalk.c:112 [inline]
    walk_p4d_range mm/pagewalk.c:139 [inline]
    walk_pgd_range mm/pagewalk.c:166 [inline]
    __walk_page_range+0x45a/0xc20 mm/pagewalk.c:261
    walk_page_range+0x179/0x310 mm/pagewalk.c:349
    madvise_pageout_page_range mm/madvise.c:506 [inline]
    madvise_pageout+0x1f0/0x330 mm/madvise.c:542
    madvise_vma mm/madvise.c:931 [inline]
    __do_sys_madvise+0x7d2/0x1600 mm/madvise.c:1113
    do_syscall_64+0x9f/0x4c0 arch/x86/entry/common.c:290
    entry_SYSCALL_64_after_hwframe+0x49/0xbe

    madvise_pageout() accesses the specified range of the vma and isolates
    them, then runs shrink_page_list() to reclaim its memory. But it also
    isolates the unevictable pages to reclaim. Hence, we can catch the
    cases in shrink_page_list().

    The root cause is that we scan the page tables instead of specific LRU
    list. and so we need to filter out the unevictable lru pages from our
    end.

    Link: http://lkml.kernel.org/r/1572616245-18946-1-git-send-email-zhongjiang@huawei.com
    Fixes: 1a4e58cce84e ("mm: introduce MADV_PAGEOUT")
    Signed-off-by: zhong jiang
    Suggested-by: Johannes Weiner
    Acked-by: Johannes Weiner
    Acked-by: Minchan Kim
    Acked-by: Michal Hocko
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    zhong jiang
     

26 Sep, 2019

1 commit

  • There are many common parts between MADV_COLD and MADV_PAGEOUT.
    This patch factor them out to save code duplication.

    Link: http://lkml.kernel.org/r/20190726023435.214162-6-minchan@kernel.org
    Signed-off-by: Minchan Kim
    Suggested-by: Johannes Weiner
    Acked-by: Michal Hocko
    Cc: Chris Zankel
    Cc: Daniel Colascione
    Cc: Dave Hansen
    Cc: Hillf Danton
    Cc: James E.J. Bottomley
    Cc: Joel Fernandes (Google)
    Cc: kbuild test robot
    Cc: Kirill A. Shutemov
    Cc: Oleksandr Natalenko
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Suren Baghdasaryan
    Cc: Tim Murray
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Minchan Kim