03 Nov, 2020

6 commits

  • Fix the following sparse warning:

    mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?

    Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
    Signed-off-by: Jason Yan
    Signed-off-by: Andrew Morton
    Reviewed-by: Yafang Shao
    Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
    Signed-off-by: Linus Torvalds

    Jason Yan
     
  • When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
    MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
    pte_unmap_unlock seems like not a good idea.

    queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
    migrate misplaced pages but returns with EIO when encountering such a
    page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
    -EIO when MPOL_MF_STRICT is specified") and early break on the first pte
    in the range results in pte_unmap_unlock on an underflow pte. This can
    lead to lockups later on when somebody tries to lock the pte resp.
    page_table_lock again..

    Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
    Signed-off-by: Shijie Luo
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Reviewed-by: Oscar Salvador
    Acked-by: Michal Hocko
    Cc: Miaohe Lin
    Cc: Feilong Lin
    Cc: Shijie Luo
    Cc:
    Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
    Signed-off-by: Linus Torvalds

    Shijie Luo
     
  • Richard reported a warning which can be reproduced by running the LTP
    madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

    WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Modules linked in:
    CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
    Workqueue: events drain_local_stock
    RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
    Call Trace:
    __memcg_kmem_uncharge (mm/memcontrol.c:3022)
    drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
    drain_local_stock (mm/memcontrol.c:2255)
    process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
    worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
    kthread (kernel/kthread.c:292)
    ret_from_fork (arch/x86/entry/entry_64.S:300)

    The problem occurs because in the non-hierarchical mode non-root page
    counters are not linked to root page counters, so the charge is not
    propagated to the root memory cgroup.

    After the removal of the original memory cgroup and reparenting of the
    object cgroup, the root cgroup might be uncharged by draining a objcg
    stock, for example. It leads to an eventual underflow of the charge and
    triggers a warning.

    Fix it by linking all page counters to corresponding root page counters
    in the non-hierarchical mode.

    Please note, that in the non-hierarchical mode all objcgs are always
    reparented to the root memory cgroup, even if the hierarchy has more
    than 1 level. This patch doesn't change it.

    The patch also doesn't affect how the hierarchical mode is working,
    which is the only sane and truly supported mode now.

    Thanks to Richard for reporting, debugging and providing an alternative
    version of the fix!

    Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
    Reported-by:
    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Reviewed-by: Michal Koutný
    Acked-by: Johannes Weiner
    Cc: Michal Hocko
    Cc:
    Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
    Debugged-by: Richard Palethorpe
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • memcg_page_state will get the specified number in hierarchical memcg, It
    should multiply by HPAGE_PMD_NR rather than an page if the item is
    NR_ANON_THPS.

    [akpm@linux-foundation.org: fix printk warning]
    [akpm@linux-foundation.org: use u64 cast, per Michal]

    Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
    Signed-off-by: zhongjiang-ali
    Signed-off-by: Andrew Morton
    Acked-by: Johannes Weiner
    Acked-by: Michal Hocko
    Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
    Signed-off-by: Linus Torvalds

    zhongjiang-ali
     
  • Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
    with hugetlbfs and hit the warning below. QEMU with free page hinting
    uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
    as free by a VM. The reporting granularity is in pageblock granularity.
    So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
    one huge page in QEMU.

    WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
    Modules linked in: ...
    CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
    Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
    RIP: 0010:page_counter_uncharge+0x4b/0x50
    ...
    Call Trace:
    hugetlb_cgroup_uncharge_file_region+0x4b/0x80
    region_del+0x1d3/0x300
    hugetlb_unreserve_pages+0x39/0xb0
    remove_inode_hugepages+0x1a8/0x3d0
    hugetlbfs_fallocate+0x3c4/0x5c0
    vfs_fallocate+0x146/0x290
    __x64_sys_fallocate+0x3e/0x70
    do_syscall_64+0x33/0x40
    entry_SYSCALL_64_after_hwframe+0x44/0xa9

    Investigation of the issue uncovered bugs in hugetlb cgroup reservation
    accounting. This patch addresses the found issues.

    Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
    Reported-by: Michal Privoznik
    Co-developed-by: David Hildenbrand
    Signed-off-by: David Hildenbrand
    Signed-off-by: Mike Kravetz
    Signed-off-by: Andrew Morton
    Tested-by: Michal Privoznik
    Reviewed-by: Mina Almasry
    Acked-by: Michael S. Tsirkin
    Cc:
    Cc: David Hildenbrand
    Cc: Michal Hocko
    Cc: Muchun Song
    Cc: "Aneesh Kumar K . V"
    Cc: Tejun Heo
    Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
    Signed-off-by: Linus Torvalds

    Mike Kravetz
     
  • commit 6f42193fd86e ("memremap: don't use a separate devm action for
    devmap_managed_enable_get") changed the static key updates such that we
    now call devmap_managed_enable_put() without doing the equivalent
    devmap_managed_enable_get().

    devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
    MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
    types too. This results in the below warning when switching between
    system-ram and devdax mode for devdax namespace.

    jump label: negative count!
    WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
    Modules linked in:
    ....

    NIP static_key_slow_try_dec+0x88/0xa0
    LR static_key_slow_try_dec+0x84/0xa0
    Call Trace:
    static_key_slow_try_dec+0x84/0xa0
    __static_key_slow_dec_cpuslocked+0x34/0xd0
    static_key_slow_dec+0x54/0xf0
    memunmap_pages+0x36c/0x500
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    bus_remove_device+0x124/0x210
    device_del+0x1d4/0x530
    unregister_dev_dax+0x48/0xe0
    devm_action_release+0x30/0x50
    release_nodes+0x2f4/0x3e0
    device_release_driver_internal+0x17c/0x280
    unbind_store+0x130/0x170
    drv_attr_store+0x40/0x60
    sysfs_kf_write+0x6c/0xb0
    kernfs_fop_write+0x118/0x280
    vfs_write+0xe8/0x2a0
    ksys_write+0x84/0x140
    system_call_exception+0x120/0x270
    system_call_common+0xf0/0x27c

    Reported-by: Aneesh Kumar K.V
    Signed-off-by: Ralph Campbell
    Signed-off-by: Andrew Morton
    Tested-by: Sachin Sant
    Reviewed-by: Aneesh Kumar K.V
    Reviewed-by: Ira Weiny
    Reviewed-by: Christoph Hellwig
    Cc: Dan Williams
    Cc: Jason Gunthorpe
    Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
    Signed-off-by: Linus Torvalds

    Ralph Campbell
     

28 Oct, 2020

2 commits

  • With e.g. m68k/defconfig:

    mm/process_vm_access.c: In function ‘process_vm_rw’:
    mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration]
    277 | in_compat_syscall());
    | ^~~~~~~~~~~~~~~~~

    Fix this by adding #include .

    Reported-by: noreply@ellerman.id.au
    Reported-by: damian
    Reported-by: Naresh Kamboju
    Fixes: 38dc5079da7081e8 ("Fix compat regression in process_vm_rw()")
    Signed-off-by: Geert Uytterhoeven
    Acked-by: Jens Axboe
    Signed-off-by: Linus Torvalds

    Geert Uytterhoeven
     
  • The removal of compat_process_vm_{readv,writev} didn't change
    process_vm_rw(), which always assumes it's not doing a compat syscall.

    Instead of passing in 'false' unconditionally for 'compat', make it
    conditional on in_compat_syscall().

    [ Both Al and Christoph point out that trying to access a 64-bit process
    from a 32-bit one cannot work anyway, and is likely better prohibited,
    but that's a separate issue - Linus ]

    Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}")
    Reported-and-tested-by: Kyle Huey
    Signed-off-by: Jens Axboe
    Acked-by: Al Viro
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Linus Torvalds

    Jens Axboe
     

24 Oct, 2020

1 commit

  • Pull clone/dedupe/remap code refactoring from Darrick Wong:
    "Move the generic file range remap (aka reflink and dedupe) functions
    out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
    reduce clutter in the first two files"

    * tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
    vfs: move the generic write and copy checks out of mm
    vfs: move the remap range helpers to remap_range.c
    vfs: move generic_remap_checks out of mm

    Linus Torvalds
     

21 Oct, 2020

2 commits

  • Pull XArray updates from Matthew Wilcox:

    - Fix the test suite after introduction of the local_lock

    - Fix a bug in the IDA spotted by Coverity

    - Change the API that allows the workingset code to delete a node

    - Fix xas_reload() when dealing with entries that occupy multiple
    indices

    - Add a few more tests to the test suite

    - Fix an unsigned int being shifted into an unsigned long

    * tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
    XArray: Fix xas_create_range for ranges above 4 billion
    radix-tree: fix the comment of radix_tree_next_slot()
    XArray: Fix xas_reload for multi-index entries
    XArray: Add private interface for workingset node deletion
    XArray: Fix xas_for_each_conflict documentation
    XArray: Test marked multiorder iterations
    XArray: Test two more things about xa_cmpxchg
    ida: Free allocated bitmap in error path
    radix tree test suite: Fix compilation

    Linus Torvalds
     
  • Pull io_uring updates from Jens Axboe:
    "A mix of fixes and a few stragglers. In detail:

    - Revert the bogus __read_mostly that we discussed for the initial
    pull request.

    - Fix a merge window regression with fixed file registration error
    path handling.

    - Fix io-wq numa node affinities.

    - Series abstracting out an io_identity struct, making it both easier
    to see what the personality items are, and also easier to to adopt
    more. Use this to cover audit logging.

    - Fix for read-ahead disabled block condition in async buffered
    reads, and using single page read-ahead to unify what
    generic_file_buffer_read() path is used.

    - Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)

    - Poll fix (Pavel)"

    * tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
    io_uring: use blk_queue_nowait() to check if NOWAIT supported
    mm: use limited read-ahead to satisfy read
    mm: mark async iocb read as NOWAIT once some data has been copied
    io_uring: fix double poll mask init
    io-wq: inherit audit loginuid and sessionid
    io_uring: use percpu counters to track inflight requests
    io_uring: assign new io_identity for task if members have changed
    io_uring: store io_identity in io_uring_task
    io_uring: COW io_identity on mismatch
    io_uring: move io identity items into separate struct
    io_uring: rely solely on work flags to determine personality.
    io_uring: pass required context in as flags
    io-wq: assign NUMA node locality if appropriate
    io_uring: fix error path cleanup in io_sqe_files_register()
    Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
    io_uring: fix REQ_F_COMP_LOCKED by killing it
    io_uring: dig out COMP_LOCK from deep call chain
    io_uring: don't put a poll req under spinlock
    io_uring: don't unnecessarily clear F_LINK_TIMEOUT
    io_uring: don't set COMP_LOCKED if won't put
    ...

    Linus Torvalds
     

19 Oct, 2020

21 commits

  • No point in having the filename inside the file.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "two small vmalloc cleanups".

    This patch (of 2):

    __vmalloc_area_node currently has four different gfp_t variables to
    just express this simple logic:

    - use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
    suitable) for the underlying page allocation
    - use just the reclaim flags from the passed in mask plus __GFP_ZERO
    for allocating the page array

    Simplify this down to just use the pre-existing nested_gfp as-is for
    the page array allocation, and just the passed in gfp_mask for the
    page allocation, after conditionally ORing __GFP_HIGHMEM into it. This
    also makes the allocation warning a little more correct.

    Also initialize two variables at the time of declaration while touching
    this area.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • All users are gone now.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Just manually pre-fault the PTEs using apply_to_page_range.

    Co-developed-by: Minchan Kim
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Besides calling the callback on each page, apply_to_page_range also has
    the effect of pre-faulting all PTEs for the range. To support callers
    that only need the pre-faulting, make the callback optional.

    Based on a patch from Minchan Kim .

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a proper helper to remap PFNs into kernel virtual space so that
    drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
    for it.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Add a flag so that vmap takes ownership of the passed in page array. When
    vfree is called on such an allocation it will put one reference on each
    page, and free the page array itself.

    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Chris Wilson
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Juergen Gross
    Cc: Matthew Auld
    Cc: "Matthew Wilcox (Oracle)"
    Cc: Minchan Kim
    Cc: Nitin Gupta
    Cc: Peter Zijlstra
    Cc: Rodrigo Vivi
    Cc: Stefano Stabellini
    Cc: Tvrtko Ursulin
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     
  • Patch series "remove alloc_vm_area", v4.

    This series removes alloc_vm_area, which was left over from the big
    vmalloc interface rework. It is a rather arkane interface, basicaly the
    equivalent of get_vm_area + actually faulting in all PTEs in the allocated
    area. It was originally addeds for Xen (which isn't modular to start
    with), and then grew users in zsmalloc and i915 which seems to mostly
    qualify as abuses of the interface, especially for i915 as a random driver
    should not set up PTE bits directly.

    This patch (of 11):

    * Document that you can call vfree() on an address returned from vmap()
    * Remove the note about the minimum size -- the minimum size of a vmalloc
    allocation is one page
    * Add a Context: section
    * Fix capitalisation
    * Reword the prohibition on calling from NMI context to avoid a double
    negative

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: Stefano Stabellini
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Tvrtko Ursulin
    Cc: Chris Wilson
    Cc: Matthew Auld
    Cc: Rodrigo Vivi
    Cc: Minchan Kim
    Cc: Matthew Wilcox
    Cc: Nitin Gupta
    Cc: Uladzislau Rezki (Sony)
    Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
    Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)
     
  • There is usecase that System Management Software(SMS) want to give a
    memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
    case of Android, it is the ActivityManagerService.

    The information required to make the reclaim decision is not known to the
    app. Instead, it is known to the centralized userspace
    daemon(ActivityManagerService), and that daemon must be able to initiate
    reclaim on its own without any app involvement.

    To solve the issue, this patch introduces a new syscall
    process_madvise(2). It uses pidfd of an external process to give the
    hint. It also supports vector address range because Android app has
    thousands of vmas due to zygote so it's totally waste of CPU and power if
    we should call the syscall one by one for each vma.(With testing 2000-vma
    syscall vs 1-vector syscall, it showed 15% performance improvement. I
    think it would be bigger in real practice because the testing ran very
    cache friendly environment).

    Another potential use case for the vector range is to amortize the cost
    ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
    benefit users like TCP receive zerocopy and malloc implementations. In
    future, we could find more usecases for other advises so let's make it
    happens as API since we introduce a new syscall at this moment. With
    that, existing madvise(2) user could replace it with process_madvise(2)
    with their own pid if they want to have batch address ranges support
    feature.

    ince it could affect other process's address range, only privileged
    process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
    UID) gives it the right to ptrace the process could use it successfully.
    The flag argument is reserved for future use if we need to extend the API.

    I think supporting all hints madvise has/will supported/support to
    process_madvise is rather risky. Because we are not sure all hints make
    sense from external process and implementation for the hint may rely on
    the caller being in the current context so it could be error-prone. Thus,
    I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

    If someone want to add other hints, we could hear the usecase and review
    it for each hint. It's safer for maintenance rather than introducing a
    buggy syscall but hard to fix it later.

    So finally, the API is as follows,

    ssize_t process_madvise(int pidfd, const struct iovec *iovec,
    unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
    The process_madvise() system call is used to give advice or directions
    to the kernel about the address ranges from external process as well as
    local process. It provides the advice to address ranges of process
    described by iovec and vlen. The goal of such advice is to improve
    system or application performance.

    The pidfd selects the process referred to by the PID file descriptor
    specified in pidfd. (See pidofd_open(2) for further information)

    The pointer iovec points to an array of iovec structures, defined in
    as:

    struct iovec {
    void *iov_base; /* starting address */
    size_t iov_len; /* number of bytes to be advised */
    };

    The iovec describes address ranges beginning at address(iov_base)
    and with size length of bytes(iov_len).

    The vlen represents the number of elements in iovec.

    The advice is indicated in the advice argument, which is one of the
    following at this moment if the target process specified by pidfd is
    external.

    MADV_COLD
    MADV_PAGEOUT

    Permission to provide a hint to external process is governed by a
    ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

    The process_madvise supports every advice madvise(2) has if target
    process is in same thread group with calling process so user could
    use process_madvise(2) to extend existing madvise(2) to support
    vector address ranges.

    RETURN VALUE
    On success, process_madvise() returns the number of bytes advised.
    This return value may be less than the total number of requested
    bytes, if an error occurred. The caller should check return value
    to determine whether a partial advice occurred.

    FAQ:

    Q.1 - Why does any external entity have better knowledge?

    Quote from Sandeep

    "For Android, every application (including the special SystemServer)
    are forked from Zygote. The reason of course is to share as many
    libraries and classes between the two as possible to benefit from the
    preloading during boot.

    After applications start, (almost) all of the APIs end up calling into
    this SystemServer process over IPC (binder) and back to the
    application.

    In a fully running system, the SystemServer monitors every single
    process periodically to calculate their PSS / RSS and also decides
    which process is "important" to the user for interactivity.

    So, because of how these processes start _and_ the fact that the
    SystemServer is looping to monitor each process, it does tend to *know*
    which address range of the application is not used / useful.

    Besides, we can never rely on applications to clean things up
    themselves. We've had the "hey app1, the system is low on memory,
    please trim your memory usage down" notifications for a long time[1].
    They rely on applications honoring the broadcasts and very few do.

    So, if we want to avoid the inevitable killing of the application and
    restarting it, some way to be able to tell the OS about unimportant
    memory in these applications will be useful.

    - ssp

    Q.2 - How to guarantee the race(i.e., object validation) between when
    giving a hint from an external process and get the hint from the target
    process?

    process_madvise operates on the target process's address space as it
    exists at the instant that process_madvise is called. If the space
    target process can run between the time the process_madvise process
    inspects the target process address space and the time that
    process_madvise is actually called, process_madvise may operate on
    memory regions that the calling process does not expect. It's the
    responsibility of the process calling process_madvise to close this
    race condition. For example, the calling process can suspend the
    target process with ptrace, SIGSTOP, or the freezer cgroup so that it
    doesn't have an opportunity to change its own address space before
    process_madvise is called. Another option is to operate on memory
    regions that the caller knows a priori will be unchanged in the target
    process. Yet another option is to accept the race for certain
    process_madvise calls after reasoning that mistargeting will do no
    harm. The suggested API itself does not provide synchronization. It
    also apply other APIs like move_pages, process_vm_write.

    The race isn't really a problem though. Why is it so wrong to require
    that callers do their own synchronization in some manner? Nobody
    objects to write(2) merely because it's possible for two processes to
    open the same file and clobber each other's writes --- instead, we tell
    people to use flock or something. Think about mmap. It never
    guarantees newly allocated address space is still valid when the user
    tries to access it because other threads could unmap the memory right
    before. That's where we need synchronization by using other API or
    design from userside. It shouldn't be part of API itself. If someone
    needs more fine-grained synchronization rather than process level,
    there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
    applicable via using last reserved argument of the API but I don't
    think it's necessary right now since we have already ways to prevent
    the race so don't want to add additional complexity with more
    fine-grained optimization model.

    To make the API extend, it reserved an unsigned long as last argument
    so we could support it in future if someone really needs it.

    Q.3 - Why doesn't ptrace work?

    Injecting an madvise in the target process using ptrace would not work
    for us because such injected madvise would have to be executed by the
    target process, which means that process would have to be runnable and
    that creates the risk of the abovementioned race and hinting a wrong
    VMA. Furthermore, we want to act the hint in caller's context, not the
    callee's, because the callee is usually limited in cpuset/cgroups or
    even freezed state so they can't act by themselves quick enough, which
    causes more thrashing/kill. It doesn't work if the target process are
    ptraced(e.g., strace, debugger, minidump) because a process can have at
    most one ptracer.

    [1] https://developer.android.com/topic/performance/memory"

    [2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

    [3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

    [minchan@kernel.org: fix process_madvise build break for arm64]
    Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
    [minchan@kernel.org: fix build error for mips of process_madvise]
    Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
    [akpm@linux-foundation.org: fix patch ordering issue]
    [akpm@linux-foundation.org: fix arm64 whoops]
    [minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
    [akpm@linux-foundation.org: fix i386 build]
    [sfr@canb.auug.org.au: fix syscall numbering]
    Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
    [sfr@canb.auug.org.au: madvise.c needs compat.h]
    Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
    [minchan@kernel.org: fix mips build]
    Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
    [yuehaibing@huawei.com: remove duplicate header which is included twice]
    Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
    [minchan@kernel.org: do not use helper functions for process_madvise]
    Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
    [akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
    [sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
    Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

    Signed-off-by: Minchan Kim
    Signed-off-by: YueHaibing
    Signed-off-by: Stephen Rothwell
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Alexander Duyck
    Cc: Brian Geffon
    Cc: Christian Brauner
    Cc: Daniel Colascione
    Cc: Jann Horn
    Cc: Jens Axboe
    Cc: Joel Fernandes
    Cc: Johannes Weiner
    Cc: John Dias
    Cc: Kirill Tkhai
    Cc: Michal Hocko
    Cc: Oleksandr Natalenko
    Cc: Sandeep Patil
    Cc: SeongJae Park
    Cc: SeongJae Park
    Cc: Shakeel Butt
    Cc: Sonny Rao
    Cc: Tim Murray
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
    Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • Patch series "introduce memory hinting API for external process", v9.

    Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
    that, application could give hints to kernel what memory range are
    preferred to be reclaimed. However, in some platform(e.g., Android), the
    information required to make the hinting decision is not known to the app.
    Instead, it is known to a centralized userspace daemon(e.g.,
    ActivityManagerService), and that daemon must be able to initiate reclaim
    on its own without any app involvement.

    To solve the concern, this patch introduces new syscall -
    process_madvise(2). Bascially, it's same with madvise(2) syscall but it
    has some differences.

    1. It needs pidfd of target process to provide the hint

    2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
    moment. Other hints in madvise will be opened when there are explicit
    requests from community to prevent unexpected bugs we couldn't support.

    3. Only privileged processes can do something for other process's
    address space.

    For more detail of the new API, please see "mm: introduce external memory
    hinting API" description in this patchset.

    This patch (of 3):

    In upcoming patches, do_madvise will be called from external process
    context so we shouldn't asssume "current" is always hinted process's
    task_struct.

    Furthermore, we must not access mm_struct via task->mm, but obtain it via
    access_mm() once (in the following patch) and only use that pointer [1],
    so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
    safe, so we can use them further down the call stack.

    And let's pass current->mm as arguments of do_madvise so it shouldn't
    change existing behavior but prepare next patch to make review easy.

    [vbabka@suse.cz: changelog tweak]
    [minchan@kernel.org: use current->mm for io_uring]
    Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
    [akpm@linux-foundation.org: fix it for upstream changes]
    [akpm@linux-foundation.org: whoops]
    [rdunlap@infradead.org: add missing includes]

    Signed-off-by: Minchan Kim
    Signed-off-by: Andrew Morton
    Reviewed-by: Suren Baghdasaryan
    Reviewed-by: Vlastimil Babka
    Acked-by: David Rientjes
    Cc: Jens Axboe
    Cc: Jann Horn
    Cc: Tim Murray
    Cc: Daniel Colascione
    Cc: Sandeep Patil
    Cc: Sonny Rao
    Cc: Brian Geffon
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Shakeel Butt
    Cc: John Dias
    Cc: Joel Fernandes
    Cc: Alexander Duyck
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Kirill Tkhai
    Cc: Oleksandr Natalenko
    Cc: SeongJae Park
    Cc: Christian Brauner
    Cc: Florian Weimer
    Cc:
    Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
    Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
    Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
    Signed-off-by: Linus Torvalds

    Minchan Kim
     
  • To be safe against concurrent changes to the VMA tree, we must take the
    mmap lock around GUP operations (excluding the GUP-fast family of
    operations, which will take the mmap lock by themselves if necessary).

    This code is only for testing, and it's only reachable by root through
    debugfs, so this doesn't really have any impact; however, if we want to
    add lockdep asserts into the GUP path, we need to have clean locking here.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Reviewed-by: John Hubbard
    Acked-by: Michel Lespinasse
    Cc: "Eric W . Biederman"
    Cc: Mauro Carvalho Chehab
    Cc: Sakari Ailus
    Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • There are two locations that have a block of code for munmapping a vma
    range. Change those two locations to use a function and add meaningful
    comments about what happens to the arguments, which was unclear in the
    previous code.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There are three places that the next vma is required which uses the same
    block of code. Replace the block with a function and add comments on what
    happens in the case where NULL is encountered.

    Signed-off-by: Liam R. Howlett
    Signed-off-by: Andrew Morton
    Reviewed-by: Andrew Morton
    Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
    Signed-off-by: Linus Torvalds

    Liam R. Howlett
     
  • There is no need to check if this process has the right to modify the
    specified process when they are same. And we could also skip the security
    hook call if a process is modifying its own pages. Add helper function to
    handle these.

    Suggested-by: Matthew Wilcox
    Signed-off-by: Hongxiang Lou
    Signed-off-by: Miaohe Lin
    Signed-off-by: Andrew Morton
    Cc: Christopher Lameter
    Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
    Signed-off-by: Linus Torvalds

    Miaohe Lin
     
  • To calculate the correct node to migrate the page for hotplug, we need to
    check node id of the page. Wrapper for alloc_migration_target() exists
    for this purpose.

    However, Vlastimil informs that all migration source pages come from a
    single node. In this case, we don't need to check the node id for each
    page and we don't need to re-set the target nodemask for each page by
    using the wrapper. Set up the migration_target_control once and use it
    for all pages.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Acked-by: Michal Hocko
    Cc: Christoph Hellwig
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • There is a well-defined standard migration target callback. Use it
    directly.

    Signed-off-by: Joonsoo Kim
    Signed-off-by: Andrew Morton
    Acked-by: Vlastimil Babka
    Cc: Christoph Hellwig
    Cc: Michal Hocko
    Cc: Mike Kravetz
    Cc: Naoya Horiguchi
    Cc: Roman Gushchin
    Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
    Signed-off-by: Linus Torvalds

    Joonsoo Kim
     
  • If a memcg to charge can be determined (using remote charging API), there
    are no reasons to exclude allocations made from an interrupt context from
    the accounting.

    Such allocations will pass even if the resulting memcg size will exceed
    the hard limit, but it will affect the application of the memory pressure
    and an inability to put the workload under the limit will eventually
    trigger the OOM.

    To use active_memcg() helper, memcg_kmem_bypass() is moved back to
    memcontrol.c.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Remote memcg charging API uses current->active_memcg to store the
    currently active memory cgroup, which overwrites the memory cgroup of the
    current process. It works well for normal contexts, but doesn't work for
    interrupt contexts: indeed, if an interrupt occurs during the execution of
    a section with an active memcg set, all allocations inside the interrupt
    will be charged to the active memcg set (given that we'll enable
    accounting for allocations from an interrupt context). But because the
    interrupt might have no relation to the active memcg set outside, it's
    obviously wrong from the accounting prospective.

    To resolve this problem, let's add a global percpu int_active_memcg
    variable, which will be used to store an active memory cgroup which will
    be used from interrupt contexts. set_active_memcg() will transparently
    use current->active_memcg or int_active_memcg depending on the context.

    To make the read part simple and transparent for the caller, let's
    introduce two new functions:
    - struct mem_cgroup *active_memcg(void),
    - struct mem_cgroup *get_active_memcg(void).

    They are returning the active memcg if it's set, hiding all implementation
    details: where to get it depending on the current context.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • There are checks for current->mm and current->active_memcg in
    get_obj_cgroup_from_current(), but these checks are redundant:
    memcg_kmem_bypass() called just above performs same checks.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Patch series "mm: kmem: kernel memory accounting in an interrupt context".

    This patchset implements memcg-based memory accounting of allocations made
    from an interrupt context.

    Historically, such allocations were passed unaccounted mostly because
    charging the memory cgroup of the current process wasn't an option. Also
    performance reasons were likely a reason too.

    The remote charging API allows to temporarily overwrite the currently
    active memory cgroup, so that all memory allocations are accounted towards
    some specified memory cgroup instead of the memory cgroup of the current
    process.

    This patchset extends the remote charging API so that it can be used from
    an interrupt context. Then it removes the fence that prevented the
    accounting of allocations made from an interrupt context. It also
    contains a couple of optimizations/code refactorings.

    This patchset doesn't directly enable accounting for any specific
    allocations, but prepares the code base for it. The bpf memory accounting
    will likely be the first user of it: a typical example is a bpf program
    parsing an incoming network packet, which allocates an entry in hashmap
    map to store some information.

    This patch (of 4):

    Currently memcg_kmem_bypass() is called before obtaining the current
    memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
    memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
    number of call sites and allows further code simplifications.

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
    Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     
  • Currently the remote memcg charging API consists of two functions:
    memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
    memcg value, which overwrites the memcg of the current task.

    memalloc_use_memcg(target_memcg);

    memalloc_unuse_memcg();

    It works perfectly for allocations performed from a normal context,
    however an attempt to call it from an interrupt context or just nest two
    remote charging blocks will lead to an incorrect accounting. On exit from
    the inner block the active memcg will be cleared instead of being
    restored.

    memalloc_use_memcg(target_memcg);

    memalloc_use_memcg(target_memcg_2);

    memalloc_unuse_memcg();

    Error: allocation here are charged to the memcg of the current
    process instead of target_memcg.

    memalloc_unuse_memcg();

    This patch extends the remote charging API by switching to a single
    function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
    which sets the new value and returns the old one. So a remote charging
    block will look like:

    old_memcg = set_active_memcg(target_memcg);

    set_active_memcg(old_memcg);

    This patch is heavily based on the patch by Johannes Weiner, which can be
    found here: https://lkml.org/lkml/2020/5/28/806 .

    Signed-off-by: Roman Gushchin
    Signed-off-by: Andrew Morton
    Reviewed-by: Shakeel Butt
    Cc: Johannes Weiner
    Cc: Dan Schatzberg
    Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
    Signed-off-by: Linus Torvalds

    Roman Gushchin
     

18 Oct, 2020

2 commits

  • For the case where read-ahead is disabled on the file, or if the cgroup
    is congested, ensure that we can at least do 1 page of read-ahead to
    make progress on the read in an async fashion. This could potentially be
    larger, but it's not needed in terms of functionality, so let's error on
    the side of caution as larger counts of pages may run into reclaim
    issues (particularly if we're congested).

    This makes sure we're not hitting the potentially sync ->readpage() path
    for IO that is marked IOCB_WAITQ, which could cause us to block. It also
    means we'll use the same path for IO, regardless of whether or not
    read-ahead happens to be disabled on the lower level device.

    Acked-by: Johannes Weiner
    Reported-by: Matthew Wilcox (Oracle)
    Reported-by: Hao_Xu
    [axboe: updated for new ractl API]
    Signed-off-by: Jens Axboe

    Jens Axboe
     
  • Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
    we should no longer attempt to async lock a new page. Instead make sure
    we return the copied amount, and let the caller retry, instead of
    returning -EIOCBQUEUED for a new page.

    This should only be possible with read-ahead disabled on the below
    device, and multiple threads racing on the same file. Haven't been able
    to reproduce on anything else.

    Cc: stable@vger.kernel.org # v5.9
    Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
    Reported-by: Kent Overstreet
    Signed-off-by: Jens Axboe

    Jens Axboe
     

17 Oct, 2020

6 commits

  • Pull documentation updates from Mauro Carvalho Chehab:
    "A series of patches addressing warnings produced by make htmldocs.
    This includes:

    - kernel-doc markup fixes

    - ReST fixes

    - Updates at the build system in order to support newer versions of
    the docs build toolchain (Sphinx)

    After this series, the number of html build warnings should reduce
    significantly, and building with Sphinx 3.1 or later should now be
    supported (although it is still recommended to use Sphinx 2.4.4).

    As agreed with Jon, I should be sending you a late pull request by the
    end of the merge window addressing remaining issues with docs build,
    as there are a number of warning fixes that depends on pull requests
    that should be happening along the merge window.

    The end goal is to have a clean htmldocs build on Kernel 5.10.

    PS. It should be noticed that Sphinx 3.0 is not currently supported,
    as it lacks support for C domain namespaces. Such feature, needed in
    order to document uAPI system calls with Sphinx 3.x, was added only on
    Sphinx 3.1"

    * tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (75 commits)
    PM / devfreq: remove a duplicated kernel-doc markup
    mm/doc: fix a literal block markup
    workqueue: fix a kernel-doc warning
    docs: virt: user_mode_linux_howto_v2.rst: fix a literal block markup
    Input: sparse-keymap: add a description for @sw
    rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu
    nl80211: docs: add a description for s1g_cap parameter
    usb: docs: document altmode register/unregister functions
    kunit: test.h: fix a bad kernel-doc markup
    drivers: core: fix kernel-doc markup for dev_err_probe()
    docs: bio: fix a kerneldoc markup
    kunit: test.h: solve kernel-doc warnings
    block: bio: fix a warning at the kernel-doc markups
    docs: powerpc: syscall64-abi.rst: fix a malformed table
    drivers: net: hamradio: fix document location
    net: appletalk: Kconfig: Fix docs location
    dt-bindings: fix references to files converted to yaml
    memblock: get rid of a :c:type leftover
    math64.h: kernel-docs: Convert some markups into normal comments
    media: uAPI: buffer.rst: remove a left-over documentation
    ...

    Linus Torvalds
     
  • Merge more updates from Andrew Morton:
    "155 patches.

    Subsystems affected by this patch series: mm (dax, debug, thp,
    readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
    core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
    binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
    romfs, and fault-injection"

    * emailed patches from Andrew Morton : (155 commits)
    lib, uaccess: add failure injection to usercopy functions
    lib, include/linux: add usercopy failure capability
    ROMFS: support inode blocks calculation
    ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
    sched.h: drop in_ubsan field when UBSAN is in trap mode
    scripts/gdb/tasks: add headers and improve spacing format
    scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
    kernel/relay.c: drop unneeded initialization
    panic: dump registers on panic_on_warn
    rapidio: fix the missed put_device() for rio_mport_add_riodev
    rapidio: fix error handling path
    nilfs2: fix some kernel-doc warnings for nilfs2
    autofs: harden ioctl table
    ramfs: fix nommu mmap with gaps in the page cache
    mm: remove the now-unnecessary mmget_still_valid() hack
    mm/gup: take mmap_lock in get_dump_page()
    binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
    coredump: rework elf/elf_fdpic vma_dump_size() into common helper
    coredump: refactor page range dumping into common helper
    coredump: let dump_emit() bail out on short writes
    ...

    Linus Torvalds
     
  • The preceding patches have ensured that core dumping properly takes the
    mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
    its users.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Properly take the mmap_lock before calling into the GUP code from
    get_dump_page(); and play nice, allowing the GUP code to drop the
    mmap_lock if it has to sleep.

    As Linus pointed out, we don't actually need the VMA because
    __get_user_pages() will flush the dcache for us if necessary.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

    At the moment, we have that rather ugly mmget_still_valid() helper to work
    around : ELF core dumping doesn't
    take the mmap_sem while traversing the task's VMAs, and if anything (like
    userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
    at the moment we use mmget_still_valid() to bail out in any writers that
    might be operating on a remote mm's VMAs.

    With this series, I'm trying to get rid of the need for that as cleanly as
    possible. ("cleanly" meaning "avoid holding the mmap_lock across
    unbounded sleeps".)

    Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
    dumping code.

    Patches 5 and 6 implement the main change: Instead of repeatedly accessing
    the VMA list with sleeps in between, we snapshot it at the start with
    proper locking, and then later we just use our copy of the VMA list. This
    ensures that the kernel won't crash, that VMA metadata in the coredump is
    consistent even in the presence of concurrent modifications, and that any
    virtual addresses that aren't being concurrently modified have their
    contents show up in the core dump properly.

    The disadvantage of this approach is that we need a bit more memory during
    core dumping for storing metadata about all VMAs.

    At the end of the series, patch 7 removes the old workaround for this
    issue (mmget_still_valid()).

    I have tested:

    - Creating a simple core dump on X86-64 still works.
    - The created coredump on X86-64 opens in GDB and looks plausible.
    - X86-64 core dumps contain the first page for executable mappings at
    offset 0, and don't contain the first page for non-executable file
    mappings or executable mappings at offset !=0.
    - NOMMU 32-bit ARM can still generate plausible-looking core dumps
    through the FDPIC implementation. (I can't test this with GDB because
    GDB is missing some structure definition for nommu ARM, but I've
    poked around in the hexdump and it looked decent.)

    This patch (of 7):

    dump_emit() is for kernel pointers, and VMAs describe userspace memory.
    Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
    even if it probably doesn't matter much on !MMU systems - especially given
    that it looks like we can just use the same get_dump_page() as on MMU if
    we move it out of the CONFIG_MMU block.

    One small change we have to make in get_dump_page() is to use
    __get_user_pages_locked() instead of __get_user_pages(), since the latter
    doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
    just call __get_user_pages() for us.

    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Acked-by: Linus Torvalds
    Cc: Christoph Hellwig
    Cc: Alexander Viro
    Cc: "Eric W . Biederman"
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
    Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
    Signed-off-by: Linus Torvalds

    Jann Horn
     
  • The current page_order() can only be called on pages in the buddy
    allocator. For compound pages, you have to use compound_order(). This is
    confusing and led to a bug, so rename page_order() to buddy_order().

    Signed-off-by: Matthew Wilcox (Oracle)
    Signed-off-by: Andrew Morton
    Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
    Signed-off-by: Linus Torvalds

    Matthew Wilcox (Oracle)