Eric Lee / smarc-fsl-linux-kernel

03 Nov, 2020

6 commits

a77eedbc8 mm/truncate.c: make __invalidate_mapping_pages() static ... Browse Code »

Fix the following sparse warning:

mm/truncate.c:531:15: warning: symbol '__invalidate_mapping_pages' was not declared. Should it be static?

Fixes: eb1d7a65f08a ("mm, fadvise: improve the expensive remote LRU cache draining after FADV_DONTNEED")
Signed-off-by: Jason Yan
Signed-off-by: Andrew Morton
Reviewed-by: Yafang Shao
Link: https://lkml.kernel.org/r/20201015054808.2445904-1-yanaijie@huawei.com
Signed-off-by: Linus Torvalds

Jason Yan
2020-11-03 04:14:19 +0800
3f0884209 mm: mempolicy: fix potential pte_unmap_unlock pte error ... Browse Code »

When flags in queue_pages_pte_range don't have MPOL_MF_MOVE or
MPOL_MF_MOVE_ALL bits, code breaks and passing origin pte - 1 to
pte_unmap_unlock seems like not a good idea.

queue_pages_pte_range can run in MPOL_MF_MOVE_ALL mode which doesn't
migrate misplaced pages but returns with EIO when encountering such a
page. Since commit a7f40cfe3b7a ("mm: mempolicy: make mbind() return
-EIO when MPOL_MF_STRICT is specified") and early break on the first pte
in the range results in pte_unmap_unlock on an underflow pte. This can
lead to lockups later on when somebody tries to lock the pte resp.
page_table_lock again..

Fixes: a7f40cfe3b7a ("mm: mempolicy: make mbind() return -EIO when MPOL_MF_STRICT is specified")
Signed-off-by: Shijie Luo
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Cc: Miaohe Lin
Cc: Feilong Lin
Cc: Shijie Luo
Cc:
Link: https://lkml.kernel.org/r/20201019074853.50856-1-luoshijie1@huawei.com
Signed-off-by: Linus Torvalds

Shijie Luo
2020-11-03 04:14:19 +0800
8de15e920 mm: memcg: link page counters to root if use_hierarchy is false ... Browse Code »

Richard reported a warning which can be reproduced by running the LTP
madvise6 test (cgroup v1 in the non-hierarchical mode should be used):

WARNING: CPU: 0 PID: 12 at mm/page_counter.c:57 page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Modules linked in:
CPU: 0 PID: 12 Comm: kworker/0:1 Not tainted 5.9.0-rc7-22-default #77
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.13.0-48-gd9c812d-rebuilt.opensuse.org 04/01/2014
Workqueue: events drain_local_stock
RIP: 0010:page_counter_uncharge (mm/page_counter.c:57 mm/page_counter.c:50 mm/page_counter.c:156)
Call Trace:
__memcg_kmem_uncharge (mm/memcontrol.c:3022)
drain_obj_stock (./include/linux/rcupdate.h:689 mm/memcontrol.c:3114)
drain_local_stock (mm/memcontrol.c:2255)
process_one_work (./arch/x86/include/asm/jump_label.h:25 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:108 kernel/workqueue.c:2274)
worker_thread (./include/linux/list.h:282 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:292)
ret_from_fork (arch/x86/entry/entry_64.S:300)

The problem occurs because in the non-hierarchical mode non-root page
counters are not linked to root page counters, so the charge is not
propagated to the root memory cgroup.

After the removal of the original memory cgroup and reparenting of the
object cgroup, the root cgroup might be uncharged by draining a objcg
stock, for example. It leads to an eventual underflow of the charge and
triggers a warning.

Fix it by linking all page counters to corresponding root page counters
in the non-hierarchical mode.

Please note, that in the non-hierarchical mode all objcgs are always
reparented to the root memory cgroup, even if the hierarchy has more
than 1 level. This patch doesn't change it.

The patch also doesn't affect how the hierarchical mode is working,
which is the only sane and truly supported mode now.

Thanks to Richard for reporting, debugging and providing an alternative
version of the fix!

Fixes: bf4f059954dc ("mm: memcg/slab: obj_cgroup API")
Reported-by:
Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Reviewed-by: Michal Koutný
Acked-by: Johannes Weiner
Cc: Michal Hocko
Cc:
Link: https://lkml.kernel.org/r/20201026231326.3212225-1-guro@fb.com
Debugged-by: Richard Palethorpe
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-11-03 04:14:18 +0800
7de2e9f19 mm: memcontrol: correct the NR_ANON_THPS counter of hierarchical memcg ... Browse Code »

memcg_page_state will get the specified number in hierarchical memcg, It
should multiply by HPAGE_PMD_NR rather than an page if the item is
NR_ANON_THPS.

[akpm@linux-foundation.org: fix printk warning]
[akpm@linux-foundation.org: use u64 cast, per Michal]

Fixes: 468c398233da ("mm: memcontrol: switch to native NR_ANON_THPS counter")
Signed-off-by: zhongjiang-ali
Signed-off-by: Andrew Morton
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Link: https://lkml.kernel.org/r/1603722395-72443-1-git-send-email-zhongjiang-ali@linux.alibaba.com
Signed-off-by: Linus Torvalds

zhongjiang-ali
2020-11-03 04:14:18 +0800
79aa925bf hugetlb_cgroup: fix reservation accounting ... Browse Code »

Michal Privoznik was using "free page reporting" in QEMU/virtio-balloon
with hugetlbfs and hit the warning below. QEMU with free page hinting
uses fallocate(FALLOC_FL_PUNCH_HOLE) to discard pages that are reported
as free by a VM. The reporting granularity is in pageblock granularity.
So when the guest reports 2M chunks, we fallocate(FALLOC_FL_PUNCH_HOLE)
one huge page in QEMU.

WARNING: CPU: 7 PID: 6636 at mm/page_counter.c:57 page_counter_uncharge+0x4b/0x50
Modules linked in: ...
CPU: 7 PID: 6636 Comm: qemu-system-x86 Not tainted 5.9.0 #137
Hardware name: Gigabyte Technology Co., Ltd. X570 AORUS PRO/X570 AORUS PRO, BIOS F21 07/31/2020
RIP: 0010:page_counter_uncharge+0x4b/0x50
...
Call Trace:
hugetlb_cgroup_uncharge_file_region+0x4b/0x80
region_del+0x1d3/0x300
hugetlb_unreserve_pages+0x39/0xb0
remove_inode_hugepages+0x1a8/0x3d0
hugetlbfs_fallocate+0x3c4/0x5c0
vfs_fallocate+0x146/0x290
__x64_sys_fallocate+0x3e/0x70
do_syscall_64+0x33/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

Investigation of the issue uncovered bugs in hugetlb cgroup reservation
accounting. This patch addresses the found issues.

Fixes: 075a61d07a8e ("hugetlb_cgroup: add accounting for shared mappings")
Reported-by: Michal Privoznik
Co-developed-by: David Hildenbrand
Signed-off-by: David Hildenbrand
Signed-off-by: Mike Kravetz
Signed-off-by: Andrew Morton
Tested-by: Michal Privoznik
Reviewed-by: Mina Almasry
Acked-by: Michael S. Tsirkin
Cc:
Cc: David Hildenbrand
Cc: Michal Hocko
Cc: Muchun Song
Cc: "Aneesh Kumar K . V"
Cc: Tejun Heo
Link: https://lkml.kernel.org/r/20201021204426.36069-1-mike.kravetz@oracle.com
Signed-off-by: Linus Torvalds

Mike Kravetz
2020-11-03 04:14:18 +0800
46b1ee38b mm/mremap_pages: fix static key devmap_managed_key updates ... Browse Code »

commit 6f42193fd86e ("memremap: don't use a separate devm action for
devmap_managed_enable_get") changed the static key updates such that we
now call devmap_managed_enable_put() without doing the equivalent
devmap_managed_enable_get().

devmap_managed_enable_get() is only called for MEMORY_DEVICE_PRIVATE and
MEMORY_DEVICE_FS_DAX, But memunmap_pages() get called for other pgmap
types too. This results in the below warning when switching between
system-ram and devdax mode for devdax namespace.

jump label: negative count!
WARNING: CPU: 52 PID: 1335 at kernel/jump_label.c:235 static_key_slow_try_dec+0x88/0xa0
Modules linked in:
....

NIP static_key_slow_try_dec+0x88/0xa0
LR static_key_slow_try_dec+0x84/0xa0
Call Trace:
static_key_slow_try_dec+0x84/0xa0
__static_key_slow_dec_cpuslocked+0x34/0xd0
static_key_slow_dec+0x54/0xf0
memunmap_pages+0x36c/0x500
devm_action_release+0x30/0x50
release_nodes+0x2f4/0x3e0
device_release_driver_internal+0x17c/0x280
bus_remove_device+0x124/0x210
device_del+0x1d4/0x530
unregister_dev_dax+0x48/0xe0
devm_action_release+0x30/0x50
release_nodes+0x2f4/0x3e0
device_release_driver_internal+0x17c/0x280
unbind_store+0x130/0x170
drv_attr_store+0x40/0x60
sysfs_kf_write+0x6c/0xb0
kernfs_fop_write+0x118/0x280
vfs_write+0xe8/0x2a0
ksys_write+0x84/0x140
system_call_exception+0x120/0x270
system_call_common+0xf0/0x27c

Reported-by: Aneesh Kumar K.V
Signed-off-by: Ralph Campbell
Signed-off-by: Andrew Morton
Tested-by: Sachin Sant
Reviewed-by: Aneesh Kumar K.V
Reviewed-by: Ira Weiny
Reviewed-by: Christoph Hellwig
Cc: Dan Williams
Cc: Jason Gunthorpe
Link: https://lkml.kernel.org/r/20201023183222.13186-1-rcampbell@nvidia.com
Signed-off-by: Linus Torvalds

Ralph Campbell
2020-11-03 04:14:18 +0800

28 Oct, 2020

2 commits

f78f63da9 mm/process_vm_access: Add missing #include <linux/compat.h> ... Browse Code »

With e.g. m68k/defconfig:

mm/process_vm_access.c: In function ‘process_vm_rw’:
mm/process_vm_access.c:277:5: error: implicit declaration of function ‘in_compat_syscall’ [-Werror=implicit-function-declaration]
277 | in_compat_syscall());
| ^~~~~~~~~~~~~~~~~

Fix this by adding #include .

Reported-by: noreply@ellerman.id.au
Reported-by: damian
Reported-by: Naresh Kamboju
Fixes: 38dc5079da7081e8 ("Fix compat regression in process_vm_rw()")
Signed-off-by: Geert Uytterhoeven
Acked-by: Jens Axboe
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2020-10-28 03:41:29 +0800
38dc5079d Fix compat regression in process_vm_rw() ... Browse Code »

The removal of compat_process_vm_{readv,writev} didn't change
process_vm_rw(), which always assumes it's not doing a compat syscall.

Instead of passing in 'false' unconditionally for 'compat', make it
conditional on in_compat_syscall().

[ Both Al and Christoph point out that trying to access a 64-bit process
from a 32-bit one cannot work anyway, and is likely better prohibited,
but that's a separate issue - Linus ]

Fixes: c3973b401ef2 ("mm: remove compat_process_vm_{readv,writev}")
Reported-and-tested-by: Kyle Huey
Signed-off-by: Jens Axboe
Acked-by: Al Viro
Reviewed-by: Christoph Hellwig
Signed-off-by: Linus Torvalds

Jens Axboe
2020-10-28 00:57:46 +0800

24 Oct, 2020

1 commit

c4728cfbe Merge tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux ... Browse Code »

Pull clone/dedupe/remap code refactoring from Darrick Wong:
"Move the generic file range remap (aka reflink and dedupe) functions
out of mm/filemap.c and fs/read_write.c and into fs/remap_range.c to
reduce clutter in the first two files"

* tag 'vfs-5.10-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
vfs: move the generic write and copy checks out of mm
vfs: move the remap range helpers to remap_range.c
vfs: move generic_remap_checks out of mm

Linus Torvalds
2020-10-24 02:33:41 +0800

21 Oct, 2020

2 commits

c4d6fe731 Merge tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray ... Browse Code »

Pull XArray updates from Matthew Wilcox:

- Fix the test suite after introduction of the local_lock

- Fix a bug in the IDA spotted by Coverity

- Change the API that allows the workingset code to delete a node

- Fix xas_reload() when dealing with entries that occupy multiple
indices

- Add a few more tests to the test suite

- Fix an unsigned int being shifted into an unsigned long

* tag 'xarray-5.9' of git://git.infradead.org/users/willy/xarray:
XArray: Fix xas_create_range for ranges above 4 billion
radix-tree: fix the comment of radix_tree_next_slot()
XArray: Fix xas_reload for multi-index entries
XArray: Add private interface for workingset node deletion
XArray: Fix xas_for_each_conflict documentation
XArray: Test marked multiorder iterations
XArray: Test two more things about xa_cmpxchg
ida: Free allocated bitmap in error path
radix tree test suite: Fix compilation

Linus Torvalds
2020-10-21 05:39:37 +0800
4962a8569 Merge tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block ... Browse Code »

Pull io_uring updates from Jens Axboe:
"A mix of fixes and a few stragglers. In detail:

- Revert the bogus __read_mostly that we discussed for the initial
pull request.

- Fix a merge window regression with fixed file registration error
path handling.

- Fix io-wq numa node affinities.

- Series abstracting out an io_identity struct, making it both easier
to see what the personality items are, and also easier to to adopt
more. Use this to cover audit logging.

- Fix for read-ahead disabled block condition in async buffered
reads, and using single page read-ahead to unify what
generic_file_buffer_read() path is used.

- Series for REQ_F_COMP_LOCKED fix and removal of it (Pavel)

- Poll fix (Pavel)"

* tag 'io_uring-5.10-2020-10-20' of git://git.kernel.dk/linux-block: (21 commits)
io_uring: use blk_queue_nowait() to check if NOWAIT supported
mm: use limited read-ahead to satisfy read
mm: mark async iocb read as NOWAIT once some data has been copied
io_uring: fix double poll mask init
io-wq: inherit audit loginuid and sessionid
io_uring: use percpu counters to track inflight requests
io_uring: assign new io_identity for task if members have changed
io_uring: store io_identity in io_uring_task
io_uring: COW io_identity on mismatch
io_uring: move io identity items into separate struct
io_uring: rely solely on work flags to determine personality.
io_uring: pass required context in as flags
io-wq: assign NUMA node locality if appropriate
io_uring: fix error path cleanup in io_sqe_files_register()
Revert "io_uring: mark io_uring_fops/io_op_defs as __read_mostly"
io_uring: fix REQ_F_COMP_LOCKED by killing it
io_uring: dig out COMP_LOCK from deep call chain
io_uring: don't put a poll req under spinlock
io_uring: don't unnecessarily clear F_LINK_TIMEOUT
io_uring: don't set COMP_LOCKED if won't put
...

Linus Torvalds
2020-10-21 04:19:30 +0800

19 Oct, 2020

21 commits

b71df8de4 mm: remove the filename in the top of file comment in vmalloc.c ... Browse Code »

No point in having the filename inside the file.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
f255935b9 mm: cleanup the gfp_mask handling in __vmalloc_area_node ... Browse Code »

Patch series "two small vmalloc cleanups".

This patch (of 2):

__vmalloc_area_node currently has four different gfp_t variables to
just express this simple logic:

- use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
suitable) for the underlying page allocation
- use just the reclaim flags from the passed in mask plus __GFP_ZERO
for allocating the page array

Simplify this down to just use the pre-existing nested_gfp as-is for
the page array allocation, and just the passed in gfp_mask for the
page allocation, after conditionally ORing __GFP_HIGHMEM into it. This
also makes the allocation warning a little more correct.

Also initialize two variables at the time of declaration while touching
this area.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
301fa9f2d mm: remove alloc_vm_area ... Browse Code »

All users are gone now.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
d1b6d2e1f zsmalloc: switch from alloc_vm_area to get_vm_area ... Browse Code »

Just manually pre-fault the PTEs using apply_to_page_range.

Co-developed-by: Minchan Kim
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
eeb4a05fc mm: allow a NULL fn callback in apply_to_page_range ... Browse Code »

Besides calling the callback on each page, apply_to_page_range also has
the effect of pre-faulting all PTEs for the range. To support callers
that only need the pre-faulting, make the callback optional.

Based on a patch from Minchan Kim .

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
3e9a9e256 mm: add a vmap_pfn function ... Browse Code »

Add a proper helper to remap PFNs into kernel virtual space so that
drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
for it.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
b944afc9d mm: add a VM_MAP_PUT_PAGES flag for vmap ... Browse Code »

Add a flag so that vmap takes ownership of the passed in page array. When
vfree is called on such an allocation it will put one reference on each
page, and free the page array itself.

Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
Signed-off-by: Linus Torvalds

Christoph Hellwig
2020-10-19 00:27:10 +0800
fa307474c mm: update the documentation for vfree ... Browse Code »

Patch series "remove alloc_vm_area", v4.

This series removes alloc_vm_area, which was left over from the big
vmalloc interface rework. It is a rather arkane interface, basicaly the
equivalent of get_vm_area + actually faulting in all PTEs in the allocated
area. It was originally addeds for Xen (which isn't modular to start
with), and then grew users in zsmalloc and i915 which seems to mostly
qualify as abuses of the interface, especially for i915 as a random driver
should not set up PTE bits directly.

This patch (of 11):

* Document that you can call vfree() on an address returned from vmap()
* Remove the note about the minimum size -- the minimum size of a vmalloc
allocation is one page
* Add a Context: section
* Fix capitalisation
* Reword the prohibition on calling from NMI context to avoid a double
negative

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: Stefano Stabellini
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Tvrtko Ursulin
Cc: Chris Wilson
Cc: Matthew Auld
Cc: Rodrigo Vivi
Cc: Minchan Kim
Cc: Matthew Wilcox
Cc: Nitin Gupta
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-19 00:27:10 +0800
ecb8ac8b1 mm/madvise: introduce process_madvise() syscall: an external memory hinting API ... Browse Code »

There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.

The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.

To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).

Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.

ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.

So finally, the API is as follows,

ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);

DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.

The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)

The pointer iovec points to an array of iovec structures, defined in
as:

struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};

The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).

The vlen represents the number of elements in iovec.

The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.

MADV_COLD
MADV_PAGEOUT

Permission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.

RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.

FAQ:

Q.1 - Why does any external entity have better knowledge?

Quote from Sandeep

"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.

After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.

In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.

Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.

- ssp

Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?

process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.

The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.

To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.

Q.3 - Why doesn't ptrace work?

Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.

[1] https://developer.android.com/topic/performance/memory"

[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

Signed-off-by: Minchan Kim
Signed-off-by: YueHaibing
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Alexander Duyck
Cc: Brian Geffon
Cc: Christian Brauner
Cc: Daniel Colascione
Cc: Jann Horn
Cc: Jens Axboe
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: John Dias
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Oleksandr Natalenko
Cc: Sandeep Patil
Cc: SeongJae Park
Cc: SeongJae Park
Cc: Shakeel Butt
Cc: Sonny Rao
Cc: Tim Murray
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds

Minchan Kim
2020-10-19 00:27:10 +0800
0726b01e7 mm/madvise: pass mm to do_madvise ... Browse Code »

Patch series "introduce memory hinting API for external process", v9.

Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.

To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.

1. It needs pidfd of target process to provide the hint

2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.

3. Only privileged processes can do something for other process's
address space.

For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.

This patch (of 3):

In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.

Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.

And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.

[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]

Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Jens Axboe
Cc: Jann Horn
Cc: Tim Murray
Cc: Daniel Colascione
Cc: Sandeep Patil
Cc: Sonny Rao
Cc: Brian Geffon
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shakeel Butt
Cc: John Dias
Cc: Joel Fernandes
Cc: Alexander Duyck
Cc: SeongJae Park
Cc: Christian Brauner
Cc: Kirill Tkhai
Cc: Oleksandr Natalenko
Cc: SeongJae Park
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds

Minchan Kim
2020-10-19 00:27:09 +0800
f3964599c mm/gup_benchmark: take the mmap lock around GUP ... Browse Code »

To be safe against concurrent changes to the VMA tree, we must take the
mmap lock around GUP operations (excluding the GUP-fast family of
operations, which will take the mmap lock by themselves if necessary).

This code is only for testing, and it's only reachable by root through
debugfs, so this doesn't really have any impact; however, if we want to
add lockdep asserts into the GUP path, we need to have clean locking here.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Reviewed-by: John Hubbard
Acked-by: Michel Lespinasse
Cc: "Eric W . Biederman"
Cc: Mauro Carvalho Chehab
Cc: Sakari Ailus
Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-19 00:27:09 +0800
fb8090b69 mm/mmap: add inline munmap_vma_range() for code readability ... Browse Code »

There are two locations that have a block of code for munmapping a vma
range. Change those two locations to use a function and add meaningful
comments about what happens to the arguments, which was unclear in the
previous code.

Signed-off-by: Liam R. Howlett
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
Signed-off-by: Linus Torvalds

Liam R. Howlett
2020-10-19 00:27:09 +0800
3903b55a6 mm/mmap: add inline vma_next() for readability of mmap code ... Browse Code »

There are three places that the next vma is required which uses the same
block of code. Replace the block with a function and add comments on what
happens in the case where NULL is encountered.

Signed-off-by: Liam R. Howlett
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
Signed-off-by: Linus Torvalds

Liam R. Howlett
2020-10-19 00:27:09 +0800
4dc200cee mm/migrate: avoid possible unnecessary process right check in kernel_move_pages() ... Browse Code »

There is no need to check if this process has the right to modify the
specified process when they are same. And we could also skip the security
hook call if a process is modifying its own pages. Add helper function to
handle these.

Suggested-by: Matthew Wilcox
Signed-off-by: Hongxiang Lou
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Cc: Christopher Lameter
Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds

Miaohe Lin
2020-10-19 00:27:09 +0800
203e6e5ca mm/memory_hotplug: remove a wrapper for alloc_migration_target() ... Browse Code »

To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page. Wrapper for alloc_migration_target() exists
for this purpose.

However, Vlastimil informs that all migration source pages come from a
single node. In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper. Set up the migration_target_control once and use it
for all pages.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Christoph Hellwig
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-10-19 00:27:09 +0800
546087599 mm/memory-failure: remove a wrapper for alloc_migration_target() ... Browse Code »

There is a well-defined standard migration target callback. Use it
directly.

Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Christoph Hellwig
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds

Joonsoo Kim
2020-10-19 00:27:09 +0800
4127c6504 mm: kmem: enable kernel memcg accounting from interrupt contexts ... Browse Code »

If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.

Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.

To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
37d5985c0 mm: kmem: prepare remote memcg charging infra for interrupt contexts ... Browse Code »

Remote memcg charging API uses current->active_memcg to store the
currently active memory cgroup, which overwrites the memory cgroup of the
current process. It works well for normal contexts, but doesn't work for
interrupt contexts: indeed, if an interrupt occurs during the execution of
a section with an active memcg set, all allocations inside the interrupt
will be charged to the active memcg set (given that we'll enable
accounting for allocations from an interrupt context). But because the
interrupt might have no relation to the active memcg set outside, it's
obviously wrong from the accounting prospective.

To resolve this problem, let's add a global percpu int_active_memcg
variable, which will be used to store an active memory cgroup which will
be used from interrupt contexts. set_active_memcg() will transparently
use current->active_memcg or int_active_memcg depending on the context.

To make the read part simple and transparent for the caller, let's
introduce two new functions:
- struct mem_cgroup *active_memcg(void),
- struct mem_cgroup *get_active_memcg(void).

They are returning the active memcg if it's set, hiding all implementation
details: where to get it depending on the current context.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
67f028649 mm: kmem: remove redundant checks from get_obj_cgroup_from_current() ... Browse Code »

There are checks for current->mm and current->active_memcg in
get_obj_cgroup_from_current(), but these checks are redundant:
memcg_kmem_bypass() called just above performs same checks.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
279c3393e mm: kmem: move memcg_kmem_bypass() calls to get_mem/obj_cgroup_from_current() ... Browse Code »

Patch series "mm: kmem: kernel memory accounting in an interrupt context".

This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.

Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.

The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.

This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.

This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.

This patch (of 4):

Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800
b87d8cefe mm, memcg: rework remote charging API to support nesting ... Browse Code »

Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.

memalloc_use_memcg(target_memcg);

memalloc_unuse_memcg();

It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.

memalloc_use_memcg(target_memcg);

memalloc_use_memcg(target_memcg_2);

memalloc_unuse_memcg();

Error: allocation here are charged to the memcg of the current
process instead of target_memcg.

memalloc_unuse_memcg();

This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:

old_memcg = set_active_memcg(target_memcg);

set_active_memcg(old_memcg);

This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .

Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Dan Schatzberg
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds

Roman Gushchin
2020-10-19 00:27:09 +0800

18 Oct, 2020

2 commits

324bcf54c mm: use limited read-ahead to satisfy read ... Browse Code »

For the case where read-ahead is disabled on the file, or if the cgroup
is congested, ensure that we can at least do 1 page of read-ahead to
make progress on the read in an async fashion. This could potentially be
larger, but it's not needed in terms of functionality, so let's error on
the side of caution as larger counts of pages may run into reclaim
issues (particularly if we're congested).

This makes sure we're not hitting the potentially sync ->readpage() path
for IO that is marked IOCB_WAITQ, which could cause us to block. It also
means we'll use the same path for IO, regardless of whether or not
read-ahead happens to be disabled on the lower level device.

Acked-by: Johannes Weiner
Reported-by: Matthew Wilcox (Oracle)
Reported-by: Hao_Xu
[axboe: updated for new ractl API]
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-18 03:49:08 +0800
13bd69142 mm: mark async iocb read as NOWAIT once some data has been copied ... Browse Code »

Once we've copied some data for an iocb that is marked with IOCB_WAITQ,
we should no longer attempt to async lock a new page. Instead make sure
we return the copied amount, and let the caller retry, instead of
returning -EIOCBQUEUED for a new page.

This should only be possible with read-ahead disabled on the below
device, and multiple threads racing on the same file. Haven't been able
to reproduce on anything else.

Cc: stable@vger.kernel.org # v5.9
Fixes: 1a0a7853b901 ("mm: support async buffered reads in generic_file_buffered_read()")
Reported-by: Kent Overstreet
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-18 03:49:05 +0800

17 Oct, 2020

6 commits

54a4c789c Merge tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media ... Browse Code »

Pull documentation updates from Mauro Carvalho Chehab:
"A series of patches addressing warnings produced by make htmldocs.
This includes:

- kernel-doc markup fixes

- ReST fixes

- Updates at the build system in order to support newer versions of
the docs build toolchain (Sphinx)

After this series, the number of html build warnings should reduce
significantly, and building with Sphinx 3.1 or later should now be
supported (although it is still recommended to use Sphinx 2.4.4).

As agreed with Jon, I should be sending you a late pull request by the
end of the merge window addressing remaining issues with docs build,
as there are a number of warning fixes that depends on pull requests
that should be happening along the merge window.

The end goal is to have a clean htmldocs build on Kernel 5.10.

PS. It should be noticed that Sphinx 3.0 is not currently supported,
as it lacks support for C domain namespaces. Such feature, needed in
order to document uAPI system calls with Sphinx 3.x, was added only on
Sphinx 3.1"

* tag 'docs/v5.10-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (75 commits)
PM / devfreq: remove a duplicated kernel-doc markup
mm/doc: fix a literal block markup
workqueue: fix a kernel-doc warning
docs: virt: user_mode_linux_howto_v2.rst: fix a literal block markup
Input: sparse-keymap: add a description for @sw
rcu/tree: docs: document bkvcache new members at struct kfree_rcu_cpu
nl80211: docs: add a description for s1g_cap parameter
usb: docs: document altmode register/unregister functions
kunit: test.h: fix a bad kernel-doc markup
drivers: core: fix kernel-doc markup for dev_err_probe()
docs: bio: fix a kerneldoc markup
kunit: test.h: solve kernel-doc warnings
block: bio: fix a warning at the kernel-doc markups
docs: powerpc: syscall64-abi.rst: fix a malformed table
drivers: net: hamradio: fix document location
net: appletalk: Kconfig: Fix docs location
dt-bindings: fix references to files converted to yaml
memblock: get rid of a :c:type leftover
math64.h: kernel-docs: Convert some markups into normal comments
media: uAPI: buffer.rst: remove a left-over documentation
...

Linus Torvalds
2020-10-17 06:02:21 +0800
c4cf498dc Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge more updates from Andrew Morton:
"155 patches.

Subsystems affected by this patch series: mm (dax, debug, thp,
readahead, page-poison, util, memory-hotplug, zram, cleanups), misc,
core-kernel, get_maintainer, MAINTAINERS, lib, bitops, checkpatch,
binfmt, ramfs, autofs, nilfs, rapidio, panic, relay, kgdb, ubsan,
romfs, and fault-injection"

* emailed patches from Andrew Morton : (155 commits)
lib, uaccess: add failure injection to usercopy functions
lib, include/linux: add usercopy failure capability
ROMFS: support inode blocks calculation
ubsan: introduce CONFIG_UBSAN_LOCAL_BOUNDS for Clang
sched.h: drop in_ubsan field when UBSAN is in trap mode
scripts/gdb/tasks: add headers and improve spacing format
scripts/gdb/proc: add struct mount & struct super_block addr in lx-mounts command
kernel/relay.c: drop unneeded initialization
panic: dump registers on panic_on_warn
rapidio: fix the missed put_device() for rio_mport_add_riodev
rapidio: fix error handling path
nilfs2: fix some kernel-doc warnings for nilfs2
autofs: harden ioctl table
ramfs: fix nommu mmap with gaps in the page cache
mm: remove the now-unnecessary mmget_still_valid() hack
mm/gup: take mmap_lock in get_dump_page()
binfmt_elf, binfmt_elf_fdpic: use a VMA list snapshot
coredump: rework elf/elf_fdpic vma_dump_size() into common helper
coredump: refactor page range dumping into common helper
coredump: let dump_emit() bail out on short writes
...

Linus Torvalds
2020-10-17 02:31:55 +0800
4d45e75a9 mm: remove the now-unnecessary mmget_still_valid() hack ... Browse Code »

The preceding patches have ensured that core dumping properly takes the
mmap_lock. Thanks to that, we can now remove mmget_still_valid() and all
its users.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Eric W . Biederman"
Cc: Oleg Nesterov
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200827114932.3572699-8-jannh@google.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-17 02:11:22 +0800
7f3bfab52 mm/gup: take mmap_lock in get_dump_page() ... Browse Code »

Properly take the mmap_lock before calling into the GUP code from
get_dump_page(); and play nice, allowing the GUP code to drop the
mmap_lock if it has to sleep.

As Linus pointed out, we don't actually need the VMA because
__get_user_pages() will flush the dcache for us if necessary.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Eric W . Biederman"
Cc: Oleg Nesterov
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200827114932.3572699-7-jannh@google.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-17 02:11:21 +0800
8f942eea1 binfmt_elf_fdpic: stop using dump_emit() on user pointers on !MMU ... Browse Code »

Patch series "Fix ELF / FDPIC ELF core dumping, and use mmap_lock properly in there", v5.

At the moment, we have that rather ugly mmget_still_valid() helper to work
around : ELF core dumping doesn't
take the mmap_sem while traversing the task's VMAs, and if anything (like
userfaultfd) then remotely messes with the VMA tree, fireworks ensue. So
at the moment we use mmget_still_valid() to bail out in any writers that
might be operating on a remote mm's VMAs.

With this series, I'm trying to get rid of the need for that as cleanly as
possible. ("cleanly" meaning "avoid holding the mmap_lock across
unbounded sleeps".)

Patches 1, 2, 3 and 4 are relatively unrelated cleanups in the core
dumping code.

Patches 5 and 6 implement the main change: Instead of repeatedly accessing
the VMA list with sleeps in between, we snapshot it at the start with
proper locking, and then later we just use our copy of the VMA list. This
ensures that the kernel won't crash, that VMA metadata in the coredump is
consistent even in the presence of concurrent modifications, and that any
virtual addresses that aren't being concurrently modified have their
contents show up in the core dump properly.

The disadvantage of this approach is that we need a bit more memory during
core dumping for storing metadata about all VMAs.

At the end of the series, patch 7 removes the old workaround for this
issue (mmget_still_valid()).

I have tested:

- Creating a simple core dump on X86-64 still works.
- The created coredump on X86-64 opens in GDB and looks plausible.
- X86-64 core dumps contain the first page for executable mappings at
offset 0, and don't contain the first page for non-executable file
mappings or executable mappings at offset !=0.
- NOMMU 32-bit ARM can still generate plausible-looking core dumps
through the FDPIC implementation. (I can't test this with GDB because
GDB is missing some structure definition for nommu ARM, but I've
poked around in the hexdump and it looked decent.)

This patch (of 7):

dump_emit() is for kernel pointers, and VMAs describe userspace memory.
Let's be tidy here and avoid accessing userspace pointers under KERNEL_DS,
even if it probably doesn't matter much on !MMU systems - especially given
that it looks like we can just use the same get_dump_page() as on MMU if
we move it out of the CONFIG_MMU block.

One small change we have to make in get_dump_page() is to use
__get_user_pages_locked() instead of __get_user_pages(), since the latter
doesn't exist on nommu. On mmu builds, __get_user_pages_locked() will
just call __get_user_pages() for us.

Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Linus Torvalds
Cc: Christoph Hellwig
Cc: Alexander Viro
Cc: "Eric W . Biederman"
Cc: Oleg Nesterov
Cc: Hugh Dickins
Link: http://lkml.kernel.org/r/20200827114932.3572699-1-jannh@google.com
Link: http://lkml.kernel.org/r/20200827114932.3572699-2-jannh@google.com
Signed-off-by: Linus Torvalds

Jann Horn
2020-10-17 02:11:21 +0800
ab130f910 mm: rename page_order() to buddy_order() ... Browse Code »

The current page_order() can only be called on pages in the buddy
allocator. For compound pages, you have to use compound_order(). This is
confusing and led to a bug, so rename page_order() to buddy_order().

Signed-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Andrew Morton
Link: https://lkml.kernel.org/r/20201001152259.14932-2-willy@infradead.org
Signed-off-by: Linus Torvalds

Matthew Wilcox (Oracle)
2020-10-17 02:11:19 +0800