20 Oct, 2020
2 commits
-
Pull fuse updates from Miklos Szeredi:
- Support directly accessing host page cache from virtiofs. This can
improve I/O performance for various workloads, as well as reducing
the memory requirement by eliminating double caching. Thanks to Vivek
Goyal for doing most of the work on this.- Allow automatic submounting inside virtiofs. This allows unique
st_dev/ st_ino values to be assigned inside the guest to files
residing on different filesystems on the host. Thanks to Max Reitz
for the patches.- Fix an old use after free bug found by Pradeep P V K.
* tag 'fuse-update-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: (25 commits)
virtiofs: calculate number of scatter-gather elements accurately
fuse: connection remove fix
fuse: implement crossmounts
fuse: Allow fuse_fill_super_common() for submounts
fuse: split fuse_mount off of fuse_conn
fuse: drop fuse_conn parameter where possible
fuse: store fuse_conn in fuse_req
fuse: add submount support to
fuse: fix page dereference after free
virtiofs: add logic to free up a memory range
virtiofs: maintain a list of busy elements
virtiofs: serialize truncate/punch_hole and dax fault path
virtiofs: define dax address space operations
virtiofs: add DAX mmap support
virtiofs: implement dax read/write operations
virtiofs: introduce setupmapping/removemapping commands
virtiofs: implement FUSE_INIT map_alignment field
virtiofs: keep a list of free dax memory ranges
virtiofs: add a mount option to enable dax
virtiofs: set up virtio_fs dax_device
... -
Pull zonefs updates from Damien Le Moal:
"Add an 'explicit-open' mount option to automatically issue a
REQ_OP_ZONE_OPEN command to the device whenever a sequential zone file
is open for writing for the first time.This avoids 'insufficient zone resources' errors for write operations
on some drives with limited zone resources or on ZNS drives with a
limited number of active zones. From Johannes"* tag 'zonefs-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dlemoal/zonefs:
zonefs: document the explicit-open mount option
zonefs: open/close zone on file open/close
zonefs: provide no-lock zonefs_io_error variant
zonefs: introduce helper for zone management
19 Oct, 2020
38 commits
-
…/kernel/git/shuah/linux-kselftest
Pull more Kunit updates from Shuah Khan:
- add Kunit to kernel_init() and remove KUnit from init calls entirely.
This addresses the concern that Kunit would not work correctly during
late init phase.- add a linker section where KUnit can put references to its test
suites.This is the first step in transitioning to dispatching all KUnit
tests from a centralized executor rather than having each as its own
separate late_initcall.- add a centralized executor to dispatch tests rather than relying on
late_initcall to schedule each test suite separately. Centralized
execution is for built-in tests only; modules will execute tests when
loaded.- convert bitfield test to use KUnit framework
- Documentation updates for naming guidelines and how
kunit_test_suite() works.- add test plan to KUnit TAP format
* tag 'linux-kselftest-kunit-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/shuah/linux-kselftest:
lib: kunit: Fix compilation test when using TEST_BIT_FIELD_COMPILE
lib: kunit: add bitfield test conversion to KUnit
Documentation: kunit: add a brief blurb about kunit_test_suite
kunit: test: add test plan to KUnit TAP format
init: main: add KUnit to kernel init
kunit: test: create a single centralized executor for all tests
vmlinux.lds.h: add linker section for KUnit test suites
Documentation: kunit: Add naming guidelines -
Pull RCU changes from Ingo Molnar:
- Debugging for smp_call_function()
- RT raw/non-raw lock ordering fixes
- Strict grace periods for KASAN
- New smp_call_function() torture test
- Torture-test updates
- Documentation updates
- Miscellaneous fixes
[ This doesn't actually pull the tag - I've dropped the last merge from
the RCU branch due to questions about the series. - Linus ]* tag 'core-rcu-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
smp: Make symbol 'csd_bug_count' static
kernel/smp: Provide CSD lock timeout diagnostics
smp: Add source and destination CPUs to __call_single_data
rcu: Shrink each possible cpu krcp
rcu/segcblist: Prevent useless GP start if no CBs to accelerate
torture: Add gdb support
rcutorture: Allow pointer leaks to test diagnostic code
rcutorture: Hoist OOM registry up one level
refperf: Avoid null pointer dereference when buf fails to allocate
rcutorture: Properly synchronize with OOM notifier
rcutorture: Properly set rcu_fwds for OOM handling
torture: Add kvm.sh --help and update help message
rcutorture: Add CONFIG_PROVE_RCU_LIST to TREE05
torture: Update initrd documentation
rcutorture: Replace HTTP links with HTTPS ones
locktorture: Make function torture_percpu_rwsem_init() static
torture: document --allcpus argument added to the kvm.sh script
rcutorture: Output number of elapsed grace periods
rcutorture: Remove KCSAN stubs
rcu: Remove unused "cpu" parameter from rcu_report_qs_rdp()
... -
Pull mailbox updates from Jassi Brar:
- arm: implementation of mhu as a doorbell driver and conversion of
dt-bindings to json-schema- mediatek: fix platform_get_irq error handling
- bcm: convert tasklets to use new tasklet_setup api
- core: fix race cause by hrtimer starting inappropriately
* tag 'mailbox-v5.10' of git://git.linaro.org/landing-teams/working/fujitsu/integration:
mailbox: avoid timer start from callback
maiblox: mediatek: Fix handling of platform_get_irq() error
mailbox: arm_mhu: Add ARM MHU doorbell driver
mailbox: arm_mhu: Match only if compatible is "arm,mhu"
dt-bindings: mailbox: add doorbell support to ARM MHU
dt-bindings: mailbox : arm,mhu: Convert to Json-schema
mailbox: bcm: convert tasklets to use new tasklet_setup() API -
Pull coccinelle updates from Julia Lawall.
* 'for-5.10' of git://git.kernel.org/pub/scm/linux/kernel/git/jlawall/linux:
coccinelle: api: add kfree_mismatch script
coccinelle: iterators: Add for_each_child.cocci script
scripts: coccicheck: Change default condition for parallelism
scripts: coccicheck: Add quotes to improve portability
coccinelle: api: kfree_sensitive: print memset position
coccinelle: misc: add flexible_array.cocci script
coccinelle: api: add kvmalloc script
scripts: coccicheck: Change default value for parallelism
coccinelle: misc: add excluded_middle.cocci script
scripts: coccicheck: Improve error feedback when coccicheck fails
coccinelle: api: update kzfree script to kfree_sensitive
coccinelle: misc: add uninitialized_var.cocci script
coccinelle: ifnullfree: add vfree(), kvfree*() functions
coccinelle: api: add kobj_to_dev.cocci script
coccinelle: add patch rule for dma_alloc_coherent
scripts: coccicheck: Add chain mode to list of modes -
Merge yet more updates from Andrew Morton:
"Subsystems affected by this patch series: mm (memcg, migration,
pagemap, gup, madvise, vmalloc), ia64, and misc"* emailed patches from Andrew Morton : (31 commits)
mm: remove duplicate include statement in mmu.c
mm: remove the filename in the top of file comment in vmalloc.c
mm: cleanup the gfp_mask handling in __vmalloc_area_node
mm: remove alloc_vm_area
x86/xen: open code alloc_vm_area in arch_gnttab_valloc
xen/xenbus: use apply_to_page_range directly in xenbus_map_ring_pv
drm/i915: use vmap in i915_gem_object_map
drm/i915: stop using kmap in i915_gem_object_map
drm/i915: use vmap in shmem_pin_map
zsmalloc: switch from alloc_vm_area to get_vm_area
mm: allow a NULL fn callback in apply_to_page_range
mm: add a vmap_pfn function
mm: add a VM_MAP_PUT_PAGES flag for vmap
mm: update the documentation for vfree
mm/madvise: introduce process_madvise() syscall: an external memory hinting API
pid: move pidfd_get_pid() to pid.c
mm/madvise: pass mm to do_madvise
selftests/vm: 10x speedup for hmm-tests
binfmt_elf: take the mmap lock around find_extend_vma()
mm/gup_benchmark: take the mmap lock around GUP
... -
Pull UML updates from Richard Weinberger:
- Improve support for non-glibc systems
- Vector: Add support for scripting and dynamic tap devices
- Various fixes for the vector networking driver
- Various fixes for time travel mode
* tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/uml:
um: vector: Add dynamic tap interfaces and scripting
um: Clean up stacktrace dump
um: Fix incorrect assumptions about max pid length
um: Remove dead usage of TIF_IA32
um: Remove redundant NULL check
um: change sigio_spinlock to a mutex
um: time-travel: Return the sequence number in ACK messages
um: time-travel: Fix IRQ handling in time_travel_handle_message()
um: Allow static linking for non-glibc implementations
um: Some fixes to build UML with musl
um: vector: Use GFP_ATOMIC under spin lock
um: Fix null pointer dereference in vector_user_bpf -
Pull more ubi and ubifs updates from Richard Weinberger:
"UBI:
- Correctly use kthread_should_stop in ubi workerUBIFS:
- Fixes for memory leaks while iterating directory entries
- Fix for a user triggerable error message
- Fix for a space accounting bug in authenticated mode"* tag 'for-linus-5.10-rc1-part2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: journal: Make sure to not dirty twice for auth nodes
ubifs: setflags: Don't show error message when vfs_ioc_setflags_prepare() fails
ubifs: ubifs_jnl_change_xattr: Remove assertion 'nlink > 0' for host inode
ubi: check kthread_should_stop() after the setting of task state
ubifs: dent: Fix some potential memory leaks while iterating entries
ubifs: xattr: Fix some potential memory leaks while iterating entries -
Pull ubifs updates from Richard Weinberger:
- Kernel-doc fixes
- Fixes for memory leaks in authentication option parsing
* tag 'for-linus-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: mount_ubifs: Release authentication resource in error handling path
ubifs: Don't parse authentication mount options in remount process
ubifs: Fix a memleak after dumping authentication mount options
ubifs: Fix some kernel-doc warnings in tnc.c
ubifs: Fix some kernel-doc warnings in replay.c
ubifs: Fix some kernel-doc warnings in gc.c
ubifs: Fix 'hash' kernel-doc warning in auth.c -
asm/sections.h is included more than once, Remove the one that isn't
necessary.Signed-off-by: Tian Tao
Signed-off-by: Andrew Morton
Reviewed-by: Mike Rapoport
Link: https://lkml.kernel.org/r/1600088607-17327-1-git-send-email-tiantao6@hisilicon.com
Signed-off-by: Linus Torvalds -
No point in having the filename inside the file.
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002124035.1539300-3-hch@lst.de
Signed-off-by: Linus Torvalds -
Patch series "two small vmalloc cleanups".
This patch (of 2):
__vmalloc_area_node currently has four different gfp_t variables to
just express this simple logic:- use the passed in mask, plus __GFP_NOWARN and __GFP_HIGHMEM (if
suitable) for the underlying page allocation
- use just the reclaim flags from the passed in mask plus __GFP_ZERO
for allocating the page arraySimplify this down to just use the pre-existing nested_gfp as-is for
the page array allocation, and just the passed in gfp_mask for the
page allocation, after conditionally ORing __GFP_HIGHMEM into it. This
also makes the allocation warning a little more correct.Also initialize two variables at the time of declaration while touching
this area.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002124035.1539300-1-hch@lst.de
Link: https://lkml.kernel.org/r/20201002124035.1539300-2-hch@lst.de
Signed-off-by: Linus Torvalds -
All users are gone now.
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-12-hch@lst.de
Signed-off-by: Linus Torvalds -
Replace the last call to alloc_vm_area with an open coded version using an
iterator in struct gnttab_vm_area instead of the triple indirection magic
in alloc_vm_area.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-11-hch@lst.de
Signed-off-by: Linus Torvalds -
Replacing alloc_vm_area with get_vm_area_caller + apply_page_range allows
to fill put the phys_addr values directly instead of doing another loop
over all addresses.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-10-hch@lst.de
Signed-off-by: Linus Torvalds -
i915_gem_object_map implements fairly low-level vmap functionality in a
driver. Split it into two helpers, one for remapping kernel memory which
can use vmap, and one for I/O memory that uses vmap_pfn.The only practical difference is that alloc_vm_area prefeaults the vmalloc
area PTEs, which doesn't seem to be required here for the kernel memory
case (and could be added to vmap using a flag if actually required).Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Tvrtko Ursulin
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-9-hch@lst.de
Signed-off-by: Linus Torvalds -
kmap for !PageHighmem is just a convoluted way to say page_address, and
kunmap is a no-op in that case.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Tvrtko Ursulin
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-8-hch@lst.de
Signed-off-by: Linus Torvalds -
shmem_pin_map somewhat awkwardly reimplements vmap using alloc_vm_area and
manual pte setup. The only practical difference is that alloc_vm_area
prefeaults the vmalloc area PTEs, which doesn't seem to be required here
(and could be added to vmap using a flag if actually required). Switch to
use vmap, and use vfree to free both the vmalloc mapping and the page
array, as well as dropping the references to each page.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Reviewed-by: Tvrtko Ursulin
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-7-hch@lst.de
Signed-off-by: Linus Torvalds -
Just manually pre-fault the PTEs using apply_to_page_range.
Co-developed-by: Minchan Kim
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-6-hch@lst.de
Signed-off-by: Linus Torvalds -
Besides calling the callback on each page, apply_to_page_range also has
the effect of pre-faulting all PTEs for the range. To support callers
that only need the pre-faulting, make the callback optional.Based on a patch from Minchan Kim .
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-5-hch@lst.de
Signed-off-by: Linus Torvalds -
Add a proper helper to remap PFNs into kernel virtual space so that
drivers don't have to abuse alloc_vm_area and open coded PTE manipulation
for it.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-4-hch@lst.de
Signed-off-by: Linus Torvalds -
Add a flag so that vmap takes ownership of the passed in page array. When
vfree is called on such an allocation it will put one reference on each
page, and free the page array itself.Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Boris Ostrovsky
Cc: Chris Wilson
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Juergen Gross
Cc: Matthew Auld
Cc: "Matthew Wilcox (Oracle)"
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Peter Zijlstra
Cc: Rodrigo Vivi
Cc: Stefano Stabellini
Cc: Tvrtko Ursulin
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-3-hch@lst.de
Signed-off-by: Linus Torvalds -
Patch series "remove alloc_vm_area", v4.
This series removes alloc_vm_area, which was left over from the big
vmalloc interface rework. It is a rather arkane interface, basicaly the
equivalent of get_vm_area + actually faulting in all PTEs in the allocated
area. It was originally addeds for Xen (which isn't modular to start
with), and then grew users in zsmalloc and i915 which seems to mostly
qualify as abuses of the interface, especially for i915 as a random driver
should not set up PTE bits directly.This patch (of 11):
* Document that you can call vfree() on an address returned from vmap()
* Remove the note about the minimum size -- the minimum size of a vmalloc
allocation is one page
* Add a Context: section
* Fix capitalisation
* Reword the prohibition on calling from NMI context to avoid a double
negativeSigned-off-by: Matthew Wilcox (Oracle)
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: Stefano Stabellini
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Tvrtko Ursulin
Cc: Chris Wilson
Cc: Matthew Auld
Cc: Rodrigo Vivi
Cc: Minchan Kim
Cc: Matthew Wilcox
Cc: Nitin Gupta
Cc: Uladzislau Rezki (Sony)
Link: https://lkml.kernel.org/r/20201002122204.1534411-1-hch@lst.de
Link: https://lkml.kernel.org/r/20201002122204.1534411-2-hch@lst.de
Signed-off-by: Linus Torvalds -
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.The information required to make the reclaim decision is not known to the
app. Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.To solve the issue, this patch introduces a new syscall
process_madvise(2). It uses pidfd of an external process to give the
hint. It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement. I
think it would be bigger in real practice because the testing ran very
cache friendly environment).Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations. In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment. With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky. Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone. Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.If someone want to add other hints, we could hear the usecase and review
it for each hint. It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.So finally, the API is as follows,
ssize_t process_madvise(int pidfd, const struct iovec *iovec,
unsigned long vlen, int advice, unsigned int flags);DESCRIPTION
The process_madvise() system call is used to give advice or directions
to the kernel about the address ranges from external process as well as
local process. It provides the advice to address ranges of process
described by iovec and vlen. The goal of such advice is to improve
system or application performance.The pidfd selects the process referred to by the PID file descriptor
specified in pidfd. (See pidofd_open(2) for further information)The pointer iovec points to an array of iovec structures, defined in
as:struct iovec {
void *iov_base; /* starting address */
size_t iov_len; /* number of bytes to be advised */
};The iovec describes address ranges beginning at address(iov_base)
and with size length of bytes(iov_len).The vlen represents the number of elements in iovec.
The advice is indicated in the advice argument, which is one of the
following at this moment if the target process specified by pidfd is
external.MADV_COLD
MADV_PAGEOUTPermission to provide a hint to external process is governed by a
ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).The process_madvise supports every advice madvise(2) has if target
process is in same thread group with calling process so user could
use process_madvise(2) to extend existing madvise(2) to support
vector address ranges.RETURN VALUE
On success, process_madvise() returns the number of bytes advised.
This return value may be less than the total number of requested
bytes, if an error occurred. The caller should check return value
to determine whether a partial advice occurred.FAQ:
Q.1 - Why does any external entity have better knowledge?
Quote from Sandeep
"For Android, every application (including the special SystemServer)
are forked from Zygote. The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.Besides, we can never rely on applications to clean things up
themselves. We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.- ssp
Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called. If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect. It's the
responsibility of the process calling process_madvise to close this
race condition. For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called. Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process. Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm. The suggested API itself does not provide synchronization. It
also apply other APIs like move_pages, process_vm_write.The race isn't really a problem though. Why is it so wrong to require
that callers do their own synchronization in some manner? Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something. Think about mmap. It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before. That's where we need synchronization by using other API or
design from userside. It shouldn't be part of API itself. If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3]. Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.Q.3 - Why doesn't ptrace work?
Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA. Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill. It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.[1] https://developer.android.com/topic/performance/memory"
[2] process_getinfo for getting the cookie which is updated whenever
vma of process address layout are changed - Daniel Colascione -
https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224[3] anonymous fd which is used for the object(i.e., address range)
validation - Michal Hocko -
https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/[minchan@kernel.org: fix process_madvise build break for arm64]
Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.auSigned-off-by: Minchan Kim
Signed-off-by: YueHaibing
Signed-off-by: Stephen Rothwell
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Alexander Duyck
Cc: Brian Geffon
Cc: Christian Brauner
Cc: Daniel Colascione
Cc: Jann Horn
Cc: Jens Axboe
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: John Dias
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Oleksandr Natalenko
Cc: Sandeep Patil
Cc: SeongJae Park
Cc: SeongJae Park
Cc: Shakeel Butt
Cc: Sonny Rao
Cc: Tim Murray
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds -
process_madvise syscall needs pidfd_get_pid function to translate pidfd to
pid so this patch move the function to kernel/pid.c.Suggested-by: Alexander Duyck
Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Alexander Duyck
Reviewed-by: Vlastimil Babka
Acked-by: Christian Brauner
Acked-by: David Rientjes
Cc: Jens Axboe
Cc: Jann Horn
Cc: Brian Geffon
Cc: Daniel Colascione
Cc: Joel Fernandes
Cc: Johannes Weiner
Cc: John Dias
Cc: Kirill Tkhai
Cc: Michal Hocko
Cc: Oleksandr Natalenko
Cc: Sandeep Patil
Cc: SeongJae Park
Cc: SeongJae Park
Cc: Shakeel Butt
Cc: Sonny Rao
Cc: Tim Murray
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: http://lkml.kernel.org/r/20200302193630.68771-5-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-3-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-3-minchan@kernel.org
Signed-off-by: Linus Torvalds -
Patch series "introduce memory hinting API for external process", v9.
Now, we have MADV_PAGEOUT and MADV_COLD as madvise hinting API. With
that, application could give hints to kernel what memory range are
preferred to be reclaimed. However, in some platform(e.g., Android), the
information required to make the hinting decision is not known to the app.
Instead, it is known to a centralized userspace daemon(e.g.,
ActivityManagerService), and that daemon must be able to initiate reclaim
on its own without any app involvement.To solve the concern, this patch introduces new syscall -
process_madvise(2). Bascially, it's same with madvise(2) syscall but it
has some differences.1. It needs pidfd of target process to provide the hint
2. It supports only MADV_{COLD|PAGEOUT|MERGEABLE|UNMEREABLE} at this
moment. Other hints in madvise will be opened when there are explicit
requests from community to prevent unexpected bugs we couldn't support.3. Only privileged processes can do something for other process's
address space.For more detail of the new API, please see "mm: introduce external memory
hinting API" description in this patchset.This patch (of 3):
In upcoming patches, do_madvise will be called from external process
context so we shouldn't asssume "current" is always hinted process's
task_struct.Furthermore, we must not access mm_struct via task->mm, but obtain it via
access_mm() once (in the following patch) and only use that pointer [1],
so pass it to do_madvise() as well. Note the vma->vm_mm pointers are
safe, so we can use them further down the call stack.And let's pass current->mm as arguments of do_madvise so it shouldn't
change existing behavior but prepare next patch to make review easy.[vbabka@suse.cz: changelog tweak]
[minchan@kernel.org: use current->mm for io_uring]
Link: http://lkml.kernel.org/r/20200423145215.72666-1-minchan@kernel.org
[akpm@linux-foundation.org: fix it for upstream changes]
[akpm@linux-foundation.org: whoops]
[rdunlap@infradead.org: add missing includes]Signed-off-by: Minchan Kim
Signed-off-by: Andrew Morton
Reviewed-by: Suren Baghdasaryan
Reviewed-by: Vlastimil Babka
Acked-by: David Rientjes
Cc: Jens Axboe
Cc: Jann Horn
Cc: Tim Murray
Cc: Daniel Colascione
Cc: Sandeep Patil
Cc: Sonny Rao
Cc: Brian Geffon
Cc: Michal Hocko
Cc: Johannes Weiner
Cc: Shakeel Butt
Cc: John Dias
Cc: Joel Fernandes
Cc: Alexander Duyck
Cc: SeongJae Park
Cc: Christian Brauner
Cc: Kirill Tkhai
Cc: Oleksandr Natalenko
Cc: SeongJae Park
Cc: Christian Brauner
Cc: Florian Weimer
Cc:
Link: https://lkml.kernel.org/r/20200901000633.1920247-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-1-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200302193630.68771-2-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200622192900.22757-2-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-2-minchan@kernel.org
Signed-off-by: Linus Torvalds -
This patch reduces the running time for hmm-tests from about 10+ seconds,
to just under 1.0 second, for an approximately 10x speedup. That brings
it in line with most of the other tests in selftests/vm, which mostly run
in < 1 sec.This is done with a one-line change that simply reduces the number of
iterations of several tests, from 256, to 10. Thanks to Ralph Campbell
for suggesting changing NTIMES as a way to get the speedup.Suggested-by: Ralph Campbell
Signed-off-by: John Hubbard
Signed-off-by: Andrew Morton
Cc: SeongJae Park
Cc: Shuah Khan
Link: https://lkml.kernel.org/r/20201003011721.44238-1-jhubbard@nvidia.com
Signed-off-by: Linus Torvalds -
create_elf_tables() runs after setup_new_exec(), so other tasks can
already access our new mm and do things like process_madvise() on it. (At
the time I'm writing this commit, process_madvise() is not in mainline
yet, but has been in akpm's tree for some time.)While I believe that there are currently no APIs that would actually allow
another process to mess up our VMA tree (process_madvise() is limited to
MADV_COLD and MADV_PAGEOUT, and uring and userfaultfd cannot reach an mm
under which no syscalls have been executed yet), this seems like an
accident waiting to happen.Let's make sure that we always take the mmap lock around GUP paths as long
as another process might be able to see the mm.(Yes, this diff looks suspicious because we drop the lock before doing
anything with `vma`, but that's because we actually don't do anything with
it apart from the NULL check.)Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Acked-by: Michel Lespinasse
Cc: "Eric W . Biederman"
Cc: Jason Gunthorpe
Cc: John Hubbard
Cc: Mauro Carvalho Chehab
Cc: Sakari Ailus
Link: https://lkml.kernel.org/r/CAG48ez1-PBCdv3y8pn-Ty-b+FmBSLwDuVKFSt8h7wARLy0dF-Q@mail.gmail.com
Signed-off-by: Linus Torvalds -
To be safe against concurrent changes to the VMA tree, we must take the
mmap lock around GUP operations (excluding the GUP-fast family of
operations, which will take the mmap lock by themselves if necessary).This code is only for testing, and it's only reachable by root through
debugfs, so this doesn't really have any impact; however, if we want to
add lockdep asserts into the GUP path, we need to have clean locking here.Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Reviewed-by: John Hubbard
Acked-by: Michel Lespinasse
Cc: "Eric W . Biederman"
Cc: Mauro Carvalho Chehab
Cc: Sakari Ailus
Link: https://lkml.kernel.org/r/CAG48ez3SG6ngZLtasxJ6LABpOnqCz5-QHqb0B4k44TQ8F9n6+w@mail.gmail.com
Signed-off-by: Linus Torvalds -
There are two locations that have a block of code for munmapping a vma
range. Change those two locations to use a function and add meaningful
comments about what happens to the arguments, which was unclear in the
previous code.Signed-off-by: Liam R. Howlett
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200818154707.2515169-2-Liam.Howlett@Oracle.com
Signed-off-by: Linus Torvalds -
There are three places that the next vma is required which uses the same
block of code. Replace the block with a function and add comments on what
happens in the case where NULL is encountered.Signed-off-by: Liam R. Howlett
Signed-off-by: Andrew Morton
Reviewed-by: Andrew Morton
Link: http://lkml.kernel.org/r/20200818154707.2515169-1-Liam.Howlett@Oracle.com
Signed-off-by: Linus Torvalds -
There is no need to check if this process has the right to modify the
specified process when they are same. And we could also skip the security
hook call if a process is modifying its own pages. Add helper function to
handle these.Suggested-by: Matthew Wilcox
Signed-off-by: Hongxiang Lou
Signed-off-by: Miaohe Lin
Signed-off-by: Andrew Morton
Cc: Christopher Lameter
Link: https://lkml.kernel.org/r/20200819083331.19012-1-linmiaohe@huawei.com
Signed-off-by: Linus Torvalds -
To calculate the correct node to migrate the page for hotplug, we need to
check node id of the page. Wrapper for alloc_migration_target() exists
for this purpose.However, Vlastimil informs that all migration source pages come from a
single node. In this case, we don't need to check the node id for each
page and we don't need to re-set the target nodemask for each page by
using the wrapper. Set up the migration_target_control once and use it
for all pages.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Cc: Christoph Hellwig
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-10-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds -
There is a well-defined standard migration target callback. Use it
directly.Signed-off-by: Joonsoo Kim
Signed-off-by: Andrew Morton
Acked-by: Vlastimil Babka
Cc: Christoph Hellwig
Cc: Michal Hocko
Cc: Mike Kravetz
Cc: Naoya Horiguchi
Cc: Roman Gushchin
Link: http://lkml.kernel.org/r/1594622517-20681-9-git-send-email-iamjoonsoo.kim@lge.com
Signed-off-by: Linus Torvalds -
If a memcg to charge can be determined (using remote charging API), there
are no reasons to exclude allocations made from an interrupt context from
the accounting.Such allocations will pass even if the resulting memcg size will exceed
the hard limit, but it will affect the application of the memory pressure
and an inability to put the workload under the limit will eventually
trigger the OOM.To use active_memcg() helper, memcg_kmem_bypass() is moved back to
memcontrol.c.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-5-guro@fb.com
Signed-off-by: Linus Torvalds -
Remote memcg charging API uses current->active_memcg to store the
currently active memory cgroup, which overwrites the memory cgroup of the
current process. It works well for normal contexts, but doesn't work for
interrupt contexts: indeed, if an interrupt occurs during the execution of
a section with an active memcg set, all allocations inside the interrupt
will be charged to the active memcg set (given that we'll enable
accounting for allocations from an interrupt context). But because the
interrupt might have no relation to the active memcg set outside, it's
obviously wrong from the accounting prospective.To resolve this problem, let's add a global percpu int_active_memcg
variable, which will be used to store an active memory cgroup which will
be used from interrupt contexts. set_active_memcg() will transparently
use current->active_memcg or int_active_memcg depending on the context.To make the read part simple and transparent for the caller, let's
introduce two new functions:
- struct mem_cgroup *active_memcg(void),
- struct mem_cgroup *get_active_memcg(void).They are returning the active memcg if it's set, hiding all implementation
details: where to get it depending on the current context.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-4-guro@fb.com
Signed-off-by: Linus Torvalds -
There are checks for current->mm and current->active_memcg in
get_obj_cgroup_from_current(), but these checks are redundant:
memcg_kmem_bypass() called just above performs same checks.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-3-guro@fb.com
Signed-off-by: Linus Torvalds -
Patch series "mm: kmem: kernel memory accounting in an interrupt context".
This patchset implements memcg-based memory accounting of allocations made
from an interrupt context.Historically, such allocations were passed unaccounted mostly because
charging the memory cgroup of the current process wasn't an option. Also
performance reasons were likely a reason too.The remote charging API allows to temporarily overwrite the currently
active memory cgroup, so that all memory allocations are accounted towards
some specified memory cgroup instead of the memory cgroup of the current
process.This patchset extends the remote charging API so that it can be used from
an interrupt context. Then it removes the fence that prevented the
accounting of allocations made from an interrupt context. It also
contains a couple of optimizations/code refactorings.This patchset doesn't directly enable accounting for any specific
allocations, but prepares the code base for it. The bpf memory accounting
will likely be the first user of it: a typical example is a bpf program
parsing an incoming network packet, which allocates an entry in hashmap
map to store some information.This patch (of 4):
Currently memcg_kmem_bypass() is called before obtaining the current
memory/obj cgroup using get_mem/obj_cgroup_from_current(). Moving
memcg_kmem_bypass() into get_mem/obj_cgroup_from_current() reduces the
number of call sites and allows further code simplifications.Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Michal Hocko
Link: http://lkml.kernel.org/r/20200827225843.1270629-1-guro@fb.com
Link: http://lkml.kernel.org/r/20200827225843.1270629-2-guro@fb.com
Signed-off-by: Linus Torvalds -
Currently the remote memcg charging API consists of two functions:
memalloc_use_memcg() and memalloc_unuse_memcg(), which set and clear the
memcg value, which overwrites the memcg of the current task.memalloc_use_memcg(target_memcg);
memalloc_unuse_memcg();It works perfectly for allocations performed from a normal context,
however an attempt to call it from an interrupt context or just nest two
remote charging blocks will lead to an incorrect accounting. On exit from
the inner block the active memcg will be cleared instead of being
restored.memalloc_use_memcg(target_memcg);
memalloc_use_memcg(target_memcg_2);
memalloc_unuse_memcg();Error: allocation here are charged to the memcg of the current
process instead of target_memcg.memalloc_unuse_memcg();
This patch extends the remote charging API by switching to a single
function: struct mem_cgroup *set_active_memcg(struct mem_cgroup *memcg),
which sets the new value and returns the old one. So a remote charging
block will look like:old_memcg = set_active_memcg(target_memcg);
set_active_memcg(old_memcg);This patch is heavily based on the patch by Johannes Weiner, which can be
found here: https://lkml.org/lkml/2020/5/28/806 .Signed-off-by: Roman Gushchin
Signed-off-by: Andrew Morton
Reviewed-by: Shakeel Butt
Cc: Johannes Weiner
Cc: Dan Schatzberg
Link: https://lkml.kernel.org/r/20200821212056.3769116-1-guro@fb.com
Signed-off-by: Linus Torvalds