17 Oct, 2020
1 commit
-
The comment talks about having to hold mmget() (which means mm_users), but
the actual check is on mm_count (which would be mmgrab()).Given that MMU notifiers are torn down in mmput() -> __mmput() ->
exit_mmap() -> mmu_notifier_release(), I believe that the comment is
correct and the check should be on mm->mm_users. Fix it up accordingly.Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier")
Signed-off-by: Jann Horn
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Cc: John Hubbard
Cc: Christoph Hellwig
Cc: Christian König
13 Aug, 2020
1 commit
-
Fix W=1 compile warnings (invalid kerneldoc):
mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'Signed-off-by: Krzysztof Kozlowski
Signed-off-by: Andrew Morton
Reviewed-by: Jason Gunthorpe
Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org
Signed-off-by: Linus Torvalds
10 Jun, 2020
4 commits
-
Convert comments that reference mmap_sem to reference mmap_lock instead.
[akpm@linux-foundation.org: fix up linux-next leftovers]
[akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
[akpm@linux-foundation.org: more linux-next fixups, per Michel]Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
Signed-off-by: Linus Torvalds -
Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
now go through the new mmap locking api. The mmap_lock is still
implemented as a rwsem, though this could change in the future.[akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]
Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Davidlohr Bueso
Reviewed-by: Daniel Jordan
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
Signed-off-by: Linus Torvalds -
Add new APIs to assert that mmap_sem is held.
Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
makes the assertions more tolerant of future changes to the lock type.Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Vlastimil Babka
Reviewed-by: Daniel Jordan
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Laurent Dufour
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
Signed-off-by: Linus Torvalds -
This change converts the existing mmap_sem rwsem calls to use the new mmap
locking API instead.The change is generated using coccinelle with the following rule:
// spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .
@@
expression mm;
@@
(
-init_rwsem
+mmap_init_lock
|
-down_write
+mmap_write_lock
|
-down_write_killable
+mmap_write_lock_killable
|
-down_write_trylock
+mmap_write_trylock
|
-up_write
+mmap_write_unlock
|
-downgrade_write
+mmap_write_downgrade
|
-down_read
+mmap_read_lock
|
-down_read_killable
+mmap_read_lock_killable
|
-down_read_trylock
+mmap_read_trylock
|
-up_read
+mmap_read_unlock
)
-(&mm->mmap_sem)
+(mm)Signed-off-by: Michel Lespinasse
Signed-off-by: Andrew Morton
Reviewed-by: Daniel Jordan
Reviewed-by: Laurent Dufour
Reviewed-by: Vlastimil Babka
Cc: Davidlohr Bueso
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Jason Gunthorpe
Cc: Jerome Glisse
Cc: John Hubbard
Cc: Liam Howlett
Cc: Matthew Wilcox
Cc: Peter Zijlstra
Cc: Ying Han
Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
Signed-off-by: Linus Torvalds
22 Mar, 2020
1 commit
-
It is safe to traverse mm->notifier_subscriptions->list either under
SRCU read lock or mm->notifier_subscriptions->lock using
hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives,
for example,WARNING: suspicious RCU usage
-----------------------------
mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!other info that might help us debug this:
rcu_scheduler_active = 2, debug_locks = 1
3 locks held by libvirtd/802:
#0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
#1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
#2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460stack backtrace:
CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2
Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
Call Trace:
dump_stack+0xa4/0xfe
lockdep_rcu_suspicious+0xeb/0xf5
__mmu_notifier_invalidate_range_start+0x3ff/0x460
change_p4d_range+0x746/0x800
change_protection+0x1df/0x300
mprotect_fixup+0x245/0x3e0
do_mprotect_pkey+0x23b/0x3e0
__x64_sys_mprotect+0x51/0x70
do_syscall_64+0x91/0xae8
entry_SYSCALL_64_after_hwframe+0x49/0xb3Signed-off-by: Qian Cai
Signed-off-by: Andrew Morton
Reviewed-by: Paul E. McKenney
Reviewed-by: Jason Gunthorpe
Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw
Signed-off-by: Linus Torvalds
14 Jan, 2020
3 commits
-
The 'interval_sub' is placed on the 'notifier_subscriptions' interval
tree.This eliminates the poor name 'mni' for this variable.
Signed-off-by: Jason Gunthorpe
-
The 'subscription' is placed on the 'notifier_subscriptions' list.
This eliminates the poor name 'mn' for this variable.
Signed-off-by: Jason Gunthorpe
-
The name mmu_notifier_mm implies that the thing is a mm_struct pointer,
and is difficult to abbreviate. The struct is actually holding the
interval tree and hlist containing the notifiers subscribed to a mm.Use 'subscriptions' as the variable name for this struct instead of the
really terrible and misleading 'mmn_mm'.Signed-off-by: Jason Gunthorpe
01 Dec, 2019
1 commit
-
Pull hmm updates from Jason Gunthorpe:
"This is another round of bug fixing and cleanup. This time the focus
is on the driver pattern to use mmu notifiers to monitor a VA range.
This code is lifted out of many drivers and hmm_mirror directly into
the mmu_notifier core and written using the best ideas from all the
driver implementations.This removes many bugs from the drivers and has a very pleasing
diffstat. More drivers can still be converted, but that is for another
cycle.- A shared branch with RDMA reworking the RDMA ODP implementation
- New mmu_interval_notifier API. This is focused on the use case of
monitoring a VA and simplifies the process for drivers- A common seq-count locking scheme built into the
mmu_interval_notifier API usable by drivers that call
get_user_pages() or hmm_range_fault() with the VA range- Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
GntDev drivers to the new API. This deletes a lot of wonky driver
code.- Two improvements for hmm_range_fault(), from testing done by Ralph"
* tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
mm/hmm: make full use of walk_page_range()
xen/gntdev: use mmu_interval_notifier_insert
mm/hmm: remove hmm_mirror and related
drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
drm/amdgpu: Call find_vma under mmap_sem
nouveau: use mmu_interval_notifier instead of hmm_mirror
nouveau: use mmu_notifier directly for invalidate_range_start
drm/radeon: use mmu_interval_notifier_insert
RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
RDMA/odp: Use mmu_interval_notifier_insert()
mm/hmm: define the pre-processor related parts of hmm.h even if disabled
mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
mm/mmu_notifier: add an interval tree notifier
mm/mmu_notifier: define the header pre-processor parts even if disabled
mm/hmm: allow snapshot of the special zero page
24 Nov, 2019
1 commit
-
Of the 13 users of mmu_notifiers, 8 of them use only
invalidate_range_start/end() and immediately intersect the
mmu_notifier_range with some kind of internal list of VAs. 4 use an
interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
of some kind (scif_dma, vhost, gntdev, hmm)And the remaining 5 either don't use invalidate_range_start() or do some
special thing with it.It turns out that building a correct scheme with an interval tree is
pretty complicated, particularly if the use case is synchronizing against
another thread doing get_user_pages(). Many of these implementations have
various subtle and difficult to fix races.This approach puts the interval tree as common code at the top of the mmu
notifier call tree and implements a shareable locking scheme.It includes:
- An interval tree tracking VA ranges, with per-range callbacks
- A read/write locking scheme for the interval tree that avoids
sleeping in the notifier path (for OOM killer)
- A sequence counter based collision-retry locking scheme to tell
device page fault that a VA range is being concurrently invalidated.This is based on various ideas:
- hmm accumulates invalidated VA ranges and releases them when all
invalidates are done, via active_invalidate_ranges count.
This approach avoids having to intersect the interval tree twice (as
umem_odp does) at the potential cost of a longer device page fault.- kvm/umem_odp use a sequence counter to drive the collision retry,
via invalidate_seq- a deferred work todo list on unlock scheme like RTNL, via deferred_list.
This makes adding/removing interval tree members more deterministic- seqlock, except this version makes the seqlock idea multi-holder on the
write side by protecting it with active_invalidate_ranges and a spinlockTo minimize MM overhead when only the interval tree is being used, the
entire SRCU and hlist overheads are dropped using some simple
branches. Similarly the interval tree overhead is dropped when in hlist
mode.The overhead from the mandatory spinlock is broadly the same as most of
existing users which already had a lock (or two) of some sort on the
invalidation path.Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
Acked-by: Christian König
Tested-by: Philip Yang
Tested-by: Ralph Campbell
Reviewed-by: John Hubbard
Reviewed-by: Christoph Hellwig
Signed-off-by: Jason Gunthorpe
13 Nov, 2019
1 commit
-
Now that we have KERNEL_HEADER_TEST all headers are generally compile
tested, so relying on makefile tricks to avoid compiling code that depends
on CONFIG_MMU_NOTIFIER is more annoying.Instead follow the usual pattern and provide most of the header with only
the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
ensures code compiles no matter what the config setting is.While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.
Link: https://lore.kernel.org/r/20191112202231.3856-2-jgg@ziepe.ca
Reviewed-by: Jérôme Glisse
Tested-by: Ralph Campbell
Reviewed-by: John Hubbard
Reviewed-by: Christoph Hellwig
Signed-off-by: Jason Gunthorpe
07 Nov, 2019
1 commit
-
The return code from the op callback is actually in _ret, while the
WARN_ON was checking ret which causes it to misfire.Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
Signed-off-by: Jason Gunthorpe
Reviewed-by: Andrew Morton
Cc: Daniel Vetter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Sep, 2019
3 commits
-
We need to make sure implementations don't cheat and don't have a possible
schedule/blocking point deeply burried where review can't catch it.I'm not sure whether this is the best way to make sure all the
might_sleep() callsites trigger, and it's a bit ugly in the code flow.
But it gets the job done.Inspired by an i915 patch series which did exactly that, because the rules
haven't been entirely clear to us.Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch
Reviewed-by: Christian König (v1)
Reviewed-by: Jérôme Glisse (v4)
Signed-off-by: Daniel Vetter
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe -
We want to teach lockdep that mmu notifiers can be called from direct
reclaim paths, since on many CI systems load might never reach that
level (e.g. when just running fuzzer or small functional tests).I've put the annotation into mmu_notifier_register since only when we have
mmu notifiers registered is there any point in teaching lockdep about
them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
Suggested-by: Jason Gunthorpe
Reviewed-by: Jason Gunthorpe
Signed-off-by: Daniel Vetter
Signed-off-by: Jason Gunthorpe -
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
easy to provoke a specific notifier to be run on a specific range: Just
prep it, and then munmap() it.A bit harder, but still doable, is to provoke the mmu notifiers for all
the various callchains that might lead to them. But both at the same time
is really hard to reliably hit, especially when you want to exercise paths
like direct reclaim or compaction, where it's not easy to control what
exactly will be unmapped.By introducing a lockdep map to tie them all together we allow lockdep to
see a lot more dependencies, without having to actually hit them in a
single challchain while testing.On Jason's suggestion this is is rolled out for both
invalidate_range_start and invalidate_range_end. They both have the same
calling context, hence we can share the same lockdep map. Note that the
annotation for invalidate_ranage_start is outside of the
mm_has_notifiers(), to make sure lockdep is informed about all paths
leading to this context irrespective of whether mmu notifiers are present
for a given context. We don't do that on the invalidate_range_end side to
avoid paying the overhead twice, there the lockdep annotation is pushed
down behind the mm_has_notifiers() check.Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
Reviewed-by: Jason Gunthorpe
Signed-off-by: Daniel Vetter
Signed-off-by: Jason Gunthorpe
28 Aug, 2019
1 commit
-
No modular code uses these, which makes a lot of sense given the wrappers
around them are only called by core mm code.Link: https://lore.kernel.org/r/20190828142109.29012-1-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe
22 Aug, 2019
1 commit
-
mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
longer have any users, they have all been converted to use
mmu_notifier_put().So delete this difficult to use interface.
Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe
20 Aug, 2019
1 commit
-
Just a bit of paranoia, since if we start pushing this deep into
callchains it's hard to spot all places where an mmu notifier
implementation might fail when it's not allowed to.Inspired by some confusion we had discussing i915 mmu notifiers and
whether we could use the newly-introduced return value to handle some
corner cases. Until we realized that these are only for when a task has
been killed by the oom reaper.An alternative approach would be to split the callback into two versions,
one with the int return value, and the other with void return value like
in older kernels. But that's a lot more churn for fairly little gain I
think.Summary from the m-l discussion on why we want something at warning level:
This allows automated tooling in CI to catch bugs without humans having to
look at everything. If we just upgrade the existing pr_info to a pr_warn,
then we'll have false positives. And as-is, no one will ever spot the
problem since it's lost in the massive amounts of overall dmesg noise.Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe
16 Aug, 2019
3 commits
-
Many places in the kernel have a flow where userspace will create some
object and that object will need to connect to the subsystem's
mmu_notifier subscription for the duration of its lifetime.In this case the subsystem is usually tracking multiple mm_structs and it
is difficult to keep track of what struct mmu_notifier's have been
allocated for what mm's.Since this has been open coded in a variety of exciting ways, provide core
functionality to do this safely.This approach uses the struct mmu_notifier_ops * as a key to determine if
the subsystem has a notifier registered on the mm or not. If there is a
registration then the existing notifier struct is returned, otherwise the
ops->alloc_notifiers() is used to create a new per-subsystem notifier for
the mm.The destroy side incorporates an async call_srcu based destruction which
will avoid bugs in the callers such as commit 6d7c3cde93c1 ("mm/hmm: fix
use after free with struct hmm in the mmu notifiers").Since we are inside the mmu notifier core locking is fairly simple, the
allocation uses the same approach as for mmu_notifier_mm, the write side
of the mmap_sem makes everything deterministic and we only need to do
hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
and the discoverability in the hlist is fully serialized by the
mmu_notifier_mm->lock.Link: https://lore.kernel.org/r/20190806231548.25242-4-jgg@ziepe.ca
Co-developed-by: Christoph Hellwig
Signed-off-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe -
A prior commit e0f3c3f78da2 ("mm/mmu_notifier: init notifier if necessary")
made an attempt at doing this, but had to be reverted as calling
the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance").However, we can avoid that problem by doing the allocation only under
the mmap_sem, which is already happening.Since all writers to mm->mmu_notifier_mm hold the write side of the
mmap_sem reading it under that sem is deterministic and we can use that to
decide if the allocation path is required, without speculation.The actual update to mmu_notifier_mm must still be done under the
mm_take_all_locks() to ensure read-side coherency.Link: https://lore.kernel.org/r/20190806231548.25242-3-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe -
This simplifies the code to not have so many one line functions and extra
logic. __mmu_notifier_register() simply becomes the entry point to
register the notifier, and the other one calls it under lock.Also add a lockdep_assert to check that the callers are holding the lock
as expected.Link: https://lore.kernel.org/r/20190806231548.25242-2-jgg@ziepe.ca
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe
13 Jul, 2019
1 commit
-
Make mmu_notifier_register() safer by issuing a memory barrier before
registering a new notifier. This fixes a theoretical bug on weakly
ordered CPUs. For example, take this simplified use of notifiers by a
driver:my_struct->mn.ops = &my_ops; /* (1) */
mmu_notifier_register(&my_struct->mn, mm)
...
hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
...Once mmu_notifier_register() releases the mm locks, another thread can
invalidate a range:mmu_notifier_invalidate_range()
...
hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
if (mn->ops->invalidate_range)The read side relies on the data dependency between mn and ops to ensure
that the pointer is properly initialized. But the write side doesn't have
any dependency between (1) and (2), so they could be reordered and the
readers could dereference an invalid mn->ops. mmu_notifier_register()
does take all the mm locks before adding to the hlist, but those have
acquire semantics which isn't sufficient.By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
hlist using a store-release, ensuring that readers see prior
initialization of my_struct. This situation is better illustated by
litmus test MP+onceassign+derefonce.Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
Fixes: cddb8a5c14aa ("mmu-notifiers: core")
Signed-off-by: Jean-Philippe Brucker
Cc: Jérôme Glisse
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jun, 2019
1 commit
-
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2 see
the copying file in the top level directoryextracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 35 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Kate Stewart
Reviewed-by: Enrico Weigelt
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
2 commits
-
Helper to test if a range is updated to read only (it is still valid to
read from the range). This is useful for device driver or anyone who wish
to optimize out update when they know that they already have the range map
read only.Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use the mmu_notifier_range_blockable() helper function instead of directly
dereferencing the range->blockable field. This is done to make it easier
to change the mmu_notifier range field.This patch is the outcome of the following coccinelle patch:
%blockable
+mmu_notifier_range_blockable(I1)
...>
}
------------------------------------------------------------------->%spatch --in-place --sp-file blockable.spatch --dir .
Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
3 commits
-
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "mmu notifier contextual informations", v2.
This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback. This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).For instance device can have they own page table that mirror the process
address space. When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap(). This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking. Or device driver can
optimize away mprotect() that change the page table permission access for
the range.This patchset enables all this optimizations for device drivers. I do not
include any of those in this series but another patchset I am posting will
leverage this.The patchset is pretty simple from a code point of view. The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments. The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).This patch (of 3):
To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback. No functional changes
with this patch.[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Jan Kara
Acked-by: Jason Gunthorpe [infiniband]
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finish.
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
Signed-off-by: Sean Christopherson
Reviewed-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Oct, 2018
1 commit
-
Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with
blockable invalidate callbacks").MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for
mmu notifiers"). We now have a full support for per range !blocking
behavior so we can drop the stop gap workaround which the per notifier
flag was used for.Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: David Rientjes
Cc: Boris Ostrovsky
Cc: Jerome Glisse
Cc: Juergen Gross
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Aug, 2018
1 commit
-
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Christian König # AMD notifiers
Acked-by: Leon Romanovsky # mlx and umem_odp
Reported-by: David Rientjes
Cc: "David (ChunMing) Zhou"
Cc: Paolo Bonzini
Cc: Alex Deucher
Cc: David Airlie
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Rodrigo Vivi
Cc: Doug Ledford
Cc: Jason Gunthorpe
Cc: Mike Marciniszyn
Cc: Dennis Dalessandro
Cc: Sudeep Dutt
Cc: Ashutosh Dixit
Cc: Dimitri Sivanich
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: "Jérôme Glisse"
Cc: Andrea Arcangeli
Cc: Felix Kuehling
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Feb, 2018
1 commit
-
Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks. This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.[rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
[akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Acked-by: Paolo Bonzini
Acked-by: Christian König
Acked-by: Dimitri Sivanich
Cc: Andrea Arcangeli
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Oded Gabbay
Cc: Alex Deucher
Cc: David Airlie
Cc: Joerg Roedel
Cc: Doug Ledford
Cc: Jani Nikula
Cc: Mike Marciniszyn
Cc: Sean Hefty
Cc: Boris Ostrovsky
Cc: Jérôme Glisse
Cc: Radim Krčmář
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Nov, 2017
1 commit
-
This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_endExisting pattern (before this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_end()
mmu_notifier_invalidate_range()New pattern (after this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_only_end()We call the invalidate_range callback after clearing the page table
under the page table lock and we skip the call to invalidate_range
inside the __mmu_notifier_invalidate_range_end() function.Idea from Andrea Arcangeli
Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Andrea Arcangeli
Cc: Joerg Roedel
Cc: Suravee Suthikulpanit
Cc: David Woodhouse
Cc: Alistair Popple
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Stephen Rothwell
Cc: Andrew Donnellan
Cc: Nadav Amit
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Sep, 2017
1 commit
-
The invalidate_page callback suffered from two pitfalls. First it used
to happen after the page table lock was release and thus a new page
might have setup before the call to invalidate_page() happened.This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
callback under the page table lock but this also broke several existing
users of the mmu_notifier API that assumed they could sleep inside this
callback.The second pitfall was invalidate_page() being the only callback not
taking a range of address in respect to invalidation but was giving an
address and a page. Lots of the callback implementers assumed this
could never be THP and thus failed to invalidate the appropriate range
for THP.By killing this callback we unify the mmu_notifier callback API to
always take a virtual address range as input.Finally this also simplifies the end user life as there is now two clear
choices:
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lockSigned-off-by: Jérôme Glisse
Cc: Bernhard Held
Cc: Adam Borowski
Cc: Andrea Arcangeli
Cc: Radim Krčmář
Cc: Wanpeng Li
Cc: Paolo Bonzini
Cc: Takashi Iwai
Cc: Nadav Amit
Cc: Mike Galbraith
Cc: Kirill A. Shutemov
Cc: axie
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
19 Apr, 2017
1 commit
-
The MM-notifier code currently dynamically initializes the srcu_struct
named "srcu" at subsys_initcall() time, and includes a BUG_ON() to check
this initialization in do_mmu_notifier_register(). Unfortunately, there
is no foolproof way to verify that an srcu_struct has been initialized,
given the possibility of an srcu_struct being allocated on the stack or
on the heap. This means that creating an srcu_struct_is_initialized()
function is not a reasonable course of action. Nor is peppering
do_mmu_notifier_register() with SRCU-specific #ifdefs an attractive
alternative.This commit therefore uses DEFINE_STATIC_SRCU() to initialize
this srcu_struct at compile time, thus eliminating both the
subsys_initcall()-time initialization and the runtime BUG_ON().Signed-off-by: Paul E. McKenney
Cc:
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Michal Hocko
Cc: "Peter Zijlstra (Intel)"
Cc: Vegard Nossum
02 Mar, 2017
1 commit
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
28 Feb, 2017
1 commit
-
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.(Michal Hocko provided most of the kerneldoc comment.)
Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra (Intel)
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
1 commit
-
There are various email addresses for me throughout the kernel. Use the
one that will always be valid.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Sep, 2015
1 commit
-
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes. The latter is required in order
to estimate wss of KVM VMs. At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one. To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb. The new method
is not mandatory and currently only implemented by KVM.Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Acked-by: Paolo Bonzini
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds