07 Nov, 2019
1 commit
-
The return code from the op callback is actually in _ret, while the
WARN_ON was checking ret which causes it to misfire.Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
Signed-off-by: Jason Gunthorpe
Reviewed-by: Andrew Morton
Cc: Daniel Vetter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Sep, 2019
3 commits
-
We need to make sure implementations don't cheat and don't have a possible
schedule/blocking point deeply burried where review can't catch it.I'm not sure whether this is the best way to make sure all the
might_sleep() callsites trigger, and it's a bit ugly in the code flow.
But it gets the job done.Inspired by an i915 patch series which did exactly that, because the rules
haven't been entirely clear to us.Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch
Reviewed-by: Christian König (v1)
Reviewed-by: Jérôme Glisse (v4)
Signed-off-by: Daniel Vetter
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe -
We want to teach lockdep that mmu notifiers can be called from direct
reclaim paths, since on many CI systems load might never reach that
level (e.g. when just running fuzzer or small functional tests).I've put the annotation into mmu_notifier_register since only when we have
mmu notifiers registered is there any point in teaching lockdep about
them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
Suggested-by: Jason Gunthorpe
Reviewed-by: Jason Gunthorpe
Signed-off-by: Daniel Vetter
Signed-off-by: Jason Gunthorpe -
This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
easy to provoke a specific notifier to be run on a specific range: Just
prep it, and then munmap() it.A bit harder, but still doable, is to provoke the mmu notifiers for all
the various callchains that might lead to them. But both at the same time
is really hard to reliably hit, especially when you want to exercise paths
like direct reclaim or compaction, where it's not easy to control what
exactly will be unmapped.By introducing a lockdep map to tie them all together we allow lockdep to
see a lot more dependencies, without having to actually hit them in a
single challchain while testing.On Jason's suggestion this is is rolled out for both
invalidate_range_start and invalidate_range_end. They both have the same
calling context, hence we can share the same lockdep map. Note that the
annotation for invalidate_ranage_start is outside of the
mm_has_notifiers(), to make sure lockdep is informed about all paths
leading to this context irrespective of whether mmu notifiers are present
for a given context. We don't do that on the invalidate_range_end side to
avoid paying the overhead twice, there the lockdep annotation is pushed
down behind the mm_has_notifiers() check.Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
Reviewed-by: Jason Gunthorpe
Signed-off-by: Daniel Vetter
Signed-off-by: Jason Gunthorpe
28 Aug, 2019
1 commit
-
No modular code uses these, which makes a lot of sense given the wrappers
around them are only called by core mm code.Link: https://lore.kernel.org/r/20190828142109.29012-1-hch@lst.de
Signed-off-by: Christoph Hellwig
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe
22 Aug, 2019
1 commit
-
mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
longer have any users, they have all been converted to use
mmu_notifier_put().So delete this difficult to use interface.
Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe
20 Aug, 2019
1 commit
-
Just a bit of paranoia, since if we start pushing this deep into
callchains it's hard to spot all places where an mmu notifier
implementation might fail when it's not allowed to.Inspired by some confusion we had discussing i915 mmu notifiers and
whether we could use the newly-introduced return value to handle some
corner cases. Until we realized that these are only for when a task has
been killed by the oom reaper.An alternative approach would be to split the callback into two versions,
one with the int return value, and the other with void return value like
in older kernels. But that's a lot more churn for fairly little gain I
think.Summary from the m-l discussion on why we want something at warning level:
This allows automated tooling in CI to catch bugs without humans having to
look at everything. If we just upgrade the existing pr_info to a pr_warn,
then we'll have false positives. And as-is, no one will ever spot the
problem since it's lost in the massive amounts of overall dmesg noise.Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
Signed-off-by: Daniel Vetter
Reviewed-by: Jason Gunthorpe
Signed-off-by: Jason Gunthorpe
16 Aug, 2019
3 commits
-
Many places in the kernel have a flow where userspace will create some
object and that object will need to connect to the subsystem's
mmu_notifier subscription for the duration of its lifetime.In this case the subsystem is usually tracking multiple mm_structs and it
is difficult to keep track of what struct mmu_notifier's have been
allocated for what mm's.Since this has been open coded in a variety of exciting ways, provide core
functionality to do this safely.This approach uses the struct mmu_notifier_ops * as a key to determine if
the subsystem has a notifier registered on the mm or not. If there is a
registration then the existing notifier struct is returned, otherwise the
ops->alloc_notifiers() is used to create a new per-subsystem notifier for
the mm.The destroy side incorporates an async call_srcu based destruction which
will avoid bugs in the callers such as commit 6d7c3cde93c1 ("mm/hmm: fix
use after free with struct hmm in the mmu notifiers").Since we are inside the mmu notifier core locking is fairly simple, the
allocation uses the same approach as for mmu_notifier_mm, the write side
of the mmap_sem makes everything deterministic and we only need to do
hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
and the discoverability in the hlist is fully serialized by the
mmu_notifier_mm->lock.Link: https://lore.kernel.org/r/20190806231548.25242-4-jgg@ziepe.ca
Co-developed-by: Christoph Hellwig
Signed-off-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe -
A prior commit e0f3c3f78da2 ("mm/mmu_notifier: init notifier if necessary")
made an attempt at doing this, but had to be reverted as calling
the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance").However, we can avoid that problem by doing the allocation only under
the mmap_sem, which is already happening.Since all writers to mm->mmu_notifier_mm hold the write side of the
mmap_sem reading it under that sem is deterministic and we can use that to
decide if the allocation path is required, without speculation.The actual update to mmu_notifier_mm must still be done under the
mm_take_all_locks() to ensure read-side coherency.Link: https://lore.kernel.org/r/20190806231548.25242-3-jgg@ziepe.ca
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe -
This simplifies the code to not have so many one line functions and extra
logic. __mmu_notifier_register() simply becomes the entry point to
register the notifier, and the other one calls it under lock.Also add a lockdep_assert to check that the callers are holding the lock
as expected.Link: https://lore.kernel.org/r/20190806231548.25242-2-jgg@ziepe.ca
Suggested-by: Christoph Hellwig
Reviewed-by: Christoph Hellwig
Reviewed-by: Ralph Campbell
Tested-by: Ralph Campbell
Signed-off-by: Jason Gunthorpe
13 Jul, 2019
1 commit
-
Make mmu_notifier_register() safer by issuing a memory barrier before
registering a new notifier. This fixes a theoretical bug on weakly
ordered CPUs. For example, take this simplified use of notifiers by a
driver:my_struct->mn.ops = &my_ops; /* (1) */
mmu_notifier_register(&my_struct->mn, mm)
...
hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
...Once mmu_notifier_register() releases the mm locks, another thread can
invalidate a range:mmu_notifier_invalidate_range()
...
hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
if (mn->ops->invalidate_range)The read side relies on the data dependency between mn and ops to ensure
that the pointer is properly initialized. But the write side doesn't have
any dependency between (1) and (2), so they could be reordered and the
readers could dereference an invalid mn->ops. mmu_notifier_register()
does take all the mm locks before adding to the hlist, but those have
acquire semantics which isn't sufficient.By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
hlist using a store-release, ensuring that readers see prior
initialization of my_struct. This situation is better illustated by
litmus test MP+onceassign+derefonce.Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
Fixes: cddb8a5c14aa ("mmu-notifiers: core")
Signed-off-by: Jean-Philippe Brucker
Cc: Jérôme Glisse
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
19 Jun, 2019
1 commit
-
Based on 1 normalized pattern(s):
this work is licensed under the terms of the gnu gpl version 2 see
the copying file in the top level directoryextracted by the scancode license scanner the SPDX license identifier
GPL-2.0-only
has been chosen to replace the boilerplate/reference in 35 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Kate Stewart
Reviewed-by: Enrico Weigelt
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
Signed-off-by: Greg Kroah-Hartman
15 May, 2019
2 commits
-
Helper to test if a range is updated to read only (it is still valid to
read from the range). This is useful for device driver or anyone who wish
to optimize out update when they know that they already have the range map
read only.Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use the mmu_notifier_range_blockable() helper function instead of directly
dereferencing the range->blockable field. This is done to make it easier
to change the mmu_notifier range field.This patch is the outcome of the following coccinelle patch:
%blockable
+mmu_notifier_range_blockable(I1)
...>
}
------------------------------------------------------------------->%spatch --in-place --sp-file blockable.spatch --dir .
Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Dec, 2018
3 commits
-
To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n
Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "mmu notifier contextual informations", v2.
This patchset adds contextual information, why an invalidation is
happening, to mmu notifier callback. This is necessary for user of mmu
notifier that wish to maintains their own data structure without having to
add new fields to struct vm_area_struct (vma).For instance device can have they own page table that mirror the process
address space. When a vma is unmap (munmap() syscall) the device driver
can free the device page table for the range.Today we do not have any information on why a mmu notifier call back is
happening and thus device driver have to assume that it is always an
munmap(). This is inefficient at it means that it needs to re-allocate
device page table on next page fault and rebuild the whole device driver
data structure for the range.Other use case beside munmap() also exist, for instance it is pointless
for device driver to invalidate the device page table when the
invalidation is for the soft dirtyness tracking. Or device driver can
optimize away mprotect() that change the page table permission access for
the range.This patchset enables all this optimizations for device drivers. I do not
include any of those in this series but another patchset I am posting will
leverage this.The patchset is pretty simple from a code point of view. The first two
patches consolidate all mmu notifier arguments into a struct so that it is
easier to add/change arguments. The last patch adds the contextual
information (munmap, protection, soft dirty, clear, ...).This patch (of 3):
To avoid having to change many callback definition everytime we want to
add a parameter use a structure to group all parameters for the
mmu_notifier invalidate_range_start/end callback. No functional changes
with this patch.[akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Jan Kara
Acked-by: Jason Gunthorpe [infiniband]
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Contrary to its name, mmu_notifier_synchronize() does not synchronize the
notifier's SRCU instance, but rather waits for RCU callbacks to finish.
i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
this matter, explicitly calling out that rcu_barrier() does not imply
synchronize_rcu().As there are no callers of mmu_notifier_synchronize() and it's unclear
whether any user of mmu_notifier_call_srcu() will ever want to barrier on
their callbacks, simply remove the function.Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
Signed-off-by: Sean Christopherson
Reviewed-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Oct, 2018
1 commit
-
Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with
blockable invalidate callbacks").MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for
mmu notifiers"). We now have a full support for per range !blocking
behavior so we can drop the stop gap workaround which the per notifier
flag was used for.Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
Signed-off-by: Michal Hocko
Cc: David Rientjes
Cc: Boris Ostrovsky
Cc: Jerome Glisse
Cc: Juergen Gross
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Aug, 2018
1 commit
-
There are several blockable mmu notifiers which might sleep in
mmu_notifier_invalidate_range_start and that is a problem for the
oom_reaper because it needs to guarantee a forward progress so it cannot
depend on any sleepable locks.Currently we simply back off and mark an oom victim with blockable mmu
notifiers as done after a short sleep. That can result in selecting a new
oom victim prematurely because the previous one still hasn't torn its
memory down yet.We can do much better though. Even if mmu notifiers use sleepable locks
there is no reason to automatically assume those locks are held. Moreover
majority of notifiers only care about a portion of the address space and
there is absolutely zero reason to fail when we are unmapping an unrelated
range. Many notifiers do really block and wait for HW which is harder to
handle and we have to bail out though.This patch handles the low hanging fruit.
__mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
are not allowed to sleep if the flag is set to false. This is achieved by
using trylock instead of the sleepable lock for most callbacks and
continue as long as we do not block down the call chain.I think we can improve that even further because there is a common pattern
to do a range lookup first and then do something about that. The first
part can be done without a sleeping lock in most cases AFAICS.The oom_reaper end then simply retries if there is at least one notifier
which couldn't make any progress in !blockable mode. A retry loop is
already implemented to wait for the mmap_sem and this is basically the
same thing.The simplest way for driver developers to test this code path is to wrap
userspace code which uses these notifiers into a memcg and set the hard
limit to hit the oom. This can be done e.g. after the test faults in all
the mmu notifier managed memory and set the hard limit to something really
small. Then we are looking for a proper process tear down.[akpm@linux-foundation.org: coding style fixes]
[akpm@linux-foundation.org: minor code simplification]
Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Christian König # AMD notifiers
Acked-by: Leon Romanovsky # mlx and umem_odp
Reported-by: David Rientjes
Cc: "David (ChunMing) Zhou"
Cc: Paolo Bonzini
Cc: Alex Deucher
Cc: David Airlie
Cc: Jani Nikula
Cc: Joonas Lahtinen
Cc: Rodrigo Vivi
Cc: Doug Ledford
Cc: Jason Gunthorpe
Cc: Mike Marciniszyn
Cc: Dennis Dalessandro
Cc: Sudeep Dutt
Cc: Ashutosh Dixit
Cc: Dimitri Sivanich
Cc: Boris Ostrovsky
Cc: Juergen Gross
Cc: "Jérôme Glisse"
Cc: Andrea Arcangeli
Cc: Felix Kuehling
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Feb, 2018
1 commit
-
Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu
notifiers") prevented the oom reaper from unmapping private anonymous
memory with the oom reaper when the oom victim mm had mmu notifiers
registered.The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
around the unmap_page_range(), which is needed, can block and the oom
killer will stall forever waiting for the victim to exit, which may not
be possible without reaping.That concern is real, but only true for mmu notifiers that have
blockable invalidate_range_{start,end}() callbacks. This patch adds a
"flags" field to mmu notifier ops that can set a bit to indicate that
these callbacks do not block.The implementation is steered toward an expensive slowpath, such as
after the oom reaper has grabbed mm->mmap_sem of a still alive oom
victim.[rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
[akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.com
Signed-off-by: David Rientjes
Acked-by: Michal Hocko
Acked-by: Paolo Bonzini
Acked-by: Christian König
Acked-by: Dimitri Sivanich
Cc: Andrea Arcangeli
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Oded Gabbay
Cc: Alex Deucher
Cc: David Airlie
Cc: Joerg Roedel
Cc: Doug Ledford
Cc: Jani Nikula
Cc: Mike Marciniszyn
Cc: Sean Hefty
Cc: Boris Ostrovsky
Cc: Jérôme Glisse
Cc: Radim Krčmář
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Nov, 2017
1 commit
-
This is an optimization patch that only affect mmu_notifier users which
rely on the invalidate_range() callback. This patch avoids calling that
callback twice in a row from inside __mmu_notifier_invalidate_range_endExisting pattern (before this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_end()
mmu_notifier_invalidate_range()New pattern (after this patch):
mmu_notifier_invalidate_range_start()
pte/pmd/pud_clear_flush_notify()
mmu_notifier_invalidate_range()
mmu_notifier_invalidate_range_only_end()We call the invalidate_range callback after clearing the page table
under the page table lock and we skip the call to invalidate_range
inside the __mmu_notifier_invalidate_range_end() function.Idea from Andrea Arcangeli
Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Andrea Arcangeli
Cc: Joerg Roedel
Cc: Suravee Suthikulpanit
Cc: David Woodhouse
Cc: Alistair Popple
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Stephen Rothwell
Cc: Andrew Donnellan
Cc: Nadav Amit
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
01 Sep, 2017
1 commit
-
The invalidate_page callback suffered from two pitfalls. First it used
to happen after the page table lock was release and thus a new page
might have setup before the call to invalidate_page() happened.This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
callback under the page table lock but this also broke several existing
users of the mmu_notifier API that assumed they could sleep inside this
callback.The second pitfall was invalidate_page() being the only callback not
taking a range of address in respect to invalidation but was giving an
address and a page. Lots of the callback implementers assumed this
could never be THP and thus failed to invalidate the appropriate range
for THP.By killing this callback we unify the mmu_notifier callback API to
always take a virtual address range as input.Finally this also simplifies the end user life as there is now two clear
choices:
- invalidate_range_start()/end() callback (which allow you to sleep)
- invalidate_range() where you can not sleep but happen right after
page table update under page table lockSigned-off-by: Jérôme Glisse
Cc: Bernhard Held
Cc: Adam Borowski
Cc: Andrea Arcangeli
Cc: Radim Krčmář
Cc: Wanpeng Li
Cc: Paolo Bonzini
Cc: Takashi Iwai
Cc: Nadav Amit
Cc: Mike Galbraith
Cc: Kirill A. Shutemov
Cc: axie
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
19 Apr, 2017
1 commit
-
The MM-notifier code currently dynamically initializes the srcu_struct
named "srcu" at subsys_initcall() time, and includes a BUG_ON() to check
this initialization in do_mmu_notifier_register(). Unfortunately, there
is no foolproof way to verify that an srcu_struct has been initialized,
given the possibility of an srcu_struct being allocated on the stack or
on the heap. This means that creating an srcu_struct_is_initialized()
function is not a reasonable course of action. Nor is peppering
do_mmu_notifier_register() with SRCU-specific #ifdefs an attractive
alternative.This commit therefore uses DEFINE_STATIC_SRCU() to initialize
this srcu_struct at compile time, thus eliminating both the
subsys_initcall()-time initialization and the runtime BUG_ON().Signed-off-by: Paul E. McKenney
Cc:
Cc: Andrew Morton
Cc: Ingo Molnar
Cc: Michal Hocko
Cc: "Peter Zijlstra (Intel)"
Cc: Vegard Nossum
02 Mar, 2017
1 commit
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.The APIs that are going to be moved first are:
mm_alloc()
__mmdrop()
mmdrop()
mmdrop_async_fn()
mmdrop_async()
mmget_not_zero()
mmput()
mmput_async()
get_task_mm()
mm_access()
mm_release()Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
28 Feb, 2017
1 commit
-
Apart from adding the helper function itself, the rest of the kernel is
converted mechanically using:git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'This is needed for a later patch that hooks into the helper, but might
be a worthwhile cleanup on its own.(Michal Hocko provided most of the kerneldoc comment.)
Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum
Acked-by: Michal Hocko
Acked-by: Peter Zijlstra (Intel)
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
1 commit
-
There are various email addresses for me throughout the kernel. Use the
one that will always be valid.Signed-off-by: Christoph Lameter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Sep, 2015
1 commit
-
In the scope of the idle memory tracking feature, which is introduced by
the following patch, we need to clear the referenced/accessed bit not only
in primary, but also in secondary ptes. The latter is required in order
to estimate wss of KVM VMs. At the same time we want to avoid flushing
tlb, because it is quite expensive and it won't really affect the final
result.Currently, there is no function for clearing pte young bit that would meet
our requirements, so this patch introduces one. To achieve that we have
to add a new mmu-notifier callback, clear_young, since there is no method
for testing-and-clearing a secondary pte w/o flushing tlb. The new method
is not mandatory and currently only implemented by KVM.Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Acked-by: Paolo Bonzini
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2014
1 commit
-
Now that the mmu_notifier_invalidate_range() calls are in place, add the
callback to allow subsystems to register against it.Signed-off-by: Joerg Roedel
Reviewed-by: Andrea Arcangeli
Reviewed-by: Jérôme Glisse
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Jay Cornwall
Cc: Oded Gabbay
Cc: Suravee Suthikulpanit
Cc: Jesse Barnes
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Oded Gabbay
24 Sep, 2014
1 commit
-
1. We were calling clear_flush_young_notify in unmap_one, but we are
within an mmu notifier invalidate range scope. The spte exists no more
(due to range_start) and the accessed bit info has already been
propagated (due to kvm_pfn_set_accessed). Simply call
clear_flush_young.2. We clear_flush_young on a primary MMU PMD, but this may be mapped
as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
This required expanding the interface of the clear_flush_young mmu
notifier, so a lot of code has been trivially touched.3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
the access bit by blowing the spte. This requires proper synchronizing
with MMU notifier consumers, like every other removal of spte's does.Signed-off-by: Andres Lagar-Cavilla
Acked-by: Rik van Riel
Signed-off-by: Paolo Bonzini
07 Aug, 2014
1 commit
-
When kernel device drivers or subsystems want to bind their lifespan to
t= he lifespan of the mm_struct, they usually use one of the following
methods:1. Manually calling a function in the interested kernel module. The
funct= ion call needs to be placed in mmput. This method was rejected
by several ker= nel maintainers.2. Registering to the mmu notifier release mechanism.
The problem with the latter approach is that the mmu_notifier_release
cal= lback is called from__mmu_notifier_release (called from exit_mmap).
That functi= on iterates over the list of mmu notifiers and don't expect
the release call= back function to remove itself from the list.
Therefore, the callback function= in the kernel module can't release the
mmu_notifier_object, which is actuall= y the kernel module's object
itself. As a result, the destruction of the kernel module's object must
to be done in a delayed fashion.This patch adds support for this delayed callback, by adding a new
mmu_notifier_call_srcu function that receives a function ptr and calls
th= at function with call_srcu. In that function, the kernel module
releases its object. To use mmu_notifier_call_srcu, the calling module
needs to call b= efore that a new function called
mmu_notifier_unregister_no_release that as its= name implies,
unregisters a notifier without calling its notifier release call= back.This patch also adds a function that will call barrier_srcu so those
kern= el modules can sync with mmu_notifier.Signed-off-by: Peter Zijlstra
Signed-off-by: Jérôme Glisse
Signed-off-by: Oded Gabbay
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jan, 2014
1 commit
-
Code that is obj-y (always built-in) or dependent on a bool Kconfig
(built-in or absent) can never be modular. So using module_init as an
alias for __initcall can be somewhat misleading.Fix these up now, so that we can relocate module_init from init.h into
module.h in the future. If we don't do this, we'd have to add module.h
to obviously non-modular code, and that would be a worse thing.The audit targets the following module_init users for change:
mm/ksm.c bool KSM
mm/mmap.c bool MMU
mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
mm/mmu_notifier.c bool MMU_NOTIFIERNote that direct use of __initcall is discouraged, vs. one of the
priority categorized subgroups. As __initcall gets mapped onto
device_initcall, our use of subsys_initcall (which makes sense for these
files) will thus change this registration from level 6-device to level
4-subsys (i.e. slightly earlier).However no observable impact of that difference has been observed during
testing.One might think that core_initcall (l2) or postcore_initcall (l3) would
be more appropriate for anything in mm/ but if we look at some actual
init functions themselves, we see things like:mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
mm/ksm.c --> ksm_init --> sysfs_create_groupand hence the choice of subsys_initcall (l4) seems reasonable, and at
the same time minimizes the risk of changing the priority too
drastically all at once. We can adjust further in the future.Also, several instances of missing ";" at EOL are fixed.
Signed-off-by: Paul Gortmaker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Jun, 2013
1 commit
-
Signed-off-by: Geert Uytterhoeven
Signed-off-by: Jiri Kosina
25 May, 2013
1 commit
-
Commit 751efd8610d3 ("mmu_notifier_unregister NULL Pointer deref and
multiple ->release()") breaks the fix 3ad3d901bbcf ("mm: mmu_notifier:
fix freed page still mapped in secondary MMU").Since hlist_for_each_entry_rcu() is changed now, we can not revert that
patch directly, so this patch reverts the commit and simply fix the bug
spotted by that patchThis bug spotted by commit 751efd8610d3 is:
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().Assume two tasks, one calling mmu_notifier_unregister() as a result
of a filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() release should be fast
since all the pages have already been released by the first call.
Anyway, this issue should be fixed in a separate patch.-stable suggestions: Any version that has commit 751efd8610d3 need to be
backported. I find the oldest version has this commit is 3.0-stable.[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Xiao Guangrong
Tested-by: Robin Holt
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Feb, 2013
1 commit
-
I'm not sure why, but the hlist for each entry iterators were conceived
list_for_each_entry(pos, head, member)
The hlist ones were greedy and wanted an extra parameter:
hlist_for_each_entry(tpos, pos, head, member)
Why did they need an extra pos parameter? I'm not quite sure. Not only
they don't really need it, it also prevents the iterator from looking
exactly like the list iterator, which is unfortunate.Besides the semantic patch, there was some manual work required:
- Fix up the actual hlist iterators in linux/list.h
- Fix up the declaration of other iterators based on the hlist ones.
- A very small amount of places were using the 'node' parameter, this
was modified to use 'obj->member' instead.
- Coccinelle didn't handle the hlist_for_each_entry_safe iterator
properly, so those had to be fixed up manually.The semantic patch which is mostly the work of Peter Senna Tschudin is here:
@@
iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;type T;
expression a,c,d,e;
identifier b;
statement S;
@@-T b;
[akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
[akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: fix warnings]
[akpm@linux-foudnation.org: redo intrusive kvm changes]
Tested-by: Peter Senna Tschudin
Acked-by: Paul E. McKenney
Signed-off-by: Sasha Levin
Cc: Wu Fengguang
Cc: Marcelo Tosatti
Cc: Gleb Natapov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Feb, 2013
2 commits
-
We at SGI have a need to address some very high physical address ranges
with our GRU (global reference unit), sometimes across partitioned
machine boundaries and sometimes with larger addresses than the cpu
supports. We do this with the aid of our own 'extended vma' module
which mimics the vma. When something (either unmap or exit) frees an
'extended vma' we use the mmu notifiers to clean them up.We had been able to mimic the functions
__mmu_notifier_invalidate_range_start() and
__mmu_notifier_invalidate_range_end() by locking the per-mm lock and
walking the per-mm notifier list. But with the change to a global srcu
lock (static in mmu_notifier.c) we can no longer do that. Our module has
no access to that lock.So we request that these two functions be exported.
Signed-off-by: Cliff Wickman
Acked-by: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is a race condition between mmu_notifier_unregister() and
__mmu_notifier_release().Assume two tasks, one calling mmu_notifier_unregister() as a result of a
filp_close() ->flush() callout (task A), and the other calling
mmu_notifier_release() from an mmput() (task B).A B
t1 srcu_read_lock()
t2 if (!hlist_unhashed())
t3 srcu_read_unlock()
t4 srcu_read_lock()
t5 hlist_del_init_rcu()
t6 synchronize_srcu()
t7 srcu_read_unlock()
t8 hlist_del_rcu() hlist_lock which can result in
callouts to the ->release() notifier from both mmu_notifier_unregister()
and __mmu_notifier_release().-stable suggestions:
The stable trees prior to 3.7.y need commits 21a92735f660 and
70400303ce0c cherry-picked in that order prior to cherry-picking this
commit. The 3.7.y tree already has those two commits.Signed-off-by: Robin Holt
Cc: Andrea Arcangeli
Cc: Wanpeng Li
Cc: Xiao Guangrong
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Marcelo Tosatti
Cc: Sagi Grimberg
Cc: Haggai Eran
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Oct, 2012
1 commit
-
While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
to work in case of tight available memory. Eventually, that would lead to
a deadlock while the swap deamon swaps anonymous pages. It was caused by
commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").=================================
[ INFO: inconsistent lock state ]
3.7.0-rc1+ #518 Not tainted
---------------------------------
inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
(&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
{RECLAIM_FS-ON-W} state was registered at:
mark_held_locks+0x86/0x150
lockdep_trace_alloc+0x67/0xc0
kmem_cache_alloc_trace+0x33/0x230
do_mmu_notifier_register+0x87/0x180
mmu_notifier_register+0x13/0x20
kvm_dev_ioctl+0x428/0x510
do_vfs_ioctl+0x98/0x570
sys_ioctl+0x91/0xb0
system_call_fastpath+0x16/0x1b
irq event stamp: 825
hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60
hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
softirqs last enabled at (0): copy_process+0x630/0x17c0
softirqs last disabled at (0): (null)
...Simply back out the above commit, which was a small performance
optimization.Signed-off-by: Gavin Shan
Reported-by: Andrea Righi
Tested-by: Andrea Righi
Cc: Wanpeng Li
Cc: Andrea Arcangeli
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Marcelo Tosatti
Cc: Xiao Guangrong
Cc: Sagi Grimberg
Cc: Haggai Eran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Oct, 2012
3 commits
-
In order to allow sleeping during invalidate_page mmu notifier calls, we
need to avoid calling when holding the PT lock. In addition to its direct
calls, invalidate_page can also be called as a substitute for a change_pte
call, in case the notifier client hasn't implemented change_pte.This patch drops the invalidate_page call from change_pte, and instead
wraps all calls to change_pte with invalidate_range_start and
invalidate_range_end calls.Note that change_pte still cannot sleep after this patch, and that clients
implementing change_pte should not take action on it in case the number of
outstanding invalidate_range_start calls is larger than one, otherwise
they might miss a later invalidation.Signed-off-by: Haggai Eran
Cc: Andrea Arcangeli
Cc: Sagi Grimberg
Cc: Peter Zijlstra
Cc: Xiao Guangrong
Cc: Or Gerlitz
Cc: Haggai Eran
Cc: Shachar Raindel
Cc: Liran Liss
Cc: Christoph Lameter
Cc: Avi Kivity
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The variable must be static especially given the variable name.
s/RCU/SRCU/ over a few comments.
Signed-off-by: Andrea Arcangeli
Cc: Xiao Guangrong
Cc: Sagi Grimberg
Cc: Peter Zijlstra
Cc: Haggai Eran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While registering MMU notifier, new instance of MMU notifier_mm will be
allocated and later free'd if currrent mm_struct's MMU notifier_mm has
been initialized. That causes some overhead. The patch tries to
elominate that.Signed-off-by: Gavin Shan
Signed-off-by: Wanpeng Li
Cc: Andrea Arcangeli
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Marcelo Tosatti
Cc: Xiao Guangrong
Cc: Sagi Grimberg
Cc: Haggai Eran
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds