17 Oct, 2020

1 commit

  • The comment talks about having to hold mmget() (which means mm_users), but
    the actual check is on mm_count (which would be mmgrab()).

    Given that MMU notifiers are torn down in mmput() -> __mmput() ->
    exit_mmap() -> mmu_notifier_release(), I believe that the comment is
    correct and the check should be on mm->mm_users. Fix it up accordingly.

    Fixes: 99cb252f5e68 ("mm/mmu_notifier: add an interval tree notifier")
    Signed-off-by: Jann Horn
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Cc: John Hubbard
    Cc: Christoph Hellwig
    Cc: Christian König

    Jann Horn
     

13 Aug, 2020

1 commit

  • Fix W=1 compile warnings (invalid kerneldoc):

    mm/mmu_notifier.c:187: warning: Function parameter or member 'interval_sub' not described in 'mmu_interval_read_bgin'
    mm/mmu_notifier.c:708: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_registr'
    mm/mmu_notifier.c:708: warning: Excess function parameter 'mn' description in 'mmu_notifier_register'
    mm/mmu_notifier.c:880: warning: Function parameter or member 'subscription' not described in 'mmu_notifier_put'
    mm/mmu_notifier.c:880: warning: Excess function parameter 'mn' description in 'mmu_notifier_put'
    mm/mmu_notifier.c:982: warning: Function parameter or member 'ops' not described in 'mmu_interval_notifier_insert'

    Signed-off-by: Krzysztof Kozlowski
    Signed-off-by: Andrew Morton
    Reviewed-by: Jason Gunthorpe
    Link: http://lkml.kernel.org/r/20200728171109.28687-4-krzk@kernel.org
    Signed-off-by: Linus Torvalds

    Krzysztof Kozlowski
     

10 Jun, 2020

4 commits

  • Convert comments that reference mmap_sem to reference mmap_lock instead.

    [akpm@linux-foundation.org: fix up linux-next leftovers]
    [akpm@linux-foundation.org: s/lockaphore/lock/, per Vlastimil]
    [akpm@linux-foundation.org: more linux-next fixups, per Michel]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-13-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Rename the mmap_sem field to mmap_lock. Any new uses of this lock should
    now go through the new mmap locking api. The mmap_lock is still
    implemented as a rwsem, though this could change in the future.

    [akpm@linux-foundation.org: fix it for mm-gup-might_lock_readmmap_sem-in-get_user_pages_fast.patch]

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Daniel Jordan
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-11-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Add new APIs to assert that mmap_sem is held.

    Using this instead of rwsem_is_locked and lockdep_assert_held[_write]
    makes the assertions more tolerant of future changes to the lock type.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Vlastimil Babka
    Reviewed-by: Daniel Jordan
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Laurent Dufour
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-10-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change converts the existing mmap_sem rwsem calls to use the new mmap
    locking API instead.

    The change is generated using coccinelle with the following rule:

    // spatch --sp-file mmap_lock_api.cocci --in-place --include-headers --dir .

    @@
    expression mm;
    @@
    (
    -init_rwsem
    +mmap_init_lock
    |
    -down_write
    +mmap_write_lock
    |
    -down_write_killable
    +mmap_write_lock_killable
    |
    -down_write_trylock
    +mmap_write_trylock
    |
    -up_write
    +mmap_write_unlock
    |
    -downgrade_write
    +mmap_write_downgrade
    |
    -down_read
    +mmap_read_lock
    |
    -down_read_killable
    +mmap_read_lock_killable
    |
    -down_read_trylock
    +mmap_read_trylock
    |
    -up_read
    +mmap_read_unlock
    )
    -(&mm->mmap_sem)
    +(mm)

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Reviewed-by: Daniel Jordan
    Reviewed-by: Laurent Dufour
    Reviewed-by: Vlastimil Babka
    Cc: Davidlohr Bueso
    Cc: David Rientjes
    Cc: Hugh Dickins
    Cc: Jason Gunthorpe
    Cc: Jerome Glisse
    Cc: John Hubbard
    Cc: Liam Howlett
    Cc: Matthew Wilcox
    Cc: Peter Zijlstra
    Cc: Ying Han
    Link: http://lkml.kernel.org/r/20200520052908.204642-5-walken@google.com
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

22 Mar, 2020

1 commit

  • It is safe to traverse mm->notifier_subscriptions->list either under
    SRCU read lock or mm->notifier_subscriptions->lock using
    hlist_for_each_entry_rcu(). Silence the PROVE_RCU_LIST false positives,
    for example,

    WARNING: suspicious RCU usage
    -----------------------------
    mm/mmu_notifier.c:484 RCU-list traversed in non-reader section!!

    other info that might help us debug this:

    rcu_scheduler_active = 2, debug_locks = 1
    3 locks held by libvirtd/802:
    #0: ffff9321e3f58148 (&mm->mmap_sem#2){++++}, at: do_mprotect_pkey+0xe1/0x3e0
    #1: ffffffff91ae6160 (mmu_notifier_invalidate_range_start){+.+.}, at: change_p4d_range+0x5fa/0x800
    #2: ffffffff91ae6e08 (srcu){....}, at: __mmu_notifier_invalidate_range_start+0x178/0x460

    stack backtrace:
    CPU: 7 PID: 802 Comm: libvirtd Tainted: G I 5.6.0-rc6-next-20200317+ #2
    Hardware name: HP ProLiant BL460c Gen8, BIOS I31 11/02/2014
    Call Trace:
    dump_stack+0xa4/0xfe
    lockdep_rcu_suspicious+0xeb/0xf5
    __mmu_notifier_invalidate_range_start+0x3ff/0x460
    change_p4d_range+0x746/0x800
    change_protection+0x1df/0x300
    mprotect_fixup+0x245/0x3e0
    do_mprotect_pkey+0x23b/0x3e0
    __x64_sys_mprotect+0x51/0x70
    do_syscall_64+0x91/0xae8
    entry_SYSCALL_64_after_hwframe+0x49/0xb3

    Signed-off-by: Qian Cai
    Signed-off-by: Andrew Morton
    Reviewed-by: Paul E. McKenney
    Reviewed-by: Jason Gunthorpe
    Link: http://lkml.kernel.org/r/20200317175640.2047-1-cai@lca.pw
    Signed-off-by: Linus Torvalds

    Qian Cai
     

14 Jan, 2020

3 commits


01 Dec, 2019

1 commit

  • Pull hmm updates from Jason Gunthorpe:
    "This is another round of bug fixing and cleanup. This time the focus
    is on the driver pattern to use mmu notifiers to monitor a VA range.
    This code is lifted out of many drivers and hmm_mirror directly into
    the mmu_notifier core and written using the best ideas from all the
    driver implementations.

    This removes many bugs from the drivers and has a very pleasing
    diffstat. More drivers can still be converted, but that is for another
    cycle.

    - A shared branch with RDMA reworking the RDMA ODP implementation

    - New mmu_interval_notifier API. This is focused on the use case of
    monitoring a VA and simplifies the process for drivers

    - A common seq-count locking scheme built into the
    mmu_interval_notifier API usable by drivers that call
    get_user_pages() or hmm_range_fault() with the VA range

    - Conversion of mlx5 ODP, hfi1, radeon, nouveau, AMD GPU, and Xen
    GntDev drivers to the new API. This deletes a lot of wonky driver
    code.

    - Two improvements for hmm_range_fault(), from testing done by Ralph"

    * tag 'for-linus-hmm' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma:
    mm/hmm: remove hmm_range_dma_map and hmm_range_dma_unmap
    mm/hmm: make full use of walk_page_range()
    xen/gntdev: use mmu_interval_notifier_insert
    mm/hmm: remove hmm_mirror and related
    drm/amdgpu: Use mmu_interval_notifier instead of hmm_mirror
    drm/amdgpu: Use mmu_interval_insert instead of hmm_mirror
    drm/amdgpu: Call find_vma under mmap_sem
    nouveau: use mmu_interval_notifier instead of hmm_mirror
    nouveau: use mmu_notifier directly for invalidate_range_start
    drm/radeon: use mmu_interval_notifier_insert
    RDMA/hfi1: Use mmu_interval_notifier_insert for user_exp_rcv
    RDMA/odp: Use mmu_interval_notifier_insert()
    mm/hmm: define the pre-processor related parts of hmm.h even if disabled
    mm/hmm: allow hmm_range to be used with a mmu_interval_notifier or hmm_mirror
    mm/mmu_notifier: add an interval tree notifier
    mm/mmu_notifier: define the header pre-processor parts even if disabled
    mm/hmm: allow snapshot of the special zero page

    Linus Torvalds
     

24 Nov, 2019

1 commit

  • Of the 13 users of mmu_notifiers, 8 of them use only
    invalidate_range_start/end() and immediately intersect the
    mmu_notifier_range with some kind of internal list of VAs. 4 use an
    interval tree (i915_gem, radeon_mn, umem_odp, hfi1). 4 use a linked list
    of some kind (scif_dma, vhost, gntdev, hmm)

    And the remaining 5 either don't use invalidate_range_start() or do some
    special thing with it.

    It turns out that building a correct scheme with an interval tree is
    pretty complicated, particularly if the use case is synchronizing against
    another thread doing get_user_pages(). Many of these implementations have
    various subtle and difficult to fix races.

    This approach puts the interval tree as common code at the top of the mmu
    notifier call tree and implements a shareable locking scheme.

    It includes:
    - An interval tree tracking VA ranges, with per-range callbacks
    - A read/write locking scheme for the interval tree that avoids
    sleeping in the notifier path (for OOM killer)
    - A sequence counter based collision-retry locking scheme to tell
    device page fault that a VA range is being concurrently invalidated.

    This is based on various ideas:
    - hmm accumulates invalidated VA ranges and releases them when all
    invalidates are done, via active_invalidate_ranges count.
    This approach avoids having to intersect the interval tree twice (as
    umem_odp does) at the potential cost of a longer device page fault.

    - kvm/umem_odp use a sequence counter to drive the collision retry,
    via invalidate_seq

    - a deferred work todo list on unlock scheme like RTNL, via deferred_list.
    This makes adding/removing interval tree members more deterministic

    - seqlock, except this version makes the seqlock idea multi-holder on the
    write side by protecting it with active_invalidate_ranges and a spinlock

    To minimize MM overhead when only the interval tree is being used, the
    entire SRCU and hlist overheads are dropped using some simple
    branches. Similarly the interval tree overhead is dropped when in hlist
    mode.

    The overhead from the mandatory spinlock is broadly the same as most of
    existing users which already had a lock (or two) of some sort on the
    invalidation path.

    Link: https://lore.kernel.org/r/20191112202231.3856-3-jgg@ziepe.ca
    Acked-by: Christian König
    Tested-by: Philip Yang
    Tested-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

13 Nov, 2019

1 commit

  • Now that we have KERNEL_HEADER_TEST all headers are generally compile
    tested, so relying on makefile tricks to avoid compiling code that depends
    on CONFIG_MMU_NOTIFIER is more annoying.

    Instead follow the usual pattern and provide most of the header with only
    the functions stubbed out when CONFIG_MMU_NOTIFIER is disabled. This
    ensures code compiles no matter what the config setting is.

    While here, struct mmu_notifier_mm is private to mmu_notifier.c, move it.

    Link: https://lore.kernel.org/r/20191112202231.3856-2-jgg@ziepe.ca
    Reviewed-by: Jérôme Glisse
    Tested-by: Ralph Campbell
    Reviewed-by: John Hubbard
    Reviewed-by: Christoph Hellwig
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

07 Nov, 2019

1 commit

  • The return code from the op callback is actually in _ret, while the
    WARN_ON was checking ret which causes it to misfire.

    Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
    Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Andrew Morton
    Cc: Daniel Vetter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     

07 Sep, 2019

3 commits

  • We need to make sure implementations don't cheat and don't have a possible
    schedule/blocking point deeply burried where review can't catch it.

    I'm not sure whether this is the best way to make sure all the
    might_sleep() callsites trigger, and it's a bit ugly in the code flow.
    But it gets the job done.

    Inspired by an i915 patch series which did exactly that, because the rules
    haven't been entirely clear to us.

    Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch
    Reviewed-by: Christian König (v1)
    Reviewed-by: Jérôme Glisse (v4)
    Signed-off-by: Daniel Vetter
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     
  • We want to teach lockdep that mmu notifiers can be called from direct
    reclaim paths, since on many CI systems load might never reach that
    level (e.g. when just running fuzzer or small functional tests).

    I've put the annotation into mmu_notifier_register since only when we have
    mmu notifiers registered is there any point in teaching lockdep about
    them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.

    Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
    Suggested-by: Jason Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Daniel Vetter
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     
  • This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
    easy to provoke a specific notifier to be run on a specific range: Just
    prep it, and then munmap() it.

    A bit harder, but still doable, is to provoke the mmu notifiers for all
    the various callchains that might lead to them. But both at the same time
    is really hard to reliably hit, especially when you want to exercise paths
    like direct reclaim or compaction, where it's not easy to control what
    exactly will be unmapped.

    By introducing a lockdep map to tie them all together we allow lockdep to
    see a lot more dependencies, without having to actually hit them in a
    single challchain while testing.

    On Jason's suggestion this is is rolled out for both
    invalidate_range_start and invalidate_range_end. They both have the same
    calling context, hence we can share the same lockdep map. Note that the
    annotation for invalidate_ranage_start is outside of the
    mm_has_notifiers(), to make sure lockdep is informed about all paths
    leading to this context irrespective of whether mmu notifiers are present
    for a given context. We don't do that on the invalidate_range_end side to
    avoid paying the overhead twice, there the lockdep annotation is pushed
    down behind the mm_has_notifiers() check.

    Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Daniel Vetter
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     

28 Aug, 2019

1 commit


22 Aug, 2019

1 commit

  • mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
    longer have any users, they have all been converted to use
    mmu_notifier_put().

    So delete this difficult to use interface.

    Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

20 Aug, 2019

1 commit

  • Just a bit of paranoia, since if we start pushing this deep into
    callchains it's hard to spot all places where an mmu notifier
    implementation might fail when it's not allowed to.

    Inspired by some confusion we had discussing i915 mmu notifiers and
    whether we could use the newly-introduced return value to handle some
    corner cases. Until we realized that these are only for when a task has
    been killed by the oom reaper.

    An alternative approach would be to split the callback into two versions,
    one with the int return value, and the other with void return value like
    in older kernels. But that's a lot more churn for fairly little gain I
    think.

    Summary from the m-l discussion on why we want something at warning level:
    This allows automated tooling in CI to catch bugs without humans having to
    look at everything. If we just upgrade the existing pr_info to a pr_warn,
    then we'll have false positives. And as-is, no one will ever spot the
    problem since it's lost in the massive amounts of overall dmesg noise.

    Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     

16 Aug, 2019

3 commits

  • Many places in the kernel have a flow where userspace will create some
    object and that object will need to connect to the subsystem's
    mmu_notifier subscription for the duration of its lifetime.

    In this case the subsystem is usually tracking multiple mm_structs and it
    is difficult to keep track of what struct mmu_notifier's have been
    allocated for what mm's.

    Since this has been open coded in a variety of exciting ways, provide core
    functionality to do this safely.

    This approach uses the struct mmu_notifier_ops * as a key to determine if
    the subsystem has a notifier registered on the mm or not. If there is a
    registration then the existing notifier struct is returned, otherwise the
    ops->alloc_notifiers() is used to create a new per-subsystem notifier for
    the mm.

    The destroy side incorporates an async call_srcu based destruction which
    will avoid bugs in the callers such as commit 6d7c3cde93c1 ("mm/hmm: fix
    use after free with struct hmm in the mmu notifiers").

    Since we are inside the mmu notifier core locking is fairly simple, the
    allocation uses the same approach as for mmu_notifier_mm, the write side
    of the mmap_sem makes everything deterministic and we only need to do
    hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
    and the discoverability in the hlist is fully serialized by the
    mmu_notifier_mm->lock.

    Link: https://lore.kernel.org/r/20190806231548.25242-4-jgg@ziepe.ca
    Co-developed-by: Christoph Hellwig
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • A prior commit e0f3c3f78da2 ("mm/mmu_notifier: init notifier if necessary")
    made an attempt at doing this, but had to be reverted as calling
    the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
    commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance").

    However, we can avoid that problem by doing the allocation only under
    the mmap_sem, which is already happening.

    Since all writers to mm->mmu_notifier_mm hold the write side of the
    mmap_sem reading it under that sem is deterministic and we can use that to
    decide if the allocation path is required, without speculation.

    The actual update to mmu_notifier_mm must still be done under the
    mm_take_all_locks() to ensure read-side coherency.

    Link: https://lore.kernel.org/r/20190806231548.25242-3-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This simplifies the code to not have so many one line functions and extra
    logic. __mmu_notifier_register() simply becomes the entry point to
    register the notifier, and the other one calls it under lock.

    Also add a lockdep_assert to check that the callers are holding the lock
    as expected.

    Link: https://lore.kernel.org/r/20190806231548.25242-2-jgg@ziepe.ca
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

13 Jul, 2019

1 commit

  • Make mmu_notifier_register() safer by issuing a memory barrier before
    registering a new notifier. This fixes a theoretical bug on weakly
    ordered CPUs. For example, take this simplified use of notifiers by a
    driver:

    my_struct->mn.ops = &my_ops; /* (1) */
    mmu_notifier_register(&my_struct->mn, mm)
    ...
    hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
    ...

    Once mmu_notifier_register() releases the mm locks, another thread can
    invalidate a range:

    mmu_notifier_invalidate_range()
    ...
    hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
    if (mn->ops->invalidate_range)

    The read side relies on the data dependency between mn and ops to ensure
    that the pointer is properly initialized. But the write side doesn't have
    any dependency between (1) and (2), so they could be reordered and the
    readers could dereference an invalid mn->ops. mmu_notifier_register()
    does take all the mm locks before adding to the hlist, but those have
    acquire semantics which isn't sufficient.

    By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
    hlist using a store-release, ensuring that readers see prior
    initialization of my_struct. This situation is better illustated by
    litmus test MP+onceassign+derefonce.

    Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
    Fixes: cddb8a5c14aa ("mmu-notifiers: core")
    Signed-off-by: Jean-Philippe Brucker
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean-Philippe Brucker
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • Helper to test if a range is updated to read only (it is still valid to
    read from the range). This is useful for device driver or anyone who wish
    to optimize out update when they know that they already have the range map
    read only.

    Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Use the mmu_notifier_range_blockable() helper function instead of directly
    dereferencing the range->blockable field. This is done to make it easier
    to change the mmu_notifier range field.

    This patch is the outcome of the following coccinelle patch:

    %blockable
    +mmu_notifier_range_blockable(I1)
    ...>
    }
    ------------------------------------------------------------------->%

    spatch --in-place --sp-file blockable.spatch --dir .

    Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

29 Dec, 2018

3 commits

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Patch series "mmu notifier contextual informations", v2.

    This patchset adds contextual information, why an invalidation is
    happening, to mmu notifier callback. This is necessary for user of mmu
    notifier that wish to maintains their own data structure without having to
    add new fields to struct vm_area_struct (vma).

    For instance device can have they own page table that mirror the process
    address space. When a vma is unmap (munmap() syscall) the device driver
    can free the device page table for the range.

    Today we do not have any information on why a mmu notifier call back is
    happening and thus device driver have to assume that it is always an
    munmap(). This is inefficient at it means that it needs to re-allocate
    device page table on next page fault and rebuild the whole device driver
    data structure for the range.

    Other use case beside munmap() also exist, for instance it is pointless
    for device driver to invalidate the device page table when the
    invalidation is for the soft dirtyness tracking. Or device driver can
    optimize away mprotect() that change the page table permission access for
    the range.

    This patchset enables all this optimizations for device drivers. I do not
    include any of those in this series but another patchset I am posting will
    leverage this.

    The patchset is pretty simple from a code point of view. The first two
    patches consolidate all mmu notifier arguments into a struct so that it is
    easier to add/change arguments. The last patch adds the contextual
    information (munmap, protection, soft dirty, clear, ...).

    This patch (of 3):

    To avoid having to change many callback definition everytime we want to
    add a parameter use a structure to group all parameters for the
    mmu_notifier invalidate_range_start/end callback. No functional changes
    with this patch.

    [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
    Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Jan Kara
    Acked-by: Jason Gunthorpe [infiniband]
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Contrary to its name, mmu_notifier_synchronize() does not synchronize the
    notifier's SRCU instance, but rather waits for RCU callbacks to finish.
    i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
    this matter, explicitly calling out that rcu_barrier() does not imply
    synchronize_rcu().

    As there are no callers of mmu_notifier_synchronize() and it's unclear
    whether any user of mmu_notifier_call_srcu() will ever want to barrier on
    their callbacks, simply remove the function.

    Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
    Signed-off-by: Sean Christopherson
    Reviewed-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean Christopherson
     

27 Oct, 2018

1 commit

  • Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with
    blockable invalidate callbacks").

    MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
    longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for
    mmu notifiers"). We now have a full support for per range !blocking
    behavior so we can drop the stop gap workaround which the per notifier
    flag was used for.

    Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Boris Ostrovsky
    Cc: Jerome Glisse
    Cc: Juergen Gross
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Aug, 2018

1 commit

  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Feb, 2018

1 commit

  • Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu
    notifiers") prevented the oom reaper from unmapping private anonymous
    memory with the oom reaper when the oom victim mm had mmu notifiers
    registered.

    The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
    around the unmap_page_range(), which is needed, can block and the oom
    killer will stall forever waiting for the victim to exit, which may not
    be possible without reaping.

    That concern is real, but only true for mmu notifiers that have
    blockable invalidate_range_{start,end}() callbacks. This patch adds a
    "flags" field to mmu notifier ops that can set a bit to indicate that
    these callbacks do not block.

    The implementation is steered toward an expensive slowpath, such as
    after the oom reaper has grabbed mm->mmap_sem of a still alive oom
    victim.

    [rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
    [akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Paolo Bonzini
    Acked-by: Christian König
    Acked-by: Dimitri Sivanich
    Cc: Andrea Arcangeli
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Oded Gabbay
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Joerg Roedel
    Cc: Doug Ledford
    Cc: Jani Nikula
    Cc: Mike Marciniszyn
    Cc: Sean Hefty
    Cc: Boris Ostrovsky
    Cc: Jérôme Glisse
    Cc: Radim Krčmář
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Nov, 2017

1 commit

  • This is an optimization patch that only affect mmu_notifier users which
    rely on the invalidate_range() callback. This patch avoids calling that
    callback twice in a row from inside __mmu_notifier_invalidate_range_end

    Existing pattern (before this patch):
    mmu_notifier_invalidate_range_start()
    pte/pmd/pud_clear_flush_notify()
    mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_end()
    mmu_notifier_invalidate_range()

    New pattern (after this patch):
    mmu_notifier_invalidate_range_start()
    pte/pmd/pud_clear_flush_notify()
    mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_only_end()

    We call the invalidate_range callback after clearing the page table
    under the page table lock and we skip the call to invalidate_range
    inside the __mmu_notifier_invalidate_range_end() function.

    Idea from Andrea Arcangeli

    Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Joerg Roedel
    Cc: Suravee Suthikulpanit
    Cc: David Woodhouse
    Cc: Alistair Popple
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Stephen Rothwell
    Cc: Andrew Donnellan
    Cc: Nadav Amit
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

01 Sep, 2017

1 commit

  • The invalidate_page callback suffered from two pitfalls. First it used
    to happen after the page table lock was release and thus a new page
    might have setup before the call to invalidate_page() happened.

    This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
    try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
    callback under the page table lock but this also broke several existing
    users of the mmu_notifier API that assumed they could sleep inside this
    callback.

    The second pitfall was invalidate_page() being the only callback not
    taking a range of address in respect to invalidation but was giving an
    address and a page. Lots of the callback implementers assumed this
    could never be THP and thus failed to invalidate the appropriate range
    for THP.

    By killing this callback we unify the mmu_notifier callback API to
    always take a virtual address range as input.

    Finally this also simplifies the end user life as there is now two clear
    choices:
    - invalidate_range_start()/end() callback (which allow you to sleep)
    - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock

    Signed-off-by: Jérôme Glisse
    Cc: Bernhard Held
    Cc: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

19 Apr, 2017

1 commit

  • The MM-notifier code currently dynamically initializes the srcu_struct
    named "srcu" at subsys_initcall() time, and includes a BUG_ON() to check
    this initialization in do_mmu_notifier_register(). Unfortunately, there
    is no foolproof way to verify that an srcu_struct has been initialized,
    given the possibility of an srcu_struct being allocated on the stack or
    on the heap. This means that creating an srcu_struct_is_initialized()
    function is not a reasonable course of action. Nor is peppering
    do_mmu_notifier_register() with SRCU-specific #ifdefs an attractive
    alternative.

    This commit therefore uses DEFINE_STATIC_SRCU() to initialize
    this srcu_struct at compile time, thus eliminating both the
    subsys_initcall()-time initialization and the runtime BUG_ON().

    Signed-off-by: Paul E. McKenney
    Cc:
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: "Peter Zijlstra (Intel)"
    Cc: Vegard Nossum

    Paul E. McKenney
     

02 Mar, 2017

1 commit

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

18 Mar, 2016

1 commit


11 Sep, 2015

1 commit

  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov