07 Nov, 2019

1 commit

  • The return code from the op callback is actually in _ret, while the
    WARN_ON was checking ret which causes it to misfire.

    Link: http://lkml.kernel.org/r/20191025175502.GA31127@ziepe.ca
    Fixes: 8402ce61bec2 ("mm/mmu_notifiers: check if mmu notifier callbacks are allowed to fail")
    Signed-off-by: Jason Gunthorpe
    Reviewed-by: Andrew Morton
    Cc: Daniel Vetter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jason Gunthorpe
     

07 Sep, 2019

3 commits

  • We need to make sure implementations don't cheat and don't have a possible
    schedule/blocking point deeply burried where review can't catch it.

    I'm not sure whether this is the best way to make sure all the
    might_sleep() callsites trigger, and it's a bit ugly in the code flow.
    But it gets the job done.

    Inspired by an i915 patch series which did exactly that, because the rules
    haven't been entirely clear to us.

    Link: https://lore.kernel.org/r/20190826201425.17547-5-daniel.vetter@ffwll.ch
    Reviewed-by: Christian König (v1)
    Reviewed-by: Jérôme Glisse (v4)
    Signed-off-by: Daniel Vetter
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     
  • We want to teach lockdep that mmu notifiers can be called from direct
    reclaim paths, since on many CI systems load might never reach that
    level (e.g. when just running fuzzer or small functional tests).

    I've put the annotation into mmu_notifier_register since only when we have
    mmu notifiers registered is there any point in teaching lockdep about
    them. Also, we already have a kmalloc(, GFP_KERNEL), so this is safe.

    Link: https://lore.kernel.org/r/20190826201425.17547-3-daniel.vetter@ffwll.ch
    Suggested-by: Jason Gunthorpe
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Daniel Vetter
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     
  • This is a similar idea to the fs_reclaim fake lockdep lock. It's fairly
    easy to provoke a specific notifier to be run on a specific range: Just
    prep it, and then munmap() it.

    A bit harder, but still doable, is to provoke the mmu notifiers for all
    the various callchains that might lead to them. But both at the same time
    is really hard to reliably hit, especially when you want to exercise paths
    like direct reclaim or compaction, where it's not easy to control what
    exactly will be unmapped.

    By introducing a lockdep map to tie them all together we allow lockdep to
    see a lot more dependencies, without having to actually hit them in a
    single challchain while testing.

    On Jason's suggestion this is is rolled out for both
    invalidate_range_start and invalidate_range_end. They both have the same
    calling context, hence we can share the same lockdep map. Note that the
    annotation for invalidate_ranage_start is outside of the
    mm_has_notifiers(), to make sure lockdep is informed about all paths
    leading to this context irrespective of whether mmu notifiers are present
    for a given context. We don't do that on the invalidate_range_end side to
    avoid paying the overhead twice, there the lockdep annotation is pushed
    down behind the mm_has_notifiers() check.

    Link: https://lore.kernel.org/r/20190826201425.17547-2-daniel.vetter@ffwll.ch
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Daniel Vetter
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     

28 Aug, 2019

1 commit


22 Aug, 2019

1 commit

  • mmu_notifier_unregister_no_release() and mmu_notifier_call_srcu() no
    longer have any users, they have all been converted to use
    mmu_notifier_put().

    So delete this difficult to use interface.

    Link: https://lore.kernel.org/r/20190806231548.25242-12-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

20 Aug, 2019

1 commit

  • Just a bit of paranoia, since if we start pushing this deep into
    callchains it's hard to spot all places where an mmu notifier
    implementation might fail when it's not allowed to.

    Inspired by some confusion we had discussing i915 mmu notifiers and
    whether we could use the newly-introduced return value to handle some
    corner cases. Until we realized that these are only for when a task has
    been killed by the oom reaper.

    An alternative approach would be to split the callback into two versions,
    one with the int return value, and the other with void return value like
    in older kernels. But that's a lot more churn for fairly little gain I
    think.

    Summary from the m-l discussion on why we want something at warning level:
    This allows automated tooling in CI to catch bugs without humans having to
    look at everything. If we just upgrade the existing pr_info to a pr_warn,
    then we'll have false positives. And as-is, no one will ever spot the
    problem since it's lost in the massive amounts of overall dmesg noise.

    Link: https://lore.kernel.org/r/20190814202027.18735-2-daniel.vetter@ffwll.ch
    Signed-off-by: Daniel Vetter
    Reviewed-by: Jason Gunthorpe
    Signed-off-by: Jason Gunthorpe

    Daniel Vetter
     

16 Aug, 2019

3 commits

  • Many places in the kernel have a flow where userspace will create some
    object and that object will need to connect to the subsystem's
    mmu_notifier subscription for the duration of its lifetime.

    In this case the subsystem is usually tracking multiple mm_structs and it
    is difficult to keep track of what struct mmu_notifier's have been
    allocated for what mm's.

    Since this has been open coded in a variety of exciting ways, provide core
    functionality to do this safely.

    This approach uses the struct mmu_notifier_ops * as a key to determine if
    the subsystem has a notifier registered on the mm or not. If there is a
    registration then the existing notifier struct is returned, otherwise the
    ops->alloc_notifiers() is used to create a new per-subsystem notifier for
    the mm.

    The destroy side incorporates an async call_srcu based destruction which
    will avoid bugs in the callers such as commit 6d7c3cde93c1 ("mm/hmm: fix
    use after free with struct hmm in the mmu notifiers").

    Since we are inside the mmu notifier core locking is fairly simple, the
    allocation uses the same approach as for mmu_notifier_mm, the write side
    of the mmap_sem makes everything deterministic and we only need to do
    hlist_add_head_rcu() under the mm_take_all_locks(). The new users count
    and the discoverability in the hlist is fully serialized by the
    mmu_notifier_mm->lock.

    Link: https://lore.kernel.org/r/20190806231548.25242-4-jgg@ziepe.ca
    Co-developed-by: Christoph Hellwig
    Signed-off-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • A prior commit e0f3c3f78da2 ("mm/mmu_notifier: init notifier if necessary")
    made an attempt at doing this, but had to be reverted as calling
    the GFP_KERNEL allocator under the i_mmap_mutex causes deadlock, see
    commit 35cfa2b0b491 ("mm/mmu_notifier: allocate mmu_notifier in advance").

    However, we can avoid that problem by doing the allocation only under
    the mmap_sem, which is already happening.

    Since all writers to mm->mmu_notifier_mm hold the write side of the
    mmap_sem reading it under that sem is deterministic and we can use that to
    decide if the allocation path is required, without speculation.

    The actual update to mmu_notifier_mm must still be done under the
    mm_take_all_locks() to ensure read-side coherency.

    Link: https://lore.kernel.org/r/20190806231548.25242-3-jgg@ziepe.ca
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     
  • This simplifies the code to not have so many one line functions and extra
    logic. __mmu_notifier_register() simply becomes the entry point to
    register the notifier, and the other one calls it under lock.

    Also add a lockdep_assert to check that the callers are holding the lock
    as expected.

    Link: https://lore.kernel.org/r/20190806231548.25242-2-jgg@ziepe.ca
    Suggested-by: Christoph Hellwig
    Reviewed-by: Christoph Hellwig
    Reviewed-by: Ralph Campbell
    Tested-by: Ralph Campbell
    Signed-off-by: Jason Gunthorpe

    Jason Gunthorpe
     

13 Jul, 2019

1 commit

  • Make mmu_notifier_register() safer by issuing a memory barrier before
    registering a new notifier. This fixes a theoretical bug on weakly
    ordered CPUs. For example, take this simplified use of notifiers by a
    driver:

    my_struct->mn.ops = &my_ops; /* (1) */
    mmu_notifier_register(&my_struct->mn, mm)
    ...
    hlist_add_head(&mn->hlist, &mm->mmu_notifiers); /* (2) */
    ...

    Once mmu_notifier_register() releases the mm locks, another thread can
    invalidate a range:

    mmu_notifier_invalidate_range()
    ...
    hlist_for_each_entry_rcu(mn, &mm->mmu_notifiers, hlist) {
    if (mn->ops->invalidate_range)

    The read side relies on the data dependency between mn and ops to ensure
    that the pointer is properly initialized. But the write side doesn't have
    any dependency between (1) and (2), so they could be reordered and the
    readers could dereference an invalid mn->ops. mmu_notifier_register()
    does take all the mm locks before adding to the hlist, but those have
    acquire semantics which isn't sufficient.

    By calling hlist_add_head_rcu() instead of hlist_add_head() we update the
    hlist using a store-release, ensuring that readers see prior
    initialization of my_struct. This situation is better illustated by
    litmus test MP+onceassign+derefonce.

    Link: http://lkml.kernel.org/r/20190502133532.24981-1-jean-philippe.brucker@arm.com
    Fixes: cddb8a5c14aa ("mmu-notifiers: core")
    Signed-off-by: Jean-Philippe Brucker
    Cc: Jérôme Glisse
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean-Philippe Brucker
     

19 Jun, 2019

1 commit

  • Based on 1 normalized pattern(s):

    this work is licensed under the terms of the gnu gpl version 2 see
    the copying file in the top level directory

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-only

    has been chosen to replace the boilerplate/reference in 35 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Kate Stewart
    Reviewed-by: Enrico Weigelt
    Reviewed-by: Allison Randal
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190604081206.797835076@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

15 May, 2019

2 commits

  • Helper to test if a range is updated to read only (it is still valid to
    read from the range). This is useful for device driver or anyone who wish
    to optimize out update when they know that they already have the range map
    read only.

    Link: http://lkml.kernel.org/r/20190326164747.24405-9-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Use the mmu_notifier_range_blockable() helper function instead of directly
    dereferencing the range->blockable field. This is done to make it easier
    to change the mmu_notifier range field.

    This patch is the outcome of the following coccinelle patch:

    %blockable
    +mmu_notifier_range_blockable(I1)
    ...>
    }
    ------------------------------------------------------------------->%

    spatch --in-place --sp-file blockable.spatch --dir .

    Link: http://lkml.kernel.org/r/20190326164747.24405-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Reviewed-by: Ralph Campbell
    Reviewed-by: Ira Weiny
    Cc: Christian König
    Cc: Joonas Lahtinen
    Cc: Jani Nikula
    Cc: Rodrigo Vivi
    Cc: Jan Kara
    Cc: Andrea Arcangeli
    Cc: Peter Xu
    Cc: Felix Kuehling
    Cc: Jason Gunthorpe
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: John Hubbard
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

29 Dec, 2018

3 commits

  • To avoid having to change many call sites everytime we want to add a
    parameter use a structure to group all parameters for the mmu_notifier
    invalidate_range_start/end cakks. No functional changes with this patch.

    [akpm@linux-foundation.org: coding style fixes]
    Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Christian König
    Acked-by: Jan Kara
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    From: Jérôme Glisse
    Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

    fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

    Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Patch series "mmu notifier contextual informations", v2.

    This patchset adds contextual information, why an invalidation is
    happening, to mmu notifier callback. This is necessary for user of mmu
    notifier that wish to maintains their own data structure without having to
    add new fields to struct vm_area_struct (vma).

    For instance device can have they own page table that mirror the process
    address space. When a vma is unmap (munmap() syscall) the device driver
    can free the device page table for the range.

    Today we do not have any information on why a mmu notifier call back is
    happening and thus device driver have to assume that it is always an
    munmap(). This is inefficient at it means that it needs to re-allocate
    device page table on next page fault and rebuild the whole device driver
    data structure for the range.

    Other use case beside munmap() also exist, for instance it is pointless
    for device driver to invalidate the device page table when the
    invalidation is for the soft dirtyness tracking. Or device driver can
    optimize away mprotect() that change the page table permission access for
    the range.

    This patchset enables all this optimizations for device drivers. I do not
    include any of those in this series but another patchset I am posting will
    leverage this.

    The patchset is pretty simple from a code point of view. The first two
    patches consolidate all mmu notifier arguments into a struct so that it is
    easier to add/change arguments. The last patch adds the contextual
    information (munmap, protection, soft dirty, clear, ...).

    This patch (of 3):

    To avoid having to change many callback definition everytime we want to
    add a parameter use a structure to group all parameters for the
    mmu_notifier invalidate_range_start/end callback. No functional changes
    with this patch.

    [akpm@linux-foundation.org: fix drivers/gpu/drm/amd/amdgpu/amdgpu_mn.c kerneldoc]
    Link: http://lkml.kernel.org/r/20181205053628.3210-2-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Acked-by: Jan Kara
    Acked-by: Jason Gunthorpe [infiniband]
    Cc: Matthew Wilcox
    Cc: Ross Zwisler
    Cc: Dan Williams
    Cc: Paolo Bonzini
    Cc: Radim Krcmar
    Cc: Michal Hocko
    Cc: Christian Koenig
    Cc: Felix Kuehling
    Cc: Ralph Campbell
    Cc: John Hubbard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     
  • Contrary to its name, mmu_notifier_synchronize() does not synchronize the
    notifier's SRCU instance, but rather waits for RCU callbacks to finish.
    i.e. it invokes rcu_barrier(). The RCU documentation is quite clear on
    this matter, explicitly calling out that rcu_barrier() does not imply
    synchronize_rcu().

    As there are no callers of mmu_notifier_synchronize() and it's unclear
    whether any user of mmu_notifier_call_srcu() will ever want to barrier on
    their callbacks, simply remove the function.

    Link: http://lkml.kernel.org/r/20181106134705.14197-1-sean.j.christopherson@intel.com
    Signed-off-by: Sean Christopherson
    Reviewed-by: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Jérôme Glisse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sean Christopherson
     

27 Oct, 2018

1 commit

  • Revert 5ff7091f5a2ca ("mm, mmu_notifier: annotate mmu notifiers with
    blockable invalidate callbacks").

    MMU_INVALIDATE_DOES_NOT_BLOCK flags was the only one used and it is no
    longer needed since 93065ac753e4 ("mm, oom: distinguish blockable mode for
    mmu notifiers"). We now have a full support for per range !blocking
    behavior so we can drop the stop gap workaround which the per notifier
    flag was used for.

    Link: http://lkml.kernel.org/r/20180827112623.8992-4-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Cc: David Rientjes
    Cc: Boris Ostrovsky
    Cc: Jerome Glisse
    Cc: Juergen Gross
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

23 Aug, 2018

1 commit

  • There are several blockable mmu notifiers which might sleep in
    mmu_notifier_invalidate_range_start and that is a problem for the
    oom_reaper because it needs to guarantee a forward progress so it cannot
    depend on any sleepable locks.

    Currently we simply back off and mark an oom victim with blockable mmu
    notifiers as done after a short sleep. That can result in selecting a new
    oom victim prematurely because the previous one still hasn't torn its
    memory down yet.

    We can do much better though. Even if mmu notifiers use sleepable locks
    there is no reason to automatically assume those locks are held. Moreover
    majority of notifiers only care about a portion of the address space and
    there is absolutely zero reason to fail when we are unmapping an unrelated
    range. Many notifiers do really block and wait for HW which is harder to
    handle and we have to bail out though.

    This patch handles the low hanging fruit.
    __mmu_notifier_invalidate_range_start gets a blockable flag and callbacks
    are not allowed to sleep if the flag is set to false. This is achieved by
    using trylock instead of the sleepable lock for most callbacks and
    continue as long as we do not block down the call chain.

    I think we can improve that even further because there is a common pattern
    to do a range lookup first and then do something about that. The first
    part can be done without a sleeping lock in most cases AFAICS.

    The oom_reaper end then simply retries if there is at least one notifier
    which couldn't make any progress in !blockable mode. A retry loop is
    already implemented to wait for the mmap_sem and this is basically the
    same thing.

    The simplest way for driver developers to test this code path is to wrap
    userspace code which uses these notifiers into a memcg and set the hard
    limit to hit the oom. This can be done e.g. after the test faults in all
    the mmu notifier managed memory and set the hard limit to something really
    small. Then we are looking for a proper process tear down.

    [akpm@linux-foundation.org: coding style fixes]
    [akpm@linux-foundation.org: minor code simplification]
    Link: http://lkml.kernel.org/r/20180716115058.5559-1-mhocko@kernel.org
    Signed-off-by: Michal Hocko
    Acked-by: Christian König # AMD notifiers
    Acked-by: Leon Romanovsky # mlx and umem_odp
    Reported-by: David Rientjes
    Cc: "David (ChunMing) Zhou"
    Cc: Paolo Bonzini
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Jani Nikula
    Cc: Joonas Lahtinen
    Cc: Rodrigo Vivi
    Cc: Doug Ledford
    Cc: Jason Gunthorpe
    Cc: Mike Marciniszyn
    Cc: Dennis Dalessandro
    Cc: Sudeep Dutt
    Cc: Ashutosh Dixit
    Cc: Dimitri Sivanich
    Cc: Boris Ostrovsky
    Cc: Juergen Gross
    Cc: "Jérôme Glisse"
    Cc: Andrea Arcangeli
    Cc: Felix Kuehling
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

01 Feb, 2018

1 commit

  • Commit 4d4bbd8526a8 ("mm, oom_reaper: skip mm structs with mmu
    notifiers") prevented the oom reaper from unmapping private anonymous
    memory with the oom reaper when the oom victim mm had mmu notifiers
    registered.

    The rationale is that doing mmu_notifier_invalidate_range_{start,end}()
    around the unmap_page_range(), which is needed, can block and the oom
    killer will stall forever waiting for the victim to exit, which may not
    be possible without reaping.

    That concern is real, but only true for mmu notifiers that have
    blockable invalidate_range_{start,end}() callbacks. This patch adds a
    "flags" field to mmu notifier ops that can set a bit to indicate that
    these callbacks do not block.

    The implementation is steered toward an expensive slowpath, such as
    after the oom reaper has grabbed mm->mmap_sem of a still alive oom
    victim.

    [rientjes@google.com: mmu_notifier_invalidate_range_end() can also call the invalidate_range() must not block, fix comment]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1801091339570.240101@chino.kir.corp.google.com
    [akpm@linux-foundation.org: make mm_has_blockable_invalidate_notifiers() return bool, use rwsem_is_locked()]
    Link: http://lkml.kernel.org/r/alpine.DEB.2.10.1712141329500.74052@chino.kir.corp.google.com
    Signed-off-by: David Rientjes
    Acked-by: Michal Hocko
    Acked-by: Paolo Bonzini
    Acked-by: Christian König
    Acked-by: Dimitri Sivanich
    Cc: Andrea Arcangeli
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mackerras
    Cc: Oded Gabbay
    Cc: Alex Deucher
    Cc: David Airlie
    Cc: Joerg Roedel
    Cc: Doug Ledford
    Cc: Jani Nikula
    Cc: Mike Marciniszyn
    Cc: Sean Hefty
    Cc: Boris Ostrovsky
    Cc: Jérôme Glisse
    Cc: Radim Krčmář
    Signed-off-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Rientjes
     

16 Nov, 2017

1 commit

  • This is an optimization patch that only affect mmu_notifier users which
    rely on the invalidate_range() callback. This patch avoids calling that
    callback twice in a row from inside __mmu_notifier_invalidate_range_end

    Existing pattern (before this patch):
    mmu_notifier_invalidate_range_start()
    pte/pmd/pud_clear_flush_notify()
    mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_end()
    mmu_notifier_invalidate_range()

    New pattern (after this patch):
    mmu_notifier_invalidate_range_start()
    pte/pmd/pud_clear_flush_notify()
    mmu_notifier_invalidate_range()
    mmu_notifier_invalidate_range_only_end()

    We call the invalidate_range callback after clearing the page table
    under the page table lock and we skip the call to invalidate_range
    inside the __mmu_notifier_invalidate_range_end() function.

    Idea from Andrea Arcangeli

    Link: http://lkml.kernel.org/r/20171017031003.7481-3-jglisse@redhat.com
    Signed-off-by: Jérôme Glisse
    Cc: Andrea Arcangeli
    Cc: Joerg Roedel
    Cc: Suravee Suthikulpanit
    Cc: David Woodhouse
    Cc: Alistair Popple
    Cc: Michael Ellerman
    Cc: Benjamin Herrenschmidt
    Cc: Stephen Rothwell
    Cc: Andrew Donnellan
    Cc: Nadav Amit
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

01 Sep, 2017

1 commit

  • The invalidate_page callback suffered from two pitfalls. First it used
    to happen after the page table lock was release and thus a new page
    might have setup before the call to invalidate_page() happened.

    This is in a weird way fixed by commit c7ab0d2fdc84 ("mm: convert
    try_to_unmap_one() to use page_vma_mapped_walk()") that moved the
    callback under the page table lock but this also broke several existing
    users of the mmu_notifier API that assumed they could sleep inside this
    callback.

    The second pitfall was invalidate_page() being the only callback not
    taking a range of address in respect to invalidation but was giving an
    address and a page. Lots of the callback implementers assumed this
    could never be THP and thus failed to invalidate the appropriate range
    for THP.

    By killing this callback we unify the mmu_notifier callback API to
    always take a virtual address range as input.

    Finally this also simplifies the end user life as there is now two clear
    choices:
    - invalidate_range_start()/end() callback (which allow you to sleep)
    - invalidate_range() where you can not sleep but happen right after
    page table update under page table lock

    Signed-off-by: Jérôme Glisse
    Cc: Bernhard Held
    Cc: Adam Borowski
    Cc: Andrea Arcangeli
    Cc: Radim Krčmář
    Cc: Wanpeng Li
    Cc: Paolo Bonzini
    Cc: Takashi Iwai
    Cc: Nadav Amit
    Cc: Mike Galbraith
    Cc: Kirill A. Shutemov
    Cc: axie
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jérôme Glisse
     

19 Apr, 2017

1 commit

  • The MM-notifier code currently dynamically initializes the srcu_struct
    named "srcu" at subsys_initcall() time, and includes a BUG_ON() to check
    this initialization in do_mmu_notifier_register(). Unfortunately, there
    is no foolproof way to verify that an srcu_struct has been initialized,
    given the possibility of an srcu_struct being allocated on the stack or
    on the heap. This means that creating an srcu_struct_is_initialized()
    function is not a reasonable course of action. Nor is peppering
    do_mmu_notifier_register() with SRCU-specific #ifdefs an attractive
    alternative.

    This commit therefore uses DEFINE_STATIC_SRCU() to initialize
    this srcu_struct at compile time, thus eliminating both the
    subsys_initcall()-time initialization and the runtime BUG_ON().

    Signed-off-by: Paul E. McKenney
    Cc:
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Michal Hocko
    Cc: "Peter Zijlstra (Intel)"
    Cc: Vegard Nossum

    Paul E. McKenney
     

02 Mar, 2017

1 commit

  • We are going to split out of , which
    will have to be picked up from other headers and a couple of .c files.

    Create a trivial placeholder file that just
    maps to to make this patch obviously correct and
    bisectable.

    The APIs that are going to be moved first are:

    mm_alloc()
    __mmdrop()
    mmdrop()
    mmdrop_async_fn()
    mmdrop_async()
    mmget_not_zero()
    mmput()
    mmput_async()
    get_task_mm()
    mm_access()
    mm_release()

    Include the new header in the files that are going to need it.

    Acked-by: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

28 Feb, 2017

1 commit

  • Apart from adding the helper function itself, the rest of the kernel is
    converted mechanically using:

    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)->mm_count);/mmgrab\(\1\);/'
    git grep -l 'atomic_inc.*mm_count' | xargs sed -i 's/atomic_inc(&\(.*\)\.mm_count);/mmgrab\(\&\1\);/'

    This is needed for a later patch that hooks into the helper, but might
    be a worthwhile cleanup on its own.

    (Michal Hocko provided most of the kerneldoc comment.)

    Link: http://lkml.kernel.org/r/20161218123229.22952-1-vegard.nossum@oracle.com
    Signed-off-by: Vegard Nossum
    Acked-by: Michal Hocko
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vegard Nossum
     

18 Mar, 2016

1 commit


11 Sep, 2015

1 commit

  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2014

1 commit

  • Now that the mmu_notifier_invalidate_range() calls are in place, add the
    callback to allow subsystems to register against it.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Jérôme Glisse
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Jay Cornwall
    Cc: Oded Gabbay
    Cc: Suravee Suthikulpanit
    Cc: Jesse Barnes
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Oded Gabbay

    Joerg Roedel
     

24 Sep, 2014

1 commit

  • 1. We were calling clear_flush_young_notify in unmap_one, but we are
    within an mmu notifier invalidate range scope. The spte exists no more
    (due to range_start) and the accessed bit info has already been
    propagated (due to kvm_pfn_set_accessed). Simply call
    clear_flush_young.

    2. We clear_flush_young on a primary MMU PMD, but this may be mapped
    as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
    This required expanding the interface of the clear_flush_young mmu
    notifier, so a lot of code has been trivially touched.

    3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
    the access bit by blowing the spte. This requires proper synchronizing
    with MMU notifier consumers, like every other removal of spte's does.

    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Rik van Riel
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

07 Aug, 2014

1 commit

  • When kernel device drivers or subsystems want to bind their lifespan to
    t= he lifespan of the mm_struct, they usually use one of the following
    methods:

    1. Manually calling a function in the interested kernel module. The
    funct= ion call needs to be placed in mmput. This method was rejected
    by several ker= nel maintainers.

    2. Registering to the mmu notifier release mechanism.

    The problem with the latter approach is that the mmu_notifier_release
    cal= lback is called from__mmu_notifier_release (called from exit_mmap).
    That functi= on iterates over the list of mmu notifiers and don't expect
    the release call= back function to remove itself from the list.
    Therefore, the callback function= in the kernel module can't release the
    mmu_notifier_object, which is actuall= y the kernel module's object
    itself. As a result, the destruction of the kernel module's object must
    to be done in a delayed fashion.

    This patch adds support for this delayed callback, by adding a new
    mmu_notifier_call_srcu function that receives a function ptr and calls
    th= at function with call_srcu. In that function, the kernel module
    releases its object. To use mmu_notifier_call_srcu, the calling module
    needs to call b= efore that a new function called
    mmu_notifier_unregister_no_release that as its= name implies,
    unregisters a notifier without calling its notifier release call= back.

    This patch also adds a function that will call barrier_srcu so those
    kern= el modules can sync with mmu_notifier.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Oded Gabbay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Jan, 2014

1 commit

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

28 Jun, 2013

1 commit


25 May, 2013

1 commit

  • Commit 751efd8610d3 ("mmu_notifier_unregister NULL Pointer deref and
    multiple ->release()") breaks the fix 3ad3d901bbcf ("mm: mmu_notifier:
    fix freed page still mapped in secondary MMU").

    Since hlist_for_each_entry_rcu() is changed now, we can not revert that
    patch directly, so this patch reverts the commit and simply fix the bug
    spotted by that patch

    This bug spotted by commit 751efd8610d3 is:

    There is a race condition between mmu_notifier_unregister() and
    __mmu_notifier_release().

    Assume two tasks, one calling mmu_notifier_unregister() as a result
    of a filp_close() ->flush() callout (task A), and the other calling
    mmu_notifier_release() from an mmput() (task B).

    A B
    t1 srcu_read_lock()
    t2 if (!hlist_unhashed())
    t3 srcu_read_unlock()
    t4 srcu_read_lock()
    t5 hlist_del_init_rcu()
    t6 synchronize_srcu()
    t7 srcu_read_unlock()
    t8 hlist_del_rcu() release should be fast
    since all the pages have already been released by the first call.
    Anyway, this issue should be fixed in a separate patch.

    -stable suggestions: Any version that has commit 751efd8610d3 need to be
    backported. I find the oldest version has this commit is 3.0-stable.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Xiao Guangrong
    Tested-by: Robin Holt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

24 Feb, 2013

2 commits

  • We at SGI have a need to address some very high physical address ranges
    with our GRU (global reference unit), sometimes across partitioned
    machine boundaries and sometimes with larger addresses than the cpu
    supports. We do this with the aid of our own 'extended vma' module
    which mimics the vma. When something (either unmap or exit) frees an
    'extended vma' we use the mmu notifiers to clean them up.

    We had been able to mimic the functions
    __mmu_notifier_invalidate_range_start() and
    __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
    walking the per-mm notifier list. But with the change to a global srcu
    lock (static in mmu_notifier.c) we can no longer do that. Our module has
    no access to that lock.

    So we request that these two functions be exported.

    Signed-off-by: Cliff Wickman
    Acked-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • There is a race condition between mmu_notifier_unregister() and
    __mmu_notifier_release().

    Assume two tasks, one calling mmu_notifier_unregister() as a result of a
    filp_close() ->flush() callout (task A), and the other calling
    mmu_notifier_release() from an mmput() (task B).

    A B
    t1 srcu_read_lock()
    t2 if (!hlist_unhashed())
    t3 srcu_read_unlock()
    t4 srcu_read_lock()
    t5 hlist_del_init_rcu()
    t6 synchronize_srcu()
    t7 srcu_read_unlock()
    t8 hlist_del_rcu() hlist_lock which can result in
    callouts to the ->release() notifier from both mmu_notifier_unregister()
    and __mmu_notifier_release().

    -stable suggestions:

    The stable trees prior to 3.7.y need commits 21a92735f660 and
    70400303ce0c cherry-picked in that order prior to cherry-picking this
    commit. The 3.7.y tree already has those two commits.

    Signed-off-by: Robin Holt
    Cc: Andrea Arcangeli
    Cc: Wanpeng Li
    Cc: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

26 Oct, 2012

1 commit

  • While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
    to work in case of tight available memory. Eventually, that would lead to
    a deadlock while the swap deamon swaps anonymous pages. It was caused by
    commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").

    =================================
    [ INFO: inconsistent lock state ]
    3.7.0-rc1+ #518 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x86/0x150
    lockdep_trace_alloc+0x67/0xc0
    kmem_cache_alloc_trace+0x33/0x230
    do_mmu_notifier_register+0x87/0x180
    mmu_notifier_register+0x13/0x20
    kvm_dev_ioctl+0x428/0x510
    do_vfs_ioctl+0x98/0x570
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b
    irq event stamp: 825
    hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60
    hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
    softirqs last enabled at (0): copy_process+0x630/0x17c0
    softirqs last disabled at (0): (null)
    ...

    Simply back out the above commit, which was a small performance
    optimization.

    Signed-off-by: Gavin Shan
    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     

09 Oct, 2012

3 commits

  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • The variable must be static especially given the variable name.

    s/RCU/SRCU/ over a few comments.

    Signed-off-by: Andrea Arcangeli
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While registering MMU notifier, new instance of MMU notifier_mm will be
    allocated and later free'd if currrent mm_struct's MMU notifier_mm has
    been initialized. That causes some overhead. The patch tries to
    elominate that.

    Signed-off-by: Gavin Shan
    Signed-off-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan