18 Mar, 2016

1 commit


11 Sep, 2015

1 commit

  • In the scope of the idle memory tracking feature, which is introduced by
    the following patch, we need to clear the referenced/accessed bit not only
    in primary, but also in secondary ptes. The latter is required in order
    to estimate wss of KVM VMs. At the same time we want to avoid flushing
    tlb, because it is quite expensive and it won't really affect the final
    result.

    Currently, there is no function for clearing pte young bit that would meet
    our requirements, so this patch introduces one. To achieve that we have
    to add a new mmu-notifier callback, clear_young, since there is no method
    for testing-and-clearing a secondary pte w/o flushing tlb. The new method
    is not mandatory and currently only implemented by KVM.

    Signed-off-by: Vladimir Davydov
    Reviewed-by: Andres Lagar-Cavilla
    Acked-by: Paolo Bonzini
    Cc: Minchan Kim
    Cc: Raghavendra K T
    Cc: Johannes Weiner
    Cc: Michal Hocko
    Cc: Greg Thelen
    Cc: Michel Lespinasse
    Cc: David Rientjes
    Cc: Pavel Emelyanov
    Cc: Cyrill Gorcunov
    Cc: Jonathan Corbet
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vladimir Davydov
     

13 Nov, 2014

1 commit

  • Now that the mmu_notifier_invalidate_range() calls are in place, add the
    callback to allow subsystems to register against it.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Jérôme Glisse
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Jay Cornwall
    Cc: Oded Gabbay
    Cc: Suravee Suthikulpanit
    Cc: Jesse Barnes
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Oded Gabbay

    Joerg Roedel
     

24 Sep, 2014

1 commit

  • 1. We were calling clear_flush_young_notify in unmap_one, but we are
    within an mmu notifier invalidate range scope. The spte exists no more
    (due to range_start) and the accessed bit info has already been
    propagated (due to kvm_pfn_set_accessed). Simply call
    clear_flush_young.

    2. We clear_flush_young on a primary MMU PMD, but this may be mapped
    as a collection of PTEs by the secondary MMU (e.g. during log-dirty).
    This required expanding the interface of the clear_flush_young mmu
    notifier, so a lot of code has been trivially touched.

    3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate
    the access bit by blowing the spte. This requires proper synchronizing
    with MMU notifier consumers, like every other removal of spte's does.

    Signed-off-by: Andres Lagar-Cavilla
    Acked-by: Rik van Riel
    Signed-off-by: Paolo Bonzini

    Andres Lagar-Cavilla
     

07 Aug, 2014

1 commit

  • When kernel device drivers or subsystems want to bind their lifespan to
    t= he lifespan of the mm_struct, they usually use one of the following
    methods:

    1. Manually calling a function in the interested kernel module. The
    funct= ion call needs to be placed in mmput. This method was rejected
    by several ker= nel maintainers.

    2. Registering to the mmu notifier release mechanism.

    The problem with the latter approach is that the mmu_notifier_release
    cal= lback is called from__mmu_notifier_release (called from exit_mmap).
    That functi= on iterates over the list of mmu notifiers and don't expect
    the release call= back function to remove itself from the list.
    Therefore, the callback function= in the kernel module can't release the
    mmu_notifier_object, which is actuall= y the kernel module's object
    itself. As a result, the destruction of the kernel module's object must
    to be done in a delayed fashion.

    This patch adds support for this delayed callback, by adding a new
    mmu_notifier_call_srcu function that receives a function ptr and calls
    th= at function with call_srcu. In that function, the kernel module
    releases its object. To use mmu_notifier_call_srcu, the calling module
    needs to call b= efore that a new function called
    mmu_notifier_unregister_no_release that as its= name implies,
    unregisters a notifier without calling its notifier release call= back.

    This patch also adds a function that will call barrier_srcu so those
    kern= el modules can sync with mmu_notifier.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jérôme Glisse
    Signed-off-by: Oded Gabbay
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Jan, 2014

1 commit

  • Code that is obj-y (always built-in) or dependent on a bool Kconfig
    (built-in or absent) can never be modular. So using module_init as an
    alias for __initcall can be somewhat misleading.

    Fix these up now, so that we can relocate module_init from init.h into
    module.h in the future. If we don't do this, we'd have to add module.h
    to obviously non-modular code, and that would be a worse thing.

    The audit targets the following module_init users for change:
    mm/ksm.c bool KSM
    mm/mmap.c bool MMU
    mm/huge_memory.c bool TRANSPARENT_HUGEPAGE
    mm/mmu_notifier.c bool MMU_NOTIFIER

    Note that direct use of __initcall is discouraged, vs. one of the
    priority categorized subgroups. As __initcall gets mapped onto
    device_initcall, our use of subsys_initcall (which makes sense for these
    files) will thus change this registration from level 6-device to level
    4-subsys (i.e. slightly earlier).

    However no observable impact of that difference has been observed during
    testing.

    One might think that core_initcall (l2) or postcore_initcall (l3) would
    be more appropriate for anything in mm/ but if we look at some actual
    init functions themselves, we see things like:

    mm/huge_memory.c --> hugepage_init --> hugepage_init_sysfs
    mm/mmap.c --> init_user_reserve --> sysctl_user_reserve_kbytes
    mm/ksm.c --> ksm_init --> sysfs_create_group

    and hence the choice of subsys_initcall (l4) seems reasonable, and at
    the same time minimizes the risk of changing the priority too
    drastically all at once. We can adjust further in the future.

    Also, several instances of missing ";" at EOL are fixed.

    Signed-off-by: Paul Gortmaker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul Gortmaker
     

28 Jun, 2013

1 commit


25 May, 2013

1 commit

  • Commit 751efd8610d3 ("mmu_notifier_unregister NULL Pointer deref and
    multiple ->release()") breaks the fix 3ad3d901bbcf ("mm: mmu_notifier:
    fix freed page still mapped in secondary MMU").

    Since hlist_for_each_entry_rcu() is changed now, we can not revert that
    patch directly, so this patch reverts the commit and simply fix the bug
    spotted by that patch

    This bug spotted by commit 751efd8610d3 is:

    There is a race condition between mmu_notifier_unregister() and
    __mmu_notifier_release().

    Assume two tasks, one calling mmu_notifier_unregister() as a result
    of a filp_close() ->flush() callout (task A), and the other calling
    mmu_notifier_release() from an mmput() (task B).

    A B
    t1 srcu_read_lock()
    t2 if (!hlist_unhashed())
    t3 srcu_read_unlock()
    t4 srcu_read_lock()
    t5 hlist_del_init_rcu()
    t6 synchronize_srcu()
    t7 srcu_read_unlock()
    t8 hlist_del_rcu() release should be fast
    since all the pages have already been released by the first call.
    Anyway, this issue should be fixed in a separate patch.

    -stable suggestions: Any version that has commit 751efd8610d3 need to be
    backported. I find the oldest version has this commit is 3.0-stable.

    [akpm@linux-foundation.org: tweak comments]
    Signed-off-by: Xiao Guangrong
    Tested-by: Robin Holt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     

28 Feb, 2013

1 commit

  • I'm not sure why, but the hlist for each entry iterators were conceived

    list_for_each_entry(pos, head, member)

    The hlist ones were greedy and wanted an extra parameter:

    hlist_for_each_entry(tpos, pos, head, member)

    Why did they need an extra pos parameter? I'm not quite sure. Not only
    they don't really need it, it also prevents the iterator from looking
    exactly like the list iterator, which is unfortunate.

    Besides the semantic patch, there was some manual work required:

    - Fix up the actual hlist iterators in linux/list.h
    - Fix up the declaration of other iterators based on the hlist ones.
    - A very small amount of places were using the 'node' parameter, this
    was modified to use 'obj->member' instead.
    - Coccinelle didn't handle the hlist_for_each_entry_safe iterator
    properly, so those had to be fixed up manually.

    The semantic patch which is mostly the work of Peter Senna Tschudin is here:

    @@
    iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host;

    type T;
    expression a,c,d,e;
    identifier b;
    statement S;
    @@

    -T b;

    [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c]
    [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c]
    [akpm@linux-foundation.org: checkpatch fixes]
    [akpm@linux-foundation.org: fix warnings]
    [akpm@linux-foudnation.org: redo intrusive kvm changes]
    Tested-by: Peter Senna Tschudin
    Acked-by: Paul E. McKenney
    Signed-off-by: Sasha Levin
    Cc: Wu Fengguang
    Cc: Marcelo Tosatti
    Cc: Gleb Natapov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sasha Levin
     

24 Feb, 2013

2 commits

  • We at SGI have a need to address some very high physical address ranges
    with our GRU (global reference unit), sometimes across partitioned
    machine boundaries and sometimes with larger addresses than the cpu
    supports. We do this with the aid of our own 'extended vma' module
    which mimics the vma. When something (either unmap or exit) frees an
    'extended vma' we use the mmu notifiers to clean them up.

    We had been able to mimic the functions
    __mmu_notifier_invalidate_range_start() and
    __mmu_notifier_invalidate_range_end() by locking the per-mm lock and
    walking the per-mm notifier list. But with the change to a global srcu
    lock (static in mmu_notifier.c) we can no longer do that. Our module has
    no access to that lock.

    So we request that these two functions be exported.

    Signed-off-by: Cliff Wickman
    Acked-by: Robin Holt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cliff Wickman
     
  • There is a race condition between mmu_notifier_unregister() and
    __mmu_notifier_release().

    Assume two tasks, one calling mmu_notifier_unregister() as a result of a
    filp_close() ->flush() callout (task A), and the other calling
    mmu_notifier_release() from an mmput() (task B).

    A B
    t1 srcu_read_lock()
    t2 if (!hlist_unhashed())
    t3 srcu_read_unlock()
    t4 srcu_read_lock()
    t5 hlist_del_init_rcu()
    t6 synchronize_srcu()
    t7 srcu_read_unlock()
    t8 hlist_del_rcu() hlist_lock which can result in
    callouts to the ->release() notifier from both mmu_notifier_unregister()
    and __mmu_notifier_release().

    -stable suggestions:

    The stable trees prior to 3.7.y need commits 21a92735f660 and
    70400303ce0c cherry-picked in that order prior to cherry-picking this
    commit. The 3.7.y tree already has those two commits.

    Signed-off-by: Robin Holt
    Cc: Andrea Arcangeli
    Cc: Wanpeng Li
    Cc: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

26 Oct, 2012

1 commit

  • While allocating mmu_notifier with parameter GFP_KERNEL, swap would start
    to work in case of tight available memory. Eventually, that would lead to
    a deadlock while the swap deamon swaps anonymous pages. It was caused by
    commit e0f3c3f78da29b ("mm/mmu_notifier: init notifier if necessary").

    =================================
    [ INFO: inconsistent lock state ]
    3.7.0-rc1+ #518 Not tainted
    ---------------------------------
    inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
    kswapd0/35 [HC0[0]:SC0[0]:HE1:SE1] takes:
    (&mapping->i_mmap_mutex){+.+.?.}, at: page_referenced+0x9c/0x2e0
    {RECLAIM_FS-ON-W} state was registered at:
    mark_held_locks+0x86/0x150
    lockdep_trace_alloc+0x67/0xc0
    kmem_cache_alloc_trace+0x33/0x230
    do_mmu_notifier_register+0x87/0x180
    mmu_notifier_register+0x13/0x20
    kvm_dev_ioctl+0x428/0x510
    do_vfs_ioctl+0x98/0x570
    sys_ioctl+0x91/0xb0
    system_call_fastpath+0x16/0x1b
    irq event stamp: 825
    hardirqs last enabled at (825): _raw_spin_unlock_irq+0x30/0x60
    hardirqs last disabled at (824): _raw_spin_lock_irq+0x19/0x80
    softirqs last enabled at (0): copy_process+0x630/0x17c0
    softirqs last disabled at (0): (null)
    ...

    Simply back out the above commit, which was a small performance
    optimization.

    Signed-off-by: Gavin Shan
    Reported-by: Andrea Righi
    Tested-by: Andrea Righi
    Cc: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     

09 Oct, 2012

4 commits

  • In order to allow sleeping during invalidate_page mmu notifier calls, we
    need to avoid calling when holding the PT lock. In addition to its direct
    calls, invalidate_page can also be called as a substitute for a change_pte
    call, in case the notifier client hasn't implemented change_pte.

    This patch drops the invalidate_page call from change_pte, and instead
    wraps all calls to change_pte with invalidate_range_start and
    invalidate_range_end calls.

    Note that change_pte still cannot sleep after this patch, and that clients
    implementing change_pte should not take action on it in case the number of
    outstanding invalidate_range_start calls is larger than one, otherwise
    they might miss a later invalidation.

    Signed-off-by: Haggai Eran
    Cc: Andrea Arcangeli
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Xiao Guangrong
    Cc: Or Gerlitz
    Cc: Haggai Eran
    Cc: Shachar Raindel
    Cc: Liran Liss
    Cc: Christoph Lameter
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Haggai Eran
     
  • The variable must be static especially given the variable name.

    s/RCU/SRCU/ over a few comments.

    Signed-off-by: Andrea Arcangeli
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Peter Zijlstra
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • While registering MMU notifier, new instance of MMU notifier_mm will be
    allocated and later free'd if currrent mm_struct's MMU notifier_mm has
    been initialized. That causes some overhead. The patch tries to
    elominate that.

    Signed-off-by: Gavin Shan
    Signed-off-by: Wanpeng Li
    Cc: Andrea Arcangeli
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc: Xiao Guangrong
    Cc: Sagi Grimberg
    Cc: Haggai Eran
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Gavin Shan
     
  • With an RCU based mmu_notifier implementation, any callout to
    mmu_notifier_invalidate_range_{start,end}() or
    mmu_notifier_invalidate_page() would not be allowed to call schedule()
    as that could potentially allow a modification to the mmu_notifier
    structure while it is currently being used.

    Since srcu allocs 4 machine words per instance per cpu, we may end up
    with memory exhaustion if we use srcu per mm. So all mms share a global
    srcu. Note that during large mmu_notifier activity exit & unregister
    paths might hang for longer periods, but it is tolerable for current
    mmu_notifier clients.

    Signed-off-by: Sagi Grimberg
    Signed-off-by: Andrea Arcangeli
    Cc: Peter Zijlstra
    Cc: Haggai Eran
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sagi Grimberg
     

01 Aug, 2012

1 commit

  • mmu_notifier_release() is called when the process is exiting. It will
    delete all the mmu notifiers. But at this time the page belonging to the
    process is still present in page tables and is present on the LRU list, so
    this race will happen:

    CPU 0 CPU 1
    mmu_notifier_release: try_to_unmap:
    hlist_del_init_rcu(&mn->hlist);
    ptep_clear_flush_notify:
    mmu nofifler not found
    free page !!!!!!
    /*
    * At the point, the page has been
    * freed, but it is still mapped in
    * the secondary MMU.
    */

    mn->ops->release(mn, mm);

    Then the box is not stable and sometimes we can get this bug:

    [ 738.075923] BUG: Bad page state in process migrate-perf pfn:03bec
    [ 738.075931] page:ffffea00000efb00 count:0 mapcount:0 mapping: (null) index:0x8076
    [ 738.075936] page flags: 0x20000000000014(referenced|dirty)

    The same issue is present in mmu_notifier_unregister().

    We can call ->release before deleting the notifier to ensure the page has
    been unmapped from the secondary MMU before it is freed.

    Signed-off-by: Xiao Guangrong
    Cc: Avi Kivity
    Cc: Marcelo Tosatti
    Cc: Paul Gortmaker
    Cc: Andrea Arcangeli
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiao Guangrong
     

31 Oct, 2011

1 commit


14 Jan, 2011

1 commit

  • For GRU and EPT, we need gup-fast to set referenced bit too (this is why
    it's correct to return 0 when shadow_access_mask is zero, it requires
    gup-fast to set the referenced bit). qemu-kvm access already sets the
    young bit in the pte if it isn't zero-copy, if it's zero copy or a shadow
    paging EPT minor fault we relay on gup-fast to signal the page is in
    use...

    We also need to check the young bits on the secondary pagetables for NPT
    and not nested shadow mmu as the data may never get accessed again by the
    primary pte.

    Without this closer accuracy, we'd have to remove the heuristic that
    avoids collapsing hugepages in hugepage virtual regions that have not even
    a single subpage in use.

    ->test_young is full backwards compatible with GRU and other usages that
    don't have young bits in pagetables set by the hardware and that should
    nuke the secondary mmu mappings when ->clear_flush_young runs just like
    EPT does.

    Removing the heuristic that checks the young bit in
    khugepaged/collapse_huge_page completely isn't so bad either probably but
    I thought it was worth it and this makes it reliable.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

22 Sep, 2009

1 commit

  • KSM is a linux driver that allows dynamicly sharing identical memory pages
    between one or more processes.

    Unlike tradtional page sharing that is made at the allocation of the
    memory, ksm do it dynamicly after the memory was created. Memory is
    periodically scanned; identical pages are identified and merged.

    The sharing is made in a transparent way to the processes that use it.

    Ksm is highly important for hypervisors (kvm), where in production
    enviorments there might be many copys of the same data data among the host
    memory. This kind of data can be: similar kernels, librarys, cache, and
    so on.

    Even that ksm was wrote for kvm, any userspace application that want to
    use it to share its data can try it.

    Ksm may be useful for any application that might have similar (page
    aligment) data strctures among the memory, ksm will find this data merge
    it to one copy, and even if it will be changed and thereforew copy on
    writed, ksm will merge it again as soon as it will be identical again.

    Another reason to consider using ksm is the fact that it might simplify
    alot the userspace code of application that want to use shared private
    data, instead that the application will mange shared area, ksm will do
    this for the application, and even write to this data will be allowed
    without any synchinization acts from the application.

    Ksm was designed to be a loadable module that doesn't change the VM code
    of linux.

    This patch:

    The set_pte_at_notify() macro allows setting a pte in the shadow page
    table directly, instead of flushing the shadow page table entry and then
    getting vmexit to set it. It uses a new change_pte() callback to do so.

    set_pte_at_notify() is an optimization for kvm, and other users of
    mmu_notifiers, for COW pages. It is useful for kvm when ksm is used,
    because it allows kvm not to have to receive vmexit and only then map the
    ksm page into the shadow page table, but instead map it directly at the
    same time as Linux maps the page into the host page table.

    Users of mmu_notifiers who don't implement new mmu_notifier_change_pte()
    callback will just receive the mmu_notifier_invalidate_page() callback.

    Signed-off-by: Izik Eidus
    Signed-off-by: Chris Wright
    Signed-off-by: Hugh Dickins
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Izik Eidus
     

29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli