16 Dec, 2014
1 commit
-
Pull drm updates from Dave Airlie:
"Highlights:- AMD KFD driver merge
This is the AMD HSA interface for exposing a lowlevel interface for
GPGPU use. They have an open source userspace built on top of this
interface, and the code looks as good as it was going to get out of
tree.- Initial atomic modesetting work
The need for an atomic modesetting interface to allow userspace to
try and send a complete set of modesetting state to the driver has
arisen, and been suffering from neglect this past year. No more,
the start of the common code and changes for msm driver to use it
are in this tree. Ongoing work to get the userspace ioctl finished
and the code clean will probably wait until next kernel.- DisplayID 1.3 and tiled monitor exposed to userspace.
Tiled monitor property is now exposed for userspace to make use of.
- Rockchip drm driver merged.
- imx gpu driver moved out of staging
Other stuff:
- core:
panel - MIPI DSI + new panels.
expose suggested x/y properties for virtual GPUs- i915:
Initial Skylake (SKL) support
gen3/4 reset work
start of dri1/ums removal
infoframe tracking
fixes for lots of things.- nouveau:
tegra k1 voltage support
GM204 modesetting support
GT21x memory reclocking work- radeon:
CI dpm fixes
GPUVM improvements
Initial DPM fan control- rcar-du:
HDMI support added
removed some support for old boards
slave encoder driver for Analog Devices adv7511- exynos:
Exynos4415 SoC support- msm:
a4xx gpu support
atomic helper conversion- tegra:
iommu support
universal plane support
ganged-mode DSI support- sti:
HDMI i2c improvements- vmwgfx:
some late fixes.- qxl:
use suggested x/y properties"* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
drm: sti: fix module compilation issue
drm/i915: save/restore GMBUS freq across suspend/resume on gen4
drm: sti: correctly cleanup CRTC and planes
drm: sti: add HQVDP plane
drm: sti: add cursor plane
drm: sti: enable auxiliary CRTC
drm: sti: fix delay in VTG programming
drm: sti: prepare sti_tvout to support auxiliary crtc
drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
drm: sti: fix hdmi avi infoframe
drm: sti: remove event lock while disabling vblank
drm: sti: simplify gdp code
drm: sti: clear all mixer control
drm: sti: remove gpio for HDMI hot plug detection
drm: sti: allow to change hdmi ddc i2c adapter
drm/doc: Document drm_add_modes_noedid() usage
drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
drm: Zero out DRM object memory upon cleanup
drm/i915/bdw: Fix the write setting up the WIZ hashing mode
...
14 Dec, 2014
1 commit
-
Convert all open coded mutex_lock/unlock calls to the
i_mmap_[lock/unlock]_write() helpers.Signed-off-by: Davidlohr Bueso
Acked-by: Rik van Riel
Acked-by: "Kirill A. Shutemov"
Acked-by: Hugh Dickins
Cc: Oleg Nesterov
Acked-by: Peter Zijlstra (Intel)
Cc: Srikar Dronamraju
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2014
1 commit
-
Add calls to the new mmu_notifier_invalidate_range() function to all
places in the VMM that need it.Signed-off-by: Joerg Roedel
Reviewed-by: Andrea Arcangeli
Reviewed-by: Jérôme Glisse
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Hugh Dickins
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Jay Cornwall
Cc: Oded Gabbay
Cc: Suravee Suthikulpanit
Cc: Jesse Barnes
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Oded Gabbay
07 Jun, 2014
1 commit
-
The remap_file_pages() system call is used to create a nonlinear
mapping, that is, a mapping in which the pages of the file are mapped
into a nonsequential order in memory. The advantage of using
remap_file_pages() over using repeated calls to mmap(2) is that the
former approach does not require the kernel to create additional VMA
(Virtual Memory Area) data structures.Supporting of nonlinear mapping requires significant amount of
non-trivial code in kernel virtual memory subsystem including hot paths.
Also to get nonlinear mapping work kernel need a way to distinguish
normal page table entries from entries with file offset (pte_file).
Kernel reserves flag in PTE for this purpose. PTE flags are scarce
resource especially on some CPU architectures. It would be nice to free
up the flag for other usage.Fortunately, there are not many users of remap_file_pages() in the wild.
It's only known that one enterprise RDBMS implementation uses the
syscall on 32-bit systems to map files bigger than can linearly fit into
32-bit virtual address space. This use-case is not critical anymore
since 64-bit systems are widely available.The plan is to deprecate the syscall and replace it with an emulation.
The emulation will create new VMAs instead of nonlinear mappings. It's
going to work slower for rare users of remap_file_pages() but ABI is
preserved.One side effect of emulation (apart from performance) is that user can
hit vm.max_map_count limit more easily due to additional VMAs. See
comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.[akpm@linux-foundation.org: fix spello]
Signed-off-by: Kirill A. Shutemov
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Dave Jones
Cc: Armin Rigo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
1 commit
-
Hugh reported:
| I noticed your soft_dirty work in install_file_pte(): which looked
| good at first, until I realized that it's propagating the soft_dirty
| of a pte it's about to zap completely, to the unrelated entry it's
| about to insert in its place. Which seems very odd to me.Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
operates with pte_t argument and returns new pte_t which were never used
after. After looking more I think what we need is to soft-dirtify all
newely remapped file pages because it should look like a new mapping for
memory tracker.Signed-off-by: Cyrill Gorcunov
Reported-by: Hugh Dickins
Cc: Pavel Emelyanov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Mar, 2014
1 commit
-
Fix some "Bad rss-counter state" reports on exit, arising from the
interaction between page migration and remap_file_pages(): zap_pte()
must count a migration entry when zapping it.And yes, it is possible (though very unusual) to find an anon page or
swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
get_user_pages(write, force) case which COWs even in a shared mapping.Signed-off-by: Hugh Dickins
Tested-by: Sasha Levin sasha.levin@oracle.com>
Tested-by: Dave Jones davej@redhat.com>
Cc: Cyrill Gorcunov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
03 Jan, 2014
1 commit
-
remap_file_pages calls mmap_region, which may merge the VMA with other
existing VMAs, and free "vma". This can lead to a use-after-free bug.
Avoid the bug by remembering vm_flags before calling mmap_region, and
not trying to dereference vma later.Signed-off-by: Rik van Riel
Reported-by: Dmitry Vyukov
Cc: PaX Team
Cc: Kees Cook
Cc: Michel Lespinasse
Cc: Cyrill Gorcunov
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Aug, 2013
1 commit
-
Andy reported that if file page get reclaimed we lose the soft-dirty bit
if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
encoded into pte entry. Thus when #pf happens on such non-present pte
we can restore it back.Reported-by: Andy Lutomirski
Signed-off-by: Cyrill Gorcunov
Acked-by: Pavel Emelyanov
Cc: Matt Mackall
Cc: Xiao Guangrong
Cc: Marcelo Tosatti
Cc: KOSAKI Motohiro
Cc: Stephen Rothwell
Cc: Peter Zijlstra
Cc: "Aneesh Kumar K.V"
Cc: Minchan Kim
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Mar, 2013
1 commit
-
This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
better deal with racy userspace programs").VM_POPULATE only has any effect when userspace plays racy games with
vmas by trying to unmap and remap memory regions that mmap or mlock are
operating on.Also, the only effect of VM_POPULATE when userspace plays such games is
that it avoids populating new memory regions that get remapped into the
address range that was being operated on by the original mmap or mlock
calls.Let's remove VM_POPULATE as there isn't any strong argument to mandate a
new vm_flag.Signed-off-by: Michel Lespinasse
Signed-off-by: Hugh Dickins
Signed-off-by: Linus Torvalds
15 Mar, 2013
1 commit
-
The vm_flags introduced in 6d7825b10dbe ("mm/fremap.c: fix oops on error
path") is supposed to avoid a compiler warning about unitialized
vm_flags without changing the generated code.However I am concerned that this is going to be very brittle, and fail
with some compiler versions. The failure could be either of:- compiler could actually load vma->vm_flags before checking for the
!vma condition, thus reintroducing the oops- compiler could optimize out the !vma check, since the pointer just got
dereferenced shortly before (so the compiler knows it can't be NULL!)I propose reversing this part of the change and initializing vm_flags to 0
just to avoid the bogus uninitialized use warning.Signed-off-by: Michel Lespinasse
Cc: Tommi Rantala
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Mar, 2013
1 commit
-
If find_vma() fails, sys_remap_file_pages() will dereference `vma', which
contains NULL. Fix it by checking the pointer.(We could alternatively check for err==0, but this seems more direct)
(The vm_flags change is to squish a bogus used-uninitialised warning
without adding extra code).Reported-by: Tommi Rantala
Cc: Michel Lespinasse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Feb, 2013
4 commits
-
The vm_populate() code populates user mappings without constantly
holding the mmap_sem. This makes it susceptible to racy userspace
programs: the user mappings may change while vm_populate() is running,
and in this case vm_populate() may end up populating the new mapping
instead of the old one.In order to reduce the possibility of userspace getting surprised by
this behavior, this change introduces the VM_POPULATE vma flag which
gets set on vmas we want vm_populate() to work on. This way
vm_populate() may still end up populating the new mapping after such a
race, but only if the new mapping is also one that the user has
requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
After the MAP_POPULATE handling has been moved to mmap_region() call
sites, the only remaining use of the flags argument is to pass the
MAP_NORESERVE flag. This can be just as easily handled by
do_mmap_pgoff(), so do that and remove the mmap_region() flags
parameter.[akpm@linux-foundation.org: remove double parens]
Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We have many vma manipulation functions that are fast in the typical
case, but can optionally be instructed to populate an unbounded number
of ptes within the region they work on:- mmap with MAP_POPULATE or MAP_LOCKED flags;
- remap_file_pages() with MAP_NONBLOCK not set or when working on a
VM_LOCKED vma;
- mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
effect;
- brk() when mlock(MCL_FUTURE) is in effect.Current code handles these pte operations locally, while the
sourrounding code has to hold the mmap_sem write side since it's
manipulating vmas. This means we're doing an unbounded amount of pte
population work with mmap_sem held, and this causes problems as Andy
Lutomirski reported (we've hit this at Google as well, though it's not
entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
first place).I propose introducing a new mm_populate() function to do this pte
population work after the mmap_sem has been released. mm_populate()
does need to acquire the mmap_sem read side, but critically, it doesn't
need to hold it continuously for the entire duration of the operation -
it can drop it whenever things take too long (such as when hitting disk
for a file read) and re-acquire it later on.The following patches are included
- Patches 1 fixes some issues I noticed while working on the existing code.
If needed, they could potentially go in before the rest of the patches.- Patch 2 introduces the new mm_populate() function and changes
mmap_region() call sites to use it after they drop mmap_sem. This is
inspired from Andy Lutomirski's proposal and is built as an extension
of the work I had previously done for mlock() and mlockall() around
v2.6.38-rc1. I had tried doing something similar at the time but had
given up as there were so many do_mmap() call sites; the recent cleanups
by Linus and Viro are a tremendous help here.- Patches 3-5 convert some of the less-obvious places doing unbounded
pte populates to the new mm_populate() mechanism.- Patches 6-7 are code cleanups that are made possible by the
mm_populate() work. In particular, they remove more code than the
entire patch series added, which should be a good thing :)- Patch 8 is optional to this entire series. It only helps to deal more
nicely with racy userspace programs that might modify their mappings
while we're trying to populate them. It adds a new VM_POPULATE flag
on the mappings we do want to populate, so that if userspace replaces
them with mappings it doesn't want populated, mm_populate() won't
populate those replacement mappings.This patch:
Assorted small fixes. The first two are quite small:
- Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
Purely cosmetic.- In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
range, make sure we own the mmap_sem write lock around the
munlock_vma_pages_range call as this manipulates the vma's vm_flags.Last fix requires a longer explanation. remap_file_pages() can do its work
either through VM_NONLINEAR manipulation or by creating extra vmas.
These two cases were inconsistent with each other (and ultimately, both wrong)
as to exactly when did they fault in the newly mapped file pages:- In the VM_NONLINEAR case, new file pages would be populated if
the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
new file pages wouldn't be populated even if the vma is already
marked as VM_LOCKED.- In the linear (emulated) case, the work is passed to the mmap_region()
function which would populate the pages if the vma is marked as
VM_LOCKED, and would not otherwise - regardless of the value of the
MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
mmap_region().The desired behavior is that we want the pages to be populated and locked
if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
flag is not passed to remap_file_pages().Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Oct, 2012
1 commit
-
In commit 0b173bc4daa8 ("mm: kill vma flag VM_CAN_NONLINEAR") we
replaced the VM_CAN_NONLINEAR test with checking whether the mapping has
a '->remap_pages()' vm operation, but there is no guarantee that there
it even has a vm_ops pointer at all.Add the appropriate test for NULL vm_ops.
Reported-by: Sasha Levin
Cc: Konstantin Khlebnikov
Cc: Andrew Morton
Signed-off-by: Linus Torvalds
09 Oct, 2012
2 commits
-
Implement an interval tree as a replacement for the VMA prio_tree. The
algorithms are similar to lib/interval_tree.c; however that code can't be
directly reused as the interval endpoints are not explicitly stored in the
VMA. So instead, the common algorithm is moved into a template and the
details (node type, how to get interval endpoints from the node, etc) are
filled in using the C preprocessor.Once the interval tree functions are available, using them as a
replacement to the VMA prio tree is a relatively simple, mechanical job.Signed-off-by: Michel Lespinasse
Cc: Rik van Riel
Cc: Hillf Danton
Cc: Peter Zijlstra
Cc: Catalin Marinas
Cc: Andrea Arcangeli
Cc: David Woodhouse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Move actual pte filling for non-linear file mappings into the new special
vma operation: ->remap_pages().Filesystems must implement this method to get non-linear mapping support,
if it uses filemap_fault() then generic_file_remap_pages() can be used.Now device drivers can implement this method and obtain nonlinear vma support.
Signed-off-by: Konstantin Khlebnikov
Cc: Alexander Viro
Cc: Carsten Otte
Cc: Chris Metcalf #arch/tile
Cc: Cyrill Gorcunov
Cc: Eric Paris
Cc: H. Peter Anvin
Cc: Hugh Dickins
Cc: Ingo Molnar
Cc: James Morris
Cc: Jason Baron
Cc: Kentaro Takeda
Cc: Matt Helsley
Cc: Nick Piggin
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Robert Richter
Cc: Suresh Siddha
Cc: Tetsuo Handa
Cc: Venkatesh Pallipadi
Acked-by: Linus Torvalds
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Sep, 2012
1 commit
-
simplifies a bunch of callers...
Signed-off-by: Al Viro
31 Oct, 2011
1 commit
-
There is nothing modular in these files, and no reason to drag
in all the 357 headers that module.h brings with it, since
it just slows down compiles.Signed-off-by: Paul Gortmaker
27 May, 2011
1 commit
-
The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
'unsigned int'. This patch fixes such misuse.Signed-off-by: KOSAKI Motohiro
[ Changed to use a typedef - we'll extend it to cover more cases
later, since there has been discussion about making it a 64-bit
type.. - Linus ]
Signed-off-by: Linus Torvalds
25 May, 2011
1 commit
-
Straightforward conversion of i_mmap_lock to a mutex.
Signed-off-by: Peter Zijlstra
Acked-by: Hugh Dickins
Cc: Benjamin Herrenschmidt
Cc: David Miller
Cc: Martin Schwidefsky
Cc: Russell King
Cc: Paul Mundt
Cc: Jeff Dike
Cc: Richard Weinberger
Cc: Tony Luck
Cc: KAMEZAWA Hiroyuki
Cc: Mel Gorman
Cc: KOSAKI Motohiro
Cc: Nick Piggin
Cc: Namhyung Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Sep, 2010
1 commit
-
Thomas Pollet noticed that the remap_file_pages() system call in
fremap.c has a potential overflow in the first part of the if statement
below, which could cause it to process bogus input parameters.
Specifically the pgoff + size parameters could be wrap thereby
preventing the system call from failing when it should.Reported-by: Thomas Pollet
Signed-off-by: Larry Woodman
Signed-off-by: Linus Torvalds
25 Sep, 2010
1 commit
-
Thomas Pollet points out that the 'end' variable is broken. It was
computed based on start/size before they were page-aligned, and as such
doesn't actually match any of the other actions we take. The overflow
test on end was also redundant, since we had already tested it with the
properly aligned version.So just get rid of it entirely. The one remaining use for that broken
variable can just use 'start+size' like all the other cases already did.Reported-by: Thomas Pollet
Signed-off-by: Linus Torvalds
07 Mar, 2010
1 commit
-
Presently, per-mm statistics counter is defined by macro in sched.h
This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.Signed-off-by: KAMEZAWA Hiroyuki
Reviewed-by: Minchan Kim
Cc: Christoph Lameter
Cc: Lee Schermerhorn
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Feb, 2009
1 commit
-
When overcommit is disabled, the core VM accounts for pages used by anonymous
shared, private mappings and special mappings. It keeps track of VMAs that
should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
with VM_NORESERVE.Overcommit for hugetlbfs is much riskier than overcommit for base pages
due to contiguity requirements. It avoids overcommiting on both shared and
private mappings using reservation counters that are checked and updated
during mmap(). This ensures (within limits) that hugepages exist in the
future when faults occurs or it is too easy to applications to be SIGKILLed.As hugetlbfs makes its own reservations of a different unit to the base page
size, VM_ACCOUNT should never be set. Even if the units were correct, we would
double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
be set because an application can request no reserves be made for hugetlbfs
at the risk of getting killed later.With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
breaks the accounting for both the core VM and hugetlbfs, can trigger an
OOM storm when hugepage pools are too small lockups and corrupted counters
otherwise are used. This patch brings hugetlbfs more in line with how the
core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.Signed-off-by: Mel Gorman
Signed-off-by: Linus Torvalds
14 Jan, 2009
1 commit
-
Signed-off-by: Heiko Carstens
07 Jan, 2009
1 commit
-
Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
page_dup_rmap(): we're trying to be more resilient about that than BUGs.Signed-off-by: Hugh Dickins
Cc: Nick Piggin
Cc: Christoph Lameter
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Oct, 2008
1 commit
-
Originally by Nick Piggin
Remove mlocked pages from the LRU using "unevictable infrastructure"
during mmap(), munmap(), mremap() and truncate(). Try to move back to
normal LRU lists on munmap() when last mlocked mapping removed. Remove
PageMlocked() status when page truncated from file.[akpm@linux-foundation.org: cleanup]
[kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
[kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
[lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
[akpm@linux-foundation.org: remove bogus kerneldoc token]
Signed-off-by: Nick Piggin
Signed-off-by: Lee Schermerhorn
Signed-off-by: Rik van Riel
Signed-off-by: KOSAKI Motohiro
Signed-off-by: KAMEZAWA Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jul, 2008
1 commit
-
With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
There are secondary MMUs (with secondary sptes and secondary tlbs) too.
sptes in the kvm case are shadow pagetables, but when I say spte in
mmu-notifier context, I mean "secondary pte". In GRU case there's no
actual secondary pte and there's only a secondary tlb because the GRU
secondary MMU has no knowledge about sptes and every secondary tlb miss
event in the MMU always generates a page fault that has to be resolved by
the CPU (this is not the case of KVM where the a secondary tlb miss will
walk sptes in hardware and it will refill the secondary tlb transparently
to software if the corresponding spte is present). The same way
zap_page_range has to invalidate the pte before freeing the page, the spte
(and secondary tlb) must also be invalidated before any page is freed and
reused.Currently we take a page_count pin on every page mapped by sptes, but that
means the pages can't be swapped whenever they're mapped by any spte
because they're part of the guest working set. Furthermore a spte unmap
event can immediately lead to a page to be freed when the pin is released
(so requiring the same complex and relatively slow tlb_gather smp safe
logic we have in zap_page_range and that can be avoided completely if the
spte unmap event doesn't require an unpin of the page previously mapped in
the secondary MMU).The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
when the VM is swapping or freeing or doing anything on the primary MMU so
that the secondary MMU code can drop sptes before the pages are freed,
avoiding all page pinning and allowing 100% reliable swapping of guest
physical address space. Furthermore it avoids the code that teardown the
mappings of the secondary MMU, to implement a logic like tlb_gather in
zap_page_range that would require many IPI to flush other cpu tlbs, for
each fixed number of spte unmapped.To make an example: if what happens on the primary MMU is a protection
downgrade (from writeable to wrprotect) the secondary MMU mappings will be
invalidated, and the next secondary-mmu-page-fault will call
get_user_pages and trigger a do_wp_page through get_user_pages if it
called get_user_pages with write=1, and it'll re-establishing an updated
spte or secondary-tlb-mapping on the copied page. Or it will setup a
readonly spte or readonly tlb mapping if it's a guest-read, if it calls
get_user_pages with write=0. This is just an example.This allows to map any page pointed by any pte (and in turn visible in the
primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
with kvm), or a remote DMA in software like XPMEM (hence needing of
schedule in XPMEM code to send the invalidate to the remote node, while no
need to schedule in kvm/gru as it's an immediate event like invalidating
primary-mmu pte).At least for KVM without this patch it's impossible to swap guests
reliably. And having this feature and removing the page pin allows
several other optimizations that simplify life considerably.Dependencies:
1) mm_take_all_locks() to register the mmu notifier when the whole VM
isn't doing anything with "mm". This allows mmu notifier users to keep
track if the VM is in the middle of the invalidate_range_begin/end
critical section with an atomic counter incraese in range_begin and
decreased in range_end. No secondary MMU page fault is allowed to map
any spte or secondary tlb reference, while the VM is in the middle of
range_begin/end as any page returned by get_user_pages in that critical
section could later immediately be freed without any further
->invalidate_page notification (invalidate_range_begin/end works on
ranges and ->invalidate_page isn't called immediately before freeing
the page). To stop all page freeing and pagetable overwrites the
mmap_sem must be taken in write mode and all other anon_vma/i_mmap
locks must be taken too.2) It'd be a waste to add branches in the VM if nobody could possibly
run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
mmu notifiers, but this already allows to compile a KVM external module
against a kernel with mmu notifiers enabled and from the next pull from
kvm.git we'll start using them. And GRU/XPMEM will also be able to
continue the development by enabling KVM=m in their config, until they
submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
are all =n.The mmu_notifier_register call can fail because mm_take_all_locks may be
interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
is used when a driver startup, a failure can be gracefully handled. Here
an example of the change applied to kvm to register the mmu notifiers.
Usually when a driver startups other allocations are required anyway and
-ENOMEM failure paths exists already.struct kvm *kvm_arch_create_vm(void)
{
struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
+ int err;if (!kvm)
return ERR_PTR(-ENOMEM);INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);
+ kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
+ err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
+ if (err) {
+ kfree(kvm);
+ return ERR_PTR(err);
+ }
+
return kvm;
}mmu_notifier_unregister returns void and it's reliable.
The patch also adds a few needed but missing includes that would prevent
kernel to compile after these changes on non-x86 archs (x86 didn't need
them by luck).[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix mm/filemap_xip.c build]
[akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
Signed-off-by: Andrea Arcangeli
Signed-off-by: Nick Piggin
Signed-off-by: Christoph Lameter
Cc: Jack Steiner
Cc: Robin Holt
Cc: Nick Piggin
Cc: Peter Zijlstra
Cc: Kanoj Sarcar
Cc: Roland Dreier
Cc: Steve Wise
Cc: Avi Kivity
Cc: Hugh Dickins
Cc: Rusty Russell
Cc: Anthony Liguori
Cc: Chris Wright
Cc: Marcelo Tosatti
Cc: Eric Dumazet
Cc: "Paul E. McKenney"
Cc: Izik Eidus
Cc: Anthony Liguori
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Mar, 2008
1 commit
-
Fix various kernel-doc notation in mm/:
filemap.c: add function short description; convert 2 to kernel-doc
fremap.c: change parameter 'prot' to @prot
pagewalk.c: change "-" in function parameters to ":"
slab.c: fix short description of kmem_ptr_validate()
swap.c: fix description & parameters of put_pages_list()
swap_state.c: fix function parameters
vmalloc.c: change "@returns" to "Returns:" since that is not a parameterSigned-off-by: Randy Dunlap
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Feb, 2008
1 commit
-
Fix ->vm_file accounting, mmap_region() may do do_munmap().
Signed-off-by: Oleg Nesterov
Signed-off-by: Miklos Szeredi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
17 Oct, 2007
2 commits
-
Fix kernel-doc for sys_remap_file_pages() and add info to the 'prot' NOTE.
Rename __prot parameter to prot.Signed-off-by: Randy Dunlap
Acked-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
remove them and add them back to files which need them.Cross-compile tested on many configs and archs.
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
09 Oct, 2007
1 commit
-
The test for VM_CAN_NONLINEAR always fails
Signed-off-by: Yan Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Jul, 2007
3 commits
-
page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
re-dirty through such a mapping will not generate a fault, PG_dirty will
not reflect the dirty state and the dirty count will be skewed. This
implies that msync() is also currently broken for nonlinear mappings.The easiest solution is to emulate remap_file_pages on non-linear mappings
with simple mmap() for non ram-backed filesystems. Applications continue
to work (albeit slower), as long as the number of remappings remain below
the maximum vma count.However all currently known real uses of non-linear mappings are for ram
backed filesystems, which this patch doesn't affect.Signed-off-by: Miklos Szeredi
Acked-by: Peter Zijlstra
Cc: William Lee Irwin III
Cc: Nick Piggin
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Change ->fault prototype. We now return an int, which contains
VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
FAULT_RET_ code tells the VM whether a page was found, whether it has been
locked, and potentially other things. This is not quite the way he wanted
it yet, but that's changed in the next patch (which requires changes to
arch code).This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
that a page is locked which requires filemap_nopage to go away (because we
can no longer remain backward compatible without that flag), but we were
going to do that anyway.struct fault_data is renamed to struct vm_fault as Linus asked. address
is now a void __user * that we should firmly encourage drivers not to use
without really good reason.The page is now returned via a page pointer in the vm_fault struct.
Signed-off-by: Nick Piggin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
the virtual address -> file offset differently from linear mappings.->populate is a layering violation because the filesystem/pagecache code
should need to know anything about the virtual memory mapping. The hitch here
is that the ->nopage handler didn't pass down enough information (ie. pgoff).
But it is more logical to pass pgoff rather than have the ->nopage function
calculate it itself anyway (because that's a similar layering violation).Having the populate handler install the pte itself is likewise a nasty thing
to be doing.This patch introduces a new fault handler that replaces ->nopage and
->populate and (later) ->nopfn. Most of the old mechanism is still in place
so there is a lot of duplication and nice cleanups that can be removed if
everyone switches over.The rationale for doing this in the first place is that nonlinear mappings are
subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
to duplicate the synchronisation logic rather than just consolidate the two.After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
pagecache. Seems like a fringe functionality anyway.NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
users have hit mainline yet.[akpm@linux-foundation.org: cleanup]
[randy.dunlap@oracle.com: doc. fixes for readahead]
[akpm@linux-foundation.org: build fix]
Signed-off-by: Nick Piggin
Signed-off-by: Randy Dunlap
Cc: Mark Fasheh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Dec, 2006
1 commit
-
Add more debugging in the rmap code in an attempt to locate to source of
the occasional "mapcount went negative" assertions.Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Dec, 2006
1 commit
-
David Binderman and his Intel C compiler rightly observe that
install_file_pte no longer has any use for its pte_val.Signed-off-by: Hugh Dickins
Cc: d binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds