16 Dec, 2014

1 commit

  • Pull drm updates from Dave Airlie:
    "Highlights:

    - AMD KFD driver merge

    This is the AMD HSA interface for exposing a lowlevel interface for
    GPGPU use. They have an open source userspace built on top of this
    interface, and the code looks as good as it was going to get out of
    tree.

    - Initial atomic modesetting work

    The need for an atomic modesetting interface to allow userspace to
    try and send a complete set of modesetting state to the driver has
    arisen, and been suffering from neglect this past year. No more,
    the start of the common code and changes for msm driver to use it
    are in this tree. Ongoing work to get the userspace ioctl finished
    and the code clean will probably wait until next kernel.

    - DisplayID 1.3 and tiled monitor exposed to userspace.

    Tiled monitor property is now exposed for userspace to make use of.

    - Rockchip drm driver merged.

    - imx gpu driver moved out of staging

    Other stuff:

    - core:
    panel - MIPI DSI + new panels.
    expose suggested x/y properties for virtual GPUs

    - i915:
    Initial Skylake (SKL) support
    gen3/4 reset work
    start of dri1/ums removal
    infoframe tracking
    fixes for lots of things.

    - nouveau:
    tegra k1 voltage support
    GM204 modesetting support
    GT21x memory reclocking work

    - radeon:
    CI dpm fixes
    GPUVM improvements
    Initial DPM fan control

    - rcar-du:
    HDMI support added
    removed some support for old boards
    slave encoder driver for Analog Devices adv7511

    - exynos:
    Exynos4415 SoC support

    - msm:
    a4xx gpu support
    atomic helper conversion

    - tegra:
    iommu support
    universal plane support
    ganged-mode DSI support

    - sti:
    HDMI i2c improvements

    - vmwgfx:
    some late fixes.

    - qxl:
    use suggested x/y properties"

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (969 commits)
    drm: sti: fix module compilation issue
    drm/i915: save/restore GMBUS freq across suspend/resume on gen4
    drm: sti: correctly cleanup CRTC and planes
    drm: sti: add HQVDP plane
    drm: sti: add cursor plane
    drm: sti: enable auxiliary CRTC
    drm: sti: fix delay in VTG programming
    drm: sti: prepare sti_tvout to support auxiliary crtc
    drm: sti: use drm_crtc_vblank_{on/off} instead of drm_vblank_{on/off}
    drm: sti: fix hdmi avi infoframe
    drm: sti: remove event lock while disabling vblank
    drm: sti: simplify gdp code
    drm: sti: clear all mixer control
    drm: sti: remove gpio for HDMI hot plug detection
    drm: sti: allow to change hdmi ddc i2c adapter
    drm/doc: Document drm_add_modes_noedid() usage
    drm/i915: Remove '& 0xffff' from the mask given to WA_REG()
    drm/i915: Invert the mask and val arguments in wa_add() and WA_REG()
    drm: Zero out DRM object memory upon cleanup
    drm/i915/bdw: Fix the write setting up the WIZ hashing mode
    ...

    Linus Torvalds
     

14 Dec, 2014

1 commit

  • Convert all open coded mutex_lock/unlock calls to the
    i_mmap_[lock/unlock]_write() helpers.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: "Kirill A. Shutemov"
    Acked-by: Hugh Dickins
    Cc: Oleg Nesterov
    Acked-by: Peter Zijlstra (Intel)
    Cc: Srikar Dronamraju
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

13 Nov, 2014

1 commit

  • Add calls to the new mmu_notifier_invalidate_range() function to all
    places in the VMM that need it.

    Signed-off-by: Joerg Roedel
    Reviewed-by: Andrea Arcangeli
    Reviewed-by: Jérôme Glisse
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: Johannes Weiner
    Cc: Jay Cornwall
    Cc: Oded Gabbay
    Cc: Suravee Suthikulpanit
    Cc: Jesse Barnes
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Oded Gabbay

    Joerg Roedel
     

07 Jun, 2014

1 commit

  • The remap_file_pages() system call is used to create a nonlinear
    mapping, that is, a mapping in which the pages of the file are mapped
    into a nonsequential order in memory. The advantage of using
    remap_file_pages() over using repeated calls to mmap(2) is that the
    former approach does not require the kernel to create additional VMA
    (Virtual Memory Area) data structures.

    Supporting of nonlinear mapping requires significant amount of
    non-trivial code in kernel virtual memory subsystem including hot paths.
    Also to get nonlinear mapping work kernel need a way to distinguish
    normal page table entries from entries with file offset (pte_file).
    Kernel reserves flag in PTE for this purpose. PTE flags are scarce
    resource especially on some CPU architectures. It would be nice to free
    up the flag for other usage.

    Fortunately, there are not many users of remap_file_pages() in the wild.
    It's only known that one enterprise RDBMS implementation uses the
    syscall on 32-bit systems to map files bigger than can linearly fit into
    32-bit virtual address space. This use-case is not critical anymore
    since 64-bit systems are widely available.

    The plan is to deprecate the syscall and replace it with an emulation.
    The emulation will create new VMAs instead of nonlinear mappings. It's
    going to work slower for rare users of remap_file_pages() but ABI is
    preserved.

    One side effect of emulation (apart from performance) is that user can
    hit vm.max_map_count limit more easily due to additional VMAs. See
    comment for DEFAULT_MAX_MAP_COUNT for more details on the limit.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Kirill A. Shutemov
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Dave Jones
    Cc: Armin Rigo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Kirill A. Shutemov
     

05 Jun, 2014

1 commit

  • Hugh reported:

    | I noticed your soft_dirty work in install_file_pte(): which looked
    | good at first, until I realized that it's propagating the soft_dirty
    | of a pte it's about to zap completely, to the unrelated entry it's
    | about to insert in its place. Which seems very odd to me.

    Indeed this code ends up being nop in result -- pte_file_mksoft_dirty()
    operates with pte_t argument and returns new pte_t which were never used
    after. After looking more I think what we need is to soft-dirtify all
    newely remapped file pages because it should look like a new mapping for
    memory tracker.

    Signed-off-by: Cyrill Gorcunov
    Reported-by: Hugh Dickins
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

20 Mar, 2014

1 commit

  • Fix some "Bad rss-counter state" reports on exit, arising from the
    interaction between page migration and remap_file_pages(): zap_pte()
    must count a migration entry when zapping it.

    And yes, it is possible (though very unusual) to find an anon page or
    swap entry in a VM_SHARED nonlinear mapping: coming from that horrid
    get_user_pages(write, force) case which COWs even in a shared mapping.

    Signed-off-by: Hugh Dickins
    Tested-by: Sasha Levin sasha.levin@oracle.com>
    Tested-by: Dave Jones davej@redhat.com>
    Cc: Cyrill Gorcunov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

03 Jan, 2014

1 commit

  • remap_file_pages calls mmap_region, which may merge the VMA with other
    existing VMAs, and free "vma". This can lead to a use-after-free bug.
    Avoid the bug by remembering vm_flags before calling mmap_region, and
    not trying to dereference vma later.

    Signed-off-by: Rik van Riel
    Reported-by: Dmitry Vyukov
    Cc: PaX Team
    Cc: Kees Cook
    Cc: Michel Lespinasse
    Cc: Cyrill Gorcunov
    Cc: Hugh Dickins
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

14 Aug, 2013

1 commit

  • Andy reported that if file page get reclaimed we lose the soft-dirty bit
    if it was there, so save _PAGE_BIT_SOFT_DIRTY bit when page address get
    encoded into pte entry. Thus when #pf happens on such non-present pte
    we can restore it back.

    Reported-by: Andy Lutomirski
    Signed-off-by: Cyrill Gorcunov
    Acked-by: Pavel Emelyanov
    Cc: Matt Mackall
    Cc: Xiao Guangrong
    Cc: Marcelo Tosatti
    Cc: KOSAKI Motohiro
    Cc: Stephen Rothwell
    Cc: Peter Zijlstra
    Cc: "Aneesh Kumar K.V"
    Cc: Minchan Kim
    Cc: Wanpeng Li
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Cyrill Gorcunov
     

29 Mar, 2013

1 commit

  • This reverts commit 186930500985 ("mm: introduce VM_POPULATE flag to
    better deal with racy userspace programs").

    VM_POPULATE only has any effect when userspace plays racy games with
    vmas by trying to unmap and remap memory regions that mmap or mlock are
    operating on.

    Also, the only effect of VM_POPULATE when userspace plays such games is
    that it avoids populating new memory regions that get remapped into the
    address range that was being operated on by the original mmap or mlock
    calls.

    Let's remove VM_POPULATE as there isn't any strong argument to mandate a
    new vm_flag.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

15 Mar, 2013

1 commit

  • The vm_flags introduced in 6d7825b10dbe ("mm/fremap.c: fix oops on error
    path") is supposed to avoid a compiler warning about unitialized
    vm_flags without changing the generated code.

    However I am concerned that this is going to be very brittle, and fail
    with some compiler versions. The failure could be either of:

    - compiler could actually load vma->vm_flags before checking for the
    !vma condition, thus reintroducing the oops

    - compiler could optimize out the !vma check, since the pointer just got
    dereferenced shortly before (so the compiler knows it can't be NULL!)

    I propose reversing this part of the change and initializing vm_flags to 0
    just to avoid the bogus uninitialized use warning.

    Signed-off-by: Michel Lespinasse
    Cc: Tommi Rantala
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

14 Mar, 2013

1 commit

  • If find_vma() fails, sys_remap_file_pages() will dereference `vma', which
    contains NULL. Fix it by checking the pointer.

    (We could alternatively check for err==0, but this seems more direct)

    (The vm_flags change is to squish a bogus used-uninitialised warning
    without adding extra code).

    Reported-by: Tommi Rantala
    Cc: Michel Lespinasse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrew Morton
     

24 Feb, 2013

4 commits

  • The vm_populate() code populates user mappings without constantly
    holding the mmap_sem. This makes it susceptible to racy userspace
    programs: the user mappings may change while vm_populate() is running,
    and in this case vm_populate() may end up populating the new mapping
    instead of the old one.

    In order to reduce the possibility of userspace getting surprised by
    this behavior, this change introduces the VM_POPULATE vma flag which
    gets set on vmas we want vm_populate() to work on. This way
    vm_populate() may still end up populating the new mapping after such a
    race, but only if the new mapping is also one that the user has
    requested (using MAP_SHARED, MAP_LOCKED or mlock) to be populated.

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • After the MAP_POPULATE handling has been moved to mmap_region() call
    sites, the only remaining use of the flags argument is to pass the
    MAP_NORESERVE flag. This can be just as easily handled by
    do_mmap_pgoff(), so do that and remove the mmap_region() flags
    parameter.

    [akpm@linux-foundation.org: remove double parens]
    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • We have many vma manipulation functions that are fast in the typical
    case, but can optionally be instructed to populate an unbounded number
    of ptes within the region they work on:

    - mmap with MAP_POPULATE or MAP_LOCKED flags;
    - remap_file_pages() with MAP_NONBLOCK not set or when working on a
    VM_LOCKED vma;
    - mmap_region() and all its wrappers when mlock(MCL_FUTURE) is in
    effect;
    - brk() when mlock(MCL_FUTURE) is in effect.

    Current code handles these pte operations locally, while the
    sourrounding code has to hold the mmap_sem write side since it's
    manipulating vmas. This means we're doing an unbounded amount of pte
    population work with mmap_sem held, and this causes problems as Andy
    Lutomirski reported (we've hit this at Google as well, though it's not
    entirely clear why people keep trying to use mlock(MCL_FUTURE) in the
    first place).

    I propose introducing a new mm_populate() function to do this pte
    population work after the mmap_sem has been released. mm_populate()
    does need to acquire the mmap_sem read side, but critically, it doesn't
    need to hold it continuously for the entire duration of the operation -
    it can drop it whenever things take too long (such as when hitting disk
    for a file read) and re-acquire it later on.

    The following patches are included

    - Patches 1 fixes some issues I noticed while working on the existing code.
    If needed, they could potentially go in before the rest of the patches.

    - Patch 2 introduces the new mm_populate() function and changes
    mmap_region() call sites to use it after they drop mmap_sem. This is
    inspired from Andy Lutomirski's proposal and is built as an extension
    of the work I had previously done for mlock() and mlockall() around
    v2.6.38-rc1. I had tried doing something similar at the time but had
    given up as there were so many do_mmap() call sites; the recent cleanups
    by Linus and Viro are a tremendous help here.

    - Patches 3-5 convert some of the less-obvious places doing unbounded
    pte populates to the new mm_populate() mechanism.

    - Patches 6-7 are code cleanups that are made possible by the
    mm_populate() work. In particular, they remove more code than the
    entire patch series added, which should be a good thing :)

    - Patch 8 is optional to this entire series. It only helps to deal more
    nicely with racy userspace programs that might modify their mappings
    while we're trying to populate them. It adds a new VM_POPULATE flag
    on the mappings we do want to populate, so that if userspace replaces
    them with mappings it doesn't want populated, mm_populate() won't
    populate those replacement mappings.

    This patch:

    Assorted small fixes. The first two are quite small:

    - Move check for vma->vm_private_data && !(vma->vm_flags & VM_NONLINEAR)
    within existing if (!(vma->vm_flags & VM_NONLINEAR)) block.
    Purely cosmetic.

    - In the VM_LOCKED case, when dropping PG_Mlocked for the over-mapped
    range, make sure we own the mmap_sem write lock around the
    munlock_vma_pages_range call as this manipulates the vma's vm_flags.

    Last fix requires a longer explanation. remap_file_pages() can do its work
    either through VM_NONLINEAR manipulation or by creating extra vmas.
    These two cases were inconsistent with each other (and ultimately, both wrong)
    as to exactly when did they fault in the newly mapped file pages:

    - In the VM_NONLINEAR case, new file pages would be populated if
    the MAP_NONBLOCK flag wasn't passed. If MAP_NONBLOCK was passed,
    new file pages wouldn't be populated even if the vma is already
    marked as VM_LOCKED.

    - In the linear (emulated) case, the work is passed to the mmap_region()
    function which would populate the pages if the vma is marked as
    VM_LOCKED, and would not otherwise - regardless of the value of the
    MAP_NONBLOCK flag, because MAP_POPULATE wasn't being passed to
    mmap_region().

    The desired behavior is that we want the pages to be populated and locked
    if the vma is marked as VM_LOCKED, or to be populated if the MAP_NONBLOCK
    flag is not passed to remap_file_pages().

    Signed-off-by: Michel Lespinasse
    Acked-by: Rik van Riel
    Tested-by: Andy Lutomirski
    Cc: Greg Ungerer
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

20 Oct, 2012

1 commit

  • In commit 0b173bc4daa8 ("mm: kill vma flag VM_CAN_NONLINEAR") we
    replaced the VM_CAN_NONLINEAR test with checking whether the mapping has
    a '->remap_pages()' vm operation, but there is no guarantee that there
    it even has a vm_ops pointer at all.

    Add the appropriate test for NULL vm_ops.

    Reported-by: Sasha Levin
    Cc: Konstantin Khlebnikov
    Cc: Andrew Morton
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

09 Oct, 2012

2 commits

  • Implement an interval tree as a replacement for the VMA prio_tree. The
    algorithms are similar to lib/interval_tree.c; however that code can't be
    directly reused as the interval endpoints are not explicitly stored in the
    VMA. So instead, the common algorithm is moved into a template and the
    details (node type, how to get interval endpoints from the node, etc) are
    filled in using the C preprocessor.

    Once the interval tree functions are available, using them as a
    replacement to the VMA prio tree is a relatively simple, mechanical job.

    Signed-off-by: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Hillf Danton
    Cc: Peter Zijlstra
    Cc: Catalin Marinas
    Cc: Andrea Arcangeli
    Cc: David Woodhouse
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Move actual pte filling for non-linear file mappings into the new special
    vma operation: ->remap_pages().

    Filesystems must implement this method to get non-linear mapping support,
    if it uses filemap_fault() then generic_file_remap_pages() can be used.

    Now device drivers can implement this method and obtain nonlinear vma support.

    Signed-off-by: Konstantin Khlebnikov
    Cc: Alexander Viro
    Cc: Carsten Otte
    Cc: Chris Metcalf #arch/tile
    Cc: Cyrill Gorcunov
    Cc: Eric Paris
    Cc: H. Peter Anvin
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Cc: James Morris
    Cc: Jason Baron
    Cc: Kentaro Takeda
    Cc: Matt Helsley
    Cc: Nick Piggin
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Robert Richter
    Cc: Suresh Siddha
    Cc: Tetsuo Handa
    Cc: Venkatesh Pallipadi
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Konstantin Khlebnikov
     

27 Sep, 2012

1 commit


31 Oct, 2011

1 commit


27 May, 2011

1 commit

  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

1 commit

  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

26 Sep, 2010

1 commit

  • Thomas Pollet noticed that the remap_file_pages() system call in
    fremap.c has a potential overflow in the first part of the if statement
    below, which could cause it to process bogus input parameters.
    Specifically the pgoff + size parameters could be wrap thereby
    preventing the system call from failing when it should.

    Reported-by: Thomas Pollet
    Signed-off-by: Larry Woodman
    Signed-off-by: Linus Torvalds

    Larry Woodman
     

25 Sep, 2010

1 commit

  • Thomas Pollet points out that the 'end' variable is broken. It was
    computed based on start/size before they were page-aligned, and as such
    doesn't actually match any of the other actions we take. The overflow
    test on end was also redundant, since we had already tested it with the
    properly aligned version.

    So just get rid of it entirely. The one remaining use for that broken
    variable can just use 'start+size' like all the other cases already did.

    Reported-by: Thomas Pollet
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

07 Mar, 2010

1 commit

  • Presently, per-mm statistics counter is defined by macro in sched.h

    This patch modifies it to
    - defined in mm.h as inlinf functions
    - use array instead of macro's name creation.

    This patch is for reducing patch size in future patch to modify
    implementation of per-mm counter.

    Signed-off-by: KAMEZAWA Hiroyuki
    Reviewed-by: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

11 Feb, 2009

1 commit

  • When overcommit is disabled, the core VM accounts for pages used by anonymous
    shared, private mappings and special mappings. It keeps track of VMAs that
    should be accounted for with VM_ACCOUNT and VMAs that never had a reserve
    with VM_NORESERVE.

    Overcommit for hugetlbfs is much riskier than overcommit for base pages
    due to contiguity requirements. It avoids overcommiting on both shared and
    private mappings using reservation counters that are checked and updated
    during mmap(). This ensures (within limits) that hugepages exist in the
    future when faults occurs or it is too easy to applications to be SIGKILLed.

    As hugetlbfs makes its own reservations of a different unit to the base page
    size, VM_ACCOUNT should never be set. Even if the units were correct, we would
    double account for the usage in the core VM and hugetlbfs. VM_NORESERVE may
    be set because an application can request no reserves be made for hugetlbfs
    at the risk of getting killed later.

    With commit fc8744adc870a8d4366908221508bb113d8b72ee, VM_NORESERVE and
    VM_ACCOUNT are getting unconditionally set for hugetlbfs-backed mappings. This
    breaks the accounting for both the core VM and hugetlbfs, can trigger an
    OOM storm when hugepage pools are too small lockups and corrupted counters
    otherwise are used. This patch brings hugetlbfs more in line with how the
    core VM treats VM_NORESERVE but prevents VM_ACCOUNT being set.

    Signed-off-by: Mel Gorman
    Signed-off-by: Linus Torvalds

    Mel Gorman
     

14 Jan, 2009

1 commit


07 Jan, 2009

1 commit

  • Remove page_remove_rmap()'s vma arg, which was only for the Eeek message.
    And remove the BUG_ON(page_mapcount(page) == 0) from CONFIG_DEBUG_VM's
    page_dup_rmap(): we're trying to be more resilient about that than BUGs.

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Christoph Lameter
    Cc: Mel Gorman
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

20 Oct, 2008

1 commit

  • Originally by Nick Piggin

    Remove mlocked pages from the LRU using "unevictable infrastructure"
    during mmap(), munmap(), mremap() and truncate(). Try to move back to
    normal LRU lists on munmap() when last mlocked mapping removed. Remove
    PageMlocked() status when page truncated from file.

    [akpm@linux-foundation.org: cleanup]
    [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
    [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
    [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
    [akpm@linux-foundation.org: remove bogus kerneldoc token]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

20 Mar, 2008

1 commit

  • Fix various kernel-doc notation in mm/:

    filemap.c: add function short description; convert 2 to kernel-doc
    fremap.c: change parameter 'prot' to @prot
    pagewalk.c: change "-" in function parameters to ":"
    slab.c: fix short description of kmem_ptr_validate()
    swap.c: fix description & parameters of put_pages_list()
    swap_state.c: fix function parameters
    vmalloc.c: change "@returns" to "Returns:" since that is not a parameter

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

06 Feb, 2008

1 commit


17 Oct, 2007

2 commits

  • Fix kernel-doc for sys_remap_file_pages() and add info to the 'prot' NOTE.
    Rename __prot parameter to prot.

    Signed-off-by: Randy Dunlap
    Acked-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     
  • mm.h doesn't use directly anything from mutex.h and backing-dev.h, so
    remove them and add them back to files which need them.

    Cross-compile tested on many configs and archs.

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

09 Oct, 2007

1 commit


20 Jul, 2007

3 commits

  • page_mkclean() doesn't re-protect ptes for non-linear mappings, so a later
    re-dirty through such a mapping will not generate a fault, PG_dirty will
    not reflect the dirty state and the dirty count will be skewed. This
    implies that msync() is also currently broken for nonlinear mappings.

    The easiest solution is to emulate remap_file_pages on non-linear mappings
    with simple mmap() for non ram-backed filesystems. Applications continue
    to work (albeit slower), as long as the number of remappings remain below
    the maximum vma count.

    However all currently known real uses of non-linear mappings are for ram
    backed filesystems, which this patch doesn't affect.

    Signed-off-by: Miklos Szeredi
    Acked-by: Peter Zijlstra
    Cc: William Lee Irwin III
    Cc: Nick Piggin
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miklos Szeredi
     
  • Change ->fault prototype. We now return an int, which contains
    VM_FAULT_xxx code in the low byte, and FAULT_RET_xxx code in the next byte.
    FAULT_RET_ code tells the VM whether a page was found, whether it has been
    locked, and potentially other things. This is not quite the way he wanted
    it yet, but that's changed in the next patch (which requires changes to
    arch code).

    This means we no longer set VM_CAN_INVALIDATE in the vma in order to say
    that a page is locked which requires filemap_nopage to go away (because we
    can no longer remain backward compatible without that flag), but we were
    going to do that anyway.

    struct fault_data is renamed to struct vm_fault as Linus asked. address
    is now a void __user * that we should firmly encourage drivers not to use
    without really good reason.

    The page is now returned via a page pointer in the vm_fault struct.

    Signed-off-by: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     
  • Nonlinear mappings are (AFAIKS) simply a virtual memory concept that encodes
    the virtual address -> file offset differently from linear mappings.

    ->populate is a layering violation because the filesystem/pagecache code
    should need to know anything about the virtual memory mapping. The hitch here
    is that the ->nopage handler didn't pass down enough information (ie. pgoff).
    But it is more logical to pass pgoff rather than have the ->nopage function
    calculate it itself anyway (because that's a similar layering violation).

    Having the populate handler install the pte itself is likewise a nasty thing
    to be doing.

    This patch introduces a new fault handler that replaces ->nopage and
    ->populate and (later) ->nopfn. Most of the old mechanism is still in place
    so there is a lot of duplication and nice cleanups that can be removed if
    everyone switches over.

    The rationale for doing this in the first place is that nonlinear mappings are
    subject to the pagefault vs invalidate/truncate race too, and it seemed stupid
    to duplicate the synchronisation logic rather than just consolidate the two.

    After this patch, MAP_NONBLOCK no longer sets up ptes for pages present in
    pagecache. Seems like a fringe functionality anyway.

    NOPAGE_REFAULT is removed. This should be implemented with ->fault, and no
    users have hit mainline yet.

    [akpm@linux-foundation.org: cleanup]
    [randy.dunlap@oracle.com: doc. fixes for readahead]
    [akpm@linux-foundation.org: build fix]
    Signed-off-by: Nick Piggin
    Signed-off-by: Randy Dunlap
    Cc: Mark Fasheh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Nick Piggin
     

23 Dec, 2006

1 commit


08 Dec, 2006

1 commit