11 Jan, 2012

1 commit

  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

01 Nov, 2011

3 commits

  • This adds THP support to mremap (decreases the number of split_huge_page()
    calls).

    Here are also some benchmarks with a proggy like this:

    ===
    #define _GNU_SOURCE
    #include
    #include
    #include
    #include
    #include

    #define SIZE (5UL*1024*1024*1024)

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, 4096))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    gettimeofday(&oldstamp, NULL);
    p4 = mremap(p, SIZE, SIZE, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    gettimeofday(&newstamp, NULL);
    diffsec = newstamp.tv_sec - oldstamp.tv_sec;
    diffsec = newstamp.tv_usec - oldstamp.tv_usec + 1000000 * diffsec;
    printf("usec %ld\n", diffsec);
    if (p == MAP_FAILED || p4 != p3)
    //if (p == MAP_FAILED)
    perror("mremap"), exit(1);
    if (memcmp(p4, p2, SIZE))
    printf("mremap bug\n"), exit(1);
    printf("ok\n");

    return 0;
    }
    ===

    THP on

    Performance counter stats for './largepage13' (3 runs):

    69195836 dTLB-loads ( +- 3.546% ) (scaled from 50.30%)
    60708 dTLB-load-misses ( +- 11.776% ) (scaled from 52.62%)
    676266476 dTLB-stores ( +- 5.654% ) (scaled from 69.54%)
    29856 dTLB-store-misses ( +- 4.081% ) (scaled from 89.22%)
    1055848782 iTLB-loads ( +- 4.526% ) (scaled from 80.18%)
    8689 iTLB-load-misses ( +- 2.987% ) (scaled from 58.20%)

    7.314454164 seconds time elapsed ( +- 0.023% )

    THP off

    Performance counter stats for './largepage13' (3 runs):

    1967379311 dTLB-loads ( +- 0.506% ) (scaled from 60.59%)
    9238687 dTLB-load-misses ( +- 22.547% ) (scaled from 61.87%)
    2014239444 dTLB-stores ( +- 0.692% ) (scaled from 60.40%)
    3312335 dTLB-store-misses ( +- 7.304% ) (scaled from 67.60%)
    6764372065 iTLB-loads ( +- 0.925% ) (scaled from 79.00%)
    8202 iTLB-load-misses ( +- 0.475% ) (scaled from 70.55%)

    9.693655243 seconds time elapsed ( +- 0.069% )

    grep thp /proc/vmstat
    thp_fault_alloc 35849
    thp_fault_fallback 0
    thp_collapse_alloc 3
    thp_collapse_alloc_failed 0
    thp_split 0

    thp_split 0 confirms no thp split despite plenty of hugepages allocated.

    The measurement of only the mremap time (so excluding the 3 long
    memset and final long 10GB memory accessing memcmp):

    THP on

    usec 14824
    usec 14862
    usec 14859

    THP off

    usec 256416
    usec 255981
    usec 255847

    With an older kernel without the mremap optimizations (the below patch
    optimizes the non THP version too).

    THP on

    usec 392107
    usec 390237
    usec 404124

    THP off

    usec 444294
    usec 445237
    usec 445820

    I guess with a threaded program that sends more IPI on large SMP it'd
    create an even larger difference.

    All debug options are off except DEBUG_VM to avoid skewing the
    results.

    The only problem for native 2M mremap like it happens above both the
    source and destination address must be 2M aligned or the hugepmd can't be
    moved without a split but that is an hardware limitation.

    [akpm@linux-foundation.org: coding-style nitpicking]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • This replaces ptep_clear_flush() with ptep_get_and_clear() and a single
    flush_tlb_range() at the end of the loop, to avoid sending one IPI for
    each page.

    The mmu_notifier_invalidate_range_start/end section is enlarged
    accordingly but this is not going to fundamentally change things. It was
    more by accident that the region under mremap was for the most part still
    available for secondary MMUs: the primary MMU was never allowed to
    reliably access that region for the duration of the mremap (modulo
    trapping SIGSEGV on the old address range which sounds unpractical and
    flakey). If users wants secondary MMUs not to lose access to a large
    region under mremap they should reduce the mremap size accordingly in
    userland and run multiple calls. Overall this will run faster so it's
    actually going to reduce the time the region is under mremap for the
    primary MMU which should provide a net benefit to apps.

    For KVM this is a noop because the guest physical memory is never
    mremapped, there's just no point it ever moving it while guest runs. One
    target of this optimization is JVM GC (so unrelated to the mmu notifier
    logic).

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Using "- 1" relies on the old_end to be page aligned and PAGE_SIZE > 1,
    those are reasonable requirements but the check remains obscure and it
    looks more like an off by one error than an overflow check. This I feel
    will improve readability.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 May, 2011

2 commits

  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

07 Apr, 2011

1 commit

  • The normal mmap paths all avoid creating a mapping where the pgoff
    inside the mapping could wrap around due to overflow. However, an
    expanding mremap() can take such a non-wrapping mapping and make it
    bigger and cause a wrapping condition.

    Noticed by Robert Swiecki when running a system call fuzzer, where it
    caused a BUG_ON() due to terminally confusing the vma_prio_tree code. A
    vma dumping patch by Hugh then pinpointed the crazy wrapped case.

    Reported-and-tested-by: Robert Swiecki
    Acked-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

24 Feb, 2011

1 commit

  • Robert Swiecki reported a BUG_ON(page_mapped) from a fuzzer, punching
    a hole with madvise(,, MADV_REMOVE). That path is under mutex, and
    cannot be explained by lack of serialization in unmap_mapping_range().

    Reviewing the code, I found one place where vm_truncate_count handling
    should have been updated, when I switched at the last minute from one
    way of managing the restart_addr to another: mremap move changes the
    virtual addresses, so it ought to adjust the restart_addr.

    But rather than exporting the notion of restart_addr from memory.c, or
    converting to restart_pgoff throughout, simply reset vm_truncate_count
    to 0 to force a rescan if mremap move races with preempted truncation.

    We have no confirmation that this fixes Robert's BUG,
    but it is a fix that's worth making anyway.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Jan, 2011

2 commits

  • split_huge_page_pmd compat code. Each one of those would need to be
    expanded to hundred of lines of complex code without a fully reliable
    split_huge_page_pmd design.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Johannes Weiner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • pte alloc routines must wait for split_huge_page if the pmd is not present
    and not null (i.e. pmd_trans_splitting). The additional branches are
    optimized away at compile time by pmd_trans_splitting if the config option
    is off. However we must pass the vma down in order to know the anon_vma
    lock to wait for.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Andrea Arcangeli
    Acked-by: Rik van Riel
    Acked-by: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

27 Oct, 2010

1 commit

  • Since we no longer need to provide KM_type, the whole pte_*map_nested()
    API is now redundant, remove it.

    Signed-off-by: Peter Zijlstra
    Acked-by: Chris Metcalf
    Cc: David Howells
    Cc: Hugh Dickins
    Cc: Rik van Riel
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Russell King
    Cc: Ralf Baechle
    Cc: David Miller
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

30 Mar, 2010

1 commit

  • …it slab.h inclusion from percpu.h

    percpu.h is included by sched.h and module.h and thus ends up being
    included when building most .c files. percpu.h includes slab.h which
    in turn includes gfp.h making everything defined by the two files
    universally available and complicating inclusion dependencies.

    percpu.h -> slab.h dependency is about to be removed. Prepare for
    this change by updating users of gfp and slab facilities include those
    headers directly instead of assuming availability. As this conversion
    needs to touch large number of source files, the following script is
    used as the basis of conversion.

    http://userweb.kernel.org/~tj/misc/slabh-sweep.py

    The script does the followings.

    * Scan files for gfp and slab usages and update includes such that
    only the necessary includes are there. ie. if only gfp is used,
    gfp.h, if slab is used, slab.h.

    * When the script inserts a new include, it looks at the include
    blocks and try to put the new include such that its order conforms
    to its surrounding. It's put in the include block which contains
    core kernel includes, in the same order that the rest are ordered -
    alphabetical, Christmas tree, rev-Xmas-tree or at the end if there
    doesn't seem to be any matching order.

    * If the script can't find a place to put a new include (mostly
    because the file doesn't have fitting include block), it prints out
    an error message indicating which .h file needs to be added to the
    file.

    The conversion was done in the following steps.

    1. The initial automatic conversion of all .c files updated slightly
    over 4000 files, deleting around 700 includes and adding ~480 gfp.h
    and ~3000 slab.h inclusions. The script emitted errors for ~400
    files.

    2. Each error was manually checked. Some didn't need the inclusion,
    some needed manual addition while adding it to implementation .h or
    embedding .c file was more appropriate for others. This step added
    inclusions to around 150 files.

    3. The script was run again and the output was compared to the edits
    from #2 to make sure no file was left behind.

    4. Several build tests were done and a couple of problems were fixed.
    e.g. lib/decompress_*.c used malloc/free() wrappers around slab
    APIs requiring slab.h to be added manually.

    5. The script was run on all .h files but without automatically
    editing them as sprinkling gfp.h and slab.h inclusions around .h
    files could easily lead to inclusion dependency hell. Most gfp.h
    inclusion directives were ignored as stuff from gfp.h was usually
    wildly available and often used in preprocessor macros. Each
    slab.h inclusion directive was examined and added manually as
    necessary.

    6. percpu.h was updated not to include slab.h.

    7. Build test were done on the following configurations and failures
    were fixed. CONFIG_GCOV_KERNEL was turned off for all tests (as my
    distributed build env didn't work with gcov compiles) and a few
    more options had to be turned off depending on archs to make things
    build (like ipr on powerpc/64 which failed due to missing writeq).

    * x86 and x86_64 UP and SMP allmodconfig and a custom test config.
    * powerpc and powerpc64 SMP allmodconfig
    * sparc and sparc64 SMP allmodconfig
    * ia64 SMP allmodconfig
    * s390 SMP allmodconfig
    * alpha SMP allmodconfig
    * um on x86_64 SMP allmodconfig

    8. percpu.h modifications were reverted so that it could be applied as
    a separate patch and serve as bisection point.

    Given the fact that I had only a couple of failures from tests on step
    6, I'm fairly confident about the coverage of this conversion patch.
    If there is a breakage, it's likely to be something in one of the arch
    headers which should be easily discoverable easily on most builds of
    the specific arch.

    Signed-off-by: Tejun Heo <tj@kernel.org>
    Guess-its-ok-by: Christoph Lameter <cl@linux-foundation.org>
    Cc: Ingo Molnar <mingo@redhat.com>
    Cc: Lee Schermerhorn <Lee.Schermerhorn@hp.com>

    Tejun Heo
     

07 Mar, 2010

2 commits

  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

11 Dec, 2009

7 commits


24 Sep, 2009

1 commit

  • Introduce new truncate helpers truncate_pagecache and inode_newsize_ok.
    vmtruncate is also consolidated from mm/memory.c and mm/nommu.c and
    into mm/truncate.c.

    Reviewed-by: Christoph Hellwig
    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    npiggin@suse.de
     

22 Sep, 2009

2 commits

  • mremap move's use of ksm_madvise() was assuming -ENOMEM on failure,
    because ksm_madvise used to say -EAGAIN for that; but ksm_madvise now says
    -ENOMEM (letting madvise convert that to -EAGAIN), and can also say
    -ERESTARTSYS when signalled: so pass the error from ksm_madvise.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM's scan allows for user pages to be COWed or unmapped at any time,
    without requiring any notification. But its stable tree does assume that
    when it finds a KSM page where it placed a KSM page, then it is the same
    KSM page that it placed there.

    mremap move could break that assumption: if an area containing a KSM page
    was unmapped, then an area containing a different KSM page was moved with
    mremap into the place of the original, before KSM's scan came around to
    notice. That could then poison a node of the stable tree, so that memcmps
    would "lie" and upset the ordering of the tree.

    Probably noone will ever need mremap move on a VM_MERGEABLE area; except
    that prohibiting it would make trouble for schemes in which we try making
    everything VM_MERGEABLE e.g. for testing: an mremap which normally works
    would then fail mysteriously.

    There's no need to go to any trouble, such as re-sorting KSM's list of
    rmap_items to match the new layout: simply unmerge the area to COW all its
    KSM pages before moving, but leave VM_MERGEABLE on so that they're
    remerged later.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Chris Wright
    Signed-off-by: Izik Eidus
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Wu Fengguang
    Cc: Balbir Singh
    Cc: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Lee Schermerhorn
    Cc: Avi Kivity
    Cc: Nick Piggin
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

14 Jan, 2009

2 commits


06 Jan, 2009

1 commit


20 Oct, 2008

1 commit

  • Originally by Nick Piggin

    Remove mlocked pages from the LRU using "unevictable infrastructure"
    during mmap(), munmap(), mremap() and truncate(). Try to move back to
    normal LRU lists on munmap() when last mlocked mapping removed. Remove
    PageMlocked() status when page truncated from file.

    [akpm@linux-foundation.org: cleanup]
    [kamezawa.hiroyu@jp.fujitsu.com: fix double unlock_page()]
    [kosaki.motohiro@jp.fujitsu.com: split LRU: munlock rework]
    [lee.schermerhorn@hp.com: mlock: fix __mlock_vma_pages_range comment block]
    [akpm@linux-foundation.org: remove bogus kerneldoc token]
    Signed-off-by: Nick Piggin
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Rik van Riel
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

29 Jul, 2008

1 commit

  • With KVM/GFP/XPMEM there isn't just the primary CPU MMU pointing to pages.
    There are secondary MMUs (with secondary sptes and secondary tlbs) too.
    sptes in the kvm case are shadow pagetables, but when I say spte in
    mmu-notifier context, I mean "secondary pte". In GRU case there's no
    actual secondary pte and there's only a secondary tlb because the GRU
    secondary MMU has no knowledge about sptes and every secondary tlb miss
    event in the MMU always generates a page fault that has to be resolved by
    the CPU (this is not the case of KVM where the a secondary tlb miss will
    walk sptes in hardware and it will refill the secondary tlb transparently
    to software if the corresponding spte is present). The same way
    zap_page_range has to invalidate the pte before freeing the page, the spte
    (and secondary tlb) must also be invalidated before any page is freed and
    reused.

    Currently we take a page_count pin on every page mapped by sptes, but that
    means the pages can't be swapped whenever they're mapped by any spte
    because they're part of the guest working set. Furthermore a spte unmap
    event can immediately lead to a page to be freed when the pin is released
    (so requiring the same complex and relatively slow tlb_gather smp safe
    logic we have in zap_page_range and that can be avoided completely if the
    spte unmap event doesn't require an unpin of the page previously mapped in
    the secondary MMU).

    The mmu notifiers allow kvm/GRU/XPMEM to attach to the tsk->mm and know
    when the VM is swapping or freeing or doing anything on the primary MMU so
    that the secondary MMU code can drop sptes before the pages are freed,
    avoiding all page pinning and allowing 100% reliable swapping of guest
    physical address space. Furthermore it avoids the code that teardown the
    mappings of the secondary MMU, to implement a logic like tlb_gather in
    zap_page_range that would require many IPI to flush other cpu tlbs, for
    each fixed number of spte unmapped.

    To make an example: if what happens on the primary MMU is a protection
    downgrade (from writeable to wrprotect) the secondary MMU mappings will be
    invalidated, and the next secondary-mmu-page-fault will call
    get_user_pages and trigger a do_wp_page through get_user_pages if it
    called get_user_pages with write=1, and it'll re-establishing an updated
    spte or secondary-tlb-mapping on the copied page. Or it will setup a
    readonly spte or readonly tlb mapping if it's a guest-read, if it calls
    get_user_pages with write=0. This is just an example.

    This allows to map any page pointed by any pte (and in turn visible in the
    primary CPU MMU), into a secondary MMU (be it a pure tlb like GRU, or an
    full MMU with both sptes and secondary-tlb like the shadow-pagetable layer
    with kvm), or a remote DMA in software like XPMEM (hence needing of
    schedule in XPMEM code to send the invalidate to the remote node, while no
    need to schedule in kvm/gru as it's an immediate event like invalidating
    primary-mmu pte).

    At least for KVM without this patch it's impossible to swap guests
    reliably. And having this feature and removing the page pin allows
    several other optimizations that simplify life considerably.

    Dependencies:

    1) mm_take_all_locks() to register the mmu notifier when the whole VM
    isn't doing anything with "mm". This allows mmu notifier users to keep
    track if the VM is in the middle of the invalidate_range_begin/end
    critical section with an atomic counter incraese in range_begin and
    decreased in range_end. No secondary MMU page fault is allowed to map
    any spte or secondary tlb reference, while the VM is in the middle of
    range_begin/end as any page returned by get_user_pages in that critical
    section could later immediately be freed without any further
    ->invalidate_page notification (invalidate_range_begin/end works on
    ranges and ->invalidate_page isn't called immediately before freeing
    the page). To stop all page freeing and pagetable overwrites the
    mmap_sem must be taken in write mode and all other anon_vma/i_mmap
    locks must be taken too.

    2) It'd be a waste to add branches in the VM if nobody could possibly
    run KVM/GRU/XPMEM on the kernel, so mmu notifiers will only enabled if
    CONFIG_KVM=m/y. In the current kernel kvm won't yet take advantage of
    mmu notifiers, but this already allows to compile a KVM external module
    against a kernel with mmu notifiers enabled and from the next pull from
    kvm.git we'll start using them. And GRU/XPMEM will also be able to
    continue the development by enabling KVM=m in their config, until they
    submit all GRU/XPMEM GPLv2 code to the mainline kernel. Then they can
    also enable MMU_NOTIFIERS in the same way KVM does it (even if KVM=n).
    This guarantees nobody selects MMU_NOTIFIER=y if KVM and GRU and XPMEM
    are all =n.

    The mmu_notifier_register call can fail because mm_take_all_locks may be
    interrupted by a signal and return -EINTR. Because mmu_notifier_reigster
    is used when a driver startup, a failure can be gracefully handled. Here
    an example of the change applied to kvm to register the mmu notifiers.
    Usually when a driver startups other allocations are required anyway and
    -ENOMEM failure paths exists already.

    struct kvm *kvm_arch_create_vm(void)
    {
    struct kvm *kvm = kzalloc(sizeof(struct kvm), GFP_KERNEL);
    + int err;

    if (!kvm)
    return ERR_PTR(-ENOMEM);

    INIT_LIST_HEAD(&kvm->arch.active_mmu_pages);

    + kvm->arch.mmu_notifier.ops = &kvm_mmu_notifier_ops;
    + err = mmu_notifier_register(&kvm->arch.mmu_notifier, current->mm);
    + if (err) {
    + kfree(kvm);
    + return ERR_PTR(err);
    + }
    +
    return kvm;
    }

    mmu_notifier_unregister returns void and it's reliable.

    The patch also adds a few needed but missing includes that would prevent
    kernel to compile after these changes on non-x86 archs (x86 didn't need
    them by luck).

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix mm/filemap_xip.c build]
    [akpm@linux-foundation.org: fix mm/mmu_notifier.c build]
    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Nick Piggin
    Signed-off-by: Christoph Lameter
    Cc: Jack Steiner
    Cc: Robin Holt
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Cc: Kanoj Sarcar
    Cc: Roland Dreier
    Cc: Steve Wise
    Cc: Avi Kivity
    Cc: Hugh Dickins
    Cc: Rusty Russell
    Cc: Anthony Liguori
    Cc: Chris Wright
    Cc: Marcelo Tosatti
    Cc: Eric Dumazet
    Cc: "Paul E. McKenney"
    Cc: Izik Eidus
    Cc: Anthony Liguori
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

19 Oct, 2007

1 commit

  • Get rid of sparse related warnings from places that use integer as NULL
    pointer.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Stephen Hemminger
    Cc: Andi Kleen
    Cc: Jeff Garzik
    Cc: Matt Mackall
    Cc: Ian Kent
    Cc: Arnd Bergmann
    Cc: Davide Libenzi
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Stephen Hemminger
     

20 Jul, 2007

1 commit

  • Remove the arg+env limit of MAX_ARG_PAGES by copying the strings directly from
    the old mm into the new mm.

    We create the new mm before the binfmt code runs, and place the new stack at
    the very top of the address space. Once the binfmt code runs and figures out
    where the stack should be, we move it downwards.

    It is a bit peculiar in that we have one task with two mm's, one of which is
    inactive.

    [a.p.zijlstra@chello.nl: limit stack size]
    Signed-off-by: Ollie Wild
    Signed-off-by: Peter Zijlstra
    Cc:
    Cc: Hugh Dickins
    [bunk@stusta.de: unexport bprm_mm_init]
    Signed-off-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ollie Wild
     

12 Jul, 2007

1 commit

  • Add a new security check on mmap operations to see if the user is attempting
    to mmap to low area of the address space. The amount of space protected is
    indicated by the new proc tunable /proc/sys/vm/mmap_min_addr and defaults to
    0, preserving existing behavior.

    This patch uses a new SELinux security class "memprotect." Policy already
    contains a number of allow rules like a_t self:process * (unconfined_t being
    one of them) which mean that putting this check in the process class (its
    best current fit) would make it useless as all user processes, which we also
    want to protect against, would be allowed. By taking the memprotect name of
    the new class it will also make it possible for us to move some of the other
    memory protect permissions out of 'process' and into the new class next time
    we bump the policy version number (which I also think is a good future idea)

    Acked-by: Stephen Smalley
    Acked-by: Chris Wright
    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

31 Jan, 2007

1 commit

  • Nick Piggin points out that page accounting on MIPS multiple ZERO_PAGEs
    is not maintained by its move_pte, and could lead to freeing a ZERO_PAGE.

    Instead of complicating that move_pte, just forget the minor optimization
    when mremapping, and change the one thing which needed it for correctness
    - filemap_xip use ZERO_PAGE(0) throughout instead of according to address.

    [ "There is no block device driver one could use for XIP on mips
    platforms" - Carsten Otte ]

    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Andrew Morton
    Cc: Ralf Baechle
    Cc: Carsten Otte
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

01 Oct, 2006

1 commit

  • Implement lazy MMU update hooks which are SMP safe for both direct and shadow
    page tables. The idea is that PTE updates and page invalidations while in
    lazy mode can be batched into a single hypercall. We use this in VMI for
    shadow page table synchronization, and it is a win. It also can be used by
    PPC and for direct page tables on Xen.

    For SMP, the enter / leave must happen under protection of the page table
    locks for page tables which are being modified. This is because otherwise,
    you end up with stale state in the batched hypercall, which other CPUs can
    race ahead of. Doing this under the protection of the locks guarantees the
    synchronization is correct, and also means that spurious faults which are
    generated during this window by remote CPUs are properly handled, as the page
    fault handler must re-check the PTE under protection of the same lock.

    Signed-off-by: Zachary Amsden
    Signed-off-by: Jeremy Fitzhardinge
    Cc: Rusty Russell
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Zachary Amsden
     

04 Jul, 2006

1 commit

  • Teach special (recursive) locking code to the lock validator. Has no effect
    on non-lockdep kernels.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ingo Molnar
     

12 Jan, 2006

1 commit

  • - Move capable() from sched.h to capability.h;

    - Use where capable() is used
    (in include/, block/, ipc/, kernel/, a few drivers/,
    mm/, security/, & sound/;
    many more drivers/ to go)

    Signed-off-by: Randy Dunlap
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Randy.Dunlap
     

17 Dec, 2005

1 commit


30 Oct, 2005

3 commits

  • Christoph Lameter demonstrated very poor scalability on the SGI 512-way, with
    a many-threaded application which concurrently initializes different parts of
    a large anonymous area.

    This patch corrects that, by using a separate spinlock per page table page, to
    guard the page table entries in that page, instead of using the mm's single
    page_table_lock. (But even then, page_table_lock is still used to guard page
    table allocation, and anon_vma allocation.)

    In this implementation, the spinlock is tucked inside the struct page of the
    page table page: with a BUILD_BUG_ON in case it overflows - which it would in
    the case of 32-bit PA-RISC with spinlock debugging enabled.

    Splitting the lock is not quite for free: another cacheline access. Ideally,
    I suppose we would use split ptlock only for multi-threaded processes on
    multi-cpu machines; but deciding that dynamically would have its own costs.
    So for now enable it by config, at some number of cpus - since the Kconfig
    language doesn't support inequalities, let preprocessor compare that with
    NR_CPUS. But I don't think it's worth being user-configurable: for good
    testing of both split and unsplit configs, split now at 4 cpus, and perhaps
    change that to 8 later.

    There is a benefit even for singly threaded processes: kswapd can be attacking
    one part of the mm while another part is busy faulting.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Second step in pushing down the page_table_lock. Remove the temporary
    bridging hack from __pud_alloc, __pmd_alloc, __pte_alloc: expect callers not
    to hold page_table_lock, whether it's on init_mm or a user mm; take
    page_table_lock internally to check if a racing task already allocated.

    Convert their callers from common code. But avoid coming back to change them
    again later: instead of moving the spin_lock(&mm->page_table_lock) down,
    switch over to new macros pte_alloc_map_lock and pte_unmap_unlock, which
    encapsulate the mapping+locking and unlocking+unmapping together, and in the
    end may use alternatives to the mm page_table_lock itself.

    These callers all hold mmap_sem (some exclusively, some not), so at no level
    can a page table be whipped away from beneath them; and pte_alloc uses the
    "atomic" pmd_present to test whether it needs to allocate. It appears that on
    all arches we can safely descend without page_table_lock.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • It seems odd to me that, whereas pud_alloc and pmd_alloc test inline, only
    calling out-of-line __pud_alloc __pmd_alloc if allocation needed,
    pte_alloc_map and pte_alloc_kernel are entirely out-of-line. Though it does
    add a little to kernel size, change them to macros testing inline, calling
    __pte_alloc or __pte_alloc_kernel to allocate out-of-line. Mark none of them
    as fastcalls, leave that to CONFIG_REGPARM or not.

    It also seems more natural for the out-of-line functions to leave the offset
    calculation and map to the inline, which has to do it anyway for the common
    case. At least mremap move wants __pte_alloc without _map.

    Macros rather than inline functions, certainly to avoid the header file issues
    which arise from CONFIG_HIGHPTE needing kmap_types.h, but also in case any
    architectures I haven't built would have other such problems.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins