11 Jan, 2012

2 commits

  • commit 297c5eee37 ("mm: make the vma list be doubly linked") added the
    vm_prev member to vm_area_struct. We can simplify find_vma_prev() by
    using it. Also, this change helps to improve page fault performance
    because it has stronger locality of reference.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Peter Zijlstra
    Cc: Shaohua Li
    Cc: Michal Hocko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • migrate was doing an rmap_walk with speculative lock-less access on
    pagetables. That could lead it to not serializing properly against mremap
    PT locks. But a second problem remains in the order of vmas in the
    same_anon_vma list used by the rmap_walk.

    If vma_merge succeeds in copy_vma, the src vma could be placed after the
    dst vma in the same_anon_vma list. That could still lead to migrate
    missing some pte.

    This patch adds an anon_vma_moveto_tail() function to force the dst vma at
    the end of the list before mremap starts to solve the problem.

    If the mremap is very large and there are a lots of parents or childs
    sharing the anon_vma root lock, this should still scale better than taking
    the anon_vma root lock around every pte copy practically for the whole
    duration of mremap.

    Update: Hugh noticed special care is needed in the error path where
    move_page_tables goes in the reverse direction, a second
    anon_vma_moveto_tail() call is needed in the error path.

    This program exercises the anon_vma_moveto_tail:

    ===

    int main()
    {
    static struct timeval oldstamp, newstamp;
    long diffsec;
    char *p, *p2, *p3, *p4;
    if (posix_memalign((void **)&p, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p2, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);
    if (posix_memalign((void **)&p3, 2*1024*1024, SIZE))
    perror("memalign"), exit(1);

    memset(p, 0xff, SIZE);
    printf("%p\n", p);
    memset(p2, 0xff, SIZE);
    memset(p3, 0x77, 4096);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    p4 = mremap(p+SIZE/2, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p3);
    if (p4 != p3)
    perror("mremap"), exit(1);
    p4 = mremap(p4, SIZE/2, SIZE/2, MREMAP_FIXED|MREMAP_MAYMOVE, p+SIZE/2);
    if (p4 != p+SIZE/2)
    perror("mremap"), exit(1);
    if (memcmp(p, p2, SIZE))
    printf("error\n");
    printf("ok\n");

    return 0;
    }
    ===

    $ perf probe -a anon_vma_moveto_tail
    Add new event:
    probe:anon_vma_moveto_tail (on anon_vma_moveto_tail)

    You can now use it on all perf tools, such as:

    perf record -e probe:anon_vma_moveto_tail -aR sleep 1

    $ perf record -e probe:anon_vma_moveto_tail -aR ./anon_vma_moveto_tail
    0x7f2ca2800000
    ok
    [ perf record: Woken up 1 times to write data ]
    [ perf record: Captured and wrote 0.043 MB perf.data (~1860 samples) ]
    $ perf report --stdio
    100.00% anon_vma_moveto [kernel.kallsyms] [k] anon_vma_moveto_tail

    Signed-off-by: Andrea Arcangeli
    Reported-by: Nai Xia
    Acked-by: Mel Gorman
    Cc: Hugh Dickins
    Cc: Pawel Sikora
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

07 Nov, 2011

1 commit

  • * 'modsplit-Oct31_2011' of git://git.kernel.org/pub/scm/linux/kernel/git/paulg/linux: (230 commits)
    Revert "tracing: Include module.h in define_trace.h"
    irq: don't put module.h into irq.h for tracking irqgen modules.
    bluetooth: macroize two small inlines to avoid module.h
    ip_vs.h: fix implicit use of module_get/module_put from module.h
    nf_conntrack.h: fix up fallout from implicit moduleparam.h presence
    include: replace linux/module.h with "struct module" wherever possible
    include: convert various register fcns to macros to avoid include chaining
    crypto.h: remove unused crypto_tfm_alg_modname() inline
    uwb.h: fix implicit use of asm/page.h for PAGE_SIZE
    pm_runtime.h: explicitly requires notifier.h
    linux/dmaengine.h: fix implicit use of bitmap.h and asm/page.h
    miscdevice.h: fix up implicit use of lists and types
    stop_machine.h: fix implicit use of smp.h for smp_processor_id
    of: fix implicit use of errno.h in include/linux/of.h
    of_platform.h: delete needless include
    acpi: remove module.h include from platform/aclinux.h
    miscdevice.h: delete unnecessary inclusion of module.h
    device_cgroup.h: delete needless include
    net: sch_generic remove redundant use of
    net: inet_timewait_sock doesnt need
    ...

    Fix up trivial conflicts (other header files, and removal of the ab3550 mfd driver) in
    - drivers/media/dvb/frontends/dibx000_common.c
    - drivers/media/video/{mt9m111.c,ov6650.c}
    - drivers/mfd/ab3550-core.c
    - include/linux/dmaengine.h

    Linus Torvalds
     

01 Nov, 2011

1 commit


31 Oct, 2011

1 commit


26 Jul, 2011

1 commit

  • - shmem pages are not immediately available, but they are not
    potentially available either, even if we swap them out, they will just
    relocate from memory into swap, total amount of immediate and
    potentially available memory is not going to be affected, so we
    shouldn't count them as potentially free in the first place.

    - nr_free_pages() is not an expensive operation anymore, there is no
    need to split the decision making in two halves and repeat code.

    Signed-off-by: Dmitry Fink
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Fink
     

16 Jun, 2011

1 commit

  • We have some users of this function that date back to before the vma
    list was doubly linked, and just are silly. These days, you can find
    the previous vma by just following the vma->vm_prev pointer.

    In some cases you don't need any find_vma() lookup at all, and in other
    cases you're better off with the regular "find_vma()" that uses the vma
    cache front-end lookup.

    Some "find_vma_prev()" users are still valid, though. For example, in
    the case of a stack that grows up, it can be the case that we don't find
    any 'vma' at all (because we're looking up an address that is past the
    last vma), and that the stack that we want to grow is the 'prev' vma.

    But that kind of special case aside, we generally should prefer to use
    'find_vma()'.

    Noticed due to a totally unrelated POWER memory corruption bug that just
    happened to hit in 'find_vma_prev()' and made me go "Hmm - why are we
    using that function here?".

    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

27 May, 2011

1 commit

  • The type of vma->vm_flags is 'unsigned long'. Neither 'int' nor
    'unsigned int'. This patch fixes such misuse.

    Signed-off-by: KOSAKI Motohiro
    [ Changed to use a typedef - we'll extend it to cover more cases
    later, since there has been discussion about making it a 64-bit
    type.. - Linus ]
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

25 May, 2011

9 commits

  • Straightforward conversion of anon_vma->lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Reviewed-by: KOSAKI Motohiro
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Straightforward conversion of i_mmap_lock to a mutex.

    Signed-off-by: Peter Zijlstra
    Acked-by: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Hugh says:
    "The only significant loser, I think, would be page reclaim (when
    concurrent with truncation): could spin for a long time waiting for
    the i_mmap_mutex it expects would soon be dropped? "

    Counter points:
    - cpu contention makes the spin stop (need_resched())
    - zap pages should be freeing pages at a higher rate than reclaim
    ever can

    I think the simplification of the truncate code is definitely worth it.

    Effectively reverts: 2aa15890f3c ("mm: prevent concurrent
    unmap_mapping_range() on the same inode") and takes out the code that
    caused its problem.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     
  • When I was reading nommu code, I found that it handles the vma list/tree
    in an unusual way. IIUC, because there can be more than one
    identical/overrapped vmas in the list/tree, it sorts the tree more
    strictly and does a linear search on the tree. But it doesn't applied to
    the list (i.e. the list could be constructed in a different order than
    the tree so that we can't use the list when finding the first vma in that
    order).

    Since inserting/sorting a vma in the tree and link is done at the same
    time, we can easily construct both of them in the same order. And linear
    searching on the tree could be more costly than doing it on the list, it
    can be converted to use the list.

    Also, after the commit 297c5eee3724 ("mm: make the vma list be doubly
    linked") made the list be doubly linked, there were a couple of code need
    to be fixed to construct the list properly.

    Patch 1/6 is a preparation. It maintains the list sorted same as the tree
    and construct doubly-linked list properly. Patch 2/6 is a simple
    optimization for the vma deletion. Patch 3/6 and 4/6 convert tree
    traversal to list traversal and the rest are simple fixes and cleanups.

    This patch:

    @vma added into @mm should be sorted by start addr, end addr and VMA
    struct addr in that order because we may get identical VMAs in the @mm.
    However this was true only for the rbtree, not for the list.

    This patch fixes this by remembering 'rb_prev' during the tree traversal
    like find_vma_prepare() does and linking the @vma via __vma_link_list().
    After this patch, we can iterate the whole VMAs in correct order simply by
    using @mm->mmap list.

    [akpm@linux-foundation.org: avoid duplicating __vma_link_list()]
    Signed-off-by: Namhyung Kim
    Acked-by: Greg Ungerer
    Cc: David Howells
    Cc: Paul Mundt
    Cc: Geert Uytterhoeven
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     
  • Avoid merging a VMA with another VMA which is cloned from the parent process.

    The cloned VMA shares the anon_vma lock with the parent process's VMA. If
    we do the merge, more vmas (even the new range is only for current
    process) use the perent process's anon_vma lock. This introduces
    scalability issues. find_mergeable_anon_vma() already considers this
    case.

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • If we only change vma->vm_end, we can avoid taking anon_vma lock even if
    'insert' isn't NULL, which is the case of split_vma.

    As I understand it, we need the lock before because rmap must get the
    'insert' VMA when we adjust old VMA's vm_end (the 'insert' VMA is linked
    to anon_vma list in __insert_vm_struct before).

    But now this isn't true any more. The 'insert' VMA is already linked to
    anon_vma list in __split_vma(with anon_vma_clone()) instead of
    __insert_vm_struct. There is no race rmap can't get required VMAs. So
    the anon_vma lock is unnecessary, and this can reduce one locking in brk
    case and improve scalability.

    Signed-off-by: Shaohua Li
    Cc: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     
  • Make some variables have correct alignment/section to avoid cache issue.
    In a workload which heavily does mmap/munmap, the variables will be used
    frequently.

    Signed-off-by: Shaohua Li
    Cc: Andi Kleen
    Cc: Rik van Riel
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Shaohua Li
     

10 May, 2011

1 commit

  • Commit a626ca6a6564 ("vm: fix vm_pgoff wrap in stack expansion") fixed
    the case of an expanding mapping causing vm_pgoff wrapping when you had
    downward stack expansion. But there was another case where IA64 and
    PA-RISC expand mappings: upward expansion.

    This fixes that case too.

    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

15 Apr, 2011

1 commit

  • 5520e89 ("brk: fix min_brk lower bound computation for COMPAT_BRK")
    tried to get the whole logic of brk randomization for legacy
    (libc5-based) applications finally right.

    It turns out that the way to detect whether brk has actually been
    randomized in the end or not introduced by that patch still doesn't work
    for those binaries, as reported by Geert:

    : /sbin/init from my old m68k ramdisk exists prematurely.
    :
    : Before the patch:
    :
    : | brk(0x80005c8e) = 0x80006000
    :
    : After the patch:
    :
    : | brk(0x80005c8e) = 0x80005c8e
    :
    : Old libc5 considers brk() to have failed if the return value is not
    : identical to the requested value.

    I don't like it, but currently see no better option than a bit flag in
    task_struct to catch the CONFIG_COMPAT_BRK && randomize_va_space == 2
    case.

    Signed-off-by: Jiri Kosina
    Tested-by: Geert Uytterhoeven
    Reported-by: Geert Uytterhoeven
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     

13 Apr, 2011

1 commit

  • Commit 982134ba6261 ("mm: avoid wrapping vm_pgoff in mremap()") fixed
    the case of a expanding mapping causing vm_pgoff wrapping when you used
    mremap. But there was another case where we expand mappings hiding in
    plain sight: the automatic stack expansion.

    This fixes that case too.

    This one also found by Robert Święcki, using his nasty system call
    fuzzer tool. Good job.

    Reported-and-tested-by: Robert Święcki
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

14 Jan, 2011

3 commits

  • Even if CONFIG_COMPAT_BRK is set in the kernel configuration, it can still
    be overriden by randomize_va_space sysctl.

    If this is the case, the min_brk computation in sys_brk() implementation
    is wrong, as it solely takes into account COMPAT_BRK setting, assuming
    that brk start is not randomized. But that might not be the case if
    randomize_va_space sysctl has been set to '2' at the time the binary has
    been loaded from disk.

    In such case, the check has to be done in a same way as in
    !CONFIG_COMPAT_BRK case.

    In addition to that, the check for the COMPAT_BRK case introduced back in
    a5b4592c ("brk: make sys_brk() honor COMPAT_BRK when computing lower
    bound") is slightly wrong -- the lower bound shouldn't be mm->end_code,
    but mm->end_data instead, as that's where the legacy applications expect
    brk section to start (i.e. immediately after last global variable).

    [akpm@linux-foundation.org: fix comment]
    Signed-off-by: Jiri Kosina
    Cc: Geert Uytterhoeven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Kosina
     
  • An huge pmd can only be mapped if the corresponding 2M virtual range is
    fully contained in the vma. At times the VM calls split_vma twice, if the
    first split_vma succeeds and the second fail, the first split_vma remains
    in effect and it's not rolled back. For split_vma or vma_adjust to fail
    an allocation failure is needed so it's a very unlikely event (the out of
    memory killer would normally fire before any allocation failure is visible
    to kernel and userland and if an out of memory condition happens it's
    unlikely to happen exactly here). Nevertheless it's safer to ensure that
    no huge pmd can be left around if the vma is adjusted in a way that can't
    fit hugepages anymore at the new vm_start/vm_end address.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • register in khugepaged if the vma grows.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

16 Dec, 2010

1 commit

  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     

30 Oct, 2010

1 commit

  • Normal syscall audit doesn't catch 5th argument of syscall. It also
    doesn't catch the contents of userland structures pointed to be
    syscall argument, so for both old and new mmap(2) ABI it doesn't
    record the descriptor we are mapping. For old one it also misses
    flags.

    Signed-off-by: Al Viro

    Al Viro
     

23 Sep, 2010

1 commit

  • If __split_vma fails because of an out of memory condition the
    anon_vma_chain isn't teardown and freed potentially leading to rmap walks
    accessing freed vma information plus there's a memleak.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Aug, 2010

1 commit

  • pa-risc and ia64 have stacks that grow upwards. Check that
    they do not run into other mappings. By making VM_GROWSUP
    0x0 on architectures that do not ever use it, we can avoid
    some unpleasant #ifdefs in check_stack_guard_page().

    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

21 Aug, 2010

1 commit

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Aug, 2010

4 commits

  • There's no anon-vma related mangling happening inside __vma_link anymore
    so no need of anon_vma locking there.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Always (and only) lock the root (oldest) anon_vma whenever we do something
    in an anon_vma. The recently introduced anon_vma scalability is due to
    the rmap code scanning only the VMAs that need to be scanned. Many common
    operations still took the anon_vma lock on the root anon_vma, so always
    taking that lock is not expected to introduce any scalability issues.

    However, always taking the same lock does mean we only need to take one
    lock, which means rmap_walk on pages from any anon_vma in the vma is
    excluded from occurring during an munmap, expand_stack or other operation
    that needs to exclude rmap_walk and similar functions.

    Also add the proper locking to vma_adjust.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style
    used in page_lock_anon_vma and will come in really handy further down in
    this patch series.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

09 Jun, 2010

1 commit

  • Add the capacility to track data mmap()s. This can be used together
    with PERF_SAMPLE_ADDR for data profiling.

    Signed-off-by: Anton Blanchard
    [Updated code for stable perf ABI]
    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

27 Apr, 2010

1 commit


13 Apr, 2010

2 commits

  • When we move the boundaries between two vma's due to things like
    mprotect, we need to make sure that the anon_vma of the pages that got
    moved from one vma to another gets properly copied around. And that was
    not always the case, in this rather hard-to-follow code sequence.

    Clarify the code, and fix it so that it copies the anon_vma from the
    right source.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "Yeah, not so much this one either" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This changes the anon_vma reuse case to require that we only reuse
    simple anon_vma's - ie the case when the vma only has a single anon_vma
    associated with it.

    This means that a reuse of an anon_vma from an adjacent vma will always
    guarantee that both vma's are associated not only with the same
    anon_vma, they will also have the same anon_vma chain (of just a single
    entry in this case).

    And since anon_vma re-use was the only case where the same anon_vma
    might be associated with different chains of anon_vma's, we now have the
    case that every vma that shares the same anon_vma will always also have
    the same chain. That makes it much easier to think about merging vma's
    that share the same anon_vma's: you can always just drop the other
    anon_vma chain in anon_vma_merge() since you know that they are always
    identical.

    This also splits up the function to validate the anon_vma re-use, and
    adds a lot of commentary about the possible races.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "That didn't fix it" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Mar, 2010

1 commit

  • Add a generic implementation of the old mmap() syscall, which expects its
    argument in a memory block and switch all architectures over to use it.

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Acked-by: Jesper Nilsson
    Acked-by: Russell King
    Acked-by: Greg Ungerer
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

07 Mar, 2010

3 commits

  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby