16 Dec, 2010

1 commit

  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     

30 Oct, 2010

1 commit

  • Normal syscall audit doesn't catch 5th argument of syscall. It also
    doesn't catch the contents of userland structures pointed to be
    syscall argument, so for both old and new mmap(2) ABI it doesn't
    record the descriptor we are mapping. For old one it also misses
    flags.

    Signed-off-by: Al Viro

    Al Viro
     

23 Sep, 2010

1 commit

  • If __split_vma fails because of an out of memory condition the
    anon_vma_chain isn't teardown and freed potentially leading to rmap walks
    accessing freed vma information plus there's a memleak.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Johannes Weiner
    Acked-by: Rik van Riel
    Acked-by: Hugh Dickins
    Cc: Marcelo Tosatti
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     

25 Aug, 2010

1 commit

  • pa-risc and ia64 have stacks that grow upwards. Check that
    they do not run into other mappings. By making VM_GROWSUP
    0x0 on architectures that do not ever use it, we can avoid
    some unpleasant #ifdefs in check_stack_guard_page().

    Signed-off-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Luck, Tony
     

21 Aug, 2010

1 commit

  • It's a really simple list, and several of the users want to go backwards
    in it to find the previous vma. So rather than have to look up the
    previous entry with 'find_vma_prev()' or something similar, just make it
    doubly linked instead.

    Tested-by: Ian Campbell
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Aug, 2010

4 commits

  • There's no anon-vma related mangling happening inside __vma_link anymore
    so no need of anon_vma locking there.

    Signed-off-by: Andrea Arcangeli
    Signed-off-by: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • Always (and only) lock the root (oldest) anon_vma whenever we do something
    in an anon_vma. The recently introduced anon_vma scalability is due to
    the rmap code scanning only the VMAs that need to be scanned. Many common
    operations still took the anon_vma lock on the root anon_vma, so always
    taking that lock is not expected to introduce any scalability issues.

    However, always taking the same lock does mean we only need to take one
    lock, which means rmap_walk on pages from any anon_vma in the vma is
    excluded from occurring during an munmap, expand_stack or other operation
    that needs to exclude rmap_walk and similar functions.

    Also add the proper locking to vma_adjust.

    Signed-off-by: Rik van Riel
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Mel Gorman
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Subsitute a direct call of spin_lock(anon_vma->lock) with an inline
    function doing exactly the same.

    This makes it easier to do the substitution to the root anon_vma lock in a
    following patch.

    We will deal with the handful of special locks (nested, dec_and_lock, etc)
    separately.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Rename anon_vma_lock to vma_lock_anon_vma. This matches the naming style
    used in page_lock_anon_vma and will come in really handy further down in
    this patch series.

    Signed-off-by: Rik van Riel
    Acked-by: Mel Gorman
    Acked-by: KAMEZAWA Hiroyuki
    Tested-by: Larry Woodman
    Acked-by: Larry Woodman
    Reviewed-by: Minchan Kim
    Acked-by: Linus Torvalds
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     

09 Jun, 2010

1 commit

  • Add the capacility to track data mmap()s. This can be used together
    with PERF_SAMPLE_ADDR for data profiling.

    Signed-off-by: Anton Blanchard
    [Updated code for stable perf ABI]
    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

27 Apr, 2010

1 commit


13 Apr, 2010

2 commits

  • When we move the boundaries between two vma's due to things like
    mprotect, we need to make sure that the anon_vma of the pages that got
    moved from one vma to another gets properly copied around. And that was
    not always the case, in this rather hard-to-follow code sequence.

    Clarify the code, and fix it so that it copies the anon_vma from the
    right source.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "Yeah, not so much this one either" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • This changes the anon_vma reuse case to require that we only reuse
    simple anon_vma's - ie the case when the vma only has a single anon_vma
    associated with it.

    This means that a reuse of an anon_vma from an adjacent vma will always
    guarantee that both vma's are associated not only with the same
    anon_vma, they will also have the same anon_vma chain (of just a single
    entry in this case).

    And since anon_vma re-use was the only case where the same anon_vma
    might be associated with different chains of anon_vma's, we now have the
    case that every vma that shares the same anon_vma will always also have
    the same chain. That makes it much easier to think about merging vma's
    that share the same anon_vma's: you can always just drop the other
    anon_vma chain in anon_vma_merge() since you know that they are always
    identical.

    This also splits up the function to validate the anon_vma re-use, and
    adds a lot of commentary about the possible races.

    Reviewed-by: Rik van Riel
    Acked-by: Johannes Weiner
    Tested-by: Borislav Petkov [ "That didn't fix it" ]
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

13 Mar, 2010

1 commit

  • Add a generic implementation of the old mmap() syscall, which expects its
    argument in a memory block and switch all architectures over to use it.

    Signed-off-by: Christoph Hellwig
    Cc: Ralf Baechle
    Cc: Benjamin Herrenschmidt
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Hirokazu Takata
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Reviewed-by: H. Peter Anvin
    Cc: Al Viro
    Cc: Arnd Bergmann
    Cc: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: "Luck, Tony"
    Cc: James Morris
    Cc: Andreas Schwab
    Acked-by: Jesper Nilsson
    Acked-by: Russell King
    Acked-by: Greg Ungerer
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Hellwig
     

07 Mar, 2010

5 commits

  • When a VMA is in an inconsistent state during setup or teardown, the worst
    that can happen is that the rmap code will not be able to find the page.

    The mapping is in the process of being torn down (PTEs just got
    invalidated by munmap), or set up (no PTEs have been instantiated yet).

    It is also impossible for the rmap code to follow a pointer to an already
    freed VMA, because the rmap code holds the anon_vma->lock, which the VMA
    teardown code needs to take before the VMA is removed from the anon_vma
    chain.

    Hence, we should not need the VM_LOCK_RMAP locking at all.

    Signed-off-by: Rik van Riel
    Cc: Nick Piggin
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in
    3e10e716abf3c71bdb5d86b8f507f9e72236c9cd ("resource: add helpers for
    fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Currently, mlock_vma_pages_range() only return len or 0. then current
    error handling of mmap_region() is meaningless complex.

    This patch makes simplify and makes consist with brk() code.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • Currently, mlock_vma_pages_range() never return negative value. Then, we
    can remove some worthless error check.

    Signed-off-by: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Lee Schermerhorn
    Cc: Rik van Riel
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

31 Dec, 2009

1 commit

  • Move sys_mmap_pgoff() from mm/util.c to mm/mmap.c and mm/nommu.c,
    where we'd expect to find such code: especially now that it contains
    the MAP_HUGETLB handling. Revert mm/util.c to how it was in 2.6.32.

    This patch just ignores MAP_HUGETLB in the nommu case, as in 2.6.32,
    whereas 2.6.33-rc2 reported -ENOSYS. Perhaps validate_mmap_request()
    should reject it with -EINVAL? Add that later if necessary.

    Signed-off-by: Hugh Dickins
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

16 Dec, 2009

2 commits

  • Modify the generic mmap() code to keep the cache attribute in
    vma->vm_page_prot regardless if writenotify is enabled or not. Without
    this patch the cache configuration selected by f_op->mmap() is overwritten
    if writenotify is enabled, making it impossible to keep the vma uncached.

    Needed by drivers such as drivers/video/sh_mobile_lcdcfb.c which uses
    deferred io together with uncached memory.

    Signed-off-by: Magnus Damm
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Paul Mundt
    Cc: Jaya Kumar
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Magnus Damm
     
  • On ia64, the following test program exit abnormally, because glibc thread
    library called abort().

    ========================================================
    (gdb) bt
    #0 0xa000000000010620 in __kernel_syscall_via_break ()
    #1 0x20000000003208e0 in raise () from /lib/libc.so.6.1
    #2 0x2000000000324090 in abort () from /lib/libc.so.6.1
    #3 0x200000000027c3e0 in __deallocate_stack () from /lib/libpthread.so.0
    #4 0x200000000027f7c0 in start_thread () from /lib/libpthread.so.0
    #5 0x200000000047ef60 in __clone2 () from /lib/libc.so.6.1
    ========================================================

    The fact is, glibc call munmap() when thread exitng time for freeing
    stack, and it assume munlock() never fail. However, munmap() often make
    vma splitting and it with many mapcount make -ENOMEM.

    Oh well, that's crazy, because stack unmapping never increase mapcount.
    The maxcount exceeding is only temporary. internal temporary exceeding
    shouldn't make ENOMEM.

    This patch does it.

    test_max_mapcount.c
    ==================================================================
    #include
    #include
    #include
    #include
    #include
    #include

    #define THREAD_NUM 30000
    #define MAL_SIZE (8*1024*1024)

    void *wait_thread(void *args)
    {
    void *addr;

    addr = malloc(MAL_SIZE);
    sleep(10);

    return NULL;
    }

    void *wait_thread2(void *args)
    {
    sleep(60);

    return NULL;
    }

    int main(int argc, char *argv[])
    {
    int i;
    pthread_t thread[THREAD_NUM], th;
    int ret, count = 0;
    pthread_attr_t attr;

    ret = pthread_attr_init(&attr);
    if(ret) {
    perror("pthread_attr_init");
    }

    ret = pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_DETACHED);
    if(ret) {
    perror("pthread_attr_setdetachstate");
    }

    for (i = 0; i < THREAD_NUM; i++) {
    ret = pthread_create(&th, &attr, wait_thread, NULL);
    if(ret) {
    fprintf(stderr, "[%d] ", count);
    perror("pthread_create");
    } else {
    printf("[%d] create OK.\n", count);
    }
    count++;

    ret = pthread_create(&thread[i], &attr, wait_thread2, NULL);
    if(ret) {
    fprintf(stderr, "[%d] ", count);
    perror("pthread_create");
    } else {
    printf("[%d] create OK.\n", count);
    }
    count++;
    }

    sleep(3600);
    return 0;
    }
    ==================================================================

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

11 Dec, 2009

3 commits


25 Oct, 2009

1 commit


28 Sep, 2009

1 commit


22 Sep, 2009

8 commits

  • Add a flag for mmap that will be used to request a huge page region that
    will look like anonymous memory to userspace. This is accomplished by
    using a file on the internal vfsmount. MAP_HUGETLB is a modifier of
    MAP_ANONYMOUS and so must be specified with it. The region will behave
    the same as a MAP_ANONYMOUS region using small pages.

    [akpm@linux-foundation.org: fix arch definitions of MAP_HUGETLB]
    Signed-off-by: Eric B Munson
    Acked-by: David Rientjes
    Cc: Mel Gorman
    Cc: Adam Litke
    Cc: David Gibson
    Cc: Lee Schermerhorn
    Cc: Nick Piggin
    Cc: Hugh Dickins
    Cc: Arnd Bergmann
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric B Munson
     
  • shmem_zero_setup() does not change vm_start, pgoff or vm_flags, only some
    drivers change them (such as /driver/video/bfin-t350mcqb-fb.c).

    Move these codes to a more proper place to save cycles for shared
    anonymous mapping.

    Signed-off-by: Huang Shijie
    Reviewed-by: Minchan Kim
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • We noticed very erratic behavior [throughput] with the AIM7 shared
    workload running on recent distro [SLES11] and mainline kernels on an
    8-socket, 32-core, 256GB x86_64 platform. On the SLES11 kernel
    [2.6.27.19+] with Barcelona processors, as we increased the load [10s of
    thousands of tasks], the throughput would vary between two "plateaus"--one
    at ~65K jobs per minute and one at ~130K jpm. The simple patch below
    causes the results to smooth out at the ~130k plateau.

    But wait, there's more:

    We do not see this behavior on smaller platforms--e.g., 4 socket/8 core.
    This could be the result of the larger number of cpus on the larger
    platform--a scalability issue--or it could be the result of the larger
    number of interconnect "hops" between some nodes in this platform and how
    the tasks for a given load end up distributed over the nodes' cpus and
    memories--a stochastic NUMA effect.

    The variability in the results are less pronounced [on the same platform]
    with Shanghai processors and with mainline kernels. With 31-rc6 on
    Shanghai processors and 288 file systems on 288 fibre attached storage
    volumes, the curves [jpm vs load] are both quite flat with the patched
    kernel consistently producing ~3.9% better throughput [~80K jpm vs ~77K
    jpm] than the unpatched kernel.

    Profiling indicated that the "slow" runs were incurring high[er]
    contention on an anon_vma lock in vma_adjust(), apparently called from the
    sbrk() system call.

    The patch:

    A comment in mm/mmap.c:vma_adjust() suggests that we don't really need the
    anon_vma lock when we're only adjusting the end of a vma, as is the case
    for brk(). The comment questions whether it's worth while to optimize for
    this case. Apparently, on the newer, larger x86_64 platforms, with
    interesting NUMA topologies, it is worth while--especially considering
    that the patch [if correct!] is quite simple.

    We can detect this condition--no overlap with next vma--by noting a NULL
    "importer". The anon_vma pointer will also be NULL in this case, so
    simply avoid loading vma->anon_vma to avoid the lock.

    However, we DO need to take the anon_vma lock when we're inserting a vma
    ['insert' non-NULL] even when we have no overlap [NULL "importer"], so we
    need to check for 'insert', as well. And Hugh points out that we should
    also take it when adjusting vm_start (so that rmap.c can rely upon
    vma_address() while it holds the anon_vma lock).

    akpm: Zhang Yanmin reprts a 150% throughput improvement with aim7, so it
    might be -stable material even though thiss isn't a regression: "this
    issue is not clear on dual socket Nehalem machine (2*4*2 cpu), but is
    severe on large machine (4*8*2 cpu)"

    [hugh.dickins@tiscali.co.uk: test vma start too]
    Signed-off-by: Lee Schermerhorn
    Signed-off-by: Hugh Dickins
    Cc: Nick Piggin
    Cc: Eric Whitney
    Tested-by: "Zhang, Yanmin"
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     
  • If (flags & MAP_LOCKED) is true, it means vm_flags has already contained
    the bit VM_LOCKED which is set by calc_vm_flag_bits().

    So there is no need to reset it again, just remove it.

    Signed-off-by: Huang Shijie
    Acked-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Huang Shijie
     
  • A few cleanups, given the munlock fix: the comment on ksm_test_exit() no
    longer applies, and it can be made private to ksm.c; there's no more
    reference to mmu_gather or tlb.h, and mmap.c doesn't need ksm.h.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • KSM originally stood for Kernel Shared Memory: but the kernel has long
    supported shared memory, and VM_SHARED and VM_MAYSHARE vmas, and KSM is
    something else. So we switched to saying "merge" instead of "share".

    But Chris Wright points out that this is confusing where mmap.c merges
    adjacent vmas: most especially in the name VM_MERGEABLE_FLAGS, used by
    is_mergeable_vma() to let vmas be merged despite flags being different.

    Call it VMA_MERGE_DESPITE_FLAGS? Perhaps, but at present it consists
    only of VM_CAN_NONLINEAR: so for now it's clearer on all sides to use
    that directly, with a comment on it in is_mergeable_vma().

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     
  • Rawhide users have reported hang at startup when cryptsetup is run: the
    same problem can be simply reproduced by running a program int main() {
    mlockall(MCL_CURRENT | MCL_FUTURE); return 0; }

    The problem is that exit_mmap() applies munlock_vma_pages_all() to
    clean up VM_LOCKED areas, and its current implementation (stupidly)
    tries to fault in absent pages, for example where PROT_NONE prevented
    them being faulted in when mlocking. Whereas the "ksm: fix oom
    deadlock" patch, knowing there's a race by which KSM might try to fault
    in pages after exit_mmap() had finally zapped the range, backs out of
    such faults doing nothing when its ksm_test_exit() notices mm_users 0.

    So revert that part of "ksm: fix oom deadlock" which moved the
    ksm_exit() call from before exit_mmap() to the middle of exit_mmap();
    and remove those ksm_test_exit() checks from the page fault paths, so
    allowing the munlocking to proceed without interference.

    ksm_exit, if there are rmap_items still chained on this mm slot, takes
    mmap_sem write side: so preventing KSM from working on an mm while
    exit_mmap runs. And KSM will bail out as soon as it notices that
    mm_users is already zero, thanks to its internal ksm_test_exit checks.
    So that when a task is killed by OOM killer or the user, KSM will not
    indefinitely prevent it from running exit_mmap to release its memory.

    This does break a part of what "ksm: fix oom deadlock" was trying to
    achieve. When unmerging KSM (echo 2 >/sys/kernel/mm/ksm), and even
    when ksmd itself has to cancel a KSM page, it is possible that the
    first OOM-kill victim would be the KSM process being faulted: then its
    memory won't be freed until a second victim has been selected (freeing
    memory for the unmerging fault to complete).

    But the OOM killer is already liable to kill a second victim once the
    intended victim's p->mm goes to NULL: so there's not much point in
    rejecting this KSM patch before fixing that OOM behaviour. It is very
    much more important to allow KSM users to boot up, than to haggle over
    an unlikely and poorly supported OOM case.

    We also intend to fix munlocking to not fault pages: at which point
    this patch _could_ be reverted; though that would be controversial, so
    we hope to find a better solution.

    Signed-off-by: Andrea Arcangeli
    Acked-by: Justin M. Forbes
    Acked-for-now-by: Hugh Dickins
    Cc: Izik Eidus
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Arcangeli
     
  • There's a now-obvious deadlock in KSM's out-of-memory handling:
    imagine ksmd or KSM_RUN_UNMERGE handling, holding ksm_thread_mutex,
    trying to allocate a page to break KSM in an mm which becomes the
    OOM victim (quite likely in the unmerge case): it's killed and goes
    to exit, and hangs there waiting to acquire ksm_thread_mutex.

    Clearly we must not require ksm_thread_mutex in __ksm_exit, simple
    though that made everything else: perhaps use mmap_sem somehow?
    And part of the answer lies in the comments on unmerge_ksm_pages:
    __ksm_exit should also leave all the rmap_item removal to ksmd.

    But there's a fundamental problem, that KSM relies upon mmap_sem to
    guarantee the consistency of the mm it's dealing with, yet exit_mmap
    tears down an mm without taking mmap_sem. And bumping mm_users won't
    help at all, that just ensures that the pages the OOM killer assumes
    are on their way to being freed will not be freed.

    The best answer seems to be, to move the ksm_exit callout from just
    before exit_mmap, to the middle of exit_mmap: after the mm's pages
    have been freed (if the mmu_gather is flushed), but before its page
    tables and vma structures have been freed; and down_write,up_write
    mmap_sem there to serialize with KSM's own reliance on mmap_sem.

    But KSM then needs to be careful, whenever it downs mmap_sem, to
    check that the mm is not already exiting: there's a danger of using
    find_vma on a layout that's being torn apart, or writing into page
    tables which have been freed for reuse; and even do_anonymous_page
    and __do_fault need to check they're not being called by break_ksm
    to reinstate a pte after zap_pte_range has zapped that page table.

    Though it might be clearer to add an exiting flag, set while holding
    mmap_sem in __ksm_exit, that wouldn't cover the issue of reinstating
    a zapped pte. All we need is to check whether mm_users is 0 - but
    must remember that ksmd may detect that before __ksm_exit is reached.
    So, ksm_test_exit(mm) added to comment such checks on mm->mm_users.

    __ksm_exit now has to leave clearing up the rmap_items to ksmd,
    that needs ksm_thread_mutex; but shift the exiting mm just after the
    ksm_scan cursor so that it will soon be dealt with. __ksm_enter raise
    mm_count to hold the mm_struct, ksmd's exit processing (exactly like
    its processing when it finds all VM_MERGEABLEs unmapped) mmdrop it,
    similar procedure for KSM_RUN_UNMERGE (which has stopped ksmd).

    But also give __ksm_exit a fast path: when there's no complication
    (no rmap_items attached to mm and it's not at the ksm_scan cursor),
    it can safely do all the exiting work itself. This is not just an
    optimization: when ksmd is not running, the raised mm_count would
    otherwise leak mm_structs.

    Signed-off-by: Hugh Dickins
    Acked-by: Izik Eidus
    Cc: Andrea Arcangeli
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

19 Sep, 2009

1 commit


17 Aug, 2009

1 commit

  • Currently SELinux enforcement of controls on the ability to map low memory
    is determined by the mmap_min_addr tunable. This patch causes SELinux to
    ignore the tunable and instead use a seperate Kconfig option specific to how
    much space the LSM should protect.

    The tunable will now only control the need for CAP_SYS_RAWIO and SELinux
    permissions will always protect the amount of low memory designated by
    CONFIG_LSM_MMAP_MIN_ADDR.

    This allows users who need to disable the mmap_min_addr controls (usual reason
    being they run WINE as a non-root user) to do so and still have SELinux
    controls preventing confined domains (like a web server) from being able to
    map some area of low memory.

    Signed-off-by: Eric Paris
    Signed-off-by: James Morris

    Eric Paris
     

12 Jun, 2009

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     

05 Jun, 2009

1 commit