16 Dec, 2010

1 commit

  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     

01 Dec, 2010

2 commits

  • Note: this patch targets 2.6.37 and tries to be as simple as possible.
    That is why it adds more copy-and-paste horror into fs/compat.c and
    uglifies fs/exec.c, this will be cleanuped later.

    compat_copy_strings() plays with bprm->vma/mm directly and thus has
    two problems: it lacks the RLIMIT_STACK check and argv/envp memory
    is not visible to oom killer.

    Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
    to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
    as do_execve() does.

    Add the fatal_signal_pending/cond_resched checks into compat_count() and
    compat_copy_strings(), this matches the code in fs/exec.c and certainly
    makes sense.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Brad Spengler published a local memory-allocation DoS that
    evades the OOM-killer (though not the virtual memory RLIMIT):
    http://www.grsecurity.net/~spender/64bit_dos.c

    execve()->copy_strings() can allocate a lot of memory, but
    this is not visible to oom-killer, nobody can see the nascent
    bprm->mm and take it into account.

    With this patch get_arg_page() increments current's MM_ANONPAGES
    counter every time we allocate the new page for argv/envp. When
    do_execve() succeds or fails, we change this counter back.

    Technically this is not 100% correct, we can't know if the new
    page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
    I don't think this really matters and everything becomes correct
    once exec changes ->mm or fails.

    Reported-by: Brad Spengler
    Reviewed-and-discussed-by: KOSAKI Motohiro
    Signed-off-by: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Oct, 2010

3 commits

  • Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
    THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
    turn PF_KTHREAD off in the ENOEXEC case.

    This patch moves this flag modification to after the finding of the
    executable file.

    This is only a theorical issue because kthreads do not call do_execve()
    directly. But fixing would be better.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Roland McGrath
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We met a parameter truncated issue, consider following:
    > echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
    %s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
    /proc/sys/kernel/core_pattern

    This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
    /proc/sys/kernel/core_pattern" shows the whole string. but after we run
    core_pattern_pipe_test in man page, we found last parameter was truncated
    like below:

    argc[10]=

    The root cause is core_pattern allows % specifiers, which need to be
    replaced during parse time, but the replace may expand the strings to
    larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
    the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
    will write out of corename array.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Xiaotian Feng
    Cc: Alexander Viro
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Neil Horman
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
    itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
    execve() has no worth.

    Let's move ->cred_guard_mutex from task_struct to signal_struct. It
    naturally prevent multiple-threads-inside-exec.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

15 Oct, 2010

2 commits

  • If you build aout support as a module, you'll want these exported.

    Reported-by: Tetsuo Handa
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Tony Luck reports that the addition of the access_ok() check in commit
    0eead9ab41da ("Don't dump task struct in a.out core-dumps") broke the
    ia64 compile due to missing the necessary header file includes.

    Rather than add yet another include () to make everything
    happy, just uninline the silly core dump helper functions and move the
    bodies to fs/exec.c where they make a lot more sense.

    dump_seek() in particular was too big to be an inline function anyway,
    and none of them are in any way performance-critical. And we really
    don't need to mess up our include file headers more than they already
    are.

    Reported-and-tested-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Sep, 2010

3 commits

  • An execve with a very large total of argument/environment strings
    can take a really long time in the execve system call. It runs
    uninterruptibly to count and copy all the strings. This change
    makes it abort the exec quickly if sent a SIGKILL.

    Note that this is the conservative change, to interrupt only for
    SIGKILL, by using fatal_signal_pending(). It would be perfectly
    correct semantics to let any signal interrupt the string-copying in
    execve, i.e. use signal_pending() instead of fatal_signal_pending().
    We'll save that change for later, since it could have user-visible
    consequences, such as having a timer set too quickly make it so that
    an execve can never complete, though it always happened to work before.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds a preemption point during the copying of the argument and
    environment strings for execve, in copy_strings(). There is already
    a preemption point in the count() loop, so this doesn't add any new
    points in the abstract sense.

    When the total argument+environment strings are very large, the time
    spent copying them can be much more than a normal user time slice.
    So this change improves the interactivity of the rest of the system
    when one process is doing an execve with very large arguments.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
    check the size of the argument/environment area on the stack.
    When it is unworkably large, shift_arg_pages() hits its BUG_ON.
    This is exploitable with a very large RLIMIT_STACK limit, to
    create a crash pretty easily.

    Check that the initial stack is not too large to make it possible
    to map in any executable. We're not checking that the actual
    executable (or intepreter, for binfmt_elf) will fit. So those
    mappings might clobber part of the initial stack mapping. But
    that is just userland lossage that userland made happen, not a
    kernel problem.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

19 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: brlock vfsmount_lock
    fs: scale files_lock
    lglock: introduce special lglock and brlock spin locks
    tty: fix fu_list abuse
    fs: cleanup files_lock locking
    fs: remove extra lookup in __lookup_hash
    fs: fs_struct rwlock to spinlock
    apparmor: use task path helpers
    fs: dentry allocation consolidation
    fs: fix do_lookup false negative
    mbcache: Limit the maximum number of cache entries
    hostfs ->follow_link() braino
    hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy
    remove SWRITE* I/O types
    kill BH_Ordered flag
    vfs: update ctime when changing the file's permission by setfacl
    cramfs: only unlock new inodes
    fix reiserfs_evict_inode end_writeback second call

    Linus Torvalds
     

18 Aug, 2010

2 commits

  • fs: fs_struct rwlock to spinlock

    struct fs_struct.lock is an rwlock with the read-side used to protect root and
    pwd members while taking references to them. Taking a reference to a path
    typically requires just 2 atomic ops, so the critical section is very small.
    Parallel read-side operations would have cacheline contention on the lock, the
    dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
    real parallelism increase.

    Replace it with a spinlock to avoid one or two atomic operations in typical
    path lookup fastpath.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • Make do_execve() take a const filename pointer so that kernel_execve() compiles
    correctly on ARM:

    arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type

    This also requires the argv and envp arguments to be consted twice, once for
    the pointer array and once for the strings the array points to. This is
    because do_execve() passes a pointer to the filename (now const) to
    copy_strings_kernel(). A simpler alternative would be to cast the filename
    pointer in do_execve() when it's passed to copy_strings_kernel().

    do_execve() may not change any of the strings it is passed as part of the argv
    or envp lists as they are some of them in .rodata, so marking these strings as
    const should be fine.

    Further kernel_execve() and sys_execve() need to be changed to match.

    This has been test built on x86_64, frv, arm and mips.

    Signed-off-by: David Howells
    Tested-by: Ralf Baechle
    Acked-by: Russell King
    Signed-off-by: Linus Torvalds

    David Howells
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
    fanotify: use both marks when possible
    fsnotify: pass both the vfsmount mark and inode mark
    fsnotify: walk the inode and vfsmount lists simultaneously
    fsnotify: rework ignored mark flushing
    fsnotify: remove global fsnotify groups lists
    fsnotify: remove group->mask
    fsnotify: remove the global masks
    fsnotify: cleanup should_send_event
    fanotify: use the mark in handler functions
    audit: use the mark in handler functions
    dnotify: use the mark in handler functions
    inotify: use the mark in handler functions
    fsnotify: send fsnotify_mark to groups in event handling functions
    fsnotify: Exchange list heads instead of moving elements
    fsnotify: srcu to protect read side of inode and vfsmount locks
    fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
    fsnotify: use _rcu functions for mark list traversal
    fsnotify: place marks on object in order of group memory address
    vfs/fsnotify: fsnotify_close can delay the final work in fput
    fsnotify: store struct file not struct path
    ...

    Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

    Linus Torvalds
     

08 Aug, 2010

1 commit


28 Jul, 2010

1 commit

  • fanotify, the upcoming notification system actually needs a struct path so it can
    do opens in the context of listeners, and it needs a file so it can get f_flags
    from the original process. Close was the only operation that already was passing
    a struct file to the notification hook. This patch passes a file for access,
    modify, and open as well as they are easily available to these hooks.

    Signed-off-by: Eric Paris

    Eric Paris
     

10 Jul, 2010

1 commit

  • core_pattern is not actually protected and hasn't been
    ever since we introduced procfs support for sysctl -- a
    _long_ time. Don't take it here either.

    Also nothing inside do_coredump appears to require bkl
    protection.

    Signed-off-by: Arnd Bergmann
    [ remove smp_lock.h headers ]
    Signed-off-by: Frederic Weisbecker

    Arnd Bergmann
     

09 Jun, 2010

1 commit

  • Add the capacility to track data mmap()s. This can be used together
    with PERF_SAMPLE_ADDR for data profiling.

    Signed-off-by: Anton Blanchard
    [Updated code for stable perf ABI]
    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

28 May, 2010

6 commits

  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - move the cprm.mm_flags checks up, before we take mmap_sem

    - move down_write(mmap_sem) and ->core_state check from do_coredump()
    to coredump_wait()

    This simplifies the code and makes the locking symmetrical.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Given that do_coredump() calls put_cred() on exit path, it is a bit ugly
    to do put_cred() + "goto fail" twice, just add the new "fail_creds" label.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • - kill "int dump_count", argv_split(argcp) accepts argcp == NULL.

    - move "int dump_count" under " if (ispipe)" branch, fail_dropcount
    can check ispipe.

    - move "char **helper_argv" as well, change the code to do argv_free()
    right after call_usermodehelper_fns().

    - If call_usermodehelper_fns() fails goto close_fail label instead
    of closing the file by hand.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_coredump() does a lot of file checks after it opens the file or calls
    usermode helper. But all of these checks are only needed in !ispipe case.

    Move this code into the "else" branch and kill the ugly repetitive ispipe
    checks.

    Signed-off-by: Oleg Nesterov
    Cc: David Howells
    Cc: Neil Horman
    Cc: Roland McGrath
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first patch in this series introduced an init function to the
    call_usermodehelper api so that processes could be customized by caller.
    This patch takes advantage of that fact, by customizing the helper in
    do_coredump to create the pipe and set its core limit to one (for our
    recusrsion check). This lets us clean up the previous uglyness in the
    usermodehelper internals and factor call_usermodehelper out entirely.
    While I'm at it, we can also modify the helper setup to look for a core
    limit value of 1 rather than zero for our recursion check

    Signed-off-by: Neil Horman
    Reviewed-by: Oleg Nesterov
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     

25 May, 2010

1 commit

  • …tion by not migrating temporary stacks

    Page migration requires rmap to be able to find all ptes mapping a page
    at all times, otherwise the migration entry can be instantiated, but it
    is possible to leave one behind if the second rmap_walk fails to find
    the page. If this page is later faulted, migration_entry_to_page() will
    call BUG because the page is locked indicating the page was migrated by
    the migration PTE not cleaned up. For example

    kernel BUG at include/linux/swapops.h:105!
    invalid opcode: 0000 [#1] PREEMPT SMP
    ...
    Call Trace:
    [<ffffffff810e951a>] handle_mm_fault+0x3f8/0x76a
    [<ffffffff8130c7a2>] do_page_fault+0x44a/0x46e
    [<ffffffff813099b5>] page_fault+0x25/0x30
    [<ffffffff8114de33>] load_elf_binary+0x152a/0x192b
    [<ffffffff8111329b>] search_binary_handler+0x173/0x313
    [<ffffffff81114896>] do_execve+0x219/0x30a
    [<ffffffff8100a5c6>] sys_execve+0x43/0x5e
    [<ffffffff8100320a>] stub_execve+0x6a/0xc0
    RIP [<ffffffff811094ff>] migration_entry_wait+0xc1/0x129

    There is a race between shift_arg_pages and migration that triggers this
    bug. A temporary stack is setup during exec and later moved. If
    migration moves a page in the temporary stack and the VMA is then removed
    before migration completes, the migration PTE may not be found leading to
    a BUG when the stack is faulted.

    This patch causes pages within the temporary stack during exec to be
    skipped by migration. It does this by marking the VMA covering the
    temporary stack with an otherwise impossible combination of VMA flags.
    These flags are cleared when the temporary stack is moved to its final
    location.

    [kamezawa.hiroyu@jp.fujitsu.com: idea for having migration skip temporary stacks]
    Signed-off-by: Mel Gorman <mel@csn.ul.ie>
    Reviewed-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
    Reviewed-by: Rik van Riel <riel@redhat.com>
    Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Minchan Kim <minchan.kim@gmail.com>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: Andrea Arcangeli <aarcange@redhat.com>
    Cc: Rik van Riel <riel@redhat.com>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Reviewed-by: KOSAKI Motohiro <kosaki.motohiro@jp.fujitsu.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

12 May, 2010

1 commit

  • Originally, commit d899bf7b ("procfs: provide stack information for
    threads") attempted to introduce a new feature for showing where the
    threadstack was located and how many pages are being utilized by the
    stack.

    Commit c44972f1 ("procfs: disable per-task stack usage on NOMMU") was
    applied to fix the NO_MMU case.

    Commit 89240ba0 ("x86, fs: Fix x86 procfs stack information for threads on
    64-bit") was applied to fix a bug in ia32 executables being loaded.

    Commit 9ebd4eba7 ("procfs: fix /proc//stat stack pointer for kernel
    threads") was applied to fix a bug which had kernel threads printing a
    userland stack address.

    Commit 1306d603f ('proc: partially revert "procfs: provide stack
    information for threads"') was then applied to revert the stack pages
    being used to solve a significant performance regression.

    This patch nearly undoes the effect of all these patches.

    The reason for reverting these is it provides an unusable value in
    field 28. For x86_64, a fork will result in the task->stack_start
    value being updated to the current user top of stack and not the stack
    start address. This unpredictability of the stack_start value makes
    it worthless. That includes the intended use of showing how much stack
    space a thread has.

    Other architectures will get different values. As an example, ia64
    gets 0. The do_fork() and copy_process() functions appear to treat the
    stack_start and stack_size parameters as architecture specific.

    I only partially reverted c44972f1 ("procfs: disable per-task stack usage
    on NOMMU") . If I had completely reverted it, I would have had to change
    mm/Makefile only build pagewalk.o when CONFIG_PROC_PAGE_MONITOR is
    configured. Since I could not test the builds without significant effort,
    I decided to not change mm/Makefile.

    I only partially reverted 89240ba0 ("x86, fs: Fix x86 procfs stack
    information for threads on 64-bit") . I left the KSTK_ESP() change in
    place as that seemed worthwhile.

    Signed-off-by: Robin Holt
    Cc: Stefani Seibold
    Cc: KOSAKI Motohiro
    Cc: Michal Simek
    Cc: Ingo Molnar
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Robin Holt
     

07 Mar, 2010

7 commits

  • Modify uid check in do_coredump so as to not apply it in the case of
    pipes.

    This just got noticed in testing. The end of do_coredump validates the
    uid of the inode for the created file against the uid of the crashing
    process to ensure that no one can pre-create a core file with different
    ownership and grab the information contained in the core when they
    shouldn' tbe able to. This causes failures when using pipes for a core
    dumps if the crashing process is not root, which is the uid of the pipe
    when it is created.

    The fix is simple. Since the check for matching uid's isn't relevant for
    pipes (a process can't create a pipe that the uermodehelper code will open
    anyway), we can just just skip it in the event ispipe is non-zero

    Reverts a pipe-affecting change which was accidentally made in

    : commit c46f739dd39db3b07ab5deb4e3ec81e1c04a91af
    : Author: Ingo Molnar
    : AuthorDate: Wed Nov 28 13:59:18 2007 +0100
    : Commit: Linus Torvalds
    : CommitDate: Wed Nov 28 10:58:01 2007 -0800
    :
    : vfs: coredumping fix

    Signed-off-by: Neil Horman
    Cc: Andi Kleen
    Cc: Oleg Nesterov
    Cc: Alan Cox
    Cc: Al Viro
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Neil Horman
     
  • User visible change.

    do_coredump() kills all threads which share the same ->mm but only the
    coredumping process gets the proper exit_code. Other tasks which share
    the same ->mm die "silently" and return status == 0 to parent.

    This is historical behaviour, not actually a bug. But I think Frank
    Heckenbach rightly dislikes the current behaviour. Simple test-case:

    #include
    #include
    #include
    #include

    int main(void)
    {
    int stat;

    if (!fork()) {
    if (!vfork())
    kill(getpid(), SIGQUIT);
    }

    wait(&stat);
    printf("stat=%x\n", stat);
    return 0;
    }

    Before this patch it prints "stat=0" despite the fact the child was killed
    by SIGQUIT. After this patch the output is "stat=3" which obviously makes
    more sense.

    Even with this patch, only the task which originates the coredumping gets
    "|= 0x80" if the core was actually dumped, but at least the coredumping
    signal is visible to do_wait/etc.

    Reported-by: Frank Heckenbach
    Signed-off-by: Oleg Nesterov
    Acked-by: WANG Cong
    Cc: Roland McGrath
    Cc: Neil Horman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Pass mm->flags as a coredump parameter for consistency.

    ---
    1787 if (mm->core_state || !get_dumpable(mm)) { mmap_sem);
    1789 put_cred(cred);
    1790 goto fail;
    1791 }
    1792
    [...]
    1798 if (get_dumpable(mm) == 2) { /* Setuid core dump mode */ fsuid = 0; /* Dump root private */
    1801 }
    ---

    Since dumpable bits are not protected by lock, there is a chance to change
    these bits between (1) and (2).

    To solve this issue, this patch copies mm->flags to
    coredump_params.mm_flags at the beginning of do_coredump() and uses it
    instead of get_dumpable() while dumping core.

    This copy is also passed to binfmt->core_dump, since elf*_core_dump() uses
    dump_filter bits in mm->flags.

    [akpm@linux-foundation.org: fix merge]
    Signed-off-by: Masami Hiramatsu
    Acked-by: Roland McGrath
    Cc: Hidehiro Kawai
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu
     
  • Currently we create the initial stack based on the PAGE_SIZE. This is
    unnecessary.

    This creates this initial stack independent of the PAGE_SIZE.

    It also bumps up the number of 4k pages allocated from 20 to 32, to
    align with 64K page systems.

    Signed-off-by: Michael Neuling
    Cc: Helge Deller
    Reviewed-by: KOSAKI Motohiro
    Cc: Americo Wang
    Cc: Anton Blanchard
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     
  • Make sure compiler won't do weird things with limits. E.g. fetching them
    twice may return 2 different values after writable limits are implemented.

    I.e. either use rlimit helpers added in commit 3e10e716abf3 ("resource:
    add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • The old anon_vma code can lead to scalability issues with heavily forking
    workloads. Specifically, each anon_vma will be shared between the parent
    process and all its child processes.

    In a workload with 1000 child processes and a VMA with 1000 anonymous
    pages per process that get COWed, this leads to a system with a million
    anonymous pages in the same anon_vma, each of which is mapped in just one
    of the 1000 processes. However, the current rmap code needs to walk them
    all, leading to O(N) scanning complexity for each page.

    This can result in systems where one CPU is walking the page tables of
    1000 processes in page_referenced_one, while all other CPUs are stuck on
    the anon_vma lock. This leads to catastrophic failure for a benchmark
    like AIM7, where the total number of processes can reach in the tens of
    thousands. Real workloads are still a factor 10 less process intensive
    than AIM7, but they are catching up.

    This patch changes the way anon_vmas and VMAs are linked, which allows us
    to associate multiple anon_vmas with a VMA. At fork time, each child
    process gets its own anon_vmas, in which its COWed pages will be
    instantiated. The parents' anon_vma is also linked to the VMA, because
    non-COWed pages could be present in any of the children.

    This reduces rmap scanning complexity to O(1) for the pages of the 1000
    child processes, with O(N) complexity for at most 1/N pages in the system.
    This reduces the average scanning cost in heavily forking workloads from
    O(N) to 2.

    The only real complexity in this patch stems from the fact that linking a
    VMA to anon_vmas now involves memory allocations. This means vma_adjust
    can fail, if it needs to attach a VMA to anon_vma structures. This in
    turn means error handling needs to be added to the calling functions.

    A second source of complexity is that, because there can be multiple
    anon_vmas, the anon_vma linking in vma_adjust can no longer be done under
    "the" anon_vma lock. To prevent the rmap code from walking up an
    incomplete VMA, this patch introduces the VM_LOCK_RMAP VMA flag. This bit
    flag uses the same slot as the NOMMU VM_MAPPED_COPY, with an ifdef in mm.h
    to make sure it is impossible to compile a kernel that needs both symbolic
    values for the same bitflag.

    Some test results:

    Without the anon_vma changes, when AIM7 hits around 9.7k users (on a test
    box with 16GB RAM and not quite enough IO), the system ends up running
    >99% in system time, with every CPU on the same anon_vma lock in the
    pageout code.

    With these changes, AIM7 hits the cross-over point around 29.7k users.
    This happens with ~99% IO wait time, there never seems to be any spike in
    system time. The anon_vma lock contention appears to be resolved.

    [akpm@linux-foundation.org: cleanups]
    Signed-off-by: Rik van Riel
    Cc: KOSAKI Motohiro
    Cc: Larry Woodman
    Cc: Lee Schermerhorn
    Cc: Minchan Kim
    Cc: Andrea Arcangeli
    Cc: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Rik van Riel
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

23 Feb, 2010

1 commit

  • 803bf5ec259941936262d10ecc84511b76a20921 ("fs/exec.c: restrict initial
    stack space expansion to rlimit") attempts to limit the initial stack to
    20*PAGE_SIZE. Unfortunately, in attempting ensure the stack is not
    reduced in size, we ended up not changing the stack at all.

    This size reduction check is not necessary as the expand_stack call does
    this already.

    This caused a regression in UML resulting in most guest processes being
    killed.

    Signed-off-by: Michael Neuling
    Reviewed-by: KOSAKI Motohiro
    Acked-by: WANG Cong
    Cc: Anton Blanchard
    Cc: Oleg Nesterov
    Cc: James Morris
    Cc: Serge Hallyn
    Cc: Benjamin Herrenschmidt
    Cc: Jouni Malinen
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     

12 Feb, 2010

1 commit

  • When reserving stack space for a new process, make sure we're not
    attempting to expand the stack by more than rlimit allows.

    This fixes a bug caused by b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
    variable length argument support") and unmasked by
    fc63cf237078c86214abcb2ee9926d8ad289da9b ("exec: setup_arg_pages() fails
    to return errors").

    This bug means that when limiting the stack to less the 20*PAGE_SIZE (eg.
    80K on 4K pages or 'ulimit -s 79') all processes will be killed before
    they start. This is particularly bad with 64K pages, where a ulimit below
    1280K will kill every process.

    To test, do:

    'ulimit -s 15; ls'

    before and after the patch is applied. Before it's applied, 'ls' should
    be killed. After the patch is applied, 'ls' should no longer be killed.

    A stack limit of 15KB since it's small enough to trigger 20*PAGE_SIZE.
    Also 15KB not a multiple of PAGE_SIZE, which is a trickier case to handle
    correctly with this code.

    4K pages should be fine to test with.

    [kosaki.motohiro@jp.fujitsu.com: cleanup]
    [akpm@linux-foundation.org: cleanup cleanup]
    Signed-off-by: Michael Neuling
    Signed-off-by: KOSAKI Motohiro
    Cc: Americo Wang
    Cc: Anton Blanchard
    Cc: Oleg Nesterov
    Cc: James Morris
    Cc: Ingo Molnar
    Cc: Serge Hallyn
    Cc: Benjamin Herrenschmidt
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michael Neuling
     

03 Feb, 2010

1 commit

  • Commit 221af7f87b9 ("Split 'flush_old_exec' into two functions") split
    the function at the point of no return - ie right where there were no
    more error cases to check. That made sense from a technical standpoint,
    but when we then also combined it with the actual personality setting
    going in between flush_old_exec() and setup_new_exec(), it needs to be a
    bit more careful.

    In particular, we need to make sure that we really flush the old
    personality bits in the 'flush' stage, rather than later in the 'setup'
    stage, since otherwise we might be flushing the _new_ personality state
    that we're just setting up.

    So this moves the flags and personality flushing (and 'flush_thread()',
    which is the arch-specific function that generally resets lazy FP state
    etc) of the old process into flush_old_exec(), so that it doesn't affect
    any state that execve() is setting up for the new process environment.

    This was reported by Michal Simek as breaking his Microblaze qemu
    environment.

    Reported-and-tested-by: Michal Simek
    Cc: Peter Anvin
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

30 Jan, 2010

1 commit

  • 'flush_old_exec()' is the point of no return when doing an execve(), and
    it is pretty badly misnamed. It doesn't just flush the old executable
    environment, it also starts up the new one.

    Which is very inconvenient for things like setting up the new
    personality, because we want the new personality to affect the starting
    of the new environment, but at the same time we do _not_ want the new
    personality to take effect if flushing the old one fails.

    As a result, the x86-64 '32-bit' personality is actually done using this
    insane "I'm going to change the ABI, but I haven't done it yet" bit
    (TIF_ABI_PENDING), with SET_PERSONALITY() not actually setting the
    personality, but just the "pending" bit, so that "flush_thread()" can do
    the actual personality magic.

    This patch in no way changes any of that insanity, but it does split the
    'flush_old_exec()' function up into a preparatory part that can fail
    (still called flush_old_exec()), and a new part that will actually set
    up the new exec environment (setup_new_exec()). All callers are changed
    to trivially comply with the new world order.

    Signed-off-by: H. Peter Anvin
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

18 Dec, 2009

1 commit

  • Introduce coredump parameter data structure (struct coredump_params) to
    simplify binfmt->core_dump() arguments.

    Signed-off-by: Masami Hiramatsu
    Suggested-by: Ingo Molnar
    Cc: Hidehiro Kawai
    Cc: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KOSAKI Motohiro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Masami Hiramatsu