18 Jun, 2011

1 commit

  • ____call_usermodehelper() now erases any credentials set by the
    subprocess_inf::init() function. The problem is that commit
    17f60a7da150 ("capabilites: allow the application of capability limits
    to usermode helpers") creates and commits new credentials with
    prepare_kernel_cred() after the call to the init() function. This wipes
    all keyrings after umh_keys_init() is called.

    The best way to deal with this is to put the init() call just prior to
    the commit_creds() call, and pass the cred pointer to init(). That
    means that umh_keys_init() and suchlike can modify the credentials
    _before_ they are published and potentially in use by the rest of the
    system.

    This prevents request_key() from working as it is prevented from passing
    the session keyring it set up with the authorisation token to
    /sbin/request-key, and so the latter can't assume the authority to
    instantiate the key. This causes the in-kernel DNS resolver to fail
    with ENOKEY unconditionally.

    Signed-off-by: David Howells
    Acked-by: Eric Paris
    Tested-by: Jeff Layton
    Signed-off-by: Linus Torvalds

    David Howells
     

16 Jun, 2011

2 commits

  • This reverts commit 7f81c8890c15a10f5220bebae3b6dfae4961962a.

    It turns out that it's not actually a build-time check on x86-64 UML,
    which does some seriously crazy stuff with VM_STACK_FLAGS.

    The VM_STACK_FLAGS define depends on the arch-supplied
    VM_STACK_DEFAULT_FLAGS value, and on x86-64 UML we have

    arch/um/sys-x86_64/shared/sysdep/vm-flags.h:

    #define VM_STACK_DEFAULT_FLAGS \
    (test_thread_flag(TIF_IA32) ? vm_stack_flags32 : vm_stack_flags)

    #define VM_STACK_DEFAULT_FLAGS vm_stack_flags

    (yes, seriously: two different #define's for that thing, with the first
    one being inside an "#ifdef TIF_IA32")

    It's possible that it is UML that should just be fixed in this area, but
    for now let's just undo the (very small) optimization.

    Reported-by: Randy Dunlap
    Acked-by: Andrew Morton
    Cc: Michal Hocko
    Cc: Richard Weinberger
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Commit a8bef8ff6ea1 ("mm: migration: avoid race between shift_arg_pages()
    and rmap_walk() during migration by not migrating temporary stacks")
    introduced a BUG_ON() to ensure that VM_STACK_FLAGS and
    VM_STACK_INCOMPLETE_SETUP do not overlap. The check is a compile time
    one, so BUILD_BUG_ON is more appropriate.

    Signed-off-by: Michal Hocko
    Cc: Mel Gorman
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

10 Jun, 2011

1 commit

  • Unconditionally changing the address limit to USER_DS and not restoring
    it to its old value in the error path is wrong because it prevents us
    using kernel memory on repeated calls to this function. This, in fact,
    breaks the fallback of hard coded paths to the init program from being
    ever successful if the first candidate fails to load.

    With this patch applied switching to USER_DS is delayed until the point
    of no return is reached which makes it possible to have a multi-arch
    rootfs with one arch specific init binary for each of the (hard coded)
    probed paths.

    Since the address limit is already set to USER_DS when start_thread()
    will be invoked, this redundancy can be safely removed.

    Signed-off-by: Mathias Krause
    Cc: Al Viro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Mathias Krause
     

27 May, 2011

2 commits

  • Now, exe_file is not proc FS dependent, so we can use it to name core
    file. So we add %E pattern for core file name cration which extract path
    from mm_struct->exe_file. Then it converts slashes to exclamation marks
    and pastes the result to the core file name itself.

    This is useful for environments where binary names are longer than 16
    character (the current->comm limitation). Also where there are binaries
    with same name but in a different path. Further in case the binery itself
    changes its current->comm after exec.

    So by doing (s/$/#/ -- # is treated as git comment):

    $ sysctl kernel.core_pattern='core.%p.%e.%E'
    $ ln /bin/cat cat45678901234567890
    $ ./cat45678901234567890
    ^Z
    $ rm cat45678901234567890
    $ fg
    ^\Quit (core dumped)
    $ ls core*

    we now get:

    core.2434.cat456789012345.!root!cat45678901234567890 (deleted)

    Signed-off-by: Jiri Slaby
    Cc: Al Viro
    Cc: Alan Cox
    Reviewed-by: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     
  • Setup and cleanup of mm_struct->exe_file is currently done in fs/proc/.
    This was because exe_file was needed only for /proc//exe. Since we
    will need the exe_file functionality also for core dumps (so core name can
    contain full binary path), built this functionality always into the
    kernel.

    To achieve that move that out of proc FS to the kernel/ where in fact it
    should belong. By doing that we can make dup_mm_exe_file static. Also we
    can drop linux/proc_fs.h inclusion in fs/exec.c and kernel/fork.c.

    Signed-off-by: Jiri Slaby
    Cc: Alexander Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Slaby
     

25 May, 2011

2 commits

  • Rework the existing mmu_gather infrastructure.

    The direct purpose of these patches was to allow preemptible mmu_gather,
    but even without that I think these patches provide an improvement to the
    status quo.

    The first 9 patches rework the mmu_gather infrastructure. For review
    purpose I've split them into generic and per-arch patches with the last of
    those a generic cleanup.

    The next patch provides generic RCU page-table freeing, and the followup
    is a patch converting s390 to use this. I've also got 4 patches from
    DaveM lined up (not included in this series) that uses this to implement
    gup_fast() for sparc64.

    Then there is one patch that extends the generic mmu_gather batching.

    After that follow the mm preemptibility patches, these make part of the mm
    a lot more preemptible. It converts i_mmap_lock and anon_vma->lock to
    mutexes which together with the mmu_gather rework makes mmu_gather
    preemptible as well.

    Making i_mmap_lock a mutex also enables a clean-up of the truncate code.

    This also allows for preemptible mmu_notifiers, something that XPMEM I
    think wants.

    Furthermore, it removes the new and universially detested unmap_mutex.

    This patch:

    Remove the first obstacle towards a fully preemptible mmu_gather.

    The current scheme assumes mmu_gather is always done with preemption
    disabled and uses per-cpu storage for the page batches. Change this to
    try and allocate a page for batching and in case of failure, use a small
    on-stack array to make some progress.

    Preemptible mmu_gather is desired in general and usable once i_mmap_lock
    becomes a mutex. Doing it before the mutex conversion saves us from
    having to rework the code by moving the mmu_gather bits inside the
    pte_lock.

    Also avoid flushing the tlb batches from under the pte lock, this is
    useful even without the i_mmap_lock conversion as it significantly reduces
    pte lock hold times.

    [akpm@linux-foundation.org: fix comment tpyo]
    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Reviewed-by: KAMEZAWA Hiroyuki
    Acked-by: Hugh Dickins
    Acked-by: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     
  • Currently we have expand_upwards exported while expand_downwards is
    accessible only via expand_stack or expand_stack_downwards.

    check_stack_guard_page is a nice example of the asymmetry. It uses
    expand_stack for VM_GROWSDOWN while expand_upwards is called for
    VM_GROWSUP case.

    Let's clean this up by exporting both functions and make those names
    consistent. Let's use expand_{upwards,downwards} because expanding
    doesn't always involve stack manipulation (an example is
    ia64_do_page_fault which uses expand_upwards for registers backing store
    expansion). expand_downwards has to be defined for both
    CONFIG_STACK_GROWS{UP,DOWN} because get_arg_page calls the downwards
    version in the early process initialization phase for growsup
    configuration.

    Signed-off-by: Michal Hocko
    Acked-by: Hugh Dickins
    Cc: James Bottomley
    Cc: "Luck, Tony"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michal Hocko
     

24 May, 2011

1 commit

  • * 'tty-next' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty-2.6: (48 commits)
    serial: 8250_pci: add support for Cronyx Omega PCI multiserial board.
    tty/serial: Fix break handling for PORT_TEGRA
    tty/serial: Add explicit PORT_TEGRA type
    n_tracerouter and n_tracesink ldisc additions.
    Intel PTI implementaiton of MIPI 1149.7.
    Kernel documentation for the PTI feature.
    export kernel call get_task_comm().
    tty: Remove to support serial for S5P6442
    pch_phub: Support new device ML7223
    8250_pci: Add support for the Digi/IBM PCIe 2-port Adapter
    ASoC: Update cx20442 for TTY API change
    pch_uart: Support new device ML7223 IOH
    parport: Use request_muxed_region for IT87 probe and lock
    tty/serial: add support for Xilinx PS UART
    n_gsm: Use print_hex_dump_bytes
    drivers/tty/moxa.c: Put correct tty value
    TTY: tty_io, annotate locking functions
    TTY: serial_core, remove superfluous set_task_state
    TTY: serial_core, remove invalid test
    Char: moxa, fix locking in moxa_write
    ...

    Fix up trivial conflicts in drivers/bluetooth/hci_ldisc.c and
    drivers/tty/serial/Makefile.

    I did the hci_ldisc thing as an evil merge, cleaning things up.

    Linus Torvalds
     

23 May, 2011

1 commit


14 May, 2011

1 commit

  • This allows drivers who call this function to be compiled modularly.
    Otherwise, a driver who is interested in this type of functionality
    has to implement their own get_task_comm() call, causing code
    duplication in the Linux source tree.

    Signed-off-by: J Freyensee
    Acked-by: David Rientjes
    Signed-off-by: Greg Kroah-Hartman

    J Freyensee
     

09 Apr, 2011

4 commits

  • Add the comment to explain acct_arg_size().

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • Add the appropriate members into struct user_arg_ptr and teach
    get_user_arg_ptr() to handle is_compat = T case correctly.

    This allows us to remove the compat_do_execve() code from fs/compat.c
    and reimplement compat_do_execve() as the trivial wrapper on top of
    do_execve_common(is_compat => true).

    In fact, this fixes another (minor) bug. "compat_uptr_t str" can
    overflow after "str += len" in compat_copy_strings() if a 64bit
    application execs via sys32_execve().

    Unexport acct_arg_size() and get_arg_page(), fs/compat.c doesn't
    need them any longer.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • No functional changes, preparation.

    Introduce struct user_arg_ptr, change do_execve() paths to use it
    instead of "char __user * const __user *argv".

    This makes the argv/envp arguments opaque, we are ready to handle the
    compat case which needs argv pointing to compat_uptr_t.

    Suggested-by: Linus Torvalds
    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     
  • Introduce get_user_arg_ptr() helper, convert count() and copy_strings()
    to use it.

    No functional changes, preparation. This helper is trivial, it just
    reads the pointer from argv/envp user-space array.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: KOSAKI Motohiro
    Tested-by: KOSAKI Motohiro

    Oleg Nesterov
     

23 Mar, 2011

1 commit

  • Currently task->signal->group_stop_count is used to decide whether to
    stop for group stop. However, if there is a task in the group which
    is taking a long time to stop, other tasks which are continued by
    ptrace would repeatedly stop for the same group stop until the group
    stop is complete.

    Conversely, if a ptraced task is in TASK_TRACED state, the debugger
    won't get notified of group stops which is inconsistent compared to
    the ptraced task in any other state.

    This patch introduces GROUP_STOP_PENDING which tracks whether a task
    is yet to stop for the group stop in progress. The flag is set when a
    group stop starts and cleared when the task stops the first time for
    the group stop, and consulted whenever whether the task should
    participate in a group stop needs to be determined. Note that now
    tasks in TASK_TRACED also participate in group stop.

    This results in the following behavior changes.

    * For a single group stop, a ptracer would see at most one stop
    reported.

    * A ptracee in TASK_TRACED now also participates in group stop and the
    tracer would get the notification. However, as a ptraced task could
    be in TASK_STOPPED state or any ptrace trap could consume group
    stop, the notification may still be missing. These will be
    addressed with further patches.

    * A ptracee may start a group stop while one is still in progress if
    the tracer let it continue with stop signal delivery. Group stop
    code handles this correctly.

    Oleg:

    * Spotted that a task might skip signal check even when its
    GROUP_STOP_PENDING is set. Fixed by updating
    recalc_sigpending_tsk() to check GROUP_STOP_PENDING instead of
    group_stop_count.

    * Pointed out that task->group_stop should be cleared whenever
    task->signal->group_stop_count is cleared. Fixed accordingly.

    * Pointed out the behavior inconsistency between TASK_TRACED and
    RUNNING and the last behavior change.

    Signed-off-by: Tejun Heo
    Acked-by: Oleg Nesterov
    Cc: Roland McGrath

    Tejun Heo
     

21 Mar, 2011

1 commit

  • Hi,

    I was backporting the coredump over pipe feature and noticed this small typo,
    I wish I would have something bigger to contribute...

    >From 15d6080e0ed4267da103c706917a33b1015e8804 Mon Sep 17 00:00:00 2001
    From: Holger Hans Peter Freyther
    Date: Thu, 24 Feb 2011 17:42:50 +0100
    Subject: [PATCH] fs: Fix a small typo in the comment

    The function is called umh_pipe_setup not uhm_pipe_setup.

    Signed-off-by: Holger Hans Peter Freyther
    Signed-off-by: Al Viro

    Holger Hans Peter Freyther
     

14 Mar, 2011

1 commit

  • take calculation of open_flags by open(2) arguments into new helper
    in fs/open.c, move filp_open() over there, have it and do_sys_open()
    use that helper, switch exec.c callers of do_filp_open() to explicit
    (and constant) struct open_flags.

    Signed-off-by: Al Viro

    Al Viro
     

03 Feb, 2011

1 commit

  • FMODE_EXEC is a constant type of fmode_t but was used with normal integer
    constants. This results in following warnings from sparse. Fix it using
    new macro __FMODE_EXEC.

    fs/exec.c:116:58: warning: restricted fmode_t degrades to integer
    fs/exec.c:689:58: warning: restricted fmode_t degrades to integer
    fs/fcntl.c:777:9: warning: restricted fmode_t degrades to integer

    Signed-off-by: Namhyung Kim
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Namhyung Kim
     

16 Dec, 2010

1 commit

  • The install_special_mapping routine (used, for example, to setup the
    vdso) skips the security check before insert_vm_struct, allowing a local
    attacker to bypass the mmap_min_addr security restriction by limiting
    the available pages for special mappings.

    bprm_mm_init() also skips the check, and although I don't think this can
    be used to bypass any restrictions, I don't see any reason not to have
    the security check.

    $ uname -m
    x86_64
    $ cat /proc/sys/vm/mmap_min_addr
    65536
    $ cat install_special_mapping.s
    section .bss
    resb BSS_SIZE
    section .text
    global _start
    _start:
    mov eax, __NR_pause
    int 0x80
    $ nasm -D__NR_pause=29 -DBSS_SIZE=0xfffed000 -f elf -o install_special_mapping.o install_special_mapping.s
    $ ld -m elf_i386 -Ttext=0x10000 -Tbss=0x11000 -o install_special_mapping install_special_mapping.o
    $ ./install_special_mapping &
    [1] 14303
    $ cat /proc/14303/maps
    0000f000-00010000 r-xp 00000000 00:00 0 [vdso]
    00010000-00011000 r-xp 00001000 00:19 2453665 /home/taviso/install_special_mapping
    00011000-ffffe000 rwxp 00000000 00:00 0 [stack]

    It's worth noting that Red Hat are shipping with mmap_min_addr set to
    4096.

    Signed-off-by: Tavis Ormandy
    Acked-by: Kees Cook
    Acked-by: Robert Swiecki
    [ Changed to not drop the error code - akpm ]
    Reviewed-by: James Morris
    Signed-off-by: Linus Torvalds

    Tavis Ormandy
     

01 Dec, 2010

2 commits

  • Note: this patch targets 2.6.37 and tries to be as simple as possible.
    That is why it adds more copy-and-paste horror into fs/compat.c and
    uglifies fs/exec.c, this will be cleanuped later.

    compat_copy_strings() plays with bprm->vma/mm directly and thus has
    two problems: it lacks the RLIMIT_STACK check and argv/envp memory
    is not visible to oom killer.

    Export acct_arg_size() and get_arg_page(), change compat_copy_strings()
    to use get_arg_page(), change compat_do_execve() to do acct_arg_size(0)
    as do_execve() does.

    Add the fatal_signal_pending/cond_resched checks into compat_count() and
    compat_copy_strings(), this matches the code in fs/exec.c and certainly
    makes sense.

    Signed-off-by: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Brad Spengler published a local memory-allocation DoS that
    evades the OOM-killer (though not the virtual memory RLIMIT):
    http://www.grsecurity.net/~spender/64bit_dos.c

    execve()->copy_strings() can allocate a lot of memory, but
    this is not visible to oom-killer, nobody can see the nascent
    bprm->mm and take it into account.

    With this patch get_arg_page() increments current's MM_ANONPAGES
    counter every time we allocate the new page for argv/envp. When
    do_execve() succeds or fails, we change this counter back.

    Technically this is not 100% correct, we can't know if the new
    page is swapped out and turn MM_ANONPAGES into MM_SWAPENTS, but
    I don't think this really matters and everything becomes correct
    once exec changes ->mm or fails.

    Reported-by: Brad Spengler
    Reviewed-and-discussed-by: KOSAKI Motohiro
    Signed-off-by: Oleg Nesterov
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Oct, 2010

3 commits

  • Presently do_execve() turns PF_KTHREAD off before search_binary_handler().
    THis has a theorical risk of PF_KTHREAD getting lost. We don't have to
    turn PF_KTHREAD off in the ENOEXEC case.

    This patch moves this flag modification to after the finding of the
    executable file.

    This is only a theorical issue because kthreads do not call do_execve()
    directly. But fixing would be better.

    Signed-off-by: KOSAKI Motohiro
    Acked-by: Roland McGrath
    Acked-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     
  • We met a parameter truncated issue, consider following:
    > echo "|/root/core_pattern_pipe_test %p /usr/libexec/blah-blah-blah \
    %s %c %p %u %g 11 12345678901234567890123456789012345678 %t" > \
    /proc/sys/kernel/core_pattern

    This is okay because the strings is less than CORENAME_MAX_SIZE. "cat
    /proc/sys/kernel/core_pattern" shows the whole string. but after we run
    core_pattern_pipe_test in man page, we found last parameter was truncated
    like below:

    argc[10]=

    The root cause is core_pattern allows % specifiers, which need to be
    replaced during parse time, but the replace may expand the strings to
    larger than CORENAME_MAX_SIZE. So if the last parameter is % specifiers,
    the replace code is using snprintf(out_ptr, out_end - out_ptr, ...), this
    will write out of corename array.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Xiaotian Feng
    Cc: Alexander Viro
    Cc: Oleg Nesterov
    Cc: KOSAKI Motohiro
    Reviewed-by: Neil Horman
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Xiaotian Feng
     
  • Oleg Nesterov pointed out we have to prevent multiple-threads-inside-exec
    itself and we can reuse ->cred_guard_mutex for it. Yes, concurrent
    execve() has no worth.

    Let's move ->cred_guard_mutex from task_struct to signal_struct. It
    naturally prevent multiple-threads-inside-exec.

    Signed-off-by: KOSAKI Motohiro
    Reviewed-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

27 Oct, 2010

1 commit

  • It's pointless to kill a task if another thread sharing its mm cannot be
    killed to allow future memory freeing. A subsequent patch will prevent
    kills in such cases, but first it's necessary to have a way to flag a task
    that shares memory with an OOM_DISABLE task that doesn't incur an
    additional tasklist scan, which would make select_bad_process() an O(n^2)
    function.

    This patch adds an atomic counter to struct mm_struct that follows how
    many threads attached to it have an oom_score_adj of OOM_SCORE_ADJ_MIN.
    They cannot be killed by the kernel, so their memory cannot be freed in
    oom conditions.

    This only requires task_lock() on the task that we're operating on, it
    does not require mm->mmap_sem since task_lock() pins the mm and the
    operation is atomic.

    [rientjes@google.com: changelog and sys_unshare() code]
    [rientjes@google.com: protect oom_disable_count with task_lock in fork]
    [rientjes@google.com: use old_mm for oom_disable_count in exec]
    Signed-off-by: Ying Han
    Signed-off-by: David Rientjes
    Cc: KAMEZAWA Hiroyuki
    Cc: KOSAKI Motohiro
    Cc: Rik van Riel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Ying Han
     

15 Oct, 2010

2 commits

  • If you build aout support as a module, you'll want these exported.

    Reported-by: Tetsuo Handa
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     
  • Tony Luck reports that the addition of the access_ok() check in commit
    0eead9ab41da ("Don't dump task struct in a.out core-dumps") broke the
    ia64 compile due to missing the necessary header file includes.

    Rather than add yet another include () to make everything
    happy, just uninline the silly core dump helper functions and move the
    bodies to fs/exec.c where they make a lot more sense.

    dump_seek() in particular was too big to be an inline function anyway,
    and none of them are in any way performance-critical. And we really
    don't need to mess up our include file headers more than they already
    are.

    Reported-and-tested-by: Tony Luck
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

10 Sep, 2010

3 commits

  • An execve with a very large total of argument/environment strings
    can take a really long time in the execve system call. It runs
    uninterruptibly to count and copy all the strings. This change
    makes it abort the exec quickly if sent a SIGKILL.

    Note that this is the conservative change, to interrupt only for
    SIGKILL, by using fatal_signal_pending(). It would be perfectly
    correct semantics to let any signal interrupt the string-copying in
    execve, i.e. use signal_pending() instead of fatal_signal_pending().
    We'll save that change for later, since it could have user-visible
    consequences, such as having a timer set too quickly make it so that
    an execve can never complete, though it always happened to work before.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This adds a preemption point during the copying of the argument and
    environment strings for execve, in copy_strings(). There is already
    a preemption point in the count() loop, so this doesn't add any new
    points in the abstract sense.

    When the total argument+environment strings are very large, the time
    spent copying them can be much more than a normal user time slice.
    So this change improves the interactivity of the rest of the system
    when one process is doing an execve with very large arguments.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • The CONFIG_STACK_GROWSDOWN variant of setup_arg_pages() does not
    check the size of the argument/environment area on the stack.
    When it is unworkably large, shift_arg_pages() hits its BUG_ON.
    This is exploitable with a very large RLIMIT_STACK limit, to
    create a crash pretty easily.

    Check that the initial stack is not too large to make it possible
    to map in any executable. We're not checking that the actual
    executable (or intepreter, for binfmt_elf) will fit. So those
    mappings might clobber part of the initial stack mapping. But
    that is just userland lossage that userland made happen, not a
    kernel problem.

    Signed-off-by: Roland McGrath
    Reviewed-by: KOSAKI Motohiro
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

19 Aug, 2010

1 commit

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    fs: brlock vfsmount_lock
    fs: scale files_lock
    lglock: introduce special lglock and brlock spin locks
    tty: fix fu_list abuse
    fs: cleanup files_lock locking
    fs: remove extra lookup in __lookup_hash
    fs: fs_struct rwlock to spinlock
    apparmor: use task path helpers
    fs: dentry allocation consolidation
    fs: fix do_lookup false negative
    mbcache: Limit the maximum number of cache entries
    hostfs ->follow_link() braino
    hostfs: dumb (and usually harmless) tpyo - strncpy instead of strlcpy
    remove SWRITE* I/O types
    kill BH_Ordered flag
    vfs: update ctime when changing the file's permission by setfacl
    cramfs: only unlock new inodes
    fix reiserfs_evict_inode end_writeback second call

    Linus Torvalds
     

18 Aug, 2010

2 commits

  • fs: fs_struct rwlock to spinlock

    struct fs_struct.lock is an rwlock with the read-side used to protect root and
    pwd members while taking references to them. Taking a reference to a path
    typically requires just 2 atomic ops, so the critical section is very small.
    Parallel read-side operations would have cacheline contention on the lock, the
    dentry, and the vfsmount cachelines, so the rwlock is unlikely to ever give a
    real parallelism increase.

    Replace it with a spinlock to avoid one or two atomic operations in typical
    path lookup fastpath.

    Signed-off-by: Nick Piggin
    Signed-off-by: Al Viro

    Nick Piggin
     
  • Make do_execve() take a const filename pointer so that kernel_execve() compiles
    correctly on ARM:

    arch/arm/kernel/sys_arm.c:88: warning: passing argument 1 of 'do_execve' discards qualifiers from pointer target type

    This also requires the argv and envp arguments to be consted twice, once for
    the pointer array and once for the strings the array points to. This is
    because do_execve() passes a pointer to the filename (now const) to
    copy_strings_kernel(). A simpler alternative would be to cast the filename
    pointer in do_execve() when it's passed to copy_strings_kernel().

    do_execve() may not change any of the strings it is passed as part of the argv
    or envp lists as they are some of them in .rodata, so marking these strings as
    const should be fine.

    Further kernel_execve() and sys_execve() need to be changed to match.

    This has been test built on x86_64, frv, arm and mips.

    Signed-off-by: David Howells
    Tested-by: Ralf Baechle
    Acked-by: Russell King
    Signed-off-by: Linus Torvalds

    David Howells
     

11 Aug, 2010

1 commit

  • * 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits)
    fanotify: use both marks when possible
    fsnotify: pass both the vfsmount mark and inode mark
    fsnotify: walk the inode and vfsmount lists simultaneously
    fsnotify: rework ignored mark flushing
    fsnotify: remove global fsnotify groups lists
    fsnotify: remove group->mask
    fsnotify: remove the global masks
    fsnotify: cleanup should_send_event
    fanotify: use the mark in handler functions
    audit: use the mark in handler functions
    dnotify: use the mark in handler functions
    inotify: use the mark in handler functions
    fsnotify: send fsnotify_mark to groups in event handling functions
    fsnotify: Exchange list heads instead of moving elements
    fsnotify: srcu to protect read side of inode and vfsmount locks
    fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called
    fsnotify: use _rcu functions for mark list traversal
    fsnotify: place marks on object in order of group memory address
    vfs/fsnotify: fsnotify_close can delay the final work in fput
    fsnotify: store struct file not struct path
    ...

    Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.

    Linus Torvalds
     

08 Aug, 2010

1 commit


28 Jul, 2010

1 commit

  • fanotify, the upcoming notification system actually needs a struct path so it can
    do opens in the context of listeners, and it needs a file so it can get f_flags
    from the original process. Close was the only operation that already was passing
    a struct file to the notification hook. This patch passes a file for access,
    modify, and open as well as they are easily available to these hooks.

    Signed-off-by: Eric Paris

    Eric Paris
     

10 Jul, 2010

1 commit

  • core_pattern is not actually protected and hasn't been
    ever since we introduced procfs support for sysctl -- a
    _long_ time. Don't take it here either.

    Also nothing inside do_coredump appears to require bkl
    protection.

    Signed-off-by: Arnd Bergmann
    [ remove smp_lock.h headers ]
    Signed-off-by: Frederic Weisbecker

    Arnd Bergmann
     

09 Jun, 2010

1 commit

  • Add the capacility to track data mmap()s. This can be used together
    with PERF_SAMPLE_ADDR for data profiling.

    Signed-off-by: Anton Blanchard
    [Updated code for stable perf ABI]
    Signed-off-by: Eric B Munson
    Signed-off-by: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Steven Rostedt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Eric B Munson
     

28 May, 2010

1 commit

  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov