29 Sep, 2008

1 commit

  • There's a race between mm->owner assignment and swapoff, more easily
    seen when task slab poisoning is turned on. The condition occurs when
    try_to_unuse() runs in parallel with an exiting task. A similar race
    can occur with callers of get_task_mm(), such as /proc//
    or ptrace or page migration.

    CPU0 CPU1
    try_to_unuse
    looks at mm = task0->mm
    increments mm->mm_users
    task 0 exits
    mm->owner needs to be updated, but no
    new owner is found (mm_users > 1, but
    no other task has task->mm = task0->mm)
    mm_update_next_owner() leaves
    mmput(mm) decrements mm->mm_users
    task0 freed
    dereferencing mm->owner fails

    The fix is to notify the subsystem via mm_owner_changed callback(),
    if no new owner is found, by specifying the new task as NULL.

    Jiri Slaby:
    mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
    must be set after that, so as not to pass NULL as old owner causing oops.

    Daisuke Nishimura:
    mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
    and its callers need to take account of this situation to avoid oops.

    Hugh Dickins:
    Lockdep warning and hang below exec_mmap() when testing these patches.
    exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
    so exec_mmap() now needs to do the same. And with that repositioning,
    there's now no point in mm_need_new_owner() allowing for NULL mm.

    Reported-by: Hugh Dickins
    Signed-off-by: Balbir Singh
    Signed-off-by: Jiri Slaby
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

29 Jul, 2008

1 commit

  • Fix compilation errors on avr32 and without CONFIG_SWAP, introduced by
    ba92a43dbaee339cf5915ef766d3d3ffbaaf103c ("exec: remove some includes")

    In file included from include/asm/tlb.h:24,
    from fs/exec.c:55:
    include/asm-generic/tlb.h: In function 'tlb_flush_mmu':
    include/asm-generic/tlb.h:76: error: implicit declaration of function 'release_pages'
    include/asm-generic/tlb.h: In function 'tlb_remove_page':
    include/asm-generic/tlb.h:105: error: implicit declaration of function 'page_cache_release'
    make[1]: *** [fs/exec.o] Error 1

    This straightforward part-revert is nobody's favourite patch to address
    the underlying tlb.h needs swap.h needs pagemap.h (but sparc won't like
    that) mess; but appropriate to fix the build now before any overhaul.

    Reported-by: Yoichi Yuasa
    Reported-by: Haavard Skinnemoen
    Signed-off-by: Hugh Dickins
    Tested-by: Adrian Bunk
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

27 Jul, 2008

5 commits


26 Jul, 2008

12 commits

  • I don't understand why the multi-thread coredump implies the core_uses_pid
    behaviour, but we shouldn't use mm->mm_users for that. This counter can
    be incremented by get_task_mm(). Use the valued returned by
    coredump_wait() instead.

    Also, remove the "const char *pattern" argument, format_corename() can use
    core_pattern directly.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Alan Cox
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we have core_state->dumper list we can use it to wake up the
    sub-threads waiting for the coredump completion.

    This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
    lessens by sizeof(struct completion). Also, with this change we can
    decouple exit_mm() from the coredumping code.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • binfmt->core_dump() has to iterate over the all threads in system in order
    to find the coredumping threads and construct the list using the
    GFP_ATOMIC allocations.

    With this patch each thread allocates the list node on exit_mm()'s stack and
    adds itself to the list.

    This allows us to do further changes:

    - simplify ->core_dump()

    - change exit_mm() to clear ->mm first, then wait for ->core_done.
    this makes the coredumping process visible to oom_kill

    - kill mm->core_done

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move the "struct core_state core_state" from coredump_wait() to
    do_coredump(), this makes mm->core_state visible to binfmt->core_dump().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Turn core_state->nr_threads into atomic_t and kill now unneeded
    down_write(&mm->mmap_sem) in exit_mm().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change zap_process() to return int instead of incrementing
    mm->core_state->nr_threads directly. Change zap_threads() to set
    mm->core_state only on success.

    This patch restores the original size of .text, and more importantly now
    ->nr_threads is used in two places only.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move mm->core_waiters into "struct core_state" allocated on stack. This
    shrinks mm_struct a little bit and allows further changes.

    This patch mostly does s/core_waiters/core_state. The only essential
    change is that coredump_wait() must clear mm->core_state before return.

    The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
    is fixed by the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mm->core_startup_done points to "struct completion startup_done" allocated
    on the coredump_wait()'s stack. Introduce the new structure, core_state,
    which holds this "struct completion". This way we can add more info
    visible to the threads participating in coredump without enlarging
    mm_struct.

    No changes in affected .o files.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The main loop in zap_threads() must skip kthreads which may use the same
    mm. Otherwise we "kill" this thread erroneously (for example, it can not
    fork or exec after that), and the coredumping task stucks in the
    TASK_UNINTERRUPTIBLE state forever because of the wrong ->core_waiters
    count.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
    by INIT_TASK() and copied to the forked childs (we could set it in
    kthreadd() along with PF_NOFREEZE instead).

    daemonize() was changed as well. In that case testing of PF_KTHREAD is
    racy, but daemonize() is hopeless anyway.

    This flag is cleared in do_execve(), before search_binary_handler().
    Probably not the best place, we can do this in exec_mmap() or in
    start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
    doesn't matter in practice, and if do_execve() fails kthread should die
    soon.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No changes in fs/exec.o

    The for_each_process() loop in zap_threads() is very subtle, it is not
    clear why we don't race with fork/exit/exec. Add the fat comment.

    Also, change the code to use while_each_thread().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • fs/exec.c used to need mman.h pagemap.h swap.h and rmap.h when it did
    mm-ish stuff in install_arg_page(); but no need for them after 2.6.22.

    [akpm@linux-foundation.org: unbreak arm]
    Signed-off-by: Hugh Dickins
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

25 Jul, 2008

1 commit


11 Jul, 2008

1 commit

  • Kernel Bugzilla #11063 points out that on some architectures (e.g. x86_32)
    exec'ing an ELF without a PT_GNU_STACK program header should default to an
    executable stack; but this got broken by the unlimited argv feature because
    stack vma is now created before the right personality has been established:
    so breaking old binaries using nested function trampolines.

    Therefore re-evaluate VM_STACK_FLAGS in setup_arg_pages, where stack
    vm_flags used to be set, before the mprotect_fixup. Checking through
    our existing VM_flags, none would have changed since insert_vm_struct:
    so this seems safer than finding a way through the personality labyrinth.

    Reported-by: pageexec@freemail.hu
    Signed-off-by: Hugh Dickins
    Cc: stable@kernel.org
    Signed-off-by: Linus Torvalds

    Hugh Dickins
     

17 Jun, 2008

1 commit

  • We only need it for the /sbin/loader hack for OSF/1 executables, and we
    don't want to include it otherwise.

    While we're at it, remove the redundant '&& CONFIG_ARCH_SUPPORTS_AOUT'
    in the ifdef around that code. It's already dependent on __alpha__, and
    CONFIG_ARCH_SUPPORTS_AOUT is hard-coded to 'y' there.

    Signed-off-by: David Woodhouse
    Acked-by: Peter Korsgaard
    Signed-off-by: Linus Torvalds

    David Woodhouse
     

27 May, 2008

1 commit

  • Based on Roland's patch. This approach was suggested by Austin Clements
    from the very beginning, and then by Linus.

    As Austin pointed out, the execing task can be killed by SI_TIMER signal
    because exec flushes the signal handlers, but doesn't discard the pending
    signals generated by posix timers. Perhaps not a bug, but people find this
    surprising. See http://bugzilla.kernel.org/show_bug.cgi?id=10460

    Signed-off-by: Oleg Nesterov
    Cc: Austin Clements
    Cc: Roland McGrath
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

17 May, 2008

1 commit

  • Even though copy_compat_strings() doesn't cache the pages,
    copy_strings_kernel() and stuff indirectly called by e.g.
    ->load_binary() is doing that, so we need to drop the
    cache contents in the end.

    [found by WANG Cong ]

    Signed-off-by: Al Viro

    Al Viro
     

13 May, 2008

1 commit

  • When mm destruction happens, we should pass mm_update_next_owner() the old mm.
    But unfortunately new mm is passed in exec_mmap().

    Thus, kernel panic is possible when a multi-threaded process uses exec().

    Also, the owner member comment description is wrong. mm->owner does not
    necessarily point to the thread group leader.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KOSAKI Motohiro
    Acked-by: Balbir Singh
    Cc: "Paul Menage"
    Cc: "KAMEZAWA Hiroyuki"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KOSAKI Motohiro
     

02 May, 2008

1 commit


30 Apr, 2008

2 commits


29 Apr, 2008

3 commits

  • The kernel implements readlink of /proc/pid/exe by getting the file from
    the first executable VMA. Then the path to the file is reconstructed and
    reported as the result.

    Because of the VMA walk the code is slightly different on nommu systems.
    This patch avoids separate /proc/pid/exe code on nommu systems. Instead of
    walking the VMAs to find the first executable file-backed VMA we store a
    reference to the exec'd file in the mm_struct.

    That reference would prevent the filesystem holding the executable file
    from being unmounted even after unmapping the VMAs. So we track the number
    of VM_EXECUTABLE VMAs and drop the new reference when the last one is
    unmapped. This avoids pinning the mounted filesystem.

    [akpm@linux-foundation.org: improve comments]
    [yamamoto@valinux.co.jp: fix dup_mmap]
    Signed-off-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: David Howells
    Cc:"Eric W. Biederman"
    Cc: Christoph Hellwig
    Cc: Al Viro
    Cc: Hugh Dickins
    Signed-off-by: YAMAMOTO Takashi
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matt Helsley
     
  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     
  • I noticed that 2.6.24.2 calculates bprm->argv_len at do_execve(). But it
    doesn't update bprm->argv_len after "remove_arg_zero() +
    copy_strings_kernel()" at load_script() etc.

    audit_bprm() is called from search_binary_handler() and
    search_binary_handler() is called from load_script() etc. Thus, I think the
    condition check

    if (bprm->argv_len > (audit_argv_kb << 10))
    return -E2BIG;

    in audit_bprm() might return wrong result when strlen(removed_arg) !=
    strlen(spliced_args). Why not update bprm->argv_len at load_script() etc. ?

    By the way, 2.6.25-rc3 seems to not doing the condition check. Is the field
    bprm->argv_len no longer needed?

    Signed-off-by: Tetsuo Handa
    Cc: Ollie Wild
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Tetsuo Handa
     

25 Apr, 2008

2 commits

  • * let unshare_files() give caller the displaced files_struct
    * don't bother with grabbing reference only to drop it in the
    caller if it hadn't been shared in the first place
    * in that form unshare_files() is trivially implemented via
    unshare_fd(), so we eliminate the duplicate logics in fork.c
    * reset_files_struct() is not just only called for current;
    it will break the system if somebody ever calls it for anything
    else (we can't modify ->files of somebody else). Lose the
    task_struct * argument.

    Signed-off-by: Al Viro

    Al Viro
     
  • * unshare_files() can fail; doing it after irreversible actions is wrong
    and de_thread() is certainly irreversible.
    * since we do it unconditionally anyway, we might as well do it in do_execve()
    and save ourselves the PITA in binfmt handlers, etc.
    * while we are at it, binfmt_som actually leaked files_struct on failure.

    As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
    become unexported.

    Signed-off-by: Al Viro

    Al Viro
     

04 Mar, 2008

1 commit

  • The new code that removed the limitation on the execve string size
    (which was historically 32 pages) replaced it with a much softer limit
    based on RLIMIT_STACK which is usually much larger than the traditional
    limit. See commit b6a2fea39318e43fee84fa7b0b90d68bed92d2ba ("mm:
    variable length argument support") for details.

    However, if you have a small stack limit (perhaps because you need lots
    of stacks in a threaded environment), the new heuristic of allowing up
    to 1/4th of RLIMIT_STACK to be used for argument and environment strings
    could actually be smaller than the old limit.

    So just say that it's ok to have up to ARG_MAX strings regardless of the
    value of RLIMIT_STACK, and check the rlimit only when going over that
    traditional limit.

    (Of course, if you actually have a *really* small stack limit, the whole
    stack itself will be limited before you hit ARG_MAX, but that has always
    been true and is clearly the right behaviour anyway).

    Acked-by: Carlos O'Donell
    Cc: Michael Kerrisk
    Cc: Peter Zijlstra
    Cc: Ollie Wild
    Signed-off-by: Linus Torvalds

    Linus Torvalds
     

15 Feb, 2008

2 commits

  • * Add path_put() functions for releasing a reference to the dentry and
    vfsmount of a struct path in the right order

    * Switch from path_release(nd) to path_put(&nd->path)

    * Rename dput_path() to path_put_conditional()

    [akpm@linux-foundation.org: fix cifs]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc:
    Cc: Al Viro
    Cc: Steven French
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     
  • This is the central patch of a cleanup series. In most cases there is no good
    reason why someone would want to use a dentry for itself. This series reflects
    that fact and embeds a struct path into nameidata.

    Together with the other patches of this series
    - it enforced the correct order of getting/releasing the reference count on
    pairs
    - it prepares the VFS for stacking support since it is essential to have a
    struct path in every place where the stack can be traversed
    - it reduces the overall code size:

    without patch series:
    text data bss dec hex filename
    5321639 858418 715768 6895825 6938d1 vmlinux

    with patch series:
    text data bss dec hex filename
    5320026 858418 715768 6894212 693284 vmlinux

    This patch:

    Switch from nd->{dentry,mnt} to nd->path.{dentry,mnt} everywhere.

    [akpm@linux-foundation.org: coding-style fixes]
    [akpm@linux-foundation.org: fix cifs]
    [akpm@linux-foundation.org: fix smack]
    Signed-off-by: Jan Blunck
    Signed-off-by: Andreas Gruenbacher
    Acked-by: Christoph Hellwig
    Cc: Al Viro
    Cc: Casey Schaufler
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jan Blunck
     

09 Feb, 2008

3 commits

  • This allows us to use executables >2GB.

    Based on a patch by Dave Anderson

    Signed-off-by: Andi Kleen
    Cc: Dave Anderson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andi Kleen
     
  • Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.

    Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
    be permitted to go looking for A.OUT libraries to load in such a case. Not
    only that, but under such conditions A.OUT core dumps are not produced either.

    To make this work, this patch also does the following:

    (1) Makes the existence of the contents of linux/a.out.h contingent on
    CONFIG_ARCH_SUPPORTS_AOUT.

    (2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
    core dumping code.

    (3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
    is then included only where needed. This means that this bit of arch
    code will be stored in the appropriate A.OUT binfmt module rather than
    the core kernel.

    (4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
    needed) and FRV.

    This patch depends on the previous patch to move STACK_TOP[_MAX] out of
    asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
    format is available.

    [jdike@addtoit.com: uml: re-remove accidentally restored code]
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • signal_struct->tsk points to the ->group_leader and thus we have the nasty
    code in de_thread() which has to change it and restart ->real_timer if the
    leader is changed.

    Use "struct pid *leader_pid" instead. This also allows us to kill now
    unneeded send_group_sig_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Acked-by: Roland McGrath
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 Feb, 2008

1 commit

  • As Roland pointed out, we have the very old problem with exec. de_thread()
    sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
    clears signal->flags. All signals (even fatal ones) sent in this window
    (which is not too small) will be lost.

    With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
    the new helper, should be used to detect exit_group() or exec() in progress.
    It can have more users, but this patch does only strictly necessary changes.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Robin Holt
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov