29 Sep, 2008

1 commit

  • There's a race between mm->owner assignment and swapoff, more easily
    seen when task slab poisoning is turned on. The condition occurs when
    try_to_unuse() runs in parallel with an exiting task. A similar race
    can occur with callers of get_task_mm(), such as /proc//
    or ptrace or page migration.

    CPU0 CPU1
    try_to_unuse
    looks at mm = task0->mm
    increments mm->mm_users
    task 0 exits
    mm->owner needs to be updated, but no
    new owner is found (mm_users > 1, but
    no other task has task->mm = task0->mm)
    mm_update_next_owner() leaves
    mmput(mm) decrements mm->mm_users
    task0 freed
    dereferencing mm->owner fails

    The fix is to notify the subsystem via mm_owner_changed callback(),
    if no new owner is found, by specifying the new task as NULL.

    Jiri Slaby:
    mm->owner was set to NULL prior to calling cgroup_mm_owner_callbacks(), but
    must be set after that, so as not to pass NULL as old owner causing oops.

    Daisuke Nishimura:
    mm_update_next_owner() may set mm->owner to NULL, but mem_cgroup_from_task()
    and its callers need to take account of this situation to avoid oops.

    Hugh Dickins:
    Lockdep warning and hang below exec_mmap() when testing these patches.
    exit_mm() up_reads mmap_sem before calling mm_update_next_owner(),
    so exec_mmap() now needs to do the same. And with that repositioning,
    there's now no point in mm_need_new_owner() allowing for NULL mm.

    Reported-by: Hugh Dickins
    Signed-off-by: Balbir Singh
    Signed-off-by: Jiri Slaby
    Signed-off-by: Daisuke Nishimura
    Signed-off-by: Hugh Dickins
    Cc: KAMEZAWA Hiroyuki
    Cc: Paul Menage
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

06 Sep, 2008

1 commit

  • Spencer reported a problem where utime and stime were going negative despite
    the fixes in commit b27f03d4bdc145a09fb7b0c0e004b29f1ee555fa. The suspected
    reason for the problem is that signal_struct maintains it's own utime and
    stime (of exited tasks), these are not updated using the new task_utime()
    routine, hence sig->utime can go backwards and cause the same problem
    to occur (sig->utime, adds tsk->utime and not task_utime()). This patch
    fixes the problem

    TODO: using max(task->prev_utime, derived utime) works for now, but a more
    generic solution is to implement cputime_max() and use the cputime_gt()
    function for comparison.

    Reported-by: spencer@bluehost.com
    Signed-off-by: Balbir Singh
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Balbir Singh
     

03 Sep, 2008

1 commit

  • We don't change pid_ns->child_reaper when the main thread of the
    subnamespace init exits. As Robert Rex pointed
    out this is wrong.

    Yes, the re-parenting itself works correctly, but if the reparented task
    exits it needs ->parent->nsproxy->pid_ns in do_notify_parent(), and if the
    main thread is zombie its ->nsproxy was already cleared by
    exit_task_namespaces().

    Introduce the new function, find_new_reaper(), which finds the new
    ->parent for the re-parenting and changes ->child_reaper if needed. Kill
    the now unneeded exit_child_reaper().

    Also move the changing of ->child_reaper from zap_pid_ns_processes() to
    find_new_reaper(), this consolidates the games with ->child_reaper and
    makes it stable under tasklist_lock.

    Addresses http://bugzilla.kernel.org/show_bug.cgi?id=11391

    Reported-by: Robert Rex
    Signed-off-by: Oleg Nesterov
    Acked-by: Serge Hallyn
    Acked-by: Pavel Emelyanov
    Acked-by: Sukadev Bhattiprolu
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

27 Aug, 2008

1 commit


02 Aug, 2008

1 commit

  • My commit 2b2a1ff64afbadac842bbc58c5166962cf4f7664 introduced a regression
    (sorry about that) for the odd case of exit_signal=0 (e.g. clone_flags=0).
    This is not a normal use, but it's used by a case in the glibc test suite.

    Dying with exit_signal=0 sends no signal, but it's supposed to wake up a
    parent's blocked wait*() calls (unlike the delayed_group_leader case).
    This fixes tracehook_notify_death() and its caller to distinguish a
    "signal 0" wakeup from the delayed_group_leader case (with no wakeup).

    Signed-off-by: Roland McGrath
    Tested-by: Serge Hallyn
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

28 Jul, 2008

1 commit

  • Put all i/o statistics in struct proc_io_accounting and use inline functions to
    initialize and increment statistics, removing a lot of single variable
    assignments.

    This also reduces the kernel size as following (with CONFIG_TASK_XACCT=y and
    CONFIG_TASK_IO_ACCOUNTING=y).

    text data bss dec hex filename
    11651 0 0 11651 2d83 kernel/exit.o.before
    11619 0 0 11619 2d63 kernel/exit.o.after
    10886 132 136 11154 2b92 kernel/fork.o.before
    10758 132 136 11026 2b12 kernel/fork.o.after

    3082029 807968 4818600 8708597 84e1f5 vmlinux.o.before
    3081869 807968 4818600 8708437 84e155 vmlinux.o.after

    Signed-off-by: Andrea Righi
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Andrea Righi
     

27 Jul, 2008

4 commits

  • long overdue...

    Signed-off-by: Al Viro

    Al Viro
     
  • This moves the ptrace logic in task death (exit_notify) into tracehook.h
    inlines. Some code is rearranged slightly to make things nicer. There is
    no change, only cleanup.

    There is one hook called with the tasklist_lock write-locked, as ptrace
    needs. There is also a new hook called after exit_state changes and
    without locks. This is a better place for tracing work to be in the
    future, since it doesn't delay the whole system with locking.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This moves the ptrace-related logic from release_task into tracehook.h and
    ptrace.h inlines. It provides clean hooks both before and after locking
    tasklist_lock, for future tracing logic to do more cleanup without the
    lock.

    This also changes release_task() itself in the rare "zap_leader" case to
    set the leader to EXIT_DEAD before iterating. This maintains the
    invariant that release_task() only ever handles a task in EXIT_DEAD. This
    is a common-sense invariant that is already always true except in this one
    arcane case of zombie leader whose parent ignores SIGCHLD.

    This change is harmless and only costs one store in this one rare case.
    It keeps the expected state more consisently sane, which is nicer when
    debugging weirdness in release_task(). It also lets some future code in
    the tracehook entry points rely on this invariant for bookkeeping.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     
  • This moves the PTRACE_EVENT_EXIT tracing into a tracehook.h inline,
    tracehook_report_exec(). The change has no effect, just clean-up.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

26 Jul, 2008

8 commits

  • Report per-thread I/O statistics in /proc/pid/task/tid/io and aggregate
    parent I/O statistics in /proc/pid/io. This approach follows the same
    model used to account per-process and per-thread CPU times.

    As a practial application, this allows for example to quickly find the top
    I/O consumer when a process spawns many child threads that perform the
    actual I/O work, because the aggregated I/O statistics can always be found
    in /proc/pid/io.

    [ Oleg Nesterov points out that we should check that the task is still
    alive before we iterate over the threads, but also says that we can do
    that fixup on top of this later. - Linus ]

    Acked-by: Balbir Singh
    Signed-off-by: Andrea Righi
    Cc: Matt Heaton
    Cc: Shailabh Nagar
    Acked-by-with-comments: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andrea Righi
     
  • Now that we have core_state->dumper list we can use it to wake up the
    sub-threads waiting for the coredump completion.

    This uglifies the code and .text grows by 47 bytes, but otoh mm_struct
    lessens by sizeof(struct completion). Also, with this change we can
    decouple exit_mm() from the coredumping code.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • binfmt->core_dump() has to iterate over the all threads in system in order
    to find the coredumping threads and construct the list using the
    GFP_ATOMIC allocations.

    With this patch each thread allocates the list node on exit_mm()'s stack and
    adds itself to the list.

    This allows us to do further changes:

    - simplify ->core_dump()

    - change exit_mm() to clear ->mm first, then wait for ->core_done.
    this makes the coredumping process visible to oom_kill

    - kill mm->core_done

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Turn core_state->nr_threads into atomic_t and kill now unneeded
    down_write(&mm->mmap_sem) in exit_mm().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move mm->core_waiters into "struct core_state" allocated on stack. This
    shrinks mm_struct a little bit and allows further changes.

    This patch mostly does s/core_waiters/core_state. The only essential
    change is that coredump_wait() must clear mm->core_state before return.

    The coredump_wait()'s path is uglified and .text grows by 30 bytes, this
    is fixed by the next patch.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • mm->core_startup_done points to "struct completion startup_done" allocated
    on the coredump_wait()'s stack. Introduce the new structure, core_state,
    which holds this "struct completion". This way we can add more info
    visible to the threads participating in coredump without enlarging
    mm_struct.

    No changes in affected .o files.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce the new PF_KTHREAD flag to mark the kernel threads. It is set
    by INIT_TASK() and copied to the forked childs (we could set it in
    kthreadd() along with PF_NOFREEZE instead).

    daemonize() was changed as well. In that case testing of PF_KTHREAD is
    racy, but daemonize() is hopeless anyway.

    This flag is cleared in do_execve(), before search_binary_handler().
    Probably not the best place, we can do this in exec_mmap() or in
    start_thread(), or clear it along with PF_FORKNOEXEC. But I think this
    doesn't matter in practice, and if do_execve() fails kthread should die
    soon.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no reason for rcu_read_lock() in __exit_signal(). tsk->sighand
    can only be changed if tsk does exec, obviously this is not possible.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

17 Jul, 2008

4 commits

  • This fixes an arcane bug that we think was a regression introduced
    by commit b2b2cbc4b2a2f389442549399a993a8306420baf. When a parent
    ignores SIGCHLD (or uses SA_NOCLDWAIT), its children would self-reap
    but they don't if it's using ptrace on them. When the parent thread
    later exits and ceases to ptrace a child but leaves other live
    threads in the parent's thread group, any zombie children are left
    dangling. The fix makes them self-reap then, as they would have
    done earlier if ptrace had not been in use.

    Signed-off-by: Roland McGrath

    Roland McGrath
     
  • This reverts the effect of commit f2cc3eb133baa2e9dc8efd40f417106b2ee520f3
    "do_wait: fix security checks". That change reverted the effect of commit
    73243284463a761e04d69d22c7516b2be7de096c. The rationale for the original
    commit still stands. The inconsistent treatment of children hidden by
    ptrace was an unintended omission in the original change and in no way
    invalidates its purpose.

    This makes do_wait return the error returned by security_task_wait()
    (usually -EACCES) in place of -ECHILD when there are some children the
    caller would be able to wait for if not for the permission failure. A
    permission error will give the user a clue to look for security policy
    problems, rather than for mysterious wait bugs.

    Signed-off-by: Roland McGrath

    Roland McGrath
     
  • ptrace no longer fiddles with the children/sibling links, and the
    old ptrace_children list is gone. Now ptrace, whether of one's own
    children or another's via PTRACE_ATTACH, just uses the new ptraced
    list instead.

    There should be no user-visible difference that matters. The only
    change is the order in which do_wait() sees multiple stopped
    children and stopped ptrace attachees. Since wait_task_stopped()
    was changed earlier so it no longer reorders the children list, we
    already know this won't cause any new problems.

    Signed-off-by: Roland McGrath

    Roland McGrath
     
  • This breaks out the guts of do_wait into three subfunctions.
    The control flow is less nonobvious without so much goto.
    do_wait_thread and ptrace_do_wait contain the main work of the outer loop.
    wait_consider_task contains the main work of the inner loop.

    Signed-off-by: Roland McGrath

    Roland McGrath
     

03 Jul, 2008

1 commit


25 May, 2008

1 commit

  • __exit_signal() does flush_sigqueue(tsk->pending) outside of ->siglock.
    This can race with another thread doing sigqueue_free(), we can free the
    same SIGQUEUE_PREALLOC sigqueue twice or corrupt the pending->list.

    Note that even sys_exit_group() can trigger this race, not only
    sys_timer_delete().

    Move the callsite of flush_sigqueue(tsk->pending) under ->siglock.

    This patch doesn't touch flush_sigqueue(->shared_pending) below, it is
    called when there are no other threads which can play with signals, and
    sigqueue_free() can't be used outside of our thread group.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

02 May, 2008

1 commit


30 Apr, 2008

6 commits


29 Apr, 2008

1 commit

  • Remove the mem_cgroup member from mm_struct and instead adds an owner.

    This approach was suggested by Paul Menage. The advantage of this approach
    is that, once the mm->owner is known, using the subsystem id, the cgroup
    can be determined. It also allows several control groups that are
    virtually grouped by mm_struct, to exist independent of the memory
    controller i.e., without adding mem_cgroup's for each controller, to
    mm_struct.

    A new config option CONFIG_MM_OWNER is added and the memory resource
    controller selects this config option.

    This patch also adds cgroup callbacks to notify subsystems when mm->owner
    changes. The mm_cgroup_changed callback is called with the task_lock() of
    the new task held and is called just prior to changing the mm->owner.

    I am indebted to Paul Menage for the several reviews of this patchset and
    helping me make it lighter and simpler.

    This patch was tested on a powerpc box, it was compiled with both the
    MM_OWNER config turned on and off.

    After the thread group leader exits, it's moved to init_css_state by
    cgroup_exit(), thus all future charges from runnings threads would be
    redirected to the init_css_set's subsystem.

    Signed-off-by: Balbir Singh
    Cc: Pavel Emelianov
    Cc: Hugh Dickins
    Cc: Sudhir Kumar
    Cc: YAMAMOTO Takashi
    Cc: Hirokazu Takahashi
    Cc: David Rientjes ,
    Cc: Balbir Singh
    Acked-by: KAMEZAWA Hiroyuki
    Acked-by: Pekka Enberg
    Reviewed-by: Paul Menage
    Cc: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Balbir Singh
     

28 Apr, 2008

1 commit

  • This is a change that was requested some time ago by Mel Gorman. Makes sense
    to me, so here it is.

    Note: I retain the name "mpol_free_shared_policy()" because it actually does
    free the shared_policy, which is NOT a reference counted object. However, ...

    The mempolicy object[s] referenced by the shared_policy are reference counted,
    so mpol_put() is used to release the reference held by the shared_policy. The
    mempolicy might not be freed at this time, because some task attached to the
    shared object associated with the shared policy may be in the process of
    allocating a page based on the mempolicy. In that case, the task performing
    the allocation will hold a reference on the mempolicy, obtained via
    mpol_shared_policy_lookup(). The mempolicy will be freed when all tasks
    holding such a reference have called mpol_put() for the mempolicy.

    Signed-off-by: Lee Schermerhorn
    Cc: Christoph Lameter
    Cc: David Rientjes
    Cc: Mel Gorman
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Lee Schermerhorn
     

25 Apr, 2008

2 commits

  • * let unshare_files() give caller the displaced files_struct
    * don't bother with grabbing reference only to drop it in the
    caller if it hadn't been shared in the first place
    * in that form unshare_files() is trivially implemented via
    unshare_fd(), so we eliminate the duplicate logics in fork.c
    * reset_files_struct() is not just only called for current;
    it will break the system if somebody ever calls it for anything
    else (we can't modify ->files of somebody else). Lose the
    task_struct * argument.

    Signed-off-by: Al Viro

    Al Viro
     
  • * unshare_files() can fail; doing it after irreversible actions is wrong
    and de_thread() is certainly irreversible.
    * since we do it unconditionally anyway, we might as well do it in do_execve()
    and save ourselves the PITA in binfmt handlers, etc.
    * while we are at it, binfmt_som actually leaked files_struct on failure.

    As a side benefit, unshare_files(), put_files_struct() and reset_files_struct()
    become unexported.

    Signed-off-by: Al Viro

    Al Viro
     

23 Apr, 2008

1 commit


11 Apr, 2008

1 commit

  • The prevent_tail_call() macro works around the problem of the compiler
    clobbering argument words on the stack, which for asmlinkage functions
    is the caller's (user's) struct pt_regs. The tail/sibling-call
    optimization is not the only way that the compiler can decide to use
    stack argument words as scratch space, which we have to prevent.
    Other optimizations can do it too.

    Until we have new compiler support to make "asmlinkage" binding on the
    compiler's own use of the stack argument frame, we have work around all
    the manifestations of this issue that crop up.

    More cases seem to be prevented by also keeping the incoming argument
    variables live at the end of the function. This makes their original
    stack slots attractive places to leave those variables, so the compiler
    tends not clobber them for something else. It's still no guarantee, but
    it handles some observed cases that prevent_tail_call() did not.

    Signed-off-by: Roland McGrath
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

09 Mar, 2008

1 commit

  • In commit ee7c82da830ea860b1f9274f1f0cdf99f206e7c2 ("wait_task_stopped:
    simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
    cast when storing si_code was lost in wait_task_stopped. This leaks the
    in-kernel CLD_* values that do not match what userland expects.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

04 Mar, 2008

2 commits

  • 1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
    should do this only when the whole process exits.

    2. exit_notify() uses "current" as "ignored_task", obviously wrong.
    Use ->group_leader instead.

    Test case:

    void hup(int sig)
    {
    printf("HUP received\n");
    }

    void *tfunc(void *arg)
    {
    sleep(2);
    printf("sub-thread exited\n");
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (!fork()) {
    signal(SIGHUP, hup);
    kill(getpid(), SIGSTOP);
    exit(0);
    }

    pthread_t thr;
    pthread_create(&thr, NULL, tfunc, NULL);

    sleep(1);
    printf("main thread exited\n");
    syscall(__NR_exit, 0);

    return 0;
    }

    output:

    main thread exited
    HUP received
    Hangup

    With this patch the output is:

    main thread exited
    sub-thread exited
    HUP received

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • p->exit_state != 0 doesn't mean this process is dead, it may have
    sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)"
    instead.

    Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
    if the main thread has exited.

    However, the new check is not perfect either. There is a window when
    exit_notify() drops tasklist and before release_task(). Suppose that
    the last (non-leader) thread exits. This means that entire group exits,
    but thread_group_empty() is not true yet.

    As Eric pointed out, is_global_init() is wrong as well, but I did not
    dare to do other changes.

    Just for the record, has_stopped_jobs() is absolutely wrong too. But we
    can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.

    Even with this patch ^Z doesn't play well with the dead main thread.
    The task is stopped correctly but do_wait(WSTOPPED) won't see it. This
    is another unrelated issue, will be (hopefully) fixed separately.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov