09 Mar, 2008

1 commit

  • In commit ee7c82da830ea860b1f9274f1f0cdf99f206e7c2 ("wait_task_stopped:
    simplify and fix races with SIGCONT/SIGKILL/untrace"), the magic (short)
    cast when storing si_code was lost in wait_task_stopped. This leaks the
    in-kernel CLD_* values that do not match what userland expects.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

04 Mar, 2008

3 commits

  • 1. exit_notify() always calls kill_orphaned_pgrp(). This is wrong, we
    should do this only when the whole process exits.

    2. exit_notify() uses "current" as "ignored_task", obviously wrong.
    Use ->group_leader instead.

    Test case:

    void hup(int sig)
    {
    printf("HUP received\n");
    }

    void *tfunc(void *arg)
    {
    sleep(2);
    printf("sub-thread exited\n");
    return NULL;
    }

    int main(int argc, char *argv[])
    {
    if (!fork()) {
    signal(SIGHUP, hup);
    kill(getpid(), SIGSTOP);
    exit(0);
    }

    pthread_t thr;
    pthread_create(&thr, NULL, tfunc, NULL);

    sleep(1);
    printf("main thread exited\n");
    syscall(__NR_exit, 0);

    return 0;
    }

    output:

    main thread exited
    HUP received
    Hangup

    With this patch the output is:

    main thread exited
    sub-thread exited
    HUP received

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • p->exit_state != 0 doesn't mean this process is dead, it may have
    sub-threads. Change the code to use "p->exit_state && thread_group_empty(p)"
    instead.

    Without this patch, ^Z doesn't deliver SIGTSTP to the foreground process
    if the main thread has exited.

    However, the new check is not perfect either. There is a window when
    exit_notify() drops tasklist and before release_task(). Suppose that
    the last (non-leader) thread exits. This means that entire group exits,
    but thread_group_empty() is not true yet.

    As Eric pointed out, is_global_init() is wrong as well, but I did not
    dare to do other changes.

    Just for the record, has_stopped_jobs() is absolutely wrong too. But we
    can't fix it now, we should first fix SIGNAL_STOP_STOPPED issues.

    Even with this patch ^Z doesn't play well with the dead main thread.
    The task is stopped correctly but do_wait(WSTOPPED) won't see it. This
    is another unrelated issue, will be (hopefully) fixed separately.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Factor out the common code in reparent_thread() and exit_notify().

    No functional changes.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Feb, 2008

1 commit


09 Feb, 2008

16 commits

  • [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Harvey Harrison
    Acked-by: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Harvey Harrison
     
  • Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
    _all_ converted to operate on the current pid namespace. After this each call
    like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
    one.

    Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
    appropriate.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This modifies do_wait and eligible child to take a pair of enum pid_type
    and struct pid *pid to precisely specify what set of processes are eligible
    to be waited for, instead of the raw pid_t value from sys_wait4.

    This fixes a bug in sys_waitid where you could not wait for children in
    just process group 1.

    This fixes a pid namespace crossing case in eligible_child. Allowing us to
    wait for a processes in our current process group even if our current
    process group == 0.

    This allows the no child with this pid case to be optimized. This allows
    us to optimize the pid membership test in eligible child to be optimized.

    This even closes a theoretical pid wraparound race where in a threaded
    parent if two threads are waiting for the same child and one thread picks
    up the child and the pid numbers wrap around and generate another child
    with that same pid before the other thread is scheduled (teribly insanely
    unlikely) we could end up waiting on the second child with the same pid#
    and not discover that the specific child we were waiting for has exited.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The previous bugfix was not optimal, we shouldn't care about group stop
    when we are the only thread or the group stop is in progress. In that case
    nothing special is needed, just set PF_EXITING and return.

    Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().

    So, from the performance POV the only difference is that we don't trust
    !signal_pending() until we take ->siglock. But this in fact fixes another
    ___pure___ theoretical minor race. __group_complete_signal() finds the
    task without PF_EXITING and chooses it as the target for signal_wake_up().
    But nothing prevents this task from exiting in between without noticing the
    pending signal and thus unpredictably delaying the actual delivery.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_signal_stop() counts all sub-thread and sets ->group_stop_count
    accordingly. Every thread should decrement ->group_stop_count and stop,
    the last one should notify the parent.

    However a sub-thread can exit before it notices the signal_pending(), or it
    may be somewhere in do_exit() already. In that case the group stop never
    finishes properly.

    Note: this is a minimal fix, we can add some optimizations later. Say we
    can return quickly if thread_group_empty(). Also, we can move some signal
    related code from exit_notify() to exit_signals().

    Signed-off-by: Oleg Nesterov
    Acked-by: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Daemonized kernel threads run in the init's session. This doesn't match the
    behaviour of kthread_create()'ed threads, and this is one of the 2 reasons
    why we need a special hack in sys_setsid().

    Now that set_special_pids() was changed to use struct pid, not pid_t, we can
    use init_struct_pid and set 0,0 special pids.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change set_special_pids() to work with struct pid, not pid_t from global name
    space. This again speedups and imho cleanups the code, also a preparation for
    the next patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense.
    The exit_state was EXIT_ZOMBIE when the function was called, and another
    thread can change it to EXIT_DEAD right after the check.

    The second condition is not possible, detached non-traced threads were already
    filtered out by eligible_child(), we didn't drop tasklist since then.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Surprise, the other two wait_task_*() functions also abuse the
    task_pid_nr_ns() function, and may cause read-after-free or report nr == 0
    in wait_task_continued(). wait_task_zombie() doesn't have this problem,
    but it is still better to cache pid_t rather than call task_pid_nr_ns()
    three times on the saved pid_namespace.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Imho, the current usage of security_task_wait() is not logical.

    Suppose we have the single child p, and security_task_wait(p) return
    -EANY. In that case waitpid(-1) returns this error. Why? Isn't it
    better to return ECHLD? We don't really have reapable children.

    Now suppose that child was stolen by gdb. In that case we find this
    child on ->ptrace_children and set flag = 1, but we don't check that the
    child was denied. So, do_wait(..., WNOHANG) returns 0, this doesn't
    match the behaviour above. Without WNOHANG do_wait() blocks only to
    return the error later, when the child will be untraced. Inho, really
    strange.

    I think eligible_child() should return the error only if the child's pid
    was requested explicitly, otherwise we should silently ignore the tasks
    which were nacked by security_task_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Chris Wright
    Cc: Eric Paris
    Cc: James Morris
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • eligible_child() == 2 means delay_group_leader(). With the previous patch
    this only matters for EXIT_ZOMBIE task, we can move that special check to
    the only place it is really needed.

    Also, with this patch we don't skip security_task_wait() for the group
    leaders in a non-empty thread group. I don't really understand the exact
    semantics of security_task_wait(), but imho this change is a bugfix.

    Also rearrange the code a bit to kill an ugly "check_continued" backdoor.

    Signed-off-by: Oleg Nesterov
    Cc: Eric Paris
    Cc: James Morris
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_stopped() doesn't need the "delay_group_leader" parameter. If
    the child is not traced it must be a group leader. With or without
    subthreads ->group_stop_count == 0 when the whole task is stopped.

    Signed-off-by: Oleg Nesterov
    Cc: Mika Penttila
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Every branch if the main "if" statement does the same code at the end. Move
    it down. Also, fix the indentation.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_stopped() has multiple races with SIGCONT/SIGKILL. tasklist_lock
    does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info
    reported (including exit_code) may be wrong.

    In fact, the code under write_lock_irq(tasklist_lock) is not safe. The child
    may be PTRACE_DETACH'ed at this time by another subthread, in that case it is
    possible we are no longer its ->parent.

    Change wait_task_stopped() to take ->siglock before inspecting the task. This
    guarantees that the child can't resume and (for example) clear its
    ->exit_code, so we don't need to use xchg(&p->exit_code) and re-check. The
    only exception is ptrace_stop() which changes ->state and ->exit_code without
    ->siglock held during abort. But this can only happen if both the tracer and
    the tracee are dying (coredump is in progress), we don't care.

    With this patch wait_task_stopped() doesn't move the child to the end of
    the ->parent list on success. This optimization could be restored, but
    in that case we have to take write_lock(tasklist) and do some nasty
    checks.

    Also change the do_wait() since we don't return EAGAIN any longer.

    [akpm@linux-foundation.org: fix up after Willy renamed everything]
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED"
    inline and simplify the corresponding logic in do_wait: we can't find the
    child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace()
    either sets TASK_STOPPED or wakes up the tracee.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Since the patch

    "Fix ptrace_attach()/ptrace_traceme()/de_thread() race"
    commit f5b40e363ad6041a96e3da32281d8faa191597b9

    we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock.

    This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer"
    which were needed to catch the "ptrace attach is in progress" case. We can
    also remove the flag itself since nobody else uses it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

07 Feb, 2008

1 commit


06 Feb, 2008

1 commit

  • As Roland pointed out, we have the very old problem with exec. de_thread()
    sets SIGNAL_GROUP_EXIT, kills other threads, changes ->group_leader and then
    clears signal->flags. All signals (even fatal ones) sent in this window
    (which is not too small) will be lost.

    With this patch exec doesn't abuse SIGNAL_GROUP_EXIT. signal_group_exit(),
    the new helper, should be used to detect exit_group() or exec() in progress.
    It can have more users, but this patch does only strictly necessary changes.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Robin Holt
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

03 Feb, 2008

1 commit


07 Dec, 2007

1 commit


30 Nov, 2007

2 commits

  • In wait_task_stopped() exit_code already contains the right value for the
    si_status member of siginfo, and this is simply set in the non WNOWAIT
    case.

    If you call waitid() with a stopped or traced process, you'll get the signal
    in siginfo.si_status as expected -- however if you call waitid(WNOWAIT) at the
    same time, you'll get the signal << 8 | 0x7f

    Pass it unchanged to wait_noreap_copyout(); we would only need to shift it
    and add 0x7f if we were returning it in the user status field and that
    isn't used for any function that permits WNOWAIT.

    Signed-off-by: Scott James Remnant
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott James Remnant
     
  • wait_task_stopped(WNOWAIT) does task_pid_nr_ns() without tasklist/rcu lock,
    we can read an already freed memory. Use the cached pid_t value.

    Signed-off-by: Oleg Nesterov
    Looks-good-to: Roland McGrath
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

16 Nov, 2007

1 commit

  • The original meaning of the old test (p->state > TASK_STOPPED) was
    "not dead", since it was before TASK_TRACED existed and before the
    state/exit_state split. It was a wrong correction in commit
    14bf01bb0599c89fc7f426d20353b76e12555308 to make this test for
    TASK_TRACED instead. It should have been changed when TASK_TRACED
    was introducted and again when exit_state was introduced.

    Signed-off-by: Roland McGrath
    Cc: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: Kees Cook
    Acked-by: Scott James Remnant
    Signed-off-by: Linus Torvalds

    Roland McGrath
     

20 Oct, 2007

12 commits

  • Save ~650 bytes here.

    add/remove: 4/0 grow/shrink: 0/7 up/down: 430/-1088 (-658)
    function old new delta
    __copy_fs_struct - 202 +202
    __put_fs_struct - 112 +112
    __exit_fs - 58 +58
    __exit_files - 58 +58
    exit_files 58 2 -56
    put_fs_struct 112 5 -107
    exit_fs 161 2 -159
    sys_unshare 774 590 -184
    copy_process 4031 3840 -191
    do_exit 1791 1597 -194
    copy_fs_struct 202 5 -197

    No difference in lmbench lat_proc tests on 2-way Opteron 246.
    Smaaaal degradation on UP P4 (within errors).

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Alexey Dobriyan
    Cc: Arjan van de Ven
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     
  • The task_struct->pid member is going to be deprecated, so start
    using the helpers (task_pid_nr/task_pid_vnr/task_pid_nr_ns) in
    the kernel.

    The first thing to start with is the pid, printed to dmesg - in
    this case we may safely use task_pid_nr(). Besides, printks produce
    more (much more) than a half of all the explicit pid usage.

    [akpm@linux-foundation.org: git-drm went and changed lots of stuff]
    Signed-off-by: Pavel Emelyanov
    Cc: Dave Airlie
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • The pgrp field is not used widely around the kernel so it is now marked as
    deprecated with appropriate comment.

    The initialization of INIT_SIGNALS is trimmed because
    a) they are set to 0 automatically;
    b) gcc cannot properly initialize two anonymous (the second one
    is the one with the session) unions. In this particular case
    to make it compile we'd have to add some field initialized
    right before the .pgrp.

    This is the same patch as the 1ec320afdc9552c92191d5f89fcd1ebe588334ca one
    (from Cedric), but for the pgrp field.

    Some progress report:

    We have to deprecate the pid, tgid, session and pgrp fields on struct
    task_struct and struct signal_struct. The session and pgrp are already
    deprecated. The tgid value is close to being such - the worst known usage
    in in fs/locks.c and audit code. The pid field deprecation is mainly
    blocked by numerous printk-s around the kernel that print the tsk->pid to
    log.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Cedric Le Goater
    Cc: Serge Hallyn
    Cc: "Eric W. Biederman"
    Cc: Herbert Poetzl
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is the largest patch in the set. Make all (I hope) the places where
    the pid is shown to or get from user operate on the virtual pids.

    The idea is:
    - all in-kernel data structures must store either struct pid itself
    or the pid's global nr, obtained with pid_nr() call;
    - when seeking the task from kernel code with the stored id one
    should use find_task_by_pid() call that works with global pids;
    - when showing pid's numerical value to the user the virtual one
    should be used, but however when one shows task's pid outside this
    task's namespace the global one is to be used;
    - when getting the pid from userspace one need to consider this as
    the virtual one and use appropriate task/pid-searching functions.

    [akpm@linux-foundation.org: build fix]
    [akpm@linux-foundation.org: nuther build fix]
    [akpm@linux-foundation.org: yet nuther build fix]
    [akpm@linux-foundation.org: remove unneeded casts]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Alexey Dobriyan
    Cc: Sukadev Bhattiprolu
    Cc: Oleg Nesterov
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Terminate all processes in a namespace when the reaper of the namespace is
    exiting. We do this by walking the pidmap of the namespace and sending
    SIGKILL to all processes.

    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Sukadev Bhattiprolu
     
  • The first part is trivial - we just make the proc_flush_task() to operate on
    arbitrary vfsmount with arbitrary ids and pass the pid and global proc_mnt to
    it.

    The other change is more tricky: I moved the proc_flush_task() call in
    release_task() higher to address the following problem.

    When flushing task from many proc trees we need to know the set of ids (not
    just one pid) to find the dentries' names to flush. Thus we need to pass the
    task's pid to proc_flush_task() as struct pid is the only object that can
    provide all the pid numbers. But after __exit_signal() task has detached all
    his pids and this information is lost.

    This creates a tiny gap for proc_pid_lookup() to bring some dentries back to
    tree and keep them in hash (since pids are still alive before __exit_signal())
    till the next shrink, but since proc_flush_task() does not provide a 100%
    guarantee that the dentries will be flushed, this is OK to do so.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Make task release its namespaces after it has reparented all his children to
    child_reaper, but before it notifies its parent about its death.

    The reason to release namespaces after reparenting is that when task exits it
    may send a signal to its parent (SIGCHLD), but if the parent has already
    exited its namespaces there will be no way to decide what pid to dever to him
    - parent can be from different namespace.

    The reason to release namespace before notifying the parent it that when task
    sends a SIGCHLD to parent it can call wait() on this taks and release it. But
    releasing the mnt namespace implies dropping of all the mounts in the mnt
    namespace and NFS expects the task to have valid sighand pointer.

    Thanks to Oleg for pointing out some races that can apear and helping with
    patches and fixes.

    Signed-off-by: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • A pid namespace is a "view" of a particular set of tasks on the system. They
    work in a similar way to filesystem namespaces. A file (or a process) can be
    accessed in multiple namespaces, but it may have a different name in each. In
    a filesystem, this name might be /etc/passwd in one namespace, but
    /chroot/etc/passwd in another.

    For processes, a process may have pid 1234 in one namespace, but be pid 1 in
    another. This allows new pid namespaces to have basically arbitrary pids, and
    not have to worry about what pids exist in other namespaces. This is
    essential for checkpoint/restart where a restarted process's pid might collide
    with an existing process on the system's pid.

    In this particular implementation, pid namespaces have a parent-child
    relationship, just like processes. A process in a pid namespace may see all
    of the processes in the same namespace, as well as all of the processes in all
    of the namespaces which are children of its namespace. Processes may not,
    however, see others which are in their parent's namespace, but not in their
    own. The same goes for sibling namespaces.

    The know issue to be solved in the nearest future is signal handling in the
    namespace boundary. That is, currently the namespace's init is treated like
    an ordinary task that can be killed from within an namespace. Ideally, the
    signal handling by the namespace's init should have two sides: when signaling
    the init from its namespace, the init should look like a real init task, i.e.
    receive only those signals, that is explicitly wants to; when signaling the
    init from one of the parent namespaces, init should look like an ordinary
    task, i.e. receive any signal, only taking the general permissions into
    account.

    The pid namespace was developed by Pavel Emlyanov and Sukadev Bhattiprolu and
    we eventually came to almost the same implementation, which differed in some
    details. This set is based on Pavel's patches, but it includes comments and
    patches that from Sukadev.

    Many thanks to Oleg, who reviewed the patches, pointed out many BUGs and made
    valuable advises on how to make this set cleaner.

    This patch:

    We have to call exit_task_namespaces() only after the exiting task has
    reparented all his children and is sure that no other threads will reparent
    theirs for it. Why this is needed is explained in appropriate patch. This
    one only reworks the forget_original_parent() so that after calling this a
    task cannot be/become parent of any other task.

    We check PF_EXITING instead of ->exit_state while choosing the new parent.
    Note that tasklits_lock acts as a barrier, everyone who takes tasklist after
    us (when forget_original_parent() drops it) must see PF_EXITING.

    The other changes are just cleanups. They just move some code from
    exit_notify to forget_original_parent(). It is a bit silly to declare
    ptrace_dead in exit_notify(), take tasklist, pass ptrace_dead to
    forget_original_parent(), unlock-lock-unlock tasklist, and then use
    ptrace_dead.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Paul Menage
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Signed-off-by: Daniel Walker
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Daniel Walker
     
  • kernel/exit.c: Convert list_for_each(_safe) to
    list_for_each_entry(_safe) in forget_original_parent(), exit_notify()
    and do_wait()

    Signed-off-by: Matthias Kaehlcke
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Matthias Kaehlcke
     
  • When someone wants to deal with some other taks's namespaces it has to lock
    the task and then to get the desired namespace if the one exists. This is
    slow on read-only paths and may be impossible in some cases.

    E.g. Oleg recently noticed a race between unshare() and the (sent for
    review in cgroups) pid namespaces - when the task notifies the parent it
    has to know the parent's namespace, but taking the task_lock() is
    impossible there - the code is under write locked tasklist lock.

    On the other hand switching the namespace on task (daemonize) and releasing
    the namespace (after the last task exit) is rather rare operation and we
    can sacrifice its speed to solve the issues above.

    The access to other task namespaces is proposed to be performed
    like this:

    rcu_read_lock();
    nsproxy = task_nsproxy(tsk);
    if (nsproxy != NULL) {
    / *
    * work with the namespaces here
    * e.g. get the reference on one of them
    * /
    } / *
    * NULL task_nsproxy() means that this task is
    * almost dead (zombie)
    * /
    rcu_read_unlock();

    This patch has passed the review by Eric and Oleg :) and,
    of course, tested.

    [clg@fr.ibm.com: fix unshare()]
    [ebiederm@xmission.com: Update get_net_ns_by_pid]
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Serge Hallyn
    Signed-off-by: Cedric Le Goater
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • is_init() is an ambiguous name for the pid==1 check. Split it into
    is_global_init() and is_container_init().

    A cgroup init has it's tsk->pid == 1.

    A global init also has it's tsk->pid == 1 and it's active pid namespace
    is the init_pid_ns. But rather than check the active pid namespace,
    compare the task structure with 'init_pid_ns.child_reaper', which is
    initialized during boot to the /sbin/init process and never changes.

    Changelog:

    2.6.22-rc4-mm2-pidns1:
    - Use 'init_pid_ns.child_reaper' to determine if a given task is the
    global init (/sbin/init) process. This would improve performance
    and remove dependence on the task_pid().

    2.6.21-mm2-pidns2:

    - [Sukadev Bhattiprolu] Changed is_container_init() calls in {powerpc,
    ppc,avr32}/traps.c for the _exception() call to is_global_init().
    This way, we kill only the cgroup if the cgroup's init has a
    bug rather than force a kernel panic.

    [akpm@linux-foundation.org: fix comment]
    [sukadev@us.ibm.com: Use is_global_init() in arch/m32r/mm/fault.c]
    [bunk@stusta.de: kernel/pid.c: remove unused exports]
    [sukadev@us.ibm.com: Fix capability.c to work with threaded init]
    Signed-off-by: Serge E. Hallyn
    Signed-off-by: Sukadev Bhattiprolu
    Acked-by: Pavel Emelianov
    Cc: Eric W. Biederman
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Herbert Poetzel
    Cc: Kirill Korotaev
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Serge E. Hallyn