09 Feb, 2008

40 commits

  • Remove now unnecessary inclusions of {asm,linux}/a.out.h.

    [akpm@linux-foundation.org: fix alpha build]
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Suppress A.OUT library support if CONFIG_ARCH_SUPPORTS_AOUT is not set.

    Not all architectures support the A.OUT binfmt, so the ELF binfmt should not
    be permitted to go looking for A.OUT libraries to load in such a case. Not
    only that, but under such conditions A.OUT core dumps are not produced either.

    To make this work, this patch also does the following:

    (1) Makes the existence of the contents of linux/a.out.h contingent on
    CONFIG_ARCH_SUPPORTS_AOUT.

    (2) Renames dump_thread() to aout_dump_thread() as it's only called by A.OUT
    core dumping code.

    (3) Moves aout_dump_thread() into asm/a.out-core.h and makes it inline. This
    is then included only where needed. This means that this bit of arch
    code will be stored in the appropriate A.OUT binfmt module rather than
    the core kernel.

    (4) Drops A.OUT support for Blackfin (according to Mike Frysinger it's not
    needed) and FRV.

    This patch depends on the previous patch to move STACK_TOP[_MAX] out of
    asm/a.out.h and into asm/processor.h as they're required whether or not A.OUT
    format is available.

    [jdike@addtoit.com: uml: re-remove accidentally restored code]
    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Jeff Dike
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Mark arches that support A.OUT format by including the following in their
    master Kconfig files:

    config ARCH_SUPPORTS_AOUT
    def_bool y

    This should also be set if the arch provides compatibility A.OUT support for
    an older arch, for instance x86_64 for i386 or sparc64 for sparc.

    I've guessed at which arches don't, based on comments in the code, however I'm
    sure that some of the ones I've marked as 'yes' actually should be 'no'.

    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Move STACK_TOP[_MAX] out of asm/a.out.h and into asm/processor.h as they're
    required whether or not A.OUT format is available.

    Signed-off-by: David Howells
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    David Howells
     
  • Fix typo in comments.

    BTW: I have to fix coding style in arch/ia64/kernel/time.c also, otherwise
    checkpatch.pl will be complaining.

    Signed-off-by: Li Zefan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Function timekeeping_is_continuous() no longer checks flag
    CLOCK_IS_CONTINUOUS, and it checks CLOCK_SOURCE_VALID_FOR_HRES now. So rename
    the function accordingly.

    Signed-off-by: Li Zefan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • list_for_each_safe() suffices here.

    Signed-off-by: Li Zefan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • Flag CLOCK_SOURCE_WATCHDOG is cleared twice. Note clocksource_change_rating()
    won't do anyting with the cs flag.

    Signed-off-by: Li Zefan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Li Zefan
     
  • There's only one caller left - the kill_pgrp one - so merge these two
    functions and forget the kill_pgrp_info one.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • This is the first step (of two) in removing the kill_pgrp_info.

    All the users of this function are in kernel/signal.c, but all they need is to
    call __kill_pgrp_info() with the tasklist_lock read-locked.

    Fortunately, one of its users is the kill_something_info(), which already
    needs this lock in one of its branches, so clean these branches up and call
    the __kill_pgrp_info() directly.

    Based on Oleg's view of how this function should look.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • When sending the pid namespaces patches I wrongly converted the tsk->tgid into
    task_pid_vnr(tsk) in mqueue-s (the git id of this patch is
    b488893a390edfe027bae7a46e9af8083e740668).

    The proper behavior is to get the task_tgid_vnr(tsk).

    This seem to be the only mistake of that kind.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • Some time ago the xxx_vnr() calls (e.g. pid_vnr or find_task_by_vpid) were
    _all_ converted to operate on the current pid namespace. After this each call
    like xxx_nr_ns(foo, current->nsproxy->pid_ns) is nothing but a xxx_vnr(foo)
    one.

    Switch all the xxx_nr_ns() callers to use the xxx_vnr() calls where
    appropriate.

    Signed-off-by: Pavel Emelyanov
    Reviewed-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Balbir Singh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     
  • signal_struct->tsk points to the ->group_leader and thus we have the nasty
    code in de_thread() which has to change it and restart ->real_timer if the
    leader is changed.

    Use "struct pid *leader_pid" instead. This also allows us to kill now
    unneeded send_group_sig_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Acked-by: Roland McGrath
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is a window when de_thread() switches the leader and drops
    tasklist_lock. In that window do_each_pid_task(PIDTYPE_PID) finds both new
    and old leaders.

    The problem is pretty much theoretical and probably can be ignored. Currently
    the only users of do_each_pid_task(PIDTYPE_PID) are send_sigio/send_sigurg, so
    they can send the signal to the same process twice.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • kill_pid_info()->pid_task() could be the old leader of the execing process.
    In that case it is possible that the leader will be released before we take
    siglock. This means that kill_pid_info() (and thus sys_kill()) can return a
    false -ESRCH.

    Change the code to retry when lock_task_sighand() fails. The endless loop is
    not possible, __exit_signal() both clears ->sighand and does detach_pid().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Use task_pgrp_vnr not task_pgrp_nr so we return the process id the processes
    pid namespace and not in the initial pid namespace.

    Signed-off-by: Eric W. Biederman
    Cc: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • [m.kozlowski@tuxland.pl: fix unbalanced parenthesis in irix_BSDsetpgrp()]
    Signed-off-by: Eric W. Biederman
    Cc: Pavel Emelyanov
    Cc: Oleg Nesterov
    Cc: Ralf Baechle
    Signed-off-by: Mariusz Kozlowski
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • With the new semantics of find_vpid() we don't need to play with ->nsproxy
    explicitely, _vxx() do the right things.

    Also s/tasklist/rcu/.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • pid_vnr returns the user space pid with respect to the pid namespace the
    struct pid was allocated in. What we want before we return a pid to user
    space is the user space pid with respect to the pid namespace of current.

    pid_vnr is a very nice optimization but because it isn't quite what we want
    it is easy to use pid_vnr at times when we aren't certain the struct pid
    was allocated in our pid namespace.

    Currently this describes at least tiocgpgrp and tiocgsid in ttyio.c the
    parent process reported in the core dumps and the parent process in
    get_signal_to_deliver.

    So unless the performance impact is huge having an interface that does what
    we want instead of always what we want should be much more reliable and
    much less error prone.

    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • This modifies do_wait and eligible child to take a pair of enum pid_type
    and struct pid *pid to precisely specify what set of processes are eligible
    to be waited for, instead of the raw pid_t value from sys_wait4.

    This fixes a bug in sys_waitid where you could not wait for children in
    just process group 1.

    This fixes a pid namespace crossing case in eligible_child. Allowing us to
    wait for a processes in our current process group even if our current
    process group == 0.

    This allows the no child with this pid case to be optimized. This allows
    us to optimize the pid membership test in eligible child to be optimized.

    This even closes a theoretical pid wraparound race where in a threaded
    parent if two threads are waiting for the same child and one thread picks
    up the child and the pid numbers wrap around and generate another child
    with that same pid before the other thread is scheduled (teribly insanely
    unlikely) we could end up waiting on the second child with the same pid#
    and not discover that the specific child we were waiting for has exited.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Eric W. Biederman
    Cc: Oleg Nesterov
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Eric W. Biederman
     
  • The previous bugfix was not optimal, we shouldn't care about group stop
    when we are the only thread or the group stop is in progress. In that case
    nothing special is needed, just set PF_EXITING and return.

    Also, take the related "TIF_SIGPENDING re-targeting" code from exit_notify().

    So, from the performance POV the only difference is that we don't trust
    !signal_pending() until we take ->siglock. But this in fact fixes another
    ___pure___ theoretical minor race. __group_complete_signal() finds the
    task without PF_EXITING and chooses it as the target for signal_wake_up().
    But nothing prevents this task from exiting in between without noticing the
    pending signal and thus unpredictably delaying the actual delivery.

    Signed-off-by: Oleg Nesterov
    Cc: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Eric's "fix clone(CLONE_NEWPID)" eliminated the last reason for this hack.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_signal_stop() counts all sub-thread and sets ->group_stop_count
    accordingly. Every thread should decrement ->group_stop_count and stop,
    the last one should notify the parent.

    However a sub-thread can exit before it notices the signal_pending(), or it
    may be somewhere in do_exit() already. In that case the group stop never
    finishes properly.

    Note: this is a minimal fix, we can add some optimizations later. Say we
    can return quickly if thread_group_empty(). Also, we can move some signal
    related code from exit_notify() to exit_signals().

    Signed-off-by: Oleg Nesterov
    Acked-by: Davide Libenzi
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • As Eric pointed out, there is no problem with init starting with sid == pgid
    == 0, and this was historical linux behavior changed in 2.6.18.

    Remove kernel_init()->__set_special_pids(), this is unneeded and complicates
    the rules for sys_setsid().

    This change and the previous change in daemonize() mean that /sbin/init does
    not need the special "session != 1" hack in sys_setsid() any longer. We can't
    remove this check yet, we should cleanup copy_process(CLONE_NEWPID) first, so
    update the comment only.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Daemonized kernel threads run in the init's session. This doesn't match the
    behaviour of kthread_create()'ed threads, and this is one of the 2 reasons
    why we need a special hack in sys_setsid().

    Now that set_special_pids() was changed to use struct pid, not pid_t, we can
    use init_struct_pid and set 0,0 special pids.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change set_special_pids() to work with struct pid, not pid_t from global name
    space. This again speedups and imho cleanups the code, also a preparation for
    the next patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_setsid() still deals with pid_t's from the global namespace. This means
    that the "session > 1" check can't help for sub-namespace init, setsid() can't
    succeed because copy_process(CLONE_NEWPID) populates PIDTYPE_PGID/SID links.

    Remove the usage of task_struct->pid and convert the code to use "struct pid".
    This also simplifies and speedups the code, saves one find_pid().

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Acked-by: Pavel Emelyanov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_setpgid() does unneeded conversions from pid_t to "struct pid" and vice
    versa. Use "struct pid" more consistently. Saves one find_vpid() and
    eliminates the explicit usage of ->nsproxy->pid_ns. Imho, cleanups the
    code.

    Also use the same_thread_group() helper.

    Signed-off-by: Oleg Nesterov
    Acked-by: Pavel Emelyanov
    Acked-by: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • The first "p->exit_state != EXIT_ZOMBIE" check doesn't make too much sense.
    The exit_state was EXIT_ZOMBIE when the function was called, and another
    thread can change it to EXIT_DEAD right after the check.

    The second condition is not possible, detached non-traced threads were already
    filtered out by eligible_child(), we didn't drop tasklist since then.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Surprise, the other two wait_task_*() functions also abuse the
    task_pid_nr_ns() function, and may cause read-after-free or report nr == 0
    in wait_task_continued(). wait_task_zombie() doesn't have this problem,
    but it is still better to cache pid_t rather than call task_pid_nr_ns()
    three times on the saved pid_namespace.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Imho, the current usage of security_task_wait() is not logical.

    Suppose we have the single child p, and security_task_wait(p) return
    -EANY. In that case waitpid(-1) returns this error. Why? Isn't it
    better to return ECHLD? We don't really have reapable children.

    Now suppose that child was stolen by gdb. In that case we find this
    child on ->ptrace_children and set flag = 1, but we don't check that the
    child was denied. So, do_wait(..., WNOHANG) returns 0, this doesn't
    match the behaviour above. Without WNOHANG do_wait() blocks only to
    return the error later, when the child will be untraced. Inho, really
    strange.

    I think eligible_child() should return the error only if the child's pid
    was requested explicitly, otherwise we should silently ignore the tasks
    which were nacked by security_task_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Chris Wright
    Cc: Eric Paris
    Cc: James Morris
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • eligible_child() == 2 means delay_group_leader(). With the previous patch
    this only matters for EXIT_ZOMBIE task, we can move that special check to
    the only place it is really needed.

    Also, with this patch we don't skip security_task_wait() for the group
    leaders in a non-empty thread group. I don't really understand the exact
    semantics of security_task_wait(), but imho this change is a bugfix.

    Also rearrange the code a bit to kill an ugly "check_continued" backdoor.

    Signed-off-by: Oleg Nesterov
    Cc: Eric Paris
    Cc: James Morris
    Cc: Roland McGrath
    Cc: Stephen Smalley
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_stopped() doesn't need the "delay_group_leader" parameter. If
    the child is not traced it must be a group leader. With or without
    subthreads ->group_stop_count == 0 when the whole task is stopped.

    Signed-off-by: Oleg Nesterov
    Cc: Mika Penttila
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the tracer is gone and we are not going to stop, ptrace_stop() sets
    ->exit_code = nostop_code. However, the tracer could actually clear the
    exit code before detaching. In that case get_signal_to_deliver() "resends"
    the signal which was cancelled by the debugger. For example, it is
    possible that a quick PTRACE_ATTACH + PTRACE_DETACH can leave the tracee in
    STOPPED state.

    Change the behaviour of ptrace_stop(). If the caller is ptrace notify(),
    we should always clear ->exit_code. If the caller is
    get_signal_to_deliver(), we should not touch it at all. To do so, change
    the nonstop_code parameter to "bool clear_code" and change the callers
    accordingly.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Every branch if the main "if" statement does the same code at the end. Move
    it down. Also, fix the indentation.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • wait_task_stopped() has multiple races with SIGCONT/SIGKILL. tasklist_lock
    does not pin the child in TASK_TRACED/TASK_STOPPED stated, almost all info
    reported (including exit_code) may be wrong.

    In fact, the code under write_lock_irq(tasklist_lock) is not safe. The child
    may be PTRACE_DETACH'ed at this time by another subthread, in that case it is
    possible we are no longer its ->parent.

    Change wait_task_stopped() to take ->siglock before inspecting the task. This
    guarantees that the child can't resume and (for example) clear its
    ->exit_code, so we don't need to use xchg(&p->exit_code) and re-check. The
    only exception is ptrace_stop() which changes ->state and ->exit_code without
    ->siglock held during abort. But this can only happen if both the tracer and
    the tracee are dying (coredump is in progress), we don't care.

    With this patch wait_task_stopped() doesn't move the child to the end of
    the ->parent list on success. This optimization could be restored, but
    in that case we have to take write_lock(tasklist) and do some nasty
    checks.

    Also change the do_wait() since we don't return EAGAIN any longer.

    [akpm@linux-foundation.org: fix up after Willy renamed everything]
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the tracer went away (may_ptrace_stop() failed), ptrace_stop() drops
    tasklist and then changes the ->state from TASK_TRACED to TASK_RUNNING.

    This can fool another tracer which attaches to us in between. Change the
    ->state under tasklist_lock to ensure that ptrace_check_attach() can't wrongly
    succeed. Also, remove the unnecessary mb().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • It is not possible to see the PT_PTRACED task without ->signal/sighand under
    tasklist_lock, release_task() does ptrace_unlink() first. If the task was
    already released before, ptrace_attach() can't succeed and set PT_PTRACED.
    Remove this check.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that my_ptrace_child() is trivial we can use the "p->ptrace & PT_PTRACED"
    inline and simplify the corresponding logic in do_wait: we can't find the
    child in TASK_TRACED state without PT_PTRACED flag set, ptrace_untrace()
    either sets TASK_STOPPED or wakes up the tracee.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Since the patch

    "Fix ptrace_attach()/ptrace_traceme()/de_thread() race"
    commit f5b40e363ad6041a96e3da32281d8faa191597b9

    we set PT_ATTACHED and change child->parent "atomically" wrt task_list lock.

    This means we can remove the checks like "PT_ATTACHED && ->parent != ptracer"
    which were needed to catch the "ptrace attach is in progress" case. We can
    also remove the flag itself since nobody else uses it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Christoph Hellwig
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov