20 Jun, 2009

1 commit

  • The bug is ancient.

    If we trace the sub-thread of our natural child and this sub-thread exits,
    we update parent->signal->cxxx fields. But we should not do this until
    the whole thread-group exits, otherwise we account this thread (and all
    other live threads) twice.

    Add the task_detached() check. No need to check thread_group_empty(),
    wait_consider_task()->delay_group_leader() already did this.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Jun, 2009

11 commits

  • Reorder struct wait_opts to remove 8 bytes of alignment padding on 64 bit
    builds.

    Signed-off-by: Richard Kennedy
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • do_wait:

    current->state = TASK_INTERRUPTIBLE;

    read_lock(&tasklist_lock);
    ... search for the task to reap ...

    In theory, the ->state changing can leak into the critical section. Since
    the child can change its status under read_lock(tasklist) in parallel
    (finish_stop/ptrace_stop), we can miss the wakeup if __wake_up_parent()
    sees us in TASK_RUNNING state. Add the barrier.

    Also, use __set_current_state() to set TASK_RUNNING.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_wait() does BUG_ON(tsk->signal != current->signal), this looks like a
    raher obsolete check. At least, I don't think do_wait() is the best place
    to verify that all threads have the same ->signal. Remove it.

    Also, change the code to use while_each_thread().

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we don't pass &retval down to other helpers we can simplify
    the code more.

    - kill tsk_result, just use retval

    - add the "notask" label right after the main loop, and
    s/got end/goto notask/ after the fastpath pid check.

    This way we don't need to initialize retval before this
    check and the code becomes a bit more clean, if this pid
    has no attached tasks we should just skip the list search.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce "struct wait_opts" which holds the parameters for misc helpers
    in do_wait() pathes.

    This adds 13 lines to kernel/exit.c, but saves 256 bytes from .o and imho
    makes the code much more readable.

    This patch temporary uglifies rusage/siginfo code a little bit, will be
    addressed by further cleanups.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes, preparation for the next patch.

    ptrace_do_wait() adds WUNTRACED to options for wait_task_stopped() which
    should always accept the stopped tracee, even if do_wait() was called
    without WUNTRACED.

    Change wait_task_stopped() to check "ptrace || WUNTRACED" instead. This
    makes the code more explicit, and "int options" argument becomes const in
    do_wait() pathes.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no reason for thread_group_cputime() in wait_task_zombie(), there
    must be no other threads.

    This call was previously needed to collect the per-cpu data which we do
    not have any longer.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change wait_task_zombie() to use ->real_parent instead of ->parent. We
    could even use current afaics, but ->real_parent is more clean.

    We know that the child is not ptrace_reparented() and thus they are equal.
    But we should avoid using task_struct->parent, we are going to remove it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    - Nobody except ptrace.c & co should use ptrace flags directly, we have
    task_ptrace() for that.

    - No need to specially check PT_PTRACED, we must not have other PT_ bits
    set without PT_PTRACED. And no need to know this flag exists.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "Search in the siblings" should use ->real_parent, not ->parent. If the
    task is traced then ->parent == tracer, while the task's parent is always
    ->real_parent.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • allow_signal() checks ->mm == NULL. Not sure why. Perhaps to make sure
    current is the kernel thread. But this helper must not be used unless we
    are the kernel thread, kill this check.

    Also, document the fact that the CLONE_SIGHAND kthread must not use
    allow_signal(), unless the caller really wants to change the parent's
    ->sighand->action as well.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Jun, 2009

2 commits

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
    nommu: Provide mmap_min_addr definition.
    TOMOYO: Add description of lists and structures.
    TOMOYO: Remove unused field.
    integrity: ima audit dentry_open failure
    TOMOYO: Remove unused parameter.
    security: use mmap_min_addr indepedently of security models
    TOMOYO: Simplify policy reader.
    TOMOYO: Remove redundant markers.
    SELinux: define audit permissions for audit tree netlink messages
    TOMOYO: Remove unused mutex.
    tomoyo: avoid get+put of task_struct
    smack: Remove redundant initialization.
    integrity: nfsd imbalance bug fix
    rootplug: Remove redundant initialization.
    smack: do not beyond ARRAY_SIZE of data
    integrity: move ima_counts_get
    integrity: path_check update
    IMA: Add __init notation to ima functions
    IMA: Minimal IMA policy and boot param for TCB IMA policy
    selinux: remove obsolete read buffer limit from sel_read_bool
    ...

    Linus Torvalds
     

11 Jun, 2009

1 commit


22 May, 2009

1 commit

  • This replaces the struct perf_counter_context in the task_struct with
    a pointer to a dynamically allocated perf_counter_context struct. The
    main reason for doing is this is to allow us to transfer a
    perf_counter_context from one task to another when we do lazy PMU
    switching in a later patch.

    This has a few side-benefits: the task_struct becomes a little smaller,
    we save some memory because only tasks that have perf_counters attached
    get a perf_counter_context allocated for them, and we can remove the
    inclusion of in sched.h, meaning that we don't
    end up recompiling nearly everything whenever perf_counter.h changes.

    The perf_counter_context structures are reference-counted and freed
    when the last reference is dropped. A context can have references
    from its task and the counters on its task. Counters can outlive the
    task so it is possible that a context will be freed well after its
    task has exited.

    Contexts are allocated on fork if the parent had a context, or
    otherwise the first time that a per-task counter is created on a task.
    In the latter case, we set the context pointer in the task struct
    locklessly using an atomic compare-and-exchange operation in case we
    raced with some other task in creating a context for the subject task.

    This also removes the task pointer from the perf_counter struct. The
    task pointer was not used anywhere and would make it harder to move a
    context from one task to another. Anything that needed to know which
    task a counter was attached to was already using counter->ctx->task.

    The __perf_counter_init_context function moves up in perf_counter.c
    so that it can be called from find_get_context, and now initializes
    the refcount, but is otherwise unchanged.

    We were potentially calling list_del_counter twice: once from
    __perf_counter_exit_task when the task exits and once from
    __perf_counter_remove_from_context when the counter's fd gets closed.
    This adds a check in list_del_counter so it doesn't do anything if
    the counter has already been removed from the lists.

    Since perf_counter_task_sched_in doesn't do anything if the task doesn't
    have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
    __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
    the case where the current task adds the first counter to itself and
    thus creates a context for itself.

    This also adds similar code to __perf_counter_enable to handle a
    similar situation which can arise when the counters have been disabled
    using prctl; that also leaves cpuctx->task_ctx = NULL.

    [ Impact: refactor counter context management to prepare for new feature ]

    Signed-off-by: Paul Mackerras
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Marcelo Tosatti
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     

20 May, 2009

1 commit

  • Fix counter lifetime bugs which explain the crashes reported by
    Marcelo Tosatti and Arnaldo Carvalho de Melo.

    The new rule is: flushing + freeing is only done for a task's
    own counters, never for other tasks.

    [ Impact: fix crashes/lockups with inherited counters ]

    Reported-by: Arnaldo Carvalho de Melo
    Reported-by: Marcelo Tosatti
    Acked-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 May, 2009

2 commits

  • Flushing counters in __exit_signal() with irqs disabled is not
    a good idea as perf_counter_exit_task() acquires mutexes. So
    flush it before acquiring the tasklist lock.

    (Note, we still need a fix for when the PID has been unhashed.)

    [ Impact: fix crash with inherited counters ]

    Cc: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Arnaldo Carvalho de Melo
    Cc: Marcelo Tosatti
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Srivatsa Vaddagiri reported that a Java workload triggers this
    warning in kernel/exit.c:

    WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));

    Add the inherited counter propagation on self-detach, this could
    cause counter leaks and incomplete stats in threaded code like
    the below:

    #include
    #include

    void *thread(void *arg)
    {
    sleep(5);
    return NULL;
    }

    void main(void)
    {
    pthread_t thr;
    pthread_create(&thr, NULL, thread, NULL);
    }

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Arnaldo Carvalho de Melo
    Cc: Marcelo Tosatti
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 May, 2009

1 commit


01 May, 2009

1 commit

  • I was never able to understand what should we actually do when
    security_task_wait() fails, but the current code doesn't look right.

    If ->task_wait() returns the error, we update *notask_error correctly.
    But then we either reap the child (despite the fact this was forbidden)
    or clear *notask_error (and hide the securiy policy problems).

    This patch assumes that "stolen by ptrace" doesn't matter. If selinux
    denies the child we should ignore it but make sure we report -EACCESS
    instead of -ECHLD if there are no other eligible children.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: James Morris

    Oleg Nesterov
     

15 Apr, 2009

2 commits

  • Impact: clean up

    Create a sub directory in include/trace called events to keep the
    trace point headers in their own separate directory. Only headers that
    declare trace points should be defined in this directory.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This patch lowers the number of places a developer must modify to add
    new tracepoints. The current method to add a new tracepoint
    into an existing system is to write the trace point macro in the
    trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
    DECLARE_TRACE, then they must add the same named item into the C file
    with the macro DEFINE_TRACE(name) and then add the trace point.

    This change cuts out the needing to add the DEFINE_TRACE(name).
    Every file that uses the tracepoint must still include the trace/.h
    file, but the one C file must also add a define before the including
    of that file.

    #define CREATE_TRACE_POINTS
    #include

    This will cause the trace/mytrace.h file to also produce the C code
    necessary to implement the trace point.

    Note, if more than one trace/.h is used to create the C code
    it is best to list them all together.

    #define CREATE_TRACE_POINTS
    #include
    #include
    #include

    Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
    the cleaner solution of the define above the includes over my first
    design to have the C code include a "special" header.

    This patch converts sched, irq and lockdep and skb to use this new
    method.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     

08 Apr, 2009

2 commits


07 Apr, 2009

2 commits


06 Apr, 2009

2 commits


03 Apr, 2009

11 commits

  • * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6:
    Remove two unneeded exports and make two symbols static in fs/mpage.c
    Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225
    Trim includes of fdtable.h
    Don't crap into descriptor table in binfmt_som
    Trim includes in binfmt_elf
    Don't mess with descriptor table in load_elf_binary()
    Get rid of indirect include of fs_struct.h
    New helper - current_umask()
    check_unsafe_exec() doesn't care about signal handlers sharing
    New locking/refcounting for fs_struct
    Take fs_struct handling to new file (fs/fs_struct.c)
    Get rid of bumping fs_struct refcount in pivot_root(2)
    Kill unsharing fs_struct in __set_personality()

    Linus Torvalds
     
  • We are wasting 2 words in signal_struct without any reason to implement
    task_pgrp_nr() and task_session_nr().

    task_session_nr() has no callers since
    2e2ba22ea4fd4bb85f0fa37c521066db6775cbef, we can remove it.

    task_pgrp_nr() is still (I believe wrongly) used in fs/autofsX and
    fs/coda.

    This patch reimplements task_pgrp_nr() via task_pgrp_nr_ns(), and kills
    __pgrp/__session and the related helpers.

    The change in drivers/char/tty_io.c is cosmetic, but hopefully makes sense
    anyway.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox [tty parts]
    Cc: Cedric Le Goater
    Cc: Dave Hansen
    Cc: Eric Biederman
    Cc: Pavel Emelyanov
    Cc: Serge Hallyn
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • sys_wait4() does get_pid(task_pgrp(current)), this is not safe. We can
    add rcu lock/unlock around, but we already have get_task_pid() which can
    be improved to handle the special pids in more reliable manner.

    Signed-off-by: Oleg Nesterov
    Cc: Louis Rilling
    Cc: "Eric W. Biederman"
    Cc: Pavel Emelyanov
    Cc: Sukadev Bhattiprolu
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • By discussion with Roland.

    - Use ->sibling instead of ->ptrace_entry to chain the need to be
    release_task'd childs. Nobody else can use ->sibling, this task
    is EXIT_DEAD and nobody can find it on its own list.

    - rename ptrace_dead to dead_childs.

    - Now that we don't have the "parallel" untrace code, change back
    reparent_thread() to return void, pass dead_childs as an argument.

    Actually, I don't understand why do we notify /sbin/init when we
    reparent a zombie, probably it is better to reap it unconditionally.

    [akpm@linux-foundation.org: s/childs/children/]
    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: "Metzger, Markus T"
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • By discussion with Roland.

    - Rename ptrace_exit() to exit_ptrace(), and change it to do all the
    necessary work with ->ptraced list by its own.

    - Move this code from exit.c to ptrace.c

    - Update the comment in ptrace_detach() to explain the rechecking of
    the child->ptrace.

    Signed-off-by: Oleg Nesterov
    Cc: "Eric W. Biederman"
    Cc: "Metzger, Markus T"
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If /sbin/init ignores SIGCHLD and we re-parent a zombie, it is leaked.
    reparent_thread() does do_notify_parent() which sets ->exit_signal = -1 in
    this case. This means that nobody except us can reap it, the detached
    task is not visible to do_wait().

    Change reparent_thread() to return a boolean (like __pthread_detach) to
    indicate that the thread is dead and must be released. Also change
    forget_original_parent() to add the child to ptrace_dead list in this
    case.

    The naming becomes insane, the next patch does the cleanup.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • reparent_thread() uses ptrace_reparented() to check whether this thread is
    ptraced, in that case we should not notify the new parent.

    But ptrace_reparented() is not exactly correct when the reparented thread
    is traced by /sbin/init, because forget_original_parent() has already
    changed ->real_parent.

    Currently, the only problem is the false notification. But with the next
    patch the kernel crash in this (yes, pathological) case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If task_detached(p) == T, then either

    a) p is not the main thread, we will find the group leader on the
    ->children list.

    or

    b) p is the group leader but its ->exit_state = EXIT_DEAD. This
    can only happen when the last sub-thread has died, but in that case
    that thread has already called kill_orphaned_pgrp() from
    exit_notify().

    In both cases kill_orphaned_pgrp() looks bogus.

    Move the task_detached() check up and simplify the code, this is also
    right from the "common sense" pov: we should do nothing with the detached
    childs, except move them to the new parent's ->children list.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes, preparation for the next patch.

    Move the "should we release this child" logic into the separate handler,
    __ptrace_detach().

    Signed-off-by: Oleg Nesterov
    Cc: Jerome Marchand
    Cc: Roland McGrath
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • ignoring_children() takes parent->sighand->siglock and checks
    k_sigaction[SIGCHLD] atomically. But this buys nothing, we can't get the
    "really" wrong result even if we race with sigaction(SIGCHLD). If we read
    the "stale" sa_handler/sa_flags we can pretend it was changed right after
    the check.

    Remove spin_lock(->siglock), and kill "int ign" which caches the result of
    ignoring_children() which becomes rather trivial.

    Perhaps it makes sense to export this helper, do_notify_parent() can use
    it too.

    Signed-off-by: Oleg Nesterov
    Cc: Jerome Marchand
    Cc: Roland McGrath
    Cc: Denys Vlasenko
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_wait(WSTOPPED) assumes that p->state must be == TASK_STOPPED, this is
    not true if the leader is already dead. Check SIGNAL_STOP_STOPPED instead
    and use signal->group_exit_code.

    Trivial test-case:

    void *tfunc(void *arg)
    {
    pause();
    return NULL;
    }

    int main(void)
    {
    pthread_t thr;
    pthread_create(&thr, NULL, tfunc, NULL);
    pthread_exit(NULL);
    return 0;
    }

    It doesn't react to ^Z (and then to ^C or ^\). The task is stopped, but
    bash can't see this.

    The bug is very old, and it was reported multiple times. This patch was sent
    more than a year ago (http://marc.info/?t=119713920000003) but it was ignored.

    This change also fixes other oddities (but not all) in this area. For
    example, before this patch:

    $ sleep 100
    ^Z
    [1]+ Stopped sleep 100
    $ strace -p `pidof sleep`
    Process 11442 attached - interrupt to quit

    strace hangs in do_wait(), because ->exit_code was already consumed by
    bash. After this patch, strace happily proceeds:

    --- SIGTSTP (Stopped) @ 0 (0) ---
    restart_syscall(

    To me, this looks much more "natural" and correct.

    Another example. Let's suppose we have the main thread M and sub-thread
    T, the process is stopped, and its parent did wait(WSTOPPED). Now we can
    ptrace T but not M. This looks at least strange to me.

    Imho, do_wait() should not confuse the per-thread ptrace stops with the
    per-process job control stops.

    Signed-off-by: Oleg Nesterov
    Cc: Denys Vlasenko
    Cc: "Eric W. Biederman"
    Cc: Jan Kratochvil
    Cc: Kaz Kylheku
    Cc: Michael Kerrisk
    Cc: Roland McGrath
    Cc: Ulrich Drepper
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov