24 Sep, 2009

10 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • Current behaviour of sys_waitid() looks odd. If user passes infop ==
    NULL, sys_waitid() returns success. When user additionally specifies flag
    WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
    combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

    This patch adds check for ->wo_info in wait_noreap_copyout().

    User-visible change: starting from this commit, sys_waitid() always checks
    infop != NULL and does not fail if it is NULL.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • do_wait() checks ->wo_info to figure out who is the caller. If it's not
    NULL the caller should be sys_waitid(), in that case do_wait() fixes up
    the retval or zeros ->wo_info, depending on retval from underlying
    function.

    This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
    incorrect value.

    man 2 waitid says:

    waitid(): returns 0 on success

    Test-case:

    int main(void)
    {
    if (fork())
    assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

    return 0;
    }

    Result:

    Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

    Move that code to sys_waitid().

    User-visible change: sys_waitid() will return 0 on success, either
    infop is set or not.

    Note, there's another bug in wait_noreap_copyout() which affects
    return value of sys_waitid(). It will be fixed in next patch.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Kill the unused "parent" argument in wait_consider_task(), it was never used.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_type() is only used by eligible_pid() which has to check wo_type
    != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
    out ->pids[type] access, this shrinks .text a bit and simplifies the code.

    The matches the behaviour of other similar helpers, say get_task_pid().
    The caller must ensure that pid_type is valid, not the callee.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • child_wait_callback()->eligible_child() is not right, we can miss the
    wakeup if the task was detached before __wake_up_parent() and the caller
    of do_wait() didn't use __WALL.

    Move ->wo_pid checks from eligible_child() to the new helper,
    eligible_pid(), and change child_wait_callback() to use it instead of
    eligible_child().

    Note: actually I think it would be better to fix the __WCLONE check in
    eligible_child(), it doesn't look exactly right. But it is not clear what
    is the supposed behaviour, and any change is user-visible.

    Reported-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Suggested by Roland.

    do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
    it is ->real_parent and the child is not traced. IOW, caller == p->parent
    otherwise we should not wake up.

    Change child_wait_callback() to check this. Ratan reports the workload with
    CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Ratan Nalumasu reported that in a process with many threads doing
    unnecessary wakeups. Every waiting thread in the process wakes up to loop
    through the children and see that the only ones it cares about are still
    not ready.

    Now that we have struct wait_opts we can change do_wait/__wake_up_parent
    to use filtered wakeups.

    We can make child_wait_callback() more clever later, right now it only
    checks eligible_child().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Acked-by: James Morris
    Tested-by: Valdis Kletnieks
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • …to wait_consider_task()

    Preparation, no functional changes.

    eligible_child() has a single caller, wait_consider_task(). We can move
    security_task_wait() out from eligible_child(), this allows us to use it
    for filtered wake_up().

    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Roland McGrath <roland@redhat.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Ratan Nalumasu <rnalumasu@gmail.com>
    Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Oleg Nesterov
     
  • The bug is old, it wasn't cause by recent changes.

    Test case:

    static void *tfunc(void *arg)
    {
    int pid = (long)arg;

    assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
    kill(pid, SIGKILL);

    sleep(1);
    return NULL;
    }

    int main(void)
    {
    pthread_t th;
    long pid = fork();

    if (!pid)
    pause();

    signal(SIGCHLD, SIG_IGN);
    assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

    int r = waitpid(-1, NULL, __WNOTHREAD);
    printf("waitpid: %d %m\n", r);

    return 0;
    }

    Before the patch this program hangs, after this patch waitpid() correctly
    fails with errno == -ECHILD.

    The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
    ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
    we should wake up other threads which can sleep in do_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

23 Sep, 2009

2 commits

  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     
  • The act of a process becoming a session leader is a useful signal to a
    supervising init daemon such as Upstart.

    While a daemon will normally do this as part of the process of becoming a
    daemon, it is rare for its children to do so. When the children do, it is
    nearly always a sign that the child should be considered detached from the
    parent and not supervised along with it.

    The poster-child example is OpenSSH; the per-login children call setsid()
    so that they may control the pty connected to them. If the primary daemon
    dies or is restarted, we do not want to consider the per-login children
    and want to respawn the primary daemon without killing the children.

    This patch adds a new PROC_SID_EVENT and associated structure to the
    proc_event event_data union, it arranges for this to be emitted when the
    special PIDTYPE_SID pid is set.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Scott James Remnant
    Acked-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Evgeniy Polyakov
    Acked-by: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott James Remnant
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Sep, 2009

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    rcu: Move end of special early-boot RCU operation earlier
    rcu: Changes from reviews: avoid casts, fix/add warnings, improve comments
    rcu: Create rcutree plugins to handle hotplug CPU for multi-level trees
    rcu: Remove lockdep annotations from RCU's _notrace() API members
    rcu: Add #ifdef to suppress __rcu_offline_cpu() warning in !HOTPLUG_CPU builds
    rcu: Add CPU-offline processing for single-node configurations
    rcu: Add "notrace" to RCU function headers used by ftrace
    rcu: Remove CONFIG_PREEMPT_RCU
    rcu: Merge preemptable-RCU functionality into hierarchical RCU
    rcu: Simplify rcu_pending()/rcu_check_callbacks() API
    rcu: Use debugfs_remove_recursive() simplify code.
    rcu: Merge per-RCU-flavor initialization into pre-existing macro
    rcu: Fix online/offline indication for rcudata.csv trace file
    rcu: Consolidate sparse and lockdep declarations in include/linux/rcupdate.h
    rcu: Renamings to increase RCU clarity
    rcu: Move private definitions from include/linux/rcutree.h to kernel/rcutree.h
    rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP
    rcu: Delay rcu_barrier() wait until beginning of next CPU-hotunplug operation.
    rcu: Fix typo in rcu_irq_exit() comment header
    rcu: Make rcupreempt_trace.c look at offline CPUs
    ...

    Linus Torvalds
     

02 Sep, 2009

1 commit

  • Add a config option (CONFIG_DEBUG_CREDENTIALS) to turn on some debug checking
    for credential management. The additional code keeps track of the number of
    pointers from task_structs to any given cred struct, and checks to see that
    this number never exceeds the usage count of the cred struct (which includes
    all references, not just those from task_structs).

    Furthermore, if SELinux is enabled, the code also checks that the security
    pointer in the cred struct is never seen to be invalid.

    This attempts to catch the bug whereby inode_has_perm() faults in an nfsd
    kernel thread on seeing cred->security be a NULL pointer (it appears that the
    credential struct has been previously released):

    http://www.kerneloops.org/oops.php?number=252883

    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     

23 Aug, 2009

1 commit

  • Create a kernel/rcutree_plugin.h file that contains definitions
    for preemptable RCU (or, under the #else branch of the #ifdef,
    empty definitions for the classic non-preemptable semantics).
    These definitions fit into plugins defined in kernel/rcutree.c
    for this purpose.

    This variant of preemptable RCU uses a new algorithm whose
    read-side expense is roughly that of classic hierarchical RCU
    under CONFIG_PREEMPT. This new algorithm's update-side expense
    is similar to that of classic hierarchical RCU, and, in absence
    of read-side preemption or blocking, is exactly that of classic
    hierarchical RCU. Perhaps more important, this new algorithm
    has a much simpler implementation, saving well over 1,000 lines
    of code compared to mainline's implementation of preemptable
    RCU, which will hopefully be retired in favor of this new
    algorithm.

    The simplifications are obtained by maintaining per-task
    nesting state for running tasks, and using a simple
    lock-protected algorithm to handle accounting when tasks block
    within RCU read-side critical sections, making use of lessons
    learned while creating numerous user-level RCU implementations
    over the past 18 months.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: akpm@linux-foundation.org
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josht@linux.vnet.ibm.com
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

09 Jul, 2009

1 commit

  • Fix various silly problems wrt mnt_namespace.h:

    - exit_mnt_ns() isn't used, remove it
    - done that, sched.h and nsproxy.h inclusions aren't needed
    - mount.h inclusion was need for vfsmount_lock, but no longer
    - remove mnt_namespace.h inclusion from files which don't use anything
    from mnt_namespace.h

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

20 Jun, 2009

1 commit

  • The bug is ancient.

    If we trace the sub-thread of our natural child and this sub-thread exits,
    we update parent->signal->cxxx fields. But we should not do this until
    the whole thread-group exits, otherwise we account this thread (and all
    other live threads) twice.

    Add the task_detached() check. No need to check thread_group_empty(),
    wait_consider_task()->delay_group_leader() already did this.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Jun, 2009

11 commits

  • Reorder struct wait_opts to remove 8 bytes of alignment padding on 64 bit
    builds.

    Signed-off-by: Richard Kennedy
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • do_wait:

    current->state = TASK_INTERRUPTIBLE;

    read_lock(&tasklist_lock);
    ... search for the task to reap ...

    In theory, the ->state changing can leak into the critical section. Since
    the child can change its status under read_lock(tasklist) in parallel
    (finish_stop/ptrace_stop), we can miss the wakeup if __wake_up_parent()
    sees us in TASK_RUNNING state. Add the barrier.

    Also, use __set_current_state() to set TASK_RUNNING.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_wait() does BUG_ON(tsk->signal != current->signal), this looks like a
    raher obsolete check. At least, I don't think do_wait() is the best place
    to verify that all threads have the same ->signal. Remove it.

    Also, change the code to use while_each_thread().

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we don't pass &retval down to other helpers we can simplify
    the code more.

    - kill tsk_result, just use retval

    - add the "notask" label right after the main loop, and
    s/got end/goto notask/ after the fastpath pid check.

    This way we don't need to initialize retval before this
    check and the code becomes a bit more clean, if this pid
    has no attached tasks we should just skip the list search.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce "struct wait_opts" which holds the parameters for misc helpers
    in do_wait() pathes.

    This adds 13 lines to kernel/exit.c, but saves 256 bytes from .o and imho
    makes the code much more readable.

    This patch temporary uglifies rusage/siginfo code a little bit, will be
    addressed by further cleanups.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes, preparation for the next patch.

    ptrace_do_wait() adds WUNTRACED to options for wait_task_stopped() which
    should always accept the stopped tracee, even if do_wait() was called
    without WUNTRACED.

    Change wait_task_stopped() to check "ptrace || WUNTRACED" instead. This
    makes the code more explicit, and "int options" argument becomes const in
    do_wait() pathes.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • There is no reason for thread_group_cputime() in wait_task_zombie(), there
    must be no other threads.

    This call was previously needed to collect the per-cpu data which we do
    not have any longer.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change wait_task_zombie() to use ->real_parent instead of ->parent. We
    could even use current afaics, but ->real_parent is more clean.

    We know that the child is not ptrace_reparented() and thus they are equal.
    But we should avoid using task_struct->parent, we are going to remove it.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • No functional changes.

    - Nobody except ptrace.c & co should use ptrace flags directly, we have
    task_ptrace() for that.

    - No need to specially check PT_PTRACED, we must not have other PT_ bits
    set without PT_PTRACED. And no need to know this flag exists.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • "Search in the siblings" should use ->real_parent, not ->parent. If the
    task is traced then ->parent == tracer, while the task's parent is always
    ->real_parent.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • allow_signal() checks ->mm == NULL. Not sure why. Perhaps to make sure
    current is the kernel thread. But this helper must not be used unless we
    are the kernel thread, kill this check.

    Also, document the fact that the CLONE_SIGHAND kthread must not use
    allow_signal(), unless the caller really wants to change the parent's
    ->sighand->action as well.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

12 Jun, 2009

2 commits

  • …el/git/tip/linux-2.6-tip

    * 'perfcounters-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (574 commits)
    perf_counter: Turn off by default
    perf_counter: Add counter->id to the throttle event
    perf_counter: Better align code
    perf_counter: Rename L2 to LL cache
    perf_counter: Standardize event names
    perf_counter: Rename enums
    perf_counter tools: Clean up u64 usage
    perf_counter: Rename perf_counter_limit sysctl
    perf_counter: More paranoia settings
    perf_counter: powerpc: Implement generalized cache events for POWER processors
    perf_counters: powerpc: Add support for POWER7 processors
    perf_counter: Accurate period data
    perf_counter: Introduce struct for sample data
    perf_counter tools: Normalize data using per sample period data
    perf_counter: Annotate exit ctx recursion
    perf_counter tools: Propagate signals properly
    perf_counter tools: Small frequency related fixes
    perf_counter: More aggressive frequency adjustment
    perf_counter/x86: Fix the model number of Intel Core2 processors
    perf_counter, x86: Correct some event and umask values for Intel processors
    ...

    Linus Torvalds
     
  • …s/security-testing-2.6

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jmorris/security-testing-2.6: (44 commits)
    nommu: Provide mmap_min_addr definition.
    TOMOYO: Add description of lists and structures.
    TOMOYO: Remove unused field.
    integrity: ima audit dentry_open failure
    TOMOYO: Remove unused parameter.
    security: use mmap_min_addr indepedently of security models
    TOMOYO: Simplify policy reader.
    TOMOYO: Remove redundant markers.
    SELinux: define audit permissions for audit tree netlink messages
    TOMOYO: Remove unused mutex.
    tomoyo: avoid get+put of task_struct
    smack: Remove redundant initialization.
    integrity: nfsd imbalance bug fix
    rootplug: Remove redundant initialization.
    smack: do not beyond ARRAY_SIZE of data
    integrity: move ima_counts_get
    integrity: path_check update
    IMA: Add __init notation to ima functions
    IMA: Minimal IMA policy and boot param for TCB IMA policy
    selinux: remove obsolete read buffer limit from sel_read_bool
    ...

    Linus Torvalds
     

11 Jun, 2009

1 commit


22 May, 2009

1 commit

  • This replaces the struct perf_counter_context in the task_struct with
    a pointer to a dynamically allocated perf_counter_context struct. The
    main reason for doing is this is to allow us to transfer a
    perf_counter_context from one task to another when we do lazy PMU
    switching in a later patch.

    This has a few side-benefits: the task_struct becomes a little smaller,
    we save some memory because only tasks that have perf_counters attached
    get a perf_counter_context allocated for them, and we can remove the
    inclusion of in sched.h, meaning that we don't
    end up recompiling nearly everything whenever perf_counter.h changes.

    The perf_counter_context structures are reference-counted and freed
    when the last reference is dropped. A context can have references
    from its task and the counters on its task. Counters can outlive the
    task so it is possible that a context will be freed well after its
    task has exited.

    Contexts are allocated on fork if the parent had a context, or
    otherwise the first time that a per-task counter is created on a task.
    In the latter case, we set the context pointer in the task struct
    locklessly using an atomic compare-and-exchange operation in case we
    raced with some other task in creating a context for the subject task.

    This also removes the task pointer from the perf_counter struct. The
    task pointer was not used anywhere and would make it harder to move a
    context from one task to another. Anything that needed to know which
    task a counter was attached to was already using counter->ctx->task.

    The __perf_counter_init_context function moves up in perf_counter.c
    so that it can be called from find_get_context, and now initializes
    the refcount, but is otherwise unchanged.

    We were potentially calling list_del_counter twice: once from
    __perf_counter_exit_task when the task exits and once from
    __perf_counter_remove_from_context when the counter's fd gets closed.
    This adds a check in list_del_counter so it doesn't do anything if
    the counter has already been removed from the lists.

    Since perf_counter_task_sched_in doesn't do anything if the task doesn't
    have a context, and leaves cpuctx->task_ctx = NULL, this adds code to
    __perf_install_in_context to set cpuctx->task_ctx if necessary, i.e. in
    the case where the current task adds the first counter to itself and
    thus creates a context for itself.

    This also adds similar code to __perf_counter_enable to handle a
    similar situation which can arise when the counters have been disabled
    using prctl; that also leaves cpuctx->task_ctx = NULL.

    [ Impact: refactor counter context management to prepare for new feature ]

    Signed-off-by: Paul Mackerras
    Acked-by: Peter Zijlstra
    Cc: Corey Ashford
    Cc: Marcelo Tosatti
    Cc: Arnaldo Carvalho de Melo
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul Mackerras
     

20 May, 2009

1 commit

  • Fix counter lifetime bugs which explain the crashes reported by
    Marcelo Tosatti and Arnaldo Carvalho de Melo.

    The new rule is: flushing + freeing is only done for a task's
    own counters, never for other tasks.

    [ Impact: fix crashes/lockups with inherited counters ]

    Reported-by: Arnaldo Carvalho de Melo
    Reported-by: Marcelo Tosatti
    Acked-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

17 May, 2009

2 commits

  • Flushing counters in __exit_signal() with irqs disabled is not
    a good idea as perf_counter_exit_task() acquires mutexes. So
    flush it before acquiring the tasklist lock.

    (Note, we still need a fix for when the PID has been unhashed.)

    [ Impact: fix crash with inherited counters ]

    Cc: Peter Zijlstra
    Cc: Srivatsa Vaddagiri
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Arnaldo Carvalho de Melo
    Cc: Marcelo Tosatti
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Srivatsa Vaddagiri reported that a Java workload triggers this
    warning in kernel/exit.c:

    WARN_ON_ONCE(!list_empty(&tsk->perf_counter_ctx.counter_list));

    Add the inherited counter propagation on self-detach, this could
    cause counter leaks and incomplete stats in threaded code like
    the below:

    #include
    #include

    void *thread(void *arg)
    {
    sleep(5);
    return NULL;
    }

    void main(void)
    {
    pthread_t thr;
    pthread_create(&thr, NULL, thread, NULL);
    }

    Reported-by: Srivatsa Vaddagiri
    Signed-off-by: Peter Zijlstra
    Cc: Paul Mackerras
    Cc: Corey Ashford
    Cc: Arnaldo Carvalho de Melo
    Cc: Marcelo Tosatti
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 May, 2009

1 commit


01 May, 2009

1 commit

  • I was never able to understand what should we actually do when
    security_task_wait() fails, but the current code doesn't look right.

    If ->task_wait() returns the error, we update *notask_error correctly.
    But then we either reap the child (despite the fact this was forbidden)
    or clear *notask_error (and hide the securiy policy problems).

    This patch assumes that "stolen by ptrace" doesn't matter. If selinux
    denies the child we should ignore it but make sure we report -EACCESS
    instead of -ECHLD if there are no other eligible children.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: James Morris

    Oleg Nesterov
     

15 Apr, 2009

2 commits

  • Impact: clean up

    Create a sub directory in include/trace called events to keep the
    trace point headers in their own separate directory. Only headers that
    declare trace points should be defined in this directory.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt
     
  • This patch lowers the number of places a developer must modify to add
    new tracepoints. The current method to add a new tracepoint
    into an existing system is to write the trace point macro in the
    trace header with one of the macros TRACE_EVENT, TRACE_FORMAT or
    DECLARE_TRACE, then they must add the same named item into the C file
    with the macro DEFINE_TRACE(name) and then add the trace point.

    This change cuts out the needing to add the DEFINE_TRACE(name).
    Every file that uses the tracepoint must still include the trace/.h
    file, but the one C file must also add a define before the including
    of that file.

    #define CREATE_TRACE_POINTS
    #include

    This will cause the trace/mytrace.h file to also produce the C code
    necessary to implement the trace point.

    Note, if more than one trace/.h is used to create the C code
    it is best to list them all together.

    #define CREATE_TRACE_POINTS
    #include
    #include
    #include

    Thanks to Mathieu Desnoyers and Christoph Hellwig for coming up with
    the cleaner solution of the define above the includes over my first
    design to have the C code include a "special" header.

    This patch converts sched, irq and lockdep and skb to use this new
    method.

    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Neil Horman
    Cc: Zhao Lei
    Cc: Eduard - Gabriel Munteanu
    Cc: Pekka Enberg
    Signed-off-by: Steven Rostedt

    Steven Rostedt