07 Mar, 2010

2 commits

  • kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one
    kernel/exit.c:1173:21: originally declared here

    Signed-off-by: Thiago Farina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thiago Farina
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

18 Dec, 2009

1 commit

  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Dec, 2009

1 commit


12 Dec, 2009

1 commit


09 Dec, 2009

1 commit

  • * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
    cfq-iosched: Do not access cfqq after freeing it
    block: include linux/err.h to use ERR_PTR
    cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
    blkio: Allow CFQ group IO scheduling even when CFQ is a module
    blkio: Implement dynamic io controlling policy registration
    blkio: Export some symbols from blkio as its user CFQ can be a module
    block: Fix io_context leak after failure of clone with CLONE_IO
    block: Fix io_context leak after clone with CLONE_IO
    cfq-iosched: make nonrot check logic consistent
    io controller: quick fix for blk-cgroup and modular CFQ
    cfq-iosched: move IO controller declerations to a header file
    cfq-iosched: fix compile problem with !CONFIG_CGROUP
    blkio: Documentation
    blkio: Wait on sync-noidle queue even if rq_noidle = 1
    blkio: Implement group_isolation tunable
    blkio: Determine async workload length based on total number of queues
    blkio: Wait for cfq queue to get backlogged if group is empty
    blkio: Propagate cgroup weight updation to cfq groups
    blkio: Drop the reference to queue once the task changes cgroup
    blkio: Provide some isolation between groups
    ...

    Linus Torvalds
     

06 Dec, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    sched, cputime: Introduce thread_group_times()
    sched, cputime: Cleanups related to task_times()
    Revert "sched, x86: Optimize branch hint in __switch_to()"
    sched: Fix isolcpus boot option
    sched: Revert 498657a478c60be092208422fefa9c7b248729c2
    sched, time: Define nsecs_to_jiffies()
    sched: Remove task_{u,s,g}time()
    sched: Introduce task_times() to replace task_{u,s}time() pair
    sched: Limit the number of scheduler debug messages
    sched.c: Call debug_show_all_locks() when dumping all tasks
    sched, x86: Optimize branch hint in __switch_to()
    sched: Optimize branch hint in context_switch()
    sched: Optimize branch hint in pick_next_task_fair()
    sched_feat_write(): Update ppos instead of file->f_pos
    sched: Sched_rt_periodic_timer vs cpu hotplug
    sched, kvm: Fix race condition involving sched_in_preempt_notifers
    sched: More generic WAKE_AFFINE vs select_idle_sibling()
    sched: Cleanup select_task_rq_fair()
    sched: Fix granularity of task_u/stime()
    sched: Fix/add missing update_rq_clock() calls
    ...

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
    decremented whenever copy_process() fails afterwards, which prevents
    exit_io_context() from calling IO schedulers exit functions.

    Give a task_struct to exit_io_context(), and call exit_io_context() instead of
    put_io_context() in copy_process() cleanup path.

    Signed-off-by: Louis Rilling
    Signed-off-by: Jens Axboe

    Louis Rilling
     

03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

2 commits

  • Now all task_{u,s}time() pairs are replaced by task_times().
    And task_gtime() is too simple to be an inline function.

    Cleanup them all.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

21 Nov, 2009

1 commit


08 Nov, 2009

1 commit

  • This patch rebase the implementation of the breakpoints API on top of
    perf events instances.

    Each breakpoints are now perf events that handle the
    register scheduling, thread/cpu attachment, etc..

    The new layering is now made as follows:

    ptrace kgdb ftrace perf syscall
    \ | / /
    \ | / /
    /
    Core breakpoint API /
    /
    | /
    | /

    Breakpoints perf events

    |
    |

    Breakpoints PMU ---- Debug Register constraints handling
    (Part of core breakpoint API)
    |
    |

    Hardware debug registers

    Reasons of this rewrite:

    - Use the centralized/optimized pmu registers scheduling,
    implying an easier arch integration
    - More powerful register handling: perf attributes (pinned/flexible
    events, exclusive/non-exclusive, tunable period, etc...)

    Impact:

    - New perf ABI: the hardware breakpoints counters
    - Ptrace breakpoints setting remains tricky and still needs some per
    thread breakpoints references.

    Todo (in the order):

    - Support breakpoints perf counter events for perf tools (ie: implement
    perf_bpcounter_event())
    - Support from perf tools

    Changes in v2:

    - Follow the perf "event " rename
    - The ptrace regression have been fixed (ptrace breakpoint perf events
    weren't released when a task ended)
    - Drop the struct hw_breakpoint and store generic fields in
    perf_event_attr.
    - Separate core and arch specific headers, drop
    asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
    - Use new generic len/type for breakpoint
    - Handle off case: when breakpoints api is not supported by an arch

    Changes in v3:

    - Fix broken CONFIG_KVM, we need to propagate the breakpoint api
    changes to kvm when we exit the guest and restore the bp registers
    to the host.

    Changes in v4:

    - Drop the hw_breakpoint_restore() stub as it is only used by KVM
    - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
    module
    - Restore the breakpoints unconditionally on kvm guest exit:
    TIF_DEBUG_THREAD doesn't anymore cover every cases of running
    breakpoints and vcpu->arch.switch_db_regs might not always be
    set when the guest used debug registers.
    (Waiting for a reliable optimization)

    Changes in v5:

    - Split-up the asm-generic/hw-breakpoint.h moving to
    linux/hw_breakpoint.h into a separate patch
    - Optimize the breakpoints restoring while switching from kvm guest
    to host. We only want to restore the state if we have active
    breakpoints to the host, otherwise we don't care about messed-up
    address registers.
    - Add asm/hw_breakpoint.h to Kbuild
    - Fix bad breakpoint type in trace_selftest.c

    Changes in v6:

    - Fix wrong header inclusion in trace.h (triggered a build
    error with CONFIG_FTRACE_SELFTEST

    Signed-off-by: Frederic Weisbecker
    Cc: Prasad
    Cc: Alan Stern
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Jan Kiszka
    Cc: Jiri Slaby
    Cc: Li Zefan
    Cc: Avi Kivity
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Masami Hiramatsu
    Cc: Paul Mundt

    Frederic Weisbecker
     

29 Oct, 2009

1 commit

  • Since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add
    event for process becoming session leader) we have the following warning:

    Badness at kernel/softirq.c:143
    [...]
    Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
    [...]
    Call Trace:
    ([] 0x13fe04100)
    [] sk_filter+0x9a/0xd0
    [] netlink_broadcast+0x2c0/0x53c
    [] cn_netlink_send+0x272/0x2b0
    [] proc_sid_connector+0xc4/0xd4
    [] __set_special_pids+0x58/0x90
    [] sys_setsid+0xb4/0xd8
    [] sysc_noemu+0x10/0x16
    [] 0x41616cb266

    The warning is
    ---> WARN_ON_ONCE(in_irq() || irqs_disabled());

    The network code must not be called with disabled interrupts but
    sys_setsid holds the tasklist_lock with spinlock_irq while calling the
    connector.

    After a discussion we agreed that we can move proc_sid_connector from
    __set_special_pids to sys_setsid.

    We also agreed that it is sufficient to change the check from
    task_session(curr) != pid into err > 0, since if we don't change the
    session, this means we were already the leader and return -EPERM.

    One last thing:
    There is also daemonize(), and some people might want to get a
    notification in that case. Since daemonize() is only needed if a user
    space does kernel_thread this does not look important (and there seems
    to be no consensus if this connector should be called in daemonize). If
    we really want this, we can add proc_sid_connector to daemonize() in an
    additional patch (Scott?)

    Signed-off-by: Christian Borntraeger
    Cc: Scott James Remnant
    Cc: Matt Helsley
    Cc: David S. Miller
    Acked-by: Oleg Nesterov
    Acked-by: Evgeniy Polyakov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     

09 Oct, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    futex: fix requeue_pi key imbalance
    futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
    rcu: Place root rcu_node structure in separate lockdep class
    rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
    rcu: Move rcu_barrier() to rcutree
    futex: Move exit_pi_state() call to release_mm()
    futex: Nullify robust lists after cleanup
    futex: Fix locking imbalance
    panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
    rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
    rcu: Clean up code based on review feedback from Josh Triplett, part 4
    rcu: Clean up code based on review feedback from Josh Triplett, part 3
    rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
    rcu: Clean up code to address Ingo's checkpatch feedback
    rcu: Clean up code based on review feedback from Josh Triplett, part 2
    rcu: Clean up code based on review feedback from Josh Triplett

    Linus Torvalds
     

06 Oct, 2009

1 commit


24 Sep, 2009

10 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • Current behaviour of sys_waitid() looks odd. If user passes infop ==
    NULL, sys_waitid() returns success. When user additionally specifies flag
    WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
    combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

    This patch adds check for ->wo_info in wait_noreap_copyout().

    User-visible change: starting from this commit, sys_waitid() always checks
    infop != NULL and does not fail if it is NULL.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • do_wait() checks ->wo_info to figure out who is the caller. If it's not
    NULL the caller should be sys_waitid(), in that case do_wait() fixes up
    the retval or zeros ->wo_info, depending on retval from underlying
    function.

    This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
    incorrect value.

    man 2 waitid says:

    waitid(): returns 0 on success

    Test-case:

    int main(void)
    {
    if (fork())
    assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

    return 0;
    }

    Result:

    Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

    Move that code to sys_waitid().

    User-visible change: sys_waitid() will return 0 on success, either
    infop is set or not.

    Note, there's another bug in wait_noreap_copyout() which affects
    return value of sys_waitid(). It will be fixed in next patch.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Kill the unused "parent" argument in wait_consider_task(), it was never used.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_type() is only used by eligible_pid() which has to check wo_type
    != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
    out ->pids[type] access, this shrinks .text a bit and simplifies the code.

    The matches the behaviour of other similar helpers, say get_task_pid().
    The caller must ensure that pid_type is valid, not the callee.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • child_wait_callback()->eligible_child() is not right, we can miss the
    wakeup if the task was detached before __wake_up_parent() and the caller
    of do_wait() didn't use __WALL.

    Move ->wo_pid checks from eligible_child() to the new helper,
    eligible_pid(), and change child_wait_callback() to use it instead of
    eligible_child().

    Note: actually I think it would be better to fix the __WCLONE check in
    eligible_child(), it doesn't look exactly right. But it is not clear what
    is the supposed behaviour, and any change is user-visible.

    Reported-by: KAMEZAWA Hiroyuki
    Tested-by: KAMEZAWA Hiroyuki
    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Suggested by Roland.

    do_wait(__WNOTHREAD) can only succeed if the caller is either ptracer, or
    it is ->real_parent and the child is not traced. IOW, caller == p->parent
    otherwise we should not wake up.

    Change child_wait_callback() to check this. Ratan reports the workload with
    CPU load >99% caused by unnecessary wakeups, should be fixed by this patch.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Ratan Nalumasu reported that in a process with many threads doing
    unnecessary wakeups. Every waiting thread in the process wakes up to loop
    through the children and see that the only ones it cares about are still
    not ready.

    Now that we have struct wait_opts we can change do_wait/__wake_up_parent
    to use filtered wakeups.

    We can make child_wait_callback() more clever later, right now it only
    checks eligible_child().

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Acked-by: James Morris
    Tested-by: Valdis Kletnieks
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • …to wait_consider_task()

    Preparation, no functional changes.

    eligible_child() has a single caller, wait_consider_task(). We can move
    security_task_wait() out from eligible_child(), this allows us to use it
    for filtered wake_up().

    Signed-off-by: Oleg Nesterov <oleg@redhat.com>
    Acked-by: Roland McGrath <roland@redhat.com>
    Cc: Ingo Molnar <mingo@elte.hu>
    Cc: Ratan Nalumasu <rnalumasu@gmail.com>
    Cc: Vitaly Mayatskikh <vmayatsk@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Oleg Nesterov
     
  • The bug is old, it wasn't cause by recent changes.

    Test case:

    static void *tfunc(void *arg)
    {
    int pid = (long)arg;

    assert(ptrace(PTRACE_ATTACH, pid, NULL, NULL) == 0);
    kill(pid, SIGKILL);

    sleep(1);
    return NULL;
    }

    int main(void)
    {
    pthread_t th;
    long pid = fork();

    if (!pid)
    pause();

    signal(SIGCHLD, SIG_IGN);
    assert(pthread_create(&th, NULL, tfunc, (void*)pid) == 0);

    int r = waitpid(-1, NULL, __WNOTHREAD);
    printf("waitpid: %d %m\n", r);

    return 0;
    }

    Before the patch this program hangs, after this patch waitpid() correctly
    fails with errno == -ECHILD.

    The problem is, __ptrace_detach() reaps the EXIT_ZOMBIE tracee if its
    ->real_parent is our sub-thread and we ignore SIGCHLD. But in this case
    we should wake up other threads which can sleep in do_wait().

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

23 Sep, 2009

2 commits

  • Make ->ru_maxrss value in struct rusage filled accordingly to rss hiwater
    mark. This struct is filled as a parameter to getrusage syscall.
    ->ru_maxrss value is set to KBs which is the way it is done in BSD
    systems. /usr/bin/time (gnu time) application converts ->ru_maxrss to KBs
    which seems to be incorrect behavior. Maintainer of this util was
    notified by me with the patch which corrects it and cc'ed.

    To make this happen we extend struct signal_struct by two fields. The
    first one is ->maxrss which we use to store rss hiwater of the task. The
    second one is ->cmaxrss which we use to store highest rss hiwater of all
    task childs. These values are used in k_getrusage() to actually fill
    ->ru_maxrss. k_getrusage() uses current rss hiwater value directly if mm
    struct exists.

    Note:
    exec() clear mm->hiwater_rss, but doesn't clear sig->maxrss.
    it is intetionally behavior. *BSD getrusage have exec() inheriting.

    test programs
    ========================================================

    getrusage.c
    ===========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    #define err(str) perror(str), exit(1)

    int main(int argc, char** argv)
    {
    int status;

    printf("allocate 100MB\n");
    consume(100);

    printf("testcase1: fork inherit? \n");
    printf(" expect: initial.self ~= child.self\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase2: fork inherit? (cont.) \n");
    printf(" expect: initial.children ~= 100MB, but child.children = 0\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    show_rusage("child");
    _exit(0);
    }
    printf("\n");

    printf("testcase3: fork + malloc \n");
    printf(" expect: child.self ~= initial.self + 50MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    } else {
    printf("allocate +50MB\n");
    consume(50);
    show_rusage("fork child");
    _exit(0);
    }
    printf("\n");

    printf("testcase4: grandchild maxrss\n");
    printf(" expect: post_wait.children ~= 300MB\n");
    show_rusage("initial");
    if (__fork()) {
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 0 -g 300");
    _exit(0);
    }
    printf("\n");

    printf("testcase5: zombie\n");
    printf(" expect: pre_wait ~= initial, IOW the zombie process is not accounted.\n");
    printf(" post_wait ~= 400MB, IOW wait() collect child's max_rss. \n");
    show_rusage("initial");
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("pre_wait");
    wait(&status);
    show_rusage("post_wait");
    } else {
    system("./child -n 400");
    _exit(0);
    }
    printf("\n");

    printf("testcase6: SIG_IGN\n");
    printf(" expect: initial ~= after_zombie (child's 500MB alloc should be ignored).\n");
    show_rusage("initial");
    signal(SIGCHLD, SIG_IGN);
    if (__fork()) {
    sleep(1); /* children become zombie */
    show_rusage("after_zombie");
    } else {
    system("./child -n 500");
    _exit(0);
    }
    printf("\n");
    signal(SIGCHLD, SIG_DFL);

    printf("testcase7: exec (without fork) \n");
    printf(" expect: initial ~= exec \n");
    show_rusage("initial");
    execl("./child", "child", "-v", NULL);

    return 0;
    }

    child.c
    =======
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"

    int main(int argc, char** argv)
    {
    int status;
    int c;
    long consume_size = 0;
    long grandchild_consume_size = 0;
    int show = 0;

    while ((c = getopt(argc, argv, "n:g:v")) != -1) {
    switch (c) {
    case 'n':
    consume_size = atol(optarg);
    break;
    case 'v':
    show = 1;
    break;
    case 'g':

    grandchild_consume_size = atol(optarg);
    break;
    default:
    break;
    }
    }

    if (show)
    show_rusage("exec");

    if (consume_size) {
    printf("child alloc %ldMB\n", consume_size);
    consume(consume_size);
    }

    if (grandchild_consume_size) {
    if (fork()) {
    wait(&status);
    } else {
    printf("grandchild alloc %ldMB\n", grandchild_consume_size);
    consume(grandchild_consume_size);

    exit(0);
    }
    }

    return 0;
    }

    common.c
    ========
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include
    #include

    #include "common.h"
    #define err(str) perror(str), exit(1)

    void show_rusage(char *prefix)
    {
    int err, err2;
    struct rusage rusage_self;
    struct rusage rusage_children;

    printf("%s: ", prefix);
    err = getrusage(RUSAGE_SELF, &rusage_self);
    if (!err)
    printf("self %ld ", rusage_self.ru_maxrss);
    err2 = getrusage(RUSAGE_CHILDREN, &rusage_children);
    if (!err2)
    printf("children %ld ", rusage_children.ru_maxrss);

    printf("\n");
    }

    /* Some buggy OS need this worthless CPU waste. */
    void make_pagefault(void)
    {
    void *addr;
    int size = getpagesize();
    int i;

    for (i=0; i
    Signed-off-by: KOSAKI Motohiro
    Cc: Oleg Nesterov
    Cc: Hugh Dickins
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jiri Pirko
     
  • The act of a process becoming a session leader is a useful signal to a
    supervising init daemon such as Upstart.

    While a daemon will normally do this as part of the process of becoming a
    daemon, it is rare for its children to do so. When the children do, it is
    nearly always a sign that the child should be considered detached from the
    parent and not supervised along with it.

    The poster-child example is OpenSSH; the per-login children call setsid()
    so that they may control the pty connected to them. If the primary daemon
    dies or is restarted, we do not want to consider the per-login children
    and want to respawn the primary daemon without killing the children.

    This patch adds a new PROC_SID_EVENT and associated structure to the
    proc_event event_data union, it arranges for this to be emitted when the
    special PIDTYPE_SID pid is set.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Scott James Remnant
    Acked-by: Matt Helsley
    Cc: Oleg Nesterov
    Cc: Evgeniy Polyakov
    Acked-by: "David S. Miller"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Scott James Remnant
     

21 Sep, 2009

1 commit

  • Bye-bye Performance Counters, welcome Performance Events!

    In the past few months the perfcounters subsystem has grown out its
    initial role of counting hardware events, and has become (and is
    becoming) a much broader generic event enumeration, reporting, logging,
    monitoring, analysis facility.

    Naming its core object 'perf_counter' and naming the subsystem
    'perfcounters' has become more and more of a misnomer. With pending
    code like hw-breakpoints support the 'counter' name is less and
    less appropriate.

    All in one, we've decided to rename the subsystem to 'performance
    events' and to propagate this rename through all fields, variables
    and API names. (in an ABI compatible fashion)

    The word 'event' is also a bit shorter than 'counter' - which makes
    it slightly more convenient to write/handle as well.

    Thanks goes to Stephane Eranian who first observed this misnomer and
    suggested a rename.

    User-space tooling and ABI compatibility is not affected - this patch
    should be function-invariant. (Also, defconfigs were not touched to
    keep the size down.)

    This patch has been generated via the following script:

    FILES=$(find * -type f | grep -vE 'oprofile|[^K]config')

    sed -i \
    -e 's/PERF_EVENT_/PERF_RECORD_/g' \
    -e 's/PERF_COUNTER/PERF_EVENT/g' \
    -e 's/perf_counter/perf_event/g' \
    -e 's/nb_counters/nb_events/g' \
    -e 's/swcounter/swevent/g' \
    -e 's/tpcounter_event/tp_event/g' \
    $FILES

    for N in $(find . -name perf_counter.[ch]); do
    M=$(echo $N | sed 's/perf_counter/perf_event/g')
    mv $N $M
    done

    FILES=$(find . -name perf_event.*)

    sed -i \
    -e 's/COUNTER_MASK/REG_MASK/g' \
    -e 's/COUNTER/EVENT/g' \
    -e 's/\/event_id/g' \
    -e 's/counter/event/g' \
    -e 's/Counter/Event/g' \
    $FILES

    ... to keep it as correct as possible. This script can also be
    used by anyone who has pending perfcounters patches - it converts
    a Linux kernel tree over to the new naming. We tried to time this
    change to the point in time where the amount of pending patches
    is the smallest: the end of the merge window.

    Namespace clashes were fixed up in a preparatory patch - and some
    stylistic fallout will be fixed up in a subsequent patch.

    ( NOTE: 'counters' are still the proper terminology when we deal
    with hardware registers - and these sed scripts are a bit
    over-eager in renaming them. I've undone some of that, but
    in case there's something left where 'counter' would be
    better than 'event' we can undo that on an individual basis
    instead of touching an otherwise nicely automated patch. )

    Suggested-by: Stephane Eranian
    Acked-by: Peter Zijlstra
    Acked-by: Paul Mackerras
    Reviewed-by: Arjan van de Ven
    Cc: Mike Galbraith
    Cc: Arnaldo Carvalho de Melo
    Cc: Frederic Weisbecker
    Cc: Steven Rostedt
    Cc: Benjamin Herrenschmidt
    Cc: David Howells
    Cc: Kyle McMartin
    Cc: Martin Schwidefsky
    Cc: "David S. Miller"
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc:
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

12 Sep, 2009

1 commit

  • * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (28 commits)
    rcu: Move end of special early-boot RCU operation earlier
    rcu: Changes from reviews: avoid casts, fix/add warnings, improve comments
    rcu: Create rcutree plugins to handle hotplug CPU for multi-level trees
    rcu: Remove lockdep annotations from RCU's _notrace() API members
    rcu: Add #ifdef to suppress __rcu_offline_cpu() warning in !HOTPLUG_CPU builds
    rcu: Add CPU-offline processing for single-node configurations
    rcu: Add "notrace" to RCU function headers used by ftrace
    rcu: Remove CONFIG_PREEMPT_RCU
    rcu: Merge preemptable-RCU functionality into hierarchical RCU
    rcu: Simplify rcu_pending()/rcu_check_callbacks() API
    rcu: Use debugfs_remove_recursive() simplify code.
    rcu: Merge per-RCU-flavor initialization into pre-existing macro
    rcu: Fix online/offline indication for rcudata.csv trace file
    rcu: Consolidate sparse and lockdep declarations in include/linux/rcupdate.h
    rcu: Renamings to increase RCU clarity
    rcu: Move private definitions from include/linux/rcutree.h to kernel/rcutree.h
    rcu: Expunge lingering references to CONFIG_CLASSIC_RCU, optimize on !SMP
    rcu: Delay rcu_barrier() wait until beginning of next CPU-hotunplug operation.
    rcu: Fix typo in rcu_irq_exit() comment header
    rcu: Make rcupreempt_trace.c look at offline CPUs
    ...

    Linus Torvalds
     

02 Sep, 2009

1 commit

  • Add a config option (CONFIG_DEBUG_CREDENTIALS) to turn on some debug checking
    for credential management. The additional code keeps track of the number of
    pointers from task_structs to any given cred struct, and checks to see that
    this number never exceeds the usage count of the cred struct (which includes
    all references, not just those from task_structs).

    Furthermore, if SELinux is enabled, the code also checks that the security
    pointer in the cred struct is never seen to be invalid.

    This attempts to catch the bug whereby inode_has_perm() faults in an nfsd
    kernel thread on seeing cred->security be a NULL pointer (it appears that the
    credential struct has been previously released):

    http://www.kerneloops.org/oops.php?number=252883

    Signed-off-by: David Howells
    Signed-off-by: James Morris

    David Howells
     

23 Aug, 2009

1 commit

  • Create a kernel/rcutree_plugin.h file that contains definitions
    for preemptable RCU (or, under the #else branch of the #ifdef,
    empty definitions for the classic non-preemptable semantics).
    These definitions fit into plugins defined in kernel/rcutree.c
    for this purpose.

    This variant of preemptable RCU uses a new algorithm whose
    read-side expense is roughly that of classic hierarchical RCU
    under CONFIG_PREEMPT. This new algorithm's update-side expense
    is similar to that of classic hierarchical RCU, and, in absence
    of read-side preemption or blocking, is exactly that of classic
    hierarchical RCU. Perhaps more important, this new algorithm
    has a much simpler implementation, saving well over 1,000 lines
    of code compared to mainline's implementation of preemptable
    RCU, which will hopefully be retired in favor of this new
    algorithm.

    The simplifications are obtained by maintaining per-task
    nesting state for running tasks, and using a simple
    lock-protected algorithm to handle accounting when tasks block
    within RCU read-side critical sections, making use of lessons
    learned while creating numerous user-level RCU implementations
    over the past 18 months.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: akpm@linux-foundation.org
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josht@linux.vnet.ibm.com
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

09 Jul, 2009

1 commit

  • Fix various silly problems wrt mnt_namespace.h:

    - exit_mnt_ns() isn't used, remove it
    - done that, sched.h and nsproxy.h inclusions aren't needed
    - mount.h inclusion was need for vfsmount_lock, but no longer
    - remove mnt_namespace.h inclusion from files which don't use anything
    from mnt_namespace.h

    Signed-off-by: Alexey Dobriyan
    Signed-off-by: Linus Torvalds

    Alexey Dobriyan
     

20 Jun, 2009

1 commit

  • The bug is ancient.

    If we trace the sub-thread of our natural child and this sub-thread exits,
    we update parent->signal->cxxx fields. But we should not do this until
    the whole thread-group exits, otherwise we account this thread (and all
    other live threads) twice.

    Add the task_detached() check. No need to check thread_group_empty(),
    wait_consider_task()->delay_group_leader() already did this.

    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Cc: Stanislaw Gruszka
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Jun, 2009

5 commits

  • Reorder struct wait_opts to remove 8 bytes of alignment padding on 64 bit
    builds.

    Signed-off-by: Richard Kennedy
    Cc: Oleg Nesterov
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Richard Kennedy
     
  • do_wait:

    current->state = TASK_INTERRUPTIBLE;

    read_lock(&tasklist_lock);
    ... search for the task to reap ...

    In theory, the ->state changing can leak into the critical section. Since
    the child can change its status under read_lock(tasklist) in parallel
    (finish_stop/ptrace_stop), we can miss the wakeup if __wake_up_parent()
    sees us in TASK_RUNNING state. Add the barrier.

    Also, use __set_current_state() to set TASK_RUNNING.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_wait() does BUG_ON(tsk->signal != current->signal), this looks like a
    raher obsolete check. At least, I don't think do_wait() is the best place
    to verify that all threads have the same ->signal. Remove it.

    Also, change the code to use while_each_thread().

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that we don't pass &retval down to other helpers we can simplify
    the code more.

    - kill tsk_result, just use retval

    - add the "notask" label right after the main loop, and
    s/got end/goto notask/ after the fastpath pid check.

    This way we don't need to initialize retval before this
    check and the code becomes a bit more clean, if this pid
    has no attached tasks we should just skip the list search.

    Signed-off-by: Oleg Nesterov
    Cc: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Introduce "struct wait_opts" which holds the parameters for misc helpers
    in do_wait() pathes.

    This adds 13 lines to kernel/exit.c, but saves 256 bytes from .o and imho
    makes the code much more readable.

    This patch temporary uglifies rusage/siginfo code a little bit, will be
    addressed by further cleanups.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Ingo Molnar
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov