18 Aug, 2010

1 commit

  • Using a program like the following:

    #include
    #include
    #include
    #include

    int main() {
    id_t id;
    siginfo_t infop;
    pid_t res;

    id = fork();
    if (id == 0) { sleep(1); exit(0); }
    kill(id, SIGSTOP);
    alarm(1);
    waitid(P_PID, id, &infop, WCONTINUED);
    return 0;
    }

    to call waitid() on a stopped process results in access to the child task's
    credentials without the RCU read lock being held - which may be replaced in the
    meantime - eliciting the following warning:

    ===================================================
    [ INFO: suspicious rcu_dereference_check() usage. ]
    ---------------------------------------------------
    kernel/exit.c:1460 invoked rcu_dereference_check() without protection!

    other info that might help us debug this:

    rcu_scheduler_active = 1, debug_locks = 1
    2 locks held by waitid02/22252:
    #0: (tasklist_lock){.?.?..}, at: [] do_wait+0xc5/0x310
    #1: (&(&sighand->siglock)->rlock){-.-...}, at: []
    wait_consider_task+0x19a/0xbe0

    stack backtrace:
    Pid: 22252, comm: waitid02 Not tainted 2.6.35-323cd+ #3
    Call Trace:
    [] lockdep_rcu_dereference+0xa4/0xc0
    [] wait_consider_task+0xaf1/0xbe0
    [] do_wait+0xf5/0x310
    [] sys_waitid+0x86/0x1f0
    [] ? child_wait_callback+0x0/0x70
    [] system_call_fastpath+0x16/0x1b

    This is fixed by holding the RCU read lock in wait_task_continued() to ensure
    that the task's current credentials aren't destroyed between us reading the
    cred pointer and us reading the UID from those credentials.

    Furthermore, protect wait_task_stopped() in the same way.

    We don't need to keep holding the RCU read lock once we've read the UID from
    the credentials as holding the RCU read lock doesn't stop the target task from
    changing its creds under us - so the credentials may be outdated immediately
    after we've read the pointer, lock or no lock.

    Signed-off-by: Daniel J Blueman
    Signed-off-by: David Howells
    Acked-by: Paul E. McKenney
    Acked-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Daniel J Blueman
     

11 Aug, 2010

1 commit

  • exit_ptrace() takes tasklist_lock unconditionally. We need this lock to
    avoid the race with ptrace_traceme(), it acts as a barrier.

    Change its caller, forget_original_parent(), to call exit_ptrace() under
    tasklist_lock. Change exit_ptrace() to drop and reacquire this lock if
    needed.

    This allows us to add the fastpath list_empty(ptraced) check. In the
    likely no-tracees case exit_ptrace() just returns and we avoid the lock()
    + unlock() sequence.

    "Zhang, Yanmin" suggested to add this
    check, and he reports that this change adds about 11% improvement in some
    tests.

    Suggested-and-tested-by: "Zhang, Yanmin"
    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 May, 2010

10 commits

  • No functional changes, just s/atomic_t count/int nr_threads/.

    With the recent changes this counter has a single user, get_nr_threads()
    And, none of its callers need the really accurate number of threads, not
    to mention each caller obviously races with fork/exit. It is only used to
    report this value to the user-space, except first_tid() uses it to avoid
    the unnecessary while_each_thread() loop in the unlikely case.

    It is a bit sad we need a word in struct signal_struct for this, perhaps
    we can change get_nr_threads() to approximate the number of threads using
    signal->live and kill ->nr_threads later.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: Oleg Nesterov
    Cc: Alexey Dobriyan
    Cc: "Eric W. Biederman"
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Move taskstats_tgid_free() from __exit_signal() to free_signal_struct().

    This way signal->stats never points to nowhere and we can read ->stats
    lockless.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Cleanup:

    - Add the boolean, group_dead = thread_group_leader(), for clarity.

    - Do not test/set sig == NULL to detect the all-dead case, use this
    boolean.

    - Pass this boolen to __unhash_process() and use it instead of another
    thread_group_leader() call which needs ->group_leader.

    This can be considered as microoptimization, but hopefully this also
    allows us do do other cleanups later.

    Signed-off-by: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Roland McGrath
    Cc: Veaceslav Falico
    Cc: Stanislaw Gruszka
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Now that task->signal can't go away we can revert the horrible hack added
    by ad474caca3e2a0550b7ce0706527ad5ab389a4d4 ("fix for
    account_group_exec_runtime(), make sure ->signal can't be freed under
    rq->lock").

    And we can do more cleanups sched_stats.h/posix-cpu-timers.c later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • When the last thread exits signal->tty is freed, but the pointer is not
    cleared and points to nowhere.

    This is OK. Nobody should use signal->tty lockless, and it is no longer
    possible to take ->siglock. However this looks wrong even if correct, and
    the nice OOPS is better than subtle and hard to find bugs.

    Change __exit_signal() to clear signal->tty under ->siglock.

    Note: __exit_signal() needs more cleanups. It should not check "sig !=
    NULL" to detect the all-dead case and we have the same issues with
    signal->stats.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • We have a lot of problems with accessing task_struct->signal, it can
    "disappear" at any moment. Even current can't use its ->signal safely
    after exit_notify(). ->siglock helps, but it is not convenient, not
    always possible, and sometimes it makes sense to use task->signal even
    after this task has already dead.

    This patch adds the reference counter, sigcnt, into signal_struct. This
    reference is owned by task_struct and it is dropped in
    __put_task_struct(). Perhaps it makes sense to export
    get/put_signal_struct() later, but currently I don't see the immediate
    reason.

    Rename __cleanup_signal() to free_signal_struct() and unexport it. With
    the previous changes it does nothing except kmem_cache_free().

    Change __exit_signal() to not clear/free ->signal, it will be freed when
    the last reference to any thread in the thread group goes away.

    Note:
    - when the last thead exits signal->tty can point to nowhere, see
    the next patch.

    - with or without this patch signal_struct->count should go away,
    or at least it should be "int nr_threads" for fs/proc. This will
    be addressed later.

    Signed-off-by: Oleg Nesterov
    Cc: Alan Cox
    Cc: Ingo Molnar
    Cc: Peter Zijlstra
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • tty_kref_put() has two callsites in copy_process() paths,

    1. if copy_process() suceeds it is called before we copy
    signal->tty from parent

    2. otherwise it is called from __cleanup_signal() under
    bad_fork_cleanup_signal: label

    In both cases tty_kref_put() is not right and unneeded because we don't
    have the balancing tty_kref_get(). Fortunately, this is harmless because
    this can only happen without CLONE_THREAD, and in this case signal->tty
    must be NULL.

    Remove tty_kref_put() from copy_process() and __cleanup_signal(), and
    change another caller of __cleanup_signal(), __exit_signal(), to call
    tty_kref_put() by hand.

    I hope this change makes sense by itself, but it is also needed to make
    ->signal refcountable.

    Signed-off-by: Oleg Nesterov
    Acked-by: Alan Cox
    Acked-by: Roland McGrath
    Cc: Greg KH
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Change __exit_signal() to check thread_group_leader() instead of
    atomic_dec_and_test(&sig->count). This must be equivalent, the group
    leader must be released only after all other threads have exited and
    passed __exit_signal().

    Henceforth sig->count is not actually used, except in fs/proc for
    get_nr_threads/etc.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • de_thread() and __exit_signal() use signal_struct->count/notify_count for
    synchronization. We can simplify the code and use ->notify_count only.
    Instead of comparing these two counters, we can change de_thread() to set
    ->notify_count = nr_of_sub_threads, then change __exit_signal() to
    dec-and-test this counter and notify group_exit_task.

    Note that __exit_signal() checks "notify_count > 0" just for symmetry with
    exit_notify(), we could just check it is != 0.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • signal_struct->count in its current form must die.

    - it has no reasons to be atomic_t

    - it looks like a reference counter, but it is not

    - otoh, we really need to make task->signal refcountable, just look at
    the extremely ugly task_rq_unlock_wait() called from __exit_signals().

    - we should change the lifetime rules for task->signal, it should be
    pinned to task_struct. We have a lot of code which can be simplified
    after that.

    - it is not needed! while the code is correct, any usage of this
    counter is artificial, except fs/proc uses it correctly to show the
    number of threads.

    This series removes the usage of sig->count from exit pathes.

    This patch:

    Now that Veaceslav changed copy_signal() to use zalloc(), exit_notify()
    can just check notify_count < 0 to ensure the execing sub-threads needs
    the notification from us. No need to do other checks, notify_count != 0
    must always mean ->group_exit_task != NULL is waiting for us.

    Signed-off-by: Oleg Nesterov
    Acked-by: Roland McGrath
    Cc: Veaceslav Falico
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

25 May, 2010

1 commit

  • Before applying this patch, cpuset updates task->mems_allowed and
    mempolicy by setting all new bits in the nodemask first, and clearing all
    old unallowed bits later. But in the way, the allocator may find that
    there is no node to alloc memory.

    The reason is that cpuset rebinds the task's mempolicy, it cleans the
    nodes which the allocater can alloc pages on, for example:

    (mpol: mempolicy)
    task1 task1's mpol task2
    alloc page 1
    alloc on node0? NO 1
    1 change mems from 1 to 0
    1 rebind task1's mpol
    0-1 set new bits
    0 clear disallowed bits
    alloc on node1? NO 0
    ...
    can't alloc page
    goto oom

    This patch fixes this problem by expanding the nodes range first(set newly
    allowed bits) and shrink it lazily(clear newly disallowed bits). So we
    use a variable to tell the write-side task that read-side task is reading
    nodemask, and the write-side task clears newly disallowed nodes after
    read-side task ends the current memory allocation.

    [akpm@linux-foundation.org: fix spello]
    Signed-off-by: Miao Xie
    Cc: David Rientjes
    Cc: Nick Piggin
    Cc: Paul Menage
    Cc: Lee Schermerhorn
    Cc: Hugh Dickins
    Cc: Ravikiran Thirumalai
    Cc: KOSAKI Motohiro
    Cc: Christoph Lameter
    Cc: Andi Kleen
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Miao Xie
     

15 Apr, 2010

1 commit


07 Apr, 2010

1 commit

  • - We weren't zeroing p->rss_stat[] at fork()

    - Consequently sync_mm_rss() was dereferencing tsk->mm for kernel
    threads and was oopsing.

    - Make __sync_task_rss_stat() static, too.

    Addresses https://bugzilla.kernel.org/show_bug.cgi?id=15648

    [akpm@linux-foundation.org: remove the BUG_ON(!mm->rss)]
    Reported-by: Troels Liebe Bentsen
    Signed-off-by: KAMEZAWA Hiroyuki
    "Michael S. Tsirkin"
    Cc: Andrea Arcangeli
    Cc: Rik van Riel
    Cc: Minchan Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

03 Apr, 2010

1 commit


14 Mar, 2010

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    locking: Make sparse work with inline spinlocks and rwlocks
    x86/mce: Fix RCU lockdep splats
    rcu: Increase RCU CPU stall timeouts if PROVE_RCU
    ftrace: Replace read_barrier_depends() with rcu_dereference_raw()
    rcu: Suppress RCU lockdep warnings during early boot
    rcu, ftrace: Fix RCU lockdep splat in ftrace_perf_buf_prepare()
    rcu: Suppress __mpol_dup() false positive from RCU lockdep
    rcu: Make rcu_read_lock_sched_held() handle !PREEMPT
    rcu: Add control variables to lockdep_rcu_dereference() diagnostics
    rcu, cgroup: Relax the check in task_subsys_state() as early boot is now handled by lockdep-RCU
    rcu: Use wrapper function instead of exporting tasklist_lock
    sched, rcu: Fix rcu_dereference() for RCU-lockdep
    rcu: Make task_subsys_state() RCU-lockdep checks handle boot-time use
    rcu: Fix holdoff for accelerated GPs for last non-dynticked CPU
    x86/gart: Unexport gart_iommu_aperture

    Fix trivial conflicts in kernel/trace/ftrace.c

    Linus Torvalds
     

07 Mar, 2010

2 commits

  • kernel/exit.c:1183:26: warning: symbol 'status' shadows an earlier one
    kernel/exit.c:1173:21: originally declared here

    Signed-off-by: Thiago Farina
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thiago Farina
     
  • Considering the nature of per mm stats, it's the shared object among
    threads and can be a cache-miss point in the page fault path.

    This patch adds per-thread cache for mm_counter. RSS value will be
    counted into a struct in task_struct and synchronized with mm's one at
    events.

    Now, in this patch, the event is the number of calls to handle_mm_fault.
    Per-thread value is added to mm at each 64 calls.

    rough estimation with small benchmark on parallel thread (2threads) shows
    [before]
    4.5 cache-miss/faults
    [after]
    4.0 cache-miss/faults
    Anyway, the most contended object is mmap_sem if the number of threads grows.

    [akpm@linux-foundation.org: coding-style fixes]
    Signed-off-by: KAMEZAWA Hiroyuki
    Cc: Minchan Kim
    Cc: Christoph Lameter
    Cc: Lee Schermerhorn
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    KAMEZAWA Hiroyuki
     

04 Mar, 2010

1 commit

  • Lockdep-RCU commit d11c563d exported tasklist_lock, which is not
    a good thing. This patch instead exports a function that uses
    lockdep to check whether tasklist_lock is held.

    Suggested-by: Christoph Hellwig
    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    Cc: Christoph Hellwig
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

25 Feb, 2010

1 commit

  • Update the rcu_dereference() usages to take advantage of the new
    lockdep-based checking.

    Signed-off-by: Paul E. McKenney
    Cc: laijs@cn.fujitsu.com
    Cc: dipankar@in.ibm.com
    Cc: mathieu.desnoyers@polymtl.ca
    Cc: josh@joshtriplett.org
    Cc: dvhltc@us.ibm.com
    Cc: niv@us.ibm.com
    Cc: peterz@infradead.org
    Cc: rostedt@goodmis.org
    Cc: Valdis.Kletnieks@vt.edu
    Cc: dhowells@redhat.com
    LKML-Reference:
    [ -v2: fix allmodconfig missing symbol export build failure on x86 ]
    Signed-off-by: Ingo Molnar

    Paul E. McKenney
     

18 Dec, 2009

1 commit

  • Thanks to Roland who pointed out de_thread() issues.

    Currently we add sub-threads to ->real_parent->children list. This buys
    nothing but slows down do_wait().

    With this patch ->children contains only main threads (group leaders).
    The only complication is that forget_original_parent() should iterate over
    sub-threads by hand, and de_thread() needs another list_replace() when it
    changes ->group_leader.

    Henceforth do_wait_thread() can never see task_detached() && !EXIT_DEAD
    tasks, we can remove this check (and we can unify do_wait_thread() and
    ptrace_do_wait()).

    This change can confuse the optimistic search in mm_update_next_owner(),
    but this is fixable and minor.

    Perhaps badness() and oom_kill_process() should be updated, but they
    should be fixed in any case.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Cc: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

15 Dec, 2009

1 commit


12 Dec, 2009

1 commit


09 Dec, 2009

1 commit

  • * 'for-2.6.33' of git://git.kernel.dk/linux-2.6-block: (113 commits)
    cfq-iosched: Do not access cfqq after freeing it
    block: include linux/err.h to use ERR_PTR
    cfq-iosched: use call_rcu() instead of doing grace period stall on queue exit
    blkio: Allow CFQ group IO scheduling even when CFQ is a module
    blkio: Implement dynamic io controlling policy registration
    blkio: Export some symbols from blkio as its user CFQ can be a module
    block: Fix io_context leak after failure of clone with CLONE_IO
    block: Fix io_context leak after clone with CLONE_IO
    cfq-iosched: make nonrot check logic consistent
    io controller: quick fix for blk-cgroup and modular CFQ
    cfq-iosched: move IO controller declerations to a header file
    cfq-iosched: fix compile problem with !CONFIG_CGROUP
    blkio: Documentation
    blkio: Wait on sync-noidle queue even if rq_noidle = 1
    blkio: Implement group_isolation tunable
    blkio: Determine async workload length based on total number of queues
    blkio: Wait for cfq queue to get backlogged if group is empty
    blkio: Propagate cgroup weight updation to cfq groups
    blkio: Drop the reference to queue once the task changes cgroup
    blkio: Provide some isolation between groups
    ...

    Linus Torvalds
     

06 Dec, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (35 commits)
    sched, cputime: Introduce thread_group_times()
    sched, cputime: Cleanups related to task_times()
    Revert "sched, x86: Optimize branch hint in __switch_to()"
    sched: Fix isolcpus boot option
    sched: Revert 498657a478c60be092208422fefa9c7b248729c2
    sched, time: Define nsecs_to_jiffies()
    sched: Remove task_{u,s,g}time()
    sched: Introduce task_times() to replace task_{u,s}time() pair
    sched: Limit the number of scheduler debug messages
    sched.c: Call debug_show_all_locks() when dumping all tasks
    sched, x86: Optimize branch hint in __switch_to()
    sched: Optimize branch hint in context_switch()
    sched: Optimize branch hint in pick_next_task_fair()
    sched_feat_write(): Update ppos instead of file->f_pos
    sched: Sched_rt_periodic_timer vs cpu hotplug
    sched, kvm: Fix race condition involving sched_in_preempt_notifers
    sched: More generic WAKE_AFFINE vs select_idle_sibling()
    sched: Cleanup select_task_rq_fair()
    sched: Fix granularity of task_u/stime()
    sched: Fix/add missing update_rq_clock() calls
    ...

    Linus Torvalds
     

04 Dec, 2009

1 commit

  • With CLONE_IO, parent's io_context->nr_tasks is incremented, but never
    decremented whenever copy_process() fails afterwards, which prevents
    exit_io_context() from calling IO schedulers exit functions.

    Give a task_struct to exit_io_context(), and call exit_io_context() instead of
    put_io_context() in copy_process() cleanup path.

    Signed-off-by: Louis Rilling
    Signed-off-by: Jens Axboe

    Louis Rilling
     

03 Dec, 2009

1 commit

  • This is a real fix for problem of utime/stime values decreasing
    described in the thread:

    http://lkml.org/lkml/2009/11/3/522

    Now cputime is accounted in the following way:

    - {u,s}time in task_struct are increased every time when the thread
    is interrupted by a tick (timer interrupt).

    - When a thread exits, its {u,s}time are added to signal->{u,s}time,
    after adjusted by task_times().

    - When all threads in a thread_group exits, accumulated {u,s}time
    (and also c{u,s}time) in signal struct are added to c{u,s}time
    in signal struct of the group's parent.

    So {u,s}time in task struct are "raw" tick count, while
    {u,s}time and c{u,s}time in signal struct are "adjusted" values.

    And accounted values are used by:

    - task_times(), to get cputime of a thread:
    This function returns adjusted values that originates from raw
    {u,s}time and scaled by sum_exec_runtime that accounted by CFS.

    - thread_group_cputime(), to get cputime of a thread group:
    This function returns sum of all {u,s}time of living threads in
    the group, plus {u,s}time in the signal struct that is sum of
    adjusted cputimes of all exited threads belonged to the group.

    The problem is the return value of thread_group_cputime(),
    because it is mixed sum of "raw" value and "adjusted" value:

    group's {u,s}time = foreach(thread){{u,s}time} + exited({u,s}time)

    This misbehavior can break {u,s}time monotonicity.
    Assume that if there is a thread that have raw values greater
    than adjusted values (e.g. interrupted by 1000Hz ticks 50 times
    but only runs 45ms) and if it exits, cputime will decrease (e.g.
    -5ms).

    To fix this, we could do:

    group's {u,s}time = foreach(t){task_times(t)} + exited({u,s}time)

    But task_times() contains hard divisions, so applying it for
    every thread should be avoided.

    This patch fixes the above problem in the following way:

    - Modify thread's exit (= __exit_signal()) not to use task_times().
    It means {u,s}time in signal struct accumulates raw values instead
    of adjusted values. As the result it makes thread_group_cputime()
    to return pure sum of "raw" values.

    - Introduce a new function thread_group_times(*task, *utime, *stime)
    that converts "raw" values of thread_group_cputime() to "adjusted"
    values, in same calculation procedure as task_times().

    - Modify group's exit (= wait_task_zombie()) to use this introduced
    thread_group_times(). It make c{u,s}time in signal struct to
    have adjusted values like before this patch.

    - Replace some thread_group_cputime() by thread_group_times().
    This replacements are only applied where conveys the "adjusted"
    cputime to users, and where already uses task_times() near by it.
    (i.e. sys_times(), getrusage(), and /proc//stat.)

    This patch have a positive side effect:

    - Before this patch, if a group contains many short-life threads
    (e.g. runs 0.9ms and not interrupted by ticks), the group's
    cputime could be invisible since thread's cputime was accumulated
    after adjusted: imagine adjustment function as adj(ticks, runtime),
    {adj(0, 0.9) + adj(0, 0.9) + ....} = {0 + 0 + ....} = 0.
    After this patch it will not happen because the adjustment is
    applied after accumulated.

    v2:
    - remove if()s, put new variables into signal_struct.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Spencer Candland
    Cc: Americo Wang
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Stanislaw Gruszka
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

26 Nov, 2009

2 commits

  • Now all task_{u,s}time() pairs are replaced by task_times().
    And task_gtime() is too simple to be an inline function.

    Cleanup them all.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     
  • Functions task_{u,s}time() are called in pair in almost all
    cases. However task_stime() is implemented to call task_utime()
    from its inside, so such paired calls run task_utime() twice.

    It means we do heavy divisions (div_u64 + do_div) twice to get
    utime and stime which can be obtained at same time by one set
    of divisions.

    This patch introduces a function task_times(*tsk, *utime,
    *stime) to retrieve utime and stime at once in better, optimized
    way.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: Stanislaw Gruszka
    Cc: Spencer Candland
    Cc: Oleg Nesterov
    Cc: Balbir Singh
    Cc: Americo Wang
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

21 Nov, 2009

1 commit


08 Nov, 2009

1 commit

  • This patch rebase the implementation of the breakpoints API on top of
    perf events instances.

    Each breakpoints are now perf events that handle the
    register scheduling, thread/cpu attachment, etc..

    The new layering is now made as follows:

    ptrace kgdb ftrace perf syscall
    \ | / /
    \ | / /
    /
    Core breakpoint API /
    /
    | /
    | /

    Breakpoints perf events

    |
    |

    Breakpoints PMU ---- Debug Register constraints handling
    (Part of core breakpoint API)
    |
    |

    Hardware debug registers

    Reasons of this rewrite:

    - Use the centralized/optimized pmu registers scheduling,
    implying an easier arch integration
    - More powerful register handling: perf attributes (pinned/flexible
    events, exclusive/non-exclusive, tunable period, etc...)

    Impact:

    - New perf ABI: the hardware breakpoints counters
    - Ptrace breakpoints setting remains tricky and still needs some per
    thread breakpoints references.

    Todo (in the order):

    - Support breakpoints perf counter events for perf tools (ie: implement
    perf_bpcounter_event())
    - Support from perf tools

    Changes in v2:

    - Follow the perf "event " rename
    - The ptrace regression have been fixed (ptrace breakpoint perf events
    weren't released when a task ended)
    - Drop the struct hw_breakpoint and store generic fields in
    perf_event_attr.
    - Separate core and arch specific headers, drop
    asm-generic/hw_breakpoint.h and create linux/hw_breakpoint.h
    - Use new generic len/type for breakpoint
    - Handle off case: when breakpoints api is not supported by an arch

    Changes in v3:

    - Fix broken CONFIG_KVM, we need to propagate the breakpoint api
    changes to kvm when we exit the guest and restore the bp registers
    to the host.

    Changes in v4:

    - Drop the hw_breakpoint_restore() stub as it is only used by KVM
    - EXPORT_SYMBOL_GPL hw_breakpoint_restore() as KVM can be built as a
    module
    - Restore the breakpoints unconditionally on kvm guest exit:
    TIF_DEBUG_THREAD doesn't anymore cover every cases of running
    breakpoints and vcpu->arch.switch_db_regs might not always be
    set when the guest used debug registers.
    (Waiting for a reliable optimization)

    Changes in v5:

    - Split-up the asm-generic/hw-breakpoint.h moving to
    linux/hw_breakpoint.h into a separate patch
    - Optimize the breakpoints restoring while switching from kvm guest
    to host. We only want to restore the state if we have active
    breakpoints to the host, otherwise we don't care about messed-up
    address registers.
    - Add asm/hw_breakpoint.h to Kbuild
    - Fix bad breakpoint type in trace_selftest.c

    Changes in v6:

    - Fix wrong header inclusion in trace.h (triggered a build
    error with CONFIG_FTRACE_SELFTEST

    Signed-off-by: Frederic Weisbecker
    Cc: Prasad
    Cc: Alan Stern
    Cc: Peter Zijlstra
    Cc: Arnaldo Carvalho de Melo
    Cc: Steven Rostedt
    Cc: Ingo Molnar
    Cc: Jan Kiszka
    Cc: Jiri Slaby
    Cc: Li Zefan
    Cc: Avi Kivity
    Cc: Paul Mackerras
    Cc: Mike Galbraith
    Cc: Masami Hiramatsu
    Cc: Paul Mundt

    Frederic Weisbecker
     

29 Oct, 2009

1 commit

  • Since commit 02b51df1b07b4e9ca823c89284e704cadb323cd1 (proc connector: add
    event for process becoming session leader) we have the following warning:

    Badness at kernel/softirq.c:143
    [...]
    Krnl PSW : 0404c00180000000 00000000001481d4 (local_bh_enable+0xb0/0xe0)
    [...]
    Call Trace:
    ([] 0x13fe04100)
    [] sk_filter+0x9a/0xd0
    [] netlink_broadcast+0x2c0/0x53c
    [] cn_netlink_send+0x272/0x2b0
    [] proc_sid_connector+0xc4/0xd4
    [] __set_special_pids+0x58/0x90
    [] sys_setsid+0xb4/0xd8
    [] sysc_noemu+0x10/0x16
    [] 0x41616cb266

    The warning is
    ---> WARN_ON_ONCE(in_irq() || irqs_disabled());

    The network code must not be called with disabled interrupts but
    sys_setsid holds the tasklist_lock with spinlock_irq while calling the
    connector.

    After a discussion we agreed that we can move proc_sid_connector from
    __set_special_pids to sys_setsid.

    We also agreed that it is sufficient to change the check from
    task_session(curr) != pid into err > 0, since if we don't change the
    session, this means we were already the leader and return -EPERM.

    One last thing:
    There is also daemonize(), and some people might want to get a
    notification in that case. Since daemonize() is only needed if a user
    space does kernel_thread this does not look important (and there seems
    to be no consensus if this connector should be called in daemonize). If
    we really want this, we can add proc_sid_connector to daemonize() in an
    additional patch (Scott?)

    Signed-off-by: Christian Borntraeger
    Cc: Scott James Remnant
    Cc: Matt Helsley
    Cc: David S. Miller
    Acked-by: Oleg Nesterov
    Acked-by: Evgeniy Polyakov
    Acked-by: David Rientjes
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christian Borntraeger
     

09 Oct, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'core-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    futex: fix requeue_pi key imbalance
    futex: Fix typo in FUTEX_WAIT/WAKE_BITSET_PRIVATE definitions
    rcu: Place root rcu_node structure in separate lockdep class
    rcu: Make hot-unplugged CPU relinquish its own RCU callbacks
    rcu: Move rcu_barrier() to rcutree
    futex: Move exit_pi_state() call to release_mm()
    futex: Nullify robust lists after cleanup
    futex: Fix locking imbalance
    panic: Fix panic message visibility by calling bust_spinlocks(0) before dying
    rcu: Replace the rcu_barrier enum with pointer to call_rcu*() function
    rcu: Clean up code based on review feedback from Josh Triplett, part 4
    rcu: Clean up code based on review feedback from Josh Triplett, part 3
    rcu: Fix rcu_lock_map build failure on CONFIG_PROVE_LOCKING=y
    rcu: Clean up code to address Ingo's checkpatch feedback
    rcu: Clean up code based on review feedback from Josh Triplett, part 2
    rcu: Clean up code based on review feedback from Josh Triplett

    Linus Torvalds
     

06 Oct, 2009

1 commit


24 Sep, 2009

5 commits

  • Because the binfmt is not different between threads in the same process,
    it can be moved from task_struct to mm_struct. And binfmt moudle is
    handled per mm_struct instead of task_struct.

    Signed-off-by: Hiroshi Shimamoto
    Acked-by: Oleg Nesterov
    Cc: Rusty Russell
    Acked-by: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Hiroshi Shimamoto
     
  • Current behaviour of sys_waitid() looks odd. If user passes infop ==
    NULL, sys_waitid() returns success. When user additionally specifies flag
    WNOWAIT, sys_waitid() returns -EFAULT on the same conditions. When user
    combines WNOWAIT with WCONTINUED, sys_waitid() again returns success.

    This patch adds check for ->wo_info in wait_noreap_copyout().

    User-visible change: starting from this commit, sys_waitid() always checks
    infop != NULL and does not fail if it is NULL.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • do_wait() checks ->wo_info to figure out who is the caller. If it's not
    NULL the caller should be sys_waitid(), in that case do_wait() fixes up
    the retval or zeros ->wo_info, depending on retval from underlying
    function.

    This is bug: user can pass ->wo_info == NULL and sys_waitid() will return
    incorrect value.

    man 2 waitid says:

    waitid(): returns 0 on success

    Test-case:

    int main(void)
    {
    if (fork())
    assert(waitid(P_ALL, 0, NULL, WEXITED) == 0);

    return 0;
    }

    Result:

    Assertion `waitid(P_ALL, 0, ((void *)0), 4) == 0' failed.

    Move that code to sys_waitid().

    User-visible change: sys_waitid() will return 0 on success, either
    infop is set or not.

    Note, there's another bug in wait_noreap_copyout() which affects
    return value of sys_waitid(). It will be fixed in next patch.

    Signed-off-by: Vitaly Mayatskikh
    Reviewed-by: Oleg Nesterov
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Vitaly Mayatskikh
     
  • Kill the unused "parent" argument in wait_consider_task(), it was never used.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: Ingo Molnar
    Cc: Ratan Nalumasu
    Cc: Vitaly Mayatskikh
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • task_pid_type() is only used by eligible_pid() which has to check wo_type
    != PIDTYPE_MAX anyway. Remove this check from task_pid_type() and factor
    out ->pids[type] access, this shrinks .text a bit and simplifies the code.

    The matches the behaviour of other similar helpers, say get_task_pid().
    The caller must ensure that pid_type is valid, not the callee.

    Signed-off-by: Oleg Nesterov
    Cc: Roland McGrath
    Cc: KAMEZAWA Hiroyuki
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov