30 Apr, 2009

1 commit


10 Apr, 2009

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'sched-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sched: do not count frozen tasks toward load
    sched: refresh MAINTAINERS entry
    sched: Print sched_group::__cpu_power in sched_domain_debug
    cpuacct: add per-cgroup utime/stime statistics
    posixtimers, sched: Fix posix clock monotonicity
    sched_rt: don't allocate cpumask in fastpath
    cpuacct: make cpuacct hierarchy walk in cpuacct_charge() safe when rcupreempt is used -v2

    Linus Torvalds
     

08 Apr, 2009

2 commits

  • update_rlimit_cpu() tries to optimize out set_process_cpu_timer() in case
    when we already have CPUCLOCK_PROF timer which should expire first. But it
    uses cputime_lt() instead of cputime_gt().

    Test case:

    int main(void)
    {
    struct itimerval it = {
    .it_value = { .tv_sec = 1000 },
    };

    assert(!setitimer(ITIMER_PROF, &it, NULL));

    struct rlimit rl = {
    .rlim_cur = 1,
    .rlim_max = 1,
    };

    assert(!setrlimit(RLIMIT_CPU, &rl));

    for (;;)
    ;

    return 0;
    }

    Without this patch, the task is not killed as RLIMIT_CPU demands.

    Signed-off-by: Oleg Nesterov
    Acked-by: Peter Zijlstra
    Cc: Peter Lojkin
    Cc: Roland McGrath
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Merge reason: update to latest upstream to queue up fix

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

01 Apr, 2009

1 commit

  • Impact: Regression fix (against clock_gettime() backwarding bug)

    This patch re-introduces a couple of functions, task_sched_runtime
    and thread_group_sched_runtime, which was once removed at the
    time of 2.6.28-rc1.

    These functions protect the sampling of thread/process clock with
    rq lock. This rq lock is required not to update rq->clock during
    the sampling.

    i.e.
    The clock_gettime() may return
    ((accounted runtime before update) + (delta after update))
    that is less than what it should be.

    v2 -> v3:
    - Rename static helper function __task_delta_exec()
    to do_task_delta_exec() since -tip tree already has
    a __task_delta_exec() of different version.

    v1 -> v2:
    - Revises comments of function and patch description.
    - Add note about accuracy of thread group's runtime.

    Signed-off-by: Hidetoshi Seto
    Acked-by: Peter Zijlstra
    Cc: stable@kernel.org [2.6.28.x][2.6.29.x]
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Hidetoshi Seto
     

24 Mar, 2009

1 commit

  • See http://bugzilla.kernel.org/show_bug.cgi?id=12911

    copy_signal() copies signal->rlim, but RLIMIT_CPU is "lost". Because
    posix_cpu_timers_init_group() sets cputime_expires.prof_exp = 0 and thus
    fastpath_timer_check() returns false unless we have other cpu timers.

    This is the minimal fix for 2.6.29 (tested) and 2.6.28. The patch is not
    optimal, we need further cleanups here. With this patch update_rlimit_cpu()
    is not really needed, but I don't think it should be removed.

    The proper fix (I think) is:

    - set_process_cpu_timer() should just start the cputimer->running
    logic (it does), no need to change cputime_expires.xxx_exp

    - posix_cpu_timers_init_group() should set ->running when needed

    - fastpath_timer_check() can check ->running instead of
    task_cputime_zero(signal->cputime_expires)

    Reported-by: Peter Lojkin
    Signed-off-by: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Roland McGrath
    Cc: [for 2.6.29.x]
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

13 Feb, 2009

1 commit

  • While reviewing the manpages, I noticed I'd missed some clock vs timer sites.

    Make sure that all timer functions call cpu_timer_sample_group() and not
    cpu_clock_sample_group(). This ensures that we enable the process wide timer
    in time, and therefore pay the O(n) thread group cost from the syscall.

    Not doing it here, will result in the first jiffy tick after setting the timer
    doing this, resulting in a very expensive tick (but only once) and a delay in
    actually starting the timer.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Feb, 2009

2 commits


05 Feb, 2009

1 commit

  • Change the process wide cpu timers/clocks so that we:

    1) don't mess up the kernel with too many threads,
    2) don't have a per-cpu allocation for each process,
    3) have no impact when not used.

    In order to accomplish this we're going to split it into two parts:

    - clocks; which can take all the time they want since they run
    from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)

    - timers; which need constant time sampling but since they're
    explicity used, the user can pay the overhead.

    The clock readout will go back to a full sum of the thread group, while the
    timers will run of a global 'clock' that only runs when needed, so only
    programs that make use of the facility pay the price.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Ingo Molnar
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Jan, 2009

1 commit

  • Either we bounce once cacheline per cpu per tick, yielding n^2 bounces
    or we just bounce a single..

    Also, using per-cpu allocations for the thread-groups complicates the
    per-cpu allocator in that its currently aimed to be a fixed sized
    allocator and the only possible extention to that would be vmap based,
    which is seriously constrained on 32 bit archs.

    So making the per-cpu memory requirement depend on the number of
    processes is an issue.

    Lastly, it didn't deal with cpu-hotplug, although admittedly that might
    be fixable.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

25 Dec, 2008

1 commit


24 Nov, 2008

1 commit


17 Nov, 2008

2 commits

  • Impact: simplify the code

    thread_group_cputime() is called by current when it must have the valid
    ->signal, or under ->siglock, or under tasklist_lock after the ->signal
    check, or the caller is wait_task_zombie() which reaps the child. In any
    case ->signal can't be NULL.

    But the point of this patch is not optimization. If it is possible to call
    thread_group_cputime() when ->signal == NULL we are doing something wrong,
    and we should not mask the problem. thread_group_cputime() fills *times
    and the caller will use it, if we silently use task_struct->*times* we
    report the wrong values.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • Impact: fix potential NULL dereference

    Contrary to ad474caca3e2a0550b7ce0706527ad5ab389a4d4 changelog, other
    acct_group_xxx() helpers can be called after exit_notify() by timer tick.
    Thanks to Roland for pointing out this. Somehow I missed this simple fact
    when I read the original patch, and I am afraid I confused Frank during
    the discussion. Sorry.

    Fortunately, these helpers work with current, we can check ->exit_state
    to ensure that ->signal can't go away under us.

    Also, add the comment and compiler barrier to account_group_exec_runtime(),
    to make sure we load ->signal only once.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     

23 Sep, 2008

1 commit

  • This is the second resubmission of the posix timer rework patch, posted
    a few days ago.

    This includes the changes from the previous resubmittion, which addressed
    Oleg Nesterov's comments, removing the RCU stuff from the patch and
    un-inlining the thread_group_cputime() function for SMP.

    In addition, per Ingo Molnar it simplifies the UP code, consolidating much
    of it with the SMP version and depending on lower-level SMP/UP handling to
    take care of the differences.

    It also cleans up some UP compile errors, moves the scheduler stats-related
    macros into kernel/sched_stats.h, cleans up a merge error in
    kernel/fork.c and has a few other minor fixes and cleanups as suggested
    by Oleg and Ingo. Thanks for the review, guys.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

14 Sep, 2008

2 commits

  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Overview

    This patch reworks the handling of POSIX CPU timers, including the
    ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
    with the help of Roland McGrath, the owner and original writer of this code.

    The problem we ran into, and the reason for this rework, has to do with using
    a profiling timer in a process with a large number of threads. It appears
    that the performance of the old implementation of run_posix_cpu_timers() was
    at least O(n*3) (where "n" is the number of threads in a process) or worse.
    Everything is fine with an increasing number of threads until the time taken
    for that routine to run becomes the same as or greater than the tick time, at
    which point things degrade rather quickly.

    This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

    Code Changes

    This rework corrects the implementation of run_posix_cpu_timers() to make it
    run in constant time for a particular machine. (Performance may vary between
    one machine and another depending upon whether the kernel is built as single-
    or multiprocessor and, in the latter case, depending upon the number of
    running processors.) To do this, at each tick we now update fields in
    signal_struct as well as task_struct. The run_posix_cpu_timers() function
    uses those fields to make its decisions.

    We define a new structure, "task_cputime," to contain user, system and
    scheduler times and use these in appropriate places:

    struct task_cputime {
    cputime_t utime;
    cputime_t stime;
    unsigned long long sum_exec_runtime;
    };

    This is included in the structure "thread_group_cputime," which is a new
    substructure of signal_struct and which varies for uniprocessor versus
    multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
    a simple substructure, while for multiprocessor kernels it is a pointer:

    struct thread_group_cputime {
    struct task_cputime totals;
    };

    struct thread_group_cputime {
    struct task_cputime *totals;
    };

    We also add a new task_cputime substructure directly to signal_struct, to
    cache the earliest expiration of process-wide timers, and task_cputime also
    replaces the it_*_expires fields of task_struct (used for earliest expiration
    of thread timers). The "thread_group_cputime" structure contains process-wide
    timers that are updated via account_user_time() and friends. In the non-SMP
    case the structure is a simple aggregator; unfortunately in the SMP case that
    simplicity was not achievable due to cache-line contention between CPUs (in
    one measured case performance was actually _worse_ on a 16-cpu system than
    the same test on a 4-cpu system, due to this contention). For SMP, the
    thread_group_cputime counters are maintained as a per-cpu structure allocated
    using alloc_percpu(). The timer functions update only the timer field in
    the structure corresponding to the running CPU, obtained using per_cpu_ptr().

    We define a set of inline functions in sched.h that we use to maintain the
    thread_group_cputime structure and hide the differences between UP and SMP
    implementations from the rest of the kernel. The thread_group_cputime_init()
    function initializes the thread_group_cputime structure for the given task.
    The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
    out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
    in the per-cpu structures and fields. The thread_group_cputime_free()
    function, also a no-op for UP, in SMP frees the per-cpu structures. The
    thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
    thread_group_cputime_alloc() if the per-cpu structures haven't yet been
    allocated. The thread_group_cputime() function fills the task_cputime
    structure it is passed with the contents of the thread_group_cputime fields;
    in UP it's that simple but in SMP it must also safely check that tsk->signal
    is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
    if so, sums the per-cpu values for each online CPU. Finally, the three
    functions account_group_user_time(), account_group_system_time() and
    account_group_exec_runtime() are used by timer functions to update the
    respective fields of the thread_group_cputime structure.

    Non-SMP operation is trivial and will not be mentioned further.

    The per-cpu structure is always allocated when a task creates its first new
    thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
    It is freed at process exit via a call to thread_group_cputime_free() from
    cleanup_signal().

    All functions that formerly summed utime/stime/sum_sched_runtime values from
    from all threads in the thread group now use thread_group_cputime() to
    snapshot the values in the thread_group_cputime structure or the values in
    the task structure itself if the per-cpu structure hasn't been allocated.

    Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
    The run_posix_cpu_timers() function has been split into a fast path and a
    slow path; the former safely checks whether there are any expired thread
    timers and, if not, just returns, while the slow path does the heavy lifting.
    With the dedicated thread group fields, timers are no longer "rebalanced" and
    the process_timer_rebalance() function and related code has gone away. All
    summing loops are gone and all code that used them now uses the
    thread_group_cputime() inline. When process-wide timers are set, the new
    task_cputime structure in signal_struct is used to cache the earliest
    expiration; this is checked in the fast path.

    Performance

    The fix appears not to add significant overhead to existing operations. It
    generally performs the same as the current code except in two cases, one in
    which it performs slightly worse (Case 5 below) and one in which it performs
    very significantly better (Case 2 below). Overall it's a wash except in those
    two cases.

    I've since done somewhat more involved testing on a dual-core Opteron system.

    Case 1: With no itimer running, for a test with 100,000 threads, the fixed
    kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
    all of which was spent in the system. There were twice as many
    voluntary context switches with the fix as without it.

    Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
    an unmodified kernel can handle), the fixed kernel ran the test in
    eight percent of the time (5.8 seconds as opposed to 70 seconds) and
    had better tick accuracy (.012 seconds per tick as opposed to .023
    seconds per tick).

    Case 3: A 4000-thread test with an initial timer tick of .01 second and an
    interval of 10,000 seconds (i.e. a timer that ticks only once) had
    very nearly the same performance in both cases: 6.3 seconds elapsed
    for the fixed kernel versus 5.5 seconds for the unfixed kernel.

    With fewer threads (eight in these tests), the Case 1 test ran in essentially
    the same time on both the modified and unmodified kernels (5.2 seconds versus
    5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
    versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
    tick versus .025 seconds per tick for the unmodified kernel.

    Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

    Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
    running), the modified kernel was very slightly favored in that while
    it killed the process in 19.997 seconds of CPU time (5.002 seconds of
    wall time), only .003 seconds of that was system time, the rest was
    user time. The unmodified kernel killed the process in 20.001 seconds
    of CPU (5.014 seconds of wall time) of which .016 seconds was system
    time. Really, though, the results were too close to call. The results
    were essentially the same with no itimer running.

    Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
    (where the hard limit would never be reached) and an itimer running,
    the modified kernel exhibited worse tick accuracy than the unmodified
    kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
    performance was almost indistinguishable. With no itimer running this
    test exhibited virtually identical behavior and times in both cases.

    In times past I did some limited performance testing. those results are below.

    On a four-cpu Opteron system without this fix, a sixteen-thread test executed
    in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
    the same system with the fix, user and elapsed time were about the same, but
    system time dropped to 0.007 seconds. Performance with eight, four and one
    thread were comparable. Interestingly, the timer ticks with the fix seemed
    more accurate: The sixteen-thread test with the fix received 149543 ticks
    for 0.024 seconds per tick, while the same test without the fix received 58720
    for 0.061 seconds per tick. Both cases were configured for an interval of
    0.01 seconds. Again, the other tests were comparable. Each thread in this
    test computed the primes up to 25,000,000.

    I also did a test with a large number of threads, 100,000 threads, which is
    impossible without the fix. In this case each thread computed the primes only
    up to 10,000 (to make the runtime manageable). System time dominated, at
    1546.968 seconds out of a total 2176.906 seconds (giving a user time of
    629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
    accurate. There is obviously no comparable test without the fix.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

25 May, 2008

1 commit


01 May, 2008

1 commit

  • x86 is the only arch right now, which provides an optimized for
    div_long_long_rem and it has the downside that one has to be very careful that
    the divide doesn't overflow.

    The API is a little akward, as the arguments for the unsigned divide are
    signed. The signed version also doesn't handle a negative divisor and
    produces worse code on 64bit archs.

    There is little incentive to keep this API alive, so this converts the few
    users to the new API.

    Signed-off-by: Roman Zippel
    Cc: Ralf Baechle
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: john stultz
    Cc: Christoph Lameter
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

17 Apr, 2008

1 commit

  • Fix sparse warnings like this:
    kernel/posix-cpu-timers.c:1090:25: warning: symbol 't' shadows an earlier one
    kernel/posix-cpu-timers.c:1058:21: originally declared here

    Signed-off-by: WANG Cong
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    WANG Cong
     

09 Feb, 2008

1 commit

  • All the functions that need to lookup a task by pid in posix timers obtain
    this pid from a user space, and thus this value refers to a task in the same
    namespace, as the current task lives in.

    So the proper behavior is to call find_task_by_vpid() here.

    Signed-off-by: Pavel Emelyanov
    Cc: "Eric W. Biederman"
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

26 Jan, 2008

2 commits

  • Remove the curious logic to set it_sched_expires in the future. It useless
    because rt.timeout wouldn't be incremented anyway.

    Explicity check for RLIM_INFINITY as a test programm that had a 1s soft limit
    and a inf hard limit would SIGKILL at 1s. This is because RLIM_INFINITY+d-1
    is d-2.

    Signed-off-by: Peter Zijlsta
    CC: Michal Schmidt
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Introduce a new rlimit that allows the user to set a runtime timeout on
    real-time tasks their slice. Once this limit is exceeded the task will receive
    SIGXCPU.

    So it measures runtime since the last sleep.

    Input and ideas by Thomas Gleixner and Lennart Poettering.

    Signed-off-by: Peter Zijlstra
    CC: Lennart Poettering
    CC: Michael Kerrisk
    CC: Ulrich Drepper
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

20 Oct, 2007

1 commit

  • With pid namespaces this field is now dangerous to use explicitly, so hide
    it behind the helpers.

    Also the pid and pgrp fields o task_struct and signal_struct are to be
    deprecated. Unfortunately this patch cannot be sent right now as this
    leads to tons of warnings, so start isolating them, and deprecate later.

    Actually the p->tgid == pid has to be changed to has_group_leader_pid(),
    but Oleg pointed out that in case of posix cpu timers this is the same, and
    thread_group_leader() is more preferable.

    Signed-off-by: Pavel Emelyanov
    Acked-by: Oleg Nesterov
    Cc: Sukadev Bhattiprolu
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelyanov
     

10 Jul, 2007

1 commit


09 May, 2007

1 commit

  • There are many places in the kernel where the construction like

    foo = list_entry(head->next, struct foo_struct, list);

    are used.
    The code might look more descriptive and neat if using the macro

    list_first_entry(head, type, member) \
    list_entry((head)->next, type, member)

    Here is the macro itself and the examples of its usage in the generic code.
    If it will turn out to be useful, I can prepare the set of patches to
    inject in into arch-specific code, drivers, networking, etc.

    Signed-off-by: Pavel Emelianov
    Signed-off-by: Kirill Korotaev
    Cc: Randy Dunlap
    Cc: Andi Kleen
    Cc: Zach Brown
    Cc: Davide Libenzi
    Cc: John McCutchan
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: john stultz
    Cc: Ram Pai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Emelianov
     

17 Feb, 2007

1 commit

  • Use RCU to avoid the need to acquire tasklist_lock in the single-threaded
    case of clock_gettime(). It still acquires tasklist_lock when for a
    (potentially multithreaded) process. This change allows realtime
    applications to frequently monitor CPU consumption of individual tasks, as
    requested (and now deployed) by some off-list users.

    This has been in Ingo Molnar's -rt patchset since late 2005 with no
    problems reported, and tests successfully on 2.6.20-rc6, so I believe that
    it is long-since ready for mainline adoption.

    [paulmck@linux.vnet.ibm.com: fix exit()/posix_cpu_clock_get() race spotted by Oleg]
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: john stultz
    Cc: Roman Zippel
    Cc: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paul E. McKenney
     

17 Oct, 2006

1 commit

  • The integer divisions in the timer accounting code can round the result
    down to 0. Adding 0 is without effect and the signal delivery stops.

    Clamp the division result to minimum 1 to avoid this.

    Problem was reported by Seongbae Park , who provided
    also an inital patch.

    Roland sayeth:

    I have had some more time to think about the problem, and to reproduce it
    using Toyo's test case. For the record, if my understanding of the problem
    is correct, this happens only in one very particular case. First, the
    expiry time has to be so soon that in cputime_t units (usually 1s/HZ ticks)
    it's < nthreads so the division yields zero. Second, it only affects each
    thread that is so new that its CPU time accumulation is zero so now+0 is
    still zero and ->it_*_expires winds up staying zero. For the VIRT and PROF
    clocks when cputime_t is tick granularity (or the SCHED clock on
    configurations where sched_clock's value only advances on clock ticks), this
    is not hard to arrange with new threads starting up and blocking before they
    accumulate a whole tick of CPU time. That's what happens in Toyo's test
    case.

    Note that in general it is fine for that division to round down to zero,
    and set each thread's expiry time to its "now" time. The problem only
    arises with thread's whose "now" value is still zero, so that now+0 winds up
    0 and is interpreted as "not set" instead of ">= now". So it would be a
    sufficient and more precise fix to just use max(ticks, 1) inside the loop
    when setting each it_*_expires value.

    But, it does no harm to round the division up to one and always advance
    every thread's expiry time. If the thread didn't already fire timers for
    the expiry time of "now", there is no expectation that it will do so before
    the next tick anyway. So I followed Thomas's patch in lifting the max out
    of the loops.

    This patch also covers the reload cases, which are harder to write a test
    for (and I didn't try). I've tested it with Toyo's case and it fixes that.

    [toyoa@mvista.com: fix: min_t -> max_t]
    Signed-off-by: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Roland McGrath
    Cc: Daniel Walker
    Cc: Toyo Abe
    Cc: john stultz
    Cc: Roman Zippel
    Cc: Seongbae Park
    Cc: Peter Mattis
    Cc: Rohit Seth
    Cc: Martin Bligh
    Cc:
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

30 Sep, 2006

2 commits

  • When a posix_cpu_nsleep() sleep is interrupted by a signal more than twice, it
    incorrectly reports the sleep time remaining to the user. Because
    posix_cpu_nsleep() doesn't report back to the user when it's called from
    restart function due to the wrong flags handling.

    This patch, which applies after previous one, moves the nanosleep() function
    from posix_cpu_nsleep() to do_cpu_nanosleep() and cleans up the flags handling
    appropriately.

    Signed-off-by: Toyo Abe
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toyo Abe
     
  • The clock_nanosleep() function does not return the time remaining when the
    sleep is interrupted by a signal.

    This patch creates a new call out, compat_clock_nanosleep_restart(), which
    handles returning the remaining time after a sleep is interrupted. This
    patch revives clock_nanosleep_restart(). It is now accessed via the new
    call out. The compat_clock_nanosleep_restart() is used for compatibility
    access.

    Since this is implemented in compatibility mode the normal path is
    virtually unaffected - no real performance impact.

    Signed-off-by: Toyo Abe
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Roland McGrath
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Toyo Abe
     

18 Jun, 2006

3 commits

  • arm_timer() checks PF_EXITING to prevent BUG_ON(->exit_state)
    in run_posix_cpu_timers().

    However, for some reason it does so only for CPUCLOCK_PERTHREAD
    case (which is imho wrong).

    Also, this check is not reliable, PF_EXITING could be set on
    another cpu without any locks/barriers just after the check,
    so it can't prevent from attaching the timer to the exiting
    task.

    The previous patch makes this check unneeded.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • do_exit() clears ->it_##clock##_expires, but nothing prevents
    another cpu to attach the timer to exiting process after that.
    arm_timer() tries to protect against this race, but the check
    is racy.

    After exit_notify() does 'write_unlock_irq(&tasklist_lock)' and
    before do_exit() calls 'schedule() local timer interrupt can find
    tsk->exit_state != 0. If that state was EXIT_DEAD (or another cpu
    does sys_wait4) interrupted task has ->signal == NULL.

    At this moment exiting task has no pending cpu timers, they were
    cleanuped in __exit_signal()->posix_cpu_timers_exit{,_group}(),
    so we can just return from irq.

    John Stultz recently confirmed this bug, see

    http://marc.theaimsgroup.com/?l=linux-kernel&m=115015841413687

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • If the local timer interrupt happens just after do_exit() sets PF_EXITING
    (and before it clears ->it_xxx_expires) run_posix_cpu_timers() will call
    check_process_timers() with tasklist_lock + ->siglock held and

    check_process_timers:

    t = tsk;
    do {
    ....

    do {
    t = next_thread(t);
    } while (unlikely(t->flags & PF_EXITING));
    } while (t != tsk);

    the outer loop will never stop.

    Actually, the window is bigger. Another process can attach the timer
    after ->it_xxx_expires was cleared (see the next commit) and the 'if
    (PF_EXITING)' check in arm_timer() is racy (see the one after that).

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

11 Jan, 2006

2 commits


07 Jan, 2006

1 commit

  • I've spent the past 3 days digging into a glibc testsuite failure in
    current CVS, specifically libc/rt/tst-cputimer1.c The thr1 and thr2
    timers fire too early in the second pass of this test. The second
    pass is noteworthy because it makes use of intervals, whereas the
    first pass does not.

    All throughout the posix-cpu-timers.c code, the calculation of the
    process sched_time sum is implemented roughly as:

    unsigned long long sum;

    sum = tsk->signal->sched_time;
    t = tsk;
    do {
    sum += t->sched_time;
    t = next_thread(t);
    } while (t != tsk);

    In fact this is the exact scheme used by check_process_timers().

    In the case of check_process_timers(), current->sched_time has just
    been updated (via scheduler_tick(), which is invoked by
    update_process_times(), which subsequently invokes
    run_posix_cpu_timers()) So there is no special processing necessary
    wrt. that.

    In other contexts, we have to allot for the fact that tsk->sched_time
    might be a bit out of date if we are current. And the
    posix-cpu-timers.c code uses current_sched_time() to deal with that.

    Unfortunately it does so in an erroneous and inconsistent manner in
    one spot which is what results in the early timer firing.

    In cpu_clock_sample_group_locked(), it does this:

    cpu->sched = p->signal->sched_time;
    /* Add in each other live thread. */
    while ((t = next_thread(t)) != p) {
    cpu->sched += t->sched_time;
    }
    if (p->tgid == current->tgid) {
    /*
    * We're sampling ourselves, so include the
    * cycles not yet banked. We still omit
    * other threads running on other CPUs,
    * so the total can always be behind as
    * much as max(nthreads-1,ncpus) * (NSEC_PER_SEC/HZ).
    */
    cpu->sched += current_sched_time(current);
    } else {
    cpu->sched += p->sched_time;
    }

    The problem is the "p->tgid == current->tgid" test. If "p" is
    not current, and the tgids are the same, we will add the process
    t->sched_time twice into cpu->sched and omit "p"'s sched_time
    which is very very very wrong.

    posix-cpu-timers.c has a helper function, sched_ns(p) which takes care
    of this, so my fix is to use that here instead of this special tgid
    test.

    The fact that current can be one of the sub-threads of "p" points out
    that we could make things a little bit more accurate, perhaps by using
    sched_ns() on every thread we process in these loops. It also points
    out that we don't use the most accurate value for threads in the group
    actively running other cpus (and this is mentioned in the comment).

    But that is a future enhancement, and this fix here definitely makes
    sense.

    Signed-off-by: David S. Miller
    Signed-off-by: Linus Torvalds

    David S. Miller
     

29 Nov, 2005

1 commit


07 Nov, 2005

1 commit


31 Oct, 2005

1 commit