15 Dec, 2011

1 commit


18 Nov, 2009

1 commit

  • incr_error and error fields of struct cpu_itimer are used when calculating
    next timer tick in check_cpu_itimers() and should not be modified without
    tsk->sighand->siglock taken.

    Signed-off-by: Stanislaw Gruszka
    LKML-Reference:
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Stanislaw Gruszka
     

29 Aug, 2009

1 commit

  • Add tracepoints for all itimer variants: ITIMER_REAL, ITIMER_VIRTUAL
    and ITIMER_PROF.

    [ tglx: Fixed comments and made the output more readable, parseable
    and consistent. Replaced pid_vnr by pid_nr because the hrtimer
    callback can happen in any namespace ]

    Signed-off-by: Xiao Guangrong
    Cc: Steven Rostedt
    Cc: Frederic Weisbecker
    Cc: Mathieu Desnoyers
    Cc: Anton Blanchard
    Cc: Peter Zijlstra
    Cc: KOSAKI Motohiro
    Cc: Zhaolei
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Xiao Guangrong
     

03 Aug, 2009

3 commits

  • For powerpc with CONFIG_VIRT_CPU_ACCOUNTING
    jiffies_to_cputime(1) is not compile time constant and run time
    calculations are quite expensive. To optimize we use
    precomputed value. For all other architectures is is
    preprocessor definition.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Measure ITIMER_PROF and ITIMER_VIRT timers interval error
    between real ticks and requested by user. Take it into account
    when scheduling next tick.

    This patch introduce possibility where time between two
    consecutive tics is smaller then requested interval, it
    preserve however dependency that n tick is generated not
    earlier than n*interval time - counting from the beginning of
    periodic signal generation.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     
  • Both cpu itimers have same data flow in the few places, this
    patch make unification of code related with VIRT and PROF
    itimers.

    Signed-off-by: Stanislaw Gruszka
    Acked-by: Peter Zijlstra
    Acked-by: Thomas Gleixner
    Cc: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Paul Mackerras
    Cc: Benjamin Herrenschmidt
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Stanislaw Gruszka
     

05 Feb, 2009

1 commit

  • Change the process wide cpu timers/clocks so that we:

    1) don't mess up the kernel with too many threads,
    2) don't have a per-cpu allocation for each process,
    3) have no impact when not used.

    In order to accomplish this we're going to split it into two parts:

    - clocks; which can take all the time they want since they run
    from user context -- ie. sys_clock_gettime(CLOCK_PROCESS_CPUTIME_ID)

    - timers; which need constant time sampling but since they're
    explicity used, the user can pay the overhead.

    The clock readout will go back to a full sum of the thread group, while the
    timers will run of a global 'clock' that only runs when needed, so only
    programs that make use of the facility pay the price.

    Signed-off-by: Peter Zijlstra
    Reviewed-by: Ingo Molnar
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jan, 2009

2 commits


14 Sep, 2008

1 commit

  • Overview

    This patch reworks the handling of POSIX CPU timers, including the
    ITIMER_PROF, ITIMER_VIRT timers and rlimit handling. It was put together
    with the help of Roland McGrath, the owner and original writer of this code.

    The problem we ran into, and the reason for this rework, has to do with using
    a profiling timer in a process with a large number of threads. It appears
    that the performance of the old implementation of run_posix_cpu_timers() was
    at least O(n*3) (where "n" is the number of threads in a process) or worse.
    Everything is fine with an increasing number of threads until the time taken
    for that routine to run becomes the same as or greater than the tick time, at
    which point things degrade rather quickly.

    This patch fixes bug 9906, "Weird hang with NPTL and SIGPROF."

    Code Changes

    This rework corrects the implementation of run_posix_cpu_timers() to make it
    run in constant time for a particular machine. (Performance may vary between
    one machine and another depending upon whether the kernel is built as single-
    or multiprocessor and, in the latter case, depending upon the number of
    running processors.) To do this, at each tick we now update fields in
    signal_struct as well as task_struct. The run_posix_cpu_timers() function
    uses those fields to make its decisions.

    We define a new structure, "task_cputime," to contain user, system and
    scheduler times and use these in appropriate places:

    struct task_cputime {
    cputime_t utime;
    cputime_t stime;
    unsigned long long sum_exec_runtime;
    };

    This is included in the structure "thread_group_cputime," which is a new
    substructure of signal_struct and which varies for uniprocessor versus
    multiprocessor kernels. For uniprocessor kernels, it uses "task_cputime" as
    a simple substructure, while for multiprocessor kernels it is a pointer:

    struct thread_group_cputime {
    struct task_cputime totals;
    };

    struct thread_group_cputime {
    struct task_cputime *totals;
    };

    We also add a new task_cputime substructure directly to signal_struct, to
    cache the earliest expiration of process-wide timers, and task_cputime also
    replaces the it_*_expires fields of task_struct (used for earliest expiration
    of thread timers). The "thread_group_cputime" structure contains process-wide
    timers that are updated via account_user_time() and friends. In the non-SMP
    case the structure is a simple aggregator; unfortunately in the SMP case that
    simplicity was not achievable due to cache-line contention between CPUs (in
    one measured case performance was actually _worse_ on a 16-cpu system than
    the same test on a 4-cpu system, due to this contention). For SMP, the
    thread_group_cputime counters are maintained as a per-cpu structure allocated
    using alloc_percpu(). The timer functions update only the timer field in
    the structure corresponding to the running CPU, obtained using per_cpu_ptr().

    We define a set of inline functions in sched.h that we use to maintain the
    thread_group_cputime structure and hide the differences between UP and SMP
    implementations from the rest of the kernel. The thread_group_cputime_init()
    function initializes the thread_group_cputime structure for the given task.
    The thread_group_cputime_alloc() is a no-op for UP; for SMP it calls the
    out-of-line function thread_group_cputime_alloc_smp() to allocate and fill
    in the per-cpu structures and fields. The thread_group_cputime_free()
    function, also a no-op for UP, in SMP frees the per-cpu structures. The
    thread_group_cputime_clone_thread() function (also a UP no-op) for SMP calls
    thread_group_cputime_alloc() if the per-cpu structures haven't yet been
    allocated. The thread_group_cputime() function fills the task_cputime
    structure it is passed with the contents of the thread_group_cputime fields;
    in UP it's that simple but in SMP it must also safely check that tsk->signal
    is non-NULL (if it is it just uses the appropriate fields of task_struct) and,
    if so, sums the per-cpu values for each online CPU. Finally, the three
    functions account_group_user_time(), account_group_system_time() and
    account_group_exec_runtime() are used by timer functions to update the
    respective fields of the thread_group_cputime structure.

    Non-SMP operation is trivial and will not be mentioned further.

    The per-cpu structure is always allocated when a task creates its first new
    thread, via a call to thread_group_cputime_clone_thread() from copy_signal().
    It is freed at process exit via a call to thread_group_cputime_free() from
    cleanup_signal().

    All functions that formerly summed utime/stime/sum_sched_runtime values from
    from all threads in the thread group now use thread_group_cputime() to
    snapshot the values in the thread_group_cputime structure or the values in
    the task structure itself if the per-cpu structure hasn't been allocated.

    Finally, the code in kernel/posix-cpu-timers.c has changed quite a bit.
    The run_posix_cpu_timers() function has been split into a fast path and a
    slow path; the former safely checks whether there are any expired thread
    timers and, if not, just returns, while the slow path does the heavy lifting.
    With the dedicated thread group fields, timers are no longer "rebalanced" and
    the process_timer_rebalance() function and related code has gone away. All
    summing loops are gone and all code that used them now uses the
    thread_group_cputime() inline. When process-wide timers are set, the new
    task_cputime structure in signal_struct is used to cache the earliest
    expiration; this is checked in the fast path.

    Performance

    The fix appears not to add significant overhead to existing operations. It
    generally performs the same as the current code except in two cases, one in
    which it performs slightly worse (Case 5 below) and one in which it performs
    very significantly better (Case 2 below). Overall it's a wash except in those
    two cases.

    I've since done somewhat more involved testing on a dual-core Opteron system.

    Case 1: With no itimer running, for a test with 100,000 threads, the fixed
    kernel took 1428.5 seconds, 513 seconds more than the unfixed system,
    all of which was spent in the system. There were twice as many
    voluntary context switches with the fix as without it.

    Case 2: With an itimer running at .01 second ticks and 4000 threads (the most
    an unmodified kernel can handle), the fixed kernel ran the test in
    eight percent of the time (5.8 seconds as opposed to 70 seconds) and
    had better tick accuracy (.012 seconds per tick as opposed to .023
    seconds per tick).

    Case 3: A 4000-thread test with an initial timer tick of .01 second and an
    interval of 10,000 seconds (i.e. a timer that ticks only once) had
    very nearly the same performance in both cases: 6.3 seconds elapsed
    for the fixed kernel versus 5.5 seconds for the unfixed kernel.

    With fewer threads (eight in these tests), the Case 1 test ran in essentially
    the same time on both the modified and unmodified kernels (5.2 seconds versus
    5.8 seconds). The Case 2 test ran in about the same time as well, 5.9 seconds
    versus 5.4 seconds but again with much better tick accuracy, .013 seconds per
    tick versus .025 seconds per tick for the unmodified kernel.

    Since the fix affected the rlimit code, I also tested soft and hard CPU limits.

    Case 4: With a hard CPU limit of 20 seconds and eight threads (and an itimer
    running), the modified kernel was very slightly favored in that while
    it killed the process in 19.997 seconds of CPU time (5.002 seconds of
    wall time), only .003 seconds of that was system time, the rest was
    user time. The unmodified kernel killed the process in 20.001 seconds
    of CPU (5.014 seconds of wall time) of which .016 seconds was system
    time. Really, though, the results were too close to call. The results
    were essentially the same with no itimer running.

    Case 5: With a soft limit of 20 seconds and a hard limit of 2000 seconds
    (where the hard limit would never be reached) and an itimer running,
    the modified kernel exhibited worse tick accuracy than the unmodified
    kernel: .050 seconds/tick versus .028 seconds/tick. Otherwise,
    performance was almost indistinguishable. With no itimer running this
    test exhibited virtually identical behavior and times in both cases.

    In times past I did some limited performance testing. those results are below.

    On a four-cpu Opteron system without this fix, a sixteen-thread test executed
    in 3569.991 seconds, of which user was 3568.435s and system was 1.556s. On
    the same system with the fix, user and elapsed time were about the same, but
    system time dropped to 0.007 seconds. Performance with eight, four and one
    thread were comparable. Interestingly, the timer ticks with the fix seemed
    more accurate: The sixteen-thread test with the fix received 149543 ticks
    for 0.024 seconds per tick, while the same test without the fix received 58720
    for 0.061 seconds per tick. Both cases were configured for an interval of
    0.01 seconds. Again, the other tests were comparable. Each thread in this
    test computed the primes up to 25,000,000.

    I also did a test with a large number of threads, 100,000 threads, which is
    impossible without the fix. In this case each thread computed the primes only
    up to 10,000 (to make the runtime manageable). System time dominated, at
    1546.968 seconds out of a total 2176.906 seconds (giving a user time of
    629.938s). It received 147651 ticks for 0.015 seconds per tick, still quite
    accurate. There is obviously no comparable test without the fix.

    Signed-off-by: Frank Mayhar
    Cc: Roland McGrath
    Cc: Alexey Dobriyan
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Frank Mayhar
     

09 Feb, 2008

1 commit

  • signal_struct->tsk points to the ->group_leader and thus we have the nasty
    code in de_thread() which has to change it and restart ->real_timer if the
    leader is changed.

    Use "struct pid *leader_pid" instead. This also allows us to kill now
    unneeded send_group_sig_info().

    Signed-off-by: Oleg Nesterov
    Acked-by: "Eric W. Biederman"
    Cc: Davide Libenzi
    Cc: Pavel Emelyanov
    Acked-by: Roland McGrath
    Acked-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

19 Oct, 2007

1 commit


09 May, 2007

2 commits


17 Feb, 2007

3 commits

  • Fix potential setitimer DoS with high-res timers by pushing itimer rearm
    processing to process context.

    [Fixes from: Ingo Molnar ]
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: john stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • Implement high resolution timers on top of the hrtimers infrastructure and the
    clockevents / tick-management framework. This provides accurate timers for
    all hrtimer subsystem users.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: john stultz
    Cc: Roman Zippel
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • - hrtimers did not use the hrtimer_restart enum and relied on the implict
    int representation. Fix the prototypes and the functions using the enums.
    - Use seperate name spaces for the enumerations
    - Convert hrtimer_restart macro to inline function
    - Add comments

    No functional changes.

    [akpm@osdl.org: fix input driver]
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar
    Cc: john stultz
    Cc: Roman Zippel
    Cc: Dmitry Torokhov
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

27 Mar, 2006

2 commits

  • The nanosleep cleanup allows to remove the data field of hrtimer. The
    callback function can use container_of() to get it's own data. Since the
    hrtimer structure is anyway embedded in other structures, this adds no
    overhead.

    Signed-off-by: Roman Zippel
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     
  • Pass current time to hrtimer_forward(). This allows to use the softirq time
    in the timer base when the forward function is called from the timer callback.
    Other places pass current time with a call to timer->base->get_time().

    Signed-off-by: Roman Zippel
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Roman Zippel
     

26 Mar, 2006

2 commits

  • According to the specification the timevals must be validated and an
    errorcode -EINVAL returned in case the timevals are not in canonical form.
    This check was never done in Linux.

    The pre 2.6.16 code converted invalid timevals silently. Negative timeouts
    were converted by the timeval_to_jiffies conversion to the maximum timeout.

    hrtimers and the ktime_t operations expect timevals in canonical form.
    Otherwise random results might happen on 32 bits machines due to the
    optimized ktime_add/sub operations. Negative timeouts are treated as
    already expired. This might break applications which work on pre 2.6.16.

    To prevent random behaviour and API breakage the timevals are checked and
    invalid timevals sanitized in a simliar way as the pre 2.6.16 code did.

    Invalid timevals are reported with a per boot limited number of kernel
    messages so applications which use this misfeature can be corrected.

    After a grace period of one year the sanitizing should be replaced by a
    correct validation check. This is also documented in
    Documentation/feature-removal-schedule.txt

    The validation and sanitizing is done inside do_setitimer so all callers
    (sys_setitimer, compat_sys_setitimer, osf_setitimer) are catched.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     
  • alarm() calls the kernel with an unsigend int timeout in seconds. The
    value is stored in the tv_sec field of a struct timeval to setup the
    itimer. The tv_sec field of struct timeval is of type long, which causes
    the tv_sec value to be negative on 32 bit machines if seconds > INT_MAX.

    Before the hrtimer merge (pre 2.6.16) such a negative value was converted
    to the maximum jiffies timeout by the timeval_to_jiffies conversion. It's
    not clear whether this was intended or just happened to be done by the
    timeval_to_jiffies code.

    hrtimers expect a timeval in canonical form and treat a negative timeout as
    already expired. This breaks the legitimate usage of alarm() with a
    timeout value > INT_MAX seconds.

    For 32 bit machines it is therefor necessary to limit the internal seconds
    value to avoid API breakage. Instead of doing this in all implementations
    of sys_alarm the duplicated sys_alarm code is moved into a common function
    in itimer.c

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Thomas Gleixner
     

02 Feb, 2006

2 commits


11 Jan, 2006

1 commit


28 Jul, 2005

1 commit

  • Fix the recent off-by-one fix in the itimer code:

    1. The repeating timer is figured using the requested time
    (not +1 as we know where we are in the jiffie).

    2. The tests for interval too large are left to the time_val to jiffie code.

    Signed-off-by: George Anzinger
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    George Anzinger
     

29 Jun, 2005

1 commit

  • As Steven Rostedt pointed out, there are 2 problems with ITIMER_REAL
    timers.

    1. do_setitimer() does not call del_timer_sync() in case
    when the timer is not pending (it_real_value() returns 0).
    This is wrong, the timer may still be running, and it can
    rearm itself.

    2. It calls del_timer_sync() with tsk->sighand->siglock held.
    This is deadlockable, because timer's handler needs this
    lock too.

    Signed-off-by: Oleg Nesterov
    Acked-by: Steven Rostedt
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

06 May, 2005

1 commit

  • It seems that the code responsible for this is in kernel/itimer.c:126:

    p->signal->real_timer.expires = jiffies + interval;
    add_timer(&p->signal->real_timer);

    If you request an interval of, lets say 900 usecs, the interval given by
    timeval_to_jiffies will be 1.

    If you request this when we are half-way between two timer ticks, the
    interval will only give 400 usecs.

    If we want to guarantee that we never ever give intervals less than
    requested, the simple solution would be to change that to:

    p->signal->real_timer.expires = jiffies + interval + 1;

    This however will produce pathological cases, like having a idle system
    being requested 1 ms timeouts will give systematically 2 ms timeouts,
    whereas currently it simply gives a few usecs less than 1 ms.

    The complex (and more computationally expensive) solution would be to
    check the gettimeofday time, and compute the correct number of jiffies.
    This way, if we request a 300 usecs timer 200 usecs inside the timer
    tick, we can wait just one tick, but not if we are 800 usecs inside the
    tick. This would also mean that we would have to lock preemption during
    these computations to avoid races, etc.

    I've searched the archives but couldn't find this particular issue being
    discussed before.

    Attached is a patch to do the simple solution, in case anybody thinks
    that it should be used.

    Signed-Off-By: Paulo Marques
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Paulo Marques
     

17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds