10 Oct, 2008

1 commit


04 Oct, 2008

1 commit


29 Sep, 2008

1 commit

  • Impact: per CPU hrtimers can be migrated from a dead CPU

    The hrtimer code has no knowledge about per CPU timers, but we need to
    prevent the migration of such timers and warn when such a timer is
    active at migration time.

    Explicitely mark the timers as per CPU and use a more understandable
    mode descriptor for the interrupts safe unlocked callback mode, which
    is used by hrtimer_sleeper and the scheduler code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

23 Sep, 2008

5 commits

  • kernel/time/tick-common.c: In function ‘tick_setup_periodic’:
    kernel/time/tick-common.c:113: error: implicit declaration of function ‘tick_broadcast_oneshot_active’

    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • Impact: timer hang on CPU online observed on AMD C1E systems

    When a CPU is brought online then the broadcast machinery can
    be in the one shot state already. Check this and setup the timer
    device of the new CPU in one shot mode so the broadcast code
    can pick up the next_event value correctly.

    Another AMD C1E oddity, as we switch to broadcast immediately and
    not after the full bring up via the ACPI cpu idle code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: Possible hang on CPU online observed on AMD C1E machines.

    The broadcast setup code looks at the mode of the tick device to
    determine whether it needs to be shut down or setup. This is wrong
    when the broadcast mode is set to one shot already. This can happen
    when a CPU is brought online as it goes through the periodic setup
    first.

    The problem went unnoticed as sane systems do not call into that code
    before the switch to one shot for the clock event device happens.
    The AMD C1E idle routine switches over immediately and thereby shuts
    down the just setup device before the first interrupt happens.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: possible hang on CPU onlining in timer one shot mode.

    The tick_next_period variable is only used during boot on nohz/highres
    enabled systems, but for CPU onlining it needs to be maintained when
    the per cpu clock events device operates in one shot mode.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: rare hang which can be triggered on CPU online.

    tick_do_timer_cpu keeps track of the CPU which updates jiffies
    via do_timer. The value -1 is used to signal, that currently no
    CPU is doing this. There are two cases, where the variable can
    have this state:

    boot:
    necessary for systems where the boot cpu id can be != 0

    nohz long idle sleep:
    When the CPU which did the jiffies update last goes into
    a long idle sleep it drops the update jiffies duty so
    another CPU which is not idle can pick it up and keep
    jiffies going.

    Using the same value for both situations is wrong, as the CPU online
    code can see the -1 state when the timer of the newly onlined CPU is
    setup. The setup for a newly onlined CPU goes through periodic mode
    and can pick up the do_timer duty without being aware of the nohz /
    highres mode of the already running system.

    Use two separate states and make them constants to avoid magic
    numbers confusion.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

17 Sep, 2008

1 commit

  • The device shut down does not cleanup the next_event variable of the
    clock event device. So when the device is reactivated the possible
    stale next_event value can prevent the device to be reprogrammed as it
    claims to wait on a event already.

    This is the root cause of the resurfacing suspend/resume problem,
    where systems need key press to come back to life.

    Fix this by setting next_event to KTIME_MAX when the device is shut
    down. Use a separate function for shutdown which takes care of that
    and only keep the direct set mode call in the broadcast code, where we
    can not touch the next_event value.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

10 Sep, 2008

1 commit

  • The issue of the endless reprogramming loop due to a too small
    min_delta_ns was fixed with the previous updates of the clock events
    code, but we had no information about the spread of this problem. I
    added a WARN_ON to get automated information via kerneloops.org and to
    get some direct reports, which allowed me to analyse the affected
    machines.

    The WARN_ON has served its purpose and would be annoying for a release
    kernel. Remove it and just keep the information about the increase of
    the min_delta_ns value.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Sep, 2008

1 commit

  • …el/git/tip/linux-2.6-tip

    * 'timers-fixes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    clocksource, acpi_pm.c: check for monotonicity
    clocksource, acpi_pm.c: use proper read function also in errata mode
    ntp: fix calculation of the next jiffie to trigger RTC sync
    x86: HPET: read back compare register before reading counter
    x86: HPET fix moronic 32/64bit thinko
    clockevents: broadcast fixup possible waiters
    HPET: make minimum reprogramming delta useful
    clockevents: prevent endless loop lockup
    clockevents: prevent multiple init/shutdown
    clockevents: enforce reprogram in oneshot setup
    clockevents: prevent endless loop in periodic broadcast handler
    clockevents: prevent clockevent event_handler ending up handler_noop

    Linus Torvalds
     

06 Sep, 2008

3 commits

  • We have a bug in the calculation of the next jiffie to trigger the RTC
    synchronisation. The aim here is to run sync_cmos_clock() as close as
    possible to the middle of a second. Which means we want this function to
    be called less than or equal to half a jiffie away from when now.tv_nsec
    equals 5e8 (500000000).

    If this is not the case for a given call to the function, for this purpose
    instead of updating the RTC we calculate the offset in nanoseconds to the
    next point in time where now.tv_nsec will be equal 5e8. The calculated
    offset is then converted to jiffies as these are the unit used by the
    timer.

    Hovewer timespec_to_jiffies() used here uses a ceil()-type rounding mode,
    where the resulting value is rounded up. As a result the range of
    now.tv_nsec when the timer will trigger is from 5e8 to 5e8 + TICK_NSEC
    rather than the desired 5e8 - TICK_NSEC / 2 to 5e8 + TICK_NSEC / 2.

    As a result if for example sync_cmos_clock() happens to be called at the
    time when now.tv_nsec is between 5e8 + TICK_NSEC / 2 and 5e8 to 5e8 +
    TICK_NSEC, it will simply be rescheduled HZ jiffies later, falling in the
    same range of now.tv_nsec again. Similarly for cases offsetted by an
    integer multiple of TICK_NSEC.

    This change addresses the problem by subtracting TICK_NSEC / 2 from the
    nanosecond offset to the next point in time where now.tv_nsec will be
    equal 5e8, effectively shifting the following rounding in
    timespec_to_jiffies() so that it produces a rounded-to-nearest result.

    Signed-off-by: Maciej W. Rozycki
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Maciej W. Rozycki
     
  • Until the C1E patches arrived there where no users of periodic broadcast
    before switching to oneshot mode. Now we need to trigger a possible
    waiter for a periodic broadcast when switching to oneshot mode.
    Otherwise we can starve them for ever.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • If HLT stops the TSC, we'll fail to account idle time, thereby inflating the
    actual process times. Fix this by re-calibrating the clock against GTOD when
    leaving nohz mode.

    Signed-off-by: Peter Zijlstra
    Tested-by: Avi Kivity
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

05 Sep, 2008

5 commits

  • The C1E/HPET bug reports on AMDX2/RS690 systems where tracked down to a
    too small value of the HPET minumum delta for programming an event.

    The clockevents code needs to enforce an interrupt event on the clock event
    device in some cases. The enforcement code was stupid and naive, as it just
    added the minimum delta to the current time and tried to reprogram the device.
    When the minimum delta is too small, then this loops forever.

    Add a sanity check. Allow reprogramming to fail 3 times, then print a warning
    and double the minimum delta value to make sure, that this does not happen again.
    Use the same function for both tick-oneshot and tick-broadcast code.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • While chasing the C1E/HPET bugreports I went through the clock events
    code inch by inch and found that the broadcast device can be initialized
    and shutdown multiple times. Multiple shutdowns are not critical, but
    useless waste of time. Multiple initializations are simply broken. Another
    CPU might have the device in use already after the first initialization and
    the second init could just render it unusable again.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • In tick_oneshot_setup we program the device to the given next_event,
    but we do not check the return value. We need to make sure that the
    device is programmed enforced so the interrupt handler engine starts
    working. Split out the reprogramming function from tick_program_event()
    and call it with the device, which was handed in to tick_setup_oneshot().
    Set the force argument, so the devices is firing an interrupt.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • The reprogramming of the periodic broadcast handler was broken,
    when the first programming returned -ETIME. The clockevents code
    stores the new expiry value in the clock events device next_event field
    only when the programming time has not been elapsed yet. The loop in
    question calculates the new expiry value from the next_event value
    and therefor never increases.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • There is a ordering related problem with clockevents code, due to which
    clockevents_register_device() called after tickless/highres switch
    will not work. The new clockevent ends up with clockevents_handle_noop as
    event handler, resulting in no timer activity.

    The problematic path seems to be

    * old device already has hrtimer_interrupt as the event_handler
    * new clockevent device registers with a higher rating
    * tick_check_new_device() is called
    * clockevents_exchange_device() gets called
    * old->event_handler is set to clockevents_handle_noop
    * tick_setup_device() is called for the new device
    * which sets new->event_handler using the old->event_handler which is noop.

    Change the ordering so that new device inherits the proper handler.

    This does not have any issue in normal case as most likely all the clockevent
    devices are setup before the highres switch. But, can potentially be affecting
    some corner case where HPET force detect happens after the highres switch.
    This was a problem with HPET in MSI mode code that we have been experimenting
    with.

    Signed-off-by: Venkatesh Pallipadi
    Signed-off-by: Shaohua Li
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Venkatesh Pallipadi
     

21 Aug, 2008

1 commit

  • On the tickless system(CONFIG_NO_HZ=y and CONFIG_HIGH_RES_TIMERS=n), after
    I made an offlined cpu online, I found this cpu's event handler was
    tick_handle_periodic, not tick_nohz_handler.

    After debuging, I found this bug was caused by the wrong tick mode. the
    tick mode is not changed to NOHZ_MODE_INACTIVE when the cpu is offline.

    This patch fixes this bug.

    Signed-off-by: Miao Xie
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Miao Xie
     

31 Jul, 2008

1 commit

  • Found an interactivity problem on a quad core test-system - simple
    CPU loops would occasionally delay the system un an unacceptable way.

    After much debugging with Peter Zijlstra it turned out that the problem
    is caused by the string of sched_clock() changes - they caused the CPU
    clock to jump backwards a bit - which confuses the scheduler arithmetics.

    (which is unsigned for performance reasons)

    So revert:

    # c300ba2: sched_clock: and multiplier for TSC to gtod drift
    # c0c8773: sched_clock: only update deltas with local reads.
    # af52a90: sched_clock: stop maximum check on NO HZ
    # f7cce27: sched_clock: widen the max and min time

    This solves the interactivity problems.

    Signed-off-by: Ingo Molnar
    Acked-by: Peter Zijlstra
    Acked-by: Mike Galbraith

    Ingo Molnar
     

26 Jul, 2008

1 commit


25 Jul, 2008

1 commit


24 Jul, 2008

2 commits

  • * 'cpus4096-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (31 commits)
    NR_CPUS: Replace NR_CPUS in speedstep-centrino.c
    cpumask: Provide a generic set of CPUMASK_ALLOC macros, FIXUP
    NR_CPUS: Replace NR_CPUS in cpufreq userspace routines
    NR_CPUS: Replace per_cpu(..., smp_processor_id()) with __get_cpu_var
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genapic_flat_64.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/genx2apic_uv_x.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/proc.c
    NR_CPUS: Replace NR_CPUS in arch/x86/kernel/cpu/mcheck/mce_64.c
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c, fix
    cpumask: Use optimized CPUMASK_ALLOC macros in the centrino_target
    cpumask: Provide a generic set of CPUMASK_ALLOC macros
    cpumask: Optimize cpumask_of_cpu in lib/smp_processor_id.c
    cpumask: Optimize cpumask_of_cpu in kernel/time/tick-common.c
    cpumask: Optimize cpumask_of_cpu in drivers/misc/sgi-xp/xpc_main.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/ldt.c
    cpumask: Optimize cpumask_of_cpu in arch/x86/kernel/io_apic_64.c
    cpumask: Replace cpumask_of_cpu with cpumask_of_cpu_ptr
    Revert "cpumask: introduce new APIs"
    cpumask: make for_each_cpu_mask a bit smaller
    net: Pass reference to cpumask variable in net/sunrpc/svc.c
    ...

    Fix up trivial conflicts in drivers/cpufreq/cpufreq.c manually

    Linus Torvalds
     
  • …ernel/git/tip/linux-2.6-tip

    * 'core/softlockup-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    softlockup: fix invalid proc_handler for softlockup_panic
    softlockup: fix watchdog task wakeup frequency
    softlockup: fix watchdog task wakeup frequency
    softlockup: show irqtrace
    softlockup: print a module list on being stuck
    softlockup: fix NMI hangs due to lock race - 2.6.26-rc regression
    softlockup: fix false positives on nohz if CPU is 100% idle for more than 60 seconds
    softlockup: fix softlockup_thresh fix
    softlockup: fix softlockup_thresh unaligned access and disable detection at runtime
    softlockup: allow panic on lockup

    Linus Torvalds
     

22 Jul, 2008

1 commit

  • This allow to dynamically generate attributes and share show/store
    functions between attributes. Right now most attributes are generated
    by special macros and lots of duplicated code. With the attribute
    passed it's instead possible to attach some data to the attribute
    and then use that in shared low level functions to do different things.

    I need this for the dynamically generated bank attributes in the x86
    machine check code, but it'll allow some further cleanups.

    I converted all users in tree to the new show/store prototype. It's a single
    huge patch to avoid unbisectable sections.

    Runtime tested: x86-32, x86-64
    Compiled only: ia64, powerpc
    Not compile tested/only grep converted: sh, arm, avr32

    Signed-off-by: Andi Kleen
    Signed-off-by: Greg Kroah-Hartman

    Andi Kleen
     

19 Jul, 2008

3 commits

  • * Optimize various places where a pointer to the cpumask_of_cpu value
    will result in reducing stack pressure.

    Signed-off-by: Mike Travis
    Signed-off-by: Ingo Molnar

    Mike Travis
     
  • Ingo Molnar
     
  • Jack Ren and Eric Miao tracked down the following long standing
    problem in the NOHZ code:

    scheduler switch to idle task
    enable interrupts

    Window starts here

    ----> interrupt happens (does not set NEED_RESCHED)
    irq_exit() stops the tick

    ----> interrupt happens (does set NEED_RESCHED)

    return from schedule()

    cpu_idle(): preempt_disable();

    Window ends here

    The interrupts can happen at any point inside the race window. The
    first interrupt stops the tick, the second one causes the scheduler to
    rerun and switch away from idle again and we end up with the tick
    disabled.

    The fact that it needs two interrupts where the first one does not set
    NEED_RESCHED and the second one does made the bug obscure and extremly
    hard to reproduce and analyse. Kudos to Jack and Eric.

    Solution: Limit the NOHZ functionality to the idle loop to make sure
    that we can not run into such a situation ever again.

    cpu_idle()
    {
    preempt_disable();

    while(1) {
    tick_nohz_stop_sched_tick(1); ,
    Debugged-by: eric miao
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

16 Jul, 2008

4 commits


15 Jul, 2008

1 commit

  • * 'sched/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (76 commits)
    sched_clock: and multiplier for TSC to gtod drift
    sched_clock: record TSC after gtod
    sched_clock: only update deltas with local reads.
    sched_clock: fix calculation of other CPU
    sched_clock: stop maximum check on NO HZ
    sched_clock: widen the max and min time
    sched_clock: record from last tick
    sched: fix accounting in task delay accounting & migration
    sched: add avg-overlap support to RT tasks
    sched: terminate newidle balancing once at least one task has moved over
    sched: fix warning
    sched: build fix
    sched: sched_clock_cpu() based cpu_clock(), lockdep fix
    sched: export cpu_clock
    sched: make sched_{rt,fair}.c ifdefs more readable
    sched: bias effective_load() error towards failing wake_affine().
    sched: incremental effective_load()
    sched: correct wakeup weight calculations
    sched: fix mult overflow
    sched: update shares on wakeup
    ...

    Linus Torvalds
     

11 Jul, 2008

2 commits

  • Working with ftrace I would get large jumps of 11 millisecs or more with
    the clock tracer. This killed the latencing timings of ftrace and also
    caused the irqoff self tests to fail.

    What was happening is with NO_HZ the idle would stop the jiffy counter and
    before the jiffy counter was updated the sched_clock would have a bad
    delta jiffies to compare with the gtod with the maximum.

    The jiffies would stop and the last sched_tick would record the last gtod.
    On wakeup, the sched clock update would compare the gtod + delta jiffies
    (which would be zero) and compare it to the TSC. The TSC would have
    correctly (with a stable TSC) moved forward several jiffies. But because the
    jiffies has not been updated yet the clock would be prevented from moving
    forward because it would appear that the TSC jumped too far ahead.

    The clock would then virtually stop, until the jiffies are updated. Then
    the next sched clock update would see that the clock was very much behind
    since the delta jiffies is now correct. This would then jump the clock
    forward by several jiffies.

    This caused ftrace to report several milliseconds of interrupts off
    latency at every resume from NO_HZ idle.

    This patch adds hooks into the nohz code to disable the checking of the
    maximum clock update when nohz is in effect. It resumes the max check
    when nohz has updated the jiffies again.

    Signed-off-by: Steven Rostedt
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Steven Rostedt
     
  • In case a cpu goes idle but softirqs are pending only an error message is
    printed to the console. It may take a very long time until the pending
    softirqs will finally be executed. Worst case would be a hanging system.

    With this patch the timer tick just continues and the softirqs will be
    executed after the next interrupt. Still a delay but better than a
    hanging system.

    Currently we have at least two device drivers on s390 which under certain
    circumstances schedule a tasklet from process context. This is a reason
    why we can end up with pending softirqs when going idle. Fixing these
    drivers seems to be non-trivial.
    However there is no question that the drivers should be fixed.
    This patch shouldn't be considered as a bug fix. It just is intended to
    keep a system running even if device drivers are buggy.

    Signed-off-by: Heiko Carstens
    Cc: Jan Glauber
    Cc: Stefan Weinhuber
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

08 Jul, 2008

1 commit

  • C1E on AMD machines is like C3 but without control from the OS. Up to
    now we disabled the local apic timer for those machines as it stops
    when the CPU goes into C1E. This excludes those machines from high
    resolution timers / dynamic ticks, which hurts especially X2 based
    laptops.

    The current boot time C1E detection has another, more serious flaw
    as well: some BIOSes do not enable C1E until the ACPI processor module
    is loaded. This causes systems to stop working after that point.

    To work nicely with C1E enabled machines we use a separate idle
    function, which checks on idle entry whether C1E was enabled in the
    Interrupt Pending Message MSR. This allows us to do timer broadcasting
    for C1E and covers the late enablement of C1E as well.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

26 Jun, 2008

1 commit


30 May, 2008

2 commits