10 Jul, 2009

2 commits

  • The timer migration expiry check should prevent the migration of a
    timer to another CPU when the timer expires before the next event is
    scheduled on the other CPU. Migrating the timer might delay it because
    we can not reprogram the clock event device on the other CPU. But the
    code implementing that check has two flaws:

    - for !HIGHRES the check compares the expiry value with the clock
    events device expiry value which is wrong for CLOCK_REALTIME based
    timers.

    - the check is racy. It holds the hrtimer base lock of the target CPU,
    but the clock event device expiry value can be modified
    nevertheless, e.g. by an timer interrupt firing.

    The !HIGHRES case is easy to fix as we can enqueue the timer on the
    cpu which was selected by the load balancer. It runs the idle
    balancing code once per jiffy anyway. So the maximum delay for the
    timer is the same as when we keep the tick on the current cpu going.

    In the HIGHRES case we can get the next expiry value from the hrtimer
    cpu_base of the target CPU and serialize the update with the cpu_base
    lock. This moves the lock section in hrtimer_interrupt() so we can set
    next_event to KTIME_MAX while we are handling the expired timers and
    set it to the next expiry value after we handled the timers under the
    base lock. While the expired timers are processed timer migration is
    blocked because the expiry time of the timer is always

    Thomas Gleixner
     
  • The timer migration code needs to check whether the expiry time of the
    timer is before the programmed clock event expiry time when the timer
    is enqueued on another CPU because we can not reprogram the timer
    device on the other CPU. The current logic checks the expiry time even
    if we enqueue on the current CPU when nohz_get_load_balancer() returns
    current CPU. This might lead to an endless loop in the expiry check
    code when the expiry time of the timer is before the current
    programmed next event.

    Check whether nohz_get_load_balancer() returns current CPU and skip
    the expiry check if this is the case.

    The bug was triggered from the networking code. The patch fixes the
    regression http://bugzilla.kernel.org/show_bug.cgi?id=13738
    (Soft-Lockup/Race in networking in 2.6.31-rc1+195)

    Cc: Arun Bharadwaj
    Tested-by: Andres Freund
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

18 Jun, 2009

1 commit

  • * 'linux-next' of git://git.infradead.org/ubifs-2.6:
    UBIFS: start using hrtimers
    hrtimer: export ktime_add_safe
    UBIFS: do not forget to register BDI device
    UBIFS: allow sync option in rootflags
    UBIFS: remove dead code
    UBIFS: use anonymous device
    UBIFS: return proper error code if the compr is not present
    UBIFS: return error if link and unlink race
    UBIFS: reset no_space flag after inode deletion

    Linus Torvalds
     

08 Jun, 2009

1 commit

  • We want to use hrtimers in UBIFS (for write-buffer write-back timer).
    We need the 'hrtimer_set_expires_range_ns()', which is an in-line
    function which uses 'ktime_add_safe()'.

    Signed-off-by: Artem Bityutskiy
    Acked-by: Ingo Molnar

    Artem Bityutskiy
     

13 May, 2009

2 commits

  • * Arun R Bharadwaj [2009-04-16 12:11:36]:

    This patch migrates all non pinned timers and hrtimers to the current
    idle load balancer, from all the idle CPUs. Timers firing on busy CPUs
    are not migrated.

    While migrating hrtimers, care should be taken to check if migrating
    a hrtimer would result in a latency or not. So we compare the expiry of the
    hrtimer with the next timer interrupt on the target cpu and migrate the
    hrtimer only if it expires *after* the next interrupt on the target cpu.
    So, added a clockevents_get_next_event() helper function to return the
    next_event on the target cpu's clock_event_device.

    [ tglx: cleanups and simplifications ]

    Signed-off-by: Arun R Bharadwaj
    Signed-off-by: Thomas Gleixner

    Arun R Bharadwaj
     
  • * Arun R Bharadwaj [2009-04-16 12:11:36]:

    This patch creates a new framework for identifying cpu-pinned timers
    and hrtimers.

    This framework is needed because pinned timers are expected to fire on
    the same CPU on which they are queued. So it is essential to identify
    these and not migrate them, in case there are any.

    For regular timers, the currently existing add_timer_on() can be used
    queue pinned timers and subsequently mod_timer_pinned() can be used
    to modify the 'expires' field.

    For hrtimers, new modes HRTIMER_ABS_PINNED and HRTIMER_REL_PINNED are
    added to queue cpu-pinned hrtimer.

    [ tglx: use .._PINNED mode argument instead of creating tons of new
    functions ]

    Signed-off-by: Arun R Bharadwaj
    Signed-off-by: Thomas Gleixner

    Arun R Bharadwaj
     

31 Mar, 2009

1 commit

  • It appears I inadvertly introduced rq->lock recursion to the
    hrtimer_start() path when I delegated running already expired
    timers to softirq context.

    This patch fixes it by introducing a __hrtimer_start_range_ns()
    method that will not use raise_softirq_irqoff() but
    __raise_softirq_irqoff() which avoids the wakeup.

    It then also changes schedule() to check for pending softirqs and
    do the wakeup then, I'm not quite sure I like this last bit, nor
    am I convinced its really needed.

    Signed-off-by: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: paulus@samba.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

31 Jan, 2009

3 commits

  • Impact: prevent false positive WARN_ON() in clockevents_program_event()

    clock_was_set() changes the base->offset of CLOCK_REALTIME and
    enforces the reprogramming of the clockevent device to expire timers
    which are based on CLOCK_REALTIME. If the clock change is large enough
    then the subtraction of the timer expiry value and base->offset can
    become negative which triggers the warning in
    clockevents_program_event().

    Check the subtraction result and set a negative value to 0.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: fix CPU hotplug hang on Power6 testbox

    On architectures that support offlining all cpus (at least powerpc/pseries),
    hot-unpluging the tick_do_timer_cpu can result in a system hang.

    This comes from the fact that if the cpu going down happens to be the
    cpu doing the tick, then as the tick_do_timer_cpu handover happens after the
    cpu is dead (via the CPU_DEAD notification), we're left without ticks,
    jiffies are frozen and any task relying on timers (msleep, ...) is stuck.
    That's particularly the case for the cpu looping in __cpu_die() waiting
    for the dying cpu to be dead.

    This patch addresses this by having the tick_do_timer_cpu handover happen
    earlier during the CPU_DYING notification. For this, a new clockevent
    notification type is introduced (CLOCK_EVT_NOTIFY_CPU_DYING) which is triggered
    in hrtimer_cpu_notify().

    Signed-off-by: Sebastien Dugue
    Cc:
    Signed-off-by: Ingo Molnar

    Sebastien Dugue
     
  • Impact: avoid timer IRQ hanging slow systems

    While using the function graph tracer on a virtualized system, the
    hrtimer_interrupt can hang the system on an infinite loop.

    This can be caused in several situations:

    - the hardware is very slow and HZ is set too high

    - something intrusive is slowing the system down (tracing under emulation)

    ... and the next clock events to program are always before the current time.

    This patch implements a reasonable compromise: if such a situation is
    detected, we share the CPUs time in 1/4 to process the hrtimer interrupts.
    This is enough to let the system running without serious starvation.

    It has been successfully tested under VirtualBox with 1000 HZ and 100 HZ
    with function graph tracer launched. On both cases, the clock events were
    increased until about 25 ms periodic ticks, which means 40 HZ.

    So we change a hard to debug hang into a warning message and a system that
    still manages to limp along.

    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Frederic Weisbecker
     

27 Jan, 2009

1 commit


19 Jan, 2009

1 commit

  • Andrey Borzenkov reported this lockdep assert:

    > [17854.688347] =================================
    > [17854.688347] [ INFO: inconsistent lock state ]
    > [17854.688347] 2.6.29-rc2-1avb #1
    > [17854.688347] ---------------------------------
    > [17854.688347] inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
    > [17854.688347] pm-suspend/18240 [HC0[0]:SC0[0]:HE1:SE1] takes:
    > [17854.688347] (&cpu_base->lock){++..}, at: [] retrigger_next_event+0x5c/0xa0
    > [17854.688347] {in-hardirq-W} state was registered at:
    > [17854.688347] [] __lock_acquire+0x79d/0x1930
    > [17854.688347] [] lock_acquire+0x5c/0x80
    > [17854.688347] [] _spin_lock+0x35/0x70
    > [17854.688347] [] hrtimer_run_queues+0x31/0x140
    > [17854.688347] [] run_local_timers+0x8/0x20
    > [17854.688347] [] update_process_times+0x23/0x60
    > [17854.688347] [] tick_periodic+0x24/0x80
    > [17854.688347] [] tick_handle_periodic+0x12/0x70
    > [17854.688347] [] timer_interrupt+0x14/0x20
    > [17854.688347] [] handle_IRQ_event+0x29/0x60
    > [17854.688347] [] handle_level_irq+0x69/0xe0
    > [17854.688347] [] 0xffffffff
    > [17854.688347] irq event stamp: 55771
    > [17854.688347] hardirqs last enabled at (55771): [] _spin_unlock_irqrestore+0x35/0x60
    > [17854.688347] hardirqs last disabled at (55770): [] _spin_lock_irqsave+0x19/0x80
    > [17854.688347] softirqs last enabled at (54836): [] __do_softirq+0xc4/0x110
    > [17854.688347] softirqs last disabled at (54831): [] do_softirq+0x8e/0xe0
    > [17854.688347]
    > [17854.688347] other info that might help us debug this:
    > [17854.688347] 3 locks held by pm-suspend/18240:
    > [17854.688347] #0: (&buffer->mutex){--..}, at: [] sysfs_write_file+0x25/0x100
    > [17854.688347] #1: (pm_mutex){--..}, at: [] enter_state+0x4f/0x140
    > [17854.688347] #2: (dpm_list_mtx){--..}, at: [] device_pm_lock+0xf/0x20
    > [17854.688347]
    > [17854.688347] stack backtrace:
    > [17854.688347] Pid: 18240, comm: pm-suspend Not tainted 2.6.29-rc2-1avb #1
    > [17854.688347] Call Trace:
    > [17854.688347] [] ? printk+0x18/0x20
    > [17854.688347] [] print_usage_bug+0x16c/0x1d0
    > [17854.688347] [] mark_lock+0x8bf/0xc90
    > [17854.688347] [] ? pit_next_event+0x2f/0x40
    > [17854.688347] [] __lock_acquire+0x580/0x1930
    > [17854.688347] [] ? _spin_unlock+0x1d/0x20
    > [17854.688347] [] ? pit_next_event+0x2f/0x40
    > [17854.688347] [] ? clockevents_program_event+0x98/0x160
    > [17854.688347] [] ? mark_held_locks+0x48/0x90
    > [17854.688347] [] ? _spin_unlock_irqrestore+0x35/0x60
    > [17854.688347] [] ? trace_hardirqs_on_caller+0x139/0x190
    > [17854.688347] [] ? trace_hardirqs_on+0xb/0x10
    > [17854.688347] [] lock_acquire+0x5c/0x80
    > [17854.688347] [] ? retrigger_next_event+0x5c/0xa0
    > [17854.688347] [] _spin_lock+0x35/0x70
    > [17854.688347] [] ? retrigger_next_event+0x5c/0xa0
    > [17854.688347] [] retrigger_next_event+0x5c/0xa0
    > [17854.688347] [] hres_timers_resume+0xa/0x10
    > [17854.688347] [] timekeeping_resume+0xee/0x150
    > [17854.688347] [] __sysdev_resume+0x14/0x50
    > [17854.688347] [] sysdev_resume+0x47/0x80
    > [17854.688347] [] device_power_up+0xb/0x20
    > [17854.688347] [] suspend_devices_and_enter+0xcf/0x150
    > [17854.688347] [] ? freeze_processes+0x3f/0x90
    > [17854.688347] [] enter_state+0xf4/0x140
    > [17854.688347] [] state_store+0x7d/0xc0
    > [17854.688347] [] ? state_store+0x0/0xc0
    > [17854.688347] [] kobj_attr_store+0x24/0x30
    > [17854.688347] [] sysfs_write_file+0x9c/0x100
    > [17854.688347] [] vfs_write+0x9c/0x160
    > [17854.688347] [] ? restore_nocheck_notrace+0x0/0xe
    > [17854.688347] [] ? sysfs_write_file+0x0/0x100
    > [17854.688347] [] sys_write+0x3d/0x70
    > [17854.688347] [] sysenter_do_call+0x12/0x31

    Andrey's analysis:

    > timekeeping_resume() is called via class ->resume
    > method; and according to comments in sysdev_resume() and
    > device_power_up(), they are called with interrupts disabled.
    >
    > Looking at suspend_enter, irqs *are* disabled at this point.
    >
    > So it actually looks like something (may be some driver)
    > unconditionally enabled irqs in resume path.

    Add a debug check to test this theory. If it triggers then it
    triggers because the resume code calls it with irqs enabled,
    which is a no-no not just for timekeeping_resume(), but also
    bad for a number of other resume handlers.

    Reported-by: Andrey Borzenkov
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jan, 2009

1 commit


05 Jan, 2009

6 commits


01 Jan, 2009

1 commit

  • …l/git/tip/linux-2.6-tip

    * 'irq-fixes-for-linus-4' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    sparseirq: move __weak symbols into separate compilation unit
    sparseirq: work around __weak alias bug
    sparseirq: fix hang with !SPARSE_IRQ
    sparseirq: set lock_class for legacy irq when sparse_irq is selected
    sparseirq: work around compiler optimizing away __weak functions
    sparseirq: fix desc->lock init
    sparseirq: do not printk when migrating IRQ descriptors
    sparseirq: remove duplicated arch_early_irq_init()
    irq: simplify for_each_irq_desc() usage
    proc: remove ifdef CONFIG_SPARSE_IRQ from stat.c
    irq: for_each_irq_desc() move to irqnr.h
    hrtimer: remove #include <linux/irq.h>

    Linus Torvalds
     

26 Dec, 2008

1 commit


19 Dec, 2008

1 commit


09 Dec, 2008

1 commit


04 Dec, 2008

1 commit


25 Nov, 2008

1 commit

  • Impact: cleanup, move all hrtimer processing into hardirq context

    This is an attempt at removing some of the hrtimer complexity by
    reducing the number of callback modes to 1.

    This means that all hrtimer callback functions will be ran from HARD-irq
    context.

    I went through all the 30 odd hrtimer callback functions in the kernel
    and saw only one that I'm not quite sure of, which is the one in
    net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.

    Furthermore, the hrtimer core now calls callbacks directly with IRQs
    disabled in case you try to enqueue an expired timer. If this timer is a
    periodic timer (which should use hrtimer_forward() to advance its time)
    then it might be possible to end up in an inf. recursive loop due to the
    fact that hrtimer_forward() doesn't round up to the next timer
    granularity, and therefore keeps on calling the callback - obviously
    this needs a fix.

    Aside from that, this seems to compile and actually boot on my dual core
    test box - although I'm sure there are some bugs in, me not hitting any
    makes me certain :-)

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Nov, 2008

1 commit


11 Nov, 2008

1 commit

  • Impact: fix incorrect locking triggered during hotplug-intense stress-tests

    While migrating the the CB_IRQSAFE_UNLOCKED timers during a cpu-offline,
    we queue them on the cb_pending list, so that they won't go
    stale.

    Thus, when the callbacks of the timers run from the softirq context,
    they could run into potential deadlocks, since these callbacks
    assume that they're running with irq's disabled, thereby annoying
    lockdep!

    Fix this by emulating hardirq context while running these callbacks from
    the hrtimer softirq.

    =================================
    [ INFO: inconsistent lock state ]
    2.6.27 #2
    --------------------------------
    inconsistent {in-hardirq-W} -> {hardirq-on-W} usage.
    ksoftirqd/0/4 [HC0[0]:SC1[1]:HE1:SE0] takes:
    (&rq->lock){++..}, at: [] sched_rt_period_timer+0x9e/0x1fc
    {in-hardirq-W} state was registered at:
    [] __lock_acquire+0x549/0x121e
    [] native_sched_clock+0x88/0x99
    [] clocksource_get_next+0x39/0x3f
    [] update_wall_time+0x616/0x7df
    [] lock_acquire+0x5a/0x74
    [] scheduler_tick+0x3a/0x18d
    [] _spin_lock+0x1c/0x45
    [] scheduler_tick+0x3a/0x18d
    [] scheduler_tick+0x3a/0x18d
    [] update_process_times+0x3a/0x44
    [] tick_periodic+0x63/0x6d
    [] tick_handle_periodic+0x14/0x5e
    [] timer_interrupt+0x44/0x4a
    [] handle_IRQ_event+0x13/0x3d
    [] handle_level_irq+0x79/0xbd
    [] do_IRQ+0x69/0x7d
    [] common_interrupt+0x28/0x30
    [] aac_probe_one+0x1a3/0x3f3
    [] _spin_unlock_irqrestore+0x36/0x39
    [] setup_irq+0x1be/0x1f9
    [] start_kernel+0x259/0x2c5
    [] 0xffffffff
    irq event stamp: 50102
    hardirqs last enabled at (50102): [] _spin_unlock_irq+0x20/0x23
    hardirqs last disabled at (50101): [] _spin_lock_irq+0xa/0x4b
    softirqs last enabled at (50088): [] do_softirq+0x37/0x4d
    softirqs last disabled at (50099): [] do_softirq+0x37/0x4d

    other info that might help us debug this:
    no locks held by ksoftirqd/0/4.

    stack backtrace:
    Pid: 4, comm: ksoftirqd/0 Not tainted 2.6.27 #2
    [] print_usage_bug+0x13e/0x147
    [] mark_lock+0x493/0x797
    [] __lock_acquire+0x5be/0x121e
    [] lock_acquire+0x5a/0x74
    [] sched_rt_period_timer+0x9e/0x1fc
    [] _spin_lock+0x1c/0x45
    [] sched_rt_period_timer+0x9e/0x1fc
    [] sched_rt_period_timer+0x9e/0x1fc
    [] finish_task_switch+0x41/0xbd
    [] native_sched_clock+0x88/0x99
    [] sched_rt_period_timer+0x0/0x1fc
    [] run_hrtimer_pending+0x54/0xe5
    [] sched_rt_period_timer+0x0/0x1fc
    [] __do_softirq+0x7b/0xef
    [] do_softirq+0x37/0x4d
    [] ksoftirqd+0x56/0xc5
    [] ksoftirqd+0x0/0xc5
    [] kthread+0x38/0x5d
    [] kthread+0x0/0x5d
    [] kernel_thread_helper+0x7/0x10
    =======================

    Signed-off-by: Gautham R Shenoy
    Acked-by: Peter Zijlstra
    Acked-by: "Paul E. McKenney"
    Signed-off-by: Ingo Molnar

    Gautham R Shenoy
     

22 Oct, 2008

1 commit


20 Oct, 2008

3 commits


18 Oct, 2008

1 commit


13 Oct, 2008

1 commit


12 Oct, 2008

1 commit


29 Sep, 2008

4 commits

  • Impact: per CPU hrtimers can be migrated from a dead CPU

    The hrtimer code has no knowledge about per CPU timers, but we need to
    prevent the migration of such timers and warn when such a timer is
    active at migration time.

    Explicitely mark the timers as per CPU and use a more understandable
    mode descriptor for the interrupts safe unlocked callback mode, which
    is used by hrtimer_sleeper and the scheduler code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: during migration active hrtimers can be seen as inactive

    The migration code removes the hrtimers from the queues of the dead
    CPU and sets the state temporary to INACTIVE. The enqueue code sets it
    to ACTIVE/PENDING again.

    Prevent that the wrong state can be seen by using a separate migration
    state bit.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: Stale timers after a CPU went offline.

    commit 37bb6cb4097e29ffee970065b74499cbf10603a3
    hrtimer: unlock hrtimer_wakeup

    changed the hrtimer sleeper callback mode to CB_IRQSAFE_NO_SOFTIRQ due
    to locking problems. A result of this change is that when enqueue is
    called for an already expired hrtimer the callback function is not
    longer called directly from the enqueue code. The normal callers have
    been fixed in the code, but the migration code which moves hrtimers
    from a dead CPU to a live CPU was not made aware of this.

    This can be fixed by checking the timer state after the call to
    enqueue in the migration code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Impact: hrtimers which are on the pending list are not migrated at cpu
    offline and can be stale forever

    Add the pending list migration when CONFIG_HIGH_RES_TIMERS is enabled

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

22 Sep, 2008

1 commit


11 Sep, 2008

1 commit

  • As part of going idle, we already look at the time of the next timer event to determine
    which C-state to select etc.

    This patch adds functionality that causes the timers that are past their
    soft expire time, to fire at this time, before we calculate the next wakeup
    time. This functionality will thus avoid wakeups by running timers before
    going idle rather than specially waking up for it.

    Signed-off-by: Arjan van de Ven

    Arjan van de Ven