03 Jul, 2010

1 commit


01 Jul, 2010

1 commit

  • Commit 0224cf4c5e (sched: Intoduce get_cpu_iowait_time_us())
    broke things by not making sure preemption was indeed disabled
    by the callers of nr_iowait_cpu() which took the iowait value of
    the current cpu.

    This resulted in a heap of preempt warnings. Cure this by making
    nr_iowait_cpu() take a cpu number and fix up the callers to pass
    in the right number.

    Signed-off-by: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sergey Senozhatsky
    Cc: Rafael J. Wysocki
    Cc: Maxim Levitsky
    Cc: Len Brown
    Cc: Pavel Machek
    Cc: Jiri Slaby
    Cc: linux-pm@lists.linux-foundation.org
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

18 Jun, 2010

1 commit

  • Chris Wedgwood reports that 39c0cbe (sched: Rate-limit nohz) causes a
    serial console regression, unresponsiveness, and indeed it does. The
    reason is that the nohz code is skipped even when the tick was already
    stopped before the nohz_ratelimit(cpu) condition changed.

    Move the nohz_ratelimit() check to the other conditions which prevent
    long idle sleeps.

    Reported-by: Chris Wedgwood
    Tested-by: Brian Bloniarz
    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: Jiri Kosina
    Cc: Linus Torvalds
    Cc: Greg KH
    Cc: Alan Cox
    Cc: OGAWA Hirofumi
    Cc: Jef Driesen
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

10 May, 2010

6 commits

  • For the ondemand cpufreq governor, it is desired that the iowait
    time is microaccounted in a similar way as idle time is.

    This patch introduces the infrastructure to account and expose
    this information via the get_cpu_iowait_time_us() function.

    [akpm@linux-foundation.org: fix CONFIG_NO_HZ=n build]
    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • Now that the only user of ts->idle_lastupdate is
    update_ts_time_stats(), the entire field can be eliminated.

    In update_ts_time_stats(), idle_lastupdate is first set to
    "now", and a few lines later, the only user is an if() statement
    that assigns a variable either to "now" or to
    ts->idle_lastupdate, which has the value of "now" at that point.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • This patch folds the updating of the last_update_time into the
    update_ts_time_stats() function, and updates the callers.

    This allows for further cleanups that are done in the next
    patch.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • Right now, get_cpu_idle_time_us() only reports the idle
    statistics upto the point the CPU entered last idle; not what is
    valid right now.

    This patch adds an update of the idle statistics to
    get_cpu_idle_time_us(), so that calling this function always
    returns statistics that are accurate at the point of the call.

    This includes resetting the start of the idle time for
    accounting purposes to avoid double accounting.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • Currently, two places update the idle statistics (and more to
    come later in this series).

    This patch creates a helper function for updating these
    statistics.

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     
  • The exported function get_cpu_idle_time_us() has no comment
    describing it; add a kerneldoc comment

    Signed-off-by: Arjan van de Ven
    Signed-off-by: Andrew Morton
    Reviewed-by: Rik van Riel
    Acked-by: Peter Zijlstra
    Cc: davej@redhat.com
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Arjan van de Ven
     

12 Mar, 2010

1 commit

  • Entering nohz code on every micro-idle is costing ~10% throughput for netperf
    TCP_RR when scheduling cross-cpu. Rate limiting entry fixes this, but raises
    ticks a bit. On my Q6600, an idle box goes from ~85 interrupts/sec to 128.

    The higher the context switch rate, the more nohz entry costs. With this patch
    and some cycle recovery patches in my tree, max cross cpu context switch rate is
    improved by ~16%, a large portion of which of which is this ratelimiting.

    Signed-off-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Mike Galbraith
     

09 Dec, 2009

1 commit

  • * 'timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip:
    timers, init: Limit the number of per cpu calibration bootup messages
    posix-cpu-timers: optimize and document timer_create callback
    clockevents: Add missing include to pacify sparse
    x86: vmiclock: Fix printk format
    x86: Fix printk format due to variable type change
    sparc: fix printk for change of variable type
    clocksource/events: Fix fallout of generic code changes
    nohz: Allow 32-bit machines to sleep for more than 2.15 seconds
    nohz: Track last do_timer() cpu
    nohz: Prevent clocksource wrapping during idle
    nohz: Type cast printk argument
    mips: Use generic mult/shift factor calculation for clocks
    clocksource: Provide a generic mult/shift factor calculation
    clockevents: Use u32 for mult and shift factors
    nohz: Introduce arch_needs_cpu
    nohz: Reuse ktime in sub-functions of tick_check_idle.
    time: Remove xtime_cache
    time: Implement logarithmic time accumulation

    Linus Torvalds
     

14 Nov, 2009

3 commits

  • The previous patch which limits the sleep time to the maximum
    deferment time of the time keeping clocksource has some limitations on
    SMP machines: if all CPUs are idle then for all CPUs the maximum sleep
    time is limited.

    Solve this by keeping track of which cpu had the do_timer() duty
    assigned last and limit the sleep time only for this cpu.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:
    Cc: Jon Hunter
    Cc: John Stultz

    Thomas Gleixner
     
  • The dynamic tick allows the kernel to sleep for periods longer than a
    single tick, but it does not limit the sleep time currently. In the
    worst case the kernel could sleep longer than the wrap around time of
    the time keeping clock source which would result in losing track of
    time.

    Prevent this by limiting it to the safe maximum sleep time of the
    current time keeping clock source. The value is calculated when the
    clock source is registered.

    [ tglx: simplified the code a bit and massaged the commit msg ]

    Signed-off-by: Jon Hunter
    Cc: John Stultz
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Jon Hunter
     
  • On some archs local_softirq_pending() has a data type of unsigned long
    on others its unsigned int. Type cast it to (unsigned int) in the
    printk to avoid the compiler warning.

    Signed-off-by: Thomas Gleixner
    LKML-Reference:

    Thomas Gleixner
     

05 Nov, 2009

2 commits

  • Allow the architecture to request a normal jiffy tick when the system
    goes idle and tick_nohz_stop_sched_tick is called . On s390 the hook is
    used to prevent the system going fully idle if there has been an
    interrupt other than a clock comparator interrupt since the last wakeup.

    On s390 the HiperSockets response time for 1 connection ping-pong goes
    down from 42 to 34 microseconds. The CPU cost decreases by 27%.

    Signed-off-by: Martin Schwidefsky
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Martin Schwidefsky
     
  • On a system with NOHZ=y tick_check_idle calls tick_nohz_stop_idle and
    tick_nohz_update_jiffies. Given the right conditions (ts->idle_active
    and/or ts->tick_stopped) both function get a time stamp with ktime_get.
    The same time stamp can be reused if both function require one.

    On s390 this change has the additional benefit that gcc inlines the
    tick_nohz_stop_idle function into tick_check_idle. The number of
    instructions to execute tick_check_idle drops from 225 to 144
    (without the ktime_get optimization it is 367 vs 215 instructions).

    before:

    0) | tick_check_idle() {
    0) | tick_nohz_stop_idle() {
    0) | ktime_get() {
    0) | read_tod_clock() {
    0) 0.601 us | }
    0) 1.765 us | }
    0) 3.047 us | }
    0) | ktime_get() {
    0) | read_tod_clock() {
    0) 0.570 us | }
    0) 1.727 us | }
    0) | tick_do_update_jiffies64() {
    0) 0.609 us | }
    0) 8.055 us | }

    after:

    0) | tick_check_idle() {
    0) | ktime_get() {
    0) | read_tod_clock() {
    0) 0.617 us | }
    0) 1.773 us | }
    0) | tick_do_update_jiffies64() {
    0) 0.593 us | }
    0) 4.477 us | }

    Signed-off-by: Martin Schwidefsky
    Cc: john stultz
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Martin Schwidefsky
     

07 Oct, 2009

1 commit

  • Commit f2e21c9610991e95621a81407cdbab881226419b had unfortunate side
    effects with cpufreq governors on some systems.

    If the system did not switch into NOHZ mode ts->inidle is not set when
    tick_nohz_stop_sched_tick() is called from the idle routine. Therefor
    all subsequent calls from irq_exit() to tick_nohz_stop_sched_tick()
    fail to call tick_nohz_start_idle(). This results in bogus idle
    accounting information which is passed to cpufreq governors.

    Set the inidle flag unconditionally of the NOHZ active state to keep
    the idle time accounting correct in any case.

    [ tglx: Added comment and tweaked the changelog ]

    Reported-by: Steven Noonan
    Signed-off-by: Eero Nurkkala
    Cc: Rik van Riel
    Cc: Venkatesh Pallipadi
    Cc: Greg KH
    Cc: Steven Noonan
    Cc: stable@kernel.org
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Eero Nurkkala
     

21 Jun, 2009

1 commit


27 May, 2009

1 commit

  • A call from irq_exit() may occasionally pause the timing
    info for cpufreq ondemand governor. This results in the
    cpufreq ondemand governor to fail to calculate the
    system load properly. Thus, relocate the checks for this
    particular case to keep the governor always functional.

    Signed-off-by: Eero Nurkkala
    Reported-by: Tero Kristo
    Acked-by: Rik van Riel
    Acked-by: Venkatesh Pallipadi
    Signed-off-by: Thomas Gleixner

    Eero Nurkkala
     

13 May, 2009

1 commit


15 Jan, 2009

1 commit


04 Jan, 2009

1 commit


03 Jan, 2009

1 commit

  • …/git/tip/linux-2.6-tip

    * 'cpus4096-for-linus-2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip: (66 commits)
    x86: export vector_used_by_percpu_irq
    x86: use logical apicid in x2apic_cluster's x2apic_cpu_mask_to_apicid_and()
    sched: nominate preferred wakeup cpu, fix
    x86: fix lguest used_vectors breakage, -v2
    x86: fix warning in arch/x86/kernel/io_apic.c
    sched: fix warning in kernel/sched.c
    sched: move test_sd_parent() to an SMP section of sched.h
    sched: add SD_BALANCE_NEWIDLE at MC and CPU level for sched_mc>0
    sched: activate active load balancing in new idle cpus
    sched: bias task wakeups to preferred semi-idle packages
    sched: nominate preferred wakeup cpu
    sched: favour lower logical cpu number for sched_mc balance
    sched: framework for sched_mc/smt_power_savings=N
    sched: convert BALANCE_FOR_xx_POWER to inline functions
    x86: use possible_cpus=NUM to extend the possible cpus allowed
    x86: fix cpu_mask_to_apicid_and to include cpu_online_mask
    x86: update io_apic.c to the new cpumask code
    x86: Introduce topology_core_cpumask()/topology_thread_cpumask()
    x86: xen: use smp_call_function_many()
    x86: use work_on_cpu in x86/kernel/cpu/mcheck/mce_amd_64.c
    ...

    Fixed up trivial conflict in kernel/time/tick-sched.c manually

    Linus Torvalds
     

31 Dec, 2008

2 commits

  • The cpu time spent by the idle process actually doing something is
    currently accounted as idle time. This is plain wrong, the architectures
    that support VIRT_CPU_ACCOUNTING=y can do better: distinguish between the
    time spent doing nothing and the time spent by idle doing work. The first
    is accounted with account_idle_time and the second with account_system_time.
    The architectures that use the account_xxx_time interface directly and not
    the account_xxx_ticks interface now need to do the check for the idle
    process in their arch code. In particular to improve the system vs true
    idle time accounting the arch code needs to measure the true idle time
    instead of just testing for the idle process.
    To improve the tick based accounting as well we would need an architecture
    primitive that can tell us if the pt_regs of the interrupted context
    points to the magic instruction that halts the cpu.

    In addition idle time is no more added to the stime of the idle process.
    This field now contains the system time of the idle process as it should
    be. On systems without VIRT_CPU_ACCOUNTING this will always be zero as
    every tick that occurs while idle is running will be accounted as idle
    time.

    This patch contains the necessary common code changes to be able to
    distinguish idle system time and true idle time. The architectures with
    support for VIRT_CPU_ACCOUNTING need some changes to exploit this.

    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     
  • The utimescaled / stimescaled fields in the task structure and the
    global cpustat should be set on all architectures. On s390 the calls
    to account_user_time_scaled and account_system_time_scaled never have
    been added. In addition system time that is accounted as guest time
    to the user time of a process is accounted to the scaled system time
    instead of the scaled user time.
    To fix the bugs and to prevent future forgetfulness this patch merges
    account_system_time_scaled into account_system_time and
    account_user_time_scaled into account_user_time.

    Cc: Benjamin Herrenschmidt
    Cc: Hidetoshi Seto
    Cc: Tony Luck
    Cc: Jeremy Fitzhardinge
    Cc: Chris Wright
    Cc: Michael Neuling
    Acked-by: Paul Mackerras
    Signed-off-by: Martin Schwidefsky

    Martin Schwidefsky
     

26 Dec, 2008

1 commit


12 Dec, 2008

2 commits

  • In my device I get many interrupts from a high speed USB device in a very
    short period of time. The system spends a lot of time reprogramming the
    hardware timer which is in a slower timing domain as compared to the CPU.
    This results in the CPU spending a huge amount of time waiting for the
    timer posting to be done. All of this reprogramming is useless as the
    wake up time has not changed.

    As measured using ETM trace this drops my reprogramming penalty from
    almost 60% CPU load down to 15% during high interrupt rate. I can send
    traces to show this.

    Suppress setting of duplicate timer event when timer already stopped.
    Timer programming can be very costly and can result in long cpu stall/wait
    times.

    [akpm@linux-foundation.org: coding-style fixes]
    [tglx@linutronix.de: move the check to the right place and avoid raising
    the softirq for nothing]

    Signed-off-by: Richard Woodruff
    Cc: johnstul@us.ibm.com
    Signed-off-by: Andrew Morton
    Signed-off-by: Thomas Gleixner

    Woodruff, Richard
     
  • Impact: remove false positive warning

    After a cpu was taken down during cpu hotplug (read: disabled for interrupts)
    it still might have pending softirqs. However take_cpu_down makes sure
    that the idle task will run next instead of ksoftirqd on the taken down cpu.
    The idle task will call tick_nohz_stop_sched_tick which might warn about
    pending softirqs just before the cpu kills itself completely.

    However the pending softirqs on the dead cpu aren't a problem because they
    will be moved to an online cpu during CPU_DEAD handling.

    So make sure we warn only for online cpus.

    Signed-off-by: Heiko Carstens
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

25 Nov, 2008

2 commits

  • Impact: cleanup, move all hrtimer processing into hardirq context

    This is an attempt at removing some of the hrtimer complexity by
    reducing the number of callback modes to 1.

    This means that all hrtimer callback functions will be ran from HARD-irq
    context.

    I went through all the 30 odd hrtimer callback functions in the kernel
    and saw only one that I'm not quite sure of, which is the one in
    net/can/bcm.c - hence I'm CC-ing the folks responsible for that code.

    Furthermore, the hrtimer core now calls callbacks directly with IRQs
    disabled in case you try to enqueue an expired timer. If this timer is a
    periodic timer (which should use hrtimer_forward() to advance its time)
    then it might be possible to end up in an inf. recursive loop due to the
    fact that hrtimer_forward() doesn't round up to the next timer
    granularity, and therefore keeps on calling the callback - obviously
    this needs a fix.

    Aside from that, this seems to compile and actually boot on my dual core
    test box - although I'm sure there are some bugs in, me not hitting any
    makes me certain :-)

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Impact: (future) size reduction for large NR_CPUS.

    Dynamically allocating cpumasks (when CONFIG_CPUMASK_OFFSTACK) saves
    space for small nr_cpu_ids but big CONFIG_NR_CPUS. cpumask_var_t
    is just a struct cpumask for !CONFIG_CPUMASK_OFFSTACK.

    Signed-off-by: Rusty Russell
    Signed-off-by: Ingo Molnar

    Rusty Russell
     

11 Nov, 2008

1 commit

  • Impact: nohz powersavings and wakeup regression

    commit fb02fbc14d17837b4b7b02dbb36142c16a7bf208 (NOHZ: restart tick
    device from irq_enter()) causes a serious wakeup regression.

    While the patch is correct it does not take into account that spurious
    wakeups happen on x86. A fix for this issue is available, but we just
    revert to the .27 behaviour and let long running softirqs screw
    themself.

    Disable it for now.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

22 Oct, 2008

2 commits

  • Conflicts:

    kernel/time/tick-sched.c

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • commit fb02fbc14d17837b4b7b02dbb36142c16a7bf208 (NOHZ: restart tick
    device from irq_enter())

    solves the problem of stale jiffies when long running softirqs happen
    in a long idle sleep period, but it has a major thinko in it:

    When the interrupt which came in _is_ the timer interrupt which should
    expire ts->sched_timer then we cancel and rearm the timer _before_ it
    gets expired in hrtimer_interrupt() to the next period. That means the
    call back function is not called. This game can go on for ever :(

    Prevent this by making sure to only rearm the timer when the expiry
    time is more than one tick_period away. Otherwise keep it running as
    it is either already expired or will expiry at the right point to
    update jiffies.

    Signed-off-by: Thomas Gleixner
    Tested-by: Venkatesch Pallipadi

    Thomas Gleixner
     

18 Oct, 2008

4 commits

  • Conflicts:

    arch/x86/kvm/i8254.c

    Arjan van de Ven
     
  • We did not restart the tick device from irq_enter() to avoid double
    reprogramming and extra events in the return immediate to idle case.

    But long lasting softirqs can lead to a situation where jiffies become
    stale:

    idle()
    tick stopped (reprogrammed to next pending timer)
    halt()
    interrupt
    jiffies updated from irq_enter()
    interrupt handler
    softirq function 1 runs 20ms
    softirq function 2 arms a 10ms timer with a stale jiffies value
    jiffies updated from irq_exit()
    timer wheel has now an already expired timer
    (the one added in function 2)
    timer fires and timer softirq runs

    This was discovered when debugging a timer problem which happend only
    when the ath5k driver is active. The debugging proved that there is a
    softirq function running for more than 20ms, which is a bug by itself.

    To solve this we restart the tick timer right from irq_enter(), but do
    not go through the other functions which are necessary to return from
    idle when need_resched() is set.

    Reported-by: Elias Oltmanns
    Signed-off-by: Thomas Gleixner
    Tested-by: Elias Oltmanns

    Thomas Gleixner
     
  • Split out the clock event device reprogramming. Preparatory
    patch.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • We have two separate nohz function calls in irq_enter() for no good
    reason. Just call a single NOHZ function from irq_enter() and call
    the bits in the tick code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

15 Oct, 2008

1 commit


10 Oct, 2008

1 commit


29 Sep, 2008

1 commit

  • Impact: per CPU hrtimers can be migrated from a dead CPU

    The hrtimer code has no knowledge about per CPU timers, but we need to
    prevent the migration of such timers and warn when such a timer is
    active at migration time.

    Explicitely mark the timers as per CPU and use a more understandable
    mode descriptor for the interrupts safe unlocked callback mode, which
    is used by hrtimer_sleeper and the scheduler code.

    Signed-off-by: Thomas Gleixner

    Thomas Gleixner