25 Sep, 2013

2 commits

  • In order to combine the preemption and need_resched test we need to
    fold the need_resched information into the preempt_count value.

    Since the NEED_RESCHED flag is set across CPUs this needs to be an
    atomic operation, however we very much want to avoid making
    preempt_count atomic, therefore we keep the existing TIF_NEED_RESCHED
    infrastructure in place but at 3 sites test it and fold its value into
    preempt_count; namely:

    - resched_task() when setting TIF_NEED_RESCHED on the current task
    - scheduler_ipi() when resched_task() sets TIF_NEED_RESCHED on a
    remote task it follows it up with a reschedule IPI
    and we can modify the cpu local preempt_count from
    there.
    - cpu_idle_loop() for when resched_task() found tsk_is_polling().

    We use an inverted bitmask to indicate need_resched so that a 0 means
    both need_resched and !atomic.

    Also remove the barrier() in preempt_enable() between
    preempt_enable_no_resched() and preempt_check_resched() to avoid
    having to reload the preemption value and allow the compiler to use
    the flags of the previuos decrement. I couldn't come up with any sane
    reason for this barrier() to be there as preempt_enable_no_resched()
    already has a barrier() before doing the decrement.

    Suggested-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-7a7m5qqbn5pmwnd4wko9u6da@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Mike reported that commit 7d1a9417 ("x86: Use generic idle loop")
    regressed several workloads and caused excessive reschedule
    interrupts.

    The patch in question failed to notice that the x86 code had an
    inverted sense of the polling state versus the new generic code (x86:
    default polling, generic: default !polling).

    Fix the two prominent x86 mwait based idle drivers and introduce a few
    new generic polling helpers (fixing the wrong smp_mb__after_clear_bit
    usage).

    Also switch the idle routines to using tif_need_resched() which is an
    immediate TIF_NEED_RESCHED test as opposed to need_resched which will
    end up being slightly different.

    Reported-by: Mike Galbraith
    Signed-off-by: Peter Zijlstra
    Cc: lenb@kernel.org
    Cc: tglx@linutronix.de
    Link: http://lkml.kernel.org/n/tip-nc03imb0etuefmzybzj7sprf@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

15 Jun, 2013

1 commit

  • PARISC bootup triggers the warning at kernel/cpu/idle.c:96. That's
    caused by the weak arch_cpu_idle() implementation, which is provided
    to avoid that architectures implement idle_poll over and over.

    The switchover to polling mode happens in the first call of the weak
    arch_cpu_idle() implementation, but that code fails to reenable
    interrupts and therefor triggers the warning.

    Fix this by enabling interrupts in the weak arch_cpu_idle() code.

    [ tglx: Made the changelog match the patch ]

    Signed-off-by: James Bottomley
    Reviewed-by: Srivatsa S. Bhat
    Link: http://lkml.kernel.org/r/1371236142.2726.43.camel@dabdike
    Signed-off-by: Thomas Gleixner

    James Bottomley
     

12 Jun, 2013

1 commit

  • Moving x86 to the generic idle implementation (commit 7d1a9417 "x86:
    Use generic idle loop") wreckaged the stack protector.

    I stupidly missed that boot_init_stack_canary() must be inlined from a
    function which never returns, but I put that call into
    arch_cpu_idle_prepare() which of course returns.

    I pondered to play tricks with arch_cpu_idle_prepare() first, but then
    I noticed, that the other archs which have implemented the
    stackprotector (ARM and SH) do not initialize the canary for the
    non-boot cpus.

    So I decided to move the boot_init_stack_canary() call into
    cpu_startup_entry() ifdeffed with an CONFIG_X86 for now. This #ifdef
    is just a temporary measure as I don't want to inflict the
    boot_init_stack_canary() call on ARM and SH that late in the cycle.

    I'll queue a patch for 3.11 which removes the #ifdef if the ARM/SH
    maintainers have no objection.

    Reported-by: Wouter van Kesteren
    Cc: x86@kernel.org
    Cc: Russell King
    Cc: Paul Mundt
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

14 May, 2013

1 commit

  • Bjørn Mork reported the following warning when running powertop.

    [ 49.289034] ------------[ cut here ]------------
    [ 49.289055] WARNING: at kernel/rcutree.c:502 rcu_eqs_exit_common.isra.48+0x3d/0x125()
    [ 49.289244] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 3.10.0-bisect-rcu-warn+ #107
    [ 49.289251] ffffffff8157d8c8 ffffffff81801e28 ffffffff8137e4e3 ffffffff81801e68
    [ 49.289260] ffffffff8103094f ffffffff81801e68 0000000000000000 ffff88023afcd9b0
    [ 49.289268] 0000000000000000 0140000000000000 ffff88023bee7700 ffffffff81801e78
    [ 49.289276] Call Trace:
    [ 49.289285] [] dump_stack+0x19/0x1b
    [ 49.289293] [] warn_slowpath_common+0x62/0x7b
    [ 49.289300] [] warn_slowpath_null+0x15/0x17
    [ 49.289306] [] rcu_eqs_exit_common.isra.48+0x3d/0x125
    [ 49.289314] [] ? trace_hardirqs_off_caller+0x37/0xa6
    [ 49.289320] [] rcu_idle_exit+0x85/0xa8
    [ 49.289327] [] trace_cpu_idle_rcuidle+0xae/0xff
    [ 49.289334] [] cpu_startup_entry+0x72/0x115
    [ 49.289341] [] rest_init+0x149/0x150
    [ 49.289347] [] ? csum_partial_copy_generic+0x16c/0x16c
    [ 49.289355] [] start_kernel+0x3f0/0x3fd
    [ 49.289362] [] ? repair_env_string+0x5a/0x5a
    [ 49.289368] [] x86_64_start_reservations+0x2a/0x2c
    [ 49.289375] [] x86_64_start_kernel+0xcd/0xd1
    [ 49.289379] ---[ end trace 07a1cc95e29e9036 ]---

    The warning is that 'rdtp->dynticks' has an unexpected value, which roughly
    translates to - the calls to rcu_idle_enter() and rcu_idle_exit() were not
    made in the correct order, or otherwise messed up.

    And Bjørn's painstaking debugging indicated that this happens when the idle
    loop enters the poll mode. Looking at the poll function cpu_idle_poll(), and
    the implementation of trace_cpu_idle_rcuidle(), the problem becomes very clear:
    cpu_idle_poll() lacks calls to rcu_idle_enter/exit(), and trace_cpu_idle_rcuidle()
    calls them in the reverse order - first rcu_idle_exit(), and then rcu_idle_enter().
    Hence the even/odd alternative sequencing of rdtp->dynticks goes for a toss.

    And powertop readily triggers this because powertop uses the idle-tracing
    infrastructure extensively.

    So, to fix this, wrap the code in cpu_idle_poll() within rcu_idle_enter/exit(),
    so that it blends properly with the calls inside trace_cpu_idle_rcuidle() and
    thus get the function ordering right.

    Reported-and-tested-by: Bjørn Mork
    Cc: Paul McKenney
    Cc: Steven Rostedt
    Cc: Dipankar Sarma
    Signed-off-by: Srivatsa S. Bhat
    Link: http://lkml.kernel.org/r/519169BF.4080208@linux.vnet.ibm.com
    Signed-off-by: Thomas Gleixner

    Srivatsa S. Bhat
     

30 Apr, 2013

1 commit

  • Pull core timer updates from Ingo Molnar:
    "The main changes in this cycle's merge are:

    - Implement shadow timekeeper to shorten in kernel reader side
    blocking, by Thomas Gleixner.

    - Posix timers enhancements by Pavel Emelyanov:

    - allocate timer ID per process, so that exact timer ID allocations
    can be re-created be checkpoint/restore code.

    - debuggability and tooling (/proc/PID/timers, etc.) improvements.

    - suspend/resume enhancements by Feng Tang: on certain new Intel Atom
    processors (Penwell and Cloverview), there is a feature that the
    TSC won't stop in S3 state, so the TSC value won't be reset to 0
    after resume. This can be taken advantage of by the generic via
    the CLOCK_SOURCE_SUSPEND_NONSTOP flag: instead of using the RTC to
    recover/approximate sleep time, the main (and precise) clocksource
    can be used.

    - Fix /proc/timer_list for 4096 CPUs by Nathan Zimmer: on so many
    CPUs the file goes beyond 4MB of size and thus the current
    simplistic seqfile approach fails. Convert /proc/timer_list to a
    proper seq_file with its own iterator.

    - Cleanups and refactorings of the core timekeeping code by John
    Stultz.

    - International Atomic Clock time is managed by the NTP code
    internally currently but not exposed externally. Separate the TAI
    code out and add CLOCK_TAI support and TAI support to the hrtimer
    and posix-timer code, by John Stultz.

    - Add deep idle support enhacement to the broadcast clockevents core
    timer code, by Daniel Lezcano: add an opt-in CLOCK_EVT_FEAT_DYNIRQ
    clockevents feature (which will be utilized by future clockevents
    driver updates), which allows the use of IRQ affinities to avoid
    spurious wakeups of idle CPUs - the right CPU with an expiring
    timer will be woken.

    - Add new ARM bcm281xx clocksource driver, by Christian Daudt

    - ... various other fixes and cleanups"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
    clockevents: Set dummy handler on CPU_DEAD shutdown
    timekeeping: Update tk->cycle_last in resume
    posix-timers: Remove unused variable
    clockevents: Switch into oneshot mode even if broadcast registered late
    timer_list: Convert timer list to be a proper seq_file
    timer_list: Split timer_list_show_tickdevices
    posix-timers: Show sigevent info in proc file
    posix-timers: Introduce /proc/PID/timers file
    posix timers: Allocate timer id per process (v2)
    timekeeping: Make sure to notify hrtimers when TAI offset changes
    hrtimer: Fix ktime_add_ns() overflow on 32bit architectures
    hrtimer: Add expiry time overflow check in hrtimer_interrupt
    timekeeping: Shorten seq_count region
    timekeeping: Implement a shadow timekeeper
    timekeeping: Delay update of clock->cycle_last
    timekeeping: Store cycle_last value in timekeeper struct as well
    ntp: Remove ntp_lock, using the timekeeping locks to protect ntp state
    timekeeping: Simplify tai updating from do_adjtimex
    timekeeping: Hold timekeepering locks in do_adjtimex and hardpps
    timekeeping: Move ADJ_SETOFFSET to top level do_adjtimex()
    ...

    Linus Torvalds
     

17 Apr, 2013

1 commit


08 Apr, 2013

2 commits

  • All idle functions in arch/* are more or less the same, plus minus a
    few bugs and extra instrumentation, tickless support and other
    optional items.

    Implement a generic idle function which resembles the functionality
    found in arch/. Provide weak arch_cpu_idle_* functions which can be
    overridden by the architecture code if needed.

    Signed-off-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Rusty Russell
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Reviewed-by: Cc: Srivatsa S. Bhat
    Cc: Magnus Damm
    Link: http://lkml.kernel.org/r/20130321215233.646635455@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • For now this calls cpu_idle(), but in the long run we want to move the
    cpu bringup code to the core and therefor we add a state argument.

    Signed-off-by: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Rusty Russell
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Reviewed-by: Cc: Srivatsa S. Bhat
    Cc: Magnus Damm
    Link: http://lkml.kernel.org/r/20130321215233.583190032@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner