12 Aug, 2013

1 commit

  • commit 148519120c6d1f19ad53349683aeae9f228b0b8d upstream.

    Revert commit 69a37bea (cpuidle: Quickly notice prediction failure for
    repeat mode), because it has been identified as the source of a
    significant performance regression in v3.8 and later as explained by
    Jeremy Eder:

    We believe we've identified a particular commit to the cpuidle code
    that seems to be impacting performance of variety of workloads.
    The simplest way to reproduce is using netperf TCP_RR test, so
    we're using that, on a pair of Sandy Bridge based servers. We also
    have data from a large database setup where performance is also
    measurably/positively impacted, though that test data isn't easily
    share-able.

    Included below are test results from 3 test kernels:

    kernel reverts
    -----------------------------------------------------------
    1) vanilla upstream (no reverts)

    2) perfteam2 reverts e11538d1f03914eb92af5a1a378375c05ae8520c

    3) test reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
    e11538d1f03914eb92af5a1a378375c05ae8520c

    In summary, netperf TCP_RR numbers improve by approximately 4%
    after reverting 69a37beabf1f0a6705c08e879bdd5d82ff6486c4. When
    69a37beabf1f0a6705c08e879bdd5d82ff6486c4 is included, C0 residency
    never seems to get above 40%. Taking that patch out gets C0 near
    100% quite often, and performance increases.

    The below data are histograms representing the %c0 residency @
    1-second sample rates (using turbostat), while under netperf test.

    - If you look at the first 4 histograms, you can see %c0 residency
    almost entirely in the 30,40% bin.
    - The last pair, which reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4,
    shows %c0 in the 80,90,100% bins.

    Below each kernel name are netperf TCP_RR trans/s numbers for the
    particular kernel that can be disclosed publicly, comparing the 3
    test kernels. We ran a 4th test with the vanilla kernel where
    we've also set /dev/cpu_dma_latency=0 to show overall impact
    boosting single-threaded TCP_RR performance over 11% above
    baseline.

    3.10-rc2 vanilla RX + c0 lock (/dev/cpu_dma_latency=0):
    TCP_RR trans/s 54323.78

    -----------------------------------------------------------
    3.10-rc2 vanilla RX (no reverts)
    TCP_RR trans/s 48192.47

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 59]:
    ***********************************************************
    40.0000 - 50.0000 [ 1]: *
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    Sender %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 11]: ***********
    40.0000 - 50.0000 [ 49]:
    *************************************************
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    -----------------------------------------------------------
    3.10-rc2 perfteam2 RX (reverts commit
    e11538d1f03914eb92af5a1a378375c05ae8520c)
    TCP_RR trans/s 49698.69

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 1]: *
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 59]:
    ***********************************************************
    40.0000 - 50.0000 [ 0]:
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    Sender %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 2]: **
    40.0000 - 50.0000 [ 58]:
    **********************************************************
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 0]:

    -----------------------------------------------------------
    3.10-rc2 test RX (reverts 69a37beabf1f0a6705c08e879bdd5d82ff6486c4
    and e11538d1f03914eb92af5a1a378375c05ae8520c)
    TCP_RR trans/s 47766.95

    Receiver %c0
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 1]: *
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 27]: ***************************
    40.0000 - 50.0000 [ 2]: **
    50.0000 - 60.0000 [ 0]:
    60.0000 - 70.0000 [ 2]: **
    70.0000 - 80.0000 [ 0]:
    80.0000 - 90.0000 [ 0]:
    90.0000 - 100.0000 [ 28]: ****************************

    Sender:
    0.0000 - 10.0000 [ 1]: *
    10.0000 - 20.0000 [ 0]:
    20.0000 - 30.0000 [ 0]:
    30.0000 - 40.0000 [ 11]: ***********
    40.0000 - 50.0000 [ 0]:
    50.0000 - 60.0000 [ 1]: *
    60.0000 - 70.0000 [ 0]:
    70.0000 - 80.0000 [ 3]: ***
    80.0000 - 90.0000 [ 7]: *******
    90.0000 - 100.0000 [ 38]: **************************************

    These results demonstrate gaining back the tendency of the CPU to
    stay in more responsive, performant C-states (and thus yield
    measurably better performance), by reverting commit
    69a37beabf1f0a6705c08e879bdd5d82ff6486c4.

    Requested-by: Jeremy Eder
    Tested-by: Len Brown
    Signed-off-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Rafael J. Wysocki
     

31 May, 2013

1 commit

  • In tick_nohz_cpu_down_callback() if the cpu is the one handling
    timekeeping, we must return something that stops the CPU_DOWN_PREPARE
    notifiers and then start notify CPU_DOWN_FAILED on the already called
    notifier call backs.

    However traditional errno values are not handled by the notifier unless
    these are encapsulated using errno_to_notifier().

    Hence the current -EINVAL is misinterpreted and converted to junk after
    notifier_to_errno(), leaving the notifier subsystem to random behaviour
    such as eventually allowing the cpu to go down.

    Fix this by using the standard NOTIFY_BAD instead.

    Signed-off-by: Li Zhong
    Reviewed-by: Srivatsa S. Bhat
    Acked-by: Steven Rostedt
    Cc: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Borislav Petkov
    Signed-off-by: Frederic Weisbecker
    Signed-off-by: Ingo Molnar

    Li Zhong
     

16 May, 2013

1 commit

  • Pull timer fixes from Thomas Gleixner:

    - Cure for not using zalloc in the first place, which leads to random
    crashes with CPUMASK_OFF_STACK.

    - Revert a user space visible change which broke udev

    - Add a missing cpu_online early return introduced by the new full
    dyntick conversions

    - Plug a long standing race in the timer wheel cpu hotplug code.
    Sigh...

    - Cleanup NOHZ per cpu data on cpu down to prevent stale data on cpu
    up.

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Revert ALWAYS_USE_PERSISTENT_CLOCK compile time optimizaitons
    timer: Don't reinitialize the cpu base lock during CPU_UP_PREPARE
    tick: Don't invoke tick_nohz_stop_sched_tick() if the cpu is offline
    tick: Cleanup NOHZ per cpu data on cpu down
    tick: Use zalloc_cpumask_var for allocating offstack cpumasks

    Linus Torvalds
     

14 May, 2013

1 commit

  • commit 5b39939a4 (nohz: Move ts->idle_calls incrementation into strict
    idle logic) moved code out of tick_nohz_stop_sched_tick() and missed
    to bail out when the cpu is offline. That's causing subsequent
    failures as an offline CPU is supposed to die and not to fiddle with
    nohz magic.

    Return false in can_stop_idle_tick() if the cpu is offline.

    Reported-and-tested-by: Jiri Kosina
    Reported-and-tested-by: Prarit Bhargava
    Cc: Frederic Weisbecker
    Cc: Borislav Petkov
    Cc: Tony Luck
    Cc: x86@kernel.org
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305132138160.2863@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

12 May, 2013

1 commit

  • Prarit reported a crash on CPU offline/online. The reason is that on
    CPU down the NOHZ related per cpu data of the dead cpu is not cleaned
    up. If at cpu online an interrupt happens before the per cpu tick
    device is registered the irq_enter() check potentially sees stale data
    and dereferences a NULL pointer.

    Cleanup the data after the cpu is dead.

    Reported-by: Prarit Bhargava
    Cc: stable@vger.kernel.org
    Cc: Mike Galbraith
    Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1305031451561.2886@ionos
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

06 May, 2013

1 commit

  • Pull 'full dynticks' support from Ingo Molnar:
    "This tree from Frederic Weisbecker adds a new, (exciting! :-) core
    kernel feature to the timer and scheduler subsystems: 'full dynticks',
    or CONFIG_NO_HZ_FULL=y.

    This feature extends the nohz variable-size timer tick feature from
    idle to busy CPUs (running at most one task) as well, potentially
    reducing the number of timer interrupts significantly.

    This feature got motivated by real-time folks and the -rt tree, but
    the general utility and motivation of full-dynticks runs wider than
    that:

    - HPC workloads get faster: CPUs running a single task should be able
    to utilize a maximum amount of CPU power. A periodic timer tick at
    HZ=1000 can cause a constant overhead of up to 1.0%. This feature
    removes that overhead - and speeds up the system by 0.5%-1.0% on
    typical distro configs even on modern systems.

    - Real-time workload latency reduction: CPUs running critical tasks
    should experience as little jitter as possible. The last remaining
    source of kernel-related jitter was the periodic timer tick.

    - A single task executing on a CPU is a pretty common situation,
    especially with an increasing number of cores/CPUs, so this feature
    helps desktop and mobile workloads as well.

    The cost of the feature is mainly related to increased timer
    reprogramming overhead when a CPU switches its tick period, and thus
    slightly longer to-idle and from-idle latency.

    Configuration-wise a third mode of operation is added to the existing
    two NOHZ kconfig modes:

    - CONFIG_HZ_PERIODIC: [formerly !CONFIG_NO_HZ], now explicitly named
    as a config option. This is the traditional Linux periodic tick
    design: there's a HZ tick going on all the time, regardless of
    whether a CPU is idle or not.

    - CONFIG_NO_HZ_IDLE: [formerly CONFIG_NO_HZ=y], this turns off the
    periodic tick when a CPU enters idle mode.

    - CONFIG_NO_HZ_FULL: this new mode, in addition to turning off the
    tick when a CPU is idle, also slows the tick down to 1 Hz (one
    timer interrupt per second) when only a single task is running on a
    CPU.

    The .config behavior is compatible: existing !CONFIG_NO_HZ and
    CONFIG_NO_HZ=y settings get translated to the new values, without the
    user having to configure anything. CONFIG_NO_HZ_FULL is turned off by
    default.

    This feature is based on a lot of infrastructure work that has been
    steadily going upstream in the last 2-3 cycles: related RCU support
    and non-periodic cputime support in particular is upstream already.

    This tree adds the final pieces and activates the feature. The pull
    request is marked RFC because:

    - it's marked 64-bit only at the moment - the 32-bit support patch is
    small but did not get ready in time.

    - it has a number of fresh commits that came in after the merge
    window. The overwhelming majority of commits are from before the
    merge window, but still some aspects of the tree are fresh and so I
    marked it RFC.

    - it's a pretty wide-reaching feature with lots of effects - and
    while the components have been in testing for some time, the full
    combination is still not very widely used. That it's default-off
    should reduce its regression abilities and obviously there are no
    known regressions with CONFIG_NO_HZ_FULL=y enabled either.

    - the feature is not completely idempotent: there is no 100%
    equivalent replacement for a periodic scheduler/timer tick. In
    particular there's ongoing work to map out and reduce its effects
    on scheduler load-balancing and statistics. This should not impact
    correctness though, there are no known regressions related to this
    feature at this point.

    - it's a pretty ambitious feature that with time will likely be
    enabled by most Linux distros, and we'd like you to make input on
    its design/implementation, if you dislike some aspect we missed.
    Without flaming us to crisp! :-)

    Future plans:

    - there's ongoing work to reduce 1Hz to 0Hz, to essentially shut off
    the periodic tick altogether when there's a single busy task on a
    CPU. We'd first like 1 Hz to be exposed more widely before we go
    for the 0 Hz target though.

    - once we reach 0 Hz we can remove the periodic tick assumption from
    nr_running>=2 as well, by essentially interrupting busy tasks only
    as frequently as the sched_latency constraints require us to do -
    once every 4-40 msecs, depending on nr_running.

    I am personally leaning towards biting the bullet and doing this in
    v3.10, like the -rt tree this effort has been going on for too long -
    but the final word is up to you as usual.

    More technical details can be found in Documentation/timers/NO_HZ.txt"

    * 'timers-nohz-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (39 commits)
    sched: Keep at least 1 tick per second for active dynticks tasks
    rcu: Fix full dynticks' dependency on wide RCU nocb mode
    nohz: Protect smp_processor_id() in tick_nohz_task_switch()
    nohz_full: Add documentation.
    cputime_nsecs: use math64.h for nsec resolution conversion helpers
    nohz: Select VIRT_CPU_ACCOUNTING_GEN from full dynticks config
    nohz: Reduce overhead under high-freq idling patterns
    nohz: Remove full dynticks' superfluous dependency on RCU tree
    nohz: Fix unavailable tick_stop tracepoint in dynticks idle
    nohz: Add basic tracing
    nohz: Select wide RCU nocb for full dynticks
    nohz: Disable the tick when irq resume in full dynticks CPU
    nohz: Re-evaluate the tick for the new task after a context switch
    nohz: Prepare to stop the tick on irq exit
    nohz: Implement full dynticks kick
    nohz: Re-evaluate the tick from the scheduler IPI
    sched: New helper to prevent from stopping the tick in full dynticks
    sched: Kick full dynticks CPU that have more than one task enqueued.
    perf: New helper to prevent full dynticks CPUs from stopping tick
    perf: Kick full dynticks CPU if events rotation is needed
    ...

    Linus Torvalds
     

04 May, 2013

1 commit

  • The scheduler doesn't yet fully support environments
    with a single task running without a periodic tick.

    In order to ensure we still maintain the duties of scheduler_tick(),
    keep at least 1 tick per second.

    This makes sure that we keep the progression of various scheduler
    accounting and background maintainance even with a very low granularity.
    Examples include cpu load, sched average, CFS entity vruntime,
    avenrun and events such as load balancing, amongst other details
    handled in sched_class::task_tick().

    This limitation will be removed in the future once we get
    these individual items to work in full dynticks CPUs.

    Suggested-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Christoph Lameter
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

29 Apr, 2013

1 commit

  • I saw following error when testing the latest nohz code on
    Power:

    [ 85.295384] BUG: using smp_processor_id() in preemptible [00000000] code: rsyslogd/3493
    [ 85.295396] caller is .tick_nohz_task_switch+0x1c/0xb8
    [ 85.295402] Call Trace:
    [ 85.295408] [c0000001fababab0] [c000000000012dc4] .show_stack+0x110/0x25c (unreliable)
    [ 85.295420] [c0000001fababba0] [c0000000007c4b54] .dump_stack+0x20/0x30
    [ 85.295430] [c0000001fababc10] [c00000000044eb74] .debug_smp_processor_id+0xf4/0x124
    [ 85.295438] [c0000001fababca0] [c0000000000d7594] .tick_nohz_task_switch+0x1c/0xb8
    [ 85.295447] [c0000001fababd20] [c0000000000b9748] .finish_task_switch+0x13c/0x160
    [ 85.295455] [c0000001fababdb0] [c0000000000bbe50] .schedule_tail+0x50/0x124
    [ 85.295463] [c0000001fababe30] [c000000000009dc8] .ret_from_fork+0x4/0x54

    The code below moves the test into local_irq_save/restore
    section to avoid the above complaint.

    Signed-off-by: Li Zhong
    Acked-by: Frederic Weisbecker
    Cc: Peter Zijlstra
    Cc: Paul McKenney
    Link: http://lkml.kernel.org/r/1367119558.6391.34.camel@ThinkPad-T5421.cn.ibm.com
    Signed-off-by: Ingo Molnar

    Li Zhong
     

26 Apr, 2013

1 commit

  • One testbox of mine (Intel Nehalem, 16-way) uses MWAIT for its idle routine,
    which apparently can break out of its idle loop rather frequently, with
    high frequency.

    In that case NO_HZ_FULL=y kernels show high ksoftirqd overhead and constant
    context switching, because tick_nohz_stop_sched_tick() will, if
    delta_jiffies == 0, mis-identify this as a timer event - activating the
    TIMER_SOFTIRQ, which wakes up ksoftirqd.

    Fix this by treating delta_jiffies == 0 the same way we treat other short
    wakeups, delta_jiffies == 1.

    Cc: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

23 Apr, 2013

5 commits

  • It's not obvious to find out why the full dynticks subsystem
    doesn't always stop the tick: whether this is due to kthreads,
    posix timers, perf events, etc...

    These new tracepoints are here to help the user diagnose
    the failures and test this feature.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • When a task is scheduled in, it may have some properties
    of its own that could make the CPU reconsider the need for
    the tick: posix cpu timers, perf events, ...

    So notify the full dynticks subsystem when a task gets
    scheduled in and re-check the tick dependency at this
    stage. This is done through a self IPI to avoid messing
    up with any current lock scenario.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Interrupt exit is a natural place to stop the tick: it happens
    after all events happening before and during the irq which
    are liable to update the dependency on the tick occured. Also
    it makes sure that any check on tick dependency is well ordered
    against dynticks kick IPIs.

    Bring in the infrastructure that performs the tick dependency
    checks on irq exit and shut it down if these checks show that we
    can do it safely.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Implement the full dynticks kick that is performed from
    IPIs sent by various subsystems (scheduler, posix timers, ...)
    when they want to notify about a new event that may
    reconsider the dependency on the tick.

    Most of the time, such an event end up restarting the tick.

    (Part of the design with subsystems providing *_can_stop_tick()
    helpers suggested by Peter Zijlstra a while ago).

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The scheduler IPI is used by the scheduler to kick
    full dynticks CPUs asynchronously when more than one
    task are running or when a new timer list timer is
    enqueued. This way the destination CPU can decide
    to restart the tick to handle this new situation.

    Now let's call that kick in the scheduler IPI.

    (Reusing the scheduler IPI rather than implementing
    a new IPI was suggested by Peter Zijlstra a while ago)

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

21 Apr, 2013

1 commit


19 Apr, 2013

4 commits

  • Provide a new kernel config that defaults all CPUs to be part
    of the full dynticks range, except the boot one for timekeeping.

    This default setting is overriden by the nohz_full= boot option
    if passed by the user.

    This is helpful for those who don't need a finegrained range
    of full dynticks CPU and also for automated testing.

    Suggested-by: Ingo Molnar
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • We need full dynticks CPU to also be RCU nocb so
    that we don't have to keep the tick to handle RCU
    callbacks.

    Make sure the range passed to nohz_full= boot
    parameter is a subset of rcu_nocbs=

    The CPUs that fail to meet this requirement will be
    excluded from the nohz_full range. This is checked
    early in boot time, before any CPU has the opportunity
    to stop its tick.

    Suggested-by: Steven Rostedt
    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • The timekeeping job must be able to run early on boot
    because there may be some pre-SMP (and thus pre-initcalls )
    components that rely on it. The IO-APIC is one such users
    as it tests the timer health by watching jiffies progression.

    Given that it happens before we know the initial online
    set, we can't rely on it to select a timekeeper. We need
    one before SMP time otherwise we simply crash on boot.

    To fix this and keep things simple for now, force the boot CPU
    outside of the full dynticks range in any case and do this early
    on kernel parameter parsing time.

    We might want a trickier solution later, expecially for aSMP
    architectures that need to assign housekeeping tasks to arbitrary
    low power CPUs.

    But it's still first pass KISS time for now.

    Reviewed-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • Provide two new helpers in order to notify the full dynticks CPUs about
    some internal system changes against which they may reconsider the state
    of their tick. Some practical examples include: posix cpu timers, perf tick
    and sched clock tick.

    For now the notifying handler, implemented through IPIs, is a stub
    that will be implemented when we get the tick stop/restart infrastructure
    in.

    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

16 Apr, 2013

1 commit

  • "Extended nohz" was used as a naming base for the full dynticks
    API and Kconfig symbols. It reflects the fact the system tries
    to stop the tick in more places than just idle.

    But that "extended" name is a bit opaque and vague. Rename it to
    "full" makes it clearer what the system tries to do under this
    config: try to shutdown the tick anytime it can. The various
    constraints that prevent that to happen shouldn't be considered
    as fundamental properties of this feature but rather technical
    issues that may be solved in the future.

    Reported-by: Ingo Molnar
    Signed-off-by: Frederic Weisbecker
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

03 Apr, 2013

2 commits

  • Given that we apply a few restrictions on the full dynticks
    CPUs range (keep an online timekeeper oustide the range,
    then in the future have the range be an RCU nocb CPUs subset),
    let's print the final resulting range of full dynticks CPUs to
    the user so that he knows what's really going to run.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     
  • We are planning to convert the dynticks Kconfig options layout
    into a choice menu. The user must be able to easily pick
    any of the following implementations: constant periodic tick,
    idle dynticks, full dynticks.

    As this implies a mutual exclusion, the two dynticks implementions
    need to converge on the selection of a common Kconfig option in order
    to ease the sharing of a common infrastructure.

    It would thus seem pretty natural to reuse CONFIG_NO_HZ to
    that end. It already implements all the idle dynticks code
    and the full dynticks depends on all that code for now.
    So ideally the choice menu would propose CONFIG_NO_HZ_IDLE and
    CONFIG_NO_HZ_EXTENDED then both would select CONFIG_NO_HZ.

    On the other hand we want to stay backward compatible: if
    CONFIG_NO_HZ is set in an older config file, we want to
    enable CONFIG_NO_HZ_IDLE by default.

    But we can't afford both at the same time or we run into
    a circular dependency:

    1) CONFIG_NO_HZ_IDLE and CONFIG_NO_HZ_EXTENDED both select
    CONFIG_NO_HZ
    2) If CONFIG_NO_HZ is set, we default to CONFIG_NO_HZ_IDLE

    We might be able to support that from Kconfig/Kbuild but it
    may not be wise to introduce such a confusing behaviour.

    So to solve this, create a new CONFIG_NO_HZ_COMMON option
    which gathers the common code between idle and full dynticks
    (that common code for now is simply the idle dynticks code)
    and select it from their referring Kconfig.

    Then we'll later create CONFIG_NO_HZ_IDLE and map CONFIG_NO_HZ
    to it for backward compatibility.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

25 Mar, 2013

1 commit


21 Mar, 2013

2 commits

  • This way the full nohz CPUs can safely run with the tick
    stopped with a guarantee that somebody else is taking
    care of the jiffies and GTOD progression.

    Once the duty is attributed to a CPU, it won't change. Also that
    CPU can't enter into dyntick idle mode or be hot unplugged.

    This may later be improved from a power consumption POV. At
    least we should be able to share the duty amongst all CPUs
    outside the full dynticks range. Then the duty could even be
    shared with full dynticks CPUs when those can't stop their
    tick for any reason.

    But let's start with that very simple approach first.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    [fix have_nohz_full_mask offcase]
    Signed-off-by: Steven Rostedt

    Frederic Weisbecker
     
  • For extreme usecases such as Real Time or HPC, having
    the ability to shutdown the tick when a single task runs
    on a CPU is a desired feature:

    * Reducing the amount of interrupts improves throughput
    for CPU-bound tasks. The CPU is less distracted from its
    real job, from an execution time and from the cache point
    of views.

    * This also improve latency response as we have less critical
    sections.

    Start with introducing a very simple interface to define
    full dynticks CPU: use a boot time option defined cpumask
    through the "nohz_extended=" kernel parameter. CPUs that
    are part of this range will have their tick shutdown
    whenever possible: provided they run a single task and
    they don't do kernel activity that require the periodic
    tick. These details will be later documented in
    Documentation/*

    An online CPU must be kept outside this range to handle the
    timekeeping.

    Suggested-by: Paul E. McKenney
    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Geoff Levand
    Cc: Gilad Ben Yossef
    Cc: Hakan Akkan
    Cc: Ingo Molnar
    Cc: Kevin Hilman
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

01 Mar, 2013

1 commit

  • Pull thermal management updates from Zhang Rui:
    "Highlights:

    - introduction of Dove thermal sensor driver.

    - introduction of Kirkwood thermal sensor driver.

    - introduction of intel_powerclamp thermal cooling device driver.

    - add interrupt and DT support for rcar thermal driver.

    - add thermal emulation support which allows platform thermal driver
    to do software/hardware emulation for thermal issues."

    * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/rzhang/linux: (36 commits)
    thermal: rcar: remove __devinitconst
    thermal: return an error on failure to register thermal class
    Thermal: rename thermal governor Kconfig option to avoid generic naming
    thermal: exynos: Use the new thermal trend type for quick cooling action.
    Thermal: exynos: Add support for temperature falling interrupt.
    Thermal: Dove: Add Themal sensor support for Dove.
    thermal: Add support for the thermal sensor on Kirkwood SoCs
    thermal: rcar: add Device Tree support
    thermal: rcar: remove machine_power_off() from rcar_thermal_notify()
    thermal: rcar: add interrupt support
    thermal: rcar: add read/write functions for common/priv data
    thermal: rcar: multi channel support
    thermal: rcar: use mutex lock instead of spin lock
    thermal: rcar: enable CPCTL to use hardware TSC deciding
    thermal: rcar: use parenthesis on macro
    Thermal: fix a build warning when CONFIG_THERMAL_EMULATION cleared
    Thermal: fix a wrong comment
    thermal: sysfs: Add a new sysfs node emul_temp for thermal emulation
    PM: intel_powerclamp: off by one in start_power_clamp()
    thermal: exynos: Miscellaneous fixes to support falling threshold interrupt
    ...

    Linus Torvalds
     

20 Feb, 2013

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "Main changes:

    - scheduler side full-dynticks (user-space execution is undisturbed
    and receives no timer IRQs) preparation changes that convert the
    cputime accounting code to be full-dynticks ready, from Frederic
    Weisbecker.

    - Initial sched.h split-up changes, by Clark Williams

    - select_idle_sibling() performance improvement by Mike Galbraith:

    " 1 tbench pair (worst case) in a 10 core + SMT package:

    pre 15.22 MB/sec 1 procs
    post 252.01 MB/sec 1 procs "

    - sched_rr_get_interval() ABI fix/change. We think this detail is not
    used by apps (so it's not an ABI in practice), but lets keep it
    under observation.

    - misc RT scheduling cleanups, optimizations"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    sched/rt: Add header to
    cputime: Remove irqsave from seqlock readers
    sched, powerpc: Fix sched.h split-up build failure
    cputime: Restore CPU_ACCOUNTING config defaults for PPC64
    sched/rt: Move rt specific bits into new header file
    sched/rt: Add a tuning knob to allow changing SCHED_RR timeslice
    sched: Move sched.h sysctl bits into separate header
    sched: Fix signedness bug in yield_to()
    sched: Fix select_idle_sibling() bouncing cow syndrome
    sched/rt: Further simplify pick_rt_task()
    sched/rt: Do not account zero delta_exec in update_curr_rt()
    cputime: Safely read cputime of full dynticks CPUs
    kvm: Prepare to add generic guest entry/exit callbacks
    cputime: Use accessors to read task cputime stats
    cputime: Allow dynamic switch between tick/virtual based cputime accounting
    cputime: Generic on-demand virtual cputime accounting
    cputime: Move default nsecs_to_cputime() to jiffies based cputime file
    cputime: Librarize per nsecs resolution cputime definitions
    cputime: Avoid multiplication overflow on utime scaling
    context_tracking: Export context state for generic vtime
    ...

    Fix up conflict in kernel/context_tracking.c due to comment additions.

    Linus Torvalds
     

05 Feb, 2013

1 commit

  • Conflicts:
    kernel/irq_work.c

    Add support for printk in full dynticks CPU.

    * Don't stop tick with irq works pending. This
    fix is generally useful and concerns archs that
    can't raise self IPIs.

    * Flush irq works before CPU offlining.

    * Introduce "lazy" irq works that can wait for the
    next tick to be executed, unless it's stopped.

    * Implement klogd wake up using irq work. This
    removes the ad-hoc printk_tick()/printk_needs_cpu()
    hooks and make it working even in dynticks mode.

    Signed-off-by: Frederic Weisbecker

    Frederic Weisbecker
     

28 Jan, 2013

1 commit

  • Allow to dynamically switch between tick and virtual based
    cputime accounting. This way we can provide a kind of "on-demand"
    virtual based cputime accounting. In this mode, the kernel relies
    on the context tracking subsystem to dynamically probe on kernel
    boundaries.

    This is in preparation for being able to stop the timer tick in
    more places than just the idle state. Doing so will depend on
    CONFIG_VIRT_CPU_ACCOUNTING_GEN which makes it possible to account
    the cputime without the tick by hooking on kernel/user boundaries.

    Depending whether the tick is stopped or not, we can switch between
    tick and vtime based accounting anytime in order to minimize the
    overhead associated to user hooks.

    Signed-off-by: Frederic Weisbecker
    Cc: Andrew Morton
    Cc: Ingo Molnar
    Cc: Li Zhong
    Cc: Namhyung Kim
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner

    Frederic Weisbecker
     

17 Jan, 2013

1 commit


12 Dec, 2012

2 commits

  • Pull core timer changes from Ingo Molnar:
    "It contains continued generic-NOHZ work by Frederic and smaller
    cleanups."

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    time: Kill xtime_lock, replacing it with jiffies_lock
    clocksource: arm_generic: use this_cpu_ptr per-cpu helper
    clocksource: arm_generic: use integer math helpers
    time/jiffies: Make clocksource_jiffies static
    clocksource: clean up parse_pmtmr()
    tick: Correct the comments for tick_sched_timer()
    tick: Conditionally build nohz specific code in tick handler
    tick: Consolidate tick handling for high and low res handlers
    tick: Consolidate timekeeping handling code

    Linus Torvalds
     
  • …it.kernel.org/pub/scm/linux/kernel/git/tip/tip

    Pull trivial fix branches from Ingo Molnar.

    Cleanup in __get_key_name, and a timer comment fixlet.

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    lockdep: Use KSYM_NAME_LEN'ed buffer for __get_key_name()

    * 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    timers, sched: Correct the comments for tick_sched_timer()

    Linus Torvalds
     

22 Nov, 2012

1 commit


18 Nov, 2012

3 commits

  • klogd is woken up asynchronously from the tick in order
    to do it safely.

    However if printk is called when the tick is stopped, the reader
    won't be woken up until the next interrupt, which might not fire
    for a while. As a result, the user may miss some message.

    To fix this, lets implement the printk tick using a lazy irq work.
    This subsystem takes care of the timer tick state and can
    fix up accordingly.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Paul Gortmaker

    Frederic Weisbecker
     
  • Don't stop the tick if we have pending irq works on the
    queue, otherwise if the arch can't raise self-IPIs, we may not
    find an opportunity to execute the pending works for a while.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Paul Gortmaker

    Frederic Weisbecker
     
  • We need some quick way to check if the CPU has stopped
    its tick. This will be useful to implement the printk tick
    using the irq work subsystem.

    Signed-off-by: Frederic Weisbecker
    Acked-by: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Cc: Andrew Morton
    Cc: Paul Gortmaker

    Frederic Weisbecker
     

15 Nov, 2012

1 commit

  • The prediction for future is difficult and when the cpuidle governor prediction
    fails and govenor possibly choose the shallower C-state than it should. How to
    quickly notice and find the failure becomes important for power saving.

    cpuidle menu governor has a method to predict the repeat pattern if there are 8
    C-states residency which are continuous and the same or very close, so it will
    predict the next C-states residency will keep same residency time.

    There is a real case that turbostat utility (tools/power/x86/turbostat)
    at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
    Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
    governor will predict it is repeat mode and there is another IPI wake up idle
    CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
    idle. However, in the turbostat, following 10 registers reading is sleep 5
    seconds by default, so the idle CPU will keep at C1 for a long time though it is
    idle until break event occurs.
    In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
    C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
    deep C-state stays at >99.98%.

    In the patch, a timer is added when menu governor detects a repeat mode and
    choose a shallow C-state. The timer is set to a time out value that greater
    than predicted time, and we conclude repeat mode prediction failure if timer is
    triggered. When repeat mode happens as expected, the timer is not triggered
    and CPU waken up from C-states and it will cancel the timer initiatively.
    When repeat mode does not happen, the timer will be time out and menu governor
    will quickly notice that the repeat mode prediction fails and then re-evaluates
    deeper C-states possibility.

    Below is another case which will clearly show the patch much benefit:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    volatile int * shutdown;
    volatile long * count;
    int delay = 20;
    int loop = 8;

    void usage(void)
    {
    fprintf(stderr,
    "Usage: idle_predict [options]\n"
    " --help -h Print this help\n"
    " --thread -n Thread number\n"
    " --loop -l Loop times in shallow Cstate\n"
    " --delay -t Sleep time (uS)in shallow Cstate\n");
    }

    void *simple_loop() {
    int idle_num = 1;
    while (!(*shutdown)) {
    *count = *count + 1;

    if (idle_num % loop)
    usleep(delay);
    else {
    /* sleep 1 second */
    usleep(1000000);
    idle_num = 0;
    }
    idle_num++;
    }

    }

    static void sighand(int sig)
    {
    *shutdown = 1;
    }

    int main(int argc, char *argv[])
    {
    sigset_t sigset;
    int signum = SIGALRM;
    int i, c, er = 0, thread_num = 8;
    pthread_t pt[1024];

    static char optstr[] = "n:l:t:h:";

    while ((c = getopt(argc, argv, optstr)) != EOF)
    switch (c) {
    case 'n':
    thread_num = atoi(optarg);
    break;
    case 'l':
    loop = atoi(optarg);
    break;
    case 't':
    delay = atoi(optarg);
    break;
    case 'h':
    default:
    usage();
    exit(1);
    }

    printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
    count = malloc(sizeof(long));
    shutdown = malloc(sizeof(int));
    *count = 0;
    *shutdown = 0;

    sigemptyset(&sigset);
    sigaddset(&sigset, signum);
    sigprocmask (SIG_BLOCK, &sigset, NULL);
    signal(SIGINT, sighand);
    signal(SIGTERM, sighand);

    for(i = 0; i < thread_num ; i++)
    pthread_create(&pt[i], NULL, simple_loop, NULL);

    for (i = 0; i < thread_num; i++)
    pthread_join(pt[i], NULL);

    exit(0);
    }

    Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
    After build the above test application, then run it.
    Test plaform can be Intel Sandybridge or other recent platforms.
    #./idle_predict -l 10 &
    #./powertop

    We will find that deep C-state will dangle between 40%~100% and much time spent
    on C1 state. It is because menu governor wrongly predict that repeat mode
    is kept, so it will choose the C1 shallow C-state even though it has chance to
    sleep 1 second in deep C-state.

    While after patched the kernel, we find that deep C-state will keep >99.6%.

    Signed-off-by: Rik van Riel
    Signed-off-by: Youquan Song
    Signed-off-by: Rafael J. Wysocki

    Youquan Song
     

14 Nov, 2012

1 commit


01 Nov, 2012

1 commit

  • In the comments of function tick_sched_timer(), the sentence
    "timer->base->cpu_base->lock held" is not right.

    In function __run_hrtimer(), before call timer->function(),
    the cpu_base->lock has been unlocked.

    Signed-off-by: liu chuansheng
    Cc: fei.li@intel.com
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
    Signed-off-by: Thomas Gleixner

    Chuansheng Liu
     

24 Oct, 2012

1 commit

  • In the comments of function tick_sched_timer(), the sentence
    "timer->base->cpu_base->lock held" is not right.

    In function __run_hrtimer(), before call timer->function(),
    the cpu_base->lock has been unlocked.

    Signed-off-by: liu chuansheng
    Cc: fei.li@intel.com
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1351098455.15558.1421.camel@cliu38-desktop-build
    Signed-off-by: Ingo Molnar

    Chuansheng Liu