04 Jul, 2016

1 commit

  • Snooze is a poll idle state in powernv and pseries platforms. Snooze
    has a timeout so that if a CPU stays in snooze for more than target
    residency of the next available idle state, then it would exit
    thereby giving chance to the cpuidle governor to re-evaluate and
    promote the CPU to a deeper idle state. Therefore whenever snooze
    exits due to this timeout, its last_residency will be target_residency
    of the next deeper state.

    Commit e93e59ce5b85 "cpuidle: Replace ktime_get() with local_clock()"
    changed the math around last_residency calculation. Specifically,
    while converting last_residency value from nano- to microseconds, it
    carries out right shift by 10. Because of that, in snooze timeout
    exit scenarios last_residency calculated is roughly 2.3% less than
    target_residency of the next available state. This pattern is picked
    up by get_typical_interval() in the menu governor and therefore
    expected_interval in menu_select() is frequently less than the
    target_residency of any state other than snooze.

    Due to this we are entering snooze at a higher rate, thereby
    affecting the single thread performance.

    Fix this by using more precise division via ktime_us_delta().

    Fixes: e93e59ce5b85 "cpuidle: Replace ktime_get() with local_clock()"
    Reported-by: Anton Blanchard
    Bisected-by: Shilpasri G Bhat
    Signed-off-by: Shreyas B. Prabhu
    Acked-by: Daniel Lezcano
    Acked-by: Balbir Singh
    Signed-off-by: Rafael J. Wysocki

    Shreyas B. Prabhu
     

18 May, 2016

1 commit

  • Commit 0b89e9aa2856 (cpuidle: delay enabling interrupts until all
    coupled CPUs leave idle) rightfully fixed a regression by letting
    the coupled idle state framework to handle local interrupt enabling
    when the CPU is exiting an idle state.

    The current code checks if the idle state is coupled and, if so, it
    will let the coupled code to enable interrupts. This way, it can
    decrement the ready-count before handling the interrupt. This
    mechanism prevents the other CPUs from waiting for a CPU which is
    handling interrupts.

    But the check is done against the state index returned by the back
    end driver's ->enter functions which could be different from the
    initial index passed as parameter to the cpuidle_enter_state()
    function.

    entered_state = target_state->enter(dev, drv, index);

    [ ... ]

    if (!cpuidle_state_is_coupled(drv, entered_state))
    local_irq_enable();

    [ ... ]

    If the 'index' is referring to a coupled idle state but the
    'entered_state' is *not* coupled, then the interrupts are enabled
    again. All CPUs blocked on the sync barrier may busy loop longer
    if the CPU has interrupts to handle before decrementing the
    ready-count. That's consuming more energy than saving.

    Fixes: 0b89e9aa2856 (cpuidle: delay enabling interrupts until all coupled CPUs leave idle)
    Signed-off-by: Daniel Lezcano
    Cc: 3.15+ # 3.15+
    [ rjw: Subject & changelog ]
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

26 Apr, 2016

1 commit

  • The ktime_get() can have a non negligeable overhead, use local_clock()
    instead.

    In order to test the difference between ktime_get() and local_clock(),
    a quick hack has been added to trigger, via debugfs, 10000 times a
    call to ktime_get() and local_clock() and measure the elapsed time.

    Then the average value, the min and max is computed for each call.

    From userspace, the test above was called 100 times every 2 seconds.

    So, ktime_get() and local_clock() have been called 1000000 times in
    total.

    The results are:

    ktime_get():
    ============
    * average: 101 ns (stddev: 27.4)
    * maximum: 38313 ns
    * minimum: 65 ns

    local_clock():
    ==============
    * average: 60 ns (stddev: 9.8)
    * maximum: 13487 ns
    * minimum: 46 ns

    The local_clock() is faster and more stable.

    Even if it is a drop in the ocean, changing the ktime_get() by the
    local_clock() allows to save 80ns at idle time (entry + exit). And
    in some circumstances, especially when there are several CPUs racing
    for the clock access, we save tens of microseconds.

    The idle duration resulting from a diff is converted from nanosec to
    microsec. This could be done with integer division (div 1000) - which is
    an expensive operation or by 10 bits shifting (div 1024) - which is fast
    but unprecise.

    The following table gives some results at the limits.

    ------------------------------------------
    | nsec | div(1000) | div(1024) |
    ------------------------------------------
    | 1e3 | 1 usec | 976 nsec |
    ------------------------------------------
    | 1e6 | 1000 usec | 976 usec |
    ------------------------------------------
    | 1e9 | 1000000 usec | 976562 usec |
    ------------------------------------------

    There is a linear deviation of 2.34%. This loss of precision is acceptable
    in the context of the resulting diff which is used for statistics. These
    ones are processed to guess estimate an approximation of the duration of the
    next idle period which ends up into an idle state selection. The selection
    criteria takes into account the next duration based on large intervals,
    represented by the idle state's target residency.

    The 2^10 division is enough because the approximation regarding the 1e3
    division is lost in all the approximations done for the next idle duration
    computation.

    Signed-off-by: Daniel Lezcano
    Acked-by: Peter Zijlstra (Intel)
    [ rjw: Subject ]
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

09 Apr, 2016

1 commit

  • Currently the 'registered' member of the cpuidle_device struct is set
    to 1 during cpuidle_register_device. In this same function there are
    checks to see if the device is already registered to prevent duplicate
    calls to register the device, but this value is never set to 0 even on
    unregister of the device. Because of this, any attempt to call
    cpuidle_register_device after a call to cpuidle_unregister_device will
    fail which shouldn't be the case.

    To prevent this, set registered to 0 when the device is unregistered.

    Fixes: c878a52d3c7c (cpuidle: Check if device is already registered)
    Signed-off-by: Dave Gerlach
    Acked-by: Daniel Lezcano
    Cc: All applicable
    Signed-off-by: Rafael J. Wysocki

    Dave Gerlach
     

22 Jan, 2016

1 commit

  • Commit 51164251f5c3 "sched / idle: Drop default_idle_call() fallback
    from call_cpuidle()" made find_deepest_state() return non-negative
    value and check all the states with index > 0. Also as a result,
    find_deepest_state() returns 0 even when enter_freeze callbacks are not
    implemented and enter_freeze_proper() is called which ends up crashing
    the kernel.

    This patch updates the check for index > 0 in cpuidle_enter_freeze and
    cpuidle_idle_call(when idle_should_freeze is true) to restore the
    suspend-to-idle functionality in absence of enter_freeze callback.

    Fixes: 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()"
    Signed-off-by: Sudeep Holla
    Signed-off-by: Rafael J. Wysocki

    Sudeep Holla
     

19 Jan, 2016

1 commit

  • After commit 9c4b2867ed7c (cpuidle: menu: Fix menu_select() for
    CPUIDLE_DRIVER_STATE_START == 0) it is clear that menu_select()
    cannot return negative values. Moreover, ladder_select_state()
    will never return a negative value too, so make find_deepest_state()
    return non-negative values too and drop the default_idle_call()
    fallback from call_cpuidle().

    This eliminates one branch from the idle loop and makes the governors
    and find_deepest_state() handle the case when all states have been
    disabled from sysfs consistently.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Ingo Molnar
    Tested-by: Sudeep Holla

    Rafael J. Wysocki
     

02 Sep, 2015

1 commit

  • Pull power management and ACPI updates from Rafael Wysocki:
    "From the number of commits perspective, the biggest items are ACPICA
    and cpufreq changes with the latter taking the lead (over 50 commits).

    On the cpufreq front, there are many cleanups and minor fixes in the
    core and governors, driver updates etc. We also have a new cpufreq
    driver for Mediatek MT8173 chips.

    ACPICA mostly updates its debug infrastructure and adds a number of
    fixes and cleanups for a good measure.

    The Operating Performance Points (OPP) framework is updated with new
    DT bindings and support for them among other things.

    We have a few updates of the generic power domains framework and a
    reorganization of the ACPI device enumeration code and bus type
    operations.

    And a lot of fixes and cleanups all over.

    Included is one branch from the MFD tree as it contains some
    PM-related driver core and ACPI PM changes a few other commits are
    based on.

    Specifics:

    - ACPICA update to upstream revision 20150818 including method
    tracing extensions to allow more in-depth AML debugging in the
    kernel and a number of assorted fixes and cleanups (Bob Moore, Lv
    Zheng, Markus Elfring).

    - ACPI sysfs code updates and a documentation update related to AML
    method tracing (Lv Zheng).

    - ACPI EC driver fix related to serialized evaluations of _Qxx
    methods and ACPI tools updates allowing the EC userspace tool to be
    built from the kernel source (Lv Zheng).

    - ACPI processor driver updates preparing it for future introduction
    of CPPC support and ACPI PCC mailbox driver updates (Ashwin
    Chaugule).

    - ACPI interrupts enumeration fix for a regression related to the
    handling of IRQ attribute conflicts between MADT and the ACPI
    namespace (Jiang Liu).

    - Fixes related to ACPI device PM (Mika Westerberg, Srinidhi
    Kasagar).

    - ACPI device registration code reorganization to separate the
    sysfs-related code and bus type operations from the rest (Rafael J
    Wysocki).

    - Assorted cleanups in the ACPI core (Jarkko Nikula, Mathias Krause,
    Andy Shevchenko, Rafael J Wysocki, Nicolas Iooss).

    - ACPI cpufreq driver and ia64 cpufreq driver fixes and cleanups (Pan
    Xinhui, Rafael J Wysocki).

    - cpufreq core cleanups on top of the previous changes allowing it to
    preseve its sysfs directories over system suspend/resume (Viresh
    Kumar, Rafael J Wysocki, Sebastian Andrzej Siewior).

    - cpufreq fixes and cleanups related to governors (Viresh Kumar).

    - cpufreq updates (core and the cpufreq-dt driver) related to the
    turbo/boost mode support (Viresh Kumar, Bartlomiej Zolnierkiewicz).

    - New DT bindings for Operating Performance Points (OPP), support for
    them in the OPP framework and in the cpufreq-dt driver plus related
    OPP framework fixes and cleanups (Viresh Kumar).

    - cpufreq powernv driver updates (Shilpasri G Bhat).

    - New cpufreq driver for Mediatek MT8173 (Pi-Cheng Chen).

    - Assorted cpufreq driver (speedstep-lib, sfi, integrator) cleanups
    and fixes (Abhilash Jindal, Andrzej Hajda, Cristian Ardelean).

    - intel_pstate driver updates including Skylake-S support, support
    for enabling HW P-states per CPU and an additional vendor bypass
    list entry (Kristen Carlson Accardi, Chen Yu, Ethan Zhao).

    - cpuidle core fixes related to the handling of coupled idle states
    (Xunlei Pang).

    - intel_idle driver updates including Skylake Client support and
    support for freeze-mode-specific idle states (Len Brown).

    - Driver core updates related to power management (Andy Shevchenko,
    Rafael J Wysocki).

    - Generic power domains framework fixes and cleanups (Jon Hunter,
    Geert Uytterhoeven, Rajendra Nayak, Ulf Hansson).

    - Device PM QoS framework update to allow the latency tolerance
    setting to be exposed to user space via sysfs (Mika Westerberg).

    - devfreq support for PPMUv2 in Exynos5433 and a fix for an incorrect
    exynos-ppmu DT binding (Chanwoo Choi, Javier Martinez Canillas).

    - System sleep support updates (Alan Stern, Len Brown, SungEun Kim).

    - rockchip-io AVS support updates (Heiko Stuebner).

    - PM core clocks support fixup (Colin Ian King).

    - Power capping RAPL driver update including support for Skylake H/S
    and Broadwell-H (Radivoje Jovanovic, Seiichi Ikarashi).

    - Generic device properties framework fixes related to the handling
    of static (driver-provided) property sets (Andy Shevchenko).

    - turbostat and cpupower updates (Len Brown, Shilpasri G Bhat,
    Shreyas B Prabhu)"

    * tag 'pm+acpi-4.3-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (180 commits)
    cpufreq: speedstep-lib: Use monotonic clock
    cpufreq: powernv: Increase the verbosity of OCC console messages
    cpufreq: sfi: use kmemdup rather than duplicating its implementation
    cpufreq: drop !cpufreq_driver check from cpufreq_parse_governor()
    cpufreq: rename cpufreq_real_policy as cpufreq_user_policy
    cpufreq: remove redundant 'policy' field from user_policy
    cpufreq: remove redundant 'governor' field from user_policy
    cpufreq: update user_policy.* on success
    cpufreq: use memcpy() to copy policy
    cpufreq: remove redundant CPUFREQ_INCOMPATIBLE notifier event
    cpufreq: mediatek: Add MT8173 cpufreq driver
    dt-bindings: mediatek: Add MT8173 CPU DVFS clock bindings
    PM / Domains: Fix typo in description of genpd_dev_pm_detach()
    PM / Domains: Remove unusable governor dummies
    PM / Domains: Make pm_genpd_init() available to modules
    PM / domains: Align column headers and data in pm_genpd_summary output
    powercap / RAPL: disable the 2nd power limit properly
    tools: cpupower: Fix error when running cpupower monitor
    PM / OPP: Drop unlikely before IS_ERR(_OR_NULL)
    PM / OPP: Fix static checker warning (broken 64bit big endian systems)
    ...

    Linus Torvalds
     

01 Sep, 2015

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The biggest change in this cycle is the rewrite of the main SMP load
    balancing metric: the CPU load/utilization. The main goal was to make
    the metric more precise and more representative - see the changelog of
    this commit for the gory details:

    9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

    It is done in a way that significantly reduces complexity of the code:

    5 files changed, 249 insertions(+), 494 deletions(-)

    and the performance testing results are encouraging. Nevertheless we
    need to keep an eye on potential regressions, since this potentially
    affects every SMP workload in existence.

    This work comes from Yuyang Du.

    Other changes:

    - SCHED_DL updates. (Andrea Parri)

    - Simplify architecture callbacks by removing finish_arch_switch().
    (Peter Zijlstra et al)

    - cputime accounting: guarantee stime + utime == rtime. (Peter
    Zijlstra)

    - optimize idle CPU wakeups some more - inspired by Facebook server
    loads. (Mike Galbraith)

    - stop_machine fixes and updates. (Oleg Nesterov)

    - Introduce the 'trace_sched_waking' tracepoint. (Peter Zijlstra)

    - sched/numa tweaks. (Srikar Dronamraju)

    - misc fixes and small cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (44 commits)
    sched/deadline: Fix comment in enqueue_task_dl()
    sched/deadline: Fix comment in push_dl_tasks()
    sched: Change the sched_class::set_cpus_allowed() calling context
    sched: Make sched_class::set_cpus_allowed() unconditional
    sched: Fix a race between __kthread_bind() and sched_setaffinity()
    sched: Ensure a task has a non-normalized vruntime when returning back to CFS
    sched/numa: Fix NUMA_DIRECT topology identification
    tile: Reorganize _switch_to()
    sched, sparc32: Update scheduler comments in copy_thread()
    sched: Remove finish_arch_switch()
    sched, tile: Remove finish_arch_switch
    sched, sh: Fold finish_arch_switch() into switch_to()
    sched, score: Remove finish_arch_switch()
    sched, avr32: Remove finish_arch_switch()
    sched, MIPS: Get rid of finish_arch_switch()
    sched, arm: Remove finish_arch_switch()
    sched/fair: Clean up load average references
    sched/fair: Provide runnable_load_avg back to cfs_rq
    sched/fair: Remove task and group entity load when they are dead
    sched/fair: Init cfs_rq's sched_entity load average
    ...

    Linus Torvalds
     

28 Aug, 2015

1 commit


21 Jul, 2015

1 commit

  • Make sure to stop tracing only once we are past a point where
    all latency tracing events have been processed (irqs are not
    enabled again). This has the slight advantage of capturing more
    latency related events in the idle path, but most importantly it
    makes sure that latency tracing doesn't get re-enabled
    inadvertently when new events are coming in.

    This makes the irqsoff latency tracer useful again, as we stop
    capturing CPU sleep time as IRQ latency.

    Signed-off-by: Lucas Stach
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: kernel@pengutronix.de
    Cc: patchwork-lst@pengutronix.de
    Link: http://lkml.kernel.org/r/1437410090-3747-1-git-send-email-l.stach@pengutronix.de
    Signed-off-by: Ingo Molnar

    Lucas Stach
     

10 Jul, 2015

1 commit


19 Jun, 2015

1 commit

  • * pm-sleep:
    PM / sleep: trace_device_pm_callback coverage in dpm_prepare/complete
    PM / wakeup: add a dummy wakeup_source to record statistics
    PM / sleep: Make suspend-to-idle-specific code depend on CONFIG_SUSPEND
    PM / sleep: Return -EBUSY from suspend_enter() on wakeup detection
    PM / tick: Add tracepoints for suspend-to-idle diagnostics
    PM / sleep: Fix symbol name in a comment in kernel/power/main.c
    leds / PM: fix hibernation on arm when gpio-led used with CPU led trigger
    ARM: omap-device: use SET_NOIRQ_SYSTEM_SLEEP_PM_OPS
    bus: omap_l3_noc: add missed callbacks for suspend-to-disk
    PM / sleep: Add macro to define common noirq system PM callbacks
    PM / sleep: Refine diagnostic messages in enter_state()
    PM / wakeup: validate wakeup source before activating it.

    * pm-runtime:
    PM / Runtime: Update last_busy in rpm_resume
    PM / runtime: add note about re-calling in during device probe()

    Rafael J. Wysocki
     

30 May, 2015

1 commit

  • The CPUIDLE_DRIVER_STATE_START symbol is defined as 1 only if
    CONFIG_ARCH_HAS_CPU_RELAX is set, otherwise it is defined as 0.
    However, if CONFIG_ARCH_HAS_CPU_RELAX is set, the first (index 0)
    entry in the cpuidle driver's table of states is overwritten with
    the default "poll" entry by the core. The "state" defined by the
    "poll" entry doesn't provide ->enter_dead and ->enter_freeze
    callbacks and its exit_latency is 0.

    For this reason, it is not necessary to use CPUIDLE_DRIVER_STATE_START
    in cpuidle_play_dead() (->enter_dead is NULL, so the "poll state"
    will be skipped by the loop).

    It also is arguably unuseful to return states with exit_latency
    equal to 0 from find_deepest_state(), so the function can be modified
    to start the loop from index 0 and the "poll state" will be skipped by
    it as a result of the check against latency_req.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy

    Rafael J. Wysocki
     

19 May, 2015

1 commit


15 May, 2015

3 commits

  • If tick_broadcast_enter() fails in cpuidle_enter_state(),
    try to find another idle state to enter instead of invoking
    default_idle_call() immediately and returning -EBUSY which
    should increase the chances of saving some energy in those
    cases.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy
    Tested-by: Preeti U Murthy
    Tested-by: Sudeep Holla
    Acked-by: Kevin Hilman

    Rafael J. Wysocki
     
  • The check of the cpuidle_enter() return value against -EBUSY
    made in call_cpuidle() will not be necessary any more if
    cpuidle_enter_state() calls default_idle_call() directly when it
    is about to return -EBUSY, so make that happen and eliminate the
    check.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy
    Tested-by: Preeti U Murthy
    Tested-by: Sudeep Holla
    Acked-by: Kevin Hilman

    Rafael J. Wysocki
     
  • Introduce a wrapper function around idle_set_state() called
    sched_idle_set_state() that will pass this_rq() to it as the
    first argument and make cpuidle_enter_state() call the new
    function before and after entering the target state.

    At the same time, remove direct invocations of idle_set_state()
    from call_cpuidle().

    This will allow the invocation of default_idle_call() to be
    moved from call_cpuidle() to cpuidle_enter_state() safely
    and call_cpuidle() to be simplified a bit as a result.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy
    Tested-by: Preeti U Murthy
    Tested-by: Sudeep Holla
    Acked-by: Kevin Hilman

    Rafael J. Wysocki
     

10 May, 2015

1 commit


05 May, 2015

1 commit

  • Avoid calling the governor's ->reflect method if the state index
    passed to cpuidle_reflect() is negative.

    This allows the analogous check to be dropped from menu_reflect(),
    so do that too, and ensures that arbitrary error codes can be
    passed to cpuidle_reflect() as the index with no adverse
    consequences.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Daniel Lezcano
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

29 Apr, 2015

1 commit

  • Commit 335f49196fd6 (sched/idle: Use explicit broadcast oneshot
    control function) replaced clockevents_notify() invocations in
    cpuidle_idle_call() with direct calls to tick_broadcast_enter()
    and tick_broadcast_exit(), but it overlooked the fact that
    interrupts were already enabled before calling the latter which
    led to functional breakage on systems using idle states with the
    CPUIDLE_FLAG_TIMER_STOP flag set.

    Fix that by moving the invocations of tick_broadcast_enter()
    and tick_broadcast_exit() down into cpuidle_enter_state() where
    interrupts are still disabled when tick_broadcast_exit() is
    called. Also ensure that interrupts will be disabled before
    running tick_broadcast_exit() even if they have been enabled by
    the idle state's ->enter callback. Trigger a WARN_ON_ONCE() in
    that case, as we generally don't want that to happen for states
    with CPUIDLE_FLAG_TIMER_STOP set.

    Fixes: 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function)
    Reported-and-tested-by: Linus Walleij
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Daniel Lezcano
    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

03 Apr, 2015

1 commit

  • Thomas Schlichter reports the following issue on his Samsung NC20:

    "The C-states C1 and C2 to the OS when connected to AC, and additionally
    provides the C3 C-state when disconnected from AC. However, the number
    of C-states shown in sysfs is fixed to the number of C-states present
    at boot.
    If I boot with AC connected, I always only see the C-states up to C2
    even if I disconnect AC.

    The reason is commit 130a5f692425 (ACPI / cpuidle: remove dev->state_count
    setting). It removes the update of dev->state_count, but sysfs uses
    exactly this variable to show the C-states.

    The fix is to use drv->state_count in sysfs. As this is currently the
    last user of dev->state_count, this variable can be completely removed."

    Remove dev->state_count as per the above.

    Reported-by: Thomas Schlichter
    Signed-off-by: Bartlomiej Zolnierkiewicz
    Signed-off-by: Kyungmin Park
    Acked-by: Daniel Lezcano
    Cc: 3.14+ # 3.14+
    [ rjw: Changelog ]
    Signed-off-by: Rafael J. Wysocki

    Bartlomiej Zolnierkiewicz
     

06 Mar, 2015

1 commit

  • Commit 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
    overlooked the fact that entering some sufficiently deep idle states
    by CPUs may cause their local timers to stop and in those cases it
    is necessary to switch over to a broadcast timer prior to entering
    the idle state. If the cpuidle driver in use does not provide
    the new ->enter_freeze callback for any of the idle states, that
    problem affects suspend-to-idle too, but it is not taken into account
    after the changes made by commit 381063133246.

    Fix that by changing the definition of cpuidle_enter_freeze() and
    re-arranging of the code in cpuidle_idle_call(), so the former does
    not call cpuidle_enter() any more and the fallback case is handled
    by cpuidle_idle_call() directly.

    Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
    Reported-and-tested-by: Lorenzo Pieralisi
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

01 Mar, 2015

2 commits

  • Modify cpuidle_enter_freeze() to do the sanity checks done by
    cpuidle_select() to avoid crashing the suspend-to-idle code
    path in case something is missing.

    Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
    Original-by: Lorenzo Pieralisi
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     
  • Disabling interrupts at the end of cpuidle_enter_freeze() is not
    useful, because its caller, cpuidle_idle_call(), re-enables them
    right away after invoking it.

    To avoid that unnecessary back and forth dance with interrupts,
    make cpuidle_enter_freeze() enable interrupts after calling
    enter_freeze_proper() and drop the local_irq_disable() at its
    end, so that all of the code paths in it end up with interrupts
    enabled. Then, cpuidle_idle_call() will not need to re-enable
    interrupts after calling cpuidle_enter_freeze() any more, because
    the latter will return with interrupts enabled, in analogy with
    cpuidle_enter().

    Reported-by: Lorenzo Pieralisi
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

16 Feb, 2015

1 commit

  • The efficiency of suspend-to-idle depends on being able to keep CPUs
    in the deepest available idle states for as much time as possible.
    Ideally, they should only be brought out of idle by system wakeup
    interrupts.

    However, timer interrupts occurring periodically prevent that from
    happening and it is not practical to chase all of the "misbehaving"
    timers in a whack-a-mole fashion. A much more effective approach is
    to suspend the local ticks for all CPUs and the entire timekeeping
    along the lines of what is done during full suspend, which also
    helps to keep suspend-to-idle and full suspend reasonably similar.

    The idea is to suspend the local tick on each CPU executing
    cpuidle_enter_freeze() and to make the last of them suspend the
    entire timekeeping. That should prevent timer interrupts from
    triggering until an IO interrupt wakes up one of the CPUs. It
    needs to be done with interrupts disabled on all of the CPUs,
    though, because otherwise the suspended clocksource might be
    accessed by an interrupt handler which might lead to fatal
    consequences.

    Unfortunately, the existing ->enter callbacks provided by cpuidle
    drivers generally cannot be used for implementing that, because some
    of them re-enable interrupts temporarily and some idle entry methods
    cause interrupts to be re-enabled automatically on exit. Also some
    of these callbacks manipulate local clock event devices of the CPUs
    which really shouldn't be done after suspending their ticks.

    To overcome that difficulty, introduce a new cpuidle state callback,
    ->enter_freeze, that will be guaranteed (1) to keep interrupts
    disabled all the time (and return with interrupts disabled) and (2)
    not to touch the CPU timer devices. Modify cpuidle_enter_freeze() to
    look for the deepest available idle state with ->enter_freeze present
    and to make the CPU execute that callback with suspended tick (and the
    last of the online CPUs to execute it with suspended timekeeping).

    Suggested-by: Thomas Gleixner
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

14 Feb, 2015

1 commit

  • In preparation for adding support for quiescing timers in the final
    stage of suspend-to-idle transitions, rework the freeze_enter()
    function making the system wait on a wakeup event, the freeze_wake()
    function terminating the suspend-to-idle loop and the mechanism by
    which deep idle states are entered during suspend-to-idle.

    First of all, introduce a simple state machine for suspend-to-idle
    and make the code in question use it.

    Second, prevent freeze_enter() from losing wakeup events due to race
    conditions and ensure that the number of online CPUs won't change
    while it is being executed. In addition to that, make it force
    all of the CPUs re-enter the idle loop in case they are in idle
    states already (so they can enter deeper idle states if possible).

    Next, drop cpuidle_use_deepest_state() and replace use_deepest_state
    checks in cpuidle_select() and cpuidle_reflect() with a single
    suspend-to-idle state check in cpuidle_idle_call().

    Finally, introduce cpuidle_enter_freeze() that will simply find the
    deepest idle state available to the given CPU and enter it using
    cpuidle_enter().

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

24 Sep, 2014

1 commit

  • When the cpu enters idle, it stores the cpuidle state pointer in its
    struct rq instance which in turn could be used to make a better decision
    when balancing tasks.

    As soon as the cpu exits its idle state, the struct rq reference is
    cleared.

    There are a couple of situations where the idle state pointer could be changed
    while it is being consulted:

    1. For x86/acpi with dynamic c-states, when a laptop switches from battery
    to AC that could result on removing the deeper idle state. The acpi driver
    triggers:
    'acpi_processor_cst_has_changed'
    'cpuidle_pause_and_lock'
    'cpuidle_uninstall_idle_handler'
    'kick_all_cpus_sync'.

    All cpus will exit their idle state and the pointed object will be set to
    NULL.

    2. The cpuidle driver is unloaded. Logically that could happen but not
    in practice because the drivers are always compiled in and 95% of them are
    not coded to unregister themselves. In any case, the unloading code must
    call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
    leading to 'kick_all_cpus_sync' as mentioned above.

    A race can happen if we use the pointer and then one of these two scenarios
    occurs at the same moment.

    In order to be safe, the idle state pointer stored in the rq must be
    used inside a rcu_read_lock section where we are protected with the
    'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
    idle_get_state() and idle_put_state() accessors should be used to that
    effect.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: linux-pm@vger.kernel.org
    Cc: linaro-kernel@lists.linaro.org
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-@git.kernel.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     

19 Sep, 2014

1 commit

  • Currently kick_all_cpus_sync() or smp_call_function() can not
    break the polling idle cpu immediately.

    Instead using wake_up_all_idle_cpus() which can wake up the polling idle
    cpu quickly is much more helpful for power.

    Signed-off-by: Chuansheng Liu
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: linux-pm@vger.kernel.org
    Cc: changcheng.liu@intel.com
    Cc: xiaoming.wang@intel.com
    Cc: souvik.k.chakravarty@intel.com
    Cc: luto@amacapital.net
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Cc: Rafael J. Wysocki
    Cc: linux-pm@vger.kernel.org
    Link: http://lkml.kernel.org/r/1409815075-4180-3-git-send-email-chuansheng.liu@intel.com
    Signed-off-by: Ingo Molnar

    Chuansheng Liu
     

09 Jul, 2014

1 commit

  • idle_exit event is the first event after a core exits
    idle state. So this should be traced before local irq
    is ebabled. Likewise idle_entry is the last event before
    a core enters idle state. This will ease visualising the
    cpu idle state from kernel traces.

    Signed-off-by: Sandeep Tripathy
    Acked-by: Daniel Lezcano
    [rjw: Subject, rebase]
    Signed-off-by: Rafael J. Wysocki

    Sandeep Tripathy
     

07 May, 2014

1 commit


01 May, 2014

1 commit

  • Since both cpuidle_enabled() and cpuidle_select() are only called by
    cpuidle_idle_call(), it is not really useful to keep them separate
    and combining them will help to avoid complicating cpuidle_idle_call()
    even further if governors are changed to return error codes sometimes.

    This code modification shouldn't lead to any functional changes.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

03 Apr, 2014

1 commit

  • Pull sched/idle changes from Ingo Molnar:
    "More idle code reorganization, to prepare for more integration.

    (Sent separately because it depended on pending timer work, which is
    now upstream)"

    * 'sched-idle-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/idle: Add more comments to the code
    sched/idle: Move idle conditions in cpuidle_idle main function
    sched/idle: Reorganize the idle loop
    cpuidle/idle: Move the cpuidle_idle_call function to idle.c
    idle/cpuidle: Split cpuidle_idle_call main function into smaller functions

    Linus Torvalds
     

02 Apr, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "The majority of this material spent some time in linux-next, some of
    it even several weeks. There are a few relatively fresh commits in
    it, but they are mostly fixes and simple cleanups.

    ACPI took the lead this time, both in terms of the number of commits
    and the number of modified lines of code, cpufreq follows and there
    are a few changes in the PM core and in cpuidle too.

    A new feature that already got some LWN.net's attention is the device
    PM QoS extension allowing latency tolerance requirements to be
    propagated from leaf devices to their ancestors with hardware
    interfaces for specifying latency tolerance. That should help systems
    with hardware-driven power management to avoid going too far with it
    in cases when there are latency tolerance constraints.

    There also are some significant changes in the ACPI core related to
    the way in which hotplug notifications are handled. They affect PCI
    hotplug (ACPIPHP) and the ACPI dock station code too. The bottom line
    is that all those notification now go through the root notify handler
    and are propagated to the interested subsystems by means of callbacks
    instead of having to install a notify handler for each device object
    that we can potentially get hotplug notifications for.

    In addition to that ACPICA will now advertise "Windows 2013"
    compatibility for _OSI, because some systems out there don't work
    correctly if that is not done (some of them don't even boot).

    On the system suspend side of things, all of the device suspend and
    resume callbacks, except for ->prepare() and ->complete(), are now
    going to be executed asynchronously as that turns out to speed up
    system suspend and resume on some platforms quite significantly and we
    have a few more optimizations in that area.

    Apart from that, there are some new device IDs and fixes and cleanups
    all over. In particular, the system suspend and resume handling by
    cpufreq should be improved and the cpuidle menu governor should be a
    bit more robust now.

    Specifics:

    - Device PM QoS support for latency tolerance constraints on systems
    with hardware interfaces allowing such constraints to be specified.
    That is necessary to prevent hardware-driven power management from
    becoming overly aggressive on some systems and to prevent power
    management features leading to excessive latencies from being used
    in some cases.

    - Consolidation of the handling of ACPI hotplug notifications for
    device objects. This causes all device hotplug notifications to go
    through the root notify handler (that was executed for all of them
    anyway before) that propagates them to individual subsystems, if
    necessary, by executing callbacks provided by those subsystems
    (those callbacks are associated with struct acpi_device objects
    during device enumeration). As a result, the code in question
    becomes both smaller in size and more straightforward and all of
    those changes should not affect users.

    - ACPICA update, including fixes related to the handling of _PRT in
    cases when it is broken and the addition of "Windows 2013" to the
    list of supported "features" for _OSI (which is necessary to
    support systems that work incorrectly or don't even boot without
    it). Changes from Bob Moore and Lv Zheng.

    - Consolidation of ACPI _OST handling from Jiang Liu.

    - ACPI battery and AC fixes allowing unusual system configurations to
    be handled by that code from Alexander Mezin.

    - New device IDs for the ACPI LPSS driver from Chiau Ee Chew.

    - ACPI fan and thermal optimizations related to system suspend and
    resume from Aaron Lu.

    - Cleanups related to ACPI video from Jean Delvare.

    - Assorted ACPI fixes and cleanups from Al Stone, Hanjun Guo, Lan
    Tianyu, Paul Bolle, Tomasz Nowicki.

    - Intel RAPL (Running Average Power Limits) driver cleanups from
    Jacob Pan.

    - intel_pstate fixes and cleanups from Dirk Brandewie.

    - cpufreq fixes related to system suspend/resume handling from Viresh
    Kumar.

    - cpufreq core fixes and cleanups from Viresh Kumar, Stratos
    Karafotis, Saravana Kannan, Rashika Kheria, Joe Perches.

    - cpufreq drivers updates from Viresh Kumar, Zhuoyu Zhang, Rob
    Herring.

    - cpuidle fixes related to the menu governor from Tuukka Tikkanen.

    - cpuidle fix related to coupled CPUs handling from Paul Burton.

    - Asynchronous execution of all device suspend and resume callbacks,
    except for ->prepare and ->complete, during system suspend and
    resume from Chuansheng Liu.

    - Delayed resuming of runtime-suspended devices during system suspend
    for the PCI bus type and ACPI PM domain.

    - New set of PM helper routines to allow device runtime PM callbacks
    to be used during system suspend and resume more easily from Ulf
    Hansson.

    - Assorted fixes and cleanups in the PM core from Geert Uytterhoeven,
    Prabhakar Lad, Philipp Zabel, Rashika Kheria, Sebastian Capella.

    - devfreq fix from Saravana Kannan"

    * tag 'pm+acpi-3.15-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (162 commits)
    PM / devfreq: Rewrite devfreq_update_status() to fix multiple bugs
    PM / sleep: Correct whitespace errors in
    intel_pstate: Set core to min P state during core offline
    cpufreq: Add stop CPU callback to cpufreq_driver interface
    cpufreq: Remove unnecessary braces
    cpufreq: Fix checkpatch errors and warnings
    cpufreq: powerpc: add cpufreq transition latency for FSL e500mc SoCs
    MAINTAINERS: Reorder maintainer addresses for PM and ACPI
    PM / Runtime: Update runtime_idle() documentation for return value meaning
    video / output: Drop display output class support
    fujitsu-laptop: Drop unneeded include
    acer-wmi: Stop selecting VIDEO_OUTPUT_CONTROL
    ACPI / gpu / drm: Stop selecting VIDEO_OUTPUT_CONTROL
    ACPI / video: fix ACPI_VIDEO dependencies
    cpufreq: remove unused notifier: CPUFREQ_{SUSPENDCHANGE|RESUMECHANGE}
    cpufreq: Do not allow ->setpolicy drivers to provide ->target
    cpufreq: arm_big_little: set 'physical_cluster' for each CPU
    cpufreq: arm_big_little: make vexpress driver depend on bL core driver
    ACPI / button: Add ACPI Button event via netlink routine
    ACPI: Remove duplicate definitions of PREFIX
    ...

    Linus Torvalds
     

12 Mar, 2014

1 commit

  • As described by a comment at the end of cpuidle_enter_state_coupled it
    can be inefficient for coupled idle states to return with IRQs enabled
    since they may proceed to service an interrupt instead of clearing the
    coupled idle state. Until they have finished & cleared the idle state
    all CPUs coupled with them will spin rather than being able to enter a
    safe idle state.

    Commits e1689795a784 "cpuidle: Add common time keeping and irq
    enabling" and 554c06ba3ee2 "cpuidle: remove en_core_tk_irqen flag" led
    to the cpuidle_enter_state enabling interrupts for all idle states,
    including coupled ones, making this inefficiency unavoidable by drivers
    & the local_irq_enable near the end of cpuidle_enter_state_coupled
    redundant. This patch avoids enabling interrupts in cpuidle_enter_state
    after a coupled state has been entered, allowing them to remain disabled
    until all coupled CPUs have exited the idle state and
    cpuidle_enter_state_coupled re-enables them.

    Cc: Daniel Lezcano
    Signed-off-by: Paul Burton
    Signed-off-by: Rafael J. Wysocki

    Paul Burton
     

11 Mar, 2014

2 commits

  • The cpuidle_idle_call does nothing more than calling the three individuals
    function and is no longer used by any arch specific code but only in the
    cpuidle framework code.

    We can move this function into the idle task code to ensure better
    proximity to the scheduler code.

    Signed-off-by: Daniel Lezcano
    Acked-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: rjw@rjwysocki.net
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-2-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     
  • In order to allow better integration between the cpuidle framework and the
    scheduler, reducing the distance between these two sub-components will
    facilitate this integration by moving part of the cpuidle code in the idle
    task file and, because idle.c is in the sched directory, we have access to
    the scheduler's private structures.

    This patch splits the cpuidle_idle_call main entry function into 3 calls
    to a newly added API:

    1. select the idle state
    2. enter the idle state
    3. reflect the idle state

    The cpuidle_idle_call calls these three functions to implement the main
    idle entry function.

    Signed-off-by: Daniel Lezcano
    Acked-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra
    Cc: rjw@rjwysocki.net
    Cc: preeti@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1393832934-11625-1-git-send-email-daniel.lezcano@linaro.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     

07 Feb, 2014

1 commit

  • Some archs set the CPUIDLE_FLAG_TIMER_STOP flag for idle states in which the
    local timers stop. The cpuidle_idle_call() currently handles such idle states
    by calling into the broadcast framework so as to wakeup CPUs at their next
    wakeup event. With the hrtimer mode of broadcast, the BROADCAST_ENTER call
    into the broadcast frameowork can fail for archs that do not have an external
    clock device to handle wakeups and the CPU in question has thus to be made
    the stand by CPU. This patch handles such cases by failing the call into
    cpuidle so that the arch can take some default action. The arch will certainly
    not enter a similar idle state because a failed cpuidle call will also implicitly
    indicate that the broadcast framework has not registered this CPU to be woken up.
    Hence we are safe if we fail the cpuidle call.

    In the process move the functions that trace idle statistics just before and
    after the entry and exit into idle states respectively. In other
    scenarios where the call to cpuidle fails, we end up not tracing idle
    entry and exit since a decision on an idle state could not be taken. Similarly
    when the call to broadcast framework fails, we skip tracing idle statistics
    because we are in no further position to take a decision on an alternative
    idle state to enter into.

    Signed-off-by: Preeti U Murthy
    Cc: deepthi@linux.vnet.ibm.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: fweisbec@gmail.com
    Cc: paulus@samba.org
    Cc: srivatsa.bhat@linux.vnet.ibm.com
    Cc: svaidy@linux.vnet.ibm.com
    Cc: peterz@infradead.org
    Cc: benh@kernel.crashing.org
    Acked-by: Rafael J. Wysocki
    Cc: linuxppc-dev@lists.ozlabs.org
    Link: http://lkml.kernel.org/r/20140207080652.17187.66344.stgit@preeti.in.ibm.com
    Signed-off-by: Thomas Gleixner

    Preeti U Murthy
     

04 Dec, 2013

1 commit

  • If not, we could end up in the unfortunate situation where
    we dereference a NULL pointer b/c we have cpuidle disabled.

    This is the case when booting under Xen (which uses the
    ACPI P/C states but disables the CPU idle driver) - and can
    be easily reproduced when booting with cpuidle.off=1.

    BUG: unable to handle kernel NULL pointer dereference at (null)
    IP: [] cpuidle_unregister_device+0x2a/0x90
    .. snip..
    Call Trace:
    [] acpi_processor_power_exit+0x3c/0x5c
    [] acpi_processor_stop+0x61/0xb6
    [] __device_release_driver+0fffff81421653>] device_release_driver+0x23/0x30
    [] bus_remove_device+0x108/0x180
    [] device_del+0x129/0x1c0
    [] ? unregister_xenbus_watch+0x1f0/0x1f0
    [] device_unregister+0x1e/0x60
    [] unregister_cpu+0x39/0x60
    [] arch_unregister_cpu+0x23/0x30
    [] handle_vcpu_hotplug_event+0xc1/0xe0
    [] xenwatch_thread+0x45/0x120
    [] ? abort_exclusive_wait+0xb0/0xb0
    [] kthread+0xd2/0xf0
    [] ? kthread_create_on_node+0x180/0x180
    [] ret_from_fork+0x7c/0xb0
    [] ? kthread_create_on_node+0x180/0x180

    This problem also appears in 3.12 and could be a candidate for backport.

    Signed-off-by: Konrad Rzeszutek Wilk
    Cc: All applicable
    Signed-off-by: Rafael J. Wysocki

    Konrad Rzeszutek Wilk
     

30 Oct, 2013

2 commits