22 Feb, 2013

1 commit

  • Pull ARM SoC-specific updates from Arnd Bergmann:
    "This is a larger set of new functionality for the existing SoC
    families, including:

    - vt8500 gains support for new CPU cores, notably the Cortex-A9 based
    wm8850

    - prima2 gains support for the "marco" SoC family, its SMP based
    cousin

    - tegra gains support for the new Tegra4 (Tegra114) family

    - socfpga now supports a newer version of the hardware including SMP

    - i.mx31 and bcm2835 are now using DT probing for their clocks

    - lots of updates for sh-mobile

    - OMAP updates for clocks, power management and USB

    - i.mx6q and tegra now support cpuidle

    - kirkwood now supports PCIe hot plugging

    - tegra clock support is updated

    - tegra USB PHY probing gets implemented diffently"

    * tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (148 commits)
    ARM: prima2: remove duplicate v7_invalidate_l1
    ARM: shmobile: r8a7779: Correct TMU clock support again
    ARM: prima2: fix __init section for cpu hotplug
    ARM: OMAP: Consolidate OMAP USB-HS platform data (part 3/3)
    ARM: OMAP: Consolidate OMAP USB-HS platform data (part 1/3)
    arm: socfpga: Add SMP support for actual socfpga harware
    arm: Add v7_invalidate_l1 to cache-v7.S
    arm: socfpga: Add entries to enable make dtbs socfpga
    arm: socfpga: Add new device tree source for actual socfpga HW
    ARM: tegra: sort Kconfig selects for Tegra114
    ARM: tegra: enable ARCH_REQUIRE_GPIOLIB for Tegra114
    ARM: tegra: Fix build error w/ ARCH_TEGRA_114_SOC w/o ARCH_TEGRA_3x_SOC
    ARM: tegra: Fix build error for gic update
    ARM: tegra: remove empty tegra_smp_init_cpus()
    ARM: shmobile: Register ARM architected timer
    ARM: MARCO: fix the build issue due to gic-vic-to-irqchip move
    ARM: shmobile: r8a7779: Correct TMU clock support
    ARM: mxs_defconfig: Select CONFIG_DEVTMPFS_MOUNT
    ARM: mxs: decrease mxs_clockevent_device.min_delta_ns to 2 clock cycles
    ARM: mxs: use apbx bus clock to drive the timers on timrotv2
    ...

    Linus Torvalds
     

01 Feb, 2013

1 commit


26 Jan, 2013

1 commit

  • The text in Documentation said it would be removed in 2.6.41;
    the text in the Kconfig said removal in the 3.1 release. Either
    way you look at it, we are well past both, so push it off a cliff.

    Note that the POWER_CSTATE and the POWER_PSTATE are part of the
    legacy tracing API. Remove all tracepoints which use these flags.
    As can be seen from context, most already have a trace entry via
    trace_cpu_idle anyways.

    Also, the cpufreq/cpufreq.c PSTATE one is actually unpaired, as
    compared to the CSTATE ones which all have a clear start/stop.
    As part of this, the trace_power_frequency also becomes orphaned,
    so it too is deleted.

    Signed-off-by: Paul Gortmaker
    Acked-by: Steven Rostedt
    Signed-off-by: Rafael J. Wysocki

    Paul Gortmaker
     

15 Jan, 2013

1 commit

  • We realized that the power usage field is never filled and when it
    is filled for tegra, the power_specified flag is not set causing all
    of these values to be reset when the driver is initialized with
    set_power_state().

    However, the power_specified flag can be simply removed under the
    assumption that the states are always backward sorted, which is the
    case with the current code.

    This change allows the menu governor select function and the
    cpuidle_play_dead() to be simplified. Moreover, the
    set_power_states() function can removed as it does not make sense
    any more.

    Drop the power_specified flag from struct cpuidle_driver and make
    the related changes as described above.

    As a consequence, this also fixes the bug where on the dynamic
    C-states system, the power fields are not initialized.

    [rjw: Changelog]
    References: https://bugzilla.kernel.org/show_bug.cgi?id=42870
    References: https://bugzilla.kernel.org/show_bug.cgi?id=43349
    References: https://lkml.org/lkml/2012/10/16/518
    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

12 Jan, 2013

1 commit

  • Commit bf4d1b5ddb78f86078ac6ae0415802d5f0c68f92 (cpuidle: support
    multiple drivers) changed the number of initialized state kobjects
    in cpuidle_add_state_sysfs() from device->state_count to
    drv->state_count, but left device->state_count in
    cpuidle_remove_state_sysfs(). The values of these two fields may be
    different, in which case a NULL pointer dereference may happen in
    cpuidle_remove_state_sysfs(), for example. Fix this problem by making
    cpuidle_add_state_sysfs() use device->state_count too (which restores
    the original behavior of it).

    [rjw: Changelog]
    Signed-off-by: Krzysztof Mazur
    Acked-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Krzysztof Mazur
     

03 Jan, 2013

3 commits

  • Commit bf4d1b5 (cpuidle: support multiple drivers) introduced
    locking in cpuidle_get_cpu_driver(), which is used in the
    idle_call() function.

    This leads to a contention problem with a large number of CPUs,
    because they all try to run the idle routine at the same time.

    The lock can be safely removed because of how is used the cpuidle
    API. Namely, cpuidle_register_driver() is called first, but the
    cpuidle idle function is not entered before cpuidle_register_device()
    is called, because the cpuidle device is not enabled then. Moreover,
    cpuidle_unregister_driver(), which would reset the driver value to
    NULL, is not called before cpuidle_unregister_device().

    All of the cpuidle drivers use the API in the same way.

    In general, a cleanup around the lock is necessary and a proper
    refcounting mechanism should be used to ensure the consistency in the
    API (for example, cpuidle_unregister_driver() should fail if the
    driver's refcount is not 0). However, these modifications will require
    some code reorganization and rewrite which will be too intrusive for
    a fix.

    For this reason, fix the contention problem introduced by commit
    bf4d1b5 by simply removing the locking from cpuidle_get_cpu_driver(),
    which restores the original behavior of that routine.

    [rjw: Changelog.]
    Reported-and-tested-by: Russ Anderson
    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The ready_waiting_counts atomic variable is compared against the wrong
    online cpu count. The latter is computed incorrectly using logical-OR
    instead of bit-OR. This patch fixes that.

    Signed-off-by: Sivaram Nair
    Acked-by: Santosh Shilimkar
    Acked-by: Colin Cross
    Cc:
    Signed-off-by: Rafael J. Wysocki

    Sivaram Nair
     
  • Since cpuidle_state.power_usage is a signed value, use INT_MAX (instead
    of -1) to init the local copies so that functions that tries to find
    cpuidle states with minimum power usage works correctly even if they use
    non-negative values.

    Signed-off-by: Sivaram Nair
    Reviewed-by: Rik van Riel
    Signed-off-by: Rafael J. Wysocki

    Sivaram Nair
     

13 Dec, 2012

1 commit

  • Pull ARM SoC updates from Olof Johansson:
    "This contains the bulk of new SoC development for this merge window.

    Two new platforms have been added, the sunxi platforms (Allwinner A1x
    SoCs) by Maxime Ripard, and a generic Broadcom platform for a new
    series of ARMv7 platforms from them, where the hope is that we can
    keep the platform code generic enough to have them all share one mach
    directory. The new Broadcom platform is contributed by Christian
    Daudt.

    Highbank has grown support for Calxeda's next generation of hardware,
    ECX-2000.

    clps711x has seen a lot of cleanup from Alexander Shiyan, and he's
    also taken on maintainership of the platform.

    Beyond this there has been a bunch of work from a number of people on
    converting more platforms to IRQ domains, pinctrl conversion, cleanup
    and general feature enablement across most of the active platforms."

    Fix up trivial conflicts as per Olof.

    * tag 'soc' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (174 commits)
    mfd: vexpress-sysreg: Remove LEDs code
    irqchip: irq-sunxi: Add terminating entry for sunxi_irq_dt_ids
    clocksource: sunxi_timer: Add terminating entry for sunxi_timer_dt_ids
    irq: versatile: delete dangling variable
    ARM: sunxi: add missing include for mdelay()
    ARM: EXYNOS: Avoid early use of of_machine_is_compatible()
    ARM: dts: add node for PL330 MDMA1 controller for exynos4
    ARM: EXYNOS: Add support for secondary CPU bring-up on Exynos4412
    ARM: EXYNOS: add UART3 to DEBUG_LL ports
    ARM: S3C24XX: Add clkdev entry for camif-upll clock
    ARM: SAMSUNG: Add s3c24xx/s3c64xx CAMIF GPIO setup helpers
    ARM: sunxi: Add missing sun4i.dtsi file
    pinctrl: samsung: Do not initialise statics to 0
    ARM i.MX6: remove gate_mask from pllv3
    ARM i.MX6: Fix ethernet PLL clocks
    ARM i.MX6: rename PLLs according to datasheet
    ARM i.MX6: Add pwm support
    ARM i.MX51: Add pwm support
    ARM i.MX53: Add pwm support
    ARM: mx5: Replace clk_register_clkdev with clock DT lookup
    ...

    Linus Torvalds
     

27 Nov, 2012

1 commit

  • Many cpuidle drivers measure their time spent in an idle state by
    reading the wallclock time before and after idling and calculating the
    difference. This leads to erroneous results when the wallclock time gets
    updated by another processor in the meantime, adding that clock
    adjustment to the idle state's time counter.

    If the clock adjustment was negative, the result is even worse due to an
    erroneous cast from int to unsigned long long of the last_residency
    variable. The negative 32 bit integer will zero-extend and result in a
    forward time jump of roughly four billion milliseconds or 1.3 hours on
    the idle state residency counter.

    This patch changes all affected cpuidle drivers to either use the
    monotonic clock for their measurements or make use of the generic time
    measurement wrapper in cpuidle.c, which was already working correctly.
    Some superfluous CLIs/STIs in the ACPI code are removed (interrupts
    should always already be disabled before entering the idle function, and
    not get reenabled until the generic wrapper has performed its second
    measurement). It also removes the erroneous cast, making sure that
    negative residency values are applied correctly even though they should
    not appear anymore.

    Signed-off-by: Julius Werner
    Reviewed-by: Preeti U Murthy
    Tested-by: Daniel Lezcano
    Acked-by: Daniel Lezcano
    Acked-by: Len Brown
    Signed-off-by: Rafael J. Wysocki

    Julius Werner
     

23 Nov, 2012

1 commit

  • I saw this suspicious RCU usage on the next tree of 11/15

    [ 67.123404] ===============================
    [ 67.123413] [ INFO: suspicious RCU usage. ]
    [ 67.123423] 3.7.0-rc5-next-20121115-dirty #1 Not tainted
    [ 67.123434] -------------------------------
    [ 67.123444] include/trace/events/timer.h:186 suspicious rcu_dereference_check() usage!
    [ 67.123458]
    [ 67.123458] other info that might help us debug this:
    [ 67.123458]
    [ 67.123474]
    [ 67.123474] RCU used illegally from idle CPU!
    [ 67.123474] rcu_scheduler_active = 1, debug_locks = 0
    [ 67.123493] RCU used illegally from extended quiescent state!
    [ 67.123507] 1 lock held by swapper/1/0:
    [ 67.123516] #0: (&cpu_base->lock){-.-...}, at: [] .__hrtimer_start_range_ns+0x28c/0x524
    [ 67.123555]
    [ 67.123555] stack backtrace:
    [ 67.123566] Call Trace:
    [ 67.123576] [c0000001e2ccb920] [c00000000001275c] .show_stack+0x78/0x184 (unreliable)
    [ 67.123599] [c0000001e2ccb9d0] [c0000000000c15a0] .lockdep_rcu_suspicious+0x120/0x148
    [ 67.123619] [c0000001e2ccba70] [c00000000009601c] .enqueue_hrtimer+0x1c0/0x1c8
    [ 67.123639] [c0000001e2ccbb00] [c000000000097aa0] .__hrtimer_start_range_ns+0x37c/0x524
    [ 67.123660] [c0000001e2ccbc20] [c0000000005c9698] .menu_select+0x508/0x5bc
    [ 67.123678] [c0000001e2ccbd20] [c0000000005c740c] .cpuidle_idle_call+0xa8/0x6e4
    [ 67.123699] [c0000001e2ccbdd0] [c0000000000459a0] .pSeries_idle+0x10/0x34
    [ 67.123717] [c0000001e2ccbe40] [c000000000014dc8] .cpu_idle+0x130/0x280
    [ 67.123738] [c0000001e2ccbee0] [c0000000006ffa8c] .start_secondary+0x378/0x384
    [ 67.123758] [c0000001e2ccbf90] [c00000000000936c] .start_secondary_prolog+0x10/0x14

    hrtimer_start was added in 198fd638 and ae515197. The patch below tries
    to use RCU_NONIDLE around it to avoid the above report.

    Signed-off-by: Li Zhong
    Acked-by: Paul E. McKenney
    Reviewed-by: Rik van Riel
    Signed-off-by: Rafael J. Wysocki

    Li Zhong
     

15 Nov, 2012

12 commits

  • With the tegra3 and the big.LITTLE [1] new architectures, several cpus
    with different characteristics (latencies and states) can co-exists on the
    system.

    The cpuidle framework has the limitation of handling only identical cpus.

    This patch removes this limitation by introducing the multiple driver support
    for cpuidle.

    This option is configurable at compile time and should be enabled for the
    architectures mentioned above. So there is no impact for the other platforms
    if the option is disabled. The option defaults to 'n'. Note the multiple drivers
    support is also compatible with the existing drivers, even if just one driver is
    needed, all the cpu will be tied to this driver using an extra small chunk of
    processor memory.

    The multiple driver support use a per-cpu driver pointer instead of a global
    variable and the accessor to this variable are done from a cpu context.

    In order to keep the compatibility with the existing drivers, the function
    'cpuidle_register_driver' and 'cpuidle_unregister_driver' will register
    the specified driver for all the cpus.

    The semantic for the output of /sys/devices/system/cpu/cpuidle/current_driver
    remains the same except the driver name will be related to the current cpu.

    The /sys/devices/system/cpu/cpu[0-9]/cpuidle/driver/name files are added
    allowing to read the per cpu driver name.

    [1] http://lwn.net/Articles/481055/

    Signed-off-by: Daniel Lezcano
    Acked-by: Peter De Schrijver
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • This patch is a preparation for the multiple cpuidle drivers support.

    As the next patch will introduce the multiple drivers with the Kconfig
    option and we want to keep the code clean and understandable, this patch
    defines a set of functions for encapsulating some common parts and splits
    what should be done under a lock from the rest.

    [rjw: Modified the subject and changelog slightly.]
    Signed-off-by: Daniel Lezcano
    Acked-by: Peter De Schrijver
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The code is racy and the check with cpuidle_curr_driver should be
    done under the lock.

    I don't find a path in the different drivers where that could happen
    because the arch specific drivers are written in such way it is not
    possible to register a driver while it is unregistered, except maybe
    in a very improbable case when "intel_idle" and "processor_idle" are
    competing. One could unregister a driver, while the other one is
    registering.

    Signed-off-by: Daniel Lezcano
    Acked-by: Peter De Schrijver
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • We want to support different cpuidle drivers co-existing together.
    In this case we should move the refcount to the cpuidle_driver
    structure to handle several drivers at a time.

    Signed-off-by: Daniel Lezcano
    Acked-by: Peter De Schrijver
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The "struct device" is only used in sysfs.c.

    The other .c files including the private header "cpuidle.h"
    do not need to pull the entire headers tree from there as they
    don't manipulate the "struct device".

    This patch fixes this by moving the header inclusion to sysfs.c
    and adding a forward declaration for the struct device.

    The number of lines generated by the preprocesor:
    Without this patch : 17269 loc
    With this patch : 16446 loc

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The structure cpuidle_state_kobj is not used anywhere except
    in the sysfs.c file. The definition of this structure is not
    needed in the cpuidle header file. This patch moves it to the
    sysfs.c file in order to encapsulate the code a bit more.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The function detect_repeating_patterns was not very useful for
    workloads with alternating long and short pauses, for example
    virtual machines handling network requests for each other (say
    a web and database server).

    Instead, try to find a recent sleep interval that is somewhere
    between the median and the mode sleep time, by discarding outliers
    to the up side and recalculating the average and standard deviation
    until that is no longer required.

    This should do something sane with a sleep interval series like:

    200 180 210 10000 30 1000 170 200

    The current code would simply discard such a series, while the
    new code will guess a typical sleep interval just shy of 200.

    The original patch come from Rik van Riel .

    Signed-off-by: Rik van Riel
    Signed-off-by: Youquan Song
    Signed-off-by: Rafael J. Wysocki

    Youquan Song
     
  • When cpuidle governor choose a C-state to enter for idle CPU, but it notice that
    there is tasks request to be executed. So the idle CPU will not really enter
    the target C-state and go to run task.

    In this situation, it will use the residency of previous really entered target
    C-states. Obviously, it is not reasonable.

    So, this patch fix it by set the target C-state residency to 0.

    Signed-off-by: Rik van Riel
    Signed-off-by: Youquan Song
    Signed-off-by: Rafael J. Wysocki

    Youquan Song
     
  • The prediction for future is difficult and when the cpuidle governor prediction
    fails and govenor possibly choose the shallower C-state than it should. How to
    quickly notice and find the failure becomes important for power saving.

    The patch extends to general case that prediction logic get a small predicted
    residency, so it choose a shallow C-state though the expected residency is large
    . Once the prediction will be fail, the CPU will keep staying at shallow C-state
    for a long time. Acutally, the CPU has change enter into deep C-state.
    So when the expected residency is long enough but governor choose a shallow
    C-state, an timer will be added in order to monitor if the prediction failure.

    When C-state is waken up prior to the adding timer, the timer will be cancelled
    initiatively. When the timer is triggered and menu governor will quickly notice
    prediction failure and re-evaluates deeper C-states possibility.

    Signed-off-by: Rik van Riel
    Signed-off-by: Youquan Song
    Signed-off-by: Rafael J. Wysocki

    Youquan Song
     
  • The prediction for future is difficult and when the cpuidle governor prediction
    fails and govenor possibly choose the shallower C-state than it should. How to
    quickly notice and find the failure becomes important for power saving.

    cpuidle menu governor has a method to predict the repeat pattern if there are 8
    C-states residency which are continuous and the same or very close, so it will
    predict the next C-states residency will keep same residency time.

    There is a real case that turbostat utility (tools/power/x86/turbostat)
    at kernel 3.3 or early. turbostat utility will read 10 registers one by one at
    Sandybridge, so it will generate 10 IPIs to wake up idle CPUs. So cpuidle menu
    governor will predict it is repeat mode and there is another IPI wake up idle
    CPU soon, so it keeps idle CPU stay at C1 state even though CPU is totally
    idle. However, in the turbostat, following 10 registers reading is sleep 5
    seconds by default, so the idle CPU will keep at C1 for a long time though it is
    idle until break event occurs.
    In a idle Sandybridge system, run "./turbostat -v", we will notice that deep
    C-state dangles between "70% ~ 99%". After patched the kernel, we will notice
    deep C-state stays at >99.98%.

    In the patch, a timer is added when menu governor detects a repeat mode and
    choose a shallow C-state. The timer is set to a time out value that greater
    than predicted time, and we conclude repeat mode prediction failure if timer is
    triggered. When repeat mode happens as expected, the timer is not triggered
    and CPU waken up from C-states and it will cancel the timer initiatively.
    When repeat mode does not happen, the timer will be time out and menu governor
    will quickly notice that the repeat mode prediction fails and then re-evaluates
    deeper C-states possibility.

    Below is another case which will clearly show the patch much benefit:

    #include
    #include
    #include
    #include
    #include
    #include
    #include

    volatile int * shutdown;
    volatile long * count;
    int delay = 20;
    int loop = 8;

    void usage(void)
    {
    fprintf(stderr,
    "Usage: idle_predict [options]\n"
    " --help -h Print this help\n"
    " --thread -n Thread number\n"
    " --loop -l Loop times in shallow Cstate\n"
    " --delay -t Sleep time (uS)in shallow Cstate\n");
    }

    void *simple_loop() {
    int idle_num = 1;
    while (!(*shutdown)) {
    *count = *count + 1;

    if (idle_num % loop)
    usleep(delay);
    else {
    /* sleep 1 second */
    usleep(1000000);
    idle_num = 0;
    }
    idle_num++;
    }

    }

    static void sighand(int sig)
    {
    *shutdown = 1;
    }

    int main(int argc, char *argv[])
    {
    sigset_t sigset;
    int signum = SIGALRM;
    int i, c, er = 0, thread_num = 8;
    pthread_t pt[1024];

    static char optstr[] = "n:l:t:h:";

    while ((c = getopt(argc, argv, optstr)) != EOF)
    switch (c) {
    case 'n':
    thread_num = atoi(optarg);
    break;
    case 'l':
    loop = atoi(optarg);
    break;
    case 't':
    delay = atoi(optarg);
    break;
    case 'h':
    default:
    usage();
    exit(1);
    }

    printf("thread=%d,loop=%d,delay=%d\n",thread_num,loop,delay);
    count = malloc(sizeof(long));
    shutdown = malloc(sizeof(int));
    *count = 0;
    *shutdown = 0;

    sigemptyset(&sigset);
    sigaddset(&sigset, signum);
    sigprocmask (SIG_BLOCK, &sigset, NULL);
    signal(SIGINT, sighand);
    signal(SIGTERM, sighand);

    for(i = 0; i < thread_num ; i++)
    pthread_create(&pt[i], NULL, simple_loop, NULL);

    for (i = 0; i < thread_num; i++)
    pthread_join(pt[i], NULL);

    exit(0);
    }

    Get powertop V2 from git://github.com/fenrus75/powertop, build powertop.
    After build the above test application, then run it.
    Test plaform can be Intel Sandybridge or other recent platforms.
    #./idle_predict -l 10 &
    #./powertop

    We will find that deep C-state will dangle between 40%~100% and much time spent
    on C1 state. It is because menu governor wrongly predict that repeat mode
    is kept, so it will choose the C1 shallow C-state even though it has chance to
    sleep 1 second in deep C-state.

    While after patched the kernel, we find that deep C-state will keep >99.6%.

    Signed-off-by: Rik van Riel
    Signed-off-by: Youquan Song
    Signed-off-by: Rafael J. Wysocki

    Youquan Song
     
  • Move the kobj initialization and completion in the sysfs.c
    and encapsulate the code more.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     
  • The function needs the cpuidle_device which is initially passed to the
    caller.

    The current code gets the struct device from the struct cpuidle_device,
    pass it the cpuidle_add_sysfs function. This function calls
    per_cpu(cpuidle_devices, cpu) to get the cpuidle_device.

    This patch pass the cpuidle_device instead and simplify the code.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

08 Nov, 2012

1 commit


09 Oct, 2012

1 commit

  • On a KVM guest, when a CPU is taken offline and brought back online, we hit
    the following NULL pointer dereference:

    [ 45.400843] Unregister pv shared memory for cpu 1
    [ 45.412331] smpboot: CPU 1 is now offline
    [ 45.529894] SMP alternatives: lockdep: fixing up alternatives
    [ 45.533472] smpboot: Booting Node 0 Processor 1 APIC 0x1
    [ 45.411526] kvm-clock: cpu 1, msr 0:7d14601, secondary cpu clock
    [ 45.571370] KVM setup async PF for cpu 1
    [ 45.572331] kvm-stealtime: cpu 1, msr 7d0e040
    [ 45.575031] BUG: unable to handle kernel NULL pointer dereference at (null)
    [ 45.576017] IP: [] cpuidle_disable_device+0x18/0x80
    [ 45.576017] PGD 5dfb067 PUD 5da8067 PMD 0
    [ 45.576017] Oops: 0000 [#1] SMP
    [ 45.576017] Modules linked in:
    [ 45.576017] CPU 0
    [ 45.576017] Pid: 607, comm: stress_cpu_hotp Not tainted 3.6.0-padata-tp-debug #3 Bochs Bochs
    [ 45.576017] RIP: 0010:[] [] cpuidle_disable_device+0x18/0x80
    [ 45.576017] RSP: 0018:ffff880005d93ce8 EFLAGS: 00010286
    [ 45.576017] RAX: ffff880005d93fd8 RBX: 0000000000000000 RCX: 0000000000000006
    [ 45.576017] RDX: 0000000000000006 RSI: 2222222222222222 RDI: 0000000000000000
    [ 45.576017] RBP: ffff880005d93cf8 R08: 2222222222222222 R09: 2222222222222222
    [ 45.576017] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
    [ 45.576017] R13: 0000000000000000 R14: ffffffff81c8cca0 R15: 0000000000000001
    [ 45.576017] FS: 00007f91936ae700(0000) GS:ffff880007c00000(0000) knlGS:0000000000000000
    [ 45.576017] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
    [ 45.576017] CR2: 0000000000000000 CR3: 0000000005db3000 CR4: 00000000000006f0
    [ 45.576017] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
    [ 45.576017] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
    [ 45.576017] Process stress_cpu_hotp (pid: 607, threadinfo ffff880005d92000, task ffff8800066bbf40)
    [ 45.576017] Stack:
    [ 45.576017] ffff880007a96400 0000000000000000 ffff880005d93d28 ffffffff813ac689
    [ 45.576017] ffff880007a96400 ffff880007a96400 0000000000000002 ffffffff81cd8d01
    [ 45.576017] ffff880005d93d58 ffffffff813aa498 0000000000000001 00000000ffffffdd
    [ 45.576017] Call Trace:
    [ 45.576017] [] acpi_processor_hotplug+0x55/0x97
    [ 45.576017] [] acpi_cpu_soft_notify+0x93/0xce
    [ 45.576017] [] notifier_call_chain+0x5d/0x110
    [ 45.576017] [] __raw_notifier_call_chain+0xe/0x10
    [ 45.576017] [] __cpu_notify+0x20/0x40
    [ 45.576017] [] cpu_notify+0x15/0x20
    [ 45.576017] [] _cpu_up+0xee/0x137
    [ 45.576017] [] cpu_up+0x49/0x59
    [ 45.576017] [] store_online+0x9d/0xe0
    [ 45.576017] [] dev_attr_store+0x18/0x30
    [ 45.576017] [] sysfs_write_file+0xe0/0x150
    [ 45.576017] [] vfs_write+0xac/0x180
    [ 45.576017] [] sys_write+0x52/0xa0
    [ 45.576017] [] system_call_fastpath+0x16/0x1b
    [ 45.576017] Code: 48 c7 c7 40 e5 ca 81 e8 07 d0 18 00 5d c3 0f 1f 44 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 10 48 89 5d f0 4c 89 65 f8 48 89 fb 07 02 75 13 48 8b 5d f0 4c 8b 65 f8 c9 c3 66 0f 1f 84 00 00
    [ 45.576017] RIP [] cpuidle_disable_device+0x18/0x80
    [ 45.576017] RSP
    [ 45.576017] CR2: 0000000000000000
    [ 45.656079] ---[ end trace 433d6c9ac0b02cef ]---

    Analysis:
    Commit 3d339dc (cpuidle / ACPI : move cpuidle_device field out of the
    acpi_processor_power structure()) made the allocation of the dev structure
    (struct cpuidle) of a CPU dynamic, whereas previously it was statically
    allocated. And this dynamic allocation occurs in acpi_processor_power_init()
    if pr->flags.power evaluates to non-zero.

    On KVM guests, pr->flags.power evaluates to zero, hence dev is never
    allocated. This causes the NULL pointer (dev) dereference in
    cpuidle_disable_device() during a subsequent CPU online operation. Fix this
    by ensuring that dev is non-NULL before dereferencing.

    Signed-off-by: Srivatsa S. Bhat
    Signed-off-by: Len Brown

    Srivatsa S. Bhat
     

22 Sep, 2012

1 commit

  • The function __cpuidle_register_driver name is confusing because it
    suggests, conforming to the coding style of the kernel, it registers
    the driver without taking a lock. Actually, it just fill the different
    power field states with a decresing value if the power has not been
    specified.

    Clarify the purpose of the function by changing its name and
    move the condition out of this function.

    This patch fix nothing and does not change the behavior of the
    function. It is just for the sake of clarity.

    IHMO, reading in the code:

    + if (!drv->power_specified)
    + set_power_states(drv);

    is much more explicit than:

    - __cpuidle_register_driver(drv);

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

20 Sep, 2012

1 commit


04 Sep, 2012

2 commits

  • For the mechanism introduced by commit cbc9ef0 (PM / Domains: Add
    preliminary support for cpuidle, v2) to work with the ladder
    governor, that governor should respect the "disabled" state flag
    added by that commit. Change the ladder governor accordingly.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • There are two cpuidle governors ladder and menu. While the ladder
    governor is always available, if CONFIG_CPU_IDLE is selected, the
    menu governor additionally requires CONFIG_NO_HZ.

    A particular C state can be disabled by writing to the sysfs file
    /sys/devices/system/cpu/cpuN/cpuidle/stateN/disable, but this mechanism
    is only implemented in the menu governor. Thus, in a system where
    CONFIG_NO_HZ is not selected, the ladder governor becomes default and
    always will walk through all sleep states - irrespective of whether the
    C state was disabled via sysfs or not. The only way to select a specific
    C state was to write the related latency to /dev/cpu_dma_latency and
    keep the file open as long as this setting was required - not very
    practical and not suitable for setting a single core in an SMP system.

    With this patch, the ladder governor only will promote to the next
    C state, if it has not been disabled, and it will demote, if the
    current C state was disabled.

    Note that the patch does not make the setting of the sysfs variable
    "disable" coherent, i.e. if one is disabling a light state, then all
    deeper states are disabled as well, but the "disable" variable does not
    reflect it. Likewise, if one enables a deep state but a lighter state
    still is disabled, then this has no effect. A related section has been
    added to the documentation.

    Signed-off-by: Carsten Emde
    Signed-off-by: Rafael J. Wysocki

    Carsten Emde
     

18 Aug, 2012

2 commits

  • When a kernel is built to support multiple hardware types it's possible
    that CONFIG_ARCH_NEEDS_CPU_IDLE_COUPLED is set but the hardware the
    kernel is run on doesn't support cpuidle and therefore doesn't load a
    driver for it. In this case, when the system is shut down,
    cpuidle_coupled_cpu_notify() gets called with cpuidle_devices set to
    NULL. There are quite possibly other circumstances where this
    situation can also occur and we should check for it.

    Signed-off-by: Jon Medhurst
    Signed-off-by: Rafael J. Wysocki

    Jon Medhurst (Tixy)
     
  • The cpu hotplug notifier gets called in both atomic and non-atomic
    contexts, it is not always safe to lock a mutex. Filter out all events
    except the six necessary ones, which are all sleepable, before taking
    the mutex.

    Signed-off-by: Colin Cross
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Rafael J. Wysocki

    Colin Cross
     

27 Jul, 2012

1 commit

  • Pull ACPI & power management update from Len Brown:
    "Re-write of the turbostat tool.
    lower overhead was necessary for measuring very large system when
    they are very idle.

    IVB support in intel_idle
    It's what I run on my IVB, others should be able to also:-)

    ACPICA core update
    We have found some bugs due to divergence between Linux and the
    upstream ACPICA base. Most of these patches are to reduce that
    divergence to reduce the risk of future bugs.

    Some cpuidle updates, mostly for non-Intel
    More will be coming, as they depend on this part.

    Some thermal management changes needed by non-ACPI systems.

    Some _OST (OS Status Indication) updates for hot ACPI hot-plug."

    * 'release' of git://git.kernel.org/pub/scm/linux/kernel/git/lenb/linux: (51 commits)
    Thermal: Documentation update
    Thermal: Add Hysteresis attributes
    Thermal: Make Thermal trip points writeable
    ACPI/AC: prevent OOPS on some boxes due to missing check power_supply_register() return value check
    tools/power: turbostat: fix large c1% issue
    tools/power: turbostat v2 - re-write for efficiency
    ACPICA: Update to version 20120711
    ACPICA: AcpiSrc: Fix some translation issues for Linux conversion
    ACPICA: Update header files copyrights to 2012
    ACPICA: Add new ACPI table load/unload external interfaces
    ACPICA: Split file: tbxface.c -> tbxfload.c
    ACPICA: Add PCC address space to space ID decode function
    ACPICA: Fix some comment fields
    ACPICA: Table manager: deploy new firmware error/warning interfaces
    ACPICA: Add new interfaces for BIOS(firmware) errors and warnings
    ACPICA: Split exception code utilities to a new file, utexcep.c
    ACPI: acpi_pad: tune round_robin_time
    ACPICA: Update to version 20120620
    ACPICA: Add support for implicit notify on multiple devices
    ACPICA: Update comments; no functional change
    ...

    Linus Torvalds
     

26 Jul, 2012

1 commit


19 Jul, 2012

1 commit

  • * pm-domains:
    PM / Domains: Fix build warning for CONFIG_PM_RUNTIME unset
    PM / Domains: Replace plain integer with NULL pointer in domain.c file
    PM / Domains: Add missing static storage class specifier in domain.c file
    PM / Domains: Allow device callbacks to be added at any time
    PM / Domains: Add device domain data reference counter
    PM / Domains: Add preliminary support for cpuidle, v2
    PM / Domains: Do not stop devices after restoring their states
    PM / Domains: Use subsystem runtime suspend/resume callbacks by default

    Rafael J. Wysocki
     

11 Jul, 2012

1 commit

  • On certain bios, resume hangs if cpus are allowed to enter idle states
    during suspend [1].

    This was fixed in apci idle driver [2].But intel_idle driver does not
    have this fix. Thus instead of replicating the fix in both the idle
    drivers, or in more platform specific idle drivers if needed, the
    more general cpuidle infrastructure could handle this.

    A suspend callback in cpuidle_driver could handle this fix. But
    a cpuidle_driver provides only basic functionalities like platform idle
    state detection capability and mechanisms to support entry and exit
    into CPU idle states. All other cpuidle functions are found in the
    cpuidle generic infrastructure for good reason that all cpuidle
    drivers, irrepective of their platforms will support these functions.

    One option therefore would be to register a suspend callback in cpuidle
    which handles this fix. This could be called through a PM_SUSPEND_PREPARE
    notifier. But this is too generic a notfier for a driver to handle.

    Also, ideally the job of cpuidle is not to handle side effects of suspend.
    It should expose the interfaces which "handle cpuidle 'during' suspend"
    or any other operation, which the subsystems call during that respective
    operation.

    The fix demands that during suspend, no cpus should be allowed to enter
    deep C-states. The interface cpuidle_uninstall_idle_handler() in cpuidle
    ensures that. Not just that it also kicks all the cpus which are already
    in idle out of their idle states which was being done during cpu hotplug
    through a CPU_DYING_FROZEN callbacks.

    Now the question arises about when during suspend should
    cpuidle_uninstall_idle_handler() be called. Since we are dealing with
    drivers it seems best to call this function during dpm_suspend().
    Delaying the call till dpm_suspend_noirq() does no harm, as long as it is
    before cpu_hotplug_begin() to avoid race conditions with cpu hotpulg
    operations. In dpm_suspend_noirq(), it would be wise to place this call
    before suspend_device_irqs() to avoid ugly interactions with the same.

    Ananlogously, during resume.

    References:
    [1] https://bugs.launchpad.net/ubuntu/+source/linux/+bug/674075.
    [2] http://marc.info/?l=linux-pm&m=133958534231884&w=2

    Reported-and-tested-by: Dave Hansen
    Signed-off-by: Preeti U Murthy
    Reviewed-by: Srivatsa S. Bhat
    Signed-off-by: Rafael J. Wysocki

    Preeti U Murthy
     

04 Jul, 2012

3 commits

  • On some systems there are CPU cores located in the same power
    domains as I/O devices. Then, power can only be removed from the
    domain if all I/O devices in it are not in use and the CPU core
    is idle. Add preliminary support for that to the generic PM domains
    framework.

    First, the platform is expected to provide a cpuidle driver with one
    extra state designated for use with the generic PM domains code.
    This state should be initially disabled and its exit_latency value
    should be set to whatever time is needed to bring up the CPU core
    itself after restoring power to it, not including the domain's
    power on latency. Its .enter() callback should point to a procedure
    that will remove power from the domain containing the CPU core at
    the end of the CPU power transition.

    The remaining characteristics of the extra cpuidle state, referred to
    as the "domain" cpuidle state below, (e.g. power usage, target
    residency) should be populated in accordance with the properties of
    the hardware.

    Next, the platform should execute genpd_attach_cpuidle() on the PM
    domain containing the CPU core. That will cause the generic PM
    domains framework to treat that domain in a special way such that:

    * When all devices in the domain have been suspended and it is about
    to be turned off, the states of the devices will be saved, but
    power will not be removed from the domain. Instead, the "domain"
    cpuidle state will be enabled so that power can be removed from
    the domain when the CPU core is idle and the state has been chosen
    as the target by the cpuidle governor.

    * When the first I/O device in the domain is resumed and
    __pm_genpd_poweron(() is called for the first time after
    power has been removed from the domain, the "domain" cpuidle
    state will be disabled to avoid subsequent surprise power removals
    via cpuidle.

    The effective exit_latency value of the "domain" cpuidle state
    depends on the time needed to bring up the CPU core itself after
    restoring power to it as well as on the power on latency of the
    domain containing the CPU core. Thus the "domain" cpuidle state's
    exit_latency has to be recomputed every time the domain's power on
    latency is updated, which may happen every time power is restored
    to the domain, if the measured power on latency is greater than
    the latency stored in the corresponding generic_pm_domain structure.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Kevin Hilman

    Rafael J. Wysocki
     
  • Add a reference counter for the cpuidle driver, so that it can't
    be unregistered when it is in use.

    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • Andrew J.Schorr raises a question. When he changes the disable setting on
    a single CPU, it affects all the other CPUs. Basically, currently, the
    disable field is per-driver instead of per-cpu. All the C states of the
    same driver are shared by all CPU in the same machine.

    The patch changes the `disable' field to per-cpu, so we could set this
    separately for each cpu.

    Signed-off-by: ShuoX Liu
    Reported-by: Andrew J.Schorr
    Reviewed-by: Yanmin Zhang
    Signed-off-by: Andrew Morton
    Signed-off-by: Rafael J. Wysocki

    ShuoX Liu
     

02 Jun, 2012

2 commits

  • Adds cpuidle_coupled_parallel_barrier, which can be used by coupled
    cpuidle state enter functions to handle resynchronization after
    determining if any cpu needs to abort. The normal use case will
    be:

    static bool abort_flag;
    static atomic_t abort_barrier;

    int arch_cpuidle_enter(struct cpuidle_device *dev, ...)
    {
    if (arch_turn_off_irq_controller()) {
    /* returns an error if an irq is pending and would be lost
    if idle continued and turned off power */
    abort_flag = true;
    }

    cpuidle_coupled_parallel_barrier(dev, &abort_barrier);

    if (abort_flag) {
    /* One of the cpus didn't turn off it's irq controller */
    arch_turn_on_irq_controller();
    return -EINTR;
    }

    /* continue with idle */
    ...
    }

    This will cause all cpus to abort idle together if one of them needs
    to abort.

    Reviewed-by: Santosh Shilimkar
    Tested-by: Santosh Shilimkar
    Reviewed-by: Kevin Hilman
    Tested-by: Kevin Hilman
    Signed-off-by: Colin Cross
    Signed-off-by: Len Brown

    Colin Cross
     
  • On some ARM SMP SoCs (OMAP4460, Tegra 2, and probably more), the
    cpus cannot be independently powered down, either due to
    sequencing restrictions (on Tegra 2, cpu 0 must be the last to
    power down), or due to HW bugs (on OMAP4460, a cpu powering up
    will corrupt the gic state unless the other cpu runs a work
    around). Each cpu has a power state that it can enter without
    coordinating with the other cpu (usually Wait For Interrupt, or
    WFI), and one or more "coupled" power states that affect blocks
    shared between the cpus (L2 cache, interrupt controller, and
    sometimes the whole SoC). Entering a coupled power state must
    be tightly controlled on both cpus.

    The easiest solution to implementing coupled cpu power states is
    to hotplug all but one cpu whenever possible, usually using a
    cpufreq governor that looks at cpu load to determine when to
    enable the secondary cpus. This causes problems, as hotplug is an
    expensive operation, so the number of hotplug transitions must be
    minimized, leading to very slow response to loads, often on the
    order of seconds.

    This file implements an alternative solution, where each cpu will
    wait in the WFI state until all cpus are ready to enter a coupled
    state, at which point the coupled state function will be called
    on all cpus at approximately the same time.

    Once all cpus are ready to enter idle, they are woken by an smp
    cross call. At this point, there is a chance that one of the
    cpus will find work to do, and choose not to enter idle. A
    final pass is needed to guarantee that all cpus will call the
    power state enter function at the same time. During this pass,
    each cpu will increment the ready counter, and continue once the
    ready counter matches the number of online coupled cpus. If any
    cpu exits idle, the other cpus will decrement their counter and
    retry.

    To use coupled cpuidle states, a cpuidle driver must:

    Set struct cpuidle_device.coupled_cpus to the mask of all
    coupled cpus, usually the same as cpu_possible_mask if all cpus
    are part of the same cluster. The coupled_cpus mask must be
    set in the struct cpuidle_device for each cpu.

    Set struct cpuidle_device.safe_state to a state that is not a
    coupled state. This is usually WFI.

    Set CPUIDLE_FLAG_COUPLED in struct cpuidle_state.flags for each
    state that affects multiple cpus.

    Provide a struct cpuidle_state.enter function for each state
    that affects multiple cpus. This function is guaranteed to be
    called on all cpus at approximately the same time. The driver
    should ensure that the cpus all abort together if any cpu tries
    to abort once the function is called.

    update1:

    cpuidle: coupled: fix count of online cpus

    online_count was never incremented on boot, and was also counting
    cpus that were not part of the coupled set. Fix both issues by
    introducting a new function that counts online coupled cpus, and
    call it from register as well as the hotplug notifier.

    update2:

    cpuidle: coupled: fix decrementing ready count

    cpuidle_coupled_set_not_ready sometimes refuses to decrement the
    ready count in order to prevent a race condition. This makes it
    unsuitable for use when finished with idle. Add a new function
    cpuidle_coupled_set_done that decrements both the ready count and
    waiting count, and call it after idle is complete.

    Cc: Amit Kucheria
    Cc: Arjan van de Ven
    Cc: Trinabh Gupta
    Cc: Deepthi Dharwar
    Reviewed-by: Santosh Shilimkar
    Tested-by: Santosh Shilimkar
    Reviewed-by: Kevin Hilman
    Tested-by: Kevin Hilman
    Signed-off-by: Colin Cross
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Len Brown

    Colin Cross