08 Jun, 2017

1 commit

  • Deferrable vmstat_updater was missing in commit:

    c1de45ca831a ("sched/idle: Add support for tasks that inject idle")

    Add it back.

    Signed-off-by: Aubrey Li
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aubrey Li
    Cc: Christoph Lameter
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1496803742-38274-1-git-send-email-aubrey.li@intel.com
    Signed-off-by: Ingo Molnar

    Aubrey Li
     

15 May, 2017

1 commit

  • I finally got around to creating trampolines for dynamically allocated
    ftrace_ops with using synchronize_rcu_tasks(). For users of the ftrace
    function hook callbacks, like perf, that allocate the ftrace_ops
    descriptor via kmalloc() and friends, ftrace was not able to optimize
    the functions being traced to use a trampoline because they would also
    need to be allocated dynamically. The problem is that they cannot be
    freed when CONFIG_PREEMPT is set, as there's no way to tell if a task
    was preempted on the trampoline. That was before Paul McKenney
    implemented synchronize_rcu_tasks() that would make sure all tasks
    (except idle) have scheduled out or have entered user space.

    While testing this, I triggered this bug:

    BUG: unable to handle kernel paging request at ffffffffa0230077
    ...
    RIP: 0010:0xffffffffa0230077
    ...
    Call Trace:
    schedule+0x5/0xe0
    schedule_preempt_disabled+0x18/0x30
    do_idle+0x172/0x220

    What happened was that the idle task was preempted on the trampoline.
    As synchronize_rcu_tasks() ignores the idle thread, there's nothing
    that lets ftrace know that the idle task was preempted on a trampoline.

    The idle task shouldn't need to ever enable preemption. The idle task
    is simply a loop that calls schedule or places the cpu into idle mode.
    In fact, having preemption enabled is inefficient, because it can
    happen when idle is just about to call schedule anyway, which would
    cause schedule to be called twice. Once for when the interrupt came in
    and was returning back to normal context, and then again in the normal
    path that the idle loop is running in, which would be pointless, as it
    had already scheduled.

    The only reason schedule_preempt_disable() enables preemption is to be
    able to call sched_submit_work(), which requires preemption enabled. As
    this is a nop when the task is in the RUNNING state, and idle is always
    in the running state, there's no reason that idle needs to enable
    preemption. But that means it cannot use schedule_preempt_disable() as
    other callers of that function require calling sched_submit_work().

    Adding a new function local to kernel/sched/ that allows idle to call
    the scheduler without enabling preemption, fixes the
    synchronize_rcu_tasks() issue, as well as removes the pointless spurious
    schedule calls caused by interrupts happening in the brief window where
    preemption is enabled just before it calls schedule.

    Reviewed: Thomas Gleixner
    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Paul E. McKenney
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170414084809.3dacde2a@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     

08 Mar, 2017

1 commit

  • Change livepatch to use a basic per-task consistency model. This is the
    foundation which will eventually enable us to patch those ~10% of
    security patches which change function or data semantics. This is the
    biggest remaining piece needed to make livepatch more generally useful.

    This code stems from the design proposal made by Vojtech [1] in November
    2014. It's a hybrid of kGraft and kpatch: it uses kGraft's per-task
    consistency and syscall barrier switching combined with kpatch's stack
    trace switching. There are also a number of fallback options which make
    it quite flexible.

    Patches are applied on a per-task basis, when the task is deemed safe to
    switch over. When a patch is enabled, livepatch enters into a
    transition state where tasks are converging to the patched state.
    Usually this transition state can complete in a few seconds. The same
    sequence occurs when a patch is disabled, except the tasks converge from
    the patched state to the unpatched state.

    An interrupt handler inherits the patched state of the task it
    interrupts. The same is true for forked tasks: the child inherits the
    patched state of the parent.

    Livepatch uses several complementary approaches to determine when it's
    safe to patch tasks:

    1. The first and most effective approach is stack checking of sleeping
    tasks. If no affected functions are on the stack of a given task,
    the task is patched. In most cases this will patch most or all of
    the tasks on the first try. Otherwise it'll keep trying
    periodically. This option is only available if the architecture has
    reliable stacks (HAVE_RELIABLE_STACKTRACE).

    2. The second approach, if needed, is kernel exit switching. A
    task is switched when it returns to user space from a system call, a
    user space IRQ, or a signal. It's useful in the following cases:

    a) Patching I/O-bound user tasks which are sleeping on an affected
    function. In this case you have to send SIGSTOP and SIGCONT to
    force it to exit the kernel and be patched.
    b) Patching CPU-bound user tasks. If the task is highly CPU-bound
    then it will get patched the next time it gets interrupted by an
    IRQ.
    c) In the future it could be useful for applying patches for
    architectures which don't yet have HAVE_RELIABLE_STACKTRACE. In
    this case you would have to signal most of the tasks on the
    system. However this isn't supported yet because there's
    currently no way to patch kthreads without
    HAVE_RELIABLE_STACKTRACE.

    3. For idle "swapper" tasks, since they don't ever exit the kernel, they
    instead have a klp_update_patch_state() call in the idle loop which
    allows them to be patched before the CPU enters the idle state.

    (Note there's not yet such an approach for kthreads.)

    All the above approaches may be skipped by setting the 'immediate' flag
    in the 'klp_patch' struct, which will disable per-task consistency and
    patch all tasks immediately. This can be useful if the patch doesn't
    change any function or data semantics. Note that, even with this flag
    set, it's possible that some tasks may still be running with an old
    version of the function, until that function returns.

    There's also an 'immediate' flag in the 'klp_func' struct which allows
    you to specify that certain functions in the patch can be applied
    without per-task consistency. This might be useful if you want to patch
    a common function like schedule(), and the function change doesn't need
    consistency but the rest of the patch does.

    For architectures which don't have HAVE_RELIABLE_STACKTRACE, the user
    must set patch->immediate which causes all tasks to be patched
    immediately. This option should be used with care, only when the patch
    doesn't change any function or data semantics.

    In the future, architectures which don't have HAVE_RELIABLE_STACKTRACE
    may be allowed to use per-task consistency if we can come up with
    another way to patch kthreads.

    The /sys/kernel/livepatch//transition file shows whether a patch
    is in transition. Only a single patch (the topmost patch on the stack)
    can be in transition at a given time. A patch can remain in transition
    indefinitely, if any of the tasks are stuck in the initial patch state.

    A transition can be reversed and effectively canceled by writing the
    opposite value to the /sys/kernel/livepatch//enabled file while
    the transition is in progress. Then all the tasks will attempt to
    converge back to the original patch state.

    [1] https://lkml.kernel.org/r/20141107140458.GA21774@suse.cz

    Signed-off-by: Josh Poimboeuf
    Acked-by: Miroslav Benes
    Acked-by: Ingo Molnar # for the scheduler changes
    Signed-off-by: Jiri Kosina

    Josh Poimboeuf
     

02 Mar, 2017

1 commit


29 Nov, 2016

2 commits

  • Idle injection drivers such as Intel powerclamp and ACPI PAD drivers use
    realtime tasks to take control of CPU then inject idle. There are two
    issues with this approach:

    1. Low efficiency: injected idle task is treated as busy so sched ticks
    do not stop during injected idle period, the result of these
    unwanted wakeups can be ~20% loss in power savings.

    2. Idle accounting: injected idle time is presented to user as busy.

    This patch addresses the issues by introducing a new PF_IDLE flag which
    allows any given task to be treated as idle task while the flag is set.
    Therefore, idle injection tasks can run through the normal flow of NOHZ
    idle enter/exit to get the correct accounting as well as tick stop when
    possible.

    The implication is that idle task is then no longer limited to PID == 0.

    Acked-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Jacob Pan
    Signed-off-by: Rafael J. Wysocki

    Peter Zijlstra
     
  • When idle injection is used to cap power, we need to override the
    governor's choice of idle states.

    For this reason, make it possible the deepest idle state selection to
    be enforced by setting a flag on a given CPU to achieve the maximum
    potential power draw reduction.

    Signed-off-by: Jacob Pan
    [ rjw: Subject & changelog ]
    Signed-off-by: Rafael J. Wysocki

    Jacob Pan
     

08 Oct, 2016

1 commit

  • When doing an nmi backtrace of many cores, most of which are idle, the
    output is a little overwhelming and very uninformative. Suppress
    messages for cpus that are idling when they are interrupted and just
    emit one line, "NMI backtrace for N skipped: idling at pc 0xNNN".

    We do this by grouping all the cpuidle code together into a new
    .cpuidle.text section, and then checking the address of the interrupted
    PC to see if it lies within that section.

    This commit suitably tags x86 and tile idle routines, and only adds in
    the minimal framework for other architectures.

    Link: http://lkml.kernel.org/r/1472487169-14923-5-git-send-email-cmetcalf@mellanox.com
    Signed-off-by: Chris Metcalf
    Acked-by: Peter Zijlstra (Intel)
    Tested-by: Peter Zijlstra (Intel)
    Tested-by: Daniel Thompson [arm]
    Tested-by: Petr Mladek
    Cc: Aaron Tomlin
    Cc: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Ingo Molnar
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chris Metcalf
     

14 Jun, 2016

1 commit


03 Jun, 2016

2 commits

  • Currently, smp_processor_id() is used to fetch the current CPU in
    cpu_idle_loop(). Every time the idle thread runs, it fetches the
    current CPU using smp_processor_id().

    Since the idle thread is per CPU, the current CPU is constant, so we
    can lift the load out of the loop, saving execution cycles/time in the
    loop.

    x86-64:

    Before patch (execution in loop):
    148: 0f ae e8 lfence
    14b: 65 8b 04 25 00 00 00 00 mov %gs:0x0,%eax
    152: 00
    153: 89 c0 mov %eax,%eax
    155: 49 0f a3 04 24 bt %rax,(%r12)

    After patch (execution in loop):
    150: 0f ae e8 lfence
    153: 4d 0f a3 34 24 bt %r14,(%r12)

    ARM64:

    Before patch (execution in loop):
    168: d5033d9f dsb ld
    16c: b9405661 ldr w1,[x19,#84]
    170: 1100fc20 add w0,w1,#0x3f
    174: 6b1f003f cmp w1,wzr
    178: 1a81b000 csel w0,w0,w1,lt
    17c: 130c7000 asr w0,w0,#6
    180: 937d7c00 sbfiz x0,x0,#3,#32
    184: f8606aa0 ldr x0,[x21,x0]
    188: 9ac12401 lsr x1,x0,x1
    18c: 36000e61 tbz w1,#0,358

    After patch (execution in loop):
    1a8: d50339df dsb ld
    1ac: f8776ac0 ldr x0,[x22,x23]
    ab0: ea18001f tst x0,x24
    1b4: 54000ea0 b.eq 388

    Further observance on ARM64 for 4 seconds shows that cpu_idle_loop is
    called 8672 times. Shifting the code will save instructions executed
    in loop and eventually time as well.

    Signed-off-by: Gaurav Jindal
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Sanjeev Yadav
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20160512101330.GA488@gauravjindalubtnb.del.spreadtrum.com
    Signed-off-by: Ingo Molnar

    Gaurav Jindal (Gaurav Jindal)
     
  • The cpuidle_devices per-CPU variable is only defined when CPU_IDLE is
    enabled. Commit c8cc7d4de7a4 ("sched/idle: Reorganize the idle loop")
    removed the #ifdef CONFIG_CPU_IDLE around cpuidle_idle_call() with the
    compiler optimising away __this_cpu_read(cpuidle_devices). However, with
    CONFIG_UBSAN && !CONFIG_CPU_IDLE, this optimisation no longer happens
    and the kernel fails to link since cpuidle_devices is not defined.

    This patch introduces an accessor function for the current CPU cpuidle
    device (returning NULL when !CONFIG_CPU_IDLE) and uses it in
    cpuidle_idle_call().

    Signed-off-by: Catalin Marinas
    Cc: 4.5+ # 4.5+
    Signed-off-by: Rafael J. Wysocki

    Catalin Marinas
     

02 Mar, 2016

3 commits

  • Make the RCU CPU_DYING_IDLE callback an explicit function call, so it gets
    invoked at the proper place.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.870167933@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Kill the busy spinning on the control side and just wait for the hotplugged
    cpu to tell that it reached the dead state.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.776157858@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • Let the upcoming cpu kick the hotplug thread and let itself complete the
    bringup. That way the controll side can just wait for the completion or later
    when we made the hotplug machinery async not care at all.

    Signed-off-by: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: Rik van Riel
    Cc: Rafael Wysocki
    Cc: "Srivatsa S. Bhat"
    Cc: Peter Zijlstra
    Cc: Arjan van de Ven
    Cc: Sebastian Siewior
    Cc: Rusty Russell
    Cc: Steven Rostedt
    Cc: Oleg Nesterov
    Cc: Tejun Heo
    Cc: Andrew Morton
    Cc: Paul McKenney
    Cc: Linus Torvalds
    Cc: Paul Turner
    Link: http://lkml.kernel.org/r/20160226182341.697655464@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

30 Jan, 2016

1 commit

  • * pm-cpuidle:
    cpuidle: coupled: remove unused define cpuidle_coupled_lock
    cpuidle: fix fallback mechanism for suspend to idle in absence of enter_freeze

    * pm-cpufreq:
    cpufreq: cpufreq-dt: avoid uninitialized variable warnings:
    cpufreq: pxa2xx: fix pxa_cpufreq_change_voltage prototype
    cpufreq: Use list_is_last() to check last entry of the policy list
    cpufreq: Fix NULL reference crash while accessing policy->governor_data

    * pm-domains:
    PM / Domains: Fix typo in comment
    PM / Domains: Fix potential deadlock while adding/removing subdomains
    PM / domains: fix lockdep issue for all subdomains

    * pm-sleep:
    PM: APM_EMULATION does not depend on PM

    Rafael J. Wysocki
     

22 Jan, 2016

1 commit

  • Commit 51164251f5c3 "sched / idle: Drop default_idle_call() fallback
    from call_cpuidle()" made find_deepest_state() return non-negative
    value and check all the states with index > 0. Also as a result,
    find_deepest_state() returns 0 even when enter_freeze callbacks are not
    implemented and enter_freeze_proper() is called which ends up crashing
    the kernel.

    This patch updates the check for index > 0 in cpuidle_enter_freeze and
    cpuidle_idle_call(when idle_should_freeze is true) to restore the
    suspend-to-idle functionality in absence of enter_freeze callback.

    Fixes: 51164251f5c3 "sched / idle: Drop default_idle_call() fallback from call_cpuidle()"
    Signed-off-by: Sudeep Holla
    Signed-off-by: Rafael J. Wysocki

    Sudeep Holla
     

21 Jan, 2016

1 commit

  • Pull more power management and ACPI updates from Rafael Wysocki:
    "This includes fixes on top of the previous batch of PM+ACPI updates
    and some new material as well.

    From the new material perspective the most significant are the driver
    core changes that should allow USB devices to stay suspended over
    system suspend/resume cycles if they have been runtime-suspended
    already beforehand. Apart from that, ACPICA is updated to upstream
    revision 20160108 (cosmetic mostly, but including one fixup on top of
    the previous ACPICA update) and there are some devfreq updates the
    didn't make it before (due to timing).

    A few recent regressions are fixed, most importantly in the cpuidle
    menu governor and in the ACPI backlight driver and some x86 platform
    drivers depending on it.

    Some more bugs are fixed and cleanups are made on top of that.

    Specifics:

    - Modify the driver core and the USB subsystem to allow USB devices
    to stay suspended over system suspend/resume cycles if they have
    been runtime-suspended already beforehand and fix some bugs on top
    of these changes (Tomeu Vizoso, Rafael Wysocki).

    - Update ACPICA to upstream revision 20160108, including updates of
    the ACPICA's copyright notices, a code fixup resulting from a
    regression fix that was necessary in the upstream code only (the
    regression fixed by it has never been present in Linux) and a
    compiler warning fix (Bob Moore, Lv Zheng).

    - Fix a recent regression in the cpuidle menu governor that broke it
    on practically all architectures other than x86 and make a couple
    of optimizations on top of that fix (Rafael Wysocki).

    - Clean up the selection of cpuidle governors depending on whether or
    not the kernel is configured for tickless systems (Jean Delvare).

    - Revert a recent commit that introduced a regression in the ACPI
    backlight driver, address the problem it attempted to fix in a
    different way and revert one more cosmetic change depending on the
    problematic commit (Hans de Goede).

    - Add two more ACPI backlight quirks (Hans de Goede).

    - Fix a few minor problems in the core devfreq code, clean it up a
    bit and update the MAINTAINERS information related to it (Chanwoo
    Choi, MyungJoo Ham).

    - Improve an error message in the ACPI fan driver (Andy Lutomirski).

    - Fix a recent build regression in the cpupower tool (Shreyas
    Prabhu)"

    * tag 'pm+acpi-4.5-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (32 commits)
    cpuidle: menu: Avoid pointless checks in menu_select()
    sched / idle: Drop default_idle_call() fallback from call_cpuidle()
    cpupower: Fix build error in cpufreq-info
    cpuidle: Don't enable all governors by default
    cpuidle: Default to ladder governor on ticking systems
    time: nohz: Expose tick_nohz_enabled
    ACPICA: Update version to 20160108
    ACPICA: Silence a -Wbad-function-cast warning when acpi_uintptr_t is 'uintptr_t'
    ACPICA: Additional 2016 copyright changes
    ACPICA: Reduce regression fix divergence from upstream ACPICA
    ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Satellite R830
    ACPI / video: Revert "thinkpad_acpi: Use acpi_video_handles_brightness_key_presses()"
    ACPI / video: Document acpi_video_handles_brightness_key_presses() a bit
    ACPI / video: Fix using an uninitialized mutex / list_head in acpi_video_handles_brightness_key_presses()
    ACPI / video: Revert "ACPI / video: driver must be registered before checking for keypresses"
    ACPI / fan: Improve acpi_device_update_power error message
    ACPI / video: Add disable_backlight_sysfs_if quirk for the Toshiba Portege R700
    cpuidle: menu: Fix menu_select() for CPUIDLE_DRIVER_STATE_START == 0
    MAINTAINERS: Add devfreq-event entry
    MAINTAINERS: Add missing git repository and directory for devfreq
    ...

    Linus Torvalds
     

19 Jan, 2016

1 commit

  • After commit 9c4b2867ed7c (cpuidle: menu: Fix menu_select() for
    CPUIDLE_DRIVER_STATE_START == 0) it is clear that menu_select()
    cannot return negative values. Moreover, ladder_select_state()
    will never return a negative value too, so make find_deepest_state()
    return non-negative values too and drop the default_idle_call()
    fallback from call_cpuidle().

    This eliminates one branch from the idle loop and makes the governors
    and find_deepest_state() handle the case when all states have been
    disabled from sysfs consistently.

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Ingo Molnar
    Tested-by: Sudeep Holla

    Rafael J. Wysocki
     

15 Jan, 2016

1 commit

  • Currently the vmstat updater is not deferrable as a result of commit
    ba4877b9ca51 ("vmstat: do not use deferrable delayed work for
    vmstat_update"). This in turn can cause multiple interruptions of the
    applications because the vmstat updater may run at

    Make vmstate_update deferrable again and provide a function that folds
    the differentials when the processor is going to idle mode thus
    addressing the issue of the above commit in a clean way.

    Note that the shepherd thread will continue scanning the differentials
    from another processor and will reenable the vmstat workers if it
    detects any changes.

    Fixes: ba4877b9ca51 ("vmstat: do not use deferrable delayed work for vmstat_update")
    Signed-off-by: Christoph Lameter
    Cc: Michal Hocko
    Cc: Johannes Weiner
    Cc: Tetsuo Handa
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Christoph Lameter
     

12 Oct, 2015

1 commit

  • When using idle=poll, the preemptoff tracer is always showing
    the idle task as the culprit for long latencies. That happens
    because critical timings are not stopped before idle loop. This
    patch stops critical timings before entering the idle loop,
    starting it again after the idle loop.

    This problem does not affect the irqsoff tracer because
    interruptions are enabled before entering the idle loop.

    Signed-off-by: Daniel Bristot de Oliveira
    Reviewed-by: Luis Claudio R. Goncalves
    Acked-by: Steven Rostedt
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/10fc3705874aef11dbe152a068b591a7be1899b4.1444314899.git.bristot@redhat.com
    Signed-off-by: Ingo Molnar

    Daniel Bristot de Oliveira
     

21 Jul, 2015

1 commit

  • Make sure to stop tracing only once we are past a point where
    all latency tracing events have been processed (irqs are not
    enabled again). This has the slight advantage of capturing more
    latency related events in the idle path, but most importantly it
    makes sure that latency tracing doesn't get re-enabled
    inadvertently when new events are coming in.

    This makes the irqsoff latency tracer useful again, as we stop
    capturing CPU sleep time as IRQ latency.

    Signed-off-by: Lucas Stach
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: kernel@pengutronix.de
    Cc: patchwork-lst@pengutronix.de
    Link: http://lkml.kernel.org/r/1437410090-3747-1-git-send-email-l.stach@pengutronix.de
    Signed-off-by: Ingo Molnar

    Lucas Stach
     

15 May, 2015

2 commits

  • The check of the cpuidle_enter() return value against -EBUSY
    made in call_cpuidle() will not be necessary any more if
    cpuidle_enter_state() calls default_idle_call() directly when it
    is about to return -EBUSY, so make that happen and eliminate the
    check.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy
    Tested-by: Preeti U Murthy
    Tested-by: Sudeep Holla
    Acked-by: Kevin Hilman

    Rafael J. Wysocki
     
  • Introduce a wrapper function around idle_set_state() called
    sched_idle_set_state() that will pass this_rq() to it as the
    first argument and make cpuidle_enter_state() call the new
    function before and after entering the target state.

    At the same time, remove direct invocations of idle_set_state()
    from call_cpuidle().

    This will allow the invocation of default_idle_call() to be
    moved from call_cpuidle() to cpuidle_enter_state() safely
    and call_cpuidle() to be simplified a bit as a result.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Preeti U Murthy
    Tested-by: Preeti U Murthy
    Tested-by: Sudeep Holla
    Acked-by: Kevin Hilman

    Rafael J. Wysocki
     

05 May, 2015

2 commits

  • Since cpuidle_reflect() should only be called if the idle state
    to enter was selected by cpuidle_select(), there is the "reflect"
    variable in cpuidle_idle_call() whose value is used to determine
    whether or not that is the case.

    However, if the entire code run between the conditional setting
    "reflect" and the call to cpuidle_reflect() is moved to a separate
    function, it will be possible to call that new function in both
    branches of the conditional, in which case cpuidle_reflect() will
    only need to be called from one of them too and the "reflect"
    variable won't be necessary any more.

    This eliminates one check made by cpuidle_idle_call() on the majority
    of its invocations, so change the code as described.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Daniel Lezcano
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     
  • Move the code under the "use_default" label in cpuidle_idle_call()
    into a separate (new) function.

    This just allows the subsequent changes to be more stratightforward.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Daniel Lezcano
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

29 Apr, 2015

1 commit

  • Commit 335f49196fd6 (sched/idle: Use explicit broadcast oneshot
    control function) replaced clockevents_notify() invocations in
    cpuidle_idle_call() with direct calls to tick_broadcast_enter()
    and tick_broadcast_exit(), but it overlooked the fact that
    interrupts were already enabled before calling the latter which
    led to functional breakage on systems using idle states with the
    CPUIDLE_FLAG_TIMER_STOP flag set.

    Fix that by moving the invocations of tick_broadcast_enter()
    and tick_broadcast_exit() down into cpuidle_enter_state() where
    interrupts are still disabled when tick_broadcast_exit() is
    called. Also ensure that interrupts will be disabled before
    running tick_broadcast_exit() even if they have been enabled by
    the idle state's ->enter callback. Trigger a WARN_ON_ONCE() in
    that case, as we generally don't want that to happen for states
    with CPUIDLE_FLAG_TIMER_STOP set.

    Fixes: 335f49196fd6 (sched/idle: Use explicit broadcast oneshot control function)
    Reported-and-tested-by: Linus Walleij
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Daniel Lezcano
    Reported-and-tested-by: Sudeep Holla
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     

15 Apr, 2015

1 commit

  • Pull RCU changes from Ingo Molnar:
    "The main changes in this cycle were:

    - changes permitting use of call_rcu() and friends very early in
    boot, for example, before rcu_init() is invoked.

    - add in-kernel API to enable and disable expediting of normal RCU
    grace periods.

    - improve RCU's handling of (hotplug-) outgoing CPUs.

    - NO_HZ_FULL_SYSIDLE fixes.

    - tiny-RCU updates to make it more tiny.

    - documentation updates.

    - miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (58 commits)
    cpu: Provide smpboot_thread_init() on !CONFIG_SMP kernels as well
    cpu: Defer smpboot kthread unparking until CPU known to scheduler
    rcu: Associate quiescent-state reports with grace period
    rcu: Yet another fix for preemption and CPU hotplug
    rcu: Add diagnostics to grace-period cleanup
    rcutorture: Default to grace-period-initialization delays
    rcu: Handle outgoing CPUs on exit from idle loop
    cpu: Make CPU-offline idle-loop transition point more precise
    rcu: Eliminate ->onoff_mutex from rcu_node structure
    rcu: Process offlining and onlining only at grace-period start
    rcu: Move rcu_report_unblock_qs_rnp() to common code
    rcu: Rework preemptible expedited bitmask handling
    rcu: Remove event tracing from rcu_cpu_notify(), used by offline CPUs
    rcutorture: Enable slow grace-period initializations
    rcu: Provide diagnostic option to slow down grace-period initialization
    rcu: Detect stalls caused by failure to propagate up rcu_node tree
    rcu: Eliminate empty HOTPLUG_CPU ifdef
    rcu: Simplify sync_rcu_preempt_exp_init()
    rcu: Put all orphan-callback-related code under same comment
    rcu: Consolidate offline-CPU callback initialization
    ...

    Linus Torvalds
     

03 Apr, 2015

1 commit


27 Mar, 2015

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    - Documentation updates.

    - Changes permitting use of call_rcu() and friends very early in
    boot, for example, before rcu_init() is invoked.

    - Miscellaneous fixes.

    - Add in-kernel API to enable and disable expediting of normal RCU
    grace periods.

    - Improve RCU's handling of (hotplug-) outgoing CPUs.

    Note: ARM support is lagging a bit here, and these improved
    diagnostics might generate (harmless) splats.

    - NO_HZ_FULL_SYSIDLE fixes.

    - Tiny RCU updates to make it more tiny.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

13 Mar, 2015

2 commits

  • This commit informs RCU of an outgoing CPU just before that CPU invokes
    arch_cpu_idle_dead() during its last pass through the idle loop (via a
    new CPU_DYING_IDLE notifier value). This change means that RCU need not
    deal with outgoing CPUs passing through the scheduler after informing
    RCU that they are no longer online. Note that removing the CPU from
    the rcu_node ->qsmaskinit bit masks is done at CPU_DYING_IDLE time,
    and orphaning callbacks is still done at CPU_DEAD time, the reason being
    that at CPU_DEAD time we have another CPU that can adopt them.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     
  • This commit uses a per-CPU variable to make the CPU-offline code path
    through the idle loop more precise, so that the outgoing CPU is
    guaranteed to make it into the idle loop before it is powered off.
    This commit is in preparation for putting the RCU offline-handling
    code on this code path, which will eliminate the magic one-jiffy
    wait that RCU uses as the maximum time for an outgoing CPU to get
    all the way through the scheduler.

    The magic one-jiffy wait for incoming CPUs remains a separate issue.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

06 Mar, 2015

1 commit

  • Commit 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
    overlooked the fact that entering some sufficiently deep idle states
    by CPUs may cause their local timers to stop and in those cases it
    is necessary to switch over to a broadcast timer prior to entering
    the idle state. If the cpuidle driver in use does not provide
    the new ->enter_freeze callback for any of the idle states, that
    problem affects suspend-to-idle too, but it is not taken into account
    after the changes made by commit 381063133246.

    Fix that by changing the definition of cpuidle_enter_freeze() and
    re-arranging of the code in cpuidle_idle_call(), so the former does
    not call cpuidle_enter() any more and the fallback case is handled
    by cpuidle_idle_call() directly.

    Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
    Reported-and-tested-by: Lorenzo Pieralisi
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

03 Mar, 2015

1 commit


01 Mar, 2015

1 commit

  • Disabling interrupts at the end of cpuidle_enter_freeze() is not
    useful, because its caller, cpuidle_idle_call(), re-enables them
    right away after invoking it.

    To avoid that unnecessary back and forth dance with interrupts,
    make cpuidle_enter_freeze() enable interrupts after calling
    enter_freeze_proper() and drop the local_irq_disable() at its
    end, so that all of the code paths in it end up with interrupts
    enabled. Then, cpuidle_idle_call() will not need to re-enable
    interrupts after calling cpuidle_enter_freeze() any more, because
    the latter will return with interrupts enabled, in analogy with
    cpuidle_enter().

    Reported-by: Lorenzo Pieralisi
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

14 Feb, 2015

1 commit

  • In preparation for adding support for quiescing timers in the final
    stage of suspend-to-idle transitions, rework the freeze_enter()
    function making the system wait on a wakeup event, the freeze_wake()
    function terminating the suspend-to-idle loop and the mechanism by
    which deep idle states are entered during suspend-to-idle.

    First of all, introduce a simple state machine for suspend-to-idle
    and make the code in question use it.

    Second, prevent freeze_enter() from losing wakeup events due to race
    conditions and ensure that the number of online CPUs won't change
    while it is being executed. In addition to that, make it force
    all of the CPUs re-enter the idle loop in case they are in idle
    states already (so they can enter deeper idle states if possible).

    Next, drop cpuidle_use_deepest_state() and replace use_deepest_state
    checks in cpuidle_select() and cpuidle_reflect() with a single
    suspend-to-idle state check in cpuidle_idle_call().

    Finally, introduce cpuidle_enter_freeze() that will simply find the
    deepest idle state available to the given CPU and enter it using
    cpuidle_enter().

    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)

    Rafael J. Wysocki
     

31 Jan, 2015

1 commit

  • cpu_idle_poll() is entered into when either the cpu_idle_force_poll is set or
    tick_check_broadcast_expired() returns true. The exit condition from
    cpu_idle_poll() is tif_need_resched().

    However this does not take into account scenarios where cpu_idle_force_poll
    changes or tick_check_broadcast_expired() returns false, without setting
    the resched flag. So a cpu will be caught in cpu_idle_poll() needlessly,
    thereby wasting power. Add an explicit check on cpu_idle_force_poll and
    tick_check_broadcast_expired() to the exit condition of cpu_idle_poll()
    to avoid this.

    Signed-off-by: Preeti U Murthy
    Reviewed-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/20150121105655.15279.59626.stgit@preeti.in.ibm.com
    Signed-off-by: Ingo Molnar

    Preeti U Murthy
     

24 Sep, 2014

1 commit

  • When the cpu enters idle, it stores the cpuidle state pointer in its
    struct rq instance which in turn could be used to make a better decision
    when balancing tasks.

    As soon as the cpu exits its idle state, the struct rq reference is
    cleared.

    There are a couple of situations where the idle state pointer could be changed
    while it is being consulted:

    1. For x86/acpi with dynamic c-states, when a laptop switches from battery
    to AC that could result on removing the deeper idle state. The acpi driver
    triggers:
    'acpi_processor_cst_has_changed'
    'cpuidle_pause_and_lock'
    'cpuidle_uninstall_idle_handler'
    'kick_all_cpus_sync'.

    All cpus will exit their idle state and the pointed object will be set to
    NULL.

    2. The cpuidle driver is unloaded. Logically that could happen but not
    in practice because the drivers are always compiled in and 95% of them are
    not coded to unregister themselves. In any case, the unloading code must
    call 'cpuidle_unregister_device', that calls 'cpuidle_pause_and_lock'
    leading to 'kick_all_cpus_sync' as mentioned above.

    A race can happen if we use the pointer and then one of these two scenarios
    occurs at the same moment.

    In order to be safe, the idle state pointer stored in the rq must be
    used inside a rcu_read_lock section where we are protected with the
    'rcu_barrier' in the 'cpuidle_uninstall_idle_handler' function. The
    idle_get_state() and idle_put_state() accessors should be used to that
    effect.

    Signed-off-by: Daniel Lezcano
    Signed-off-by: Nicolas Pitre
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Rafael J. Wysocki"
    Cc: linux-pm@vger.kernel.org
    Cc: linaro-kernel@lists.linaro.org
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-@git.kernel.org
    Signed-off-by: Ingo Molnar

    Daniel Lezcano
     

07 Aug, 2014

1 commit

  • Pull ACPI and power management updates from Rafael Wysocki:
    "Again, ACPICA leads the pack (47 commits), followed by cpufreq (18
    commits) and system suspend/hibernation (9 commits).

    From the new code perspective, the ACPICA update brings ACPI 5.1 to
    the table, including a new device configuration object called _DSD
    (Device Specific Data) that will hopefully help us to operate device
    properties like Device Trees do (at least to some extent) and changes
    related to supporting ACPI on ARM.

    Apart from that we have hibernation changes making it use radix trees
    to store memory bitmaps which should speed up some operations carried
    out by it quite significantly. We also have some power management
    changes related to suspend-to-idle (the "freeze" sleep state) support
    and more preliminary changes needed to support ACPI on ARM (outside of
    ACPICA).

    The rest is fixes and cleanups pretty much everywhere.

    Specifics:

    - ACPICA update to upstream version 20140724. That includes ACPI 5.1
    material (support for the _CCA and _DSD predefined names, changes
    related to the DMAR and PCCT tables and ARM support among other
    things) and cleanups related to using ACPICA's header files. A
    major part of it is related to acpidump and the core code used by
    that utility. Changes from Bob Moore, David E Box, Lv Zheng,
    Sascha Wildner, Tomasz Nowicki, Hanjun Guo.

    - Radix trees for memory bitmaps used by the hibernation core from
    Joerg Roedel.

    - Support for waking up the system from suspend-to-idle (also known
    as the "freeze" sleep state) using ACPI-based PCI wakeup signaling
    (Rafael J Wysocki).

    - Fixes for issues related to ACPI button events (Rafael J Wysocki).

    - New device ID for an ACPI-enumerated device included into the
    Wildcat Point PCH from Jie Yang.

    - ACPI video updates related to backlight handling from Hans de Goede
    and Linus Torvalds.

    - Preliminary changes needed to support ACPI on ARM from Hanjun Guo
    and Graeme Gregory.

    - ACPI PNP core cleanups from Arjun Sreedharan and Zhang Rui.

    - Cleanups related to ACPI_COMPANION() and ACPI_HANDLE() macros
    (Rafael J Wysocki).

    - ACPI-based device hotplug cleanups from Wei Yongjun and Rafael J
    Wysocki.

    - Cleanups and improvements related to system suspend from Lan
    Tianyu, Randy Dunlap and Rafael J Wysocki.

    - ACPI battery cleanup from Wei Yongjun.

    - cpufreq core fixes from Viresh Kumar.

    - Elimination of a deadband effect from the cpufreq ondemand governor
    and intel_pstate driver cleanups from Stratos Karafotis.

    - 350MHz CPU support for the powernow-k6 cpufreq driver from Mikulas
    Patocka.

    - Fix for the imx6 cpufreq driver from Anson Huang.

    - cpuidle core and governor cleanups from Daniel Lezcano, Sandeep
    Tripathy and Mohammad Merajul Islam Molla.

    - Build fix for the big_little cpuidle driver from Sachin Kamat.

    - Configuration fix for the Operation Performance Points (OPP)
    framework from Mark Brown.

    - APM cleanup from Jean Delvare.

    - cpupower utility fixes and cleanups from Peter Senna Tschudin,
    Andrey Utkin, Himangi Saraogi, Rickard Strandqvist, Thomas
    Renninger"

    * tag 'pm+acpi-3.17-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (118 commits)
    ACPI / LPSS: add LPSS device for Wildcat Point PCH
    ACPI / PNP: Replace faulty is_hex_digit() by isxdigit()
    ACPICA: Update version to 20140724.
    ACPICA: ACPI 5.1: Update for PCCT table changes.
    ACPICA/ARM: ACPI 5.1: Update for GTDT table changes.
    ACPICA/ARM: ACPI 5.1: Update for MADT changes.
    ACPICA/ARM: ACPI 5.1: Update for FADT changes.
    ACPICA: ACPI 5.1: Support for the _CCA predifined name.
    ACPICA: ACPI 5.1: New notify value for System Affinity Update.
    ACPICA: ACPI 5.1: Support for the _DSD predefined name.
    ACPICA: Debug object: Add current value of Timer() to debug line prefix.
    ACPICA: acpihelp: Add UUID support, restructure some existing files.
    ACPICA: Utilities: Fix local printf issue.
    ACPICA: Tables: Update for DMAR table changes.
    ACPICA: Remove some extraneous printf arguments.
    ACPICA: Update for comments/formatting. No functional changes.
    ACPICA: Disassembler: Add support for the ToUUID opererator (macro).
    ACPICA: Remove a redundant cast to acpi_size for ACPI_OFFSET() macro.
    ACPICA: Work around an ancient GCC bug.
    ACPI / processor: Make it possible to get local x2apic id via _MAT
    ...

    Linus Torvalds
     

09 Jul, 2014

1 commit

  • idle_exit event is the first event after a core exits
    idle state. So this should be traced before local irq
    is ebabled. Likewise idle_entry is the last event before
    a core enters idle state. This will ease visualising the
    cpu idle state from kernel traces.

    Signed-off-by: Sandeep Tripathy
    Acked-by: Daniel Lezcano
    [rjw: Subject, rebase]
    Signed-off-by: Rafael J. Wysocki

    Sandeep Tripathy
     

05 Jul, 2014

1 commit

  • We don't need 'broadcast' to be set to 'zero or one', but to 'zero or non-zero'
    and so the extra operation to convert it to 'zero or one' can be skipped.

    Also change type of 'broadcast' to unsigned int, i.e. type of
    drv->states[*].flags.

    Signed-off-by: Viresh Kumar
    Cc: linaro-kernel@lists.linaro.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/0dfbe2976aa108c53e08d3477ea90f6360c1f54c.1403584026.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

05 Jun, 2014

1 commit

  • [ This series reduces the number of IPIs on Andy's workload by something like
    99%. It's down from many hundreds per second to very few.

    The basic idea behind this series is to make TIF_POLLING_NRFLAG be a
    reliable indication that the idle task is polling. Once that's done,
    the rest is reasonably straightforward. ]

    When enqueueing tasks on remote LLC domains, we send an IPI to do the
    work 'locally' and avoid bouncing all the cachelines over.

    However, when the remote CPU is idle (and polling, say x86 mwait), we
    don't need to send an IPI, we can simply kick the TIF word to wake it
    up and have the 'idle' loop do the work.

    So when _TIF_POLLING_NRFLAG is set, but _TIF_NEED_RESCHED is not (yet)
    set, set _TIF_NEED_RESCHED and avoid sending the IPI.

    Much-requested-by: Andy Lutomirski
    Signed-off-by: Peter Zijlstra
    [Edited by Andy Lutomirski, but this is mostly Peter Zijlstra's code.]
    Signed-off-by: Andy Lutomirski
    Cc: nicolas.pitre@linaro.org
    Cc: daniel.lezcano@linaro.org
    Cc: Mike Galbraith
    Cc: umgwanakikbuti@gmail.com
    Cc: Rafael J. Wysocki
    Cc: Linus Torvalds
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/ce06f8b02e7e337be63e97597fc4b248d3aa6f9b.1401902905.git.luto@amacapital.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra