10 Aug, 2019

1 commit

  • To avoid reducing the frequency of a CPU prematurely, we skip reducing
    the frequency if the CPU had been busy recently.

    This should not be done when the limits of the policy are changed, for
    example due to thermal throttling. We should always get the frequency
    within the new limits as soon as possible.

    Trying to fix this by using only one flag, i.e. need_freq_update, can
    lead to a race condition where the flag gets cleared without forcing us
    to change the frequency at least once. And so this patch introduces
    another flag to avoid that race condition.

    Fixes: ecd288429126 ("cpufreq: schedutil: Don't set next_freq to UINT_MAX")
    Cc: v4.18+ # v4.18+
    Reported-by: Doug Smythies
    Tested-by: Doug Smythies
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

25 Jun, 2019

3 commits

  • The Energy Aware Scheduler (EAS) estimates the energy impact of waking
    up a task on a given CPU. This estimation is based on:

    a) an (active) power consumption defined for each CPU frequency
    b) an estimation of which frequency will be used on each CPU
    c) an estimation of the busy time (utilization) of each CPU

    Utilization clamping can affect both b) and c).

    A CPU is expected to run:

    - on an higher than required frequency, but for a shorter time, in case
    its estimated utilization will be smaller than the minimum utilization
    enforced by uclamp
    - on a smaller than required frequency, but for a longer time, in case
    its estimated utilization is bigger than the maximum utilization
    enforced by uclamp

    While compute_energy() already accounts clamping effects on busy time,
    the clamping effects on frequency selection are currently ignored.

    Fix it by considering how CPU clamp values will be affected by a
    task waking up and being RUNNABLE on that CPU.

    Do that by refactoring schedutil_freq_util() to take an additional
    task_struct* which allows EAS to evaluate the impact on clamp values of
    a task being eventually queued in a CPU. Clamp values are applied to the
    RT+CFS utilization only when a FREQUENCY_UTIL is required by
    compute_energy().

    Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
    computation of the cpu_util signal implies that we are more likely to
    estimate the highest OPP when a RT task is running in another CPU of
    the same performance domain. This can have an impact on energy
    estimation but:

    - it's not easy to say which approach is better, since it depends on
    the use case
    - the original approach could still be obtained by setting a smaller
    task-specific util_min whenever required

    Since we are at that:

    - rename schedutil_freq_util() into schedutil_cpu_util(),
    since it's not only used for frequency selection.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-12-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • Each time a frequency update is required via schedutil, a frequency is
    selected to (possibly) satisfy the utilization reported by each
    scheduling class and irqs. However, when utilization clamping is in use,
    the frequency selection should consider userspace utilization clamping
    hints. This will allow, for example, to:

    - boost tasks which are directly affecting the user experience
    by running them at least at a minimum "requested" frequency

    - cap low priority tasks not directly affecting the user experience
    by running them only up to a maximum "allowed" frequency

    These constraints are meant to support a per-task based tuning of the
    frequency selection thus supporting a fine grained definition of
    performance boosting vs energy saving strategies in kernel space.

    Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
    within the boundaries defined by their aggregated utilization clamp
    constraints.

    Do that by considering the max(min_util, max_util) to give boosted tasks
    the performance they need even when they happen to be co-scheduled with
    other capped tasks.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
    unused since commit:

    765d0af19f5f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

    Remove it.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Viresh Kumar
    Reviewed-by: Valentin Schneider
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: gregkh@linuxfoundation.org
    Cc: linux@armlinux.org.uk
    Cc: quentin.perret@arm.com
    Cc: rafael@kernel.org
    Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

08 May, 2019

1 commit

  • Pull driver core/kobject updates from Greg KH:
    "Here is the "big" set of driver core patches for 5.2-rc1

    There are a number of ACPI patches in here as well, as Rafael said
    they should go through this tree due to the driver core changes they
    required. They have all been acked by the ACPI developers.

    There are also a number of small subsystem-specific changes in here,
    due to some changes to the kobject core code. Those too have all been
    acked by the various subsystem maintainers.

    As for content, it's pretty boring outside of the ACPI changes:
    - spdx cleanups
    - kobject documentation updates
    - default attribute groups for kobjects
    - other minor kobject/driver core fixes

    All have been in linux-next for a while with no reported issues"

    * tag 'driver-core-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (47 commits)
    kobject: clean up the kobject add documentation a bit more
    kobject: Fix kernel-doc comment first line
    kobject: Remove docstring reference to kset
    firmware_loader: Fix a typo ("syfs" -> "sysfs")
    kobject: fix dereference before null check on kobj
    Revert "driver core: platform: Fix the usage of platform device name(pdev->name)"
    init/config: Do not select BUILD_BIN2C for IKCONFIG
    Provide in-kernel headers to make extending kernel easier
    kobject: Improve doc clarity kobject_init_and_add()
    kobject: Improve docs for kobject_add/del
    driver core: platform: Fix the usage of platform device name(pdev->name)
    livepatch: Replace klp_ktype_patch's default_attrs with groups
    cpufreq: schedutil: Replace default_attrs field with groups
    padata: Replace padata_attr_type default_attrs field with groups
    irqdesc: Replace irq_kobj_type's default_attrs field with groups
    net-sysfs: Replace ktype default_attrs field with groups
    block: Replace all ktype default_attrs with groups
    samples/kobject: Replace foo_ktype's default_attrs field with groups
    kobject: Add support for default attribute groups to kobj_type
    driver core: Postpone DMA tear-down until after devres release for probe failure
    ...

    Linus Torvalds
     

07 May, 2019

1 commit

  • Pull power management updates from Rafael Wysocki:
    "These fix the (Intel-specific) Performance and Energy Bias Hint (EPB)
    handling and expose it to user space via sysfs, fix and clean up
    several cpufreq drivers, add support for two new chips to the qoriq
    cpufreq driver, fix, simplify and clean up the cpufreq core and the
    schedutil governor, add support for "CPU" domains to the generic power
    domains (genpd) framework and provide low-level PSCI firmware support
    for that feature, fix the exynos cpuidle driver and fix a couple of
    issues in the devfreq subsystem and clean it up.

    Specifics:

    - Fix the handling of Performance and Energy Bias Hint (EPB) on Intel
    processors and expose it to user space via sysfs to avoid having to
    access it through the generic MSR I/F (Rafael Wysocki).

    - Improve the handling of global turbo changes made by the platform
    firmware in the intel_pstate driver (Rafael Wysocki).

    - Convert some slow-path static_cpu_has() callers to boot_cpu_has()
    in cpufreq (Borislav Petkov).

    - Fix the frequency calculation loop in the armada-37xx cpufreq
    driver (Gregory CLEMENT).

    - Fix possible object reference leaks in multuple cpufreq drivers
    (Wen Yang).

    - Fix kerneldoc comment in the centrino cpufreq driver (dongjian).

    - Clean up the ACPI and maple cpufreq drivers (Viresh Kumar, Mohan
    Kumar).

    - Add support for lx2160a and ls1028a to the qoriq cpufreq driver
    (Vabhav Sharma, Yuantian Tang).

    - Fix kobject memory leak in the cpufreq core (Viresh Kumar).

    - Simplify the IOwait boosting in the schedutil cpufreq governor and
    rework the TSC cpufreq notifier on x86 (Rafael Wysocki).

    - Clean up the cpufreq core and statistics code (Yue Hu, Kyle Lin).

    - Improve the cpufreq documentation, add SPDX license tags to some PM
    documentation files and unify copyright notices in them (Rafael
    Wysocki).

    - Add support for "CPU" domains to the generic power domains (genpd)
    framework and provide low-level PSCI firmware support for that
    feature (Ulf Hansson).

    - Rearrange the PSCI firmware support code and add support for
    SYSTEM_RESET2 to it (Ulf Hansson, Sudeep Holla).

    - Improve genpd support for devices in multiple power domains (Ulf
    Hansson).

    - Unify target residency for the AFTR and coupled AFTR states in the
    exynos cpuidle driver (Marek Szyprowski).

    - Introduce new helper routine in the operating performance points
    (OPP) framework (Andrew-sh.Cheng).

    - Add support for passing on-die termination (ODT) and auto power
    down parameters from the kernel to Trusted Firmware-A (TF-A) to the
    rk3399_dmc devfreq driver (Enric Balletbo i Serra).

    - Add tracing to devfreq (Lukasz Luba).

    - Make the exynos-bus devfreq driver suspend all devices on system
    shutdown (Marek Szyprowski).

    - Fix a few minor issues in the devfreq subsystem and clean it up
    somewhat (Enric Balletbo i Serra, MyungJoo Ham, Rob Herring,
    Saravana Kannan, Yangtao Li).

    - Improve system wakeup diagnostics (Stephen Boyd).

    - Rework filesystem sync messages emitted during system suspend and
    hibernation (Harry Pan)"

    * tag 'pm-5.2-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (72 commits)
    cpufreq: Fix kobject memleak
    cpufreq: armada-37xx: fix frequency calculation for opp
    cpufreq: centrino: Fix centrino_setpolicy() kerneldoc comment
    cpufreq: qoriq: add support for lx2160a
    x86: tsc: Rework time_cpufreq_notifier()
    PM / Domains: Allow to attach a CPU via genpd_dev_pm_attach_by_id|name()
    PM / Domains: Search for the CPU device outside the genpd lock
    PM / Domains: Drop unused in-parameter to some genpd functions
    PM / Domains: Use the base device for driver_deferred_probe_check_state()
    cpufreq: qoriq: Add ls1028a chip support
    PM / Domains: Enable genpd_dev_pm_attach_by_id|name() for single PM domain
    PM / Domains: Allow OF lookup for multi PM domain case from ->attach_dev()
    PM / Domains: Don't kfree() the virtual device in the error path
    cpufreq: Move ->get callback check outside of __cpufreq_get()
    PM / Domains: remove unnecessary unlikely()
    cpufreq: Remove needless bios_limit check in show_bios_limit()
    drivers/cpufreq/acpi-cpufreq.c: This fixes the following checkpatch warning
    firmware/psci: add support for SYSTEM_RESET2
    PM / devfreq: add tracing for scheduling work
    trace: events: add devfreq trace event file
    ...

    Linus Torvalds
     

30 Apr, 2019

1 commit

  • Currently the error return path from kobject_init_and_add() is not
    followed by a call to kobject_put() - which means we are leaking
    the kobject.

    Fix it by adding a call to kobject_put() in the error path of
    kobject_init_and_add().

    Signed-off-by: Tobin C. Harding
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Tobin C. Harding
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: http://lkml.kernel.org/r/20190430001144.24890-1-tobin@kernel.org
    Signed-off-by: Ingo Molnar

    Tobin C. Harding
     

26 Apr, 2019

1 commit

  • The kobj_type default_attrs field is being replaced by the
    default_groups field. Replace sugov_tunables_ktype's default_attrs field
    with default groups. Change "sugov_attributes" to "sugov_attrs" and use
    the ATTRIBUTE_GROUPS macro to create sugov_groups.

    This patch was tested by setting the scaling governor to schedutil and
    verifying that the sysfs files for the attributes in the default groups
    were created.

    Signed-off-by: Kimberly Brown
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Rafael J. Wysocki
    Signed-off-by: Greg Kroah-Hartman

    Kimberly Brown
     

08 Apr, 2019

1 commit

  • There is not reason for the minimum iowait boost value in the
    schedutil cpufreq governor to depend on the available range of CPU
    frequencies. In fact, that dependency is generally confusing,
    because it causes the iowait boost to behave somewhat differently
    on CPUs with the same maximum frequency and different minimum
    frequencies, for example.

    For this reason, replace the min field in struct sugov_cpu
    with a constant and choose its values to be 1/8 of
    SCHED_CAPACITY_SCALE (for consistency with the intel_pstate
    driver's internal governor).

    [Note that policy->cpuinfo.max_freq will not be a constant any more
    after a subsequent change, so this change is depended on by it.]

    Link: https://lore.kernel.org/lkml/20190305083202.GU32494@hirez.programming.kicks-ass.net/T/#ee20bdc98b7d89f6110c0d00e5c3ee8c2ced93c3d
    Suggested-by: Peter Zijlstra
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar

    Rafael J. Wysocki
     

25 Mar, 2019

1 commit

  • Pull scheduler updates from Thomas Gleixner:
    "Third more careful attempt for this set of fixes:

    - Prevent a 32bit math overflow in the cpufreq code

    - Fix a buffer overflow when scanning the cgroup2 cpu.max property

    - A set of fixes for the NOHZ scheduler logic to prevent waking up
    CPUs even if the capacity of the busy CPUs is sufficient along with
    other tweaks optimizing the behaviour for asymmetric systems
    (big/little)"

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched/fair: Skip LLC NOHZ logic for asymmetric systems
    sched/fair: Tune down misfit NOHZ kicks
    sched/fair: Comment some nohz_balancer_kick() kick conditions
    sched/core: Fix buffer overflow in cgroup2 property cpu.max
    sched/cpufreq: Fix 32-bit math overflow

    Linus Torvalds
     

19 Mar, 2019

1 commit

  • Vincent Wang reported that get_next_freq() has a mult overflow bug on
    32-bit platforms in the IOWAIT boost case, since in that case {util,max}
    are in freq units instead of capacity units.

    Solve this by moving the IOWAIT boost to capacity units. And since this
    means @max is constant; simplify the code.

    Reported-by: Vincent Wang
    Tested-by: Vincent Wang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rafael J. Wysocki
    Cc: Andy Lutomirski
    Cc: Borislav Petkov
    Cc: Chunyan Zhang
    Cc: Dave Hansen
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J. Wysocki
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190305083202.GU32494@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

26 Jan, 2019

1 commit

  • Now that synchronize_rcu() waits for preempt-disable regions of
    code as well as RCU read-side critical sections, synchronize_sched()
    can be replaced by synchronize_rcu(), in fact, synchronize_sched()
    is now completely equivalent to synchronize_rcu(). This commit
    therefore replaces synchronize_sched() with synchronize_rcu() so that
    synchronize_sched() can eventually be removed entirely.

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Paul E. McKenney
     

27 Dec, 2018

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - Introduce "Energy Aware Scheduling" - by Quentin Perret.

    This is a coherent topology description of CPUs in cooperation with
    the PM subsystem, with the goal to schedule more energy-efficiently
    on asymetric SMP platform - such as waking up tasks to the more
    energy-efficient CPUs first, as long as the system isn't
    oversubscribed.

    For details of the design, see:

    https://lore.kernel.org/lkml/20180724122521.22109-1-quentin.perret@arm.com/

    - Misc cleanups and smaller enhancements"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (23 commits)
    sched/fair: Select an energy-efficient CPU on task wake-up
    sched/fair: Introduce an energy estimation helper function
    sched/fair: Add over-utilization/tipping point indicator
    sched/fair: Clean-up update_sg_lb_stats parameters
    sched/toplogy: Introduce the 'sched_energy_present' static key
    sched/topology: Make Energy Aware Scheduling depend on schedutil
    sched/topology: Disable EAS on inappropriate platforms
    sched/topology: Add lowest CPU asymmetry sched_domain level pointer
    sched/topology: Reference the Energy Model of CPUs when available
    PM: Introduce an Energy Model management framework
    sched/cpufreq: Prepare schedutil for Energy Aware Scheduling
    sched/topology: Relocate arch_scale_cpu_capacity() to the internal header
    sched/core: Remove unnecessary unlikely() in push_*_task()
    sched/topology: Remove the ::smt_gain field from 'struct sched_domain'
    sched: Fix various typos in comments
    sched/core: Clean up the #ifdef block in add_nr_running()
    sched/fair: Make some variables static
    sched/core: Create task_has_idle_policy() helper
    sched/fair: Add lsub_positive() and use it consistently
    sched/fair: Mask UTIL_AVG_UNCHANGED usages
    ...

    Linus Torvalds
     

11 Dec, 2018

3 commits

  • Energy Aware Scheduling (EAS) is designed with the assumption that
    frequencies of CPUs follow their utilization value. When using a CPUFreq
    governor other than schedutil, the chances of this assumption being true
    are small, if any. When schedutil is being used, EAS' predictions are at
    least consistent with the frequency requests. Although those requests
    have no guarantees to be honored by the hardware, they should at least
    guide DVFS in the right direction and provide some hope in regards to the
    EAS model being accurate.

    To make sure EAS is only used in a sane configuration, create a strong
    dependency on schedutil being used. Since having sugov compiled-in does
    not provide that guarantee, make CPUFreq call a scheduler function on
    governor changes hence letting it rebuild the scheduling domains, check
    the governors of the online CPUs, and enable/disable EAS accordingly.

    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: adharmap@codeaurora.org
    Cc: chris.redpath@arm.com
    Cc: currojerez@riseup.net
    Cc: dietmar.eggemann@arm.com
    Cc: edubezval@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: javi.merino@kernel.org
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pkondeti@codeaurora.org
    Cc: skannan@codeaurora.org
    Cc: smuckle@google.com
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Cc: tkjos@google.com
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Cc: viresh.kumar@linaro.org
    Link: https://lkml.kernel.org/r/20181203095628.11858-9-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar

    Quentin Perret
     
  • Schedutil requests frequency by aggregating utilization signals from
    the scheduler (CFS, RT, DL, IRQ) and applying a 25% margin on top of
    them. Since Energy Aware Scheduling (EAS) needs to be able to predict
    the frequency requests, it needs to forecast the decisions made by the
    governor.

    In order to prepare the introduction of EAS, introduce
    schedutil_freq_util() to centralize the aforementioned signal
    aggregation and make it available to both schedutil and EAS. Since
    frequency selection and energy estimation still need to deal with RT and
    DL signals slightly differently, schedutil_freq_util() is called with a
    different 'type' parameter in those two contexts, and returns an
    aggregated utilization signal accordingly. While at it, introduce the
    map_util_freq() function which is designed to make schedutil's 25%
    margin usable easily for both sugov and EAS.

    As EAS will be able to predict schedutil's frequency requests more
    accurately than any other governor by design, it'd be sensible to make
    sure EAS cannot be used without schedutil. This will be done later, once
    EAS has actually been introduced.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Thomas Gleixner
    Cc: adharmap@codeaurora.org
    Cc: chris.redpath@arm.com
    Cc: currojerez@riseup.net
    Cc: dietmar.eggemann@arm.com
    Cc: edubezval@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: javi.merino@kernel.org
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: morten.rasmussen@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pkondeti@codeaurora.org
    Cc: rjw@rjwysocki.net
    Cc: skannan@codeaurora.org
    Cc: smuckle@google.com
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Cc: tkjos@google.com
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Cc: viresh.kumar@linaro.org
    Link: https://lkml.kernel.org/r/20181203095628.11858-3-quentin.perret@arm.com
    Signed-off-by: Ingo Molnar

    Quentin Perret
     
  • The SPDX tags are not present in cpufreq.c and cpufreq_schedutil.c.

    Add them and remove the license descriptions

    Signed-off-by: Daniel Lezcano
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Daniel Lezcano
     

25 Jul, 2018

1 commit

  • Reuse cpu_util_irq() that has been defined for schedutil and set irq util
    to 0 when !CONFIG_IRQ_TIME_ACCOUNTING.

    But the compiler is not able to optimize the sequence (at least with
    aarch64 GCC 7.2.1):

    free *= (max - irq);
    free /= max;

    when irq is fixed to 0

    Add a new inline function scale_irq_capacity() that will scale utilization
    when irq is accounted. Reuse this funciton in schedutil which applies
    similar formula.

    Suggested-by: Ingo Molnar
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: rjw@rjwysocki.net
    Link: http://lkml.kernel.org/r/1532001606-6689-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

16 Jul, 2018

5 commits

  • Add a few comments to (hopefully) clarifying some of the magic in
    sugov_get_util().

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Link: http://lkml.kernel.org/r/20180705123617.GM2458@hirez.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • There is no reason why sugov_get_util() and sugov_aggregate_util()
    were in fact separate functions.

    Signed-off-by: Vincent Guittot
    [ Rebased after adding irq tracking and fixed some compilation errors. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Link: http://lkml.kernel.org/r/1530200714-4504-9-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • The time spent executing IRQ handlers can be significant but it is not reflected
    in the utilization of CPU when deciding to choose an OPP. Now that we have
    access to this metric, schedutil can take it into account when selecting
    the OPP for a CPU.

    RQS utilization don't see the time spend under interrupt context and report
    their value in the normal context time window. We need to compensate this when
    adding interrupt utilization

    The CPU utilization is:

    IRQ util_avg + (1 - IRQ util_avg / max capacity ) * /Sum rq util_avg

    A test with iperf on hikey (octo arm64) gives the following speedup:

    iperf -c server_address -r -t 5

    w/o patch w/ patch
    Tx 276 Mbits/sec 304 Mbits/sec +10%
    Rx 299 Mbits/sec 328 Mbits/sec +9%

    8 iterations
    stdev is lower than 1%

    Only WFI idle state is enabled (shallowest idle state).

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Link: http://lkml.kernel.org/r/1530200714-4504-8-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Now that we have both the DL class bandwidth requirement and the DL class
    utilization, we can detect when CPU is fully used so we should run at max.
    Otherwise, we keep using the DL bandwidth requirement to define the
    utilization of the CPU.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Link: http://lkml.kernel.org/r/1530200714-4504-6-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Add both CFS and RT utilization when selecting an OPP for CFS tasks as RT
    can preempt and steal CFS's running time.

    RT util_avg is used to take into account the utilization of RT tasks
    on the CPU when selecting OPP. If a RT task migrate, the RT utilization
    will not migrate but will decay over time. On an overloaded CPU, CFS
    utilization reflects the remaining utilization avialable on CPU. When RT
    task migrates, the CFS utilization will increase when tasks will start to
    use the newly available capacity. At the same pace, RT utilization will
    decay and both variations will compensate each other to keep unchanged
    overall utilization and will prevent any OPP drop.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Viresh Kumar
    Cc: Linus Torvalds
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: claudio@evidence.eu.com
    Cc: daniel.lezcano@linaro.org
    Cc: dietmar.eggemann@arm.com
    Cc: joel@joelfernandes.org
    Cc: juri.lelli@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: patrick.bellasi@arm.com
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: valentin.schneider@arm.com
    Link: http://lkml.kernel.org/r/1530200714-4504-4-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

03 Jul, 2018

1 commit

  • With commit:

    8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")

    the schedutil governor uses rq->rt.rt_nr_running to detect whether an
    RT task is currently running on the CPU and to set frequency to max
    if necessary.

    cpufreq_update_util() is called in enqueue/dequeue_top_rt_rq() but
    rq->rt.rt_nr_running has not been updated yet when dequeue_top_rt_rq() is
    called so schedutil still considers that an RT task is running when the
    last task is dequeued. The update of rq->rt.rt_nr_running happens later
    in dequeue_rt_stack().

    In fact, we can take advantage of the sequence that the dequeue then
    re-enqueue rt entities when a rt task is enqueued or dequeued;
    As a result enqueue_top_rt_rq() is always called when a task is
    enqueued or dequeued and also when groups are throttled or unthrottled.
    The only place that not use enqueue_top_rt_rq() is when root rt_rq is
    throttled.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: efault@gmx.de
    Cc: juri.lelli@redhat.com
    Cc: patrick.bellasi@arm.com
    Cc: viresh.kumar@linaro.org
    Fixes: 8f111bc357aa ('cpufreq/schedutil: Rewrite CPUFREQ_RT support')
    Link: http://lkml.kernel.org/r/1530021202-21695-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

06 Jun, 2018

1 commit

  • Pull power management updates from Rafael Wysocki:
    "These include a significant update of the generic power domains
    (genpd) and Operating Performance Points (OPP) frameworks, mostly
    related to the introduction of power domain performance levels,
    cpufreq updates (new driver for Qualcomm Kryo processors, updates of
    the existing drivers, some core fixes, schedutil governor
    improvements), PCI power management fixes, ACPI workaround for
    EC-based wakeup events handling on resume from suspend-to-idle, and
    major updates of the turbostat and pm-graph utilities.

    Specifics:

    - Introduce power domain performance levels into the the generic
    power domains (genpd) and Operating Performance Points (OPP)
    frameworks (Viresh Kumar, Rajendra Nayak, Dan Carpenter).

    - Fix two issues in the runtime PM framework related to the
    initialization and removal of devices using device links (Ulf
    Hansson).

    - Clean up the initialization of drivers for devices in PM domains
    (Ulf Hansson, Geert Uytterhoeven).

    - Fix a cpufreq core issue related to the policy sysfs interface
    causing CPU online to fail for CPUs sharing one cpufreq policy in
    some situations (Tao Wang).

    - Make it possible to use platform-specific suspend/resume hooks in
    the cpufreq-dt driver and make the Armada 37xx DVFS use that
    feature (Viresh Kumar, Miquel Raynal).

    - Optimize policy transition notifications in cpufreq (Viresh Kumar).

    - Improve the iowait boost mechanism in the schedutil cpufreq
    governor (Patrick Bellasi).

    - Improve the handling of deferred frequency updates in the schedutil
    cpufreq governor (Joel Fernandes, Dietmar Eggemann, Rafael Wysocki,
    Viresh Kumar).

    - Add a new cpufreq driver for Qualcomm Kryo (Ilia Lin).

    - Fix and clean up some cpufreq drivers (Colin Ian King, Dmitry
    Osipenko, Doug Smythies, Luc Van Oostenryck, Simon Horman, Viresh
    Kumar).

    - Fix the handling of PCI devices with the DPM_SMART_SUSPEND flag set
    and update stale comments in the PCI core PM code (Rafael Wysocki).

    - Work around an issue related to the handling of EC-based wakeup
    events in the ACPI PM core during resume from suspend-to-idle if
    the EC has been put into the low-power mode (Rafael Wysocki).

    - Improve the handling of wakeup source objects in the PM core (Doug
    Berger, Mahendran Ganesh, Rafael Wysocki).

    - Update the driver core to prevent deferred probe from breaking
    suspend/resume ordering (Feng Kan).

    - Clean up the PM core somewhat (Bjorn Helgaas, Ulf Hansson, Rafael
    Wysocki).

    - Make the core suspend/resume code and cpufreq support the RT patch
    (Sebastian Andrzej Siewior, Thomas Gleixner).

    - Consolidate the PM QoS handling in cpuidle governors (Rafael
    Wysocki).

    - Fix a possible crash in the hibernation core (Tetsuo Handa).

    - Update the rockchip-io Adaptive Voltage Scaling (AVS) driver (David
    Wu).

    - Update the turbostat utility (fixes, cleanups, new CPU IDs, new
    command line options, built-in "Low Power Idle" counters support,
    new POLL and POLL% columns) and add an entry for it to MAINTAINERS
    (Len Brown, Artem Bityutskiy, Chen Yu, Laura Abbott, Matt Turner,
    Prarit Bhargava, Srinivas Pandruvada).

    - Update the pm-graph to version 5.1 (Todd Brandt).

    - Update the intel_pstate_tracer utility (Doug Smythies)"

    * tag 'pm-4.18-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (128 commits)
    tools/power turbostat: update version number
    tools/power turbostat: Add Node in output
    tools/power turbostat: add node information into turbostat calculations
    tools/power turbostat: remove num_ from cpu_topology struct
    tools/power turbostat: rename num_cores_per_pkg to num_cores_per_node
    tools/power turbostat: track thread ID in cpu_topology
    tools/power turbostat: Calculate additional node information for a package
    tools/power turbostat: Fix node and siblings lookup data
    tools/power turbostat: set max_num_cpus equal to the cpumask length
    tools/power turbostat: if --num_iterations, print for specific number of iterations
    tools/power turbostat: Add Cannon Lake support
    tools/power turbostat: delete duplicate #defines
    x86: msr-index.h: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
    tools/power turbostat: Correct SNB_C1/C3_AUTO_UNDEMOTE defines
    tools/power turbostat: add POLL and POLL% column
    tools/power turbostat: Fix --hide Pk%pc10
    tools/power turbostat: Build-in "Low Power Idle" counters support
    tools/power turbostat: Don't make man pages executable
    tools/power turbostat: remove blank lines
    tools/power turbostat: a small C-states dump readability immprovement
    ...

    Linus Torvalds
     

25 May, 2018

1 commit

  • Since the refactoring introduced by:

    commit 8f111bc357aa ("cpufreq/schedutil: Rewrite CPUFREQ_RT support")

    we aggregate FAIR utilization only if this class has runnable tasks.

    This was mainly due to avoid the risk to stay on an high frequency just
    because of the blocked utilization of a CPU not being properly decayed
    while the CPU was idle.

    However, since:

    commit 31e77c93e432 ("sched/fair: Update blocked load when newly idle")

    the FAIR blocked utilization is properly decayed also for IDLE CPUs.

    This allows us to use the FAIR blocked utilization as a safe mechanism
    to gracefully reduce the frequency only if no FAIR tasks show up on a
    CPU for a reasonable period of time.

    Moreover, we also reduce the frequency drops of CPUs running periodic
    tasks which, depending on the task periodicity and the time required
    for a frequency switch, was increasing the chances to introduce some
    undesirable performance variations.

    Reported-by: Vincent Guittot
    Tested-by: Vincent Guittot
    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Acked-by: Viresh Kumar
    Acked-by: Vincent Guittot
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Link: http://lkml.kernel.org/r/20180524141023.13765-2-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

24 May, 2018

1 commit

  • Commit 152db033d775 (schedutil: Allow cpufreq requests to be made
    even when kthread kicked) made changes to prevent utilization updates
    from being discarded during processing a previous request, but it
    left a small window in which that still can happen in the one-CPU
    policy case. Namely, updates coming in after setting work_in_progress
    in sugov_update_commit() and clearing it in sugov_work() will still
    be dropped due to the work_in_progress check in sugov_update_single().

    To close that window, rearrange the code so as to acquire the update
    lock around the deferred update branch in sugov_update_single()
    and drop the work_in_progress check from it.

    Signed-off-by: Rafael J. Wysocki
    Reviewed-by: Juri Lelli
    Acked-by: Viresh Kumar
    Reviewed-by: Joel Fernandes (Google)

    Rafael J. Wysocki
     

23 May, 2018

2 commits

  • Currently there is a chance of a schedutil cpufreq update request to be
    dropped if there is a pending update request. This pending request can
    be delayed if there is a scheduling delay of the irq_work and the wake
    up of the schedutil governor kthread.

    A very bad scenario is when a schedutil request was already just made,
    such as to reduce the CPU frequency, then a newer request to increase
    CPU frequency (even sched deadline urgent frequency increase requests)
    can be dropped, even though the rate limits suggest that its Ok to
    process a request. This is because of the way the work_in_progress flag
    is used.

    This patch improves the situation by allowing new requests to happen
    even though the old one is still being processed. Note that in this
    approach, if an irq_work was already issued, we just update next_freq
    and don't bother to queue another request so there's no extra work being
    done to make this happen.

    Acked-by: Viresh Kumar
    Acked-by: Juri Lelli
    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Rafael J. Wysocki

    Joel Fernandes (Google)
     
  • This routine checks if the CPU running this code belongs to the policy
    of the target CPU or if not, can it do remote DVFS for it remotely. But
    the current name of it implies as if it is only about doing remote
    updates.

    Rename it to make it more relevant.

    Suggested-by: Rafael J. Wysocki
    Signed-off-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     

22 May, 2018

2 commits

  • The iowait boosting code has been recently updated to add a progressive
    boosting behavior which allows to be less aggressive in boosting tasks
    doing only sporadic IO operations, thus being more energy efficient for
    example on mobile platforms.

    The current code is now however a bit convoluted. Some functionalities
    (e.g. iowait boost reset) are replicated in different paths and their
    documentation is slightly misaligned.

    Let's cleanup the code by consolidating all the IO wait boosting related
    functionality within within few dedicated functions and better define
    their role:

    - sugov_iowait_boost: set/increase the IO wait boost of a CPU
    - sugov_iowait_apply: apply/reduce the IO wait boost of a CPU

    Both these two function are used at every sugov update and they make
    use of a unified IO wait boost reset policy provided by:

    - sugov_iowait_reset: reset/disable the IO wait boost of a CPU
    if a CPU is not updated for more then one tick

    This makes possible a cleaner and more self-contained design for the IO
    wait boosting code since the rest of the sugov update routines, both for
    single and shared frequency domains, follow the same template:

    /* Configure IO boost, if required */
    sugov_iowait_boost()

    /* Return here if freq change is in progress or throttled */

    /* Collect and aggregate utilization information */
    sugov_get_util()
    sugov_aggregate_util()

    /*
    * Add IO boost, if currently enabled, on top of the aggregated
    * utilization value
    */
    sugov_iowait_apply()

    As a extra bonus, let's also add the documentation for the new
    functions and better align the in-code documentation.

    Signed-off-by: Patrick Bellasi
    Reviewed-by: Joel Fernandes (Google)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Rafael J. Wysocki

    Patrick Bellasi
     
  • A more energy efficient update of the IO wait boosting mechanism has
    been introduced in:

    commit a5a0809bc58e ("cpufreq: schedutil: Make iowait boost more energy efficient")

    where the boost value is expected to be:

    - doubled at each successive wakeup from IO
    staring from the minimum frequency supported by a CPU

    - reset when a CPU is not updated for more then one tick
    by either disabling the IO wait boost or resetting its value to the
    minimum frequency if this new update requires an IO boost.

    This approach is supposed to "ignore" boosting for sporadic wakeups from
    IO, while still getting the frequency boosted to the maximum to benefit
    long sequence of wakeup from IO operations.

    However, these assumptions are not always satisfied.
    For example, when an IO boosted CPU enters idle for more the one tick
    and then wakes up after an IO wait, since in sugov_set_iowait_boost() we
    first check the IOWAIT flag, we keep doubling the iowait boost instead
    of restarting from the minimum frequency value.

    This misbehavior could happen mainly on non-shared frequency domains,
    thus defeating the energy efficiency optimization, but it can also
    happen on shared frequency domain systems.

    Let fix this issue in sugov_set_iowait_boost() by:
    - first check the IO wait boost reset conditions
    to eventually reset the boost value
    - then applying the correct IO boost value
    if required by the caller

    Fixes: a5a0809bc58e (cpufreq: schedutil: Make iowait boost more energy efficient)
    Reported-by: Viresh Kumar
    Signed-off-by: Patrick Bellasi
    Reviewed-by: Joel Fernandes (Google)
    Acked-by: Viresh Kumar
    Acked-by: Peter Zijlstra (Intel)
    Signed-off-by: Rafael J. Wysocki

    Patrick Bellasi
     

15 May, 2018

2 commits

  • The schedutil driver sets sg_policy->next_freq to UINT_MAX on certain
    occasions to discard the cached value of next freq:
    - In sugov_start(), when the schedutil governor is started for a group
    of CPUs.
    - And whenever we need to force a freq update before rate-limit
    duration, which happens when:
    - there is an update in cpufreq policy limits.
    - Or when the utilization of DL scheduling class increases.

    In return, get_next_freq() doesn't return a cached next_freq value but
    recalculates the next frequency instead.

    But having special meaning for a particular value of frequency makes the
    code less readable and error prone. We recently fixed a bug where the
    UINT_MAX value was considered as valid frequency in
    sugov_update_single().

    All we need is a flag which can be used to discard the value of
    sg_policy->next_freq and we already have need_freq_update for that. Lets
    reuse it instead of setting next_freq to UINT_MAX.

    Signed-off-by: Viresh Kumar
    Reviewed-by: Joel Fernandes (Google)
    Signed-off-by: Rafael J. Wysocki

    Viresh Kumar
     
  • This reverts commit e2cabe48c20efb174ce0c01190f8b9c5f3ea1d13.

    Lifting the restriction that the sugov kthread is bound to the
    policy->related_cpus for a system with a slow switching cpufreq driver,
    which is able to perform DVFS from any cpu (e.g. cpufreq-dt), is not
    only not beneficial it also harms Enery-Aware Scheduling (EAS) on
    systems with asymmetric cpu capacities (e.g. Arm big.LITTLE).

    The sugov kthread which does the update for the little cpus could
    potentially run on a big cpu. It could prevent that the big cluster goes
    into deeper idle states although all the tasks are running on the little
    cluster.

    Example: hikey960 w/ 4.16.0-rc6-+
    Arm big.LITTLE with per-cluster DVFS

    root@h960:~# cat /proc/cpuinfo | grep "^CPU part"
    CPU part : 0xd03 (Cortex-A53, little cpu)
    CPU part : 0xd03
    CPU part : 0xd03
    CPU part : 0xd03
    CPU part : 0xd09 (Cortex-A73, big cpu)
    CPU part : 0xd09
    CPU part : 0xd09
    CPU part : 0xd09

    root@h960:/sys/devices/system/cpu/cpufreq# ls
    policy0 policy4 schedutil

    root@h960:/sys/devices/system/cpu/cpufreq# cat policy*/related_cpus
    0 1 2 3
    4 5 6 7

    (1) w/o the revert:

    root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
    /sugov/'
    PID CLS RTPRIO PRI PSR COMMAND
    1489 #6 0 140 1 sugov:0
    1490 #6 0 140 0 sugov:4

    The sugov kthread sugov:4 responsible for policy4 runs on cpu0. (In this
    case both sugov kthreads run on little cpus).

    cross policy (cluster) remote callback example:
    ...
    migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=5
    migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=5
    sg_cpu->sg_policy->policy->related_cpus=4-7
    sugov:4-1490 [000] sugov_work: this_cpu=0
    sg_cpu->sg_policy->policy->related_cpus=4-7
    ...

    The remote callback (this_cpu=1, target_cpu=5) is executed on cpu=0.

    (2) w/ the revert:

    root@h960:~# ps -eo pid,class,rtprio,pri,psr,comm | awk 'NR == 1 ||
    /sugov/'
    PID CLS RTPRIO PRI PSR COMMAND
    1491 #6 0 140 2 sugov:0
    1492 #6 0 140 4 sugov:4

    The sugov kthread sugov:4 responsible for policy4 runs on cpu4.

    cross policy (cluster) remote callback example:
    ...
    migration/1-14 [001] enqueue_task_fair: this_cpu=1 cpu_of(rq)=7
    migration/1-14 [001] sugov_update_shared: this_cpu=1 sg_cpu->cpu=7
    sg_cpu->sg_policy->policy->related_cpus=4-7
    sugov:4-1492 [004] sugov_work: this_cpu=4
    sg_cpu->sg_policy->policy->related_cpus=4-7
    ...

    The remote callback (this_cpu=1, target_cpu=7) is executed on cpu=4.

    Now the sugov kthread executes again on the policy (cluster) for which
    the Operating Performance Point (OPP) should be changed.
    It avoids the problem that an otherwise idle policy (cluster) is running
    schedutil (the sugov kthread) for another one.

    Signed-off-by: Dietmar Eggemann
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Dietmar Eggemann
     

09 May, 2018

2 commits

  • If the next_freq field of struct sugov_policy is set to UINT_MAX,
    it shouldn't be used for updating the CPU frequency (this is a
    special "invalid" value), but after commit b7eaf1aab9f8 (cpufreq:
    schedutil: Avoid reducing frequency of busy CPUs prematurely) it
    may be passed as the new frequency to sugov_update_commit() in
    sugov_update_single().

    Fix that by adding an extra check for the special UINT_MAX value
    of next_freq to sugov_update_single().

    Fixes: b7eaf1aab9f8 (cpufreq: schedutil: Avoid reducing frequency of busy CPUs prematurely)
    Reported-by: Viresh Kumar
    Cc: 4.12+ # 4.12+
    Signed-off-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Signed-off-by: Rafael J. Wysocki

    Rafael J. Wysocki
     
  • After commit 794a56ebd9a57 (sched/cpufreq: Change the worker kthread to
    SCHED_DEADLINE) schedutil kthreads are "ignored" for a clock frequency
    selection point of view, so the potential corner case for RT tasks is not
    possible at all now.

    Remove the stale comment mentioning it.

    Signed-off-by: Juri Lelli
    Signed-off-by: Rafael J. Wysocki

    Juri Lelli
     

05 Apr, 2018

1 commit


01 Apr, 2018

1 commit

  • This patch prevents the 'global_tunables_lock' mutex from being
    unlocked before being locked. This mutex is not locked if the
    sugov_kthread_create() function fails.

    Signed-off-by: Jules Maselbas
    Acked-by: Peter Zijlstra
    Cc: Chris Redpath
    Cc: Dietmar Eggermann
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Patrick Bellasi
    Cc: Stephen Kyle
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: nd@arm.com
    Link: http://lkml.kernel.org/r/20180329144301.38419-1-jules.maselbas@arm.com
    Signed-off-by: Ingo Molnar

    Jules Maselbas
     

24 Mar, 2018

1 commit

  • When the SCHED_DEADLINE scheduling class increases the CPU utilization, it
    should not wait for the rate limit, otherwise it may miss some deadline.

    Tests using rt-app on Exynos5422 with up to 10 SCHED_DEADLINE tasks have
    shown reductions of even 10% of deadline misses with a negligible
    increase of energy consumption (measured through Baylibre Cape).

    Signed-off-by: Claudio Scordino
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Rafael J. Wysocki
    Acked-by: Viresh Kumar
    Cc: Juri Lelli
    Cc: Joel Fernandes
    Cc: Vincent Guittot
    Cc: linux-pm@vger.kernel.org
    Cc: Peter Zijlstra
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Todd Kjos
    Cc: Dietmar Eggemann
    Link: https://lkml.kernel.org/r/1520937340-2755-1-git-send-email-claudio@evidence.eu.com

    Claudio Scordino
     

09 Mar, 2018

1 commit

  • Instead of trying to duplicate scheduler state to track if an RT task
    is running, directly use the scheduler runqueue state for it.

    This vastly simplifies things and fixes a number of bugs related to
    sugov and the scheduler getting out of sync wrt this state.

    As a consequence we not also update the remove cfs/dl state when
    iterating the shared mask.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Rafael J. Wysocki
    Cc: Thomas Gleixner
    Cc: Viresh Kumar
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

04 Mar, 2018

1 commit

  • Do the following cleanups and simplifications:

    - sched/sched.h already includes , so no need to
    include it in sched/core.c again.

    - order the headers alphabetically

    - add all headers to kernel/sched/sched.h

    - remove all unnecessary includes from the .c files that
    are already included in kernel/sched/sched.h.

    Finally, make all scheduler .c files use a single common header:

    #include "sched.h"

    ... which now contains a union of the relied upon headers.

    This makes the various .c files easier to read and easier to handle.

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     

03 Mar, 2018

1 commit

  • A good number of small style inconsistencies have accumulated
    in the scheduler core, so do a pass over them to harmonize
    all these details:

    - fix speling in comments,

    - use curly braces for multi-line statements,

    - remove unnecessary parentheses from integer literals,

    - capitalize consistently,

    - remove stray newlines,

    - add comments where necessary,

    - remove invalid/unnecessary comments,

    - align structure definitions and other data types vertically,

    - add missing newlines for increased readability,

    - fix vertical tabulation where it's misaligned,

    - harmonize preprocessor conditional block labeling
    and vertical alignment,

    - remove line-breaks where they uglify the code,

    - add newline after local variable definitions,

    No change in functionality:

    md5:
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
    1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm

    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar