Eric Lee / smarc-fsl-linux-kernel

Blame view

Documentation/scheduler/sched-tune.txt 17 KB
               Central, scheduler-driven, power-performance control
                                 (EXPERIMENTAL)
  
  Abstract
  ========
  
  The topic of a single simple power-performance tunable, that is wholly
  scheduler centric, and has well defined and predictable properties has come up
  on several occasions in the past [1,2]. With techniques such as a scheduler
  driven DVFS [3], we now have a good framework for implementing such a tunable.
  This document describes the overall ideas behind its design and implementation.
  
  
  Table of Contents
  =================
  
  1. Motivation
  2. Introduction
  3. Signal Boosting Strategy
  4. OPP selection using boosted CPU utilization
  5. Per task group boosting
  6. Per-task wakeup-placement-strategy Selection
  7. Question and Answers
     - What about "auto" mode?
     - What about boosting on a congested system?
     - How CPUs are boosted when we have tasks with multiple boost values?
  8. References
  
  
  1. Motivation
  =============
  
  Sched-DVFS [3] was a new event-driven cpufreq governor which allows the
  scheduler to select the optimal DVFS operating point (OPP) for running a task
  allocated to a CPU. Later, the cpufreq maintainers introduced a similar
  governor, schedutil. The introduction of schedutil also enables running
  workloads at the most energy efficient OPPs.
  
  However, sometimes it may be desired to intentionally boost the performance of
  a workload even if that could imply a reasonable increase in energy
  consumption. For example, in order to reduce the response time of a task, we
  may want to run the task at a higher OPP than the one that is actually required
  by it's CPU bandwidth demand.
  
  This last requirement is especially important if we consider that one of the
  main goals of the utilization-driven governor component is to replace all
  currently available CPUFreq policies. Since sched-DVFS and schedutil are event
  based, as opposed to the sampling driven governors we currently have, they are
  already more responsive at selecting the optimal OPP to run tasks allocated to
  a CPU. However, just tracking the actual task load demand may not be enough
  from a performance standpoint.  For example, it is not possible to get
  behaviors similar to those provided by the "performance" and "interactive"
  CPUFreq governors.
  
  This document describes an implementation of a tunable, stacked on top of the
  utilization-driven governors which extends their functionality to support task
  performance boosting.
  
  By "performance boosting" we mean the reduction of the time required to
  complete a task activation, i.e. the time elapsed from a task wakeup to its
  next deactivation (e.g. because it goes back to sleep or it terminates).  For
  example, if we consider a simple periodic task which executes the same workload
  for 5[s] every 20[s] while running at a certain OPP, a boosted execution of
  that task must complete each of its activations in less than 5[s].
  
  A previous attempt [5] to introduce such a boosting feature has not been
  successful mainly because of the complexity of the proposed solution. Previous
  versions of the approach described in this document exposed a single simple
  interface to user-space.  This single tunable knob allowed the tuning of
  system wide scheduler behaviours ranging from energy efficiency at one end
  through to incremental performance boosting at the other end.  This first
  tunable affects all tasks. However, that is not useful for Android products
  so in this version only a more advanced extension of the concept is provided
  which uses CGroups to boost the performance of only selected tasks while using
  the energy efficient default for all others.
  
  The rest of this document introduces in more details the proposed solution
  which has been named SchedTune.
  
  
  2. Introduction
  ===============
  
  SchedTune exposes a simple user-space interface provided through a new
  CGroup controller 'stune' which provides two power-performance tunables
  per group:
  
    /<stune cgroup mount point>/schedtune.prefer_idle
    /<stune cgroup mount point>/schedtune.boost
  
  The CGroup implementation permits arbitrary user-space defined task
  classification to tune the scheduler for different goals depending on the
  specific nature of the task, e.g. background vs interactive vs low-priority.
  
  More details are given in section 5.
  
  2.1 Boosting
  ============
  
  The boost value is expressed as an integer in the range [-100..0..100].
  
  A value of 0 (default) configures the CFS scheduler for maximum energy
  efficiency. This means that sched-DVFS runs the tasks at the minimum OPP
  required to satisfy their workload demand.
  
  A value of 100 configures scheduler for maximum performance, which translates
  to the selection of the maximum OPP on that CPU.
  
  A value of -100 configures scheduler for minimum performance, which translates
  to the selection of the minimum OPP on that CPU.
  
  The range between -100, 0 and 100 can be set to satisfy other scenarios suitably.
  For example to satisfy interactive response or depending on other system events
  (battery level etc).
  
  The overall design of the SchedTune module is built on top of "Per-Entity Load
  Tracking" (PELT) signals and sched-DVFS by introducing a bias on the Operating
  Performance Point (OPP) selection.
  
  Each time a task is allocated on a CPU, cpufreq is given the opportunity to tune
  the operating frequency of that CPU to better match the workload demand. The
  selection of the actual OPP being activated is influenced by the boost value
  for the task CGroup.
  
  This simple biasing approach leverages existing frameworks, which means minimal
  modifications to the scheduler, and yet it allows to achieve a range of
  different behaviours all from a single simple tunable knob.
  
  In EAS schedulers, we use boosted task and CPU utilization for energy
  calculation and energy-aware task placement.
  
  2.2 prefer_idle
  ===============
  
  This is a flag which indicates to the scheduler that userspace would like
  the scheduler to focus on energy or to focus on performance.
  
  A value of 0 (default) signals to the CFS scheduler that tasks in this group
  can be placed according to the energy-aware wakeup strategy.
  
  A value of 1 signals to the CFS scheduler that tasks in this group should be
  placed to minimise wakeup latency.
  
  The value is combined with the boost value - task placement will not be
  boost aware however CPU OPP selection is still boost aware.
  
  Android platforms typically use this flag for application tasks which the
  user is currently interacting with.
  
  
  3. Signal Boosting Strategy
  ===========================
  
  The whole PELT machinery works based on the value of a few load tracking signals
  which basically track the CPU bandwidth requirements for tasks and the capacity
  of CPUs. The basic idea behind the SchedTune knob is to artificially inflate
  some of these load tracking signals to make a task or RQ appears more demanding
  that it actually is.
  
  Which signals have to be inflated depends on the specific "consumer".  However,
  independently from the specific (signal, consumer) pair, it is important to
  define a simple and possibly consistent strategy for the concept of boosting a
  signal.
  
  A boosting strategy defines how the "abstract" user-space defined
  sched_cfs_boost value is translated into an internal "margin" value to be added
  to a signal to get its inflated value:
  
    margin         := boosting_strategy(sched_cfs_boost, signal)
    boosted_signal := signal + margin
  
  Different boosting strategies were identified and analyzed before selecting the
  one found to be most effective.
  
  Signal Proportional Compensation (SPC)
  --------------------------------------
  
  In this boosting strategy the sched_cfs_boost value is used to compute a
  margin which is proportional to the complement of the original signal.
  When a signal has a maximum possible value, its complement is defined as
  the delta from the actual value and its possible maximum.
  
  Since the tunable implementation uses signals which have SCHED_LOAD_SCALE as
  the maximum possible value, the margin becomes:
  
  	margin := sched_cfs_boost * (SCHED_LOAD_SCALE - signal)
  
  Using this boosting strategy:
  - a 100% sched_cfs_boost means that the signal is scaled to the maximum value
  - each value in the range of sched_cfs_boost effectively inflates the signal in
    question by a quantity which is proportional to the maximum value.
  
  For example, by applying the SPC boosting strategy to the selection of the OPP
  to run a task it is possible to achieve these behaviors:
  
  -   0% boosting: run the task at the minimum OPP required by its workload
  - 100% boosting: run the task at the maximum OPP available for the CPU
  -  50% boosting: run at the half-way OPP between minimum and maximum
  
  Which means that, at 50% boosting, a task will be scheduled to run at half of
  the maximum theoretically achievable performance on the specific target
  platform.
  
  A graphical representation of an SPC boosted signal is represented in the
  following figure where:
   a) "-" represents the original signal
   b) "b" represents a  50% boosted signal
   c) "p" represents a 100% boosted signal
  
  
     ^
     |  SCHED_LOAD_SCALE
     +-----------------------------------------------------------------+
     |pppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppppp
     |
     |                                             boosted_signal
     |                                          bbbbbbbbbbbbbbbbbbbbbbbb
     |
     |                                            original signal
     |                  bbbbbbbbbbbbbbbbbbbbbbbb+----------------------+
     |                                          |
     |bbbbbbbbbbbbbbbbbb                        |
     |                                          |
     |                                          |
     |                                          |
     |                  +-----------------------+
     |                  |
     |                  |
     |                  |
     |------------------+
     |
     |
     +----------------------------------------------------------------------->
  
  The plot above shows a ramped load signal (titled 'original_signal') and it's
  boosted equivalent. For each step of the original signal the boosted signal
  corresponding to a 50% boost is midway from the original signal and the upper
  bound. Boosting by 100% generates a boosted signal which is always saturated to
  the upper bound.
  
  
  4. OPP selection using boosted CPU utilization
  ==============================================
  
  It is worth calling out that the implementation does not introduce any new load
  signals. Instead, it provides an API to tune existing signals. This tuning is
  done on demand and only in scheduler code paths where it is sensible to do so.
  The new API calls are defined to return either the default signal or a boosted
  one, depending on the value of sched_cfs_boost. This is a clean an non invasive
  modification of the existing existing code paths.
  
  The signal representing a CPU's utilization is boosted according to the
  previously described SPC boosting strategy. To sched-DVFS, this allows a CPU
  (ie CFS run-queue) to appear more used then it actually is.
  
  Thus, with the sched_cfs_boost enabled we have the following main functions to
  get the current utilization of a CPU:
  
    cpu_util()
    boosted_cpu_util()
  
  The new boosted_cpu_util() is similar to the first but returns a boosted
  utilization signal which is a function of the sched_cfs_boost value.
  
  This function is used in the CFS scheduler code paths where sched-DVFS needs to
  decide the OPP to run a CPU at.
  For example, this allows selecting the highest OPP for a CPU which has
  the boost value set to 100%.
  
  
  5. Per task group boosting
  ==========================
  
  On battery powered devices there usually are many background services which are
  long running and need energy efficient scheduling. On the other hand, some
  applications are more performance sensitive and require an interactive
  response and/or maximum performance, regardless of the energy cost.
  
  To better service such scenarios, the SchedTune implementation has an extension
  that provides a more fine grained boosting interface.
  
  A new CGroup controller, namely "schedtune", can be enabled which allows to
  defined and configure task groups with different boosting values.
  Tasks that require special performance can be put into separate CGroups.
  The value of the boost associated with the tasks in this group can be specified
  using a single knob exposed by the CGroup controller:
  
     schedtune.boost
  
  This knob allows the definition of a boost value that is to be used for
  SPC boosting of all tasks attached to this group.
  
  The current schedtune controller implementation is really simple and has these
  main characteristics:
  
    1) It is only possible to create 1 level depth hierarchies
  
       The root control groups define the system-wide boost value to be applied
       by default to all tasks. Its direct subgroups are named "boost groups" and
       they define the boost value for specific set of tasks.
       Further nested subgroups are not allowed since they do not have a sensible
       meaning from a user-space standpoint.
  
    2) It is possible to define only a limited number of "boost groups"
  
       This number is defined at compile time and by default configured to 16.
       This is a design decision motivated by two main reasons:
       a) In a real system we do not expect utilization scenarios with more then few
  	boost groups. For example, a reasonable collection of groups could be
          just "background", "interactive" and "performance".
       b) It simplifies the implementation considerably, especially for the code
  	which has to compute the per CPU boosting once there are multiple
          RUNNABLE tasks with different boost values.
  
  Such a simple design should allow servicing the main utilization scenarios identified
  so far. It provides a simple interface which can be used to manage the
  power-performance of all tasks or only selected tasks.
  Moreover, this interface can be easily integrated by user-space run-times (e.g.
  Android, ChromeOS) to implement a QoS solution for task boosting based on tasks
  classification, which has been a long standing requirement.
  
  Setup and usage
  ---------------
  
  0. Use a kernel with CONFIG_SCHED_TUNE support enabled
  
  1. Check that the "schedtune" CGroup controller is available:
  
     root@linaro-nano:~# cat /proc/cgroups
     #subsys_name	hierarchy	num_cgroups	enabled
     cpuset  	0		1		1
     cpu     	0		1		1
     schedtune	0		1		1
  
  2. Mount a tmpfs to create the CGroups mount point (Optional)
  
     root@linaro-nano:~# sudo mount -t tmpfs cgroups /sys/fs/cgroup
  
  3. Mount the "schedtune" controller
  
     root@linaro-nano:~# mkdir /sys/fs/cgroup/stune
     root@linaro-nano:~# sudo mount -t cgroup -o schedtune stune /sys/fs/cgroup/stune
  
  4. Create task groups and configure their specific boost value (Optional)
  
     For example here we create a "performance" boost group configure to boost
     all its tasks to 100%
  
     root@linaro-nano:~# mkdir /sys/fs/cgroup/stune/performance
     root@linaro-nano:~# echo 100 > /sys/fs/cgroup/stune/performance/schedtune.boost
  
  5. Move tasks into the boost group
  
     For example, the following moves the tasks with PID $TASKPID (and all its
     threads) into the "performance" boost group.
  
     root@linaro-nano:~# echo "TASKPID > /sys/fs/cgroup/stune/performance/cgroup.procs
  
  This simple configuration allows only the threads of the $TASKPID task to run,
  when needed, at the highest OPP in the most capable CPU of the system.
  
  
  6. Per-task wakeup-placement-strategy Selection
  ===============================================
  
  Many devices have a number of CFS tasks in use which require an absolute
  minimum wakeup latency, and many tasks for which wakeup latency is not
  important.
  
  For touch-driven environments, removing additional wakeup latency can be
  critical.
  
  When you use the Schedtume CGroup controller, you have access to a second
  parameter which allows a group to be marked such that energy_aware task
  placement is bypassed for tasks belonging to that group.
  
  prefer_idle=0 (default - use energy-aware task placement if available)
  prefer_idle=1 (never use energy-aware task placement for these tasks)
  
  Since the regular wakeup task placement algorithm in CFS is biased for
  performance, this has the effect of restoring minimum wakeup latency
  for the desired tasks whilst still allowing energy-aware wakeup placement
  to save energy for other tasks.
  
  
  7. Question and Answers
  =======================
  
  What about "auto" mode?
  -----------------------
  
  The 'auto' mode as described in [5] can be implemented by interfacing SchedTune
  with some suitable user-space element. This element could use the exposed
  system-wide or cgroup based interface.
  
  How are multiple groups of tasks with different boost values managed?
  ---------------------------------------------------------------------
  
  The current SchedTune implementation keeps track of the boosted RUNNABLE tasks
  on a CPU. The CPU utilization seen by the scheduler-driven cpufreq governors
  (and used to select an appropriate OPP) is boosted with a value which is the
  maximum of the boost values of the currently RUNNABLE tasks in its RQ.
  
  This allows cpufreq to boost a CPU only while there are boosted tasks ready
  to run and switch back to the energy efficient mode as soon as the last boosted
  task is dequeued.
  
  
  8. References
  =============
  [1] http://lwn.net/Articles/552889
  [2] http://lkml.org/lkml/2012/5/18/91
  [3] http://lkml.org/lkml/2015/6/26/620