26 Jan, 2020

1 commit

  • commit a0e813f26ebcb25c0b5e504498fbd796cca1a4ba upstream.

    It turns out there really is something special to the first
    set_next_task() invocation. In specific the 'change' pattern really
    should not cause balance callbacks.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: juri.lelli@redhat.com
    Cc: ktkhai@virtuozzo.com
    Cc: mgorman@suse.de
    Cc: qais.yousef@arm.com
    Cc: qperret@google.com
    Cc: rostedt@goodmis.org
    Cc: valentin.schneider@arm.com
    Cc: vincent.guittot@linaro.org
    Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
    Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
    Signed-off-by: Ingo Molnar
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

31 Dec, 2019

1 commit

  • [ Upstream commit 7763baace1b738d65efa46d68326c9406311c6bf ]

    Some uclamp helpers had their return type changed from 'unsigned int' to
    'enum uclamp_id' by commit

    0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

    but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
    range, which should really be unsigned int. The affected helpers are
    uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

    Note that this doesn't lead to any obj diff using a relatively recent
    aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
    properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
    anything funny. I'm still marking this as fixing the above commit to be on
    the safe side.

    Signed-off-by: Valentin Schneider
    Reviewed-by: Qais Yousef
    Acked-by: Vincent Guittot
    Cc: Dietmar.Eggemann@arm.com
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: patrick.bellasi@matbug.net
    Cc: qperret@google.com
    Cc: surenb@google.com
    Cc: tj@kernel.org
    Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
    Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin

    Valentin Schneider
     

09 Nov, 2019

1 commit

  • Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
    inadvertly introduced a race because it changed a previously
    unexplored dependency between dropping the rq->lock and
    sched_class::put_prev_task().

    The comments about dropping rq->lock, in for example
    newidle_balance(), only mentions the task being current and ->on_cpu
    being set. But when we look at the 'change' pattern (in for example
    sched_setnuma()):

    queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
    running = task_current(rq, p); /* rq->curr == p */

    if (queued)
    dequeue_task(...);
    if (running)
    put_prev_task(...);

    /* change task properties */

    if (queued)
    enqueue_task(...);
    if (running)
    set_next_task(...);

    It becomes obvious that if we do this after put_prev_task() has
    already been called on @p, things go sideways. This is exactly what
    the commit in question allows to happen when it does:

    prev->sched_class->put_prev_task(rq, prev, rf);
    if (!rq->nr_running)
    newidle_balance(rq, rf);

    The newidle_balance() call will drop rq->lock after we've called
    put_prev_task() and that allows the above 'change' pattern to
    interleave and mess up the state.

    Furthermore, it turns out we lost the RT-pull when we put the last DL
    task.

    Fix both problems by extracting the balancing from put_prev_task() and
    doing a multi-class balance() pass before put_prev_task().

    Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
    Reported-by: Quentin Perret
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Quentin Perret
    Tested-by: Valentin Schneider

    Peter Zijlstra
     

25 Sep, 2019

1 commit

  • The membarrier_state field is located within the mm_struct, which
    is not guaranteed to exist when used from runqueue-lock-free iteration
    on runqueues by the membarrier system call.

    Copy the membarrier_state from the mm_struct into the scheduler runqueue
    when the scheduler switches between mm.

    When registering membarrier for mm, after setting the registration bit
    in the mm membarrier state, issue a synchronize_rcu() to ensure the
    scheduler observes the change. In order to take care of the case
    where a runqueue keeps executing the target mm without swapping to
    other mm, iterate over each runqueue and issue an IPI to copy the
    membarrier_state from the mm_struct into each runqueue which have the
    same mm which state has just been modified.

    Move the mm membarrier_state field closer to pgd in mm_struct to use
    a cache line already touched by the scheduler switch_mm.

    The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
    clear the runqueue's membarrier state in addition to clear the mm
    membarrier state, so move its implementation into the scheduler
    membarrier code so it can access the runqueue structure.

    Add memory barrier in membarrier_exec_mmap() prior to clearing
    the membarrier state, ensuring memory accesses executed prior to exec
    are not reordered with the stores clearing the membarrier state.

    As suggested by Linus, move all membarrier.c RCU read-side locks outside
    of the for each cpu loops.

    Suggested-by: Linus Torvalds
    Signed-off-by: Mathieu Desnoyers
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Chris Metcalf
    Cc: Christoph Lameter
    Cc: Eric W. Biederman
    Cc: Kirill Tkhai
    Cc: Mike Galbraith
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Russell King - ARM Linux admin
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
    Signed-off-by: Ingo Molnar

    Mathieu Desnoyers
     

16 Sep, 2019

1 commit


03 Sep, 2019

3 commits

  • The supported clamp indexes are defined in 'enum clamp_id', however, because
    of the code logic in some of the first utilization clamping series version,
    sometimes we needed to use 'unsigned int' to represent indices.

    This is not more required since the final version of the uclamp_* APIs can
    always use the proper enum uclamp_id type.

    Fix it with a bulk rename now that we have all the bits merged.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Michal Koutny
    Acked-by: Tejun Heo
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190822132811.31294-7-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • In order to properly support hierarchical resources control, the cgroup
    delegation model requires that attribute writes from a child group never
    fail but still are locally consistent and constrained based on parent's
    assigned resources. This requires to properly propagate and aggregate
    parent attributes down to its descendants.

    Implement this mechanism by adding a new "effective" clamp value for each
    task group. The effective clamp value is defined as the smaller value
    between the clamp value of a group and the effective clamp value of its
    parent. This is the actual clamp value enforced on tasks in a task group.

    Since it's possible for a cpu.uclamp.min value to be bigger than the
    cpu.uclamp.max value, ensure local consistency by restricting each
    "protection" (i.e. min utilization) with the corresponding "limit"
    (i.e. max utilization).

    Do that at effective clamps propagation to ensure all user-space write
    never fails while still always tracking the most restrictive values.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Michal Koutny
    Acked-by: Tejun Heo
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190822132811.31294-3-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • The cgroup CPU bandwidth controller allows to assign a specified
    (maximum) bandwidth to the tasks of a group. However this bandwidth is
    defined and enforced only on a temporal base, without considering the
    actual frequency a CPU is running on. Thus, the amount of computation
    completed by a task within an allocated bandwidth can be very different
    depending on the actual frequency the CPU is running that task.
    The amount of computation can be affected also by the specific CPU a
    task is running on, especially when running on asymmetric capacity
    systems like Arm's big.LITTLE.

    With the availability of schedutil, the scheduler is now able
    to drive frequency selections based on actual task utilization.
    Moreover, the utilization clamping support provides a mechanism to
    bias the frequency selection operated by schedutil depending on
    constraints assigned to the tasks currently RUNNABLE on a CPU.

    Giving the mechanisms described above, it is now possible to extend the
    cpu controller to specify the minimum (or maximum) utilization which
    should be considered for tasks RUNNABLE on a cpu.
    This makes it possible to better defined the actual computational
    power assigned to task groups, thus improving the cgroup CPU bandwidth
    controller which is currently based just on time constraints.

    Extend the CPU controller with a couple of new attributes uclamp.{min,max}
    which allow to enforce utilization boosting and capping for all the
    tasks in a group.

    Specifically:

    - uclamp.min: defines the minimum utilization which should be considered
    i.e. the RUNNABLE tasks of this group will run at least at a
    minimum frequency which corresponds to the uclamp.min
    utilization

    - uclamp.max: defines the maximum utilization which should be considered
    i.e. the RUNNABLE tasks of this group will run up to a
    maximum frequency which corresponds to the uclamp.max
    utilization

    These attributes:

    a) are available only for non-root nodes, both on default and legacy
    hierarchies, while system wide clamps are defined by a generic
    interface which does not depends on cgroups. This system wide
    interface enforces constraints on tasks in the root node.

    b) enforce effective constraints at each level of the hierarchy which
    are a restriction of the group requests considering its parent's
    effective constraints. Root group effective constraints are defined
    by the system wide interface.
    This mechanism allows each (non-root) level of the hierarchy to:
    - request whatever clamp values it would like to get
    - effectively get only up to the maximum amount allowed by its parent

    c) have higher priority than task-specific clamps, defined via
    sched_setattr(), thus allowing to control and restrict task requests.

    Add two new attributes to the cpu controller to collect "requested"
    clamp values. Allow that at each non-root level of the hierarchy.
    Keep it simple by not caring now about "effective" values computation
    and propagation along the hierarchy.

    Update sysctl_sched_uclamp_handler() to use the newly introduced
    uclamp_mutex so that we serialize system default updates with cgroup
    relate updates.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Michal Koutny
    Acked-by: Tejun Heo
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190822132811.31294-2-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     

08 Aug, 2019

6 commits

  • Avoid the RETRY_TASK case in the pick_next_task() slow path.

    By doing the put_prev_task() early, we get the rt/deadline pull done,
    and by testing rq->nr_running we know if we need newidle_balance().

    This then gives a stable state to pick a task from.

    Since the fast-path is fair only; it means the other classes will
    always have pick_next_task(.prev=NULL, .rf=NULL) and we can simplify.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Valentin Schneider
    Cc: mingo@kernel.org
    Cc: Phil Auld
    Cc: Julien Desfossez
    Cc: Nishanth Aravamudan
    Link: https://lkml.kernel.org/r/aa34d24b36547139248f32a30138791ac6c02bd6.1559129225.git.vpillai@digitalocean.com

    Peter Zijlstra
     
  • Currently the pick_next_task() loop is convoluted and ugly because of
    how it can drop the rq->lock and needs to restart the picking.

    For the RT/Deadline classes, it is put_prev_task() where we do
    balancing, and we could do this before the picking loop. Make this
    possible.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Valentin Schneider
    Cc: Aaron Lu
    Cc: mingo@kernel.org
    Cc: Phil Auld
    Cc: Julien Desfossez
    Cc: Nishanth Aravamudan
    Link: https://lkml.kernel.org/r/e4519f6850477ab7f3d257062796e6425ee4ba7c.1559129225.git.vpillai@digitalocean.com

    Peter Zijlstra
     
  • For pick_next_task_fair() it is the newidle balance that requires
    dropping the rq->lock; provided we do put_prev_task() early, we can
    also detect the condition for doing newidle early.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Valentin Schneider
    Cc: mingo@kernel.org
    Cc: Phil Auld
    Cc: Julien Desfossez
    Cc: Nishanth Aravamudan
    Link: https://lkml.kernel.org/r/9e3eb1859b946f03d7e500453a885725b68957ba.1559129225.git.vpillai@digitalocean.com

    Peter Zijlstra
     
  • In preparation of further separating pick_next_task() and
    set_curr_task() we have to pass the actual task into it, while there,
    rename the thing to better pair with put_prev_task().

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Valentin Schneider
    Cc: mingo@kernel.org
    Cc: Phil Auld
    Cc: Julien Desfossez
    Cc: Nishanth Aravamudan
    Link: https://lkml.kernel.org/r/a96d1bcdd716db4a4c5da2fece647a1456c0ed78.1559129225.git.vpillai@digitalocean.com

    Peter Zijlstra
     
  • The CPU hotplug task selection is the only place where we used
    put_prev_task() on a task that is not current. While looking at that,
    it occured to me that we can simplify all that by by using a custom
    pick loop.

    Since we don't need to put current, we can do away with the fake task
    too.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aaron Lu
    Cc: Valentin Schneider
    Cc: mingo@kernel.org
    Cc: Phil Auld
    Cc: Julien Desfossez
    Cc: Nishanth Aravamudan

    Peter Zijlstra
     
  • It has been observed, that highly-threaded, non-cpu-bound applications
    running under cpu.cfs_quota_us constraints can hit a high percentage of
    periods throttled while simultaneously not consuming the allocated
    amount of quota. This use case is typical of user-interactive non-cpu
    bound applications, such as those running in kubernetes or mesos when
    run on multiple cpu cores.

    This has been root caused to cpu-local run queue being allocated per cpu
    bandwidth slices, and then not fully using that slice within the period.
    At which point the slice and quota expires. This expiration of unused
    slice results in applications not being able to utilize the quota for
    which they are allocated.

    The non-expiration of per-cpu slices was recently fixed by
    'commit 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift
    condition")'. Prior to that it appears that this had been broken since
    at least 'commit 51f2176d74ac ("sched/fair: Fix unlocked reads of some
    cfs_b->quota/period")' which was introduced in v3.16-rc1 in 2014. That
    added the following conditional which resulted in slices never being
    expired.

    if (cfs_rq->runtime_expires != cfs_b->runtime_expires) {
    /* extend local deadline, drift is bounded above by 2 ticks */
    cfs_rq->runtime_expires += TICK_NSEC;

    Because this was broken for nearly 5 years, and has recently been fixed
    and is now being noticed by many users running kubernetes
    (https://github.com/kubernetes/kubernetes/issues/67577) it is my opinion
    that the mechanisms around expiring runtime should be removed
    altogether.

    This allows quota already allocated to per-cpu run-queues to live longer
    than the period boundary. This allows threads on runqueues that do not
    use much CPU to continue to use their remaining slice over a longer
    period of time than cpu.cfs_period_us. However, this helps prevent the
    above condition of hitting throttling while also not fully utilizing
    your cpu quota.

    This theoretically allows a machine to use slightly more than its
    allotted quota in some periods. This overflow would be bounded by the
    remaining quota left on each per-cpu runqueueu. This is typically no
    more than min_cfs_rq_runtime=1ms per cpu. For CPU bound tasks this will
    change nothing, as they should theoretically fully utilize all of their
    quota in each period. For user-interactive tasks as described above this
    provides a much better user/application experience as their cpu
    utilization will more closely match the amount they requested when they
    hit throttling. This means that cpu limits no longer strictly apply per
    period for non-cpu bound applications, but that they are still accurate
    over longer timeframes.

    This greatly improves performance of high-thread-count, non-cpu bound
    applications with low cfs_quota_us allocation on high-core-count
    machines. In the case of an artificial testcase (10ms/100ms of quota on
    80 CPU machine), this commit resulted in almost 30x performance
    improvement, while still maintaining correct cpu quota restrictions.
    That testcase is available at https://github.com/indeedeng/fibtest.

    Fixes: 512ac999d275 ("sched/fair: Fix bandwidth timer clock drift condition")
    Signed-off-by: Dave Chiluk
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Phil Auld
    Reviewed-by: Ben Segall
    Cc: Ingo Molnar
    Cc: John Hammond
    Cc: Jonathan Corbet
    Cc: Kyle Anderson
    Cc: Gabriel Munos
    Cc: Peter Oskolkov
    Cc: Cong Wang
    Cc: Brendan Gregg
    Link: https://lkml.kernel.org/r/1563900266-19734-2-git-send-email-chiluk+linux@indeed.com

    Dave Chiluk
     

01 Aug, 2019

1 commit

  • CONFIG_PREEMPTION is selected by CONFIG_PREEMPT and by
    CONFIG_PREEMPT_RT. Both PREEMPT and PREEMPT_RT require the same
    functionality which today depends on CONFIG_PREEMPT.

    Switch the preemption code, scheduler and init task over to use
    CONFIG_PREEMPTION.

    That's the first step towards RT in that area. The more complex changes are
    coming separately.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Masami Hiramatsu
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Link: http://lkml.kernel.org/r/20190726212124.117528401@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

25 Jul, 2019

3 commits

  • When the topology of root domains is modified by CPUset or CPUhotplug
    operations information about the current deadline bandwidth held in the
    root domain is lost.

    This patch addresses the issue by recalculating the lost deadline
    bandwidth information by circling through the deadline tasks held in
    CPUsets and adding their current load to the root domain they are
    associated with.

    Tested-by: Dietmar Eggemann
    Signed-off-by: Mathieu Poirier
    Signed-off-by: Juri Lelli
    [ Various additional modifications. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bristot@redhat.com
    Cc: claudio@evidence.eu.com
    Cc: lizefan@huawei.com
    Cc: longman@redhat.com
    Cc: luca.abeni@santannapisa.it
    Cc: rostedt@goodmis.org
    Cc: tj@kernel.org
    Cc: tommaso.cucinotta@santannapisa.it
    Link: https://lkml.kernel.org/r/20190719140000.31694-4-juri.lelli@redhat.com
    Signed-off-by: Ingo Molnar

    Mathieu Poirier
     
  • In real product setup, there will be houseeking CPUs in each nodes, it
    is prefer to do housekeeping from local node, fallback to global online
    cpumask if failed to find houseeking CPU from local node.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Frederic Weisbecker
    Reviewed-by: Srikar Dronamraju
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1561711901-4755-2-git-send-email-wanpengli@tencent.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     
  • Track how many tasks are present with SCHED_IDLE policy in each cfs_rq.
    This will be used by later commits.

    Signed-off-by: Viresh Kumar
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Daniel Lezcano
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Vincent Guittot
    Cc: chris.redpath@arm.com
    Cc: quentin.perret@linaro.org
    Cc: songliubraving@fb.com
    Cc: steven.sistare@oracle.com
    Cc: subhra.mazumdar@oracle.com
    Cc: tkjos@google.com
    Link: https://lkml.kernel.org/r/0d3cdc427fc68808ad5bccc40e86ed0bf9da8bb4.1561523542.git.viresh.kumar@linaro.org
    Signed-off-by: Ingo Molnar

    Viresh Kumar
     

25 Jun, 2019

6 commits

  • The Energy Aware Scheduler (EAS) estimates the energy impact of waking
    up a task on a given CPU. This estimation is based on:

    a) an (active) power consumption defined for each CPU frequency
    b) an estimation of which frequency will be used on each CPU
    c) an estimation of the busy time (utilization) of each CPU

    Utilization clamping can affect both b) and c).

    A CPU is expected to run:

    - on an higher than required frequency, but for a shorter time, in case
    its estimated utilization will be smaller than the minimum utilization
    enforced by uclamp
    - on a smaller than required frequency, but for a longer time, in case
    its estimated utilization is bigger than the maximum utilization
    enforced by uclamp

    While compute_energy() already accounts clamping effects on busy time,
    the clamping effects on frequency selection are currently ignored.

    Fix it by considering how CPU clamp values will be affected by a
    task waking up and being RUNNABLE on that CPU.

    Do that by refactoring schedutil_freq_util() to take an additional
    task_struct* which allows EAS to evaluate the impact on clamp values of
    a task being eventually queued in a CPU. Clamp values are applied to the
    RT+CFS utilization only when a FREQUENCY_UTIL is required by
    compute_energy().

    Do note that switching from ENERGY_UTIL to FREQUENCY_UTIL in the
    computation of the cpu_util signal implies that we are more likely to
    estimate the highest OPP when a RT task is running in another CPU of
    the same performance domain. This can have an impact on energy
    estimation but:

    - it's not easy to say which approach is better, since it depends on
    the use case
    - the original approach could still be obtained by setting a smaller
    task-specific util_min whenever required

    Since we are at that:

    - rename schedutil_freq_util() into schedutil_cpu_util(),
    since it's not only used for frequency selection.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-12-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • So far uclamp_util() allows to clamp a specified utilization considering
    the clamp values requested by RUNNABLE tasks in a CPU. For the Energy
    Aware Scheduler (EAS) it is interesting to test how clamp values will
    change when a task is becoming RUNNABLE on a given CPU.
    For example, EAS is interested in comparing the energy impact of
    different scheduling decisions and the clamp values can play a role on
    that.

    Add uclamp_util_with() which allows to clamp a given utilization by
    considering the possible impact on CPU clamp values of a specified task.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-11-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • Each time a frequency update is required via schedutil, a frequency is
    selected to (possibly) satisfy the utilization reported by each
    scheduling class and irqs. However, when utilization clamping is in use,
    the frequency selection should consider userspace utilization clamping
    hints. This will allow, for example, to:

    - boost tasks which are directly affecting the user experience
    by running them at least at a minimum "requested" frequency

    - cap low priority tasks not directly affecting the user experience
    by running them only up to a maximum "allowed" frequency

    These constraints are meant to support a per-task based tuning of the
    frequency selection thus supporting a fine grained definition of
    performance boosting vs energy saving strategies in kernel space.

    Add support to clamp the utilization of RUNNABLE FAIR and RT tasks
    within the boundaries defined by their aggregated utilization clamp
    constraints.

    Do that by considering the max(min_util, max_util) to give boosted tasks
    the performance they need even when they happen to be co-scheduled with
    other capped tasks.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-10-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • When a task sleeps it removes its max utilization clamp from its CPU.
    However, the blocked utilization on that CPU can be higher than the max
    clamp value enforced while the task was running. This allows undesired
    CPU frequency increases while a CPU is idle, for example, when another
    CPU on the same frequency domain triggers a frequency update, since
    schedutil can now see the full not clamped blocked utilization of the
    idle CPU.

    Fix this by using:

    uclamp_rq_dec_id(p, rq, UCLAMP_MAX)
    uclamp_rq_max_value(rq, UCLAMP_MAX, clamp_value)

    to detect when a CPU has no more RUNNABLE clamped tasks and to flag this
    condition.

    Don't track any minimum utilization clamps since an idle CPU never
    requires a minimum frequency. The decay of the blocked utilization is
    good enough to reduce the CPU frequency.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-4-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • Utilization clamping allows to clamp the CPU's utilization within a
    [util_min, util_max] range, depending on the set of RUNNABLE tasks on
    that CPU. Each task references two "clamp buckets" defining its minimum
    and maximum (util_{min,max}) utilization "clamp values". A CPU's clamp
    bucket is active if there is at least one RUNNABLE tasks enqueued on
    that CPU and refcounting that bucket.

    When a task is {en,de}queued {on,from} a rq, the set of active clamp
    buckets on that CPU can change. If the set of active clamp buckets
    changes for a CPU a new "aggregated" clamp value is computed for that
    CPU. This is because each clamp bucket enforces a different utilization
    clamp value.

    Clamp values are always MAX aggregated for both util_min and util_max.
    This ensures that no task can affect the performance of other
    co-scheduled tasks which are more boosted (i.e. with higher util_min
    clamp) or less capped (i.e. with higher util_max clamp).

    A task has:
    task_struct::uclamp[clamp_id]::bucket_id
    to track the "bucket index" of the CPU's clamp bucket it refcounts while
    enqueued, for each clamp index (clamp_id).

    A runqueue has:
    rq::uclamp[clamp_id]::bucket[bucket_id].tasks
    to track how many RUNNABLE tasks on that CPU refcount each
    clamp bucket (bucket_id) of a clamp index (clamp_id).
    It also has a:
    rq::uclamp[clamp_id]::bucket[bucket_id].value
    to track the clamp value of each clamp bucket (bucket_id) of a clamp
    index (clamp_id).

    The rq::uclamp::bucket[clamp_id][] array is scanned every time it's
    needed to find a new MAX aggregated clamp value for a clamp_id. This
    operation is required only when it's dequeued the last task of a clamp
    bucket tracking the current MAX aggregated clamp value. In this case,
    the CPU is either entering IDLE or going to schedule a less boosted or
    more clamped task.
    The expected number of different clamp values configured at build time
    is small enough to fit the full unordered array into a single cache
    line, for configurations of up to 7 buckets.

    Add to struct rq the basic data structures required to refcount the
    number of RUNNABLE tasks for each clamp bucket. Add also the max
    aggregation required to update the rq's clamp value at each
    enqueue/dequeue event.

    Use a simple linear mapping of clamp values into clamp buckets.
    Pre-compute and cache bucket_id to avoid integer divisions at
    enqueue/dequeue time.

    Signed-off-by: Patrick Bellasi
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alessio Balsini
    Cc: Dietmar Eggemann
    Cc: Joel Fernandes
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Paul Turner
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Rafael J . Wysocki
    Cc: Steve Muckle
    Cc: Suren Baghdasaryan
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Cc: Todd Kjos
    Cc: Vincent Guittot
    Cc: Viresh Kumar
    Link: https://lkml.kernel.org/r/20190621084217.8167-2-patrick.bellasi@arm.com
    Signed-off-by: Ingo Molnar

    Patrick Bellasi
     
  • The 'struct sched_domain *sd' parameter to arch_scale_cpu_capacity() is
    unused since commit:

    765d0af19f5f ("sched/topology: Remove the ::smt_gain field from 'struct sched_domain'")

    Remove it.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Viresh Kumar
    Reviewed-by: Valentin Schneider
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: gregkh@linuxfoundation.org
    Cc: linux@armlinux.org.uk
    Cc: quentin.perret@arm.com
    Cc: rafael@kernel.org
    Link: https://lkml.kernel.org/r/1560783617-5827-1-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

17 Jun, 2019

1 commit

  • When a cfs_rq sleeps and returns its quota, we delay for 5ms before
    waking any throttled cfs_rqs to coalesce with other cfs_rqs going to
    sleep, as this has to be done outside of the rq lock we hold.

    The current code waits for 5ms without any sleeps, instead of waiting
    for 5ms from the first sleep, which can delay the unthrottle more than
    we want. Switch this around so that we can't push this forward forever.

    This requires an extra flag rather than using hrtimer_active, since we
    need to start a new timer if the current one is in the process of
    finishing.

    Signed-off-by: Ben Segall
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Xunlei Pang
    Acked-by: Phil Auld
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/xm26a7euy6iq.fsf_-_@bsegall-linux.svl.corp.google.com
    Signed-off-by: Ingo Molnar

    bsegall@google.com
     

03 Jun, 2019

3 commits

  • The per rq load array values also disappear from the cpu#X sections in
    /proc/sched_debug.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • With LB_BIAS disabled, there is no need to update the rq->cpu_load[idx]
    any more.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Rik van Riel
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190527062116.11512-2-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     
  • The CFS class is the only one maintaining and using the CPU wide load
    (rq->load(.weight)). The last use case of the CPU wide load in CFS's
    set_next_entity() can be replaced by using the load of the CFS class
    (rq->cfs.load(.weight)) instead.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     

03 Apr, 2019

3 commits

  • This fixes the following sparse errors in sched/fair.c:

    fair.c:6506:14: error: incompatible types in comparison expression
    fair.c:8642:21: error: incompatible types in comparison expression

    Using __rcu will also help sparse catch any future bugs.

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Peter Zijlstra (Intel)
    [ From an RCU perspective. ]
    Reviewed-by: Paul E. McKenney
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Luc Van Oostenryck
    Cc: Mathieu Desnoyers
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: keescook@chromium.org
    Cc: kernel-hardening@lists.openwall.com
    Cc: kernel-team@android.com
    Link: https://lkml.kernel.org/r/20190321003426.160260-5-joel@joelfernandes.org
    Signed-off-by: Ingo Molnar

    Joel Fernandes (Google)
     
  • The scheduler uses RCU API in various places to access sched_domain
    pointers. These cause sparse errors as below.

    Many new errors show up because of an annotation check I added to
    rcu_assign_pointer(). Let us annotate the pointers correctly which also
    will help sparse catch any potential future bugs.

    This fixes the following sparse errors:

    rt.c:1681:9: error: incompatible types in comparison expression
    deadline.c:1904:9: error: incompatible types in comparison expression
    core.c:519:9: error: incompatible types in comparison expression
    core.c:1634:17: error: incompatible types in comparison expression
    fair.c:6193:14: error: incompatible types in comparison expression
    fair.c:9883:22: error: incompatible types in comparison expression
    fair.c:9897:9: error: incompatible types in comparison expression
    sched.h:1287:9: error: incompatible types in comparison expression
    topology.c:612:9: error: incompatible types in comparison expression
    topology.c:615:9: error: incompatible types in comparison expression
    sched.h:1300:9: error: incompatible types in comparison expression
    topology.c:618:9: error: incompatible types in comparison expression
    sched.h:1287:9: error: incompatible types in comparison expression
    topology.c:621:9: error: incompatible types in comparison expression
    sched.h:1300:9: error: incompatible types in comparison expression
    topology.c:624:9: error: incompatible types in comparison expression
    topology.c:671:9: error: incompatible types in comparison expression
    stats.c:45:17: error: incompatible types in comparison expression
    fair.c:5998:15: error: incompatible types in comparison expression
    fair.c:5989:15: error: incompatible types in comparison expression
    fair.c:5998:15: error: incompatible types in comparison expression
    fair.c:5989:15: error: incompatible types in comparison expression
    fair.c:6120:19: error: incompatible types in comparison expression
    fair.c:6506:14: error: incompatible types in comparison expression
    fair.c:6515:14: error: incompatible types in comparison expression
    fair.c:6623:9: error: incompatible types in comparison expression
    fair.c:5970:17: error: incompatible types in comparison expression
    fair.c:8642:21: error: incompatible types in comparison expression
    fair.c:9253:9: error: incompatible types in comparison expression
    fair.c:9331:9: error: incompatible types in comparison expression
    fair.c:9519:15: error: incompatible types in comparison expression
    fair.c:9533:14: error: incompatible types in comparison expression
    fair.c:9542:14: error: incompatible types in comparison expression
    fair.c:9567:14: error: incompatible types in comparison expression
    fair.c:9597:14: error: incompatible types in comparison expression
    fair.c:9421:16: error: incompatible types in comparison expression
    fair.c:9421:16: error: incompatible types in comparison expression

    Signed-off-by: Joel Fernandes (Google)
    Signed-off-by: Peter Zijlstra (Intel)
    [ From an RCU perspective. ]
    Reviewed-by: Paul E. McKenney
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Luc Van Oostenryck
    Cc: Mathieu Desnoyers
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: keescook@chromium.org
    Cc: kernel-hardening@lists.openwall.com
    Cc: kernel-team@android.com
    Link: https://lkml.kernel.org/r/20190321003426.160260-3-joel@joelfernandes.org
    Signed-off-by: Ingo Molnar

    Joel Fernandes (Google)
     
  • Recently I added an RCU annotation check to rcu_assign_pointer(). All
    pointers assigned to RCU protected data are to be annotated with __rcu
    inorder to be able to use rcu_assign_pointer() similar to checks in
    other RCU APIs.

    This resulted in a sparse error:

    kernel//sched/cpufreq.c:41:9: sparse: error: incompatible types in comparison expression (different address spaces)

    Fix this by annotating cpufreq_update_util_data pointer with __rcu. This
    will also help sparse catch any future RCU misuage bugs.

    Signed-off-by: Joel Fernandes (Google)
    [ From an RCU perspective. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Paul E. McKenney
    Cc: Josh Triplett
    Cc: Lai Jiangshan
    Cc: Linus Torvalds
    Cc: Luc Van Oostenryck
    Cc: Mathieu Desnoyers
    Cc: Mike Galbraith
    Cc: Morten Rasmussen
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Cc: keescook@chromium.org
    Cc: kernel-hardening@lists.openwall.com
    Cc: kernel-team@android.com
    Link: https://lkml.kernel.org/r/20190321003426.160260-2-joel@joelfernandes.org
    Signed-off-by: Ingo Molnar

    Joel Fernandes (Google)
     

07 Mar, 2019

1 commit

  • Pull scheduler updates from Ingo Molnar:
    "The main changes in this cycle were:

    - refcount conversions

    - Solve the rq->leaf_cfs_rq_list can of worms for real.

    - improve power-aware scheduling

    - add sysctl knob for Energy Aware Scheduling

    - documentation updates

    - misc other changes"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (34 commits)
    kthread: Do not use TIMER_IRQSAFE
    kthread: Convert worker lock to raw spinlock
    sched/fair: Use non-atomic cpumask_{set,clear}_cpu()
    sched/fair: Remove unused 'sd' parameter from select_idle_smt()
    sched/wait: Use freezable_schedule() when possible
    sched/fair: Prune, fix and simplify the nohz_balancer_kick() comment block
    sched/fair: Explain LLC nohz kick condition
    sched/fair: Simplify nohz_balancer_kick()
    sched/topology: Fix percpu data types in struct sd_data & struct s_data
    sched/fair: Simplify post_init_entity_util_avg() by calling it with a task_struct pointer argument
    sched/fair: Fix O(nr_cgroups) in the load balancing path
    sched/fair: Optimize update_blocked_averages()
    sched/fair: Fix insertion in rq->leaf_cfs_rq_list
    sched/fair: Add tmp_alone_branch assertion
    sched/core: Use READ_ONCE()/WRITE_ONCE() in move_queued_task()/task_rq_lock()
    sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK
    sched/pelt: Skip updating util_est when utilization is higher than CPU's capacity
    sched/fair: Update scale invariance of PELT
    sched/fair: Move the rq_of() helper function
    sched/core: Convert task_struct.stack_refcount to refcount_t
    ...

    Linus Torvalds
     

11 Feb, 2019

1 commit

  • Since commit:

    d03266910a53 ("sched/fair: Fix task group initialization")

    the utilization of a sched entity representing a task group is no longer
    initialized to any other value than 0. So post_init_entity_util_avg() is
    only used for tasks, not for sched_entities.

    Make this clear by calling it with a task_struct pointer argument which
    also eliminates the entity_is_task(se) if condition in the fork path and
    get rid of the stale comment in remove_entity_load_avg() accordingly.

    Signed-off-by: Dietmar Eggemann
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Morten Rasmussen
    Cc: Patrick Bellasi
    Cc: Peter Zijlstra
    Cc: Quentin Perret
    Cc: Thomas Gleixner
    Cc: Valentin Schneider
    Cc: Vincent Guittot
    Link: https://lkml.kernel.org/r/20190122162501.12000-1-dietmar.eggemann@arm.com
    Signed-off-by: Ingo Molnar

    Dietmar Eggemann
     

04 Feb, 2019

4 commits

  • move_queued_task() synchronizes with task_rq_lock() as follows:

    move_queued_task() task_rq_lock()

    [S] ->on_rq = MIGRATING [L] rq = task_rq()
    WMB (__set_task_cpu()) ACQUIRE (rq->lock);
    [S] ->cpu = new_cpu [L] ->on_rq

    where "[L] rq = task_rq()" is ordered before "ACQUIRE (rq->lock)" by an
    address dependency and, in turn, "ACQUIRE (rq->lock)" is ordered before
    "[L] ->on_rq" by the ACQUIRE itself.

    Use READ_ONCE() to load ->cpu in task_rq() (c.f., task_cpu()) to honor
    this address dependency. Also, mark the accesses to ->cpu and ->on_rq
    with READ_ONCE()/WRITE_ONCE() to comply with the LKMM.

    Signed-off-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alan Stern
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: https://lkml.kernel.org/r/20190121155240.27173-1-andrea.parri@amarulasolutions.com
    Signed-off-by: Ingo Molnar

    Andrea Parri
     
  • util_est is mainly meant to be a lower-bound for tasks utilization.
    That's why task_util_est() returns the actual util_avg when it's higher
    than the estimated utilization.

    With new invaraince signal and without any special check on samples
    collection, if a task is limited because of thermal capping for
    example, we could end up overestimating its utilization and thus
    perhaps generating an unwanted frequency spike when the capping is
    relaxed... and (even worst) it will take some more activations for the
    estimated utilization to converge back to the actual utilization.

    Since we cannot easily know if there is idle time in a CPU when a task
    completes an activation with a utilization higher then the CPU capacity,
    we skip the sampling when utilization is higher than CPU's capacity.

    Suggested-by: Patrick Bellasi
    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: pjt@google.com
    Cc: pkondeti@codeaurora.org
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Link: https://lkml.kernel.org/r/1548257214-13745-4-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • The current implementation of load tracking invariance scales the
    contribution with current frequency and uarch performance (only for
    utilization) of the CPU. One main result of this formula is that the
    figures are capped by current capacity of CPU. Another one is that the
    load_avg is not invariant because not scaled with uarch.

    The util_avg of a periodic task that runs r time slots every p time slots
    varies in the range :

    U * (1-y^r)/(1-y^p) * y^i < Utilization < U * (1-y^r)/(1-y^p)

    with U is the max util_avg value = SCHED_CAPACITY_SCALE

    At a lower capacity, the range becomes:

    U * C * (1-y^r')/(1-y^p) * y^i' < Utilization < U * C * (1-y^r')/(1-y^p)

    with C reflecting the compute capacity ratio between current capacity and
    max capacity.

    so C tries to compensate changes in (1-y^r') but it can't be accurate.

    Instead of scaling the contribution value of PELT algo, we should scale the
    running time. The PELT signal aims to track the amount of computation of
    tasks and/or rq so it seems more correct to scale the running time to
    reflect the effective amount of computation done since the last update.

    In order to be fully invariant, we need to apply the same amount of
    running time and idle time whatever the current capacity. Because running
    at lower capacity implies that the task will run longer, we have to ensure
    that the same amount of idle time will be applied when system becomes idle
    and no idle time has been "stolen". But reaching the maximum utilization
    value (SCHED_CAPACITY_SCALE) means that the task is seen as an
    always-running task whatever the capacity of the CPU (even at max compute
    capacity). In this case, we can discard this "stolen" idle times which
    becomes meaningless.

    In order to achieve this time scaling, a new clock_pelt is created per rq.
    The increase of this clock scales with current capacity when something
    is running on rq and synchronizes with clock_task when rq is idle. With
    this mechanism, we ensure the same running and idle time whatever the
    current capacity. This also enables to simplify the pelt algorithm by
    removing all references of uarch and frequency and applying the same
    contribution to utilization and loads. Furthermore, the scaling is done
    only once per update of clock (update_rq_clock_task()) instead of during
    each update of sched_entities and cfs/rt/dl_rq of the rq like the current
    implementation. This is interesting when cgroup are involved as shown in
    the results below:

    On a hikey (octo Arm64 platform).
    Performance cpufreq governor and only shallowest c-state to remove variance
    generated by those power features so we only track the impact of pelt algo.

    each test runs 16 times:

    ./perf bench sched pipe
    (higher is better)
    kernel tip/sched/core + patch
    ops/seconds ops/seconds diff
    cgroup
    root 59652(+/- 0.18%) 59876(+/- 0.24%) +0.38%
    level1 55608(+/- 0.27%) 55923(+/- 0.24%) +0.57%
    level2 52115(+/- 0.29%) 52564(+/- 0.22%) +0.86%

    hackbench -l 1000
    (lower is better)
    kernel tip/sched/core + patch
    duration(sec) duration(sec) diff
    cgroup
    root 4.453(+/- 2.37%) 4.383(+/- 2.88%) -1.57%
    level1 4.859(+/- 8.50%) 4.830(+/- 7.07%) -0.60%
    level2 5.063(+/- 9.83%) 4.928(+/- 9.66%) -2.66%

    Then, the responsiveness of PELT is improved when CPU is not running at max
    capacity with this new algorithm. I have put below some examples of
    duration to reach some typical load values according to the capacity of the
    CPU with current implementation and with this patch. These values has been
    computed based on the geometric series and the half period value:

    Util (%) max capacity half capacity(mainline) half capacity(w/ patch)
    972 (95%) 138ms not reachable 276ms
    486 (47.5%) 30ms 138ms 60ms
    256 (25%) 13ms 32ms 26ms

    On my hikey (octo Arm64 platform) with schedutil governor, the time to
    reach max OPP when starting from a null utilization, decreases from 223ms
    with current scale invariance down to 121ms with the new algorithm.

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pjt@google.com
    Cc: pkondeti@codeaurora.org
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Link: https://lkml.kernel.org/r/1548257214-13745-3-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     
  • Move rq_of() helper function so it can be used in pelt.c

    [ mingo: Improve readability while at it. ]

    Signed-off-by: Vincent Guittot
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Morten.Rasmussen@arm.com
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: bsegall@google.com
    Cc: dietmar.eggemann@arm.com
    Cc: patrick.bellasi@arm.com
    Cc: pjt@google.com
    Cc: pkondeti@codeaurora.org
    Cc: quentin.perret@arm.com
    Cc: rjw@rjwysocki.net
    Cc: srinivas.pandruvada@linux.intel.com
    Cc: thara.gopinath@linaro.org
    Link: https://lkml.kernel.org/r/1548257214-13745-2-git-send-email-vincent.guittot@linaro.org
    Signed-off-by: Ingo Molnar

    Vincent Guittot
     

27 Jan, 2019

1 commit

  • All that fancy new Energy-Aware scheduling foo is hidden behind a
    static_key, which is awesome if you have the stuff enabled in your
    config.

    However, when you lack all the prerequisites it doesn't make any sense
    to pretend we'll ever actually run this, so provide a little more clue
    to the compiler so it can more agressively delete the code.

    text data bss dec hex filename
    50297 976 96 51369 c8a9 defconfig-build/kernel/sched/fair.o
    49227 944 96 50267 c45b defconfig-build/kernel/sched/fair.o

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

26 Jan, 2019

1 commit

  • Now that call_rcu()'s callback is not invoked until after all
    preempt-disable regions of code have completed (in addition to explicitly
    marked RCU read-side critical sections), call_rcu() can be used in place
    of call_rcu_sched(). This commit therefore makes that change.

    While in the area, this commit also updates an outdated header comment
    for for_each_domain().

    Signed-off-by: Paul E. McKenney
    Cc: Ingo Molnar
    Cc: Peter Zijlstra

    Paul E. McKenney
     

06 Jan, 2019

1 commit

  • Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

    The jump label is controlled by HAVE_JUMP_LABEL, which is defined
    like this:

    #if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
    # define HAVE_JUMP_LABEL
    #endif

    We can improve this by testing 'asm goto' support in Kconfig, then
    make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

    Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
    match to the real kernel capability.

    Signed-off-by: Masahiro Yamada
    Acked-by: Michael Ellerman (powerpc)
    Tested-by: Sedat Dilek

    Masahiro Yamada