Eric Lee / smarc-fsl-linux-kernel

22 Jul, 2020

2 commits

6aae92ed2 sched/fair: handle case of task_h_load() returning 0 ... Browse Code »

commit 01cfcde9c26d8555f0e6e9aea9d6049f87683998 upstream.

task_h_load() can return 0 in some situations like running stress-ng
mmapfork, which forks thousands of threads, in a sched group on a 224 cores
system. The load balance doesn't handle this correctly because
env->imbalance never decreases and it will stop pulling tasks only after
reaching loop_max, which can be equal to the number of running tasks of
the cfs. Make sure that imbalance will be decreased by at least 1.

misfit task is the other feature that doesn't handle correctly such
situation although it's probably more difficult to face the problem
because of the smaller number of CPUs and running tasks on heterogenous
system.

We can't simply ensure that task_h_load() returns at least one because it
would imply to handle underflow in other places.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Reviewed-by: Dietmar Eggemann
Tested-by: Dietmar Eggemann
Cc: # v4.4+
Link: https://lkml.kernel.org/r/20200710152426.16981-1-vincent.guittot@linaro.org
Signed-off-by: Greg Kroah-Hartman

Vincent Guittot
2020-07-22 15:33:16 +0800
b5b774918 sched: Fix unreliable rseq cpu_id for new tasks ... Browse Code »

commit ce3614daabea8a2d01c1dd17ae41d1ec5e5ae7db upstream.

While integrating rseq into glibc and replacing glibc's sched_getcpu
implementation with rseq, glibc's tests discovered an issue with
incorrect __rseq_abi.cpu_id field value right after the first time
a newly created process issues sched_setaffinity.

For the records, it triggers after building glibc and running tests, and
then issuing:

for x in {1..2000} ; do posix/tst-affinity-static & done

and shows up as:

error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 2, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0
error: Unexpected CPU 138, expected 0

This is caused by the scheduler invoking __set_task_cpu() directly from
sched_fork() and wake_up_new_task(), thus bypassing rseq_migrate() which
is done by set_task_cpu().

Add the missing rseq_migrate() to both functions. The only other direct
use of __set_task_cpu() is done by init_idle(), which does not involve a
user-space task.

Based on my testing with the glibc test-case, just adding rseq_migrate()
to wake_up_new_task() is sufficient to fix the observed issue. Also add
it to sched_fork() to keep things consistent.

The reason why this never triggered so far with the rseq/basic_test
selftest is unclear.

The current use of sched_getcpu(3) does not typically require it to be
always accurate. However, use of the __rseq_abi.cpu_id field within rseq
critical sections requires it to be accurate. If it is not accurate, it
can cause corruption in the per-cpu data targeted by rseq critical
sections in user-space.

Reported-By: Florian Weimer
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Tested-By: Florian Weimer
Cc: stable@vger.kernel.org # v4.18+
Link: https://lkml.kernel.org/r/20200707201505.2632-1-mathieu.desnoyers@efficios.com
Signed-off-by: Greg Kroah-Hartman

Mathieu Desnoyers
2020-07-22 15:33:16 +0800

16 Jul, 2020

1 commit

1128ed7e1 sched/core: Check cpus_mask, not cpus_ptr in __set_cpus_allowed_ptr(), to fix mask corruption ... Browse Code »

[ Upstream commit fd844ba9ae59b51e34e77105d79f8eca780b3bd6 ]

This function is concerned with the long-term CPU mask, not the
transitory mask the task might have while migrate disabled. Before
this patch, if a task was migrate-disabled at the time
__set_cpus_allowed_ptr() was called, and the new mask happened to be
equal to the CPU that the task was running on, then the mask update
would be lost.

Signed-off-by: Scott Wood
Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200617121742.cpxppyi7twxmpin7@linutronix.de
Signed-off-by: Sasha Levin

Scott Wood
2020-07-16 14:16:36 +0800

09 Jul, 2020

1 commit

542d541c1 sched/debug: Make sd->flags sysctl read-only ... Browse Code »

[ Upstream commit 9818427c6270a9ce8c52c8621026fe9cebae0f92 ]

Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com
Signed-off-by: Sasha Levin

Valentin Schneider
2020-07-09 15:37:50 +0800

01 Jul, 2020

2 commits

83bdf7f8b sched/core: Fix PI boosting between RT and DEADLINE tasks ... Browse Code »

[ Upstream commit 740797ce3a124b7dd22b7fb832d87bc8fba1cf6f ]

syzbot reported the following warning:

WARNING: CPU: 1 PID: 6351 at kernel/sched/deadline.c:628
enqueue_task_dl+0x22da/0x38a0 kernel/sched/deadline.c:1504

At deadline.c:628 we have:

623 static inline void setup_new_dl_entity(struct sched_dl_entity *dl_se)
624 {
625 struct dl_rq *dl_rq = dl_rq_of_se(dl_se);
626 struct rq *rq = rq_of_dl_rq(dl_rq);
627
628 WARN_ON(dl_se->dl_boosted);
629 WARN_ON(dl_time_before(rq_clock(rq), dl_se->deadline));
[...]
}

Which means that setup_new_dl_entity() has been called on a task
currently boosted. This shouldn't happen though, as setup_new_dl_entity()
is only called when the 'dynamic' deadline of the new entity
is in the past w.r.t. rq_clock and boosted tasks shouldn't verify this
condition.

Digging through the PI code I noticed that what above might in fact happen
if an RT tasks blocks on an rt_mutex hold by a DEADLINE task. In the
first branch of boosting conditions we check only if a pi_task 'dynamic'
deadline is earlier than mutex holder's and in this case we set mutex
holder to be dl_boosted. However, since RT 'dynamic' deadlines are only
initialized if such tasks get boosted at some point (or if they become
DEADLINE of course), in general RT 'dynamic' deadlines are usually equal
to 0 and this verifies the aforementioned condition.

Fix it by checking that the potential donor task is actually (even if
temporary because in turn boosted) running at DEADLINE priority before
using its 'dynamic' deadline value.

Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+119ba87189432ead09b4@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Daniel Bristot de Oliveira
Tested-by: Daniel Wagner
Link: https://lkml.kernel.org/r/20181119153201.GB2119@localhost.localdomain
Signed-off-by: Sasha Levin

Juri Lelli
2020-07-01 03:37:02 +0800
3dc713894 sched/deadline: Initialize ->dl_boosted ... Browse Code »

[ Upstream commit ce9bc3b27f2a21a7969b41ffb04df8cf61bd1592 ]

syzbot reported the following warning triggered via SYSC_sched_setattr():

WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 setup_new_dl_entity /kernel/sched/deadline.c:594 [inline]
WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_dl_entity /kernel/sched/deadline.c:1370 [inline]
WARNING: CPU: 0 PID: 6973 at kernel/sched/deadline.c:593 enqueue_task_dl+0x1c17/0x2ba0 /kernel/sched/deadline.c:1441

This happens because the ->dl_boosted flag is currently not initialized by
__dl_clear_params() (unlike the other flags) and setup_new_dl_entity()
rightfully complains about it.

Initialize dl_boosted to 0.

Fixes: 2d3d891d3344 ("sched/deadline: Add SCHED_DEADLINE inheritance logic")
Reported-by: syzbot+5ac8bac25f95e8b221e7@syzkaller.appspotmail.com
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Tested-by: Daniel Wagner
Link: https://lkml.kernel.org/r/20200617072919.818409-1-juri.lelli@redhat.com
Signed-off-by: Sasha Levin

Juri Lelli
2020-07-01 03:37:02 +0800

22 Jun, 2020

3 commits

9fa3b0bd9 sched: Defend cfs and rt bandwidth quota against overflow ... Browse Code »

[ Upstream commit d505b8af58912ae1e1a211fabc9995b19bd40828 ]

When users write some huge number into cpu.cfs_quota_us or
cpu.rt_runtime_us, overflow might happen during to_ratio() shifts of
schedulable checks.

to_ratio() could be altered to avoid unnecessary internal overflow, but
min_cfs_quota_period is less than 1 << BW_SHIFT, so a cutoff would still
be needed. Set a cap MAX_BW for cfs_quota_us and rt_runtime_us to
prevent overflow.

Signed-off-by: Huaixin Chang
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Link: https://lkml.kernel.org/r/20200425105248.60093-1-changhuaixin@linux.alibaba.com
Signed-off-by: Sasha Levin

Huaixin Chang
2020-06-22 15:31:07 +0800
f7757368e sched/core: Fix illegal RCU from offline CPUs ... Browse Code »

[ Upstream commit bf2c59fce4074e55d622089b34be3a6bc95484fb ]

In the CPU-offline process, it calls mmdrop() after idle entry and the
subsequent call to cpuhp_report_idle_dead(). Once execution passes the
call to rcu_report_dead(), RCU is ignoring the CPU, which results in
lockdep complaining when mmdrop() uses RCU from either memcg or
debugobjects below.

Fix it by cleaning up the active_mm state from BP instead. Every arch
which has CONFIG_HOTPLUG_CPU should have already called idle_task_exit()
from AP. The only exception is parisc because it switches them to
&init_mm unconditionally (see smp_boot_one_cpu() and smp_cpu_init()),
but the patch will still work there because it calls mmgrab(&init_mm) in
smp_cpu_init() and then should call mmdrop(&init_mm) in finish_cpu().

WARNING: suspicious RCU usage
-----------------------------
kernel/workqueue.c:710 RCU or wq_pool_mutex should be held!

other info that might help us debug this:

RCU used illegally from offline CPU!
Call Trace:
dump_stack+0xf4/0x164 (unreliable)
lockdep_rcu_suspicious+0x140/0x164
get_work_pool+0x110/0x150
__queue_work+0x1bc/0xca0
queue_work_on+0x114/0x120
css_release+0x9c/0xc0
percpu_ref_put_many+0x204/0x230
free_pcp_prepare+0x264/0x570
free_unref_page+0x38/0xf0
__mmdrop+0x21c/0x2c0
idle_task_exit+0x170/0x1b0
pnv_smp_cpu_kill_self+0x38/0x2e0
cpu_die+0x48/0x64
arch_cpu_idle_dead+0x30/0x50
do_idle+0x2f4/0x470
cpu_startup_entry+0x38/0x40
start_secondary+0x7a8/0xa80
start_secondary_resume+0x10/0x14

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Qian Cai
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Michael Ellerman (powerpc)
Link: https://lkml.kernel.org/r/20200401214033.8448-1-cai@lca.pw
Signed-off-by: Sasha Levin

Peter Zijlstra
2020-06-22 15:31:01 +0800
9f664eda6 sched/fair: Refill bandwidth before scaling ... Browse Code »

[ Upstream commit 5a6d6a6ccb5f48ca8cf7c6d64ff83fd9c7999390 ]

In order to prevent possible hardlockup of sched_cfs_period_timer()
loop, loop count is introduced to denote whether to scale quota and
period or not. However, scale is done between forwarding period timer
and refilling cfs bandwidth runtime, which means that period timer is
forwarded with old "period" while runtime is refilled with scaled
"quota".

Move do_sched_cfs_period_timer() before scaling to solve this.

Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Signed-off-by: Huaixin Chang
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200420024421.22442-3-changhuaixin@linux.alibaba.com
Signed-off-by: Sasha Levin

Huaixin Chang
2020-06-22 15:30:51 +0800

17 Jun, 2020

1 commit

53fed23f8 sched/fair: Don't NUMA balance for kthreads ... Browse Code »

[ Upstream commit 18f855e574d9799a0e7489f8ae6fd8447d0dd74a ]

Stefano reported a crash with using SQPOLL with io_uring:

BUG: kernel NULL pointer dereference, address: 00000000000003b0
CPU: 2 PID: 1307 Comm: io_uring-sq Not tainted 5.7.0-rc7 #11
RIP: 0010:task_numa_work+0x4f/0x2c0
Call Trace:
task_work_run+0x68/0xa0
io_sq_thread+0x252/0x3d0
kthread+0xf9/0x130
ret_from_fork+0x35/0x40

which is task_numa_work() oopsing on current->mm being NULL.

The task work is queued by task_tick_numa(), which checks if current->mm is
NULL at the time of the call. But this state isn't necessarily persistent,
if the kthread is using use_mm() to temporarily adopt the mm of a task.

Change the task_tick_numa() check to exclude kernel threads in general,
as it doesn't make sense to attempt ot balance for kthreads anyway.

Reported-by: Stefano Garzarella
Signed-off-by: Jens Axboe
Signed-off-by: Ingo Molnar
Acked-by: Peter Zijlstra
Link: https://lore.kernel.org/r/865de121-8190-5d30-ece5-3b097dc74431@kernel.dk
Signed-off-by: Sasha Levin

Jens Axboe
2020-06-17 22:40:21 +0800

27 May, 2020

3 commits

b51001860 sched/fair: Fix enqueue_task_fair() warning some more ... Browse Code »

[ Upstream commit b34cb07dde7c2346dec73d053ce926aeaa087303 ]

sched/fair: Fix enqueue_task_fair warning some more

The recent patch, fe61468b2cb (sched/fair: Fix enqueue_task_fair warning)
did not fully resolve the issues with the rq->tmp_alone_branch !=
&rq->leaf_cfs_rq_list warning in enqueue_task_fair. There is a case where
the first for_each_sched_entity loop exits due to on_rq, having incompletely
updated the list. In this case the second for_each_sched_entity loop can
further modify se. The later code to fix up the list management fails to do
what is needed because se does not point to the sched_entity which broke out
of the first loop. The list is not fixed up because the throttled parent was
already added back to the list by a task enqueue in a parallel child hierarchy.

Address this by calling list_add_leaf_cfs_rq if there are throttled parents
while doing the second for_each_sched_entity loop.

Fixes: fe61468b2cb ("sched/fair: Fix enqueue_task_fair warning")
Suggested-by: Vincent Guittot
Signed-off-by: Phil Auld
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20200512135222.GC2201@lorien.usersys.redhat.com
Signed-off-by: Sasha Levin

Phil Auld
2020-05-27 23:46:52 +0800
8b13f5657 sched/fair: Fix reordering of enqueue/dequeue_task_fair() ... Browse Code »

[ Upstream commit 5ab297bab984310267734dfbcc8104566658ebef ]

Even when a cgroup is throttled, the group se of a child cgroup can still
be enqueued and its gse->on_rq stays true. When a task is enqueued on such
child, we still have to update the load_avg and increase
h_nr_running of the throttled cfs. Nevertheless, the 1st
for_each_sched_entity() loop is skipped because of gse->on_rq == true and the
2nd loop because the cfs is throttled whereas we have to update both
load_avg with the old h_nr_running and increase h_nr_running in such case.

The same sequence can happen during dequeue when se moves to parent before
breaking in the 1st loop.

Note that the update of load_avg will effectively happen only once in order
to sync up to the throttled time. Next call for updating load_avg will stop
early because the clock stays unchanged.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Fixes: 6d4d22468dae ("sched/fair: Reorder enqueue/dequeue_task_fair path")
Link: https://lkml.kernel.org/r/20200306084208.12583-1-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin

Vincent Guittot
2020-05-27 23:46:52 +0800
a2ad232aa sched/fair: Reorder enqueue/dequeue_task_fair path ... Browse Code »

[ Upstream commit 6d4d22468dae3d8757af9f8b81b848a76ef4409d ]

The walk through the cgroup hierarchy during the enqueue/dequeue of a task
is split in 2 distinct parts for throttled cfs_rq without any added value
but making code less readable.

Change the code ordering such that everything related to a cfs_rq
(throttled or not) will be done in the same loop.

In addition, the same steps ordering is used when updating a cfs_rq:

- update_load_avg
- update_cfs_group
- update *h_nr_running

This reordering enables the use of h_nr_running in PELT algorithm.

No functional and performance changes are expected and have been noticed
during tests.

Signed-off-by: Vincent Guittot
Signed-off-by: Mel Gorman
Signed-off-by: Ingo Molnar
Reviewed-by: "Dietmar Eggemann "
Acked-by: Peter Zijlstra
Cc: Juri Lelli
Cc: Valentin Schneider
Cc: Phil Auld
Cc: Hillf Danton
Link: https://lore.kernel.org/r/20200224095223.13361-5-mgorman@techsingularity.net
Signed-off-by: Sasha Levin

Vincent Guittot
2020-05-27 23:46:52 +0800

02 May, 2020

1 commit

c04d01e91 sched/core: Fix reset-on-fork from RT with uclamp ... Browse Code »

commit eaf5a92ebde5bca3bb2565616115bd6d579486cd upstream.

uclamp_fork() resets the uclamp values to their default when the
reset-on-fork flag is set. It also checks whether the task has a RT
policy, and sets its uclamp.min to 1024 accordingly. However, during
reset-on-fork, the task's policy is lowered to SCHED_NORMAL right after,
hence leading to an erroneous uclamp.min setting for the new task if it
was forked from RT.

Fix this by removing the unnecessary check on rt_task() in
uclamp_fork() as this doesn't make sense if the reset-on-fork flag is
set.

Fixes: 1a00d999971c ("sched/uclamp: Set default clamps for RT tasks")
Reported-by: Chitti Babu Theegala
Signed-off-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Patrick Bellasi
Reviewed-by: Dietmar Eggemann
Link: https://lkml.kernel.org/r/20200416085956.217587-1-qperret@google.com
Signed-off-by: Greg Kroah-Hartman

Quentin Perret
2020-05-02 14:48:52 +0800

17 Apr, 2020

3 commits

1dbfae009 sched/core: Remove duplicate assignment in sched_tick_remote() ... Browse Code »

commit 82e0516ce3a147365a5dd2a9bedd5ba43a18663d upstream.

A redundant "curr = rq->curr" was added; remove it.

Fixes: ebc0f83c78a2 ("timers/nohz: Update NOHZ load in remote tick")
Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/1580776558-12882-1-git-send-email-swood@redhat.com
Cc: Guenter Roeck
Signed-off-by: Greg Kroah-Hartman

Scott Wood
2020-04-17 16:50:17 +0800
524089fa7 sched/fair: Fix enqueue_task_fair warning ... Browse Code »

commit fe61468b2cbc2b7ce5f8d3bf32ae5001d4c434e9 upstream.

When a cfs rq is throttled, the latter and its child are removed from the
leaf list but their nr_running is not changed which includes staying higher
than 1. When a task is enqueued in this throttled branch, the cfs rqs must
be added back in order to ensure correct ordering in the list but this can
only happens if nr_running == 1.
When cfs bandwidth is used, we call unconditionnaly list_add_leaf_cfs_rq()
when enqueuing an entity to make sure that the complete branch will be
added.

Similarly unthrottle_cfs_rq() can stop adding cfs in the list when a parent
is throttled. Iterate the remaining entity to ensure that the complete
branch will be added in the list.

Reported-by: Christian Borntraeger
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Tested-by: Christian Borntraeger
Tested-by: Dietmar Eggemann
Cc: stable@vger.kernel.org
Cc: stable@vger.kernel.org #v5.1+
Link: https://lkml.kernel.org/r/20200306135257.25044-1-vincent.guittot@linaro.org
Signed-off-by: Greg Kroah-Hartman

Vincent Guittot
2020-04-17 16:50:11 +0800
dd39eadc7 sched: Avoid scale real weight down to zero ... Browse Code »

[ Upstream commit 26cf52229efc87e2effa9d788f9b33c40fb3358a ]

During our testing, we found a case that shares no longer
working correctly, the cgroup topology is like:

/sys/fs/cgroup/cpu/A (shares=102400)
/sys/fs/cgroup/cpu/A/B (shares=2)
/sys/fs/cgroup/cpu/A/B/C (shares=1024)

/sys/fs/cgroup/cpu/D (shares=1024)
/sys/fs/cgroup/cpu/D/E (shares=1024)
/sys/fs/cgroup/cpu/D/E/F (shares=1024)

The same benchmark is running in group C & F, no other tasks are
running, the benchmark is capable to consumed all the CPUs.

We suppose the group C will win more CPU resources since it could
enjoy all the shares of group A, but it's F who wins much more.

The reason is because we have group B with shares as 2, since
A->cfs_rq.load.weight == B->se.load.weight == B->shares/nr_cpus,
so A->cfs_rq.load.weight become very small.

And in calc_group_shares() we calculate shares as:

load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
shares = (tg_shares * load) / tg_weight;

Since the 'cfs_rq->load.weight' is too small, the load become 0
after scale down, although 'tg_shares' is 102400, shares of the se
which stand for group A on root cfs_rq become 2.

While the se of D on root cfs_rq is far more bigger than 2, so it
wins the battle.

Thus when scale_load_down() scale real weight down to 0, it's no
longer telling the real story, the caller will have the wrong
information and the calculation will be buggy.

This patch add check in scale_load_down(), so the real weight will
be >= MIN_SHARES after scale, after applied the group C wins as
expected.

Suggested-by: Peter Zijlstra
Signed-off-by: Michael Wang
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/38e8e212-59a1-64b2-b247-b6d0b52d8dc1@linux.alibaba.com
Signed-off-by: Sasha Levin

Michael Wang
2020-04-17 16:50:02 +0800

05 Mar, 2020

4 commits

a25ae5539 sched/fair: Optimize select_idle_cpu ... Browse Code »

commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.

select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :

1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")

introduces a way to limit how many CPUs we scan.

But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.

Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Srikar Dronamraju
Reviewed-by: Vincent Guittot
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
Signed-off-by: Greg Kroah-Hartman

Cheng Jian
2020-03-05 23:43:48 +0800
36b5fcc14 sched/fair: Prevent unlimited runtime on throttled group ... Browse Code »

[ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]

When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.

We have the following sequence:

sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.

Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Ben Segall
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin

Vincent Guittot
2020-03-05 23:43:36 +0800
166d6008f timers/nohz: Update NOHZ load in remote tick ... Browse Code »

[ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]

The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.

Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.

[ swood: Added changelog and removed recheck of stopped tick. ]

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin

Peter Zijlstra (Intel)
2020-03-05 23:43:36 +0800
5a309e3bf sched/core: Don't skip remote tick for idle CPUs ... Browse Code »

[ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]

This will be used in the next patch to get a loadavg update from
nohz cpus. The delta check is skipped because idle_sched_class
doesn't update se.exec_start.

Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin

Scott Wood
2020-03-05 23:43:35 +0800

29 Feb, 2020

1 commit

e61c236dc sched/psi: Fix OOB write when writing 0 bytes to PSI files ... Browse Code »

commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.

Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Suren Baghdasaryan
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Johannes Weiner
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
Signed-off-by: Greg Kroah-Hartman

Suren Baghdasaryan
2020-02-29 00:22:21 +0800

24 Feb, 2020

2 commits

f2323c374 sched/topology: Assert non-NUMA topology masks don't (partially) overlap ... Browse Code »

[ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8package_id != cpu_topo->package_id)
continue;

+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
Signed-off-by: Sasha Levin

Valentin Schneider
2020-02-24 15:36:52 +0800
5d13f62b9 sched/core: Fix size of rq::uclamp initialization ... Browse Code »

[ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]

rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Qais Yousef
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
Signed-off-by: Sasha Levin

Li Guanglei
2020-02-24 15:36:51 +0800

20 Feb, 2020

1 commit

9f6f61c61 sched/uclamp: Reject negative values in cpu_uclamp_write() ... Browse Code »

commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.

The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison

7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }

# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96

Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().

# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range

Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
Signed-off-by: Greg Kroah-Hartman

Qais Yousef
2020-02-20 02:53:07 +0800

15 Feb, 2020

1 commit

ba95651ce sched/uclamp: Fix a bug in propagating uclamp value in new cgroups ... Browse Code »

commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.

When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.

Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.

Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.

The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.

By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.

Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Doug Smythies
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
Signed-off-by: Greg Kroah-Hartman

Qais Yousef
2020-02-15 05:34:17 +0800

26 Jan, 2020

2 commits

0e9619ff1 sched/cpufreq: Move the cfs_rq_util_change() call to cpufreq_update_util() ... Browse Code »

[ Upstream commit bef69dd87828ef5d8ecdab8d857cd3a33cf98675 ]

update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.

When a task is attached on a CPU, we have this call path:

update_load_avg()
update_cfs_rq_load_avg()
cfs_rq_util_change -- > trig frequency update
attach_entity_load_avg()
cfs_rq_util_change -- > trig frequency update

The 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.

update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.

It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.

This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.

[ mingo: Minor updates. ]

Reported-by: Doug Smythies
Tested-by: Doug Smythies
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Acked-by: Rafael J. Wysocki
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Vincent Guittot
2020-01-26 17:01:08 +0800
37bb3c464 sched/core: Further clarify sched_class::set_next_task() ... Browse Code »

commit a0e813f26ebcb25c0b5e504498fbd796cca1a4ba upstream.

It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Peter Zijlstra
2020-01-26 17:01:03 +0800

12 Jan, 2020

2 commits

4e3813518 psi: Fix a division error in psi poll() ... Browse Code »

[ Upstream commit c3466952ca1514158d7c16c9cfc48c27d5c5dc0f ]

The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Suren Baghdasaryan
Cc: Jingfeng Xie
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
Signed-off-by: Sasha Levin

Johannes Weiner
2020-01-12 19:21:36 +0800
74e2bdcb7 sched/psi: Fix sampling error and rare div0 crashes with cgroups and high uptime ... Browse Code »

[ Upstream commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 ]

Jingfeng reports rare div0 crashes in psi on systems with some uptime:

[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330

The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.

We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.

The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.

Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.

Reported-by: Jingfeng Xie
Signed-off-by: Johannes Weiner
Signed-off-by: Peter Zijlstra (Intel)
Cc: Suren Baghdasaryan
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
Signed-off-by: Sasha Levin

Johannes Weiner
2020-01-12 19:21:36 +0800

31 Dec, 2019

2 commits

da06508bc cpufreq: Avoid leaving stale IRQ work items during CPU offline ... Browse Code »

commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.

The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.

Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.

If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point. Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever. In particular, a system-wide
deadlock may occur during CPU online as a result of that.

The failing scenario is as follows. CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu). It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline. The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3. When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.

To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.

Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.

Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.

Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang
Tested-by: Anson Huang
Cc: 4.14+ # 4.14+
Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
Tested-by: Peng Fan (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman

Rafael J. Wysocki
2019-12-31 23:46:06 +0800
c598c8a46 sched/uclamp: Fix overzealous type replacement ... Browse Code »

[ Upstream commit 7763baace1b738d65efa46d68326c9406311c6bf ]

Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit

0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")

but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.

Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.

Signed-off-by: Valentin Schneider
Reviewed-by: Qais Yousef
Acked-by: Vincent Guittot
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Valentin Schneider
2019-12-31 23:45:37 +0800

15 Nov, 2019

1 commit

6e1ff0773 sched/uclamp: Fix incorrect condition ... Browse Code »

uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.

[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/

Reported-by: Suren Baghdasaryan
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Vincent Guittot
Cc: Ben Segall
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Patrick Bellasi
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar

Qais Yousef
2019-11-15 18:02:18 +0800

13 Nov, 2019

2 commits

b90f7c9d2 sched/pelt: Fix update of blocked PELT ordering ... Browse Code »

update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an
update of the frequency. Make sure that RT, DL and IRQ PELT signals have
been updated before calling cpufreq.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dietmar.eggemann@arm.com
Cc: dsmythies@telus.net
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
Fixes: 91c27493e78d ("sched/irq: Add IRQ utilization tracking")
Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar

Vincent Guittot
2019-11-13 15:01:31 +0800
ff51ff84d sched/core: Avoid spurious lock dependencies ... Browse Code »

While seemingly harmless, __sched_fork() does hrtimer_init(), which,
when DEBUG_OBJETS, can end up doing allocations.

This then results in the following lock order:

rq->lock
zone->lock.rlock
batched_entropy_u64.lock

Which in turn causes deadlocks when we do wakeups while holding that
batched_entropy lock -- as the random code does.

Solve this by moving __sched_fork() out from under rq->lock. This is
safe because nothing there relies on rq->lock, as also evident from the
other __sched_fork() callsite.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: bigeasy@linutronix.de
Cc: cl@linux.com
Cc: keescook@chromium.org
Cc: penberg@kernel.org
Cc: rientjes@google.com
Cc: thgarnie@google.com
Cc: tytso@mit.edu
Cc: will@kernel.org
Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy")
Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2019-11-13 15:01:30 +0800

09 Nov, 2019

2 commits

6e2df0581 sched: Fix pick_next_task() vs 'change' pattern race ... Browse Code »

Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().

The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):

queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */

if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);

/* change task properties */

if (queued)
enqueue_task(...);
if (running)
set_next_task(...);

It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:

prev->sched_class->put_prev_task(rq, prev, rf);
if (!rq->nr_running)
newidle_balance(rq, rf);

The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.

Furthermore, it turns out we lost the RT-pull when we put the last DL
task.

Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().

Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Quentin Perret
Tested-by: Valentin Schneider

Peter Zijlstra
2019-11-09 05:34:14 +0800
e3b8b6a0d sched/core: Fix compilation error when cgroup not selected ... Browse Code »

When cgroup is disabled the following compilation error was hit

kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
struct css_task_iter it;
^~
kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
css_task_iter_start(css, 0, &it);
^~~~~~~~~~~~~~~~~~~
__sg_page_iter_start
kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
while ((p = css_task_iter_next(&it))) {
^~~~~~~~~~~~~~~~~~
__sg_page_iter_next
kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
css_task_iter_end(&it);
^~~~~~~~~~~~~~~~~
get_task_cred
kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
struct css_task_iter it;
^~
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/core.o] Error 1

Fix by protetion uclamp_update_active_tasks() with
CONFIG_UCLAMP_TASK_GROUP

Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Reported-by: Randy Dunlap
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Randy Dunlap
Cc: Steven Rostedt
Cc: Ingo Molnar
Cc: Vincent Guittot
Cc: Patrick Bellasi
Cc: Mel Gorman
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Ben Segall
Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com

Qais Yousef
2019-11-09 05:34:14 +0800

29 Oct, 2019

2 commits

e284df705 sched/topology: Allow sched_asym_cpucapacity to be disabled ... Browse Code »

While the static key is correctly initialized as being disabled, it will
remain forever enabled once turned on. This means that if we start with an
asymmetric system and hotplug out enough CPUs to end up with an SMP system,
the static key will remain set - which is obviously wrong. We should detect
this and turn off things like misfit migration and capacity aware wakeups.

As Quentin pointed out, having separate root domains makes this slightly
trickier. We could have exclusive cpusets that create an SMP island - IOW,
the domains within this root domain will not see any asymmetry. This means
we can't just disable the key on domain destruction, we need to count how
many asymmetric root domains we have.

Consider the following example using Juno r0 which is 2+4 big.LITTLE, where
two identical cpusets are created: they both span both big and LITTLE CPUs:

asym0 asym1
[ ][ ]
L L B L L B

$ cgcreate -g cpuset:asym0
$ cgset -r cpuset.cpus=0,1,3 asym0
$ cgset -r cpuset.mems=0 asym0
$ cgset -r cpuset.cpu_exclusive=1 asym0

$ cgcreate -g cpuset:asym1
$ cgset -r cpuset.cpus=2,4,5 asym1
$ cgset -r cpuset.mems=0 asym1
$ cgset -r cpuset.cpu_exclusive=1 asym1

$ cgset -r cpuset.sched_load_balance=0 .

(the CPU numbering may look odd because on the Juno LITTLEs are CPUs 0,3-5
and bigs are CPUs 1-2)

If we make one of those SMP (IOW remove asymmetry) by e.g. hotplugging its
big core, we would end up with an SMP cpuset and an asymmetric cpuset - the
static key must remain set, because we still have one asymmetric root domain.

With the above example, this could be done with:

$ echo 0 > /sys/devices/system/cpu/cpu2/online

Which would result in:

asym0 asym1
[ ][ ]
L L B L L

When both SMP and asymmetric cpusets are present, all CPUs will observe
sched_asym_cpucapacity being set (it is system-wide), but not all CPUs
observe asymmetry in their sched domain hierarchy:

per_cpu(sd_asym_cpucapacity, ) ==
per_cpu(sd_asym_cpucapacity, ) == NULL

Change the simple key enablement to an increment, and decrement the key
counter when destroying domains that cover asymmetric CPUs.

Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
Link: https://lkml.kernel.org/r/20191023153745.19515-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar

Valentin Schneider
2019-10-29 16:58:46 +0800
cd1cb3350 sched/topology: Don't try to build empty sched domains ... Browse Code »

Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
cpuset code feeding empty cpumasks to the sched domain rebuild machinery.

This leads to the following splat:

Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
Hardware name: ARM Juno development board (r0) (DT)
Workqueue: events cpuset_hotplug_workfn
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
lr : build_sched_domains (kernel/sched/topology.c:1966)
Call trace:
build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
partition_sched_domains_locked (kernel/sched/topology.c:2250)
rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:255)
ret_from_fork (arch/arm64/kernel/entry.S:1167)
Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)

The faulty line in question is:

cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));

and we're not checking the return value against nr_cpu_ids (we shouldn't
have to!), which leads to the above.

Prevent generate_sched_domains() from returning empty cpumasks, and add
some assertion in build_sched_domains() to scream bloody murder if it
happens again.

The above splat was obtained on my Juno r0 with the following reproducer:

$ cgcreate -g cpuset:asym
$ cgset -r cpuset.cpus=0-3 asym
$ cgset -r cpuset.mems=0 asym
$ cgset -r cpuset.cpu_exclusive=1 asym

$ cgcreate -g cpuset:smp
$ cgset -r cpuset.cpus=4-5 smp
$ cgset -r cpuset.mems=0 smp
$ cgset -r cpuset.cpu_exclusive=1 smp

$ cgset -r cpuset.sched_load_balance=0 .

$ echo 0 > /sys/devices/system/cpu/cpu4/online
$ echo 0 > /sys/devices/system/cpu/cpu5/online

Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar

Valentin Schneider
2019-10-29 16:58:45 +0800

13 Oct, 2019

1 commit

328fefadd Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
quota precision fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/vtime: Fix guest/system mis-accounting on task switch
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision

Linus Torvalds
2019-10-13 06:29:54 +0800