05 Mar, 2020
4 commits
-
commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.
select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
introduces a way to limit how many CPUs we scan.
But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Srikar Dronamraju
Reviewed-by: Vincent Guittot
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]
When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.We have the following sequence:
sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Ben Segall
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin -
[ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]
The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.[ swood: Added changelog and removed recheck of stopped tick. ]
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin -
[ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]
This will be used in the next patch to get a loadavg update from
nohz cpus. The delta check is skipped because idle_sched_class
doesn't update se.exec_start.Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin
29 Feb, 2020
1 commit
-
commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.
Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.Signed-off-by: Suren Baghdasaryan
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Johannes Weiner
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
Signed-off-by: Greg Kroah-Hartman
24 Feb, 2020
2 commits
-
[ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]
topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'And a later pass through get_group(4, sdd@MC) will "corrupt" that to:
3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.Testing
=======Dietmar managed to reproduce this using the following qemu incantation:
$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):8package_id != cpu_topo->package_id)
continue;+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);8
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
Signed-off-by: Sasha Levin -
[ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]
rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Qais Yousef
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
Signed-off-by: Sasha Levin
20 Feb, 2020
1 commit
-
commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.
The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of rangeFixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
Signed-off-by: Greg Kroah-Hartman
15 Feb, 2020
1 commit
-
commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.
When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Doug Smythies
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
Signed-off-by: Greg Kroah-Hartman
26 Jan, 2020
2 commits
-
[ Upstream commit bef69dd87828ef5d8ecdab8d857cd3a33cf98675 ]
update_cfs_rq_load_avg() calls cfs_rq_util_change() every time PELT decays,
which might be inefficient when the cpufreq driver has rate limitation.When a task is attached on a CPU, we have this call path:
update_load_avg()
update_cfs_rq_load_avg()
cfs_rq_util_change -- > trig frequency update
attach_entity_load_avg()
cfs_rq_util_change -- > trig frequency updateThe 1st frequency update will not take into account the utilization of the
newly attached task and the 2nd one might be discarded because of rate
limitation of the cpufreq driver.update_cfs_rq_load_avg() is only called by update_blocked_averages()
and update_load_avg() so we can move the call to
cfs_rq_util_change/cpufreq_update_util() into these two functions.It's also interesting to note that update_load_avg() already calls
cfs_rq_util_change() directly for the !SMP case.This change will also ensure that cpufreq_update_util() is called even
when there is no more CFS rq in the leaf_cfs_rq_list to update, but only
IRQ, RT or DL PELT signals.[ mingo: Minor updates. ]
Reported-by: Doug Smythies
Tested-by: Doug Smythies
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Acked-by: Rafael J. Wysocki
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: juri.lelli@redhat.com
Cc: linux-pm@vger.kernel.org
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Cc: sargun@sargun.me
Cc: srinivas.pandruvada@linux.intel.com
Cc: tj@kernel.org
Cc: xiexiuqi@huawei.com
Cc: xiezhipeng1@huawei.com
Fixes: 039ae8bcf7a5 ("sched/fair: Fix O(nr_cgroups) in the load balancing path")
Link: https://lkml.kernel.org/r/1574083279-799-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin -
commit a0e813f26ebcb25c0b5e504498fbd796cca1a4ba upstream.
It turns out there really is something special to the first
set_next_task() invocation. In specific the 'change' pattern really
should not cause balance callbacks.Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: bsegall@google.com
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: ktkhai@virtuozzo.com
Cc: mgorman@suse.de
Cc: qais.yousef@arm.com
Cc: qperret@google.com
Cc: rostedt@goodmis.org
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: f95d4eaee6d0 ("sched/{rt,deadline}: Fix set_next_task vs pick_next_task")
Link: https://lkml.kernel.org/r/20191108131909.775434698@infradead.org
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman
12 Jan, 2020
2 commits
-
[ Upstream commit c3466952ca1514158d7c16c9cfc48c27d5c5dc0f ]
The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.Use div64_u64.
Signed-off-by: Johannes Weiner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Suren Baghdasaryan
Cc: Jingfeng Xie
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org
Signed-off-by: Sasha Levin -
[ Upstream commit 3dfbe25c27eab7c90c8a7e97b4c354a9d24dd985 ]
Jingfeng reports rare div0 crashes in psi on systems with some uptime:
[58914.066423] divide error: 0000 [#1] SMP
[58914.070416] Modules linked in: ipmi_poweroff ipmi_watchdog toa overlay fuse tcp_diag inet_diag binfmt_misc aisqos(O) aisqos_hotfixes(O)
[58914.083158] CPU: 94 PID: 140364 Comm: kworker/94:2 Tainted: G W OE K 4.9.151-015.ali3000.alios7.x86_64 #1
[58914.093722] Hardware name: Alibaba Alibaba Cloud ECS/Alibaba Cloud ECS, BIOS 3.23.34 02/14/2019
[58914.102728] Workqueue: events psi_update_work
[58914.107258] task: ffff8879da83c280 task.stack: ffffc90059dcc000
[58914.113336] RIP: 0010:[] [] psi_update_stats+0x1c1/0x330
[58914.122183] RSP: 0018:ffffc90059dcfd60 EFLAGS: 00010246
[58914.127650] RAX: 0000000000000000 RBX: ffff8858fe98be50 RCX: 000000007744d640
[58914.134947] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00003594f700648e
[58914.142243] RBP: ffffc90059dcfdf8 R08: 0000359500000000 R09: 0000000000000000
[58914.149538] R10: 0000000000000000 R11: 0000000000000000 R12: 0000359500000000
[58914.156837] R13: 0000000000000000 R14: 0000000000000000 R15: ffff8858fe98bd78
[58914.164136] FS: 0000000000000000(0000) GS:ffff887f7f380000(0000) knlGS:0000000000000000
[58914.172529] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[58914.178467] CR2: 00007f2240452090 CR3: 0000005d5d258000 CR4: 00000000007606f0
[58914.185765] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[58914.193061] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[58914.200360] PKRU: 55555554
[58914.203221] Stack:
[58914.205383] ffff8858fe98bd48 00000000000002f0 0000002e81036d09 ffffc90059dcfde8
[58914.213168] ffff8858fe98bec8 0000000000000000 0000000000000000 0000000000000000
[58914.220951] 0000000000000000 0000000000000000 0000000000000000 0000000000000000
[58914.228734] Call Trace:
[58914.231337] [] psi_update_work+0x22/0x60
[58914.237067] [] process_one_work+0x189/0x420
[58914.243063] [] worker_thread+0x4e/0x4b0
[58914.248701] [] ? process_one_work+0x420/0x420
[58914.254869] [] kthread+0xe6/0x100
[58914.259994] [] ? kthread_park+0x60/0x60
[58914.265640] [] ret_from_fork+0x39/0x50
[58914.271193] Code: 41 29 c3 4d 39 dc 4d 0f 42 dc f7 f1 48 8b 13 48 89 c7 48 c1
[58914.279691] RIP [] psi_update_stats+0x1c1/0x330The crashing instruction is trying to divide the observed stall time
by the sampling period. The period, stored in R8, is not 0, but we are
dividing by the lower 32 bits only, which are all 0 in this instance.We could switch to a 64-bit division, but the period shouldn't be that
big in the first place. It's the time between the last update and the
next scheduled one, and so should always be around 2s and comfortably
fit into 32 bits.The bug is in the initialization of new cgroups: we schedule the first
sampling event in a cgroup as an offset of sched_clock(), but fail to
initialize the last_update timestamp, and it defaults to 0. That
results in a bogusly large sampling period the first time we run the
sampling code, and consequently we underreport pressure for the first
2s of a cgroup's life. But worse, if sched_clock() is sufficiently
advanced on the system, and the user gets unlucky, the period's lower
32 bits can all be 0 and the sampling division will crash.Fix this by initializing the last update timestamp to the creation
time of the cgroup, thus correctly marking the start of the first
pressure sampling period in a new cgroup.Reported-by: Jingfeng Xie
Signed-off-by: Johannes Weiner
Signed-off-by: Peter Zijlstra (Intel)
Cc: Suren Baghdasaryan
Link: https://lkml.kernel.org/r/20191203183524.41378-2-hannes@cmpxchg.org
Signed-off-by: Sasha Levin
31 Dec, 2019
2 commits
-
commit 85572c2c4a45a541e880e087b5b17a48198b2416 upstream.
The scheduler code calling cpufreq_update_util() may run during CPU
offline on the target CPU after the IRQ work lists have been flushed
for it, so the target CPU should be prevented from running code that
may queue up an IRQ work item on it at that point.Unfortunately, that may not be the case if dvfs_possible_from_any_cpu
is set for at least one cpufreq policy in the system, because that
allows the CPU going offline to run the utilization update callback
of the cpufreq governor on behalf of another (online) CPU in some
cases.If that happens, the cpufreq governor callback may queue up an IRQ
work on the CPU running it, which is going offline, and the IRQ work
may not be flushed after that point. Moreover, that IRQ work cannot
be flushed until the "offlining" CPU goes back online, so if any
other CPU calls irq_work_sync() to wait for the completion of that
IRQ work, it will have to wait until the "offlining" CPU is back
online and that may not happen forever. In particular, a system-wide
deadlock may occur during CPU online as a result of that.The failing scenario is as follows. CPU0 is the boot CPU, so it
creates a cpufreq policy and becomes the "leader" of it
(policy->cpu). It cannot go offline, because it is the boot CPU.
Next, other CPUs join the cpufreq policy as they go online and they
leave it when they go offline. The last CPU to go offline, say CPU3,
may queue up an IRQ work while running the governor callback on
behalf of CPU0 after leaving the cpufreq policy because of the
dvfs_possible_from_any_cpu effect described above. Then, CPU0 is
the only online CPU in the system and the stale IRQ work is still
queued on CPU3. When, say, CPU1 goes back online, it will run
irq_work_sync() to wait for that IRQ work to complete and so it
will wait for CPU3 to go back online (which may never happen even
in principle), but (worse yet) CPU0 is waiting for CPU1 at that
point too and a system-wide deadlock occurs.To address this problem notice that CPUs which cannot run cpufreq
utilization update code for themselves (for example, because they
have left the cpufreq policies that they belonged to), should also
be prevented from running that code on behalf of the other CPUs that
belong to a cpufreq policy with dvfs_possible_from_any_cpu set and so
in that case the cpufreq_update_util_data pointer of the CPU running
the code must not be NULL as well as for the CPU which is the target
of the cpufreq utilization update in progress.Accordingly, change cpufreq_this_cpu_can_update() into a regular
function in kernel/sched/cpufreq.c (instead of a static inline in a
header file) and make it check the cpufreq_update_util_data pointer
of the local CPU if dvfs_possible_from_any_cpu is set for the target
cpufreq policy.Also update the schedutil governor to do the
cpufreq_this_cpu_can_update() check in the non-fast-switch
case too to avoid the stale IRQ work issues.Fixes: 99d14d0e16fa ("cpufreq: Process remote callbacks from any CPU if the platform permits")
Link: https://lore.kernel.org/linux-pm/20191121093557.bycvdo4xyinbc5cb@vireshk-i7/
Reported-by: Anson Huang
Tested-by: Anson Huang
Cc: 4.14+ # 4.14+
Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
Tested-by: Peng Fan (i.MX8QXP-MEK)
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman -
[ Upstream commit 7763baace1b738d65efa46d68326c9406311c6bf ]
Some uclamp helpers had their return type changed from 'unsigned int' to
'enum uclamp_id' by commit0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
but it happens that some do return a value in the [0, SCHED_CAPACITY_SCALE]
range, which should really be unsigned int. The affected helpers are
uclamp_none(), uclamp_rq_max_value() and uclamp_eff_value(). Fix those up.Note that this doesn't lead to any obj diff using a relatively recent
aarch64 compiler (8.3-2019.03). The current code of e.g. uclamp_eff_value()
properly returns an 11 bit value (bits_per(1024)) and doesn't seem to do
anything funny. I'm still marking this as fixing the above commit to be on
the safe side.Signed-off-by: Valentin Schneider
Reviewed-by: Qais Yousef
Acked-by: Vincent Guittot
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: patrick.bellasi@matbug.net
Cc: qperret@google.com
Cc: surenb@google.com
Cc: tj@kernel.org
Fixes: 0413d7f33e60 ("sched/uclamp: Always use 'enum uclamp_id' for clamp_id values")
Link: https://lkml.kernel.org/r/20191115103908.27610-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
15 Nov, 2019
1 commit
-
uclamp_update_active() should perform the update when
p->uclamp[clamp_id].active is true. But when the logic was inverted in
[1], the if condition wasn't inverted correctly too.[1] https://lore.kernel.org/lkml/20190902073836.GO2369@hirez.programming.kicks-ass.net/
Reported-by: Suren Baghdasaryan
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Vincent Guittot
Cc: Ben Segall
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Patrick Bellasi
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Fixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Link: https://lkml.kernel.org/r/20191114211052.15116-1-qais.yousef@arm.com
Signed-off-by: Ingo Molnar
13 Nov, 2019
2 commits
-
update_cfs_rq_load_avg() can call cpufreq_update_util() to trigger an
update of the frequency. Make sure that RT, DL and IRQ PELT signals have
been updated before calling cpufreq.Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dietmar.eggemann@arm.com
Cc: dsmythies@telus.net
Cc: juri.lelli@redhat.com
Cc: mgorman@suse.de
Cc: rostedt@goodmis.org
Fixes: 371bf4273269 ("sched/rt: Add rt_rq utilization tracking")
Fixes: 3727e0e16340 ("sched/dl: Add dl_rq utilization tracking")
Fixes: 91c27493e78d ("sched/irq: Add IRQ utilization tracking")
Link: https://lkml.kernel.org/r/1572434309-32512-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar -
While seemingly harmless, __sched_fork() does hrtimer_init(), which,
when DEBUG_OBJETS, can end up doing allocations.This then results in the following lock order:
rq->lock
zone->lock.rlock
batched_entropy_u64.lockWhich in turn causes deadlocks when we do wakeups while holding that
batched_entropy lock -- as the random code does.Solve this by moving __sched_fork() out from under rq->lock. This is
safe because nothing there relies on rq->lock, as also evident from the
other __sched_fork() callsite.Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Qian Cai
Cc: Thomas Gleixner
Cc: akpm@linux-foundation.org
Cc: bigeasy@linutronix.de
Cc: cl@linux.com
Cc: keescook@chromium.org
Cc: penberg@kernel.org
Cc: rientjes@google.com
Cc: thgarnie@google.com
Cc: tytso@mit.edu
Cc: will@kernel.org
Fixes: b7d5dc21072c ("random: add a spinlock_t to struct batched_entropy")
Link: https://lkml.kernel.org/r/20191001091837.GK4536@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
09 Nov, 2019
2 commits
-
Commit 67692435c411 ("sched: Rework pick_next_task() slow-path")
inadvertly introduced a race because it changed a previously
unexplored dependency between dropping the rq->lock and
sched_class::put_prev_task().The comments about dropping rq->lock, in for example
newidle_balance(), only mentions the task being current and ->on_cpu
being set. But when we look at the 'change' pattern (in for example
sched_setnuma()):queued = task_on_rq_queued(p); /* p->on_rq == TASK_ON_RQ_QUEUED */
running = task_current(rq, p); /* rq->curr == p */if (queued)
dequeue_task(...);
if (running)
put_prev_task(...);/* change task properties */
if (queued)
enqueue_task(...);
if (running)
set_next_task(...);It becomes obvious that if we do this after put_prev_task() has
already been called on @p, things go sideways. This is exactly what
the commit in question allows to happen when it does:prev->sched_class->put_prev_task(rq, prev, rf);
if (!rq->nr_running)
newidle_balance(rq, rf);The newidle_balance() call will drop rq->lock after we've called
put_prev_task() and that allows the above 'change' pattern to
interleave and mess up the state.Furthermore, it turns out we lost the RT-pull when we put the last DL
task.Fix both problems by extracting the balancing from put_prev_task() and
doing a multi-class balance() pass before put_prev_task().Fixes: 67692435c411 ("sched: Rework pick_next_task() slow-path")
Reported-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Quentin Perret
Tested-by: Valentin Schneider -
When cgroup is disabled the following compilation error was hit
kernel/sched/core.c: In function ‘uclamp_update_active_tasks’:
kernel/sched/core.c:1081:23: error: storage size of ‘it’ isn’t known
struct css_task_iter it;
^~
kernel/sched/core.c:1084:2: error: implicit declaration of function ‘css_task_iter_start’; did you mean ‘__sg_page_iter_start’? [-Werror=implicit-function-declaration]
css_task_iter_start(css, 0, &it);
^~~~~~~~~~~~~~~~~~~
__sg_page_iter_start
kernel/sched/core.c:1085:14: error: implicit declaration of function ‘css_task_iter_next’; did you mean ‘__sg_page_iter_next’? [-Werror=implicit-function-declaration]
while ((p = css_task_iter_next(&it))) {
^~~~~~~~~~~~~~~~~~
__sg_page_iter_next
kernel/sched/core.c:1091:2: error: implicit declaration of function ‘css_task_iter_end’; did you mean ‘get_task_cred’? [-Werror=implicit-function-declaration]
css_task_iter_end(&it);
^~~~~~~~~~~~~~~~~
get_task_cred
kernel/sched/core.c:1081:23: warning: unused variable ‘it’ [-Wunused-variable]
struct css_task_iter it;
^~
cc1: some warnings being treated as errors
make[2]: *** [kernel/sched/core.o] Error 1Fix by protetion uclamp_update_active_tasks() with
CONFIG_UCLAMP_TASK_GROUPFixes: babbe170e053 ("sched/uclamp: Update CPU's refcount on TG's clamp changes")
Reported-by: Randy Dunlap
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Randy Dunlap
Cc: Steven Rostedt
Cc: Ingo Molnar
Cc: Vincent Guittot
Cc: Patrick Bellasi
Cc: Mel Gorman
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Ben Segall
Link: https://lkml.kernel.org/r/20191105112212.596-1-qais.yousef@arm.com
29 Oct, 2019
2 commits
-
While the static key is correctly initialized as being disabled, it will
remain forever enabled once turned on. This means that if we start with an
asymmetric system and hotplug out enough CPUs to end up with an SMP system,
the static key will remain set - which is obviously wrong. We should detect
this and turn off things like misfit migration and capacity aware wakeups.As Quentin pointed out, having separate root domains makes this slightly
trickier. We could have exclusive cpusets that create an SMP island - IOW,
the domains within this root domain will not see any asymmetry. This means
we can't just disable the key on domain destruction, we need to count how
many asymmetric root domains we have.Consider the following example using Juno r0 which is 2+4 big.LITTLE, where
two identical cpusets are created: they both span both big and LITTLE CPUs:asym0 asym1
[ ][ ]
L L B L L B$ cgcreate -g cpuset:asym0
$ cgset -r cpuset.cpus=0,1,3 asym0
$ cgset -r cpuset.mems=0 asym0
$ cgset -r cpuset.cpu_exclusive=1 asym0$ cgcreate -g cpuset:asym1
$ cgset -r cpuset.cpus=2,4,5 asym1
$ cgset -r cpuset.mems=0 asym1
$ cgset -r cpuset.cpu_exclusive=1 asym1$ cgset -r cpuset.sched_load_balance=0 .
(the CPU numbering may look odd because on the Juno LITTLEs are CPUs 0,3-5
and bigs are CPUs 1-2)If we make one of those SMP (IOW remove asymmetry) by e.g. hotplugging its
big core, we would end up with an SMP cpuset and an asymmetric cpuset - the
static key must remain set, because we still have one asymmetric root domain.With the above example, this could be done with:
$ echo 0 > /sys/devices/system/cpu/cpu2/online
Which would result in:
asym0 asym1
[ ][ ]
L L B L LWhen both SMP and asymmetric cpusets are present, all CPUs will observe
sched_asym_cpucapacity being set (it is system-wide), but not all CPUs
observe asymmetry in their sched domain hierarchy:per_cpu(sd_asym_cpucapacity, ) ==
per_cpu(sd_asym_cpucapacity, ) == NULLChange the simple key enablement to an increment, and decrement the key
counter when destroying domains that cover asymmetric CPUs.Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")
Link: https://lkml.kernel.org/r/20191023153745.19515-3-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar -
Turns out hotplugging CPUs that are in exclusive cpusets can lead to the
cpuset code feeding empty cpumasks to the sched domain rebuild machinery.This leads to the following splat:
Internal error: Oops: 96000004 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 235 Comm: kworker/5:2 Not tainted 5.4.0-rc1-00005-g8d495477d62e #23
Hardware name: ARM Juno development board (r0) (DT)
Workqueue: events cpuset_hotplug_workfn
pstate: 60000005 (nZCv daif -PAN -UAO)
pc : build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
lr : build_sched_domains (kernel/sched/topology.c:1966)
Call trace:
build_sched_domains (./include/linux/arch_topology.h:23 kernel/sched/topology.c:1898 kernel/sched/topology.c:1969)
partition_sched_domains_locked (kernel/sched/topology.c:2250)
rebuild_sched_domains_locked (./include/linux/bitmap.h:370 ./include/linux/cpumask.h:538 kernel/cgroup/cpuset.c:955 kernel/cgroup/cpuset.c:978 kernel/cgroup/cpuset.c:1019)
rebuild_sched_domains (kernel/cgroup/cpuset.c:1032)
cpuset_hotplug_workfn (kernel/cgroup/cpuset.c:3205 (discriminator 2))
process_one_work (./arch/arm64/include/asm/jump_label.h:21 ./include/linux/jump_label.h:200 ./include/trace/events/workqueue.h:114 kernel/workqueue.c:2274)
worker_thread (./include/linux/compiler.h:199 ./include/linux/list.h:268 kernel/workqueue.c:2416)
kthread (kernel/kthread.c:255)
ret_from_fork (arch/arm64/kernel/entry.S:1167)
Code: f860dae2 912802d6 aa1603e1 12800000 (f8616853)The faulty line in question is:
cap = arch_scale_cpu_capacity(cpumask_first(cpu_map));
and we're not checking the return value against nr_cpu_ids (we shouldn't
have to!), which leads to the above.Prevent generate_sched_domains() from returning empty cpumasks, and add
some assertion in build_sched_domains() to scream bloody murder if it
happens again.The above splat was obtained on my Juno r0 with the following reproducer:
$ cgcreate -g cpuset:asym
$ cgset -r cpuset.cpus=0-3 asym
$ cgset -r cpuset.mems=0 asym
$ cgset -r cpuset.cpu_exclusive=1 asym$ cgcreate -g cpuset:smp
$ cgset -r cpuset.cpus=4-5 smp
$ cgset -r cpuset.mems=0 smp
$ cgset -r cpuset.cpu_exclusive=1 smp$ cgset -r cpuset.sched_load_balance=0 .
$ echo 0 > /sys/devices/system/cpu/cpu4/online
$ echo 0 > /sys/devices/system/cpu/cpu5/onlineSigned-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hannes@cmpxchg.org
Cc: lizefan@huawei.com
Cc: morten.rasmussen@arm.com
Cc: qperret@google.com
Cc: tj@kernel.org
Cc: vincent.guittot@linaro.org
Fixes: 05484e098448 ("sched/topology: Add SD_ASYM_CPUCAPACITY flag detection")
Link: https://lkml.kernel.org/r/20191023153745.19515-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar
13 Oct, 2019
1 commit
-
Pull scheduler fixes from Ingo Molnar:
"Two fixes: a guest-cputime accounting fix, and a cgroup bandwidth
quota precision fix"* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/vtime: Fix guest/system mis-accounting on task switch
sched/fair: Scale bandwidth quota and period without losing quota/period ratio precision
09 Oct, 2019
2 commits
-
vtime_account_system() assumes that the target task to account cputime
to is always the current task. This is most often true indeed except on
task switch where we call:vtime_common_task_switch(prev)
vtime_account_system(prev)Here prev is the scheduling-out task where we account the cputime to. It
doesn't match current that is already the scheduling-in task at this
stage of the context switch.So we end up checking the wrong task flags to determine if we are
accounting guest or system time to the previous task.As a result the wrong task is used to check if the target is running in
guest mode. We may then spuriously account or leak either system or
guest time on task switch.Fix this assumption and also turn vtime_guest_enter/exit() to use the
task passed in parameter as well to avoid future similar issues.Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Cc: Wanpeng Li
Fixes: 2a42eb9594a1 ("sched/cputime: Accumulate vtime on top of nsec clocksource")
Link: https://lkml.kernel.org/r/20190925214242.21873-1-frederic@kernel.org
Signed-off-by: Ingo Molnar -
The quota/period ratio is used to ensure a child task group won't get
more bandwidth than the parent task group, and is calculated as:normalized_cfs_quota() = [(quota_us << 20) / period_us]
If the quota/period ratio was changed during this scaling due to
precision loss, it will cause inconsistency between parent and child
task groups.See below example:
A userspace container manager (kubelet) does three operations:
1) Create a parent cgroup, set quota to 1,000us and period to 10,000us.
2) Create a few children cgroups.
3) Set quota to 1,000us and period to 10,000us on a child cgroup.These operations are expected to succeed. However, if the scaling of
147/128 happens before step 3, quota and period of the parent cgroup
will be changed:new_quota: 1148437ns, 1148us
new_period: 11484375ns, 11484usAnd when step 3 comes in, the ratio of the child cgroup will be
104857, which will be larger than the parent cgroup ratio (104821),
and will fail.Scaling them by a factor of 2 will fix the problem.
Tested-by: Phil Auld
Signed-off-by: Xuewei Zhang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Phil Auld
Cc: Anton Blanchard
Cc: Ben Segall
Cc: Dietmar Eggemann
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mel Gorman
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Vincent Guittot
Fixes: 2e8e19226398 ("sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup")
Link: https://lkml.kernel.org/r/20191004001243.140897-1-xueweiz@google.com
Signed-off-by: Ingo Molnar
05 Oct, 2019
1 commit
-
…/kernel/git/brauner/linux
Pull copy_struct_from_user() helper from Christian Brauner:
"This contains the copy_struct_from_user() helper which got split out
from the openat2() patchset. It is a generic interface designed to
copy a struct from userspace.The helper will be especially useful for structs versioned by size of
which we have quite a few. This allows for backwards compatibility,
i.e. an extended struct can be passed to an older kernel, or a legacy
struct can be passed to a newer kernel. For the first case (extended
struct, older kernel) the new fields in an extended struct can be set
to zero and the struct safely passed to an older kernel.The most obvious benefit is that this helper lets us get rid of
duplicate code present in at least sched_setattr(), perf_event_open(),
and clone3(). More importantly it will also help to ensure that users
implementing versioning-by-size end up with the same core semantics.This point is especially crucial since we have at least one case where
versioning-by-size is used but with slighly different semantics:
sched_setattr(), perf_event_open(), and clone3() all do do similar
checks to copy_struct_from_user() while rt_sigprocmask(2) always
rejects differently-sized struct arguments.With this pull request we also switch over sched_setattr(),
perf_event_open(), and clone3() to use the new helper"* tag 'copy-struct-from-user-v5.4-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/brauner/linux:
usercopy: Add parentheses around assignment in test_copy_struct_from_user
perf_event_open: switch to copy_struct_from_user()
sched_setattr: switch to copy_struct_from_user()
clone3: switch to copy_struct_from_user()
lib: introduce copy_struct_from_user() helper
02 Oct, 2019
1 commit
-
The following commit:
227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
got fat fingered by me when merging it with other patches. It meant to move
the RCU section out of the for loop but ended up doing it partially, leaving
a superfluous rcu_read_lock() inside, causing havok.Reported-by: Ingo Molnar
Signed-off-by: Peter Zijlstra (Intel)
Cc: Borislav Petkov
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Cc: linux-tip-commits@vger.kernel.org
Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Link: https://lkml.kernel.org/r/20191001085033.GP4519@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
01 Oct, 2019
1 commit
-
Switch sched_setattr() syscall from it's own copying struct sched_attr
from userspace to the new dedicated copy_struct_from_user() helper.The change is very straightforward, and helps unify the syscall
interface for struct-from-userspace syscalls. Ideally we could also
unify sched_getattr(2)-style syscalls as well, but unfortunately the
correct semantics for such syscalls are much less clear (see [1] for
more detail). In future we could come up with a more sane idea for how
the syscall interface should look.[1]: commit 1251201c0d34 ("sched/core: Fix uclamp ABI bug, clean up and
robustify sched_read_attr() ABI logic and code")Signed-off-by: Aleksa Sarai
Reviewed-by: Kees Cook
Reviewed-by: Christian Brauner
[christian.brauner@ubuntu.com: improve commit message]
Link: https://lore.kernel.org/r/20191001011055.19283-4-cyphar@cyphar.com
Signed-off-by: Christian Brauner
29 Sep, 2019
1 commit
-
Pull scheduler fixes from Ingo Molnar:
- Apply a number of membarrier related fixes and cleanups, which fixes
a use-after-free race in the membarrier code- Introduce proper RCU protection for tasks on the runqueue - to get
rid of the subtle task_rcu_dereference() interface that was easy to
get wrong- Misc fixes, but also an EAS speedup
* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/fair: Avoid redundant EAS calculation
sched/core: Remove double update_max_interval() call on CPU startup
sched/core: Fix preempt_schedule() interrupt return comment
sched/fair: Fix -Wunused-but-set-variable warnings
sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
sched/membarrier: Skip IPIs when mm->mm_users == 1
selftests, sched/membarrier: Add multi-threaded test
sched/membarrier: Fix p->mm->membarrier_state racy load
sched/membarrier: Call sync_core only before usermode for same mm
sched/membarrier: Remove redundant check
sched/membarrier: Fix private expedited registration check
tasks, sched/core: RCUify the assignment of rq->curr
tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
tasks: Add a count of task RCU users
sched/core: Convert vcpu_is_preempted() from macro to an inline function
sched/fair: Remove unused cfs_rq_clock_task() function
25 Sep, 2019
11 commits
-
The EAS wake-up path computes the system energy for several CPU
candidates: the CPU with maximum spare capacity in each performance
domain, and the prev_cpu. However, if prev_cpu also happens to be the
CPU with maximum spare capacity in its performance domain, the energy
calculation is still done twice, unnecessarily.Add a condition to filter out this corner case before doing the energy
calculation.Reported-by: Pavan Kondeti
Signed-off-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: morten.rasmussen@arm.com
Cc: qais.yousef@arm.com
Cc: rjw@rjwysocki.net
Cc: tkjos@google.com
Cc: valentin.schneider@arm.com
Cc: vincent.guittot@linaro.org
Fixes: eb92692b2544 ("sched/fair: Speed-up energy-aware wake-ups")
Link: https://lkml.kernel.org/r/20190920094115.GA11503@qperret.net
Signed-off-by: Ingo Molnar -
update_max_interval() is called in both CPUHP_AP_SCHED_STARTING's startup
and teardown callbacks, but it turns out it's also called at the end of
the startup callback of CPUHP_AP_ACTIVE (which is further down the
startup sequence).There's no point in repeating this interval update in the startup sequence
since the CPU will remain online until it goes down the teardown path.Remove the redundant call in sched_cpu_activate() (CPUHP_AP_ACTIVE).
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dietmar.eggemann@arm.com
Cc: juri.lelli@redhat.com
Cc: vincent.guittot@linaro.org
Link: https://lkml.kernel.org/r/20190923093017.11755-1-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar -
preempt_schedule_irq() is the one that should be called on return from
interrupt, clean up the comment to avoid any ambiguity.Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Thomas Gleixner
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-riscv@lists.infradead.org
Cc: uclinux-h8-devel@lists.sourceforge.jp
Link: https://lkml.kernel.org/r/20190923143620.29334-2-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar -
Commit:
de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
introduced a few compilation warnings:
kernel/sched/fair.c: In function '__refill_cfs_bandwidth_runtime':
kernel/sched/fair.c:4365:6: warning: variable 'now' set but not used [-Wunused-but-set-variable]
kernel/sched/fair.c: In function 'start_cfs_bandwidth':
kernel/sched/fair.c:4992:6: warning: variable 'overrun' set but not used [-Wunused-but-set-variable]Also, __refill_cfs_bandwidth_runtime() does no longer update the
expiration time, so fix the comments accordingly.Signed-off-by: Qian Cai
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Ben Segall
Reviewed-by: Dave Chiluk
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: pauld@redhat.com
Fixes: de53fd7aedb1 ("sched/fair: Fix low cpu usage with high throttling by removing expiration of cpu-local slices")
Link: https://lkml.kernel.org/r/1566326455-8038-1-git-send-email-cai@lca.pw
Signed-off-by: Ingo Molnar -
An oops can be triggered in the scheduler when running qemu on arm64:
Unable to handle kernel paging request at virtual address ffff000008effe40
Internal error: Oops: 96000007 [#1] SMP
Process migration/0 (pid: 12, stack limit = 0x00000000084e3736)
pstate: 20000085 (nzCv daIf -PAN -UAO)
pc : __ll_sc___cmpxchg_case_acq_4+0x4/0x20
lr : move_queued_task.isra.21+0x124/0x298
...
Call trace:
__ll_sc___cmpxchg_case_acq_4+0x4/0x20
__migrate_task+0xc8/0xe0
migration_cpu_stop+0x170/0x180
cpu_stopper_thread+0xec/0x178
smpboot_thread_fn+0x1ac/0x1e8
kthread+0x134/0x138
ret_from_fork+0x10/0x18__set_cpus_allowed_ptr() will choose an active dest_cpu in affinity mask to
migrage the process if process is not currently running on any one of the
CPUs specified in affinity mask. __set_cpus_allowed_ptr() will choose an
invalid dest_cpu (dest_cpu >= nr_cpu_ids, 1024 in my virtual machine) if
CPUS in an affinity mask are deactived by cpu_down after cpumask_intersects
check. cpumask_test_cpu() of dest_cpu afterwards is overflown and may pass if
corresponding bit is coincidentally set. As a consequence, kernel will
access an invalid rq address associate with the invalid CPU in
migration_cpu_stop->__migrate_task->move_queued_task and the Oops occurs.The reproduce the crash:
1) A process repeatedly binds itself to cpu0 and cpu1 in turn by calling
sched_setaffinity.2) A shell script repeatedly does "echo 0 > /sys/devices/system/cpu/cpu1/online"
and "echo 1 > /sys/devices/system/cpu/cpu1/online" in turn.3) Oops appears if the invalid CPU is set in memory after tested cpumask.
Signed-off-by: KeMeng Shi
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/1568616808-16808-1-git-send-email-shikemeng@huawei.com
Signed-off-by: Ingo Molnar -
Remove the IPI fallback code from membarrier to deal with very
infrequent cpumask memory allocation failure. Use GFP_KERNEL rather
than GFP_NOWAIT, and relax the blocking guarantees for the expedited
membarrier system call commands, allowing it to block if waiting for
memory to be made available.In addition, now -ENOMEM can be returned to user-space if the cpumask
memory allocation fails.Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-8-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar -
If there is only a single mm_user for the mm, the private expedited
membarrier command can skip the IPIs, because only a single thread
is using the mm.Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-7-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar -
The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.Suggested-by: Linus Torvalds
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar -
Checking that the number of threads is 1 is redundant with checking
mm_users == 1.No change in functionality intended.
Suggested-by: Oleg Nesterov
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-3-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar -
Fix a logic flaw in the way membarrier_register_private_expedited()
handles ready state checks for private expedited sync core and private
expedited registrations.If a private expedited membarrier registration is first performed, and
then a private expedited sync_core registration is performed, the ready
state check will skip the second registration when it really should not.Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Eric W. Biederman
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190919173705.2181-2-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar -
The current task on the runqueue is currently read with rcu_dereference().
To obtain ordinary RCU semantics for an rcu_dereference() of rq->curr it needs
to be paired with rcu_assign_pointer() of rq->curr. Which provides the
memory barrier necessary to order assignments to the task_struct
and the assignment to rq->curr.Unfortunately the assignment of rq->curr in __schedule is a hot path,
and it has already been show that additional barriers in that code
will reduce the performance of the scheduler. So I will attempt to
describe below why you can effectively have ordinary RCU semantics
without any additional barriers.The assignment of rq->curr in init_idle is a slow path called once
per cpu and that can use rcu_assign_pointer() without any concerns.As I write this there are effectively two users of rcu_dereference() on
rq->curr. There is the membarrier code in kernel/sched/membarrier.c
that only looks at "->mm" after the rcu_dereference(). Then there is
task_numa_compare() in kernel/sched/fair.c. My best reading of the
code shows that task_numa_compare only access: "->flags",
"->cpus_ptr", "->numa_group", "->numa_faults[]",
"->total_numa_faults", and "->se.cfs_rq".The code in __schedule() essentially does:
rq_lock(...);
smp_mb__after_spinlock();next = pick_next_task(...);
rq->curr = next;context_switch(prev, next);
At the start of the function the rq_lock/smp_mb__after_spinlock
pair provides a full memory barrier. Further there is a full memory barrier
in context_switch().This means that any task that has already run and modified itself (the
common case) has already seen two memory barriers before __schedule()
runs and begins executing. A task that modifies itself then sees a
third full memory barrier pair with the rq_lock();For a brand new task that is enqueued with wake_up_new_task() there
are the memory barriers present from the taking and release the
pi_lock and the rq_lock as the processes is enqueued as well as the
full memory barrier at the start of __schedule() assuming __schedule()
happens on the same cpu.This means that by the time we reach the assignment of rq->curr
except for values on the task struct modified in pick_next_task
the code has the same guarantees as if it used rcu_assign_pointer().Reading through all of the implementations of pick_next_task it
appears pick_next_task is limited to modifying the task_struct fields
"->se", "->rt", "->dl". These fields are the sched_entity structures
of the varies schedulers.Further "->se.cfs_rq" is only changed in cgroup attach/move operations
initialized by userspace.Unless I have missed something this means that in practice that the
users of "rcu_dereference(rq->curr)" get normal RCU semantics of
rcu_dereference() for the fields the care about, despite the
assignment of rq->curr in __schedule() ot using rcu_assign_pointer.Signed-off-by: Eric W. Biederman
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Metcalf
Cc: Christoph Lameter
Cc: Davidlohr Bueso
Cc: Kirill Tkhai
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Russell King - ARM Linux admin
Cc: Thomas Gleixner
Link: https://lore.kernel.org/r/20190903200603.GW2349@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar