Eric Lee / smarc-fsl-linux-kernel

06 Mar, 2021

7 commits

ce29ddc47 sched/membarrier: fix missing local execution of ipi_sync_rq_state() ... Browse Code »

The function sync_runqueues_membarrier_state() should copy the
membarrier state from the @mm received as parameter to each runqueue
currently running tasks using that mm.

However, the use of smp_call_function_many() skips the current runqueue,
which is unintended. Replace by a call to on_each_cpu_mask().

Fixes: 227a4aadc75b ("sched/membarrier: Fix p->mm->membarrier_state racy load")
Reported-by: Nadav Amit
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Cc: stable@vger.kernel.org # 5.4.x+
Link: https://lore.kernel.org/r/74F1E842-4A84-47BF-B6C2-5407DFDD4A4A@gmail.com

Mathieu Desnoyers
2021-03-06 19:40:21 +0800
50caf9c14 sched: Simplify set_affinity_pending refcounts ... Browse Code »

Now that we have set_affinity_pending::stop_pending to indicate if a
stopper is in progress, and we have the guarantee that if that stopper
exists, it will (eventually) complete our @pending we can simplify the
refcount scheme by no longer counting the stopper thread.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.724130207@infradead.org

Peter Zijlstra
2021-03-06 19:40:21 +0800
9e81889c7 sched: Fix affine_move_task() self-concurrency ... Browse Code »

Consider:

sched_setaffinity(p, X); sched_setaffinity(p, Y);

Then the first will install p->migration_pending = &my_pending; and
issue stop_one_cpu_nowait(pending); and the second one will read
p->migration_pending and _also_ issue: stop_one_cpu_nowait(pending),
the _SAME_ @pending.

This causes stopper list corruption.

Add set_affinity_pending::stop_pending, to indicate if a stopper is in
progress.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.649146419@infradead.org

Peter Zijlstra
2021-03-06 19:40:21 +0800
3f1bc119c sched: Optimize migration_cpu_stop() ... Browse Code »

When the purpose of migration_cpu_stop() is to migrate the task to
'any' valid CPU, don't migrate the task when it's already running on a
valid CPU.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.569238629@infradead.org

Peter Zijlstra
2021-03-06 19:40:21 +0800
58b1a4508 sched: Collate affine_move_task() stoppers ... Browse Code »

The SCA_MIGRATE_ENABLE and task_running() cases are almost identical,
collapse them to avoid further duplication.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.500108964@infradead.org

Peter Zijlstra
2021-03-06 19:40:21 +0800
c20cf065d sched: Simplify migration_cpu_stop() ... Browse Code »

When affine_move_task() issues a migration_cpu_stop(), the purpose of
that function is to complete that @pending, not any random other
p->migration_pending that might have gotten installed since.

This realization much simplifies migration_cpu_stop() and allows
further necessary steps to fix all this as it provides the guarantee
that @pending's stopper will complete @pending (and not some random
other @pending).

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.430014682@infradead.org

Peter Zijlstra
2021-03-06 19:40:20 +0800
8a6edb525 sched: Fix migration_cpu_stop() requeueing ... Browse Code »

When affine_move_task(p) is called on a running task @p, which is not
otherwise already changing affinity, we'll first set
p->migration_pending and then do:

stop_one_cpu(cpu_of_rq(rq), migration_cpu_stop, &arg);

This then gets us to migration_cpu_stop() running on the CPU that was
previously running our victim task @p.

If we find that our task is no longer on that runqueue (this can
happen because of a concurrent migration due to load-balance etc.),
then we'll end up at the:

} else if (dest_cpu < 1 || pending) {

branch. Which we'll take because we set pending earlier. Here we first
check if the task @p has already satisfied the affinity constraints,
if so we bail early [A]. Otherwise we'll reissue migration_cpu_stop()
onto the CPU that is now hosting our task @p:

stop_one_cpu_nowait(cpu_of(rq), migration_cpu_stop,
&pending->arg, &pending->stop_work);

Except, we've never initialized pending->arg, which will be all 0s.

This then results in running migration_cpu_stop() on the next CPU with
arg->p == NULL, which gives the by now obvious result of fireworks.

The cure is to change affine_move_task() to always use pending->arg,
furthermore we can use the exact same pattern as the
SCA_MIGRATE_ENABLE case, since we'll block on the pending->done
completion anyway, no point in adding yet another completion in
stop_one_cpu().

This then gives a clear distinction between the two
migration_cpu_stop() use cases:

- sched_exec() / migrate_task_to() : arg->pending == NULL
- affine_move_task() : arg->pending != NULL;

And we can have it ignore p->migration_pending when !arg->pending. Any
stop work from sched_exec() / migrate_task_to() is in addition to stop
works from affine_move_task(), which will be sufficient to issue the
completion.

Fixes: 6d337eab041d ("sched: Fix migrate_disable() vs set_cpus_allowed_ptr()")
Cc: stable@kernel.org
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210224131355.357743989@infradead.org

Peter Zijlstra
2021-03-06 19:40:20 +0800

27 Feb, 2021

1 commit

c034f48e9 kernel: delete repeated words in comments ... Browse Code »

Drop repeated words in kernel/events/.
{if, the, that, with, time}

Drop repeated words in kernel/locking/.
{it, no, the}

Drop repeated words in kernel/sched/.
{in, not}

Link: https://lkml.kernel.org/r/20210127023412.26292-1-rdunlap@infradead.org
Signed-off-by: Randy Dunlap
Acked-by: Will Deacon [kernel/locking/]
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Will Deacon
Cc: Mathieu Desnoyers
Cc: "Paul E. McKenney"
Cc: Juri Lelli
Cc: Vincent Guittot
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Randy Dunlap
2021-02-27 01:41:03 +0800

24 Feb, 2021

1 commit

005d3bd9e Merge tag 'pm-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more power management updates from Rafael Wysocki:
"These are fixes and cleanups on top of the power management material
for 5.12-rc1 merged previously.

Specifics:

- Address cpufreq regression introduced in 5.11 that causes CPU
frequency reporting to be distorted on systems with CPPC that use
acpi-cpufreq as the scaling driver (Rafael Wysocki).

- Fix regression introduced during the 5.10 development cycle related
to CPU hotplug and policy recreation in the qcom-cpufreq-hw driver
(Shawn Guo).

- Fix recent regression in the operating performance points (OPP)
framework that may cause frequency updates to be skipped by mistake
in some cases (Jonathan Marek).

- Simplify schedutil governor code and remove a misleading comment
from it (Yue Hu).

- Fix kerneldoc comment typo in the cpufreq core (Yue Hu)"

* tag 'pm-5.12-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Fix typo in kerneldoc comment
cpufreq: schedutil: Remove update_lock comment from struct sugov_policy definition
cpufreq: schedutil: Remove needless sg_policy parameter from ignore_dl_rate_limit()
cpufreq: ACPI: Set cpuinfo.max_freq directly if max boost is known
cpufreq: qcom-hw: drop devm_xxx() calls from init/exit hooks
opp: Don't skip freq update for different frequency

Linus Torvalds
2021-02-24 06:59:46 +0800

22 Feb, 2021

2 commits

3e1058533 Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm ... Browse Code »

Pull KVM updates from Paolo Bonzini:
"x86:

- Support for userspace to emulate Xen hypercalls

- Raise the maximum number of user memslots

- Scalability improvements for the new MMU.

Instead of the complex "fast page fault" logic that is used in
mmu.c, tdp_mmu.c uses an rwlock so that page faults are concurrent,
but the code that can run against page faults is limited. Right now
only page faults take the lock for reading; in the future this will
be extended to some cases of page table destruction. I hope to
switch the default MMU around 5.12-rc3 (some testing was delayed
due to Chinese New Year).

- Cleanups for MAXPHYADDR checks

- Use static calls for vendor-specific callbacks

- On AMD, use VMLOAD/VMSAVE to save and restore host state

- Stop using deprecated jump label APIs

- Workaround for AMD erratum that made nested virtualization
unreliable

- Support for LBR emulation in the guest

- Support for communicating bus lock vmexits to userspace

- Add support for SEV attestation command

- Miscellaneous cleanups

PPC:

- Support for second data watchpoint on POWER10

- Remove some complex workarounds for buggy early versions of POWER9

- Guest entry/exit fixes

ARM64:

- Make the nVHE EL2 object relocatable

- Cleanups for concurrent translation faults hitting the same page

- Support for the standard TRNG hypervisor call

- A bunch of small PMU/Debug fixes

- Simplification of the early init hypercall handling

Non-KVM changes (with acks):

- Detection of contended rwlocks (implemented only for qrwlocks,
because KVM only needs it for x86)

- Allow __DISABLE_EXPORTS from assembly code

- Provide a saner follow_pfn replacements for modules"

* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (192 commits)
KVM: x86/xen: Explicitly pad struct compat_vcpu_info to 64 bytes
KVM: selftests: Don't bother mapping GVA for Xen shinfo test
KVM: selftests: Fix hex vs. decimal snafu in Xen test
KVM: selftests: Fix size of memslots created by Xen tests
KVM: selftests: Ignore recently added Xen tests' build output
KVM: selftests: Add missing header file needed by xAPIC IPI tests
KVM: selftests: Add operand to vmsave/vmload/vmrun in svm.c
KVM: SVM: Make symbol 'svm_gp_erratum_intercept' static
locking/arch: Move qrwlock.h include after qspinlock.h
KVM: PPC: Book3S HV: Fix host radix SLB optimisation with hash guests
KVM: PPC: Book3S HV: Ensure radix guest has no SLB entries
KVM: PPC: Don't always report hash MMU capability for P9 < DD2.2
KVM: PPC: Book3S HV: Save and restore FSCR in the P9 path
KVM: PPC: remove unneeded semicolon
KVM: PPC: Book3S HV: Use POWER9 SLBIA IH=6 variant to clear SLB
KVM: PPC: Book3S HV: No need to clear radix host SLB before loading HPT guest
KVM: PPC: Book3S HV: Fix radix guest SLB side channel
KVM: PPC: Book3S HV: Remove support for running HPT guest on RPT host without mixed mode support
KVM: PPC: Book3S HV: Introduce new capability for 2nd DAWR
KVM: PPC: Book3S HV: Add infrastructure to support 2nd DAWR
...

Linus Torvalds
2021-02-22 05:31:43 +0800
657bd90c9 Merge tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"Core scheduler updates:

- Add CONFIG_PREEMPT_DYNAMIC: this in its current form adds the
preempt=none/voluntary/full boot options (default: full), to allow
distros to build a PREEMPT kernel but fall back to close to
PREEMPT_VOLUNTARY (or PREEMPT_NONE) runtime scheduling behavior via
a boot time selection.

There's also the /debug/sched_debug switch to do this runtime.

This feature is implemented via runtime patching (a new variant of
static calls).

The scope of the runtime patching can be best reviewed by looking
at the sched_dynamic_update() function in kernel/sched/core.c.

( Note that the dynamic none/voluntary mode isn't 100% identical,
for example preempt-RCU is available in all cases, plus the
preempt count is maintained in all models, which has runtime
overhead even with the code patching. )

The PREEMPT_VOLUNTARY/PREEMPT_NONE models, used by the vast
majority of distributions, are supposed to be unaffected.

- Fix ignored rescheduling after rcu_eqs_enter(). This is a bug that
was found via rcutorture triggering a hang. The bug is that
rcu_idle_enter() may wake up a NOCB kthread, but this happens after
the last generic need_resched() check. Some cpuidle drivers fix it
by chance but many others don't.

In true 2020 fashion the original bug fix has grown into a 5-patch
scheduler/RCU fix series plus another 16 RCU patches to address the
underlying issue of missed preemption events. These are the initial
fixes that should fix current incarnations of the bug.

- Clean up rbtree usage in the scheduler, by providing & using the
following consistent set of rbtree APIs:

partial-order; less() based:
- rb_add(): add a new entry to the rbtree
- rb_add_cached(): like rb_add(), but for a rb_root_cached

total-order; cmp() based:
- rb_find(): find an entry in an rbtree
- rb_find_add(): find an entry, and add if not found

- rb_find_first(): find the first (leftmost) matching entry
- rb_next_match(): continue from rb_find_first()
- rb_for_each(): iterate a sub-tree using the previous two

- Improve the SMP/NUMA load-balancer: scan for an idle sibling in a
single pass. This is a 4-commit series where each commit improves
one aspect of the idle sibling scan logic.

- Improve the cpufreq cooling driver by getting the effective CPU
utilization metrics from the scheduler

- Improve the fair scheduler's active load-balancing logic by
reducing the number of active LB attempts & lengthen the
load-balancing interval. This improves stress-ng mmapfork
performance.

- Fix CFS's estimated utilization (util_est) calculation bug that can
result in too high utilization values

Misc updates & fixes:

- Fix the HRTICK reprogramming & optimization feature

- Fix SCHED_SOFTIRQ raising race & warning in the CPU offlining code

- Reduce dl_add_task_root_domain() overhead

- Fix uprobes refcount bug

- Process pending softirqs in flush_smp_call_function_from_idle()

- Clean up task priority related defines, remove *USER_*PRIO and
USER_PRIO()

- Simplify the sched_init_numa() deduplication sort

- Documentation updates

- Fix EAS bug in update_misfit_status(), which degraded the quality
of energy-balancing

- Smaller cleanups"

* tag 'sched-core-2021-02-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (51 commits)
sched,x86: Allow !PREEMPT_DYNAMIC
entry/kvm: Explicitly flush pending rcuog wakeup before last rescheduling point
entry: Explicitly flush pending rcuog wakeup before last rescheduling point
rcu/nocb: Trigger self-IPI on late deferred wake up before user resume
rcu/nocb: Perform deferred wake up before last idle's need_resched() check
rcu: Pull deferred rcuog wake up to rcu_eqs_enter() callers
sched/features: Distinguish between NORMAL and DEADLINE hrtick
sched/features: Fix hrtick reprogramming
sched/deadline: Reduce rq lock contention in dl_add_task_root_domain()
uprobes: (Re)add missing get_uprobe() in __find_uprobe()
smp: Process pending softirqs in flush_smp_call_function_from_idle()
sched: Harden PREEMPT_DYNAMIC
static_call: Allow module use without exposing static_call_key
sched: Add /debug/sched_preempt
preempt/dynamic: Support dynamic preempt with preempt= boot option
preempt/dynamic: Provide irqentry_exit_cond_resched() static call
preempt/dynamic: Provide preempt_schedule[_notrace]() static calls
preempt/dynamic: Provide cond_resched() and might_resched() static calls
preempt: Introduce CONFIG_PREEMPT_DYNAMIC
static_call: Provide DEFINE_STATIC_CALL_RET0()
...

Linus Torvalds
2021-02-22 04:35:04 +0800

19 Feb, 2021

2 commits

e209cb51b cpufreq: schedutil: Remove update_lock comment from struct sugov_policy definition ... Browse Code »

Currently, update_lock is also used in sugov_update_single_freq().

The comment is not helpful anymore.

Signed-off-by: Yue Hu
Acked-by: Viresh Kumar
[ rjw: Subject edits ]
Signed-off-by: Rafael J. Wysocki

Yue Hu
2021-02-19 23:14:16 +0800
71f1309f4 cpufreq: schedutil: Remove needless sg_policy parameter from ignore_dl_rate_limit() ... Browse Code »

Since sg_policy is a member of struct sugov_cpu.

Also remove the local variable in sugov_update_single_common() to
make the code more clean.

Signed-off-by: Yue Hu
Acked-by: Viresh Kumar
[ rjw: Minor subject edits ]
Signed-off-by: Rafael J. Wysocki

Yue Hu
2021-02-19 23:11:19 +0800

17 Feb, 2021

18 commits

43789ef3f rcu/nocb: Perform deferred wake up before last idle's need_resched() check ... Browse Code »

Entering RCU idle mode may cause a deferred wake up of an RCU NOCB_GP
kthread (rcuog) to be serviced.

Usually a local wake up happening while running the idle task is handled
in one of the need_resched() checks carefully placed within the idle
loop that can break to the scheduler.

Unfortunately the call to rcu_idle_enter() is already beyond the last
generic need_resched() check and we may halt the CPU with a resched
request unhandled, leaving the task hanging.

Fix this with splitting the rcuog wakeup handling from rcu_idle_enter()
and place it before the last generic need_resched() check in the idle
loop. It is then assumed that no call to call_rcu() will be performed
after that in the idle loop until the CPU is put in low power mode.

Fixes: 96d3fd0d315a (rcu: Break call_rcu() deadlock involving scheduler and perf)
Reported-by: Paul E. McKenney
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/20210131230548.32970-3-frederic@kernel.org

Frederic Weisbecker
2021-02-17 21:12:43 +0800
e0ee463c9 sched/features: Distinguish between NORMAL and DEADLINE hrtick ... Browse Code »

The HRTICK feature has traditionally been servicing configurations that
need precise preemptions point for NORMAL tasks. More recently, the
feature has been extended to also service DEADLINE tasks with stringent
runtime enforcement needs (e.g., runtime < 1ms with HZ=1000).

Enabling HRTICK sched feature currently enables the additional timer and
task tick for both classes, which might introduced undesired overhead
for no additional benefit if one needed it only for one of the cases.

Separate HRTICK sched feature in two (and leave the traditional case
name unmodified) so that it can be selectively enabled when needed.

With:

$ echo HRTICK > /sys/kernel/debug/sched_features

the NORMAL/fair hrtick gets enabled.

With:

$ echo HRTICK_DL > /sys/kernel/debug/sched_features

the DEADLINE hrtick gets enabled.

Signed-off-by: Juri Lelli
Signed-off-by: Luis Claudio R. Goncalves
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210208073554.14629-3-juri.lelli@redhat.com

Juri Lelli
2021-02-17 21:12:42 +0800
156ec6f42 sched/features: Fix hrtick reprogramming ... Browse Code »

Hung tasks and RCU stall cases were reported on systems which were not
100% busy. Investigation of such unexpected cases (no sign of potential
starvation caused by tasks hogging the system) pointed out that the
periodic sched tick timer wasn't serviced anymore after a certain point
and that caused all machinery that depends on it (timers, RCU, etc.) to
stop working as well. This issues was however only reproducible if
HRTICK was enabled.

Looking at core dumps it was found that the rbtree of the hrtimer base
used also for the hrtick was corrupted (i.e. next as seen from the base
root and actual leftmost obtained by traversing the tree are different).
Same base is also used for periodic tick hrtimer, which might get "lost"
if the rbtree gets corrupted.

Much alike what described in commit 1f71addd34f4c ("tick/sched: Do not
mess with an enqueued hrtimer") there is a race window between
hrtimer_set_expires() in hrtick_start and hrtimer_start_expires() in
__hrtick_restart() in which the former might be operating on an already
queued hrtick hrtimer, which might lead to corruption of the base.

Use hrtick_start() (which removes the timer before enqueuing it back) to
ensure hrtick hrtimer reprogramming is entirely guarded by the base
lock, so that no race conditions can occur.

Signed-off-by: Juri Lelli
Signed-off-by: Luis Claudio R. Goncalves
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210208073554.14629-2-juri.lelli@redhat.com

Juri Lelli
2021-02-17 21:12:42 +0800
de40f33e7 sched/deadline: Reduce rq lock contention in dl_add_task_root_domain() ... Browse Code »

dl_add_task_root_domain() is called during sched domain rebuild:

rebuild_sched_domains_locked()
partition_and_rebuild_sched_domains()
rebuild_root_domains()
for all top_cpuset descendants:
update_tasks_root_domain()
for all tasks of cpuset:
dl_add_task_root_domain()

Change it so that only the task pi lock is taken to check if the task
has a SCHED_DEADLINE (DL) policy. In case that p is a DL task take the
rq lock as well to be able to safely de-reference root domain's DL
bandwidth structure.

Most of the tasks will have another policy (namely SCHED_NORMAL) and
can now bail without taking the rq lock.

One thing to note here: Even in case that there aren't any DL user
tasks, a slow frequency switching system with cpufreq gov schedutil has
a DL task (sugov) per frequency domain running which participates in DL
bandwidth management.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Quentin Perret
Reviewed-by: Valentin Schneider
Reviewed-by: Daniel Bristot de Oliveira
Acked-by: Juri Lelli
Link: https://lkml.kernel.org/r/20210119083542.19856-1-dietmar.eggemann@arm.com

Dietmar Eggemann
2021-02-17 21:12:42 +0800
ef72661e2 sched: Harden PREEMPT_DYNAMIC ... Browse Code »

Use the new EXPORT_STATIC_CALL_TRAMP() / static_call_mod() to unexport
the static_call_key for the PREEMPT_DYNAMIC calls such that modules
can no longer update these calls.

Having modules change/hi-jack the preemption calls would be horrible.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar

Peter Zijlstra
2021-02-17 21:12:42 +0800
e59e10f8e sched: Add /debug/sched_preempt ... Browse Code »

Add a debugfs file to muck about with the preempt mode at runtime.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/YAsGiUYf6NyaTplX@hirez.programming.kicks-ass.net

Peter Zijlstra
2021-02-17 21:12:42 +0800
826bfeb37 preempt/dynamic: Support dynamic preempt with preempt= boot option ... Browse Code »

Support the preempt= boot option and patch the static call sites
accordingly.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210118141223.123667-9-frederic@kernel.org

Peter Zijlstra (Intel)
2021-02-17 21:12:42 +0800
2c9a98d3b preempt/dynamic: Provide preempt_schedule[_notrace]() static calls ... Browse Code »

Provide static calls to control preempt_schedule[_notrace]()
(called in CONFIG_PREEMPT) so that we can override their behaviour when
preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
initialized to the arch provided wrapper, if any.

[fweisbec: only define static calls when PREEMPT_DYNAMIC, make it less
dependent on x86 with __preempt_schedule_func]
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210118141223.123667-7-frederic@kernel.org

Peter Zijlstra (Intel)
2021-02-17 21:12:42 +0800
b965f1ddb preempt/dynamic: Provide cond_resched() and might_resched() static calls ... Browse Code »

Provide static calls to control cond_resched() (called in !CONFIG_PREEMPT)
and might_resched() (called in CONFIG_PREEMPT_VOLUNTARY) to that we
can override their behaviour when preempt= is overriden.

Since the default behaviour is full preemption, both their calls are
ignored when preempt= isn't passed.

[fweisbec: branch might_resched() directly to __cond_resched(), only
define static calls when PREEMPT_DYNAMIC]

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Frederic Weisbecker
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210118141223.123667-6-frederic@kernel.org

Peter Zijlstra (Intel)
2021-02-17 21:12:42 +0800
c541bb783 sched/core: Update task_prio() function header ... Browse Code »

The description of the RT offset and the values for 'normal' tasks needs
update. Moreover there are DL tasks now.
task_prio() has to stay like it is to guarantee compatibility with the
/proc//stat priority field:

# cat /proc//stat | awk '{ print $18; }'

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210128131040.296856-4-dietmar.eggemann@arm.com

Dietmar Eggemann
2021-02-17 21:08:30 +0800
9d061ba6b sched: Remove USER_PRIO, TASK_USER_PRIO and MAX_USER_PRIO ... Browse Code »

The only remaining use of MAX_USER_PRIO (and USER_PRIO) is the
SCALE_PRIO() definition in the PowerPC Cell architecture's Synergistic
Processor Unit (SPU) scheduler. TASK_USER_PRIO isn't used anymore.

Commit fe443ef2ac42 ("[POWERPC] spusched: Dynamic timeslicing for
SCHED_OTHER") copied SCALE_PRIO() from the task scheduler in v2.6.23.

Commit a4ec24b48dde ("sched: tidy up SCHED_RR") removed it from the task
scheduler in v2.6.24.

Commit 3ee237dddcd8 ("sched/prio: Add 3 macros of MAX_NICE, MIN_NICE and
NICE_WIDTH in prio.h") introduced NICE_WIDTH much later.

With:

MAX_USER_PRIO = USER_PRIO(MAX_PRIO)

= MAX_PRIO - MAX_RT_PRIO

MAX_PRIO = MAX_RT_PRIO + NICE_WIDTH

MAX_USER_PRIO = MAX_RT_PRIO + NICE_WIDTH - MAX_RT_PRIO

MAX_USER_PRIO = NICE_WIDTH

MAX_USER_PRIO can be replaced by NICE_WIDTH to be able to remove all the
{*_}USER_PRIO defines.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210128131040.296856-3-dietmar.eggemann@arm.com

Dietmar Eggemann
2021-02-17 21:08:17 +0800
ae18ad281 sched: Remove MAX_USER_RT_PRIO ... Browse Code »

Commit d46523ea32a7 ("[PATCH] fix MAX_USER_RT_PRIO and MAX_RT_PRIO")
was introduced due to a a small time period in which the realtime patch
set was using different values for MAX_USER_RT_PRIO and MAX_RT_PRIO.

This is no longer true, i.e. now MAX_RT_PRIO == MAX_USER_RT_PRIO.

Get rid of MAX_USER_RT_PRIO and make everything use MAX_RT_PRIO
instead.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20210128131040.296856-2-dietmar.eggemann@arm.com

Dietmar Eggemann
2021-02-17 21:08:11 +0800
71e5f6644 sched/topology: Fix sched_domain_topology_level alloc in sched_init_numa() ... Browse Code »

Commit "sched/topology: Make sched_init_numa() use a set for the
deduplicating sort" allocates 'i + nr_levels (level)' instead of
'i + nr_levels + 1' sched_domain_topology_level.

This led to an Oops (on Arm64 juno with CONFIG_SCHED_DEBUG):

sched_init_domains
build_sched_domains()
__free_domain_allocs()
__sdt_free() {
...
for_each_sd_topology(tl)
...
sd = *per_cpu_ptr(sdd->sd, j);
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Tested-by: Vincent Guittot
Tested-by: Barry Song
Link: https://lkml.kernel.org/r/6000e39e-7d28-c360-9cd6-8798fd22a9bf@arm.com

Dietmar Eggemann
2021-02-17 21:08:05 +0800
8ecca3948 rbtree, sched/deadline: Use rb_add_cached() ... Browse Code »

Reduce rbtree boiler plate by using the new helpers.

Make rb_add_cached() / rb_erase_cached() return a pointer to the
leftmost node to aid in updating additional state.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Davidlohr Bueso

Peter Zijlstra
2021-02-17 21:07:44 +0800
bf9be9a16 rbtree, sched/fair: Use rb_add_cached() ... Browse Code »

Reduce rbtree boiler plate by using the new helper function.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Davidlohr Bueso

Peter Zijlstra
2021-02-17 21:07:39 +0800
9fe1f127b sched/fair: Merge select_idle_core/cpu() ... Browse Code »

Both select_idle_core() and select_idle_cpu() do a loop over the same
cpumask. Observe that by clearing the already visited CPUs, we can
fold the iteration and iterate a core at a time.

All we need to do is remember any non-idle CPU we encountered while
scanning for an idle core. This way we'll only iterate every CPU once.

Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20210127135203.19633-5-mgorman@techsingularity.net

Mel Gorman
2021-02-17 21:07:25 +0800
6cd56ef1d sched/fair: Remove select_idle_smt() ... Browse Code »

In order to make the next patch more readable, and to quantify the
actual effectiveness of this pass, start by removing it.

Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20210125085909.4600-4-mgorman@techsingularity.net

Mel Gorman
2021-02-17 21:06:59 +0800
ed3cd45f8 Merge tag 'v5.11' into sched/core, to pick up fixes & refresh the branch ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2021-02-17 21:04:39 +0800

12 Feb, 2021

1 commit

85e853c5e Merge branch 'for-mingo-rcu' of git://git.kernel.org/pub/scm/linux/kernel/git/pa… ... Browse Code »

…ulmck/linux-rcu into core/rcu

Pull RCU updates from Paul E. McKenney:

- Documentation updates.

- Miscellaneous fixes.

- kfree_rcu() updates: Addition of mem_dump_obj() to provide allocator return
addresses to more easily locate bugs. This has a couple of RCU-related commits,
but is mostly MM. Was pulled in with akpm's agreement.

- Per-callback-batch tracking of numbers of callbacks,
which enables better debugging information and smarter
reactions to large numbers of callbacks.

- The first round of changes to allow CPUs to be runtime switched from and to
callback-offloaded state.

- CONFIG_PREEMPT_RT-related changes.

- RCU CPU stall warning updates.
- Addition of polling grace-period APIs for SRCU.

- Torture-test and torture-test scripting updates, including a "torture everything"
script that runs rcutorture, locktorture, scftorture, rcuscale, and refscale.
Plus does an allmodconfig build.

Signed-off-by: Ingo Molnar <mingo@kernel.org>

Ingo Molnar
2021-02-12 19:56:55 +0800

04 Feb, 2021

1 commit

f3d4b4b1d sched: Add cond_resched_rwlock ... Browse Code »

Safely rescheduling while holding a spin lock is essential for keeping
long running kernel operations running smoothly. Add the facility to
cond_resched rwlocks.

CC: Ingo Molnar
CC: Will Deacon
Acked-by: Peter Zijlstra
Acked-by: Davidlohr Bueso
Acked-by: Waiman Long
Acked-by: Paolo Bonzini
Signed-off-by: Ben Gardon
Message-Id:
Signed-off-by: Paolo Bonzini

Ben Gardon
2021-02-04 18:27:43 +0800

28 Jan, 2021

4 commits

bae4ec136 sched/fair: Move avg_scan_cost calculations under SIS_PROP ... Browse Code »

As noted by Vincent Guittot, avg_scan_costs are calculated for SIS_PROP
even if SIS_PROP is disabled. Move the time calculations under a SIS_PROP
check and while we are at it, exclude the cost of initialising the CPU
mask from the average scan cost.

Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20210125085909.4600-3-mgorman@techsingularity.net

Mel Gorman
2021-01-28 00:26:44 +0800
e6e0dc2d5 sched/fair: Remove SIS_AVG_CPU ... Browse Code »

SIS_AVG_CPU was introduced as a means of avoiding a search when the
average search cost indicated that the search would likely fail. It was
a blunt instrument and disabled by commit 4c77b18cf8b7 ("sched/fair: Make
select_idle_cpu() more aggressive") and later replaced with a proportional
search depth by commit 1ad3aaf3fcd2 ("sched/core: Implement new approach
to scale select_idle_cpu()").

While there are corner cases where SIS_AVG_CPU is better, it has now been
disabled for almost three years. As the intent of SIS_PROP is to reduce
the time complexity of select_idle_cpu(), lets drop SIS_AVG_CPU and focus
on SIS_PROP as a throttling mechanism.

Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20210125085909.4600-2-mgorman@techsingularity.net

Mel Gorman
2021-01-28 00:26:43 +0800
620a6dc40 sched/topology: Make sched_init_numa() use a set for the deduplicating sort ... Browse Code »

The deduplicating sort in sched_init_numa() assumes that the first line in
the distance table contains all unique values in the entire table. I've
been trying to pen what this exactly means for the topology, but it's not
straightforward. For instance, topology.c uses this example:

node 0 1 2 3
0: 10 20 20 30
1: 20 10 20 20
2: 20 20 10 20
3: 30 20 20 10

0 ----- 1
| / |
| / |
| / |
2 ----- 3

Which works out just fine. However, if we swap nodes 0 and 1:

1 ----- 0
| / |
| / |
| / |
2 ----- 3

we get this distance table:

node 0 1 2 3
0: 10 20 20 20
1: 20 10 20 30
2: 20 20 10 20
3: 20 30 20 10

Which breaks the deduplicating sort (non-representative first line). In
this case this would just be a renumbering exercise, but it so happens that
we can have a deduplicating sort that goes through the whole table in O(n²)
at the extra cost of a temporary memory allocation (i.e. any form of set).

The ACPI spec (SLIT) mentions distances are encoded on 8 bits. Following
this, implement the set as a 256-bits bitmap. Should this not be
satisfactory (i.e. we want to support 32-bit values), then we'll have to go
for some other sparse set implementation.

This has the added benefit of letting us allocate just the right amount of
memory for sched_domains_numa_distance[], rather than an arbitrary
(nr_node_ids + 1).

Note: DT binding equivalent (distance-map) decodes distances as 32-bit
values.

Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20210122123943.1217-2-valentin.schneider@arm.com

Valentin Schneider
2021-01-28 00:26:42 +0800
0ae78eec8 sched/eas: Don't update misfit status if the task is pinned ... Browse Code »

If the task is pinned to a cpu, setting the misfit status means that
we'll unnecessarily continuously attempt to migrate the task but fail.

This continuous failure will cause the balance_interval to increase to
a high value, and eventually cause unnecessary significant delays in
balancing the system when real imbalance happens.

Caught while testing uclamp where rt-app calibration loop was pinned to
cpu 0, shortly after which we spawn another task with high util_clamp
value. The task was failing to migrate after over 40ms of runtime due to
balance_interval unnecessary expanded to a very high value from the
calibration loop.

Not done here, but it could be useful to extend the check for pinning to
verify that the affinity of the task has a cpu that fits. We could end
up in a similar situation otherwise.

Fixes: 3b1baa6496e6 ("sched/fair: Add 'group_misfit_task' load-balance type")
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Quentin Perret
Acked-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210119120755.2425264-1-qais.yousef@arm.com

Qais Yousef
2021-01-28 00:26:42 +0800

23 Jan, 2021

1 commit

0d2460ba6 Merge branches 'doc.2021.01.06a', 'fixes.2021.01.04b', 'kfree_rcu.2021.01.04a', … ... Browse Code »

…'mmdumpobj.2021.01.22a', 'nocb.2021.01.06a', 'rt.2021.01.04a', 'stall.2021.01.06a', 'torture.2021.01.12a' and 'tortureall.2021.01.06a' into HEAD

doc.2021.01.06a: Documentation updates.
fixes.2021.01.04b: Miscellaneous fixes.
kfree_rcu.2021.01.04a: kfree_rcu() updates.
mmdumpobj.2021.01.22a: Dump allocation point for memory blocks.
nocb.2021.01.06a: RCU callback offload updates and cblist segment lengths.
rt.2021.01.04a: Real-time updates.
stall.2021.01.06a: RCU CPU stall warning updates.
torture.2021.01.12a: Torture-test updates and polling SRCU grace-period API.
tortureall.2021.01.06a: Torture-test script updates.

Paul E. McKenney
2021-01-23 07:26:44 +0800

22 Jan, 2021

2 commits

741ba80f6 sched: Relax the set_cpus_allowed_ptr() semantics ... Browse Code »

Now that we have KTHREAD_IS_PER_CPU to denote the critical per-cpu
tasks to retain during CPU offline, we can relax the warning in
set_cpus_allowed_ptr(). Any spurious kthread that wants to get on at
the last minute will get pushed off before it can run.

While during CPU online there is no harm, and actual benefit, to
allowing kthreads back on early, it simplifies hotplug code and fixes
a number of outstanding races.

Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Lai jiangshan
Reviewed-by: Valentin Schneider
Tested-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210121103507.240724591@infradead.org

Peter Zijlstra
2021-01-22 22:09:44 +0800
5ba2ffba1 sched: Fix CPU hotplug / tighten is_per_cpu_kthread() ... Browse Code »

Prior to commit 1cf12e08bc4d ("sched/hotplug: Consolidate task
migration on CPU unplug") we'd leave any task on the dying CPU and
break affinity and force them off at the very end.

This scheme had to change in order to enable migrate_disable(). One
cannot wait for migrate_disable() to complete while stuck in
stop_machine(). Furthermore, since we need at the very least: idle,
hotplug and stop threads at any point before stop_machine, we can't
break affinity and/or push those away.

Under the assumption that all per-cpu kthreads are sanely handled by
CPU hotplug, the new code no long breaks affinity or migrates any of
them (which then includes the critical ones above).

However, there's an important difference between per-cpu kthreads and
kthreads that happen to have a single CPU affinity which is lost. The
latter class very much relies on the forced affinity breaking and
migration semantics previously provided.

Use the new kthread_is_per_cpu() infrastructure to tighten
is_per_cpu_kthread() and fix the hot-unplug problems stemming from the
change.

Fixes: 1cf12e08bc4d ("sched/hotplug: Consolidate task migration on CPU unplug")
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Tested-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20210121103507.102416009@infradead.org

Peter Zijlstra
2021-01-22 22:09:44 +0800