Eric Lee / smarc-fsl-linux-kernel

30 Dec, 2020

2 commits

8933a5253 sched: Reenable interrupts in do_sched_yield() ... Browse Code »

[ Upstream commit 345a957fcc95630bf5535d7668a59ed983eb49a7 ]

do_sched_yield() invokes schedule() with interrupts disabled which is
not allowed. This goes back to the pre git era to commit a6efb709806c
("[PATCH] irqlock patch 2.5.27-H6") in the history tree.

Reenable interrupts and remove the misleading comment which "explains" it.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/87r1pt7y5c.fsf@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin

Thomas Gleixner
2020-12-30 18:52:59 +0800
6d4250fe7 sched/deadline: Fix sched_dl_global_validate() ... Browse Code »

[ Upstream commit a57415f5d1e43c3a5c5d412cd85e2792d7ed9b11 ]

When change sched_rt_{runtime, period}_us, we validate that the new
settings should at least accommodate the currently allocated -dl
bandwidth:

sched_rt_handler()
--> sched_dl_bandwidth_validate()
{
new_bw = global_rt_runtime()/global_rt_period();

for_each_possible_cpu(cpu) {
dl_b = dl_bw_of(cpu);
if (new_bw < dl_b->total_bw) total_bw is the allocated bandwidth of the whole root domain.
Instead, we should compare dl_b->total_bw against "cpus*new_bw",
where 'cpus' is the number of CPUs of the root domain.

Also, below annotation(in kernel/sched/sched.h) implied implementation
only appeared in SCHED_DEADLINE v2[1], then deadline scheduler kept
evolving till got merged(v9), but the annotation remains unchanged,
meaningless and misleading, update it.

* With respect to SMP, the bandwidth is given on a per-CPU basis,
* meaning that:
* - dl_bw (< 100%) is the bandwidth of the system (group) on each CPU;
* - dl_total_bw array contains, in the i-eth element, the currently
* allocated bandwidth on the i-eth CPU.

[1]: https://lore.kernel.org/lkml/1267385230.13676.101.camel@Palantir/

Fixes: 332ac17ef5bf ("sched/deadline: Add bandwidth management for SCHED_DEADLINE tasks")
Signed-off-by: Peng Liu
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Daniel Bristot de Oliveira
Acked-by: Juri Lelli
Link: https://lkml.kernel.org/r/db6bbda316048cda7a1bbc9571defde193a8d67e.1602171061.git.iwtbavbm@gmail.com
Signed-off-by: Sasha Levin

Peng Liu
2020-12-30 18:52:59 +0800

09 Dec, 2020

3 commits

e45cdc71d membarrier: Execute SYNC_CORE on the calling thread ... Browse Code »

membarrier()'s MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE is documented as
syncing the core on all sibling threads but not necessarily the calling
thread. This behavior is fundamentally buggy and cannot be used safely.

Suppose a user program has two threads. Thread A is on CPU 0 and thread B
is on CPU 1. Thread A modifies some text and calls
membarrier(MEMBARRIER_CMD_PRIVATE_EXPEDITED_SYNC_CORE).

Then thread B executes the modified code. If, at any point after
membarrier() decides which CPUs to target, thread A could be preempted and
replaced by thread B on CPU 0. This could even happen on exit from the
membarrier() syscall. If this happens, thread B will end up running on CPU
0 without having synced.

In principle, this could be fixed by arranging for the scheduler to issue
sync_core_before_usermode() whenever switching between two threads in the
same mm if there is any possibility of a concurrent membarrier() call, but
this would have considerable overhead. Instead, make membarrier() sync the
calling CPU as well.

As an optimization, this avoids an extra smp_mb() in the default
barrier-only mode and an extra rseq preempt on the caller.

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski
Signed-off-by: Thomas Gleixner
Reviewed-by: Mathieu Desnoyers
Link: https://lore.kernel.org/r/250ded637696d490c69bef1877148db86066881c.1607058304.git.luto@kernel.org

Andy Lutomirski
2020-12-09 16:37:43 +0800
758c9373d membarrier: Explicitly sync remote cores when SYNC_CORE is requested ... Browse Code »

membarrier() does not explicitly sync_core() remote CPUs; instead, it
relies on the assumption that an IPI will result in a core sync. On x86,
this may be true in practice, but it's not architecturally reliable. In
particular, the SDM and APM do not appear to guarantee that interrupt
delivery is serializing. While IRET does serialize, IPI return can
schedule, thereby switching to another task in the same mm that was
sleeping in a syscall. The new task could then SYSRET back to usermode
without ever executing IRET.

Make this more robust by explicitly calling sync_core_before_usermode()
on remote cores. (This also helps people who search the kernel tree for
instances of sync_core() and sync_core_before_usermode() -- one might be
surprised that the core membarrier code doesn't currently show up in a
such a search.)

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski
Signed-off-by: Thomas Gleixner
Reviewed-by: Mathieu Desnoyers
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/776b448d5f7bd6b12690707f5ed67bcda7f1d427.1607058304.git.luto@kernel.org

Andy Lutomirski
2020-12-09 16:37:43 +0800
2ecedd756 membarrier: Add an actual barrier before rseq_preempt() ... Browse Code »

It seems that most RSEQ membarrier users will expect any stores done before
the membarrier() syscall to be visible to the target task(s). While this
is extremely likely to be true in practice, nothing actually guarantees it
by a strict reading of the x86 manuals. Rather than providing this
guarantee by accident and potentially causing a problem down the road, just
add an explicit barrier.

Fixes: 70216e18e519 ("membarrier: Provide core serializing command, *_SYNC_CORE")
Signed-off-by: Andy Lutomirski
Signed-off-by: Thomas Gleixner
Reviewed-by: Mathieu Desnoyers
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/d3e7197e034fa4852afcf370ca49c30496e58e40.1607058304.git.luto@kernel.org

Andy Lutomirski
2020-12-09 16:37:43 +0800

30 Nov, 2020

1 commit

f91a3aa6b Merge tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fixes from Thomas Gleixner:
"Two more places which invoke tracing from RCU disabled regions in the
idle path.

Similar to the entry path the low level idle functions have to be
non-instrumentable"

* tag 'locking-urgent-2020-11-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
intel_idle: Fix intel_idle() vs tracing
sched/idle: Fix arch_cpu_idle() vs tracing

Linus Torvalds
2020-11-30 03:19:26 +0800

24 Nov, 2020

1 commit

58c644ba5 sched/idle: Fix arch_cpu_idle() vs tracing ... Browse Code »

We call arch_cpu_idle() with RCU disabled, but then use
local_irq_{en,dis}able(), which invokes tracing, which relies on RCU.

Switch all arch_cpu_idle() implementations to use
raw_local_irq_{en,dis}able() and carefully manage the
lockdep,rcu,tracing state like we do in entry.

(XXX: we really should change arch_cpu_idle() to not return with
interrupts enabled)

Reported-by: Sven Schnelle
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Mark Rutland
Tested-by: Mark Rutland
Link: https://lkml.kernel.org/r/20201120114925.594122626@infradead.org

Peter Zijlstra
2020-11-24 23:47:35 +0800

23 Nov, 2020

1 commit

f4b936f5d Merge tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"A couple of scheduler fixes:

- Make the conditional update of the overutilized state work
correctly by caching the relevant flags state before overwriting
them and checking them afterwards.

- Fix a data race in the wakeup path which caused loadavg on ARM64
platforms to become a random number generator.

- Fix the ordering of the iowaiter accounting operations so it can't
be decremented before it is incremented.

- Fix a bug in the deadline scheduler vs. priority inheritance when a
non-deadline task A has inherited the parameters of a deadline task
B and then blocks on a non-deadline task C.

The second inheritance step used the static deadline parameters of
task A, which are usually 0, instead of further propagating task
B's parameters. The zero initialized parameters trigger a bug in
the deadline scheduler"

* tag 'sched-urgent-2020-11-22' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Fix priority inheritance with multiple scheduling classes
sched: Fix rq->nr_iowait ordering
sched: Fix data-race in wakeup
sched/fair: Fix overutilized update in enqueue_task_fair()

Linus Torvalds
2020-11-23 05:26:07 +0800

17 Nov, 2020

3 commits

2279f540e sched/deadline: Fix priority inheritance with multiple scheduling classes ... Browse Code »

Glenn reported that "an application [he developed produces] a BUG in
deadline.c when a SCHED_DEADLINE task contends with CFS tasks on nested
PTHREAD_PRIO_INHERIT mutexes. I believe the bug is triggered when a CFS
task that was boosted by a SCHED_DEADLINE task boosts another CFS task
(nested priority inheritance).

------------[ cut here ]------------
kernel BUG at kernel/sched/deadline.c:1462!
invalid opcode: 0000 [#1] PREEMPT SMP
CPU: 12 PID: 19171 Comm: dl_boost_bug Tainted: ...
Hardware name: ...
RIP: 0010:enqueue_task_dl+0x335/0x910
Code: ...
RSP: 0018:ffffc9000c2bbc68 EFLAGS: 00010002
RAX: 0000000000000009 RBX: ffff888c0af94c00 RCX: ffffffff81e12500
RDX: 000000000000002e RSI: ffff888c0af94c00 RDI: ffff888c10b22600
RBP: ffffc9000c2bbd08 R08: 0000000000000009 R09: 0000000000000078
R10: ffffffff81e12440 R11: ffffffff81e1236c R12: ffff888bc8932600
R13: ffff888c0af94eb8 R14: ffff888c10b22600 R15: ffff888bc8932600
FS: 00007fa58ac55700(0000) GS:ffff888c10b00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fa58b523230 CR3: 0000000bf44ab003 CR4: 00000000007606e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
? intel_pstate_update_util_hwp+0x13/0x170
rt_mutex_setprio+0x1cc/0x4b0
task_blocks_on_rt_mutex+0x225/0x260
rt_spin_lock_slowlock_locked+0xab/0x2d0
rt_spin_lock_slowlock+0x50/0x80
hrtimer_grab_expiry_lock+0x20/0x30
hrtimer_cancel+0x13/0x30
do_nanosleep+0xa0/0x150
hrtimer_nanosleep+0xe1/0x230
? __hrtimer_init_sleeper+0x60/0x60
__x64_sys_nanosleep+0x8d/0xa0
do_syscall_64+0x4a/0x100
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x7fa58b52330d
...
---[ end trace 0000000000000002 ]—

He also provided a simple reproducer creating the situation below:

So the execution order of locking steps are the following
(N1 and N2 are non-deadline tasks. D1 is a deadline task. M1 and M2
are mutexes that are enabled * with priority inheritance.)

Time moves forward as this timeline goes down:

N1 N2 D1
| | |
| | |
Lock(M1) | |
| | |
| Lock(M2) |
| | |
| | Lock(M2)
| | |
| Lock(M1) |
| (!!bug triggered!) |

Daniel reported a similar situation as well, by just letting ksoftirqd
run with DEADLINE (and eventually block on a mutex).

Problem is that boosted entities (Priority Inheritance) use static
DEADLINE parameters of the top priority waiter. However, there might be
cases where top waiter could be a non-DEADLINE entity that is currently
boosted by a DEADLINE entity from a different lock chain (i.e., nested
priority chains involving entities of non-DEADLINE classes). In this
case, top waiter static DEADLINE parameters could be null (initialized
to 0 at fork()) and replenish_dl_entity() would hit a BUG().

Fix this by keeping track of the original donor and using its parameters
when a task is boosted.

Reported-by: Glenn Elliott
Reported-by: Daniel Bristot de Oliveira
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Daniel Bristot de Oliveira
Link: https://lkml.kernel.org/r/20201117061432.517340-1-juri.lelli@redhat.com

Juri Lelli
2020-11-17 20:15:28 +0800
ec618b84f sched: Fix rq->nr_iowait ordering ... Browse Code »

schedule() ttwu()
deactivate_task(); if (p->on_rq && ...) // false
atomic_dec(&task_rq(p)->nr_iowait);
if (prev->in_iowait)
atomic_inc(&rq->nr_iowait);

Allows nr_iowait to be decremented before it gets incremented,
resulting in more dodgy IO-wait numbers than usual.

Note that because we can now do ttwu_queue_wakelist() before
p->on_cpu==0, we lose the natural ordering and have to further delay
the decrement.

Fixes: c6e7bd7afaeb ("sched/core: Optimize ttwu() spinning on p->on_cpu")
Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mel Gorman
Link: https://lkml.kernel.org/r/20201117093829.GD3121429@hirez.programming.kicks-ass.net

Peter Zijlstra
2020-11-17 20:15:28 +0800
8e1ac4299 sched/fair: Fix overutilized update in enqueue_task_fair() ... Browse Code »

enqueue_task_fair() attempts to skip the overutilized update for new
tasks as their util_avg is not accurate yet. However, the flag we check
to do so is overwritten earlier on in the function, which makes the
condition pretty much a nop.

Fix this by saving the flag early on.

Fixes: 2802bf3cd936 ("sched/fair: Add over-utilization/tipping point indicator")
Reported-by: Rick Yiu
Signed-off-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20201112111201.2081902-1-qperret@google.com

Quentin Perret
2020-11-17 20:15:27 +0800

16 Nov, 2020

1 commit

d0a37fd57 Merge tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"A set of scheduler fixes:

- Address a load balancer regression by making the load balancer use
the same logic as the wakeup path to spread tasks in the LLC domain

- Prefer the CPU on which a task run last over the local CPU in the
fast wakeup path for asymmetric CPU capacity systems to align with
the symmetric case. This ensures more locality and prevents massive
migration overhead on those asymetric systems

- Fix a memory corruption bug in the scheduler debug code caused by
handing a modified buffer pointer to kfree()"

* tag 'sched-urgent-2020-11-15' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/debug: Fix memory corruption caused by multiple small reads of flags
sched/fair: Prefer prev cpu in asymmetric wakeup path
sched/fair: Ensure tasks spreading in LLC during LB

Linus Torvalds
2020-11-16 01:39:35 +0800

11 Nov, 2020

4 commits

8d4d9c7b4 sched/debug: Fix memory corruption caused by multiple small reads of flags ... Browse Code »

Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
with small reads causes oopses with slub corruption issues because the kfree is
free'ing an offset from a previous allocation. Fix this by adding in a new
pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
to handle memory copies of the buf offsets.

Fixes: 5b9f8ff7b320 ("sched/debug: Output SD flag names rather than their values")
Reported-by: Jeff Bastian
Signed-off-by: Colin Ian King
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com

Colin Ian King
2020-11-11 01:38:49 +0800
b4c9c9f15 sched/fair: Prefer prev cpu in asymmetric wakeup path ... Browse Code »

During fast wakeup path, scheduler always check whether local or prev
cpus are good candidates for the task before looking for other cpus in
the domain. With commit b7a331615d25 ("sched/fair: Add asymmetric CPU
capacity wakeup scan") the heterogenous system gains a dedicated path
but doesn't try to reuse prev cpu whenever possible. If the previous
cpu is idle and belong to the LLC domain, we should check it 1st
before looking for another cpu because it stays one of the best
candidate and this also stabilizes task placement on the system.

This change aligns asymmetric path behavior with symmetric one and reduces
cases where the task migrates across all cpus of the sd_asym_cpucapacity
domains at wakeup.

This change does not impact normal EAS mode but only the overloaded case or
when EAS is not used.

- On hikey960 with performance governor (EAS disable)

./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 999364 0
ops/sec 149313(+/-0.28%) 182587(+/- 0.40) +22%

- On hikey with performance governor

./perf bench sched pipe -T -l 50000
mainline w/ patch
# migrations 0 0
ops/sec 47721(+/-0.76%) 47899(+/- 0.56) +0.4%

According to test on hikey, the patch doesn't impact symmetric system
compared to current implementation (only tested on arm64)

Also read the uclamped value of task's utilization at most twice instead
instead each time we compare task's utilization with cpu's capacity.

Fixes: b7a331615d25 ("sched/fair: Add asymmetric CPU capacity wakeup scan")
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Dietmar Eggemann
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20201029161824.26389-1-vincent.guittot@linaro.org

Vincent Guittot
2020-11-11 01:38:48 +0800
16b0a7a1a sched/fair: Ensure tasks spreading in LLC during LB ... Browse Code »

schbench shows latency increase for 95 percentile above since:
commit 0b0695f2b34a ("sched/fair: Rework load_balance()")

Align the behavior of the load balancer with the wake up path, which tries
to select an idle CPU which belongs to the LLC for a waking task.

calculate_imbalance() will use nr_running instead of the spare
capacity when CPUs share resources (ie cache) at the domain level. This
will ensure a better spread of tasks on idle CPUs.

Running schbench on a hikey (8cores arm64) shows the problem:

tip/sched/core :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 33
75.0th: 45
90.0th: 51
95.0th: 4152
*99.0th: 14288
99.5th: 14288
99.9th: 14288
min=0, max=14276

tip/sched/core + patch :
schbench -m 2 -t 4 -s 10000 -c 1000000 -r 10
Latency percentiles (usec)
50.0th: 34
75.0th: 47
90.0th: 52
95.0th: 78
*99.0th: 94
99.5th: 94
99.9th: 94
min=0, max=94

Fixes: 0b0695f2b34a ("sched/fair: Rework load_balance()")
Reported-by: Chris Mason
Suggested-by: Rik van Riel
Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Rik van Riel
Tested-by: Rik van Riel
Link: https://lkml.kernel.org/r/20201102102457.28808-1-vincent.guittot@linaro.org

Vincent Guittot
2020-11-11 01:38:48 +0800
9a2a9ebc0 cpufreq: Introduce governor flags ... Browse Code »

A new cpufreq governor flag will be added subsequently, so replace
the bool dynamic_switching fleid in struct cpufreq_governor with a
flags field and introduce CPUFREQ_GOV_DYNAMIC_SWITCHING to set for
the "dynamic switching" governors instead of it.

No intentional functional impact.

Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar

Rafael J. Wysocki
2020-11-11 01:31:17 +0800

03 Nov, 2020

1 commit

23a881852 cpufreq: schedutil: Don't skip freq update if need_freq_update is set ... Browse Code »

The cpufreq policy's frequency limits (min/max) can get changed at any
point of time, while schedutil is trying to update the next frequency.
Though the schedutil governor has necessary locking and support in place
to make sure we don't miss any of those updates, there is a corner case
where the governor will find that the CPU is already running at the
desired frequency and so may skip an update.

For example, consider that the CPU can run at 1 GHz, 1.2 GHz and 1.4 GHz
and is running at 1 GHz currently. Schedutil tries to update the
frequency to 1.2 GHz, during this time the policy limits get changed as
policy->min = 1.4 GHz. As schedutil (and cpufreq core) does clamp the
frequency at various instances, we will eventually set the frequency to
1.4 GHz, while we will save 1.2 GHz in sg_policy->next_freq.

Now lets say the policy limits get changed back at this time with
policy->min as 1 GHz. The next time schedutil is invoked by the
scheduler, we will reevaluate the next frequency (because
need_freq_update will get set due to limits change event) and lets say
we want to set the frequency to 1.2 GHz again. At this point
sugov_update_next_freq() will find the next_freq == current_freq and
will abort the update, while the CPU actually runs at 1.4 GHz.

Until now need_freq_update was used as a flag to indicate that the
policy's frequency limits have changed, and that we should consider the
new limits while reevaluating the next frequency.

This patch fixes the above mentioned issue by extending the purpose of
the need_freq_update flag. If this flag is set now, the schedutil
governor will not try to abort a frequency change even if next_freq ==
current_freq.

As similar behavior is required in the case of
CPUFREQ_NEED_UPDATE_LIMITS flag as well, need_freq_update will never be
set to false if that flag is set for the driver.

We also don't need to consider the need_freq_update flag in
sugov_update_single() anymore to handle the special case of busy CPU, as
we won't abort a frequency update anymore.

Reported-by: zhuguangqing
Suggested-by: Rafael J. Wysocki
Signed-off-by: Viresh Kumar
[ rjw: Rearrange code to avoid a branch ]
Signed-off-by: Rafael J. Wysocki

Viresh Kumar
2020-11-03 00:57:49 +0800

29 Oct, 2020

1 commit

d1e7c2996 cpufreq: schedutil: Always call driver if CPUFREQ_NEED_UPDATE_LIMITS is set ... Browse Code »

Because sugov_update_next_freq() may skip a frequency update even if
the need_freq_update flag has been set for the policy at hand, policy
limits updates may not take effect as expected.

For example, if the intel_pstate driver operates in the passive mode
with HWP enabled, it needs to update the HWP min and max limits when
the policy min and max limits change, respectively, but that may not
happen if the target frequency does not change along with the limit
at hand. In particular, if the policy min is changed first, causing
the target frequency to be adjusted to it, and the policy max limit
is changed later to the same value, the HWP max limit will not be
updated to follow it as expected, because the target frequency is
still equal to the policy min limit and it will not change until
that limit is updated.

To address this issue, modify get_next_freq() to let the driver
callback run if the CPUFREQ_NEED_UPDATE_LIMITS cpufreq driver flag
is set regardless of whether or not the new frequency to set is
equal to the previous one.

Fixes: f6ebbcf08f37 ("cpufreq: intel_pstate: Implement passive mode with HWP enabled")
Reported-by: Zhang Rui
Tested-by: Zhang Rui
Cc: 5.9+ # 5.9+: 1c534352f47f cpufreq: Introduce CPUFREQ_NEED_UPDATE_LIMITS ...
Cc: 5.9+ # 5.9+: a62f68f5ca53 cpufreq: Introduce cpufreq_driver_test_flags()
Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki

Rafael J. Wysocki
2020-10-29 21:12:18 +0800

26 Oct, 2020

2 commits

33def8498 treewide: Convert macro and uses of __section(foo) to __section("foo") ... Browse Code »

Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches
Reviewed-by: Nick Desaulniers
Reviewed-by: Miguel Ojeda
Signed-off-by: Linus Torvalds

Joe Perches
2020-10-26 05:51:49 +0800
87702a337 Merge tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"Two scheduler fixes:

- A trivial build fix for sched_feat() to compile correctly with
CONFIG_JUMP_LABEL=n

- Replace a zero lenght array with a flexible array"

* tag 'sched-urgent-2020-10-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/features: Fix !CONFIG_JUMP_LABEL case
sched: Replace zero-length array with flexible-array

Linus Torvalds
2020-10-26 02:25:16 +0800

24 Oct, 2020

1 commit

41f762a15 Merge tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull more power management updates from Rafael Wysocki:
"First of all, the adaptive voltage scaling (AVS) drivers go to new
platform-specific locations as planned (this part was reported to have
merge conflicts against the new arm-soc updates in linux-next).

In addition to that, there are some fixes (intel_idle, intel_pstate,
RAPL, acpi_cpufreq), the addition of on/off notifiers and idle state
accounting support to the generic power domains (genpd) code and some
janitorial changes all over.

Specifics:

- Move the AVS drivers to new platform-specific locations and get rid
of the drivers/power/avs directory (Ulf Hansson).

- Add on/off notifiers and idle state accounting support to the
generic power domains (genpd) framework (Ulf Hansson, Lina Iyer).

- Ulf will maintain the PM domain part of cpuidle-psci (Ulf Hansson).

- Make intel_idle disregard ACPI _CST if it cannot use the data
returned by that method (Mel Gorman).

- Modify intel_pstate to avoid leaving useless sysfs directory
structure behind if it cannot be registered (Chen Yu).

- Fix domain detection in the RAPL power capping driver and prevent
it from failing to enumerate the Psys RAPL domain (Zhang Rui).

- Allow acpi-cpufreq to use ACPI _PSD information with Family 19 and
later AMD chips (Wei Huang).

- Update the driver assumptions comment in intel_idle and fix a
kerneldoc comment in the runtime PM framework (Alexander Monakov,
Bean Huo).

- Avoid unnecessary resets of the cached frequency in the schedutil
cpufreq governor to reduce overhead (Wei Wang).

- Clean up the cpufreq core a bit (Viresh Kumar).

- Make assorted minor janitorial changes (Daniel Lezcano, Geert
Uytterhoeven, Hubert Jasudowicz, Tom Rix).

- Clean up and optimize the cpupower utility somewhat (Colin Ian
King, Martin Kaistra)"

* tag 'pm-5.10-rc1-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (23 commits)
PM: sleep: remove unreachable break
PM: AVS: Drop the avs directory and the corresponding Kconfig
PM: AVS: qcom-cpr: Move the driver to the qcom specific drivers
PM: runtime: Fix typo in pm_runtime_set_active() helper comment
PM: domains: Fix build error for genpd notifiers
powercap: Fix typo in Kconfig "Plance" -> "Plane"
cpufreq: schedutil: restore cached freq when next_f is not changed
acpi-cpufreq: Honor _PSD table setting on new AMD CPUs
PM: AVS: smartreflex Move driver to soc specific drivers
PM: AVS: rockchip-io: Move the driver to the rockchip specific drivers
PM: domains: enable domain idle state accounting
PM: domains: Add curly braces to delimit comment + statement block
PM: domains: Add support for PM domain on/off notifiers for genpd
powercap/intel_rapl: enumerate Psys RAPL domain together with package RAPL domain
powercap/intel_rapl: Fix domain detection
intel_idle: Ignore _CST if control cannot be taken from the platform
cpuidle: Remove pointless stub
intel_idle: mention assumption that WBINVD is not needed
MAINTAINERS: Add section for cpuidle-psci PM domain
cpufreq: intel_pstate: Delete intel_pstate sysfs if failed to register the driver
...

Linus Torvalds
2020-10-24 07:27:03 +0800

19 Oct, 2020

1 commit

0070ea296 cpufreq: schedutil: restore cached freq when next_f is not changed ... Browse Code »

We have the raw cached freq to reduce the chance in calling cpufreq
driver where it could be costly in some arch/SoC.

Currently, the raw cached freq is reset in sugov_update_single() when
it avoids frequency reduction (which is not desirable sometimes), but
it is better to restore the previous value of it in that case,
because it may not change in the next cycle and it is not necessary
to change the CPU frequency then.

Adapted from https://android-review.googlesource.com/1352810/

Signed-off-by: Wei Wang
Acked-by: Viresh Kumar
[ rjw: Subject edit and changelog rewrite ]
Signed-off-by: Rafael J. Wysocki

Wei Wang
2020-10-19 23:38:16 +0800

18 Oct, 2020

1 commit

91989c707 task_work: cleanup notification modes ... Browse Code »

A previous commit changed the notification mode from true/false to an
int, allowing notify-no, notify-yes, or signal-notify. This was
backwards compatible in the sense that any existing true/false user
would translate to either 0 (on notification sent) or 1, the latter
which mapped to TWA_RESUME. TWA_SIGNAL was assigned a value of 2.

Clean this up properly, and define a proper enum for the notification
mode. Now we have:

- TWA_NONE. This is 0, same as before the original change, meaning no
notification requested.
- TWA_RESUME. This is 1, same as before the original change, meaning
that we use TIF_NOTIFY_RESUME.
- TWA_SIGNAL. This uses TIF_SIGPENDING/JOBCTL_TASK_WORK for the
notification.

Clean up all the callers, switching their 0/1/false/true to using the
appropriate TWA_* mode for notifications.

Fixes: e91b48162332 ("task_work: teach task_work_add() to do signal_wake_up()")
Reviewed-by: Thomas Gleixner
Signed-off-by: Jens Axboe

Jens Axboe
2020-10-18 05:05:30 +0800

15 Oct, 2020

3 commits

a73f863af sched/features: Fix !CONFIG_JUMP_LABEL case ... Browse Code »

Commit:

765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")

made sched features static for !CONFIG_SCHED_DEBUG configurations, but
overlooked the CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL cases.

For the latter echoing changes to /sys/kernel/debug/sched_features has
the nasty effect of effectively changing what sched_features reports,
but without actually changing the scheduler behaviour (since different
translation units get different sysctl_sched_features).

Fix CONFIG_SCHED_DEBUG=y and !CONFIG_JUMP_LABEL configurations by properly
restructuring ifdefs.

Fixes: 765cc3a4b224e ("sched/core: Optimize sched_feat() for !CONFIG_SCHED_DEBUG builds")
Co-developed-by: Daniel Bristot de Oliveira
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Juri Lelli
Signed-off-by: Ingo Molnar
Acked-by: Patrick Bellasi
Reviewed-by: Valentin Schneider
Link: https://lore.kernel.org/r/20201013053114.160628-1-juri.lelli@redhat.com

Juri Lelli
2020-10-15 01:55:46 +0800
eba9f0829 sched: Replace zero-length array with flexible-array ... Browse Code »

In the following commit:

04f5c362ec6d: ("sched/fair: Replace zero-length array with flexible-array")

a zero-length array cpumask[0] has been replaced with cpumask[].
But there is still a cpumask[0] in 'struct sched_group_capacity'
which was missed.

The point of using [] instead of [0] is that with [] the compiler will
generate a build warning if it isn't the last member of a struct.

[ mingo: Rewrote the changelog. ]

Signed-off-by: zhuguangqing
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20201014140220.11384-1-zhuguangqing83@gmail.com

zhuguangqing
2020-10-15 01:55:19 +0800
0b8417c14 Merge tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management updates from Rafael Wysocki:
"These rework the collection of cpufreq statistics to allow it to take
place if fast frequency switching is enabled in the governor, rework
the frequency invariance handling in the cpufreq core and drivers, add
new hardware support to a couple of cpufreq drivers, fix a number of
assorted issues and clean up the code all over.

Specifics:

- Rework cpufreq statistics collection to allow it to take place when
fast frequency switching is enabled in the governor (Viresh Kumar).

- Make the cpufreq core set the frequency scale on behalf of the
driver and update several cpufreq drivers accordingly (Ionela
Voinescu, Valentin Schneider).

- Add new hardware support to the STI and qcom cpufreq drivers and
improve them (Alain Volmat, Manivannan Sadhasivam).

- Fix multiple assorted issues in cpufreq drivers (Jon Hunter,
Krzysztof Kozlowski, Matthias Kaehlcke, Pali Rohár, Stephan
Gerhold, Viresh Kumar).

- Fix several assorted issues in the operating performance points
(OPP) framework (Stephan Gerhold, Viresh Kumar).

- Allow devfreq drivers to fetch devfreq instances by DT enumeration
instead of using explicit phandles and modify the devfreq core code
to support driver-specific devfreq DT bindings (Leonard Crestez,
Chanwoo Choi).

- Improve initial hardware resetting in the tegra30 devfreq driver
and clean up the tegra cpuidle driver (Dmitry Osipenko).

- Update the cpuidle core to collect state entry rejection statistics
and expose them via sysfs (Lina Iyer).

- Improve the ACPI _CST code handling diagnostics (Chen Yu).

- Update the PSCI cpuidle driver to allow the PM domain
initialization to occur in the OSI mode as well as in the PC mode
(Ulf Hansson).

- Rework the generic power domains (genpd) core code to allow domain
power off transition to be aborted in the absence of the "power
off" domain callback (Ulf Hansson).

- Fix two suspend-to-idle issues in the ACPI EC driver (Rafael
Wysocki).

- Fix the handling of timer_expires in the PM-runtime framework on
32-bit systems and the handling of device links in it (Grygorii
Strashko, Xiang Chen).

- Add IO requests batching support to the hibernate image saving and
reading code and drop a bogus get_gendisk() from there (Xiaoyi
Chen, Christoph Hellwig).

- Allow PCIe ports to be put into the D3cold power state if they are
power-manageable via ACPI (Lukas Wunner).

- Add missing header file include to a power capping driver (Pujin
Shi).

- Clean up the qcom-cpr AVS driver a bit (Liu Shixin).

- Kevin Hilman steps down as designated reviwer of adaptive voltage
scaling (AVS) drivers (Kevin Hilman)"

* tag 'pm-5.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (65 commits)
cpufreq: stats: Fix string format specifier mismatch
arm: disable frequency invariance for CONFIG_BL_SWITCHER
cpufreq,arm,arm64: restructure definitions of arch_set_freq_scale()
cpufreq: stats: Add memory barrier to store_reset()
cpufreq: schedutil: Simplify sugov_fast_switch()
ACPI: EC: PM: Drop ec_no_wakeup check from acpi_ec_dispatch_gpe()
ACPI: EC: PM: Flush EC work unconditionally after wakeup
PCI/ACPI: Whitelist hotplug ports for D3 if power managed by ACPI
PM: hibernate: remove the bogus call to get_gendisk() in software_resume()
cpufreq: Move traces and update to policy->cur to cpufreq core
cpufreq: stats: Enable stats for fast-switch as well
cpufreq: stats: Mark few conditionals with unlikely()
cpufreq: stats: Remove locking
cpufreq: stats: Defer stats update to cpufreq_stats_record_transition()
PM: domains: Allow to abort power off when no ->power_off() callback
PM: domains: Rename power state enums for genpd
PM / devfreq: tegra30: Improve initial hardware resetting
PM / devfreq: event: Change prototype of devfreq_event_get_edev_by_phandle function
PM / devfreq: Change prototype of devfreq_get_devfreq_by_phandle function
PM / devfreq: Add devfreq_get_devfreq_by_node function
...

Linus Torvalds
2020-10-15 01:45:41 +0800

13 Oct, 2020

1 commit

edaa5ddf3 Merge tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:

- reorganize & clean up the SD* flags definitions and add a bunch of
sanity checks. These new checks caught quite a few bugs or at least
inconsistencies, resulting in another set of patches.

- rseq updates, add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ

- add a new tracepoint to improve CPU capacity tracking

- improve overloaded SMP system load-balancing behavior

- tweak SMT balancing

- energy-aware scheduling updates

- NUMA balancing improvements

- deadline scheduler fixes and improvements

- CPU isolation fixes

- misc cleanups, simplifications and smaller optimizations

* tag 'sched-core-2020-10-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (42 commits)
sched/deadline: Unthrottle PI boosted threads while enqueuing
sched/debug: Add new tracepoint to track cpu_capacity
sched/fair: Tweak pick_next_entity()
rseq/selftests: Test MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
rseq/selftests,x86_64: Add rseq_offset_deref_addv()
rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ
sched/fair: Use dst group while checking imbalance for NUMA balancer
sched/fair: Reduce busy load balance interval
sched/fair: Minimize concurrent LBs between domain level
sched/fair: Reduce minimal imbalance threshold
sched/fair: Relax constraint on task's load during load balance
sched/fair: Remove the force parameter of update_tg_load_avg()
sched/fair: Fix wrong cpu selecting from isolated domain
sched: Remove unused inline function uclamp_bucket_base_value()
sched/rt: Disable RT_RUNTIME_SHARE by default
sched/deadline: Fix stale throttling on de-/boosted tasks
sched/numa: Use runnable_avg to classify node
sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL
MAINTAINERS: Add myself as SCHED_DEADLINE reviewer
sched/topology: Move SD_DEGENERATE_GROUPS_MASK out of linux/sched/topology.h
...

Linus Torvalds
2020-10-13 03:56:01 +0800

07 Oct, 2020

1 commit

86836bac5 cpufreq: schedutil: Simplify sugov_fast_switch() ... Browse Code »

Drop a redundant local variable definition from sugov_fast_switch()
and rearrange the code in there to avoid the redundant logical
negation.

Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar

Rafael J. Wysocki
2020-10-07 23:11:37 +0800

05 Oct, 2020

1 commit

08d8c65e8 cpufreq: Move traces and update to policy->cur to cpufreq core ... Browse Code »

The cpufreq core handles the updates to policy->cur and recording of
cpufreq trace events for all the governors except schedutil's fast
switch case.

Move that as well to cpufreq core for consistency and readability.

Signed-off-by: Viresh Kumar
Signed-off-by: Rafael J. Wysocki

Viresh Kumar
2020-10-05 21:13:43 +0800

03 Oct, 2020

3 commits

feff2e65e sched/deadline: Unthrottle PI boosted threads while enqueuing ... Browse Code »

stress-ng has a test (stress-ng --cyclic) that creates a set of threads
under SCHED_DEADLINE with the following parameters:

dl_runtime = 10000 (10 us)
dl_deadline = 100000 (100 us)
dl_period = 100000 (100 us)

These parameters are very aggressive. When using a system without HRTICK
set, these threads can easily execute longer than the dl_runtime because
the throttling happens with 1/HZ resolution.

During the main part of the test, the system works just fine because
the workload does not try to run over the 10 us. The problem happens at
the end of the test, on the exit() path. During exit(), the threads need
to do some cleanups that require real-time mutex locks, mainly those
related to memory management, resulting in this scenario:

Note: locks are rt_mutexes...
------------------------------------------------------------------------
TASK A: TASK B: TASK C:
activation
activation
activation

lock(a): OK! lock(b): OK!

lock(a)
-> block (task A owns it)
-> self notice/set throttled
+--< -> arm replenished timer
| switch-out
| lock(b)
| -> B prio>
| -> boost TASK B
| unlock(a) switch-out
| -> handle lock a to B
| -> wakeup(B)
| -> B is throttled:
| -> do not enqueue
| switch-out
|
|
+---------------------> replenishment timer
-> TASK B is boosted:
-> do not enqueue
------------------------------------------------------------------------

BOOM: TASK B is runnable but !enqueued, holding TASK C: the system
crashes with hung task C.

This problem is avoided by removing the throttle state from the boosted
thread while boosting it (by TASK A in the example above), allowing it to
be queued and run boosted.

The next replenishment will take care of the runtime overrun, pushing
the deadline further away. See the "while (dl_se->runtime
Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Juri Lelli
Tested-by: Mark Simmons
Link: https://lkml.kernel.org/r/5076e003450835ec74e6fa5917d02c4fa41687e6.1600170294.git.bristot@redhat.com

Daniel Bristot de Oliveira
2020-10-03 22:30:53 +0800
51cf18c90 sched/debug: Add new tracepoint to track cpu_capacity ... Browse Code »

rq->cpu_capacity is a key element in several scheduler parts, such as EAS
task placement and load balancing. Tracking this value enables testing
and/or debugging by a toolkit.

Signed-off-by: Vincent Donnefort
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/1598605249-72651-1-git-send-email-vincent.donnefort@arm.com

Vincent Donnefort
2020-10-03 22:30:52 +0800
9abb89734 sched/fair: Tweak pick_next_entity() ... Browse Code »

Currently, pick_next_entity(...) has the following structure
(simplified):

[...]
if (last_buddy_ok())
result = last_buddy;
if (next_buddy_ok())
result = next_buddy;
[...]

The intended behavior is to prefer next buddy over last buddy;
the current code somewhat obfuscates this, and also wastes
cycles checking the last buddy when eventually the next buddy is
picked up.

So this patch refactors two 'ifs' above into

[...]
if (next_buddy_ok())
result = next_buddy;
else if (last_buddy_ok())
result = last_buddy;
[...]

Signed-off-by: Peter Oskolkov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/20200930173532.1069092-1-posk@google.com

Peter Oskolkov
2020-10-03 22:30:52 +0800

25 Sep, 2020

8 commits

2a36ab717 rseq/membarrier: Add MEMBARRIER_CMD_PRIVATE_EXPEDITED_RSEQ ... Browse Code »

This patchset is based on Google-internal RSEQ work done by Paul
Turner and Andrew Hunter.

When working with per-CPU RSEQ-based memory allocations, it is
sometimes important to make sure that a global memory location is no
longer accessed from RSEQ critical sections. For example, there can be
two per-CPU lists, one is "active" and accessed per-CPU, while another
one is inactive and worked on asynchronously "off CPU" (e.g. garbage
collection is performed). Then at some point the two lists are
swapped, and a fast RCU-like mechanism is required to make sure that
the previously active list is no longer accessed.

This patch introduces such a mechanism: in short, membarrier() syscall
issues an IPI to a CPU, restarting a potentially active RSEQ critical
section on the CPU.

Signed-off-by: Peter Oskolkov
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Mathieu Desnoyers
Link: https://lkml.kernel.org/r/20200923233618.2572849-1-posk@google.com

Peter Oskolkov
2020-09-25 20:23:27 +0800
233e7aca4 sched/fair: Use dst group while checking imbalance for NUMA balancer ... Browse Code »

Barry Song noted the following

Something is wrong. In find_busiest_group(), we are checking if
src has higher load, however, in task_numa_find_cpu(), we are
checking if dst will have higher load after balancing. It seems
it is not sensible to check src.

It maybe cause wrong imbalance value, for example,

if dst_running = env->dst_stats.nr_running + 1 results in 3 or
above, and src_running = env->src_stats.nr_running - 1 results
in 1;

The current code is thinking imbalance as 0 since src_running is
smaller than 2. This is inconsistent with load balancer.

Basically, in find_busiest_group(), the NUMA imbalance is ignored if moving
a task "from an almost idle domain" to a "domain with spare capacity". This
patch forbids movement "from a misplaced domain" to "an almost idle domain"
as that is closer to what the CPU load balancer expects.

This patch is not a universal win. The old behaviour was intended to allow
a task from an almost idle NUMA node to migrate to its preferred node if
the destination had capacity but there are corner cases. For example,
a NAS compute load could be parallelised to use 1/3rd of available CPUs
but not all those potential tasks are active at all times allowing this
logic to trigger. An obvious example is specjbb 2005 running various
numbers of warehouses on a 2 socket box with 80 cpus.

specjbb
5.9.0-rc4 5.9.0-rc4
vanilla dstbalance-v1r1
Hmean tput-1 46425.00 ( 0.00%) 43394.00 * -6.53%*
Hmean tput-2 98416.00 ( 0.00%) 96031.00 * -2.42%*
Hmean tput-3 150184.00 ( 0.00%) 148783.00 * -0.93%*
Hmean tput-4 200683.00 ( 0.00%) 197906.00 * -1.38%*
Hmean tput-5 236305.00 ( 0.00%) 245549.00 * 3.91%*
Hmean tput-6 281559.00 ( 0.00%) 285692.00 * 1.47%*
Hmean tput-7 338558.00 ( 0.00%) 334467.00 * -1.21%*
Hmean tput-8 340745.00 ( 0.00%) 372501.00 * 9.32%*
Hmean tput-9 424343.00 ( 0.00%) 413006.00 * -2.67%*
Hmean tput-10 421854.00 ( 0.00%) 434261.00 * 2.94%*
Hmean tput-11 493256.00 ( 0.00%) 485330.00 * -1.61%*
Hmean tput-12 549573.00 ( 0.00%) 529959.00 * -3.57%*
Hmean tput-13 593183.00 ( 0.00%) 555010.00 * -6.44%*
Hmean tput-14 588252.00 ( 0.00%) 599166.00 * 1.86%*
Hmean tput-15 623065.00 ( 0.00%) 642713.00 * 3.15%*
Hmean tput-16 703924.00 ( 0.00%) 660758.00 * -6.13%*
Hmean tput-17 666023.00 ( 0.00%) 697675.00 * 4.75%*
Hmean tput-18 761502.00 ( 0.00%) 758360.00 * -0.41%*
Hmean tput-19 796088.00 ( 0.00%) 798368.00 * 0.29%*
Hmean tput-20 733564.00 ( 0.00%) 823086.00 * 12.20%*
Hmean tput-21 840980.00 ( 0.00%) 856711.00 * 1.87%*
Hmean tput-22 804285.00 ( 0.00%) 872238.00 * 8.45%*
Hmean tput-23 795208.00 ( 0.00%) 889374.00 * 11.84%*
Hmean tput-24 848619.00 ( 0.00%) 966783.00 * 13.92%*
Hmean tput-25 750848.00 ( 0.00%) 903790.00 * 20.37%*
Hmean tput-26 780523.00 ( 0.00%) 962254.00 * 23.28%*
Hmean tput-27 1042245.00 ( 0.00%) 991544.00 * -4.86%*
Hmean tput-28 1090580.00 ( 0.00%) 1035926.00 * -5.01%*
Hmean tput-29 999483.00 ( 0.00%) 1082948.00 * 8.35%*
Hmean tput-30 1098663.00 ( 0.00%) 1113427.00 * 1.34%*
Hmean tput-31 1125671.00 ( 0.00%) 1134175.00 * 0.76%*
Hmean tput-32 968167.00 ( 0.00%) 1250286.00 * 29.14%*
Hmean tput-33 1077676.00 ( 0.00%) 1060893.00 * -1.56%*
Hmean tput-34 1090538.00 ( 0.00%) 1090933.00 * 0.04%*
Hmean tput-35 967058.00 ( 0.00%) 1107421.00 * 14.51%*
Hmean tput-36 1051745.00 ( 0.00%) 1210663.00 * 15.11%*
Hmean tput-37 1019465.00 ( 0.00%) 1351446.00 * 32.56%*
Hmean tput-38 1083102.00 ( 0.00%) 1064541.00 * -1.71%*
Hmean tput-39 1232990.00 ( 0.00%) 1303623.00 * 5.73%*
Hmean tput-40 1175542.00 ( 0.00%) 1340943.00 * 14.07%*
Hmean tput-41 1127826.00 ( 0.00%) 1339492.00 * 18.77%*
Hmean tput-42 1198313.00 ( 0.00%) 1411023.00 * 17.75%*
Hmean tput-43 1163733.00 ( 0.00%) 1228253.00 * 5.54%*
Hmean tput-44 1305562.00 ( 0.00%) 1357886.00 * 4.01%*
Hmean tput-45 1326752.00 ( 0.00%) 1406061.00 * 5.98%*
Hmean tput-46 1339424.00 ( 0.00%) 1418451.00 * 5.90%*
Hmean tput-47 1415057.00 ( 0.00%) 1381570.00 * -2.37%*
Hmean tput-48 1392003.00 ( 0.00%) 1421167.00 * 2.10%*
Hmean tput-49 1408374.00 ( 0.00%) 1418659.00 * 0.73%*
Hmean tput-50 1359822.00 ( 0.00%) 1391070.00 * 2.30%*
Hmean tput-51 1414246.00 ( 0.00%) 1392679.00 * -1.52%*
Hmean tput-52 1432352.00 ( 0.00%) 1354020.00 * -5.47%*
Hmean tput-53 1387563.00 ( 0.00%) 1409563.00 * 1.59%*
Hmean tput-54 1406420.00 ( 0.00%) 1388711.00 * -1.26%*
Hmean tput-55 1438804.00 ( 0.00%) 1387472.00 * -3.57%*
Hmean tput-56 1399465.00 ( 0.00%) 1400296.00 * 0.06%*
Hmean tput-57 1428132.00 ( 0.00%) 1396399.00 * -2.22%*
Hmean tput-58 1432385.00 ( 0.00%) 1386253.00 * -3.22%*
Hmean tput-59 1421612.00 ( 0.00%) 1371416.00 * -3.53%*
Hmean tput-60 1429423.00 ( 0.00%) 1389412.00 * -2.80%*
Hmean tput-61 1396230.00 ( 0.00%) 1351122.00 * -3.23%*
Hmean tput-62 1418396.00 ( 0.00%) 1383098.00 * -2.49%*
Hmean tput-63 1409918.00 ( 0.00%) 1374662.00 * -2.50%*
Hmean tput-64 1410236.00 ( 0.00%) 1376216.00 * -2.41%*
Hmean tput-65 1396405.00 ( 0.00%) 1364418.00 * -2.29%*
Hmean tput-66 1395975.00 ( 0.00%) 1357326.00 * -2.77%*
Hmean tput-67 1392986.00 ( 0.00%) 1349642.00 * -3.11%*
Hmean tput-68 1386541.00 ( 0.00%) 1343261.00 * -3.12%*
Hmean tput-69 1374407.00 ( 0.00%) 1342588.00 * -2.32%*
Hmean tput-70 1377513.00 ( 0.00%) 1334654.00 * -3.11%*
Hmean tput-71 1369319.00 ( 0.00%) 1334952.00 * -2.51%*
Hmean tput-72 1354635.00 ( 0.00%) 1329005.00 * -1.89%*
Hmean tput-73 1350933.00 ( 0.00%) 1318942.00 * -2.37%*
Hmean tput-74 1351714.00 ( 0.00%) 1316347.00 * -2.62%*
Hmean tput-75 1352198.00 ( 0.00%) 1309974.00 * -3.12%*
Hmean tput-76 1349490.00 ( 0.00%) 1286064.00 * -4.70%*
Hmean tput-77 1336131.00 ( 0.00%) 1303684.00 * -2.43%*
Hmean tput-78 1308896.00 ( 0.00%) 1271024.00 * -2.89%*
Hmean tput-79 1326703.00 ( 0.00%) 1290862.00 * -2.70%*
Hmean tput-80 1336199.00 ( 0.00%) 1291629.00 * -3.34%*

The performance at the mid-point is better but not universally better. The
patch is a mixed bag depending on the workload, machine and overall
levels of utilisation. Sometimes it's better (sometimes much better),
other times it is worse (sometimes much worse). Given that there isn't a
universally good decision in this section and more people seem to prefer
the patch then it may be best to keep the LB decisions consistent and
revisit imbalance handling when the load balancer code changes settle down.

Jirka Hladky added the following observation.

Our results are mostly in line with what you see. We observe
big gains (20-50%) when the system is loaded to 1/3 of the
maximum capacity and mixed results at the full load - some
workloads benefit from the patch at the full load, others not,
but performance changes at the full load are mostly within the
noise of results (+/-5%). Overall, we think this patch is helpful.

[mgorman@techsingularity.net: Rewrote changelog]
Fixes: fb86f5b211 ("sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity")
Signed-off-by: Barry Song
Signed-off-by: Mel Gorman
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200921221849.GI3179@techsingularity.net

Barry Song
2020-09-25 20:23:26 +0800
6e7499135 sched/fair: Reduce busy load balance interval ... Browse Code »

The busy_factor, which increases load balance interval when a cpu is busy,
is set to 32 by default. This value generates some huge LB interval on
large system like the THX2 made of 2 node x 28 cores x 4 threads.
For such system, the interval increases from 112ms to 3584ms at MC level.
And from 228ms to 7168ms at NUMA level.

Even on smaller system, a lower busy factor has shown improvement on the
fair distribution of the running time so let reduce it for all.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-5-vincent.guittot@linaro.org

Vincent Guittot
2020-09-25 20:23:26 +0800
e4d32e4d5 sched/fair: Minimize concurrent LBs between domain level ... Browse Code »

sched domains tend to trigger simultaneously the load balance loop but
the larger domains often need more time to collect statistics. This
slowness makes the larger domain trying to detach tasks from a rq whereas
tasks already migrated somewhere else at a sub-domain level. This is not
a real problem for idle LB because the period of smaller domains will
increase with its CPUs being busy and this will let time for higher ones
to pulled tasks. But this becomes a problem when all CPUs are already busy
because all domains stay synced when they trigger their LB.

A simple way to minimize simultaneous LB of all domains is to decrement the
the busy interval by 1 jiffies. Because of the busy_factor, the interval of
larger domain will not be a multiple of smaller ones anymore.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-4-vincent.guittot@linaro.org

Vincent Guittot
2020-09-25 20:23:26 +0800
2208cdaa5 sched/fair: Reduce minimal imbalance threshold ... Browse Code »

The 25% default imbalance threshold for DIE and NUMA domain is large
enough to generate significant unfairness between threads. A typical
example is the case of 11 threads running on 2x4 CPUs. The imbalance of
20% between the 2 groups of 4 cores is just low enough to not trigger
the load balance between the 2 groups. We will have always the same 6
threads on one group of 4 CPUs and the other 5 threads on the other
group of CPUS. With a fair time sharing in each group, we ends up with
+20% running time for the group of 5 threads.

Consider decreasing the imbalance threshold for overloaded case where we
use the load to balance task and to ensure fair time sharing.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Acked-by: Hillf Danton
Link: https://lkml.kernel.org/r/20200921072424.14813-3-vincent.guittot@linaro.org

Vincent Guittot
2020-09-25 20:23:26 +0800
5a7f55590 sched/fair: Relax constraint on task's load during load balance ... Browse Code »

Some UCs like 9 always running tasks on 8 CPUs can't be balanced and the
load balancer currently migrates the waiting task between the CPUs in an
almost random manner. The success of a rq pulling a task depends of the
value of nr_balance_failed of its domains and its ability to be faster
than others to detach it. This behavior results in an unfair distribution
of the running time between tasks because some CPUs will run most of the
time, if not always, the same task whereas others will share their time
between several tasks.

Instead of using nr_balance_failed as a boolean to relax the condition
for detaching task, the LB will use nr_balanced_failed to relax the
threshold between the tasks'load and the imbalance. This mecanism
prevents the same rq or domain to always win the load balance fight.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Phil Auld
Link: https://lkml.kernel.org/r/20200921072424.14813-2-vincent.guittot@linaro.org

Vincent Guittot
2020-09-25 20:23:25 +0800
fe7491580 sched/fair: Remove the force parameter of update_tg_load_avg() ... Browse Code »

In the file fair.c, sometims update_tg_load_avg(cfs_rq, 0) is used,
sometimes update_tg_load_avg(cfs_rq, false) is used.
update_tg_load_avg() has the parameter force, but in current code,
it never set 1 or true to it, so remove the force parameter.

Signed-off-by: Xianting Tian
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200924014755.36253-1-tian.xianting@h3c.com

Xianting Tian
2020-09-25 20:23:25 +0800
df3cb4ea1 sched/fair: Fix wrong cpu selecting from isolated domain ... Browse Code »

We've met problems that occasionally tasks with full cpumask
(e.g. by putting it into a cpuset or setting to full affinity)
were migrated to our isolated cpus in production environment.

After some analysis, we found that it is due to the current
select_idle_smt() not considering the sched_domain mask.

Steps to reproduce on my 31-CPU hyperthreads machine:
1. with boot parameter: "isolcpus=domain,2-31"
(thread lists: 0,16 and 1,17)
2. cgcreate -g cpu:test; cgexec -g cpu:test "test_threads"
3. some threads will be migrated to the isolated cpu16~17.

Fix it by checking the valid domain mask in select_idle_smt().

Fixes: 10e2f1acd010 ("sched/core: Rewrite and improve select_idle_siblings())
Reported-by: Wetp Zhang
Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Jiang Biao
Reviewed-by: Vincent Guittot
Link: https://lkml.kernel.org/r/1600930127-76857-1-git-send-email-xlpang@linux.alibaba.com

Xunlei Pang
2020-09-25 20:23:25 +0800