Eric Lee / smarc-fsl-linux-kernel

11 Nov, 2020

1 commit

8d4d9c7b4 sched/debug: Fix memory corruption caused by multiple small reads of flags ... Browse Code »

Reading /proc/sys/kernel/sched_domain/cpu*/domain0/flags mutliple times
with small reads causes oopses with slub corruption issues because the kfree is
free'ing an offset from a previous allocation. Fix this by adding in a new
pointer 'buf' for the allocation and kfree and use the temporary pointer tmp
to handle memory copies of the buf offsets.

Fixes: 5b9f8ff7b320 ("sched/debug: Output SD flag names rather than their values")
Reported-by: Jeff Bastian
Signed-off-by: Colin Ian King
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20201029151103.373410-1-colin.king@canonical.com

Colin Ian King
2020-11-11 01:38:49 +0800

09 Sep, 2020

1 commit

848785df4 sched/topology: Move sd_flag_debug out of #ifdef CONFIG_SYSCTL ... Browse Code »

The last sd_flag_debug shuffle inadvertently moved its definition within
an #ifdef CONFIG_SYSCTL region. While CONFIG_SYSCTL is indeed required to
produce the sched domain ctl interface (which uses sd_flag_debug to output
flag names), it isn't required to run any assertion on the sched_domain
hierarchy itself.

Move the definition of sd_flag_debug to a CONFIG_SCHED_DEBUG region of
topology.c.

Now at long last we have:

- sd_flag_debug declared in include/linux/sched/topology.h iff
CONFIG_SCHED_DEBUG=y
- sd_flag_debug defined in kernel/sched/topology.c, conditioned by:
- CONFIG_SCHED_DEBUG, with an explicit #ifdef block
- CONFIG_SMP, as a requirement to compile topology.c

With this change, all symbols pertaining to SD flag metadata (with the
exception of __SD_FLAG_CNT) are now defined exclusively within topology.c

Fixes: 8fca9494d4b4 ("sched/topology: Move sd_flag_debug out of linux/sched/topology.h")
Reported-by: Randy Dunlap
Signed-off-by: Valentin Schneider
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20200908184956.23369-1-valentin.schneider@arm.com

Valentin Schneider
2020-09-09 16:09:03 +0800

26 Aug, 2020

1 commit

8fca9494d sched/topology: Move sd_flag_debug out of linux/sched/topology.h ... Browse Code »

Defining an array in a header imported all over the place clearly is a daft
idea, that still didn't stop me from doing it.

Leave a declaration of sd_flag_debug in topology.h and move its definition
to sched/debug.c.

Fixes: b6e862f38672 ("sched/topology: Define and assign sched_domain flag metadata")
Reported-by: Andy Shevchenko
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200825133216.9163-1-valentin.schneider@arm.com

Valentin Schneider
2020-08-26 18:41:59 +0800

19 Aug, 2020

1 commit

5b9f8ff7b sched/debug: Output SD flag names rather than their values ... Browse Code »

Decoding the output of /proc/sys/kernel/sched_domain/cpu*/domain*/flags has
always been somewhat annoying, as one needs to go fetch the bit -> name
mapping from the source code itself. This encoding can be saved in a script
somewhere, but that isn't safe from flags being added, removed or even
shuffled around.

What matters for debugging purposes is to get *which* flags are set in a
given domain, their associated value is pretty much meaningless.

Make the sd flags debug file output flag names.

Signed-off-by: Valentin Schneider
Signed-off-by: Ingo Molnar
Acked-by: Peter Zijlstra
Link: https://lore.kernel.org/r/20200817113003.20802-7-valentin.schneider@arm.com

Valentin Schneider
2020-08-19 16:49:48 +0800

28 May, 2020

1 commit

126c2092e sched: Add rq::ttwu_pending ... Browse Code »

In preparation of removing rq->wake_list, replace the
!list_empty(rq->wake_list) with rq->ttwu_pending. This is not fully
equivalent as this new variable is racy.

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lore.kernel.org/r/20200526161908.070399698@infradead.org

Peter Zijlstra
2020-05-28 16:54:16 +0800

20 May, 2020

2 commits

9013196a4 Merge branch 'sched/urgent' Browse Code »

Peter Zijlstra
2020-05-20 02:34:12 +0800
ad32bb41f sched/debug: Fix requested task uclamp values shown in procfs ... Browse Code »

The intention of commit 96e74ebf8d59 ("sched/debug: Add task uclamp
values to SCHED_DEBUG procfs") was to print requested and effective
task uclamp values. The requested values printed are read from p->uclamp,
which holds the last effective values. Fix this by printing the values
from p->uclamp_req.

Fixes: 96e74ebf8d59 ("sched/debug: Add task uclamp values to SCHED_DEBUG procfs")
Signed-off-by: Pavankumar Kondeti
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Valentin Schneider
Tested-by: Valentin Schneider
Link: https://lkml.kernel.org/r/1589115401-26391-1-git-send-email-pkondeti@codeaurora.org

Pavankumar Kondeti
2020-05-20 02:34:10 +0800

01 May, 2020

2 commits

9818427c6 sched/debug: Make sd->flags sysctl read-only ... Browse Code »

Writing to the sysctl of a sched_domain->flags directly updates the value of
the field, and goes nowhere near update_top_cache_domain(). This means that
the cached domain pointers can end up containing stale data (e.g. the
domain pointed to doesn't have the relevant flag set anymore).

Explicit domain walks that check for flags will be affected by
the write, but this won't be in sync with the cached pointers which will
still point to the domains that were cached at the last sched_domain
build.

In other words, writing to this interface is playing a dangerous game. It
could be made to trigger an update of the cached sched_domain pointers when
written to, but this does not seem to be worth the trouble. Make it
read-only.

Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200415210512.805-3-valentin.schneider@arm.com

Valentin Schneider
2020-05-01 02:14:39 +0800
f080d93e1 sched/debug: Fix trival print_task() format ... Browse Code »

Ensure leave one space between state and task name.

w/o patch:
runnable tasks:
S task PID tree-key switches prio wait
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200414125721.195801-1-xiexiuqi@huawei.com

Xie XiuQi
2020-05-01 02:14:37 +0800

08 Apr, 2020

3 commits

96e74ebf8 sched/debug: Add task uclamp values to SCHED_DEBUG procfs ... Browse Code »

Requested and effective uclamp values can be a bit tricky to decipher when
playing with cgroup hierarchies. Add them to a task's procfs when
SCHED_DEBUG is enabled.

Reviewed-by: Qais Yousef
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200226124543.31986-4-valentin.schneider@arm.com

Valentin Schneider
2020-04-08 17:35:27 +0800
9e3bf9469 sched/debug: Factor out printing formats into common macros ... Browse Code »

The printing macros in debug.c keep redefining the same output
format. Collect each output format in a single definition, and reuse that
definition in the other macros. While at it, add a layer of parentheses and
replace printf's with the newly introduced macros.

Reviewed-by: Qais Yousef
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200226124543.31986-3-valentin.schneider@arm.com

Valentin Schneider
2020-04-08 17:35:26 +0800
c745a6212 sched/debug: Remove redundant macro define ... Browse Code »

Most printing macros for procfs are defined globally in debug.c, and they
are re-defined (to the exact same thing) within proc_sched_show_task().

Get rid of the duplicate defines.

Reviewed-by: Qais Yousef
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200226124543.31986-2-valentin.schneider@arm.com

Valentin Schneider
2020-04-08 17:35:24 +0800

24 Feb, 2020

2 commits

9f6839533 sched/pelt: Add a new runnable average signal ... Browse Code »

Now that runnable_load_avg has been removed, we can replace it by a new
signal that will highlight the runnable pressure on a cfs_rq. This signal
track the waiting time of tasks on rq and can help to better define the
state of rqs.

At now, only util_avg is used to define the state of a rq:
A rq with more that around 80% of utilization and more than 1 tasks is
considered as overloaded.

But the util_avg signal of a rq can become temporaly low after that a task
migrated onto another rq which can bias the classification of the rq.

When tasks compete for the same rq, their runnable average signal will be
higher than util_avg as it will include the waiting time and we can use
this signal to better classify cfs_rqs.

The new runnable_avg will track the runnable time of a task which simply
adds the waiting time to the running time. The runnable _avg of cfs_rq
will be the /Sum of se's runnable_avg and the runnable_avg of group entity
will follow the one of the rq similarly to util_avg.

Signed-off-by: Vincent Guittot
Signed-off-by: Mel Gorman
Signed-off-by: Ingo Molnar
Reviewed-by: "Dietmar Eggemann "
Acked-by: Peter Zijlstra
Cc: Juri Lelli
Cc: Valentin Schneider
Cc: Phil Auld
Cc: Hillf Danton
Link: https://lore.kernel.org/r/20200224095223.13361-9-mgorman@techsingularity.net

Vincent Guittot
2020-02-24 18:36:36 +0800
0dacee1bf sched/pelt: Remove unused runnable load average ... Browse Code »

Now that runnable_load_avg is no more used, we can remove it to make
space for a new signal.

Signed-off-by: Vincent Guittot
Signed-off-by: Mel Gorman
Signed-off-by: Ingo Molnar
Reviewed-by: "Dietmar Eggemann "
Acked-by: Peter Zijlstra
Cc: Juri Lelli
Cc: Valentin Schneider
Cc: Phil Auld
Cc: Hillf Danton
Link: https://lore.kernel.org/r/20200224095223.13361-8-mgorman@techsingularity.net

Vincent Guittot
2020-02-24 18:36:36 +0800

17 Jan, 2020

1 commit

02d4ac588 sched/debug: Reset watchdog on all CPUs while processing sysrq-t ... Browse Code »

Lengthy output of sysrq-t may take a lot of time on slow serial console
with lots of processes and CPUs.

So we need to reset NMI-watchdog to avoid spurious lockup messages, and
we also reset softlockup watchdogs on all other CPUs since another CPU
might be blocked waiting for us to process an IPI or stop_machine.

Add to sysrq_sched_debug_show() as what we did in show_state_filter().

Signed-off-by: Wei Li
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Steven Rostedt (VMware)
Link: https://lkml.kernel.org/r/20191226085224.48942-1-liwei391@huawei.com

Wei Li
2020-01-17 17:19:20 +0800

25 Jun, 2019

1 commit

d2abae71e Merge tag 'v5.2-rc6' into sched/core, to refresh the branch ... Browse Code »

Signed-off-by: Ingo Molnar

Ingo Molnar
2019-06-25 01:19:53 +0800

19 Jun, 2019

1 commit

d2912cb15 treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 500 ... Browse Code »

Based on 2 normalized pattern(s):

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation

this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license version 2 as
published by the free software foundation #

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 4122 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Enrico Weigelt
Reviewed-by: Kate Stewart
Reviewed-by: Allison Randal
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081206.933168790@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-06-19 23:09:55 +0800

03 Jun, 2019

4 commits

0e1fef63d sched/core: Remove sd->*_idx ... Browse Code »

The sched domain per rq load index files also disappear from the
/proc/sys/kernel/sched_domain/cpuX/domainY directories.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Patrick Bellasi
Cc: Peter Zijlstra
Cc: Quentin Perret
Cc: Thomas Gleixner
Cc: Valentin Schneider
Cc: Vincent Guittot
Link: https://lkml.kernel.org/r/20190527062116.11512-6-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar

Dietmar Eggemann
2019-06-03 17:49:40 +0800
55627e3cd sched/core: Remove rq->cpu_load[] ... Browse Code »

The per rq load array values also disappear from the cpu#X sections in
/proc/sched_debug.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Patrick Bellasi
Cc: Peter Zijlstra
Cc: Quentin Perret
Cc: Thomas Gleixner
Cc: Valentin Schneider
Cc: Vincent Guittot
Link: https://lkml.kernel.org/r/20190527062116.11512-5-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar

Dietmar Eggemann
2019-06-03 17:49:40 +0800
3d8d53554 sched/debug: Remove sd->*_idx range on sysctl ... Browse Code »

This reverts:

commit 201c373e8e48 ("sched/debug: Limit sd->*_idx range on sysctl")

Load indexes (sd->*_idx) are no longer needed without rq->cpu_load[].
The range check for load indexes can be removed as well. Get rid of it
before the rq->cpu_load[] since it uses CPU_LOAD_IDX_MAX.

At the same time, fix the following coding style issues detected by
scripts/checkpatch.pl:

ERROR: space prohibited before that ','
ERROR: space prohibited before that close parenthesis ')'

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Rik van Riel
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Patrick Bellasi
Cc: Peter Zijlstra
Cc: Quentin Perret
Cc: Thomas Gleixner
Cc: Valentin Schneider
Cc: Vincent Guittot
Link: https://lkml.kernel.org/r/20190527062116.11512-4-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar

Dietmar Eggemann
2019-06-03 17:49:39 +0800
f2bedc470 sched/fair: Remove rq->load ... Browse Code »

The CFS class is the only one maintaining and using the CPU wide load
(rq->load(.weight)). The last use case of the CPU wide load in CFS's
set_next_entity() can be replaced by using the load of the CFS class
(rq->cfs.load(.weight)) instead.

Signed-off-by: Dietmar Eggemann
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190424084556.604-1-dietmar.eggemann@arm.com
Signed-off-by: Ingo Molnar

Dietmar Eggemann
2019-06-03 17:49:37 +0800

20 Apr, 2019

1 commit

ad2e379de sched/debug: Fix spelling mistake "logaritmic" -> "logarithmic" ... Browse Code »

Signed-off-by: Colin Ian King
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: kernel-janitors@vger.kernel.org
Link: http://lkml.kernel.org/r/20181128152350.13622-1-colin.king@canonical.com
Signed-off-by: Ingo Molnar

Colin Ian King
2019-04-20 03:04:49 +0800

04 Feb, 2019

1 commit

1ca4fa3ab sched/debug: Initialize sd_sysctl_cpus if !CONFIG_CPUMASK_OFFSTACK ... Browse Code »

register_sched_domain_sysctl() copies the cpu_possible_mask into
sd_sysctl_cpus, but only if sd_sysctl_cpus hasn't already been
allocated (ie, CONFIG_CPUMASK_OFFSTACK is set). However, when
CONFIG_CPUMASK_OFFSTACK is not set, sd_sysctl_cpus is left
uninitialized (all zeroes) and the kernel may fail to initialize
sched_domain sysctl entries for all possible CPUs.

This is visible to the user if the kernel is booted with maxcpus=n, or
if ACPI tables have been modified to leave CPUs offline, and then
checking for missing /proc/sys/kernel/sched_domain/cpu* entries.

Fix this by separating the allocation and initialization, and adding a
flag to initialize the possible CPU entries while system booting only.

Tested-by: Syuuichirou Ishii
Tested-by: Tarumizu, Kohei
Signed-off-by: Hidetoshi Seto
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Masayoshi Mizuma
Acked-by: Joe Lawrence
Cc: Linus Torvalds
Cc: Masayoshi Mizuma
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/20190129151245.5073-1-msys.mizuma@gmail.com
Signed-off-by: Ingo Molnar

Hidetoshi Seto
2019-02-04 16:13:21 +0800

06 Jan, 2019

1 commit

e9666d10a jump_label: move 'asm goto' support test to Kconfig ... Browse Code »

Currently, CONFIG_JUMP_LABEL just means "I _want_ to use jump label".

The jump label is controlled by HAVE_JUMP_LABEL, which is defined
like this:

#if defined(CC_HAVE_ASM_GOTO) && defined(CONFIG_JUMP_LABEL)
# define HAVE_JUMP_LABEL
#endif

We can improve this by testing 'asm goto' support in Kconfig, then
make JUMP_LABEL depend on CC_HAS_ASM_GOTO.

Ugly #ifdef HAVE_JUMP_LABEL will go away, and CONFIG_JUMP_LABEL will
match to the real kernel capability.

Signed-off-by: Masahiro Yamada
Acked-by: Michael Ellerman (powerpc)
Tested-by: Sedat Dilek

Masahiro Yamada
2019-01-06 08:46:51 +0800

12 Nov, 2018

1 commit

1da1843f9 sched/core: Create task_has_idle_policy() helper ... Browse Code »

We already have task_has_rt_policy() and task_has_dl_policy() helpers,
create task_has_idle_policy() as well and update sched core to start
using it.

While at it, use task_has_dl_policy() at one more place.

Signed-off-by: Viresh Kumar
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Daniel Lezcano
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vincent Guittot
Link: http://lkml.kernel.org/r/ce3915d5b490fc81af926a3b6bfb775e7188e005.1541416894.git.viresh.kumar@linaro.org
Signed-off-by: Ingo Molnar

Viresh Kumar
2018-11-12 13:17:52 +0800

10 Sep, 2018

1 commit

e73e81975 sched/debug: Fix potential deadlock when writing to sched_features ... Browse Code »

The following lockdep report can be triggered by writing to /sys/kernel/debug/sched_features:

======================================================
WARNING: possible circular locking dependency detected
4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18 Not tainted
------------------------------------------------------
sh/3358 is trying to acquire lock:
000000004ad3989d (cpu_hotplug_lock.rw_sem){++++}, at: static_key_enable+0x14/0x30
but task is already holding lock:
00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #3 (&sb->s_type->i_mutex_key#3){+.+.}:
lock_acquire+0xb8/0x148
down_write+0xac/0x140
start_creating+0x5c/0x168
debugfs_create_dir+0x18/0x220
opp_debug_register+0x8c/0x120
_add_opp_dev+0x104/0x1f8
dev_pm_opp_get_opp_table+0x174/0x340
_of_add_opp_table_v2+0x110/0x760
dev_pm_opp_of_add_table+0x5c/0x240
dev_pm_opp_of_cpumask_add_table+0x5c/0x100
cpufreq_init+0x160/0x430
cpufreq_online+0x1cc/0xe30
cpufreq_add_dev+0x78/0x198
subsys_interface_register+0x168/0x270
cpufreq_register_driver+0x1c8/0x278
dt_cpufreq_probe+0xdc/0x1b8
platform_drv_probe+0xb4/0x168
driver_probe_device+0x318/0x4b0
__device_attach_driver+0xfc/0x1f0
bus_for_each_drv+0xf8/0x180
__device_attach+0x164/0x200
device_initial_probe+0x10/0x18
bus_probe_device+0x110/0x178
device_add+0x6d8/0x908
platform_device_add+0x138/0x3d8
platform_device_register_full+0x1cc/0x1f8
cpufreq_dt_platdev_init+0x174/0x1bc
do_one_initcall+0xb8/0x310
kernel_init_freeable+0x4b8/0x56c
kernel_init+0x10/0x138
ret_from_fork+0x10/0x18
-> #2 (opp_table_lock){+.+.}:
lock_acquire+0xb8/0x148
__mutex_lock+0x104/0xf50
mutex_lock_nested+0x1c/0x28
_of_add_opp_table_v2+0xb4/0x760
dev_pm_opp_of_add_table+0x5c/0x240
dev_pm_opp_of_cpumask_add_table+0x5c/0x100
cpufreq_init+0x160/0x430
cpufreq_online+0x1cc/0xe30
cpufreq_add_dev+0x78/0x198
subsys_interface_register+0x168/0x270
cpufreq_register_driver+0x1c8/0x278
dt_cpufreq_probe+0xdc/0x1b8
platform_drv_probe+0xb4/0x168
driver_probe_device+0x318/0x4b0
__device_attach_driver+0xfc/0x1f0
bus_for_each_drv+0xf8/0x180
__device_attach+0x164/0x200
device_initial_probe+0x10/0x18
bus_probe_device+0x110/0x178
device_add+0x6d8/0x908
platform_device_add+0x138/0x3d8
platform_device_register_full+0x1cc/0x1f8
cpufreq_dt_platdev_init+0x174/0x1bc
do_one_initcall+0xb8/0x310
kernel_init_freeable+0x4b8/0x56c
kernel_init+0x10/0x138
ret_from_fork+0x10/0x18
-> #1 (subsys mutex#6){+.+.}:
lock_acquire+0xb8/0x148
__mutex_lock+0x104/0xf50
mutex_lock_nested+0x1c/0x28
subsys_interface_register+0xd8/0x270
cpufreq_register_driver+0x1c8/0x278
dt_cpufreq_probe+0xdc/0x1b8
platform_drv_probe+0xb4/0x168
driver_probe_device+0x318/0x4b0
__device_attach_driver+0xfc/0x1f0
bus_for_each_drv+0xf8/0x180
__device_attach+0x164/0x200
device_initial_probe+0x10/0x18
bus_probe_device+0x110/0x178
device_add+0x6d8/0x908
platform_device_add+0x138/0x3d8
platform_device_register_full+0x1cc/0x1f8
cpufreq_dt_platdev_init+0x174/0x1bc
do_one_initcall+0xb8/0x310
kernel_init_freeable+0x4b8/0x56c
kernel_init+0x10/0x138
ret_from_fork+0x10/0x18
-> #0 (cpu_hotplug_lock.rw_sem){++++}:
__lock_acquire+0x203c/0x21d0
lock_acquire+0xb8/0x148
cpus_read_lock+0x58/0x1c8
static_key_enable+0x14/0x30
sched_feat_write+0x314/0x428
full_proxy_write+0xa0/0x138
__vfs_write+0xd8/0x388
vfs_write+0xdc/0x318
ksys_write+0xb4/0x138
sys_write+0xc/0x18
__sys_trace_return+0x0/0x4
other info that might help us debug this:
Chain exists of:
cpu_hotplug_lock.rw_sem --> opp_table_lock --> &sb->s_type->i_mutex_key#3
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&sb->s_type->i_mutex_key#3);
lock(opp_table_lock);
lock(&sb->s_type->i_mutex_key#3);
lock(cpu_hotplug_lock.rw_sem);
*** DEADLOCK ***
2 locks held by sh/3358:
#0: 00000000a8c4b363 (sb_writers#10){.+.+}, at: vfs_write+0x238/0x318
#1: 00000000c1b31a88 (&sb->s_type->i_mutex_key#3){+.+.}, at: sched_feat_write+0x160/0x428
stack backtrace:
CPU: 5 PID: 3358 Comm: sh Not tainted 4.18.0-rc6-00152-gcd3f77d74ac3-dirty #18
Hardware name: Renesas H3ULCB Kingfisher board based on r8a7795 ES2.0+ (DT)
Call trace:
dump_backtrace+0x0/0x288
show_stack+0x14/0x20
dump_stack+0x13c/0x1ac
print_circular_bug.isra.10+0x270/0x438
check_prev_add.constprop.16+0x4dc/0xb98
__lock_acquire+0x203c/0x21d0
lock_acquire+0xb8/0x148
cpus_read_lock+0x58/0x1c8
static_key_enable+0x14/0x30
sched_feat_write+0x314/0x428
full_proxy_write+0xa0/0x138
__vfs_write+0xd8/0x388
vfs_write+0xdc/0x318
ksys_write+0xb4/0x138
sys_write+0xc/0x18
__sys_trace_return+0x0/0x4

This is because when loading the cpufreq_dt module we first acquire
cpu_hotplug_lock.rw_sem lock, then in cpufreq_init(), we are taking
the &sb->s_type->i_mutex_key lock.

But when writing to /sys/kernel/debug/sched_features, the
cpu_hotplug_lock.rw_sem lock depends on the &sb->s_type->i_mutex_key lock.

To fix this bug, reverse the lock acquisition order when writing to
sched_features, this way cpu_hotplug_lock.rw_sem no longer depends on
&sb->s_type->i_mutex_key.

Tested-by: Dietmar Eggemann
Signed-off-by: Jiada Wang
Signed-off-by: Peter Zijlstra (Intel)
Cc: Eugeniu Rosca
Cc: George G. Davis
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20180731121222.26195-1-jiada_wang@mentor.com
Signed-off-by: Ingo Molnar

Jiada Wang
2018-09-10 16:13:45 +0800

14 Aug, 2018

1 commit

13e091b6d Merge branch 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 timer updates from Thomas Gleixner:
"Early TSC based time stamping to allow better boot time analysis.

This comes with a general cleanup of the TSC calibration code which
grew warts and duct taping over the years and removes 250 lines of
code. Initiated and mostly implemented by Pavel with help from various
folks"

* 'x86-timers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (37 commits)
x86/kvmclock: Mark kvm_get_preset_lpj() as __init
x86/tsc: Consolidate init code
sched/clock: Disable interrupts when calling generic_sched_clock_init()
timekeeping: Prevent false warning when persistent clock is not available
sched/clock: Close a hole in sched_clock_init()
x86/tsc: Make use of tsc_calibrate_cpu_early()
x86/tsc: Split native_calibrate_cpu() into early and late parts
sched/clock: Use static key for sched_clock_running
sched/clock: Enable sched clock early
sched/clock: Move sched clock initialization and merge with generic clock
x86/tsc: Use TSC as sched clock early
x86/tsc: Initialize cyc2ns when tsc frequency is determined
x86/tsc: Calibrate tsc only once
ARM/time: Remove read_boot_clock64()
s390/time: Remove read_boot_clock64()
timekeeping: Default boot time offset to local_clock()
timekeeping: Replace read_boot_clock64() with read_persistent_wall_and_boot_offset()
s390/time: Add read_persistent_wall_and_boot_offset()
x86/xen/time: Output xen sched_clock time from 0
x86/xen/time: Initialize pv xen time in init_hypervisor_platform()
...

Linus Torvalds
2018-08-14 09:28:19 +0800

25 Jul, 2018

1 commit

67d9f6c25 sched/debug: Reverse the order of printing faults ... Browse Code »

Fix the order in which the private and shared numa faults are getting
printed.

No functional changes.

Running SPECjbb2005 on a 4 node machine and comparing bops/JVM
JVMS LAST_PATCH WITH_PATCH %CHANGE
16 25215.7 25375.3 0.63
1 72107 72617 0.70

Signed-off-by: Srikar Dronamraju
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Rik van Riel
Acked-by: Mel Gorman
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1529514181-9842-7-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar

Srikar Dronamraju
2018-07-25 17:41:07 +0800

20 Jul, 2018

1 commit

46457ea46 sched/clock: Use static key for sched_clock_running ... Browse Code »

sched_clock_running may be read every time sched_clock_cpu() is called.
Yet, this variable is updated only twice during boot, and never changes
again, therefore it is better to make it a static key.

Signed-off-by: Pavel Tatashin
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: steven.sistare@oracle.com
Cc: daniel.m.jordan@oracle.com
Cc: linux@armlinux.org.uk
Cc: schwidefsky@de.ibm.com
Cc: heiko.carstens@de.ibm.com
Cc: john.stultz@linaro.org
Cc: sboyd@codeaurora.org
Cc: hpa@zytor.com
Cc: douly.fnst@cn.fujitsu.com
Cc: prarit@redhat.com
Cc: feng.tang@intel.com
Cc: pmladek@suse.com
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: linux-s390@vger.kernel.org
Cc: boris.ostrovsky@oracle.com
Cc: jgross@suse.com
Cc: pbonzini@redhat.com
Link: https://lkml.kernel.org/r/20180719205545.16512-25-pasha.tatashin@oracle.com

Pavel Tatashin
2018-07-20 06:02:43 +0800

21 Jun, 2018

1 commit

8f894bf47 sched/debug: Use match_string() helper instead of open-coded logic ... Browse Code »

match_string() returns the index of an array for a matching string,
which can be used instead of the open coded variant.

Signed-off-by: Yisheng Xie
Reviewed-by: Andy Shevchenko
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lore.kernel.org/lkml/1527765086-19873-15-git-send-email-xieyisheng1@huawei.com
Signed-off-by: Ingo Molnar

Yisheng Xie
2018-06-21 21:45:31 +0800

16 May, 2018

1 commit

fddda2b7b proc: introduce proc_create_seq{,_data} ... Browse Code »

Variants of proc_create{,_data} that directly take a struct seq_operations
argument and drastically reduces the boilerplate code in the callers.

All trivial callers converted over.

Signed-off-by: Christoph Hellwig

Christoph Hellwig
2018-05-16 13:23:35 +0800

03 Apr, 2018

1 commit

46e0d28bd Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler updates from Ingo Molnar:
"The main scheduler changes in this cycle were:

- NUMA balancing improvements (Mel Gorman)

- Further load tracking improvements (Patrick Bellasi)

- Various NOHZ balancing cleanups and optimizations (Peter Zijlstra)

- Improve blocked load handling, in particular we can now reduce and
eventually stop periodic load updates on 'very idle' CPUs. (Vincent
Guittot)

- On isolated CPUs offload the final 1Hz scheduler tick as well, plus
related cleanups and reorganization. (Frederic Weisbecker)

- Core scheduler code cleanups (Ingo Molnar)"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (45 commits)
sched/core: Update preempt_notifier_key to modern API
sched/cpufreq: Rate limits for SCHED_DEADLINE
sched/fair: Update util_est only on util_avg updates
sched/cpufreq/schedutil: Use util_est for OPP selection
sched/fair: Use util_est in LB and WU paths
sched/fair: Add util_est on top of PELT
sched/core: Remove TASK_ALL
sched/completions: Use bool in try_wait_for_completion()
sched/fair: Update blocked load when newly idle
sched/fair: Move idle_balance()
sched/nohz: Merge CONFIG_NO_HZ_COMMON blocks
sched/fair: Move rebalance_domains()
sched/nohz: Optimize nohz_idle_balance()
sched/fair: Reduce the periodic update duration
sched/nohz: Stop NOHZ stats when decayed
sched/cpufreq: Provide migration hint
sched/nohz: Clean up nohz enter/exit
sched/fair: Update blocked load from NEWIDLE
sched/fair: Add NOHZ stats balancing
sched/fair: Restructure nohz_balance_kick()
...

Linus Torvalds
2018-04-03 02:49:41 +0800

20 Mar, 2018

3 commits

e9ca26709 sched/debug: Adjust newlines for better alignment ... Browse Code »

Scheduler debug stats include newlines that display out of alignment
when prefixed by timestamps. For example, the dmesg utility:

% echo t > /proc/sysrq-trigger
% dmesg
...
[ 83.124251]
runnable tasks:
S task PID tree-key switches prio wait-time
sum-exec sum-sleep
-----------------------------------------------------------------------------------------------------------

At the same time, some syslog utilities (like rsyslog by default) don't
like the additional newlines control characters, saving lines like this
to /var/log/messages:

Mar 16 16:02:29 localhost kernel: #012runnable tasks:#012 S task PID tree-key ...
^^^^ ^^^^
Clean these up by moving newline characters to their own SEQ_printf
invocation. This leaves the /proc/sched_debug unchanged, but brings the
entire output into alignment when prefixed:

% echo t > /proc/sysrq-trigger
% dmesg
...
[ 62.410368] runnable tasks:
[ 62.410368] S task PID tree-key switches prio wait-time sum-exec sum-sleep
[ 62.410369] -----------------------------------------------------------------------------------------------------------
[ 62.410369] I kworker/u12:0 5 1932.215593 332 120 0.000000 3.621252 0.000000 0 0 /

and no escaped control characters from rsyslog in /var/log/messages:

Mar 16 16:15:06 localhost kernel: runnable tasks:
Mar 16 16:15:06 localhost kernel: S task PID tree-key ...

Signed-off-by: Joe Lawrence
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1521484555-8620-3-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar

Joe Lawrence
2018-03-20 16:30:09 +0800
a8c024cd9 sched/debug: Fix per-task line continuation for console output ... Browse Code »

When the SEQ_printf() macro prints to the console, it runs a simple
printk() without KERN_CONT "continued" line printing. The result of
this is oddly wrapped task info, for example:

% echo t > /proc/sysrq-trigger
% dmesg
...
runnable tasks:
...
[ 29.608611] I
[ 29.608613] rcu_sched 8 3252.013846 4087 120
[ 29.608614] 0.000000 29.090111 0.000000
[ 29.608615] 0 0
[ 29.608616] /

Modify SEQ_printf to use pr_cont() for expected one-line results:

% echo t > /proc/sysrq-trigger
% dmesg
...
runnable tasks:
...
[ 106.716329] S cpuhp/5 37 2006.315026 14 120 0.000000 0.496893 0.000000 0 0 /

Signed-off-by: Joe Lawrence
Acked-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1521484555-8620-2-git-send-email-joe.lawrence@redhat.com
Signed-off-by: Ingo Molnar

Joe Lawrence
2018-03-20 16:30:09 +0800
7f65ea42e sched/fair: Add util_est on top of PELT ... Browse Code »

The util_avg signal computed by PELT is too variable for some use-cases.
For example, a big task waking up after a long sleep period will have its
utilization almost completely decayed. This introduces some latency before
schedutil will be able to pick the best frequency to run a task.

The same issue can affect task placement. Indeed, since the task
utilization is already decayed at wakeup, when the task is enqueued in a
CPU, this can result in a CPU running a big task as being temporarily
represented as being almost empty. This leads to a race condition where
other tasks can be potentially allocated on a CPU which just started to run
a big task which slept for a relatively long period.

Moreover, the PELT utilization of a task can be updated every [ms], thus
making it a continuously changing value for certain longer running
tasks. This means that the instantaneous PELT utilization of a RUNNING
task is not really meaningful to properly support scheduler decisions.

For all these reasons, a more stable signal can do a better job of
representing the expected/estimated utilization of a task/cfs_rq.
Such a signal can be easily created on top of PELT by still using it as
an estimator which produces values to be aggregated on meaningful
events.

This patch adds a simple implementation of util_est, a new signal built on
top of PELT's util_avg where:

util_est(task) = max(task::util_avg, f(task::util_avg@dequeue))

This allows to remember how big a task has been reported by PELT in its
previous activations via f(task::util_avg@dequeue), which is the new
_task_util_est(struct task_struct*) function added by this patch.

If a task should change its behavior and it runs longer in a new
activation, after a certain time its util_est will just track the
original PELT signal (i.e. task::util_avg).

The estimated utilization of cfs_rq is defined only for root ones.
That's because the only sensible consumer of this signal are the
scheduler and schedutil when looking for the overall CPU utilization
due to FAIR tasks.

For this reason, the estimated utilization of a root cfs_rq is simply
defined as:

util_est(cfs_rq) = max(cfs_rq::util_avg, cfs_rq::util_est::enqueued)

where:

cfs_rq::util_est::enqueued = sum(_task_util_est(task))
for each RUNNABLE task on that root cfs_rq

It's worth noting that the estimated utilization is tracked only for
objects of interests, specifically:

- Tasks: to better support tasks placement decisions
- root cfs_rqs: to better support both tasks placement decisions as
well as frequencies selection

Signed-off-by: Patrick Bellasi
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Dietmar Eggemann
Cc: Joel Fernandes
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Morten Rasmussen
Cc: Paul Turner
Cc: Rafael J . Wysocki
Cc: Steve Muckle
Cc: Thomas Gleixner
Cc: Todd Kjos
Cc: Vincent Guittot
Cc: Viresh Kumar
Link: http://lkml.kernel.org/r/20180309095245.11071-2-patrick.bellasi@arm.com
Signed-off-by: Ingo Molnar

Patrick Bellasi
2018-03-20 15:11:06 +0800

04 Mar, 2018

1 commit

325ea10c0 sched/headers: Simplify and clean up header usage in the scheduler ... Browse Code »

Do the following cleanups and simplifications:

- sched/sched.h already includes , so no need to
include it in sched/core.c again.

- order the headers alphabetically

- add all headers to kernel/sched/sched.h

- remove all unnecessary includes from the .c files that
are already included in kernel/sched/sched.h.

Finally, make all scheduler .c files use a single common header:

#include "sched.h"

... which now contains a union of the relied upon headers.

This makes the various .c files easier to read and easier to handle.

Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2018-03-04 19:39:29 +0800

03 Mar, 2018

1 commit

97fb7a0a8 sched: Clean up and harmonize the coding style of the scheduler code base ... Browse Code »

A good number of small style inconsistencies have accumulated
in the scheduler core, so do a pass over them to harmonize
all these details:

- fix speling in comments,

- use curly braces for multi-line statements,

- remove unnecessary parentheses from integer literals,

- capitalize consistently,

- remove stray newlines,

- add comments where necessary,

- remove invalid/unnecessary comments,

- align structure definitions and other data types vertically,

- add missing newlines for increased readability,

- fix vertical tabulation where it's misaligned,

- harmonize preprocessor conditional block labeling
and vertical alignment,

- remove line-breaks where they uglify the code,

- add newline after local variable definitions,

No change in functionality:

md5:
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.before.asm
1191fa0a890cfa8132156d2959d7e9e2 built-in.o.after.asm

Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2018-03-03 22:50:21 +0800

30 Sep, 2017

3 commits

1ea6c46a2 sched/fair: Propagate an effective runnable_load_avg ... Browse Code »

The load balancer uses runnable_load_avg as load indicator. For
!cgroup this is:

runnable_load_avg = \Sum se->avg.load_avg ; where se->on_rq

That is, a direct sum of all runnable tasks on that runqueue. As
opposed to load_avg, which is a sum of all tasks on the runqueue,
which includes a blocked component.

However, in the cgroup case, this comes apart since the group entities
are always runnable, even if most of their constituent entities are
blocked.

Therefore introduce a runnable_weight which for task entities is the
same as the regular weight, but for group entities is a fraction of
the entity weight and represents the runnable part of the group
runqueue.

Then propagate this load through the PELT hierarchy to arrive at an
effective runnable load avgerage -- which we should not confuse with
the canonical runnable load average.

Suggested-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-30 01:35:15 +0800
0e2d2aaaa sched/fair: Rewrite PELT migration propagation ... Browse Code »

When an entity migrates in (or out) of a runqueue, we need to add (or
remove) its contribution from the entire PELT hierarchy, because even
non-runnable entities are included in the load average sums.

In order to do this we have some propagation logic that updates the
PELT tree, however the way it 'propagates' the runnable (or load)
change is (more or less):

tg->weight * grq->avg.load_avg
ge->avg.load_avg = ------------------------------
tg->load_avg

But that is the expression for ge->weight, and per the definition of
load_avg:

ge->avg.load_avg := ge->weight * ge->avg.runnable_avg

That destroys the runnable_avg (by setting it to 1) we wanted to
propagate.

Instead directly propagate runnable_sum.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-30 01:35:15 +0800
2a2f5d4e4 sched/fair: Rewrite cfs_rq->removed_*avg ... Browse Code »

Since on wakeup migration we don't hold the rq->lock for the old CPU
we cannot update its state. Instead we add the removed 'load' to an
atomic variable and have the next update on that CPU collect and
process it.

Currently we have 2 atomic variables; which already have the issue
that they can be read out-of-sync. Also, two atomic ops on a single
cacheline is already more expensive than an uncontended lock.

Since we want to add more, convert the thing over to an explicit
cacheline with a lock in.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-09-30 01:35:14 +0800