Eric Lee / smarc-fsl-linux-kernel

05 Mar, 2020

13 commits

6f9cff84d blktrace: Protect q->blk_trace with RCU ... Browse Code »

[ Upstream commit c780e86dd48ef6467a1146cf7d0fe1e05a635039 ]

KASAN is reporting that __blk_add_trace() has a use-after-free issue
when accessing q->blk_trace. Indeed the switching of block tracing (and
thus eventual freeing of q->blk_trace) is completely unsynchronized with
the currently running tracing and thus it can happen that the blk_trace
structure is being freed just while __blk_add_trace() works on it.
Protect accesses to q->blk_trace by RCU during tracing and make sure we
wait for the end of RCU grace period when shutting down tracing. Luckily
that is rare enough event that we can afford that. Note that postponing
the freeing of blk_trace to an RCU callback should better be avoided as
it could have unexpected user visible side-effects as debugfs files
would be still existing for a short while block tracing has been shut
down.

Link: https://bugzilla.kernel.org/show_bug.cgi?id=205711
CC: stable@vger.kernel.org
Reviewed-by: Chaitanya Kulkarni
Reviewed-by: Ming Lei
Tested-by: Ming Lei
Reviewed-by: Bart Van Assche
Reported-by: Tristan Madani
Signed-off-by: Jan Kara
Signed-off-by: Jens Axboe
Signed-off-by: Sasha Levin

Jan Kara
2020-03-05 23:43:52 +0800
2482da809 locking/lockdep: Fix lockdep_stats indentation problem ... Browse Code »

commit a030f9767da1a6bbcec840fc54770eb11c2414b6 upstream.

It was found that two lines in the output of /proc/lockdep_stats have
indentation problem:

# cat /proc/lockdep_stats
:
in-process chains: 25057
stack-trace entries: 137827 [max: 524288]
number of stack traces: 7973
number of stack hash chains: 6355
combined max dependencies: 1356414598
hardirq-safe locks: 57
hardirq-unsafe locks: 1286
:

All the numbers displayed in /proc/lockdep_stats except the two stack
trace numbers are formatted with a field with of 11. To properly align
all the numbers, a field width of 11 is now added to the two stack
trace numbers.

Fixes: 8c779229d0f4 ("locking/lockdep: Report more stack trace statistics")
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Bart Van Assche
Link: https://lkml.kernel.org/r/20191211213139.29934-1-longman@redhat.com
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2020-03-05 23:43:51 +0800
ef0dcab6d rcu: Allow only one expedited GP to run concurrently with wakeups ... Browse Code »

commit 4bc6b745e5cbefed92c48071e28a5f41246d0470 upstream.

The current expedited RCU grace-period code expects that a task
requesting an expedited grace period cannot awaken until that grace
period has reached the wakeup phase. However, it is possible for a long
preemption to result in the waiting task never sleeping. For example,
consider the following sequence of events:

1. Task A starts an expedited grace period by invoking
synchronize_rcu_expedited(). It proceeds normally up to the
wait_event() near the end of that function, and is then preempted
(or interrupted or whatever).

2. The expedited grace period completes, and a kworker task starts
the awaken phase, having incremented the counter and acquired
the rcu_state structure's .exp_wake_mutex. This kworker task
is then preempted or interrupted or whatever.

3. Task A resumes and enters wait_event(), which notes that the
expedited grace period has completed, and thus doesn't sleep.

4. Task B starts an expedited grace period exactly as did Task A,
complete with the preemption (or whatever delay) just before
the call to wait_event().

5. The expedited grace period completes, and another kworker
task starts the awaken phase, having incremented the counter.
However, it blocks when attempting to acquire the rcu_state
structure's .exp_wake_mutex because step 2's kworker task has
not yet released it.

6. Steps 4 and 5 repeat, resulting in overflow of the rcu_node
structure's ->exp_wq[] array.

In theory, this is harmless. Tasks waiting on the various ->exp_wq[]
array will just be spuriously awakened, but they will just sleep again
on noting that the rcu_state structure's ->expedited_sequence value has
not advanced far enough.

In practice, this wastes CPU time and is an accident waiting to happen.
This commit therefore moves the rcu_exp_gp_seq_end() call that officially
ends the expedited grace period (along with associate tracing) until
after the ->exp_wake_mutex has been acquired. This prevents Task A from
awakening prematurely, thus preventing more than one expedited grace
period from being in flight during a previous expedited grace period's
wakeup phase.

Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay
[ paulmck: Added updated comment. ]
Signed-off-by: Paul E. McKenney
Signed-off-by: Greg Kroah-Hartman

Neeraj Upadhyay
2020-03-05 23:43:50 +0800
1fabae5c8 lib/vdso: Update coarse timekeeper unconditionally ... Browse Code »

commit 9f24c540f7f8eb3a981528da9a9a636a5bdf5987 upstream.

The low resolution parts of the VDSO, i.e.:

clock_gettime(CLOCK_*_COARSE), clock_getres(), time()

can be used even if there is no VDSO capable clocksource.

But if an architecture opts out of the VDSO data update then this
information becomes stale. This affects ARM when there is no architected
timer available. The lack of update causes userspace to use stale data
forever.

Make the update of the low resolution parts unconditional and only skip
the update of the high resolution parts if the architecture requests it.

Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Thomas Gleixner
Link: https://lore.kernel.org/r/20200114185946.765577901@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2020-03-05 23:43:49 +0800
91ebef861 lib/vdso: Make __arch_update_vdso_data() logic understandable ... Browse Code »

commit 9a6b55ac4a44060bcb782baf002859b2a2c63267 upstream.

The function name suggests that this is a boolean checking whether the
architecture asks for an update of the VDSO data, but it works the other
way round. To spare further confusion invert the logic.

Fixes: 44f57d788e7d ("timekeeping: Provide a generic update_vsyscall() implementation")
Signed-off-by: Thomas Gleixner
Link: https://lore.kernel.org/r/20200114185946.656652824@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2020-03-05 23:43:49 +0800
74c515e04 kprobes: Set unoptimized flag after unoptimizing code ... Browse Code »

commit f66c0447cca1281116224d474cdb37d6a18e4b5b upstream.

Set the unoptimized flag after confirming the code is completely
unoptimized. Without this fix, when a kprobe hits the intermediate
modified instruction (the first byte is replaced by an INT3, but
later bytes can still be a jump address operand) while unoptimizing,
it can return to the middle byte of the modified code, which causes
an invalid instruction exception in the kernel.

Usually, this is a rare case, but if we put a probe on the function
call while text patching, it always causes a kernel panic as below:

# echo p text_poke+5 > kprobe_events
# echo 1 > events/kprobes/enable
# echo 0 > events/kprobes/enable

invalid opcode: 0000 [#1] PREEMPT SMP PTI
RIP: 0010:text_poke+0x9/0x50
Call Trace:
arch_unoptimize_kprobe+0x22/0x28
arch_unoptimize_kprobes+0x39/0x87
kprobe_optimizer+0x6e/0x290
process_one_work+0x2a0/0x610
worker_thread+0x28/0x3d0
? process_one_work+0x610/0x610
kthread+0x10d/0x130
? kthread_park+0x80/0x80
ret_from_fork+0x3a/0x50

text_poke() is used for patching the code in optprobes.

This can happen even if we blacklist text_poke() and other functions,
because there is a small time window during which we show the intermediate
code to other CPUs.

[ mingo: Edited the changelog. ]

Tested-by: Alexei Starovoitov
Signed-off-by: Masami Hiramatsu
Cc: Andy Lutomirski
Cc: Borislav Petkov
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: bristot@redhat.com
Fixes: 6274de4984a6 ("kprobes: Support delayed unoptimizing")
Link: https://lkml.kernel.org/r/157483422375.25881.13508326028469515760.stgit@devnote2
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Masami Hiramatsu
2020-03-05 23:43:49 +0800
a25ae5539 sched/fair: Optimize select_idle_cpu ... Browse Code »

commit 60588bfa223ff675b95f866249f90616613fbe31 upstream.

select_idle_cpu() will scan the LLC domain for idle CPUs,
it's always expensive. so the next commit :

1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")

introduces a way to limit how many CPUs we scan.

But it consume some CPUs out of 'nr' that are not allowed
for the task and thus waste our attempts. The function
always return nr_cpumask_bits, and we can't find a CPU
which our task is allowed to run.

Cpumask may be too big, similar to select_idle_core(), use
per_cpu_ptr 'select_idle_mask' to prevent stack overflow.

Fixes: 1ad3aaf3fcd2 ("sched/core: Implement new approach to scale select_idle_cpu()")
Signed-off-by: Cheng Jian
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Srikar Dronamraju
Reviewed-by: Vincent Guittot
Reviewed-by: Valentin Schneider
Link: https://lkml.kernel.org/r/20191213024530.28052-1-cj.chengjian@huawei.com
Signed-off-by: Greg Kroah-Hartman

Cheng Jian
2020-03-05 23:43:48 +0800
b20f83ef0 tracing: Disable trace_printk() on post poned tests ... Browse Code »

commit 78041c0c9e935d9ce4086feeff6c569ed88ddfd4 upstream.

The tracing seftests checks various aspects of the tracing infrastructure,
and one is filtering. If trace_printk() is active during a self test, it can
cause the filtering to fail, which will disable that part of the trace.

To keep the selftests from failing because of trace_printk() calls,
trace_printk() checks the variable tracing_selftest_running, and if set, it
does not write to the tracing buffer.

As some tracers were registered earlier in boot, the selftest they triggered
would fail because not all the infrastructure was set up for the full
selftest. Thus, some of the tests were post poned to when their
infrastructure was ready (namely file system code). The postpone code did
not set the tracing_seftest_running variable, and could fail if a
trace_printk() was added and executed during their run.

Cc: stable@vger.kernel.org
Fixes: 9afecfbb95198 ("tracing: Postpone tracer start-up tests till the system is more robust")
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Steven Rostedt (VMware)
2020-03-05 23:43:43 +0800
82a6cb7a8 audit: always check the netlink payload length in audit_receive_msg() ... Browse Code »

commit 756125289285f6e55a03861bf4b6257aa3d19a93 upstream.

This patch ensures that we always check the netlink payload length
in audit_receive_msg() before we take any action on the payload
itself.

Cc: stable@vger.kernel.org
Reported-by: syzbot+399c44bf1f43b8747403@syzkaller.appspotmail.com
Reported-by: syzbot+e4b12d8d202701f08b6d@syzkaller.appspotmail.com
Signed-off-by: Paul Moore
Signed-off-by: Greg Kroah-Hartman

Paul Moore
2020-03-05 23:43:42 +0800
37f4c2775 audit: fix error handling in audit_data_to_entry() ... Browse Code »

commit 2ad3e17ebf94b7b7f3f64c050ff168f9915345eb upstream.

Commit 219ca39427bf ("audit: use union for audit_field values since
they are mutually exclusive") combined a number of separate fields in
the audit_field struct into a single union. Generally this worked
just fine because they are generally mutually exclusive.
Unfortunately in audit_data_to_entry() the overlap can be a problem
when a specific error case is triggered that causes the error path
code to attempt to cleanup an audit_field struct and the cleanup
involves attempting to free a stored LSM string (the lsm_str field).
Currently the code always has a non-NULL value in the
audit_field.lsm_str field as the top of the for-loop transfers a
value into audit_field.val (both .lsm_str and .val are part of the
same union); if audit_data_to_entry() fails and the audit_field
struct is specified to contain a LSM string, but the
audit_field.lsm_str has not yet been properly set, the error handling
code will attempt to free the bogus audit_field.lsm_str value that
was set with audit_field.val at the top of the for-loop.

This patch corrects this by ensuring that the audit_field.val is only
set when needed (it is cleared when the audit_field struct is
allocated with kcalloc()). It also corrects a few other issues to
ensure that in case of error the proper error code is returned.

Cc: stable@vger.kernel.org
Fixes: 219ca39427bf ("audit: use union for audit_field values since they are mutually exclusive")
Reported-by: syzbot+1f4d90ead370d72e450b@syzkaller.appspotmail.com
Signed-off-by: Paul Moore
Signed-off-by: Greg Kroah-Hartman

Paul Moore
2020-03-05 23:43:42 +0800
36b5fcc14 sched/fair: Prevent unlimited runtime on throttled group ... Browse Code »

[ Upstream commit 2a4b03ffc69f2dedc6388e9a6438b5f4c133a40d ]

When a running task is moved on a throttled task group and there is no
other task enqueued on the CPU, the task can keep running using 100% CPU
whatever the allocated bandwidth for the group and although its cfs rq is
throttled. Furthermore, the group entity of the cfs_rq and its parents are
not enqueued but only set as curr on their respective cfs_rqs.

We have the following sequence:

sched_move_task
-dequeue_task: dequeue task and group_entities.
-put_prev_task: put task and group entities.
-sched_change_group: move task to new group.
-enqueue_task: enqueue only task but not group entities because cfs_rq is
throttled.
-set_next_task : set task and group_entities as current sched_entity of
their cfs_rq.

Another impact is that the root cfs_rq runnable_load_avg at root rq stays
null because the group_entities are not enqueued. This situation will stay
the same until an "external" event triggers a reschedule. Let trigger it
immediately instead.

Signed-off-by: Vincent Guittot
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Ben Segall
Link: https://lkml.kernel.org/r/1579011236-31256-1-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Sasha Levin

Vincent Guittot
2020-03-05 23:43:36 +0800
166d6008f timers/nohz: Update NOHZ load in remote tick ... Browse Code »

[ Upstream commit ebc0f83c78a2d26384401ecf2d2fa48063c0ee27 ]

The way loadavg is tracked during nohz only pays attention to the load
upon entering nohz. This can be particularly noticeable if full nohz is
entered while non-idle, and then the cpu goes idle and stays that way for
a long time.

Use the remote tick to ensure that full nohz cpus report their deltas
within a reasonable time.

[ swood: Added changelog and removed recheck of stopped tick. ]

Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-3-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin

Peter Zijlstra (Intel)
2020-03-05 23:43:36 +0800
5a309e3bf sched/core: Don't skip remote tick for idle CPUs ... Browse Code »

[ Upstream commit 488603b815a7514c7009e6fc339d74ed4a30f343 ]

This will be used in the next patch to get a loadavg update from
nohz cpus. The delta check is skipped because idle_sched_class
doesn't update se.exec_start.

Signed-off-by: Scott Wood
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/1578736419-14628-2-git-send-email-swood@redhat.com
Signed-off-by: Sasha Levin

Scott Wood
2020-03-05 23:43:35 +0800

29 Feb, 2020

3 commits

8132323eb bpf, offload: Replace bitwise AND by logical AND in bpf_prog_offload_info_fill ... Browse Code »

commit e20d3a055a457a10a4c748ce5b7c2ed3173a1324 upstream.

This if guards whether user-space wants a copy of the offload-jited
bytecode and whether this bytecode exists. By erroneously doing a bitwise
AND instead of a logical AND on user- and kernel-space buffer-size can lead
to no data being copied to user-space especially when user-space size is a
power of two and bigger then the kernel-space buffer.

Fixes: fcfb126defda ("bpf: add new jited info fields in bpf_dev_offload and bpf_prog_info")
Signed-off-by: Johannes Krude
Signed-off-by: Daniel Borkmann
Acked-by: Jakub Kicinski
Link: https://lore.kernel.org/bpf/20200212193227.GA3769@phlox.h.transitiv.net
Signed-off-by: Greg Kroah-Hartman

Johannes Krude
2020-02-29 00:22:27 +0800
2463a30f6 genirq/proc: Reject invalid affinity masks (again) ... Browse Code »

commit cba6437a1854fde5934098ec3bd0ee83af3129f5 upstream.

Qian Cai reported that the WARN_ON() in the x86/msi affinity setting code,
which catches cases where the affinity setting is not done on the CPU which
is the current target of the interrupt, triggers during CPU hotplug stress
testing.

It turns out that the warning which was added with the commit addressing
the MSI affinity race unearthed yet another long standing bug.

If user space writes a bogus affinity mask, i.e. it contains no online CPUs,
then it calls irq_select_affinity_usr(). This was introduced for ALPHA in

eee45269b0f5 ("[PATCH] Alpha: convert to generic irq framework (generic part)")

and subsequently made available for all architectures in

18404756765c ("genirq: Expose default irq affinity mask (take 3)")

which introduced the circumvention of the affinity setting restrictions for
interrupt which cannot be moved in process context.

The whole exercise is bogus in various aspects:

1) If the interrupt is already started up then there is absolutely
no point to honour a bogus interrupt affinity setting from user
space. The interrupt is already assigned to an online CPU and it
does not make any sense to reassign it to some other randomly
chosen online CPU.

2) If the interupt is not yet started up then there is no point
either. A subsequent startup of the interrupt will invoke
irq_setup_affinity() anyway which will chose a valid target CPU.

So the only correct solution is to just return -EINVAL in case user space
wrote an affinity mask which does not contain any online CPUs, except for
ALPHA which has it's own magic sauce for this.

Fixes: 18404756765c ("genirq: Expose default irq affinity mask (take 3)")
Reported-by: Qian Cai
Signed-off-by: Thomas Gleixner
Tested-by: Qian Cai
Link: https://lkml.kernel.org/r/878sl8xdbm.fsf@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2020-02-29 00:22:26 +0800
e61c236dc sched/psi: Fix OOB write when writing 0 bytes to PSI files ... Browse Code »

commit 6fcca0fa48118e6d63733eb4644c6cd880c15b8f upstream.

Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Suren Baghdasaryan
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Acked-by: Johannes Weiner
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com
Signed-off-by: Greg Kroah-Hartman

Suren Baghdasaryan
2020-02-29 00:22:21 +0800

24 Feb, 2020

19 commits

224c0751d trigger_next should increase position index ... Browse Code »

[ Upstream commit 6722b23e7a2ace078344064a9735fb73e554e9ef ]

if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Without patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
# Available triggers:
# traceon traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
6+1 records in
6+1 records out
206 bytes copied, 0.00027916 s, 738 kB/s

Notice the printing of "# Available triggers:..." after the line.

With the patch:
# dd bs=30 skip=1 if=/sys/kernel/tracing/events/sched/sched_switch/trigger
dd: /sys/kernel/tracing/events/sched/sched_switch/trigger: cannot skip to specified offset
n traceoff snapshot stacktrace enable_event disable_event enable_hist disable_hist hist
2+1 records in
2+1 records out
88 bytes copied, 0.000526867 s, 167 kB/s

It only prints the end of the file, and does not restart.

Link: http://lkml.kernel.org/r/3c35ee24-dd3a-8119-9c19-552ed253388a@virtuozzo.com

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin

Vasily Averin
2020-02-24 15:36:59 +0800
e34928727 ftrace: fpid_next() should increase position index ... Browse Code »

[ Upstream commit e4075e8bdffd93a9b6d6e1d52fabedceeca5a91b ]

if seq_file .next fuction does not change position index,
read after some lseek can generate unexpected output.

Without patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
no pid
2+1 records in
2+1 records out
10 bytes copied, 0.000213285 s, 46.9 kB/s

Notice the "id" followed by "no pid".

With the patch:
# dd bs=4 skip=1 if=/sys/kernel/tracing/set_ftrace_pid
dd: /sys/kernel/tracing/set_ftrace_pid: cannot skip to specified offset
id
0+1 records in
0+1 records out
3 bytes copied, 0.000202112 s, 14.8 kB/s

Notice that it only prints "id" and not the "no pid" afterward.

Link: http://lkml.kernel.org/r/4f87c6ad-f114-30bb-8506-c32274ce2992@virtuozzo.com

https://bugzilla.kernel.org/show_bug.cgi?id=206283
Signed-off-by: Vasily Averin
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin

Vasily Averin
2020-02-24 15:36:59 +0800
3ce3df5d0 bpf: map_seq_next should always increase position index ... Browse Code »

[ Upstream commit 90435a7891a2259b0f74c5a1bc5600d0d64cba8f ]

If seq_file .next fuction does not change position index,
read after some lseek can generate an unexpected output.

See also: https://bugzilla.kernel.org/show_bug.cgi?id=206283

v1 -> v2: removed missed increment in end of function

Signed-off-by: Vasily Averin
Signed-off-by: Daniel Borkmann
Link: https://lore.kernel.org/bpf/eca84fdd-c374-a154-d874-6c7b55fc3bc4@virtuozzo.com
Signed-off-by: Sasha Levin

Vasily Averin
2020-02-24 15:36:58 +0800
251c53a92 alarmtimer: Make alarmtimer platform device child of RTC device ... Browse Code »

[ Upstream commit c79108bd19a8490315847e0c95ac6526fcd8e770 ]

The alarmtimer_suspend() function will fail if an RTC device is on a bus
such as SPI or i2c and that RTC device registers and probes after
alarmtimer_init() registers and probes the 'alarmtimer' platform device.

This is because system wide suspend suspends devices in the reverse order
of their probe. When alarmtimer_suspend() attempts to program the RTC for a
wakeup it will try to program an RTC device on a bus that has already been
suspended.

Move the alarmtimer device registration to happen when the RTC which is
used for wakeup is registered. Register the 'alarmtimer' platform device as
a child of the RTC device too, so that it can be guaranteed that the RTC
device won't be suspended when alarmtimer_suspend() is called.

Reported-by: Douglas Anderson
Signed-off-by: Stephen Boyd
Signed-off-by: Thomas Gleixner
Reviewed-by: Douglas Anderson
Link: https://lore.kernel.org/r/20200124055849.154411-2-swboyd@chromium.org
Signed-off-by: Sasha Levin

Stephen Boyd
2020-02-24 15:36:57 +0800
496d6c021 module: avoid setting info->name early in case we can fall back to info->mod->name ... Browse Code »

[ Upstream commit 708e0ada1916be765b7faa58854062f2bc620bbf ]

In setup_load_info(), info->name (which contains the name of the module,
mostly used for early logging purposes before the module gets set up)
gets unconditionally assigned if .modinfo is missing despite the fact
that there is an if (!info->name) check near the end of the function.
Avoid assigning a placeholder string to info->name if .modinfo doesn't
exist, so that we can fall back to info->mod->name later on.

Fixes: 5fdc7db6448a ("module: setup load info before module_sig_check()")
Reviewed-by: Miroslav Benes
Signed-off-by: Jessica Yu
Signed-off-by: Sasha Levin

Jessica Yu
2020-02-24 15:36:54 +0800
0b2ecef39 watchdog/softlockup: Enforce that timestamp is valid on boot ... Browse Code »

[ Upstream commit 11e31f608b499f044f24b20be73f1dcab3e43f8a ]

Robert reported that during boot the watchdog timestamp is set to 0 for one
second which is the indicator for a watchdog reset.

The reason for this is that the timestamp is in seconds and the time is
taken from sched clock and divided by ~1e9. sched clock starts at 0 which
means that for the first second during boot the watchdog timestamp is 0,
i.e. reset.

Use ULONG_MAX as the reset indicator value so the watchdog works correctly
right from the start. ULONG_MAX would only conflict with a real timestamp
if the system reaches an uptime of 136 years on 32bit and almost eternity
on 64bit.

Reported-by: Robert Richter
Signed-off-by: Thomas Gleixner
Link: https://lore.kernel.org/r/87o8v3uuzl.fsf@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin

Thomas Gleixner
2020-02-24 15:36:52 +0800
f2323c374 sched/topology: Assert non-NUMA topology masks don't (partially) overlap ... Browse Code »

[ Upstream commit ccf74128d66ce937876184ad55db2e0276af08d3 ]

topology.c::get_group() relies on the assumption that non-NUMA domains do
not partially overlap. Zeng Tao pointed out in [1] that such topology
descriptions, while completely bogus, can end up being exposed to the
scheduler.

In his example (8 CPUs, 2-node system), we end up with:
MC span for CPU3 == 3-7
MC span for CPU4 == 4-7

The first pass through get_group(3, sdd@MC) will result in the following
sched_group list:

3 -> 4 -> 5 -> 6 -> 7
^ /
`----------------'

And a later pass through get_group(4, sdd@MC) will "corrupt" that to:

3 -> 4 -> 5 -> 6 -> 7
^ /
`-----------'

which will completely break things like 'while (sg != sd->groups)' when
using CPU3's base sched_domain.

There already are some architecture-specific checks in place such as
x86/kernel/smpboot.c::topology.sane(), but this is something we can detect
in the core scheduler, so it seems worthwhile to do so.

Warn and abort the construction of the sched domains if such a broken
topology description is detected. Note that this is somewhat
expensive (O(t.c²), 't' non-NUMA topology levels and 'c' CPUs) and could be
gated under SCHED_DEBUG if deemed necessary.

Testing
=======

Dietmar managed to reproduce this using the following qemu incantation:

$ qemu-system-aarch64 -kernel ./Image -hda ./qemu-image-aarch64.img \
-append 'root=/dev/vda console=ttyAMA0 loglevel=8 sched_debug' -smp \
cores=8 --nographic -m 512 -cpu cortex-a53 -machine virt -numa \
node,cpus=0-2,nodeid=0 -numa node,cpus=3-7,nodeid=1

alongside the following drivers/base/arch_topology.c hack (AIUI wouldn't be
needed if '-smp cores=X, sockets=Y' would work with qemu):

8package_id != cpu_topo->package_id)
continue;

+ if ((cpu < 4 && cpuid > 3) || (cpu > 3 && cpuid < 4))
+ continue;
+
cpumask_set_cpu(cpuid, &cpu_topo->core_sibling);
cpumask_set_cpu(cpu, &cpuid_topo->core_sibling);

8
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Link: https://lkml.kernel.org/r/20200115160915.22575-1-valentin.schneider@arm.com
Signed-off-by: Sasha Levin

Valentin Schneider
2020-02-24 15:36:52 +0800
5d13f62b9 sched/core: Fix size of rq::uclamp initialization ... Browse Code »

[ Upstream commit dcd6dffb0a75741471297724640733fa4e958d72 ]

rq::uclamp is an array of struct uclamp_rq, make sure we clear the
whole thing.

Fixes: 69842cba9ace ("sched/uclamp: Add CPU's clamp buckets refcountinga")
Signed-off-by: Li Guanglei
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Qais Yousef
Link: https://lkml.kernel.org/r/1577259844-12677-1-git-send-email-guangleix.li@gmail.com
Signed-off-by: Sasha Levin

Li Guanglei
2020-02-24 15:36:51 +0800
0685dfa0a padata: validate cpumask without removed CPU during offline ... Browse Code »

[ Upstream commit 894c9ef9780c5cf2f143415e867ee39a33ecb75d ]

Configuring an instance's parallel mask without any online CPUs...

echo 2 > /sys/kernel/pcrypt/pencrypt/parallel_cpumask
echo 0 > /sys/devices/system/cpu/cpu1/online

...makes tcrypt mode=215 crash like this:

divide error: 0000 [#1] SMP PTI
CPU: 4 PID: 283 Comm: modprobe Not tainted 5.4.0-rc8-padata-doc-v2+ #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20191013_105130-anatol 04/01/2014
RIP: 0010:padata_do_parallel+0x114/0x300
Call Trace:
pcrypt_aead_encrypt+0xc0/0xd0 [pcrypt]
crypto_aead_encrypt+0x1f/0x30
do_mult_aead_op+0x4e/0xdf [tcrypt]
test_mb_aead_speed.constprop.0.cold+0x226/0x564 [tcrypt]
do_test+0x28c2/0x4d49 [tcrypt]
tcrypt_mod_init+0x55/0x1000 [tcrypt]
...

cpumask_weight() in padata_cpu_hash() returns 0 because the mask has no
CPUs. The problem is __padata_remove_cpu() checks for valid masks too
early and so doesn't mark the instance PADATA_INVALID as expected, which
would have made padata_do_parallel() return error before doing the
division.

Fix by introducing a second padata CPU hotplug state before
CPUHP_BRINGUP_CPU so that __padata_remove_cpu() sees the online mask
without @cpu. No need for the second argument to padata_replace() since
@cpu is now already missing from the online mask.

Fixes: 33e54450683c ("padata: Handle empty padata cpumasks")
Signed-off-by: Daniel Jordan
Cc: Eric Biggers
Cc: Herbert Xu
Cc: Sebastian Andrzej Siewior
Cc: Steffen Klassert
Cc: Thomas Gleixner
Cc: linux-crypto@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Herbert Xu
Signed-off-by: Sasha Levin

Daniel Jordan
2020-02-24 15:36:34 +0800
4c50665fc modules: lockdep: Suppress suspicious RCU usage warning ... Browse Code »

[ Upstream commit bf08949cc8b98b7d1e20cfbba169a5938d42dae8 ]

While running kprobe module test, find_module_all() caused
a suspicious RCU usage warning.

-----
=============================
WARNING: suspicious RCU usage
5.4.0-next-20191202+ #63 Not tainted
-----------------------------
kernel/module.c:619 RCU-list traversed in non-reader section!!

other info that might help us debug this:

rcu_scheduler_active = 2, debug_locks = 1
1 lock held by rmmod/642:
#0: ffffffff8227da80 (module_mutex){+.+.}, at: __x64_sys_delete_module+0x9a/0x230

stack backtrace:
CPU: 0 PID: 642 Comm: rmmod Not tainted 5.4.0-next-20191202+ #63
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014
Call Trace:
dump_stack+0x71/0xa0
find_module_all+0xc1/0xd0
__x64_sys_delete_module+0xac/0x230
? do_syscall_64+0x12/0x1f0
do_syscall_64+0x50/0x1f0
entry_SYSCALL_64_after_hwframe+0x49/0xbe
RIP: 0033:0x4b6d49
-----

This is because list_for_each_entry_rcu(modules) is called
without rcu_read_lock(). This is safe because the module_mutex
is locked.

Pass lockdep_is_held(&module_mutex) to the list_for_each_entry_rcu()
to suppress this warning, This also fixes similar issue in
mod_find() and each_symbol_section().

Signed-off-by: Masami Hiramatsu
Signed-off-by: Jessica Yu
Signed-off-by: Sasha Levin

Masami Hiramatsu
2020-02-24 15:36:33 +0800
201fdd62b tracing: Simplify assignment parsing for hist triggers ... Browse Code »

[ Upstream commit b527b638fd63ba791dc90a0a6e9a3035b10df52b ]

In the process of adding better error messages for sorting, I realized
that strsep was being used incorrectly and some of the error paths I
was expecting to be hit weren't and just fell through to the common
invalid key error case.

It also became obvious that for keyword assignments, it wasn't
necessary to save the full assignment and reparse it later, and having
a common empty-assignment check would also make more sense in terms of
error processing.

Change the code to fix these problems and simplify it for new error
message changes in a subsequent patch.

Link: http://lkml.kernel.org/r/1c3ef0b6655deaf345f6faee2584a0298ac2d743.1561743018.git.zanussi@kernel.org

Fixes: e62347d24534 ("tracing: Add hist trigger support for user-defined sorting ('sort=' param)")
Fixes: 7ef224d1d0e3 ("tracing: Add 'hist' event trigger command")
Fixes: a4072fe85ba3 ("tracing: Add a clock attribute for hist triggers")
Reported-by: Masami Hiramatsu
Reviewed-by: Masami Hiramatsu
Signed-off-by: Tom Zanussi
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin

Tom Zanussi
2020-02-24 15:36:31 +0800
8c72748e9 tracing: Fix very unlikely race of registering two stat tracers ... Browse Code »

[ Upstream commit dfb6cd1e654315168e36d947471bd2a0ccd834ae ]

Looking through old emails in my INBOX, I came across a patch from Luis
Henriques that attempted to fix a race of two stat tracers registering the
same stat trace (extremely unlikely, as this is done in the kernel, and
probably doesn't even exist). The submitted patch wasn't quite right as it
needed to deal with clean up a bit better (if two stat tracers were the
same, it would have the same files).

But to make the code cleaner, all we needed to do is to keep the
all_stat_sessions_mutex held for most of the registering function.

Link: http://lkml.kernel.org/r/1410299375-20068-1-git-send-email-luis.henriques@canonical.com

Fixes: 002bb86d8d42f ("tracing/ftrace: separate events tracing and stats tracing engine")
Reported-by: Luis Henriques
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin

Steven Rostedt (VMware)
2020-02-24 15:36:30 +0800
75225eee8 tracing: Fix tracing_stat return values in error handling paths ... Browse Code »

[ Upstream commit afccc00f75bbbee4e4ae833a96c2d29a7259c693 ]

tracing_stat_init() was always returning '0', even on the error paths. It
now returns -ENODEV if tracing_init_dentry() fails or -ENOMEM if it fails
to created the 'trace_stat' debugfs directory.

Link: http://lkml.kernel.org/r/1410299381-20108-1-git-send-email-luis.henriques@canonical.com

Fixes: ed6f1c996bfe4 ("tracing: Check return value of tracing_init_dentry()")
Signed-off-by: Luis Henriques
[ Pulled from the archeological digging of my INBOX ]
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Sasha Levin

Luis Henriques
2020-02-24 15:36:30 +0800
f60672166 kernel/module: Fix memleak in module_add_modinfo_attrs() ... Browse Code »

[ Upstream commit f6d061d617124abbd55396a3bc37b9bf7d33233c ]

In module_add_modinfo_attrs() if sysfs_create_file() fails
on the first iteration of the loop (so i = 0), we forget to
free the modinfo_attrs.

Fixes: bc6f2a757d52 ("kernel/module: Fix mem leak in module_add_modinfo_attrs")
Reviewed-by: Miroslav Benes
Signed-off-by: YueHaibing
Signed-off-by: Jessica Yu
Signed-off-by: Sasha Levin

YueHaibing
2020-02-24 15:36:26 +0800
d21cc4ea7 printk: fix exclusive_console replaying ... Browse Code »

[ Upstream commit def97da136515cb289a14729292c193e0a93bc64 ]

Commit f92b070f2dc8 ("printk: Do not miss new messages when replaying
the log") introduced a new variable @exclusive_console_stop_seq to
store when an exclusive console should stop printing. It should be
set to the @console_seq value at registration. However, @console_seq
is previously set to @syslog_seq so that the exclusive console knows
where to begin. This results in the exclusive console immediately
reactivating all the other consoles and thus repeating the messages
for those consoles.

Set @console_seq after @exclusive_console_stop_seq has stored the
current @console_seq value.

Fixes: f92b070f2dc8 ("printk: Do not miss new messages when replaying the log")
Link: http://lkml.kernel.org/r/20191219115322.31160-1-john.ogness@linutronix.de
Cc: Steven Rostedt
Cc: linux-kernel@vger.kernel.org
Signed-off-by: John Ogness
Acked-by: Sergey Senozhatsky
Signed-off-by: Petr Mladek
Signed-off-by: Sasha Levin

John Ogness
2020-02-24 15:36:24 +0800
c4d0a90b5 cpu/hotplug, stop_machine: Fix stop_machine vs hotplug order ... Browse Code »

[ Upstream commit 45178ac0cea853fe0e405bf11e101bdebea57b15 ]

Paul reported a very sporadic, rcutorture induced, workqueue failure.
When the planets align, the workqueue rescuer's self-migrate fails and
then triggers a WARN for running a work on the wrong CPU.

Tejun then figured that set_cpus_allowed_ptr()'s stop_one_cpu() call
could be ignored! When stopper->enabled is false, stop_machine will
insta complete the work, without actually doing the work. Worse, it
will not WARN about this (we really should fix this).

It turns out there is a small window where a freshly online'ed CPU is
marked 'online' but doesn't yet have the stopper task running:

BP AP

bringup_cpu()
__cpu_up(cpu, idle) --> start_secondary()
...
cpu_startup_entry()
bringup_wait_for_ap()
wait_for_ap_thread() enable = true;

Close this by moving the stop_machine_unpark() into
cpuhp_online_idle(), such that the stopper thread is ready before we
start the idle loop and schedule.

Reported-by: "Paul E. McKenney"
Debugged-by: Tejun Heo
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: "Paul E. McKenney"
Signed-off-by: Sasha Levin

Peter Zijlstra
2020-02-24 15:36:23 +0800
255edefeb rcu: Fix data-race due to atomic_t copy-by-value ... Browse Code »

[ Upstream commit 6cf539a87a61a4fbc43f625267dbcbcf283872ed ]

This fixes a data-race where `atomic_t dynticks` is copied by value. The
copy is performed non-atomically, resulting in a data-race if `dynticks`
is updated concurrently.

This data-race was found with KCSAN:
==================================================================
BUG: KCSAN: data-race in dyntick_save_progress_counter / rcu_irq_enter

write to 0xffff989dbdbe98e0 of 4 bytes by task 10 on cpu 3:
atomic_add_return include/asm-generic/atomic-instrumented.h:78 [inline]
rcu_dynticks_snap kernel/rcu/tree.c:310 [inline]
dyntick_save_progress_counter+0x43/0x1b0 kernel/rcu/tree.c:984
force_qs_rnp+0x183/0x200 kernel/rcu/tree.c:2286
rcu_gp_fqs kernel/rcu/tree.c:1601 [inline]
rcu_gp_fqs_loop+0x71/0x880 kernel/rcu/tree.c:1653
rcu_gp_kthread+0x22c/0x3b0 kernel/rcu/tree.c:1799
kthread+0x1b5/0x200 kernel/kthread.c:255

read to 0xffff989dbdbe98e0 of 4 bytes by task 154 on cpu 7:
rcu_nmi_enter_common kernel/rcu/tree.c:828 [inline]
rcu_irq_enter+0xda/0x240 kernel/rcu/tree.c:870
irq_enter+0x5/0x50 kernel/softirq.c:347

Reported by Kernel Concurrency Sanitizer on:
CPU: 7 PID: 154 Comm: kworker/7:1H Not tainted 5.3.0+ #5
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-1 04/01/2014
Workqueue: kblockd blk_mq_run_work_fn
==================================================================

Signed-off-by: Marco Elver
Cc: Paul E. McKenney
Cc: Josh Triplett
Cc: Steven Rostedt
Cc: Mathieu Desnoyers
Cc: Joel Fernandes
Cc: Ingo Molnar
Cc: Dmitry Vyukov
Cc: rcu@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Paul E. McKenney
Signed-off-by: Sasha Levin

Marco Elver
2020-02-24 15:36:23 +0800
b7725deb9 rcu: Fix missed wakeup of exp_wq waiters ... Browse Code »

[ Upstream commit fd6bc19d7676a060a171d1cf3dcbf6fd797eb05f ]

Tasks waiting within exp_funnel_lock() for an expedited grace period to
elapse can be starved due to the following sequence of events:

1. Tasks A and B both attempt to start an expedited grace
period at about the same time. This grace period will have
completed when the lower four bits of the rcu_state structure's
->expedited_sequence field are 0b'0100', for example, when the
initial value of this counter is zero. Task A wins, and thus
does the actual work of starting the grace period, including
acquiring the rcu_state structure's .exp_mutex and sets the
counter to 0b'0001'.

2. Because task B lost the race to start the grace period, it
waits on ->expedited_sequence to reach 0b'0100' inside of
exp_funnel_lock(). This task therefore blocks on the rcu_node
structure's ->exp_wq[1] field, keeping in mind that the
end-of-grace-period value of ->expedited_sequence (0b'0100')
is shifted down two bits before indexing the ->exp_wq[] field.

3. Task C attempts to start another expedited grace period,
but blocks on ->exp_mutex, which is still held by Task A.

4. The aforementioned expedited grace period completes, so that
->expedited_sequence now has the value 0b'0100'. A kworker task
therefore acquires the rcu_state structure's ->exp_wake_mutex
and starts awakening any tasks waiting for this grace period.

5. One of the first tasks awakened happens to be Task A. Task A
therefore releases the rcu_state structure's ->exp_mutex,
which allows Task C to start the next expedited grace period,
which causes the lower four bits of the rcu_state structure's
->expedited_sequence field to become 0b'0101'.

6. Task C's expedited grace period completes, so that the lower four
bits of the rcu_state structure's ->expedited_sequence field now
become 0b'1000'.

7. The kworker task from step 4 above continues its wakeups.
Unfortunately, the wake_up_all() refetches the rcu_state
structure's .expedited_sequence field:

wake_up_all(&rnp->exp_wq[rcu_seq_ctr(rcu_state.expedited_sequence) & 0x3]);

This results in the wakeup being applied to the rcu_node
structure's ->exp_wq[2] field, which is unfortunate given that
Task B is instead waiting on ->exp_wq[1].

On a busy system, no harm is done (or at least no permanent harm is done).
Some later expedited grace period will redo the wakeup. But on a quiet
system, such as many embedded systems, it might be a good long time before
there was another expedited grace period. On such embedded systems,
this situation could therefore result in a system hang.

This issue manifested as DPM device timeout during suspend (which
usually qualifies as a quiet time) due to a SCSI device being stuck in
_synchronize_rcu_expedited(), with the following stack trace:

schedule()
synchronize_rcu_expedited()
synchronize_rcu()
scsi_device_quiesce()
scsi_bus_suspend()
dpm_run_callback()
__device_suspend()

This commit therefore prevents such delays, timeouts, and hangs by
making rcu_exp_wait_wake() use its "s" argument consistently instead of
refetching from rcu_state.expedited_sequence.

Fixes: 3b5f668e715b ("rcu: Overlap wakeups with next expedited grace period")
Signed-off-by: Neeraj Upadhyay
Signed-off-by: Paul E. McKenney
Signed-off-by: Sasha Levin

Neeraj Upadhyay
2020-02-24 15:36:22 +0800
3ece067c1 rcu/nocb: Fix dump_tree hierarchy print always active ... Browse Code »

[ Upstream commit 610dea36d3083a977e4f156206cbe1eaa2a532f0 ]

Commit 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if
dump_tree") added print statements to rcu_organize_nocb_kthreads for
debugging, but incorrectly guarded them, causing the function to always
spew out its message.

This patch fixes it by guarding both pr_alert statements with dump_tree,
while also changing the second pr_alert to a pr_cont, to print the
hierarchy in a single line (assuming that's how it was supposed to
work).

Fixes: 18cd8c93e69e ("rcu/nocb: Print gp/cb kthread hierarchy if dump_tree")
Signed-off-by: Stefan Reiter
[ paulmck: Make single-nocbs-CPU GP kthreads look less erroneous. ]
Signed-off-by: Paul E. McKenney
Signed-off-by: Sasha Levin

Stefan Reiter
2020-02-24 15:36:22 +0800

20 Feb, 2020

2 commits

9f6f61c61 sched/uclamp: Reject negative values in cpu_uclamp_write() ... Browse Code »

commit b562d140649966d4daedd0483a8fe59ad3bb465a upstream.

The check to ensure that the new written value into cpu.uclamp.{min,max}
is within range, [0:100], wasn't working because of the signed
comparison

7301 if (req.percent > UCLAMP_PERCENT_SCALE) {
7302 req.ret = -ERANGE;
7303 return req;
7304 }

# echo -1 > cpu.uclamp.min
# cat cpu.uclamp.min
42949671.96

Cast req.percent into u64 to force the comparison to be unsigned and
work as intended in capacity_from_percent().

# echo -1 > cpu.uclamp.min
sh: write error: Numerical result out of range

Fixes: 2480c093130f ("sched/uclamp: Extend CPU's cgroup controller")
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Link: https://lkml.kernel.org/r/20200114210947.14083-1-qais.yousef@arm.com
Signed-off-by: Greg Kroah-Hartman

Qais Yousef
2020-02-20 02:53:07 +0800
0671627a5 ACPI: PM: s2idle: Avoid possible race related to the EC GPE ... Browse Code »

commit e3728b50cd9be7d4b1469447cdf1feb93e3b7adb upstream.

It is theoretically possible for the ACPI EC GPE to be set after the
s2idle_ops->wake() called from s2idle_loop() has returned and before
the subsequent pm_wakeup_pending() check is carried out. If that
happens, the resulting wakeup event will cause the system to resume
even though it may be a spurious one.

To avoid that race, first make the ->wake() callback in struct
platform_s2idle_ops return a bool value indicating whether or not
to let the system resume and rearrange s2idle_loop() to use that
value instad of the direct pm_wakeup_pending() call if ->wake() is
present.

Next, rework acpi_s2idle_wake() to process EC events and check
pm_wakeup_pending() before re-arming the SCI for system wakeup
to prevent it from triggering prematurely and add comments to
that function to explain the rationale for the new code flow.

Fixes: 56b991849009 ("PM: sleep: Simplify suspend-to-idle control flow")
Cc: 5.4+ # 5.4+
Signed-off-by: Rafael J. Wysocki
Signed-off-by: Greg Kroah-Hartman

Rafael J. Wysocki
2020-02-20 02:52:56 +0800

15 Feb, 2020

1 commit

ba95651ce sched/uclamp: Fix a bug in propagating uclamp value in new cgroups ... Browse Code »

commit 7226017ad37a888915628e59a84a2d1e57b40707 upstream.

When a new cgroup is created, the effective uclamp value wasn't updated
with a call to cpu_util_update_eff() that looks at the hierarchy and
update to the most restrictive values.

Fix it by ensuring to call cpu_util_update_eff() when a new cgroup
becomes online.

Without this change, the newly created cgroup uses the default
root_task_group uclamp values, which is 1024 for both uclamp_{min, max},
which will cause the rq to to be clamped to max, hence cause the
system to run at max frequency.

The problem was observed on Ubuntu server and was reproduced on Debian
and Buildroot rootfs.

By default, Ubuntu and Debian create a cpu controller cgroup hierarchy
and add all tasks to it - which creates enough noise to keep the rq
uclamp value at max most of the time. Imitating this behavior makes the
problem visible in Buildroot too which otherwise looks fine since it's a
minimal userspace.

Fixes: 0b60ba2dd342 ("sched/uclamp: Propagate parent clamps")
Reported-by: Doug Smythies
Signed-off-by: Qais Yousef
Signed-off-by: Peter Zijlstra (Intel)
Tested-by: Doug Smythies
Link: https://lore.kernel.org/lkml/000701d5b965$361b6c60$a2524520$@net/
Signed-off-by: Greg Kroah-Hartman

Qais Yousef
2020-02-15 05:34:17 +0800

11 Feb, 2020

2 commits

743823969 perf/core: Fix mlock accounting in perf_mmap() ... Browse Code »

commit 003461559ef7a9bd0239bae35a22ad8924d6e9ad upstream.

Decreasing sysctl_perf_event_mlock between two consecutive perf_mmap()s of
a perf ring buffer may lead to an integer underflow in locked memory
accounting. This may lead to the undesired behaviors, such as failures in
BPF map creation.

Address this by adjusting the accounting logic to take into account the
possibility that the amount of already locked memory may exceed the
current limit.

Fixes: c4b75479741c ("perf/core: Make the mlock accounting simple again")
Suggested-by: Alexander Shishkin
Signed-off-by: Song Liu
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Ingo Molnar
Cc:
Acked-by: Alexander Shishkin
Link: https://lkml.kernel.org/r/20200123181146.2238074-1-songliubraving@fb.com
Signed-off-by: Greg Kroah-Hartman

Song Liu
2020-02-11 20:35:54 +0800
d1318034e clocksource: Prevent double add_timer_on() for watchdog_timer ... Browse Code »

commit febac332a819f0e764aa4da62757ba21d18c182b upstream.

Kernel crashes inside QEMU/KVM are observed:

kernel BUG at kernel/time/timer.c:1154!
BUG_ON(timer_pending(timer) || !timer->function) in add_timer_on().

At the same time another cpu got:

general protection fault: 0000 [#1] SMP PTI of poinson pointer 0xdead000000000200 in:

__hlist_del at include/linux/list.h:681
(inlined by) detach_timer at kernel/time/timer.c:818
(inlined by) expire_timers at kernel/time/timer.c:1355
(inlined by) __run_timers at kernel/time/timer.c:1686
(inlined by) run_timer_softirq at kernel/time/timer.c:1699

Unfortunately kernel logs are badly scrambled, stacktraces are lost.

Printing the timer->function before the BUG_ON() pointed to
clocksource_watchdog().

The execution of clocksource_watchdog() can race with a sequence of
clocksource_stop_watchdog() .. clocksource_start_watchdog():

expire_timers()
detach_timer(timer, true);
timer->entry.pprev = NULL;
raw_spin_unlock_irq(&base->lock);
call_timer_fn
clocksource_watchdog()

clocksource_watchdog_kthread() or
clocksource_unbind()

spin_lock_irqsave(&watchdog_lock, flags);
clocksource_stop_watchdog();
del_timer(&watchdog_timer);
watchdog_running = 0;
spin_unlock_irqrestore(&watchdog_lock, flags);

spin_lock_irqsave(&watchdog_lock, flags);
clocksource_start_watchdog();
add_timer_on(&watchdog_timer, ...);
watchdog_running = 1;
spin_unlock_irqrestore(&watchdog_lock, flags);

spin_lock(&watchdog_lock);
add_timer_on(&watchdog_timer, ...);
BUG_ON(timer_pending(timer) || !timer->function);
timer_pending() -> true
BUG()

I.e. inside clocksource_watchdog() watchdog_timer could be already armed.

Check timer_pending() before calling add_timer_on(). This is sufficient as
all operations are synchronized by watchdog_lock.

Fixes: 75c5158f70c0 ("timekeeping: Update clocksource with stop_machine")
Signed-off-by: Konstantin Khlebnikov
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/158048693917.4378.13823603769948933793.stgit@buzz
Signed-off-by: Greg Kroah-Hartman

Konstantin Khlebnikov
2020-02-11 20:35:54 +0800