Eric Lee / smarc-fsl-linux-kernel

27 Nov, 2018

1 commit

ef8d2a5d9 sched/core: Take the hotplug lock in sched_init_smp() ... Browse Code »

[ Upstream commit 40fa3780bac2b654edf23f6b13f4e2dd550aea10 ]

When running on linux-next (8c60c36d0b8c ("Add linux-next specific files
for 20181019")) + CONFIG_PROVE_LOCKING=y on a big.LITTLE system (e.g.
Juno or HiKey960), we get the following report:

[ 0.748225] Call trace:
[ 0.750685] lockdep_assert_cpus_held+0x30/0x40
[ 0.755236] static_key_enable_cpuslocked+0x20/0xc8
[ 0.760137] build_sched_domains+0x1034/0x1108
[ 0.764601] sched_init_domains+0x68/0x90
[ 0.768628] sched_init_smp+0x30/0x80
[ 0.772309] kernel_init_freeable+0x278/0x51c
[ 0.776685] kernel_init+0x10/0x108
[ 0.780190] ret_from_fork+0x10/0x18

The static_key in question is 'sched_asym_cpucapacity' introduced by
commit:

df054e8445a4 ("sched/topology: Add static_key for asymmetric CPU capacity optimizations")

In this particular case, we enable it because smp_prepare_cpus() will
end up fetching the capacity-dmips-mhz entry from the devicetree,
so we already have some asymmetry detected when entering sched_init_smp().

This didn't get detected in tip/sched/core because we were missing:

commit cb538267ea1e ("jump_label/lockdep: Assert we hold the hotplug lock for _cpuslocked() operations")

Calls to build_sched_domains() post sched_init_smp() will hold the
hotplug lock, it just so happens that this very first call is a
special case. As stated by a comment in sched_init_smp(), "There's no
userspace yet to cause hotplug operations" so this is a harmless
warning.

However, to both respect the semantics of underlying
callees and make lockdep happy, take the hotplug lock in
sched_init_smp(). This also satisfies the comment atop
sched_init_domains() that says "Callers must hold the hotplug lock".

Reported-by: Sudeep Holla
Tested-by: Sudeep Holla
Signed-off-by: Valentin Schneider
Signed-off-by: Peter Zijlstra (Intel)
Cc: Dietmar.Eggemann@arm.com
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: morten.rasmussen@arm.com
Cc: quentin.perret@arm.com
Link: http://lkml.kernel.org/r/1540301851-3048-1-git-send-email-valentin.schneider@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Valentin Schneider
2018-11-27 23:10:49 +0800

23 Nov, 2018

1 commit

c33b5cc70 Revert "x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation" ... Browse Code »

This reverts commit 8a13906ae519b3ed95cd0fb73f1098b46362f6c4 which is
commit 53c613fe6349994f023245519265999eed75957f upstream.

It's not ready for the stable trees as there are major slowdowns
involved with this patch.

Reported-by: Jiri Kosina
Cc: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: "WoodhouseDavid"
Cc: Andi Kleen
Cc: Tim Chen
Cc: "SchauflerCasey"
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2018-11-23 15:19:27 +0800

21 Nov, 2018

3 commits

c9b8d580b printk: Never set console_may_schedule in console_trylock() ... Browse Code »

commit fd5f7cde1b85d4c8e09ca46ce948e008a2377f64 upstream.

This patch, basically, reverts commit 6b97a20d3a79 ("printk:
set may_schedule for some of console_trylock() callers").
That commit was a mistake, it introduced a big dependency
on the scheduler, by enabling preemption under console_sem
in printk()->console_unlock() path, which is rather too
critical. The patch did not significantly reduce the
possibilities of printk() lockups, but made it possible to
stall printk(), as has been reported by Tetsuo Handa [1].

Another issues is that preemption under console_sem also
messes up with Steven Rostedt's hand off scheme, by making
it possible to sleep with console_sem both in console_unlock()
and in vprintk_emit(), after acquiring the console_sem
ownership (anywhere between printk_safe_exit_irqrestore() in
console_trylock_spinning() and printk_safe_enter_irqsave()
in console_unlock()). This makes hand off less likely and,
at the same time, may result in a significant amount of
pending logbuf messages. Preempted console_sem owner makes
it impossible for other CPUs to emit logbuf messages, but
does not make it impossible for other CPUs to append new
messages to the logbuf.

Reinstate the old behavior and make printk() non-preemptible.
Should any printk() lockup reports arrive they must be handled
in a different way.

[1] http://lkml.kernel.org/r/201603022101.CAH73907.OVOOMFHFFtQJSL%20()%20I-love%20!%20SAKURA%20!%20ne%20!%20jp
Fixes: 6b97a20d3a79 ("printk: set may_schedule for some of console_trylock() callers")
Link: http://lkml.kernel.org/r/20180116044716.GE6607@jagdpanzerIV
To: Tetsuo Handa
Cc: Sergey Senozhatsky
Cc: Tejun Heo
Cc: akpm@linux-foundation.org
Cc: linux-mm@kvack.org
Cc: Cong Wang
Cc: Dave Hansen
Cc: Johannes Weiner
Cc: Mel Gorman
Cc: Michal Hocko
Cc: Vlastimil Babka
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Jan Kara
Cc: Mathieu Desnoyers
Cc: Byungchul Park
Cc: Pavel Machek
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Sergey Senozhatsky
Reported-by: Tetsuo Handa
Reviewed-by: Steven Rostedt (VMware)
Signed-off-by: Petr Mladek
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Sergey Senozhatsky
2018-11-21 16:24:17 +0800
dedde93bd kdb: print real address of pointers instead of hashed addresses ... Browse Code »

commit 568fb6f42ac6851320adaea25f8f1b94de14e40a upstream.

Since commit ad67b74d2469 ("printk: hash addresses printed with %p"),
all pointers printed with %p are printed with hashed addresses
instead of real addresses in order to avoid leaking addresses in
dmesg and syslog. But this applies to kdb too, with is unfortunate:

Entering kdb (current=0x(ptrval), pid 329) due to Keyboard Entry
kdb> ps
15 sleeping system daemon (state M) processes suppressed,
use 'ps A' to see all.
Task Addr Pid Parent [*] cpu State Thread Command
0x(ptrval) 329 328 1 0 R 0x(ptrval) *sh

0x(ptrval) 1 0 0 0 S 0x(ptrval) init
0x(ptrval) 3 2 0 0 D 0x(ptrval) rcu_gp
0x(ptrval) 4 2 0 0 D 0x(ptrval) rcu_par_gp
0x(ptrval) 5 2 0 0 D 0x(ptrval) kworker/0:0
0x(ptrval) 6 2 0 0 D 0x(ptrval) kworker/0:0H
0x(ptrval) 7 2 0 0 D 0x(ptrval) kworker/u2:0
0x(ptrval) 8 2 0 0 D 0x(ptrval) mm_percpu_wq
0x(ptrval) 10 2 0 0 D 0x(ptrval) rcu_preempt

The whole purpose of kdb is to debug, and for debugging real addresses
need to be known. In addition, data displayed by kdb doesn't go into
dmesg.

This patch replaces all %p by %px in kdb in order to display real
addresses.

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc:
Signed-off-by: Christophe Leroy
Signed-off-by: Daniel Thompson
Signed-off-by: Greg Kroah-Hartman

Christophe Leroy
2018-11-21 16:24:14 +0800
ce583650a kdb: use correct pointer when 'btc' calls 'btt' ... Browse Code »

commit dded2e159208a9edc21dd5c5f583afa28d378d39 upstream.

On a powerpc 8xx, 'btc' fails as follows:

Entering kdb (current=0x(ptrval), pid 282) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0x0

when booting the kernel with 'debug_boot_weak_hash', it fails as well

Entering kdb (current=0xba99ad80, pid 284) due to Keyboard Entry
kdb> btc
btc: cpu status: Currently on cpu 0
Available cpus: 0
kdb_getarea: Bad address 0xba99ad80

On other platforms, Oopses have been observed too, see
https://github.com/linuxppc/linux/issues/139

This is due to btc calling 'btt' with %p pointer as an argument.

This patch replaces %p by %px to get the real pointer value as
expected by 'btt'

Fixes: ad67b74d2469 ("printk: hash addresses printed with %p")
Cc:
Signed-off-by: Christophe Leroy
Reviewed-by: Daniel Thompson
Signed-off-by: Daniel Thompson
Signed-off-by: Greg Kroah-Hartman

Christophe Leroy
2018-11-21 16:24:14 +0800

14 Nov, 2018

11 commits

4bea15f79 bpf: wait for running BPF programs when updating map-in-map ... Browse Code »

commit 1ae80cf31938c8f77c37a29bbe29e7f1cd492be8 upstream.

The map-in-map frequently serves as a mechanism for atomic
snapshotting of state that a BPF program might record. The current
implementation is dangerous to use in this way, however, since
userspace has no way of knowing when all programs that might have
retrieved the "old" value of the map may have completed.

This change ensures that map update operations on map-in-map map types
always wait for all references to the old map to drop before returning
to userspace.

Signed-off-by: Daniel Colascione
Reviewed-by: Joel Fernandes (Google)
Signed-off-by: Alexei Starovoitov
[fengc@google.com: 4.14 backport: adjust context]
Signed-off-by: Chenbo Feng
Signed-off-by: Greg Kroah-Hartman

Daniel Colascione
2018-11-14 03:15:18 +0800
e6b8a4d76 genirq: Fix race on spurious interrupt detection ... Browse Code »

commit 746a923b863a1065ef77324e1e43f19b1a3eab5c upstream.

Commit 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of
threaded irqs") made detection of spurious interrupts work for threaded
handlers by:

a) incrementing a counter every time the thread returns IRQ_HANDLED, and
b) checking whether that counter has increased every time the thread is
woken.

However for oneshot interrupts, the commit unmasks the interrupt before
incrementing the counter. If another interrupt occurs right after
unmasking but before the counter is incremented, that interrupt is
incorrectly considered spurious:

time
| irq_thread()
| irq_thread_fn()
| action->thread_fn()
| irq_finalize_oneshot()
| unmask_threaded_irq() /* interrupt is unmasked */
|
| /* interrupt fires, incorrectly deemed spurious */
|
| atomic_inc(&desc->threads_handled); /* counter is incremented */
v

This is observed with a hi3110 CAN controller receiving data at high volume
(from a separate machine sending with "cangen -g 0 -i -x"): The controller
signals a huge number of interrupts (hundreds of millions per day) and
every second there are about a dozen which are deemed spurious.

In theory with high CPU load and the presence of higher priority tasks, the
number of incorrectly detected spurious interrupts might increase beyond
the 99,900 threshold and cause disablement of the interrupt.

In practice it just increments the spurious interrupt count. But that can
cause people to waste time investigating it over and over.

Fix it by moving the accounting before the invocation of
irq_finalize_oneshot().

[ tglx: Folded change log update ]

Fixes: 1e77d0a1ed74 ("genirq: Sanitize spurious interrupt detection of threaded irqs")
Signed-off-by: Lukas Wunner
Signed-off-by: Thomas Gleixner
Cc: Mathias Duckeck
Cc: Akshay Bhat
Cc: Casey Fitzpatrick
Cc: stable@vger.kernel.org # v3.16+
Link: https://lkml.kernel.org/r/1dfd8bbd16163940648045495e3e9698e63b50ad.1539867047.git.lukas@wunner.de
Signed-off-by: Greg Kroah-Hartman

Lukas Wunner
2018-11-14 03:15:09 +0800
cc4dcea8b printk: Fix panic caused by passing log_buf_len to command line ... Browse Code »

commit 277fcdb2cfee38ccdbe07e705dbd4896ba0c9930 upstream.

log_buf_len_setup does not check input argument before passing it to
simple_strtoull. The argument would be a NULL pointer if "log_buf_len",
without its value, is set in command line and thus causes the following
panic.

PANIC: early exception 0xe3 IP 10:ffffffffaaeacd0d error 0 cr2 0x0
[ 0.000000] CPU: 0 PID: 0 Comm: swapper Not tainted 4.19.0-rc4-yocto-standard+ #1
[ 0.000000] RIP: 0010:_parse_integer_fixup_radix+0xd/0x70
...
[ 0.000000] Call Trace:
[ 0.000000] simple_strtoull+0x29/0x70
[ 0.000000] memparse+0x26/0x90
[ 0.000000] log_buf_len_setup+0x17/0x22
[ 0.000000] do_early_param+0x57/0x8e
[ 0.000000] parse_args+0x208/0x320
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_options+0x29/0x2d
[ 0.000000] ? rdinit_setup+0x30/0x30
[ 0.000000] parse_early_param+0x36/0x4d
[ 0.000000] setup_arch+0x336/0x99e
[ 0.000000] start_kernel+0x6f/0x4ee
[ 0.000000] x86_64_start_reservations+0x24/0x26
[ 0.000000] x86_64_start_kernel+0x6f/0x72
[ 0.000000] secondary_startup_64+0xa4/0xb0

This patch adds a check to prevent the panic.

Link: http://lkml.kernel.org/r/1538239553-81805-1-git-send-email-zhe.he@windriver.com
Cc: stable@vger.kernel.org
Cc: rostedt@goodmis.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: He Zhe
Reviewed-by: Sergey Senozhatsky
Signed-off-by: Petr Mladek
Signed-off-by: Greg Kroah-Hartman

He Zhe
2018-11-14 03:15:09 +0800
d6ea9f305 kbuild: fix kernel/bounds.c 'W=1' warning ... Browse Code »

commit 6a32c2469c3fbfee8f25bcd20af647326650a6cf upstream.

Building any configuration with 'make W=1' produces a warning:

kernel/bounds.c:16:6: warning: no previous prototype for 'foo' [-Wmissing-prototypes]

When also passing -Werror, this prevents us from building any other files.
Nobody ever calls the function, but we can't make it 'static' either
since we want the compiler output.

Calling it 'main' instead however avoids the warning, because gcc
does not insist on having a declaration for main.

Link: http://lkml.kernel.org/r/20181005083313.2088252-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Reported-by: Kieran Bingham
Reviewed-by: Kieran Bingham
Cc: David Laight
Cc: Masahiro Yamada
Cc: Greg Kroah-Hartman
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Arnd Bergmann
2018-11-14 03:15:08 +0800
51f62e827 signal: Guard against negative signal numbers in copy_siginfo_from_user32 ... Browse Code »

commit a36700589b85443e28170be59fa11c8a104130a5 upstream.

While fixing an out of bounds array access in known_siginfo_layout
reported by the kernel test robot it became apparent that the same bug
exists in siginfo_layout and affects copy_siginfo_from_user32.

The straight forward fix that makes guards against making this mistake
in the future and should keep the code size small is to just take an
unsigned signal number instead of a signed signal number, as I did to
fix known_siginfo_layout.

Cc: stable@vger.kernel.org
Fixes: cc731525f26a ("signal: Remove kernel interal si_code magic")
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2018-11-14 03:15:07 +0800
06bd97b79 signal: Always deliver the kernel's SIGKILL and SIGSTOP to a pid namespace init ... Browse Code »

[ Upstream commit 3597dfe01d12f570bc739da67f857fd222a3ea66 ]

Instead of playing whack-a-mole and changing SEND_SIG_PRIV to
SEND_SIG_FORCED throughout the kernel to ensure a pid namespace init
gets signals sent by the kernel, stop allowing a pid namespace init to
ignore SIGKILL or SIGSTOP sent by the kernel. A pid namespace init is
only supposed to be able to ignore signals sent from itself and
children with SIG_DFL.

Fixes: 921cf9f63089 ("signals: protect cinit from unblocked SIG_DFL signals")
Reviewed-by: Thomas Gleixner
Signed-off-by: "Eric W. Biederman"
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Eric W. Biederman
2018-11-14 03:15:00 +0800
0e0b860ff kprobes: Return error if we fail to reuse kprobe instead of BUG_ON() ... Browse Code »

[ Upstream commit 819319fc93461c07b9cdb3064f154bd8cfd48172 ]

Make reuse_unused_kprobe() to return error code if
it fails to reuse unused kprobe for optprobe instead
of calling BUG_ON().

Signed-off-by: Masami Hiramatsu
Cc: Anil S Keshavamurthy
Cc: David S . Miller
Cc: Linus Torvalds
Cc: Naveen N . Rao
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/153666124040.21306.14150398706331307654.stgit@devbox
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Masami Hiramatsu
2018-11-14 03:14:55 +0800
b66d0bb19 signal: Introduce COMPAT_SIGMINSTKSZ for use in compat_sys_sigaltstack ... Browse Code »

[ Upstream commit 22839869f21ab3850fbbac9b425ccc4c0023926f ]

The sigaltstack(2) system call fails with -ENOMEM if the new alternative
signal stack is found to be smaller than SIGMINSTKSZ. On architectures
such as arm64, where the native value for SIGMINSTKSZ is larger than
the compat value, this can result in an unexpected error being reported
to a compat task. See, for example:

https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=904385

This patch fixes the problem by extending do_sigaltstack to take the
minimum signal stack size as an additional parameter, allowing the
native and compat system call entry code to pass in their respective
values. COMPAT_SIGMINSTKSZ is just defined as SIGMINSTKSZ if it has not
been defined by the architecture.

Cc: Arnd Bergmann
Cc: Dominik Brodowski
Cc: "Eric W. Biederman"
Cc: Andrew Morton
Cc: Al Viro
Cc: Oleg Nesterov
Reported-by: Steve McIntyre
Tested-by: Steve McIntyre
Signed-off-by: Will Deacon
Signed-off-by: Catalin Marinas
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Will Deacon
2018-11-14 03:14:54 +0800
5cf2ab06e locking/lockdep: Fix debug_locks off performance problem ... Browse Code »

[ Upstream commit 9506a7425b094d2f1d9c877ed5a78f416669269b ]

It was found that when debug_locks was turned off because of a problem
found by the lockdep code, the system performance could drop quite
significantly when the lock_stat code was also configured into the
kernel. For instance, parallel kernel build time on a 4-socket x86-64
server nearly doubled.

Further analysis into the cause of the slowdown traced back to the
frequent call to debug_locks_off() from the __lock_acquired() function
probably due to some inconsistent lockdep states with debug_locks
off. The debug_locks_off() function did an unconditional atomic xchg
to write a 0 value into debug_locks which had already been set to 0.
This led to severe cacheline contention in the cacheline that held
debug_locks. As debug_locks is being referenced in quite a few different
places in the kernel, this greatly slow down the system performance.

To prevent that trashing of debug_locks cacheline, lock_acquired()
and lock_contended() now checks the state of debug_locks before
proceeding. The debug_locks_off() function is also modified to check
debug_locks before calling __debug_locks_off().

Signed-off-by: Waiman Long
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/1539913518-15598-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Waiman Long
2018-11-14 03:14:51 +0800
e5a02efef sched/fair: Fix the min_vruntime update logic in dequeue_entity() ... Browse Code »

[ Upstream commit 9845c49cc9bbb317a0bc9e9cf78d8e09d54c9af0 ]

The comment and the code around the update_min_vruntime() call in
dequeue_entity() are not in agreement.

>From commit:

b60205c7c558 ("sched/fair: Fix min_vruntime tracking")

I think that we want to update min_vruntime when a task is sleeping/migrating.
So, the check is inverted there - fix it.

Signed-off-by: Song Muchun
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: b60205c7c558 ("sched/fair: Fix min_vruntime tracking")
Link: http://lkml.kernel.org/r/20181014112612.2614-1-smuchun@gmail.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Song Muchun
2018-11-14 03:14:50 +0800
8a13906ae x86/speculation: Enable cross-hyperthread spectre v2 STIBP mitigation ... Browse Code »

commit 53c613fe6349994f023245519265999eed75957f upstream.

STIBP is a feature provided by certain Intel ucodes / CPUs. This feature
(once enabled) prevents cross-hyperthread control of decisions made by
indirect branch predictors.

Enable this feature if

- the CPU is vulnerable to spectre v2
- the CPU supports SMT and has SMT siblings online
- spectre_v2 mitigation autoselection is enabled (default)

After some previous discussion, this leaves STIBP on all the time, as wrmsr
on crossing kernel boundary is a no-no. This could perhaps later be a bit
more optimized (like disabling it in NOHZ, experiment with disabling it in
idle, etc) if needed.

Note that the synchronization of the mask manipulation via newly added
spec_ctrl_mutex is currently not strictly needed, as the only updater is
already being serialized by cpu_add_remove_lock, but let's make this a
little bit more future-proof.

Signed-off-by: Jiri Kosina
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Josh Poimboeuf
Cc: Andrea Arcangeli
Cc: "WoodhouseDavid"
Cc: Andi Kleen
Cc: Tim Chen
Cc: "SchauflerCasey"
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/nycvar.YFH.7.76.1809251438240.15880@cbobk.fhfr.pm
Signed-off-by: Greg Kroah-Hartman

Jiri Kosina
2018-11-14 03:14:47 +0800

10 Nov, 2018

2 commits

1220de222 sched/fair: Fix throttle_list starvation with low CFS quota ... Browse Code »

commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve. Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list. The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -976050
2 ffff90b56cb2cc00 -484925
3 ffff90b56cb2bc00 -658814
4 ffff90b56cb2ba00 -275365
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582

crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1" "$4}' | pr -t -n3
1 ffff90b56cb2d200 -994147
2 ffff90b56cb2cc00 -306051
3 ffff90b56cb2bc00 -961321
4 ffff90b56cb2ba00 -24490
5 ffff90b166a45600 -135138
6 ffff90b56cb2da00 -282505
7 ffff90b56cb2e000 -148065
8 ffff90b56cb2fa00 -872591
9 ffff90b56cb2c000 -84687
10 ffff90b56cb2f000 -87237
11 ffff90b166a40a00 -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},
crash> task ffff8eb765994500 sched_info
PID: 7800 TASK: ffff8eb765994500 CPU: 16 COMMAND: "cputest"
sched_info = {
pcount = 8,
run_delay = 697094208,
last_arrival = 240260125039,
last_queued = 240260327513
},

Signed-off-by: Phil Auld
Reviewed-by: Ben Segall
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: stable@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Phil Auld
2018-11-10 23:48:36 +0800
eb9b195c5 bpf: fix partial copy of map_ptr when dst is scalar ... Browse Code »

commit 0962590e553331db2cc0aef2dc35c57f6300dbbe upstream.

ALU operations on pointers such as scalar_reg += map_value_ptr are
handled in adjust_ptr_min_max_vals(). Problem is however that map_ptr
and range in the register state share a union, so transferring state
through dst_reg->range = ptr_reg->range is just buggy as any new
map_ptr in the dst_reg is then truncated (or null) for subsequent
checks. Fix this by adding a raw member and use it for copying state
over to dst_reg.

Fixes: f1174f77b50c ("bpf/verifier: rework value tracking")
Signed-off-by: Daniel Borkmann
Cc: Edward Cree
Acked-by: Alexei Starovoitov
Signed-off-by: Alexei Starovoitov
Acked-by: Edward Cree
Signed-off-by: Sasha Levin

Daniel Borkmann
2018-11-10 23:48:34 +0800

04 Nov, 2018

4 commits

3c0cff34e bpf: sockmap, map_release does not hold refcnt for pinned maps ... Browse Code »

[ Upstream commit ba6b8de423f8d0dee48d6030288ed81c03ddf9f0 ]

Relying on map_release hook to decrement the reference counts when a
map is removed only works if the map is not being pinned. In the
pinned case the ref is decremented immediately and the BPF programs
released. After this BPF programs may not be in-use which is not
what the user would expect.

This patch moves the release logic into bpf_map_put_uref() and brings
sockmap in-line with how a similar case is handled in prog array maps.

Fixes: 3d9e952697de ("bpf: sockmap, fix leaking maps with attached but not detached progs")
Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Signed-off-by: Sasha Levin

John Fastabend
2018-11-04 21:52:44 +0800
45894023b locking/ww_mutex: Fix runtime warning in the WW mutex selftest ... Browse Code »

[ Upstream commit e4a02ed2aaf447fa849e3254bfdb3b9b01e1e520 ]

If CONFIG_WW_MUTEX_SELFTEST=y is enabled, booting an image
in an arm64 virtual machine results in the following
traceback if 8 CPUs are enabled:

DEBUG_LOCKS_WARN_ON(__owner_task(owner) != current)
WARNING: CPU: 2 PID: 537 at kernel/locking/mutex.c:1033 __mutex_unlock_slowpath+0x1a8/0x2e0
...
Call trace:
__mutex_unlock_slowpath()
ww_mutex_unlock()
test_cycle_work()
process_one_work()
worker_thread()
kthread()
ret_from_fork()

If requesting b_mutex fails with -EDEADLK, the error variable
is reassigned to the return value from calling ww_mutex_lock
on a_mutex again. If this call fails, a_mutex is not locked.
It is, however, unconditionally unlocked subsequently, causing
the reported warning. Fix the problem by using two error variables.

With this change, the selftest still fails as follows:

cyclic deadlock not resolved, ret[7/8] = -35

However, the traceback is gone.

Signed-off-by: Guenter Roeck
Cc: Chris Wilson
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Fixes: d1b42b800e5d0 ("locking/ww_mutex: Add kselftests for resolving ww_mutex cyclic deadlocks")
Link: http://lkml.kernel.org/r/1538516929-9734-1-git-send-email-linux@roeck-us.net
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Guenter Roeck
2018-11-04 21:52:41 +0800
a18e2159c perf/ring_buffer: Prevent concurent ring buffer access ... Browse Code »

[ Upstream commit cd6fb677ce7e460c25bdd66f689734102ec7d642 ]

Some of the scheduling tracepoints allow the perf_tp_event
code to write to ring buffer under different cpu than the
code is running on.

This results in corrupted ring buffer data demonstrated in
following perf commands:

# perf record -e 'sched:sched_switch,sched:sched_wakeup' perf bench sched messaging
# Running 'sched/messaging' benchmark:
# 20 sender and receiver processes per group
# 10 groups == 400 processes run

Total time: 0.383 [sec]
[ perf record: Woken up 8 times to write data ]
0x42b890 [0]: failed to process type: -1765585640
[ perf record: Captured and wrote 4.825 MB perf.data (29669 samples) ]

# perf report --stdio
0x42b890 [0]: failed to process type: -1765585640

The reason for the corruption are some of the scheduling tracepoints,
that have __perf_task dfined and thus allow to store data to another
cpu ring buffer:

sched_waking
sched_wakeup
sched_wakeup_new
sched_stat_wait
sched_stat_sleep
sched_stat_iowait
sched_stat_blocked

The perf_tp_event function first store samples for current cpu
related events defined for tracepoint:

hlist_for_each_entry_rcu(event, head, hlist_entry)
perf_swevent_event(event, count, &data, regs);

And then iterates events of the 'task' and store the sample
for any task's event that passes tracepoint checks:

ctx = rcu_dereference(task->perf_event_ctxp[perf_sw_context]);

list_for_each_entry_rcu(event, &ctx->event_list, event_entry) {
if (event->attr.type != PERF_TYPE_TRACEPOINT)
continue;
if (event->attr.config != entry->type)
continue;

perf_swevent_event(event, count, &data, regs);
}

Above code can race with same code running on another cpu,
ending up with 2 cpus trying to store under the same ring
buffer, which is specifically not allowed.

This patch prevents the problem, by allowing only events with the same
current cpu to receive the event.

NOTE: this requires the use of (per-task-)per-cpu buffers for this
feature to work; perf-record does this.

Signed-off-by: Jiri Olsa
[peterz: small edits to Changelog]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Andrew Vagin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: e6dab5ffab59 ("perf/trace: Add ability to set a target task for events")
Link: http://lkml.kernel.org/r/20180923161343.GB15054@krava
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Jiri Olsa
2018-11-04 21:52:40 +0800
ffc3cb561 perf/core: Fix perf_pmu_unregister() locking ... Browse Code »

[ Upstream commit a9f9772114c8b07ae75bcb3654bd017461248095 ]

When we unregister a PMU, we fail to serialize the @pmu_idr properly.
Fix that by doing the entire thing under pmu_lock.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Fixes: 2e80a82a49c4 ("perf: Dynamic pmu types")
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin

Peter Zijlstra
2018-11-04 21:52:40 +0800

20 Oct, 2018

1 commit

2a797fd8f mm: disallow mappings that conflict for devm_memremap_pages() ... Browse Code »

commit 15d36fecd0bdc7510b70a0e5ec6671140b3fce0c upstream.

When pmem namespaces created are smaller than section size, this can
cause an issue during removal and gpf was observed:

general protection fault: 0000 1 SMP PTI
CPU: 36 PID: 3941 Comm: ndctl Tainted: G W 4.14.28-1.el7uek.x86_64 #2
task: ffff88acda150000 task.stack: ffffc900233a4000
RIP: 0010:__put_page+0x56/0x79
Call Trace:
devm_memremap_pages_release+0x155/0x23a
release_nodes+0x21e/0x260
devres_release_all+0x3c/0x48
device_release_driver_internal+0x15c/0x207
device_release_driver+0x12/0x14
unbind_store+0xba/0xd8
drv_attr_store+0x27/0x31
sysfs_kf_write+0x3f/0x46
kernfs_fop_write+0x10f/0x18b
__vfs_write+0x3a/0x16d
vfs_write+0xb2/0x1a1
SyS_write+0x55/0xb9
do_syscall_64+0x79/0x1ae
entry_SYSCALL_64_after_hwframe+0x3d/0x0

Add code to check whether we have a mapping already in the same section
and prevent additional mappings from being created if that is the case.

Link: http://lkml.kernel.org/r/152909478401.50143.312364396244072931.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang
Cc: Dan Williams
Cc: Robert Elliott
Cc: Jeff Moyer
Cc: Matthew Wilcox
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sudip Mukherjee
Signed-off-by: Greg Kroah-Hartman

Dave Jiang
2018-10-20 15:48:53 +0800

18 Oct, 2018

1 commit

bc183079d cgroup: Fix dom_cgrp propagation when enabling threaded mode ... Browse Code »

commit 479adb89a97b0a33e5a9d702119872cc82ca21aa upstream.

A cgroup which is already a threaded domain may be converted into a
threaded cgroup if the prerequisite conditions are met. When this
happens, all threaded descendant should also have their ->dom_cgrp
updated to the new threaded domain cgroup. Unfortunately, this
propagation was missing leading to the following failure.

# cd /sys/fs/cgroup/unified
# cat cgroup.subtree_control # show that no controllers are enabled

# mkdir -p mycgrp/a/b/c
# echo threaded > mycgrp/a/b/cgroup.type

At this point, the hierarchy looks as follows:

mycgrp [d]
a [dt]
b [t]
c [inv]

Now let's make node "a" threaded (and thus "mycgrp" s made "domain threaded"):

# echo threaded > mycgrp/a/cgroup.type

By this point, we now have a hierarchy that looks as follows:

mycgrp [dt]
a [t]
b [t]
c [inv]

But, when we try to convert the node "c" from "domain invalid" to
"threaded", we get ENOTSUP on the write():

# echo threaded > mycgrp/a/b/c/cgroup.type
sh: echo: write error: Operation not supported

This patch fixes the problem by

* Moving the opencoded ->dom_cgrp save and restoration in
cgroup_enable_threaded() into cgroup_{save|restore}_control() so
that mulitple cgroups can be handled.

* Updating all threaded descendants' ->dom_cgrp to point to the new
dom_cgrp when enabling threaded mode.

Signed-off-by: Tejun Heo
Reported-and-tested-by: "Michael Kerrisk (man-pages)"
Reported-by: Amin Jamali
Reported-by: Joao De Almeida Pereira
Link: https://lore.kernel.org/r/CAKgNAkhHYCMn74TCNiMJ=ccLd7DcmXSbvw3CbZ1YREeG7iJM5g@mail.gmail.com
Fixes: 454000adaa2a ("cgroup: introduce cgroup->dom_cgrp and threaded css_set handling")
Cc: stable@vger.kernel.org # v4.14+
Signed-off-by: Greg Kroah-Hartman

Tejun Heo
2018-10-18 15:16:24 +0800

13 Oct, 2018

1 commit

ab18409cf perf/core: Add sanity check to deal with pinned event failure ... Browse Code »

commit befb1b3c2703897c5b8ffb0044dc5d0e5f27c5d7 upstream.

It is possible that a failure can occur during the scheduling of a
pinned event. The initial portion of perf_event_read_local() contains
the various error checks an event should pass before it can be
considered valid. Ensure that the potential scheduling failure
of a pinned event is checked for and have a credible error.

Suggested-by: Peter Zijlstra
Signed-off-by: Reinette Chatre
Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: fenghua.yu@intel.com
Cc: tony.luck@intel.com
Cc: acme@kernel.org
Cc: gavin.hindman@intel.com
Cc: jithu.joseph@intel.com
Cc: dave.hansen@intel.com
Cc: hpa@zytor.com
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/6486385d1f30336e9973b24c8c65f5079543d3d3.1537377064.git.reinette.chatre@intel.com
Signed-off-by: Greg Kroah-Hartman

Reinette Chatre
2018-10-13 15:27:21 +0800

10 Oct, 2018

1 commit

10fdfea70 bpf: 32-bit RSH verification must truncate input before the ALU op ... Browse Code »

commit b799207e1e1816b09e7a5920fbb2d5fcf6edd681 upstream.

When I wrote commit 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification"), I
assumed that, in order to emulate 64-bit arithmetic with 32-bit logic, it
is sufficient to just truncate the output to 32 bits; and so I just moved
the register size coercion that used to be at the start of the function to
the end of the function.

That assumption is true for almost every op, but not for 32-bit right
shifts, because those can propagate information towards the least
significant bit. Fix it by always truncating inputs for 32-bit ops to 32
bits.

Also get rid of the coerce_reg_to_size() after the ALU op, since that has
no effect.

Fixes: 468f6eafa6c4 ("bpf: fix 32-bit ALU op verification")
Acked-by: Daniel Borkmann
Signed-off-by: Jann Horn
Signed-off-by: Daniel Borkmann
Signed-off-by: Greg Kroah-Hartman

Jann Horn
2018-10-10 14:54:22 +0800

04 Oct, 2018

5 commits

92935e1c2 bpf: sockmap: write_space events need to be passed to TCP handler ... Browse Code »

[ Upstream commit 9b2e0388bec8ec5427403e23faff3b58dd1c3200 ]

When sockmap code is using the stream parser it also handles the write
space events in order to handle the case where (a) verdict redirects
skb to another socket and (b) the sockmap then sends the skb but due
to memory constraints (or other EAGAIN errors) needs to do a retry.

But the initial code missed a third case where the
skb_send_sock_locked() triggers an sk_wait_event(). A typically case
would be when sndbuf size is exceeded. If this happens because we
do not pass the write_space event to the lower layers we never wake
up the event and it will wait for sndtimeo. Which as noted in ktls
fix may be rather large and look like a hang to the user.

To reproduce the best test is to reduce the sndbuf size and send
1B data chunks to stress the memory handling. To fix this pass the
event from the upper layer to the lower layer.

Signed-off-by: John Fastabend
Signed-off-by: Daniel Borkmann
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

John Fastabend
2018-10-04 08:00:58 +0800
5bcbbadf6 module: exclude SHN_UNDEF symbols from kallsyms api ... Browse Code »

[ Upstream commit 9f2d1e68cf4d641def734adaccfc3823d3575e6c ]

Livepatch modules are special in that we preserve their entire symbol
tables in order to be able to apply relocations after module load. The
unwanted side effect of this is that undefined (SHN_UNDEF) symbols of
livepatch modules are accessible via the kallsyms api and this can
confuse symbol resolution in livepatch (klp_find_object_symbol()) and
cause subtle bugs in livepatch.

Have the module kallsyms api skip over SHN_UNDEF symbols. These symbols
are usually not available for normal modules anyway as we cut down their
symbol tables to just the core (non-undefined) symbols, so this should
really just affect livepatch modules. Note that this patch doesn't
affect the display of undefined symbols in /proc/kallsyms.

Reported-by: Josh Poimboeuf
Tested-by: Josh Poimboeuf
Reviewed-by: Josh Poimboeuf
Signed-off-by: Jessica Yu
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Jessica Yu
2018-10-04 08:00:53 +0800
3e3f075f7 posix-timers: Sanitize overrun handling ... Browse Code »

[ Upstream commit 78c9c4dfbf8c04883941445a195276bb4bb92c76 ]

The posix timer overrun handling is broken because the forwarding functions
can return a huge number of overruns which does not fit in an int. As a
consequence timer_getoverrun(2) and siginfo::si_overrun can turn into
random number generators.

The k_clock::timer_forward() callbacks return a 64 bit value now. Make
k_itimer::ti_overrun[_last] 64bit as well, so the kernel internal
accounting is correct. 3Remove the temporary (int) casts.

Add a helper function which clamps the overrun value returned to user space
via timer_getoverrun(2) or siginfo::si_overrun limited to a positive value
between 0 and INT_MAX. INT_MAX is an indicator for user space that the
overrun value has been clamped.

Reported-by: Team OWL337
Signed-off-by: Thomas Gleixner
Acked-by: John Stultz
Cc: Peter Zijlstra
Cc: Michael Kerrisk
Link: https://lkml.kernel.org/r/20180626132705.018623573@linutronix.de
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-10-04 08:00:50 +0800
a05bd4ba6 posix-timers: Make forward callback return s64 ... Browse Code »

[ Upstream commit 6fec64e1c92d5c715c6d0f50786daa7708266bde ]

The posix timer ti_overrun handling is broken because the forwarding
functions can return a huge number of overruns which does not fit in an
int. As a consequence timer_getoverrun(2) and siginfo::si_overrun can turn
into random number generators.

As a first step to address that let the timer_forward() callbacks return
the full 64 bit value.

Cast it to (int) temporarily until k_itimer::ti_overrun is converted to
64bit and the conversion to user space visible values is sanitized.

Reported-by: Team OWL337
Signed-off-by: Thomas Gleixner
Acked-by: John Stultz
Cc: Peter Zijlstra
Cc: Michael Kerrisk
Link: https://lkml.kernel.org/r/20180626132704.922098090@linutronix.de
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-10-04 08:00:50 +0800
a4dbaf7c2 alarmtimer: Prevent overflow for relative nanosleep ... Browse Code »

[ Upstream commit 5f936e19cc0ef97dbe3a56e9498922ad5ba1edef ]

Air Icy reported:

UBSAN: Undefined behaviour in kernel/time/alarmtimer.c:811:7
signed integer overflow:
1529859276030040771 + 9223372036854775807 cannot be represented in type 'long long int'
Call Trace:
alarm_timer_nsleep+0x44c/0x510 kernel/time/alarmtimer.c:811
__do_sys_clock_nanosleep kernel/time/posix-timers.c:1235 [inline]
__se_sys_clock_nanosleep kernel/time/posix-timers.c:1213 [inline]
__x64_sys_clock_nanosleep+0x326/0x4e0 kernel/time/posix-timers.c:1213
do_syscall_64+0xb8/0x3a0 arch/x86/entry/common.c:290

alarm_timer_nsleep() uses ktime_add() to add the current time and the
relative expiry value. ktime_add() has no sanity checks so the addition
can overflow when the relative timeout is large enough.

Use ktime_add_safe() which has the necessary sanity checks in place and
limits the result to the valid range.

Fixes: 9a7adcf5c6de ("timers: Posix interface for alarm-timers")
Reported-by: Team OWL337
Signed-off-by: Thomas Gleixner
Cc: John Stultz
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1807020926360.1595@nanos.tec.linutronix.de
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2018-10-04 08:00:49 +0800

29 Sep, 2018

3 commits

ed5e9462f tick/nohz: Prevent bogus softirq pending warning ... Browse Code »

Commit 0a0e0829f990 ("nohz: Fix missing tick reprogram when interrupting an
inline softirq") got backported to stable trees and now causes the NOHZ
softirq pending warning to trigger. It's not an upstream issue as the NOHZ
update logic has been changed there.

The problem is when a softirq disabled section gets interrupted and on
return from interrupt the tick/nohz state is evaluated, which then can
observe pending soft interrupts. These soft interrupts are legitimately
pending because they cannot be processed as long as soft interrupts are
disabled and the interrupted code will correctly process them when soft
interrupts are reenabled.

Add a check for softirqs disabled to the pending check to prevent the
warning.

Reported-by: Grygorii Strashko
Reported-by: John Crispin
Signed-off-by: Thomas Gleixner
Tested-by: Grygorii Strashko
Tested-by: John Crispin
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Anna-Maria Gleixner
Cc: stable@vger.kernel.org
Fixes: 2d898915ccf4838c ("nohz: Fix missing tick reprogram when interrupting an inline softirq")
Acked-by: Frederic Weisbecker
Tested-by: Geert Uytterhoeven

Thomas Gleixner
2018-09-29 18:06:07 +0800
fe87d18b1 sched/fair: Fix vruntime_normalized() for remote non-migration wakeup ... Browse Code »

commit d0cdb3ce8834332d918fc9c8ff74f8a169ec9abe upstream.

When a task which previously ran on a given CPU is remotely queued to
wake up on that same CPU, there is a period where the task's state is
TASK_WAKING and its vruntime is not normalized. This is not accounted
for in vruntime_normalized() which will cause an error in the task's
vruntime if it is switched from the fair class during this time.

For example if it is boosted to RT priority via rt_mutex_setprio(),
rq->min_vruntime will not be subtracted from the task's vruntime but
it will be added again when the task returns to the fair class. The
task's vruntime will have been erroneously doubled and the effective
priority of the task will be reduced.

Note this will also lead to inflation of all vruntimes since the doubled
vruntime value will become the rq's min_vruntime when other tasks leave
the rq. This leads to repeated doubling of the vruntime and priority
penalty.

Fix this by recognizing a WAKING task's vruntime as normalized only if
sched_remote_wakeup is true. This indicates a migration, in which case
the vruntime would have been normalized in migrate_task_rq_fair().

Based on a similar patch from John Dias .

Suggested-by: Peter Zijlstra
Tested-by: Dietmar Eggemann
Signed-off-by: Steve Muckle
Signed-off-by: Peter Zijlstra (Intel)
Cc: Chris Redpath
Cc: John Dias
Cc: Linus Torvalds
Cc: Miguel de Dios
Cc: Morten Rasmussen
Cc: Patrick Bellasi
Cc: Paul Turner
Cc: Quentin Perret
Cc: Thomas Gleixner
Cc: Todd Kjos
Cc: kernel-team@android.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20180831224217.169476-1-smuckle@google.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Steve Muckle
2018-09-29 18:06:07 +0800
7eba38a3f ring-buffer: Allow for rescheduling when removing pages ... Browse Code »

commit 83f365554e47997ec68dc4eca3f5dce525cd15c3 upstream.

When reducing ring buffer size, pages are removed by scheduling a work
item on each CPU for the corresponding CPU ring buffer. After the pages
are removed from ring buffer linked list, the pages are free()d in a
tight loop. The loop does not give up CPU until all pages are removed.
In a worst case behavior, when lot of pages are to be freed, it can
cause system stall.

After the pages are removed from the list, the free() can happen while
the work is rescheduled. Call cond_resched() in the loop to prevent the
system hangup.

Link: http://lkml.kernel.org/r/20180907223129.71994-1-vnagarnaik@google.com

Cc: stable@vger.kernel.org
Fixes: 83f40318dab00 ("ring-buffer: Make removal of ring buffer pages atomic")
Reported-by: Jason Behmer
Signed-off-by: Vaibhav Nagarnaik
Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Greg Kroah-Hartman

Vaibhav Nagarnaik
2018-09-29 18:06:04 +0800

26 Sep, 2018

4 commits

e593232f6 sched/fair: Fix util_avg of new tasks for asymmetric systems ... Browse Code »

[ Upstream commit 8fe5c5a937d0f4e84221631833a2718afde52285 ]

When a new task wakes-up for the first time, its initial utilization
is set to half of the spare capacity of its CPU. The current
implementation of post_init_entity_util_avg() uses SCHED_CAPACITY_SCALE
directly as a capacity reference. As a result, on a big.LITTLE system, a
new task waking up on an idle little CPU will be given ~512 of util_avg,
even if the CPU's capacity is significantly less than that.

Fix this by computing the spare capacity with arch_scale_cpu_capacity().

Signed-off-by: Quentin Perret
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Vincent Guittot
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: dietmar.eggemann@arm.com
Cc: morten.rasmussen@arm.com
Cc: patrick.bellasi@arm.com
Link: http://lkml.kernel.org/r/20180612112215.25448-1-quentin.perret@arm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Quentin Perret
2018-09-26 14:38:12 +0800
9b85641c2 sched/core: Use smp_mb() in wake_woken_function() ... Browse Code »

[ Upstream commit 76e079fefc8f62bd9b2cd2950814d1ee806e31a5 ]

wake_woken_function() synchronizes with wait_woken() as follows:

[wait_woken] [wake_woken_function]

entry->flags &= ~wq_flag_woken; condition = true;
smp_mb(); smp_wmb();
if (condition) wq_entry->flags |= wq_flag_woken;
break;

This commit replaces the above smp_wmb() with an smp_mb() in order to
guarantee that either wait_woken() sees the wait condition being true
or the store to wq_entry->flags in woken_wake_function() follows the
store in wait_woken() in the coherence order (so that the former can
eventually be observed by wait_woken()).

The commit also fixes a comment associated to set_current_state() in
wait_woken(): the comment pairs the barrier in set_current_state() to
the above smp_wmb(), while the actual pairing involves the barrier in
set_current_state() and the barrier executed by the try_to_wake_up()
in wake_woken_function().

Signed-off-by: Andrea Parri
Signed-off-by: Paul E. McKenney
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: akiyks@gmail.com
Cc: boqun.feng@gmail.com
Cc: dhowells@redhat.com
Cc: j.alglave@ucl.ac.uk
Cc: linux-arch@vger.kernel.org
Cc: luc.maranget@inria.fr
Cc: npiggin@gmail.com
Cc: parri.andrea@gmail.com
Cc: stern@rowland.harvard.edu
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/20180716180605.16115-10-paulmck@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Andrea Parri
2018-09-26 14:38:10 +0800
67e522a76 audit: fix use-after-free in audit_add_watch ... Browse Code »

[ Upstream commit baa2a4fdd525c8c4b0f704d20457195b29437839 ]

audit_add_watch stores locally krule->watch without taking a reference
on watch. Then, it calls audit_add_to_parent, and uses the watch stored
locally.

Unfortunately, it is possible that audit_add_to_parent updates
krule->watch.
When it happens, it also drops a reference of watch which
could free the watch.

How to reproduce (with KASAN enabled):

auditctl -w /etc/passwd -F success=0 -k test_passwd
auditctl -w /etc/passwd -F success=1 -k test_passwd2

The second call to auditctl triggers the use-after-free, because
audit_to_parent updates krule->watch to use a previous existing watch
and drops the reference to the newly created watch.

To fix the issue, we grab a reference of watch and we release it at the
end of the function.

Signed-off-by: Ronny Chevalier
Reviewed-by: Richard Guy Briggs
Signed-off-by: Paul Moore
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Ronny Chevalier
2018-09-26 14:38:09 +0800
d9951521d perf/core: Force USER_DS when recording user stack data ... Browse Code »

commit 02e184476eff848273826c1d6617bb37e5bcc7ad upstream.

Perf can record user stack data in response to a synchronous request, such
as a tracepoint firing. If this happens under set_fs(KERNEL_DS), then we
end up reading user stack data using __copy_from_user_inatomic() under
set_fs(KERNEL_DS). I think this conflicts with the intention of using
set_fs(KERNEL_DS). And it is explicitly forbidden by hardware on ARM64
when both CONFIG_ARM64_UAO and CONFIG_ARM64_PAN are used.

So fix this by forcing USER_DS when recording user stack data.

Signed-off-by: Yabin Cui
Acked-by: Peter Zijlstra (Intel)
Cc:
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Namhyung Kim
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 88b0193d9418 ("perf/callchain: Force USER_DS when invoking perf_callchain_user()")
Link: http://lkml.kernel.org/r/20180823225935.27035-1-yabinc@google.com
Signed-off-by: Ingo Molnar
Signed-off-by: Greg Kroah-Hartman

Yabin Cui
2018-09-26 14:38:08 +0800

20 Sep, 2018

2 commits

cb71229f6 timers: Clear timer_base::must_forward_clk with timer_base::lock held ... Browse Code »

[ Upstream commit 363e934d8811d799c88faffc5bfca782fd728334 ]

timer_base::must_forward_clock is indicating that the base clock might be
stale due to a long idle sleep.

The forwarding of the base clock takes place in the timer softirq or when a
timer is enqueued to a base which is idle. If the enqueue of timer to an
idle base happens from a remote CPU, then the following race can happen:

CPU0 CPU1
run_timer_softirq mod_timer

base = lock_timer_base(timer);
base->must_forward_clk = false
if (base->must_forward_clk)
forward(base); -> skipped

enqueue_timer(base, timer, idx);
-> idx is calculated high due to
stale base
unlock_timer_base(timer);
base = lock_timer_base(timer);
forward(base);

The root cause is that timer_base::must_forward_clk is cleared outside the
timer_base::lock held region, so the remote queuing CPU observes it as
cleared, but the base clock is still stale. This can cause large
granularity values for timers, i.e. the accuracy of the expiry time
suffers.

Prevent this by clearing the flag with timer_base::lock held, so that the
forwarding takes place before the cleared flag is observable by a remote
CPU.

Signed-off-by: Gaurav Kohli
Signed-off-by: Thomas Gleixner
Cc: john.stultz@linaro.org
Cc: sboyd@kernel.org
Cc: linux-arm-msm@vger.kernel.org
Link: https://lkml.kernel.org/r/1533199863-22748-1-git-send-email-gkohli@codeaurora.org
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman

Gaurav Kohli
2018-09-20 04:43:39 +0800
1d92a611d cpu/hotplug: Prevent state corruption on error rollback ... Browse Code »

commit 69fa6eb7d6a64801ea261025cce9723d9442d773 upstream.

When a teardown callback fails, the CPU hotplug code brings the CPU back to
the previous state. The previous state becomes the new target state. The
rollback happens in undo_cpu_down() which increments the state
unconditionally even if the state is already the same as the target.

As a consequence the next CPU hotplug operation will start at the wrong
state. This is easily to observe when __cpu_disable() fails.

Prevent the unconditional undo by checking the state vs. target before
incrementing state and fix up the consequently wrong conditional in the
unplug code which handles the failure of the final CPU take down on the
control CPU side.

Fixes: 4dddfb5faa61 ("smp/hotplug: Rewrite AP state machine core")
Reported-by: Neeraj Upadhyay
Signed-off-by: Thomas Gleixner
Tested-by: Geert Uytterhoeven
Tested-by: Sudeep Holla
Tested-by: Neeraj Upadhyay
Cc: josh@joshtriplett.org
Cc: peterz@infradead.org
Cc: jiangshanlai@gmail.com
Cc: dzickus@redhat.com
Cc: brendan.jackman@arm.com
Cc: malat@debian.org
Cc: sramana@codeaurora.org
Cc: linux-arm-msm@vger.kernel.org
Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1809051419580.1416@nanos.tec.linutronix.de
Signed-off-by: Greg Kroah-Hartman

----

Thomas Gleixner
2018-09-20 04:43:36 +0800