Eric Lee / smarc-fsl-linux-kernel

03 Apr, 2017

1 commit

128c434a7 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"This update provides:

- make the scheduler clock switch to unstable mode smooth so the
timestamps stay at microseconds granularity instead of switching to
tick granularity.

- unbreak perf test tsc by taking the new offset into account which
was added in order to proveide better sched clock continuity

- switching sched clock to unstable mode runs all clock related
computations which affect the sched clock output itself from a work
queue. In case of preemption sched clock uses half updated data and
provides wrong timestamps. Keep the math in the protected context
and delegate only the static key switch to workqueue context.

- remove a duplicate header include"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/headers: Remove duplicate #include line
sched/clock: Fix broken stable to unstable transfer
sched/clock, x86/perf: Fix "perf test tsc"
sched/clock: Fix clear_sched_clock_stable() preempt wobbly

Linus Torvalds
2017-04-03 00:25:10 +0800

01 Apr, 2017

1 commit

035f0cd3f Merge branch 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6 ... Browse Code »

Pull crypto fixes from Herbert Xu:
"This fixes the following issues:

- memory corruption when kmalloc fails in xts/lrw

- mark some CCP DMA channels as private

- fix reordering race in padata

- regression in omap-rng DT description"

* 'linus' of git://git.kernel.org/pub/scm/linux/kernel/git/herbert/crypto-2.6:
crypto: xts,lrw - fix out-of-bounds write after kmalloc failure
crypto: ccp - Make some CCP DMA channels private
padata: avoid race in reordering
dt-bindings: rng: clocks property on omap_rng not always mandatory

Linus Torvalds
2017-04-01 03:11:32 +0800

27 Mar, 2017

1 commit

7b09cc5a9 sched/clock: Fix broken stable to unstable transfer ... Browse Code »

When it is determined that the clock is actually unstable, and
we switch from stable to unstable, the __clear_sched_clock_stable()
function is eventually called.

In this function we set gtod_offset so the following holds true:

sched_clock() + raw_offset == ktime_get_ns() + gtod_offset

But instead of getting the latest timestamps, we use the last values
from scd, so instead of sched_clock() we use scd->tick_raw, and
instead of ktime_get_ns() we use scd->tick_gtod.

However, later, when we use gtod_offset sched_clock_local() we do not
add it to scd->tick_gtod to calculate the correct clock value when we
determine the boundaries for min/max clocks.

This can result in tick granularity sched_clock() values, so fix it.

Signed-off-by: Pavel Tatashin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: hpa@zytor.com
Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
Link: http://lkml.kernel.org/r/1490214265-899964-2-git-send-email-pasha.tatashin@oracle.com
Signed-off-by: Ingo Molnar

Pavel Tatashin
2017-03-27 16:23:48 +0800

26 Mar, 2017

1 commit

4c3de7e5b Merge branch 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit ... Browse Code »

Pull audit fix from Paul Moore:
"We've got an audit fix, and unfortunately it is big.

While I'm not excited that we need to be sending you something this
large during the -rcX phase, it does fix some very real, and very
tangled, problems relating to locking, backlog queues, and the audit
daemon connection.

This code has passed our testsuite without problem and it has held up
to my ad-hoc stress tests (arguably better than the existing code),
please consider pulling this as fix for the next v4.11-rcX tag"

* 'stable-4.11' of git://git.infradead.org/users/pcmoore/audit:
audit: fix auditd/kernel connection state tracking

Linus Torvalds
2017-03-26 06:13:55 +0800

24 Mar, 2017

3 commits

de5540d08 padata: avoid race in reordering ... Browse Code »

Under extremely heavy uses of padata, crashes occur, and with list
debugging turned on, this happens instead:

[87487.298728] WARNING: CPU: 1 PID: 882 at lib/list_debug.c:33
__list_add+0xae/0x130
[87487.301868] list_add corruption. prev->next should be next
(ffffb17abfc043d0), but was ffff8dba70872c80. (prev=ffff8dba70872b00).
[87487.339011] [] dump_stack+0x68/0xa3
[87487.342198] [] ? console_unlock+0x281/0x6d0
[87487.345364] [] __warn+0xff/0x140
[87487.348513] [] warn_slowpath_fmt+0x4a/0x50
[87487.351659] [] __list_add+0xae/0x130
[87487.354772] [] ? _raw_spin_lock+0x64/0x70
[87487.357915] [] padata_reorder+0x1e6/0x420
[87487.361084] [] padata_do_serial+0xa5/0x120

padata_reorder calls list_add_tail with the list to which its adding
locked, which seems correct:

spin_lock(&squeue->serial.lock);
list_add_tail(&padata->list, &squeue->serial.list);
spin_unlock(&squeue->serial.lock);

This therefore leaves only place where such inconsistency could occur:
if padata->list is added at the same time on two different threads.
This pdata pointer comes from the function call to
padata_get_next(pd), which has in it the following block:

next_queue = per_cpu_ptr(pd->pqueue, cpu);
padata = NULL;
reorder = &next_queue->reorder;
if (!list_empty(&reorder->list)) {
padata = list_entry(reorder->list.next,
struct padata_priv, list);
spin_lock(&reorder->lock);
list_del_init(&padata->list);
atomic_dec(&pd->reorder_objects);
spin_unlock(&reorder->lock);

pd->processed++;

goto out;
}
out:
return padata;

I strongly suspect that the problem here is that two threads can race
on reorder list. Even though the deletion is locked, call to
list_entry is not locked, which means it's feasible that two threads
pick up the same padata object and subsequently call list_add_tail on
them at the same time. The fix is thus be hoist that lock outside of
that block.

Signed-off-by: Jason A. Donenfeld
Acked-by: Steffen Klassert
Signed-off-by: Herbert Xu

Jason A. Donenfeld
2017-03-24 21:51:33 +0800
ebe64824e Merge tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm ... Browse Code »

Pull power management fixes from Rafael Wysocki:
"One of these is an intel_pstate regression fix and it is not a small
change, but it mostly removes code that shouldn't be there. That code
was acquired by mistake and has been a source of constant pain since
then, so the time has come to get rid of it finally. We have not seen
problems with this change in the lab, so fingers crossed.

The rest is more usual: one more intel_pstate commit removing useless
code, a cpufreq core fix to make it restore policy limits on CPU
online (which prevents the limits from being reset over system
suspend/resume), a schedutil cpufreq governor initialization fix to
make it actually work as advertised on all systems and an extra sanity
check in the cpuidle core to prevent crashes from happening if the
arch code messes things up.

Specifics:

- Make intel_pstate use one set of global P-state limits in the
active mode regardless of the scaling_governor settings for
individual CPUs instead of switching back and forth between two of
them in a way that is hard to control (Rafael Wysocki).

- Drop a useless function from intel_pstate to prevent it from
modifying the maximum supported frequency value unexpectedly which
may confuse the cpufreq core (Rafael Wysocki).

- Fix the cpufreq core to restore policy limits on CPU online so that
the limits are not reset over system suspend/resume, among other
things (Viresh Kumar).

- Fix the initialization of the schedutil cpufreq governor to make
the IO-wait boosting mechanism in it actually work on systems with
one CPU per cpufreq policy (Rafael Wysocki).

- Add a sanity check to the cpuidle core to prevent crashes from
happening if the architecture code initialization fails to set up
things as expected (Vaidyanathan Srinivasan)"

* tag 'pm-4.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm:
cpufreq: Restore policy min/max limits on CPU online
cpuidle: Validate cpu_dev in cpuidle_add_sysfs()
cpufreq: intel_pstate: Fix policy data management in passive mode
cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start()
cpufreq: intel_pstate: One set of global limits in active mode

Linus Torvalds
2017-03-24 11:00:39 +0800
f341d9f08 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Several netfilter fixes from Pablo and the crew:
- Handle fragmented packets properly in netfilter conntrack, from
Florian Westphal.
- Fix SCTP ICMP packet handling, from Ying Xue.
- Fix big-endian bug in nftables, from Liping Zhang.
- Fix alignment of fake conntrack entry, from Steven Rostedt.

2) Fix feature flags setting in fjes driver, from Taku Izumi.

3) Openvswitch ipv6 tunnel source address not set properly, from Or
Gerlitz.

4) Fix jumbo MTU handling in amd-xgbe driver, from Thomas Lendacky.

5) sk->sk_frag.page not released properly in some cases, from Eric
Dumazet.

6) Fix RTNL deadlocks in nl80211, from Johannes Berg.

7) Fix erroneous RTNL lockdep splat in crypto, from Herbert Xu.

8) Cure improper inflight handling during AF_UNIX GC, from Andrey
Ulanov.

9) sch_dsmark doesn't write to packet headers properly, from Eric
Dumazet.

10) Fix SCM_TIMESTAMPING_OPT_STATS handling in TCP, from Soheil Hassas
Yeganeh.

11) Add some IDs for Motorola qmi_wwan chips, from Tony Lindgren.

12) Fix nametbl deadlock in tipc, from Ying Xue.

13) GRO and LRO packets not counted correctly in mlx5 driver, from Gal
Pressman.

14) Fix reset of internal PHYs in bcmgenet, from Doug Berger.

15) Fix hashmap allocation handling, from Alexei Starovoitov.

16) nl_fib_input() needs stronger netlink message length checking, from
Eric Dumazet.

17) Fix double-free of sk->sk_filter during sock clone, from Daniel
Borkmann.

18) Fix RX checksum offloading in aquantia driver, from Pavel Belous.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (85 commits)
net:ethernet:aquantia: Fix for RX checksum offload.
amd-xgbe: Fix the ECC-related bit position definitions
sfc: cleanup a condition in efx_udp_tunnel_del()
Bluetooth: btqcomsmd: fix compile-test dependency
inet: frag: release spinlock before calling icmp_send()
tcp: initialize icsk_ack.lrcvtime at session start time
genetlink: fix counting regression on ctrl_dumpfamily()
socket, bpf: fix sk_filter use after free in sk_clone_lock
ipv4: provide stronger user input validation in nl_fib_input()
bpf: fix hashmap extra_elems logic
enic: update enic maintainers
net: bcmgenet: remove bcmgenet_internal_phy_setup()
ipv6: make sure to initialize sockc.tsflags before first use
fjes: Do not load fjes driver if extended socket device is not power on.
fjes: Do not load fjes driver if system does not have extended socket device.
net/mlx5e: Count LRO packets correctly
net/mlx5e: Count GSO packets correctly
net/mlx5: Increase number of max QPs in default profile
net/mlx5e: Avoid supporting udp tunnel port ndo for VF reps
net/mlx5e: Use the proper UAPI values when offloading TC vlan actions
...

Linus Torvalds
2017-03-24 02:29:49 +0800

23 Mar, 2017

3 commits

698eff635 sched/clock, x86/perf: Fix "perf test tsc" ... Browse Code »

People reported that commit:

5680d8094ffa ("sched/clock: Provide better clock continuity")

broke "perf test tsc".

That commit added another offset to the reported clock value; so
take that into account when computing the provided offset values.

Reported-by: Adrian Hunter
Reported-by: Arnaldo Carvalho de Melo
Tested-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: 5680d8094ffa ("sched/clock: Provide better clock continuity")
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-23 14:31:49 +0800
71fdb70eb sched/clock: Fix clear_sched_clock_stable() preempt wobbly ... Browse Code »

Paul reported a problems with clear_sched_clock_stable(). Since we run
all of __clear_sched_clock_stable() from workqueue context, there's a
preempt problem.

Solve it by only running the static_key_disable() from workqueue.

Reported-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170313124621.GA3328@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-23 14:31:48 +0800
8c290e60f bpf: fix hashmap extra_elems logic ... Browse Code »

In both kmalloc and prealloc mode the bpf_map_update_elem() is using
per-cpu extra_elems to do atomic update when the map is full.
There are two issues with it. The logic can be misused, since it allows
max_entries+num_cpus elements to be present in the map. And alloc_extra_elems()
at map creation time can fail percpu alloc for large map values with a warn:
WARNING: CPU: 3 PID: 2752 at ../mm/percpu.c:892 pcpu_alloc+0x119/0xa60
illegal size (32824) or align (8) for percpu allocation

The fixes for both of these issues are different for kmalloc and prealloc modes.
For prealloc mode allocate extra num_possible_cpus elements and store
their pointers into extra_elems array instead of actual elements.
Hence we can use these hidden(spare) elements not only when the map is full
but during bpf_map_update_elem() that replaces existing element too.
That also improves performance, since pcpu_freelist_pop/push is avoided.
Unfortunately this approach cannot be used for kmalloc mode which needs
to kfree elements after rcu grace period. Therefore switch it back to normal
kmalloc even when full and old element exists like it was prior to
commit 6c9059817432 ("bpf: pre-allocate hash map elements").

Add tests to check for over max_entries and large map values.

Reported-by: Dave Jones
Fixes: 6c9059817432 ("bpf: pre-allocate hash map elements")
Signed-off-by: Alexei Starovoitov
Acked-by: Daniel Borkmann
Acked-by: Martin KaFai Lau
Signed-off-by: David S. Miller

Alexei Starovoitov
2017-03-23 05:12:18 +0800

21 Mar, 2017

2 commits

5b52330bb audit: fix auditd/kernel connection state tracking ... Browse Code »

What started as a rather straightforward race condition reported by
Dmitry using the syzkaller fuzzer ended up revealing some major
problems with how the audit subsystem managed its netlink sockets and
its connection with the userspace audit daemon. Fixing this properly
had quite the cascading effect and what we are left with is this rather
large and complicated patch. My initial goal was to try and decompose
this patch into multiple smaller patches, but the way these changes
are intertwined makes it difficult to split these changes into
meaningful pieces that don't break or somehow make things worse for
the intermediate states.

The patch makes a number of changes, but the most significant are
highlighted below:

* The auditd tracking variables, e.g. audit_sock, are now gone and
replaced by a RCU/spin_lock protected variable auditd_conn which is
a structure containing all of the auditd tracking information.

* We no longer track the auditd sock directly, instead we track it
via the network namespace in which it resides and we use the audit
socket associated with that namespace. In spirit, this is what the
code was trying to do prior to this patch (at least I think that is
what the original authors intended), but it was done rather poorly
and added a layer of obfuscation that only masked the underlying
problems.

* Big backlog queue cleanup, again. In v4.10 we made some pretty big
changes to how the audit backlog queues work, here we haven't changed
the queue design so much as cleaned up the implementation. Brought
about by the locking changes, we've simplified kauditd_thread() quite
a bit by consolidating the queue handling into a new helper function,
kauditd_send_queue(), which allows us to eliminate a lot of very
similar code and makes the looping logic in kauditd_thread() clearer.

* All netlink messages sent to auditd are now sent via
auditd_send_unicast_skb(). Other than just making sense, this makes
the lock handling easier.

* Change the audit_log_start() sleep behavior so that we never sleep
on auditd events (unchanged) or if the caller is holding the
audit_cmd_mutex (changed). Previously we didn't sleep if the caller
was auditd or if the message type fell between a certain range; the
type check was a poor effort of doing what the cmd_mutex check now
does. Richard Guy Briggs originally proposed not sleeping the
cmd_mutex owner several years ago but his patch wasn't acceptable
at the time. At least the idea lives on here.

* A problem with the lost record counter has been resolved. Steve
Grubb and I both happened to notice this problem and according to
some quick testing by Steve, this problem goes back quite some time.
It's largely a harmless problem, although it may have left some
careful sysadmins quite puzzled.

Cc: # 4.10.x-
Reported-by: Dmitry Vyukov
Signed-off-by: Paul Moore

Paul Moore
2017-03-21 23:26:35 +0800
4296f23ed cpufreq: schedutil: Fix per-CPU structure initialization in sugov_start() ... Browse Code »

sugov_start() only initializes struct sugov_cpu per-CPU structures
for shared policies, but it should do that for single-CPU policies too.

That in particular makes the IO-wait boost mechanism work in the
cases when cpufreq policies correspond to individual CPUs.

Fixes: 21ca6d2c52f8 (cpufreq: schedutil: Add iowait boosting)
Signed-off-by: Rafael J. Wysocki
Acked-by: Viresh Kumar
Cc: 4.9+ # 4.9+

Rafael J. Wysocki
2017-03-21 08:03:08 +0800

18 Mar, 2017

4 commits

3e51f893d Merge branch 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull CPU hotplug fix from Thomas Gleixner:
"A single fix preventing the concurrent execution of the CPU hotplug
callback install/invocation machinery. Long standing bug caused by a
massive brain slip of that Gleixner dude, which went unnoticed for
almost a year"

* 'smp-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Serialize callback invocations proper

Linus Torvalds
2017-03-18 23:33:44 +0800
a7fc726bb Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Thomas Gleixner:
"A set of perf related fixes:

- fix a CR4.PCE propagation issue caused by usage of mm instead of
active_mm and therefore propagated the wrong value.

- perf core fixes, which plug a use-after-free issue and make the
event inheritance on fork more robust.

- a tooling fix for symbol handling"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf symbols: Fix symbols__fixup_end heuristic for corner cases
x86/perf: Clarify why x86_pmu_event_mapped() isn't racy
x86/perf: Fix CR4.PCE propagation to use active_mm instead of mm
perf/core: Better explain the inherit magic
perf/core: Simplify perf_event_free_task()
perf/core: Fix event inheritance on fork()
perf/core: Fix use-after-free in perf_release()

Linus Torvalds
2017-03-18 04:59:52 +0800
cd21debe5 Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Thomas Gleixner:
"From the scheduler departement:

- a bunch of sched deadline related fixes which deal with various
buglets and corner cases.

- two fixes for the loadavg spikes which are caused by the delayed
NOHZ accounting"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/deadline: Use deadline instead of period when calculating overflow
sched/deadline: Throttle a constrained deadline task activated after the deadline
sched/deadline: Make sure the replenishment timer fires in the next period
sched/loadavg: Use {READ,WRITE}_ONCE() for sample window
sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting
sched/deadline: Add missing update_rq_clock() in dl_task_timer()

Linus Torvalds
2017-03-18 04:19:07 +0800
b5f13082b Merge branch 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking fixes from Thomas Gleixner:
"Three fixes related to locking:

- fix a SIGKILL issue for RWSEM_GENERIC_SPINLOCK which has been fixed
for the XCHGADD variant already

- plug a potential use after free in the futex code

- prevent leaking a held spinlock in an futex error handling code
path"

* 'locking-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y
futex: Add missing error handling to FUTEX_REQUEUE_PI
futex: Fix potential use-after-free in FUTEX_REQUEUE_PI

Linus Torvalds
2017-03-18 04:16:24 +0800

17 Mar, 2017

1 commit

55adc1d05 mm: add private lock to serialize memory hotplug operations ... Browse Code »

Commit bfc8c90139eb ("mem-hotplug: implement get/put_online_mems")
introduced new functions get/put_online_mems() and mem_hotplug_begin/end()
in order to allow similar semantics for memory hotplug like for cpu
hotplug.

The corresponding functions for cpu hotplug are get/put_online_cpus()
and cpu_hotplug_begin/done() for cpu hotplug.

The commit however missed to introduce functions that would serialize
memory hotplug operations like they are done for cpu hotplug with
cpu_maps_update_begin/done().

This basically leaves mem_hotplug.active_writer unprotected and allows
concurrent writers to modify it, which may lead to problems as outlined
by commit f931ab479dd2 ("mm: fix devm_memremap_pages crash, use
mem_hotplug_{begin, done}").

That commit was extended again with commit b5d24fda9c3d ("mm,
devm_memremap_pages: hold device_hotplug lock over mem_hotplug_{begin,
done}") which serializes memory hotplug operations for some call sites
by using the device_hotplug lock.

In addition with commit 3fc21924100b ("mm: validate device_hotplug is held
for memory hotplug") a sanity check was added to mem_hotplug_begin() to
verify that the device_hotplug lock is held.

This in turn triggers the following warning on s390:

WARNING: CPU: 6 PID: 1 at drivers/base/core.c:643 assert_held_device_hotplug+0x4a/0x58
Call Trace:
assert_held_device_hotplug+0x40/0x58)
mem_hotplug_begin+0x34/0xc8
add_memory_resource+0x7e/0x1f8
add_memory+0xda/0x130
add_memory_merged+0x15c/0x178
sclp_detect_standby_memory+0x2ae/0x2f8
do_one_initcall+0xa2/0x150
kernel_init_freeable+0x228/0x2d8
kernel_init+0x2a/0x140
kernel_thread_starter+0x6/0xc

One possible fix would be to add more lock_device_hotplug() and
unlock_device_hotplug() calls around each call site of
mem_hotplug_begin/end(). But that would give the device_hotplug lock
additional semantics it better should not have (serialize memory hotplug
operations).

Instead add a new memory_add_remove_lock which has the similar semantics
like cpu_add_remove_lock for cpu hotplug.

To keep things hopefully a bit easier the lock will be locked and unlocked
within the mem_hotplug_begin/end() functions.

Link: http://lkml.kernel.org/r/20170314125226.16779-2-heiko.carstens@de.ibm.com
Signed-off-by: Heiko Carstens
Reported-by: Sebastian Ott
Acked-by: Dan Williams
Acked-by: Rafael J. Wysocki
Cc: Michal Hocko
Cc: Vladimir Davydov
Cc: Ben Hutchings
Cc: Gerald Schaefer
Cc: Martin Schwidefsky
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2017-03-17 07:56:18 +0800

16 Mar, 2017

11 commits

d8a8cfc76 perf/core: Better explain the inherit magic ... Browse Code »

While going through the event inheritance code Oleg got confused.

Add some comments to better explain the silent dissapearance of
orphaned events.

So what happens is that at perf_event_release_kernel() time; when an
event looses its connection to userspace (and ceases to exist from the
user's perspective) we can still have an arbitrary amount of inherited
copies of the event. We want to synchronously find and remove all
these child events.

Since that requires a bit of lock juggling, there is the possibility
that concurrent clone()s will create new child events. Therefore we
first mark the parent event as DEAD, which marks all the extant child
events as orphaned.

We then avoid copying orphaned events; in order to avoid getting more
of them.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Dmitry Vyukov
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.289567442@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-16 21:16:53 +0800
15121c789 perf/core: Simplify perf_event_free_task() ... Browse Code »

We have ctx->event_list that contains all events; no need to
repeatedly iterate the group lists to find them all.

Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Dmitry Vyukov
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Oleg Nesterov
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: fweisbec@gmail.com
Link: http://lkml.kernel.org/r/20170316125823.239678244@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-16 21:16:53 +0800
e7cc4865f perf/core: Fix event inheritance on fork() ... Browse Code »

While hunting for clues to a use-after-free, Oleg spotted that
perf_event_init_context() can loose an error value with the result
that fork() can succeed even though we did not fully inherit the perf
event context.

Spotted-by: Oleg Nesterov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Dmitry Vyukov
Cc: Frederic Weisbecker
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: 889ff0150661 ("perf/core: Split context's event group list into pinned and non-pinned lists")
Link: http://lkml.kernel.org/r/20170316125823.190342547@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-16 21:16:52 +0800
e552a8389 perf/core: Fix use-after-free in perf_release() ... Browse Code »

Dmitry reported syzcaller tripped a use-after-free in perf_release().

After much puzzlement Oleg spotted the below scenario:

Task1 Task2

fork()
perf_event_init_task()
/* ... */
goto bad_fork_$foo;
/* ... */
perf_event_free_task()
mutex_lock(ctx->lock)
perf_free_event(B)

perf_event_release_kernel(A)
mutex_lock(A->child_mutex)
list_for_each_entry(child, ...) {
/* child == B */
ctx = B->ctx;
get_ctx(ctx);
mutex_unlock(A->child_mutex);

mutex_lock(A->child_mutex)
list_del_init(B->child_list)
mutex_unlock(A->child_mutex)

/* ... */

mutex_unlock(ctx->lock);
put_ctx() /* >0 */
free_task();
mutex_lock(ctx->lock);
mutex_lock(A->child_mutex);
/* ... */
mutex_unlock(A->child_mutex);
mutex_unlock(ctx->lock)
put_ctx() /* 0 */
ctx->task && !TOMBSTONE
put_task_struct() /* UAF */

This patch closes the hole by making perf_event_free_task() destroy the
task ctx relation such that perf_event_release_kernel() will no longer
observe the now dead task.

Spotted-by: Oleg Nesterov
Reported-by: Dmitry Vyukov
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Arnaldo Carvalho de Melo
Cc: Arnaldo Carvalho de Melo
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Mathieu Desnoyers
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Vince Weaver
Cc: fweisbec@gmail.com
Cc: oleg@redhat.com
Cc: stable@vger.kernel.org
Fixes: c6e5b73242d2 ("perf: Synchronously clean up child events")
Link: http://lkml.kernel.org/r/20170314155949.GE32474@worktop
Link: http://lkml.kernel.org/r/20170316125823.140295131@infradead.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2017-03-16 21:16:52 +0800
2317d5f1c sched/deadline: Use deadline instead of period when calculating overflow ... Browse Code »

I was testing Daniel's changes with his test case, and tweaked it a
little. Instead of having the runtime equal to the deadline, I
increased the deadline ten fold.

Daniel's test case had:

attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

To make it more interesting, I changed it to:

attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 20 * 1000 * 1000; /* 20 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

The results were rather surprising. The behavior that Daniel's patch
was fixing came back. The task started using much more than .1% of the
CPU. More like 20%.

Looking into this I found that it was due to the dl_entity_overflow()
constantly returning true. That's because it uses the relative period
against relative runtime vs the absolute deadline against absolute
runtime.

runtime / (deadline - t) > dl_runtime / dl_period

There's even a comment mentioning this, and saying that when relative
deadline equals relative period, that the equation is the same as using
deadline instead of period. That comment is backwards! What we really
want is:

runtime / (deadline - t) > dl_runtime / dl_deadline

We care about if the runtime can make its deadline, not its period. And
then we can say "when the deadline equals the period, the equation is
the same as using dl_period instead of dl_deadline".

After correcting this, now when the task gets enqueued, it can throttle
correctly, and Daniel's fix to the throttling of sleeping deadline
tasks works even when the runtime and deadline are not the same.

Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Daniel Bristot de Oliveira
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Luca Abeni
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Romulo Silva de Oliveira
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Tommaso Cucinotta
Link: http://lkml.kernel.org/r/02135a27f1ae3fe5fd032568a5a2f370e190e8d7.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar

Steven Rostedt (VMware)
2017-03-16 16:37:38 +0800
df8eac8ca sched/deadline: Throttle a constrained deadline task activated after the deadline ... Browse Code »

During the activation, CBS checks if it can reuse the current task's
runtime and period. If the deadline of the task is in the past, CBS
cannot use the runtime, and so it replenishes the task. This rule
works fine for implicit deadline tasks (deadline == period), and the
CBS was designed for implicit deadline tasks. However, a task with
constrained deadline (deadine < period) might be awakened after the
deadline, but before the next period. In this case, replenishing the
task would allow it to run for runtime / deadline. As in this case
deadline < period, CBS enables a task to run for more than the
runtime / period. In a very loaded system, this can cause a domino
effect, making other tasks miss their deadlines.

To avoid this problem, in the activation of a constrained deadline
task after the deadline but before the next period, throttle the
task and set the replenishing timer to the begin of the next period,
unless it is boosted.

Reproducer:

--------------- %< ---------------
int main (int argc, char **argv)
{
int ret;
int flags = 0;
unsigned long l = 0;
struct timespec ts;
struct sched_attr attr;

memset(&attr, 0, sizeof(attr));
attr.size = sizeof(attr);

attr.sched_policy = SCHED_DEADLINE;
attr.sched_runtime = 2 * 1000 * 1000; /* 2 ms */
attr.sched_deadline = 2 * 1000 * 1000; /* 2 ms */
attr.sched_period = 2 * 1000 * 1000 * 1000; /* 2 s */

ts.tv_sec = 0;
ts.tv_nsec = 2000 * 1000; /* 2 ms */

ret = sched_setattr(0, &attr, flags);

if (ret < 0) {
perror("sched_setattr");
exit(-1);
}

for(;;) {
/* XXX: you may need to adjust the loop */
for (l = 0; l < 150000; l++);
/*
* The ideia is to go to sleep right before the deadline
* and then wake up before the next period to receive
* a new replenishment.
*/
nanosleep(&ts, NULL);
}

exit(0);
}
--------------- >% ---------------

On my box, this reproducer uses almost 50% of the CPU time, which is
obviously wrong for a task with 2/2000 reservation.

Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Luca Abeni
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Romulo Silva de Oliveira
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Tommaso Cucinotta
Link: http://lkml.kernel.org/r/edf58354e01db46bf42df8d2dd32418833f68c89.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar

Daniel Bristot de Oliveira
2017-03-16 16:37:38 +0800
5ac69d377 sched/deadline: Make sure the replenishment timer fires in the next period ... Browse Code »

Currently, the replenishment timer is set to fire at the deadline
of a task. Although that works for implicit deadline tasks because the
deadline is equals to the begin of the next period, that is not correct
for constrained deadline tasks (deadline < period).

For instance:

f.c:
--------------- %< ---------------
int main (void)
{
for(;;);
}
--------------- >% ---------------

# gcc -o f f.c

# trace-cmd record -e sched:sched_switch \
-e syscalls:sys_exit_sched_setattr \
chrt -d --sched-runtime 490000000 \
--sched-deadline 500000000 \
--sched-period 1000000000 0 ./f

# trace-cmd report | grep "{pid of ./f}"

After setting parameters, the task is replenished and continue running
until being throttled:

f-11295 [003] 13322.113776: sys_exit_sched_setattr: 0x0

The task is throttled after running 492318 ms, as expected:

f-11295 [003] 13322.606094: sched_switch: f:11295 [-1] R ==> watchdog/3:32 [0]

But then, the task is replenished 500719 ms after the first
replenishment:

-0 [003] 13322.614495: sched_switch: swapper/3:0 [120] R ==> f:11295 [-1]

Running for 490277 ms:

f-11295 [003] 13323.104772: sched_switch: f:11295 [-1] R ==> swapper/3:0 [120]

Hence, in the first period, the task runs 2 * runtime, and that is a bug.

During the first replenishment, the next deadline is set one period away.
So the runtime / period starts to be respected. However, as the second
replenishment took place in the wrong instant, the next replenishment
will also be held in a wrong instant of time. Rather than occurring in
the nth period away from the first activation, it is taking place
in the (nth period - relative deadline).

Signed-off-by: Daniel Bristot de Oliveira
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Luca Abeni
Reviewed-by: Steven Rostedt (VMware)
Reviewed-by: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Romulo Silva de Oliveira
Cc: Steven Rostedt
Cc: Thomas Gleixner
Cc: Tommaso Cucinotta
Link: http://lkml.kernel.org/r/ac50d89887c25285b47465638354b63362f8adff.1488392936.git.bristot@redhat.com
Signed-off-by: Ingo Molnar

Daniel Bristot de Oliveira
2017-03-16 16:37:37 +0800
17fcbd590 locking/rwsem: Fix down_write_killable() for CONFIG_RWSEM_GENERIC_SPINLOCK=y ... Browse Code »

We hang if SIGKILL has been sent, but the task is stuck in down_read()
(after do_exit()), even though no task is doing down_write() on the
rwsem in question:

INFO: task libupnp:21868 blocked for more than 120 seconds.
libupnp D 0 21868 1 0x08100008
...
Call Trace:
__schedule()
schedule()
__down_read()
do_exit()
do_group_exit()
__wake_up_parent()

This bug has already been fixed for CONFIG_RWSEM_XCHGADD_ALGORITHM=y in
the following commit:

04cafed7fc19 ("locking/rwsem: Fix down_write_killable()")

... however, this bug also exists for CONFIG_RWSEM_GENERIC_SPINLOCK=y.

Signed-off-by: Niklas Cassel
Signed-off-by: Peter Zijlstra (Intel)
Cc:
Cc:
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Niklas Cassel
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: d47996082f52 ("locking/rwsem: Introduce basis for down_write_killable()")
Link: http://lkml.kernel.org/r/1487981873-12649-1-git-send-email-niklass@axis.com
Signed-off-by: Ingo Molnar

Niklas Cassel
2017-03-16 16:28:30 +0800
caeb58829 sched/loadavg: Use {READ,WRITE}_ONCE() for sample window ... Browse Code »

'calc_load_update' is accessed without any kind of locking and there's
a clear assumption in the code that only a single value is read or
written.

Make this explicit by using READ_ONCE() and WRITE_ONCE(), and avoid
unintentionally seeing multiple values, or having the load/stores
split.

Technically the loads in calc_global_*() don't require this since
those are the only functions that update 'calc_load_update', but I've
added the READ_ONCE() for consistency.

Suggested-by: Peter Zijlstra
Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Thomas Gleixner
Cc: Vincent Guittot
Link: http://lkml.kernel.org/r/20170217120731.11868-3-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar

Matt Fleming
2017-03-16 16:21:01 +0800
6e5f32f7a sched/loadavg: Avoid loadavg spikes caused by delayed NO_HZ accounting ... Browse Code »

If we crossed a sample window while in NO_HZ we will add LOAD_FREQ to
the pending sample window time on exit, setting the next update not
one window into the future, but two.

This situation on exiting NO_HZ is described by:

this_rq->calc_load_update < jiffies < calc_load_update

In this scenario, what we should be doing is:

this_rq->calc_load_update = calc_load_update [ next window ]

But what we actually do is:

this_rq->calc_load_update = calc_load_update + LOAD_FREQ [ next+1 window ]

This has the effect of delaying load average updates for potentially
up to ~9seconds.

This can result in huge spikes in the load average values due to
per-cpu uninterruptible task counts being out of sync when accumulated
across all CPUs.

It's safe to update the per-cpu active count if we wake between sample
windows because any load that we left in 'calc_load_idle' will have
been zero'd when the idle load was folded in calc_global_load().

This issue is easy to reproduce before,

commit 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")

just by forking short-lived process pipelines built from ps(1) and
grep(1) in a loop. I'm unable to reproduce the spikes after that
commit, but the bug still seems to be present from code review.

Signed-off-by: Matt Fleming
Signed-off-by: Peter Zijlstra (Intel)
Cc: Frederic Weisbecker
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Mike Galbraith
Cc: Morten Rasmussen
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Vincent Guittot
Fixes: commit 5167e8d ("sched/nohz: Rewrite and fix load-avg computation -- again")
Link: http://lkml.kernel.org/r/20170217120731.11868-2-matt@codeblueprint.co.uk
Signed-off-by: Ingo Molnar

Matt Fleming
2017-03-16 16:21:00 +0800
dcc3b5ffe sched/deadline: Add missing update_rq_clock() in dl_task_timer() ... Browse Code »

The following warning can be triggered by hot-unplugging the CPU
on which an active SCHED_DEADLINE task is running on:

------------[ cut here ]------------
WARNING: CPU: 7 PID: 0 at kernel/sched/sched.h:833 replenish_dl_entity+0x71e/0xc40
rq->clock_update_flags < RQCF_ACT_SKIP
CPU: 7 PID: 0 Comm: swapper/7 Tainted: G B 4.11.0-rc1+ #24
Hardware name: LENOVO ThinkCentre M8500t-N000/SHARKBAY, BIOS FBKTC1AUS 02/16/2016
Call Trace:

dump_stack+0x85/0xc4
__warn+0x172/0x1b0
warn_slowpath_fmt+0xb4/0xf0
? __warn+0x1b0/0x1b0
? debug_check_no_locks_freed+0x2c0/0x2c0
? cpudl_set+0x3d/0x2b0
replenish_dl_entity+0x71e/0xc40
enqueue_task_dl+0x2ea/0x12e0
? dl_task_timer+0x777/0x990
? __hrtimer_run_queues+0x270/0xa50
dl_task_timer+0x316/0x990
? enqueue_task_dl+0x12e0/0x12e0
? enqueue_task_dl+0x12e0/0x12e0
__hrtimer_run_queues+0x270/0xa50
? hrtimer_cancel+0x20/0x20
? hrtimer_interrupt+0x119/0x600
hrtimer_interrupt+0x19c/0x600
? trace_hardirqs_off+0xd/0x10
local_apic_timer_interrupt+0x74/0xe0
smp_apic_timer_interrupt+0x76/0xa0
apic_timer_interrupt+0x93/0xa0

The DL task will be migrated to a suitable later deadline rq once the DL
timer fires and currnet rq is offline. The rq clock of the new rq should
be updated. This patch fixes it by updating the rq clock after holding
the new rq's rq lock.

Signed-off-by: Wanpeng Li
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Matt Fleming
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1488865888-15894-1-git-send-email-wanpeng.li@hotmail.com
Signed-off-by: Ingo Molnar

Wanpeng Li
2017-03-16 16:20:59 +0800

15 Mar, 2017

6 commits

ae50dfd61 Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net ... Browse Code »

Pull networking fixes from David Miller:

1) Ensure that mtu is at least IPV6_MIN_MTU in ipv6 VTI tunnel driver,
from Steffen Klassert.

2) Fix crashes when user tries to get_next_key on an LPM bpf map, from
Alexei Starovoitov.

3) Fix detection of VLAN fitlering feature for bnx2x VF devices, from
Michal Schmidt.

4) We can get a divide by zero when TCP socket are morphed into
listening state, fix from Eric Dumazet.

5) Fix socket refcounting bugs in skb_complete_wifi_ack() and
skb_complete_tx_timestamp(). From Eric Dumazet.

6) Use after free in dccp_feat_activate_values(), also from Eric
Dumazet.

7) Like bonding team needs to use ETH_MAX_MTU as netdev->max_mtu, from
Jarod Wilson.

8) Fix use after free in vrf_xmit(), from David Ahern.

9) Don't do UDP Fragmentation Offload on IPComp ipsec packets, from
Alexey Kodanev.

10) Properly check napi_complete_done() return value in order to decide
whether to re-enable IRQs or not in amd-xgbe driver, from Thomas
Lendacky.

11) Fix double free of hwmon device in marvell phy driver, from Andrew
Lunn.

12) Don't crash on malformed netlink attributes in act_connmark, from
Etienne Noss.

13) Don't remove routes with a higher metric in ipv6 ECMP route replace,
from Sabrina Dubroca.

14) Don't write into a cloned SKB in ipv6 fragmentation handling, from
Florian Westphal.

15) Fix routing redirect races in dccp and tcp, basically the ICMP
handler can't modify the socket's cached route in it's locked by the
user at this moment. From Jon Maxwell.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (108 commits)
qed: Enable iSCSI Out-of-Order
qed: Correct out-of-bound access in OOO history
qed: Fix interrupt flags on Rx LL2
qed: Free previous connections when releasing iSCSI
qed: Fix mapping leak on LL2 rx flow
qed: Prevent creation of too-big u32-chains
qed: Align CIDs according to DORQ requirement
mlxsw: reg: Fix SPVMLR max record count
mlxsw: reg: Fix SPVM max record count
net: Resend IGMP memberships upon peer notification.
dccp: fix memory leak during tear-down of unsuccessful connection request
tun: fix premature POLLOUT notification on tun devices
dccp/tcp: fix routing redirect race
ucc/hdlc: fix two little issue
vxlan: fix ovs support
net: use net->count to check whether a netns is alive or not
bridge: drop netfilter fake rtable unconditionally
ipv6: avoid write to a possibly cloned skb
net: wimax/i2400m: fix NULL-deref at probe
isdn/gigaset: fix NULL-deref at probe
...

Linus Torvalds
2017-03-15 12:31:23 +0800
352526f45 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"Three cgroup fixes. Nothing critical:

- the pids controller could trigger suspicious RCU warning
spuriously. Fixed.

- in the debug controller, %p -> %pK to protect kernel pointer
from getting exposed.

- documentation formatting fix"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroups: censor kernel pointer in debug files
cgroup/pids: remove spurious suspicious RCU usage warning
cgroup: Fix indenting in PID controller documentation

Linus Torvalds
2017-03-15 06:11:19 +0800
bc2588793 Merge branch 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue fix from Tejun Heo:
"If a delayed work is queued with NULL @wq, workqueue code explodes
after the timer expires at which point it's difficult to tell who the
culprit was.

This actually happened and the offender was net/smc this time.

Add an explicit sanity check for it in the queueing path"

* 'for-4.11-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: trigger WARN if queue_delayed_work() is called with NULL @wq

Linus Torvalds
2017-03-15 05:52:08 +0800
9bbb25afe futex: Add missing error handling to FUTEX_REQUEUE_PI ... Browse Code »

Thomas spotted that fixup_pi_state_owner() can return errors and we
fail to unlock the rt_mutex in that case.

Reported-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Darren Hart
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.867401760@infradead.org
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2017-03-15 04:45:36 +0800
c236c8e95 futex: Fix potential use-after-free in FUTEX_REQUEUE_PI ... Browse Code »

While working on the futex code, I stumbled over this potential
use-after-free scenario. Dmitry triggered it later with syzkaller.

pi_mutex is a pointer into pi_state, which we drop the reference on in
unqueue_me_pi(). So any access to that pointer after that is bad.

Since other sites already do rt_mutex_unlock() with hb->lock held, see
for example futex_lock_pi(), simply move the unlock before
unqueue_me_pi().

Reported-by: Dmitry Vyukov
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Darren Hart
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20170304093558.801744246@infradead.org
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2017-03-15 04:45:36 +0800
dc434e056 cpu/hotplug: Serialize callback invocations proper ... Browse Code »

The setup/remove_state/instance() functions in the hotplug core code are
serialized against concurrent CPU hotplug, but unfortunately not serialized
against themself.

As a consequence a concurrent invocation of these function results in
corruption of the callback machinery because two instances try to invoke
callbacks on remote cpus at the same time. This results in missing callback
invocations and initiator threads waiting forever on the completion.

The obvious solution to replace get_cpu_online() with cpu_hotplug_begin()
is not possible because at least one callsite calls into these functions
from a get_online_cpu() locked region.

Extend the protection scope of the cpuhp_state_mutex from solely protecting
the state arrays to cover the callback invocation machinery as well.

Fixes: 5b7aa87e0482 ("cpu/hotplug: Implement setup/removal interface")
Reported-and-tested-by: Bart Van Assche
Signed-off-by: Sebastian Andrzej Siewior
Cc: hpa@zytor.com
Cc: mingo@kernel.org
Cc: akpm@linux-foundation.org
Cc: torvalds@linux-foundation.org
Link: http://lkml.kernel.org/r/20170314150645.g4tdyoszlcbajmna@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2017-03-15 02:19:27 +0800

13 Mar, 2017

1 commit

5a45a5a88 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Thomas Gleixner:

- a fix for the kexec/purgatory regression which was introduced in the
merge window via an innocent sparse fix. We could have reverted that
commit, but on deeper inspection it turned out that the whole
machinery is neither documented nor robust. So a proper cleanup was
done instead

- the fix for the TLB flush issue which was discovered recently

- a simple typo fix for a reboot quirk

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/tlb: Fix tlb flushing when lguest clears PGE
kexec, x86/purgatory: Unbreak it and clean it up
x86/reboot/quirks: Fix typo in ASUS EeeBook X205TA reboot quirk

Linus Torvalds
2017-03-13 05:18:49 +0800

11 Mar, 2017

2 commits

40c50c1fe kexec, x86/purgatory: Unbreak it and clean it up ... Browse Code »

The purgatory code defines global variables which are referenced via a
symbol lookup in the kexec code (core and arch).

A recent commit addressing sparse warnings made these static and thereby
broke kexec_file.

Why did this happen? Simply because the whole machinery is undocumented and
lacks any form of forward declarations. The variable names are unspecific
and lack a prefix, so adding forward declarations creates shadow variables
in the core code. Aside of that the code relies on magic constants and
duplicate struct definitions with no way to ensure that these things stay
in sync. The section placement of the purgatory variables happened by
chance and not by design.

Unbreak kexec and cleanup the mess:

- Add proper forward declarations and document the usage
- Use common struct definition
- Use the proper common defines instead of magic constants
- Add a purgatory_ prefix to have a proper name space
- Use ARRAY_SIZE() instead of a homebrewn reimplementation
- Add proper sections to the purgatory variables [ From Mike ]

Fixes: 72042a8c7b01 ("x86/purgatory: Make functions and variables static")
Reported-by: Mike Galbraith <
Signed-off-by: Thomas Gleixner
Cc: Nicholas Mc Guire
Cc: Borislav Petkov
Cc: Vivek Goyal
Cc: "Tobin C. Harding"
Link: http://lkml.kernel.org/r/alpine.DEB.2.20.1703101315140.3681@nanos
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2017-03-11 03:55:09 +0800
8fe3ccaed Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge fixes from Andrew Morton:
"26 fixes"

* emailed patches from Andrew Morton : (26 commits)
userfaultfd: remove wrong comment from userfaultfd_ctx_get()
fat: fix using uninitialized fields of fat_inode/fsinfo_inode
sh: cayman: IDE support fix
kasan: fix races in quarantine_remove_cache()
kasan: resched in quarantine_remove_cache()
mm: do not call mem_cgroup_free() from within mem_cgroup_alloc()
thp: fix another corner case of munlock() vs. THPs
rmap: fix NULL-pointer dereference on THP munlocking
mm/memblock.c: fix memblock_next_valid_pfn()
userfaultfd: selftest: vm: allow to build in vm/ directory
userfaultfd: non-cooperative: userfaultfd_remove revalidate vma in MADV_DONTNEED
userfaultfd: non-cooperative: fix fork fctx->new memleak
mm/cgroup: avoid panic when init with low memory
drivers/md/bcache/util.h: remove duplicate inclusion of blkdev.h
mm/vmstats: add thp_split_pud event for clarity
include/linux/fs.h: fix unsigned enum warning with gcc-4.2
userfaultfd: non-cooperative: release all ctx in dup_userfaultfd_complete
userfaultfd: non-cooperative: robustness check
userfaultfd: non-cooperative: rollback userfaultfd_exit
x86, mm: unify exit paths in gup_pte_range()
...

Linus Torvalds
2017-03-11 00:34:42 +0800

10 Mar, 2017

3 commits

dd0db88d8 userfaultfd: non-cooperative: rollback userfaultfd_exit ... Browse Code »

Patch series "userfaultfd non-cooperative further update for 4.11 merge
window".

Unfortunately I noticed one relevant bug in userfaultfd_exit while doing
more testing. I've been doing testing before and this was also tested
by kbuild bot and exercised by the selftest, but this bug never
reproduced before.

I dropped userfaultfd_exit as result. I dropped it because of
implementation difficulty in receiving signals in __mmput and because I
think -ENOSPC as result from the background UFFDIO_COPY should be enough
already.

Before I decided to remove userfaultfd_exit, I noticed userfaultfd_exit
wasn't exercised by the selftest and when I tried to exercise it, after
moving it to a more correct place in __mmput where it would make more
sense and where the vma list is stable, it resulted in the
event_wait_completion in D state. So then I added the second patch to
be sure even if we call userfaultfd_event_wait_completion too late
during task exit(), we won't risk to generate tasks in D state. The
same check exists in handle_userfault() for the same reason, except it
makes a difference there, while here is just a robustness check and it's
run under WARN_ON_ONCE.

While looking at the userfaultfd_event_wait_completion() function I
looked back at its callers too while at it and I think it's not ok to
stop executing dup_fctx on the fcs list because we relay on
userfaultfd_event_wait_completion to execute
userfaultfd_ctx_put(fctx->orig) which is paired against
userfaultfd_ctx_get(fctx->orig) in dup_userfault just before
list_add(fcs). This change only takes care of fctx->orig but this area
also needs further review looking for similar problems in fctx->new.

The only patch that is urgent is the first because it's an use after
free during a SMP race condition that affects all processes if
CONFIG_USERFAULTFD=y. Very hard to reproduce though and probably
impossible without SLUB poisoning enabled.

This patch (of 3):

I once reproduced this oops with the userfaultfd selftest, it's not
easily reproducible and it requires SLUB poisoning to reproduce.

general protection fault: 0000 [#1] SMP
Modules linked in:
CPU: 2 PID: 18421 Comm: userfaultfd Tainted: G ------------ T 3.10.0+ #15
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.1-0-g8891697-prebuilt.qemu-project.org 04/01/2014
task: ffff8801f83b9440 ti: ffff8801f833c000 task.ti: ffff8801f833c000
RIP: 0010:[] [] userfaultfd_exit+0x29/0xa0
RSP: 0018:ffff8801f833fe80 EFLAGS: 00010202
RAX: ffff8801f833ffd8 RBX: 6b6b6b6b6b6b6b6b RCX: ffff8801f83b9440
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8800baf18600
RBP: ffff8801f833fee8 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: ffffffff8127ceb3 R12: 0000000000000000
R13: ffff8800baf186b0 R14: ffff8801f83b99f8 R15: 00007faed746c700
FS: 0000000000000000(0000) GS:ffff88023fc80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007faf0966f028 CR3: 0000000001bc6000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Call Trace:
do_exit+0x297/0xd10
SyS_exit+0x17/0x20
tracesys+0xdd/0xe2
Code: 00 00 66 66 66 66 90 55 48 89 e5 41 54 53 48 83 ec 58 48 8b 1f 48 85 db 75 11 eb 73 66 0f 1f 44 00 00 48 8b 5b 10 48 85 db 74 64 8b a3 b8 00 00 00 4d 85 e4 74 eb 41 f6 84 24 2c 01 00 00 80
RIP [] userfaultfd_exit+0x29/0xa0
RSP
---[ end trace 9fecd6dcb442846a ]---

In the debugger I located the "mm" pointer in the stack and walking
mm->mmap->vm_next through the end shows the vma->vm_next list is fully
consistent and it is null terminated list as expected. So this has to
be an SMP race condition where userfaultfd_exit was running while the
vma list was being modified by another CPU.

When userfaultfd_exit() run one of the ->vm_next pointers pointed to
SLAB_POISON (RBX is the vma pointer and is 0x6b6b..).

The reason is that it's not running in __mmput but while there are still
other threads running and it's not holding the mmap_sem (it can't as it
has to wait the even to be received by the manager). So this is an use
after free that was happening for all processes.

One more implementation problem aside from the race condition:
userfaultfd_exit has really to check a flag in mm->flags before walking
the vma or it's going to slowdown the exit() path for regular tasks.

One more implementation problem: at that point signals can't be
delivered so it would also create a task in D state if the manager
doesn't read the event.

The major design issue: it overall looks superfluous as the manager can
check for -ENOSPC in the background transfer:

if (mmget_not_zero(ctx->mm)) {
[..]
} else {
return -ENOSPC;
}

It's safer to roll it back and re-introduce it later if at all.

[rppt@linux.vnet.ibm.com: documentation fixup after removal of UFFD_EVENT_EXIT]
Link: http://lkml.kernel.org/r/1488345437-4364-1-git-send-email-rppt@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/20170224181957.19736-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Signed-off-by: Mike Rapoport
Acked-by: Mike Rapoport
Cc: "Dr. David Alan Gilbert"
Cc: Mike Kravetz
Cc: Pavel Emelyanov
Cc: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-03-10 09:01:09 +0800
505d3085d scripts/spelling.txt: add "overide" pattern and fix typo instances ... Browse Code »

Fix typos and add the following to the scripts/spelling.txt:

overide||override

While we are here, fix the doubled "address" in the touched line
Documentation/devicetree/bindings/regulator/ti-abb-regulator.txt.

Also, fix the comment block style in the touched hunks in
drivers/media/dvb-frontends/drx39xyj/drx_driver.h.

Link: http://lkml.kernel.org/r/1481573103-11329-21-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2017-03-10 09:01:09 +0800
8a1115ff6 scripts/spelling.txt: add "disble(d)" pattern and fix typo instances ... Browse Code »

Fix typos and add the following to the scripts/spelling.txt:

disble||disable
disbled||disabled

I kept the TSL2563_INT_DISBLED in /drivers/iio/light/tsl2563.c
untouched. The macro is not referenced at all, but this commit is
touching only comment blocks just in case.

Link: http://lkml.kernel.org/r/1481573103-11329-20-git-send-email-yamada.masahiro@socionext.com
Signed-off-by: Masahiro Yamada
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Masahiro Yamada
2017-03-10 09:01:09 +0800