Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

02 Jun, 2014

1 commit

32439700f Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"Various fixlets, mostly related to the (root-only) SCHED_DEADLINE
policy, but also a hotplug bug fix and a fix for a NR_CPUS related
overallocation bug causing a suspend/resume regression"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched: Fix hotplug vs. set_cpus_allowed_ptr()
sched/cpupri: Replace NR_CPUS arrays
sched/deadline: Replace NR_CPUS arrays
sched/deadline: Restrict user params max value to 2^63 ns
sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE
sched: Disallow sched_attr::sched_policy < 0
sched: Make sched_setattr() correctly return -EFBIG

Linus Torvalds
2014-06-02 09:26:59 +0800

01 Jun, 2014

1 commit

a4bf79eb6 Merge branch 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core futex/rtmutex fixes from Thomas Gleixner:
"Three fixlets for long standing issues in the futex/rtmutex code
unearthed by Dave Jones syscall fuzzer:

- Add missing early deadlock detection checks in the futex code
- Prevent user space from attaching a futex to kernel threads
- Make the deadlock detector of rtmutex work again

Looks large, but is more comments than code change"

* 'core-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
rtmutex: Fix deadlock detector for real
futex: Prevent attaching to kernel threads
futex: Add another early deadlock detection check

Linus Torvalds
2014-06-01 00:47:55 +0800

28 May, 2014

3 commits

397335f00 rtmutex: Fix deadlock detector for real ... Browse Code »
19

The current deadlock detection logic does not work reliably due to the
following early exit path:

/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;

So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.

So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.

Simple fix: Continue the chain walk, when deadlock detection is
enabled.

We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-05-28 23:28:13 +0800
9e3d63317 Merge branch 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc ... Browse Code »

Pull two powerpc fixes from Ben Herrenschmidt:
"Here's a pair of powerpc fixes for 3.15 which are also going to
stable.

One's a fix for building with newer binutils (the problem currently
only affects the BookE kernels but the affected macro might come back
into use on BookS platforms at any time). Unfortunately, the binutils
maintainer did a backward incompatible change to a construct that we
use so we have to add Makefile check.

The other one is a fix for CPUs getting stuck in kexec when running
single threaded. Since we routinely use kexec on power (including in
our newer bootloaders), I deemed that important enough"

* 'merge' of git://git.kernel.org/pub/scm/linux/kernel/git/benh/powerpc:
powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode
powerpc: Fix 64 bit builds with binutils 2.24

Linus Torvalds
2014-05-28 23:06:50 +0800
011e4b02f powerpc, kexec: Fix "Processor X is stuck" issue during kexec from ST mode ... Browse Code »
5

If we try to perform a kexec when the machine is in ST (Single-Threaded) mode
(ppc64_cpu --smt=off), the kexec operation doesn't succeed properly, and we
get the following messages during boot:

[ 0.089866] POWER8 performance monitor hardware support registered
[ 0.089985] power8-pmu: PMAO restore workaround active.
[ 5.095419] Processor 1 is stuck.
[ 10.097933] Processor 2 is stuck.
[ 15.100480] Processor 3 is stuck.
[ 20.102982] Processor 4 is stuck.
[ 25.105489] Processor 5 is stuck.
[ 30.108005] Processor 6 is stuck.
[ 35.110518] Processor 7 is stuck.
[ 40.113369] Processor 9 is stuck.
[ 45.115879] Processor 10 is stuck.
[ 50.118389] Processor 11 is stuck.
[ 55.120904] Processor 12 is stuck.
[ 60.123425] Processor 13 is stuck.
[ 65.125970] Processor 14 is stuck.
[ 70.128495] Processor 15 is stuck.
[ 75.131316] Processor 17 is stuck.

Note that only the sibling threads are stuck, while the primary threads (0, 8,
16 etc) boot just fine. Looking closer at the previous step of kexec, we observe
that kexec tries to wakeup (bring online) the sibling threads of all the cores,
before performing kexec:

[ 9464.131231] Starting new kernel
[ 9464.148507] kexec: Waking offline cpu 1.
[ 9464.148552] kexec: Waking offline cpu 2.
[ 9464.148600] kexec: Waking offline cpu 3.
[ 9464.148636] kexec: Waking offline cpu 4.
[ 9464.148671] kexec: Waking offline cpu 5.
[ 9464.148708] kexec: Waking offline cpu 6.
[ 9464.148743] kexec: Waking offline cpu 7.
[ 9464.148779] kexec: Waking offline cpu 9.
[ 9464.148815] kexec: Waking offline cpu 10.
[ 9464.148851] kexec: Waking offline cpu 11.
[ 9464.148887] kexec: Waking offline cpu 12.
[ 9464.148922] kexec: Waking offline cpu 13.
[ 9464.148958] kexec: Waking offline cpu 14.
[ 9464.148994] kexec: Waking offline cpu 15.
[ 9464.149030] kexec: Waking offline cpu 17.

Instrumenting this piece of code revealed that the cpu_up() operation actually
fails with -EBUSY. Thus, only the primary threads of all the cores are online
during kexec, and hence this is a sure-shot receipe for disaster, as explained
in commit e8e5c2155b (powerpc/kexec: Fix orphaned offline CPUs across kexec),
as well as in the comment above wake_offline_cpus().

It turns out that cpu_up() was returning -EBUSY because the variable
'cpu_hotplug_disabled' was set to 1; and this disabling of CPU hotplug was done
by migrate_to_reboot_cpu() inside kernel_kexec().

Now, migrate_to_reboot_cpu() was originally written with the assumption that
any further code will not need to perform CPU hotplug, since we are anyway in
the reboot path. However, kexec is clearly not such a case, since we depend on
onlining CPUs, atleast on powerpc.

So re-enable cpu-hotplug after returning from migrate_to_reboot_cpu() in the
kexec path, to fix this regression in kexec on powerpc.

Also, wrap the cpu_up() in powerpc kexec code within a WARN_ON(), so that we
can catch such issues more easily in the future.

Fixes: c97102ba963 (kexec: migrate to reboot cpu)
Cc: stable@vger.kernel.org
Signed-off-by: Srivatsa S. Bhat
Signed-off-by: Benjamin Herrenschmidt

Srivatsa S. Bhat
2014-05-28 11:24:26 +0800

24 May, 2014

2 commits

f02f79dbc Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"The biggest commit is an irqtime accounting loop latency fix, the rest
are misc fixes all over the place: deadline scheduling, docs, numa,
balancer and a bad to-idle latency fix"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
sched/numa: Initialize newidle balance stats in sd_numa_init()
sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance()
sched: Skip double execution of pick_next_task_fair()
sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check
sched/deadline: Fix memory leak
sched/deadline: Fix sched_yield() behavior
sched: Sanitize irq accounting madness
sched/docbook: Fix 'make htmldocs' warnings caused by missing description

Linus Torvalds
2014-05-24 01:04:04 +0800
e6a32c3ad Merge branch 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull perf fixes from Ingo Molnar:
"The biggest changes are fixes for races that kept triggering Trinity
crashes, plus liblockdep build fixes and smaller misc fixes.

The liblockdep bits in perf/urgent are a pull mistake - they should
have been in locking/urgent - but by the time I noticed other commits
were added and testing was done :-/ Sorry about that"

* 'perf-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
perf: Fix a race between ring_buffer_detach() and ring_buffer_attach()
perf: Prevent false warning in perf_swevent_add
perf: Limit perf_event_attr::sample_period to 63 bits
tools/liblockdep: Remove all build files when doing make clean
tools/liblockdep: Build liblockdep from tools/Makefile
perf/x86/intel: Fix Silvermont's event constraints
perf: Fix perf_event_init_context()
perf: Fix race in removing an event

Linus Torvalds
2014-05-24 01:02:34 +0800

22 May, 2014

7 commits

6acbfb969 sched: Fix hotplug vs. set_cpus_allowed_ptr() ... Browse Code »
57

Lai found that:

WARNING: CPU: 1 PID: 13 at arch/x86/kernel/smp.c:124 native_smp_send_reschedule+0x2d/0x4b()
...
migration_cpu_stop+0x1d/0x22

was caused by set_cpus_allowed_ptr() assuming that cpu_active_mask is
always a sub-set of cpu_online_mask.

This isn't true since 5fbd036b552f ("sched: Cleanup cpu_active madness").

So set active and online at the same time to avoid this particular
problem.

Fixes: 5fbd036b552f ("sched: Cleanup cpu_active madness")
Signed-off-by: Lai Jiangshan
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Gautham R. Shenoy
Cc: Linus Torvalds
Cc: Michael wang
Cc: Paul Gortmaker
Cc: Rafael J. Wysocki
Cc: Srivatsa S. Bhat
Cc: Toshi Kani
Link: http://lkml.kernel.org/r/53758B12.8060609@cn.fujitsu.com
Signed-off-by: Ingo Molnar

Lai Jiangshan
2014-05-22 16:21:31 +0800
4dac0b638 sched/cpupri: Replace NR_CPUS arrays ... Browse Code »

Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.

Replace the NR_CPUS arrays in there with a dynamically allocated
array.

Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Cc: Johannes Weiner
Cc: Steven Rostedt
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-7cysnkw1gik45r864t1nkudh@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:29 +0800
944770ab5 sched/deadline: Replace NR_CPUS arrays ... Browse Code »

Tejun reported that his resume was failing due to order-3 allocations
from sched_domain building.

Replace the NR_CPUS arrays in there with a dynamically allocated
array.

Reported-by: Tejun Heo
Signed-off-by: Peter Zijlstra
Acked-by: Juri Lelli
Cc: Johannes Weiner
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-kat4gl1m5a6dwy6nzuqox45e@git.kernel.org
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:28 +0800
b0827819b sched/deadline: Restrict user params max value to 2^63 ns ... Browse Code »
5

Michael Kerrisk noticed that creating SCHED_DEADLINE reservations
with certain parameters (e.g, a runtime of something near 2^64 ns)
can cause a system freeze for some amount of time.

The problem is that in the interface we have

u64 sched_runtime;

while internally we need to have a signed runtime (to cope with
budget overruns)

s64 runtime;

At the time we setup a new dl_entity we copy the first value in
the second. The cast turns out with negative values when
sched_runtime is too big, and this causes the scheduler to go crazy
right from the start.

Moreover, considering how we deal with deadlines wraparound

(s64)(a - b) < 0

we also have to restrict acceptable values for sched_{deadline,period}.

This patch fixes the thing checking that user parameters are always
below 2^63 ns (still large enough for everyone).

It also rewrites other conditions that we check, since in
__checkparam_dl we don't have to deal with deadline wraparounds
and what we have now erroneously fails when the difference between
values is too big.

Reported-by: Michael Kerrisk
Suggested-by: Peter Zijlstra
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc:
Cc: Dario Faggioli
Cc: Dave Jones
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140513141131.20d944f81633ee937f256385@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-05-22 16:21:27 +0800
ce5f7f820 sched/deadline: Change sched_getparam() behaviour vs SCHED_DEADLINE ... Browse Code »
5

The way we read POSIX one should only call sched_getparam() when
sched_getscheduler() returns either SCHED_FIFO or SCHED_RR.

Given that we currently return sched_param::sched_priority=0 for all
others, extend the same behaviour to SCHED_DEADLINE.

Requested-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc: Dario Faggioli
Cc: linux-man
Cc: "Michael Kerrisk (man-pages)"
Cc: Juri Lelli
Cc: Linus Torvalds
Cc:
Link: http://lkml.kernel.org/r/20140512205034.GH13467@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
dbdb22754 sched: Disallow sched_attr::sched_policy < 0 ... Browse Code »
24

The scheduler uses policy=-1 to preserve the current policy state to
implement sys_sched_setparam(), this got exposed to userspace by
accident through sys_sched_setattr(), cure this.

Reported-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Acked-by: Michael Kerrisk
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/20140509085311.GJ30445@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-22 16:21:26 +0800
143cf23df sched: Make sched_setattr() correctly return -EFBIG ... Browse Code »
5

The documented[1] behavior of sched_attr() in the proposed man page text is:

sched_attr::size must be set to the size of the structure, as in
sizeof(struct sched_attr), if the provided structure is smaller
than the kernel structure, any additional fields are assumed
'0'. If the provided structure is larger than the kernel structure,
the kernel verifies all additional fields are '0' if not the
syscall will fail with -E2BIG.

As currently implemented, sched_copy_attr() returns -EFBIG for
for this case, but the logic in sys_sched_setattr() converts that
error to -EFAULT. This patch fixes the behavior.

[1] http://thread.gmane.org/gmane.linux.kernel/1615615/focus=1697760

Signed-off-by: Michael Kerrisk
Signed-off-by: Peter Zijlstra
Cc:
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/536CEC17.9070903@gmail.com
Signed-off-by: Ingo Molnar

Michael Kerrisk
2014-05-22 16:21:25 +0800

21 May, 2014

1 commit

06eb4cc2e Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull more cgroup fixes from Tejun Heo:
"Three more patches to fix cgroup_freezer breakage due to the recent
cgroup internal locking changes - an operation cgroup_freezer was
using now requires sleepable context and cgroup_freezer was invoking
that while holding a spin lock. cgroup_freezer was using an overly
elaborate hierarchical locking scheme.

While it's possible to convert the hierarchical spinlocks directly to
mutexes, this patch simplifies the overall locking so that it uses a
global mutex. This has the added benefit of avoiding iterating
potentially huge number of tasks under a spinlock. While the patch is
on the larger side in the devel cycle, the changes made are mostly
straight-forward and the locking logic is a lot simpler afterwards"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
cgroup: fix rcu_read_lock() leak in update_if_frozen()
cgroup_freezer: replace freezer->lock with freezer_mutex
cgroup: introduce task_css_is_root()

Linus Torvalds
2014-05-21 17:36:40 +0800

20 May, 2014

1 commit

95d08585e Merge branch 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer fix from Thomas Gleixner:
"A single bug fix for a long standing issue:

- Updating the expiry value of a relative timer _after_ letting the
idle logic select a target cpu for the timer based on its stale
expiry value is outright stupid. Thanks to Viresh for spotting the
brainfart"

* 'timers-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
hrtimer: Set expiry time before switch_hrtimer_base()

Linus Torvalds
2014-05-20 13:19:10 +0800

19 May, 2014

5 commits

b69cf5364 perf: Fix a race between ring_buffer_detach() and ring_buffer_attach() ... Browse Code »
13

Alexander noticed that we use RCU iteration on rb->event_list but do
not use list_{add,del}_rcu() to add,remove entries to that list, nor
do we observe proper grace periods when re-using the entries.

Merge ring_buffer_detach() into ring_buffer_attach() such that
attaching to the NULL buffer is detaching.

Furthermore, ensure that between any 'detach' and 'attach' of the same
event we observe the required grace period, but only when strictly
required. In effect this means that only ioctl(.request =
PERF_EVENT_IOC_SET_OUTPUT) will wait for a grace period, while the
normal initial attach and final detach will not be delayed.

This patch should, I think, do the right thing under all
circumstances, the 'normal' cases all should never see the extra grace
period, but the two cases:

1) PERF_EVENT_IOC_SET_OUTPUT on an event which already has a
ring_buffer set, will now observe the required grace period between
removing itself from the old and attaching itself to the new buffer.

This case is 'simple' in that both buffers are present in
perf_event_set_output() one could think an unconditional
synchronize_rcu() would be sufficient; however...

2) an event that has a buffer attached, the buffer is destroyed
(munmap) and then the event is attached to a new/different buffer
using PERF_EVENT_IOC_SET_OUTPUT.

This case is more complex because the buffer destruction does:
ring_buffer_attach(.rb = NULL)
followed by the ioctl() doing:
ring_buffer_attach(.rb = foo);

and we still need to observe the grace period between these two
calls due to us reusing the event->rb_entry list_head.

In order to make 2 happen we use Paul's latest cond_synchronize_rcu()
call.

Cc: Paul Mackerras
Cc: Stephane Eranian
Cc: Andi Kleen
Cc: "Paul E. McKenney"
Cc: Ingo Molnar
Cc: Frederic Weisbecker
Cc: Mike Galbraith
Reported-by: Alexander Shishkin
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140507123526.GD13658@twins.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2014-05-19 20:44:56 +0800
39af6b167 perf: Prevent false warning in perf_swevent_add ... Browse Code »
6

The perf cpu offline callback takes down all cpu context
events and releases swhash->swevent_hlist.

This could race with task context software event being just
scheduled on this cpu via perf_swevent_add while cpu hotplug
code already cleaned up event's data.

The race happens in the gap between the cpu notifier code
and the cpu being actually taken down. Note that only cpu
ctx events are terminated in the perf cpu hotplug code.

It's easily reproduced with:
$ perf record -e faults perf bench sched pipe

while putting one of the cpus offline:
# echo 0 > /sys/devices/system/cpu/cpu1/online

Console emits following warning:
WARNING: CPU: 1 PID: 2845 at kernel/events/core.c:5672 perf_swevent_add+0x18d/0x1a0()
Modules linked in:
CPU: 1 PID: 2845 Comm: sched-pipe Tainted: G W 3.14.0+ #256
Hardware name: Intel Corporation Montevina platform/To be filled by O.E.M., BIOS AMVACRB1.86C.0066.B00.0805070703 05/07/2008
0000000000000009 ffff880077233ab8 ffffffff81665a23 0000000000200005
0000000000000000 ffff880077233af8 ffffffff8104732c 0000000000000046
ffff88007467c800 0000000000000002 ffff88007a9cf2a0 0000000000000001
Call Trace:
[] dump_stack+0x4f/0x7c
[] warn_slowpath_common+0x8c/0xc0
[] warn_slowpath_null+0x1a/0x20
[] perf_swevent_add+0x18d/0x1a0
[] event_sched_in.isra.75+0x9e/0x1f0
[] group_sched_in+0x6a/0x1f0
[] ? sched_clock_local+0x25/0xa0
[] ctx_sched_in+0x1f6/0x450
[] perf_event_sched_in+0x6b/0xa0
[] perf_event_context_sched_in+0x7b/0xc0
[] __perf_event_task_sched_in+0x43e/0x460
[] ? put_lock_stats.isra.18+0xe/0x30
[] finish_task_switch+0xb8/0x100
[] __schedule+0x30e/0xad0
[] ? pipe_read+0x3e2/0x560
[] ? preempt_schedule_irq+0x3e/0x70
[] ? preempt_schedule_irq+0x3e/0x70
[] preempt_schedule_irq+0x44/0x70
[] retint_kernel+0x20/0x30
[] ? lockdep_sys_exit+0x1a/0x90
[] lockdep_sys_exit_thunk+0x35/0x67
[] ? sysret_check+0x5/0x56

Fixing this by tracking the cpu hotplug state and displaying
the WARN only if current cpu is initialized properly.

Cc: Corey Ashford
Cc: Frederic Weisbecker
Cc: Ingo Molnar
Cc: Paul Mackerras
Cc: Arnaldo Carvalho de Melo
Cc: stable@vger.kernel.org
Reported-by: Fengguang Wu
Signed-off-by: Jiri Olsa
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1396861448-10097-1-git-send-email-jolsa@redhat.com
Signed-off-by: Thomas Gleixner

Jiri Olsa
2014-05-19 20:44:55 +0800
0819b2e30 perf: Limit perf_event_attr::sample_period to 63 bits ... Browse Code »
6

Vince reported that using a large sample_period (one with bit 63 set)
results in wreckage since while the sample_period is fundamentally
unsigned (negative periods don't make sense) the way we implement
things very much rely on signed logic.

So limit sample_period to 63 bits to avoid tripping over this.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/n/tip-p25fhunibl4y3qi0zuqmyf4b@git.kernel.org
Signed-off-by: Thomas Gleixner

Peter Zijlstra
2014-05-19 20:44:55 +0800
f0d71b3dc futex: Prevent attaching to kernel threads ... Browse Code »
5

We happily allow userspace to declare a random kernel thread to be the
owner of a user space PI futex.

Found while analysing the fallout of Dave Jones syscall fuzzer.

We also should validate the thread group for private futexes and find
some fast way to validate whether the "alleged" owner has RW access on
the file which backs the SHM, but that's a separate issue.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.194824402@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800
866293ee5 futex: Add another early deadlock detection check ... Browse Code »
5

Dave Jones trinity syscall fuzzer exposed an issue in the deadlock
detection code of rtmutex:
http://lkml.kernel.org/r/20140429151655.GA14277@redhat.com

That underlying issue has been fixed with a patch to the rtmutex code,
but the futex code must not call into rtmutex in that case because
- it can detect that issue early
- it avoids a different and more complex fixup for backing out

If the user space variable got manipulated to 0x80000000 which means
no lock holder, but the waiters bit set and an active pi_state in the
kernel is found we can figure out the recursive locking issue by
looking at the pi_state owner. If that is the current task, then we
can safely return -EDEADLK.

The check should have been added in commit 59fa62451 (futex: Handle
futex_pi OWNER_DIED take over correctly) already, but I did not see
the above issue caused by user space manipulation back then.

Signed-off-by: Thomas Gleixner
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Darren Hart
Cc: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Clark Williams
Cc: Paul McKenney
Cc: Lai Jiangshan
Cc: Roland McGrath
Cc: Carlos ODonell
Cc: Jakub Jelinek
Cc: Michael Kerrisk
Cc: Sebastian Andrzej Siewior
Link: http://lkml.kernel.org/r/20140512201701.097349971@linutronix.de
Signed-off-by: Thomas Gleixner
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-05-19 20:18:49 +0800

13 May, 2014

5 commits

36e9d2ebc cgroup: fix rcu_read_lock() leak in update_if_frozen() ... Browse Code »

While updating cgroup_freezer locking, 68fafb77d827 ("cgroup_freezer:
replace freezer->lock with freezer_mutex") introduced a bug in
update_if_frozen() where it returns with rcu_read_lock() held. Fix it
by adding rcu_read_unlock() before returning.

Signed-off-by: Tejun Heo
Reported-by: kbuild test robot

Tejun Heo
2014-05-13 23:28:30 +0800
e5ced8ebb cgroup_freezer: replace freezer->lock with freezer_mutex ... Browse Code »

After 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it
to css_set_rwsem"), css task iterators requires sleepable context as
it may block on css_set_rwsem. I missed that cgroup_freezer was
iterating tasks under IRQ-safe spinlock freezer->lock. This leads to
errors like the following on freezer state reads and transitions.

BUG: sleeping function called from invalid context at /work
/os/work/kernel/locking/rwsem.c:20
in_atomic(): 0, irqs_disabled(): 0, pid: 462, name: bash
5 locks held by bash/462:
#0: (sb_writers#7){.+.+.+}, at: [] vfs_write+0x1a3/0x1c0
#1: (&of->mutex){+.+.+.}, at: [] kernfs_fop_write+0xbb/0x170
#2: (s_active#70){.+.+.+}, at: [] kernfs_fop_write+0xc3/0x170
#3: (freezer_mutex){+.+...}, at: [] freezer_write+0x61/0x1e0
#4: (rcu_read_lock){......}, at: [] freezer_write+0x53/0x1e0
Preemption disabled at:[] console_unlock+0x1e4/0x460

CPU: 3 PID: 462 Comm: bash Not tainted 3.15.0-rc1-work+ #10
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
ffff88000916a6d0 ffff88000e0a3da0 ffffffff81cf8c96 0000000000000000
ffff88000e0a3dc8 ffffffff810cf4f2 ffffffff82388040 ffff880013aaf740
0000000000000002 ffff88000e0a3de8 ffffffff81d05974 0000000000000246
Call Trace:
[] dump_stack+0x4e/0x7a
[] __might_sleep+0x162/0x260
[] down_read+0x24/0x60
[] css_task_iter_start+0x27/0x70
[] freezer_apply_state+0x5d/0x130
[] freezer_write+0xf6/0x1e0
[] cgroup_file_write+0xd8/0x230
[] kernfs_fop_write+0xe7/0x170
[] vfs_write+0xb6/0x1c0
[] SyS_write+0x4d/0xc0
[] system_call_fastpath+0x16/0x1b

freezer->lock used to be used in hot paths but that time is long gone
and there's no reason for the lock to be IRQ-safe spinlock or even
per-cgroup. In fact, given the fact that a cgroup may contain large
number of tasks, it's not a good idea to iterate over them while
holding IRQ-safe spinlock.

Let's simplify locking by replacing per-cgroup freezer->lock with
global freezer_mutex. This also makes the comments explaining the
intricacies of policy inheritance and the locking around it as the
states are protected by a common mutex.

The conversion is mostly straight-forward. The followings are worth
mentioning.

* freezer_css_online() no longer needs double locking.

* freezer_attach() now performs propagation simply while holding
freezer_mutex. update_if_frozen() race no longer exists and the
comment is removed.

* freezer_fork() now tests whether the task is in root cgroup using
the new task_css_is_root() without doing rcu_read_lock/unlock(). If
not, it grabs freezer_mutex and performs the operation.

* freezer_read() and freezer_change_state() grab freezer_mutex across
the whole operation and pin the css while iterating so that each
descendant processing happens in sleepable context.

Fixes: 96d365e0b86e ("cgroup: make css_set_lock a rwsem and rename it to css_set_rwsem")
Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-13 23:26:31 +0800
5024ae29c cgroup: introduce task_css_is_root() ... Browse Code »

Determining the css of a task usually requires RCU read lock as that's
the only thing which keeps the returned css accessible till its
reference is acquired; however, testing whether a task belongs to the
root can be performed without dereferencing the returned css by
comparing the returned pointer against the root one in init_css_set[]
which never changes.

Implement task_css_is_root() which can be invoked in any context.
This will be used by the scheduled cgroup_freezer change.

v2: cgroup no longer supports modular controllers. No need to export
init_css_set. Pointed out by Li.

Signed-off-by: Tejun Heo
Acked-by: Li Zefan

Tejun Heo
2014-05-13 23:26:27 +0800
efb2b1d5f Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq ... Browse Code »

Pull workqueue fixes from Tejun Heo:
"Fixes for two bugs in workqueue.

One is exiting with internal mutex held in a failure path of
wq_update_unbound_numa(). The other is a subtle and unlikely
use-after-possible-last-put in the rescuer logic. Both have been
around for quite some time now and are unlikely to have triggered
noticeably often. All patches are marked for -stable backport"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq:
workqueue: fix a possible race condition between rescuer and pwq-release
workqueue: make rescuer_thread() empty wq->maydays list before exiting
workqueue: fix bugs in wq_update_unbound_numa() failure path

Linus Torvalds
2014-05-13 10:24:07 +0800
26a41cd1e Merge branch 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup ... Browse Code »

Pull cgroup fixes from Tejun Heo:
"During recent restructuring, device_cgroup unified config input check
and enforcement logic; unfortunately, it turned out to share too much.
Aristeu's patches fix the breakage and marked for -stable backport.

The other two patches are fallouts from kernfs conversion. The blkcg
change is temporary and will go away once kernfs internal locking gets
simplified (patches pending)"

* 'for-3.15-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
blkcg: use trylock on blkcg_pol_mutex in blkcg_reset_stats()
device_cgroup: check if exception removal is allowed
device_cgroup: fix the comment format for recently added functions
device_cgroup: rework device access check and exception checking
cgroup: fix the retry path of cgroup_mount()

Linus Torvalds
2014-05-13 10:22:57 +0800

12 May, 2014

1 commit

84ea7fe37 hrtimer: Set expiry time before switch_hrtimer_base() ... Browse Code »
5

switch_hrtimer_base() calls hrtimer_check_target() which ensures that
we do not migrate a timer to a remote cpu if the timer expires before
the current programmed expiry time on that remote cpu.

But __hrtimer_start_range_ns() calls switch_hrtimer_base() before the
new expiry time is set. So the sanity check in hrtimer_check_target()
is operating on stale or even uninitialized data.

Update expiry time before calling switch_hrtimer_base().

[ tglx: Rewrote changelog once again ]

Signed-off-by: Viresh Kumar
Cc: linaro-kernel@lists.linaro.org
Cc: linaro-networking@linaro.org
Cc: fweisbec@gmail.com
Cc: arvind.chauhan@arm.com
Link: http://lkml.kernel.org/r/81999e148745fc51bbcd0615823fbab9b2e87e23.1399882253.git.viresh.kumar@linaro.org
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Viresh Kumar
2014-05-12 16:47:39 +0800

10 May, 2014

1 commit

181da3c34 Merge branch 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull x86 fixes from Peter Anvin:
"A somewhat unpleasantly large collection of small fixes. The big ones
are the __visible tree sweep and a fix for 'earlyprintk=efi,keep'. It
was using __init functions with predictably suboptimal results.

Another key fix is a build fix which would produce output that simply
would not decompress correctly in some configuration, due to the
existing Makefiles picking up an unfortunate local label and mistaking
it for the global symbol _end.

Additional fixes include the handling of 64-bit numbers when setting
the vdso data page (a latent bug which became manifest when i386
started exporting a vdso with time functions), a fix to the new MSR
manipulation accessors which would cause features to not get properly
unblocked, a build fix for 32-bit userland, and a few new platform
quirks"

* 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86, vdso, time: Cast tv_nsec to u64 for proper shifting in update_vsyscall()
x86: Fix typo in MSR_IA32_MISC_ENABLE_LIMIT_CPUID macro
x86: Fix typo preventing msr_set/clear_bit from having an effect
x86/intel: Add quirk to disable HPET for the Baytrail platform
x86/hpet: Make boot_hpet_disable extern
x86-64, build: Fix stack protector Makefile breakage with 32-bit userland
x86/reboot: Add reboot quirk for Certec BPC600
asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/*
asmlinkage, x86: Add explicit __visible to arch/x86/*
asmlinkage: Revert "lto: Make asmlinkage __visible"
x86, build: Don't get confused by local symbols
x86/efi: earlyprintk=efi,keep fix

Linus Torvalds
2014-05-10 03:24:20 +0800

09 May, 2014

1 commit

f322e2623 Merge tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kerne… ... Browse Code »

…l/git/rostedt/linux-trace

Pull tracing fixes from Steven Rostedt:
"This contains two fixes.

The first is a long standing bug that causes bogus data to show up in
the refcnt field of the module_refcnt tracepoint. It was introduced
by a merge conflict resolution back in 2.6.35-rc days.

The result should be 'refcnt = incs - decs', but instead it did
'refcnt = incs + decs'.

The second fix is to a bug that was introduced in this merge window
that allowed for a tracepoint funcs pointer to be used after it was
freed. Moving the location of where the probes are released solved
the problem"

* tag 'trace-fixes-v3.15-rc4-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracepoint: Fix use of tracepoint funcs after rcu free
trace: module: Maintain a valid user count

Linus Torvalds
2014-05-09 05:17:13 +0800

08 May, 2014

1 commit

8058bd0fa tracepoint: Fix use of tracepoint funcs after rcu free ... Browse Code »

Commit de7b2973903c "tracepoint: Use struct pointer instead of name hash
for reg/unreg tracepoints" introduces a use after free by calling
release_probes on the old struct tracepoint array before the newly
allocated array is published with rcu_assign_pointer. There is a race
window where tracepoints (RCU readers) can perform a
"use-after-grace-period-after-free", which shows up as a GPF in
stress-tests.

Link: http://lkml.kernel.org/r/53698021.5020108@oracle.com
Link: http://lkml.kernel.org/p/1399549669-25465-1-git-send-email-mathieu.desnoyers@efficios.com

Reported-by: Sasha Levin
CC: Oleg Nesterov
CC: Dave Jones
Fixes: de7b2973903c "tracepoint: Use struct pointer instead of name hash for reg/unreg tracepoints"
Signed-off-by: Mathieu Desnoyers
Signed-off-by: Steven Rostedt

Mathieu Desnoyers
2014-05-08 21:10:56 +0800

07 May, 2014

9 commits

2b4cfe64d sched/numa: Initialize newidle balance stats in sd_numa_init() ... Browse Code »

Also initialize the per-sd variables for newidle load balancing
in sd_numa_init().

Signed-off-by: Jason Low
Acked-by: morten.rasmussen@arm.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: preeti@linux.vnet.ibm.com
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Cc: aswin@hp.com
Cc: chegu_vinod@hp.com
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1398303035-18255-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-05-07 17:51:37 +0800
0e5b5337f sched: Fix updating rq->max_idle_balance_cost and rq->next_balance in idle_balance() ... Browse Code »

The following commit:

e5fc66119ec9 ("sched: Fix race in idle_balance()")

can potentially cause rq->max_idle_balance_cost to not be updated,
even when load_balance(NEWLY_IDLE) is attempted and the per-sd
max cost value is updated.

Preeti noticed a similar issue with updating rq->next_balance.

In this patch, we fix this by making sure we still check/update those values
even if a task gets enqueued while browsing the domains.

Signed-off-by: Jason Low
Reviewed-by: Preeti U Murthy
Signed-off-by: Peter Zijlstra
Cc: morten.rasmussen@arm.com
Cc: aswin@hp.com
Cc: daniel.lezcano@linaro.org
Cc: alex.shi@linaro.org
Cc: efault@gmx.de
Cc: vincent.guittot@linaro.org
Link: http://lkml.kernel.org/r/1398725155-7591-2-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar

Jason Low
2014-05-07 17:51:36 +0800
6ccdc84b8 sched: Skip double execution of pick_next_task_fair() ... Browse Code »

Tim wrote:

"The current code will call pick_next_task_fair a second time in the
slow path if we did not pull any task in our first try. This is
really unnecessary as we already know no task can be pulled and it
doubles the delay for the cpu to enter idle.

We instrumented some network workloads and that saw that
pick_next_task_fair is frequently called twice before a cpu enters
idle. The call to pick_next_task_fair can add non trivial latency as
it calls load_balance which runs find_busiest_group on an hierarchy of
sched domains spanning the cpus for a large system. For some 4 socket
systems, we saw almost 0.25 msec spent per call of pick_next_task_fair
before a cpu can be idled."

Optimize the second call away for the common case and document the
dependency.

Reported-by: Tim Chen
Signed-off-by: Peter Zijlstra
Cc: Linus Torvalds
Cc: Len Brown
Link: http://lkml.kernel.org/r/20140424100047.GP11096@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-07 17:51:35 +0800
6227cb00c sched: Use CPUPRI_NR_PRIORITIES instead of MAX_RT_PRIO in cpupri check ... Browse Code »
6

The check at the beginning of cpupri_find() makes sure that the task_pri
variable does not exceed the cp->pri_to_cpu array length. But that length
is CPUPRI_NR_PRIORITIES not MAX_RT_PRIO, where it will miss the last two
priorities in that array.

As task_pri is computed from convert_prio() which should never be bigger
than CPUPRI_NR_PRIORITIES, if the check should cause a panic if it is
hit.

Reported-by: Mike Galbraith
Signed-off-by: Steven Rostedt
Signed-off-by: Peter Zijlstra
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/1397015410.5212.13.camel@marge.simpson.net
Signed-off-by: Ingo Molnar

Steven Rostedt (Red Hat)
2014-05-07 17:51:33 +0800
6a7cd273d sched/deadline: Fix memory leak ... Browse Code »
5

Free cpudl->free_cpus allocated in cpudl_init().

Signed-off-by: Li Zefan
Acked-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Cc: # 3.14+
Link: http://lkml.kernel.org/r/534F36CE.2000409@huawei.com
Signed-off-by: Ingo Molnar

Li Zefan
2014-05-07 17:51:32 +0800
5bfd126e8 sched/deadline: Fix sched_yield() behavior ... Browse Code »

yield_task_dl() is broken:

o it forces current to be throttled setting its runtime to zero;
o it sets current's dl_se->dl_new to one, expecting that dl_task_timer()
will queue it back with proper parameters at replenish time.

Unfortunately, dl_task_timer() has this check at the very beginning:

if (!dl_task(p) || dl_se->dl_new)
goto unlock;

So, it just bails out and the task is never replenished. It actually
yielded forever.

To fix this, introduce a new flag indicating that the task properly yielded
the CPU before its current runtime expired. While this is a little overdoing
at the moment, the flag would be useful in the future to discriminate between
"good" jobs (of which remaining runtime could be reclaimed, i.e. recycled)
and "bad" jobs (for which dl_throttled task has been set) that needed to be
stopped.

Reported-by: yjay.kim
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140429103953.e68eba1b2ac3309214e3dc5a@gmail.com
Signed-off-by: Ingo Molnar

Juri Lelli
2014-05-07 17:51:31 +0800
2d513868e sched: Sanitize irq accounting madness ... Browse Code »
6

Russell reported, that irqtime_account_idle_ticks() takes ages due to:

for (i = 0; i < ticks; i++)
irqtime_account_process_tick(current, 0, rq);

It's sad, that this code was written way _AFTER_ the NOHZ idle
functionality was available. I charge myself guitly for not paying
attention when that crap got merged with commit abb74cefa ("sched:
Export ns irqtimes through /proc/stat")

So instead of looping nr_ticks times just apply the whole thing at
once.

As a side note: The whole cputime_t vs. u64 business in that context
wants to be cleaned up as well. There is no point in having all these
back and forth conversions. Lets standardise on u64 nsec for all
kernel internal accounting and be done with it. Everything else does
not make sense at all for fine grained accounting. Frederic, can you
please take care of that?

Reported-by: Russell King
Signed-off-by: Thomas Gleixner
Reviewed-by: Paul E. McKenney
Signed-off-by: Peter Zijlstra
Cc: Venkatesh Pallipadi
Cc: Shaun Ruffell
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/alpine.DEB.2.02.1405022307000.6261@ionos.tec.linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-05-07 17:51:30 +0800
ffb4ef21a perf: Fix perf_event_init_context() ... Browse Code »

perf_pin_task_context() can return NULL but perf_event_init_context()
assumes it will not, correct this.

Reported-by: Vince Weaver
Signed-off-by: Peter Zijlstra
Cc: Arnaldo Carvalho de Melo
Link: http://lkml.kernel.org/r/20140505171428.GU26782@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-07 17:33:15 +0800
46ce0fe97 perf: Fix race in removing an event ... Browse Code »
6

When removing a (sibling) event we do:

raw_spin_lock_irq(&ctx->lock);
perf_group_detach(event);
raw_spin_unlock_irq(&ctx->lock);

perf_remove_from_context(event);
raw_spin_lock_irq(&ctx->lock);
...
raw_spin_unlock_irq(&ctx->lock);

Now, assuming the event is a sibling, it will be 'unreachable' for
things like ctx_sched_out() because that iterates the
groups->siblings, and we just unhooked the sibling.

So, if during we get ctx_sched_out(), it will miss the event
and not call event_sched_out() on it, leaving it programmed on the
PMU.

The subsequent perf_remove_from_context() call will find the ctx is
inactive and only call list_del_event() to remove the event from all
other lists.

Hereafter we can proceed to free the event; while still programmed!

Close this hole by moving perf_group_detach() inside the same
ctx->lock region(s) perf_remove_from_context() has.

The condition on inherited events only in __perf_event_exit_task() is
likely complete crap because non-inherited events are part of groups
too and we're tearing down just the same. But leave that for another
patch.

Most-likely-Fixes: e03a9a55b4e ("perf: Change close() semantics for group events")
Reported-by: Vince Weaver
Tested-by: Vince Weaver
Much-staring-at-traces-by: Vince Weaver
Much-staring-at-traces-by: Thomas Gleixner
Cc: Arnaldo Carvalho de Melo
Cc: Linus Torvalds
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140505093124.GN17778@laptop.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-05-07 17:33:14 +0800

06 May, 2014

1 commit

722a9f929 asmlinkage: Add explicit __visible to drivers/*, lib/*, kernel/* ... Browse Code »

As requested by Linus add explicit __visible to the asmlinkage users.
This marks functions visible to assembler.

Tree sweep for rest of tree.

Signed-off-by: Andi Kleen
Link: http://lkml.kernel.org/r/1398984278-29319-4-git-send-email-andi@firstfloor.org
Signed-off-by: H. Peter Anvin

Andi Kleen
2014-05-06 07:07:46 +0800