Eric Lee / smarc-fsl-linux-kernel

02 Dec, 2016

1 commit

dbb26055d locking/rtmutex: Prevent dequeue vs. unlock race ... Browse Code »

David reported a futex/rtmutex state corruption. It's caused by the
following problem:

CPU0 CPU1 CPU2

l->owner=T1
rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T2)
boost()
unlock(l->wait_lock)
schedule()

rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T3)
boost()
unlock(l->wait_lock)
schedule()
signal(->T2) signal(->T3)
lock(l->wait_lock)
dequeue(T2)
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
dequeue(T3)
===> wait list is now empty
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
l->owner = owner
==> l->owner = T1
}

lock(l->wait_lock)
rt_mutex_unlock(l) fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
===> Success (l->owner = NULL)
l->owner = owner
==> l->owner = T1
}

That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.

This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().

The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.

It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.

Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.

Reported-by: David Daney
Tested-by: David Daney
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Will Deacon
Cc: stable@vger.kernel.org
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2016-12-02 18:13:26 +0800

08 Jun, 2016

1 commit

a461d5879 locking/rtmutex: Only warn once on a trylock from bad context ... Browse Code »

One warning should be enough to get one motivated to fix this. It is
possible that this happens more than once and that starts flooding the
output. Later the prints will be suppressed so we only get half of it.
Depending on the console system used it might not be helpful.

Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Sebastian Andrzej Siewior
2016-06-08 20:22:00 +0800

26 Jan, 2016

1 commit

b4abf9104 rtmutex: Make wait_lock irq safe ... Browse Code »

Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.

CPU0 CPU1

rtmutex_lock(&rcu->rt_mutex)
spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
spin_lock(&timer->it_lock)

This is caused by the following code sequence on CPU1

rcu_read_lock()
x = lookup();
if (x)
spin_lock_irqsave(&x->it_lock);
rcu_read_unlock();
return x;

We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.

Taking rt_mutex.wait_lock irq safe prevents the deadlock.

Reported-by: Sasha Levin
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney

Thomas Gleixner
2016-01-26 18:08:35 +0800

04 Nov, 2015

1 commit

53528695f Merge branch 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:

- sched/fair load tracking fixes and cleanups (Byungchul Park)

- Make load tracking frequency scale invariant (Dietmar Eggemann)

- sched/deadline updates (Juri Lelli)

- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)

- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)

- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

- Optimize per entity utilization tracking (Peter Zijlstra)

- ... misc other fixes, cleanups and smaller updates"

* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...

Linus Torvalds
2015-11-04 10:03:50 +0800

06 Oct, 2015

1 commit

700318d1d locking/rtmutex: Use acquire/release semantics ... Browse Code »

As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Paul E.McKenney
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2015-10-06 23:28:22 +0800

23 Sep, 2015

1 commit

f52405757 sched/deadline, locking/rtmutex: Fix open coded check in rt_mutex_waiter_less() ... Browse Code »

rt_mutex_waiter_less() check of task deadlines is open coded. Since this
is subject to wraparound bugs, make it use the correct helper.

Reported-by: Luca Abeni
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1441188096-23021-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar

Juri Lelli
2015-09-23 15:51:25 +0800

20 Jul, 2015

1 commit

1b0b7c176 rtmutex: Delete scriptable tester ... Browse Code »

No one uses this anymore, and this is not the first time the
idea of replacing it with a (now possible) userspace side.
Lock stealing logic was removed long ago in when the lock
was granted to the highest prio.

Signed-off-by: Davidlohr Bueso
Cc: Darren Hart
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Sebastian Andrzej Siewior
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1435782588-4177-2-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2015-07-20 17:45:45 +0800

25 Jun, 2015

1 commit

a26294833 Merge branch 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Thomas Gleixner:
"These locking updates depend on the alreay merged sched/core branch:

- Lockless top waiter wakeup for rtmutex (Davidlohr)

- Reduce hash bucket lock contention for PI futexes (Sebastian)

- Documentation update (Davidlohr)"

* 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Update stale plist comments
futex: Lower the lock contention on the HB lock during wake up
locking/rtmutex: Implement lockless top-waiter wakeup

Linus Torvalds
2015-06-25 05:46:01 +0800

23 Jun, 2015

2 commits

43224b96a Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:

- Cache footprint optimizations for both hrtimers and timer wheel

- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.

- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter

- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf

- Some more leap second tweaks

- Another round of changes addressing the 2038 problem

- First step to change the internals of clock event devices by
introducing the necessary infrastructure

- Allow constant folding for usecs/msecs_to_jiffies()

- The usual pile of clockevent/clocksource driver updates

The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
...

Linus Torvalds
2015-06-23 09:57:44 +0800
1bf7067c6 Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull locking updates from Ingo Molnar:
"The main changes are:

- 'qspinlock' support, enabled on x86: queued spinlocks - these are
now the spinlock variant used by x86 as they outperform ticket
spinlocks in every category. (Waiman Long)

- 'pvqspinlock' support on x86: paravirtualized variant of queued
spinlocks. (Waiman Long, Peter Zijlstra)

- 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
queued spinlocks, they are now the variant used by x86:

CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y

- various lockdep fixlets

- various locking primitives cleanups, further WRITE_ONCE()
propagation"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
locking/lockdep: Remove hard coded array size dependency
locking/qrwlock: Don't contend with readers when setting _QW_WAITING
lockdep: Do not break user-visible string
locking/arch: Rename set_mb() to smp_store_mb()
locking/arch: Add WRITE_ONCE() to set_mb()
rtmutex: Warn if trylock is called from hard/softirq context
arch: Remove __ARCH_HAVE_CMPXCHG
locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
locking/pvqspinlock, x86: Enable PV qspinlock for Xen
locking/pvqspinlock, x86: Enable PV qspinlock for KVM
locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
locking/pvqspinlock: Implement simple paravirt support for the qspinlock
locking/qspinlock: Revert to test-and-set on hypervisors
locking/qspinlock: Use a simple write to grab the lock
locking/qspinlock: Optimize for smaller NR_CPUS
locking/qspinlock: Extract out code snippets for the next patch
locking/qspinlock: Add pending bit
...

Linus Torvalds
2015-06-23 05:54:22 +0800

20 Jun, 2015

2 commits

9f40a51a3 locking/rtmutex: Update stale plist comments ... Browse Code »

... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.

Signed-off-by: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Sebastian Andrzej Siewior
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2015-06-20 03:27:21 +0800
802ab58da futex: Lower the lock contention on the HB lock during wake up ... Browse Code »

wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.

[ tglx: Fixed up the rtmutex unlock path ]

Originally-from: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2015-06-20 03:26:38 +0800

19 Jun, 2015

1 commit

45ab4effc locking/rtmutex: Implement lockless top-waiter wakeup ... Browse Code »

Mark the task for later wakeup after the wait_lock has been released.
This way, once the next task is awoken, it will have a better chance
to of finding the wait_lock free when continuing executing in
__rt_mutex_slowlock() when trying to acquire the rtmutex, calling
try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
take the lock may acquire it first, right after the wait_lock is released,
but (a) this can also occur with the current code, as it relies on the
spinlock fairness, and (b) we are dealing with the top-waiter anyway,
so it will always take the lock next.

Signed-off-by: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Sebastian Andrzej Siewior
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1432056298-18738-2-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner

Davidlohr Bueso
2015-06-19 04:27:46 +0800

19 May, 2015

1 commit

c3b5d3cea Merge branch 'linus' into timers/core ... Browse Code »

Make sure the upstream fixes are applied before adding further
modifications.

Thomas Gleixner
2015-05-19 22:12:32 +0800

14 May, 2015

1 commit

6ce47fd96 rtmutex: Warn if trylock is called from hard/softirq context ... Browse Code »

rt_mutex_trylock() must be called from thread context. It can be
called from atomic regions (preemption or interrupts disabled), but
not from hard/softirq/nmi context. Add a warning to alert abusers.

The reasons for this are:

1) There is a potential deadlock in the slowpath

2) Another cpu which blocks on the rtmutex will boost the task
which allegedly locked the rtmutex, but that cannot work
because the hard/softirq context borrows the task context.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Sebastian Siewior

Thomas Gleixner
2015-05-14 04:49:12 +0800

13 May, 2015

1 commit

cede88418 locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG ... Browse Code »

The rtmutex code is the only user of __HAVE_ARCH_CMPXCHG and we have a few
other user of cmpxchg() which do not care about __HAVE_ARCH_CMPXCHG. This
define was first introduced in 23f78d4a0 ("[PATCH] pi-futex: rt mutex core")
which is v2.6.18. The generic cmpxchg was introduced later in 068fbad288
("Add cmpxchg_local to asm-generic for per cpu atomic operations") which is
v2.6.25.
Back then something was required to get rtmutex working with the fast
path on architectures without cmpxchg and this seems to be the result.

It popped up recently on rt-users because ARM (v6+) does not define
__HAVE_ARCH_CMPXCHG (even that it implements it) which results in slower
locking performance in the fast path.
To put some numbers on it: preempt -RT, am335x, 10 loops of
100000 invocations of rt_spin_lock() + rt_spin_unlock() (time "total" is
the average of the 10 loops for the 100000 invocations, "loop" is
"total / 100000 * 1000"):

cmpxchg | slowpath used || cmpxchg used
| total | loop || total | loop
--------|-----------|-------||------------|-------
ARMv6 | 9129.4 us | 91 ns || 3311.9 us | 33 ns
generic | 9360.2 us | 94 ns || 10834.6 us | 108 ns
----------------------------||--------------------

Forcing it to generic cmpxchg() made things worse for the slowpath and
even worse in cmpxchg() path. It boils down to 14ns more per lock+unlock
in a cache hot loop so it might not be that much in real world.
The last test was a substitute for pre ARMv6 machine but then I was able
to perform the comparison on imx28 which is ARMv5 and therefore is
always is using the generic cmpxchg implementation. And the numbers:

| total | loop
-------- |----------- |--------
slowpath | 263937.2 us | 2639 ns
cmpxchg | 16934.2 us | 169 ns
--------------------------------

The numbers are larger since the machine is slower in general. However,
letting rtmutex use cmpxchg() instead the slowpath seem to improve things.

Since from the ARM (tested on am335x + imx28) point of view always
using cmpxchg() in rt_mutex_lock() + rt_mutex_unlock() makes sense I
would drop the define.

Signed-off-by: Sebastian Andrzej Siewior
Cc: Arnd Bergmann
Cc: Peter Zijlstra
Cc: will.deacon@arm.com
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/20150225175613.GE6823@linutronix.de
Signed-off-by: Thomas Gleixner

Sebastian Andrzej Siewior
2015-05-13 16:51:28 +0800

08 May, 2015

1 commit

0782e63bc sched: Handle priority boosted tasks proper in setscheduler() ... Browse Code »

Ronny reported that the following scenario is not handled correctly:

T1 (prio = 10)
lock(rtmutex);

T2 (prio = 20)
lock(rtmutex)
boost T1

T1 (prio = 20)
sys_set_scheduler(prio = 30)
T1 prio = 30
....
sys_set_scheduler(prio = 10)
T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.

Reported-by: Ronny Meeus
Tested-by: Steven Rostedt
Signed-off-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Steven Rostedt
Cc:
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: Mike Galbraith
Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar

Thomas Gleixner
2015-05-08 17:53:55 +0800

22 Apr, 2015

1 commit

ccdd92c17 rtmutex: Remove bogus hrtimer_active() check ... Browse Code »

The check for hrtimer_active() after starting the timer is
pointless. If the timer is inactive it has expired already and
therefor the task pointer is already NULL.

Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra
Cc: Preeti U Murthy
Cc: Viresh Kumar
Cc: Marcelo Tosatti
Cc: Frederic Weisbecker
Link: http://lkml.kernel.org/r/20150414203503.081830481@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2015-04-22 23:06:51 +0800

14 Apr, 2015

1 commit

cc76ee75a Merge branch 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull core locking changes from Ingo Molnar:
"Main changes:

- jump label asm preparatory work for PowerPC (Anton Blanchard)

- rwsem optimizations and cleanups (Davidlohr Bueso)

- mutex optimizations and cleanups (Jason Low)

- futex fix (Oleg Nesterov)

- remove broken atomicity checks from {READ,WRITE}_ONCE() (Peter
Zijlstra)"

* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
powerpc, jump_label: Include linux/jump_label.h to get HAVE_JUMP_LABEL define
jump_label: Allow jump labels to be used in assembly
jump_label: Allow asm/jump_label.h to be included in assembly
locking/mutex: Further simplify mutex_spin_on_owner()
locking: Remove atomicy checks from {READ,WRITE}_ONCE
locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
locking/rwsem: Fix lock optimistic spinning when owner is not running
locking: Remove ACCESS_ONCE() usage
locking/rwsem: Check for active lock before bailing on spinning
locking/rwsem: Avoid deceiving lock spinners
locking/rwsem: Set lock ownership ASAP
locking/rwsem: Document barrier need when waking tasks
locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
locking/mutex: Refactor mutex_spin_on_owner()
locking/mutex: In mutex_spin_on_owner(), return true when owner changes

Linus Torvalds
2015-04-14 01:27:28 +0800

25 Mar, 2015

1 commit

e6beaa363 locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well ... Browse Code »

The following commit changed "deadlock_detect" to "chwalk":

8930ed80f970 ("rtmutex: Cleanup deadlock detector debug logic")

do that rename in the function's documentation as well.

Signed-off-by: Tom(JeHyeon) Yeon
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1426655010-31651-1-git-send-email-tom.yeon@windriver.com
Signed-off-by: Ingo Molnar

Tom(JeHyeon) Yeon
2015-03-25 20:43:55 +0800

01 Mar, 2015

1 commit

9d3e2d02f locking/rtmutex: Set state back to running on error ... Browse Code »

The "usual" path is:

- rt_mutex_slowlock()
- set_current_state()
- task_blocks_on_rt_mutex() (ret 0)
- __rt_mutex_slowlock()
- sleep or not but do return with __set_current_state(TASK_RUNNING)
- back to caller.

In the early error case where task_blocks_on_rt_mutex() return
-EDEADLK we never change the task's state back to RUNNING. I
assume this is intended. Without this change after ww_mutex
using rt_mutex the selftest passes but later I get plenty of:

| bad: scheduling from the idle thread!

backtraces.

Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Mike Galbraith
Cc: Linus Torvalds
Cc: Maarten Lankhorst
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: afffc6c1805d ("locking/rtmutex: Optimize setting task running after being blocked")
Link: http://lkml.kernel.org/r/1425056229-22326-4-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Sebastian Andrzej Siewior
2015-03-01 16:45:06 +0800

18 Feb, 2015

1 commit

8d1e5a1a1 locking/rtmutex: Avoid a NULL pointer dereference on deadlock ... Browse Code »

With task_blocks_on_rt_mutex() returning early -EDEADLK we never
add the waiter to the waitqueue. Later, we try to remove it via
remove_waiter() and go boom in rt_mutex_top_waiter() because
rb_entry() gives a NULL pointer.

( Tested on v3.18-RT where rtmutex is used for regular mutex and I
tried to get one twice in a row. )

Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix
deadlock detector for real") or commit 3d5c9340d194 ("rtmutex:
Handle deadlock detection smarter").

Signed-off-by: Sebastian Andrzej Siewior
Acked-by: Peter Zijlstra
Cc: Thomas Gleixner
Cc: # for v3.16 and later kernels
Link: http://lkml.kernel.org/r/1424187823-19600-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Sebastian Andrzej Siewior
2015-02-18 17:20:32 +0800

04 Feb, 2015

1 commit

afffc6c18 locking/rtmutex: Optimize setting task running after being blocked ... Browse Code »

We explicitly mark the task running after returning from
a __rt_mutex_slowlock() call, which does the actual sleeping
via wait-wake-trylocking. As such, this patch does two things:

(1) refactors the code so that setting current to TASK_RUNNING
is done by __rt_mutex_slowlock(), and not by the callers. The
downside to this is that it becomes a bit unclear when at what
point we block. As such I've added a comment that the task
blocks when calling __rt_mutex_slowlock() so readers can figure
out when it is running again.

(2) relaxes setting current's state through __set_current_state(),
instead of it's more expensive barrier alternative. There was no
need for the implied barrier as we're obviously not planning on
blocking.

Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2015-02-04 14:57:42 +0800

13 Aug, 2014

1 commit

214e0aed6 locking/Documentation: Move locking related docs into Documentation/locking/ ... Browse Code »

Specifically:
Documentation/locking/lockdep-design.txt
Documentation/locking/lockstat.txt
Documentation/locking/mutex-design.txt
Documentation/locking/rt-mutex-design.txt
Documentation/locking/rt-mutex.txt
Documentation/locking/spinlocks.txt
Documentation/locking/ww-mutex-design.txt

Signed-off-by: Davidlohr Bueso
Acked-by: Randy Dunlap
Signed-off-by: Peter Zijlstra
Cc: jason.low2@hp.com
Cc: aswin@hp.com
Cc: Alexei Starovoitov
Cc: Al Viro
Cc: Andrew Morton
Cc: Chris Mason
Cc: Dan Streetman
Cc: David Airlie
Cc: Davidlohr Bueso
Cc: David S. Miller
Cc: Greg Kroah-Hartman
Cc: Heiko Carstens
Cc: Jason Low
Cc: Josef Bacik
Cc: Kees Cook
Cc: Linus Torvalds
Cc: Lubomir Rintel
Cc: Masanari Iida
Cc: Paul E. McKenney
Cc: Randy Dunlap
Cc: Tim Chen
Cc: Vineet Gupta
Cc: fengguang.wu@intel.com
Link: http://lkml.kernel.org/r/1406752916-3341-6-git-send-email-davidlohr@hp.com
Signed-off-by: Ingo Molnar

Davidlohr Bueso
2014-08-13 16:32:03 +0800

22 Jun, 2014

9 commits

67792e2ca rtmutex: Avoid pointless requeueing in the deadlock detection chain walk ... Browse Code »

In case the dead lock detector is enabled we follow the lock chain to
the end in rt_mutex_adjust_prio_chain, even if we could stop earlier
due to the priority/waiter constellation.

But once we are no longer the top priority waiter in a certain step
or the task holding the lock has already the same priority then there
is no point in dequeing and enqueing along the lock chain as there is
no change at all.

So stop the queueing at this point.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/20140522031950.280830190@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:05:31 +0800
8930ed80f rtmutex: Cleanup deadlock detector debug logic ... Browse Code »

The conditions under which deadlock detection is conducted are unclear
and undocumented.

Add constants instead of using 0/1 and provide a selection function
which hides the additional debug dependency from the calling code.

Add comments where needed.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Link: http://lkml.kernel.org/r/20140522031949.947264874@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-22 04:05:30 +0800
c051b21f7 rtmutex: Confine deadlock logic to futex ... Browse Code »

The deadlock logic is only required for futexes.

Remove the extra arguments for the public functions and also for the
futex specific ones which get always called with deadlock detection
enabled.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt

Thomas Gleixner
2014-06-22 04:05:30 +0800
1ca7b8606 rtmutex: Simplify remove_waiter() ... Browse Code »

Exit right away, when the removed waiter was not the top priority
waiter on the lock. Get rid of the extra indent level.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Reviewed-by: Lai Jiangshan

Thomas Gleixner
2014-06-22 04:05:30 +0800
3eb65aead rtmutex: Document pi chain walk ... Browse Code »

Add commentry to document the chain walk and the protection mechanisms
and their scope.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt

Thomas Gleixner
2014-06-22 04:05:30 +0800
a57594a13 rtmutex: Clarify the boost/deboost part ... Browse Code »

Add a separate local variable for the boost/deboost logic to make the
code more readable. Add comments where appropriate.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt

Thomas Gleixner
2014-06-22 04:05:30 +0800
2ffa5a5cd rtmutex: No need to keep task ref for lock owner check ... Browse Code »

There is no point to keep the task ref across the check for lock
owner. Drop the ref before that, so the protection context is clear.

Found while documenting the chain walk.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Reviewed-by: Lai Jiangshan

Thomas Gleixner
2014-06-22 04:05:30 +0800
358c331f3 rtmutex: Simplify and document try_to_take_rtmutex() ... Browse Code »

The current implementation of try_to_take_rtmutex() is correct, but
requires more than a single brain twist to understand the clever
encoded conditionals.

Untangle it and document the cases proper.

Looks less efficient at the first glance, but actually reduces the
binary code size on x8664 by 80 bytes.

Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt

Thomas Gleixner
2014-06-22 04:05:30 +0800
88f2b4c15 rtmutex: Simplify rtmutex_slowtrylock() ... Browse Code »

Oleg noticed that rtmutex_slowtrylock() has a pointless check for
rt_mutex_owner(lock) != current.

To avoid calling try_to_take_rtmutex() we really want to check whether
the lock has an owner at all or whether the trylock failed because the
owner is NULL, but the RT_MUTEX_HAS_WAITERS bit is set. This covers
the lock is owned by caller situation as well.

We can actually do this check lockless. trylock is taking a chance
whether we take lock->wait_lock to do the check or not.

Add comments to the function while at it.

Reported-by: Oleg Nesterov
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Reviewed-by: Lai Jiangshan

Thomas Gleixner
2014-06-22 04:05:30 +0800

16 Jun, 2014

1 commit

27e35715d rtmutex: Plug slow unlock race ... Browse Code »

When the rtmutex fast path is enabled the slow unlock function can
create the following situation:

spin_lock(foo->m->wait_lock);
foo->m->owner = NULL;
rt_mutex_lock(foo->m); refcnt);
rt_mutex_unlock(foo->m); m->wait_lock); owner */
clear_rt_mutex_waiters(m);
owner = rt_mutex_owner(m);
spin_unlock(m->wait_lock);
if (cmpxchg(m->owner, owner, 0) == owner)
return;
spin_lock(m->wait_lock);
}

So in case of a new waiter incoming while the owner tries the slow
path unlock we have two situations:

unlock(wait_lock);
lock(wait_lock);
cmpxchg(p, owner, 0) == owner
mark_rt_mutex_waiters(lock);
acquire(lock);

Or:

unlock(wait_lock);
lock(wait_lock);
mark_rt_mutex_waiters(lock);
cmpxchg(p, owner, 0) != owner
enqueue_waiter();
unlock(wait_lock);
lock(wait_lock);
wakeup_next waiter();
unlock(wait_lock);
lock(wait_lock);
acquire(lock);

If the fast path is disabled, then the simple

m->owner = NULL;
unlock(m->wait_lock);

is sufficient as all access to m->owner is serialized via
m->wait_lock;

Also document and clarify the wakeup_next_waiter function as suggested
by Oleg Nesterov.

Reported-by: Steven Rostedt
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-16 16:03:09 +0800

07 Jun, 2014

2 commits

820849843 rtmutex: Detect changes in the pi lock chain ... Browse Code »

When we walk the lock chain, we drop all locks after each step. So the
lock chain can change under us before we reacquire the locks. That's
harmless in principle as we just follow the wrong lock path. But it
can lead to a false positive in the dead lock detection logic:

T0 holds L0
T0 blocks on L1 held by T1
T1 blocks on L2 held by T2
T2 blocks on L3 held by T3
T4 blocks on L4 held by T4

Now we walk the chain

lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
lock T2 -> adjust T2 -> drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

Brad tried to work around that in the deadlock detection logic itself,
but the more I looked at it the less I liked it, because it's crystal
ball magic after the fact.

We actually can detect a chain change very simple:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T2 times out and blocks on L0

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

So if we detect that T2 is now blocked on a different lock we stop the
chain walk. That's also correct in the following scenario:

lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

next_lock = T2->pi_blocked_on->lock;

drop locks

T3 times out and drops L3
T2 acquires L3 and blocks on L4 now

Now we continue:

lock T2 ->

if (next_lock != T2->pi_blocked_on->lock)
return;

We don't have to follow up the chain at that point, because T2
propagated our priority up to T4 already.

[ Folded a cleanup patch from peterz ]

Signed-off-by: Thomas Gleixner
Reported-by: Brad Mouring
Cc: Steven Rostedt
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
Cc: stable@vger.kernel.org

Thomas Gleixner
2014-06-07 20:55:40 +0800
3d5c9340d rtmutex: Handle deadlock detection smarter ... Browse Code »

Even in the case when deadlock detection is not requested by the
caller, we can detect deadlocks. Right now the code stops the lock
chain walk and keeps the waiter enqueued, even on itself. Silly not to
yell when such a scenario is detected and to keep the waiter enqueued.

Return -EDEADLK unconditionally and handle it at the call sites.

The futex calls return -EDEADLK. The non futex ones dequeue the
waiter, throw a warning and put the task into a schedule loop.

Tagged for stable as it makes the code more robust.

Signed-off-by: Thomas Gleixner
Cc: Steven Rostedt
Cc: Peter Zijlstra
Cc: Brad Mouring
Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
Cc: stable@vger.kernel.org
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-06-07 20:55:40 +0800

28 May, 2014

1 commit

397335f00 rtmutex: Fix deadlock detector for real ... Browse Code »

The current deadlock detection logic does not work reliably due to the
following early exit path:

/*
* Drop out, when the task has no waiters. Note,
* top_waiter can be NULL, when we are in the deboosting
* mode!
*/
if (top_waiter && (!task_has_pi_waiters(task) ||
top_waiter != task_top_pi_waiter(task)))
goto out_unlock_pi;

So this not only exits when the task has no waiters, it also exits
unconditionally when the current waiter is not the top priority waiter
of the task.

So in a nested locking scenario, it might abort the lock chain walk
and therefor miss a potential deadlock.

Simple fix: Continue the chain walk, when deadlock detection is
enabled.

We also avoid the whole enqueue, if we detect the deadlock right away
(A-A). It's an optimization, but also prevents that another waiter who
comes in after the detection and before the task has undone the damage
observes the situation and detects the deadlock and returns
-EDEADLOCK, which is wrong as the other task is not in a deadlock
situation.

Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Reviewed-by: Steven Rostedt
Cc: Lai Jiangshan
Cc: stable@vger.kernel.org
Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
Signed-off-by: Thomas Gleixner

Thomas Gleixner
2014-05-28 23:28:13 +0800

23 Feb, 2014

1 commit

c365c292d sched: Consider pi boosting in setscheduler() ... Browse Code »

If a PI boosted task policy/priority is modified by a setscheduler()
call we unconditionally dequeue and requeue the task if it is on the
runqueue even if the new priority is lower than the current effective
boosted priority. This can result in undesired reordering of the
priority bucket list.

If the new priority is less or equal than the current effective we
just store the new parameters in the task struct and leave the
scheduler class and the runqueue untouched. This is handled when the
task deboosts itself. Only if the new priority is higher than the
effective boosted priority we apply the change immediately.

Signed-off-by: Thomas Gleixner
[ Rebase ontop of v3.14-rc1. ]
Signed-off-by: Sebastian Andrzej Siewior
Cc: Dario Faggioli
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar

Thomas Gleixner
2014-02-23 01:10:04 +0800

13 Jan, 2014

2 commits

2d3d891d3 sched/deadline: Add SCHED_DEADLINE inheritance logic ... Browse Code »

Some method to deal with rt-mutexes and make sched_dl interact with
the current PI-coded is needed, raising all but trivial issues, that
needs (according to us) to be solved with some restructuring of
the pi-code (i.e., going toward a proxy execution-ish implementation).

This is under development, in the meanwhile, as a temporary solution,
what this commits does is:

- ensure a pi-lock owner with waiters is never throttled down. Instead,
when it runs out of runtime, it immediately gets replenished and it's
deadline is postponed;

- the scheduling parameters (relative deadline and default runtime)
used for that replenishments --during the whole period it holds the
pi-lock-- are the ones of the waiting task with earliest deadline.

Acting this way, we provide some kind of boosting to the lock-owner,
still by using the existing (actually, slightly modified by the previous
commit) pi-architecture.

We would stress the fact that this is only a surely needed, all but
clean solution to the problem. In the end it's only a way to re-start
discussion within the community. So, as always, comments, ideas, rants,
etc.. are welcome! :-)

Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
[ Added !RT_MUTEXES build fix. ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Dario Faggioli
2014-01-13 20:42:56 +0800
fb00aca47 rtmutex: Turn the plist into an rb-tree ... Browse Code »

Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
and provide a proper comparison function for -deadline and
-priority tasks.

This is done mainly because:
- classical prio field of the plist is just an int, which might
not be enough for representing a deadline;
- manipulating such a list would become O(nr_deadline_tasks),
which might be to much, as the number of -deadline task increases.

Therefore, an rb-tree is used, and tasks are queued in it according
to the following logic:
- among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
one with the higher (lower, actually!) prio wins;
- among a -priority and a -deadline task, the latter always wins;
- among two -deadline tasks, the one with the earliest deadline
wins.

Queueing and dequeueing functions are changed accordingly, for both
the list of a task's pi-waiters and the list of tasks blocked on
a pi-lock.

Signed-off-by: Peter Zijlstra
Signed-off-by: Dario Faggioli
Signed-off-by: Juri Lelli
Signed-off-again-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
Signed-off-by: Ingo Molnar

Peter Zijlstra
2014-01-13 20:41:50 +0800