05 Sep, 2018
1 commit
-
[ Upstream commit 62cedf3e60af03e47849fe2bd6a03ec179422a8a ]
Needed for annotating rt_mutex locks.
Tested-by: John Sperbeck
Signed-off-by: Peter Rosin
Signed-off-by: Peter Zijlstra (Intel)
Cc: Davidlohr Bueso
Cc: Deepa Dinamani
Cc: Greg Kroah-Hartman
Cc: Linus Torvalds
Cc: Peter Chang
Cc: Peter Zijlstra
Cc: Philippe Ombredanne
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Wolfram Sang
Link: http://lkml.kernel.org/r/20180720083914.1950-2-peda@axentia.se
Signed-off-by: Ingo Molnar
Signed-off-by: Sasha Levin
Signed-off-by: Greg Kroah-Hartman
24 Jan, 2018
1 commit
-
commit c1e2f0eaf015fb7076d51a339011f2383e6dd389 upstream.
Julia reported futex state corruption in the following scenario:
waiter waker stealer (prio > waiter)
futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
timeout=[N ms])
futex_wait_requeue_pi()
futex_wait_queue_me()
freezable_schedule()
futex(LOCK_PI, uaddr2)
futex(CMP_REQUEUE_PI, uaddr,
uaddr2, 1, 0)
/* requeues waiter to uaddr2 */
futex(UNLOCK_PI, uaddr2)
wake_futex_pi()
cmp_futex_value_locked(uaddr2, waiter)
wake_up_q()
task>
futex(LOCK_PI, uaddr2)
__rt_mutex_start_proxy_lock()
try_to_take_rt_mutex() /* steals lock */
rt_mutex_set_owner(lock, stealer)
rt_mutex_wait_proxy_lock()
__rt_mutex_slowlock()
try_to_take_rt_mutex() /* fails, lock held by stealer */
if (timeout && !timeout->task)
return -ETIMEDOUT;
fixup_owner()
/* lock wasn't acquired, so,
fixup_pi_state_owner skipped */return -ETIMEDOUT;
/* At this point, we've returned -ETIMEDOUT to userspace, but the
* futex word shows waiter to be the owner, and the pi_mutex has
* stealer as the owner */futex_lock(LOCK_PI, uaddr2)
-> bails with EDEADLK, futex word says we're owner.And suggested that what commit:
73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")
removes from fixup_owner() looks to be just what is needed. And indeed
it is -- I completely missed that requeue_pi could also result in this
case. So we need to restore that, except that subsequent patches, like
commit:16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")
changed all the locking rules. Even without that, the sequence:
- if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
- locked = 1;
- goto out;
- }- raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
- owner = rt_mutex_owner(&q->pi_state->pi_mutex);
- if (!owner)
- owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
- raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
- ret = fixup_pi_state_owner(uaddr, q, owner);already suggests there were races; otherwise we'd never have to look
at next_owner.So instead of doing 3 consecutive wait_lock sections with who knows
what races, we do it all in a single section. Additionally, the usage
of pi_state->owner in fixup_owner() was only safe because only the
rt_mutex owner would modify it, which this additional case wrecks.Luckily the values can only change away and not to the value we're
testing, this means we can do a speculative test and double check once
we have the wait_lock.Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")
Reported-by: Julia Cartwright
Reported-by: Gratian Crisan
Signed-off-by: Peter Zijlstra (Intel)
Signed-off-by: Thomas Gleixner
Tested-by: Julia Cartwright
Tested-by: Gratian Crisan
Cc: Darren Hart
Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
Signed-off-by: Greg Kroah-Hartman
09 Sep, 2017
1 commit
-
... with the generic rbtree flavor instead. No changes
in semantics whatsoever.Link: http://lkml.kernel.org/r/20170719014603.19029-10-dave@stgolabs.net
Signed-off-by: Davidlohr Bueso
Acked-by: Peter Zijlstra (Intel)
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Jul, 2017
1 commit
-
We don't need to adjust priority before adding a new pi_waiter, the
priority only needs to be updated after pi_waiter change or task
priority change.Steven Rostedt pointed out:
"Interesting, I did some git mining and this was added with the original
entry of the rtmutex.c (23f78d4a03c5). Looking at even that version, I
don't see the purpose of adjusting the task prio here. It is done
before anything changes in the task."Signed-off-by: Alex Shi
Reviewed-by: Steven Rostedt (VMware)
Acked-by: Peter Zijlstra (Intel)
Cc: Juri Lelli
Cc: Linus Torvalds
Cc: Mathieu Poirier
Cc: Sebastian Siewior
Cc: Steven Rostedt
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1499926704-28841-1-git-send-email-alex.shi@linaro.org
[ Enhance the changelog. ]
Signed-off-by: Ingo Molnar
20 Jun, 2017
1 commit
-
pi_mutex isn't supposed to be tracked by lockdep, but just
passing NULLs for name and key will cause lockdep to spew a
warning and die, which is not what we want it to do.Skip lockdep initialization if the caller passed NULLs for
name and key, suggesting such initialization isn't desired.Signed-off-by: Sasha Levin
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Fixes: f5694788ad8d ("rt_mutex: Add lockdep annotations")
Link: http://lkml.kernel.org/r/20170618140548.4763-1-alexander.levin@verizon.com
Signed-off-by: Ingo Molnar
08 Jun, 2017
1 commit
-
Now that (PI) futexes have their own private RT-mutex interface and
implementation we can easily add lockdep annotations to the existing
RT-mutex interface.Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
23 May, 2017
1 commit
-
Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
commit:cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")
The following trace shows the problem:
ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI
ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI
ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUTTask 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
which then returns with -ETIMEDOUT. That wrecks the lock state, because now
the owner isn't aware it acquired the lock and removes the pending robust
list entry.If 2161 is killed, the robust list will not clear out this futex and the
subsequent acquire on this futex will then (correctly) result in -ESRCH
which is unexpected by glibc, triggers an internal assertion and dies.Task 2161 Task 2165
rt_mutex_wait_proxy_lock()
timeout();
/* T2161 is still queued in the waiter list */
return -ETIMEDOUT;futex_unlock_pi()
spin_lock(hb->lock);
rtmutex_unlock()
remove_rtmutex_waiter(T2161);
mark_lock_available();
/* Make the next waiter owner of the user space side */
futex_uval = 2161;
spin_unlock(hb->lock);
spin_lock(hb->lock);
rt_mutex_cleanup_proxy_lock()
if (rtmutex_owner() !== current)
...
return FAIL;
....
return -ETIMEOUT;This means that rt_mutex_cleanup_proxy_lock() needs to call
try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
assigned by the waker. If the rtmutex is owned by some other task then this
call is harmless and just confirmes that the waiter is not able to acquire
it.While there, fix what looks like a merge error which resulted in
rt_mutex_cleanup_proxy_lock() having two calls to
fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
Both should have one, since both potentially touch the waiter list.Fixes: 38d589f2fd08 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
Reported-by: Markus Trippelsdorf
Bug-Spotted-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Cc: Florian Weimer
Cc: Darren Hart
Cc: Sebastian Andrzej Siewior
Cc: Markus Trippelsdorf
Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net
Signed-off-by: Thomas Gleixner
05 Apr, 2017
1 commit
-
mark_wakeup_next_waiter() already disables preemption, doing so again
leaves us with an unpaired preempt_disable().Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
Signed-off-by: Mike Galbraith
Cc: Peter Zijlstra
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
Signed-off-by: Thomas Gleixner
04 Apr, 2017
7 commits
-
There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
is not-less (both in kernel priority view).This necessitated the introduction of cmp_task() which creates a
pointer to an unnamed stack variable of struct rt_mutex_waiter type to
compare against tasks.With this, we can now also create and employ rt_mutex_waiter_equal().
Reviewed-and-tested-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
Signed-off-by: Thomas Gleixner -
rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
during the PI chain walk, such that the PI chain order isn't messed up
by (asynchronous) task state updates.Currently rt_mutex_waiter_less() uses task state for deadline tasks;
this is broken, since the task state can, as said above, change
asynchronously, causing the RB tree order to change without actual
tree update -> FAIL.Fix this by also copying the deadline into the rt_mutex_waiter state
and updating it along with its prio field.Ideally we would also force PI chain updates whenever DL tasks update
their deadline parameter, but for first approximation this is less
broken than it was.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
Signed-off-by: Thomas Gleixner -
With the introduction of SCHED_DEADLINE the whole notion that priority
is a single number is gone, therefore the @prio argument to
rt_mutex_setprio() doesn't make sense anymore.So rework the code to pass a pi_task instead.
Note this also fixes a problem with pi_top_task caching; previously we
would not set the pointer (call rt_mutex_update_top_task) if the
priority didn't change, this could lead to a stale pointer.As for the XXX, I think its fine to use pi_task->prio, because if it
differs from waiter->prio, a PI chain update is immenent.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
Signed-off-by: Thomas Gleixner -
Previous patches changed the meaning of the return value of
rt_mutex_slowunlock(); update comments and code to reflect this.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
Signed-off-by: Thomas Gleixner -
Currently dl tasks will actually return at the very beginning
of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:if (waiter->prio == task->prio) {
if (!detect_deadlock)
goto out_unlock_pi; // out here
else
requeue = false;
}As the deadline value of blocked deadline tasks(waiters) without
changing their sched_class(thus prio doesn't change) never changes,
this seems reasonable, but it actually misses the chance of updating
rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
deadline parameters(dl_runtime, dl_period) or boosted waiter changes
to !deadline class.Thus, force deadline task not out by adding the !dl_prio() condition.
Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Steven Rostedt
Reviewed-by: Thomas Gleixner
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
Signed-off-by: Thomas Gleixner -
A crash happened while I was playing with deadline PI rtmutex.
BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
IP: [] rt_mutex_get_top_task+0x1f/0x30
PGD 232a75067 PUD 230947067 PMD 0
Oops: 0000 [#1] SMP
CPU: 1 PID: 10994 Comm: a.out Not taintedCall Trace:
[] enqueue_task+0x2c/0x80
[] activate_task+0x23/0x30
[] pull_dl_task+0x1d5/0x260
[] pre_schedule_dl+0x16/0x20
[] __schedule+0xd3/0x900
[] schedule+0x29/0x70
[] __rt_mutex_slowlock+0x4b/0xc0
[] rt_mutex_slowlock+0xd1/0x190
[] rt_mutex_timed_lock+0x53/0x60
[] futex_lock_pi.isra.18+0x28c/0x390
[] do_futex+0x190/0x5b0
[] SyS_futex+0x80/0x180This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
are only protected by pi_lock when operating pi waiters, while
rt_mutex_get_top_task(), will access them with rq lock held but
not holding pi_lock.In order to tackle it, we introduce new "pi_top_task" pointer
cached in task_struct, and add new rt_mutex_update_top_task()
to update its value, it can be called by rt_mutex_setprio()
which held both owner's pi_lock and rq lock. Thus "pi_top_task"
can be safely accessed by enqueue_task_dl() under rq lock.Originally-From: Peter Zijlstra
Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Steven Rostedt
Reviewed-by: Thomas Gleixner
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
Signed-off-by: Thomas Gleixner -
We should deboost before waking the high-priority task, such that we
don't run two tasks with the same "state" (priority, deadline,
sched_class, etc).In order to make sure the boosting task doesn't start running between
unlock and deboost (due to 'spurious' wakeup), we move the deboost
under the wait_lock, that way its serialized against the wait loop in
__rt_mutex_slowlock().Doing the deboost early can however lead to priority-inversion if
current would get preempted after the deboost but before waking our
high-prio task, hence we disable preemption before doing deboost, and
enabling it after the wake up is over.This gets us the right semantic order, but most importantly however;
this change ensures pointer stability for the next patch, where we
have rt_mutex_setprio() cache a pointer to the top-most waiter task.
If we, as before this change, do the wakeup first and then deboost,
this pointer might point into thin air.[peterz: Changelog + patch munging]
Suggested-by: Peter Zijlstra
Signed-off-by: Xunlei Pang
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Steven Rostedt
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
Signed-off-by: Thomas Gleixner
24 Mar, 2017
6 commits
-
When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
chain code will (falsely) report a deadlock and BUG.The problem is that it hold hb->lock (now an rt_mutex) while doing
task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
interleaved just right with futex_unlock_pi() leads it to believe to see an
AB-BA deadlock.Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
does FUTEX_UNLOCK_PI)lock hb->lock
lock rt_mutex (as per start_proxy)
lock hb->lockWhich is a trivial AB-BA.
It is not an actual deadlock, because it won't be holding hb->lock by the
time it actually blocks on the rt_mutex, but the chainwalk code doesn't
know that and it would be a nightmare to handle this gracefully.To avoid this problem, do the same as in futex_unlock_pi() and drop
hb->lock after acquiring wait_lock. This still fully serializes against
futex_unlock_pi(), since adding to the wait_list does the very same lock
dance, and removing it holds both locks.Aside of solving the RT problem this makes the lock and unlock mechanism
symetric and reduces the hb->lock held time.Reported-and-tested-by: Sebastian Andrzej Siewior
Suggested-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
Signed-off-by: Thomas Gleixner -
By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
modifications are done under both hb->lock and wait_lock.This closes the obvious interleave pattern between futex_lock_pi() and
futex_unlock_pi(), but not entirely so. See below:Before:
futex_lock_pi() futex_unlock_pi()
unlock hb->locklock hb->lock
unlock hb->locklock rt_mutex->wait_lock
unlock rt_mutex_wait_lock
-EAGAINlock rt_mutex->wait_lock
list_add
unlock rt_mutex->wait_lockschedule()
lock rt_mutex->wait_lock
list_del
unlock rt_mutex->wait_lock
-EAGAINlock hb->lock
After:
futex_lock_pi() futex_unlock_pi()
lock hb->lock
lock rt_mutex->wait_lock
list_add
unlock rt_mutex->wait_lock
unlock hb->lockschedule()
lock hb->lock
unlock hb->lock
lock hb->lock
lock rt_mutex->wait_lock
list_del
unlock rt_mutex->wait_locklock rt_mutex->wait_lock
unlock rt_mutex_wait_lock
-EAGAINunlock hb->lock
It does however solve the earlier starvation/live-lock scenario which got
introduced with the -EAGAIN since unlike the before scenario; where the
-EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
after scenario it happens while futex_unlock_pi() actually holds a lock,
and then it is serialized on that lock.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
Signed-off-by: Thomas Gleixner -
With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
parts, such that only the actual blocking can be done without hb->lock
held.Split split_mutex_finish_proxy_lock() into two parts, one that does the
blocking and one that does remove_waiter() when the lock acquire failed.When the rtmutex was acquired successfully the waiter can be removed in the
acquisiton path safely, since there is no concurrency on the lock owner.This means that, except for futex_lock_pi(), all wait_list modifications
are done with both hb->lock and wait_lock held.[bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]
Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
Signed-off-by: Thomas Gleixner -
Since there's already two copies of this code, introduce a helper now
before adding a third one.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
Signed-off-by: Thomas Gleixner -
Part of what makes futex_unlock_pi() intricate is that
rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
rt_mutex::wait_lock.This means it cannot rely on the atomicy of wait_lock, which would be
preferred in order to not rely on hb->lock so much.The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
race with the rt_mutex fastpath, however futexes have their own fast path.Since futexes already have a bunch of separate rt_mutex accessors, complete
that set and implement a rt_mutex variant without fastpath for them.Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
Signed-off-by: Thomas Gleixner -
These are unused and clutter up the code.
Signed-off-by: Peter Zijlstra (Intel)
Cc: juri.lelli@arm.com
Cc: bigeasy@linutronix.de
Cc: xlpang@redhat.com
Cc: rostedt@goodmis.org
Cc: mathieu.desnoyers@efficios.com
Cc: jdesfossez@efficios.com
Cc: dvhart@infradead.org
Cc: bristot@redhat.com
Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
Signed-off-by: Thomas Gleixner
02 Mar, 2017
3 commits
-
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
…hed.h> into <linux/sched/signal.h>
Fix up affected files that include this signal functionality via sched.h.
Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar <mingo@kernel.org> -
We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.Include the new header in the files that are going to need it.
Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
30 Jan, 2017
1 commit
-
Running my likely/unlikely profiler for 3 weeks on two production
machines, I discovered that the unlikely() test in
__rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
100% of the time, making it a very likely case.The reason is, on a vanilla kernel, the majority case of calling
rt_mutex() is from the futex code. This code is always called as
TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
likely scenario.The rt_mutex() code should be optimized for the common vanilla case,
and that is from a futex, with TASK_INTERRUPTIBLE as the state.Signed-off-by: Steven Rostedt (VMware)
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
Signed-off-by: Ingo Molnar
02 Dec, 2016
3 commits
-
While debugging the unlock vs. dequeue race which resulted in state
corruption of futexes the lockless nature of rt_mutex_proxy_unlock()
caused some confusion.Add commentry to explain why it is safe to do this lockless. Add matching
comments to rt_mutex_init_proxy_locked() for completeness sake.Signed-off-by: Thomas Gleixner
Acked-by: Peter Zijlstra (Intel)
Cc: David Daney
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Steven Rostedt
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de
Signed-off-by: Ingo Molnar -
Signed-off-by: Ingo Molnar
-
David reported a futex/rtmutex state corruption. It's caused by the
following problem:CPU0 CPU1 CPU2
l->owner=T1
rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T2)
boost()
unlock(l->wait_lock)
schedule()rt_mutex_lock(l)
lock(l->wait_lock)
l->owner = T1 | HAS_WAITERS;
enqueue(T3)
boost()
unlock(l->wait_lock)
schedule()
signal(->T2) signal(->T3)
lock(l->wait_lock)
dequeue(T2)
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
dequeue(T3)
===> wait list is now empty
deboost()
unlock(l->wait_lock)
lock(l->wait_lock)
fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
l->owner = owner
==> l->owner = T1
}lock(l->wait_lock)
rt_mutex_unlock(l) fixup_rt_mutex_waiters()
if (wait_list_empty(l)) {
owner = l->owner & ~HAS_WAITERS;
cmpxchg(l->owner, T1, NULL)
===> Success (l->owner = NULL)
l->owner = owner
==> l->owner = T1
}That means the problem is caused by fixup_rt_mutex_waiters() which does the
RMW to clear the waiters bit unconditionally when there are no waiters in
the rtmutexes rbtree.This can be fatal: A concurrent unlock can release the rtmutex in the
fastpath because the waiters bit is not set. If the cmpxchg() gets in the
middle of the RMW operation then the previous owner, which just unlocked
the rtmutex is set as the owner again when the write takes place after the
successfull cmpxchg().The solution is rather trivial: verify that the owner member of the rtmutex
has the waiters bit set before clearing it. This does not require a
cmpxchg() or other atomic operations because the waiters bit can only be
set and cleared with the rtmutex wait_lock held. It's also safe against the
fast path unlock attempt. The unlock attempt via cmpxchg() will either see
the bit set and take the slowpath or see the bit cleared and release it
atomically in the fastpath.It's remarkable that the test program provided by David triggers on ARM64
and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
problem exists there as well. That refusal might explain that this got not
discovered earlier despite the bug existing from day one of the rtmutex
implementation more than 10 years ago.Thanks to David for meticulously instrumenting the code and providing the
information which allowed to decode this subtle problem.Reported-by: David Daney
Tested-by: David Daney
Signed-off-by: Thomas Gleixner
Reviewed-by: Steven Rostedt
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mark Rutland
Cc: Peter Zijlstra
Cc: Sebastian Siewior
Cc: Will Deacon
Cc: stable@vger.kernel.org
Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
Signed-off-by: Ingo Molnar
21 Nov, 2016
1 commit
-
Currently the wake_q data structure is defined by the WAKE_Q() macro.
This macro, however, looks like a function doing something as "wake" is
a verb. Even checkpatch.pl was confused as it reported warnings likeWARNING: Missing a blank line after declarations
#548: FILE: kernel/futex.c:3665:
+ int ret;
+ WAKE_Q(wake_q);This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
what the macro is doing and eliminates the checkpatch.pl warnings.Signed-off-by: Waiman Long
Acked-by: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
[ Resolved conflict and added missing rename. ]
Signed-off-by: Ingo Molnar
08 Jun, 2016
1 commit
-
One warning should be enough to get one motivated to fix this. It is
possible that this happens more than once and that starts flooding the
output. Later the prints will be suppressed so we only get half of it.
Depending on the console system used it might not be helpful.Signed-off-by: Sebastian Andrzej Siewior
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
Signed-off-by: Ingo Molnar
26 Jan, 2016
1 commit
-
Sasha reported a lockdep splat about a potential deadlock between RCU boosting
rtmutex and the posix timer it_lock.CPU0 CPU1
rtmutex_lock(&rcu->rt_mutex)
spin_lock(&rcu->rt_mutex.wait_lock)
local_irq_disable()
spin_lock(&timer->it_lock)
spin_lock(&rcu->mutex.wait_lock)
--> Interrupt
spin_lock(&timer->it_lock)This is caused by the following code sequence on CPU1
rcu_read_lock()
x = lookup();
if (x)
spin_lock_irqsave(&x->it_lock);
rcu_read_unlock();
return x;We could fix that in the posix timer code by keeping rcu read locked across
the spinlocked and irq disabled section, but the above sequence is common and
there is no reason not to support it.Taking rt_mutex.wait_lock irq safe prevents the deadlock.
Reported-by: Sasha Levin
Signed-off-by: Thomas Gleixner
Cc: Peter Zijlstra
Cc: Paul McKenney
04 Nov, 2015
1 commit
-
Pull scheduler changes from Ingo Molnar:
"The main changes in this cycle were:- sched/fair load tracking fixes and cleanups (Byungchul Park)
- Make load tracking frequency scale invariant (Dietmar Eggemann)
- sched/deadline updates (Juri Lelli)
- stop machine fixes, cleanups and enhancements for bugs triggered by
CPU hotplug stress testing (Oleg Nesterov)- scheduler preemption code rework: remove PREEMPT_ACTIVE and related
cleanups (Peter Zijlstra)- Rework the sched_info::run_delay code to fix races (Peter Zijlstra)
- Optimize per entity utilization tracking (Peter Zijlstra)
- ... misc other fixes, cleanups and smaller updates"
* 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
sched: Start stopper early
stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
sched/x86: Fix typo in __switch_to() comments
sched/core: Remove a parameter in the migrate_task_rq() function
sched/core: Drop unlikely behind BUG_ON()
sched/core: Fix task and run queue sched_info::run_delay inconsistencies
sched/numa: Fix task_tick_fair() from disabling numa_balancing
sched/core: Add preempt_count invariant check
sched/core: More notrace annotations
sched/core: Kill PREEMPT_ACTIVE
sched/core, sched/x86: Kill thread_info::saved_preempt_count
sched/core: Simplify preempt_count tests
sched/core: Robustify preemption leak checks
sched/core: Stop setting PREEMPT_ACTIVE
...
06 Oct, 2015
1 commit
-
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Paul E.McKenney
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
23 Sep, 2015
1 commit
-
rt_mutex_waiter_less() check of task deadlines is open coded. Since this
is subject to wraparound bugs, make it use the correct helper.Reported-by: Luca Abeni
Signed-off-by: Juri Lelli
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1441188096-23021-4-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar
20 Jul, 2015
1 commit
-
No one uses this anymore, and this is not the first time the
idea of replacing it with a (now possible) userspace side.
Lock stealing logic was removed long ago in when the lock
was granted to the highest prio.Signed-off-by: Davidlohr Bueso
Cc: Darren Hart
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Sebastian Andrzej Siewior
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1435782588-4177-2-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner
25 Jun, 2015
1 commit
-
Pull locking updates from Thomas Gleixner:
"These locking updates depend on the alreay merged sched/core branch:- Lockless top waiter wakeup for rtmutex (Davidlohr)
- Reduce hash bucket lock contention for PI futexes (Sebastian)
- Documentation update (Davidlohr)"
* 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
locking/rtmutex: Update stale plist comments
futex: Lower the lock contention on the HB lock during wake up
locking/rtmutex: Implement lockless top-waiter wakeup
23 Jun, 2015
2 commits
-
Pull timer updates from Thomas Gleixner:
"A rather largish update for everything time and timer related:- Cache footprint optimizations for both hrtimers and timer wheel
- Lower the NOHZ impact on systems which have NOHZ or timer migration
disabled at runtime.- Optimize run time overhead of hrtimer interrupt by making the clock
offset updates smarter- hrtimer cleanups and removal of restrictions to tackle some
problems in sched/perf- Some more leap second tweaks
- Another round of changes addressing the 2038 problem
- First step to change the internals of clock event devices by
introducing the necessary infrastructure- Allow constant folding for usecs/msecs_to_jiffies()
- The usual pile of clockevent/clocksource driver updates
The hrtimer changes contain updates to sched, perf and x86 as they
depend on them plus changes all over the tree to cleanup API changes
and redundant code, which got copied all over the place. The y2038
changes touch s390 to remove the last non 2038 safe code related to
boot/persistant clock"* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
clocksource: Increase dependencies of timer-stm32 to limit build wreckage
timer: Minimize nohz off overhead
timer: Reduce timer migration overhead if disabled
timer: Stats: Simplify the flags handling
timer: Replace timer base by a cpu index
timer: Use hlist for the timer wheel hash buckets
timer: Remove FIFO "guarantee"
timers: Sanitize catchup_timer_jiffies() usage
hrtimer: Allow hrtimer::function() to free the timer
seqcount: Introduce raw_write_seqcount_barrier()
seqcount: Rename write_seqcount_barrier()
hrtimer: Fix hrtimer_is_queued() hole
hrtimer: Remove HRTIMER_STATE_MIGRATE
selftest: Timers: Avoid signal deadlock in leap-a-day
timekeeping: Copy the shadow-timekeeper over the real timekeeper last
clockevents: Check state instead of mode in suspend/resume path
selftests: timers: Add leap-second timer edge testing to leap-a-day.c
ntp: Do leapsecond adjustment in adjtimex read path
time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
ntp: Introduce and use SECS_PER_DAY macro instead of 86400
... -
Pull locking updates from Ingo Molnar:
"The main changes are:- 'qspinlock' support, enabled on x86: queued spinlocks - these are
now the spinlock variant used by x86 as they outperform ticket
spinlocks in every category. (Waiman Long)- 'pvqspinlock' support on x86: paravirtualized variant of queued
spinlocks. (Waiman Long, Peter Zijlstra)- 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
queued spinlocks, they are now the variant used by x86:CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y- various lockdep fixlets
- various locking primitives cleanups, further WRITE_ONCE()
propagation"* 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
locking/lockdep: Remove hard coded array size dependency
locking/qrwlock: Don't contend with readers when setting _QW_WAITING
lockdep: Do not break user-visible string
locking/arch: Rename set_mb() to smp_store_mb()
locking/arch: Add WRITE_ONCE() to set_mb()
rtmutex: Warn if trylock is called from hard/softirq context
arch: Remove __ARCH_HAVE_CMPXCHG
locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
locking/pvqspinlock, x86: Enable PV qspinlock for Xen
locking/pvqspinlock, x86: Enable PV qspinlock for KVM
locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
locking/pvqspinlock: Implement simple paravirt support for the qspinlock
locking/qspinlock: Revert to test-and-set on hypervisors
locking/qspinlock: Use a simple write to grab the lock
locking/qspinlock: Optimize for smaller NR_CPUS
locking/qspinlock: Extract out code snippets for the next patch
locking/qspinlock: Add pending bit
...
20 Jun, 2015
2 commits
-
... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
no longer use plists for queuing any waiters. Update stale comments.Signed-off-by: Davidlohr Bueso
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Sebastian Andrzej Siewior
Cc: Davidlohr Bueso
Cc: Peter Zijlstra
Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
Signed-off-by: Thomas Gleixner -
wake_futex_pi() wakes the task before releasing the hash bucket lock
(HB). The first thing the woken up task usually does is to acquire the
lock which requires the HB lock. On SMP Systems this leads to blocking
on the HB lock which is released by the owner shortly after.
This patch rearranges the unlock path by first releasing the HB lock and
then waking up the task.[ tglx: Fixed up the rtmutex unlock path ]
Originally-from: Thomas Gleixner
Signed-off-by: Sebastian Andrzej Siewior
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Davidlohr Bueso
Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
Signed-off-by: Thomas Gleixner