05 Sep, 2018

1 commit

  • [ Upstream commit 62cedf3e60af03e47849fe2bd6a03ec179422a8a ]

    Needed for annotating rt_mutex locks.

    Tested-by: John Sperbeck
    Signed-off-by: Peter Rosin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Peter Chang
    Cc: Peter Zijlstra
    Cc: Philippe Ombredanne
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Wolfram Sang
    Link: http://lkml.kernel.org/r/20180720083914.1950-2-peda@axentia.se
    Signed-off-by: Ingo Molnar
    Signed-off-by: Sasha Levin
    Signed-off-by: Greg Kroah-Hartman

    Peter Rosin
     

24 Jan, 2018

1 commit

  • commit c1e2f0eaf015fb7076d51a339011f2383e6dd389 upstream.

    Julia reported futex state corruption in the following scenario:

    waiter waker stealer (prio > waiter)

    futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
    timeout=[N ms])
    futex_wait_requeue_pi()
    futex_wait_queue_me()
    freezable_schedule()

    futex(LOCK_PI, uaddr2)
    futex(CMP_REQUEUE_PI, uaddr,
    uaddr2, 1, 0)
    /* requeues waiter to uaddr2 */
    futex(UNLOCK_PI, uaddr2)
    wake_futex_pi()
    cmp_futex_value_locked(uaddr2, waiter)
    wake_up_q()

    task>
    futex(LOCK_PI, uaddr2)
    __rt_mutex_start_proxy_lock()
    try_to_take_rt_mutex() /* steals lock */
    rt_mutex_set_owner(lock, stealer)


    rt_mutex_wait_proxy_lock()
    __rt_mutex_slowlock()
    try_to_take_rt_mutex() /* fails, lock held by stealer */
    if (timeout && !timeout->task)
    return -ETIMEDOUT;
    fixup_owner()
    /* lock wasn't acquired, so,
    fixup_pi_state_owner skipped */

    return -ETIMEDOUT;

    /* At this point, we've returned -ETIMEDOUT to userspace, but the
    * futex word shows waiter to be the owner, and the pi_mutex has
    * stealer as the owner */

    futex_lock(LOCK_PI, uaddr2)
    -> bails with EDEADLK, futex word says we're owner.

    And suggested that what commit:

    73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")

    removes from fixup_owner() looks to be just what is needed. And indeed
    it is -- I completely missed that requeue_pi could also result in this
    case. So we need to restore that, except that subsequent patches, like
    commit:

    16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")

    changed all the locking rules. Even without that, the sequence:

    - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
    - locked = 1;
    - goto out;
    - }

    - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
    - owner = rt_mutex_owner(&q->pi_state->pi_mutex);
    - if (!owner)
    - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
    - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
    - ret = fixup_pi_state_owner(uaddr, q, owner);

    already suggests there were races; otherwise we'd never have to look
    at next_owner.

    So instead of doing 3 consecutive wait_lock sections with who knows
    what races, we do it all in a single section. Additionally, the usage
    of pi_state->owner in fixup_owner() was only safe because only the
    rt_mutex owner would modify it, which this additional case wrecks.

    Luckily the values can only change away and not to the value we're
    testing, this means we can do a speculative test and double check once
    we have the wait_lock.

    Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")
    Reported-by: Julia Cartwright
    Reported-by: Gratian Crisan
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Tested-by: Julia Cartwright
    Tested-by: Gratian Crisan
    Cc: Darren Hart
    Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net
    Signed-off-by: Greg Kroah-Hartman

    Peter Zijlstra
     

09 Sep, 2017

1 commit


13 Jul, 2017

1 commit

  • We don't need to adjust priority before adding a new pi_waiter, the
    priority only needs to be updated after pi_waiter change or task
    priority change.

    Steven Rostedt pointed out:

    "Interesting, I did some git mining and this was added with the original
    entry of the rtmutex.c (23f78d4a03c5). Looking at even that version, I
    don't see the purpose of adjusting the task prio here. It is done
    before anything changes in the task."

    Signed-off-by: Alex Shi
    Reviewed-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1499926704-28841-1-git-send-email-alex.shi@linaro.org
    [ Enhance the changelog. ]
    Signed-off-by: Ingo Molnar

    Alex Shi
     

20 Jun, 2017

1 commit

  • pi_mutex isn't supposed to be tracked by lockdep, but just
    passing NULLs for name and key will cause lockdep to spew a
    warning and die, which is not what we want it to do.

    Skip lockdep initialization if the caller passed NULLs for
    name and key, suggesting such initialization isn't desired.

    Signed-off-by: Sasha Levin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: f5694788ad8d ("rt_mutex: Add lockdep annotations")
    Link: http://lkml.kernel.org/r/20170618140548.4763-1-alexander.levin@verizon.com
    Signed-off-by: Ingo Molnar

    Levin, Alexander (Sasha Levin)
     

08 Jun, 2017

1 commit

  • Now that (PI) futexes have their own private RT-mutex interface and
    implementation we can easily add lockdep annotations to the existing
    RT-mutex interface.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 May, 2017

1 commit

  • Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
    commit:

    cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

    The following trace shows the problem:

    ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI
    ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
    ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI
    ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
    ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
    ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT

    Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
    which then returns with -ETIMEDOUT. That wrecks the lock state, because now
    the owner isn't aware it acquired the lock and removes the pending robust
    list entry.

    If 2161 is killed, the robust list will not clear out this futex and the
    subsequent acquire on this futex will then (correctly) result in -ESRCH
    which is unexpected by glibc, triggers an internal assertion and dies.

    Task 2161 Task 2165

    rt_mutex_wait_proxy_lock()
    timeout();
    /* T2161 is still queued in the waiter list */
    return -ETIMEDOUT;

    futex_unlock_pi()
    spin_lock(hb->lock);
    rtmutex_unlock()
    remove_rtmutex_waiter(T2161);
    mark_lock_available();
    /* Make the next waiter owner of the user space side */
    futex_uval = 2161;
    spin_unlock(hb->lock);
    spin_lock(hb->lock);
    rt_mutex_cleanup_proxy_lock()
    if (rtmutex_owner() !== current)
    ...
    return FAIL;
    ....
    return -ETIMEOUT;

    This means that rt_mutex_cleanup_proxy_lock() needs to call
    try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
    assigned by the waker. If the rtmutex is owned by some other task then this
    call is harmless and just confirmes that the waiter is not able to acquire
    it.

    While there, fix what looks like a merge error which resulted in
    rt_mutex_cleanup_proxy_lock() having two calls to
    fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
    Both should have one, since both potentially touch the waiter list.

    Fixes: 38d589f2fd08 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
    Reported-by: Markus Trippelsdorf
    Bug-Spotted-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Florian Weimer
    Cc: Darren Hart
    Cc: Sebastian Andrzej Siewior
    Cc: Markus Trippelsdorf
    Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

05 Apr, 2017

1 commit

  • mark_wakeup_next_waiter() already disables preemption, doing so again
    leaves us with an unpaired preempt_disable().

    Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
    Signed-off-by: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     

04 Apr, 2017

7 commits

  • There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
    convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
    is not-less (both in kernel priority view).

    This necessitated the introduction of cmp_task() which creates a
    pointer to an unnamed stack variable of struct rt_mutex_waiter type to
    compare against tasks.

    With this, we can now also create and employ rt_mutex_waiter_equal().

    Reviewed-and-tested-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
    during the PI chain walk, such that the PI chain order isn't messed up
    by (asynchronous) task state updates.

    Currently rt_mutex_waiter_less() uses task state for deadline tasks;
    this is broken, since the task state can, as said above, change
    asynchronously, causing the RB tree order to change without actual
    tree update -> FAIL.

    Fix this by also copying the deadline into the rt_mutex_waiter state
    and updating it along with its prio field.

    Ideally we would also force PI chain updates whenever DL tasks update
    their deadline parameter, but for first approximation this is less
    broken than it was.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the introduction of SCHED_DEADLINE the whole notion that priority
    is a single number is gone, therefore the @prio argument to
    rt_mutex_setprio() doesn't make sense anymore.

    So rework the code to pass a pi_task instead.

    Note this also fixes a problem with pi_top_task caching; previously we
    would not set the pointer (call rt_mutex_update_top_task) if the
    priority didn't change, this could lead to a stale pointer.

    As for the XXX, I think its fine to use pi_task->prio, because if it
    differs from waiter->prio, a PI chain update is immenent.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Previous patches changed the meaning of the return value of
    rt_mutex_slowunlock(); update comments and code to reflect this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Currently dl tasks will actually return at the very beginning
    of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
    if (!detect_deadlock)
    goto out_unlock_pi; // out here
    else
    requeue = false;
    }

    As the deadline value of blocked deadline tasks(waiters) without
    changing their sched_class(thus prio doesn't change) never changes,
    this seems reasonable, but it actually misses the chance of updating
    rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
    deadline parameters(dl_runtime, dl_period) or boosted waiter changes
    to !deadline class.

    Thus, force deadline task not out by adding the !dl_prio() condition.

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     
  • A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [] enqueue_task+0x2c/0x80
    [] activate_task+0x23/0x30
    [] pull_dl_task+0x1d5/0x260
    [] pre_schedule_dl+0x16/0x20
    [] __schedule+0xd3/0x900
    [] schedule+0x29/0x70
    [] __rt_mutex_slowlock+0x4b/0xc0
    [] rt_mutex_slowlock+0xd1/0x190
    [] rt_mutex_timed_lock+0x53/0x60
    [] futex_lock_pi.isra.18+0x28c/0x390
    [] do_futex+0x190/0x5b0
    [] SyS_futex+0x80/0x180

    This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
    are only protected by pi_lock when operating pi waiters, while
    rt_mutex_get_top_task(), will access them with rq lock held but
    not holding pi_lock.

    In order to tackle it, we introduce new "pi_top_task" pointer
    cached in task_struct, and add new rt_mutex_update_top_task()
    to update its value, it can be called by rt_mutex_setprio()
    which held both owner's pi_lock and rq lock. Thus "pi_top_task"
    can be safely accessed by enqueue_task_dl() under rq lock.

    Originally-From: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     
  • We should deboost before waking the high-priority task, such that we
    don't run two tasks with the same "state" (priority, deadline,
    sched_class, etc).

    In order to make sure the boosting task doesn't start running between
    unlock and deboost (due to 'spurious' wakeup), we move the deboost
    under the wait_lock, that way its serialized against the wait loop in
    __rt_mutex_slowlock().

    Doing the deboost early can however lead to priority-inversion if
    current would get preempted after the deboost but before waking our
    high-prio task, hence we disable preemption before doing deboost, and
    enabling it after the wake up is over.

    This gets us the right semantic order, but most importantly however;
    this change ensures pointer stability for the next patch, where we
    have rt_mutex_setprio() cache a pointer to the top-most waiter task.
    If we, as before this change, do the wakeup first and then deboost,
    this pointer might point into thin air.

    [peterz: Changelog + patch munging]
    Suggested-by: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

24 Mar, 2017

6 commits

  • When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
    chain code will (falsely) report a deadlock and BUG.

    The problem is that it hold hb->lock (now an rt_mutex) while doing
    task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
    interleaved just right with futex_unlock_pi() leads it to believe to see an
    AB-BA deadlock.

    Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
    does FUTEX_UNLOCK_PI)

    lock hb->lock
    lock rt_mutex (as per start_proxy)
    lock hb->lock

    Which is a trivial AB-BA.

    It is not an actual deadlock, because it won't be holding hb->lock by the
    time it actually blocks on the rt_mutex, but the chainwalk code doesn't
    know that and it would be a nightmare to handle this gracefully.

    To avoid this problem, do the same as in futex_unlock_pi() and drop
    hb->lock after acquiring wait_lock. This still fully serializes against
    futex_unlock_pi(), since adding to the wait_list does the very same lock
    dance, and removing it holds both locks.

    Aside of solving the RT problem this makes the lock and unlock mechanism
    symetric and reduces the hb->lock held time.

    Reported-and-tested-by: Sebastian Andrzej Siewior
    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
    modifications are done under both hb->lock and wait_lock.

    This closes the obvious interleave pattern between futex_lock_pi() and
    futex_unlock_pi(), but not entirely so. See below:

    Before:

    futex_lock_pi() futex_unlock_pi()
    unlock hb->lock

    lock hb->lock
    unlock hb->lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock

    schedule()

    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock


    -EAGAIN

    lock hb->lock

    After:

    futex_lock_pi() futex_unlock_pi()

    lock hb->lock
    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock
    unlock hb->lock

    schedule()
    lock hb->lock
    unlock hb->lock
    lock hb->lock
    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    unlock hb->lock

    It does however solve the earlier starvation/live-lock scenario which got
    introduced with the -EAGAIN since unlike the before scenario; where the
    -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
    after scenario it happens while futex_unlock_pi() actually holds a lock,
    and then it is serialized on that lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
    consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
    parts, such that only the actual blocking can be done without hb->lock
    held.

    Split split_mutex_finish_proxy_lock() into two parts, one that does the
    blocking and one that does remove_waiter() when the lock acquire failed.

    When the rtmutex was acquired successfully the waiter can be removed in the
    acquisiton path safely, since there is no concurrency on the lock owner.

    This means that, except for futex_lock_pi(), all wait_list modifications
    are done with both hb->lock and wait_lock held.

    [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since there's already two copies of this code, introduce a helper now
    before adding a third one.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Part of what makes futex_unlock_pi() intricate is that
    rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
    rt_mutex::wait_lock.

    This means it cannot rely on the atomicy of wait_lock, which would be
    preferred in order to not rely on hb->lock so much.

    The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
    race with the rt_mutex fastpath, however futexes have their own fast path.

    Since futexes already have a bunch of separate rt_mutex accessors, complete
    that set and implement a rt_mutex variant without fastpath for them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • These are unused and clutter up the code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

02 Mar, 2017

3 commits


30 Jan, 2017

1 commit

  • Running my likely/unlikely profiler for 3 weeks on two production
    machines, I discovered that the unlikely() test in
    __rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
    100% of the time, making it a very likely case.

    The reason is, on a vanilla kernel, the majority case of calling
    rt_mutex() is from the futex code. This code is always called as
    TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
    PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
    likely scenario.

    The rt_mutex() code should be optimized for the common vanilla case,
    and that is from a futex, with TASK_INTERRUPTIBLE as the state.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     

02 Dec, 2016

3 commits

  • While debugging the unlock vs. dequeue race which resulted in state
    corruption of futexes the lockless nature of rt_mutex_proxy_unlock()
    caused some confusion.

    Add commentry to explain why it is safe to do this lockless. Add matching
    comments to rt_mutex_init_proxy_locked() for completeness sake.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 Jun, 2016

1 commit

  • One warning should be enough to get one motivated to fix this. It is
    possible that this happens more than once and that starts flooding the
    output. Later the prints will be suppressed so we only get half of it.
    Depending on the console system used it might not be helpful.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

04 Nov, 2015

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle were:

    - sched/fair load tracking fixes and cleanups (Byungchul Park)

    - Make load tracking frequency scale invariant (Dietmar Eggemann)

    - sched/deadline updates (Juri Lelli)

    - stop machine fixes, cleanups and enhancements for bugs triggered by
    CPU hotplug stress testing (Oleg Nesterov)

    - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
    cleanups (Peter Zijlstra)

    - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

    - Optimize per entity utilization tracking (Peter Zijlstra)

    - ... misc other fixes, cleanups and smaller updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
    sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
    sched: Start stopper early
    stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
    stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
    stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
    stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
    stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
    sched/x86: Fix typo in __switch_to() comments
    sched/core: Remove a parameter in the migrate_task_rq() function
    sched/core: Drop unlikely behind BUG_ON()
    sched/core: Fix task and run queue sched_info::run_delay inconsistencies
    sched/numa: Fix task_tick_fair() from disabling numa_balancing
    sched/core: Add preempt_count invariant check
    sched/core: More notrace annotations
    sched/core: Kill PREEMPT_ACTIVE
    sched/core, sched/x86: Kill thread_info::saved_preempt_count
    sched/core: Simplify preempt_count tests
    sched/core: Robustify preemption leak checks
    sched/core: Stop setting PREEMPT_ACTIVE
    ...

    Linus Torvalds
     

06 Oct, 2015

1 commit

  • As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
    variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
    Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
    ordered archs can benefit from more relaxed use of barriers when locking
    and unlocking, instead of regular full barrier semantics. While currently
    only arm64 supports such optimizations, updating corresponding locking
    primitives serves for other archs to immediately benefit as well, once the
    necessary machinery is implemented of course.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E.McKenney
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

23 Sep, 2015

1 commit

  • rt_mutex_waiter_less() check of task deadlines is open coded. Since this
    is subject to wraparound bugs, make it use the correct helper.

    Reported-by: Luca Abeni
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1441188096-23021-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

20 Jul, 2015

1 commit

  • No one uses this anymore, and this is not the first time the
    idea of replacing it with a (now possible) userspace side.
    Lock stealing logic was removed long ago in when the lock
    was granted to the highest prio.

    Signed-off-by: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1435782588-4177-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

25 Jun, 2015

1 commit

  • Pull locking updates from Thomas Gleixner:
    "These locking updates depend on the alreay merged sched/core branch:

    - Lockless top waiter wakeup for rtmutex (Davidlohr)

    - Reduce hash bucket lock contention for PI futexes (Sebastian)

    - Documentation update (Davidlohr)"

    * 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rtmutex: Update stale plist comments
    futex: Lower the lock contention on the HB lock during wake up
    locking/rtmutex: Implement lockless top-waiter wakeup

    Linus Torvalds
     

23 Jun, 2015

2 commits

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes are:

    - 'qspinlock' support, enabled on x86: queued spinlocks - these are
    now the spinlock variant used by x86 as they outperform ticket
    spinlocks in every category. (Waiman Long)

    - 'pvqspinlock' support on x86: paravirtualized variant of queued
    spinlocks. (Waiman Long, Peter Zijlstra)

    - 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
    queued spinlocks, they are now the variant used by x86:

    CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
    CONFIG_QUEUED_SPINLOCKS=y
    CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
    CONFIG_QUEUED_RWLOCKS=y

    - various lockdep fixlets

    - various locking primitives cleanups, further WRITE_ONCE()
    propagation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    locking/lockdep: Remove hard coded array size dependency
    locking/qrwlock: Don't contend with readers when setting _QW_WAITING
    lockdep: Do not break user-visible string
    locking/arch: Rename set_mb() to smp_store_mb()
    locking/arch: Add WRITE_ONCE() to set_mb()
    rtmutex: Warn if trylock is called from hard/softirq context
    arch: Remove __ARCH_HAVE_CMPXCHG
    locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
    locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
    locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
    locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
    locking/pvqspinlock, x86: Enable PV qspinlock for Xen
    locking/pvqspinlock, x86: Enable PV qspinlock for KVM
    locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
    locking/pvqspinlock: Implement simple paravirt support for the qspinlock
    locking/qspinlock: Revert to test-and-set on hypervisors
    locking/qspinlock: Use a simple write to grab the lock
    locking/qspinlock: Optimize for smaller NR_CPUS
    locking/qspinlock: Extract out code snippets for the next patch
    locking/qspinlock: Add pending bit
    ...

    Linus Torvalds
     

20 Jun, 2015

2 commits

  • ... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
    no longer use plists for queuing any waiters. Update stale comments.

    Signed-off-by: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     
  • wake_futex_pi() wakes the task before releasing the hash bucket lock
    (HB). The first thing the woken up task usually does is to acquire the
    lock which requires the HB lock. On SMP Systems this leads to blocking
    on the HB lock which is released by the owner shortly after.
    This patch rearranges the unlock path by first releasing the HB lock and
    then waking up the task.

    [ tglx: Fixed up the rtmutex unlock path ]

    Originally-from: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior