27 Apr, 2020

1 commit


21 Feb, 2020

1 commit

  • The rt_mutex structure's ->owner field is read locklessly, so this
    commit adds the WRITE_ONCE() to an update in order to provide proper
    documentation and READ_ONCE()/WRITE_ONCE() pairing.

    This data race was reported by KCSAN. Not appropriate for backporting
    due to failure being unlikely.

    Signed-off-by: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Will Deacon

    Paul E. McKenney
     

09 Oct, 2019

1 commit

  • Since the following commit:

    b4adfe8e05f1 ("locking/lockdep: Remove unused argument in __lock_release")

    @nested is no longer used in lock_release(), so remove it from all
    lock_release() calls and friends.

    Signed-off-by: Qian Cai
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Acked-by: Daniel Vetter
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: airlied@linux.ie
    Cc: akpm@linux-foundation.org
    Cc: alexander.levin@microsoft.com
    Cc: daniel@iogearbox.net
    Cc: davem@davemloft.net
    Cc: dri-devel@lists.freedesktop.org
    Cc: duyuyang@gmail.com
    Cc: gregkh@linuxfoundation.org
    Cc: hannes@cmpxchg.org
    Cc: intel-gfx@lists.freedesktop.org
    Cc: jack@suse.com
    Cc: jlbec@evilplan.or
    Cc: joonas.lahtinen@linux.intel.com
    Cc: joseph.qi@linux.alibaba.com
    Cc: jslaby@suse.com
    Cc: juri.lelli@redhat.com
    Cc: maarten.lankhorst@linux.intel.com
    Cc: mark@fasheh.com
    Cc: mhocko@kernel.org
    Cc: mripard@kernel.org
    Cc: ocfs2-devel@oss.oracle.com
    Cc: rodrigo.vivi@intel.com
    Cc: sean@poorly.run
    Cc: st@kernel.org
    Cc: tj@kernel.org
    Cc: tytso@mit.edu
    Cc: vdavydov.dev@gmail.com
    Cc: vincent.guittot@linaro.org
    Cc: viro@zeniv.linux.org.uk
    Link: https://lkml.kernel.org/r/1568909380-32199-1-git-send-email-cai@lca.pw
    Signed-off-by: Ingo Molnar

    Qian Cai
     

25 Jul, 2019

1 commit


15 Jul, 2019

1 commit

  • Convert the locking documents to ReST and add them to the
    kernel development book where it belongs.

    Most of the stuff here is just to make Sphinx to properly
    parse the text file, as they're already in good shape,
    not requiring massive changes in order to be parsed.

    The conversion is actually:
    - add blank lines and identation in order to identify paragraphs;
    - fix tables markups;
    - add some lists markups;
    - mark literal blocks;
    - adjust title markups.

    At its new index.rst, let's add a :orphan: while this is not linked to
    the main index.rst file, in order to avoid build warnings.

    Signed-off-by: Mauro Carvalho Chehab
    Acked-by: Federico Vaga

    Mauro Carvalho Chehab
     

21 May, 2019

1 commit

  • Add SPDX license identifiers to all files which:

    - Have no license information of any form

    - Have EXPORT_.*_SYMBOL_GPL inside which was used in the
    initial scan/conversion to ignore the file

    These files fall under the project license, GPL v2 only. The resulting SPDX
    license identifier is:

    GPL-2.0-only

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

08 Feb, 2019

1 commit

  • commit 56222b212e8e ("futex: Drop hb->lock before enqueueing on the
    rtmutex") changed the locking rules in the futex code so that the hash
    bucket lock is not longer held while the waiter is enqueued into the
    rtmutex wait list. This made the lock and the unlock path symmetric, but
    unfortunately the possible early exit from __rt_mutex_proxy_start() due to
    a detected deadlock was not updated accordingly. That allows a concurrent
    unlocker to observe inconsitent state which triggers the warning in the
    unlock path.

    futex_lock_pi() futex_unlock_pi()
    lock(hb->lock)
    queue(hb_waiter) lock(hb->lock)
    lock(rtmutex->wait_lock)
    unlock(hb->lock)
    // acquired hb->lock
    hb_waiter = futex_top_waiter()
    lock(rtmutex->wait_lock)
    __rt_mutex_proxy_start()
    ---> fail
    remove(rtmutex_waiter);
    ---> returns -EDEADLOCK
    unlock(rtmutex->wait_lock)
    // acquired wait_lock
    wake_futex_pi()
    rt_mutex_next_owner()
    --> returns NULL
    --> WARN

    lock(hb->lock)
    unqueue(hb_waiter)

    The problem is caused by the remove(rtmutex_waiter) in the failure case of
    __rt_mutex_proxy_start() as this lets the unlocker observe a waiter in the
    hash bucket but no waiter on the rtmutex, i.e. inconsistent state.

    The original commit handles this correctly for the other early return cases
    (timeout, signal) by delaying the removal of the rtmutex waiter until the
    returning task reacquired the hash bucket lock.

    Treat the failure case of __rt_mutex_proxy_start() in the same way and let
    the existing cleanup code handle the eventual handover of the rtmutex
    gracefully. The regular rt_mutex_proxy_start() gains the rtmutex waiter
    removal for the failure case, so that the other callsites are still
    operating correctly.

    Add proper comments to the code so all these details are fully documented.

    Thanks to Peter for helping with the analysis and writing the really
    valuable code comments.

    Fixes: 56222b212e8e ("futex: Drop hb->lock before enqueueing on the rtmutex")
    Reported-by: Heiko Carstens
    Co-developed-by: Peter Zijlstra
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Thomas Gleixner
    Tested-by: Heiko Carstens
    Cc: Martin Schwidefsky
    Cc: linux-s390@vger.kernel.org
    Cc: Stefan Liebler
    Cc: Sebastian Sewior
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/alpine.DEB.2.21.1901292311410.1950@nanos.tec.linutronix.de

    Thomas Gleixner
     

11 Sep, 2018

1 commit

  • Merging v4.14.68 into v4.14-rt I tripped over a conflict in the
    rtmutex.c code. There I found that we had:

    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    [..]
    #endif

    #ifndef CONFIG_DEBUG_LOCK_ALLOC
    [..]
    #endif

    Really this should be:

    #ifdef CONFIG_DEBUG_LOCK_ALLOC
    [..]
    #else
    [..]
    #endif

    This cleans up that logic.

    Signed-off-by: Steven Rostedt (VMware)
    Cc: Linus Torvalds
    Cc: Peter Rosin
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20180910214638.55926030@vmware.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     

25 Jul, 2018

1 commit

  • Needed for annotating rt_mutex locks.

    Tested-by: John Sperbeck
    Signed-off-by: Peter Rosin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Davidlohr Bueso
    Cc: Deepa Dinamani
    Cc: Greg Kroah-Hartman
    Cc: Linus Torvalds
    Cc: Peter Chang
    Cc: Peter Zijlstra
    Cc: Philippe Ombredanne
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Wolfram Sang
    Link: http://lkml.kernel.org/r/20180720083914.1950-2-peda@axentia.se
    Signed-off-by: Ingo Molnar

    Peter Rosin
     

29 Mar, 2018

1 commit

  • In -RT task_blocks_on_rt_mutex() may return with -EAGAIN due to
    (->pi_blocked_on == PI_WAKEUP_INPROGRESS) before it added itself as a
    waiter. In such a case remove_waiter() must not be called because without a
    waiter it will trigger the BUG_ON() statement.

    This was initially reported by Yimin Deng. Thomas Gleixner fixed it then
    with an explicit check for waiters before calling remove_waiter().

    Instead of an explicit NULL check before calling rt_mutex_top_waiter() make
    the function return NULL if there are no waiters. With that fixed the now
    pointless NULL check is removed from rt_mutex_slowlock().

    Reported-and-debugged-by: Yimin Deng
    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Thomas Gleixner
    Link: https://lkml.kernel.org/r/CAAh1qt=DCL9aUXNxanP5BKtiPp3m+qj4yB+gDohhXPVFCxWwzg@mail.gmail.com
    Link: https://lkml.kernel.org/r/20180327121438.sss7hxg3crqy4ecd@linutronix.de

    Peter Zijlstra
     

09 Mar, 2018

1 commit

  • When running rcutorture with TREE03 config, CONFIG_PROVE_LOCKING=y, and
    kernel cmdline argument "rcutorture.gp_exp=1", lockdep reports a
    HARDIRQ-safe->HARDIRQ-unsafe deadlock:

    ================================
    WARNING: inconsistent lock state
    4.16.0-rc4+ #1 Not tainted
    --------------------------------
    inconsistent {IN-HARDIRQ-W} -> {HARDIRQ-ON-W} usage.
    takes:
    __schedule+0xbe/0xaf0
    {IN-HARDIRQ-W} state was registered at:
    _raw_spin_lock+0x2a/0x40
    scheduler_tick+0x47/0xf0
    ...
    other info that might help us debug this:
    Possible unsafe locking scenario:
    CPU0
    ----
    lock(&rq->lock);

    lock(&rq->lock);
    *** DEADLOCK ***
    1 lock held by rcu_torture_rea/724:
    rcu_torture_read_lock+0x0/0x70
    stack backtrace:
    CPU: 2 PID: 724 Comm: rcu_torture_rea Not tainted 4.16.0-rc4+ #1
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-20171110_100015-anatol 04/01/2014
    Call Trace:
    lock_acquire+0x90/0x200
    ? __schedule+0xbe/0xaf0
    _raw_spin_lock+0x2a/0x40
    ? __schedule+0xbe/0xaf0
    __schedule+0xbe/0xaf0
    preempt_schedule_irq+0x2f/0x60
    retint_kernel+0x1b/0x2d
    RIP: 0010:rcu_read_unlock_special+0x0/0x680
    ? rcu_torture_read_unlock+0x60/0x60
    __rcu_read_unlock+0x64/0x70
    rcu_torture_read_unlock+0x17/0x60
    rcu_torture_reader+0x275/0x450
    ? rcutorture_booster_init+0x110/0x110
    ? rcu_torture_stall+0x230/0x230
    ? kthread+0x10e/0x130
    kthread+0x10e/0x130
    ? kthread_create_worker_on_cpu+0x70/0x70
    ? call_usermodehelper_exec_async+0x11a/0x150
    ret_from_fork+0x3a/0x50

    This happens with the following even sequence:

    preempt_schedule_irq();
    local_irq_enable();
    __schedule():
    local_irq_disable(); // irq off
    ...
    rcu_note_context_switch():
    rcu_note_preempt_context_switch():
    rcu_read_unlock_special():
    local_irq_save(flags);
    ...
    raw_spin_unlock_irqrestore(...,flags); // irq remains off
    rt_mutex_futex_unlock():
    raw_spin_lock_irq();
    ...
    raw_spin_unlock_irq(); // accidentally set irq on


    rq_lock():
    raw_spin_lock(); // acquiring rq->lock with irq on

    which means rq->lock becomes a HARDIRQ-unsafe lock, which can cause
    deadlocks in scheduler code.

    This problem was introduced by commit 02a7c234e540 ("rcu: Suppress
    lockdep false-positive ->boost_mtx complaints"). That brought the user
    of rt_mutex_futex_unlock() with irq off.

    To fix this, replace the *lock_irq() in rt_mutex_futex_unlock() with
    *lock_irq{save,restore}() to make it safe to call rt_mutex_futex_unlock()
    with irq off.

    Fixes: 02a7c234e540 ("rcu: Suppress lockdep false-positive ->boost_mtx complaints")
    Signed-off-by: Boqun Feng
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Lai Jiangshan
    Cc: Steven Rostedt
    Cc: Josh Triplett
    Cc: Mathieu Desnoyers
    Cc: "Paul E . McKenney"
    Link: https://lkml.kernel.org/r/20180309065630.8283-1-boqun.feng@gmail.com

    Boqun Feng
     

15 Jan, 2018

1 commit

  • Julia reported futex state corruption in the following scenario:

    waiter waker stealer (prio > waiter)

    futex(WAIT_REQUEUE_PI, uaddr, uaddr2,
    timeout=[N ms])
    futex_wait_requeue_pi()
    futex_wait_queue_me()
    freezable_schedule()

    futex(LOCK_PI, uaddr2)
    futex(CMP_REQUEUE_PI, uaddr,
    uaddr2, 1, 0)
    /* requeues waiter to uaddr2 */
    futex(UNLOCK_PI, uaddr2)
    wake_futex_pi()
    cmp_futex_value_locked(uaddr2, waiter)
    wake_up_q()

    task>
    futex(LOCK_PI, uaddr2)
    __rt_mutex_start_proxy_lock()
    try_to_take_rt_mutex() /* steals lock */
    rt_mutex_set_owner(lock, stealer)


    rt_mutex_wait_proxy_lock()
    __rt_mutex_slowlock()
    try_to_take_rt_mutex() /* fails, lock held by stealer */
    if (timeout && !timeout->task)
    return -ETIMEDOUT;
    fixup_owner()
    /* lock wasn't acquired, so,
    fixup_pi_state_owner skipped */

    return -ETIMEDOUT;

    /* At this point, we've returned -ETIMEDOUT to userspace, but the
    * futex word shows waiter to be the owner, and the pi_mutex has
    * stealer as the owner */

    futex_lock(LOCK_PI, uaddr2)
    -> bails with EDEADLK, futex word says we're owner.

    And suggested that what commit:

    73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")

    removes from fixup_owner() looks to be just what is needed. And indeed
    it is -- I completely missed that requeue_pi could also result in this
    case. So we need to restore that, except that subsequent patches, like
    commit:

    16ffa12d7425 ("futex: Pull rt_mutex_futex_unlock() out from under hb->lock")

    changed all the locking rules. Even without that, the sequence:

    - if (rt_mutex_futex_trylock(&q->pi_state->pi_mutex)) {
    - locked = 1;
    - goto out;
    - }

    - raw_spin_lock_irq(&q->pi_state->pi_mutex.wait_lock);
    - owner = rt_mutex_owner(&q->pi_state->pi_mutex);
    - if (!owner)
    - owner = rt_mutex_next_owner(&q->pi_state->pi_mutex);
    - raw_spin_unlock_irq(&q->pi_state->pi_mutex.wait_lock);
    - ret = fixup_pi_state_owner(uaddr, q, owner);

    already suggests there were races; otherwise we'd never have to look
    at next_owner.

    So instead of doing 3 consecutive wait_lock sections with who knows
    what races, we do it all in a single section. Additionally, the usage
    of pi_state->owner in fixup_owner() was only safe because only the
    rt_mutex owner would modify it, which this additional case wrecks.

    Luckily the values can only change away and not to the value we're
    testing, this means we can do a speculative test and double check once
    we have the wait_lock.

    Fixes: 73d786bd043e ("futex: Rework inconsistent rt_mutex/futex_q state")
    Reported-by: Julia Cartwright
    Reported-by: Gratian Crisan
    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Thomas Gleixner
    Tested-by: Julia Cartwright
    Tested-by: Gratian Crisan
    Cc: Darren Hart
    Cc: stable@vger.kernel.org
    Link: https://lkml.kernel.org/r/20171208124939.7livp7no2ov65rrc@hirez.programming.kicks-ass.net

    Peter Zijlstra
     

09 Sep, 2017

1 commit


13 Jul, 2017

1 commit

  • We don't need to adjust priority before adding a new pi_waiter, the
    priority only needs to be updated after pi_waiter change or task
    priority change.

    Steven Rostedt pointed out:

    "Interesting, I did some git mining and this was added with the original
    entry of the rtmutex.c (23f78d4a03c5). Looking at even that version, I
    don't see the purpose of adjusting the task prio here. It is done
    before anything changes in the task."

    Signed-off-by: Alex Shi
    Reviewed-by: Steven Rostedt (VMware)
    Acked-by: Peter Zijlstra (Intel)
    Cc: Juri Lelli
    Cc: Linus Torvalds
    Cc: Mathieu Poirier
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1499926704-28841-1-git-send-email-alex.shi@linaro.org
    [ Enhance the changelog. ]
    Signed-off-by: Ingo Molnar

    Alex Shi
     

20 Jun, 2017

1 commit

  • pi_mutex isn't supposed to be tracked by lockdep, but just
    passing NULLs for name and key will cause lockdep to spew a
    warning and die, which is not what we want it to do.

    Skip lockdep initialization if the caller passed NULLs for
    name and key, suggesting such initialization isn't desired.

    Signed-off-by: Sasha Levin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: f5694788ad8d ("rt_mutex: Add lockdep annotations")
    Link: http://lkml.kernel.org/r/20170618140548.4763-1-alexander.levin@verizon.com
    Signed-off-by: Ingo Molnar

    Levin, Alexander (Sasha Levin)
     

08 Jun, 2017

1 commit

  • Now that (PI) futexes have their own private RT-mutex interface and
    implementation we can easily add lockdep annotations to the existing
    RT-mutex interface.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 May, 2017

1 commit

  • Markus reported that the glibc/nptl/tst-robustpi8 test was failing after
    commit:

    cfafcd117da0 ("futex: Rework futex_lock_pi() to use rt_mutex_*_proxy_lock()")

    The following trace shows the problem:

    ld-linux-x86-64-2161 [019] .... 410.760971: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_LOCK_PI
    ld-linux-x86-64-2161 [019] ...1 410.760972: lock_pi_update_atomic: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000875 ret=0
    ld-linux-x86-64-2165 [011] .... 410.760978: SyS_futex: 00007ffbeb76b028: 80000875 op=FUTEX_UNLOCK_PI
    ld-linux-x86-64-2165 [011] d..1 410.760979: do_futex: 00007ffbeb76b028: curval=80000875 uval=80000875 newval=80000871 ret=0
    ld-linux-x86-64-2165 [011] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=0000
    ld-linux-x86-64-2161 [019] .... 410.760980: SyS_futex: 00007ffbeb76b028: 80000871 ret=ETIMEDOUT

    Task 2165 does an UNLOCK_PI, assigning the lock to the waiter task 2161
    which then returns with -ETIMEDOUT. That wrecks the lock state, because now
    the owner isn't aware it acquired the lock and removes the pending robust
    list entry.

    If 2161 is killed, the robust list will not clear out this futex and the
    subsequent acquire on this futex will then (correctly) result in -ESRCH
    which is unexpected by glibc, triggers an internal assertion and dies.

    Task 2161 Task 2165

    rt_mutex_wait_proxy_lock()
    timeout();
    /* T2161 is still queued in the waiter list */
    return -ETIMEDOUT;

    futex_unlock_pi()
    spin_lock(hb->lock);
    rtmutex_unlock()
    remove_rtmutex_waiter(T2161);
    mark_lock_available();
    /* Make the next waiter owner of the user space side */
    futex_uval = 2161;
    spin_unlock(hb->lock);
    spin_lock(hb->lock);
    rt_mutex_cleanup_proxy_lock()
    if (rtmutex_owner() !== current)
    ...
    return FAIL;
    ....
    return -ETIMEOUT;

    This means that rt_mutex_cleanup_proxy_lock() needs to call
    try_to_take_rt_mutex() so it can take over the rtmutex correctly which was
    assigned by the waker. If the rtmutex is owned by some other task then this
    call is harmless and just confirmes that the waiter is not able to acquire
    it.

    While there, fix what looks like a merge error which resulted in
    rt_mutex_cleanup_proxy_lock() having two calls to
    fixup_rt_mutex_waiters() and rt_mutex_wait_proxy_lock() not having any.
    Both should have one, since both potentially touch the waiter list.

    Fixes: 38d589f2fd08 ("futex,rt_mutex: Restructure rt_mutex_finish_proxy_lock()")
    Reported-by: Markus Trippelsdorf
    Bug-Spotted-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Florian Weimer
    Cc: Darren Hart
    Cc: Sebastian Andrzej Siewior
    Cc: Markus Trippelsdorf
    Link: http://lkml.kernel.org/r/20170519154850.mlomgdsd26drq5j6@hirez.programming.kicks-ass.net
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

05 Apr, 2017

1 commit

  • mark_wakeup_next_waiter() already disables preemption, doing so again
    leaves us with an unpaired preempt_disable().

    Fixes: 2a1c60299406 ("rtmutex: Deboost before waking up the top waiter")
    Signed-off-by: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Link: http://lkml.kernel.org/r/1491379707.6538.2.camel@gmx.de
    Signed-off-by: Thomas Gleixner

    Mike Galbraith
     

04 Apr, 2017

7 commits

  • There was a pure ->prio comparison left in try_to_wake_rt_mutex(),
    convert it to use rt_mutex_waiter_less(), noting that greater-or-equal
    is not-less (both in kernel priority view).

    This necessitated the introduction of cmp_task() which creates a
    pointer to an unnamed stack variable of struct rt_mutex_waiter type to
    compare against tasks.

    With this, we can now also create and employ rt_mutex_waiter_equal().

    Reviewed-and-tested-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.455584638@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • rt_mutex_waiter::prio is a copy of task_struct::prio which is updated
    during the PI chain walk, such that the PI chain order isn't messed up
    by (asynchronous) task state updates.

    Currently rt_mutex_waiter_less() uses task state for deadline tasks;
    this is broken, since the task state can, as said above, change
    asynchronously, causing the RB tree order to change without actual
    tree update -> FAIL.

    Fix this by also copying the deadline into the rt_mutex_waiter state
    and updating it along with its prio field.

    Ideally we would also force PI chain updates whenever DL tasks update
    their deadline parameter, but for first approximation this is less
    broken than it was.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.403992539@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the introduction of SCHED_DEADLINE the whole notion that priority
    is a single number is gone, therefore the @prio argument to
    rt_mutex_setprio() doesn't make sense anymore.

    So rework the code to pass a pi_task instead.

    Note this also fixes a problem with pi_top_task caching; previously we
    would not set the pointer (call rt_mutex_update_top_task) if the
    priority didn't change, this could lead to a stale pointer.

    As for the XXX, I think its fine to use pi_task->prio, because if it
    differs from waiter->prio, a PI chain update is immenent.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.303827095@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Previous patches changed the meaning of the return value of
    rt_mutex_slowunlock(); update comments and code to reflect this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.255058238@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Currently dl tasks will actually return at the very beginning
    of rt_mutex_adjust_prio_chain() in !detect_deadlock cases:

    if (waiter->prio == task->prio) {
    if (!detect_deadlock)
    goto out_unlock_pi; // out here
    else
    requeue = false;
    }

    As the deadline value of blocked deadline tasks(waiters) without
    changing their sched_class(thus prio doesn't change) never changes,
    this seems reasonable, but it actually misses the chance of updating
    rt_mutex_waiter's "dl_runtime(period)_copy" if a waiter updates its
    deadline parameters(dl_runtime, dl_period) or boosted waiter changes
    to !deadline class.

    Thus, force deadline task not out by adding the !dl_prio() condition.

    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/1460633827-345-7-git-send-email-xlpang@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.206577901@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     
  • A crash happened while I was playing with deadline PI rtmutex.

    BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
    IP: [] rt_mutex_get_top_task+0x1f/0x30
    PGD 232a75067 PUD 230947067 PMD 0
    Oops: 0000 [#1] SMP
    CPU: 1 PID: 10994 Comm: a.out Not tainted

    Call Trace:
    [] enqueue_task+0x2c/0x80
    [] activate_task+0x23/0x30
    [] pull_dl_task+0x1d5/0x260
    [] pre_schedule_dl+0x16/0x20
    [] __schedule+0xd3/0x900
    [] schedule+0x29/0x70
    [] __rt_mutex_slowlock+0x4b/0xc0
    [] rt_mutex_slowlock+0xd1/0x190
    [] rt_mutex_timed_lock+0x53/0x60
    [] futex_lock_pi.isra.18+0x28c/0x390
    [] do_futex+0x190/0x5b0
    [] SyS_futex+0x80/0x180

    This is because rt_mutex_enqueue_pi() and rt_mutex_dequeue_pi()
    are only protected by pi_lock when operating pi waiters, while
    rt_mutex_get_top_task(), will access them with rq lock held but
    not holding pi_lock.

    In order to tackle it, we introduce new "pi_top_task" pointer
    cached in task_struct, and add new rt_mutex_update_top_task()
    to update its value, it can be called by rt_mutex_setprio()
    which held both owner's pi_lock and rq lock. Thus "pi_top_task"
    can be safely accessed by enqueue_task_dl() under rq lock.

    Originally-From: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Reviewed-by: Thomas Gleixner
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.157682758@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     
  • We should deboost before waking the high-priority task, such that we
    don't run two tasks with the same "state" (priority, deadline,
    sched_class, etc).

    In order to make sure the boosting task doesn't start running between
    unlock and deboost (due to 'spurious' wakeup), we move the deboost
    under the wait_lock, that way its serialized against the wait loop in
    __rt_mutex_slowlock().

    Doing the deboost early can however lead to priority-inversion if
    current would get preempted after the deboost but before waking our
    high-prio task, hence we disable preemption before doing deboost, and
    enabling it after the wake up is over.

    This gets us the right semantic order, but most importantly however;
    this change ensures pointer stability for the next patch, where we
    have rt_mutex_setprio() cache a pointer to the top-most waiter task.
    If we, as before this change, do the wakeup first and then deboost,
    this pointer might point into thin air.

    [peterz: Changelog + patch munging]
    Suggested-by: Peter Zijlstra
    Signed-off-by: Xunlei Pang
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Steven Rostedt
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170323150216.110065320@infradead.org
    Signed-off-by: Thomas Gleixner

    Xunlei Pang
     

24 Mar, 2017

6 commits

  • When PREEMPT_RT_FULL does the spinlock -> rt_mutex substitution the PI
    chain code will (falsely) report a deadlock and BUG.

    The problem is that it hold hb->lock (now an rt_mutex) while doing
    task_blocks_on_rt_mutex on the futex's pi_state::rtmutex. This, when
    interleaved just right with futex_unlock_pi() leads it to believe to see an
    AB-BA deadlock.

    Task1 (holds rt_mutex, Task2 (does FUTEX_LOCK_PI)
    does FUTEX_UNLOCK_PI)

    lock hb->lock
    lock rt_mutex (as per start_proxy)
    lock hb->lock

    Which is a trivial AB-BA.

    It is not an actual deadlock, because it won't be holding hb->lock by the
    time it actually blocks on the rt_mutex, but the chainwalk code doesn't
    know that and it would be a nightmare to handle this gracefully.

    To avoid this problem, do the same as in futex_unlock_pi() and drop
    hb->lock after acquiring wait_lock. This still fully serializes against
    futex_unlock_pi(), since adding to the wait_list does the very same lock
    dance, and removing it holds both locks.

    Aside of solving the RT problem this makes the lock and unlock mechanism
    symetric and reduces the hb->lock held time.

    Reported-and-tested-by: Sebastian Andrzej Siewior
    Suggested-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.161341537@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • By changing futex_lock_pi() to use rt_mutex_*_proxy_lock() all wait_list
    modifications are done under both hb->lock and wait_lock.

    This closes the obvious interleave pattern between futex_lock_pi() and
    futex_unlock_pi(), but not entirely so. See below:

    Before:

    futex_lock_pi() futex_unlock_pi()
    unlock hb->lock

    lock hb->lock
    unlock hb->lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock

    schedule()

    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock


    -EAGAIN

    lock hb->lock

    After:

    futex_lock_pi() futex_unlock_pi()

    lock hb->lock
    lock rt_mutex->wait_lock
    list_add
    unlock rt_mutex->wait_lock
    unlock hb->lock

    schedule()
    lock hb->lock
    unlock hb->lock
    lock hb->lock
    lock rt_mutex->wait_lock
    list_del
    unlock rt_mutex->wait_lock

    lock rt_mutex->wait_lock
    unlock rt_mutex_wait_lock
    -EAGAIN

    unlock hb->lock

    It does however solve the earlier starvation/live-lock scenario which got
    introduced with the -EAGAIN since unlike the before scenario; where the
    -EAGAIN happens while futex_unlock_pi() doesn't hold any locks; in the
    after scenario it happens while futex_unlock_pi() actually holds a lock,
    and then it is serialized on that lock.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.062785528@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • With the ultimate goal of keeping rt_mutex wait_list and futex_q waiters
    consistent it's necessary to split 'rt_mutex_futex_lock()' into finer
    parts, such that only the actual blocking can be done without hb->lock
    held.

    Split split_mutex_finish_proxy_lock() into two parts, one that does the
    blocking and one that does remove_waiter() when the lock acquire failed.

    When the rtmutex was acquired successfully the waiter can be removed in the
    acquisiton path safely, since there is no concurrency on the lock owner.

    This means that, except for futex_lock_pi(), all wait_list modifications
    are done with both hb->lock and wait_lock held.

    [bigeasy@linutronix.de: fix for futex_requeue_pi_signal_restart]

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104152.001659630@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Since there's already two copies of this code, introduce a helper now
    before adding a third one.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.950039479@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Part of what makes futex_unlock_pi() intricate is that
    rt_mutex_futex_unlock() -> rt_mutex_slowunlock() can drop
    rt_mutex::wait_lock.

    This means it cannot rely on the atomicy of wait_lock, which would be
    preferred in order to not rely on hb->lock so much.

    The reason rt_mutex_slowunlock() needs to drop wait_lock is because it can
    race with the rt_mutex fastpath, however futexes have their own fast path.

    Since futexes already have a bunch of separate rt_mutex accessors, complete
    that set and implement a rt_mutex variant without fastpath for them.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.702962446@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • These are unused and clutter up the code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: juri.lelli@arm.com
    Cc: bigeasy@linutronix.de
    Cc: xlpang@redhat.com
    Cc: rostedt@goodmis.org
    Cc: mathieu.desnoyers@efficios.com
    Cc: jdesfossez@efficios.com
    Cc: dvhart@infradead.org
    Cc: bristot@redhat.com
    Link: http://lkml.kernel.org/r/20170322104151.652692478@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     

02 Mar, 2017

3 commits


30 Jan, 2017

1 commit

  • Running my likely/unlikely profiler for 3 weeks on two production
    machines, I discovered that the unlikely() test in
    __rt_mutex_slowlock() checking if state is TASK_INTERRUPTIBLE is hit
    100% of the time, making it a very likely case.

    The reason is, on a vanilla kernel, the majority case of calling
    rt_mutex() is from the futex code. This code is always called as
    TASK_INTERRUPTIBLE. In the -rt patch, this code is commonly called when
    PREEMPT_RT is enabled with TASK_UNINTERRUPTIBLE. But that's not the
    likely scenario.

    The rt_mutex() code should be optimized for the common vanilla case,
    and that is from a futex, with TASK_INTERRUPTIBLE as the state.

    Signed-off-by: Steven Rostedt (VMware)
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170119113234.1efeedd1@gandalf.local.home
    Signed-off-by: Ingo Molnar

    Steven Rostedt (VMware)
     

02 Dec, 2016

3 commits

  • While debugging the unlock vs. dequeue race which resulted in state
    corruption of futexes the lockless nature of rt_mutex_proxy_unlock()
    caused some confusion.

    Add commentry to explain why it is safe to do this lockless. Add matching
    comments to rt_mutex_init_proxy_locked() for completeness sake.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 Jun, 2016

1 commit

  • One warning should be enough to get one motivated to fix this. It is
    possible that this happens more than once and that starts flooding the
    output. Later the prints will be suppressed so we only get half of it.
    Depending on the console system used it might not be helpful.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior