02 Dec, 2016

1 commit

  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

08 Jun, 2016

1 commit

  • One warning should be enough to get one motivated to fix this. It is
    possible that this happens more than once and that starts flooding the
    output. Later the prints will be suppressed so we only get half of it.
    Depending on the console system used it might not be helpful.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

04 Nov, 2015

1 commit

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle were:

    - sched/fair load tracking fixes and cleanups (Byungchul Park)

    - Make load tracking frequency scale invariant (Dietmar Eggemann)

    - sched/deadline updates (Juri Lelli)

    - stop machine fixes, cleanups and enhancements for bugs triggered by
    CPU hotplug stress testing (Oleg Nesterov)

    - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
    cleanups (Peter Zijlstra)

    - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

    - Optimize per entity utilization tracking (Peter Zijlstra)

    - ... misc other fixes, cleanups and smaller updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
    sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
    sched: Start stopper early
    stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
    stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
    stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
    stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
    stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
    sched/x86: Fix typo in __switch_to() comments
    sched/core: Remove a parameter in the migrate_task_rq() function
    sched/core: Drop unlikely behind BUG_ON()
    sched/core: Fix task and run queue sched_info::run_delay inconsistencies
    sched/numa: Fix task_tick_fair() from disabling numa_balancing
    sched/core: Add preempt_count invariant check
    sched/core: More notrace annotations
    sched/core: Kill PREEMPT_ACTIVE
    sched/core, sched/x86: Kill thread_info::saved_preempt_count
    sched/core: Simplify preempt_count tests
    sched/core: Robustify preemption leak checks
    sched/core: Stop setting PREEMPT_ACTIVE
    ...

    Linus Torvalds
     

06 Oct, 2015

1 commit

  • As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
    variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
    Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
    ordered archs can benefit from more relaxed use of barriers when locking
    and unlocking, instead of regular full barrier semantics. While currently
    only arm64 supports such optimizations, updating corresponding locking
    primitives serves for other archs to immediately benefit as well, once the
    necessary machinery is implemented of course.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E.McKenney
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1443643395-17016-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

23 Sep, 2015

1 commit

  • rt_mutex_waiter_less() check of task deadlines is open coded. Since this
    is subject to wraparound bugs, make it use the correct helper.

    Reported-by: Luca Abeni
    Signed-off-by: Juri Lelli
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1441188096-23021-4-git-send-email-juri.lelli@arm.com
    Signed-off-by: Ingo Molnar

    Juri Lelli
     

20 Jul, 2015

1 commit

  • No one uses this anymore, and this is not the first time the
    idea of replacing it with a (now possible) userspace side.
    Lock stealing logic was removed long ago in when the lock
    was granted to the highest prio.

    Signed-off-by: Davidlohr Bueso
    Cc: Darren Hart
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1435782588-4177-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

25 Jun, 2015

1 commit

  • Pull locking updates from Thomas Gleixner:
    "These locking updates depend on the alreay merged sched/core branch:

    - Lockless top waiter wakeup for rtmutex (Davidlohr)

    - Reduce hash bucket lock contention for PI futexes (Sebastian)

    - Documentation update (Davidlohr)"

    * 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rtmutex: Update stale plist comments
    futex: Lower the lock contention on the HB lock during wake up
    locking/rtmutex: Implement lockless top-waiter wakeup

    Linus Torvalds
     

23 Jun, 2015

2 commits

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes are:

    - 'qspinlock' support, enabled on x86: queued spinlocks - these are
    now the spinlock variant used by x86 as they outperform ticket
    spinlocks in every category. (Waiman Long)

    - 'pvqspinlock' support on x86: paravirtualized variant of queued
    spinlocks. (Waiman Long, Peter Zijlstra)

    - 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
    queued spinlocks, they are now the variant used by x86:

    CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
    CONFIG_QUEUED_SPINLOCKS=y
    CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
    CONFIG_QUEUED_RWLOCKS=y

    - various lockdep fixlets

    - various locking primitives cleanups, further WRITE_ONCE()
    propagation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    locking/lockdep: Remove hard coded array size dependency
    locking/qrwlock: Don't contend with readers when setting _QW_WAITING
    lockdep: Do not break user-visible string
    locking/arch: Rename set_mb() to smp_store_mb()
    locking/arch: Add WRITE_ONCE() to set_mb()
    rtmutex: Warn if trylock is called from hard/softirq context
    arch: Remove __ARCH_HAVE_CMPXCHG
    locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
    locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
    locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
    locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
    locking/pvqspinlock, x86: Enable PV qspinlock for Xen
    locking/pvqspinlock, x86: Enable PV qspinlock for KVM
    locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
    locking/pvqspinlock: Implement simple paravirt support for the qspinlock
    locking/qspinlock: Revert to test-and-set on hypervisors
    locking/qspinlock: Use a simple write to grab the lock
    locking/qspinlock: Optimize for smaller NR_CPUS
    locking/qspinlock: Extract out code snippets for the next patch
    locking/qspinlock: Add pending bit
    ...

    Linus Torvalds
     

20 Jun, 2015

2 commits

  • ... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
    no longer use plists for queuing any waiters. Update stale comments.

    Signed-off-by: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     
  • wake_futex_pi() wakes the task before releasing the hash bucket lock
    (HB). The first thing the woken up task usually does is to acquire the
    lock which requires the HB lock. On SMP Systems this leads to blocking
    on the HB lock which is released by the owner shortly after.
    This patch rearranges the unlock path by first releasing the HB lock and
    then waking up the task.

    [ tglx: Fixed up the rtmutex unlock path ]

    Originally-from: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

19 Jun, 2015

1 commit

  • Mark the task for later wakeup after the wait_lock has been released.
    This way, once the next task is awoken, it will have a better chance
    to of finding the wait_lock free when continuing executing in
    __rt_mutex_slowlock() when trying to acquire the rtmutex, calling
    try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
    take the lock may acquire it first, right after the wait_lock is released,
    but (a) this can also occur with the current code, as it relies on the
    spinlock fairness, and (b) we are dealing with the top-waiter anyway,
    so it will always take the lock next.

    Signed-off-by: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1432056298-18738-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

19 May, 2015

1 commit


14 May, 2015

1 commit

  • rt_mutex_trylock() must be called from thread context. It can be
    called from atomic regions (preemption or interrupts disabled), but
    not from hard/softirq/nmi context. Add a warning to alert abusers.

    The reasons for this are:

    1) There is a potential deadlock in the slowpath

    2) Another cpu which blocks on the rtmutex will boost the task
    which allegedly locked the rtmutex, but that cannot work
    because the hard/softirq context borrows the task context.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior

    Thomas Gleixner
     

13 May, 2015

1 commit

  • The rtmutex code is the only user of __HAVE_ARCH_CMPXCHG and we have a few
    other user of cmpxchg() which do not care about __HAVE_ARCH_CMPXCHG. This
    define was first introduced in 23f78d4a0 ("[PATCH] pi-futex: rt mutex core")
    which is v2.6.18. The generic cmpxchg was introduced later in 068fbad288
    ("Add cmpxchg_local to asm-generic for per cpu atomic operations") which is
    v2.6.25.
    Back then something was required to get rtmutex working with the fast
    path on architectures without cmpxchg and this seems to be the result.

    It popped up recently on rt-users because ARM (v6+) does not define
    __HAVE_ARCH_CMPXCHG (even that it implements it) which results in slower
    locking performance in the fast path.
    To put some numbers on it: preempt -RT, am335x, 10 loops of
    100000 invocations of rt_spin_lock() + rt_spin_unlock() (time "total" is
    the average of the 10 loops for the 100000 invocations, "loop" is
    "total / 100000 * 1000"):

    cmpxchg | slowpath used || cmpxchg used
    | total | loop || total | loop
    --------|-----------|-------||------------|-------
    ARMv6 | 9129.4 us | 91 ns || 3311.9 us | 33 ns
    generic | 9360.2 us | 94 ns || 10834.6 us | 108 ns
    ----------------------------||--------------------

    Forcing it to generic cmpxchg() made things worse for the slowpath and
    even worse in cmpxchg() path. It boils down to 14ns more per lock+unlock
    in a cache hot loop so it might not be that much in real world.
    The last test was a substitute for pre ARMv6 machine but then I was able
    to perform the comparison on imx28 which is ARMv5 and therefore is
    always is using the generic cmpxchg implementation. And the numbers:

    | total | loop
    -------- |----------- |--------
    slowpath | 263937.2 us | 2639 ns
    cmpxchg | 16934.2 us | 169 ns
    --------------------------------

    The numbers are larger since the machine is slower in general. However,
    letting rtmutex use cmpxchg() instead the slowpath seem to improve things.

    Since from the ARM (tested on am335x + imx28) point of view always
    using cmpxchg() in rt_mutex_lock() + rt_mutex_unlock() makes sense I
    would drop the define.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: will.deacon@arm.com
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/20150225175613.GE6823@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

08 May, 2015

1 commit

  • Ronny reported that the following scenario is not handled correctly:

    T1 (prio = 10)
    lock(rtmutex);

    T2 (prio = 20)
    lock(rtmutex)
    boost T1

    T1 (prio = 20)
    sys_set_scheduler(prio = 30)
    T1 prio = 30
    ....
    sys_set_scheduler(prio = 10)
    T1 prio = 30

    The last step is wrong as T1 should now be back at prio 20.

    Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()")
    only handles the case where a boosted tasks tries to lower its
    priority.

    Fix it by taking the new effective priority into account for the
    decision whether a change of the priority is required.

    Reported-by: Ronny Meeus
    Tested-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt
    Cc:
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Mike Galbraith
    Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

22 Apr, 2015

1 commit

  • The check for hrtimer_active() after starting the timer is
    pointless. If the timer is inactive it has expired already and
    therefor the task pointer is already NULL.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203503.081830481@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

14 Apr, 2015

1 commit

  • Pull core locking changes from Ingo Molnar:
    "Main changes:

    - jump label asm preparatory work for PowerPC (Anton Blanchard)

    - rwsem optimizations and cleanups (Davidlohr Bueso)

    - mutex optimizations and cleanups (Jason Low)

    - futex fix (Oleg Nesterov)

    - remove broken atomicity checks from {READ,WRITE}_ONCE() (Peter
    Zijlstra)"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    powerpc, jump_label: Include linux/jump_label.h to get HAVE_JUMP_LABEL define
    jump_label: Allow jump labels to be used in assembly
    jump_label: Allow asm/jump_label.h to be included in assembly
    locking/mutex: Further simplify mutex_spin_on_owner()
    locking: Remove atomicy checks from {READ,WRITE}_ONCE
    locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
    locking/rwsem: Fix lock optimistic spinning when owner is not running
    locking: Remove ACCESS_ONCE() usage
    locking/rwsem: Check for active lock before bailing on spinning
    locking/rwsem: Avoid deceiving lock spinners
    locking/rwsem: Set lock ownership ASAP
    locking/rwsem: Document barrier need when waking tasks
    locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
    locking/mutex: Refactor mutex_spin_on_owner()
    locking/mutex: In mutex_spin_on_owner(), return true when owner changes

    Linus Torvalds
     

25 Mar, 2015

1 commit


01 Mar, 2015

1 commit

  • The "usual" path is:

    - rt_mutex_slowlock()
    - set_current_state()
    - task_blocks_on_rt_mutex() (ret 0)
    - __rt_mutex_slowlock()
    - sleep or not but do return with __set_current_state(TASK_RUNNING)
    - back to caller.

    In the early error case where task_blocks_on_rt_mutex() return
    -EDEADLK we never change the task's state back to RUNNING. I
    assume this is intended. Without this change after ww_mutex
    using rt_mutex the selftest passes but later I get plenty of:

    | bad: scheduling from the idle thread!

    backtraces.

    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Mike Galbraith
    Cc: Linus Torvalds
    Cc: Maarten Lankhorst
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Fixes: afffc6c1805d ("locking/rtmutex: Optimize setting task running after being blocked")
    Link: http://lkml.kernel.org/r/1425056229-22326-4-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

18 Feb, 2015

1 commit

  • With task_blocks_on_rt_mutex() returning early -EDEADLK we never
    add the waiter to the waitqueue. Later, we try to remove it via
    remove_waiter() and go boom in rt_mutex_top_waiter() because
    rb_entry() gives a NULL pointer.

    ( Tested on v3.18-RT where rtmutex is used for regular mutex and I
    tried to get one twice in a row. )

    Not sure when this started but I guess 397335f004f4 ("rtmutex: Fix
    deadlock detector for real") or commit 3d5c9340d194 ("rtmutex:
    Handle deadlock detection smarter").

    Signed-off-by: Sebastian Andrzej Siewior
    Acked-by: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: # for v3.16 and later kernels
    Link: http://lkml.kernel.org/r/1424187823-19600-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     

04 Feb, 2015

1 commit

  • We explicitly mark the task running after returning from
    a __rt_mutex_slowlock() call, which does the actual sleeping
    via wait-wake-trylocking. As such, this patch does two things:

    (1) refactors the code so that setting current to TASK_RUNNING
    is done by __rt_mutex_slowlock(), and not by the callers. The
    downside to this is that it becomes a bit unclear when at what
    point we block. As such I've added a comment that the task
    blocks when calling __rt_mutex_slowlock() so readers can figure
    out when it is running again.

    (2) relaxes setting current's state through __set_current_state(),
    instead of it's more expensive barrier alternative. There was no
    need for the implied barrier as we're obviously not planning on
    blocking.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422857784.18096.1.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

13 Aug, 2014

1 commit

  • Specifically:
    Documentation/locking/lockdep-design.txt
    Documentation/locking/lockstat.txt
    Documentation/locking/mutex-design.txt
    Documentation/locking/rt-mutex-design.txt
    Documentation/locking/rt-mutex.txt
    Documentation/locking/spinlocks.txt
    Documentation/locking/ww-mutex-design.txt

    Signed-off-by: Davidlohr Bueso
    Acked-by: Randy Dunlap
    Signed-off-by: Peter Zijlstra
    Cc: jason.low2@hp.com
    Cc: aswin@hp.com
    Cc: Alexei Starovoitov
    Cc: Al Viro
    Cc: Andrew Morton
    Cc: Chris Mason
    Cc: Dan Streetman
    Cc: David Airlie
    Cc: Davidlohr Bueso
    Cc: David S. Miller
    Cc: Greg Kroah-Hartman
    Cc: Heiko Carstens
    Cc: Jason Low
    Cc: Josef Bacik
    Cc: Kees Cook
    Cc: Linus Torvalds
    Cc: Lubomir Rintel
    Cc: Masanari Iida
    Cc: Paul E. McKenney
    Cc: Randy Dunlap
    Cc: Tim Chen
    Cc: Vineet Gupta
    Cc: fengguang.wu@intel.com
    Link: http://lkml.kernel.org/r/1406752916-3341-6-git-send-email-davidlohr@hp.com
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

22 Jun, 2014

9 commits

  • In case the dead lock detector is enabled we follow the lock chain to
    the end in rt_mutex_adjust_prio_chain, even if we could stop earlier
    due to the priority/waiter constellation.

    But once we are no longer the top priority waiter in a certain step
    or the task holding the lock has already the same priority then there
    is no point in dequeing and enqueing along the lock chain as there is
    no change at all.

    So stop the queueing at this point.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Link: http://lkml.kernel.org/r/20140522031950.280830190@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The conditions under which deadlock detection is conducted are unclear
    and undocumented.

    Add constants instead of using 0/1 and provide a selection function
    which hides the additional debug dependency from the calling code.

    Add comments where needed.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Link: http://lkml.kernel.org/r/20140522031949.947264874@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     
  • The deadlock logic is only required for futexes.

    Remove the extra arguments for the public functions and also for the
    futex specific ones which get always called with deadlock detection
    enabled.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt

    Thomas Gleixner
     
  • Exit right away, when the removed waiter was not the top priority
    waiter on the lock. Get rid of the extra indent level.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Reviewed-by: Lai Jiangshan

    Thomas Gleixner
     
  • Add commentry to document the chain walk and the protection mechanisms
    and their scope.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt

    Thomas Gleixner
     
  • Add a separate local variable for the boost/deboost logic to make the
    code more readable. Add comments where appropriate.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt

    Thomas Gleixner
     
  • There is no point to keep the task ref across the check for lock
    owner. Drop the ref before that, so the protection context is clear.

    Found while documenting the chain walk.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Reviewed-by: Lai Jiangshan

    Thomas Gleixner
     
  • The current implementation of try_to_take_rtmutex() is correct, but
    requires more than a single brain twist to understand the clever
    encoded conditionals.

    Untangle it and document the cases proper.

    Looks less efficient at the first glance, but actually reduces the
    binary code size on x8664 by 80 bytes.

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt

    Thomas Gleixner
     
  • Oleg noticed that rtmutex_slowtrylock() has a pointless check for
    rt_mutex_owner(lock) != current.

    To avoid calling try_to_take_rtmutex() we really want to check whether
    the lock has an owner at all or whether the trylock failed because the
    owner is NULL, but the RT_MUTEX_HAS_WAITERS bit is set. This covers
    the lock is owned by caller situation as well.

    We can actually do this check lockless. trylock is taking a chance
    whether we take lock->wait_lock to do the check or not.

    Add comments to the function while at it.

    Reported-by: Oleg Nesterov
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Reviewed-by: Lai Jiangshan

    Thomas Gleixner
     

16 Jun, 2014

1 commit

  • When the rtmutex fast path is enabled the slow unlock function can
    create the following situation:

    spin_lock(foo->m->wait_lock);
    foo->m->owner = NULL;
    rt_mutex_lock(foo->m); refcnt);
    rt_mutex_unlock(foo->m); m->wait_lock); owner */
    clear_rt_mutex_waiters(m);
    owner = rt_mutex_owner(m);
    spin_unlock(m->wait_lock);
    if (cmpxchg(m->owner, owner, 0) == owner)
    return;
    spin_lock(m->wait_lock);
    }

    So in case of a new waiter incoming while the owner tries the slow
    path unlock we have two situations:

    unlock(wait_lock);
    lock(wait_lock);
    cmpxchg(p, owner, 0) == owner
    mark_rt_mutex_waiters(lock);
    acquire(lock);

    Or:

    unlock(wait_lock);
    lock(wait_lock);
    mark_rt_mutex_waiters(lock);
    cmpxchg(p, owner, 0) != owner
    enqueue_waiter();
    unlock(wait_lock);
    lock(wait_lock);
    wakeup_next waiter();
    unlock(wait_lock);
    lock(wait_lock);
    acquire(lock);

    If the fast path is disabled, then the simple

    m->owner = NULL;
    unlock(m->wait_lock);

    is sufficient as all access to m->owner is serialized via
    m->wait_lock;

    Also document and clarify the wakeup_next_waiter function as suggested
    by Oleg Nesterov.

    Reported-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

07 Jun, 2014

2 commits

  • When we walk the lock chain, we drop all locks after each step. So the
    lock chain can change under us before we reacquire the locks. That's
    harmless in principle as we just follow the wrong lock path. But it
    can lead to a false positive in the dead lock detection logic:

    T0 holds L0
    T0 blocks on L1 held by T1
    T1 blocks on L2 held by T2
    T2 blocks on L3 held by T3
    T4 blocks on L4 held by T4

    Now we walk the chain

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
    lock T2 -> adjust T2 -> drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

    Brad tried to work around that in the deadlock detection logic itself,
    but the more I looked at it the less I liked it, because it's crystal
    ball magic after the fact.

    We actually can detect a chain change very simple:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    So if we detect that T2 is now blocked on a different lock we stop the
    chain walk. That's also correct in the following scenario:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T3 times out and drops L3
    T2 acquires L3 and blocks on L4 now

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    We don't have to follow up the chain at that point, because T2
    propagated our priority up to T4 already.

    [ Folded a cleanup patch from peterz ]

    Signed-off-by: Thomas Gleixner
    Reported-by: Brad Mouring
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     
  • Even in the case when deadlock detection is not requested by the
    caller, we can detect deadlocks. Right now the code stops the lock
    chain walk and keeps the waiter enqueued, even on itself. Silly not to
    yell when such a scenario is detected and to keep the waiter enqueued.

    Return -EDEADLK unconditionally and handle it at the call sites.

    The futex calls return -EDEADLK. The non futex ones dequeue the
    waiter, throw a warning and put the task into a schedule loop.

    Tagged for stable as it makes the code more robust.

    Signed-off-by: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Brad Mouring
    Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

28 May, 2014

1 commit

  • The current deadlock detection logic does not work reliably due to the
    following early exit path:

    /*
    * Drop out, when the task has no waiters. Note,
    * top_waiter can be NULL, when we are in the deboosting
    * mode!
    */
    if (top_waiter && (!task_has_pi_waiters(task) ||
    top_waiter != task_top_pi_waiter(task)))
    goto out_unlock_pi;

    So this not only exits when the task has no waiters, it also exits
    unconditionally when the current waiter is not the top priority waiter
    of the task.

    So in a nested locking scenario, it might abort the lock chain walk
    and therefor miss a potential deadlock.

    Simple fix: Continue the chain walk, when deadlock detection is
    enabled.

    We also avoid the whole enqueue, if we detect the deadlock right away
    (A-A). It's an optimization, but also prevents that another waiter who
    comes in after the detection and before the task has undone the damage
    observes the situation and detects the deadlock and returns
    -EDEADLOCK, which is wrong as the other task is not in a deadlock
    situation.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

23 Feb, 2014

1 commit

  • If a PI boosted task policy/priority is modified by a setscheduler()
    call we unconditionally dequeue and requeue the task if it is on the
    runqueue even if the new priority is lower than the current effective
    boosted priority. This can result in undesired reordering of the
    priority bucket list.

    If the new priority is less or equal than the current effective we
    just store the new parameters in the task struct and leave the
    scheduler class and the runqueue untouched. This is handled when the
    task deboosts itself. Only if the new priority is higher than the
    effective boosted priority we apply the change immediately.

    Signed-off-by: Thomas Gleixner
    [ Rebase ontop of v3.14-rc1. ]
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Dario Faggioli
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1391803122-4425-7-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

13 Jan, 2014

2 commits

  • Some method to deal with rt-mutexes and make sched_dl interact with
    the current PI-coded is needed, raising all but trivial issues, that
    needs (according to us) to be solved with some restructuring of
    the pi-code (i.e., going toward a proxy execution-ish implementation).

    This is under development, in the meanwhile, as a temporary solution,
    what this commits does is:

    - ensure a pi-lock owner with waiters is never throttled down. Instead,
    when it runs out of runtime, it immediately gets replenished and it's
    deadline is postponed;

    - the scheduling parameters (relative deadline and default runtime)
    used for that replenishments --during the whole period it holds the
    pi-lock-- are the ones of the waiting task with earliest deadline.

    Acting this way, we provide some kind of boosting to the lock-owner,
    still by using the existing (actually, slightly modified by the previous
    commit) pi-architecture.

    We would stress the fact that this is only a surely needed, all but
    clean solution to the problem. In the end it's only a way to re-start
    discussion within the community. So, as always, comments, ideas, rants,
    etc.. are welcome! :-)

    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    [ Added !RT_MUTEXES build fix. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-11-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Dario Faggioli
     
  • Turn the pi-chains from plist to rb-tree, in the rt_mutex code,
    and provide a proper comparison function for -deadline and
    -priority tasks.

    This is done mainly because:
    - classical prio field of the plist is just an int, which might
    not be enough for representing a deadline;
    - manipulating such a list would become O(nr_deadline_tasks),
    which might be to much, as the number of -deadline task increases.

    Therefore, an rb-tree is used, and tasks are queued in it according
    to the following logic:
    - among two -priority (i.e., SCHED_BATCH/OTHER/RR/FIFO) tasks, the
    one with the higher (lower, actually!) prio wins;
    - among a -priority and a -deadline task, the latter always wins;
    - among two -deadline tasks, the one with the earliest deadline
    wins.

    Queueing and dequeueing functions are changed accordingly, for both
    the list of a task's pi-waiters and the list of tasks blocked on
    a pi-lock.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Dario Faggioli
    Signed-off-by: Juri Lelli
    Signed-off-again-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1383831828-15501-10-git-send-email-juri.lelli@gmail.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra