06 Dec, 2016

1 commit

  • Since commit:

    4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")

    printk() requires KERN_CONT to continue log messages. Lots of printk()
    in lockdep.c and print_ip_sym() don't have it. As the result lockdep
    reports are completely messed up.

    Add missing KERN_CONT and inline print_ip_sym() where necessary.

    Example of a messed up report:

    0-rc5+ #41 Not tainted
    -------------------------------------------------------
    syz-executor0/5036 is trying to acquire lock:
    (
    rtnl_mutex
    ){+.+.+.}
    , at:
    [] rtnl_lock+0x1c/0x20
    but task is already holding lock:
    (
    &net->packet.sklist_lock
    ){+.+...}
    , at:
    [] packet_diag_dump+0x1a6/0x1920
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3
    (
    &net->packet.sklist_lock
    +.+...}
    ...

    Without this patch all scripts that parse kernel bug reports are broken.

    Signed-off-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andreyknvl@google.com
    Cc: aryabinin@virtuozzo.com
    Cc: joe@perches.com
    Cc: syzkaller@googlegroups.com
    Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     

02 Dec, 2016

2 commits

  • While debugging the rtmutex unlock vs. dequeue race Will suggested to use
    READ_ONCE() in rt_mutex_owner() as it might race against the
    cmpxchg_release() in unlock_rt_mutex_safe().

    Will: "It's a minor thing which will most likely not matter in practice"

    Careful search did not unearth an actual problem in todays code, but it's
    better to be safe than surprised.

    Suggested-by: Will Deacon
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

19 Nov, 2016

1 commit


22 Sep, 2016

3 commits

  • It is now unused, remove it before someone else thinks its a good idea
    to use this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • stop_two_cpus() and stop_cpus() use stop_cpus_lock to avoid the deadlock,
    we need to ensure that the stopper functions can't be queued "backwards"
    from one another. This doesn't look nice; if we use lglock then we do not
    really need stopper->lock, cpu_stop_queue_work() could use lg_local_lock()
    under local_irq_save().

    OTOH it would be even better to avoid lglock in stop_machine.c and remove
    lg_double_lock(). This patch adds "bool stop_cpus_in_progress" set/cleared
    by queue_stop_cpus_work(), and changes cpu_stop_queue_two_works() to busy
    wait until it is cleared.

    queue_stop_cpus_work() sets stop_cpus_in_progress = T lockless, but after
    it queues a work on CPU1 it must be visible to stop_two_cpus(CPU1, CPU2)
    which checks it under the same lock. And since stop_two_cpus() holds the
    2nd lock too, queue_stop_cpus_work() can not clear stop_cpus_in_progress
    if it is also going to queue a work on CPU2, it needs to take that 2nd
    lock to do this.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20151121181148.GA433@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • cmpxchg_release() is more lighweight than cmpxchg() on some archs(e.g.
    PPC), moreover, in __pv_queued_spin_unlock() we only needs a RELEASE in
    the fast path(pairing with *_try_lock() or *_lock()). And the slow path
    has smp_store_release too. So it's safe to use cmpxchg_release here.

    Suggested-by: Boqun Feng
    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: benh@kernel.crashing.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: virtualization@lists.linux-foundation.org
    Cc: waiman.long@hpe.com
    Link: http://lkml.kernel.org/r/1474277037-15200-2-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

18 Aug, 2016

3 commits

  • When wanting to wakeup readers, __rwsem_mark_wakeup() currently
    iterates the wait_list twice while looking to wakeup the first N
    queued reader-tasks. While this can be quite inefficient, it was
    there such that a awoken reader would be first and foremost
    acknowledged by the lock counter.

    Keeping the same logic, we can further benefit from the use of
    wake_qs and avoid entirely the first wait_list iteration that sets
    the counter as wake_up_process() isn't going to occur right away,
    and therefore we maintain the counter->list order of going about
    things.

    Other than saving cycles with O(n) "scanning", this change also
    nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
    visually and less tedious to read.

    For example, the following improvements where seen on some will
    it scale microbenchmarks, on a 48-core Haswell:

    v4.7 v4.7-rwsem-v1
    Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%)
    Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%)
    Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%)
    Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%)
    Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%)
    Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%)
    Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%)
    Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%)
    Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%)
    Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%)

    Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%)
    Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%)
    Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%)
    Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%)
    Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%)
    Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%)
    Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%)
    Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%)
    Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%)
    Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%)

    Other runs that saw improvements include context_switch and pipe; and
    as expected, this is particularly highlighted on larger thread counts
    as it becomes more expensive to walk the list twice.

    No change in wakeup ordering or semantics.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Our rwsem code (xadd, at least) is rather well documented, but
    there are a few really annoying comments in there that serve
    no purpose and we shouldn't bother with them.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • We currently return a rw_semaphore structure, which is the
    same lock we passed to the function's argument in the first
    place. While there are several functions that choose this
    return value, the callers use it, for example, for things
    like ERR_PTR. This is not the case for __rwsem_mark_wake(),
    and in addition this function is really about the lock
    waiters (which we know there are at this point), so its
    somewhat odd to be returning the sem structure.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

10 Aug, 2016

5 commits

  • Currently the percpu-rwsem switches to (global) atomic ops while a
    writer is waiting; which could be quite a while and slows down
    releasing the readers.

    This patch cures this problem by ordering the reader-state vs
    reader-count (see the comments in __percpu_down_read() and
    percpu_down_write()). This changes a global atomic op into a full
    memory barrier, which doesn't have the global cacheline contention.

    This also enables using the percpu-rwsem with rcu_sync disabled in order
    to bias the implementation differently, reducing the writer latency by
    adding some cost to readers.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    [ Fixed modular build. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently there are overlap in the pvqspinlock wait_again and
    spurious_wakeup stat counters. Because of lock stealing, it is
    no longer possible to accurately determine if spurious wakeup has
    happened in the queue head. As they track both the queue node and
    queue head status, it is also hard to tell how many of those comes
    from the queue head and how many from the queue node.

    This patch changes the accounting rules so that spurious wakeup is
    only tracked in the queue node. The wait_again count, however, is
    only tracked in the queue head when the vCPU failed to acquire the
    lock after a vCPU kick. This should give a much better indication of
    the wait-kick dynamics in the queue node and the queue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464713631-1066-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Restructure pv_queued_spin_steal_lock() as I found it hard to read.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long

    Peter Zijlstra
     
  • It's obviously wrong to set stat to NULL. So lets remove it.
    Otherwise it is always zero when we check the latency of kick/wake.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • When the lock holder vCPU is racing with the queue head:

    CPU 0 (lock holder) CPU1 (queue head)
    =================== =================
    spin_lock(); spin_lock();
    pv_kick_node(): pv_wait_head_or_lock():
    if (!lp) {
    lp = pv_hash(lock, pn);
    xchg(&l->locked, _Q_SLOW_VAL);
    }
    WRITE_ONCE(pn->state, vcpu_halted);
    cmpxchg(&pn->state,
    vcpu_halted, vcpu_hashed);
    WRITE_ONCE(l->locked, _Q_SLOW_VAL);
    (void)pv_hash(lock, pn);

    In this case, lock holder inserts the pv_node of queue head into the
    hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
    restoring/setting vcpu_hashed state after failing adaptive locking
    spinning.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Pan Xinhui
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

26 Jul, 2016

1 commit

  • Pull locking updates from Ingo Molnar:
    "The locking tree was busier in this cycle than the usual pattern - a
    couple of major projects happened to coincide.

    The main changes are:

    - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
    across all SMP architectures (Peter Zijlstra)

    - add atomic_fetch_{inc/dec}() as well, using the generic primitives
    (Davidlohr Bueso)

    - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
    Waiman Long)

    - optimize smp_cond_load_acquire() on arm64 and implement LSE based
    atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    on arm64 (Will Deacon)

    - introduce smp_acquire__after_ctrl_dep() and fix various barrier
    mis-uses and bugs (Peter Zijlstra)

    - after discovering ancient spin_unlock_wait() barrier bugs in its
    implementation and usage, strengthen its semantics and update/fix
    usage sites (Peter Zijlstra)

    - optimize mutex_trylock() fastpath (Peter Zijlstra)

    - ... misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
    locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
    locking/static_keys: Fix non static symbol Sparse warning
    locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
    locking/atomic, arch/tile: Fix tilepro build
    locking/atomic, arch/m68k: Remove comment
    locking/atomic, arch/arc: Fix build
    locking/Documentation: Clarify limited control-dependency scope
    locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
    locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
    locking/atomic, arch/mips: Convert to _relaxed atomics
    locking/atomic, arch/alpha: Convert to _relaxed atomics
    locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
    locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
    locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    locking/atomic: Fix atomic64_relaxed() bits
    locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    ...

    Linus Torvalds
     

07 Jul, 2016

1 commit


27 Jun, 2016

1 commit

  • queued_spin_lock_slowpath() should not worry about another
    queued_spin_lock_slowpath() running in interrupt context and
    changing node->count by accident, because node->count keeps
    the same value every time we enter/leave queued_spin_lock_slowpath().

    On some architectures this_cpu_dec() will save/restore irq flags,
    which has high overhead. Use the much cheaper __this_cpu_dec() instead.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Rewrote changelog. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

24 Jun, 2016

1 commit


16 Jun, 2016

3 commits

  • Now that we have fetch_add() we can stop using add_return() - val.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The only reason for the current code is to make GCC emit only the
    "LOCK XADD" instruction on x86 (and not do a pointless extra ADD on
    the result), do so nicer.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • These functions have been deprecated for a while and there is only the
    one user left, convert and kill.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Jun, 2016

2 commits

  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This new form allows using hardware assisted waiting.

    Some hardware (ARM64 and x86) allow monitoring an address for changes,
    so by providing a pointer we can use this to replace the cpu_relax()
    with hardware optimized methods in the future.

    Requested-by: Will Deacon
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Jun, 2016

11 commits

  • This patch moves the owner loading and checking code entirely inside of
    rwsem_spin_on_owner() to simplify the logic of rwsem_optimistic_spin()
    loop.

    Suggested-by: Peter Hurley
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Peter Hurley
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-6-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • In __rwsem_do_wake(), the reader wakeup code will assume a writer
    has stolen the lock if the active reader/writer count is not 0.
    However, this is not as reliable an indicator as the original
    "< RWSEM_WAITING_BIAS" check. If another reader is present, the code
    will still break out and exit even if the writer is gone. This patch
    changes it to check the same "< RWSEM_WAITING_BIAS" condition to
    reduce the chance of false positive.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Peter Hurley
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-5-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Without using WRITE_ONCE(), the compiler can potentially break a
    write into multiple smaller ones (store tearing). So a read from the
    same data by another task concurrently may return a partial result.
    This can result in a kernel crash if the data is a memory address
    that is being dereferenced.

    This patch changes all write to rwsem->owner to use WRITE_ONCE()
    to make sure that store tearing will not happen. READ_ONCE() may
    not be needed for rwsem->owner as long as the value is only used for
    comparison and not dereferencing.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Currently, it is not possible to determine for sure if a reader
    owns a rwsem by looking at the content of the rwsem data structure.
    This patch adds a new state RWSEM_READER_OWNED to the owner field
    to indicate that readers currently own the lock. This enables us to
    address the following 2 issues in the rwsem optimistic spinning code:

    1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
    the owner field is NULL which can mean either the readers own
    the lock or the owning writer hasn't set the owner field yet.
    In the latter case, we miss the chance to do optimistic spinning.

    2) While a writer is waiting in the OSQ and a reader takes the lock,
    the writer will continue to spin when out of the OSQ in the main
    rwsem_optimistic_spin() loop as the owner field is NULL wasting
    CPU cycles if some of readers are sleeping.

    Adding the new state will allow optimistic spinning to go forward as
    long as the owner field is not RWSEM_READER_OWNED and the owner is
    running, if set, but stop immediately when that state has been reached.

    On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
    fio test with multithreaded randrw and randwrite tests on the same
    file on a XFS partition on top of a NVDIMM were run, the aggregated
    bandwidths before and after the patch were as follows:

    Test BW before patch BW after patch % change
    ---- --------------- -------------- --------
    randrw 988 MB/s 1192 MB/s +21%
    randwrite 1513 MB/s 1623 MB/s +7.3%

    The perf profile of the rwsem_down_write_failed() function in randrw
    before and after the patch were:

    19.95% 5.88% fio [kernel.vmlinux] [k] rwsem_down_write_failed
    14.20% 1.52% fio [kernel.vmlinux] [k] rwsem_down_write_failed

    The actual CPU cycles spend in rwsem_down_write_failed() dropped from
    5.88% to 1.52% after the patch.

    The xfstests was also run and no regression was observed.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Jason Low
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Convert the rwsem count variable to an atomic_long_t since we use it
    as an atomic variable. This also allows us to remove the
    rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
    level of indirection. In follow up patches, we also remove the
    rwsem_atomic_{add,update}() definitions across the various architectures.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Jason Low
    [ Build warning fixes on various architectures. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Terry Rudd
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Tony Luck
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • I figured we need to document the spin_is_locked() and
    spin_unlock_wait() constraints somwehere.

    Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
    could find a place in there. But currently that document is stale to
    the point of hardly being useful.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While going over the code I noticed that xchg_tail() is a RELEASE but
    had no obvious pairing commented.

    It pairs with a somewhat unique address dependency through
    decode_tail().

    So the store-release of xchg_tail() is paired by the address
    dependency of the load of xchg_tail followed by the dereference from
    the pointer computed from that load.

    The @old -> @prev transformation itself is pure, and therefore does
    not depend on external state, so that is immaterial wrt. ordering.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • While this prior commit:

    54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

    ... fixes spin_is_locked() and spin_unlock_wait() for the usage
    in ipc/sem and netfilter, it does not in fact work right for the
    usage in task_work and futex.

    So while the 2 locks crossed problem:

    spin_lock(A) spin_lock(B)
    if (!spin_is_locked(B)) spin_unlock_wait(A)
    foo() foo();

    ... works with the smp_mb() injected by both spin_is_locked() and
    spin_unlock_wait(), this is not sufficient for:

    flag = 1;
    smp_mb(); spin_lock()
    spin_unlock_wait() if (!flag)
    // add to lockless list
    // iterate lockless list

    ... because in this scenario, the store from spin_lock() can be delayed
    past the load of flag, uncrossing the variables and loosing the
    guarantee.

    This patch reworks spin_is_locked() and spin_unlock_wait() to work in
    both cases by exploiting the observation that while the lock byte
    store can be delayed, the contender must have registered itself
    visibly in other state contained in the word.

    It also allows for architectures to override both functions, as PPC
    and ARM64 have an additional issue for which we currently have no
    generic solution.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Giovanni Gherdovich
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: stable@vger.kernel.org # v4.2 and later
    Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • One warning should be enough to get one motivated to fix this. It is
    possible that this happens more than once and that starts flooding the
    output. Later the prints will be suppressed so we only get half of it.
    Depending on the console system used it might not be helpful.

    Signed-off-by: Sebastian Andrzej Siewior
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464356838-1755-1-git-send-email-bigeasy@linutronix.de
    Signed-off-by: Ingo Molnar

    Sebastian Andrzej Siewior
     
  • Use __jhash_mix() to mix the class_idx into the class_key. This
    function provides better mixing than the previously used, home grown
    mix function.

    Leave hashing to the professionals :-)

    Suggested-by: George Spelvin
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jun, 2016

5 commits

  • The mutex owner can get read and written to locklessly.
    Use WRITE_ONCE when setting and clearing the owner field
    in order to avoid optimizations such as store tearing. This
    avoids situations where the owner field gets written to with
    multiple stores and another thread could concurrently read
    and use a partially written owner value.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Davidlohr Bueso
    Acked-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Terry Rudd
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463782776.2479.9.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • When acquiring the rwsem write lock in the slowpath, we first try
    to set count to RWSEM_WAITING_BIAS. When that is successful,
    we then atomically add the RWSEM_WAITING_BIAS in cases where
    there are other tasks on the wait list. This causes write lock
    operations to often issue multiple atomic operations.

    We can instead make the list_is_singular() check first, and then
    set the count accordingly, so that we issue at most 1 atomic
    operation when acquiring the write lock and reduce unnecessary
    cacheline contention.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Ivan Kokshaysky
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Terry Rudd
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Tony Luck
    Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Readers that are awoken will expect a nil ->task indicating
    that a wakeup has occurred. Because of the way readers are
    implemented, there's a small chance that the waiter will never
    block in the slowpath (rwsem_down_read_failed), and therefore
    requires some form of reference counting to avoid the following
    scenario:

    rwsem_down_read_failed() rwsem_wake()
    get_task_struct();
    spin_lock_irq(&wait_lock);
    list_add_tail(&waiter.list)
    spin_unlock_irq(&wait_lock);
    raw_spin_lock_irqsave(&wait_lock)
    __rwsem_do_wake()
    while (1) {
    set_task_state(TASK_UNINTERRUPTIBLE);
    waiter->task = NULL
    if (!waiter.task) // true
    break;
    schedule() // never reached

    __set_task_state(TASK_RUNNING);
    do_exit();
    wake_up_process(tsk); // boom

    ... and therefore race with do_exit() when the caller returns.

    There is also a mismatch between the smp_mb() and its documentation,
    in that the serialization is done between reading the task and the
    nil store. Furthermore, in addition to having the overlapping of
    loads and stores to waiter->task guaranteed to be ordered within
    that CPU, both wake_up_process() originally and now wake_q_add()
    already imply barriers upon successful calls, which serves the
    comment.

    Now, as an alternative to perhaps inverting the checks in the blocker
    side (which has its own penalty in that schedule is unavoidable),
    with lockless wakeups this situation is naturally addressed and we
    can just use the refcount held by wake_q_add(), instead doing so
    explicitly. Of course, we must guarantee that the nil store is done
    as the _last_ operation in that the task must already be marked for
    deletion to not fall into the race above. Spurious wakeups are also
    handled transparently in that the task's reference is only removed
    when wake_up_q() is actually called _after_ the nil store.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hp.com
    Cc: peter@hurleysoftware.com
    Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • As wake_qs gain users, we can teach rwsems about them such that
    waiters can be awoken without the wait_lock. This is for both
    readers and writer, the former being the most ideal candidate
    as we can batch the wakeups shortening the critical region that
    much more -- ie writer task blocking a bunch of tasks waiting to
    service page-faults (mmap_sem readers).

    In general applying wake_qs to rwsem (xadd) is not difficult as
    the wait_lock is intended to be released soon _anyways_, with
    the exception of when a writer slowpath will proactively wakeup
    any queued readers if it sees that the lock is owned by a reader,
    in which we simply do the wakeups with the lock held (see comment
    in __rwsem_down_write_failed_common()).

    Similar to other locking primitives, delaying the waiter being
    awoken does allow, at least in theory, the lock to be stolen in
    the case of writers, however no harm was seen in this (in fact
    lock stealing tends to be a _good_ thing in most workloads), and
    this is a tiny window anyways.

    Some page-fault (pft) and mmap_sem intensive benchmarks show some
    pretty constant reduction in systime (by up to ~8 and ~10%) on a
    2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
    page allocations (page_test)

    aim9:
    4.6-rc6 4.6-rc6
    rwsemv2
    Min page_test 378167.89 ( 0.00%) 382613.33 ( 1.18%)
    Min exec_test 499.00 ( 0.00%) 502.67 ( 0.74%)
    Min fork_test 3395.47 ( 0.00%) 3537.64 ( 4.19%)
    Hmean page_test 395433.06 ( 0.00%) 414693.68 ( 4.87%)
    Hmean exec_test 499.67 ( 0.00%) 505.30 ( 1.13%)
    Hmean fork_test 3504.22 ( 0.00%) 3594.95 ( 2.59%)
    Stddev page_test 17426.57 ( 0.00%) 26649.92 (-52.93%)
    Stddev exec_test 0.47 ( 0.00%) 1.41 (-199.05%)
    Stddev fork_test 63.74 ( 0.00%) 32.59 ( 48.86%)
    Max page_test 429873.33 ( 0.00%) 456960.00 ( 6.30%)
    Max exec_test 500.33 ( 0.00%) 507.66 ( 1.47%)
    Max fork_test 3653.33 ( 0.00%) 3650.90 ( -0.07%)

    4.6-rc6 4.6-rc6
    rwsemv2
    User 1.12 0.04
    System 0.23 0.04
    Elapsed 727.27 721.98

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hp.com
    Cc: peter@hurleysoftware.com
    Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Recursive locking for ww_mutexes was originally conceived as an
    exception. However, it is heavily used by the DRM atomic modesetting
    code. Currently, the recursive deadlock is checked after we have queued
    up for a busy-spin and as we never release the lock, we spin until
    kicked, whereupon the deadlock is discovered and reported.

    A simple solution for the now common problem is to move the recursive
    deadlock discovery to the first action when taking the ww_mutex.

    Suggested-by: Maarten Lankhorst
    Signed-off-by: Chris Wilson
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Maarten Lankhorst
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/1464293297-19777-1-git-send-email-chris@chris-wilson.co.uk
    Signed-off-by: Ingo Molnar

    Chris Wilson