27 Jun, 2016

1 commit

  • queued_spin_lock_slowpath() should not worry about another
    queued_spin_lock_slowpath() running in interrupt context and
    changing node->count by accident, because node->count keeps
    the same value every time we enter/leave queued_spin_lock_slowpath().

    On some architectures this_cpu_dec() will save/restore irq flags,
    which has high overhead. Use the much cheaper __this_cpu_dec() instead.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Rewrote changelog. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

14 Jun, 2016

2 commits

  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This new form allows using hardware assisted waiting.

    Some hardware (ARM64 and x86) allow monitoring an address for changes,
    so by providing a pointer we can use this to replace the cpu_relax()
    with hardware optimized methods in the future.

    Requested-by: Will Deacon
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Jun, 2016

3 commits

  • I figured we need to document the spin_is_locked() and
    spin_unlock_wait() constraints somwehere.

    Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
    could find a place in there. But currently that document is stale to
    the point of hardly being useful.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While going over the code I noticed that xchg_tail() is a RELEASE but
    had no obvious pairing commented.

    It pairs with a somewhat unique address dependency through
    decode_tail().

    So the store-release of xchg_tail() is paired by the address
    dependency of the load of xchg_tail followed by the dereference from
    the pointer computed from that load.

    The @old -> @prev transformation itself is pure, and therefore does
    not depend on external state, so that is immaterial wrt. ordering.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While this prior commit:

    54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

    ... fixes spin_is_locked() and spin_unlock_wait() for the usage
    in ipc/sem and netfilter, it does not in fact work right for the
    usage in task_work and futex.

    So while the 2 locks crossed problem:

    spin_lock(A) spin_lock(B)
    if (!spin_is_locked(B)) spin_unlock_wait(A)
    foo() foo();

    ... works with the smp_mb() injected by both spin_is_locked() and
    spin_unlock_wait(), this is not sufficient for:

    flag = 1;
    smp_mb(); spin_lock()
    spin_unlock_wait() if (!flag)
    // add to lockless list
    // iterate lockless list

    ... because in this scenario, the store from spin_lock() can be delayed
    past the load of flag, uncrossing the variables and loosing the
    guarantee.

    This patch reworks spin_is_locked() and spin_unlock_wait() to work in
    both cases by exploiting the observation that while the lock byte
    store can be delayed, the contender must have registered itself
    visibly in other state contained in the word.

    It also allows for architectures to override both functions, as PPC
    and ARM64 have an additional issue for which we currently have no
    generic solution.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Giovanni Gherdovich
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: stable@vger.kernel.org # v4.2 and later
    Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Feb, 2016

1 commit

  • The newly introduced smp_cond_acquire() was used to replace the
    slowpath lock acquisition loop. Similarly, the new function can also
    be applied to the pending bit locking loop. This patch uses the new
    function in that loop.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

04 Dec, 2015

3 commits

  • In an overcommitted guest where some vCPUs have to be halted to make
    forward progress in other areas, it is highly likely that a vCPU later
    in the spinlock queue will be spinning while the ones earlier in the
    queue would have been halted. The spinning in the later vCPUs is then
    just a waste of precious CPU cycles because they are not going to
    get the lock soon as the earlier ones have to be woken up and take
    their turn to get the lock.

    This patch implements an adaptive spinning mechanism where the vCPU
    will call pv_wait() if the previous vCPU is not running.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
    After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s

    For 32 vCPUs, this patch doesn't cause any noticeable change in
    performance. For 48 vCPUs (over-committed), there is about 8%
    performance improvement.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch allows one attempt for the lock waiter to steal the lock
    when entering the PV slowpath. To prevent lock starvation, the pending
    bit will be set by the queue head vCPU when it is in the active lock
    spinning loop to disable any lock stealing attempt. This helps to
    reduce the performance penalty caused by lock waiter preemption while
    not having much of the downsides of a real unfair lock.

    The pv_wait_head() function was renamed as pv_wait_head_or_lock()
    as it was modified to acquire the lock before returning. This is
    necessary because of possible lock stealing attempts from other tasks.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
    After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s

    For the overcommited case (48 vCPUs), this patch is able to reduce
    kernel build time by more than 54% for Westmere and 44% for Haswell.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Introduce smp_cond_acquire() which combines a control dependency and a
    read barrier to form acquire semantics.

    This primitive has two benefits:

    - it documents control dependencies,
    - its typically cheaper than using smp_load_acquire() in a loop.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Nov, 2015

3 commits

  • With optimistic prefetch of the next node cacheline, the next pointer
    may have been properly inititalized. As a result, the reading
    of node->next in the contended path may be redundant. This patch
    eliminates the redundant read if the next pointer value is not NULL.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • A queue head CPU, after acquiring the lock, will have to notify
    the next CPU in the wait queue that it has became the new queue
    head. This involves loading a new cacheline from the MCS node of the
    next CPU. That operation can be expensive and add to the latency of
    locking operation.

    This patch addes code to optmistically prefetch the next MCS node
    cacheline if the next pointer is defined and it has been spinning
    for the MCS lock for a while. This reduces the locking latency and
    improves the system throughput.

    The performance change will depend on whether the prefetch overhead
    can be hidden within the latency of the lock spin loop. On really
    short critical section, there may not be performance gain at all. With
    longer critical section, however, it was found to have a performance
    boost of 5-10% over a range of different queue depths with a spinlock
    loop microbenchmark.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch replaces the cmpxchg() and xchg() calls in the native
    qspinlock code with the more relaxed _acquire or _release versions of
    those calls to enable other architectures to adopt queued spinlocks
    with less memory barrier performance overhead.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

11 Sep, 2015

1 commit

  • Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
    set and Linus noted that the test-and-set implementation was retarded.

    One should spin on the variable with a load, not a RMW.

    While there, remove 'queued' from the name, as the lock isn't queued
    at all, but a simple test-and-set.

    Suggested-by: Linus Torvalds
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: stable@vger.kernel.org # v4.2+
    Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Aug, 2015

1 commit

  • For an over-committed guest with more vCPUs than physical CPUs
    available, it is possible that a vCPU may be kicked twice before
    getting the lock - once before it becomes queue head and once again
    before it gets the lock. All these CPU kicking and halting (VMEXIT)
    can be expensive and slow down system performance.

    This patch adds a new vCPU state (vcpu_hashed) which enables the code
    to delay CPU kicking until at unlock time. Once this state is set,
    the new lock holder will set _Q_SLOW_VAL and fill in the hash table
    on behalf of the halted queue head vCPU. The original vcpu_halted
    state will be used by pv_wait_node() only to differentiate other
    queue nodes from the qeue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 May, 2015

7 commits

  • Provide a separate (second) version of the spin_lock_slowpath for
    paravirt along with a special unlock path.

    The second slowpath is generated by adding a few pv hooks to the
    normal slowpath, but where those will compile away for the native
    case, they expand into special wait/wake code for the pv version.

    The actual MCS queue can use extra storage in the mcs_nodes[] array to
    keep track of state and therefore uses directed wakeups.

    The head contender has no such storage directly visible to the
    unlocker. So the unlocker searches a hash table with open addressing
    using a simple binary Galois linear feedback shift register.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When we detect a hypervisor (!paravirt, see qspinlock paravirt support
    patches), revert to a simple test-and-set lock to avoid the horrors
    of queue preemption.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • Currently, atomic_cmpxchg() is used to get the lock. However, this
    is not really necessary if there is more than one task in the queue
    and the queue head don't need to reset the tail code. For that case,
    a simple write to set the lock bit is enough as the queue head will
    be the only one eligible to get the lock as long as it checks that
    both the lock and pending bits are not set. The current pending bit
    waiting code will ensure that the bit will not be set as soon as the
    tail code in the lock is set.

    With that change, the are some slight improvement in the performance
    of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
    Westere-EX machine as shown in the tables below.

    [Standalone/Embedded - same node]
    # of tasks Before patch After patch %Change
    ---------- ----------- ---------- -------
    3 2324/2321 2248/2265 -3%/-2%
    4 2890/2896 2819/2831 -2%/-2%
    5 3611/3595 3522/3512 -2%/-2%
    6 4281/4276 4173/4160 -3%/-3%
    7 5018/5001 4875/4861 -3%/-3%
    8 5759/5750 5563/5568 -3%/-3%

    [Standalone/Embedded - different nodes]
    # of tasks Before patch After patch %Change
    ---------- ----------- ---------- -------
    3 12242/12237 12087/12093 -1%/-1%
    4 10688/10696 10507/10521 -2%/-2%

    It was also found that this change produced a much bigger performance
    improvement in the newer IvyBridge-EX chip and was essentially to close
    the performance gap between the ticket spinlock and queued spinlock.

    The disk workload of the AIM7 benchmark was run on a 4-socket
    Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
    on a 3.14 based kernel. The results of the test runs were:

    AIM7 XFS Disk Test
    kernel JPM Real Time Sys Time Usr Time
    ----- --- --------- -------- --------
    ticketlock 5678233 3.17 96.61 5.81
    qspinlock 5750799 3.13 94.83 5.97

    AIM7 EXT4 Disk Test
    kernel JPM Real Time Sys Time Usr Time
    ----- --- --------- -------- --------
    ticketlock 1114551 16.15 509.72 7.11
    qspinlock 2184466 8.24 232.99 6.01

    The ext4 filesystem run had a much higher spinlock contention than
    the xfs filesystem run.

    The "ebizzy -m" test was also run with the following results:

    kernel records/s Real Time Sys Time Usr Time
    ----- --------- --------- -------- --------
    ticketlock 2075 10.00 216.35 3.49
    qspinlock 3023 10.00 198.20 4.80

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When we allow for a max NR_CPUS < 2^14 we can optimize the pending
    wait-acquire and the xchg_tail() operations.

    By growing the pending bit to a byte, we reduce the tail to 16bit.
    This means we can use xchg16 for the tail part and do away with all
    the repeated compxchg() operations.

    This in turn allows us to unconditionally acquire; the locked state
    as observed by the wait loops cannot change. And because both locked
    and pending are now a full byte we can use simple stores for the
    state transition, obviating one atomic operation entirely.

    This optimization is needed to make the qspinlock achieve performance
    parity with ticket spinlock at light load.

    All this is horribly broken on Alpha pre EV56 (and any other arch that
    cannot do single-copy atomic byte stores).

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • This is a preparatory patch that extracts out the following 2 code
    snippets to prepare for the next performance optimization patch.

    1) the logic for the exchange of new and previous tail code words
    into a new xchg_tail() function.
    2) the logic for clearing the pending bit and setting the locked bit
    into a new clear_pending_set_locked() function.

    This patch also simplifies the trylock operation before queuing by
    calling queued_spin_trylock() directly.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Because the qspinlock needs to touch a second cacheline (the per-cpu
    mcs_nodes[]); add a pending bit and allow a single in-word spinner
    before we punt to the second cacheline.

    It is possible so observe the pending bit without the locked bit when
    the last owner has just released but the pending owner has not yet
    taken ownership.

    In this case we would normally queue -- because the pending bit is
    already taken. However, in this case the pending bit is guaranteed
    to be released 'soon', therefore wait for it and avoid queueing.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • This patch introduces a new generic queued spinlock implementation that
    can serve as an alternative to the default ticket spinlock. Compared
    with the ticket spinlock, this queued spinlock should be almost as fair
    as the ticket spinlock. It has about the same speed in single-thread
    and it can be much faster in high contention situations especially when
    the spinlock is embedded within the data structure to be protected.

    Only in light to moderate contention where the average queue depth
    is around 1-3 will this queued spinlock be potentially a bit slower
    due to the higher slowpath overhead.

    This queued spinlock is especially suit to NUMA machines with a large
    number of cores as the chance of spinlock contention is much higher
    in those machines. The cost of contention is also higher because of
    slower inter-node memory traffic.

    Due to the fact that spinlocks are acquired with preemption disabled,
    the process will not be migrated to another CPU while it is trying
    to get a spinlock. Ignoring interrupt handling, a CPU can only be
    contending in one spinlock at any one time. Counting soft IRQ, hard
    IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
    activities. By allocating a set of per-cpu queue nodes and used them
    to form a waiting queue, we can encode the queue node address into a
    much smaller 24-bit size (including CPU number and queue node index)
    leaving one byte for the lock.

    Please note that the queue node is only needed when waiting for the
    lock. Once the lock is acquired, the queue node can be released to
    be used later.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long