07 Jul, 2017

1 commit

  • Update dcache, inode, pid, mountpoint, and mount hash tables to use
    HASH_ZERO, and remove initialization after allocations. In case of
    places where HASH_EARLY was used such as in __pv_init_lock_hash the
    zeroed hash table was already assumed, because memblock zeroes the
    memory.

    CPU: SPARC M6, Memory: 7T
    Before fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 11.798s

    After fix:
    Dentry cache hash table entries: 1073741824
    Inode-cache hash table entries: 536870912
    Mount-cache hash table entries: 16777216
    Mountpoint-cache hash table entries: 16777216
    ftrace: allocating 20414 entries in 40 pages
    Total time: 3.198s

    CPU: Intel Xeon E5-2630, Memory: 2.2T:
    Before fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.245s

    After fix:
    Dentry cache hash table entries: 536870912
    Inode-cache hash table entries: 268435456
    Mount-cache hash table entries: 8388608
    Mountpoint-cache hash table entries: 8388608
    CPU: Physical Processor ID: 0
    Total time: 3.244s

    Link: http://lkml.kernel.org/r/1488432825-92126-4-git-send-email-pasha.tatashin@oracle.com
    Signed-off-by: Pavel Tatashin
    Reviewed-by: Babu Moger
    Cc: David Miller
    Cc: Al Viro
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Pavel Tatashin
     

12 Jan, 2017

1 commit

  • If prev node is not in running state or its vCPU is preempted, we can give
    up our vCPU slices in pv_wait_node() ASAP.

    Signed-off-by: Pan Xinhui
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: longman@redhat.com
    Link: http://lkml.kernel.org/r/1484035006-6787-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Fixed typos in the changelog, removed ugly linebreak from the code. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

22 Sep, 2016

1 commit

  • cmpxchg_release() is more lighweight than cmpxchg() on some archs(e.g.
    PPC), moreover, in __pv_queued_spin_unlock() we only needs a RELEASE in
    the fast path(pairing with *_try_lock() or *_lock()). And the slow path
    has smp_store_release too. So it's safe to use cmpxchg_release here.

    Suggested-by: Boqun Feng
    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: benh@kernel.crashing.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: virtualization@lists.linux-foundation.org
    Cc: waiman.long@hpe.com
    Link: http://lkml.kernel.org/r/1474277037-15200-2-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

10 Aug, 2016

3 commits

  • Currently there are overlap in the pvqspinlock wait_again and
    spurious_wakeup stat counters. Because of lock stealing, it is
    no longer possible to accurately determine if spurious wakeup has
    happened in the queue head. As they track both the queue node and
    queue head status, it is also hard to tell how many of those comes
    from the queue head and how many from the queue node.

    This patch changes the accounting rules so that spurious wakeup is
    only tracked in the queue node. The wait_again count, however, is
    only tracked in the queue head when the vCPU failed to acquire the
    lock after a vCPU kick. This should give a much better indication of
    the wait-kick dynamics in the queue node and the queue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464713631-1066-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Restructure pv_queued_spin_steal_lock() as I found it hard to read.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long

    Peter Zijlstra
     
  • When the lock holder vCPU is racing with the queue head:

    CPU 0 (lock holder) CPU1 (queue head)
    =================== =================
    spin_lock(); spin_lock();
    pv_kick_node(): pv_wait_head_or_lock():
    if (!lp) {
    lp = pv_hash(lock, pn);
    xchg(&l->locked, _Q_SLOW_VAL);
    }
    WRITE_ONCE(pn->state, vcpu_halted);
    cmpxchg(&pn->state,
    vcpu_halted, vcpu_hashed);
    WRITE_ONCE(l->locked, _Q_SLOW_VAL);
    (void)pv_hash(lock, pn);

    In this case, lock holder inserts the pv_node of queue head into the
    hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
    restoring/setting vcpu_hashed state after failing adaptive locking
    spinning.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Pan Xinhui
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

16 Jun, 2016

1 commit

  • These functions have been deprecated for a while and there is only the
    one user left, convert and kill.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Frederic Weisbecker
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Feb, 2016

2 commits

  • This patch enables the tracking of the number of slowpath locking
    operations performed. This can be used to compare against the number
    of lock stealing operations to see what percentage of locks are stolen
    versus acquired via the regular slowpath.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch moves the lock stealing count tracking code into
    pv_queued_spin_steal_lock() instead of via a jacket function simplifying
    the code.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

04 Dec, 2015

3 commits

  • In an overcommitted guest where some vCPUs have to be halted to make
    forward progress in other areas, it is highly likely that a vCPU later
    in the spinlock queue will be spinning while the ones earlier in the
    queue would have been halted. The spinning in the later vCPUs is then
    just a waste of precious CPU cycles because they are not going to
    get the lock soon as the earlier ones have to be woken up and take
    their turn to get the lock.

    This patch implements an adaptive spinning mechanism where the vCPU
    will call pv_wait() if the previous vCPU is not running.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
    After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s

    For 32 vCPUs, this patch doesn't cause any noticeable change in
    performance. For 48 vCPUs (over-committed), there is about 8%
    performance improvement.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch allows one attempt for the lock waiter to steal the lock
    when entering the PV slowpath. To prevent lock starvation, the pending
    bit will be set by the queue head vCPU when it is in the active lock
    spinning loop to disable any lock stealing attempt. This helps to
    reduce the performance penalty caused by lock waiter preemption while
    not having much of the downsides of a real unfair lock.

    The pv_wait_head() function was renamed as pv_wait_head_or_lock()
    as it was modified to acquire the lock before returning. This is
    necessary because of possible lock stealing attempts from other tasks.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
    After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s

    For the overcommited case (48 vCPUs), this patch is able to reduce
    kernel build time by more than 54% for Westmere and 44% for Haswell.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch enables the accumulation of kicking and waiting related
    PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
    option is selected. It also enables the collection of data which
    enable us to calculate the kicking and wakeup latencies which have
    a heavy dependency on the CPUs being used.

    The statistical counters are per-cpu variables to minimize the
    performance overhead in their updates. These counters are exported
    via the debugfs filesystem under the qlockstat directory. When the
    corresponding debugfs files are read, summation and computing of the
    required data are then performed.

    The measured latencies for different CPUs are:

    CPU Wakeup Kicking
    --- ------ -------
    Haswell-EX 63.6us 7.4us
    Westmere-EX 67.6us 9.3us

    The measured latencies varied a bit from run-to-run. The wakeup
    latency is much higher than the kicking latency.

    A sample of statistical counters after system bootup (with vCPU
    overcommit) was:

    pv_hash_hops=1.00
    pv_kick_unlock=1148
    pv_kick_wake=1146
    pv_latency_kick=11040
    pv_latency_wake=194840
    pv_spurious_wakeup=7
    pv_wait_again=4
    pv_wait_head=23
    pv_wait_node=1129

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

23 Nov, 2015

1 commit

  • The unlock function in queued spinlocks was optimized for better
    performance on bare metal systems at the expense of virtualized guests.

    For x86-64 systems, the unlock call needs to go through a
    PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
    registers before calling the real __pv_queued_spin_unlock()
    function. The thunk code may also be in a separate cacheline from
    __pv_queued_spin_unlock().

    This patch optimizes the PV unlock code path by:

    1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..

    2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

    With a microbenchmark of 5M lock-unlock loop, the table below shows
    the execution times before and after the patch with different number
    of threads in a VM running on a 32-core Westmere-EX box with x86-64
    4.2-rc1 based kernels:

    Threads Before patch After patch % Change
    ------- ------------ ----------- --------
    1 134.1 ms 119.3 ms -11%
    2 1286 ms 953 ms -26%
    3 3715 ms 3480 ms -6.3%
    4 4092 ms 3764 ms -8.0%

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

18 Sep, 2015

1 commit

  • If _Q_SLOW_VAL has been set, the vCPU state must have been vcpu_hashed.
    The extra check at the end of __pv_queued_spin_unlock() is unnecessary
    and can be removed.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1441996658-62854-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

03 Aug, 2015

3 commits

  • For an over-committed guest with more vCPUs than physical CPUs
    available, it is possible that a vCPU may be kicked twice before
    getting the lock - once before it becomes queue head and once again
    before it gets the lock. All these CPU kicking and halting (VMEXIT)
    can be expensive and slow down system performance.

    This patch adds a new vCPU state (vcpu_hashed) which enables the code
    to delay CPU kicking until at unlock time. Once this state is set,
    the new lock holder will set _Q_SLOW_VAL and fill in the hash table
    on behalf of the halted queue head vCPU. The original vcpu_halted
    state will be used by pv_wait_node() only to differentiate other
    queue nodes from the qeue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When we unlock in __pv_queued_spin_unlock(), a failed cmpxchg() on the lock
    value indicates that we need to take the slow-path and unhash the
    corresponding node blocked on the lock.

    Since a failed cmpxchg() does not provide any memory-ordering guarantees,
    it is possible that the node data could be read before the cmpxchg() on
    weakly-ordered architectures and therefore return a stale value, leading
    to hash corruption and/or a BUG().

    This patch adds an smb_rmb() following the failed cmpxchg operation, so
    that the unhashing is ordered after the lock has been checked.

    Reported-by: Peter Zijlstra
    Signed-off-by: Will Deacon
    [ Added more comments]
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul McKenney
    Cc: Steve Capper
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150713155830.GL2632@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • - Rename the on-stack variable to match the datastructure variable,

    - place the cmpxchg back under the comment that explains it,

    - clean up the WARN() statement to avoid superfluous conditionals
    and line-breaks.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

21 Jul, 2015

1 commit

  • Enabling locking-selftest in a VM guest may cause the following
    kernel panic:

    kernel BUG at .../kernel/locking/qspinlock_paravirt.h:137!

    This is due to the fact that the pvqspinlock unlock function is
    expecting either a _Q_LOCKED_VAL or _Q_SLOW_VAL in the lock
    byte. This patch prevents that bug report by ignoring it when
    debug_locks_silent is set. Otherwise, a warning will be printed
    if it contains an unexpected value.

    With this patch applied, the kernel locking-selftest completed
    without any noise.

    Tested-by: Masami Hiramatsu
    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1436663959-53092-1-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

19 May, 2015

1 commit

  • Since set_mb() is really about an smp_mb() -- not a IO/DMA barrier
    like mb() rename it to match the recent smp_load_acquire() and
    smp_store_release().

    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 May, 2015

1 commit

  • The xchg() function was used in pv_wait_node() to set a certain
    value and provide a memory barrier which is what the set_mb()
    function is for. This patch replaces the xchg() call by
    set_mb().

    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Cc: Douglas Hatch
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 May, 2015

1 commit

  • Provide a separate (second) version of the spin_lock_slowpath for
    paravirt along with a special unlock path.

    The second slowpath is generated by adding a few pv hooks to the
    normal slowpath, but where those will compile away for the native
    case, they expand into special wait/wake code for the pv version.

    The actual MCS queue can use extra storage in the mcs_nodes[] array to
    keep track of state and therefore uses directed wakeups.

    The head contender has no such storage directly visible to the
    unlocker. So the unlocker searches a hash table with open addressing
    using a simple binary Galois linear feedback shift register.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long