31 May, 2019
1 commit
-
Based on 3 normalized pattern(s):
this program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version this program is distributed in the
hope that it will be useful but without any warranty without even
the implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more detailsthis program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version [author] [kishon] [vijay] [abraham]
[i] [kishon]@[ti] [com] this program is distributed in the hope that
it will be useful but without any warranty without even the implied
warranty of merchantability or fitness for a particular purpose see
the gnu general public license for more detailsthis program is free software you can redistribute it and or modify
it under the terms of the gnu general public license as published by
the free software foundation either version 2 of the license or at
your option any later version [author] [graeme] [gregory]
[gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
[kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
[hk] [hemahk]@[ti] [com] this program is distributed in the hope
that it will be useful but without any warranty without even the
implied warranty of merchantability or fitness for a particular
purpose see the gnu general public license for more detailsextracted by the scancode license scanner the SPDX license identifier
GPL-2.0-or-later
has been chosen to replace the boilerplate/reference in 1105 file(s).
Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Reviewed-by: Richard Fontana
Reviewed-by: Kate Stewart
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
Signed-off-by: Greg Kroah-Hartman
10 Apr, 2019
1 commit
-
The percpu event counts used by qspinlock code can be useful for
other locking code as well. So a new set of lockevent_* counting APIs
is introduced with the lock event names extracted out into the new
lock_events_list.h header file for easier addition in the future.The existing qstat_inc() calls are replaced by either lockevent_inc() or
lockevent_cond_inc() calls.The qstat_hop() call is renamed to lockevent_pv_hop(). The "reset_counters"
debugfs file is also renamed to ".reset_counts".Signed-off-by: Waiman Long
Acked-by: Peter Zijlstra
Acked-by: Davidlohr Bueso
Cc: Andrew Morton
Cc: Arnd Bergmann
Cc: Borislav Petkov
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Will Deacon
Link: http://lkml.kernel.org/r/20190404174320.22416-8-longman@redhat.com
Signed-off-by: Ingo Molnar
28 Feb, 2019
1 commit
-
With the > 4 nesting levels case handled by the commit:
d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels")
the BUG_ON() call in encode_tail() will never actually be triggered.
Remove it.
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar
04 Feb, 2019
2 commits
-
Track the number of slowpath locking operations that are being done
without any MCS node available as well renaming lock_index[123] to make
them more descriptive.Using these stat counters is one way to find out if a code path is
being exercised.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: James Morse
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: SRINIVAS
Cc: Thomas Gleixner
Cc: Will Deacon
Cc: Zhenzhong Duan
Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar -
Four queue nodes per CPU are allocated to enable up to 4 nesting levels
using the per-CPU nodes. Nested NMIs are possible in some architectures.
Still it is very unlikely that we will ever hit more than 4 nested
levels with contention in the slowpath.When that rare condition happens, however, it is likely that the system
will hang or crash shortly after that. It is not good and we need to
handle this exception case.This is done by spinning directly on the lock using repeated trylock.
This alternative code path should only be used when there is nested
NMIs. Assuming that the locks used by those NMI handlers will not be
heavily contended, a simple TAS locking should work out.Suggested-by: Peter Zijlstra
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: H. Peter Anvin
Cc: James Morse
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: SRINIVAS
Cc: Thomas Gleixner
Cc: Zhenzhong Duan
Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar
17 Oct, 2018
2 commits
-
The qspinlock code supports up to 4 levels of slowpath nesting using
four per-CPU mcs_spinlock structures. For 64-bit architectures, they
fit nicely in one 64-byte cacheline.For para-virtualized (PV) qspinlocks it needs to store more information
in the per-CPU node structure than there is space for. It uses a trick
to use a second cacheline to hold the extra information that it needs.
So PV qspinlock needs to access two extra cachelines for its information
whereas the native qspinlock code only needs one extra cacheline.Freshly added counter profiling of the qspinlock code, however, revealed
that it was very rare to use more than two levels of slowpath nesting.
So it doesn't make sense to penalize PV qspinlock code in order to have
four mcs_spinlock structures in the same cacheline to optimize for a case
in the native qspinlock code that rarely happens.Extend the per-CPU node structure to have two more long words when PV
qspinlock locks are configured to hold the extra data that it needs.As a result, the PV qspinlock code will enjoy the same benefit of using
just one extra cacheline like the native counterpart, for most cases.[ mingo: Minor changelog edits. ]
Signed-off-by: Waiman Long
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar -
Queued spinlock supports up to 4 levels of lock slowpath nesting -
user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
often the nesting happens.So add 3 more per-CPU stat counters to track the number of instances where
nesting index goes to 1, 2 and 3 respectively.On a dual-socket 64-core 128-thread Zen server, the following were the
new stat counter values under different circumstances:State slowpath index1 index2 index3
----- -------- ------ ------ -------
After bootup 1,012,150 82 0 0
After parallel build + perf-top 125,195,009 82 0 0So the chance of having more than 2 levels of nesting is extremely low.
[ mingo: Minor changelog edits. ]
Signed-off-by: Waiman Long
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Will Deacon
Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.com
Signed-off-by: Ingo Molnar
16 Oct, 2018
3 commits
-
On x86 we cannot do fetch_or() with a single instruction and thus end up
using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
with a composite operation: tas-pending + load.Using two instructions of course opens a window we previously did not
have. Consider the scenario:CPU0 CPU1 CPU2
1) lock
trylock -> (0,0,1)2) lock
trylock /* fail */3) unlock -> (0,0,0)
4) lock
trylock -> (0,0,1)5) tas-pending -> (0,1,1)
load-val (0,0,1)FAIL: _2_ owners
where 5) is our new composite operation. When we consider each part of
the qspinlock state as a separate variable (as we can when
_Q_PENDING_BITS == 8) then the above is entirely possible, because
tas-pending will only RmW the pending byte, so the later load is able
to observe prior tail and lock state (but not earlier than its own
trylock, which operates on the whole word, due to coherence).To avoid this we need 2 things:
- the load must come after the tas-pending (obviously, otherwise it
can trivially observe prior state).- the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
example, such that we cannot observe other state prior to setting
pending.On x86 we can realize this by using "LOCK BTS m32, r32" for
tas-pending followed by a regular load.Note that observing later state is not a problem:
- if we fail to observe a later unlock, we'll simply spin-wait for
that store to become visible.- if we observe a later xchg_tail(), there is no difference from that
xchg_tail() having taken place before the tas-pending.Suggested-by: Will Deacon
Reported-by: Thomas Gleixner
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
Signed-off-by: Ingo Molnar -
While working my way through the code again; I felt the comments could
use help.Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar -
Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
such that we loose the indent. This also result in a more natural code
flow IMO.Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Will Deacon
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: andrea.parri@amarulasolutions.com
Cc: longman@redhat.com
Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
Signed-off-by: Ingo Molnar
27 Apr, 2018
9 commits
-
Currently, the qspinlock_stat code tracks only statistical counts in the
PV qspinlock code. However, it may also be useful to track the number
of locking operations done via the pending code vs. the MCS lock queue
slowpath for the non-PV case.The qspinlock stat code is modified to do that. The stat counter
pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
both the PV and non-PV cases.Signed-off-by: Waiman Long
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Cc: will.deacon@arm.com
Link: http://lkml.kernel.org/r/1524738868-31318-14-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
When reaching the head of an uncontended queue on the qspinlock slow-path,
using a try_cmpxchg() instead of a cmpxchg() operation to transition the
lock work to _Q_LOCKED_VAL generates slightly better code for x86 and
pretty much identical code for arm64.Reported-by: Peter Zijlstra
Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-13-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
The qspinlock slowpath must ensure that the MCS node is fully initialised
before it can be reached by another other CPU. This is currently enforced
by using a RELEASE operation when updating the tail and also when linking
the node into the waitqueue, since the control dependency off xchg_tail
is insufficient to enforce sufficient ordering, see:95bcade33a8a ("locking/qspinlock: Ensure node is initialised before updating prev->next")
Back-to-back RELEASE operations may be expensive on some architectures,
particularly those that implement them using fences under the hood. We
can replace the two RELEASE operations with a single smp_wmb() fence and
use RELAXED operations for the subsequent publishing of the node.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-12-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
When a locker reaches the head of the queue and takes the lock, a
concurrent locker may enqueue and force the lock holder to spin
whilst its node->next field is initialised. Rather than open-code
a READ_ONCE/cpu_relax() loop, this can be implemented using
smp_cond_load_relaxed() instead.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-10-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
Rather than dig into the counter field of the atomic_t inside the
qspinlock structure so that we can call smp_cond_load_acquire(), use
atomic_cond_read_acquire() instead, which operates on the atomic_t
directly.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-8-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
When a queued locker reaches the head of the queue, it claims the lock
by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
must also clear the tail as part of this operation so that subsequent
lockers can avoid taking the slowpath altogether.Currently this is expressed as a cmpxchg() loop that practically only
runs up to two iterations. This is confusing to the reader and unhelpful
to the compiler. Rewrite the cmpxchg() loop without the loop, so that a
failed cmpxchg() implies that there is contention and we just need to
write to _Q_LOCKED_VAL without considering the rest of the lockword.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
The qspinlock locking slowpath utilises a "pending" bit as a simple form
of an embedded test-and-set lock that can avoid the overhead of explicit
queuing in cases where the lock is held but uncontended. This bit is
managed using a cmpxchg() loop which tries to transition the uncontended
lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
indefinitely if the lock word is seen to oscillate between unlocked
(0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
able to take the lock in the cmpxchg() loop without queuing and pass it
around amongst themselves.This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
using atomic_fetch_or, and then inspecting the old value to see whether
we need to spin on the current lock owner, or whether we now effectively
hold the lock. The tricky scenario is when concurrent lockers end up
queuing on the lock and the lock becomes available, causing us to see
a lockword of (n,0,0). With pending now set, simply queuing could lead
to deadlock as the head of the queue may not have observed the pending
flag being cleared. Conversely, if the head of the queue did observe
pending being cleared, then it could transition the lock from (n,0,0) ->
(0,0,1) meaning that any attempt to "undo" our setting of the pending
bit could race with a concurrent locker trying to set it.We handle this race by preserving the pending bit when taking the lock
after reaching the head of the queue and leaving the tail entry intact
if we saw pending set, because we know that the tail is going to be
updated shortly.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
If a locker taking the qspinlock slowpath reads a lock value indicating
that only the pending bit is set, then it will spin whilst the
concurrent pending->locked transition takes effect.Unfortunately, there is no guarantee that such a transition will ever be
observed since concurrent lockers could continuously set pending and
hand over the lock amongst themselves, leading to starvation. Whilst
this would probably resolve in practice, it means that it is not
possible to prove liveness properties about the lock and means that lock
acquisition time is unbounded.Rather than removing the pending->locked spinning from the slowpath
altogether (which has been shown to heavily penalise a 2-threaded
locking stress test on x86), this patch replaces the explicit spinning
with a call to atomic_cond_read_relaxed and allows the architecture to
provide a bound on the number of spins. For architectures that can
respond to changes in cacheline state in their smp_cond_load implementation,
it should be sufficient to use the default bound of 1.Suggested-by: Waiman Long
Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: boqun.feng@gmail.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
'struct __qspinlock' provides a handy union of fields so that
subcomponents of the lockword can be accessed by name, without having to
manage shifts and masks explicitly and take endianness into account.This is useful in qspinlock.h and also potentially in arch headers, so
move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
definition.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Acked-by: Boqun Feng
Cc: Linus Torvalds
Cc: Thomas Gleixner
Cc: linux-arm-kernel@lists.infradead.org
Cc: paulmck@linux.vnet.ibm.com
Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
13 Feb, 2018
2 commits
-
When queuing on the qspinlock, the count field for the current CPU's head
node is incremented. This needn't be atomic because locking in e.g. IRQ
context is balanced and so an IRQ will return with node->count as it
found it.However, the compiler could in theory reorder the initialisation of
node[idx] before the increment of the head node->count, causing an
IRQ to overwrite the initialised node and potentially corrupt the lock
state.Avoid the potential for this harmful compiler reordering by placing a
barrier() between the increment of the head node->count and the subsequent
node initialisation.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar -
When a locker ends up queuing on the qspinlock locking slowpath, we
initialise the relevant mcs node and publish it indirectly by updating
the tail portion of the lock word using xchg_tail. If we find that there
was a pre-existing locker in the queue, we subsequently update their
->next field to point at our node so that we are notified when it's our
turn to take the lock.This can be roughly illustrated as follows:
/* Initialise the fields in node and encode a pointer to node in tail */
tail = initialise_node(node);/*
* Exchange tail into the lockword using an atomic read-modify-write
* operation with release semantics
*/
old = xchg_tail(lock, tail);/* If there was a pre-existing waiter ... */
if (old & _Q_TAIL_MASK) {
prev = decode_tail(old);
smp_read_barrier_depends();/* ... then update their ->next field to point to node.
WRITE_ONCE(prev->next, node);
}The conditional update of prev->next therefore relies on the address
dependency from the result of xchg_tail ensuring order against the
prior initialisation of node. However, since the release semantics of
the xchg_tail operation apply only to the write portion of the RmW,
then this ordering is not guaranteed and it is possible for the CPU
to return old before the writes to node have been published, consequently
allowing us to point prev->next to an uninitialised node.This patch fixes the problem by making the update of prev->next a RELEASE
operation, which also removes the reliance on dependency ordering.Signed-off-by: Will Deacon
Acked-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com
Signed-off-by: Ingo Molnar
05 Dec, 2017
1 commit
-
Queued spinlocks are not used by DEC Alpha, and furthermore operations
such as READ_ONCE() and release/relaxed RMW atomics are being changed
to imply smp_read_barrier_depends(). This commit therefore removes the
now-redundant smp_read_barrier_depends() from queued_spin_lock_slowpath(),
and adjusts the comments accordingly.Signed-off-by: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Ingo Molnar
17 Aug, 2017
1 commit
-
There is no agreed-upon definition of spin_unlock_wait()'s semantics,
and it appears that all callers could do just as well with a lock/unlock
pair. This commit therefore removes spin_unlock_wait() and related
definitions from core code.Signed-off-by: Paul E. McKenney
Cc: Arnd Bergmann
Cc: Ingo Molnar
Cc: Will Deacon
Cc: Peter Zijlstra
Cc: Alan Stern
Cc: Andrea Parri
Cc: Linus Torvalds
08 Jul, 2017
1 commit
-
In architectures that use qspinlock, like x86, prefetch is loaded
indirectly via the asm/qspinlock.h include. On other architectures, like
OpenRISC, which may want to use asm-generic/qspinlock.h the built will
fail without the asm/prefetch.h include.Fix this by including directly.
Signed-off-by: Stafford Horne
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20170707195658.23840-1-shorne@gmail.com
Signed-off-by: Ingo Molnar
27 Jun, 2016
1 commit
-
queued_spin_lock_slowpath() should not worry about another
queued_spin_lock_slowpath() running in interrupt context and
changing node->count by accident, because node->count keeps
the same value every time we enter/leave queued_spin_lock_slowpath().On some architectures this_cpu_dec() will save/restore irq flags,
which has high overhead. Use the much cheaper __this_cpu_dec() instead.Signed-off-by: Pan Xinhui
Signed-off-by: Peter Zijlstra (Intel)
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hpe.com
Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
[ Rewrote changelog. ]
Signed-off-by: Ingo Molnar
14 Jun, 2016
2 commits
-
Introduce smp_acquire__after_ctrl_dep(), this construct is not
uncommon, but the lack of this barrier is.Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
semaphore code and the qspinlock code.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
This new form allows using hardware assisted waiting.
Some hardware (ARM64 and x86) allow monitoring an address for changes,
so by providing a pointer we can use this to replace the cpu_relax()
with hardware optimized methods in the future.Requested-by: Will Deacon
Suggested-by: Linus Torvalds
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
08 Jun, 2016
3 commits
-
I figured we need to document the spin_is_locked() and
spin_unlock_wait() constraints somwehere.Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
could find a place in there. But currently that document is stale to
the point of hardly being useful.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Boqun Feng
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Pan Xinhui
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
While going over the code I noticed that xchg_tail() is a RELEASE but
had no obvious pairing commented.It pairs with a somewhat unique address dependency through
decode_tail().So the store-release of xchg_tail() is paired by the address
dependency of the load of xchg_tail followed by the dereference from
the pointer computed from that load.The @old -> @prev transformation itself is pure, and therefore does
not depend on external state, so that is immaterial wrt. ordering.Signed-off-by: Peter Zijlstra (Intel)
Cc: Boqun Feng
Cc: Davidlohr Bueso
Cc: Linus Torvalds
Cc: Pan Xinhui
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar -
While this prior commit:
54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
... fixes spin_is_locked() and spin_unlock_wait() for the usage
in ipc/sem and netfilter, it does not in fact work right for the
usage in task_work and futex.So while the 2 locks crossed problem:
spin_lock(A) spin_lock(B)
if (!spin_is_locked(B)) spin_unlock_wait(A)
foo() foo();... works with the smp_mb() injected by both spin_is_locked() and
spin_unlock_wait(), this is not sufficient for:flag = 1;
smp_mb(); spin_lock()
spin_unlock_wait() if (!flag)
// add to lockless list
// iterate lockless list... because in this scenario, the store from spin_lock() can be delayed
past the load of flag, uncrossing the variables and loosing the
guarantee.This patch reworks spin_is_locked() and spin_unlock_wait() to work in
both cases by exploiting the observation that while the lock byte
store can be delayed, the contender must have registered itself
visibly in other state contained in the word.It also allows for architectures to override both functions, as PPC
and ARM64 have an additional issue for which we currently have no
generic solution.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Boqun Feng
Cc: Davidlohr Bueso
Cc: Giovanni Gherdovich
Cc: Linus Torvalds
Cc: Pan Xinhui
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: Will Deacon
Cc: stable@vger.kernel.org # v4.2 and later
Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
Signed-off-by: Ingo Molnar
29 Feb, 2016
1 commit
-
The newly introduced smp_cond_acquire() was used to replace the
slowpath lock acquisition loop. Similarly, the new function can also
be applied to the pending bit locking loop. This patch uses the new
function in that loop.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Douglas Hatch
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar
04 Dec, 2015
3 commits
-
In an overcommitted guest where some vCPUs have to be halted to make
forward progress in other areas, it is highly likely that a vCPU later
in the spinlock queue will be spinning while the ones earlier in the
queue would have been halted. The spinning in the later vCPUs is then
just a waste of precious CPU cycles because they are not going to
get the lock soon as the earlier ones have to be woken up and take
their turn to get the lock.This patch implements an adaptive spinning mechanism where the vCPU
will call pv_wait() if the previous vCPU is not running.Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2sFor 32 vCPUs, this patch doesn't cause any noticeable change in
performance. For 48 vCPUs (over-committed), there is about 8%
performance improvement.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
This patch allows one attempt for the lock waiter to steal the lock
when entering the PV slowpath. To prevent lock starvation, the pending
bit will be set by the queue head vCPU when it is in the active lock
spinning loop to disable any lock stealing attempt. This helps to
reduce the performance penalty caused by lock waiter preemption while
not having much of the downsides of a real unfair lock.The pv_wait_head() function was renamed as pv_wait_head_or_lock()
as it was modified to acquire the lock before returning. This is
necessary because of possible lock stealing attempts from other tasks.Linux kernel builds were run in KVM guest on an 8-socket, 4
cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
Haswell-EX system. Both systems are configured to have 32 physical
CPUs. The kernel build times before and after the patch were:Westmere Haswell
Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
----- -------- -------- -------- --------
Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5sFor the overcommited case (48 vCPUs), this patch is able to reduce
kernel build time by more than 54% for Westmere and 44% for Haswell.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
Introduce smp_cond_acquire() which combines a control dependency and a
read barrier to form acquire semantics.This primitive has two benefits:
- it documents control dependencies,
- its typically cheaper than using smp_load_acquire() in a loop.Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Mike Galbraith
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Signed-off-by: Ingo Molnar
23 Nov, 2015
3 commits
-
With optimistic prefetch of the next node cacheline, the next pointer
may have been properly inititalized. As a result, the reading
of node->next in the contended path may be redundant. This patch
eliminates the redundant read if the next pointer value is not NULL.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
A queue head CPU, after acquiring the lock, will have to notify
the next CPU in the wait queue that it has became the new queue
head. This involves loading a new cacheline from the MCS node of the
next CPU. That operation can be expensive and add to the latency of
locking operation.This patch addes code to optmistically prefetch the next MCS node
cacheline if the next pointer is defined and it has been spinning
for the MCS lock for a while. This reduces the locking latency and
improves the system throughput.The performance change will depend on whether the prefetch overhead
can be hidden within the latency of the lock spin loop. On really
short critical section, there may not be performance gain at all. With
longer critical section, however, it was found to have a performance
boost of 5-10% over a range of different queue depths with a spinlock
loop microbenchmark.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
This patch replaces the cmpxchg() and xchg() calls in the native
qspinlock code with the more relaxed _acquire or _release versions of
those calls to enable other architectures to adopt queued spinlocks
with less memory barrier performance overhead.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar
11 Sep, 2015
1 commit
-
Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
set and Linus noted that the test-and-set implementation was retarded.One should spin on the variable with a load, not a RMW.
While there, remove 'queued' from the name, as the lock isn't queued
at all, but a simple test-and-set.Suggested-by: Linus Torvalds
Reported-by: Dave Chinner
Tested-by: Dave Chinner
Signed-off-by: Peter Zijlstra (Intel)
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: stable@vger.kernel.org # v4.2+
Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
03 Aug, 2015
1 commit
-
For an over-committed guest with more vCPUs than physical CPUs
available, it is possible that a vCPU may be kicked twice before
getting the lock - once before it becomes queue head and once again
before it gets the lock. All these CPU kicking and halting (VMEXIT)
can be expensive and slow down system performance.This patch adds a new vCPU state (vcpu_hashed) which enables the code
to delay CPU kicking until at unlock time. Once this state is set,
the new lock holder will set _Q_SLOW_VAL and fill in the hash table
on behalf of the halted queue head vCPU. The original vcpu_halted
state will be used by pv_wait_node() only to differentiate other
queue nodes from the qeue head.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar
08 May, 2015
1 commit
-
Provide a separate (second) version of the spin_lock_slowpath for
paravirt along with a special unlock path.The second slowpath is generated by adding a few pv hooks to the
normal slowpath, but where those will compile away for the native
case, they expand into special wait/wake code for the pv version.The actual MCS queue can use extra storage in the mcs_nodes[] array to
keep track of state and therefore uses directed wakeups.The head contender has no such storage directly visible to the
unlocker. So the unlocker searches a hash table with open addressing
using a simple binary Galois linear feedback shift register.Suggested-by: Peter Zijlstra (Intel)
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Boris Ostrovsky
Cc: Borislav Petkov
Cc: Daniel J Blueman
Cc: David Vrabel
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Konrad Rzeszutek Wilk
Cc: Linus Torvalds
Cc: Oleg Nesterov
Cc: Paolo Bonzini
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Raghavendra K T
Cc: Rik van Riel
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar