18 Aug, 2016
3 commits
-
When wanting to wakeup readers, __rwsem_mark_wakeup() currently
iterates the wait_list twice while looking to wakeup the first N
queued reader-tasks. While this can be quite inefficient, it was
there such that a awoken reader would be first and foremost
acknowledged by the lock counter.Keeping the same logic, we can further benefit from the use of
wake_qs and avoid entirely the first wait_list iteration that sets
the counter as wake_up_process() isn't going to occur right away,
and therefore we maintain the counter->list order of going about
things.Other than saving cycles with O(n) "scanning", this change also
nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
visually and less tedious to read.For example, the following improvements where seen on some will
it scale microbenchmarks, on a 48-core Haswell:v4.7 v4.7-rwsem-v1
Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%)
Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%)
Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%)
Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%)
Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%)
Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%)
Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%)
Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%)
Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%)
Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%)Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%)
Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%)
Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%)
Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%)
Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%)
Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%)
Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%)
Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%)
Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%)
Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%)Other runs that saw improvements include context_switch and pipe; and
as expected, this is particularly highlighted on larger thread counts
as it becomes more expensive to walk the list twice.No change in wakeup ordering or semantics.
Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hp.com
Cc: dave@stgolabs.net
Cc: jason.low2@hpe.com
Cc: wanpeng.li@hotmail.com
Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
Our rwsem code (xadd, at least) is rather well documented, but
there are a few really annoying comments in there that serve
no purpose and we shouldn't bother with them.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hp.com
Cc: dave@stgolabs.net
Cc: jason.low2@hpe.com
Cc: wanpeng.li@hotmail.com
Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
We currently return a rw_semaphore structure, which is the
same lock we passed to the function's argument in the first
place. While there are several functions that choose this
return value, the callers use it, for example, for things
like ERR_PTR. This is not the case for __rwsem_mark_wake(),
and in addition this function is really about the lock
waiters (which we know there are at this point), so its
somewhat odd to be returning the sem structure.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hp.com
Cc: dave@stgolabs.net
Cc: jason.low2@hpe.com
Cc: wanpeng.li@hotmail.com
Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
16 Jun, 2016
1 commit
-
Now that we have fetch_add() we can stop using add_return() - val.
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Jason Low
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman Long
Cc: linux-arch@vger.kernel.org
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar
08 Jun, 2016
4 commits
-
This patch moves the owner loading and checking code entirely inside of
rwsem_spin_on_owner() to simplify the logic of rwsem_optimistic_spin()
loop.Suggested-by: Peter Hurley
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Peter Hurley
Cc: Andrew Morton
Cc: Dave Chinner
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: Jason Low
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1463534783-38814-6-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
In __rwsem_do_wake(), the reader wakeup code will assume a writer
has stolen the lock if the active reader/writer count is not 0.
However, this is not as reliable an indicator as the original
"< RWSEM_WAITING_BIAS" check. If another reader is present, the code
will still break out and exit even if the writer is gone. This patch
changes it to check the same "< RWSEM_WAITING_BIAS" condition to
reduce the chance of false positive.Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Peter Hurley
Cc: Andrew Morton
Cc: Dave Chinner
Cc: Davidlohr Bueso
Cc: Douglas Hatch
Cc: Jason Low
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1463534783-38814-5-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
Currently, it is not possible to determine for sure if a reader
owns a rwsem by looking at the content of the rwsem data structure.
This patch adds a new state RWSEM_READER_OWNED to the owner field
to indicate that readers currently own the lock. This enables us to
address the following 2 issues in the rwsem optimistic spinning code:1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
the owner field is NULL which can mean either the readers own
the lock or the owning writer hasn't set the owner field yet.
In the latter case, we miss the chance to do optimistic spinning.2) While a writer is waiting in the OSQ and a reader takes the lock,
the writer will continue to spin when out of the OSQ in the main
rwsem_optimistic_spin() loop as the owner field is NULL wasting
CPU cycles if some of readers are sleeping.Adding the new state will allow optimistic spinning to go forward as
long as the owner field is not RWSEM_READER_OWNED and the owner is
running, if set, but stop immediately when that state has been reached.On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
fio test with multithreaded randrw and randwrite tests on the same
file on a XFS partition on top of a NVDIMM were run, the aggregated
bandwidths before and after the patch were as follows:Test BW before patch BW after patch % change
---- --------------- -------------- --------
randrw 988 MB/s 1192 MB/s +21%
randwrite 1513 MB/s 1623 MB/s +7.3%The perf profile of the rwsem_down_write_failed() function in randrw
before and after the patch were:19.95% 5.88% fio [kernel.vmlinux] [k] rwsem_down_write_failed
14.20% 1.52% fio [kernel.vmlinux] [k] rwsem_down_write_failedThe actual CPU cycles spend in rwsem_down_write_failed() dropped from
5.88% to 1.52% after the patch.The xfstests was also run and no regression was observed.
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Jason Low
Acked-by: Davidlohr Bueso
Cc: Andrew Morton
Cc: Dave Chinner
Cc: Douglas Hatch
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Hurley
Cc: Peter Zijlstra
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar -
Convert the rwsem count variable to an atomic_long_t since we use it
as an atomic variable. This also allows us to remove the
rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
level of indirection. In follow up patches, we also remove the
rwsem_atomic_{add,update}() definitions across the various architectures.Suggested-by: Peter Zijlstra
Signed-off-by: Jason Low
[ Build warning fixes on various architectures. ]
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Davidlohr Bueso
Cc: Fenghua Yu
Cc: Heiko Carstens
Cc: Jason Low
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Paul E. McKenney
Cc: Peter Hurley
Cc: Terry Rudd
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Tony Luck
Cc: Waiman Long
Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.com
Signed-off-by: Ingo Molnar
03 Jun, 2016
3 commits
-
When acquiring the rwsem write lock in the slowpath, we first try
to set count to RWSEM_WAITING_BIAS. When that is successful,
we then atomically add the RWSEM_WAITING_BIAS in cases where
there are other tasks on the wait list. This causes write lock
operations to often issue multiple atomic operations.We can instead make the list_is_singular() check first, and then
set the count accordingly, so that we issue at most 1 atomic
operation when acquiring the write lock and reduce unnecessary
cacheline contention.Signed-off-by: Jason Low
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Waiman Long
Acked-by: Davidlohr Bueso
Cc: Andrew Morton
Cc: Arnd Bergmann
Cc: Christoph Lameter
Cc: Fenghua Yu
Cc: Heiko Carstens
Cc: Ivan Kokshaysky
Cc: Jason Low
Cc: Linus Torvalds
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Paul E. McKenney
Cc: Peter Hurley
Cc: Peter Zijlstra
Cc: Richard Henderson
Cc: Terry Rudd
Cc: Thomas Gleixner
Cc: Tim Chen
Cc: Tony Luck
Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.com
Signed-off-by: Ingo Molnar -
Readers that are awoken will expect a nil ->task indicating
that a wakeup has occurred. Because of the way readers are
implemented, there's a small chance that the waiter will never
block in the slowpath (rwsem_down_read_failed), and therefore
requires some form of reference counting to avoid the following
scenario:rwsem_down_read_failed() rwsem_wake()
get_task_struct();
spin_lock_irq(&wait_lock);
list_add_tail(&waiter.list)
spin_unlock_irq(&wait_lock);
raw_spin_lock_irqsave(&wait_lock)
__rwsem_do_wake()
while (1) {
set_task_state(TASK_UNINTERRUPTIBLE);
waiter->task = NULL
if (!waiter.task) // true
break;
schedule() // never reached__set_task_state(TASK_RUNNING);
do_exit();
wake_up_process(tsk); // boom... and therefore race with do_exit() when the caller returns.
There is also a mismatch between the smp_mb() and its documentation,
in that the serialization is done between reading the task and the
nil store. Furthermore, in addition to having the overlapping of
loads and stores to waiter->task guaranteed to be ordered within
that CPU, both wake_up_process() originally and now wake_q_add()
already imply barriers upon successful calls, which serves the
comment.Now, as an alternative to perhaps inverting the checks in the blocker
side (which has its own penalty in that schedule is unavoidable),
with lockless wakeups this situation is naturally addressed and we
can just use the refcount held by wake_q_add(), instead doing so
explicitly. Of course, we must guarantee that the nil store is done
as the _last_ operation in that the task must already be marked for
deletion to not fall into the race above. Spurious wakeups are also
handled transparently in that the task's reference is only removed
when wake_up_q() is actually called _after_ the nil store.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
As wake_qs gain users, we can teach rwsems about them such that
waiters can be awoken without the wait_lock. This is for both
readers and writer, the former being the most ideal candidate
as we can batch the wakeups shortening the critical region that
much more -- ie writer task blocking a bunch of tasks waiting to
service page-faults (mmap_sem readers).In general applying wake_qs to rwsem (xadd) is not difficult as
the wait_lock is intended to be released soon _anyways_, with
the exception of when a writer slowpath will proactively wakeup
any queued readers if it sees that the lock is owned by a reader,
in which we simply do the wakeups with the lock held (see comment
in __rwsem_down_write_failed_common()).Similar to other locking primitives, delaying the waiter being
awoken does allow, at least in theory, the lock to be stolen in
the case of writers, however no harm was seen in this (in fact
lock stealing tends to be a _good_ thing in most workloads), and
this is a tiny window anyways.Some page-fault (pft) and mmap_sem intensive benchmarks show some
pretty constant reduction in systime (by up to ~8 and ~10%) on a
2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
page allocations (page_test)aim9:
4.6-rc6 4.6-rc6
rwsemv2
Min page_test 378167.89 ( 0.00%) 382613.33 ( 1.18%)
Min exec_test 499.00 ( 0.00%) 502.67 ( 0.74%)
Min fork_test 3395.47 ( 0.00%) 3537.64 ( 4.19%)
Hmean page_test 395433.06 ( 0.00%) 414693.68 ( 4.87%)
Hmean exec_test 499.67 ( 0.00%) 505.30 ( 1.13%)
Hmean fork_test 3504.22 ( 0.00%) 3594.95 ( 2.59%)
Stddev page_test 17426.57 ( 0.00%) 26649.92 (-52.93%)
Stddev exec_test 0.47 ( 0.00%) 1.41 (-199.05%)
Stddev fork_test 63.74 ( 0.00%) 32.59 ( 48.86%)
Max page_test 429873.33 ( 0.00%) 456960.00 ( 6.30%)
Max exec_test 500.33 ( 0.00%) 507.66 ( 1.47%)
Max fork_test 3653.33 ( 0.00%) 3650.90 ( -0.07%)4.6-rc6 4.6-rc6
rwsemv2
User 1.12 0.04
System 0.23 0.04
Elapsed 727.27 721.98Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: Waiman.Long@hpe.com
Cc: dave@stgolabs.net
Cc: jason.low2@hp.com
Cc: peter@hurleysoftware.com
Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
16 May, 2016
1 commit
-
The new signal_pending exit path in __rwsem_down_write_failed_common()
was fingered as breaking his kernel by Tetsuo Handa.Upon inspection it was found that there are two things wrong with it;
- it forgets to remove WAITING_BIAS if it leaves the list empty, or
- it forgets to wake further waiters that were blocked on the now
removed waiter.Especially the first issue causes new lock attempts to block and stall
indefinitely, as the code assumes that pending waiters mean there is
an owner that will wake when it releases the lock.Reported-by: Tetsuo Handa
Tested-by: Tetsuo Handa
Tested-by: Michal Hocko
Signed-off-by: Peter Zijlstra (Intel)
Cc: Alexander Shishkin
Cc: Andrew Morton
Cc: Arnaldo Carvalho de Melo
Cc: Chris Zankel
Cc: David S. Miller
Cc: Davidlohr Bueso
Cc: H. Peter Anvin
Cc: Jiri Olsa
Cc: Linus Torvalds
Cc: Max Filippov
Cc: Peter Zijlstra
Cc: Stephane Eranian
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: Vince Weaver
Cc: Waiman Long
Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar
13 Apr, 2016
1 commit
-
Introduce a generic implementation necessary for down_write_killable().
This is a trivial extension of the already existing down_write() call
which can be interrupted by SIGKILL. This patch doesn't provide
down_write_killable() yet because arches have to provide the necessary
pieces before.rwsem_down_write_failed() which is a generic slow path for the
write lock is extended to take a task state and renamed to
__rwsem_down_write_failed_common(). The return value is either a valid
semaphore pointer or ERR_PTR(-EINTR).rwsem_down_write_failed_killable() is exported as a new way to wait for
the lock and be killable.For rwsem-spinlock implementation the current __down_write() it updated
in a similar way as __rwsem_down_write_failed_common() except it doesn't
need new exports just visible __down_write_killable().Architectures which are not using the generic rwsem implementation are
supposed to provide their __down_write_killable() implementation and
use rwsem_down_write_failed_killable() for the slow path.Signed-off-by: Michal Hocko
Cc: Andrew Morton
Cc: Chris Zankel
Cc: David S. Miller
Cc: Linus Torvalds
Cc: Max Filippov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Signed-off-by: Davidlohr Bueso
Cc: Signed-off-by: Jason Low
Cc: Thomas Gleixner
Cc: Tony Luck
Cc: linux-alpha@vger.kernel.org
Cc: linux-arch@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1460041951-22347-7-git-send-email-mhocko@kernel.org
Signed-off-by: Ingo Molnar
06 Oct, 2015
1 commit
-
As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
ordered archs can benefit from more relaxed use of barriers when locking
and unlocking, instead of regular full barrier semantics. While currently
only arm64 supports such optimizations, updating corresponding locking
primitives serves for other archs to immediately benefit as well, once the
necessary machinery is implemented of course.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Thomas Gleixner
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Paul E.McKenney
Cc: Peter Zijlstra
Cc: Will Deacon
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/1443643395-17016-6-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
08 May, 2015
1 commit
-
In up_write()/up_read(), rwsem_wake() will be called whenever it
detects that some writers/readers are waiting. The rwsem_wake()
function will take the wait_lock and call __rwsem_do_wake() to do the
real wakeup. For a heavily contended rwsem, doing a spin_lock() on
wait_lock will cause further contention on the heavily contended rwsem
cacheline resulting in delay in the completion of the up_read/up_write
operations.This patch makes the wait_lock taking and the call to __rwsem_do_wake()
optional if at least one spinning writer is present. The spinning
writer will be able to take the rwsem and call rwsem_wake() later
when it calls up_write(). With the presence of a spinning writer,
rwsem_wake() will now try to acquire the lock using trylock. If that
fails, it will just quit.Suggested-by: Peter Zijlstra (Intel)
Signed-off-by: Waiman Long
Signed-off-by: Peter Zijlstra (Intel)
Reviewed-by: Davidlohr Bueso
Acked-by: Jason Low
Cc: Andrew Morton
Cc: Borislav Petkov
Cc: Douglas Hatch
Cc: H. Peter Anvin
Cc: Linus Torvalds
Cc: Scott J Norton
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com
Signed-off-by: Ingo Molnar
07 Mar, 2015
1 commit
-
Ming reported soft lockups occurring when running xfstest due to
the following tip:locking/core commit:b3fd4f03ca0b ("locking/rwsem: Avoid deceiving lock spinners")
When doing optimistic spinning in rwsem, threads should stop
spinning when the lock owner is not running. While a thread is
spinning on owner, if the owner reschedules, owner->on_cpu
returns false and we stop spinning.However, this commit essentially caused the check to get
ignored because when we break out of the spin loop due to
!on_cpu, we continue spinning if sem->owner != NULL.This patch fixes this by making sure we stop spinning if the
owner is not running. Furthermore, just like with mutexes,
refactor the code such that we don't have separate checks for
owner_running(). This makes it more straightforward in terms of
why we exit the spin on owner loop and we would also avoid
needing to "guess" why we broke out of the loop to make this
more readable.Reported-and-tested-by: Ming Lei
Signed-off-by: Jason Low
Acked-by: Davidlohr Bueso
Cc: Andrew Morton
Cc: Dave Jones
Cc: Linus Torvalds
Cc: Michel Lespinasse
Cc: Oleg Nesterov
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Sasha Levin
Cc: Thomas Gleixner
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1425714331.2475.388.camel@j-VirtualBox
Signed-off-by: Ingo Molnar
24 Feb, 2015
1 commit
-
With the new standardized functions, we can replace all
ACCESS_ONCE() calls across relevant locking - this includes
lockref and seqlock while at it.ACCESS_ONCE() does not work reliably on non-scalar types.
For example gcc 4.6 and 4.7 might remove the volatile tag
for such accesses during the SRA (scalar replacement of
aggregates) step:https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145
Update the new calls regardless of if it is a scalar type,
this is cleaner than having three alternatives.Signed-off-by: Davidlohr Bueso
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Thomas Gleixner
Cc: Paul E. McKenney
Link: http://lkml.kernel.org/r/1424662301.6539.18.camel@stgolabs.net
Signed-off-by: Ingo Molnar
18 Feb, 2015
4 commits
-
37e9562453b ("locking/rwsem: Allow conservative optimistic
spinning when readers have lock") forced the default for
optimistic spinning to be disabled if the lock owner was
nil, which makes much sense for readers. However, while
it is not our priority, we can make some optimizations
for write-mostly workloads. We can bail the spinning step
and still be conservative if there are any active tasks,
otherwise there's really no reason not to spin, as the
semaphore is most likely unlocked.This patch recovers most of a Unixbench 'execl' benchmark
throughput by sleeping less and making better average system
usage:before:
CPU %user %nice %system %iowait %steal %idle
all 0.60 0.00 8.02 0.00 0.00 91.38after:
CPU %user %nice %system %iowait %steal %idle
all 1.22 0.00 70.18 0.00 0.00 28.60Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Acked-by: Jason Low
Cc: Linus Torvalds
Cc: Michel Lespinasse
Cc: Paul E. McKenney
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1422609267-15102-6-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
When readers hold the semaphore, the ->owner is nil. As such,
and unlike mutexes, '!owner' does not necessarily imply that
the lock is free. This will cause writers to potentially spin
excessively as they've been mislead to thinking they have a
chance of acquiring the lock, instead of blocking.This patch therefore enhances the counter check when the owner
is not set by the time we've broken out of the loop. Otherwise
we can return true as a new owner has the lock and thus we want
to continue spinning. While at it, we can make rwsem_spin_on_owner()
less ambiguos and return right away under need_resched conditions.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Jason Low
Cc: Linus Torvalds
Cc: Michel Lespinasse
Cc: Paul E. McKenney
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1422609267-15102-5-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
In order to optimize the spinning step, we need to set the lock
owner as soon as the lock is acquired; after a successful counter
cmpxchg operation, that is. This is particularly useful as rwsems
need to set the owner to nil for readers, so there is a greater
chance of falling out of the spinning. Currently we only set the
owner much later in the game, in the more generic level -- latency
can be specially bad when waiting for a node->next pointer when
releasing the osq in up_write calls.As such, update the owner inside rwsem_try_write_lock (when the
lock is obtained after blocking) and rwsem_try_write_lock_unqueued
(when the lock is obtained while spinning). This requires creating
a new internal rwsem.h header to share the owner related calls.Also cleanup some headers for mutex and rwsem.
Suggested-by: Peter Zijlstra
Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Jason Low
Cc: Linus Torvalds
Cc: Michel Lespinasse
Cc: Paul E. McKenney
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1422609267-15102-4-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar -
The need for the smp_mb() in __rwsem_do_wake() should be
properly documented. Applies to both xadd and spinlock
variants.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Jason Low
Cc: Linus Torvalds
Cc: Michel Lespinasse
Cc: Paul E. McKenney
Cc: Tim Chen
Link: http://lkml.kernel.org/r/1422609267-15102-3-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
04 Feb, 2015
1 commit
-
Call __set_task_state() instead of assigning the new state
directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP
environments, keeping track of who last changed the state.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: "Paul E. McKenney"
Cc: Jason Low
Cc: Michel Lespinasse
Cc: Tim Chen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1422257769-14083-2-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
03 Oct, 2014
1 commit
-
Commit 9b0fc9c09f1b ("rwsem: skip initial trylock in rwsem_down_write_failed")
checks for if there are known active lockers in order to avoid write trylocking
using expensive cmpxchg() when it likely wouldn't get the lock.However, a subsequent patch was added such that we directly
check for sem->count == RWSEM_WAITING_BIAS right before trying
that cmpxchg().Thus, commit 9b0fc9c09f1b now just adds overhead.
This patch modifies it so that we only do a check for if
count == RWSEM_WAITING_BIAS.Also, add a comment on why we do an "extra check" of count
before the cmpxchg().Signed-off-by: Jason Low
Acked-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra (Intel)
Cc: Aswin Chandramouleeswaran
Cc: Chegu Vinod
Cc: Peter Hurley
Cc: Tim Chen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1410913017.2447.22.camel@j-VirtualBox
Signed-off-by: Ingo Molnar
16 Sep, 2014
1 commit
-
rw-semaphore is the only type of lock doing this ugliness of
exporting at the end of the file.Signed-off-by: Davidlohr Bueso
Cc: dave@stgolabs.net
Cc: peterz@infradead.org
Link: http://lkml.kernel.org/r/1410500066-5909-1-git-send-email-dave@stgolabs.net
Signed-off-by: Ingo Molnar
17 Jul, 2014
1 commit
-
The arch_mutex_cpu_relax() function, introduced by 34b133f, is
hacky and ugly. It was added a few years ago to address the fact
that common cpu_relax() calls include yielding on s390, and thus
impact the optimistic spinning functionality of mutexes. Nowadays
we use this function well beyond mutexes: rwsem, qrwlock, mcs and
lockref. Since the macro that defines the call is in the mutex header,
any users must include mutex.h and the naming is misleading as well.This patch (i) renames the call to cpu_relax_lowlatency ("relax, but
only if you can do it with very low latency") and (ii) defines it in
each arch's asm/processor.h local header, just like for regular cpu_relax
functions. On all archs, except s390, cpu_relax_lowlatency is simply cpu_relax,
and thus we can take it out of mutex.h. While this can seem redundant,
I believe it is a good choice as it allows us to move out arch specific
logic from generic locking primitives and enables future(?) archs to
transparently define it, similarly to System Z.Signed-off-by: Davidlohr Bueso
Signed-off-by: Peter Zijlstra
Cc: Andrew Morton
Cc: Anton Blanchard
Cc: Aurelien Jacquiot
Cc: Benjamin Herrenschmidt
Cc: Bharat Bhushan
Cc: Catalin Marinas
Cc: Chen Liqin
Cc: Chris Metcalf
Cc: Christian Borntraeger
Cc: Chris Zankel
Cc: David Howells
Cc: David S. Miller
Cc: Deepthi Dharwar
Cc: Dominik Dingel
Cc: Fenghua Yu
Cc: Geert Uytterhoeven
Cc: Guan Xuetao
Cc: Haavard Skinnemoen
Cc: Hans-Christian Egtvedt
Cc: Heiko Carstens
Cc: Helge Deller
Cc: Hirokazu Takata
Cc: Ivan Kokshaysky
Cc: James E.J. Bottomley
Cc: James Hogan
Cc: Jason Wang
Cc: Jesper Nilsson
Cc: Joe Perches
Cc: Jonas Bonn
Cc: Joseph Myers
Cc: Kees Cook
Cc: Koichi Yasutake
Cc: Lennox Wu
Cc: Linus Torvalds
Cc: Mark Salter
Cc: Martin Schwidefsky
Cc: Matt Turner
Cc: Max Filippov
Cc: Michael Neuling
Cc: Michal Simek
Cc: Mikael Starvik
Cc: Nicolas Pitre
Cc: Paolo Bonzini
Cc: Paul Burton
Cc: Paul E. McKenney
Cc: Paul Gortmaker
Cc: Paul Mackerras
Cc: Qais Yousef
Cc: Qiaowei Ren
Cc: Rafael Wysocki
Cc: Ralf Baechle
Cc: Richard Henderson
Cc: Richard Kuo
Cc: Russell King
Cc: Steven Miao
Cc: Steven Rostedt
Cc: Stratos Karafotis
Cc: Tim Chen
Cc: Tony Luck
Cc: Vasily Kulikov
Cc: Vineet Gupta
Cc: Vineet Gupta
Cc: Waiman Long
Cc: Will Deacon
Cc: Wolfram Sang
Cc: adi-buildroot-devel@lists.sourceforge.net
Cc: linux390@de.ibm.com
Cc: linux-alpha@vger.kernel.org
Cc: linux-am33-list@redhat.com
Cc: linux-arm-kernel@lists.infradead.org
Cc: linux-c6x-dev@linux-c6x.org
Cc: linux-cris-kernel@axis.com
Cc: linux-hexagon@vger.kernel.org
Cc: linux-ia64@vger.kernel.org
Cc: linux@lists.openrisc.net
Cc: linux-m32r-ja@ml.linux-m32r.org
Cc: linux-m32r@ml.linux-m32r.org
Cc: linux-m68k@lists.linux-m68k.org
Cc: linux-metag@vger.kernel.org
Cc: linux-mips@linux-mips.org
Cc: linux-parisc@vger.kernel.org
Cc: linuxppc-dev@lists.ozlabs.org
Cc: linux-s390@vger.kernel.org
Cc: linux-sh@vger.kernel.org
Cc: linux-xtensa@linux-xtensa.org
Cc: sparclinux@vger.kernel.org
Link: http://lkml.kernel.org/r/1404079773.2619.4.camel@buesod1.americas.hpqcorp.net
Signed-off-by: Ingo Molnar
16 Jul, 2014
4 commits
-
Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
encapsulate the dependencies for rwsem optimistic spinning.
No logical changes here as it continues to depend on both
SMP and the XADD algorithm variant.Signed-off-by: Davidlohr Bueso
Acked-by: Jason Low
[ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
Cc: aswin@hp.com
Cc: Chris Mason
Cc: Davidlohr Bueso
Cc: Josef Bacik
Cc: Linus Torvalds
Cc: Waiman Long
Signed-off-by: Ingo MolnarSigned-off-by: Ingo Molnar
-
Currently, we initialize the osq lock by directly setting the lock's values. It
would be preferable if we use an init macro to do the initialization like we do
with other locks.This patch introduces and uses a macro and function for initializing the osq lock.
Signed-off-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Scott Norton
Cc: "Paul E. McKenney"
Cc: Dave Chinner
Cc: Waiman Long
Cc: Davidlohr Bueso
Cc: Rik van Riel
Cc: Andrew Morton
Cc: "H. Peter Anvin"
Cc: Steven Rostedt
Cc: Tim Chen
Cc: Konrad Rzeszutek Wilk
Cc: Aswin Chandramouleeswaran
Cc: Linus Torvalds
Cc: Chris Mason
Cc: Josef Bacik
Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar -
The cancellable MCS spinlock is currently used to queue threads that are
doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
the lock would access and queue the local node corresponding to the CPU that
it's running on. Currently, the cancellable MCS lock is implemented by using
pointers to these nodes.In this patch, instead of operating on pointers to the per-cpu nodes, we
store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
A similar concept is used with the qspinlock.By operating on the CPU # of the nodes using atomic_t instead of pointers
to those nodes, this can reduce the overhead of the cancellable MCS spinlock
by 32 bits (on 64 bit systems).Signed-off-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Scott Norton
Cc: "Paul E. McKenney"
Cc: Dave Chinner
Cc: Waiman Long
Cc: Davidlohr Bueso
Cc: Rik van Riel
Cc: Andrew Morton
Cc: "H. Peter Anvin"
Cc: Steven Rostedt
Cc: Tim Chen
Cc: Konrad Rzeszutek Wilk
Cc: Aswin Chandramouleeswaran
Cc: Linus Torvalds
Cc: Chris Mason
Cc: Heiko Carstens
Cc: Josef Bacik
Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
Signed-off-by: Ingo Molnar -
Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
introduced a major performance regression for workloads such as
xfs_repair which mix read and write locking of the mmap_sem across
many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
than on 3.15 and using 20x more system CPU time.Perf profiles indicate in some workloads that significant time can
be spent spinning on !owner. This is because we don't set the lock
owner when readers(s) obtain the rwsem.In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
return false if there is no lock owner. The rationale is that if we
just entered the slowpath, yet there is no lock owner, then there is
a possibility that a reader has the lock. To be conservative, we'll
avoid spinning in these situations.This patch reduced the total run time of the xfs_repair workload from
about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
back to close to the same performance as on 3.15.Retesting of AIM7, which were some of the workloads used to test the
original optimistic spinning code, confirmed that we still get big
performance gains with optimistic spinning, even with this additional
regression fix. Davidlohr found that while the 'custom' workload took
a performance hit of ~-14% to throughput for >300 users with this
additional patch, the overall gain with optimistic spinning is
still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.Tested-by: Dave Chinner
Acked-by: Davidlohr Bueso
Signed-off-by: Jason Low
Signed-off-by: Peter Zijlstra
Cc: Tim Chen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
Signed-off-by: Ingo Molnar
05 Jun, 2014
2 commits
-
WARNING: line over 80 characters
#205: FILE: kernel/locking/rwsem-xadd.c:275:
+ old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS);WARNING: line over 80 characters
#376: FILE: kernel/locking/rwsem-xadd.c:434:
+ * If there were already threads queued before us and there are noWARNING: line over 80 characters
#377: FILE: kernel/locking/rwsem-xadd.c:435:
+ * active writers, the lock must be read owned; so we try to waketotal: 0 errors, 3 warnings, 417 lines checked
Signed-off-by: Andrew Morton
Signed-off-by: Peter Zijlstra
Cc: "H. Peter Anvin"
Cc: Davidlohr Bueso
Cc: Tim Chen
Cc: Linus Torvalds
Link: http://lkml.kernel.org/n/tip-pn6pslaplw031lykweojsn8c@git.kernel.org
Signed-off-by: Ingo Molnar -
We have reached the point where our mutexes are quite fine tuned
for a number of situations. This includes the use of heuristics
and optimistic spinning, based on MCS locking techniques.Exclusive ownership of read-write semaphores are, conceptually,
just about the same as mutexes, making them close cousins. To
this end we need to make them both perform similarly, and
right now, rwsems are simply not up to it. This was discovered
by both reverting commit 4fc3f1d6 (mm/rmap, migration: Make
rmap_walk_anon() and try_to_unmap_anon() more scalable) and
similarly, converting some other mutexes (ie: i_mmap_mutex) to
rwsems. This creates a situation where users have to choose
between a rwsem and mutex taking into account this important
performance difference. Specifically, biggest difference between
both locks is when we fail to acquire a mutex in the fastpath,
optimistic spinning comes in to play and we can avoid a large
amount of unnecessary sleeping and overhead of moving tasks in
and out of wait queue. Rwsems do not have such logic.This patch, based on the work from Tim Chen and I, adds support
for write-side optimistic spinning when the lock is contended.
It also includes support for the recently added cancelable MCS
locking for adaptive spinning. Note that is is only applicable
to the xadd method, and the spinlock rwsem variant remains intact.Allowing optimistic spinning before putting the writer on the wait
queue reduces wait queue contention and provided greater chance
for the rwsem to get acquired. With these changes, rwsem is on par
with mutex. The performance benefits can be seen on a number of
workloads. For instance, on a 8 socket, 80 core 64bit Westmere box,
aim7 shows the following improvements in throughput:+--------------+---------------------+-----------------+
| Workload | throughput-increase | number of users |
+--------------+---------------------+-----------------+
| alltests | 20% | >1000 |
| custom | 27%, 60% | 10-100, >1000 |
| high_systime | 36%, 30% | >100, >1000 |
| shared | 58%, 29% | 10-100, >1000 |
+--------------+---------------------+-----------------+There was also improvement on smaller systems, such as a quad-core
x86-64 laptop running a 30Gb PostgreSQL (pgbench) workload for up
to +60% in throughput for over 50 clients. Additionally, benefits
were also noticed in exim (mail server) workloads. Furthermore, no
performance regression have been seen at all.Based-on-work-from: Tim Chen
Signed-off-by: Davidlohr Bueso
[peterz: rej fixup due to comment patches, sched/rt.h header]
Signed-off-by: Peter Zijlstra
Cc: Alex Shi
Cc: Andi Kleen
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Peter Hurley
Cc: "Paul E.McKenney"
Cc: Jason Low
Cc: Aswin Chandramouleeswaran
Cc: Andrew Morton
Cc: Linus Torvalds
Cc: "Scott J Norton"
Cc: Andrea Arcangeli
Cc: Chris Mason
Cc: Josef Bacik
Link: http://lkml.kernel.org/r/1399055055.6275.15.camel@buesod1.americas.hpqcorp.net
Signed-off-by: Ingo Molnar
05 May, 2014
1 commit
-
It took me quite a while to understand how rwsem's count field
mainifested itself in different scenarios.Add comments to provide a quick reference to the the rwsem's count
field for each scenario where readers and writers are contending
for the lock.Hopefully it will be useful for future maintenance of the code and
for people to get up to speed on how the logic in the code works.Signed-off-by: Tim Chen
Cc: Davidlohr Bueso
Cc: Alex Shi
Cc: Michel Lespinasse
Cc: Rik van Riel
Cc: Peter Hurley
Cc: Paul E.McKenney
Cc: Jason Low
Cc: Peter Zijlstra
Cc: Linus Torvalds
Cc: Andrew Morton
Cc: Paul E. McKenney
Link: http://lkml.kernel.org/r/1399060437.2970.146.camel@schen9-DESK
Signed-off-by: Ingo Molnar
14 Feb, 2014
1 commit
-
Mark the rwsem functions that can be called from assembler asmlinkage.
Signed-off-by: Andi Kleen
Link: http://lkml.kernel.org/r/1391845930-28580-9-git-send-email-ak@linux.intel.com
Signed-off-by: H. Peter Anvin
06 Nov, 2013
1 commit
-
Notably: changed lib/rwsem* targets from lib- to obj-, no idea about
the ramifications of that.Suggested-by: Ingo Molnar
Signed-off-by: Peter Zijlstra
Link: http://lkml.kernel.org/n/tip-g0kynfh5feriwc6p3h6kpbw6@git.kernel.org
Signed-off-by: Ingo Molnar