18 Aug, 2016

3 commits

  • When wanting to wakeup readers, __rwsem_mark_wakeup() currently
    iterates the wait_list twice while looking to wakeup the first N
    queued reader-tasks. While this can be quite inefficient, it was
    there such that a awoken reader would be first and foremost
    acknowledged by the lock counter.

    Keeping the same logic, we can further benefit from the use of
    wake_qs and avoid entirely the first wait_list iteration that sets
    the counter as wake_up_process() isn't going to occur right away,
    and therefore we maintain the counter->list order of going about
    things.

    Other than saving cycles with O(n) "scanning", this change also
    nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
    visually and less tedious to read.

    For example, the following improvements where seen on some will
    it scale microbenchmarks, on a 48-core Haswell:

    v4.7 v4.7-rwsem-v1
    Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%)
    Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%)
    Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%)
    Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%)
    Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%)
    Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%)
    Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%)
    Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%)
    Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%)
    Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%)

    Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%)
    Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%)
    Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%)
    Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%)
    Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%)
    Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%)
    Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%)
    Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%)
    Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%)
    Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%)

    Other runs that saw improvements include context_switch and pipe; and
    as expected, this is particularly highlighted on larger thread counts
    as it becomes more expensive to walk the list twice.

    No change in wakeup ordering or semantics.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Our rwsem code (xadd, at least) is rather well documented, but
    there are a few really annoying comments in there that serve
    no purpose and we shouldn't bother with them.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • We currently return a rw_semaphore structure, which is the
    same lock we passed to the function's argument in the first
    place. While there are several functions that choose this
    return value, the callers use it, for example, for things
    like ERR_PTR. This is not the case for __rwsem_mark_wake(),
    and in addition this function is really about the lock
    waiters (which we know there are at this point), so its
    somewhat odd to be returning the sem structure.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

16 Jun, 2016

1 commit


08 Jun, 2016

4 commits

  • This patch moves the owner loading and checking code entirely inside of
    rwsem_spin_on_owner() to simplify the logic of rwsem_optimistic_spin()
    loop.

    Suggested-by: Peter Hurley
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Peter Hurley
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-6-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • In __rwsem_do_wake(), the reader wakeup code will assume a writer
    has stolen the lock if the active reader/writer count is not 0.
    However, this is not as reliable an indicator as the original
    "< RWSEM_WAITING_BIAS" check. If another reader is present, the code
    will still break out and exit even if the writer is gone. This patch
    changes it to check the same "< RWSEM_WAITING_BIAS" condition to
    reduce the chance of false positive.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Peter Hurley
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-5-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Currently, it is not possible to determine for sure if a reader
    owns a rwsem by looking at the content of the rwsem data structure.
    This patch adds a new state RWSEM_READER_OWNED to the owner field
    to indicate that readers currently own the lock. This enables us to
    address the following 2 issues in the rwsem optimistic spinning code:

    1) rwsem_can_spin_on_owner() will disallow optimistic spinning if
    the owner field is NULL which can mean either the readers own
    the lock or the owning writer hasn't set the owner field yet.
    In the latter case, we miss the chance to do optimistic spinning.

    2) While a writer is waiting in the OSQ and a reader takes the lock,
    the writer will continue to spin when out of the OSQ in the main
    rwsem_optimistic_spin() loop as the owner field is NULL wasting
    CPU cycles if some of readers are sleeping.

    Adding the new state will allow optimistic spinning to go forward as
    long as the owner field is not RWSEM_READER_OWNED and the owner is
    running, if set, but stop immediately when that state has been reached.

    On a 4-socket Haswell machine running on a 4.6-rc1 based kernel, the
    fio test with multithreaded randrw and randwrite tests on the same
    file on a XFS partition on top of a NVDIMM were run, the aggregated
    bandwidths before and after the patch were as follows:

    Test BW before patch BW after patch % change
    ---- --------------- -------------- --------
    randrw 988 MB/s 1192 MB/s +21%
    randwrite 1513 MB/s 1623 MB/s +7.3%

    The perf profile of the rwsem_down_write_failed() function in randrw
    before and after the patch were:

    19.95% 5.88% fio [kernel.vmlinux] [k] rwsem_down_write_failed
    14.20% 1.52% fio [kernel.vmlinux] [k] rwsem_down_write_failed

    The actual CPU cycles spend in rwsem_down_write_failed() dropped from
    5.88% to 1.52% after the patch.

    The xfstests was also run and no regression was observed.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Jason Low
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Dave Chinner
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1463534783-38814-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Convert the rwsem count variable to an atomic_long_t since we use it
    as an atomic variable. This also allows us to remove the
    rwsem_atomic_{add,update}() "abstraction" which would now be an unnecesary
    level of indirection. In follow up patches, we also remove the
    rwsem_atomic_{add,update}() definitions across the various architectures.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Jason Low
    [ Build warning fixes on various architectures. ]
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Terry Rudd
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Tony Luck
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1465017963-4839-2-git-send-email-jason.low2@hpe.com
    Signed-off-by: Ingo Molnar

    Jason Low
     

03 Jun, 2016

3 commits

  • When acquiring the rwsem write lock in the slowpath, we first try
    to set count to RWSEM_WAITING_BIAS. When that is successful,
    we then atomically add the RWSEM_WAITING_BIAS in cases where
    there are other tasks on the wait list. This causes write lock
    operations to often issue multiple atomic operations.

    We can instead make the list_is_singular() check first, and then
    set the count accordingly, so that we issue at most 1 atomic
    operation when acquiring the write lock and reduce unnecessary
    cacheline contention.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Christoph Lameter
    Cc: Fenghua Yu
    Cc: Heiko Carstens
    Cc: Ivan Kokshaysky
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Paul E. McKenney
    Cc: Peter Hurley
    Cc: Peter Zijlstra
    Cc: Richard Henderson
    Cc: Terry Rudd
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Tony Luck
    Link: http://lkml.kernel.org/r/1463445486-16078-2-git-send-email-jason.low2@hpe.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Readers that are awoken will expect a nil ->task indicating
    that a wakeup has occurred. Because of the way readers are
    implemented, there's a small chance that the waiter will never
    block in the slowpath (rwsem_down_read_failed), and therefore
    requires some form of reference counting to avoid the following
    scenario:

    rwsem_down_read_failed() rwsem_wake()
    get_task_struct();
    spin_lock_irq(&wait_lock);
    list_add_tail(&waiter.list)
    spin_unlock_irq(&wait_lock);
    raw_spin_lock_irqsave(&wait_lock)
    __rwsem_do_wake()
    while (1) {
    set_task_state(TASK_UNINTERRUPTIBLE);
    waiter->task = NULL
    if (!waiter.task) // true
    break;
    schedule() // never reached

    __set_task_state(TASK_RUNNING);
    do_exit();
    wake_up_process(tsk); // boom

    ... and therefore race with do_exit() when the caller returns.

    There is also a mismatch between the smp_mb() and its documentation,
    in that the serialization is done between reading the task and the
    nil store. Furthermore, in addition to having the overlapping of
    loads and stores to waiter->task guaranteed to be ordered within
    that CPU, both wake_up_process() originally and now wake_q_add()
    already imply barriers upon successful calls, which serves the
    comment.

    Now, as an alternative to perhaps inverting the checks in the blocker
    side (which has its own penalty in that schedule is unavoidable),
    with lockless wakeups this situation is naturally addressed and we
    can just use the refcount held by wake_q_add(), instead doing so
    explicitly. Of course, we must guarantee that the nil store is done
    as the _last_ operation in that the task must already be marked for
    deletion to not fall into the race above. Spurious wakeups are also
    handled transparently in that the task's reference is only removed
    when wake_up_q() is actually called _after_ the nil store.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hp.com
    Cc: peter@hurleysoftware.com
    Link: http://lkml.kernel.org/r/1463165787-25937-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • As wake_qs gain users, we can teach rwsems about them such that
    waiters can be awoken without the wait_lock. This is for both
    readers and writer, the former being the most ideal candidate
    as we can batch the wakeups shortening the critical region that
    much more -- ie writer task blocking a bunch of tasks waiting to
    service page-faults (mmap_sem readers).

    In general applying wake_qs to rwsem (xadd) is not difficult as
    the wait_lock is intended to be released soon _anyways_, with
    the exception of when a writer slowpath will proactively wakeup
    any queued readers if it sees that the lock is owned by a reader,
    in which we simply do the wakeups with the lock held (see comment
    in __rwsem_down_write_failed_common()).

    Similar to other locking primitives, delaying the waiter being
    awoken does allow, at least in theory, the lock to be stolen in
    the case of writers, however no harm was seen in this (in fact
    lock stealing tends to be a _good_ thing in most workloads), and
    this is a tiny window anyways.

    Some page-fault (pft) and mmap_sem intensive benchmarks show some
    pretty constant reduction in systime (by up to ~8 and ~10%) on a
    2-socket, 12 core AMD box. In addition, on an 8-core Westmere doing
    page allocations (page_test)

    aim9:
    4.6-rc6 4.6-rc6
    rwsemv2
    Min page_test 378167.89 ( 0.00%) 382613.33 ( 1.18%)
    Min exec_test 499.00 ( 0.00%) 502.67 ( 0.74%)
    Min fork_test 3395.47 ( 0.00%) 3537.64 ( 4.19%)
    Hmean page_test 395433.06 ( 0.00%) 414693.68 ( 4.87%)
    Hmean exec_test 499.67 ( 0.00%) 505.30 ( 1.13%)
    Hmean fork_test 3504.22 ( 0.00%) 3594.95 ( 2.59%)
    Stddev page_test 17426.57 ( 0.00%) 26649.92 (-52.93%)
    Stddev exec_test 0.47 ( 0.00%) 1.41 (-199.05%)
    Stddev fork_test 63.74 ( 0.00%) 32.59 ( 48.86%)
    Max page_test 429873.33 ( 0.00%) 456960.00 ( 6.30%)
    Max exec_test 500.33 ( 0.00%) 507.66 ( 1.47%)
    Max fork_test 3653.33 ( 0.00%) 3650.90 ( -0.07%)

    4.6-rc6 4.6-rc6
    rwsemv2
    User 1.12 0.04
    System 0.23 0.04
    Elapsed 727.27 721.98

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hp.com
    Cc: peter@hurleysoftware.com
    Link: http://lkml.kernel.org/r/1463165787-25937-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

16 May, 2016

1 commit

  • The new signal_pending exit path in __rwsem_down_write_failed_common()
    was fingered as breaking his kernel by Tetsuo Handa.

    Upon inspection it was found that there are two things wrong with it;

    - it forgets to remove WAITING_BIAS if it leaves the list empty, or
    - it forgets to wake further waiters that were blocked on the now
    removed waiter.

    Especially the first issue causes new lock attempts to block and stall
    indefinitely, as the code assumes that pending waiters mean there is
    an owner that will wake when it releases the lock.

    Reported-by: Tetsuo Handa
    Tested-by: Tetsuo Handa
    Tested-by: Michal Hocko
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Alexander Shishkin
    Cc: Andrew Morton
    Cc: Arnaldo Carvalho de Melo
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Davidlohr Bueso
    Cc: H. Peter Anvin
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: Vince Weaver
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/20160512115745.GP3192@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

13 Apr, 2016

1 commit

  • Introduce a generic implementation necessary for down_write_killable().

    This is a trivial extension of the already existing down_write() call
    which can be interrupted by SIGKILL. This patch doesn't provide
    down_write_killable() yet because arches have to provide the necessary
    pieces before.

    rwsem_down_write_failed() which is a generic slow path for the
    write lock is extended to take a task state and renamed to
    __rwsem_down_write_failed_common(). The return value is either a valid
    semaphore pointer or ERR_PTR(-EINTR).

    rwsem_down_write_failed_killable() is exported as a new way to wait for
    the lock and be killable.

    For rwsem-spinlock implementation the current __down_write() it updated
    in a similar way as __rwsem_down_write_failed_common() except it doesn't
    need new exports just visible __down_write_killable().

    Architectures which are not using the generic rwsem implementation are
    supposed to provide their __down_write_killable() implementation and
    use rwsem_down_write_failed_killable() for the slow path.

    Signed-off-by: Michal Hocko
    Cc: Andrew Morton
    Cc: Chris Zankel
    Cc: David S. Miller
    Cc: Linus Torvalds
    Cc: Max Filippov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Signed-off-by: Davidlohr Bueso
    Cc: Signed-off-by: Jason Low
    Cc: Thomas Gleixner
    Cc: Tony Luck
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-arch@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/1460041951-22347-7-git-send-email-mhocko@kernel.org
    Signed-off-by: Ingo Molnar

    Michal Hocko
     

06 Oct, 2015

1 commit

  • As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
    variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
    Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
    ordered archs can benefit from more relaxed use of barriers when locking
    and unlocking, instead of regular full barrier semantics. While currently
    only arm64 supports such optimizations, updating corresponding locking
    primitives serves for other archs to immediately benefit as well, once the
    necessary machinery is implemented of course.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E.McKenney
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1443643395-17016-6-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

08 May, 2015

1 commit

  • In up_write()/up_read(), rwsem_wake() will be called whenever it
    detects that some writers/readers are waiting. The rwsem_wake()
    function will take the wait_lock and call __rwsem_do_wake() to do the
    real wakeup. For a heavily contended rwsem, doing a spin_lock() on
    wait_lock will cause further contention on the heavily contended rwsem
    cacheline resulting in delay in the completion of the up_read/up_write
    operations.

    This patch makes the wait_lock taking and the call to __rwsem_do_wake()
    optional if at least one spinning writer is present. The spinning
    writer will be able to take the rwsem and call rwsem_wake() later
    when it calls up_write(). With the presence of a spinning writer,
    rwsem_wake() will now try to acquire the lock using trylock. If that
    fails, it will just quit.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Acked-by: Jason Low
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

07 Mar, 2015

1 commit

  • Ming reported soft lockups occurring when running xfstest due to
    the following tip:locking/core commit:

    b3fd4f03ca0b ("locking/rwsem: Avoid deceiving lock spinners")

    When doing optimistic spinning in rwsem, threads should stop
    spinning when the lock owner is not running. While a thread is
    spinning on owner, if the owner reschedules, owner->on_cpu
    returns false and we stop spinning.

    However, this commit essentially caused the check to get
    ignored because when we break out of the spin loop due to
    !on_cpu, we continue spinning if sem->owner != NULL.

    This patch fixes this by making sure we stop spinning if the
    owner is not running. Furthermore, just like with mutexes,
    refactor the code such that we don't have separate checks for
    owner_running(). This makes it more straightforward in terms of
    why we exit the spin on owner loop and we would also avoid
    needing to "guess" why we broke out of the loop to make this
    more readable.

    Reported-and-tested-by: Ming Lei
    Signed-off-by: Jason Low
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Dave Jones
    Cc: Linus Torvalds
    Cc: Michel Lespinasse
    Cc: Oleg Nesterov
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Sasha Levin
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1425714331.2475.388.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     

24 Feb, 2015

1 commit

  • With the new standardized functions, we can replace all
    ACCESS_ONCE() calls across relevant locking - this includes
    lockref and seqlock while at it.

    ACCESS_ONCE() does not work reliably on non-scalar types.
    For example gcc 4.6 and 4.7 might remove the volatile tag
    for such accesses during the SRA (scalar replacement of
    aggregates) step:

    https://gcc.gnu.org/bugzilla/show_bug.cgi?id=58145

    Update the new calls regardless of if it is a scalar type,
    this is cleaner than having three alternatives.

    Signed-off-by: Davidlohr Bueso
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1424662301.6539.18.camel@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

18 Feb, 2015

4 commits

  • 37e9562453b ("locking/rwsem: Allow conservative optimistic
    spinning when readers have lock") forced the default for
    optimistic spinning to be disabled if the lock owner was
    nil, which makes much sense for readers. However, while
    it is not our priority, we can make some optimizations
    for write-mostly workloads. We can bail the spinning step
    and still be conservative if there are any active tasks,
    otherwise there's really no reason not to spin, as the
    semaphore is most likely unlocked.

    This patch recovers most of a Unixbench 'execl' benchmark
    throughput by sleeping less and making better average system
    usage:

    before:
    CPU %user %nice %system %iowait %steal %idle
    all 0.60 0.00 8.02 0.00 0.00 91.38

    after:
    CPU %user %nice %system %iowait %steal %idle
    all 1.22 0.00 70.18 0.00 0.00 28.60

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Jason Low
    Cc: Linus Torvalds
    Cc: Michel Lespinasse
    Cc: Paul E. McKenney
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1422609267-15102-6-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • When readers hold the semaphore, the ->owner is nil. As such,
    and unlike mutexes, '!owner' does not necessarily imply that
    the lock is free. This will cause writers to potentially spin
    excessively as they've been mislead to thinking they have a
    chance of acquiring the lock, instead of blocking.

    This patch therefore enhances the counter check when the owner
    is not set by the time we've broken out of the loop. Otherwise
    we can return true as a new owner has the lock and thus we want
    to continue spinning. While at it, we can make rwsem_spin_on_owner()
    less ambiguos and return right away under need_resched conditions.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Michel Lespinasse
    Cc: Paul E. McKenney
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1422609267-15102-5-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • In order to optimize the spinning step, we need to set the lock
    owner as soon as the lock is acquired; after a successful counter
    cmpxchg operation, that is. This is particularly useful as rwsems
    need to set the owner to nil for readers, so there is a greater
    chance of falling out of the spinning. Currently we only set the
    owner much later in the game, in the more generic level -- latency
    can be specially bad when waiting for a node->next pointer when
    releasing the osq in up_write calls.

    As such, update the owner inside rwsem_try_write_lock (when the
    lock is obtained after blocking) and rwsem_try_write_lock_unqueued
    (when the lock is obtained while spinning). This requires creating
    a new internal rwsem.h header to share the owner related calls.

    Also cleanup some headers for mutex and rwsem.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Michel Lespinasse
    Cc: Paul E. McKenney
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1422609267-15102-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • The need for the smp_mb() in __rwsem_do_wake() should be
    properly documented. Applies to both xadd and spinlock
    variants.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Michel Lespinasse
    Cc: Paul E. McKenney
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1422609267-15102-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

04 Feb, 2015

1 commit

  • Call __set_task_state() instead of assigning the new state
    directly. These interfaces also aid CONFIG_DEBUG_ATOMIC_SLEEP
    environments, keeping track of who last changed the state.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: "Paul E. McKenney"
    Cc: Jason Low
    Cc: Michel Lespinasse
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1422257769-14083-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

03 Oct, 2014

1 commit

  • Commit 9b0fc9c09f1b ("rwsem: skip initial trylock in rwsem_down_write_failed")
    checks for if there are known active lockers in order to avoid write trylocking
    using expensive cmpxchg() when it likely wouldn't get the lock.

    However, a subsequent patch was added such that we directly
    check for sem->count == RWSEM_WAITING_BIAS right before trying
    that cmpxchg().

    Thus, commit 9b0fc9c09f1b now just adds overhead.

    This patch modifies it so that we only do a check for if
    count == RWSEM_WAITING_BIAS.

    Also, add a comment on why we do an "extra check" of count
    before the cmpxchg().

    Signed-off-by: Jason Low
    Acked-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Aswin Chandramouleeswaran
    Cc: Chegu Vinod
    Cc: Peter Hurley
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1410913017.2447.22.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     

16 Sep, 2014

1 commit


17 Jul, 2014

1 commit

  • The arch_mutex_cpu_relax() function, introduced by 34b133f, is
    hacky and ugly. It was added a few years ago to address the fact
    that common cpu_relax() calls include yielding on s390, and thus
    impact the optimistic spinning functionality of mutexes. Nowadays
    we use this function well beyond mutexes: rwsem, qrwlock, mcs and
    lockref. Since the macro that defines the call is in the mutex header,
    any users must include mutex.h and the naming is misleading as well.

    This patch (i) renames the call to cpu_relax_lowlatency ("relax, but
    only if you can do it with very low latency") and (ii) defines it in
    each arch's asm/processor.h local header, just like for regular cpu_relax
    functions. On all archs, except s390, cpu_relax_lowlatency is simply cpu_relax,
    and thus we can take it out of mutex.h. While this can seem redundant,
    I believe it is a good choice as it allows us to move out arch specific
    logic from generic locking primitives and enables future(?) archs to
    transparently define it, similarly to System Z.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra
    Cc: Andrew Morton
    Cc: Anton Blanchard
    Cc: Aurelien Jacquiot
    Cc: Benjamin Herrenschmidt
    Cc: Bharat Bhushan
    Cc: Catalin Marinas
    Cc: Chen Liqin
    Cc: Chris Metcalf
    Cc: Christian Borntraeger
    Cc: Chris Zankel
    Cc: David Howells
    Cc: David S. Miller
    Cc: Deepthi Dharwar
    Cc: Dominik Dingel
    Cc: Fenghua Yu
    Cc: Geert Uytterhoeven
    Cc: Guan Xuetao
    Cc: Haavard Skinnemoen
    Cc: Hans-Christian Egtvedt
    Cc: Heiko Carstens
    Cc: Helge Deller
    Cc: Hirokazu Takata
    Cc: Ivan Kokshaysky
    Cc: James E.J. Bottomley
    Cc: James Hogan
    Cc: Jason Wang
    Cc: Jesper Nilsson
    Cc: Joe Perches
    Cc: Jonas Bonn
    Cc: Joseph Myers
    Cc: Kees Cook
    Cc: Koichi Yasutake
    Cc: Lennox Wu
    Cc: Linus Torvalds
    Cc: Mark Salter
    Cc: Martin Schwidefsky
    Cc: Matt Turner
    Cc: Max Filippov
    Cc: Michael Neuling
    Cc: Michal Simek
    Cc: Mikael Starvik
    Cc: Nicolas Pitre
    Cc: Paolo Bonzini
    Cc: Paul Burton
    Cc: Paul E. McKenney
    Cc: Paul Gortmaker
    Cc: Paul Mackerras
    Cc: Qais Yousef
    Cc: Qiaowei Ren
    Cc: Rafael Wysocki
    Cc: Ralf Baechle
    Cc: Richard Henderson
    Cc: Richard Kuo
    Cc: Russell King
    Cc: Steven Miao
    Cc: Steven Rostedt
    Cc: Stratos Karafotis
    Cc: Tim Chen
    Cc: Tony Luck
    Cc: Vasily Kulikov
    Cc: Vineet Gupta
    Cc: Vineet Gupta
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: Wolfram Sang
    Cc: adi-buildroot-devel@lists.sourceforge.net
    Cc: linux390@de.ibm.com
    Cc: linux-alpha@vger.kernel.org
    Cc: linux-am33-list@redhat.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: linux-c6x-dev@linux-c6x.org
    Cc: linux-cris-kernel@axis.com
    Cc: linux-hexagon@vger.kernel.org
    Cc: linux-ia64@vger.kernel.org
    Cc: linux@lists.openrisc.net
    Cc: linux-m32r-ja@ml.linux-m32r.org
    Cc: linux-m32r@ml.linux-m32r.org
    Cc: linux-m68k@lists.linux-m68k.org
    Cc: linux-metag@vger.kernel.org
    Cc: linux-mips@linux-mips.org
    Cc: linux-parisc@vger.kernel.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-sh@vger.kernel.org
    Cc: linux-xtensa@linux-xtensa.org
    Cc: sparclinux@vger.kernel.org
    Link: http://lkml.kernel.org/r/1404079773.2619.4.camel@buesod1.americas.hpqcorp.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

16 Jul, 2014

4 commits

  • Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
    encapsulate the dependencies for rwsem optimistic spinning.
    No logical changes here as it continues to depend on both
    SMP and the XADD algorithm variant.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Jason Low
    [ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
    Cc: aswin@hp.com
    Cc: Chris Mason
    Cc: Davidlohr Bueso
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Waiman Long
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Currently, we initialize the osq lock by directly setting the lock's values. It
    would be preferable if we use an init macro to do the initialization like we do
    with other locks.

    This patch introduces and uses a macro and function for initializing the osq lock.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Cc: Chris Mason
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • The cancellable MCS spinlock is currently used to queue threads that are
    doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
    the lock would access and queue the local node corresponding to the CPU that
    it's running on. Currently, the cancellable MCS lock is implemented by using
    pointers to these nodes.

    In this patch, instead of operating on pointers to the per-cpu nodes, we
    store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
    A similar concept is used with the qspinlock.

    By operating on the CPU # of the nodes using atomic_t instead of pointers
    to those nodes, this can reduce the overhead of the cancellable MCS spinlock
    by 32 bits (on 64 bit systems).

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Cc: Chris Mason
    Cc: Heiko Carstens
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
    introduced a major performance regression for workloads such as
    xfs_repair which mix read and write locking of the mmap_sem across
    many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
    than on 3.15 and using 20x more system CPU time.

    Perf profiles indicate in some workloads that significant time can
    be spent spinning on !owner. This is because we don't set the lock
    owner when readers(s) obtain the rwsem.

    In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
    return false if there is no lock owner. The rationale is that if we
    just entered the slowpath, yet there is no lock owner, then there is
    a possibility that a reader has the lock. To be conservative, we'll
    avoid spinning in these situations.

    This patch reduced the total run time of the xfs_repair workload from
    about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
    back to close to the same performance as on 3.15.

    Retesting of AIM7, which were some of the workloads used to test the
    original optimistic spinning code, confirmed that we still get big
    performance gains with optimistic spinning, even with this additional
    regression fix. Davidlohr found that while the 'custom' workload took
    a performance hit of ~-14% to throughput for >300 users with this
    additional patch, the overall gain with optimistic spinning is
    still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.

    Tested-by: Dave Chinner
    Acked-by: Davidlohr Bueso
    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     

05 Jun, 2014

2 commits

  • WARNING: line over 80 characters
    #205: FILE: kernel/locking/rwsem-xadd.c:275:
    + old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS);

    WARNING: line over 80 characters
    #376: FILE: kernel/locking/rwsem-xadd.c:434:
    + * If there were already threads queued before us and there are no

    WARNING: line over 80 characters
    #377: FILE: kernel/locking/rwsem-xadd.c:435:
    + * active writers, the lock must be read owned; so we try to wake

    total: 0 errors, 3 warnings, 417 lines checked

    Signed-off-by: Andrew Morton
    Signed-off-by: Peter Zijlstra
    Cc: "H. Peter Anvin"
    Cc: Davidlohr Bueso
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-pn6pslaplw031lykweojsn8c@git.kernel.org
    Signed-off-by: Ingo Molnar

    Andrew Morton
     
  • We have reached the point where our mutexes are quite fine tuned
    for a number of situations. This includes the use of heuristics
    and optimistic spinning, based on MCS locking techniques.

    Exclusive ownership of read-write semaphores are, conceptually,
    just about the same as mutexes, making them close cousins. To
    this end we need to make them both perform similarly, and
    right now, rwsems are simply not up to it. This was discovered
    by both reverting commit 4fc3f1d6 (mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable) and
    similarly, converting some other mutexes (ie: i_mmap_mutex) to
    rwsems. This creates a situation where users have to choose
    between a rwsem and mutex taking into account this important
    performance difference. Specifically, biggest difference between
    both locks is when we fail to acquire a mutex in the fastpath,
    optimistic spinning comes in to play and we can avoid a large
    amount of unnecessary sleeping and overhead of moving tasks in
    and out of wait queue. Rwsems do not have such logic.

    This patch, based on the work from Tim Chen and I, adds support
    for write-side optimistic spinning when the lock is contended.
    It also includes support for the recently added cancelable MCS
    locking for adaptive spinning. Note that is is only applicable
    to the xadd method, and the spinlock rwsem variant remains intact.

    Allowing optimistic spinning before putting the writer on the wait
    queue reduces wait queue contention and provided greater chance
    for the rwsem to get acquired. With these changes, rwsem is on par
    with mutex. The performance benefits can be seen on a number of
    workloads. For instance, on a 8 socket, 80 core 64bit Westmere box,
    aim7 shows the following improvements in throughput:

    +--------------+---------------------+-----------------+
    | Workload | throughput-increase | number of users |
    +--------------+---------------------+-----------------+
    | alltests | 20% | >1000 |
    | custom | 27%, 60% | 10-100, >1000 |
    | high_systime | 36%, 30% | >100, >1000 |
    | shared | 58%, 29% | 10-100, >1000 |
    +--------------+---------------------+-----------------+

    There was also improvement on smaller systems, such as a quad-core
    x86-64 laptop running a 30Gb PostgreSQL (pgbench) workload for up
    to +60% in throughput for over 50 clients. Additionally, benefits
    were also noticed in exim (mail server) workloads. Furthermore, no
    performance regression have been seen at all.

    Based-on-work-from: Tim Chen
    Signed-off-by: Davidlohr Bueso
    [peterz: rej fixup due to comment patches, sched/rt.h header]
    Signed-off-by: Peter Zijlstra
    Cc: Alex Shi
    Cc: Andi Kleen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Hurley
    Cc: "Paul E.McKenney"
    Cc: Jason Low
    Cc: Aswin Chandramouleeswaran
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: "Scott J Norton"
    Cc: Andrea Arcangeli
    Cc: Chris Mason
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1399055055.6275.15.camel@buesod1.americas.hpqcorp.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

05 May, 2014

1 commit

  • It took me quite a while to understand how rwsem's count field
    mainifested itself in different scenarios.

    Add comments to provide a quick reference to the the rwsem's count
    field for each scenario where readers and writers are contending
    for the lock.

    Hopefully it will be useful for future maintenance of the code and
    for people to get up to speed on how the logic in the code works.

    Signed-off-by: Tim Chen
    Cc: Davidlohr Bueso
    Cc: Alex Shi
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Hurley
    Cc: Paul E.McKenney
    Cc: Jason Low
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1399060437.2970.146.camel@schen9-DESK
    Signed-off-by: Ingo Molnar

    Tim Chen
     

14 Feb, 2014

1 commit


06 Nov, 2013

1 commit