22 Sep, 2016

2 commits

  • Provide a down_read()/up_read() variant that keeps preemption disabled
    over the whole thing, when possible.

    This avoids a needless preemption point for constructs such as:

    percpu_down_read(&global_rwsem);
    spin_lock(&lock);
    ...
    spin_unlock(&lock);
    percpu_up_read(&global_rwsem);

    Which perturbs timings. In particular it was found to cure a
    performance regression in a follow up patch in fs/locks.c

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Provide a static init and a standard locking assertion method.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: dave@stgolabs.net
    Cc: der.herr@hofr.at
    Cc: oleg@redhat.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: riel@redhat.com
    Cc: tj@kernel.org
    Cc: viro@ZenIV.linux.org.uk
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

10 Aug, 2016

1 commit

  • Currently the percpu-rwsem switches to (global) atomic ops while a
    writer is waiting; which could be quite a while and slows down
    releasing the readers.

    This patch cures this problem by ordering the reader-state vs
    reader-count (see the comments in __percpu_down_read() and
    percpu_down_write()). This changes a global atomic op into a full
    memory barrier, which doesn't have the global cacheline contention.

    This also enables using the percpu-rwsem with rcu_sync disabled in order
    to bias the implementation differently, reducing the writer latency by
    adding some cost to readers.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    [ Fixed modular build. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Oct, 2015

1 commit

  • Currently down_write/up_write calls synchronize_sched_expedited()
    twice, which is evil. Change this code to rely on rcu-sync primitives.
    This avoids the _expedited "big hammer", and this can be faster in
    the contended case or even in the case when a single thread does
    down_write/up_write in a loop.

    Of course, a single down_write() will take more time, but otoh it
    will be much more friendly to the whole system.

    To simplify the review this patch doesn't update the comments, fixed
    by the next change.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Oleg Nesterov
     

15 Aug, 2015

2 commits


18 Dec, 2012

3 commits

  • Add lockdep annotations. Not only this can help to find the potential
    problems, we do not want the false warnings if, say, the task takes two
    different percpu_rw_semaphore's for reading. IOW, at least ->rw_sem
    should not use a single class.

    This patch exposes this internal lock to lockdep so that it represents the
    whole percpu_rw_semaphore. This way we do not need to add another "fake"
    ->lockdep_map and lock_class_key. More importantly, this also makes the
    output from lockdep much more understandable if it finds the problem.

    In short, with this patch from lockdep pov percpu_down_read() and
    percpu_up_read() acquire/release ->rw_sem for reading, this matches the
    actual semantics. This abuses __up_read() but I hope this is fine and in
    fact I'd like to have down_read_no_lockdep() as well,
    percpu_down_read_recursive_readers() will need it.

    Signed-off-by: Oleg Nesterov
    Cc: Anton Arapov
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Mikulas Patocka
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • percpu_rw_semaphore->writer_mutex was only added to simplify the initial
    rewrite, the only thing it protects is clear_fast_ctr() which otherwise
    could be called by multiple writers. ->rw_sem is enough to serialize the
    writers.

    Kill this mutex and add "atomic_t write_ctr" instead. The writers
    increment/decrement this counter, the readers check it is zero instead of
    mutex_is_locked().

    Move atomic_add(clear_fast_ctr(), slow_read_ctr) under down_write() to
    avoid the race with other writers. This is a bit sub-optimal, only the
    first writer needs this and we do not need to exclude the readers at this
    stage. But this is simple, we do not want another internal lock until we
    add more features.

    And this speeds up the write-contended case. Before this patch the racing
    writers sleep in synchronize_sched_expedited() sequentially, with this
    patch multiple synchronize_sched_expedited's can "overlap" with each
    other. Note: we can do more optimizations, this is only the first step.

    Signed-off-by: Oleg Nesterov
    Cc: Anton Arapov
    Cc: Ingo Molnar
    Cc: Linus Torvalds
    Cc: Michal Marek
    Cc: Mikulas Patocka
    Cc: "Paul E. McKenney"
    Cc: Peter Zijlstra
    Cc: Srikar Dronamraju
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     
  • Currently the writer does msleep() plus synchronize_sched() 3 times to
    acquire/release the semaphore, and during this time the readers are
    blocked completely. Even if the "write" section was not actually started
    or if it was already finished.

    With this patch down_write/up_write does synchronize_sched() twice and
    down_read/up_read are still possible during this time, just they use the
    slow path.

    percpu_down_write() first forces the readers to use rw_semaphore and
    increment the "slow" counter to take the lock for reading, then it
    takes that rw_semaphore for writing and blocks the readers.

    Also. With this patch the code relies on the documented behaviour of
    synchronize_sched(), it doesn't try to pair synchronize_sched() with
    barrier.

    Signed-off-by: Oleg Nesterov
    Reviewed-by: Paul E. McKenney
    Cc: Linus Torvalds
    Cc: Mikulas Patocka
    Cc: Peter Zijlstra
    Cc: Ingo Molnar
    Cc: Srikar Dronamraju
    Cc: Ananth N Mavinakayanahalli
    Cc: Anton Arapov
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Oleg Nesterov
     

28 Nov, 2012

1 commit


29 Oct, 2012

2 commits

  • Use rcu_read_lock_sched / rcu_read_unlock_sched / synchronize_sched
    instead of rcu_read_lock / rcu_read_unlock / synchronize_rcu.

    This is an optimization. The RCU-protected region is very small, so
    there will be no latency problems if we disable preempt in this region.

    So we use rcu_read_lock_sched / rcu_read_unlock_sched that translates
    to preempt_disable / preempt_disable. It is smaller (and supposedly
    faster) than preemptible rcu_read_lock / rcu_read_unlock.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     
  • This patch introduces new barrier pair light_mb() and heavy_mb() for
    percpu rw semaphores.

    This patch fixes a bug in percpu-rw-semaphores where a barrier was
    missing in percpu_up_write.

    This patch improves performance on the read path of
    percpu-rw-semaphores: on non-x86 cpus, there was a smp_mb() in
    percpu_up_read. This patch changes it to a compiler barrier and removes
    the "#if defined(X86) ..." condition.

    From: Lai Jiangshan
    Signed-off-by: Mikulas Patocka
    Signed-off-by: Linus Torvalds

    Mikulas Patocka
     

26 Sep, 2012

1 commit

  • This avoids cache line bouncing when many processes lock the semaphore
    for read.

    New percpu lock implementation

    The lock consists of an array of percpu unsigned integers, a boolean
    variable and a mutex.

    When we take the lock for read, we enter rcu read section, check for a
    "locked" variable. If it is false, we increase a percpu counter on the
    current cpu and exit the rcu section. If "locked" is true, we exit the
    rcu section, take the mutex and drop it (this waits until a writer
    finished) and retry.

    Unlocking for read just decreases percpu variable. Note that we can
    unlock on a difference cpu than where we locked, in this case the
    counter underflows. The sum of all percpu counters represents the number
    of processes that hold the lock for read.

    When we need to lock for write, we take the mutex, set "locked" variable
    to true and synchronize rcu. Since RCU has been synchronized, no
    processes can create new read locks. We wait until the sum of percpu
    counters is zero - when it is, there are no readers in the critical
    section.

    Signed-off-by: Mikulas Patocka
    Signed-off-by: Jens Axboe

    Mikulas Patocka