08 May, 2013

1 commit

  • This patch tries to reduce the amount of cmpxchg calls in the writer
    failed path by checking the counter value first before issuing the
    instruction. If ->count is not set to RWSEM_WAITING_BIAS then there is
    no point wasting a cmpxchg call.

    Furthermore, Michel states "I suppose it helps due to the case where
    someone else steals the lock while we're trying to acquire
    sem->wait_lock."

    Two very different workloads and machines were used to see how this
    patch improves throughput: pgbench on a quad-core laptop and aim7 on a
    large 8 socket box with 80 cores.

    Some results comparing Michel's fast-path write lock stealing
    (tps-rwsem) on a quad-core laptop running pgbench:

    | db_size | clients | tps-rwsem | tps-patch |
    +---------+----------+----------------+--------------+
    | 160 MB | 1 | 6906 | 9153 | + 32.5
    | 160 MB | 2 | 15931 | 22487 | + 41.1%
    | 160 MB | 4 | 33021 | 32503 |
    | 160 MB | 8 | 34626 | 34695 |
    | 160 MB | 16 | 33098 | 34003 |
    | 160 MB | 20 | 31343 | 31440 |
    | 160 MB | 30 | 28961 | 28987 |
    | 160 MB | 40 | 26902 | 26970 |
    | 160 MB | 50 | 25760 | 25810 |
    ------------------------------------------------------
    | 1.6 GB | 1 | 7729 | 7537 |
    | 1.6 GB | 2 | 19009 | 23508 | + 23.7%
    | 1.6 GB | 4 | 33185 | 32666 |
    | 1.6 GB | 8 | 34550 | 34318 |
    | 1.6 GB | 16 | 33079 | 32689 |
    | 1.6 GB | 20 | 31494 | 31702 |
    | 1.6 GB | 30 | 28535 | 28755 |
    | 1.6 GB | 40 | 27054 | 27017 |
    | 1.6 GB | 50 | 25591 | 25560 |
    ------------------------------------------------------
    | 7.6 GB | 1 | 6224 | 7469 | + 20.0%
    | 7.6 GB | 2 | 13611 | 12778 |
    | 7.6 GB | 4 | 33108 | 32927 |
    | 7.6 GB | 8 | 34712 | 34878 |
    | 7.6 GB | 16 | 32895 | 33003 |
    | 7.6 GB | 20 | 31689 | 31974 |
    | 7.6 GB | 30 | 29003 | 28806 |
    | 7.6 GB | 40 | 26683 | 26976 |
    | 7.6 GB | 50 | 25925 | 25652 |
    ------------------------------------------------------

    For the aim7 worloads, they overall improved on top of Michel's
    patchset. For full graphs on how the rwsem series plus this patch
    behaves on a large 8 socket machine against a vanilla kernel:

    http://stgolabs.net/rwsem-aim7-results.tar.gz

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     

07 May, 2013

13 commits

  • Change explicit "signed long" declarations into plain "long" as suggested
    by Peter Hurley.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Michel Lespinasse
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This change fixes a race condition where a reader might determine it
    needs to block, but by the time it acquires the wait_lock the rwsem has
    active readers and no queued waiters.

    In this situation the reader can run in parallel with the existing
    active readers; it does not need to block until the active readers
    complete.

    Thanks to Peter Hurley for noticing this possible race.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When we decide to wake up readers, we must first grant them as many read
    locks as necessary, and then actually wake up all these readers. But in
    order to know how many read shares to grant, we must first count the
    readers at the head of the queue. This might take a while if there are
    many readers, and we want to be protected against a writer stealing the
    lock while we're counting. To that end, we grant the first reader lock
    before counting how many more readers are queued.

    We also require some adjustments to the wake_type semantics.

    RWSEM_WAKE_NO_ACTIVE used to mean that we had found the count to be
    RWSEM_WAITING_BIAS, in which case the rwsem was known to be free as
    nobody could steal it while we hold the wait_lock. This doesn't make
    sense once we implement fastpath write lock stealing, so we now use
    RWSEM_WAKE_ANY in that case.

    Similarly, when rwsem_down_write_failed found that a read lock was
    active, it would use RWSEM_WAKE_READ_OWNED which signalled that new
    readers could be woken without checking first that the rwsem was
    available. We can't do that anymore since the existing readers might
    release their read locks, and a writer could steal the lock before we
    wake up additional readers. So, we have to use a new RWSEM_WAKE_READERS
    value to indicate we only want to wake readers, but we don't currently
    hold any read lock.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This is mostly for cleanup value:

    - We don't need several gotos to handle the case where the first
    waiter is a writer. Two simple tests will do (and generate very
    similar code).

    - In the remainder of the function, we know the first waiter is a reader,
    so we don't have to double check that. We can use do..while loops
    to iterate over the readers to wake (generates slightly better code).

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • We can skip the initial trylock in rwsem_down_write_failed() if there
    are known active lockers already, thus saving one likely-to-fail
    cmpxchg.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In rwsem_down_write_failed(), if there are active locks after we wake up
    (i.e. the lock got stolen from us), skip taking the wait_lock and go
    back to sleep immediately.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Using rwsem_atomic_update to try stealing the write lock forced us to
    undo the adjustment in the failure path. We can have simpler and faster
    code by using cmpxchg instead.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Some small code simplifications can be achieved by doing more agressive
    lock stealing:

    - When rwsem_down_write_failed() notices that there are no active locks
    (and thus no thread to wake us if we decided to sleep), it used to wake
    the first queued process. However, stealing the lock is also sufficient
    to deal with this case, so we don't need this check anymore.

    - In try_get_writer_sem(), we can steal the lock even when the first waiter
    is a reader. This is correct because the code path that wakes readers is
    protected by the wait_lock. As to the performance effects of this change,
    they are expected to be minimal: readers are still granted the lock
    (rather than having to acquire it themselves) when they reach the front
    of the wait queue, so we have essentially the same behavior as in
    rwsem-spinlock.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When waking writers, we never grant them the lock - instead, they have
    to acquire it themselves when they run, and remove themselves from the
    wait_list when they succeed.

    As a result, we can do a few simplifications in rwsem_down_write_failed():

    - We don't need to check for !waiter.task since __rwsem_do_wake() doesn't
    remove writers from the wait_list

    - There is no point releaseing the wait_lock before entering the wait loop,
    as we will need to reacquire it immediately. We can change the loop so
    that the lock is always held at the start of each loop iteration.

    - We don't need to get a reference on the task structure, since the task
    is responsible for removing itself from the wait_list. There is no risk,
    like in the rwsem_down_read_failed() case, that a task would wake up and
    exit (thus destroying its task structure) while __rwsem_do_wake() is
    still running - wait_lock protects against that.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When trying to acquire a read lock, the RWSEM_ACTIVE_READ_BIAS
    adjustment doesn't cause other readers to block, so we never have to
    worry about waking them back after canceling this adjustment in
    rwsem_down_read_failed().

    We also never want to steal the lock in rwsem_down_read_failed(), so we
    don't have to grab the wait_lock either.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Remove the rwsem_down_failed_common function and replace it with two
    identical copies of its code in rwsem_down_{read,write}_failed.

    This is because we want to make different optimizations in
    rwsem_down_{read,write}_failed; we are adding this pure-duplication
    step as a separate commit in order to make it easier to check the
    following steps.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change reduces the size of the spinlocked and TASK_UNINTERRUPTIBLE
    sections in rwsem_down_failed_common():

    - We only need the sem->wait_lock to insert ourselves on the wait_list;
    the waiter node can be prepared outside of the wait_lock.

    - The task state only needs to be set to TASK_UNINTERRUPTIBLE immediately
    before checking if we actually need to sleep; it doesn't need to protect
    the entire function.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • We are not planning to add some new waiter flags, so we can convert the
    waiter type into an enumeration.

    Background: David Howells suggested I do this back when I tried adding
    a new waiter type for unfair readers. However, I believe the cleanup
    applies regardless of that use case.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

19 Feb, 2013

1 commit

  • Commit 5a505085f043 ("mm/rmap: Convert the struct anon_vma::mutex
    to an rwsem") changed struct anon_vma::mutex to an rwsem, which
    caused aim7 fork_test performance to drop by 50%.

    Yuanhan Liu did the following excellent analysis:

    https://lkml.org/lkml/2013/1/29/84

    and found that the regression is caused by strict, serialized,
    FIFO sequential write-ownership of rwsems. Ingo suggested
    implementing opportunistic lock-stealing for the front writer
    task in the waitqueue.

    Yuanhan Liu implemented lock-stealing for spinlock-rwsems,
    which indeed recovered much of the regression - confirming
    the analysis that the main factor in the regression was the
    FIFO writer-fairness of rwsems.

    In this patch we allow lock-stealing to happen when the first
    waiter is also writer. With that change in place the
    aim7 fork_test performance is fully recovered on my
    Intel NHM EP, NHM EX, SNB EP 2S and 4S test-machines.

    Reported-by: lkp@linux.intel.com
    Reported-by: Yuanhan Liu
    Signed-off-by: Alex Shi
    Cc: David Howells
    Cc: Michel Lespinasse
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Anton Blanchard
    Cc: Arjan van de Ven
    Cc: paul.gortmaker@windriver.com
    Link: https://lkml.org/lkml/2013/1/29/84
    Link: http://lkml.kernel.org/r/1360069915-31619-1-git-send-email-alex.shi@intel.com
    [ Small stylistic fixes, updated changelog. ]
    Signed-off-by: Ingo Molnar

    Alex Shi
     

08 Mar, 2012

1 commit


13 Sep, 2011

1 commit

  • There is no reason to allow the lock protecting rwsems (the
    ownerless variant) to be preemptible on -rt. Convert it to raw.

    In mainline this change documents the low level nature of
    the lock - otherwise there's no functional difference. Lockdep
    and Sparse checking will work as usual.

    Signed-off-by: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

27 Jan, 2011

1 commit

  • Peter Zijlstra pointed out, that the only user of asmregparm (x86) is
    compiling the kernel already with -mregparm=3. So the annotation of
    the rwsem functions is redundant. Remove it.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: David Howells
    Cc: Benjamin Herrenschmidt
    Cc: Matt Turner
    Cc: Tony Luck
    Cc: Heiko Carstens
    Cc: Paul Mundt
    Cc: David Miller
    Cc: Chris Zankel
    LKML-Reference:
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

10 Aug, 2010

5 commits

  • More code can be pushed from rwsem_down_read_failed and
    rwsem_down_write_failed into rwsem_down_failed_common.

    Following change adding down_read_critical infrastructure support also
    enjoys having flags available in a register rather than having to fish it
    out in the struct rwsem_waiter...

    Signed-off-by: Michel Lespinasse
    Acked-by: David Howells
    Cc: Mike Waychison
    Cc: Suleiman Souhlal
    Cc: Ying Han
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change addresses the following situation:

    - Thread A acquires the rwsem for read
    - Thread B tries to acquire the rwsem for write, notices there is already
    an active owner for the rwsem.
    - Thread C tries to acquire the rwsem for read, notices that thread B already
    tried to acquire it.
    - Thread C grabs the spinlock and queues itself on the wait queue.
    - Thread B grabs the spinlock and queues itself behind C. At this point A is
    the only remaining active owner on the rwsem.

    In this situation thread B could notice that it was the last active writer
    on the rwsem, and decide to wake C to let it proceed in parallel with A
    since they both only want the rwsem for read.

    Signed-off-by: Michel Lespinasse
    Acked-by: David Howells
    Cc: Mike Waychison
    Cc: Suleiman Souhlal
    Cc: Ying Han
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Previously each waiting thread added a bias of RWSEM_WAITING_BIAS. With
    this change, the bias is added only once to indicate that the wait list is
    non-empty.

    This has a few nice properties which will be used in following changes:
    - when the spinlock is held and the waiter list is known to be non-empty,
    count < RWSEM_WAITING_BIAS there is an active writer on that sem
    - count == RWSEM_WAITING_BIAS there are waiting threads and no
    active readers/writers on that sem

    Signed-off-by: Michel Lespinasse
    Acked-by: David Howells
    Cc: Mike Waychison
    Cc: Suleiman Souhlal
    Cc: Ying Han
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In __rwsem_do_wake(), we can skip the active count check unless we come
    there from up_xxxx(). Also when checking the active count, it is not
    actually necessary to increment it; this allows us to get rid of the read
    side undo code and simplify the calculation of the final rwsem count
    adjustment once we've counted the reader threads to wake.

    The basic observation is the following. When there are waiter threads on
    a rwsem and the spinlock is held, other threads can only increment the
    active count by trying to grab the rwsem in down_xxxx(). However
    down_xxxx() will notice there are waiter threads and take the down_failed
    path, blocking to acquire the spinlock on the way there. Therefore, a
    thread observing an active count of zero with waiters queued and the
    spinlock held, is protected against other threads acquiring the rwsem
    until it wakes the last waiter or releases the spinlock.

    Signed-off-by: Michel Lespinasse
    Acked-by: David Howells
    Cc: Mike Waychison
    Cc: Suleiman Souhlal
    Cc: Ying Han
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This is in preparation for later changes in the series.

    In __rwsem_do_wake(), the first queued waiter is checked first in order to
    determine whether it's a writer or a reader. The code paths diverge at
    this point. The code that checks and increments the rwsem active count is
    duplicated on both sides - the point is that later changes in the series
    will be able to independently modify both sides.

    Signed-off-by: Michel Lespinasse
    Acked-by: David Howells
    Cc: Mike Waychison
    Cc: Suleiman Souhlal
    Cc: Ying Han
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

13 May, 2010

1 commit

  • If there are no active threasd using a semaphore, it is always correct
    to unqueue blocked threads. This seems to be what was intended in the
    undo code.

    What was done instead, was to look for a sem count of zero - this is an
    impossible situation, given that at least one thread is known to be
    queued on the semaphore. The code might be correct as written, but it's
    hard to reason about and it's not what was intended (otherwise the goto
    out would have been unconditional).

    Go for checking the active count - the alternative is not worth the
    headache.

    Signed-off-by: Michel Lespinasse
    Signed-off-by: David Howells
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

30 Jan, 2008

1 commit

  • introduce the "asmregparm" calling convention: for functions
    implemented in assembly with a fixed regparm input parameters
    calling convention.

    mark the semaphore and rwsem slowpath functions with that.

    Signed-off-by: Ingo Molnar
    Signed-off-by: Miklos Szeredi
    Signed-off-by: Thomas Gleixner

    Ingo Molnar
     

18 Dec, 2007

1 commit

  • This following commit

    http://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commitdiff;h=fdf8cb0909b531f9ae8f9b9d7e4eb35ba3505f07

    un-inlined a low-level rwsem function, but did not mark it as __sched.
    The result is that it now shows up as thread wchan (which also affects
    /proc/profile stats). The following simple patch fixes this by properly
    marking rwsem_down_failed_common() as a __sched function.

    Also in this patch, which is up for discussion, marks down_read() and
    down_write() proper as __sched. For profiling, it is pretty much
    useless to know that a semaphore is beig help - it is necessary to know
    _which_ one. By going up another frame on the stack, the information
    becomes much more useful.

    In summary, the below change to lib/rwsem.c should be applied; the
    changes to kernel/rwsem.c could be applied if other kernel hackers agree
    with my proposal that down_read()/down_write() in the profile is not
    enough.

    [ akpm@linux-foundation.org: build fix ]

    Signed-off-by: Livio Soares
    Signed-off-by: Andrew Morton
    Signed-off-by: Ingo Molnar

    Livio Soares
     

11 Oct, 2006

1 commit


30 Sep, 2006

1 commit

  • Un-inlining rwsem_down_failed_common() (two callsites) reduced lib/rwsem.o
    on my Athlon, gcc 4.1.2 from 5935 to 5480 Bytes (455 Bytes saved).

    I thus guess that reduced icache footprint (and better function caching) is
    worth more than any function call overhead.

    Signed-off-by: Andreas Mohr
    Cc: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Andreas Mohr
     

04 Jul, 2006

2 commits


01 May, 2005

1 commit


17 Apr, 2005

1 commit

  • Initial git repository build. I'm not bothering with the full history,
    even though we have it. We can create a separate "historical" git
    archive of that later if we want to, and in the meantime it's about
    3.2GB when imported into git - space that would just make the early
    git days unnecessarily complicated, when we don't have a lot of good
    infrastructure for it.

    Let it rip!

    Linus Torvalds