31 May, 2019

1 commit

  • Based on 3 normalized pattern(s):

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version this program is distributed in the
    hope that it will be useful but without any warranty without even
    the implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [kishon] [vijay] [abraham]
    [i] [kishon]@[ti] [com] this program is distributed in the hope that
    it will be useful but without any warranty without even the implied
    warranty of merchantability or fitness for a particular purpose see
    the gnu general public license for more details

    this program is free software you can redistribute it and or modify
    it under the terms of the gnu general public license as published by
    the free software foundation either version 2 of the license or at
    your option any later version [author] [graeme] [gregory]
    [gg]@[slimlogic] [co] [uk] [author] [kishon] [vijay] [abraham] [i]
    [kishon]@[ti] [com] [based] [on] [twl6030]_[usb] [c] [author] [hema]
    [hk] [hemahk]@[ti] [com] this program is distributed in the hope
    that it will be useful but without any warranty without even the
    implied warranty of merchantability or fitness for a particular
    purpose see the gnu general public license for more details

    extracted by the scancode license scanner the SPDX license identifier

    GPL-2.0-or-later

    has been chosen to replace the boilerplate/reference in 1105 file(s).

    Signed-off-by: Thomas Gleixner
    Reviewed-by: Allison Randal
    Reviewed-by: Richard Fontana
    Reviewed-by: Kate Stewart
    Cc: linux-spdx@vger.kernel.org
    Link: https://lkml.kernel.org/r/20190527070033.202006027@linutronix.de
    Signed-off-by: Greg Kroah-Hartman

    Thomas Gleixner
     

10 Apr, 2019

1 commit

  • The percpu event counts used by qspinlock code can be useful for
    other locking code as well. So a new set of lockevent_* counting APIs
    is introduced with the lock event names extracted out into the new
    lock_events_list.h header file for easier addition in the future.

    The existing qstat_inc() calls are replaced by either lockevent_inc() or
    lockevent_cond_inc() calls.

    The qstat_hop() call is renamed to lockevent_pv_hop(). The "reset_counters"
    debugfs file is also renamed to ".reset_counts".

    Signed-off-by: Waiman Long
    Acked-by: Peter Zijlstra
    Acked-by: Davidlohr Bueso
    Cc: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20190404174320.22416-8-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

28 Feb, 2019

1 commit

  • With the > 4 nesting levels case handled by the commit:

    d682b596d993 ("locking/qspinlock: Handle > 4 slowpath nesting levels")

    the BUG_ON() call in encode_tail() will never actually be triggered.

    Remove it.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: https://lkml.kernel.org/r/1551057253-3231-1-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

04 Feb, 2019

2 commits

  • Track the number of slowpath locking operations that are being done
    without any MCS node available as well renaming lock_index[123] to make
    them more descriptive.

    Using these stat counters is one way to find out if a code path is
    being exercised.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: James Morse
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: SRINIVAS
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: Zhenzhong Duan
    Link: https://lkml.kernel.org/r/1548798828-16156-3-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Four queue nodes per CPU are allocated to enable up to 4 nesting levels
    using the per-CPU nodes. Nested NMIs are possible in some architectures.
    Still it is very unlikely that we will ever hit more than 4 nested
    levels with contention in the slowpath.

    When that rare condition happens, however, it is likely that the system
    will hang or crash shortly after that. It is not good and we need to
    handle this exception case.

    This is done by spinning directly on the lock using repeated trylock.
    This alternative code path should only be used when there is nested
    NMIs. Assuming that the locks used by those NMI handlers will not be
    heavily contended, a simple TAS locking should work out.

    Suggested-by: Peter Zijlstra
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: James Morse
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: SRINIVAS
    Cc: Thomas Gleixner
    Cc: Zhenzhong Duan
    Link: https://lkml.kernel.org/r/1548798828-16156-2-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

17 Oct, 2018

2 commits

  • The qspinlock code supports up to 4 levels of slowpath nesting using
    four per-CPU mcs_spinlock structures. For 64-bit architectures, they
    fit nicely in one 64-byte cacheline.

    For para-virtualized (PV) qspinlocks it needs to store more information
    in the per-CPU node structure than there is space for. It uses a trick
    to use a second cacheline to hold the extra information that it needs.
    So PV qspinlock needs to access two extra cachelines for its information
    whereas the native qspinlock code only needs one extra cacheline.

    Freshly added counter profiling of the qspinlock code, however, revealed
    that it was very rare to use more than two levels of slowpath nesting.
    So it doesn't make sense to penalize PV qspinlock code in order to have
    four mcs_spinlock structures in the same cacheline to optimize for a case
    in the native qspinlock code that rarely happens.

    Extend the per-CPU node structure to have two more long words when PV
    qspinlock locks are configured to hold the extra data that it needs.

    As a result, the PV qspinlock code will enjoy the same benefit of using
    just one extra cacheline like the native counterpart, for most cases.

    [ mingo: Minor changelog edits. ]

    Signed-off-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1539697507-28084-2-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Queued spinlock supports up to 4 levels of lock slowpath nesting -
    user context, soft IRQ, hard IRQ and NMI. However, we are not sure how
    often the nesting happens.

    So add 3 more per-CPU stat counters to track the number of instances where
    nesting index goes to 1, 2 and 3 respectively.

    On a dual-socket 64-core 128-thread Zen server, the following were the
    new stat counter values under different circumstances:

    State slowpath index1 index2 index3
    ----- -------- ------ ------ -------
    After bootup 1,012,150 82 0 0
    After parallel build + perf-top 125,195,009 82 0 0

    So the chance of having more than 2 levels of nesting is extremely low.

    [ mingo: Minor changelog edits. ]

    Signed-off-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1539697507-28084-1-git-send-email-longman@redhat.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

16 Oct, 2018

3 commits

  • On x86 we cannot do fetch_or() with a single instruction and thus end up
    using a cmpxchg loop, this reduces determinism. Replace the fetch_or()
    with a composite operation: tas-pending + load.

    Using two instructions of course opens a window we previously did not
    have. Consider the scenario:

    CPU0 CPU1 CPU2

    1) lock
    trylock -> (0,0,1)

    2) lock
    trylock /* fail */

    3) unlock -> (0,0,0)

    4) lock
    trylock -> (0,0,1)

    5) tas-pending -> (0,1,1)
    load-val (0,0,1)

    FAIL: _2_ owners

    where 5) is our new composite operation. When we consider each part of
    the qspinlock state as a separate variable (as we can when
    _Q_PENDING_BITS == 8) then the above is entirely possible, because
    tas-pending will only RmW the pending byte, so the later load is able
    to observe prior tail and lock state (but not earlier than its own
    trylock, which operates on the whole word, due to coherence).

    To avoid this we need 2 things:

    - the load must come after the tas-pending (obviously, otherwise it
    can trivially observe prior state).

    - the tas-pending must be a full word RmW instruction, it cannot be an XCHGB for
    example, such that we cannot observe other state prior to setting
    pending.

    On x86 we can realize this by using "LOCK BTS m32, r32" for
    tas-pending followed by a regular load.

    Note that observing later state is not a problem:

    - if we fail to observe a later unlock, we'll simply spin-wait for
    that store to become visible.

    - if we observe a later xchg_tail(), there is no difference from that
    xchg_tail() having taken place before the tas-pending.

    Suggested-by: Will Deacon
    Reported-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Fixes: 59fb586b4a07 ("locking/qspinlock: Remove unbounded cmpxchg() loop from locking slowpath")
    Link: https://lkml.kernel.org/r/20181003130957.183726335@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While working my way through the code again; I felt the comments could
    use help.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Flip the branch condition after atomic_fetch_or_acquire(_Q_PENDING_VAL)
    such that we loose the indent. This also result in a more natural code
    flow IMO.

    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Will Deacon
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andrea.parri@amarulasolutions.com
    Cc: longman@redhat.com
    Link: https://lkml.kernel.org/r/20181003130257.156322446@infradead.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

27 Apr, 2018

9 commits

  • Currently, the qspinlock_stat code tracks only statistical counts in the
    PV qspinlock code. However, it may also be useful to track the number
    of locking operations done via the pending code vs. the MCS lock queue
    slowpath for the non-PV case.

    The qspinlock stat code is modified to do that. The stat counter
    pv_lock_slowpath is renamed to lock_slowpath so that it can be used by
    both the PV and non-PV cases.

    Signed-off-by: Waiman Long
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Cc: will.deacon@arm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-14-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When reaching the head of an uncontended queue on the qspinlock slow-path,
    using a try_cmpxchg() instead of a cmpxchg() operation to transition the
    lock work to _Q_LOCKED_VAL generates slightly better code for x86 and
    pretty much identical code for arm64.

    Reported-by: Peter Zijlstra
    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-13-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • The qspinlock slowpath must ensure that the MCS node is fully initialised
    before it can be reached by another other CPU. This is currently enforced
    by using a RELEASE operation when updating the tail and also when linking
    the node into the waitqueue, since the control dependency off xchg_tail
    is insufficient to enforce sufficient ordering, see:

    95bcade33a8a ("locking/qspinlock: Ensure node is initialised before updating prev->next")

    Back-to-back RELEASE operations may be expensive on some architectures,
    particularly those that implement them using fences under the hood. We
    can replace the two RELEASE operations with a single smp_wmb() fence and
    use RELAXED operations for the subsequent publishing of the node.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-12-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • When a locker reaches the head of the queue and takes the lock, a
    concurrent locker may enqueue and force the lock holder to spin
    whilst its node->next field is initialised. Rather than open-code
    a READ_ONCE/cpu_relax() loop, this can be implemented using
    smp_cond_load_relaxed() instead.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-10-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • Rather than dig into the counter field of the atomic_t inside the
    qspinlock structure so that we can call smp_cond_load_acquire(), use
    atomic_cond_read_acquire() instead, which operates on the atomic_t
    directly.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-8-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • When a queued locker reaches the head of the queue, it claims the lock
    by setting _Q_LOCKED_VAL in the lockword. If there isn't contention, it
    must also clear the tail as part of this operation so that subsequent
    lockers can avoid taking the slowpath altogether.

    Currently this is expressed as a cmpxchg() loop that practically only
    runs up to two iterations. This is confusing to the reader and unhelpful
    to the compiler. Rewrite the cmpxchg() loop without the loop, so that a
    failed cmpxchg() implies that there is contention and we just need to
    write to _Q_LOCKED_VAL without considering the rest of the lockword.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-7-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • The qspinlock locking slowpath utilises a "pending" bit as a simple form
    of an embedded test-and-set lock that can avoid the overhead of explicit
    queuing in cases where the lock is held but uncontended. This bit is
    managed using a cmpxchg() loop which tries to transition the uncontended
    lock word from (0,0,0) -> (0,0,1) or (0,0,1) -> (0,1,1).

    Unfortunately, the cmpxchg() loop is unbounded and lockers can be starved
    indefinitely if the lock word is seen to oscillate between unlocked
    (0,0,0) and locked (0,0,1). This could happen if concurrent lockers are
    able to take the lock in the cmpxchg() loop without queuing and pass it
    around amongst themselves.

    This patch fixes the problem by unconditionally setting _Q_PENDING_VAL
    using atomic_fetch_or, and then inspecting the old value to see whether
    we need to spin on the current lock owner, or whether we now effectively
    hold the lock. The tricky scenario is when concurrent lockers end up
    queuing on the lock and the lock becomes available, causing us to see
    a lockword of (n,0,0). With pending now set, simply queuing could lead
    to deadlock as the head of the queue may not have observed the pending
    flag being cleared. Conversely, if the head of the queue did observe
    pending being cleared, then it could transition the lock from (n,0,0) ->
    (0,0,1) meaning that any attempt to "undo" our setting of the pending
    bit could race with a concurrent locker trying to set it.

    We handle this race by preserving the pending bit when taking the lock
    after reaching the head of the queue and leaving the tail entry intact
    if we saw pending set, because we know that the tail is going to be
    updated shortly.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-6-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • If a locker taking the qspinlock slowpath reads a lock value indicating
    that only the pending bit is set, then it will spin whilst the
    concurrent pending->locked transition takes effect.

    Unfortunately, there is no guarantee that such a transition will ever be
    observed since concurrent lockers could continuously set pending and
    hand over the lock amongst themselves, leading to starvation. Whilst
    this would probably resolve in practice, it means that it is not
    possible to prove liveness properties about the lock and means that lock
    acquisition time is unbounded.

    Rather than removing the pending->locked spinning from the slowpath
    altogether (which has been shown to heavily penalise a 2-threaded
    locking stress test on x86), this patch replaces the explicit spinning
    with a call to atomic_cond_read_relaxed and allows the architecture to
    provide a bound on the number of spins. For architectures that can
    respond to changes in cacheline state in their smp_cond_load implementation,
    it should be sufficient to use the default bound of 1.

    Suggested-by: Waiman Long
    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: boqun.feng@gmail.com
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-4-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • 'struct __qspinlock' provides a handy union of fields so that
    subcomponents of the lockword can be accessed by name, without having to
    manage shifts and masks explicitly and take endianness into account.

    This is useful in qspinlock.h and also potentially in arch headers, so
    move the 'struct __qspinlock' into 'struct qspinlock' and kill the extra
    definition.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Acked-by: Waiman Long
    Acked-by: Boqun Feng
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Cc: linux-arm-kernel@lists.infradead.org
    Cc: paulmck@linux.vnet.ibm.com
    Link: http://lkml.kernel.org/r/1524738868-31318-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

13 Feb, 2018

2 commits

  • When queuing on the qspinlock, the count field for the current CPU's head
    node is incremented. This needn't be atomic because locking in e.g. IRQ
    context is balanced and so an IRQ will return with node->count as it
    found it.

    However, the compiler could in theory reorder the initialisation of
    node[idx] before the increment of the head node->count, causing an
    IRQ to overwrite the initialised node and potentially corrupt the lock
    state.

    Avoid the potential for this harmful compiler reordering by placing a
    barrier() between the increment of the head node->count and the subsequent
    node initialisation.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1518528177-19169-3-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     
  • When a locker ends up queuing on the qspinlock locking slowpath, we
    initialise the relevant mcs node and publish it indirectly by updating
    the tail portion of the lock word using xchg_tail. If we find that there
    was a pre-existing locker in the queue, we subsequently update their
    ->next field to point at our node so that we are notified when it's our
    turn to take the lock.

    This can be roughly illustrated as follows:

    /* Initialise the fields in node and encode a pointer to node in tail */
    tail = initialise_node(node);

    /*
    * Exchange tail into the lockword using an atomic read-modify-write
    * operation with release semantics
    */
    old = xchg_tail(lock, tail);

    /* If there was a pre-existing waiter ... */
    if (old & _Q_TAIL_MASK) {
    prev = decode_tail(old);
    smp_read_barrier_depends();

    /* ... then update their ->next field to point to node.
    WRITE_ONCE(prev->next, node);
    }

    The conditional update of prev->next therefore relies on the address
    dependency from the result of xchg_tail ensuring order against the
    prior initialisation of node. However, since the release semantics of
    the xchg_tail operation apply only to the write portion of the RmW,
    then this ordering is not guaranteed and it is possible for the CPU
    to return old before the writes to node have been published, consequently
    allowing us to point prev->next to an uninitialised node.

    This patch fixes the problem by making the update of prev->next a RELEASE
    operation, which also removes the reliance on dependency ordering.

    Signed-off-by: Will Deacon
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1518528177-19169-2-git-send-email-will.deacon@arm.com
    Signed-off-by: Ingo Molnar

    Will Deacon
     

05 Dec, 2017

1 commit


17 Aug, 2017

1 commit

  • There is no agreed-upon definition of spin_unlock_wait()'s semantics,
    and it appears that all callers could do just as well with a lock/unlock
    pair. This commit therefore removes spin_unlock_wait() and related
    definitions from core code.

    Signed-off-by: Paul E. McKenney
    Cc: Arnd Bergmann
    Cc: Ingo Molnar
    Cc: Will Deacon
    Cc: Peter Zijlstra
    Cc: Alan Stern
    Cc: Andrea Parri
    Cc: Linus Torvalds

    Paul E. McKenney
     

08 Jul, 2017

1 commit

  • In architectures that use qspinlock, like x86, prefetch is loaded
    indirectly via the asm/qspinlock.h include. On other architectures, like
    OpenRISC, which may want to use asm-generic/qspinlock.h the built will
    fail without the asm/prefetch.h include.

    Fix this by including directly.

    Signed-off-by: Stafford Horne
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20170707195658.23840-1-shorne@gmail.com
    Signed-off-by: Ingo Molnar

    Stafford Horne
     

27 Jun, 2016

1 commit

  • queued_spin_lock_slowpath() should not worry about another
    queued_spin_lock_slowpath() running in interrupt context and
    changing node->count by accident, because node->count keeps
    the same value every time we enter/leave queued_spin_lock_slowpath().

    On some architectures this_cpu_dec() will save/restore irq flags,
    which has high overhead. Use the much cheaper __this_cpu_dec() instead.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Rewrote changelog. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

14 Jun, 2016

2 commits

  • Introduce smp_acquire__after_ctrl_dep(), this construct is not
    uncommon, but the lack of this barrier is.

    Use it to better express smp_rmb() uses in WRITE_ONCE(), the IPC
    semaphore code and the qspinlock code.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • This new form allows using hardware assisted waiting.

    Some hardware (ARM64 and x86) allow monitoring an address for changes,
    so by providing a pointer we can use this to replace the cpu_relax()
    with hardware optimized methods in the future.

    Requested-by: Will Deacon
    Suggested-by: Linus Torvalds
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Jun, 2016

3 commits

  • I figured we need to document the spin_is_locked() and
    spin_unlock_wait() constraints somwehere.

    Ideally 'someone' would rewrite Documentation/atomic_ops.txt and we
    could find a place in there. But currently that document is stale to
    the point of hardly being useful.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While going over the code I noticed that xchg_tail() is a RELEASE but
    had no obvious pairing commented.

    It pairs with a somewhat unique address dependency through
    decode_tail().

    So the store-release of xchg_tail() is paired by the address
    dependency of the load of xchg_tail followed by the dereference from
    the pointer computed from that load.

    The @old -> @prev transformation itself is pure, and therefore does
    not depend on external state, so that is immaterial wrt. ordering.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • While this prior commit:

    54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")

    ... fixes spin_is_locked() and spin_unlock_wait() for the usage
    in ipc/sem and netfilter, it does not in fact work right for the
    usage in task_work and futex.

    So while the 2 locks crossed problem:

    spin_lock(A) spin_lock(B)
    if (!spin_is_locked(B)) spin_unlock_wait(A)
    foo() foo();

    ... works with the smp_mb() injected by both spin_is_locked() and
    spin_unlock_wait(), this is not sufficient for:

    flag = 1;
    smp_mb(); spin_lock()
    spin_unlock_wait() if (!flag)
    // add to lockless list
    // iterate lockless list

    ... because in this scenario, the store from spin_lock() can be delayed
    past the load of flag, uncrossing the variables and loosing the
    guarantee.

    This patch reworks spin_is_locked() and spin_unlock_wait() to work in
    both cases by exploiting the observation that while the lock byte
    store can be delayed, the contender must have registered itself
    visibly in other state contained in the word.

    It also allows for architectures to override both functions, as PPC
    and ARM64 have an additional issue for which we currently have no
    generic solution.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Davidlohr Bueso
    Cc: Giovanni Gherdovich
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: Will Deacon
    Cc: stable@vger.kernel.org # v4.2 and later
    Fixes: 54cf809b9512 ("locking,qspinlock: Fix spin_is_locked() and spin_unlock_wait()")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

29 Feb, 2016

1 commit

  • The newly introduced smp_cond_acquire() was used to replace the
    slowpath lock acquisition loop. Similarly, the new function can also
    be applied to the pending bit locking loop. This patch uses the new
    function in that loop.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

04 Dec, 2015

3 commits

  • In an overcommitted guest where some vCPUs have to be halted to make
    forward progress in other areas, it is highly likely that a vCPU later
    in the spinlock queue will be spinning while the ones earlier in the
    queue would have been halted. The spinning in the later vCPUs is then
    just a waste of precious CPU cycles because they are not going to
    get the lock soon as the earlier ones have to be woken up and take
    their turn to get the lock.

    This patch implements an adaptive spinning mechanism where the vCPU
    will call pv_wait() if the previous vCPU is not running.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
    After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s

    For 32 vCPUs, this patch doesn't cause any noticeable change in
    performance. For 48 vCPUs (over-committed), there is about 8%
    performance improvement.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch allows one attempt for the lock waiter to steal the lock
    when entering the PV slowpath. To prevent lock starvation, the pending
    bit will be set by the queue head vCPU when it is in the active lock
    spinning loop to disable any lock stealing attempt. This helps to
    reduce the performance penalty caused by lock waiter preemption while
    not having much of the downsides of a real unfair lock.

    The pv_wait_head() function was renamed as pv_wait_head_or_lock()
    as it was modified to acquire the lock before returning. This is
    necessary because of possible lock stealing attempts from other tasks.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
    After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s

    For the overcommited case (48 vCPUs), this patch is able to reduce
    kernel build time by more than 54% for Westmere and 44% for Haswell.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Introduce smp_cond_acquire() which combines a control dependency and a
    read barrier to form acquire semantics.

    This primitive has two benefits:

    - it documents control dependencies,
    - its typically cheaper than using smp_load_acquire() in a loop.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Nov, 2015

3 commits

  • With optimistic prefetch of the next node cacheline, the next pointer
    may have been properly inititalized. As a result, the reading
    of node->next in the contended path may be redundant. This patch
    eliminates the redundant read if the next pointer value is not NULL.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • A queue head CPU, after acquiring the lock, will have to notify
    the next CPU in the wait queue that it has became the new queue
    head. This involves loading a new cacheline from the MCS node of the
    next CPU. That operation can be expensive and add to the latency of
    locking operation.

    This patch addes code to optmistically prefetch the next MCS node
    cacheline if the next pointer is defined and it has been spinning
    for the MCS lock for a while. This reduces the locking latency and
    improves the system throughput.

    The performance change will depend on whether the prefetch overhead
    can be hidden within the latency of the lock spin loop. On really
    short critical section, there may not be performance gain at all. With
    longer critical section, however, it was found to have a performance
    boost of 5-10% over a range of different queue depths with a spinlock
    loop microbenchmark.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch replaces the cmpxchg() and xchg() calls in the native
    qspinlock code with the more relaxed _acquire or _release versions of
    those calls to enable other architectures to adopt queued spinlocks
    with less memory barrier performance overhead.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

11 Sep, 2015

1 commit

  • Dave ran into horrible performance on a VM without PARAVIRT_SPINLOCKS
    set and Linus noted that the test-and-set implementation was retarded.

    One should spin on the variable with a load, not a RMW.

    While there, remove 'queued' from the name, as the lock isn't queued
    at all, but a simple test-and-set.

    Suggested-by: Linus Torvalds
    Reported-by: Dave Chinner
    Tested-by: Dave Chinner
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: stable@vger.kernel.org # v4.2+
    Link: http://lkml.kernel.org/r/20150904152523.GR18673@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Aug, 2015

1 commit

  • For an over-committed guest with more vCPUs than physical CPUs
    available, it is possible that a vCPU may be kicked twice before
    getting the lock - once before it becomes queue head and once again
    before it gets the lock. All these CPU kicking and halting (VMEXIT)
    can be expensive and slow down system performance.

    This patch adds a new vCPU state (vcpu_hashed) which enables the code
    to delay CPU kicking until at unlock time. Once this state is set,
    the new lock holder will set _Q_SLOW_VAL and fill in the hash table
    on behalf of the halted queue head vCPU. The original vcpu_halted
    state will be used by pv_wait_node() only to differentiate other
    queue nodes from the qeue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1436647018-49734-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 May, 2015

1 commit

  • Provide a separate (second) version of the spin_lock_slowpath for
    paravirt along with a special unlock path.

    The second slowpath is generated by adding a few pv hooks to the
    normal slowpath, but where those will compile away for the native
    case, they expand into special wait/wake code for the pv version.

    The actual MCS queue can use extra storage in the mcs_nodes[] array to
    keep track of state and therefore uses directed wakeups.

    The head contender has no such storage directly visible to the
    unlocker. So the unlocker searches a hash table with open addressing
    using a simple binary Galois linear feedback shift register.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long