25 Dec, 2016

1 commit


17 Dec, 2016

1 commit

  • Pull vfs updates from Al Viro:

    - more ->d_init() stuff (work.dcache)

    - pathname resolution cleanups (work.namei)

    - a few missing iov_iter primitives - copy_from_iter_full() and
    friends. Either copy the full requested amount, advance the iterator
    and return true, or fail, return false and do _not_ advance the
    iterator. Quite a few open-coded callers converted (and became more
    readable and harder to fuck up that way) (work.iov_iter)

    - several assorted patches, the big one being logfs removal

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
    logfs: remove from tree
    vfs: fix put_compat_statfs64() does not handle errors
    namei: fold should_follow_link() with the step into not-followed link
    namei: pass both WALK_GET and WALK_MORE to should_follow_link()
    namei: invert WALK_PUT logics
    namei: shift interpretation of LOOKUP_FOLLOW inside should_follow_link()
    namei: saner calling conventions for mountpoint_last()
    namei.c: get rid of user_path_parent()
    switch getfrag callbacks to ..._full() primitives
    make skb_add_data,{_nocache}() and skb_copy_to_page_nocache() advance only on success
    [iov_iter] new primitives - copy_from_iter_full() and friends
    don't open-code file_inode()
    ceph: switch to use of ->d_init()
    ceph: unify dentry_operations instances
    lustre: switch to use of ->d_init()

    Linus Torvalds
     

15 Dec, 2016

1 commit

  • Pull xfs updates from Dave Chinner:
    "There is quite a varied bunch of stuff in this update, and some of it
    you will have already merged through the ext4 tree which imported the
    dax-4.10-iomap-pmd topic branch from the XFS tree.

    There is also a new direct IO implementation that uses the iomap
    infrastructure. It's much simpler, faster, and has lower IO latency
    than the existing direct IO infrastructure.

    Summary:
    - DAX PMD faults via iomap infrastructure
    - Direct-io support in iomap infrastructure
    - removal of now-redundant XFS inode iolock, replaced with VFS
    i_rwsem
    - synchronisation with fixes and changes in userspace libxfs code
    - extent tree lookup helpers
    - lots of little corruption detection improvements to verifiers
    - optimised CRC calculations
    - faster buffer cache lookups
    - deprecation of barrier/nobarrier mount options - we always use
    REQ_FUA/REQ_FLUSH where appropriate for data integrity now
    - cleanups to speculative preallocation
    - miscellaneous minor bug fixes and cleanups"

    * tag 'xfs-for-linus-4.10-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/dgc/linux-xfs: (63 commits)
    xfs: nuke unused tracepoint definitions
    xfs: use GPF_NOFS when allocating btree cursors
    xfs: use xfs_vn_setattr_size to check on new size
    xfs: deprecate barrier/nobarrier mount option
    xfs: Always flush caches when integrity is required
    xfs: ignore leaf attr ichdr.count in verifier during log replay
    xfs: use rhashtable to track buffer cache
    xfs: optimise CRC updates
    xfs: make xfs btree stats less huge
    xfs: don't cap maximum dedupe request length
    xfs: don't allow di_size with high bit set
    xfs: error out if trying to add attrs and anextents > 0
    xfs: don't crash if reading a directory results in an unexpected hole
    xfs: complain if we don't get nextents bmap records
    xfs: check for bogus values in btree block headers
    xfs: forbid AG btrees with level == 0
    xfs: several xattr functions can be void
    xfs: handle cow fork in xfs_bmap_trace_exlist
    xfs: pass state not whichfork to trace_xfs_extlist
    xfs: Move AGI buffer type setting to xfs_read_agi
    ...

    Linus Torvalds
     

11 Dec, 2016

1 commit


06 Dec, 2016

1 commit

  • Since commit:

    4bcc595ccd80 ("printk: reinstate KERN_CONT for printing continuation lines")

    printk() requires KERN_CONT to continue log messages. Lots of printk()
    in lockdep.c and print_ip_sym() don't have it. As the result lockdep
    reports are completely messed up.

    Add missing KERN_CONT and inline print_ip_sym() where necessary.

    Example of a messed up report:

    0-rc5+ #41 Not tainted
    -------------------------------------------------------
    syz-executor0/5036 is trying to acquire lock:
    (
    rtnl_mutex
    ){+.+.+.}
    , at:
    [] rtnl_lock+0x1c/0x20
    but task is already holding lock:
    (
    &net->packet.sklist_lock
    ){+.+...}
    , at:
    [] packet_diag_dump+0x1a6/0x1920
    which lock already depends on the new lock.
    the existing dependency chain (in reverse order) is:
    -> #3
    (
    &net->packet.sklist_lock
    +.+...}
    ...

    Without this patch all scripts that parse kernel bug reports are broken.

    Signed-off-by: Dmitry Vyukov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: andreyknvl@google.com
    Cc: aryabinin@virtuozzo.com
    Cc: joe@perches.com
    Cc: syzkaller@googlegroups.com
    Link: http://lkml.kernel.org/r/1480343083-48731-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     

05 Dec, 2016

1 commit


02 Dec, 2016

5 commits

  • While debugging the unlock vs. dequeue race which resulted in state
    corruption of futexes the lockless nature of rt_mutex_proxy_unlock()
    caused some confusion.

    Add commentry to explain why it is safe to do this lockless. Add matching
    comments to rt_mutex_init_proxy_locked() for completeness sake.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20161130210030.591941927@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • This is a left over from the original rtmutex implementation which used
    both bit0 and bit1 in the owner pointer. Commit:

    8161239a8bcc ("rtmutex: Simplify PI algorithm and make highest prio task get lock")

    ... removed the usage of bit1, but kept the extra mask around. This is
    confusing at best.

    Remove it and just use RT_MUTEX_HAS_WAITERS for the masking.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20161130210030.509567906@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • While debugging the rtmutex unlock vs. dequeue race Will suggested to use
    READ_ONCE() in rt_mutex_owner() as it might race against the
    cmpxchg_release() in unlock_rt_mutex_safe().

    Will: "It's a minor thing which will most likely not matter in practice"

    Careful search did not unearth an actual problem in todays code, but it's
    better to be safe than surprised.

    Suggested-by: Will Deacon
    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra (Intel)
    Cc: David Daney
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Steven Rostedt
    Cc:
    Link: http://lkml.kernel.org/r/20161130210030.431379999@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     
  • David reported a futex/rtmutex state corruption. It's caused by the
    following problem:

    CPU0 CPU1 CPU2

    l->owner=T1
    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T2)
    boost()
    unlock(l->wait_lock)
    schedule()

    rt_mutex_lock(l)
    lock(l->wait_lock)
    l->owner = T1 | HAS_WAITERS;
    enqueue(T3)
    boost()
    unlock(l->wait_lock)
    schedule()
    signal(->T2) signal(->T3)
    lock(l->wait_lock)
    dequeue(T2)
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    dequeue(T3)
    ===> wait list is now empty
    deboost()
    unlock(l->wait_lock)
    lock(l->wait_lock)
    fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    l->owner = owner
    ==> l->owner = T1
    }

    lock(l->wait_lock)
    rt_mutex_unlock(l) fixup_rt_mutex_waiters()
    if (wait_list_empty(l)) {
    owner = l->owner & ~HAS_WAITERS;
    cmpxchg(l->owner, T1, NULL)
    ===> Success (l->owner = NULL)
    l->owner = owner
    ==> l->owner = T1
    }

    That means the problem is caused by fixup_rt_mutex_waiters() which does the
    RMW to clear the waiters bit unconditionally when there are no waiters in
    the rtmutexes rbtree.

    This can be fatal: A concurrent unlock can release the rtmutex in the
    fastpath because the waiters bit is not set. If the cmpxchg() gets in the
    middle of the RMW operation then the previous owner, which just unlocked
    the rtmutex is set as the owner again when the write takes place after the
    successfull cmpxchg().

    The solution is rather trivial: verify that the owner member of the rtmutex
    has the waiters bit set before clearing it. This does not require a
    cmpxchg() or other atomic operations because the waiters bit can only be
    set and cleared with the rtmutex wait_lock held. It's also safe against the
    fast path unlock attempt. The unlock attempt via cmpxchg() will either see
    the bit set and take the slowpath or see the bit cleared and release it
    atomically in the fastpath.

    It's remarkable that the test program provided by David triggers on ARM64
    and MIPS64 really quick, but it refuses to reproduce on x86-64, while the
    problem exists there as well. That refusal might explain that this got not
    discovered earlier despite the bug existing from day one of the rtmutex
    implementation more than 10 years ago.

    Thanks to David for meticulously instrumenting the code and providing the
    information which allowed to decode this subtle problem.

    Reported-by: David Daney
    Tested-by: David Daney
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mark Rutland
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior
    Cc: Will Deacon
    Cc: stable@vger.kernel.org
    Fixes: 23f78d4a03c5 ("[PATCH] pi-futex: rt mutex core")
    Link: http://lkml.kernel.org/r/20161130210030.351136722@linutronix.de
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

30 Nov, 2016

1 commit

  • Christoph requested lockdep_assert_held() variants that distinguish
    between held-for-read or held-for-write.

    Provide:

    int lock_is_held_type(struct lockdep_map *lock, int read)

    which takes the same argument as lock_acquire(.read) and matches it to
    the held_lock instance.

    Use of this function should be gated by the debug_locks variable. When
    that is 0 the return value of the lock_is_held_type() function is
    undefined. This is done to allow both negative and positive tests for
    holding locks.

    By default we provide (positive) lockdep_assert_held{,_exclusive,_read}()
    macros.

    Requested-by: Christoph Hellwig
    Signed-off-by: Peter Zijlstra (Intel)
    Tested-by: Jens Axboe
    Reviewed-by: Darrick J. Wong
    Signed-off-by: Dave Chinner

    Peter Zijlstra
     

22 Nov, 2016

3 commits

  • … when owner vCPU is preempted

    An over-committed guest with more vCPUs than pCPUs has a heavy overload
    in the two spin_on_owner. This blames on the lock holder preemption
    issue.

    Break out of the loop if the vCPU is preempted: if vcpu_is_preempted(cpu)
    is true.

    test-case:
    perf record -a perf bench sched messaging -g 400 -p && perf report

    before patch:
    20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
    8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
    4.12% sched-messaging [kernel.vmlinux] [k] system_call
    3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
    2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
    2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
    2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

    after patch:
    9.99% sched-messaging [kernel.vmlinux] [k] mutex_unlock
    5.28% sched-messaging [unknown] [H] 0xc0000000000768e0
    4.27% sched-messaging [kernel.vmlinux] [k] __copy_tofrom_user_power7
    3.77% sched-messaging [kernel.vmlinux] [k] copypage_power7
    3.24% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
    3.02% sched-messaging [kernel.vmlinux] [k] system_call
    2.69% sched-messaging [kernel.vmlinux] [k] wait_consider_task

    Tested-by: Juergen Gross <jgross@suse.com>
    Signed-off-by: Pan Xinhui <xinhui.pan@linux.vnet.ibm.com>
    Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
    Acked-by: Christian Borntraeger <borntraeger@de.ibm.com>
    Acked-by: Paolo Bonzini <pbonzini@redhat.com>
    Cc: David.Laight@ACULAB.COM
    Cc: Linus Torvalds <torvalds@linux-foundation.org>
    Cc: Peter Zijlstra <peterz@infradead.org>
    Cc: Thomas Gleixner <tglx@linutronix.de>
    Cc: benh@kernel.crashing.org
    Cc: boqun.feng@gmail.com
    Cc: bsingharora@gmail.com
    Cc: dave@stgolabs.net
    Cc: kernellwp@gmail.com
    Cc: konrad.wilk@oracle.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: rkrcmar@redhat.com
    Cc: virtualization@lists.linux-foundation.org
    Cc: will.deacon@arm.com
    Cc: xen-devel-request@lists.xenproject.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1478077718-37424-4-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Pan Xinhui
     
  • An over-committed guest with more vCPUs than pCPUs has a heavy overload
    in osq_lock().

    This is because if vCPU-A holds the osq lock and yields out, vCPU-B ends
    up waiting for per_cpu node->locked to be set. IOW, vCPU-B waits for
    vCPU-A to run and unlock the osq lock.

    Use the new vcpu_is_preempted(cpu) interface to detect if a vCPU is
    currently running or not, and break out of the spin-loop if so.

    test case:

    $ perf record -a perf bench sched messaging -g 400 -p && perf report

    before patch:
    18.09% sched-messaging [kernel.vmlinux] [k] osq_lock
    12.28% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
    5.27% sched-messaging [kernel.vmlinux] [k] mutex_unlock
    3.89% sched-messaging [kernel.vmlinux] [k] wait_consider_task
    3.64% sched-messaging [kernel.vmlinux] [k] _raw_write_lock_irq
    3.41% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner.is
    2.49% sched-messaging [kernel.vmlinux] [k] system_call

    after patch:
    20.68% sched-messaging [kernel.vmlinux] [k] mutex_spin_on_owner
    8.45% sched-messaging [kernel.vmlinux] [k] mutex_unlock
    4.12% sched-messaging [kernel.vmlinux] [k] system_call
    3.01% sched-messaging [kernel.vmlinux] [k] system_call_common
    2.83% sched-messaging [kernel.vmlinux] [k] copypage_power7
    2.64% sched-messaging [kernel.vmlinux] [k] rwsem_spin_on_owner
    2.00% sched-messaging [kernel.vmlinux] [k] osq_lock

    Suggested-by: Boqun Feng
    Tested-by: Juergen Gross
    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Acked-by: Christian Borntraeger
    Acked-by: Paolo Bonzini
    Cc: David.Laight@ACULAB.COM
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: benh@kernel.crashing.org
    Cc: bsingharora@gmail.com
    Cc: dave@stgolabs.net
    Cc: kernellwp@gmail.com
    Cc: konrad.wilk@oracle.com
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: rkrcmar@redhat.com
    Cc: virtualization@lists.linux-foundation.org
    Cc: will.deacon@arm.com
    Cc: xen-devel-request@lists.xenproject.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1478077718-37424-3-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Translated to English. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

21 Nov, 2016

1 commit

  • Currently the wake_q data structure is defined by the WAKE_Q() macro.
    This macro, however, looks like a function doing something as "wake" is
    a verb. Even checkpatch.pl was confused as it reported warnings like

    WARNING: Missing a blank line after declarations
    #548: FILE: kernel/futex.c:3665:
    + int ret;
    + WAKE_Q(wake_q);

    This patch renames the WAKE_Q() macro to DEFINE_WAKE_Q() which clarifies
    what the macro is doing and eliminates the checkpatch.pl warnings.

    Signed-off-by: Waiman Long
    Acked-by: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1479401198-1765-1-git-send-email-longman@redhat.com
    [ Resolved conflict and added missing rename. ]
    Signed-off-by: Ingo Molnar

    Waiman Long
     

19 Nov, 2016

1 commit


16 Nov, 2016

1 commit

  • With the s390 special case of a yielding cpu_relax() implementation gone,
    we can now remove all users of cpu_relax_lowlatency() and replace them
    with cpu_relax().

    Signed-off-by: Christian Borntraeger
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Catalin Marinas
    Cc: Heiko Carstens
    Cc: Linus Torvalds
    Cc: Martin Schwidefsky
    Cc: Nicholas Piggin
    Cc: Noam Camus
    Cc: Peter Zijlstra
    Cc: Russell King
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1477386195-32736-5-git-send-email-borntraeger@de.ibm.com
    Signed-off-by: Ingo Molnar

    Christian Borntraeger
     

11 Nov, 2016

1 commit


25 Oct, 2016

5 commits

  • This patch makes the waiter that sets the HANDOFF flag start spinning
    instead of sleeping until the handoff is complete or the owner
    sleeps. Otherwise, the handoff will cause the optimistic spinners to
    abort spinning as the handed-off owner may not be running.

    Tested-by: Jason Low
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Ding Tianhong
    Cc: Imre Deak
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1472254509-27508-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch removes some of the redundant ww_mutex code in
    __mutex_lock_common().

    Tested-by: Jason Low
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Ding Tianhong
    Cc: Imre Deak
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/1472254509-27508-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Doesn't really matter yet, but pull the HANDOFF and trylock out from
    under the wait_lock.

    The intention is to add an optimistic spin loop here, which requires
    we do not hold the wait_lock, so shuffle code around in preparation.

    Also clarify the purpose of taking the wait_lock in the wait loop, its
    tempting to want to avoid it altogether, but the cancellation cases
    need to to avoid losing wakeups.

    Suggested-by: Waiman Long
    Tested-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Implement lock handoff to avoid lock starvation.

    Lock starvation is possible because mutex_lock() allows lock stealing,
    where a running (or optimistic spinning) task beats the woken waiter
    to the acquire.

    Lock stealing is an important performance optimization because waiting
    for a waiter to wake up and get runtime can take a significant time,
    during which everyboy would stall on the lock.

    The down-side is of course that it allows for starvation.

    This patch has the waiter requesting a handoff if it fails to acquire
    the lock upon waking. This re-introduces some of the wait time,
    because once we do a handoff we have to wait for the waiter to wake up
    again.

    A future patch will add a round of optimistic spinning to attempt to
    alleviate this penalty, but if that turns out to not be enough, we can
    add a counter and only request handoff after multiple failed wakeups.

    There are a few tricky implementation details:

    - accepting a handoff must only be done in the wait-loop. Since the
    handoff condition is owner == current, it can easily cause
    recursive locking trouble.

    - accepting the handoff must be careful to provide the ACQUIRE
    semantics.

    - having the HANDOFF bit set on unlock requires care, we must not
    clear the owner.

    - we must be careful to not leave HANDOFF set after we've acquired
    the lock. The tricky scenario is setting the HANDOFF bit on an
    unlocked mutex.

    Tested-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The current mutex implementation has an atomic lock word and a
    non-atomic owner field.

    This disparity leads to a number of issues with the current mutex code
    as it means that we can have a locked mutex without an explicit owner
    (because the owner field has not been set, or already cleared).

    This leads to a number of weird corner cases, esp. between the
    optimistic spinning and debug code. Where the optimistic spinning
    code needs the owner field updated inside the lock region, the debug
    code is more relaxed because the whole lock is serialized by the
    wait_lock.

    Also, the spinning code itself has a few corner cases where we need to
    deal with a held lock without an owner field.

    Furthermore, it becomes even more of a problem when trying to fix
    starvation cases in the current code. We end up stacking special case
    on special case.

    To solve this rework the basic mutex implementation to be a single
    atomic word that contains the owner and uses the low bits for extra
    state.

    This matches how PI futexes and rt_mutex already work. By having the
    owner an integral part of the lock state a lot of the problems
    dissapear and we get a better option to deal with starvation cases,
    direct owner handoff.

    Changing the basic mutex does however invalidate all the arch specific
    mutex code; this patch leaves that unused in-place, a later patch will
    remove that.

    Tested-by: Jason Low
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Will Deacon
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

22 Sep, 2016

3 commits

  • It is now unused, remove it before someone else thinks its a good idea
    to use this.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • stop_two_cpus() and stop_cpus() use stop_cpus_lock to avoid the deadlock,
    we need to ensure that the stopper functions can't be queued "backwards"
    from one another. This doesn't look nice; if we use lglock then we do not
    really need stopper->lock, cpu_stop_queue_work() could use lg_local_lock()
    under local_irq_save().

    OTOH it would be even better to avoid lglock in stop_machine.c and remove
    lg_double_lock(). This patch adds "bool stop_cpus_in_progress" set/cleared
    by queue_stop_cpus_work(), and changes cpu_stop_queue_two_works() to busy
    wait until it is cleared.

    queue_stop_cpus_work() sets stop_cpus_in_progress = T lockless, but after
    it queues a work on CPU1 it must be visible to stop_two_cpus(CPU1, CPU2)
    which checks it under the same lock. And since stop_two_cpus() holds the
    2nd lock too, queue_stop_cpus_work() can not clear stop_cpus_in_progress
    if it is also going to queue a work on CPU2, it needs to take that 2nd
    lock to do this.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Tejun Heo
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20151121181148.GA433@redhat.com
    Signed-off-by: Ingo Molnar

    Oleg Nesterov
     
  • cmpxchg_release() is more lighweight than cmpxchg() on some archs(e.g.
    PPC), moreover, in __pv_queued_spin_unlock() we only needs a RELEASE in
    the fast path(pairing with *_try_lock() or *_lock()). And the slow path
    has smp_store_release too. So it's safe to use cmpxchg_release here.

    Suggested-by: Boqun Feng
    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: benh@kernel.crashing.org
    Cc: linuxppc-dev@lists.ozlabs.org
    Cc: mpe@ellerman.id.au
    Cc: paulmck@linux.vnet.ibm.com
    Cc: paulus@samba.org
    Cc: virtualization@lists.linux-foundation.org
    Cc: waiman.long@hpe.com
    Link: http://lkml.kernel.org/r/1474277037-15200-2-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

18 Aug, 2016

3 commits

  • When wanting to wakeup readers, __rwsem_mark_wakeup() currently
    iterates the wait_list twice while looking to wakeup the first N
    queued reader-tasks. While this can be quite inefficient, it was
    there such that a awoken reader would be first and foremost
    acknowledged by the lock counter.

    Keeping the same logic, we can further benefit from the use of
    wake_qs and avoid entirely the first wait_list iteration that sets
    the counter as wake_up_process() isn't going to occur right away,
    and therefore we maintain the counter->list order of going about
    things.

    Other than saving cycles with O(n) "scanning", this change also
    nicely cleans up a good chunk of __rwsem_mark_wakeup(); both
    visually and less tedious to read.

    For example, the following improvements where seen on some will
    it scale microbenchmarks, on a 48-core Haswell:

    v4.7 v4.7-rwsem-v1
    Hmean signal1-processes-8 5792691.42 ( 0.00%) 5771971.04 ( -0.36%)
    Hmean signal1-processes-12 6081199.96 ( 0.00%) 6072174.38 ( -0.15%)
    Hmean signal1-processes-21 3071137.71 ( 0.00%) 3041336.72 ( -0.97%)
    Hmean signal1-processes-48 3712039.98 ( 0.00%) 3708113.59 ( -0.11%)
    Hmean signal1-processes-79 4464573.45 ( 0.00%) 4682798.66 ( 4.89%)
    Hmean signal1-processes-110 4486842.01 ( 0.00%) 4633781.71 ( 3.27%)
    Hmean signal1-processes-141 4611816.83 ( 0.00%) 4692725.38 ( 1.75%)
    Hmean signal1-processes-172 4638157.05 ( 0.00%) 4714387.86 ( 1.64%)
    Hmean signal1-processes-203 4465077.80 ( 0.00%) 4690348.07 ( 5.05%)
    Hmean signal1-processes-224 4410433.74 ( 0.00%) 4687534.43 ( 6.28%)

    Stddev signal1-processes-8 6360.47 ( 0.00%) 8455.31 ( 32.94%)
    Stddev signal1-processes-12 4004.98 ( 0.00%) 9156.13 (128.62%)
    Stddev signal1-processes-21 3273.14 ( 0.00%) 5016.80 ( 53.27%)
    Stddev signal1-processes-48 28420.25 ( 0.00%) 26576.22 ( -6.49%)
    Stddev signal1-processes-79 22038.34 ( 0.00%) 18992.70 (-13.82%)
    Stddev signal1-processes-110 23226.93 ( 0.00%) 17245.79 (-25.75%)
    Stddev signal1-processes-141 6358.98 ( 0.00%) 7636.14 ( 20.08%)
    Stddev signal1-processes-172 9523.70 ( 0.00%) 4824.75 (-49.34%)
    Stddev signal1-processes-203 13915.33 ( 0.00%) 9326.33 (-32.98%)
    Stddev signal1-processes-224 15573.94 ( 0.00%) 10613.82 (-31.85%)

    Other runs that saw improvements include context_switch and pipe; and
    as expected, this is particularly highlighted on larger thread counts
    as it becomes more expensive to walk the list twice.

    No change in wakeup ordering or semantics.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • Our rwsem code (xadd, at least) is rather well documented, but
    there are a few really annoying comments in there that serve
    no purpose and we shouldn't bother with them.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-3-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • We currently return a rw_semaphore structure, which is the
    same lock we passed to the function's argument in the first
    place. While there are several functions that choose this
    return value, the callers use it, for example, for things
    like ERR_PTR. This is not the case for __rwsem_mark_wake(),
    and in addition this function is really about the lock
    waiters (which we know there are at this point), so its
    somewhat odd to be returning the sem structure.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hp.com
    Cc: dave@stgolabs.net
    Cc: jason.low2@hpe.com
    Cc: wanpeng.li@hotmail.com
    Link: http://lkml.kernel.org/r/1470384285-32163-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

10 Aug, 2016

5 commits

  • Currently the percpu-rwsem switches to (global) atomic ops while a
    writer is waiting; which could be quite a while and slows down
    releasing the readers.

    This patch cures this problem by ordering the reader-state vs
    reader-count (see the comments in __percpu_down_read() and
    percpu_down_write()). This changes a global atomic op into a full
    memory barrier, which doesn't have the global cacheline contention.

    This also enables using the percpu-rwsem with rcu_sync disabled in order
    to bias the implementation differently, reducing the writer latency by
    adding some cost to readers.

    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Oleg Nesterov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    [ Fixed modular build. ]
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Currently there are overlap in the pvqspinlock wait_again and
    spurious_wakeup stat counters. Because of lock stealing, it is
    no longer possible to accurately determine if spurious wakeup has
    happened in the queue head. As they track both the queue node and
    queue head status, it is also hard to tell how many of those comes
    from the queue head and how many from the queue node.

    This patch changes the accounting rules so that spurious wakeup is
    only tracked in the queue node. The wait_again count, however, is
    only tracked in the queue head when the vCPU failed to acquire the
    lock after a vCPU kick. This should give a much better indication of
    the wait-kick dynamics in the queue node and the queue head.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boqun Feng
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Pan Xinhui
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1464713631-1066-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Restructure pv_queued_spin_steal_lock() as I found it hard to read.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long

    Peter Zijlstra
     
  • It's obviously wrong to set stat to NULL. So lets remove it.
    Otherwise it is always zero when we check the latency of kick/wake.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Waiman Long
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1468405414-3700-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     
  • When the lock holder vCPU is racing with the queue head:

    CPU 0 (lock holder) CPU1 (queue head)
    =================== =================
    spin_lock(); spin_lock();
    pv_kick_node(): pv_wait_head_or_lock():
    if (!lp) {
    lp = pv_hash(lock, pn);
    xchg(&l->locked, _Q_SLOW_VAL);
    }
    WRITE_ONCE(pn->state, vcpu_halted);
    cmpxchg(&pn->state,
    vcpu_halted, vcpu_hashed);
    WRITE_ONCE(l->locked, _Q_SLOW_VAL);
    (void)pv_hash(lock, pn);

    In this case, lock holder inserts the pv_node of queue head into the
    hash table and set _Q_SLOW_VAL unnecessary. This patch avoids it by
    restoring/setting vcpu_hashed state after failing adaptive locking
    spinning.

    Signed-off-by: Wanpeng Li
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Pan Xinhui
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Link: http://lkml.kernel.org/r/1468484156-4521-1-git-send-email-wanpeng.li@hotmail.com
    Signed-off-by: Ingo Molnar

    Wanpeng Li
     

26 Jul, 2016

1 commit

  • Pull locking updates from Ingo Molnar:
    "The locking tree was busier in this cycle than the usual pattern - a
    couple of major projects happened to coincide.

    The main changes are:

    - implement the atomic_fetch_{add,sub,and,or,xor}() API natively
    across all SMP architectures (Peter Zijlstra)

    - add atomic_fetch_{inc/dec}() as well, using the generic primitives
    (Davidlohr Bueso)

    - optimize various aspects of rwsems (Jason Low, Davidlohr Bueso,
    Waiman Long)

    - optimize smp_cond_load_acquire() on arm64 and implement LSE based
    atomic{,64}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    on arm64 (Will Deacon)

    - introduce smp_acquire__after_ctrl_dep() and fix various barrier
    mis-uses and bugs (Peter Zijlstra)

    - after discovering ancient spin_unlock_wait() barrier bugs in its
    implementation and usage, strengthen its semantics and update/fix
    usage sites (Peter Zijlstra)

    - optimize mutex_trylock() fastpath (Peter Zijlstra)

    - ... misc fixes and cleanups"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (67 commits)
    locking/atomic: Introduce inc/dec variants for the atomic_fetch_$op() API
    locking/barriers, arch/arm64: Implement LDXR+WFE based smp_cond_load_acquire()
    locking/static_keys: Fix non static symbol Sparse warning
    locking/qspinlock: Use __this_cpu_dec() instead of full-blown this_cpu_dec()
    locking/atomic, arch/tile: Fix tilepro build
    locking/atomic, arch/m68k: Remove comment
    locking/atomic, arch/arc: Fix build
    locking/Documentation: Clarify limited control-dependency scope
    locking/atomic, arch/rwsem: Employ atomic_long_fetch_add()
    locking/atomic, arch/qrwlock: Employ atomic_fetch_add_acquire()
    locking/atomic, arch/mips: Convert to _relaxed atomics
    locking/atomic, arch/alpha: Convert to _relaxed atomics
    locking/atomic: Remove the deprecated atomic_{set,clear}_mask() functions
    locking/atomic: Remove linux/atomic.h:atomic_fetch_or()
    locking/atomic: Implement atomic{,64,_long}_fetch_{add,sub,and,andnot,or,xor}{,_relaxed,_acquire,_release}()
    locking/atomic: Fix atomic64_relaxed() bits
    locking/atomic, arch/xtensa: Implement atomic_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/x86: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/tile: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    locking/atomic, arch/sparc: Implement atomic{,64}_fetch_{add,sub,and,or,xor}()
    ...

    Linus Torvalds
     

07 Jul, 2016

1 commit


27 Jun, 2016

1 commit

  • queued_spin_lock_slowpath() should not worry about another
    queued_spin_lock_slowpath() running in interrupt context and
    changing node->count by accident, because node->count keeps
    the same value every time we enter/leave queued_spin_lock_slowpath().

    On some architectures this_cpu_dec() will save/restore irq flags,
    which has high overhead. Use the much cheaper __this_cpu_dec() instead.

    Signed-off-by: Pan Xinhui
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Waiman.Long@hpe.com
    Link: http://lkml.kernel.org/r/1465886247-3773-1-git-send-email-xinhui.pan@linux.vnet.ibm.com
    [ Rewrote changelog. ]
    Signed-off-by: Ingo Molnar

    Pan Xinhui
     

24 Jun, 2016

1 commit


16 Jun, 2016

1 commit