16 Jul, 2014

7 commits

  • Just like with mutexes (CONFIG_MUTEX_SPIN_ON_OWNER),
    encapsulate the dependencies for rwsem optimistic spinning.
    No logical changes here as it continues to depend on both
    SMP and the XADD algorithm variant.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Jason Low
    [ Also make it depend on ARCH_SUPPORTS_ATOMIC_RMW. ]
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1405112406-13052-2-git-send-email-davidlohr@hp.com
    Cc: aswin@hp.com
    Cc: Chris Mason
    Cc: Davidlohr Bueso
    Cc: Josef Bacik
    Cc: Linus Torvalds
    Cc: Waiman Long
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • There are two definitions of struct rw_semaphore, one in linux/rwsem.h
    and one in linux/rwsem-spinlock.h.

    For some reason they have different names for the initial field. This
    makes it impossible to use C99 named initialization for
    __RWSEM_INITIALIZER() -- or we have to duplicate that entire thing
    along with the structure definitions.

    The simpler patch is renaming the rwsem-spinlock variant to match the
    regular rwsem.

    This allows us to switch to C99 named initialization.

    Signed-off-by: Peter Zijlstra
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-bmrZolsbGmautmzrerog27io@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • In the unlock function of the cancellable MCS spinlock, the first
    thing we do is to retrive the current CPU's osq node. However, due to
    the changes made in the previous patch, in the common case where the
    lock is not contended, we wouldn't need to access the current CPU's
    osq node anymore.

    This patch optimizes this by only retriving this CPU's osq node
    after we attempt the initial cmpxchg to unlock the osq and found
    that its contended.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1405358872-3732-5-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Currently, we initialize the osq lock by directly setting the lock's values. It
    would be preferable if we use an init macro to do the initialization like we do
    with other locks.

    This patch introduces and uses a macro and function for initializing the osq lock.

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Cc: Chris Mason
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1405358872-3732-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • The cancellable MCS spinlock is currently used to queue threads that are
    doing optimistic spinning. It uses per-cpu nodes, where a thread obtaining
    the lock would access and queue the local node corresponding to the CPU that
    it's running on. Currently, the cancellable MCS lock is implemented by using
    pointers to these nodes.

    In this patch, instead of operating on pointers to the per-cpu nodes, we
    store the CPU numbers in which the per-cpu nodes correspond to in atomic_t.
    A similar concept is used with the qspinlock.

    By operating on the CPU # of the nodes using atomic_t instead of pointers
    to those nodes, this can reduce the overhead of the cancellable MCS spinlock
    by 32 bits (on 64 bit systems).

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Cc: Chris Mason
    Cc: Heiko Carstens
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1405358872-3732-3-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Currently, the per-cpu nodes structure for the cancellable MCS spinlock is
    named "optimistic_spin_queue". However, in a follow up patch in the series
    we will be introducing a new structure that serves as the new "handle" for
    the lock. It would make more sense if that structure is named
    "optimistic_spin_queue". Additionally, since the current use of the
    "optimistic_spin_queue" structure are "nodes", it might be better if we
    rename them to "node" anyway.

    This preparatory patch renames all current "optimistic_spin_queue"
    to "optimistic_spin_node".

    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Scott Norton
    Cc: "Paul E. McKenney"
    Cc: Dave Chinner
    Cc: Waiman Long
    Cc: Davidlohr Bueso
    Cc: Rik van Riel
    Cc: Andrew Morton
    Cc: "H. Peter Anvin"
    Cc: Steven Rostedt
    Cc: Tim Chen
    Cc: Konrad Rzeszutek Wilk
    Cc: Aswin Chandramouleeswaran
    Cc: Linus Torvalds
    Cc: Chris Mason
    Cc: Heiko Carstens
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1405358872-3732-2-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • Commit 4fc828e24cd9 ("locking/rwsem: Support optimistic spinning")
    introduced a major performance regression for workloads such as
    xfs_repair which mix read and write locking of the mmap_sem across
    many threads. The result was xfs_repair ran 5x slower on 3.16-rc2
    than on 3.15 and using 20x more system CPU time.

    Perf profiles indicate in some workloads that significant time can
    be spent spinning on !owner. This is because we don't set the lock
    owner when readers(s) obtain the rwsem.

    In this patch, we'll modify rwsem_can_spin_on_owner() such that we'll
    return false if there is no lock owner. The rationale is that if we
    just entered the slowpath, yet there is no lock owner, then there is
    a possibility that a reader has the lock. To be conservative, we'll
    avoid spinning in these situations.

    This patch reduced the total run time of the xfs_repair workload from
    about 4 minutes 24 seconds down to approximately 1 minute 26 seconds,
    back to close to the same performance as on 3.15.

    Retesting of AIM7, which were some of the workloads used to test the
    original optimistic spinning code, confirmed that we still get big
    performance gains with optimistic spinning, even with this additional
    regression fix. Davidlohr found that while the 'custom' workload took
    a performance hit of ~-14% to throughput for >300 users with this
    additional patch, the overall gain with optimistic spinning is
    still ~+45%. The 'disk' workload even improved by ~+15% at >1000 users.

    Tested-by: Dave Chinner
    Acked-by: Davidlohr Bueso
    Signed-off-by: Jason Low
    Signed-off-by: Peter Zijlstra
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/r/1404532172.2572.30.camel@j-VirtualBox
    Signed-off-by: Ingo Molnar

    Jason Low
     

22 Jun, 2014

1 commit


16 Jun, 2014

1 commit

  • When the rtmutex fast path is enabled the slow unlock function can
    create the following situation:

    spin_lock(foo->m->wait_lock);
    foo->m->owner = NULL;
    rt_mutex_lock(foo->m); refcnt);
    rt_mutex_unlock(foo->m); m->wait_lock); owner */
    clear_rt_mutex_waiters(m);
    owner = rt_mutex_owner(m);
    spin_unlock(m->wait_lock);
    if (cmpxchg(m->owner, owner, 0) == owner)
    return;
    spin_lock(m->wait_lock);
    }

    So in case of a new waiter incoming while the owner tries the slow
    path unlock we have two situations:

    unlock(wait_lock);
    lock(wait_lock);
    cmpxchg(p, owner, 0) == owner
    mark_rt_mutex_waiters(lock);
    acquire(lock);

    Or:

    unlock(wait_lock);
    lock(wait_lock);
    mark_rt_mutex_waiters(lock);
    cmpxchg(p, owner, 0) != owner
    enqueue_waiter();
    unlock(wait_lock);
    lock(wait_lock);
    wakeup_next waiter();
    unlock(wait_lock);
    lock(wait_lock);
    acquire(lock);

    If the fast path is disabled, then the simple

    m->owner = NULL;
    unlock(m->wait_lock);

    is sufficient as all access to m->owner is serialized via
    m->wait_lock;

    Also document and clarify the wakeup_next_waiter function as suggested
    by Oleg Nesterov.

    Reported-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Reviewed-by: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140611183852.937945560@linutronix.de
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

13 Jun, 2014

1 commit

  • Pull more locking changes from Ingo Molnar:
    "This is the second round of locking tree updates for v3.16, offering
    large system scalability improvements:

    - optimistic spinning for rwsems, from Davidlohr Bueso.

    - 'qrwlocks' core code and x86 enablement, from Waiman Long and PeterZ"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    x86, locking/rwlocks: Enable qrwlocks on x86
    locking/rwlocks: Introduce 'qrwlocks' - fair, queued rwlocks
    locking/mutexes: Documentation update/rewrite
    locking/rwsem: Fix checkpatch.pl warnings
    locking/rwsem: Fix warnings for CONFIG_RWSEM_GENERIC_SPINLOCK
    locking/rwsem: Support optimistic spinning

    Linus Torvalds
     

07 Jun, 2014

2 commits

  • When we walk the lock chain, we drop all locks after each step. So the
    lock chain can change under us before we reacquire the locks. That's
    harmless in principle as we just follow the wrong lock path. But it
    can lead to a false positive in the dead lock detection logic:

    T0 holds L0
    T0 blocks on L1 held by T1
    T1 blocks on L2 held by T2
    T2 blocks on L3 held by T3
    T4 blocks on L4 held by T4

    Now we walk the chain

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 ->
    lock T2 -> adjust T2 -> drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 -> lock L0 -> deadlock detected, but it's not a deadlock at all.

    Brad tried to work around that in the deadlock detection logic itself,
    but the more I looked at it the less I liked it, because it's crystal
    ball magic after the fact.

    We actually can detect a chain change very simple:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T2 times out and blocks on L0

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    So if we detect that T2 is now blocked on a different lock we stop the
    chain walk. That's also correct in the following scenario:

    lock T1 -> lock L2 -> adjust L2 -> unlock T1 -> lock T2 -> adjust T2 ->

    next_lock = T2->pi_blocked_on->lock;

    drop locks

    T3 times out and drops L3
    T2 acquires L3 and blocks on L4 now

    Now we continue:

    lock T2 ->

    if (next_lock != T2->pi_blocked_on->lock)
    return;

    We don't have to follow up the chain at that point, because T2
    propagated our priority up to T4 already.

    [ Folded a cleanup patch from peterz ]

    Signed-off-by: Thomas Gleixner
    Reported-by: Brad Mouring
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140605152801.930031935@linutronix.de
    Cc: stable@vger.kernel.org

    Thomas Gleixner
     
  • Even in the case when deadlock detection is not requested by the
    caller, we can detect deadlocks. Right now the code stops the lock
    chain walk and keeps the waiter enqueued, even on itself. Silly not to
    yell when such a scenario is detected and to keep the waiter enqueued.

    Return -EDEADLK unconditionally and handle it at the call sites.

    The futex calls return -EDEADLK. The non futex ones dequeue the
    waiter, throw a warning and put the task into a schedule loop.

    Tagged for stable as it makes the code more robust.

    Signed-off-by: Thomas Gleixner
    Cc: Steven Rostedt
    Cc: Peter Zijlstra
    Cc: Brad Mouring
    Link: http://lkml.kernel.org/r/20140605152801.836501969@linutronix.de
    Cc: stable@vger.kernel.org
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

06 Jun, 2014

1 commit

  • This rwlock uses the arch_spin_lock_t as a waitqueue, and assuming the
    arch_spin_lock_t is a fair lock (ticket,mcs etc..) the resulting
    rwlock is a fair lock.

    It fits in the same 8 bytes as the regular rwlock_t by folding the
    reader and writer count into a single integer, using the remaining 4
    bytes for the arch_spinlock_t.

    Architectures that can single-copy adress bytes can optimize
    queue_write_unlock() with a 0 write to the LSB (the write count).

    Performance as measured by Davidlohr Bueso (rwlock_t -> qrwlock_t):

    +--------------+-------------+---------------+
    | Workload | #users | delta |
    +--------------+-------------+---------------+
    | alltests | > 1400 | -4.83% |
    | custom | 0-100,> 100 | +1.43%,-1.57% |
    | high_systime | > 1000 | -2.61 |
    | shared | all | +0.32 |
    +--------------+-------------+---------------+

    http://www.stgolabs.net/qrwlock-stuff/aim7-results-vs-rwsem_optsin/

    Signed-off-by: Waiman Long
    [peterz: near complete rewrite]
    Signed-off-by: Peter Zijlstra
    Cc: Arnd Bergmann
    Cc: Linus Torvalds
    Cc: "Paul E.McKenney"
    Cc: linux-arch@vger.kernel.org
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/n/tip-gac1nnl3wvs2ij87zv2xkdzq@git.kernel.org
    Signed-off-by: Ingo Molnar

    Waiman Long
     

05 Jun, 2014

2 commits

  • WARNING: line over 80 characters
    #205: FILE: kernel/locking/rwsem-xadd.c:275:
    + old = cmpxchg(&sem->count, count, count + RWSEM_ACTIVE_WRITE_BIAS);

    WARNING: line over 80 characters
    #376: FILE: kernel/locking/rwsem-xadd.c:434:
    + * If there were already threads queued before us and there are no

    WARNING: line over 80 characters
    #377: FILE: kernel/locking/rwsem-xadd.c:435:
    + * active writers, the lock must be read owned; so we try to wake

    total: 0 errors, 3 warnings, 417 lines checked

    Signed-off-by: Andrew Morton
    Signed-off-by: Peter Zijlstra
    Cc: "H. Peter Anvin"
    Cc: Davidlohr Bueso
    Cc: Tim Chen
    Cc: Linus Torvalds
    Link: http://lkml.kernel.org/n/tip-pn6pslaplw031lykweojsn8c@git.kernel.org
    Signed-off-by: Ingo Molnar

    Andrew Morton
     
  • We have reached the point where our mutexes are quite fine tuned
    for a number of situations. This includes the use of heuristics
    and optimistic spinning, based on MCS locking techniques.

    Exclusive ownership of read-write semaphores are, conceptually,
    just about the same as mutexes, making them close cousins. To
    this end we need to make them both perform similarly, and
    right now, rwsems are simply not up to it. This was discovered
    by both reverting commit 4fc3f1d6 (mm/rmap, migration: Make
    rmap_walk_anon() and try_to_unmap_anon() more scalable) and
    similarly, converting some other mutexes (ie: i_mmap_mutex) to
    rwsems. This creates a situation where users have to choose
    between a rwsem and mutex taking into account this important
    performance difference. Specifically, biggest difference between
    both locks is when we fail to acquire a mutex in the fastpath,
    optimistic spinning comes in to play and we can avoid a large
    amount of unnecessary sleeping and overhead of moving tasks in
    and out of wait queue. Rwsems do not have such logic.

    This patch, based on the work from Tim Chen and I, adds support
    for write-side optimistic spinning when the lock is contended.
    It also includes support for the recently added cancelable MCS
    locking for adaptive spinning. Note that is is only applicable
    to the xadd method, and the spinlock rwsem variant remains intact.

    Allowing optimistic spinning before putting the writer on the wait
    queue reduces wait queue contention and provided greater chance
    for the rwsem to get acquired. With these changes, rwsem is on par
    with mutex. The performance benefits can be seen on a number of
    workloads. For instance, on a 8 socket, 80 core 64bit Westmere box,
    aim7 shows the following improvements in throughput:

    +--------------+---------------------+-----------------+
    | Workload | throughput-increase | number of users |
    +--------------+---------------------+-----------------+
    | alltests | 20% | >1000 |
    | custom | 27%, 60% | 10-100, >1000 |
    | high_systime | 36%, 30% | >100, >1000 |
    | shared | 58%, 29% | 10-100, >1000 |
    +--------------+---------------------+-----------------+

    There was also improvement on smaller systems, such as a quad-core
    x86-64 laptop running a 30Gb PostgreSQL (pgbench) workload for up
    to +60% in throughput for over 50 clients. Additionally, benefits
    were also noticed in exim (mail server) workloads. Furthermore, no
    performance regression have been seen at all.

    Based-on-work-from: Tim Chen
    Signed-off-by: Davidlohr Bueso
    [peterz: rej fixup due to comment patches, sched/rt.h header]
    Signed-off-by: Peter Zijlstra
    Cc: Alex Shi
    Cc: Andi Kleen
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Hurley
    Cc: "Paul E.McKenney"
    Cc: Jason Low
    Cc: Aswin Chandramouleeswaran
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: "Scott J Norton"
    Cc: Andrea Arcangeli
    Cc: Chris Mason
    Cc: Josef Bacik
    Link: http://lkml.kernel.org/r/1399055055.6275.15.camel@buesod1.americas.hpqcorp.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

04 Jun, 2014

3 commits

  • …/git/tip/tip into next

    Pull scheduler updates from Ingo Molnar:
    "The main scheduling related changes in this cycle were:

    - various sched/numa updates, for better performance

    - tree wide cleanup of open coded nice levels

    - nohz fix related to rq->nr_running use

    - cpuidle changes and continued consolidation to improve the
    kernel/sched/idle.c high level idle scheduling logic. As part of
    this effort I pulled cpuidle driver changes from Rafael as well.

    - standardized idle polling amongst architectures

    - continued work on preparing better power/energy aware scheduling

    - sched/rt updates

    - misc fixlets and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (49 commits)
    sched/numa: Decay ->wakee_flips instead of zeroing
    sched/numa: Update migrate_improves/degrades_locality()
    sched/numa: Allow task switch if load imbalance improves
    sched/rt: Fix 'struct sched_dl_entity' and dl_task_time() comments, to match the current upstream code
    sched: Consolidate open coded implementations of nice level frobbing into nice_to_rlimit() and rlimit_to_nice()
    sched: Initialize rq->age_stamp on processor start
    sched, nohz: Change rq->nr_running to always use wrappers
    sched: Fix the rq->next_balance logic in rebalance_domains() and idle_balance()
    sched: Use clamp() and clamp_val() to make sys_nice() more readable
    sched: Do not zero sg->cpumask and sg->sgp->power in build_sched_groups()
    sched/numa: Fix initialization of sched_domain_topology for NUMA
    sched: Call select_idle_sibling() when not affine_sd
    sched: Simplify return logic in sched_read_attr()
    sched: Simplify return logic in sched_copy_attr()
    sched: Fix exec_start/task_hot on migrated tasks
    arm64: Remove TIF_POLLING_NRFLAG
    metag: Remove TIF_POLLING_NRFLAG
    sched/idle: Make cpuidle_idle_call() void
    sched/idle: Reflow cpuidle_idle_call()
    sched/idle: Delay clearing the polling bit
    ...

    Linus Torvalds
     
  • …el/git/tip/tip into next

    Pull core locking updates from Ingo Molnar:
    "The main changes in this cycle were:

    - reduced/streamlined smp_mb__*() interface that allows more usecases
    and makes the existing ones less buggy, especially in rarer
    architectures

    - add rwsem implementation comments

    - bump up lockdep limits"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (33 commits)
    rwsem: Add comments to explain the meaning of the rwsem's count field
    lockdep: Increase static allocations
    arch: Mass conversion of smp_mb__*()
    arch,doc: Convert smp_mb__*()
    arch,xtensa: Convert smp_mb__*()
    arch,x86: Convert smp_mb__*()
    arch,tile: Convert smp_mb__*()
    arch,sparc: Convert smp_mb__*()
    arch,sh: Convert smp_mb__*()
    arch,score: Convert smp_mb__*()
    arch,s390: Convert smp_mb__*()
    arch,powerpc: Convert smp_mb__*()
    arch,parisc: Convert smp_mb__*()
    arch,openrisc: Convert smp_mb__*()
    arch,mn10300: Convert smp_mb__*()
    arch,mips: Convert smp_mb__*()
    arch,metag: Convert smp_mb__*()
    arch,m68k: Convert smp_mb__*()
    arch,m32r: Convert smp_mb__*()
    arch,ia64: Convert smp_mb__*()
    ...

    Linus Torvalds
     
  • Pull RCU changes from Ingo Molnar:
    "The main RCU changes in this cycle were:

    - RCU torture-test changes.

    - variable-name renaming cleanup.

    - update RCU documentation.

    - miscellaneous fixes.

    - patch to suppress RCU stall warnings while sysrq requests are being
    processed"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
    rcu: Provide API to suppress stall warnings while sysrc runs
    rcu: Variable name changed in tree_plugin.h and used in tree.c
    torture: Remove unused definition
    torture: Remove __init from torture_init_begin/end
    torture: Check for multiple concurrent torture tests
    locktorture: Remove reference to nonexistent Kconfig parameter
    rcutorture: Run rcu_torture_writer at normal priority
    rcutorture: Note diffs from git commits
    rcutorture: Add missing destroy_timer_on_stack()
    rcutorture: Explicitly test synchronous grace-period primitives
    rcutorture: Add tests for get_state_synchronize_rcu()
    rcutorture: Test RCU-sched primitives in TREE_PREEMPT_RCU kernels
    torture: Use elapsed time to detect hangs
    rcutorture: Check for rcu_torture_fqs creation errors
    torture: Better summary diagnostics for build failures
    torture: Notice if an all-zero cpumask is passed inside a critical section
    rcutorture: Make rcu_torture_reader() use cond_resched()
    sched,rcu: Make cond_resched() report RCU quiescent states
    percpu: Fix raw_cpu_inc_return()
    rcutorture: Export RCU grace-period kthread wait state to rcutorture
    ...

    Linus Torvalds
     

28 May, 2014

1 commit

  • The current deadlock detection logic does not work reliably due to the
    following early exit path:

    /*
    * Drop out, when the task has no waiters. Note,
    * top_waiter can be NULL, when we are in the deboosting
    * mode!
    */
    if (top_waiter && (!task_has_pi_waiters(task) ||
    top_waiter != task_top_pi_waiter(task)))
    goto out_unlock_pi;

    So this not only exits when the task has no waiters, it also exits
    unconditionally when the current waiter is not the top priority waiter
    of the task.

    So in a nested locking scenario, it might abort the lock chain walk
    and therefor miss a potential deadlock.

    Simple fix: Continue the chain walk, when deadlock detection is
    enabled.

    We also avoid the whole enqueue, if we detect the deadlock right away
    (A-A). It's an optimization, but also prevents that another waiter who
    comes in after the detection and before the task has undone the damage
    observes the situation and detects the deadlock and returns
    -EDEADLOCK, which is wrong as the other task is not in a deadlock
    situation.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Reviewed-by: Steven Rostedt
    Cc: Lai Jiangshan
    Cc: stable@vger.kernel.org
    Link: http://lkml.kernel.org/r/20140522031949.725272460@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

22 May, 2014

2 commits

  • …/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    " 1. Update RCU documentation. These were posted to LKML at
    https://lkml.org/lkml/2014/4/28/634.

    2. Miscellaneous fixes. These were posted to LKML at
    https://lkml.org/lkml/2014/4/28/645.

    3. Torture-test changes. These were posted to LKML at
    https://lkml.org/lkml/2014/4/28/667.

    4. Variable-name renaming cleanup, sent separately due to conflicts.
    This was posted to LKML at https://lkml.org/lkml/2014/5/13/854.

    5. Patch to suppress RCU stall warnings while sysrq requests are
    being processed. This patch is the RCU portions of the patch
    that Rik posted to LKML at https://lkml.org/lkml/2014/4/29/457.
    The reason for pushing this patch ahead instead of waiting until
    3.17 is that the NMI-based stack traces are messing up sysrq
    output, and in some cases also messing up the system as well."

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     
  • Signed-off-by: Ingo Molnar

    Ingo Molnar
     

15 May, 2014

2 commits

  • The torture tests are designed to run in isolation, but do not enforce
    this isolation. This commit therefore checks for concurrent torture
    tests, and refuses to start new tests while old tests are running.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • The locktorture module references CONFIG_LOCK_TORTURE_TEST_RUNNABLE,
    which does not exist. Which is a good thing, because otherwise
    randconfig testing could enable both rcutorture and locktorture
    concurrently, which the torture tests are not set up for. This
    commit therefore removes the reference, so that test is runnable
    immediately only when inserted as a module.

    Reported-by: Paul Bolle
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     

14 May, 2014

1 commit

  • The current lock_torture_writer() spends too much time sleeping and not
    enough time hammering locks, as in an eight-CPU test will often only be
    utilizing a CPU or two. This commit therefore makes lock_torture_writer()
    sleep less and hammer more.

    Signed-off-by: Paul E. McKenney

    Paul E. McKenney
     

07 May, 2014

1 commit


06 May, 2014

1 commit


05 May, 2014

1 commit

  • It took me quite a while to understand how rwsem's count field
    mainifested itself in different scenarios.

    Add comments to provide a quick reference to the the rwsem's count
    field for each scenario where readers and writers are contending
    for the lock.

    Hopefully it will be useful for future maintenance of the code and
    for people to get up to speed on how the logic in the code works.

    Signed-off-by: Tim Chen
    Cc: Davidlohr Bueso
    Cc: Alex Shi
    Cc: Michel Lespinasse
    Cc: Rik van Riel
    Cc: Peter Hurley
    Cc: Paul E.McKenney
    Cc: Jason Low
    Cc: Peter Zijlstra
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Paul E. McKenney
    Link: http://lkml.kernel.org/r/1399060437.2970.146.camel@schen9-DESK
    Signed-off-by: Ingo Molnar

    Tim Chen
     

18 Apr, 2014

2 commits

  • Fuzzing a recent kernel with a large configuration hits the static
    allocation limits and disables lockdep.

    This patch doubles the limits.

    Signed-off-by: Sasha Levin
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1389208906-24338-1-git-send-email-sasha.levin@oracle.com
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Sasha Levin
     
  • Replace various -20/+19 hardcoded nice values with MIN_NICE/MAX_NICE.

    Signed-off-by: Dongsheng Yang
    Acked-by: Tejun Heo
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/ff13819fd09b7a5dba5ab5ae797f2e7019bdfa17.1394532288.git.yangds.fnst@cn.fujitsu.com
    Cc: devel@driverdev.osuosl.org
    Cc: devicetree@vger.kernel.org
    Cc: fcoe-devel@open-fcoe.org
    Cc: linux390@de.ibm.com
    Cc: linux-kernel@vger.kernel.org
    Cc: linux-mm@kvack.org
    Cc: linux-s390@vger.kernel.org
    Cc: linux-scsi@vger.kernel.org
    Cc: nbd-general@lists.sourceforge.net
    Cc: ocfs2-devel@oss.oracle.com
    Cc: openipmi-developer@lists.sourceforge.net
    Cc: qla2xxx-upstream@qlogic.com
    Cc: linux-arch@vger.kernel.org
    [ Consolidated the patches, twiddled the changelog. ]
    Signed-off-by: Ingo Molnar

    Dongsheng Yang
     

17 Apr, 2014

1 commit


11 Apr, 2014

1 commit

  • debug_mutex_unlock() would bail when !debug_locks and forgets to
    actually unlock.

    Reported-by: "Michael L. Semon"
    Reported-by: "Kirill A. Shutemov"
    Reported-by: Valdis Kletnieks
    Fixes: 6f008e72cd11 ("locking/mutex: Fix debug checks")
    Tested-by: Dave Jones
    Cc: Jason Low
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20140410141559.GE13658@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

08 Apr, 2014

1 commit

  • When the system has only one CPU, lglock is effectively a spinlock; map
    it directly to spinlock to eliminate the indirection and duplicate code.

    In addition to removing overhead, this drops 1.6k of code with a
    defconfig modified to have !CONFIG_SMP, and 1.1k with a minimal config.

    Signed-off-by: Josh Triplett
    Cc: Rusty Russell
    Cc: Michal Marek
    Cc: Thomas Gleixner
    Cc: David Howells
    Cc: "H. Peter Anvin"
    Cc: Nick Piggin
    Cc: Peter Zijlstra
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Josh Triplett
     

01 Apr, 2014

3 commits

  • Pull x86 LTO changes from Peter Anvin:
    "More infrastructure work in preparation for link-time optimization
    (LTO). Most of these changes is to make sure symbols accessed from
    assembly code are properly marked as visible so the linker doesn't
    remove them.

    My understanding is that the changes to support LTO are still not
    upstream in binutils, but are on the way there. This patchset should
    conclude the x86-specific changes, and remaining patches to actually
    enable LTO will be fed through the Kbuild tree (other than keeping up
    with changes to the x86 code base, of course), although not
    necessarily in this merge window"

    * 'x86-asmlinkage-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (25 commits)
    Kbuild, lto: Handle basic LTO in modpost
    Kbuild, lto: Disable LTO for asm-offsets.c
    Kbuild, lto: Add a gcc-ld script to let run gcc as ld
    Kbuild, lto: add ld-version and ld-ifversion macros
    Kbuild, lto: Drop .number postfixes in modpost
    Kbuild, lto, workaround: Don't warn for initcall_reference in modpost
    lto: Disable LTO for sys_ni
    lto: Handle LTO common symbols in module loader
    lto, workaround: Add workaround for initcall reordering
    lto: Make asmlinkage __visible
    x86, lto: Disable LTO for the x86 VDSO
    initconst, x86: Fix initconst mistake in ts5500 code
    initconst: Fix initconst mistake in dcdbas
    asmlinkage: Make trace_hardirqs_on/off_caller visible
    asmlinkage, x86: Fix 32bit memcpy for LTO
    asmlinkage Make __stack_chk_failed and memcmp visible
    asmlinkage: Mark rwsem functions that can be called from assembler asmlinkage
    asmlinkage: Make main_extable_sort_needed visible
    asmlinkage, mutex: Mark __visible
    asmlinkage: Make trace_hardirq visible
    ...

    Linus Torvalds
     
  • Pull scheduler changes from Ingo Molnar:
    "Bigger changes:

    - sched/idle restructuring: they are WIP preparation for deeper
    integration between the scheduler and idle state selection, by
    Nicolas Pitre.

    - add NUMA scheduling pseudo-interleaving, by Rik van Riel.

    - optimize cgroup context switches, by Peter Zijlstra.

    - RT scheduling enhancements, by Thomas Gleixner.

    The rest is smaller changes, non-urgnt fixes and cleanups"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (68 commits)
    sched: Clean up the task_hot() function
    sched: Remove double calculation in fix_small_imbalance()
    sched: Fix broken setscheduler()
    sparc64, sched: Remove unused sparc64_multi_core
    sched: Remove unused mc_capable() and smt_capable()
    sched/numa: Move task_numa_free() to __put_task_struct()
    sched/fair: Fix endless loop in idle_balance()
    sched/core: Fix endless loop in pick_next_task()
    sched/fair: Push down check for high priority class task into idle_balance()
    sched/rt: Fix picking RT and DL tasks from empty queue
    trace: Replace hardcoding of 19 with MAX_NICE
    sched: Guarantee task priority in pick_next_task()
    sched/idle: Remove stale old file
    sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED
    cpuidle/arm64: Remove redundant cpuidle_idle_call()
    cpuidle/powernv: Remove redundant cpuidle_idle_call()
    sched, nohz: Exclude isolated cores from load balancing
    sched: Fix select_task_rq_fair() description comments
    workqueue: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    sys: Replace hardcoding of -20 and 19 with MIN_NICE and MAX_NICE
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:
    "Main changes:

    - Torture-test changes, including refactoring of rcutorture and
    introduction of a vestigial locktorture.

    - Real-time latency fixes.

    - Documentation updates.

    - Miscellaneous fixes"

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
    rcu: Provide grace-period piggybacking API
    rcu: Ensure kernel/rcu/rcu.h can be sourced/used stand-alone
    rcu: Fix sparse warning for rcu_expedited from kernel/ksysfs.c
    notifier: Substitute rcu_access_pointer() for rcu_dereference_raw()
    Documentation/memory-barriers.txt: Clarify release/acquire ordering
    rcutorture: Save kvm.sh output to log
    rcutorture: Add a lock_busted to test the test
    rcutorture: Place kvm-test-1-run.sh output into res directory
    rcutorture: Rename TREE_RCU-Kconfig.txt
    locktorture: Add kvm-recheck.sh plug-in for locktorture
    rcutorture: Gracefully handle NULL cleanup hooks
    locktorture: Add vestigial locktorture configuration
    rcutorture: Introduce "rcu" directory level underneath configs
    rcutorture: Rename kvm-test-1-rcu.sh
    rcutorture: Remove RCU dependencies from ver_functions.sh API
    rcutorture: Create CFcommon file for common Kconfig parameters
    rcutorture: Create config files for scripted test-the-test testing
    rcutorture: Add an rcu_busted to test the test
    locktorture: Add a lock-torture kernel module
    rcutorture: Abstract kvm-recheck.sh
    ...

    Linus Torvalds
     

12 Mar, 2014

1 commit

  • OK, so commit:

    1d8fe7dc8078 ("locking/mutexes: Unlock the mutex without the wait_lock")

    generates this boot warning when CONFIG_DEBUG_MUTEXES=y:

    WARNING: CPU: 0 PID: 139 at /usr/src/linux-2.6/kernel/locking/mutex-debug.c:82 debug_mutex_unlock+0x155/0x180() DEBUG_LOCKS_WARN_ON(lock->owner != current)

    And that makes sense, because as soon as we release the lock a
    new owner can come in...

    One would think that !__mutex_slowpath_needs_to_unlock()
    implementations suffer the same, but for DEBUG we fall back to
    mutex-null.h which has an unconditional 1 for that.

    The mutex debug code requires the mutex to be unlocked after
    doing the debug checks, otherwise it can find inconsistent
    state.

    Reported-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Cc: jason.low2@hp.com
    Link: http://lkml.kernel.org/r/20140312122442.GB27965@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

11 Mar, 2014

4 commits

  • Add in an extra reschedule in an attempt to avoid getting reschedule
    the moment we've acquired the lock.

    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/n/tip-zah5eyn9gu7qlgwh9r6n2anc@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Since we want a task waiting for a mutex_lock() to go to sleep and
    reschedule on need_resched() we must be able to abort the
    mcs_spin_lock() around the adaptive spin.

    Therefore implement a cancelable mcs lock.

    Signed-off-by: Peter Zijlstra
    Cc: chegu_vinod@hp.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: Waiman.Long@hp.com
    Cc: torvalds@linux-foundation.org
    Cc: tglx@linutronix.de
    Cc: riel@redhat.com
    Cc: akpm@linux-foundation.org
    Cc: davidlohr@hp.com
    Cc: hpa@zytor.com
    Cc: andi@firstfloor.org
    Cc: aswin@hp.com
    Cc: scott.norton@hp.com
    Cc: Jason Low
    Link: http://lkml.kernel.org/n/tip-62hcl5wxydmjzd182zhvk89m@git.kernel.org
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • When running workloads that have high contention in mutexes on an 8 socket
    machine, mutex spinners would often spin for a long time with no lock owner.

    The main reason why this is occuring is in __mutex_unlock_common_slowpath(),
    if __mutex_slowpath_needs_to_unlock(), then the owner needs to acquire the
    mutex->wait_lock before releasing the mutex (setting lock->count to 1). When
    the wait_lock is contended, this delays the mutex from being released.
    We should be able to release the mutex without holding the wait_lock.

    Signed-off-by: Jason Low
    Cc: chegu_vinod@hp.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: Waiman.Long@hp.com
    Cc: torvalds@linux-foundation.org
    Cc: tglx@linutronix.de
    Cc: riel@redhat.com
    Cc: akpm@linux-foundation.org
    Cc: davidlohr@hp.com
    Cc: hpa@zytor.com
    Cc: andi@firstfloor.org
    Cc: aswin@hp.com
    Cc: scott.norton@hp.com
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1390936396-3962-4-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low
     
  • The mutex->spin_mlock was introduced in order to ensure that only 1 thread
    spins for lock acquisition at a time to reduce cache line contention. When
    lock->owner is NULL and the lock->count is still not 1, the spinner(s) will
    continually release and obtain the lock->spin_mlock. This can generate
    quite a bit of overhead/contention, and also might just delay the spinner
    from getting the lock.

    This patch modifies the way optimistic spinners are queued by queuing before
    entering the optimistic spinning loop as oppose to acquiring before every
    call to mutex_spin_on_owner(). So in situations where the spinner requires
    a few extra spins before obtaining the lock, then there will only be 1 spinner
    trying to get the lock and it will avoid the overhead from unnecessarily
    unlocking and locking the spin_mlock.

    Signed-off-by: Jason Low
    Cc: tglx@linutronix.de
    Cc: riel@redhat.com
    Cc: akpm@linux-foundation.org
    Cc: davidlohr@hp.com
    Cc: hpa@zytor.com
    Cc: andi@firstfloor.org
    Cc: aswin@hp.com
    Cc: scott.norton@hp.com
    Cc: chegu_vinod@hp.com
    Cc: Waiman.Long@hp.com
    Cc: paulmck@linux.vnet.ibm.com
    Cc: torvalds@linux-foundation.org
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1390936396-3962-3-git-send-email-jason.low2@hp.com
    Signed-off-by: Ingo Molnar

    Jason Low