21 Jul, 2015

1 commit

  • Enabling locking-selftest in a VM guest may cause the following
    kernel panic:

    kernel BUG at .../kernel/locking/qspinlock_paravirt.h:137!

    This is due to the fact that the pvqspinlock unlock function is
    expecting either a _Q_LOCKED_VAL or _Q_SLOW_VAL in the lock
    byte. This patch prevents that bug report by ignoring it when
    debug_locks_silent is set. Otherwise, a warning will be printed
    if it contains an unexpected value.

    With this patch applied, the kernel locking-selftest completed
    without any noise.

    Tested-by: Masami Hiramatsu
    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1436663959-53092-1-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

25 Jun, 2015

2 commits

  • Pull scheduler updates from Thomas Gleixner:
    "This series of scheduler updates depends on sched/core and timers/core
    branches, which are already in your tree:

    - Scheduler balancing overhaul to plug a hard to trigger race which
    causes an oops in the balancer (Peter Zijlstra)

    - Lockdep updates which are related to the balancing updates (Peter
    Zijlstra)"

    * 'sched-hrtimers-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    sched,lockdep: Employ lock pinning
    lockdep: Implement lock pinning
    lockdep: Simplify lock_release()
    sched: Streamline the task migration locking a little
    sched: Move code around
    sched,dl: Fix sched class hopping CBS hole
    sched, dl: Convert switched_{from, to}_dl() / prio_changed_dl() to balance callbacks
    sched,dl: Remove return value from pull_dl_task()
    sched, rt: Convert switched_{from, to}_rt() / prio_changed_rt() to balance callbacks
    sched,rt: Remove return value from pull_rt_task()
    sched: Allow balance callbacks for check_class_changed()
    sched: Use replace normalize_task() with __sched_setscheduler()
    sched: Replace post_schedule with a balance callback list

    Linus Torvalds
     
  • Pull locking updates from Thomas Gleixner:
    "These locking updates depend on the alreay merged sched/core branch:

    - Lockless top waiter wakeup for rtmutex (Davidlohr)

    - Reduce hash bucket lock contention for PI futexes (Sebastian)

    - Documentation update (Davidlohr)"

    * 'sched-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking/rtmutex: Update stale plist comments
    futex: Lower the lock contention on the HB lock during wake up
    locking/rtmutex: Implement lockless top-waiter wakeup

    Linus Torvalds
     

23 Jun, 2015

4 commits

  • Pull timer updates from Thomas Gleixner:
    "A rather largish update for everything time and timer related:

    - Cache footprint optimizations for both hrtimers and timer wheel

    - Lower the NOHZ impact on systems which have NOHZ or timer migration
    disabled at runtime.

    - Optimize run time overhead of hrtimer interrupt by making the clock
    offset updates smarter

    - hrtimer cleanups and removal of restrictions to tackle some
    problems in sched/perf

    - Some more leap second tweaks

    - Another round of changes addressing the 2038 problem

    - First step to change the internals of clock event devices by
    introducing the necessary infrastructure

    - Allow constant folding for usecs/msecs_to_jiffies()

    - The usual pile of clockevent/clocksource driver updates

    The hrtimer changes contain updates to sched, perf and x86 as they
    depend on them plus changes all over the tree to cleanup API changes
    and redundant code, which got copied all over the place. The y2038
    changes touch s390 to remove the last non 2038 safe code related to
    boot/persistant clock"

    * 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (114 commits)
    clocksource: Increase dependencies of timer-stm32 to limit build wreckage
    timer: Minimize nohz off overhead
    timer: Reduce timer migration overhead if disabled
    timer: Stats: Simplify the flags handling
    timer: Replace timer base by a cpu index
    timer: Use hlist for the timer wheel hash buckets
    timer: Remove FIFO "guarantee"
    timers: Sanitize catchup_timer_jiffies() usage
    hrtimer: Allow hrtimer::function() to free the timer
    seqcount: Introduce raw_write_seqcount_barrier()
    seqcount: Rename write_seqcount_barrier()
    hrtimer: Fix hrtimer_is_queued() hole
    hrtimer: Remove HRTIMER_STATE_MIGRATE
    selftest: Timers: Avoid signal deadlock in leap-a-day
    timekeeping: Copy the shadow-timekeeper over the real timekeeper last
    clockevents: Check state instead of mode in suspend/resume path
    selftests: timers: Add leap-second timer edge testing to leap-a-day.c
    ntp: Do leapsecond adjustment in adjtimex read path
    time: Prevent early expiry of hrtimers[CLOCK_REALTIME] at the leap second edge
    ntp: Introduce and use SECS_PER_DAY macro instead of 86400
    ...

    Linus Torvalds
     
  • Pull scheduler updates from Ingo Molnar:
    "The main changes are:

    - lockless wakeup support for futexes and IPC message queues
    (Davidlohr Bueso, Peter Zijlstra)

    - Replace spinlocks with atomics in thread_group_cputimer(), to
    improve scalability (Jason Low)

    - NUMA balancing improvements (Rik van Riel)

    - SCHED_DEADLINE improvements (Wanpeng Li)

    - clean up and reorganize preemption helpers (Frederic Weisbecker)

    - decouple page fault disabling machinery from the preemption
    counter, to improve debuggability and robustness (David
    Hildenbrand)

    - SCHED_DEADLINE documentation updates (Luca Abeni)

    - topology CPU masks cleanups (Bartosz Golaszewski)

    - /proc/sched_debug improvements (Srikar Dronamraju)"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (79 commits)
    sched/deadline: Remove needless parameter in dl_runtime_exceeded()
    sched: Remove superfluous resetting of the p->dl_throttled flag
    sched/deadline: Drop duplicate init_sched_dl_class() declaration
    sched/deadline: Reduce rq lock contention by eliminating locking of non-feasible target
    sched/deadline: Make init_sched_dl_class() __init
    sched/deadline: Optimize pull_dl_task()
    sched/preempt: Add static_key() to preempt_notifiers
    sched/preempt: Fix preempt notifiers documentation about hlist_del() within unsafe iteration
    sched/stop_machine: Fix deadlock between multiple stop_two_cpus()
    sched/debug: Add sum_sleep_runtime to /proc//sched
    sched/debug: Replace vruntime with wait_sum in /proc/sched_debug
    sched/debug: Properly format runnable tasks in /proc/sched_debug
    sched/numa: Only consider less busy nodes as numa balancing destinations
    Revert 095bebf61a46 ("sched/numa: Do not move past the balance point if unbalanced")
    sched/fair: Prevent throttling in early pick_next_task_fair()
    preempt: Reorganize the notrace definitions a bit
    preempt: Use preempt_schedule_context() as the official tracing preemption point
    sched: Make preempt_schedule_context() function-tracing safe
    x86: Remove cpu_sibling_mask() and cpu_core_mask()
    x86: Replace cpu_**_mask() with topology_**_cpumask()
    ...

    Linus Torvalds
     
  • Pull locking updates from Ingo Molnar:
    "The main changes are:

    - 'qspinlock' support, enabled on x86: queued spinlocks - these are
    now the spinlock variant used by x86 as they outperform ticket
    spinlocks in every category. (Waiman Long)

    - 'pvqspinlock' support on x86: paravirtualized variant of queued
    spinlocks. (Waiman Long, Peter Zijlstra)

    - 'qrwlock' support, enabled on x86: queued rwlocks. Similar to
    queued spinlocks, they are now the variant used by x86:

    CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
    CONFIG_QUEUED_SPINLOCKS=y
    CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
    CONFIG_QUEUED_RWLOCKS=y

    - various lockdep fixlets

    - various locking primitives cleanups, further WRITE_ONCE()
    propagation"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (24 commits)
    locking/lockdep: Remove hard coded array size dependency
    locking/qrwlock: Don't contend with readers when setting _QW_WAITING
    lockdep: Do not break user-visible string
    locking/arch: Rename set_mb() to smp_store_mb()
    locking/arch: Add WRITE_ONCE() to set_mb()
    rtmutex: Warn if trylock is called from hard/softirq context
    arch: Remove __ARCH_HAVE_CMPXCHG
    locking/rtmutex: Drop usage of __HAVE_ARCH_CMPXCHG
    locking/qrwlock: Rename QUEUE_RWLOCK to QUEUED_RWLOCKS
    locking/pvqspinlock: Rename QUEUED_SPINLOCK to QUEUED_SPINLOCKS
    locking/pvqspinlock: Replace xchg() by the more descriptive set_mb()
    locking/pvqspinlock, x86: Enable PV qspinlock for Xen
    locking/pvqspinlock, x86: Enable PV qspinlock for KVM
    locking/pvqspinlock, x86: Implement the paravirt qspinlock call patching
    locking/pvqspinlock: Implement simple paravirt support for the qspinlock
    locking/qspinlock: Revert to test-and-set on hypervisors
    locking/qspinlock: Use a simple write to grab the lock
    locking/qspinlock: Optimize for smaller NR_CPUS
    locking/qspinlock: Extract out code snippets for the next patch
    locking/qspinlock: Add pending bit
    ...

    Linus Torvalds
     
  • Pull RCU updates from Ingo Molnar:

    - Continued initialization/Kconfig updates: hide most Kconfig options
    from unsuspecting users.

    There's now a single high level configuration option:

    *
    * RCU Subsystem
    *
    Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)

    Which if answered in the negative, leaves us with a single
    interactive configuration option:

    Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)

    All the rest of the RCU options are configured automatically. Later
    on we'll remove this single leftover configuration option as well.

    - Remove all uses of RCU-protected array indexes: replace the
    rcu_[access|dereference]_index_check() APIs with READ_ONCE() and
    rcu_lockdep_assert()

    - RCU CPU-hotplug cleanups

    - Updates to Tiny RCU: a race fix and further code shrinkage.

    - RCU torture-testing updates: fixes, speedups, cleanups and
    documentation updates.

    - Miscellaneous fixes

    - Documentation updates

    * 'core-rcu-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (60 commits)
    rcutorture: Allow repetition factors in Kconfig-fragment lists
    rcutorture: Display "make oldconfig" errors
    rcutorture: Update TREE_RCU-kconfig.txt
    rcutorture: Make rcutorture scripts force RCU_EXPERT
    rcutorture: Update configuration fragments for rcutree.rcu_fanout_exact
    rcutorture: TASKS_RCU set directly, so don't explicitly set it
    rcutorture: Test SRCU cleanup code path
    rcutorture: Replace barriers with smp_store_release() and smp_load_acquire()
    locktorture: Change longdelay_us to longdelay_ms
    rcutorture: Allow negative values of nreaders to oversubscribe
    rcutorture: Exchange TREE03 and TREE08 NR_CPUS, speed up CPU hotplug
    rcutorture: Exchange TREE03 and TREE04 geometries
    locktorture: fix deadlock in 'rw_lock_irq' type
    rcu: Correctly handle non-empty Tiny RCU callback list with none ready
    rcutorture: Test both RCU-sched and RCU-bh for Tiny RCU
    rcu: Further shrink Tiny RCU by making empty functions static inlines
    rcu: Conditionally compile RCU's eqs warnings
    rcu: Remove prompt for RCU implementation
    rcu: Make RCU able to tolerate undefined CONFIG_RCU_KTHREAD_PRIO
    rcu: Make RCU able to tolerate undefined CONFIG_RCU_FANOUT_LEAF
    ...

    Linus Torvalds
     

20 Jun, 2015

2 commits

  • ... as of fb00aca4744 (rtmutex: Turn the plist into an rb-tree) we
    no longer use plists for queuing any waiters. Update stale comments.

    Signed-off-by: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1432056298-18738-4-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     
  • wake_futex_pi() wakes the task before releasing the hash bucket lock
    (HB). The first thing the woken up task usually does is to acquire the
    lock which requires the HB lock. On SMP Systems this leads to blocking
    on the HB lock which is released by the owner shortly after.
    This patch rearranges the unlock path by first releasing the HB lock and
    then waking up the task.

    [ tglx: Fixed up the rtmutex unlock path ]

    Originally-from: Thomas Gleixner
    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Peter Zijlstra
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Davidlohr Bueso
    Link: http://lkml.kernel.org/r/20150617083350.GA2433@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

19 Jun, 2015

5 commits

  • Jiri reported a machine stuck in multi_cpu_stop() with
    migrate_swap_stop() as function and with the following src,dst cpu
    pairs: {11, 4} {13, 11} { 4, 13}

    4 11 13

    cpuM: queue(4 ,13)
    *Ma
    cpuN: queue(13,11)
    *N Na
    *M Mb
    cpuO: queue(11, 4)
    *O Oa
    *Nb
    *Ob

    Where *X denotes the cpu running the queueing of cpu-X and X[ab] denotes
    the first/second queued work.

    You'll observe the top of the workqueue for each cpu: 4,11,13 to be work
    from cpus: M, O, N resp. IOW. deadlock.

    Do away with the queueing trickery and introduce lg_double_lock() to
    lock both CPUs and fully serialize the stop_two_cpus() callers instead
    of the partial (and buggy) serialization we have now.

    Reported-by: Jiri Olsa
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Peter Zijlstra
    Cc: Rik van Riel
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150605153023.GH19282@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The current cmpxchg() loop in setting the _QW_WAITING flag for writers
    in queue_write_lock_slowpath() will contend with incoming readers
    causing possibly extra cmpxchg() operations that are wasteful. This
    patch changes the code to do a byte cmpxchg() to eliminate contention
    with new readers.

    A multithreaded microbenchmark running 5M read_lock/write_lock loop
    on a 8-socket 80-core Westmere-EX machine running 4.0 based kernel
    with the qspinlock patch have the following execution times (in ms)
    with and without the patch:

    With R:W ratio = 5:1

    Threads w/o patch with patch % change
    ------- --------- ---------- --------
    2 990 895 -9.6%
    3 2136 1912 -10.5%
    4 3166 2830 -10.6%
    5 3953 3629 -8.2%
    6 4628 4405 -4.8%
    7 5344 5197 -2.8%
    8 6065 6004 -1.0%
    9 6826 6811 -0.2%
    10 7599 7599 0.0%
    15 9757 9766 +0.1%
    20 13767 13817 +0.4%

    With small number of contending threads, this patch can improve
    locking performance by up to 10%. With more contending threads,
    however, the gain diminishes.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Arnd Bergmann
    Cc: Borislav Petkov
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1433863153-30722-3-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Add a lockdep annotation that WARNs if you 'accidentially' unlock a
    lock.

    This is especially helpful for code with callbacks, where the upper
    layer assumes a lock remains taken but a lower layer thinks it maybe
    can drop and reacquire the lock.

    By unwittingly breaking up the lock, races can be introduced.

    Lock pinning is a lockdep annotation that helps with this, when you
    lockdep_pin_lock() a held lock, any unlock without a
    lockdep_unpin_lock() will produce a WARN. Think of this as a relative
    of lockdep_assert_held(), except you don't only assert its held now,
    but ensure it stays held until you release your assertion.

    RFC: a possible alternative API would be something like:

    int cookie = lockdep_pin_lock(&foo);
    ...
    lockdep_unpin_lock(&foo, cookie);

    Where we pick a random number for the pin_count; this makes it
    impossible to sneak a lock break in without also passing the right
    cookie along.

    I've not done this because it ends up generating code for !LOCKDEP,
    esp. if you need to pass the cookie around for some reason.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: ktkhai@parallels.com
    Cc: rostedt@goodmis.org
    Cc: juri.lelli@gmail.com
    Cc: pang.xunlei@linaro.org
    Cc: oleg@redhat.com
    Cc: wanpeng.li@linux.intel.com
    Cc: umgwanakikbuti@gmail.com
    Link: http://lkml.kernel.org/r/20150611124743.906731065@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • lock_release() takes this nested argument that's mostly pointless
    these days, remove the implementation but leave the argument a
    rudiment for now.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: ktkhai@parallels.com
    Cc: rostedt@goodmis.org
    Cc: juri.lelli@gmail.com
    Cc: pang.xunlei@linaro.org
    Cc: oleg@redhat.com
    Cc: wanpeng.li@linux.intel.com
    Cc: umgwanakikbuti@gmail.com
    Link: http://lkml.kernel.org/r/20150611124743.840411606@infradead.org
    Signed-off-by: Thomas Gleixner

    Peter Zijlstra
     
  • Mark the task for later wakeup after the wait_lock has been released.
    This way, once the next task is awoken, it will have a better chance
    to of finding the wait_lock free when continuing executing in
    __rt_mutex_slowlock() when trying to acquire the rtmutex, calling
    try_to_take_rt_mutex(). Upon contended scenarios, other tasks attempting
    take the lock may acquire it first, right after the wait_lock is released,
    but (a) this can also occur with the current code, as it relies on the
    spinlock fairness, and (b) we are dealing with the top-waiter anyway,
    so it will always take the lock next.

    Signed-off-by: Davidlohr Bueso
    Cc: Steven Rostedt
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Sebastian Andrzej Siewior
    Cc: Davidlohr Bueso
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1432056298-18738-2-git-send-email-dave@stgolabs.net
    Signed-off-by: Thomas Gleixner

    Davidlohr Bueso
     

07 Jun, 2015

1 commit

  • The lock_class iteration of /proc/lock_stat is not serialized against
    the lockdep_free_key_range() call from module unload.

    Therefore it can happen that we find a class of which ->name/->key are
    no longer valid.

    There is a further bug in zap_class() that left ->name dangling. Cure
    this. Use RCU_INIT_POINTER() because NULL.

    Since lockdep_free_key_range() is rcu_sched serialized, we can read
    both ->name and ->key under rcu_read_lock_sched() (preempt-disable)
    and be assured that if we observe a !NULL value it stays safe to use
    for as long as we hold that lock.

    If we observe both NULL, skip the entry.

    Reported-by: Jerome Marchand
    Tested-by: Jerome Marchand
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150602105013.GS3644@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

03 Jun, 2015

1 commit

  • Remove the line-break in the user-visible string and add the
    missing space in this error message:

    WARNING: lockdep init error! lock-(console_sem).lock was acquiredbefore lockdep_init

    Also:

    - don't yell, it's just a debug warning

    - denote references to function calls with '()'

    - standardize the lock name quoting

    - and finish the sentence.

    The result:

    WARNING: lockdep init error: lock '(console_sem).lock' was acquired before lockdep_init().

    Signed-off-by: Borislav Petkov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20150602133827.GD19887@pd.tnic
    [ Added a few more stylistic tweaks to the error message. ]
    Signed-off-by: Ingo Molnar

    Signed-off-by: Ingo Molnar

    Borislav Petkov
     

02 Jun, 2015

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU changes from Paul E. McKenney:

    - Initialization/Kconfig updates: hide most Kconfig options from unsuspecting users.
    There's now a single high level configuration option:

    *
    * RCU Subsystem
    *
    Make expert-level adjustments to RCU configuration (RCU_EXPERT) [N/y/?] (NEW)

    Which if answered in the negative, leaves us with a single interactive
    configuration option:

    Offload RCU callback processing from boot-selected CPUs (RCU_NOCB_CPU) [N/y/?] (NEW)

    All the rest of the RCU options are configured automatically.

    - Remove all uses of RCU-protected array indexes: replace the
    rcu_[access|dereference]_index_check() APIs with READ_ONCE() and rcu_lockdep_assert().

    - RCU CPU-hotplug cleanups.

    - Updates to Tiny RCU: a race fix and further code shrinkage.

    - RCU torture-testing updates: fixes, speedups, cleanups and
    documentation updates.

    - Miscellaneous fixes.

    - Documentation updates.

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

28 May, 2015

2 commits


19 May, 2015

2 commits


14 May, 2015

1 commit

  • rt_mutex_trylock() must be called from thread context. It can be
    called from atomic regions (preemption or interrupts disabled), but
    not from hard/softirq/nmi context. Add a warning to alert abusers.

    The reasons for this are:

    1) There is a potential deadlock in the slowpath

    2) Another cpu which blocks on the rtmutex will boost the task
    which allegedly locked the rtmutex, but that cannot work
    because the hard/softirq context borrows the task context.

    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Sebastian Siewior

    Thomas Gleixner
     

13 May, 2015

1 commit

  • The rtmutex code is the only user of __HAVE_ARCH_CMPXCHG and we have a few
    other user of cmpxchg() which do not care about __HAVE_ARCH_CMPXCHG. This
    define was first introduced in 23f78d4a0 ("[PATCH] pi-futex: rt mutex core")
    which is v2.6.18. The generic cmpxchg was introduced later in 068fbad288
    ("Add cmpxchg_local to asm-generic for per cpu atomic operations") which is
    v2.6.25.
    Back then something was required to get rtmutex working with the fast
    path on architectures without cmpxchg and this seems to be the result.

    It popped up recently on rt-users because ARM (v6+) does not define
    __HAVE_ARCH_CMPXCHG (even that it implements it) which results in slower
    locking performance in the fast path.
    To put some numbers on it: preempt -RT, am335x, 10 loops of
    100000 invocations of rt_spin_lock() + rt_spin_unlock() (time "total" is
    the average of the 10 loops for the 100000 invocations, "loop" is
    "total / 100000 * 1000"):

    cmpxchg | slowpath used || cmpxchg used
    | total | loop || total | loop
    --------|-----------|-------||------------|-------
    ARMv6 | 9129.4 us | 91 ns || 3311.9 us | 33 ns
    generic | 9360.2 us | 94 ns || 10834.6 us | 108 ns
    ----------------------------||--------------------

    Forcing it to generic cmpxchg() made things worse for the slowpath and
    even worse in cmpxchg() path. It boils down to 14ns more per lock+unlock
    in a cache hot loop so it might not be that much in real world.
    The last test was a substitute for pre ARMv6 machine but then I was able
    to perform the comparison on imx28 which is ARMv5 and therefore is
    always is using the generic cmpxchg implementation. And the numbers:

    | total | loop
    -------- |----------- |--------
    slowpath | 263937.2 us | 2639 ns
    cmpxchg | 16934.2 us | 169 ns
    --------------------------------

    The numbers are larger since the machine is slower in general. However,
    letting rtmutex use cmpxchg() instead the slowpath seem to improve things.

    Since from the ARM (tested on am335x + imx28) point of view always
    using cmpxchg() in rt_mutex_lock() + rt_mutex_unlock() makes sense I
    would drop the define.

    Signed-off-by: Sebastian Andrzej Siewior
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: will.deacon@arm.com
    Cc: linux-arm-kernel@lists.infradead.org
    Link: http://lkml.kernel.org/r/20150225175613.GE6823@linutronix.de
    Signed-off-by: Thomas Gleixner

    Sebastian Andrzej Siewior
     

12 May, 2015

1 commit

  • To be consistent with the queued spinlocks which use
    CONFIG_QUEUED_SPINLOCKS config parameter, the one for the queued
    rwlocks is now renamed to CONFIG_QUEUED_RWLOCKS.

    Signed-off-by: Waiman Long
    Cc: Borislav Petkov
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1431367031-36697-1-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

11 May, 2015

2 commits

  • Valentin Rothberg reported that we use CONFIG_QUEUED_SPINLOCKS
    in arch/x86/kernel/paravirt_patch_32.c, while the symbol is
    called CONFIG_QUEUED_SPINLOCK. (Note the extra 'S')

    But the typo was natural: the proper English term for such
    a generic object would be 'queued spinlocks' - so rename
    this and related symbols accordingly to the plural form.

    Reported-by: Valentin Rothberg
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: Waiman Long
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The xchg() function was used in pv_wait_node() to set a certain
    value and provide a memory barrier which is what the set_mb()
    function is for. This patch replaces the xchg() call by
    set_mb().

    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Cc: Douglas Hatch
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 May, 2015

9 commits

  • Provide a separate (second) version of the spin_lock_slowpath for
    paravirt along with a special unlock path.

    The second slowpath is generated by adding a few pv hooks to the
    normal slowpath, but where those will compile away for the native
    case, they expand into special wait/wake code for the pv version.

    The actual MCS queue can use extra storage in the mcs_nodes[] array to
    keep track of state and therefore uses directed wakeups.

    The head contender has no such storage directly visible to the
    unlocker. So the unlocker searches a hash table with open addressing
    using a simple binary Galois linear feedback shift register.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1429901803-29771-9-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When we detect a hypervisor (!paravirt, see qspinlock paravirt support
    patches), revert to a simple test-and-set lock to avoid the horrors
    of queue preemption.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-8-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • Currently, atomic_cmpxchg() is used to get the lock. However, this
    is not really necessary if there is more than one task in the queue
    and the queue head don't need to reset the tail code. For that case,
    a simple write to set the lock bit is enough as the queue head will
    be the only one eligible to get the lock as long as it checks that
    both the lock and pending bits are not set. The current pending bit
    waiting code will ensure that the bit will not be set as soon as the
    tail code in the lock is set.

    With that change, the are some slight improvement in the performance
    of the queued spinlock in the 5M loop micro-benchmark run on a 4-socket
    Westere-EX machine as shown in the tables below.

    [Standalone/Embedded - same node]
    # of tasks Before patch After patch %Change
    ---------- ----------- ---------- -------
    3 2324/2321 2248/2265 -3%/-2%
    4 2890/2896 2819/2831 -2%/-2%
    5 3611/3595 3522/3512 -2%/-2%
    6 4281/4276 4173/4160 -3%/-3%
    7 5018/5001 4875/4861 -3%/-3%
    8 5759/5750 5563/5568 -3%/-3%

    [Standalone/Embedded - different nodes]
    # of tasks Before patch After patch %Change
    ---------- ----------- ---------- -------
    3 12242/12237 12087/12093 -1%/-1%
    4 10688/10696 10507/10521 -2%/-2%

    It was also found that this change produced a much bigger performance
    improvement in the newer IvyBridge-EX chip and was essentially to close
    the performance gap between the ticket spinlock and queued spinlock.

    The disk workload of the AIM7 benchmark was run on a 4-socket
    Westmere-EX machine with both ext4 and xfs RAM disks at 3000 users
    on a 3.14 based kernel. The results of the test runs were:

    AIM7 XFS Disk Test
    kernel JPM Real Time Sys Time Usr Time
    ----- --- --------- -------- --------
    ticketlock 5678233 3.17 96.61 5.81
    qspinlock 5750799 3.13 94.83 5.97

    AIM7 EXT4 Disk Test
    kernel JPM Real Time Sys Time Usr Time
    ----- --- --------- -------- --------
    ticketlock 1114551 16.15 509.72 7.11
    qspinlock 2184466 8.24 232.99 6.01

    The ext4 filesystem run had a much higher spinlock contention than
    the xfs filesystem run.

    The "ebizzy -m" test was also run with the following results:

    kernel records/s Real Time Sys Time Usr Time
    ----- --------- --------- -------- --------
    ticketlock 2075 10.00 216.35 3.49
    qspinlock 3023 10.00 198.20 4.80

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-7-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • When we allow for a max NR_CPUS < 2^14 we can optimize the pending
    wait-acquire and the xchg_tail() operations.

    By growing the pending bit to a byte, we reduce the tail to 16bit.
    This means we can use xchg16 for the tail part and do away with all
    the repeated compxchg() operations.

    This in turn allows us to unconditionally acquire; the locked state
    as observed by the wait loops cannot change. And because both locked
    and pending are now a full byte we can use simple stores for the
    state transition, obviating one atomic operation entirely.

    This optimization is needed to make the qspinlock achieve performance
    parity with ticket spinlock at light load.

    All this is horribly broken on Alpha pre EV56 (and any other arch that
    cannot do single-copy atomic byte stores).

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-6-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • This is a preparatory patch that extracts out the following 2 code
    snippets to prepare for the next performance optimization patch.

    1) the logic for the exchange of new and previous tail code words
    into a new xchg_tail() function.
    2) the logic for clearing the pending bit and setting the locked bit
    into a new clear_pending_set_locked() function.

    This patch also simplifies the trylock operation before queuing by
    calling queued_spin_trylock() directly.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-5-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Because the qspinlock needs to touch a second cacheline (the per-cpu
    mcs_nodes[]); add a pending bit and allow a single in-word spinner
    before we punt to the second cacheline.

    It is possible so observe the pending bit without the locked bit when
    the last owner has just released but the pending owner has not yet
    taken ownership.

    In this case we would normally queue -- because the pending bit is
    already taken. However, in this case the pending bit is guaranteed
    to be released 'soon', therefore wait for it and avoid queueing.

    Signed-off-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-4-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Peter Zijlstra (Intel)
     
  • This patch introduces a new generic queued spinlock implementation that
    can serve as an alternative to the default ticket spinlock. Compared
    with the ticket spinlock, this queued spinlock should be almost as fair
    as the ticket spinlock. It has about the same speed in single-thread
    and it can be much faster in high contention situations especially when
    the spinlock is embedded within the data structure to be protected.

    Only in light to moderate contention where the average queue depth
    is around 1-3 will this queued spinlock be potentially a bit slower
    due to the higher slowpath overhead.

    This queued spinlock is especially suit to NUMA machines with a large
    number of cores as the chance of spinlock contention is much higher
    in those machines. The cost of contention is also higher because of
    slower inter-node memory traffic.

    Due to the fact that spinlocks are acquired with preemption disabled,
    the process will not be migrated to another CPU while it is trying
    to get a spinlock. Ignoring interrupt handling, a CPU can only be
    contending in one spinlock at any one time. Counting soft IRQ, hard
    IRQ and NMI, a CPU can only have a maximum of 4 concurrent lock waiting
    activities. By allocating a set of per-cpu queue nodes and used them
    to form a waiting queue, we can encode the queue node address into a
    much smaller 24-bit size (including CPU number and queue node index)
    leaving one byte for the lock.

    Please note that the queue node is only needed when waiting for the
    lock. Once the lock is acquired, the queue node can be released to
    be used later.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Boris Ostrovsky
    Cc: Borislav Petkov
    Cc: Daniel J Blueman
    Cc: David Vrabel
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Konrad Rzeszutek Wilk
    Cc: Linus Torvalds
    Cc: Oleg Nesterov
    Cc: Paolo Bonzini
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Raghavendra K T
    Cc: Rik van Riel
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Cc: virtualization@lists.linux-foundation.org
    Cc: xen-devel@lists.xenproject.org
    Link: http://lkml.kernel.org/r/1429901803-29771-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • In up_write()/up_read(), rwsem_wake() will be called whenever it
    detects that some writers/readers are waiting. The rwsem_wake()
    function will take the wait_lock and call __rwsem_do_wake() to do the
    real wakeup. For a heavily contended rwsem, doing a spin_lock() on
    wait_lock will cause further contention on the heavily contended rwsem
    cacheline resulting in delay in the completion of the up_read/up_write
    operations.

    This patch makes the wait_lock taking and the call to __rwsem_do_wake()
    optional if at least one spinning writer is present. The spinning
    writer will be able to take the rwsem and call rwsem_wake() later
    when it calls up_write(). With the presence of a spinning writer,
    rwsem_wake() will now try to acquire the lock using trylock. If that
    fails, it will just quit.

    Suggested-by: Peter Zijlstra (Intel)
    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Davidlohr Bueso
    Acked-by: Jason Low
    Cc: Andrew Morton
    Cc: Borislav Petkov
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1430428337-16802-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Ronny reported that the following scenario is not handled correctly:

    T1 (prio = 10)
    lock(rtmutex);

    T2 (prio = 20)
    lock(rtmutex)
    boost T1

    T1 (prio = 20)
    sys_set_scheduler(prio = 30)
    T1 prio = 30
    ....
    sys_set_scheduler(prio = 10)
    T1 prio = 30

    The last step is wrong as T1 should now be back at prio 20.

    Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()")
    only handles the case where a boosted tasks tries to lower its
    priority.

    Fix it by taking the new effective priority into account for the
    decision whether a change of the priority is required.

    Reported-by: Ronny Meeus
    Tested-by: Steven Rostedt
    Signed-off-by: Thomas Gleixner
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Steven Rostedt
    Cc:
    Cc: Borislav Petkov
    Cc: H. Peter Anvin
    Cc: Mike Galbraith
    Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()")
    Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
    Signed-off-by: Ingo Molnar

    Thomas Gleixner
     

22 Apr, 2015

1 commit

  • The check for hrtimer_active() after starting the timer is
    pointless. If the timer is inactive it has expired already and
    therefor the task pointer is already NULL.

    Signed-off-by: Thomas Gleixner
    Acked-by: Peter Zijlstra
    Cc: Preeti U Murthy
    Cc: Viresh Kumar
    Cc: Marcelo Tosatti
    Cc: Frederic Weisbecker
    Link: http://lkml.kernel.org/r/20150414203503.081830481@linutronix.de
    Signed-off-by: Thomas Gleixner

    Thomas Gleixner
     

17 Apr, 2015

1 commit

  • During sysrq's show-held-locks command it is possible that
    hlock_class() returns NULL for a given lock. The result is then (after
    the warning):

    |BUG: unable to handle kernel NULL pointer dereference at 0000001c
    |IP: [] get_usage_chars+0x5/0x100
    |Call Trace:
    | [] print_lock_name+0x23/0x60
    | [] print_lock+0x5d/0x7e
    | [] lockdep_print_held_locks+0x74/0xe0
    | [] debug_show_all_locks+0x132/0x1b0
    | [] sysrq_handle_showlocks+0x8/0x10

    This *might* happen because the thread on the other CPU drops the lock
    after we are looking ->lockdep_depth and ->held_locks points no longer
    to a lock that is held.

    The fix here is to simply ignore it and continue.

    Reported-by: Andreas Messerschmid
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Sebastian Andrzej Siewior
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

14 Apr, 2015

1 commit

  • Pull core locking changes from Ingo Molnar:
    "Main changes:

    - jump label asm preparatory work for PowerPC (Anton Blanchard)

    - rwsem optimizations and cleanups (Davidlohr Bueso)

    - mutex optimizations and cleanups (Jason Low)

    - futex fix (Oleg Nesterov)

    - remove broken atomicity checks from {READ,WRITE}_ONCE() (Peter
    Zijlstra)"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    powerpc, jump_label: Include linux/jump_label.h to get HAVE_JUMP_LABEL define
    jump_label: Allow jump labels to be used in assembly
    jump_label: Allow asm/jump_label.h to be included in assembly
    locking/mutex: Further simplify mutex_spin_on_owner()
    locking: Remove atomicy checks from {READ,WRITE}_ONCE
    locking/rtmutex: Rename argument in the rt_mutex_adjust_prio_chain() documentation as well
    locking/rwsem: Fix lock optimistic spinning when owner is not running
    locking: Remove ACCESS_ONCE() usage
    locking/rwsem: Check for active lock before bailing on spinning
    locking/rwsem: Avoid deceiving lock spinners
    locking/rwsem: Set lock ownership ASAP
    locking/rwsem: Document barrier need when waking tasks
    locking/futex: Check PF_KTHREAD rather than !p->mm to filter out kthreads
    locking/mutex: Refactor mutex_spin_on_owner()
    locking/mutex: In mutex_spin_on_owner(), return true when owner changes

    Linus Torvalds
     

09 Apr, 2015

1 commit

  • Similar to what Linus suggested for rwsem_spin_on_owner(), in
    mutex_spin_on_owner() instead of having while (true) and
    breaking out of the spin loop on lock->owner != owner, we can
    have the loop directly check for while (lock->owner == owner) to
    improve the readability of the code.

    It also shrinks the code a bit:

    text data bss dec hex filename
    3721 0 0 3721 e89 mutex.o.before
    3705 0 0 3705 e79 mutex.o.after

    Signed-off-by: Jason Low
    Cc: Andrew Morton
    Cc: Aswin Chandramouleeswaran
    Cc: Davidlohr Bueso
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Link: http://lkml.kernel.org/r/1428521960-5268-2-git-send-email-jason.low2@hp.com
    [ Added code generation info. ]
    Signed-off-by: Ingo Molnar

    Jason Low
     

25 Mar, 2015

1 commit