31 Mar, 2016

1 commit

  • A sequence of pairs [class_idx -> corresponding chain_key iteration]
    is printed for both the current held_lock chain and the cached chain.

    That exposes the two different class_idx sequences that led to that
    particular hash value.

    This helps with debugging hash chain collision reports.

    Signed-off-by: Alfredo Alvarez Fernandez
    Acked-by: Peter Zijlstra (Intel)
    Cc: Linus Torvalds
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-fsdevel@vger.kernel.org
    Cc: sedat.dilek@gmail.com
    Cc: tytso@mit.edu
    Link: http://lkml.kernel.org/r/1459357416-19190-1-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     

23 Mar, 2016

1 commit

  • kcov provides code coverage collection for coverage-guided fuzzing
    (randomized testing). Coverage-guided fuzzing is a testing technique
    that uses coverage feedback to determine new interesting inputs to a
    system. A notable user-space example is AFL
    (http://lcamtuf.coredump.cx/afl/). However, this technique is not
    widely used for kernel testing due to missing compiler and kernel
    support.

    kcov does not aim to collect as much coverage as possible. It aims to
    collect more or less stable coverage that is function of syscall inputs.
    To achieve this goal it does not collect coverage in soft/hard
    interrupts and instrumentation of some inherently non-deterministic or
    non-interesting parts of kernel is disbled (e.g. scheduler, locking).

    Currently there is a single coverage collection mode (tracing), but the
    API anticipates additional collection modes. Initially I also
    implemented a second mode which exposes coverage in a fixed-size hash
    table of counters (what Quentin used in his original patch). I've
    dropped the second mode for simplicity.

    This patch adds the necessary support on kernel side. The complimentary
    compiler support was added in gcc revision 231296.

    We've used this support to build syzkaller system call fuzzer, which has
    found 90 kernel bugs in just 2 months:

    https://github.com/google/syzkaller/wiki/Found-Bugs

    We've also found 30+ bugs in our internal systems with syzkaller.
    Another (yet unexplored) direction where kcov coverage would greatly
    help is more traditional "blob mutation". For example, mounting a
    random blob as a filesystem, or receiving a random blob over wire.

    Why not gcov. Typical fuzzing loop looks as follows: (1) reset
    coverage, (2) execute a bit of code, (3) collect coverage, repeat. A
    typical coverage can be just a dozen of basic blocks (e.g. an invalid
    input). In such context gcov becomes prohibitively expensive as
    reset/collect coverage steps depend on total number of basic
    blocks/edges in program (in case of kernel it is about 2M). Cost of
    kcov depends only on number of executed basic blocks/edges. On top of
    that, kernel requires per-thread coverage because there are always
    background threads and unrelated processes that also produce coverage.
    With inlined gcov instrumentation per-thread coverage is not possible.

    kcov exposes kernel PCs and control flow to user-space which is
    insecure. But debugfs should not be mapped as user accessible.

    Based on a patch by Quentin Casasnovas.

    [akpm@linux-foundation.org: make task_struct.kcov_mode have type `enum kcov_mode']
    [akpm@linux-foundation.org: unbreak allmodconfig]
    [akpm@linux-foundation.org: follow x86 Makefile layout standards]
    Signed-off-by: Dmitry Vyukov
    Reviewed-by: Kees Cook
    Cc: syzkaller
    Cc: Vegard Nossum
    Cc: Catalin Marinas
    Cc: Tavis Ormandy
    Cc: Will Deacon
    Cc: Quentin Casasnovas
    Cc: Kostya Serebryany
    Cc: Eric Dumazet
    Cc: Alexander Potapenko
    Cc: Kees Cook
    Cc: Bjorn Helgaas
    Cc: Sasha Levin
    Cc: David Drysdale
    Cc: Ard Biesheuvel
    Cc: Andrey Ryabinin
    Cc: Kirill A. Shutemov
    Cc: Jiri Slaby
    Cc: Ingo Molnar
    Cc: Thomas Gleixner
    Cc: "H. Peter Anvin"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Dmitry Vyukov
     

16 Mar, 2016

1 commit

  • $ make tags
    GEN tags
    ctags: Warning: drivers/acpi/processor_idle.c:64: null expansion of name pattern "\1"
    ctags: Warning: drivers/xen/events/events_2l.c:41: null expansion of name pattern "\1"
    ctags: Warning: kernel/locking/lockdep.c:151: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:133: null expansion of name pattern "\1"
    ctags: Warning: kernel/rcu/rcutorture.c:135: null expansion of name pattern "\1"
    ctags: Warning: kernel/workqueue.c:323: null expansion of name pattern "\1"
    ctags: Warning: net/ipv4/syncookies.c:53: null expansion of name pattern "\1"
    ctags: Warning: net/ipv6/syncookies.c:44: null expansion of name pattern "\1"
    ctags: Warning: net/rds/page.c:45: null expansion of name pattern "\1"

    Which are all the result of the DEFINE_PER_CPU pattern:

    scripts/tags.sh:200: '/\
    Acked-by: David S. Miller
    Acked-by: Rafael J. Wysocki
    Cc: Tejun Heo
    Cc: "Paul E. McKenney"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

29 Feb, 2016

7 commits

  • Add detection for chain_key collision under CONFIG_DEBUG_LOCKDEP.
    When a collision is detected the problem is reported and all lock
    debugging is turned off.

    Tested using liblockdep and the added tests before and after
    applying the fix, confirming both that the code added for the
    detection correctly reports the problem and that the fix actually
    fixes it.

    Tested tweaking lockdep to generate false collisions and
    verified that the problem is reported and that lock debugging is
    turned off.

    Also tested with lockdep's test suite after applying the patch:

    [ 0.000000] Good, all 253 testcases passed! |

    Signed-off-by: Alfredo Alvarez Fernandez
    Cc: Alfredo Alvarez Fernandez
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1455864533-7536-4-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Ingo Molnar
     
  • The chain_key hashing macro iterate_chain_key(key1, key2) does not
    generate a new different value if both key1 and key2 are 0. In that
    case the generated value is again 0. This can lead to collisions which
    can result in lockdep not detecting deadlocks or circular
    dependencies.

    Avoid the problem by using class_idx (1-based) instead of class id
    (0-based) as an input for the hashing macro 'key2' in
    iterate_chain_key(key1, key2).

    The use of class id created collisions in cases like the following:

    1.- Consider an initial state in which no class has been acquired yet.
    Under these circumstances an AA deadlock will not be detected by
    lockdep:

    lock [key1,key2]->new key (key1=old chain_key, key2=id)
    --------------------------
    A [0,0]->0
    A [0,0]->0 (collision)

    The newly generated chain_key collides with the one used before and as
    a result the check for a deadlock is skipped

    A simple test using liblockdep and a pthread mutex confirms the
    problem: (omitting stack traces)

    new class 0xe15038: 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    acquire class [0xe15038] 0x7ffc64950f20
    hash chain already cached, key: 0000000000000000 tail class:
    [0xe15038] 0x7ffc64950f20

    2.- Consider an ABBA in 2 different tasks and no class yet acquired.

    T1 [key1,key2]->new key T2[key1,key2]->new key
    -- --
    A [0,0]->0

    B [0,1]->1

    B [0,1]->1 (collision)

    A

    In this case the collision prevents lockdep from creating the new
    dependency A->B. This in turn results in lockdep not detecting the
    circular dependency when T2 acquires A.

    Signed-off-by: Alfredo Alvarez Fernandez
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1455147212-2389-4-git-send-email-alfredoalvarezernandez@gmail.com
    Signed-off-by: Ingo Molnar

    Alfredo Alvarez Fernandez
     
  • Make use of wake-queues and enable the wakeup to occur after releasing the
    wait_lock. This is similar to what we do with rtmutex top waiter,
    slightly shortening the critical region and allow other waiters to
    acquire the wait_lock sooner. In low contention cases it can also help
    the recently woken waiter to find the wait_lock available (fastpath)
    when it continues execution.

    Reviewed-by: Waiman Long
    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Ding Tianhong
    Cc: Jason Low
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Tim Chen
    Cc: Waiman Long
    Cc: Will Deacon
    Link: http://lkml.kernel.org/r/20160125022343.GA3322@linux-uzut.site
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • This patch enables the tracking of the number of slowpath locking
    operations performed. This can be used to compare against the number
    of lock stealing operations to see what percentage of locks are stolen
    versus acquired via the regular slowpath.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • The newly introduced smp_cond_acquire() was used to replace the
    slowpath lock acquisition loop. Similarly, the new function can also
    be applied to the pending bit locking loop. This patch uses the new
    function in that loop.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch moves the lock stealing count tracking code into
    pv_queued_spin_steal_lock() instead of via a jacket function simplifying
    the code.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Douglas Hatch
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1449778666-13593-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Similar to commit b4b29f94856a ("locking/osq: Fix ordering of node
    initialisation in osq_lock") the use of xchg_acquire() is
    fundamentally broken with MCS like constructs.

    Furthermore, it turns out we rely on the global transitivity of this
    operation because the unlock path observes the pointer with a
    READ_ONCE(), not an smp_load_acquire().

    This is non-critical because the MCS code isn't actually used and
    mostly serves as documentation, a stepping stone to the more complex
    things we've build on top of the idea.

    Reported-by: Andrea Parri
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Will Deacon
    Fixes: 3552a07a9c4a ("locking/mcs: Use acquire/release semantics")
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

09 Feb, 2016

3 commits

  • Lockdep is initialized at compile time now. Get rid of lockdep_init().

    Signed-off-by: Andrey Ryabinin
    Signed-off-by: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Krinkin
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: mm-commits@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Andrey Ryabinin
     
  • Mike said:

    : CONFIG_UBSAN_ALIGNMENT breaks x86-64 kernel with lockdep enabled, i.e.
    : kernel with CONFIG_UBSAN_ALIGNMENT=y fails to load without even any error
    : message.
    :
    : The problem is that ubsan callbacks use spinlocks and might be called
    : before lockdep is initialized. Particularly this line in the
    : reserve_ebda_region function causes problem:
    :
    : lowmem = *(unsigned short *)__va(BIOS_LOWMEM_KILOBYTES);
    :
    : If i put lockdep_init() before reserve_ebda_region call in
    : x86_64_start_reservations kernel loads well.

    Fix this ordering issue permanently: change lockdep so that it uses hlists
    for the hash tables. Unlike a list_head, an hlist_head is in its
    initialized state when it is all-zeroes, so lockdep is ready for operation
    immediately upon boot - lockdep_init() need not have run.

    The patch will also save some memory.

    Probably lockdep_init() and lockdep_initialized can be done away with now.

    Suggested-by: Mike Krinkin
    Reported-by: Mike Krinkin
    Signed-off-by: Andrew Morton
    Cc: Andrey Ryabinin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: linux-kernel@vger.kernel.org
    Cc: mm-commits@vger.kernel.org
    Signed-off-by: Ingo Molnar

    Andrew Morton
     
  • check_prev_add() caches saved stack trace in static trace variable
    to avoid duplicate save_trace() calls in dependencies involving trylocks.
    But that caching logic contains a bug. We may not save trace on first
    iteration due to early return from check_prev_add(). Then on the
    second iteration when we actually need the trace we don't save it
    because we think that we've already saved it.

    Let check_prev_add() itself control when stack is saved.

    There is another bug. Trace variable is protected by graph lock.
    But we can temporary release graph lock during printing.

    Fix this by invalidating cached stack trace when we release graph lock.

    Signed-off-by: Dmitry Vyukov
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: glider@google.com
    Cc: kcc@google.com
    Cc: peter@hurleysoftware.com
    Cc: sasha.levin@oracle.com
    Link: http://lkml.kernel.org/r/1454593240-121647-1-git-send-email-dvyukov@google.com
    Signed-off-by: Ingo Molnar

    Dmitry Vyukov
     

26 Jan, 2016

1 commit

  • Sasha reported a lockdep splat about a potential deadlock between RCU boosting
    rtmutex and the posix timer it_lock.

    CPU0 CPU1

    rtmutex_lock(&rcu->rt_mutex)
    spin_lock(&rcu->rt_mutex.wait_lock)
    local_irq_disable()
    spin_lock(&timer->it_lock)
    spin_lock(&rcu->mutex.wait_lock)
    --> Interrupt
    spin_lock(&timer->it_lock)

    This is caused by the following code sequence on CPU1

    rcu_read_lock()
    x = lookup();
    if (x)
    spin_lock_irqsave(&x->it_lock);
    rcu_read_unlock();
    return x;

    We could fix that in the posix timer code by keeping rcu read locked across
    the spinlocked and irq disabled section, but the above sequence is common and
    there is no reason not to support it.

    Taking rt_mutex.wait_lock irq safe prevents the deadlock.

    Reported-by: Sasha Levin
    Signed-off-by: Thomas Gleixner
    Cc: Peter Zijlstra
    Cc: Paul McKenney

    Thomas Gleixner
     

12 Jan, 2016

1 commit

  • Pull locking updates from Ingo Molnar:
    "So we have a laundry list of locking subsystem changes:

    - continuing barrier API and code improvements

    - futex enhancements

    - atomics API improvements

    - pvqspinlock enhancements: in particular lock stealing and adaptive
    spinning

    - qspinlock micro-enhancements"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op
    futex: Cleanup the goto confusion in requeue_pi()
    futex: Remove pointless put_pi_state calls in requeue()
    futex: Document pi_state refcounting in requeue code
    futex: Rename free_pi_state() to put_pi_state()
    futex: Drop refcount if requeue_pi() acquired the rtmutex
    locking/barriers, arch: Remove ambiguous statement in the smp_store_mb() documentation
    lcoking/barriers, arch: Use smp barriers in smp_store_release()
    locking/cmpxchg, arch: Remove tas() definitions
    locking/pvqspinlock: Queue node adaptive spinning
    locking/pvqspinlock: Allow limited lock stealing
    locking/pvqspinlock: Collect slowpath lock statistics
    sched/core, locking: Document Program-Order guarantees
    locking, sched: Introduce smp_cond_acquire() and use it
    locking/pvqspinlock, x86: Optimize the PV unlock code path
    locking/qspinlock: Avoid redundant read of next pointer
    locking/qspinlock: Prefetch the next node cacheline
    locking/qspinlock: Use _acquire/_release() versions of cmpxchg() & xchg()
    atomics: Add test for atomic operations with _relaxed variants

    Linus Torvalds
     

18 Dec, 2015

1 commit

  • The Cavium guys reported a soft lockup on their arm64 machine, caused by
    commit c55a6ffa6285 ("locking/osq: Relax atomic semantics"):

    mutex_optimistic_spin+0x9c/0x1d0
    __mutex_lock_slowpath+0x44/0x158
    mutex_lock+0x54/0x58
    kernfs_iop_permission+0x38/0x70
    __inode_permission+0x88/0xd8
    inode_permission+0x30/0x6c
    link_path_walk+0x68/0x4d4
    path_openat+0xb4/0x2bc
    do_filp_open+0x74/0xd0
    do_sys_open+0x14c/0x228
    SyS_openat+0x3c/0x48
    el0_svc_naked+0x24/0x28

    This is because in osq_lock we initialise the node for the current CPU:

    node->locked = 0;
    node->next = NULL;
    node->cpu = curr;

    and then publish the current CPU in the lock tail:

    old = atomic_xchg_acquire(&lock->tail, curr);

    Once the update to lock->tail is visible to another CPU, the node is
    then live and can be both read and updated by concurrent lockers.

    Unfortunately, the ACQUIRE semantics of the xchg operation mean that
    there is no guarantee the contents of the node will be visible before
    lock tail is updated. This can lead to lock corruption when, for
    example, a concurrent locker races to set the next field.

    Fixes: c55a6ffa6285 ("locking/osq: Relax atomic semantics"):
    Reported-by: David Daney
    Reported-by: Andrew Pinski
    Tested-by: Andrew Pinski
    Acked-by: Davidlohr Bueso
    Signed-off-by: Will Deacon
    Signed-off-by: Peter Zijlstra (Intel)
    Link: http://lkml.kernel.org/r/1449856001-21177-1-git-send-email-will.deacon@arm.com
    Signed-off-by: Linus Torvalds

    Will Deacon
     

04 Dec, 2015

4 commits

  • In an overcommitted guest where some vCPUs have to be halted to make
    forward progress in other areas, it is highly likely that a vCPU later
    in the spinlock queue will be spinning while the ones earlier in the
    queue would have been halted. The spinning in the later vCPUs is then
    just a waste of precious CPU cycles because they are not going to
    get the lock soon as the earlier ones have to be woken up and take
    their turn to get the lock.

    This patch implements an adaptive spinning mechanism where the vCPU
    will call pv_wait() if the previous vCPU is not running.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s
    After patch 3m03.0s 4m37.5s 1m43.0s 2m47.2s

    For 32 vCPUs, this patch doesn't cause any noticeable change in
    performance. For 48 vCPUs (over-committed), there is about 8%
    performance improvement.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-8-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch allows one attempt for the lock waiter to steal the lock
    when entering the PV slowpath. To prevent lock starvation, the pending
    bit will be set by the queue head vCPU when it is in the active lock
    spinning loop to disable any lock stealing attempt. This helps to
    reduce the performance penalty caused by lock waiter preemption while
    not having much of the downsides of a real unfair lock.

    The pv_wait_head() function was renamed as pv_wait_head_or_lock()
    as it was modified to acquire the lock before returning. This is
    necessary because of possible lock stealing attempts from other tasks.

    Linux kernel builds were run in KVM guest on an 8-socket, 4
    cores/socket Westmere-EX system and a 4-socket, 8 cores/socket
    Haswell-EX system. Both systems are configured to have 32 physical
    CPUs. The kernel build times before and after the patch were:

    Westmere Haswell
    Patch 32 vCPUs 48 vCPUs 32 vCPUs 48 vCPUs
    ----- -------- -------- -------- --------
    Before patch 3m15.6s 10m56.1s 1m44.1s 5m29.1s
    After patch 3m02.3s 5m00.2s 1m43.7s 3m03.5s

    For the overcommited case (48 vCPUs), this patch is able to reduce
    kernel build time by more than 54% for Westmere and 44% for Haswell.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447190336-53317-1-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch enables the accumulation of kicking and waiting related
    PV qspinlock statistics when the new QUEUED_LOCK_STAT configuration
    option is selected. It also enables the collection of data which
    enable us to calculate the kicking and wakeup latencies which have
    a heavy dependency on the CPUs being used.

    The statistical counters are per-cpu variables to minimize the
    performance overhead in their updates. These counters are exported
    via the debugfs filesystem under the qlockstat directory. When the
    corresponding debugfs files are read, summation and computing of the
    required data are then performed.

    The measured latencies for different CPUs are:

    CPU Wakeup Kicking
    --- ------ -------
    Haswell-EX 63.6us 7.4us
    Westmere-EX 67.6us 9.3us

    The measured latencies varied a bit from run-to-run. The wakeup
    latency is much higher than the kicking latency.

    A sample of statistical counters after system bootup (with vCPU
    overcommit) was:

    pv_hash_hops=1.00
    pv_kick_unlock=1148
    pv_kick_wake=1146
    pv_latency_kick=11040
    pv_latency_wake=194840
    pv_spurious_wakeup=7
    pv_wait_again=4
    pv_wait_head=23
    pv_wait_node=1129

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-6-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • Introduce smp_cond_acquire() which combines a control dependency and a
    read barrier to form acquire semantics.

    This primitive has two benefits:

    - it documents control dependencies,
    - its typically cheaper than using smp_load_acquire() in a loop.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

23 Nov, 2015

5 commits

  • The unlock function in queued spinlocks was optimized for better
    performance on bare metal systems at the expense of virtualized guests.

    For x86-64 systems, the unlock call needs to go through a
    PV_CALLEE_SAVE_REGS_THUNK() which saves and restores 8 64-bit
    registers before calling the real __pv_queued_spin_unlock()
    function. The thunk code may also be in a separate cacheline from
    __pv_queued_spin_unlock().

    This patch optimizes the PV unlock code path by:

    1) Moving the unlock slowpath code from the fastpath into a separate
    __pv_queued_spin_unlock_slowpath() function to make the fastpath
    as simple as possible..

    2) For x86-64, hand-coded an assembly function to combine the register
    saving thunk code with the fastpath code. Only registers that
    are used in the fastpath will be saved and restored. If the
    fastpath fails, the slowpath function will be called via another
    PV_CALLEE_SAVE_REGS_THUNK(). For 32-bit, it falls back to the C
    __pv_queued_spin_unlock() code as the thunk saves and restores
    only one 32-bit register.

    With a microbenchmark of 5M lock-unlock loop, the table below shows
    the execution times before and after the patch with different number
    of threads in a VM running on a 32-core Westmere-EX box with x86-64
    4.2-rc1 based kernels:

    Threads Before patch After patch % Change
    ------- ------------ ----------- --------
    1 134.1 ms 119.3 ms -11%
    2 1286 ms 953 ms -26%
    3 3715 ms 3480 ms -6.3%
    4 4092 ms 3764 ms -8.0%

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-5-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • With optimistic prefetch of the next node cacheline, the next pointer
    may have been properly inititalized. As a result, the reading
    of node->next in the contended path may be redundant. This patch
    eliminates the redundant read if the next pointer value is not NULL.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-4-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • A queue head CPU, after acquiring the lock, will have to notify
    the next CPU in the wait queue that it has became the new queue
    head. This involves loading a new cacheline from the MCS node of the
    next CPU. That operation can be expensive and add to the latency of
    locking operation.

    This patch addes code to optmistically prefetch the next MCS node
    cacheline if the next pointer is defined and it has been spinning
    for the MCS lock for a while. This reduces the locking latency and
    improves the system throughput.

    The performance change will depend on whether the prefetch overhead
    can be hidden within the latency of the lock spin loop. On really
    short critical section, there may not be performance gain at all. With
    longer critical section, however, it was found to have a performance
    boost of 5-10% over a range of different queue depths with a spinlock
    loop microbenchmark.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-3-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • This patch replaces the cmpxchg() and xchg() calls in the native
    qspinlock code with the more relaxed _acquire or _release versions of
    those calls to enable other architectures to adopt queued spinlocks
    with less memory barrier performance overhead.

    Signed-off-by: Waiman Long
    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Andrew Morton
    Cc: Davidlohr Bueso
    Cc: Douglas Hatch
    Cc: H. Peter Anvin
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Peter Zijlstra
    Cc: Scott J Norton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/1447114167-47185-2-git-send-email-Waiman.Long@hpe.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • There were still a number of references to my old Red Hat email
    address in the kernel source. Remove these while keeping the
    Red Hat copyright notices intact.

    Signed-off-by: Peter Zijlstra (Intel)
    Cc: Arnaldo Carvalho de Melo
    Cc: Jiri Olsa
    Cc: Linus Torvalds
    Cc: Mike Galbraith
    Cc: Peter Zijlstra
    Cc: Stephane Eranian
    Cc: Thomas Gleixner
    Cc: Vince Weaver
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

07 Nov, 2015

1 commit

  • …d avoiding waking kswapd

    __GFP_WAIT has been used to identify atomic context in callers that hold
    spinlocks or are in interrupts. They are expected to be high priority and
    have access one of two watermarks lower than "min" which can be referred
    to as the "atomic reserve". __GFP_HIGH users get access to the first
    lower watermark and can be called the "high priority reserve".

    Over time, callers had a requirement to not block when fallback options
    were available. Some have abused __GFP_WAIT leading to a situation where
    an optimisitic allocation with a fallback option can access atomic
    reserves.

    This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
    cannot sleep and have no alternative. High priority users continue to use
    __GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
    are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
    callers that want to wake kswapd for background reclaim. __GFP_WAIT is
    redefined as a caller that is willing to enter direct reclaim and wake
    kswapd for background reclaim.

    This patch then converts a number of sites

    o __GFP_ATOMIC is used by callers that are high priority and have memory
    pools for those requests. GFP_ATOMIC uses this flag.

    o Callers that have a limited mempool to guarantee forward progress clear
    __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
    into this category where kswapd will still be woken but atomic reserves
    are not used as there is a one-entry mempool to guarantee progress.

    o Callers that are checking if they are non-blocking should use the
    helper gfpflags_allow_blocking() where possible. This is because
    checking for __GFP_WAIT as was done historically now can trigger false
    positives. Some exceptions like dm-crypt.c exist where the code intent
    is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
    flag manipulations.

    o Callers that built their own GFP flags instead of starting with GFP_KERNEL
    and friends now also need to specify __GFP_KSWAPD_RECLAIM.

    The first key hazard to watch out for is callers that removed __GFP_WAIT
    and was depending on access to atomic reserves for inconspicuous reasons.
    In some cases it may be appropriate for them to use __GFP_HIGH.

    The second key hazard is callers that assembled their own combination of
    GFP flags instead of starting with something like GFP_KERNEL. They may
    now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
    if it's missed in most cases as other activity will wake kswapd.

    Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
    Acked-by: Vlastimil Babka <vbabka@suse.cz>
    Acked-by: Michal Hocko <mhocko@suse.com>
    Acked-by: Johannes Weiner <hannes@cmpxchg.org>
    Cc: Christoph Lameter <cl@linux.com>
    Cc: David Rientjes <rientjes@google.com>
    Cc: Vitaly Wool <vitalywool@gmail.com>
    Cc: Rik van Riel <riel@redhat.com>
    Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
    Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

    Mel Gorman
     

04 Nov, 2015

2 commits

  • Pull scheduler changes from Ingo Molnar:
    "The main changes in this cycle were:

    - sched/fair load tracking fixes and cleanups (Byungchul Park)

    - Make load tracking frequency scale invariant (Dietmar Eggemann)

    - sched/deadline updates (Juri Lelli)

    - stop machine fixes, cleanups and enhancements for bugs triggered by
    CPU hotplug stress testing (Oleg Nesterov)

    - scheduler preemption code rework: remove PREEMPT_ACTIVE and related
    cleanups (Peter Zijlstra)

    - Rework the sched_info::run_delay code to fix races (Peter Zijlstra)

    - Optimize per entity utilization tracking (Peter Zijlstra)

    - ... misc other fixes, cleanups and smaller updates"

    * 'sched-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (57 commits)
    sched: Don't scan all-offline ->cpus_allowed twice if !CONFIG_CPUSETS
    sched: Move cpu_active() tests from stop_two_cpus() into migrate_swap_stop()
    sched: Start stopper early
    stop_machine: Kill cpu_stop_threads->setup() and cpu_stop_unpark()
    stop_machine: Kill smp_hotplug_thread->pre_unpark, introduce stop_machine_unpark()
    stop_machine: Change cpu_stop_queue_two_works() to rely on stopper->enabled
    stop_machine: Introduce __cpu_stop_queue_work() and cpu_stop_queue_two_works()
    stop_machine: Ensure that a queued callback will be called before cpu_stop_park()
    sched/x86: Fix typo in __switch_to() comments
    sched/core: Remove a parameter in the migrate_task_rq() function
    sched/core: Drop unlikely behind BUG_ON()
    sched/core: Fix task and run queue sched_info::run_delay inconsistencies
    sched/numa: Fix task_tick_fair() from disabling numa_balancing
    sched/core: Add preempt_count invariant check
    sched/core: More notrace annotations
    sched/core: Kill PREEMPT_ACTIVE
    sched/core, sched/x86: Kill thread_info::saved_preempt_count
    sched/core: Simplify preempt_count tests
    sched/core: Robustify preemption leak checks
    sched/core: Stop setting PREEMPT_ACTIVE
    ...

    Linus Torvalds
     
  • Pull locking changes from Ingo Molnar:
    "The main changes in this cycle were:

    - More gradual enhancements to atomic ops: new atomic*_read_ctrl()
    ops, synchronize atomic_{read,set}() ordering requirements between
    architectures, add atomic_long_t bitops. (Peter Zijlstra)

    - Add _{relaxed|acquire|release}() variants for inc/dec atomics and
    use them in various locking primitives: mutex, rtmutex, mcs, rwsem.
    This enables weakly ordered architectures (such as arm64) to make
    use of more locking related optimizations. (Davidlohr Bueso)

    - Implement atomic[64]_{inc,dec}_relaxed() on ARM. (Will Deacon)

    - Futex kernel data cache footprint micro-optimization. (Rasmus
    Villemoes)

    - pvqspinlock runtime overhead micro-optimization. (Waiman Long)

    - misc smaller fixlets"

    * 'locking-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    ARM, locking/atomics: Implement _relaxed variants of atomic[64]_{inc,dec}
    locking/rwsem: Use acquire/release semantics
    locking/mcs: Use acquire/release semantics
    locking/rtmutex: Use acquire/release semantics
    locking/mutex: Use acquire/release semantics
    locking/asm-generic: Add _{relaxed|acquire|release}() variants for inc/dec atomics
    atomic: Implement atomic_read_ctrl()
    atomic, arch: Audit atomic_{read,set}()
    atomic: Add atomic_long_t bitops
    futex: Force hot variables into a single cache line
    locking/pvqspinlock: Kick the PV CPU unconditionally when _Q_SLOW_VAL
    locking/osq: Relax atomic semantics
    locking/qrwlock: Rename ->lock to ->wait_lock
    locking/Documentation/lockstat: Fix typo - lokcing -> locking
    locking/atomics, cmpxchg: Privatize the inclusion of asm/cmpxchg.h

    Linus Torvalds
     

19 Oct, 2015

1 commit

  • …k/linux-rcu into core/rcu

    Pull RCU updates from Paul E. McKenney:

    - Miscellaneous fixes. (Paul E. McKenney, Boqun Feng, Oleg Nesterov, Patrick Marlier)

    - Improvements to expedited grace periods. (Paul E. McKenney)

    - Performance improvements to and locktorture tests for percpu-rwsem.
    (Oleg Nesterov, Paul E. McKenney)

    - Torture-test changes. (Paul E. McKenney, Davidlohr Bueso)

    - Documentation updates. (Paul E. McKenney)

    Signed-off-by: Ingo Molnar <mingo@kernel.org>

    Ingo Molnar
     

08 Oct, 2015

1 commit


07 Oct, 2015

8 commits

  • The locktorture module has a list of torture types, and specifying
    a type not on this list is supposed to cleanly fail the module load.
    Unfortunately, the "fail" happens without the "cleanly". This commit
    therefore adds the needed clean-up after an incorrect torture_type.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Based on Peter Zijlstra's earlier patch.

    Change percpu_down_read() to use __down_read(), this way we can
    do rwsem_acquire_read() unconditionally at the start to make this
    code more symmetric and clean.

    Originally-From: Peter Zijlstra
    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Oleg Nesterov
     
  • Update the comments broken by the previous change.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Oleg Nesterov
     
  • Currently down_write/up_write calls synchronize_sched_expedited()
    twice, which is evil. Change this code to rely on rcu-sync primitives.
    This avoids the _expedited "big hammer", and this can be faster in
    the contended case or even in the case when a single thread does
    down_write/up_write in a loop.

    Of course, a single down_write() will take more time, but otoh it
    will be much more friendly to the whole system.

    To simplify the review this patch doesn't update the comments, fixed
    by the next change.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Oleg Nesterov
     
  • This is the temporary ugly hack which will be reverted later. We only
    need it to ensure that the next patch will not break "change sb_writers
    to use percpu_rw_semaphore" patches routed via the VFS tree.

    The alloc_super()->destroy_super() error path assumes that it is safe
    to call percpu_free_rwsem() after kzalloc() without percpu_init_rwsem(),
    so let's not disappoint it.

    Signed-off-by: Oleg Nesterov
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Oleg Nesterov
     
  • This commit adds percpu_rwsem tests based on the earlier rwsem tests.

    Signed-off-by: Paul E. McKenney
    Cc: Oleg Nesterov
    Cc: Davidlohr Bueso
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • This commit exports percpu_down_read(), percpu_down_write(),
    __percpu_init_rwsem(), percpu_up_read(), and percpu_up_write() to allow
    locktorture to test them when built as a module.

    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Paul E. McKenney
     
  • Real time mutexes is one of the few general primitives
    that we do not have in locktorture. Address this -- a few
    considerations:

    o To spice things up, enable competing thread(s) to become
    rt, such that we can stress different prio boosting paths
    in the rtmutex code. Introduce a ->task_boost callback,
    only used by rtmutex-torturer. Tasks will boost/deboost
    around every 50k (arbitrarily) lock/unlock operations.

    o Hold times are similar to what we have for other locks:
    only occasionally having longer hold times (per ~200k ops).
    So we roughly do two full rt boost+deboosting ops with
    short hold times.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Paul E. McKenney
    Reviewed-by: Josh Triplett

    Davidlohr Bueso
     

06 Oct, 2015

2 commits

  • As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
    variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
    Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
    ordered archs can benefit from more relaxed use of barriers when locking
    and unlocking, instead of regular full barrier semantics. While currently
    only arm64 supports such optimizations, updating corresponding locking
    primitives serves for other archs to immediately benefit as well, once the
    necessary machinery is implemented of course.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E.McKenney
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1443643395-17016-6-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     
  • As of 654672d4ba1 (locking/atomics: Add _{acquire|release|relaxed}()
    variants of some atomic operations) and 6d79ef2d30e (locking, asm-generic:
    Add _{relaxed|acquire|release}() variants for 'atomic_long_t'), weakly
    ordered archs can benefit from more relaxed use of barriers when locking
    and unlocking, instead of regular full barrier semantics. While currently
    only arm64 supports such optimizations, updating corresponding locking
    primitives serves for other archs to immediately benefit as well, once the
    necessary machinery is implemented of course.

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Peter Zijlstra (Intel)
    Reviewed-by: Thomas Gleixner
    Cc: Andrew Morton
    Cc: Linus Torvalds
    Cc: Paul E. McKenney
    Cc: Paul E.McKenney
    Cc: Peter Zijlstra
    Cc: Will Deacon
    Cc: linux-kernel@vger.kernel.org
    Link: http://lkml.kernel.org/r/1443643395-17016-5-git-send-email-dave@stgolabs.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso