04 Sep, 2013

1 commit

  • Pull core/locking changes from Ingo Molnar:
    "Main changes:

    - another mutex optimization, from Davidlohr Bueso

    - improved lglock lockdep tracking, from Michel Lespinasse

    - [ assorted smaller updates, improvements, cleanups. ]"

    * 'core-locking-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    generic-ipi/locking: Fix misleading smp_call_function_any() description
    hung_task debugging: Print more info when reporting the problem
    mutex: Avoid label warning when !CONFIG_MUTEX_SPIN_ON_OWNER
    mutex: Do not unnecessarily deal with waiters
    mutex: Fix/document access-once assumption in mutex_can_spin_on_owner()
    lglock: Update lockdep annotations to report recursive local locks
    lockdep: Introduce lock_acquire_exclusive()/shared() helper macros

    Linus Torvalds
     

31 Jul, 2013

1 commit

  • The check needs to be for > 1, because ctx->acquired is already incremented.
    This will prevent ww_mutex_lock_slow from returning -EDEADLK and not locking
    the mutex. It caused a lot of false gpu lockups on radeon with
    CONFIG_DEBUG_WW_MUTEX_SLOWPATH=y because a function that shouldn't be able
    to return -EDEADLK did.

    Signed-off-by: Maarten Lankhorst
    Signed-off-by: Peter Zijlstra
    Cc: Alex Deucher
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/51F775B5.201@canonical.com
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     

26 Jul, 2013

1 commit


23 Jul, 2013

1 commit

  • Upon entering the slowpath, we immediately attempt to acquire
    the lock by checking if it is already unlocked. If we are lucky
    enough that this is the case, then we don't need to deal with
    any waiter related logic.

    Furthermore any checks for an empty wait_list are unnecessary as
    we already know that count is non-negative and hence no one is
    waiting for the lock.

    Move the count check and xchg calls to be done before any
    waiters are setup - including waiter debugging. Upon failure to
    acquire the lock, the xchg sets the counter to 0, instead of -1
    as it was originally. This can be done here since we set it back
    to -1 right at the beginning of the loop so other waiters are
    woken up when the lock is released.

    When tested on a 8-socket (80 core) system against a vanilla
    3.10-rc1 kernel, this patch provides some small performance
    benefits (+2-6%). While it could be considered in the noise
    level, the average percentages were stable across multiple runs
    and no performance regressions were seen. Two big winners, for
    small amounts of users (10-100), were the short and compute
    workloads had a +19.36% and +%15.76% in jobs per minute.

    Also change some break statements to 'goto slowpath', which IMO
    makes a little more intuitive to read.

    Signed-off-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Acked-by: Maarten Lankhorst
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1372450398.2106.1.camel@buesod1.americas.hpqcorp.net
    Signed-off-by: Ingo Molnar

    Davidlohr Bueso
     

22 Jul, 2013

1 commit

  • mutex_can_spin_on_owner() is technically broken in that it would
    in theory allow the compiler to load lock->owner twice, seeing a
    pointer first time and a NULL pointer the second time.

    Linus pointed out that a compiler has to be seriously broken to
    not compile this correctly - but nevertheless this change
    is correct as it will better document the implementation.

    Signed-off-by: Peter Zijlstra
    Acked-by: Davidlohr Bueso
    Acked-by: Waiman Long
    Acked-by: Linus Torvalds
    Acked-by: Thomas Gleixner
    Acked-by: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Link: http://lkml.kernel.org/r/20130719183101.GA20909@twins.programming.kicks-ass.net
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

12 Jul, 2013

1 commit

  • Move the definitions for wound/wait mutexes out to a separate
    header, ww_mutex.h. This reduces clutter in mutex.h, and
    increases readability.

    Suggested-by: Linus Torvalds
    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Acked-by: Rik van Riel
    Acked-by: Maarten Lankhorst
    Cc: Dave Airlie
    Link: http://lkml.kernel.org/r/51D675DC.3000907@canonical.com
    [ Tidied up the code a bit. ]
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     

26 Jun, 2013

3 commits

  • Injects EDEADLK conditions at pseudo-random interval, with
    exponential backoff up to UINT_MAX (to ensure that every lock
    operation still completes in a reasonable time).

    This way we can test the wound slowpath even for ww mutex users
    where contention is never expected, and the ww deadlock
    avoidance algorithm is only needed for correctness against
    malicious userspace. An example would be protecting kernel
    modesetting properties, which thanks to single-threaded X isn't
    really expected to contend, ever.

    I've looked into using the CONFIG_FAULT_INJECTION
    infrastructure, but decided against it for two reasons:

    - EDEADLK handling is mandatory for ww mutex users and should
    never affect the outcome of a syscall. This is in contrast to -ENOMEM
    injection. So fine configurability isn't required.

    - The fault injection framework only allows to set a simple
    probability for failure. Now the probability that a ww mutex acquire
    stage with N locks will never complete (due to too many injected
    EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
    operations for the completely uncontended case would be O(exp(N)).
    The per-acuiqire ctx exponential backoff solution choosen here only
    results in O(log N) overhead due to injection and so O(log N * N)
    lock operations. This way we can fail with high probability (and so
    have good test coverage even for fancy backoff and lock acquisition
    paths) without running into patalogical cases.

    Note that EDEADLK will only ever be injected when we managed to
    acquire the lock. This prevents any behaviour changes for users
    which rely on the EALREADY semantics.

    Signed-off-by: Daniel Vetter
    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser
    Signed-off-by: Ingo Molnar

    Daniel Vetter
     
  • Wound/wait mutexes are used when other multiple lock
    acquisitions of a similar type can be done in an arbitrary
    order. The deadlock handling used here is called wait/wound in
    the RDBMS literature: The older tasks waits until it can acquire
    the contended lock. The younger tasks needs to back off and drop
    all the locks it is currently holding, i.e. the younger task is
    wounded.

    For full documentation please read Documentation/ww-mutex-design.txt.

    References: https://lwn.net/Articles/548909/
    Signed-off-by: Maarten Lankhorst
    Acked-by: Daniel Vetter
    Acked-by: Rob Clark
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     
  • This will allow me to call functions that have multiple
    arguments if fastpath fails. This is required to support ticket
    mutexes, because they need to be able to pass an extra argument
    to the fail function.

    Originally I duplicated the functions, by adding
    __mutex_fastpath_lock_retval_arg. This ended up being just a
    duplication of the existing function, so a way to test if
    fastpath was called ended up being better.

    This also cleaned up the reservation mutex patch some by being
    able to call an atomic_set instead of atomic_xchg, and making it
    easier to detect if the wrong unlock function was previously
    used.

    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: robclark@gmail.com
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113105.4001.83929.stgit@patser
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     

19 Apr, 2013

4 commits

  • Linus suggested that probably all the supported architectures can
    allow a negative mutex count without incorrect behavior, so we can
    then back out the architecture specific change and allow the
    mutex count to go to any negative number. That should further
    reduce contention for non-x86 architecture.

    Suggested-by: Linus Torvalds
    Signed-off-by: Waiman Long
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-5-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • The current mutex spinning code (with MUTEX_SPIN_ON_OWNER option
    turned on) allow multiple tasks to spin on a single mutex
    concurrently. A potential problem with the current approach is
    that when the mutex becomes available, all the spinning tasks
    will try to acquire the mutex more or less simultaneously. As a
    result, there will be a lot of cacheline bouncing especially on
    systems with a large number of CPUs.

    This patch tries to reduce this kind of contention by putting
    the mutex spinners into a queue so that only the first one in
    the queue will try to acquire the mutex. This will reduce
    contention and allow all the tasks to move forward faster.

    The queuing of mutex spinners is done using an MCS lock based
    implementation which will further reduce contention on the mutex
    cacheline than a similar ticket spinlock based implementation.
    This patch will add a new field into the mutex data structure
    for holding the MCS lock. This expands the mutex size by 8 bytes
    for 64-bit system and 4 bytes for 32-bit system. This overhead
    will be avoid if the MUTEX_SPIN_ON_OWNER option is turned off.

    The following table shows the jobs per minute (JPM) scalability
    data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
    numactl command is used to restrict the running of the fserver
    workloads to 1/2/4/8 nodes with hyperthreading off.

    +-----------------+-----------+-----------+-------------+----------+
    | Configuration | Mean JPM | Mean JPM | Mean JPM | % Change |
    | | w/o patch | patch 1 | patches 1&2 | 1->1&2 |
    +-----------------+------------------------------------------------+
    | | User Range 1100 - 2000 |
    +-----------------+------------------------------------------------+
    | 8 nodes, HT off | 227972 | 227237 | 305043 | +34.2% |
    | 4 nodes, HT off | 393503 | 381558 | 394650 | +3.4% |
    | 2 nodes, HT off | 334957 | 325240 | 338853 | +4.2% |
    | 1 node , HT off | 198141 | 197972 | 198075 | +0.1% |
    +-----------------+------------------------------------------------+
    | | User Range 200 - 1000 |
    +-----------------+------------------------------------------------+
    | 8 nodes, HT off | 282325 | 312870 | 332185 | +6.2% |
    | 4 nodes, HT off | 390698 | 378279 | 393419 | +4.0% |
    | 2 nodes, HT off | 336986 | 326543 | 340260 | +4.2% |
    | 1 node , HT off | 197588 | 197622 | 197582 | 0.0% |
    +-----------------+-----------+-----------+-------------+----------+

    At low user range 10-100, the JPM differences were within +/-1%.
    So they are not that interesting.

    The fserver workload uses mutex spinning extensively. With just
    the mutex change in the first patch, there is no noticeable
    change in performance. Rather, there is a slight drop in
    performance. This mutex spinning patch more than recovers the
    lost performance and show a significant increase of +30% at high
    user load with the full 8 nodes. Similar improvements were also
    seen in a 3.8 kernel.

    The table below shows the %time spent by different kernel
    functions as reported by perf when running the fserver workload
    at 1500 users with all 8 nodes.

    +-----------------------+-----------+---------+-------------+
    | Function | % time | % time | % time |
    | | w/o patch | patch 1 | patches 1&2 |
    +-----------------------+-----------+---------+-------------+
    | __read_lock_failed | 34.96% | 34.91% | 29.14% |
    | __write_lock_failed | 10.14% | 10.68% | 7.51% |
    | mutex_spin_on_owner | 3.62% | 3.42% | 2.33% |
    | mspin_lock | N/A | N/A | 9.90% |
    | __mutex_lock_slowpath | 1.46% | 0.81% | 0.14% |
    | _raw_spin_lock | 2.25% | 2.50% | 1.10% |
    +-----------------------+-----------+---------+-------------+

    The fserver workload for an 8-node system is dominated by the
    contention in the read/write lock. Mutex contention also plays a
    role. With the first patch only, mutex contention is down (as
    shown by the __mutex_lock_slowpath figure) which help a little
    bit. We saw only a few percents improvement with that.

    By applying patch 2 as well, the single mutex_spin_on_owner
    figure is now split out into an additional mspin_lock figure.
    The time increases from 3.42% to 11.23%. It shows a great
    reduction in contention among the spinners leading to a 30%
    improvement. The time ratio 9.9/2.33=4.3 indicates that there
    are on average 4+ spinners waiting in the spin_lock loop for
    each spinner in the mutex_spin_on_owner loop. Contention in
    other locking functions also go down by quite a lot.

    The table below shows the performance change of both patches 1 &
    2 over patch 1 alone in other AIM7 workloads (at 8 nodes,
    hyperthreading off).

    +--------------+---------------+----------------+-----------------+
    | Workload | mean % change | mean % change | mean % change |
    | | 10-100 users | 200-1000 users | 1100-2000 users |
    +--------------+---------------+----------------+-----------------+
    | alltests | 0.0% | -0.8% | +0.6% |
    | five_sec | -0.3% | +0.8% | +0.8% |
    | high_systime | +0.4% | +2.4% | +2.1% |
    | new_fserver | +0.1% | +14.1% | +34.2% |
    | shared | -0.5% | -0.3% | -0.4% |
    | short | -1.7% | -9.8% | -8.3% |
    +--------------+---------------+----------------+-----------------+

    The short workload is the only one that shows a decline in
    performance probably due to the spinner locking and queuing
    overhead.

    Signed-off-by: Waiman Long
    Reviewed-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Norton Scott J
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-4-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • In the __mutex_lock_common() function, an initial entry into
    the lock slow path will cause two atomic_xchg instructions to be
    issued. Together with the atomic decrement in the fast path, a
    total of three atomic read-modify-write instructions will be
    issued in rapid succession. This can cause a lot of cache
    bouncing when many tasks are trying to acquire the mutex at the
    same time.

    This patch will reduce the number of atomic_xchg instructions
    used by checking the counter value first before issuing the
    instruction. The atomic_read() function is just a simple memory
    read. The atomic_xchg() function, on the other hand, can be up
    to 2 order of magnitude or even more in cost when compared with
    atomic_read(). By using atomic_read() to check the value first
    before calling atomic_xchg(), we can avoid a lot of unnecessary
    cache coherency traffic. The only downside with this change is
    that a task on the slow path will have a tiny bit less chance of
    getting the mutex when competing with another task in the fast
    path.

    The same is true for the atomic_cmpxchg() function in the
    mutex-spin-on-owner loop. So an atomic_read() is also performed
    before calling atomic_cmpxchg().

    The mutex locking and unlocking code for the x86 architecture
    can allow any negative number to be used in the mutex count to
    indicate that some tasks are waiting for the mutex. I am not so
    sure if that is the case for the other architectures. So the
    default is to avoid atomic_xchg() if the count has already been
    set to -1. For x86, the check is modified to include all
    negative numbers to cover a larger case.

    The following table shows the jobs per minutes (JPM) scalability
    data on an 8-node 80-core Westmere box with a 3.7.10 kernel. The
    numactl command is used to restrict the running of the
    high_systime workloads to 1/2/4/8 nodes with hyperthreading on
    and off.

    +-----------------+-----------+------------+----------+
    | Configuration | Mean JPM | Mean JPM | % Change |
    | | w/o patch | with patch | |
    +-----------------+-----------------------------------+
    | | User Range 1100 - 2000 |
    +-----------------+-----------------------------------+
    | 8 nodes, HT on | 36980 | 148590 | +301.8% |
    | 8 nodes, HT off | 42799 | 145011 | +238.8% |
    | 4 nodes, HT on | 61318 | 118445 | +51.1% |
    | 4 nodes, HT off | 158481 | 158592 | +0.1% |
    | 2 nodes, HT on | 180602 | 173967 | -3.7% |
    | 2 nodes, HT off | 198409 | 198073 | -0.2% |
    | 1 node , HT on | 149042 | 147671 | -0.9% |
    | 1 node , HT off | 126036 | 126533 | +0.4% |
    +-----------------+-----------------------------------+
    | | User Range 200 - 1000 |
    +-----------------+-----------------------------------+
    | 8 nodes, HT on | 41525 | 122349 | +194.6% |
    | 8 nodes, HT off | 49866 | 124032 | +148.7% |
    | 4 nodes, HT on | 66409 | 106984 | +61.1% |
    | 4 nodes, HT off | 119880 | 130508 | +8.9% |
    | 2 nodes, HT on | 138003 | 133948 | -2.9% |
    | 2 nodes, HT off | 132792 | 131997 | -0.6% |
    | 1 node , HT on | 116593 | 115859 | -0.6% |
    | 1 node , HT off | 104499 | 104597 | +0.1% |
    +-----------------+------------+-----------+----------+

    At low user range 10-100, the JPM differences were within +/-1%.
    So they are not that interesting.

    AIM7 benchmark run has a pretty large run-to-run variance due to
    random nature of the subtests executed. So a difference of less
    than +-5% may not be really significant.

    This patch improves high_systime workload performance at 4 nodes
    and up by maintaining transaction rates without significant
    drop-off at high node count. The patch has practically no
    impact on 1 and 2 nodes system.

    The table below shows the percentage time (as reported by perf
    record -a -s -g) spent on the __mutex_lock_slowpath() function
    by the high_systime workload at 1500 users for 2/4/8-node
    configurations with hyperthreading off.

    +---------------+-----------------+------------------+---------+
    | Configuration | %Time w/o patch | %Time with patch | %Change |
    +---------------+-----------------+------------------+---------+
    | 8 nodes | 65.34% | 0.69% | -99% |
    | 4 nodes | 8.70% | 1.02% | -88% |
    | 2 nodes | 0.41% | 0.32% | -22% |
    +---------------+-----------------+------------------+---------+

    It is obvious that the dramatic performance improvement at 8
    nodes was due to the drastic cut in the time spent within the
    __mutex_lock_slowpath() function.

    The table below show the improvements in other AIM7 workloads
    (at 8 nodes, hyperthreading off).

    +--------------+---------------+----------------+-----------------+
    | Workload | mean % change | mean % change | mean % change |
    | | 10-100 users | 200-1000 users | 1100-2000 users |
    +--------------+---------------+----------------+-----------------+
    | alltests | +0.6% | +104.2% | +185.9% |
    | five_sec | +1.9% | +0.9% | +0.9% |
    | fserver | +1.4% | -7.7% | +5.1% |
    | new_fserver | -0.5% | +3.2% | +3.1% |
    | shared | +13.1% | +146.1% | +181.5% |
    | short | +7.4% | +5.0% | +4.2% |
    +--------------+---------------+----------------+-----------------+

    Signed-off-by: Waiman Long
    Reviewed-by: Davidlohr Bueso
    Reviewed-by: Rik van Riel
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Norton: Scott J
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-3-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     
  • As mentioned by Ingo, the SCHED_FEAT_OWNER_SPIN scheduler
    feature bit was really just an early hack to make with/without
    mutex-spinning testable. So it is no longer necessary.

    This patch removes the SCHED_FEAT_OWNER_SPIN feature bit and
    move the mutex spinning code from kernel/sched/core.c back to
    kernel/mutex.c which is where they should belong.

    Signed-off-by: Waiman Long
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Chandramouleeswaran Aswin
    Cc: Davidlohr Bueso
    Cc: Norton Scott J
    Cc: Rik van Riel
    Cc: Paul E. McKenney
    Cc: David Howells
    Cc: Dave Jones
    Cc: Clark Williams
    Cc: Peter Zijlstra
    Link: http://lkml.kernel.org/r/1366226594-5506-2-git-send-email-Waiman.Long@hp.com
    Signed-off-by: Ingo Molnar

    Waiman Long
     

08 Feb, 2013

1 commit


01 Mar, 2012

1 commit


31 Oct, 2011

1 commit

  • The changed files were only including linux/module.h for the
    EXPORT_SYMBOL infrastructure, and nothing else. Revector them
    onto the isolated export header for faster compile times.

    Nothing to see here but a whole lot of instances of:

    -#include
    +#include

    This commit is only changing the kernel dir; next targets
    will probably be mm, fs, the arch dirs, etc.

    Signed-off-by: Paul Gortmaker

    Paul Gortmaker
     

25 May, 2011

1 commit

  • In order to convert i_mmap_lock to a mutex we need a mutex equivalent to
    spin_lock_nest_lock(), thus provide the mutex_lock_nest_lock() annotation.

    As with spin_lock_nest_lock(), mutex_lock_nest_lock() allows annotation of
    the locking pattern where an outer lock serializes the acquisition order
    of nested locks. That is, if every time you lock multiple locks A, say A1
    and A2 you first acquire N, the order of acquiring A1 and A2 is
    irrelevant.

    Signed-off-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc: David Miller
    Cc: Martin Schwidefsky
    Cc: Russell King
    Cc: Paul Mundt
    Cc: Jeff Dike
    Cc: Richard Weinberger
    Cc: Tony Luck
    Cc: KAMEZAWA Hiroyuki
    Cc: Hugh Dickins
    Cc: Mel Gorman
    Cc: KOSAKI Motohiro
    Cc: Nick Piggin
    Cc: Namhyung Kim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Peter Zijlstra
     

24 Apr, 2011

1 commit

  • Neil Brown pointed out that lock_depth somehow escaped the BKL
    removal work. Let's get rid of it now.

    Note that the perf scripting utilities still have a bunch of
    code for dealing with common_lock_depth in tracepoints; I have
    left that in place in case anybody wants to use that code with
    older kernels.

    Suggested-by: Neil Brown
    Signed-off-by: Jonathan Corbet
    Cc: Arnd Bergmann
    Cc: Peter Zijlstra
    Cc: Thomas Gleixner
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Link: http://lkml.kernel.org/r/20110422111910.456c0e84@bike.lwn.net
    Signed-off-by: Ingo Molnar

    Jonathan Corbet
     

14 Apr, 2011

1 commit

  • Since we now have p->on_cpu unconditionally available, use it to
    re-implement mutex_spin_on_owner.

    Requested-by: Thomas Gleixner
    Reviewed-by: Frank Rowand
    Cc: Mike Galbraith
    Cc: Nick Piggin
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Signed-off-by: Ingo Molnar
    Signed-off-by: Peter Zijlstra
    Link: http://lkml.kernel.org/r/20110405152728.826338173@chello.nl

    Peter Zijlstra
     

31 Mar, 2011

1 commit


26 Nov, 2010

1 commit

  • The spinning mutex implementation uses cpu_relax() in busy loops as a
    compiler barrier. Depending on the architecture, cpu_relax() may do more
    than needed in this specific mutex spin loops. On System z we also give
    up the time slice of the virtual cpu in cpu_relax(), which prevents
    effective spinning on the mutex.

    This patch replaces cpu_relax() in the spinning mutex code with
    arch_mutex_cpu_relax(), which can be defined by each architecture that
    selects HAVE_ARCH_MUTEX_CPU_RELAX. The default is still cpu_relax(), so
    this patch should not affect other architectures than System z for now.

    Signed-off-by: Gerald Schaefer
    Signed-off-by: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Gerald Schaefer
     

03 Sep, 2010

1 commit


19 May, 2010

1 commit

  • Currently, we can hit a nasty case with optimistic
    spinning on mutexes:

    CPU A tries to take a mutex, while holding the BKL

    CPU B tried to take the BLK while holding the mutex

    This looks like a AB-BA scenario but in practice, is
    allowed and happens due to the auto-release on
    schedule() nature of the BKL.

    In that case, the optimistic spinning code can get us
    into a situation where instead of going to sleep, A
    will spin waiting for B who is spinning waiting for
    A, and the only way out of that loop is the
    need_resched() test in mutex_spin_on_owner().

    This patch fixes it by completely disabling spinning
    if we own the BKL. This adds one more detail to the
    extensive list of reasons why it's a bad idea for
    kernel code to be holding the BKL.

    Signed-off-by: Tony Breeds
    Acked-by: Linus Torvalds
    Acked-by: Peter Zijlstra
    Cc: Benjamin Herrenschmidt
    Cc:
    LKML-Reference:
    [ added an unlikely() attribute to the branch ]
    Signed-off-by: Ingo Molnar

    Tony Breeds
     

03 Dec, 2009

1 commit


11 Jun, 2009

2 commits


11 May, 2009

1 commit


06 May, 2009

1 commit


30 Apr, 2009

1 commit

  • include/linux/mutex.h:136: warning: 'mutex_lock' declared inline after being called
    include/linux/mutex.h:136: warning: previous declaration of 'mutex_lock' was here

    uninline it.

    [ Impact: clean up and uninline, address compiler warning ]

    Signed-off-by: Andrew Morton
    Cc: Al Viro
    Cc: Christoph Hellwig
    Cc: Eric Paris
    Cc: Paul Mackerras
    Cc: Peter Zijlstra
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Andrew Morton
     

29 Apr, 2009

1 commit


21 Apr, 2009

1 commit


10 Apr, 2009

1 commit

  • Impact: performance regression fix for s390

    The adaptive spinning mutexes will not always do what one would expect on
    virtualized architectures like s390. Especially the cpu_relax() loop in
    mutex_spin_on_owner might hurt if the mutex holding cpu has been scheduled
    away by the hypervisor.

    We would end up in a cpu_relax() loop when there is no chance that the
    state of the mutex changes until the target cpu has been scheduled again by
    the hypervisor.

    For that reason we should change the default behaviour to no-spin on s390.

    We do have an instruction which allows to yield the current cpu in favour of
    a different target cpu. Also we have an instruction which allows us to figure
    out if the target cpu is physically backed.

    However we need to do some performance tests until we can come up with
    a solution that will do the right thing on s390.

    Signed-off-by: Heiko Carstens
    Acked-by: Peter Zijlstra
    Cc: Martin Schwidefsky
    Cc: Christian Borntraeger
    LKML-Reference:
    Signed-off-by: Ingo Molnar

    Heiko Carstens
     

06 Apr, 2009

1 commit

  • Impact: build fix

    mutex_lock() is was defined inline in kernel/mutex.c, but wasn't
    declared so not in . This didn't cause a problem until
    checkin 3a2d367d9aabac486ac4444c6c7ec7a1dab16267 added the
    atomic_dec_and_mutex_lock() inline in between declaration and
    definion.

    This broke building with CONFIG_ALLOW_WARNINGS=n, e.g. make
    allnoconfig.

    Either from the source code nor the allnoconfig binary output I cannot
    find any internal references to mutex_lock() in kernel/mutex.c, so
    presumably this "inline" is now-useless legacy.

    Cc: Eric Paris
    Cc: Peter Zijlstra
    Cc: Paul Mackerras
    Orig-LKML-Reference:
    Signed-off-by: H. Peter Anvin

    H. Peter Anvin
     

15 Jan, 2009

4 commits

  • Spin more agressively. This is less fair but also markedly faster.

    The numbers:

    * dbench 50 (higher is better):
    spin 1282MB/s
    v10 548MB/s
    v10 no wait 1868MB/s

    * 4k creates (numbers in files/second higher is better):
    spin avg 200.60 median 193.20 std 19.71 high 305.93 low 186.82
    v10 avg 180.94 median 175.28 std 13.91 high 229.31 low 168.73
    v10 no wait avg 232.18 median 222.38 std 22.91 high 314.66 low 209.12

    * File stats (numbers in seconds, lower is better):
    spin 2.27s
    v10 5.1s
    v10 no wait 1.6s

    ( The source changes are smaller than they look, I just moved the
    need_resched checks in __mutex_lock_common after the cmpxchg. )

    Signed-off-by: Chris Mason
    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Chris Mason
     
  • Change mutex contention behaviour such that it will sometimes busy wait on
    acquisition - moving its behaviour closer to that of spinlocks.

    This concept got ported to mainline from the -rt tree, where it was originally
    implemented for rtmutexes by Steven Rostedt, based on work by Gregory Haskins.

    Testing with Ingo's test-mutex application (http://lkml.org/lkml/2006/1/8/50)
    gave a 345% boost for VFS scalability on my testbox:

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 296604

    # ./test-mutex-shm V 16 10 | grep "^avg ops"
    avg ops/sec: 85870

    The key criteria for the busy wait is that the lock owner has to be running on
    a (different) cpu. The idea is that as long as the owner is running, there is a
    fair chance it'll release the lock soon, and thus we'll be better off spinning
    instead of blocking/scheduling.

    Since regular mutexes (as opposed to rtmutexes) do not atomically track the
    owner, we add the owner in a non-atomic fashion and deal with the races in
    the slowpath.

    Furthermore, to ease the testing of the performance impact of this new code,
    there is means to disable this behaviour runtime (without having to reboot
    the system), when scheduler debugging is enabled (CONFIG_SCHED_DEBUG=y),
    by issuing the following command:

    # echo NO_OWNER_SPIN > /debug/sched_features

    This command re-enables spinning again (this is also the default):

    # echo OWNER_SPIN > /debug/sched_features

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • The problem is that dropping the spinlock right before schedule is a voluntary
    preemption point and can cause a schedule, right after which we schedule again.

    Fix this inefficiency by keeping preemption disabled until we schedule, do this
    by explicity disabling preemption and providing a schedule() variant that
    assumes preemption is already disabled.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     
  • Remove a local variable by combining an assingment and test in one.

    Signed-off-by: Peter Zijlstra
    Signed-off-by: Ingo Molnar

    Peter Zijlstra
     

24 Nov, 2008

1 commit


20 Oct, 2008

1 commit


29 Jul, 2008

1 commit