04 Jul, 2013

3 commits

  • We print a dump stack after idr_remove warning. This is useful to find
    the faulty piece of code. Let's do the same for ida_remove, as it would
    be equally useful there.

    [akpm@linux-foundation.org: convert the open-coded printk+dump_stack into WARN()]
    Signed-off-by: Jean Delvare
    Cc: Tejun Heo
    Cc: Takashi Iwai
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Jean Delvare
     
  • __this_cpu_write doesn't need to be protected by spinlock, AS we are doing
    per cpu write with preempt disabled. And another reason to remove
    __this_cpu_write outside of spinlock: __percpu_counter_sum is not an
    accurate counter.

    Signed-off-by: Fan Du
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Fan Du
     
  • Add functionality to serialize the output from dump_stack() to avoid
    mangling of the output when dump_stack is called simultaneously from
    multiple cpus.

    [akpm@linux-foundation.org: fix comment indenting, avoid inclusion of files - use where possiblem fix uniprocessor build (__dump_stack undefined), remove unneeded ifdef around smp.h inclusion]
    Signed-off-by: Alex Thorlton
    Reported-by: Russ Anderson
    Reviewed-by: Robin Holt
    Cc: Vineet Gupta
    Cc: David S. Miller
    Cc: Richard Kuo
    Cc: Jesper Nilsson
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Alex Thorlton
     

03 Jul, 2013

4 commits

  • Pull per-cpu changes from Tejun Heo:
    "This pull request contains Kent's per-cpu reference counter. It has
    gone through several iterations since the last time and the dynamic
    allocation is gone.

    The usual usage is relatively straight-forward although async kill
    confirm interface, which is not used int most cases, is somewhat icky.
    There also are some interface concerns - e.g. I'm not sure about
    passing in @relesae callback during init as that becomes funny when we
    later implement synchronous kill_and_drain - but nothing too serious
    and it's quite useable now.

    cgroup_subsys_state refcnting has already been converted and we should
    convert module refcnt (Kent?)"

    * 'for-3.11' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/percpu:
    percpu-refcount: use RCU-sched insted of normal RCU
    percpu-refcount: implement percpu_tryget() along with percpu_ref_kill_and_confirm()
    percpu-refcount: implement percpu_ref_cancel_init()
    percpu-refcount: add __must_check to percpu_ref_init() and don't use ACCESS_ONCE() in percpu_ref_kill_rcu()
    percpu-refcount: cosmetic updates
    percpu-refcount: consistently use plain (non-sched) RCU
    percpu-refcount: Don't use silly cmpxchg()
    percpu: implement generic percpu refcounting

    Linus Torvalds
     
  • Pull WW mutex support from Ingo Molnar:
    "This tree adds support for wound/wait style locks, which the graphics
    guys would like to make use of in the TTM graphics subsystem.

    Wound/wait mutexes are used when other multiple lock acquisitions of a
    similar type can be done in an arbitrary order. The deadlock handling
    used here is called wait/wound in the RDBMS literature: The older
    tasks waits until it can acquire the contended lock. The younger
    tasks needs to back off and drop all the locks it is currently
    holding, ie the younger task is wounded.

    See this LWN.net description of W/W mutexes:

    https://lwn.net/Articles/548909/

    The comments there outline specific usecases for this facility (which
    have already been implemented for the DRM tree).

    Also see Documentation/ww-mutex-design.txt for more details"

    * 'core-mutexes-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    locking-selftests: Handle unexpected failures more strictly
    mutex: Add more w/w tests to test EDEADLK path handling
    mutex: Add more tests to lib/locking-selftest.c
    mutex: Add w/w tests to lib/locking-selftest.c
    mutex: Add w/w mutex slowpath debugging
    mutex: Add support for wound/wait style locks
    arch: Make __mutex_fastpath_lock_retval return whether fastpath succeeded or not

    Linus Torvalds
     
  • …ernel/git/arm/arm-soc

    Pull ARM SoC non-cricitical bug fixes from Arnd Bergmann:
    "These are various bug fixes that were not considered important enough
    for merging into 3.10.

    The majority of the ARM fixes are for the OMAP and at91 platforms, and
    there is another set of bug fixes for device drivers that resolve
    'randconfig' build errors and that the subsystem maintainers either
    did not pick up or preferred to get merged through the arm-soc tree."

    * tag 'fixes-non-critical-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/arm/arm-soc: (43 commits)
    ARM: at91/PMC: use at91_usb_rate() for UTMI PLL
    ARM: at91/PMC: fix at91sam9n12 USB FS init
    ARM: at91/PMC: at91sam9n12 family has a PLLB
    ARM: at91/PMC: sama5d3 family doesn't have a PLLB
    ARM: tegra: fix section mismatch in tegra_pmc_parse_dt
    ARM: mxs: don't select HAVE_PWM
    ARM: mxs: stub out mxs_pm_init for !CONFIG_PM
    cpuidle: calxeda: select ARM_CPU_SUSPEND
    ARM: mvebu: fix length of ethernet registers in mv78260 dtsi
    ARM: at91: cpuidle: Fix target_residency
    ARM: at91: fix at91_extern_irq usage for non-dt boards
    ARM: sirf: use CONFIG_SIRF rather than CONFIG_PRIMA2 where necessary
    clocksource: kona: adapt to CLOCKSOURCE_OF_DECLARE change
    X.509: do not emit any informational output
    mtd: omap2: allow bulding as a module
    [SCSI] nsp32: use mdelay instead of large udelay constants
    hwrng: bcm2835: fix MODULE_LICENSE tag
    ARM: at91: Change the internal SRAM memory type MT_MEMORY_NONCACHED
    ARM: at91: Fix link breakage when !CONFIG_PHYLIB
    MAINTAINERS: Add exynos filename match to ARM/S5P EXYNOS ARM ARCHITECTURES
    ...

    Linus Torvalds
     
  • Pull driver core updates from Greg KH:
    "Here's the big driver core merge for 3.11-rc1

    Lots of little things, and larger firmware subsystem updates, all
    described in the shortlog. Nice thing here is that we finally get rid
    of CONFIG_HOTPLUG, after 10+ years, thanks to Stephen Rohtwell (it had
    been always on for a number of kernel releases, now it's just
    removed)"

    * tag 'driver-core-3.11-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (27 commits)
    driver core: device.h: fix doc compilation warnings
    firmware loader: fix another compile warning with PM_SLEEP unset
    build some drivers only when compile-testing
    firmware loader: fix compile warning with PM_SLEEP set
    kobject: sanitize argument for format string
    sysfs_notify is only possible on file attributes
    firmware loader: simplify holding module for request_firmware
    firmware loader: don't export cache_firmware and uncache_firmware
    drivers/base: Use attribute groups to create sysfs memory files
    firmware loader: fix compile warning
    firmware loader: fix build failure with !CONFIG_FW_LOADER_USER_HELPER
    Documentation: Updated broken link in HOWTO
    Finally eradicate CONFIG_HOTPLUG
    driver core: firmware loader: kill FW_ACTION_NOHOTPLUG requests before suspend
    driver core: firmware loader: don't cache FW_ACTION_NOHOTPLUG firmware
    Documentation: Tidy up some drivers/base/core.c kerneldoc content.
    platform_device: use a macro instead of platform_driver_register
    firmware: move EXPORT_SYMBOL annotations
    firmware: Avoid deadlock of usermodehelper lock at shutdown
    dell_rbu: Select CONFIG_FW_LOADER_USER_HELPER explicitly
    ...

    Linus Torvalds
     

26 Jun, 2013

6 commits

  • When CONFIG_PROVE_LOCKING is not enabled, more tests are
    expected to pass unexpectedly, but there no tests that should
    start to fail that pass with CONFIG_PROVE_LOCKING enabled.

    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113151.4001.77963.stgit@patser
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     
  • Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113141.4001.54331.stgit@patser
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     
  • None of the ww_mutex codepaths should be taken in the 'normal'
    mutex calls. The easiest way to verify this is by using the
    normal mutex calls, and making sure o.ctx is unmodified.

    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: robclark@gmail.com
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113130.4001.45423.stgit@patser
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     
  • This stresses the lockdep code in some ways specifically useful
    to ww_mutexes. It adds checks for most of the common locking
    errors.

    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: robclark@gmail.com
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113124.4001.23186.stgit@patser
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     
  • Injects EDEADLK conditions at pseudo-random interval, with
    exponential backoff up to UINT_MAX (to ensure that every lock
    operation still completes in a reasonable time).

    This way we can test the wound slowpath even for ww mutex users
    where contention is never expected, and the ww deadlock
    avoidance algorithm is only needed for correctness against
    malicious userspace. An example would be protecting kernel
    modesetting properties, which thanks to single-threaded X isn't
    really expected to contend, ever.

    I've looked into using the CONFIG_FAULT_INJECTION
    infrastructure, but decided against it for two reasons:

    - EDEADLK handling is mandatory for ww mutex users and should
    never affect the outcome of a syscall. This is in contrast to -ENOMEM
    injection. So fine configurability isn't required.

    - The fault injection framework only allows to set a simple
    probability for failure. Now the probability that a ww mutex acquire
    stage with N locks will never complete (due to too many injected
    EDEADLK backoffs) is zero. But the expected number of ww_mutex_lock
    operations for the completely uncontended case would be O(exp(N)).
    The per-acuiqire ctx exponential backoff solution choosen here only
    results in O(log N) overhead due to injection and so O(log N * N)
    lock operations. This way we can fail with high probability (and so
    have good test coverage even for fancy backoff and lock acquisition
    paths) without running into patalogical cases.

    Note that EDEADLK will only ever be injected when we managed to
    acquire the lock. This prevents any behaviour changes for users
    which rely on the EALREADY semantics.

    Signed-off-by: Daniel Vetter
    Signed-off-by: Maarten Lankhorst
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/20130620113117.4001.21681.stgit@patser
    Signed-off-by: Ingo Molnar

    Daniel Vetter
     
  • Wound/wait mutexes are used when other multiple lock
    acquisitions of a similar type can be done in an arbitrary
    order. The deadlock handling used here is called wait/wound in
    the RDBMS literature: The older tasks waits until it can acquire
    the contended lock. The younger tasks needs to back off and drop
    all the locks it is currently holding, i.e. the younger task is
    wounded.

    For full documentation please read Documentation/ww-mutex-design.txt.

    References: https://lwn.net/Articles/548909/
    Signed-off-by: Maarten Lankhorst
    Acked-by: Daniel Vetter
    Acked-by: Rob Clark
    Acked-by: Peter Zijlstra
    Cc: dri-devel@lists.freedesktop.org
    Cc: linaro-mm-sig@lists.linaro.org
    Cc: rostedt@goodmis.org
    Cc: daniel@ffwll.ch
    Cc: Linus Torvalds
    Cc: Andrew Morton
    Cc: Thomas Gleixner
    Link: http://lkml.kernel.org/r/51C8038C.9000106@canonical.com
    Signed-off-by: Ingo Molnar

    Maarten Lankhorst
     

19 Jun, 2013

1 commit

  • When building a kernel using 'make -s', I expect to see an empty output,
    except for build warnings and errors. The build_OID_registry code
    always prints one line when run, which is not helpful to most people
    building the kernels, and which makes it harder to automatically
    check for build warnings.

    Let's just remove the one line output.

    Signed-off-by: Arnd Bergmann
    Cc: David Howells
    Cc: Rusty Russell

    Arnd Bergmann
     

18 Jun, 2013

2 commits


17 Jun, 2013

1 commit

  • percpu-refcount was incorrectly using preempt_disable/enable() for RCU
    critical sections against call_rcu(). 6a24474da8 ("percpu-refcount:
    consistently use plain (non-sched) RCU") fixed it by converting the
    preepmtion operations with rcu_read_[un]lock() citing that there isn't
    any advantage in using sched-RCU over using the usual one; however,
    rcu_read_[un]lock() for the preemptible RCU implementation -
    CONFIG_TREE_PREEMPT_RCU, chosen when CONFIG_PREEMPT - are slightly
    more expensive than preempt_disable/enable().

    In a contrived microbench which repeats the followings,

    - percpu_ref_get()
    - copy 32 bytes of data into percpu buffer
    - percpu_put_get()
    - copy 32 bytes of data into percpu buffer

    rcu_read_[un]lock() used in percpu_ref_get/put() makes it go slower by
    about 15% when compared to using sched-RCU.

    As the RCU critical sections are extremely short, using sched-RCU
    shouldn't have any latency implications. Convert to RCU-sched.

    Signed-off-by: Tejun Heo
    Acked-by: Kent Overstreet
    Acked-by: "Paul E. McKenney"
    Cc: Michal Hocko
    Cc: Rusty Russell

    Tejun Heo
     

14 Jun, 2013

3 commits

  • Implement percpu_tryget() which stops giving out references once the
    percpu_ref is visible as killed. Because the refcnt is per-cpu,
    different CPUs will start to see a refcnt as killed at different
    points in time and tryget() may continue to succeed on subset of cpus
    for a while after percpu_ref_kill() returns.

    For use cases where it's necessary to know when all CPUs start to see
    the refcnt as dead, percpu_ref_kill_and_confirm() is added. The new
    function takes an extra argument @confirm_kill which is invoked when
    the refcnt is guaranteed to be viewed as killed on all CPUs.

    While this isn't the prettiest interface, it doesn't force synchronous
    wait and is much safer than requiring the caller to do its own
    call_rcu().

    v2: Patch description rephrased to emphasize that tryget() may
    continue to succeed on some CPUs after kill() returns as suggested
    by Kent.

    v3: Function comment in percpu_ref_kill_and_confirm() updated warning
    people to not depend on the implied RCU grace period from the
    confirm callback as it's an implementation detail.

    Signed-off-by: Tejun Heo
    Slightly-Grumpily-Acked-by: Kent Overstreet

    Tejun Heo
     
  • Normally, percpu_ref_init() initializes and percpu_ref_kill()
    initiates destruction which completes asynchronously. The
    asynchronous destruction can be problematic in init failure path where
    the caller wants to destroy half-constructed object - distinguishing
    half-constructed objects from the usual release method can be painful
    for complex objects.

    This patch implements percpu_ref_cancel_init() which synchronously
    destroys the percpu_ref without invoking release. To avoid
    unintentional misuses, the function requires the ref to have finished
    percpu_ref_init() but never used and triggers WARN otherwise.

    v2: Explain the weird name and usage restriction in the function
    comment.

    Signed-off-by: Tejun Heo
    Acked-by: Kent Overstreet

    Tejun Heo
     
  • …() in percpu_ref_kill_rcu()

    Two small changes.

    * Unlike most init functions, percpu_ref_init() allocates memory and
    may fail. Let's mark it with __must_check in case the caller
    forgets.

    * percpu_ref_kill_rcu() is unnecessarily using ACCESS_ONCE() to
    dereference @ref->pcpu_count, which can be misleading. The pointer
    is guaranteed to be valid and visible and can't change underneath
    the function. Drop ACCESS_ONCE().

    Signed-off-by: Tejun Heo <tj@kernel.org>

    Tejun Heo
     

13 Jun, 2013

2 commits

  • * s/percpu_ref_release/percpu_ref_func_t/ as it's customary to have _t
    postfix for types and the type is gonna be used for a different type
    of callback too.

    * Add @ARG to function comments.

    * Drop unnecessary and unaligned indentation from percpu_ref_init()
    function comment.

    Signed-off-by: Tejun Heo
    Acked-by: Kent Overstreet

    Tejun Heo
     
  • For 'while' looping, need stop when 'nbytes == 0', or will cause issue.
    ('nbytes' is size_t which is always bigger or equal than zero).

    The related warning: (with EXTRA_CFLAGS=-W)

    lib/mpi/mpicoder.c:40:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

    Signed-off-by: Chen Gang
    Cc: Rusty Russell
    Cc: David Howells
    Cc: James Morris
    Cc: Andy Shevchenko
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     

08 Jun, 2013

1 commit

  • Unlike kobject_set_name(), the kset_create_and_add() interface does not
    provide a way to use format strings, so make sure that the interface
    cannot be abused accidentally. It looks like all current callers use
    static strings, so there's no existing flaw.

    Signed-off-by: Kees Cook
    Signed-off-by: Greg Kroah-Hartman

    Kees Cook
     

06 Jun, 2013

1 commit

  • Since we have at least one user of this function outside of CONFIG_NET
    scope, we have to provide this function independently. The proposed
    solution is to move it under lib/net_utils.c with corresponding
    configuration variable and select wherever it is needed.

    Signed-off-by: Andy Shevchenko
    Reported-by: Arnd Bergmann
    Acked-by: David S. Miller
    Acked-by: Arnd Bergmann
    Signed-off-by: Greg Kroah-Hartman

    Andy Shevchenko
     

04 Jun, 2013

3 commits

  • The cmpxchg() was just to ensure the debug check didn't race, which was
    a bit excessive. The caller is supposed to do the appropriate
    synchronization, which means percpu_ref_kill() can just do a simple
    store.

    Signed-off-by: Kent Overstreet
    Signed-off-by: Tejun Heo

    Kent Overstreet
     
  • This implements a refcount with similar semantics to
    atomic_get()/atomic_dec_and_test() - but percpu.

    It also implements two stage shutdown, as we need it to tear down the
    percpu counts. Before dropping the initial refcount, you must call
    percpu_ref_kill(); this puts the refcount in "shutting down mode" and
    switches back to a single atomic refcount with the appropriate
    barriers (synchronize_rcu()).

    It's also legal to call percpu_ref_kill() multiple times - it only
    returns true once, so callers don't have to reimplement shutdown
    synchronization.

    [akpm@linux-foundation.org: fix build]
    [akpm@linux-foundation.org: coding-style tweak]
    Signed-off-by: Kent Overstreet
    Cc: Zach Brown
    Cc: Felipe Balbi
    Cc: Greg Kroah-Hartman
    Cc: Mark Fasheh
    Cc: Joel Becker
    Cc: Rusty Russell
    Cc: Jens Axboe
    Cc: Asai Thambi S P
    Cc: Selvan Mani
    Cc: Sam Bradshaw
    Cc: Jeff Moyer
    Cc: Al Viro
    Cc: Benjamin LaHaise
    Cc: Tejun Heo
    Cc: Oleg Nesterov
    Cc: Christoph Lameter
    Cc: Ingo Molnar
    Reviewed-by: "Theodore Ts'o"
    Signed-off-by: Tejun Heo

    Kent Overstreet
     
  • debugfs currently lack the ability to create attributes
    that set/get atomic_t values.

    This patch adds support for this through a new
    debugfs_create_atomic_t() function.

    Signed-off-by: Seth Jennings
    Acked-by: Greg Kroah-Hartman
    Acked-by: Mel Gorman
    Acked-by: Rik van Riel
    Acked-by: Konrad Rzeszutek Wilk
    Signed-off-by: Greg Kroah-Hartman

    Seth Jennings
     

25 May, 2013

1 commit

  • The umul_ppmm() macro for parisc uses the xmpyu assembler statement
    which does calculation via a floating point register.

    But usage of floating point registers inside the Linux kernel are not
    allowed and gcc will stop compilation due to the -mdisable-fpregs
    compiler option.

    Fix this by disabling the umul_ppmm() and udiv_qrnnd() macros. The
    mpilib will then use the generic built-in implementations instead.

    Signed-off-by: Helge Deller

    Helge Deller
     

24 May, 2013

2 commits

  • Pull driver core fixes from Greg Kroah-Hartman:
    "Here are 3 tiny driver core fixes for 3.10-rc2.

    A needed symbol export, a change to make it easier to track down
    offending sysfs files with incorrect attributes, and a klist bugfix.

    All have been in linux-next for a while"

    * tag 'driver-core-3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    klist: del waiter from klist_remove_waiters before wakeup waitting process
    driver core: print sysfs attribute name when warning about bogus permissions
    driver core: export subsys_virtual_register

    Linus Torvalds
     
  • Fix build error io vmw_vmci.ko when CONFIG_VMWARE_VMCI=m by chaning
    iovec.o from lib-y to obj-y.

    ERROR: "memcpy_toiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!
    ERROR: "memcpy_fromiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!

    Signed-off-by: Randy Dunlap
    Acked-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

22 May, 2013

1 commit

  • There is a race between klist_remove and klist_release. klist_remove
    uses a local var waiter saved on stack. When klist_release calls
    wake_up_process(waiter->process) to wake up the waiter, waiter might run
    immediately and reuse the stack. Then, klist_release calls
    list_del(&waiter->list) to change previous
    wait data and cause prior waiter thread corrupt.

    The patch fixes it against kernel 3.9.

    Signed-off-by: wang, biao
    Acked-by: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    wang, biao
     

20 May, 2013

1 commit

  • ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined!

    That function is only present with CONFIG_NET. Turns out that
    crypto/algif_skcipher.c also uses that outside net, but it actually
    needs sockets anyway.

    In addition, commit 6d4f0139d642c45411a47879325891ce2a7c164a added
    CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist
    that function and revert that commit too.

    socket.h already includes uio.h, so no callers need updating; trying
    only broke things fo x86_64 randconfig (thanks Fengguang!).

    Reported-by: Randy Dunlap
    Acked-by: David S. Miller
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Rusty Russell

    Rusty Russell
     

09 May, 2013

1 commit

  • Pull block driver updates from Jens Axboe:
    "It might look big in volume, but when categorized, not a lot of
    drivers are touched. The pull request contains:

    - mtip32xx fixes from Micron.

    - A slew of drbd updates, this time in a nicer series.

    - bcache, a flash/ssd caching framework from Kent.

    - Fixes for cciss"

    * 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
    bcache: Use bd_link_disk_holder()
    bcache: Allocator cleanup/fixes
    cciss: bug fix to prevent cciss from loading in kdump crash kernel
    cciss: add cciss_allow_hpsa module parameter
    drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
    mtip32xx: Workaround for unaligned writes
    bcache: Make sure blocksize isn't smaller than device blocksize
    bcache: Fix merge_bvec_fn usage for when it modifies the bvm
    bcache: Correctly check against BIO_MAX_PAGES
    bcache: Hack around stuff that clones up to bi_max_vecs
    bcache: Set ra_pages based on backing device's ra_pages
    bcache: Take data offset from the bdev superblock.
    mtip32xx: mtip32xx: Disable TRIM support
    mtip32xx: fix a smatch warning
    bcache: Disable broken btree fuzz tester
    bcache: Fix a format string overflow
    bcache: Fix a minor memory leak on device teardown
    bcache: Documentation updates
    bcache: Use WARN_ONCE() instead of __WARN()
    bcache: Add missing #include
    ...

    Linus Torvalds
     

08 May, 2013

3 commits

  • This patch tries to reduce the amount of cmpxchg calls in the writer
    failed path by checking the counter value first before issuing the
    instruction. If ->count is not set to RWSEM_WAITING_BIAS then there is
    no point wasting a cmpxchg call.

    Furthermore, Michel states "I suppose it helps due to the case where
    someone else steals the lock while we're trying to acquire
    sem->wait_lock."

    Two very different workloads and machines were used to see how this
    patch improves throughput: pgbench on a quad-core laptop and aim7 on a
    large 8 socket box with 80 cores.

    Some results comparing Michel's fast-path write lock stealing
    (tps-rwsem) on a quad-core laptop running pgbench:

    | db_size | clients | tps-rwsem | tps-patch |
    +---------+----------+----------------+--------------+
    | 160 MB | 1 | 6906 | 9153 | + 32.5
    | 160 MB | 2 | 15931 | 22487 | + 41.1%
    | 160 MB | 4 | 33021 | 32503 |
    | 160 MB | 8 | 34626 | 34695 |
    | 160 MB | 16 | 33098 | 34003 |
    | 160 MB | 20 | 31343 | 31440 |
    | 160 MB | 30 | 28961 | 28987 |
    | 160 MB | 40 | 26902 | 26970 |
    | 160 MB | 50 | 25760 | 25810 |
    ------------------------------------------------------
    | 1.6 GB | 1 | 7729 | 7537 |
    | 1.6 GB | 2 | 19009 | 23508 | + 23.7%
    | 1.6 GB | 4 | 33185 | 32666 |
    | 1.6 GB | 8 | 34550 | 34318 |
    | 1.6 GB | 16 | 33079 | 32689 |
    | 1.6 GB | 20 | 31494 | 31702 |
    | 1.6 GB | 30 | 28535 | 28755 |
    | 1.6 GB | 40 | 27054 | 27017 |
    | 1.6 GB | 50 | 25591 | 25560 |
    ------------------------------------------------------
    | 7.6 GB | 1 | 6224 | 7469 | + 20.0%
    | 7.6 GB | 2 | 13611 | 12778 |
    | 7.6 GB | 4 | 33108 | 32927 |
    | 7.6 GB | 8 | 34712 | 34878 |
    | 7.6 GB | 16 | 32895 | 33003 |
    | 7.6 GB | 20 | 31689 | 31974 |
    | 7.6 GB | 30 | 29003 | 28806 |
    | 7.6 GB | 40 | 26683 | 26976 |
    | 7.6 GB | 50 | 25925 | 25652 |
    ------------------------------------------------------

    For the aim7 worloads, they overall improved on top of Michel's
    patchset. For full graphs on how the rwsem series plus this patch
    behaves on a large 8 socket machine against a vanilla kernel:

    http://stgolabs.net/rwsem-aim7-results.tar.gz

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • - make warning smp-safe
    - result of atomic _unless_zero functions should be checked by caller
    to avoid use-after-free error
    - trivial whitespace fix.

    Link: https://lkml.org/lkml/2013/4/12/391

    Tested: compile x86, boot machine and run xfstests
    Signed-off-by: Anatol Pomozov
    [ Removed line-break, changed to use WARN_ON_ONCE() - Linus ]
    Signed-off-by: Linus Torvalds

    Anatol Pomozov
     
  • Merge rwsem optimizations from Michel Lespinasse:
    "These patches extend Alex Shi's work (which added write lock stealing
    on the rwsem slow path) in order to provide rwsem write lock stealing
    on the fast path (that is, without taking the rwsem's wait_lock).

    I have unfortunately been unable to push this through -next before due
    to Ingo Molnar / David Howells / Peter Zijlstra being busy with other
    things. However, this has gotten some attention from Rik van Riel and
    Davidlohr Bueso who both commented that they felt this was ready for
    v3.10, and Ingo Molnar has said that he was OK with me pushing
    directly to you. So, here goes :)

    Davidlohr got the following test results from pgbench running on a
    quad-core laptop:

    | db_size | clients | tps-vanilla | tps-rwsem |
    +---------+----------+----------------+--------------+
    | 160 MB | 1 | 5803 | 6906 | + 19.0%
    | 160 MB | 2 | 13092 | 15931 |
    | 160 MB | 4 | 29412 | 33021 |
    | 160 MB | 8 | 32448 | 34626 |
    | 160 MB | 16 | 32758 | 33098 |
    | 160 MB | 20 | 26940 | 31343 | + 16.3%
    | 160 MB | 30 | 25147 | 28961 |
    | 160 MB | 40 | 25484 | 26902 |
    | 160 MB | 50 | 24528 | 25760 |
    ------------------------------------------------------
    | 1.6 GB | 1 | 5733 | 7729 | + 34.8%
    | 1.6 GB | 2 | 9411 | 19009 | + 101.9%
    | 1.6 GB | 4 | 31818 | 33185 |
    | 1.6 GB | 8 | 33700 | 34550 |
    | 1.6 GB | 16 | 32751 | 33079 |
    | 1.6 GB | 20 | 30919 | 31494 |
    | 1.6 GB | 30 | 28540 | 28535 |
    | 1.6 GB | 40 | 26380 | 27054 |
    | 1.6 GB | 50 | 25241 | 25591 |
    ------------------------------------------------------
    | 7.6 GB | 1 | 5779 | 6224 |
    | 7.6 GB | 2 | 10897 | 13611 | + 24.9%
    | 7.6 GB | 4 | 32683 | 33108 |
    | 7.6 GB | 8 | 33968 | 34712 |
    | 7.6 GB | 16 | 32287 | 32895 |
    | 7.6 GB | 20 | 27770 | 31689 | + 14.1%
    | 7.6 GB | 30 | 26739 | 29003 |
    | 7.6 GB | 40 | 24901 | 26683 |
    | 7.6 GB | 50 | 17115 | 25925 | + 51.5%
    ------------------------------------------------------

    (Davidlohr also has one additional patch which further improves
    throughput, though I will ask him to send it directly to you as I have
    suggested some minor changes)."

    * emailed patches from Michel Lespinasse :
    rwsem: no need for explicit signed longs
    x86 rwsem: avoid taking slow path when stealing write lock
    rwsem: do not block readers at head of queue if other readers are active
    rwsem: implement support for write lock stealing on the fastpath
    rwsem: simplify __rwsem_do_wake
    rwsem: skip initial trylock in rwsem_down_write_failed
    rwsem: avoid taking wait_lock in rwsem_down_write_failed
    rwsem: use cmpxchg for trying to steal write lock
    rwsem: more agressive lock stealing in rwsem_down_write_failed
    rwsem: simplify rwsem_down_write_failed
    rwsem: simplify rwsem_down_read_failed
    rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
    rwsem: shorter spinlocked section in rwsem_down_failed_common()
    rwsem: make the waiter type an enumeration rather than a bitmask

    Linus Torvalds
     

07 May, 2013

4 commits

  • Change explicit "signed long" declarations into plain "long" as suggested
    by Peter Hurley.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Michel Lespinasse
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This change fixes a race condition where a reader might determine it
    needs to block, but by the time it acquires the wait_lock the rwsem has
    active readers and no queued waiters.

    In this situation the reader can run in parallel with the existing
    active readers; it does not need to block until the active readers
    complete.

    Thanks to Peter Hurley for noticing this possible race.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When we decide to wake up readers, we must first grant them as many read
    locks as necessary, and then actually wake up all these readers. But in
    order to know how many read shares to grant, we must first count the
    readers at the head of the queue. This might take a while if there are
    many readers, and we want to be protected against a writer stealing the
    lock while we're counting. To that end, we grant the first reader lock
    before counting how many more readers are queued.

    We also require some adjustments to the wake_type semantics.

    RWSEM_WAKE_NO_ACTIVE used to mean that we had found the count to be
    RWSEM_WAITING_BIAS, in which case the rwsem was known to be free as
    nobody could steal it while we hold the wait_lock. This doesn't make
    sense once we implement fastpath write lock stealing, so we now use
    RWSEM_WAKE_ANY in that case.

    Similarly, when rwsem_down_write_failed found that a read lock was
    active, it would use RWSEM_WAKE_READ_OWNED which signalled that new
    readers could be woken without checking first that the rwsem was
    available. We can't do that anymore since the existing readers might
    release their read locks, and a writer could steal the lock before we
    wake up additional readers. So, we have to use a new RWSEM_WAKE_READERS
    value to indicate we only want to wake readers, but we don't currently
    hold any read lock.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This is mostly for cleanup value:

    - We don't need several gotos to handle the case where the first
    waiter is a writer. Two simple tests will do (and generate very
    similar code).

    - In the remainder of the function, we know the first waiter is a reader,
    so we don't have to double check that. We can use do..while loops
    to iterate over the readers to wake (generates slightly better code).

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse