28 Aug, 2014

1 commit

  • When using pool space for DMA buffer, there might be duplicated calling of
    gen_pool_alloc() and gen_pool_virt_to_phys() in each implementation.

    Thus it's better to add a simple helper function, a compatible one to the
    common dma_alloc_coherent(), to save some code.

    Signed-off-by: Nicolin Chen
    Cc: "Hans J. Koch"
    Cc: Dan Williams
    Cc: Eric Miao
    Cc: Grant Likely
    Cc: Greg Kroah-Hartman
    Cc: Haojian Zhuang
    Cc: Jaroslav Kysela
    Cc: Kevin Hilman
    Cc: Liam Girdwood
    Cc: Mark Brown
    Cc: Mauro Carvalho Chehab
    Cc: Rob Herring
    Cc: Russell King
    Cc: Sekhar Nori
    Cc: Takashi Iwai
    Cc: Vinod Koul
    Signed-off-by: Andrew Morton

    Nicolin Chen
     

08 Aug, 2014

1 commit

  • commit c75b53af2f0043aff500af0a6f878497bef41bca upstream.

    I use btree from 3.14-rc2 in my own module. When the btree module is
    removed, a warning arises:

    kmem_cache_destroy btree_node: Slab cache still has objects
    CPU: 13 PID: 9150 Comm: rmmod Tainted: GF O 3.14.0-rc2 #1
    Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013
    Call Trace:
    dump_stack+0x49/0x5d
    kmem_cache_destroy+0xcf/0xe0
    btree_module_exit+0x10/0x12 [btree]
    SyS_delete_module+0x198/0x1f0
    system_call_fastpath+0x16/0x1b

    The cause is that it doesn't release the last btree node, when height = 1
    and fill = 1.

    [akpm@linux-foundation.org: remove unneeded test of NULL]
    Signed-off-by: Minfei Huang
    Cc: Joern Engel
    Cc: Johannes Berg
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Minfei Huang
     

01 Jul, 2014

1 commit

  • commit 3afb69cb5572b3c8c898c00880803cf1a49852c4 upstream.

    idr_replace() open-codes the logic to calculate the maximum valid ID
    given the height of the idr tree; unfortunately, the open-coded logic
    doesn't account for the fact that the top layer may have unused slots
    and over-shifts the limit to zero when the tree is at its maximum
    height.

    The following test code shows it fails to replace the value for
    id=((1<<
    Acked-by: Tejun Heo
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Lai Jiangshan
     

27 Jun, 2014

2 commits

  • commit 206a81c18401c0cde6e579164f752c4b147324ce upstream.

    The lzo decompressor can, if given some really crazy data, possibly
    overrun some variable types. Modify the checking logic to properly
    detect overruns before they happen.

    Reported-by: "Don A. Bailey"
    Tested-by: "Don A. Bailey"
    Signed-off-by: Greg Kroah-Hartman

    Greg Kroah-Hartman
     
  • [ Upstream commit bfc5184b69cf9eeb286137640351c650c27f118a ]

    Any process is able to send netlink messages with leftover bytes.
    Make the warning rate-limited to prevent too much log spam.

    The warning is supposed to help find userspace bugs, so print the
    triggering command name to implicate the buggy program.

    [v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

    Signed-off-by: Michal Schmidt
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Michal Schmidt
     

14 Apr, 2014

1 commit

  • [ Upstream commit 8b7b932434f5eee495b91a2804f5b64ebb2bc835 ]

    nla_strcmp compares the string length plus one, so it's implicitly
    including the nul-termination in the comparison.

    int nla_strcmp(const struct nlattr *nla, const char *str)
    {
    int len = strlen(str) + 1;
    ...
    d = memcmp(nla_data(nla), str, len);

    However, if NLA_STRING is used, userspace can send us a string without
    the nul-termination. This is a problem since the string
    comparison will not match as the last byte may be not the
    nul-termination.

    Fix this by skipping the comparison of the nul-termination if the
    attribute data is nul-terminated. Suggested by Thomas Graf.

    Cc: Florian Westphal
    Cc: Thomas Graf
    Signed-off-by: Pablo Neira Ayuso
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Pablo Neira
     

21 Feb, 2014

1 commit

  • commit 6583327c4dd55acbbf2a6f25e775b28b3abf9a42 upstream.

    Commit d61931d89b, "x86: Add optimized popcnt variants" introduced
    compile flag -fcall-saved-rdi for lib/hweight.c. When combined with
    options -fprofile-arcs and -O2, this flag causes gcc to generate
    broken constructor code. As a result, a 64 bit x86 kernel compiled
    with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create
    file" and runs into sproadic BUGs during boot.

    The gcc people indicate that these kinds of problems are endemic when
    using ad hoc calling conventions. It is therefore best to treat any
    file compiled with ad hoc calling conventions as an isolated
    environment and avoid things like profiling or coverage analysis,
    since those subsystems assume a "normal" calling conventions.

    This patch avoids the bug by excluding lib/hweight.o from coverage
    profiling.

    Reported-by: Meelis Roos
    Cc: Andrew Morton
    Signed-off-by: Peter Oberparleiter
    Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.com
    Signed-off-by: H. Peter Anvin
    Signed-off-by: Greg Kroah-Hartman

    Peter Oberparleiter
     

07 Feb, 2014

1 commit

  • commit 1431574a1c4c669a0c198e4763627837416e4443 upstream.

    When decompressing into memory, the output buffer length is set to some
    arbitrarily high value (0x7fffffff) to indicate the output is, virtually,
    unlimited in size.

    The problem with this is that some platforms have their physical memory at
    high physical addresses (0x80000000 or more), and that the output buffer
    address and its "unlimited" length cannot be added without overflowing.
    An example of this can be found in inflate_fast():

    /* next_out is the output buffer address */
    out = strm->next_out - OFF;
    /* avail_out is the output buffer size. end will overflow if the output
    * address is >= 0x80000104 */
    end = out + (strm->avail_out - 257);

    This has huge consequences on the performance of kernel decompression,
    since the following exit condition of inflate_fast() will be always true:

    } while (in < last && out < end);

    Indeed, "end" has overflowed and is now always lower than "out". As a
    result, inflate_fast() will return after processing one single byte of
    input data, and will thus need to be called an unreasonably high number of
    times. This probably went unnoticed because kernel decompression is fast
    enough even with this issue.

    Nonetheless, adjusting the output buffer length in such a way that the
    above pointer arithmetic never overflows results in a kernel decompression
    that is about 3 times faster on affected machines.

    Signed-off-by: Alexandre Courbot
    Tested-by: Jon Medhurst
    Cc: Stephen Warren
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Cc: Mark Brown
    Signed-off-by: Greg Kroah-Hartman

    Alexandre Courbot
     

12 Dec, 2013

1 commit

  • commit 674470d97958a0ec72f72caf7f6451da40159cc7 upstream.

    In struct gen_pool_chunk, end_addr means the end address of memory chunk
    (inclusive), but in the implementation it is treated as address + size of
    memory chunk (exclusive), so it points to the address plus one instead of
    correct ending address.

    The ending address of memory chunk plus one will cause overflow on the
    memory chunk including the last address of memory map, e.g. when starting
    address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
    address will be 0x100000000.

    Use correct ending address like starting address + size - 1.

    [akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
    Signed-off-by: Joonyoung Shim
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Jonghwan Choi
    Signed-off-by: Greg Kroah-Hartman

    Joonyoung Shim
     

08 Dec, 2013

1 commit

  • [ Upstream commit 51c37a70aaa3f95773af560e6db3073520513912 ]

    For properly initialising the Tausworthe generator [1], we have
    a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.

    Commit 697f8d0348 ("random32: seeding improvement") introduced
    a __seed() function that imposes boundary checks proposed by the
    errata paper [2] to properly ensure above conditions.

    However, we're off by one, as the function is implemented as:
    "return (x < m) ? x + m : x;", and called with __seed(X, 1),
    __seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
    would be possible, whereas the lower boundary should actually
    be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
    an initialization with an unwanted seed could have the effect
    that Tausworthe's PRNG properties cannot not be ensured.

    Note that this PRNG is *not* used for cryptography in the kernel.

    [1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
    [2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps

    Joint work with Hannes Frederic Sowa.

    Fixes: 697f8d0348a6 ("random32: seeding improvement")
    Cc: Stephen Hemminger
    Cc: Florian Weimer
    Cc: Theodore Ts'o
    Signed-off-by: Daniel Borkmann
    Signed-off-by: Hannes Frederic Sowa
    Signed-off-by: David S. Miller
    Signed-off-by: Greg Kroah-Hartman

    Daniel Borkmann
     

05 Dec, 2013

1 commit

  • commit 312b4e226951f707e120b95b118cbc14f3d162b2 upstream.

    Some setuid binaries will allow reading of files which have read
    permission by the real user id. This is problematic with files which
    use %pK because the file access permission is checked at open() time,
    but the kptr_restrict setting is checked at read() time. If a setuid
    binary opens a %pK file as an unprivileged user, and then elevates
    permissions before reading the file, then kernel pointer values may be
    leaked.

    This happens for example with the setuid pppd application on Ubuntu 12.04:

    $ head -1 /proc/kallsyms
    00000000 T startup_32

    $ pppd file /proc/kallsyms
    pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

    This will only leak the pointer value from the first line, but other
    setuid binaries may leak more information.

    Fix this by adding a check that in addition to the current process having
    CAP_SYSLOG, that effective user and group ids are equal to the real ids.
    If a setuid binary reads the contents of a file which uses %pK then the
    pointer values will be printed as NULL if the real user is unprivileged.

    Update the sysctl documentation to reflect the changes, and also correct
    the documentation to state the kptr_restrict=0 is the default.

    This is a only temporary solution to the issue. The correct solution is
    to do the permission check at open() time on files, and to replace %pK
    with a function which checks the open() time permission. %pK uses in
    printk should be removed since no sane permission check can be done, and
    instead protected by using dmesg_restrict.

    Signed-off-by: Ryan Mallon
    Cc: Kees Cook
    Cc: Alexander Viro
    Cc: Joe Perches
    Cc: "Eric W. Biederman"
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ryan Mallon
     

13 Nov, 2013

1 commit

  • commit 3d77b50c5874b7e923be946ba793644f82336b75 upstream.

    Commit b1adaf65ba03 ("[SCSI] block: add sg buffer copy helper
    functions") introduces two sg buffer copy helpers, and calls
    flush_kernel_dcache_page() on pages in SG list after these pages are
    written to.

    Unfortunately, the commit may introduce a potential bug:

    - Before sending some SCSI commands, kmalloc() buffer may be passed to
    block layper, so flush_kernel_dcache_page() can see a slab page
    finally

    - According to cachetlb.txt, flush_kernel_dcache_page() is only called
    on "a user page", which surely can't be a slab page.

    - ARCH's implementation of flush_kernel_dcache_page() may use page
    mapping information to do optimization so page_mapping() will see the
    slab page, then VM_BUG_ON() is triggered.

    Aaro Koskinen reported the bug on ARM/kirkwood when DEBUG_VM is enabled,
    and this patch fixes the bug by adding test of '!PageSlab(miter->page)'
    before calling flush_kernel_dcache_page().

    Signed-off-by: Ming Lei
    Reported-by: Aaro Koskinen
    Tested-by: Simon Baatz
    Cc: Russell King - ARM Linux
    Cc: Will Deacon
    Cc: Aaro Koskinen
    Acked-by: Catalin Marinas
    Cc: FUJITA Tomonori
    Cc: Tejun Heo
    Cc: "James E.J. Bottomley"
    Cc: Jens Axboe
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds
    Signed-off-by: Greg Kroah-Hartman

    Ming Lei
     

29 Jul, 2013

1 commit

  • commit 25c87eae1725ed77a8b44d782a86abdc279b4ede upstream.

    FAULT_INJECTION_STACKTRACE_FILTER selects FRAME_POINTER but
    that symbol is not available for MIPS.

    Fixes the following problem on a randconfig:
    warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP &&
    KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies
    (DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN ||
    MN10300 || METAG) || ARCH_WANT_FRAME_POINTERS)

    Signed-off-by: Markos Chandras
    Acked-by: Steven J. Hill
    Cc: linux-mips@linux-mips.org
    Patchwork: https://patchwork.linux-mips.org/patch/5441/
    Signed-off-by: Ralf Baechle
    Signed-off-by: Greg Kroah-Hartman

    Markos Chandras
     

13 Jun, 2013

1 commit

  • For 'while' looping, need stop when 'nbytes == 0', or will cause issue.
    ('nbytes' is size_t which is always bigger or equal than zero).

    The related warning: (with EXTRA_CFLAGS=-W)

    lib/mpi/mpicoder.c:40:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

    Signed-off-by: Chen Gang
    Cc: Rusty Russell
    Cc: David Howells
    Cc: James Morris
    Cc: Andy Shevchenko
    Acked-by: David Howells
    Signed-off-by: Andrew Morton
    Signed-off-by: Linus Torvalds

    Chen Gang
     

25 May, 2013

1 commit

  • The umul_ppmm() macro for parisc uses the xmpyu assembler statement
    which does calculation via a floating point register.

    But usage of floating point registers inside the Linux kernel are not
    allowed and gcc will stop compilation due to the -mdisable-fpregs
    compiler option.

    Fix this by disabling the umul_ppmm() and udiv_qrnnd() macros. The
    mpilib will then use the generic built-in implementations instead.

    Signed-off-by: Helge Deller

    Helge Deller
     

24 May, 2013

2 commits

  • Pull driver core fixes from Greg Kroah-Hartman:
    "Here are 3 tiny driver core fixes for 3.10-rc2.

    A needed symbol export, a change to make it easier to track down
    offending sysfs files with incorrect attributes, and a klist bugfix.

    All have been in linux-next for a while"

    * tag 'driver-core-3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
    klist: del waiter from klist_remove_waiters before wakeup waitting process
    driver core: print sysfs attribute name when warning about bogus permissions
    driver core: export subsys_virtual_register

    Linus Torvalds
     
  • Fix build error io vmw_vmci.ko when CONFIG_VMWARE_VMCI=m by chaning
    iovec.o from lib-y to obj-y.

    ERROR: "memcpy_toiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!
    ERROR: "memcpy_fromiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!

    Signed-off-by: Randy Dunlap
    Acked-by: Rusty Russell
    Signed-off-by: Linus Torvalds

    Randy Dunlap
     

22 May, 2013

1 commit

  • There is a race between klist_remove and klist_release. klist_remove
    uses a local var waiter saved on stack. When klist_release calls
    wake_up_process(waiter->process) to wake up the waiter, waiter might run
    immediately and reuse the stack. Then, klist_release calls
    list_del(&waiter->list) to change previous
    wait data and cause prior waiter thread corrupt.

    The patch fixes it against kernel 3.9.

    Signed-off-by: wang, biao
    Acked-by: Peter Zijlstra
    Signed-off-by: Greg Kroah-Hartman

    wang, biao
     

20 May, 2013

1 commit

  • ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined!

    That function is only present with CONFIG_NET. Turns out that
    crypto/algif_skcipher.c also uses that outside net, but it actually
    needs sockets anyway.

    In addition, commit 6d4f0139d642c45411a47879325891ce2a7c164a added
    CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist
    that function and revert that commit too.

    socket.h already includes uio.h, so no callers need updating; trying
    only broke things fo x86_64 randconfig (thanks Fengguang!).

    Reported-by: Randy Dunlap
    Acked-by: David S. Miller
    Acked-by: Michael S. Tsirkin
    Signed-off-by: Rusty Russell

    Rusty Russell
     

09 May, 2013

1 commit

  • Pull block driver updates from Jens Axboe:
    "It might look big in volume, but when categorized, not a lot of
    drivers are touched. The pull request contains:

    - mtip32xx fixes from Micron.

    - A slew of drbd updates, this time in a nicer series.

    - bcache, a flash/ssd caching framework from Kent.

    - Fixes for cciss"

    * 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
    bcache: Use bd_link_disk_holder()
    bcache: Allocator cleanup/fixes
    cciss: bug fix to prevent cciss from loading in kdump crash kernel
    cciss: add cciss_allow_hpsa module parameter
    drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
    mtip32xx: Workaround for unaligned writes
    bcache: Make sure blocksize isn't smaller than device blocksize
    bcache: Fix merge_bvec_fn usage for when it modifies the bvm
    bcache: Correctly check against BIO_MAX_PAGES
    bcache: Hack around stuff that clones up to bi_max_vecs
    bcache: Set ra_pages based on backing device's ra_pages
    bcache: Take data offset from the bdev superblock.
    mtip32xx: mtip32xx: Disable TRIM support
    mtip32xx: fix a smatch warning
    bcache: Disable broken btree fuzz tester
    bcache: Fix a format string overflow
    bcache: Fix a minor memory leak on device teardown
    bcache: Documentation updates
    bcache: Use WARN_ONCE() instead of __WARN()
    bcache: Add missing #include
    ...

    Linus Torvalds
     

08 May, 2013

3 commits

  • This patch tries to reduce the amount of cmpxchg calls in the writer
    failed path by checking the counter value first before issuing the
    instruction. If ->count is not set to RWSEM_WAITING_BIAS then there is
    no point wasting a cmpxchg call.

    Furthermore, Michel states "I suppose it helps due to the case where
    someone else steals the lock while we're trying to acquire
    sem->wait_lock."

    Two very different workloads and machines were used to see how this
    patch improves throughput: pgbench on a quad-core laptop and aim7 on a
    large 8 socket box with 80 cores.

    Some results comparing Michel's fast-path write lock stealing
    (tps-rwsem) on a quad-core laptop running pgbench:

    | db_size | clients | tps-rwsem | tps-patch |
    +---------+----------+----------------+--------------+
    | 160 MB | 1 | 6906 | 9153 | + 32.5
    | 160 MB | 2 | 15931 | 22487 | + 41.1%
    | 160 MB | 4 | 33021 | 32503 |
    | 160 MB | 8 | 34626 | 34695 |
    | 160 MB | 16 | 33098 | 34003 |
    | 160 MB | 20 | 31343 | 31440 |
    | 160 MB | 30 | 28961 | 28987 |
    | 160 MB | 40 | 26902 | 26970 |
    | 160 MB | 50 | 25760 | 25810 |
    ------------------------------------------------------
    | 1.6 GB | 1 | 7729 | 7537 |
    | 1.6 GB | 2 | 19009 | 23508 | + 23.7%
    | 1.6 GB | 4 | 33185 | 32666 |
    | 1.6 GB | 8 | 34550 | 34318 |
    | 1.6 GB | 16 | 33079 | 32689 |
    | 1.6 GB | 20 | 31494 | 31702 |
    | 1.6 GB | 30 | 28535 | 28755 |
    | 1.6 GB | 40 | 27054 | 27017 |
    | 1.6 GB | 50 | 25591 | 25560 |
    ------------------------------------------------------
    | 7.6 GB | 1 | 6224 | 7469 | + 20.0%
    | 7.6 GB | 2 | 13611 | 12778 |
    | 7.6 GB | 4 | 33108 | 32927 |
    | 7.6 GB | 8 | 34712 | 34878 |
    | 7.6 GB | 16 | 32895 | 33003 |
    | 7.6 GB | 20 | 31689 | 31974 |
    | 7.6 GB | 30 | 29003 | 28806 |
    | 7.6 GB | 40 | 26683 | 26976 |
    | 7.6 GB | 50 | 25925 | 25652 |
    ------------------------------------------------------

    For the aim7 worloads, they overall improved on top of Michel's
    patchset. For full graphs on how the rwsem series plus this patch
    behaves on a large 8 socket machine against a vanilla kernel:

    http://stgolabs.net/rwsem-aim7-results.tar.gz

    Signed-off-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • - make warning smp-safe
    - result of atomic _unless_zero functions should be checked by caller
    to avoid use-after-free error
    - trivial whitespace fix.

    Link: https://lkml.org/lkml/2013/4/12/391

    Tested: compile x86, boot machine and run xfstests
    Signed-off-by: Anatol Pomozov
    [ Removed line-break, changed to use WARN_ON_ONCE() - Linus ]
    Signed-off-by: Linus Torvalds

    Anatol Pomozov
     
  • Merge rwsem optimizations from Michel Lespinasse:
    "These patches extend Alex Shi's work (which added write lock stealing
    on the rwsem slow path) in order to provide rwsem write lock stealing
    on the fast path (that is, without taking the rwsem's wait_lock).

    I have unfortunately been unable to push this through -next before due
    to Ingo Molnar / David Howells / Peter Zijlstra being busy with other
    things. However, this has gotten some attention from Rik van Riel and
    Davidlohr Bueso who both commented that they felt this was ready for
    v3.10, and Ingo Molnar has said that he was OK with me pushing
    directly to you. So, here goes :)

    Davidlohr got the following test results from pgbench running on a
    quad-core laptop:

    | db_size | clients | tps-vanilla | tps-rwsem |
    +---------+----------+----------------+--------------+
    | 160 MB | 1 | 5803 | 6906 | + 19.0%
    | 160 MB | 2 | 13092 | 15931 |
    | 160 MB | 4 | 29412 | 33021 |
    | 160 MB | 8 | 32448 | 34626 |
    | 160 MB | 16 | 32758 | 33098 |
    | 160 MB | 20 | 26940 | 31343 | + 16.3%
    | 160 MB | 30 | 25147 | 28961 |
    | 160 MB | 40 | 25484 | 26902 |
    | 160 MB | 50 | 24528 | 25760 |
    ------------------------------------------------------
    | 1.6 GB | 1 | 5733 | 7729 | + 34.8%
    | 1.6 GB | 2 | 9411 | 19009 | + 101.9%
    | 1.6 GB | 4 | 31818 | 33185 |
    | 1.6 GB | 8 | 33700 | 34550 |
    | 1.6 GB | 16 | 32751 | 33079 |
    | 1.6 GB | 20 | 30919 | 31494 |
    | 1.6 GB | 30 | 28540 | 28535 |
    | 1.6 GB | 40 | 26380 | 27054 |
    | 1.6 GB | 50 | 25241 | 25591 |
    ------------------------------------------------------
    | 7.6 GB | 1 | 5779 | 6224 |
    | 7.6 GB | 2 | 10897 | 13611 | + 24.9%
    | 7.6 GB | 4 | 32683 | 33108 |
    | 7.6 GB | 8 | 33968 | 34712 |
    | 7.6 GB | 16 | 32287 | 32895 |
    | 7.6 GB | 20 | 27770 | 31689 | + 14.1%
    | 7.6 GB | 30 | 26739 | 29003 |
    | 7.6 GB | 40 | 24901 | 26683 |
    | 7.6 GB | 50 | 17115 | 25925 | + 51.5%
    ------------------------------------------------------

    (Davidlohr also has one additional patch which further improves
    throughput, though I will ask him to send it directly to you as I have
    suggested some minor changes)."

    * emailed patches from Michel Lespinasse :
    rwsem: no need for explicit signed longs
    x86 rwsem: avoid taking slow path when stealing write lock
    rwsem: do not block readers at head of queue if other readers are active
    rwsem: implement support for write lock stealing on the fastpath
    rwsem: simplify __rwsem_do_wake
    rwsem: skip initial trylock in rwsem_down_write_failed
    rwsem: avoid taking wait_lock in rwsem_down_write_failed
    rwsem: use cmpxchg for trying to steal write lock
    rwsem: more agressive lock stealing in rwsem_down_write_failed
    rwsem: simplify rwsem_down_write_failed
    rwsem: simplify rwsem_down_read_failed
    rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
    rwsem: shorter spinlocked section in rwsem_down_failed_common()
    rwsem: make the waiter type an enumeration rather than a bitmask

    Linus Torvalds
     

07 May, 2013

13 commits

  • Change explicit "signed long" declarations into plain "long" as suggested
    by Peter Hurley.

    Signed-off-by: Davidlohr Bueso
    Reviewed-by: Michel Lespinasse
    Signed-off-by: Michel Lespinasse
    Signed-off-by: Linus Torvalds

    Davidlohr Bueso
     
  • This change fixes a race condition where a reader might determine it
    needs to block, but by the time it acquires the wait_lock the rwsem has
    active readers and no queued waiters.

    In this situation the reader can run in parallel with the existing
    active readers; it does not need to block until the active readers
    complete.

    Thanks to Peter Hurley for noticing this possible race.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When we decide to wake up readers, we must first grant them as many read
    locks as necessary, and then actually wake up all these readers. But in
    order to know how many read shares to grant, we must first count the
    readers at the head of the queue. This might take a while if there are
    many readers, and we want to be protected against a writer stealing the
    lock while we're counting. To that end, we grant the first reader lock
    before counting how many more readers are queued.

    We also require some adjustments to the wake_type semantics.

    RWSEM_WAKE_NO_ACTIVE used to mean that we had found the count to be
    RWSEM_WAITING_BIAS, in which case the rwsem was known to be free as
    nobody could steal it while we hold the wait_lock. This doesn't make
    sense once we implement fastpath write lock stealing, so we now use
    RWSEM_WAKE_ANY in that case.

    Similarly, when rwsem_down_write_failed found that a read lock was
    active, it would use RWSEM_WAKE_READ_OWNED which signalled that new
    readers could be woken without checking first that the rwsem was
    available. We can't do that anymore since the existing readers might
    release their read locks, and a writer could steal the lock before we
    wake up additional readers. So, we have to use a new RWSEM_WAKE_READERS
    value to indicate we only want to wake readers, but we don't currently
    hold any read lock.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This is mostly for cleanup value:

    - We don't need several gotos to handle the case where the first
    waiter is a writer. Two simple tests will do (and generate very
    similar code).

    - In the remainder of the function, we know the first waiter is a reader,
    so we don't have to double check that. We can use do..while loops
    to iterate over the readers to wake (generates slightly better code).

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • We can skip the initial trylock in rwsem_down_write_failed() if there
    are known active lockers already, thus saving one likely-to-fail
    cmpxchg.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • In rwsem_down_write_failed(), if there are active locks after we wake up
    (i.e. the lock got stolen from us), skip taking the wait_lock and go
    back to sleep immediately.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Using rwsem_atomic_update to try stealing the write lock forced us to
    undo the adjustment in the failure path. We can have simpler and faster
    code by using cmpxchg instead.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Acked-by: Rik van Riel
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Some small code simplifications can be achieved by doing more agressive
    lock stealing:

    - When rwsem_down_write_failed() notices that there are no active locks
    (and thus no thread to wake us if we decided to sleep), it used to wake
    the first queued process. However, stealing the lock is also sufficient
    to deal with this case, so we don't need this check anymore.

    - In try_get_writer_sem(), we can steal the lock even when the first waiter
    is a reader. This is correct because the code path that wakes readers is
    protected by the wait_lock. As to the performance effects of this change,
    they are expected to be minimal: readers are still granted the lock
    (rather than having to acquire it themselves) when they reach the front
    of the wait queue, so we have essentially the same behavior as in
    rwsem-spinlock.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When waking writers, we never grant them the lock - instead, they have
    to acquire it themselves when they run, and remove themselves from the
    wait_list when they succeed.

    As a result, we can do a few simplifications in rwsem_down_write_failed():

    - We don't need to check for !waiter.task since __rwsem_do_wake() doesn't
    remove writers from the wait_list

    - There is no point releaseing the wait_lock before entering the wait loop,
    as we will need to reacquire it immediately. We can change the loop so
    that the lock is always held at the start of each loop iteration.

    - We don't need to get a reference on the task structure, since the task
    is responsible for removing itself from the wait_list. There is no risk,
    like in the rwsem_down_read_failed() case, that a task would wake up and
    exit (thus destroying its task structure) while __rwsem_do_wake() is
    still running - wait_lock protects against that.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • When trying to acquire a read lock, the RWSEM_ACTIVE_READ_BIAS
    adjustment doesn't cause other readers to block, so we never have to
    worry about waking them back after canceling this adjustment in
    rwsem_down_read_failed().

    We also never want to steal the lock in rwsem_down_read_failed(), so we
    don't have to grab the wait_lock either.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • Remove the rwsem_down_failed_common function and replace it with two
    identical copies of its code in rwsem_down_{read,write}_failed.

    This is because we want to make different optimizations in
    rwsem_down_{read,write}_failed; we are adding this pure-duplication
    step as a separate commit in order to make it easier to check the
    following steps.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • This change reduces the size of the spinlocked and TASK_UNINTERRUPTIBLE
    sections in rwsem_down_failed_common():

    - We only need the sem->wait_lock to insert ourselves on the wait_list;
    the waiter node can be prepared outside of the wait_lock.

    - The task state only needs to be set to TASK_UNINTERRUPTIBLE immediately
    before checking if we actually need to sleep; it doesn't need to protect
    the entire function.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     
  • We are not planning to add some new waiter flags, so we can convert the
    waiter type into an enumeration.

    Background: David Howells suggested I do this back when I tried adding
    a new waiter type for unfair readers. However, I believe the cleanup
    applies regardless of that use case.

    Signed-off-by: Michel Lespinasse
    Reviewed-by: Rik van Riel
    Reviewed-by: Peter Hurley
    Acked-by: Davidlohr Bueso
    Signed-off-by: Linus Torvalds

    Michel Lespinasse
     

06 May, 2013

1 commit


03 May, 2013

2 commits

  • Pull drm updates from Dave Airlie:
    "This is the main drm pull request for 3.10.

    Wierd bits:
    - OMAP drm changes required OMAP dss changes, in drivers/video, so I
    took them in here.
    - one more fbcon fix for font handover
    - VT switch avoidance in pm code
    - scatterlist helpers for gpu drivers - have acks from akpm

    Highlights:
    - qxl kms driver - driver for the spice qxl virtual GPU

    Nouveau:
    - fermi/kepler VRAM compression
    - GK110/nvf0 modesetting support.

    Tegra:
    - host1x core merged with 2D engine support

    i915:
    - vt switchless resume
    - more valleyview support
    - vblank fixes
    - modesetting pipe config rework

    radeon:
    - UVD engine support
    - SI chip tiling support
    - GPU registers initialisation from golden values.

    exynos:
    - device tree changes
    - fimc block support

    Otherwise:
    - bunches of fixes all over the place."

    * 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
    qxl: update to new idr interfaces.
    drm/nouveau: fix build with nv50->nvc0
    drm/radeon: fix handling of v6 power tables
    drm/radeon: clarify family checks in pm table parsing
    drm/radeon: consolidate UVD clock programming
    drm/radeon: fix UPLL_REF_DIV_MASK definition
    radeon: add bo tracking debugfs
    drm/radeon: add new richland pci ids
    drm/radeon: add some new SI PCI ids
    drm/radeon: fix scratch reg handling for UVD fence
    drm/radeon: allocate SA bo in the requested domain
    drm/radeon: fix possible segfault when parsing pm tables
    drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
    OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
    OMAPDSS: VENC: Add error handling for venc_probe_pdata
    OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
    OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
    OMAPDSS: DSI: Add error handling for dsi_probe_pdata
    OMAPDSS: SDI: Add error handling for sdi_probe_pdata
    OMAPDSS: DPI: Add error handling for dpi_probe_pdata
    ...

    Linus Torvalds
     
  • Pull scheduler fixes from Ingo Molnar:
    "This fixes the cputime scaling overflow problems for good without
    having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
    as well."

    * 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
    Revert "math64: New div64_u64_rem helper"
    sched: Avoid prev->stime underflow
    sched: Do not account bogus utime
    sched: Avoid cputime scaling overflow

    Linus Torvalds
     

02 May, 2013

1 commit

  • Pull VFS updates from Al Viro,

    Misc cleanups all over the place, mainly wrt /proc interfaces (switch
    create_proc_entry to proc_create(), get rid of the deprecated
    create_proc_read_entry() in favor of using proc_create_data() and
    seq_file etc).

    7kloc removed.

    * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
    don't bother with deferred freeing of fdtables
    proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
    proc: Make the PROC_I() and PDE() macros internal to procfs
    proc: Supply a function to remove a proc entry by PDE
    take cgroup_open() and cpuset_open() to fs/proc/base.c
    ppc: Clean up scanlog
    ppc: Clean up rtas_flash driver somewhat
    hostap: proc: Use remove_proc_subtree()
    drm: proc: Use remove_proc_subtree()
    drm: proc: Use minor->index to label things, not PDE->name
    drm: Constify drm_proc_list[]
    zoran: Don't print proc_dir_entry data in debug
    reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
    proc: Supply an accessor for getting the data from a PDE's parent
    airo: Use remove_proc_subtree()
    rtl8192u: Don't need to save device proc dir PDE
    rtl8187se: Use a dir under /proc/net/r8180/
    proc: Add proc_mkdir_data()
    proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
    proc: Move PDE_NET() to fs/proc/proc_net.c
    ...

    Linus Torvalds