Eric Lee / smarc-fsl-linux-kernel

28 Aug, 2014

1 commit

643d2d89e lib/genalloc: add a helper function for DMA buffer allocation ... Browse Code »

When using pool space for DMA buffer, there might be duplicated calling of
gen_pool_alloc() and gen_pool_virt_to_phys() in each implementation.

Thus it's better to add a simple helper function, a compatible one to the
common dma_alloc_coherent(), to save some code.

Signed-off-by: Nicolin Chen
Cc: "Hans J. Koch"
Cc: Dan Williams
Cc: Eric Miao
Cc: Grant Likely
Cc: Greg Kroah-Hartman
Cc: Haojian Zhuang
Cc: Jaroslav Kysela
Cc: Kevin Hilman
Cc: Liam Girdwood
Cc: Mark Brown
Cc: Mauro Carvalho Chehab
Cc: Rob Herring
Cc: Russell King
Cc: Sekhar Nori
Cc: Takashi Iwai
Cc: Vinod Koul
Signed-off-by: Andrew Morton

Nicolin Chen
2014-08-28 07:06:15 +0800

08 Aug, 2014

1 commit

81513d147 lib/btree.c: fix leak of whole btree nodes ... Browse Code »

commit c75b53af2f0043aff500af0a6f878497bef41bca upstream.

I use btree from 3.14-rc2 in my own module. When the btree module is
removed, a warning arises:

kmem_cache_destroy btree_node: Slab cache still has objects
CPU: 13 PID: 9150 Comm: rmmod Tainted: GF O 3.14.0-rc2 #1
Hardware name: Inspur NF5270M3/NF5270M3, BIOS CHEETAH_2.1.3 09/10/2013
Call Trace:
dump_stack+0x49/0x5d
kmem_cache_destroy+0xcf/0xe0
btree_module_exit+0x10/0x12 [btree]
SyS_delete_module+0x198/0x1f0
system_call_fastpath+0x16/0x1b

The cause is that it doesn't release the last btree node, when height = 1
and fill = 1.

[akpm@linux-foundation.org: remove unneeded test of NULL]
Signed-off-by: Minfei Huang
Cc: Joern Engel
Cc: Johannes Berg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Minfei Huang
2014-08-08 05:30:27 +0800

01 Jul, 2014

1 commit

2ada32b01 idr: fix overflow bug during maximum ID calculation at maximum height ... Browse Code »

commit 3afb69cb5572b3c8c898c00880803cf1a49852c4 upstream.

idr_replace() open-codes the logic to calculate the maximum valid ID
given the height of the idr tree; unfortunately, the open-coded logic
doesn't account for the fact that the top layer may have unused slots
and over-shifts the limit to zero when the tree is at its maximum
height.

The following test code shows it fails to replace the value for
id=((1<<
Acked-by: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Lai Jiangshan
2014-07-01 11:09:42 +0800

27 Jun, 2014

2 commits

feaad0172 lzo: properly check for overruns ... Browse Code »

commit 206a81c18401c0cde6e579164f752c4b147324ce upstream.

The lzo decompressor can, if given some really crazy data, possibly
overrun some variable types. Modify the checking logic to properly
detect overruns before they happen.

Reported-by: "Don A. Bailey"
Tested-by: "Don A. Bailey"
Signed-off-by: Greg Kroah-Hartman

Greg Kroah-Hartman
2014-06-27 03:12:41 +0800
83880480d netlink: rate-limit leftover bytes warning and print process name ... Browse Code »

[ Upstream commit bfc5184b69cf9eeb286137640351c650c27f118a ]

Any process is able to send netlink messages with leftover bytes.
Make the warning rate-limited to prevent too much log spam.

The warning is supposed to help find userspace bugs, so print the
triggering command name to implicate the buggy program.

[v2: Use pr_warn_ratelimited instead of printk_ratelimited.]

Signed-off-by: Michal Schmidt
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Michal Schmidt
2014-06-27 03:12:37 +0800

14 Apr, 2014

1 commit

dbd3f730c netlink: don't compare the nul-termination in nla_strcmp ... Browse Code »

[ Upstream commit 8b7b932434f5eee495b91a2804f5b64ebb2bc835 ]

nla_strcmp compares the string length plus one, so it's implicitly
including the nul-termination in the comparison.

int nla_strcmp(const struct nlattr *nla, const char *str)
{
int len = strlen(str) + 1;
...
d = memcmp(nla_data(nla), str, len);

However, if NLA_STRING is used, userspace can send us a string without
the nul-termination. This is a problem since the string
comparison will not match as the last byte may be not the
nul-termination.

Fix this by skipping the comparison of the nul-termination if the
attribute data is nul-terminated. Suggested by Thomas Graf.

Cc: Florian Westphal
Cc: Thomas Graf
Signed-off-by: Pablo Neira Ayuso
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Pablo Neira
2014-04-14 21:42:18 +0800

21 Feb, 2014

1 commit

f878b94f9 x86, hweight: Fix BUG when booting with CONFIG_GCOV_PROFILE_ALL=y ... Browse Code »

commit 6583327c4dd55acbbf2a6f25e775b28b3abf9a42 upstream.

Commit d61931d89b, "x86: Add optimized popcnt variants" introduced
compile flag -fcall-saved-rdi for lib/hweight.c. When combined with
options -fprofile-arcs and -O2, this flag causes gcc to generate
broken constructor code. As a result, a 64 bit x86 kernel compiled
with CONFIG_GCOV_PROFILE_ALL=y prints message "gcov: could not create
file" and runs into sproadic BUGs during boot.

The gcc people indicate that these kinds of problems are endemic when
using ad hoc calling conventions. It is therefore best to treat any
file compiled with ad hoc calling conventions as an isolated
environment and avoid things like profiling or coverage analysis,
since those subsystems assume a "normal" calling conventions.

This patch avoids the bug by excluding lib/hweight.o from coverage
profiling.

Reported-by: Meelis Roos
Cc: Andrew Morton
Signed-off-by: Peter Oberparleiter
Link: http://lkml.kernel.org/r/52F3A30C.7050205@linux.vnet.ibm.com
Signed-off-by: H. Peter Anvin
Signed-off-by: Greg Kroah-Hartman

Peter Oberparleiter
2014-02-21 03:06:11 +0800

07 Feb, 2014

1 commit

c18e49ad5 lib/decompressors: fix "no limit" output buffer length ... Browse Code »

commit 1431574a1c4c669a0c198e4763627837416e4443 upstream.

When decompressing into memory, the output buffer length is set to some
arbitrarily high value (0x7fffffff) to indicate the output is, virtually,
unlimited in size.

The problem with this is that some platforms have their physical memory at
high physical addresses (0x80000000 or more), and that the output buffer
address and its "unlimited" length cannot be added without overflowing.
An example of this can be found in inflate_fast():

/* next_out is the output buffer address */
out = strm->next_out - OFF;
/* avail_out is the output buffer size. end will overflow if the output
* address is >= 0x80000104 */
end = out + (strm->avail_out - 257);

This has huge consequences on the performance of kernel decompression,
since the following exit condition of inflate_fast() will be always true:

} while (in < last && out < end);

Indeed, "end" has overflowed and is now always lower than "out". As a
result, inflate_fast() will return after processing one single byte of
input data, and will thus need to be called an unreasonably high number of
times. This probably went unnoticed because kernel decompression is fast
enough even with this issue.

Nonetheless, adjusting the output buffer length in such a way that the
above pointer arithmetic never overflows results in a kernel decompression
that is about 3 times faster on affected machines.

Signed-off-by: Alexandre Courbot
Tested-by: Jon Medhurst
Cc: Stephen Warren
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Cc: Mark Brown
Signed-off-by: Greg Kroah-Hartman

Alexandre Courbot
2014-02-07 03:08:12 +0800

12 Dec, 2013

1 commit

2e8cb3339 lib/genalloc.c: fix overflow of ending address of memory chunk ... Browse Code »

commit 674470d97958a0ec72f72caf7f6451da40159cc7 upstream.

In struct gen_pool_chunk, end_addr means the end address of memory chunk
(inclusive), but in the implementation it is treated as address + size of
memory chunk (exclusive), so it points to the address plus one instead of
correct ending address.

The ending address of memory chunk plus one will cause overflow on the
memory chunk including the last address of memory map, e.g. when starting
address is 0xFFF00000 and size is 0x100000 on 32bit machine, ending
address will be 0x100000000.

Use correct ending address like starting address + size - 1.

[akpm@linux-foundation.org: add comment to struct gen_pool_chunk:end_addr]
Signed-off-by: Joonyoung Shim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Jonghwan Choi
Signed-off-by: Greg Kroah-Hartman

Joonyoung Shim
2013-12-12 14:36:28 +0800

08 Dec, 2013

1 commit

e29507fec random32: fix off-by-one in seeding requirement ... Browse Code »

[ Upstream commit 51c37a70aaa3f95773af560e6db3073520513912 ]

For properly initialising the Tausworthe generator [1], we have
a strict seeding requirement, that is, s1 > 1, s2 > 7, s3 > 15.

Commit 697f8d0348 ("random32: seeding improvement") introduced
a __seed() function that imposes boundary checks proposed by the
errata paper [2] to properly ensure above conditions.

However, we're off by one, as the function is implemented as:
"return (x < m) ? x + m : x;", and called with __seed(X, 1),
__seed(X, 7), __seed(X, 15). Thus, an unwanted seed of 1, 7, 15
would be possible, whereas the lower boundary should actually
be of at least 2, 8, 16, just as GSL does. Fix this, as otherwise
an initialization with an unwanted seed could have the effect
that Tausworthe's PRNG properties cannot not be ensured.

Note that this PRNG is *not* used for cryptography in the kernel.

[1] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme.ps
[2] http://www.iro.umontreal.ca/~lecuyer/myftp/papers/tausme2.ps

Joint work with Hannes Frederic Sowa.

Fixes: 697f8d0348a6 ("random32: seeding improvement")
Cc: Stephen Hemminger
Cc: Florian Weimer
Cc: Theodore Ts'o
Signed-off-by: Daniel Borkmann
Signed-off-by: Hannes Frederic Sowa
Signed-off-by: David S. Miller
Signed-off-by: Greg Kroah-Hartman

Daniel Borkmann
2013-12-08 23:29:24 +0800

05 Dec, 2013

1 commit

7135a8a10 vsprintf: check real user/group id for %pK ... Browse Code »

commit 312b4e226951f707e120b95b118cbc14f3d162b2 upstream.

Some setuid binaries will allow reading of files which have read
permission by the real user id. This is problematic with files which
use %pK because the file access permission is checked at open() time,
but the kptr_restrict setting is checked at read() time. If a setuid
binary opens a %pK file as an unprivileged user, and then elevates
permissions before reading the file, then kernel pointer values may be
leaked.

This happens for example with the setuid pppd application on Ubuntu 12.04:

$ head -1 /proc/kallsyms
00000000 T startup_32

$ pppd file /proc/kallsyms
pppd: In file /proc/kallsyms: unrecognized option 'c1000000'

This will only leak the pointer value from the first line, but other
setuid binaries may leak more information.

Fix this by adding a check that in addition to the current process having
CAP_SYSLOG, that effective user and group ids are equal to the real ids.
If a setuid binary reads the contents of a file which uses %pK then the
pointer values will be printed as NULL if the real user is unprivileged.

Update the sysctl documentation to reflect the changes, and also correct
the documentation to state the kptr_restrict=0 is the default.

This is a only temporary solution to the issue. The correct solution is
to do the permission check at open() time on files, and to replace %pK
with a function which checks the open() time permission. %pK uses in
printk should be removed since no sane permission check can be done, and
instead protected by using dmesg_restrict.

Signed-off-by: Ryan Mallon
Cc: Kees Cook
Cc: Alexander Viro
Cc: Joe Perches
Cc: "Eric W. Biederman"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Ryan Mallon
2013-12-05 02:56:06 +0800

13 Nov, 2013

1 commit

e14594cd6 lib/scatterlist.c: don't flush_kernel_dcache_page on slab page ... Browse Code »

commit 3d77b50c5874b7e923be946ba793644f82336b75 upstream.

Commit b1adaf65ba03 ("[SCSI] block: add sg buffer copy helper
functions") introduces two sg buffer copy helpers, and calls
flush_kernel_dcache_page() on pages in SG list after these pages are
written to.

Unfortunately, the commit may introduce a potential bug:

- Before sending some SCSI commands, kmalloc() buffer may be passed to
block layper, so flush_kernel_dcache_page() can see a slab page
finally

- According to cachetlb.txt, flush_kernel_dcache_page() is only called
on "a user page", which surely can't be a slab page.

- ARCH's implementation of flush_kernel_dcache_page() may use page
mapping information to do optimization so page_mapping() will see the
slab page, then VM_BUG_ON() is triggered.

Aaro Koskinen reported the bug on ARM/kirkwood when DEBUG_VM is enabled,
and this patch fixes the bug by adding test of '!PageSlab(miter->page)'
before calling flush_kernel_dcache_page().

Signed-off-by: Ming Lei
Reported-by: Aaro Koskinen
Tested-by: Simon Baatz
Cc: Russell King - ARM Linux
Cc: Will Deacon
Cc: Aaro Koskinen
Acked-by: Catalin Marinas
Cc: FUJITA Tomonori
Cc: Tejun Heo
Cc: "James E.J. Bottomley"
Cc: Jens Axboe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Ming Lei
2013-11-13 11:05:33 +0800

29 Jul, 2013

1 commit

3c19c4f9b lib/Kconfig.debug: Restrict FRAME_POINTER for MIPS ... Browse Code »

commit 25c87eae1725ed77a8b44d782a86abdc279b4ede upstream.

FAULT_INJECTION_STACKTRACE_FILTER selects FRAME_POINTER but
that symbol is not available for MIPS.

Fixes the following problem on a randconfig:
warning: (LOCKDEP && FAULT_INJECTION_STACKTRACE_FILTER && LATENCYTOP &&
KMEMCHECK) selects FRAME_POINTER which has unmet direct dependencies
(DEBUG_KERNEL && (CRIS || M68K || FRV || UML || AVR32 || SUPERH || BLACKFIN ||
MN10300 || METAG) || ARCH_WANT_FRAME_POINTERS)

Signed-off-by: Markos Chandras
Acked-by: Steven J. Hill
Cc: linux-mips@linux-mips.org
Patchwork: https://patchwork.linux-mips.org/patch/5441/
Signed-off-by: Ralf Baechle
Signed-off-by: Greg Kroah-Hartman

Markos Chandras
2013-07-29 07:30:12 +0800

13 Jun, 2013

1 commit

5402b8047 lib/mpi/mpicoder.c: looping issue, need stop when equal to zero, found by 'EXTRA_FLAGS=-W'. ... Browse Code »

For 'while' looping, need stop when 'nbytes == 0', or will cause issue.
('nbytes' is size_t which is always bigger or equal than zero).

The related warning: (with EXTRA_CFLAGS=-W)

lib/mpi/mpicoder.c:40:2: warning: comparison of unsigned expression >= 0 is always true [-Wtype-limits]

Signed-off-by: Chen Gang
Cc: Rusty Russell
Cc: David Howells
Cc: James Morris
Cc: Andy Shevchenko
Acked-by: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-06-13 07:29:44 +0800

25 May, 2013

1 commit

70ef5578d MPILIB: disable usage of floating point registers on parisc ... Browse Code »

The umul_ppmm() macro for parisc uses the xmpyu assembler statement
which does calculation via a floating point register.

But usage of floating point registers inside the Linux kernel are not
allowed and gcc will stop compilation due to the -mdisable-fpregs
compiler option.

Fix this by disabling the umul_ppmm() and udiv_qrnnd() macros. The
mpilib will then use the generic built-in implementations instead.

Signed-off-by: Helge Deller

Helge Deller
2013-05-25 04:30:11 +0800

24 May, 2013

2 commits

c7153d064 Merge tag 'driver-core-3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core ... Browse Code »

Pull driver core fixes from Greg Kroah-Hartman:
"Here are 3 tiny driver core fixes for 3.10-rc2.

A needed symbol export, a change to make it easier to track down
offending sysfs files with incorrect attributes, and a klist bugfix.

All have been in linux-next for a while"

* tag 'driver-core-3.10-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core:
klist: del waiter from klist_remove_waiters before wakeup waitting process
driver core: print sysfs attribute name when warning about bogus permissions
driver core: export subsys_virtual_register

Linus Torvalds
2013-05-24 00:27:08 +0800
b4d3ba334 lib: make iovec obj instead of lib ... Browse Code »

Fix build error io vmw_vmci.ko when CONFIG_VMWARE_VMCI=m by chaning
iovec.o from lib-y to obj-y.

ERROR: "memcpy_toiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!
ERROR: "memcpy_fromiovec" [drivers/misc/vmw_vmci/vmw_vmci.ko] undefined!

Signed-off-by: Randy Dunlap
Acked-by: Rusty Russell
Signed-off-by: Linus Torvalds

Randy Dunlap
2013-05-24 00:17:11 +0800

22 May, 2013

1 commit

ac5a2962b klist: del waiter from klist_remove_waiters before wakeup waitting process ... Browse Code »

There is a race between klist_remove and klist_release. klist_remove
uses a local var waiter saved on stack. When klist_release calls
wake_up_process(waiter->process) to wake up the waiter, waiter might run
immediately and reuse the stack. Then, klist_release calls
list_del(&waiter->list) to change previous
wait data and cause prior waiter thread corrupt.

The patch fixes it against kernel 3.9.

Signed-off-by: wang, biao
Acked-by: Peter Zijlstra
Signed-off-by: Greg Kroah-Hartman

wang, biao
2013-05-22 01:16:39 +0800

20 May, 2013

1 commit

d2f83e907 Hoist memcpy_fromiovec/memcpy_toiovec into lib/ ... Browse Code »

ERROR: "memcpy_fromiovec" [drivers/vhost/vhost_scsi.ko] undefined!

That function is only present with CONFIG_NET. Turns out that
crypto/algif_skcipher.c also uses that outside net, but it actually
needs sockets anyway.

In addition, commit 6d4f0139d642c45411a47879325891ce2a7c164a added
CONFIG_NET dependency to CONFIG_VMCI for memcpy_toiovec, so hoist
that function and revert that commit too.

socket.h already includes uio.h, so no callers need updating; trying
only broke things fo x86_64 randconfig (thanks Fengguang!).

Reported-by: Randy Dunlap
Acked-by: David S. Miller
Acked-by: Michael S. Tsirkin
Signed-off-by: Rusty Russell

Rusty Russell
2013-05-20 08:54:22 +0800

09 May, 2013

1 commit

ebb372777 Merge branch 'for-3.10/drivers' of git://git.kernel.dk/linux-block ... Browse Code »

Pull block driver updates from Jens Axboe:
"It might look big in volume, but when categorized, not a lot of
drivers are touched. The pull request contains:

- mtip32xx fixes from Micron.

- A slew of drbd updates, this time in a nicer series.

- bcache, a flash/ssd caching framework from Kent.

- Fixes for cciss"

* 'for-3.10/drivers' of git://git.kernel.dk/linux-block: (66 commits)
bcache: Use bd_link_disk_holder()
bcache: Allocator cleanup/fixes
cciss: bug fix to prevent cciss from loading in kdump crash kernel
cciss: add cciss_allow_hpsa module parameter
drivers/block/mg_disk.c: add CONFIG_PM_SLEEP to suspend/resume functions
mtip32xx: Workaround for unaligned writes
bcache: Make sure blocksize isn't smaller than device blocksize
bcache: Fix merge_bvec_fn usage for when it modifies the bvm
bcache: Correctly check against BIO_MAX_PAGES
bcache: Hack around stuff that clones up to bi_max_vecs
bcache: Set ra_pages based on backing device's ra_pages
bcache: Take data offset from the bdev superblock.
mtip32xx: mtip32xx: Disable TRIM support
mtip32xx: fix a smatch warning
bcache: Disable broken btree fuzz tester
bcache: Fix a format string overflow
bcache: Fix a minor memory leak on device teardown
bcache: Documentation updates
bcache: Use WARN_ONCE() instead of __WARN()
bcache: Add missing #include
...

Linus Torvalds
2013-05-09 02:51:05 +0800

08 May, 2013

3 commits

9607a85b6 rwsem: check counter to avoid cmpxchg calls ... Browse Code »

This patch tries to reduce the amount of cmpxchg calls in the writer
failed path by checking the counter value first before issuing the
instruction. If ->count is not set to RWSEM_WAITING_BIAS then there is
no point wasting a cmpxchg call.

Furthermore, Michel states "I suppose it helps due to the case where
someone else steals the lock while we're trying to acquire
sem->wait_lock."

Two very different workloads and machines were used to see how this
patch improves throughput: pgbench on a quad-core laptop and aim7 on a
large 8 socket box with 80 cores.

Some results comparing Michel's fast-path write lock stealing
(tps-rwsem) on a quad-core laptop running pgbench:

| db_size | clients | tps-rwsem | tps-patch |
+---------+----------+----------------+--------------+
| 160 MB | 1 | 6906 | 9153 | + 32.5
| 160 MB | 2 | 15931 | 22487 | + 41.1%
| 160 MB | 4 | 33021 | 32503 |
| 160 MB | 8 | 34626 | 34695 |
| 160 MB | 16 | 33098 | 34003 |
| 160 MB | 20 | 31343 | 31440 |
| 160 MB | 30 | 28961 | 28987 |
| 160 MB | 40 | 26902 | 26970 |
| 160 MB | 50 | 25760 | 25810 |
------------------------------------------------------
| 1.6 GB | 1 | 7729 | 7537 |
| 1.6 GB | 2 | 19009 | 23508 | + 23.7%
| 1.6 GB | 4 | 33185 | 32666 |
| 1.6 GB | 8 | 34550 | 34318 |
| 1.6 GB | 16 | 33079 | 32689 |
| 1.6 GB | 20 | 31494 | 31702 |
| 1.6 GB | 30 | 28535 | 28755 |
| 1.6 GB | 40 | 27054 | 27017 |
| 1.6 GB | 50 | 25591 | 25560 |
------------------------------------------------------
| 7.6 GB | 1 | 6224 | 7469 | + 20.0%
| 7.6 GB | 2 | 13611 | 12778 |
| 7.6 GB | 4 | 33108 | 32927 |
| 7.6 GB | 8 | 34712 | 34878 |
| 7.6 GB | 16 | 32895 | 33003 |
| 7.6 GB | 20 | 31689 | 31974 |
| 7.6 GB | 30 | 29003 | 28806 |
| 7.6 GB | 40 | 26683 | 26976 |
| 7.6 GB | 50 | 25925 | 25652 |
------------------------------------------------------

For the aim7 worloads, they overall improved on top of Michel's
patchset. For full graphs on how the rwsem series plus this patch
behaves on a large 8 socket machine against a vanilla kernel:

http://stgolabs.net/rwsem-aim7-results.tar.gz

Signed-off-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-08 07:11:51 +0800
2d864e417 kref: minor cleanup ... Browse Code »

- make warning smp-safe
- result of atomic _unless_zero functions should be checked by caller
to avoid use-after-free error
- trivial whitespace fix.

Link: https://lkml.org/lkml/2013/4/12/391

Tested: compile x86, boot machine and run xfstests
Signed-off-by: Anatol Pomozov
[ Removed line-break, changed to use WARN_ON_ONCE() - Linus ]
Signed-off-by: Linus Torvalds

Anatol Pomozov
2013-05-08 07:09:00 +0800
c8de2fa4d Merge branch 'rwsem-optimizations' ... Browse Code »

Merge rwsem optimizations from Michel Lespinasse:
"These patches extend Alex Shi's work (which added write lock stealing
on the rwsem slow path) in order to provide rwsem write lock stealing
on the fast path (that is, without taking the rwsem's wait_lock).

I have unfortunately been unable to push this through -next before due
to Ingo Molnar / David Howells / Peter Zijlstra being busy with other
things. However, this has gotten some attention from Rik van Riel and
Davidlohr Bueso who both commented that they felt this was ready for
v3.10, and Ingo Molnar has said that he was OK with me pushing
directly to you. So, here goes :)

Davidlohr got the following test results from pgbench running on a
quad-core laptop:

| db_size | clients | tps-vanilla | tps-rwsem |
+---------+----------+----------------+--------------+
| 160 MB | 1 | 5803 | 6906 | + 19.0%
| 160 MB | 2 | 13092 | 15931 |
| 160 MB | 4 | 29412 | 33021 |
| 160 MB | 8 | 32448 | 34626 |
| 160 MB | 16 | 32758 | 33098 |
| 160 MB | 20 | 26940 | 31343 | + 16.3%
| 160 MB | 30 | 25147 | 28961 |
| 160 MB | 40 | 25484 | 26902 |
| 160 MB | 50 | 24528 | 25760 |
------------------------------------------------------
| 1.6 GB | 1 | 5733 | 7729 | + 34.8%
| 1.6 GB | 2 | 9411 | 19009 | + 101.9%
| 1.6 GB | 4 | 31818 | 33185 |
| 1.6 GB | 8 | 33700 | 34550 |
| 1.6 GB | 16 | 32751 | 33079 |
| 1.6 GB | 20 | 30919 | 31494 |
| 1.6 GB | 30 | 28540 | 28535 |
| 1.6 GB | 40 | 26380 | 27054 |
| 1.6 GB | 50 | 25241 | 25591 |
------------------------------------------------------
| 7.6 GB | 1 | 5779 | 6224 |
| 7.6 GB | 2 | 10897 | 13611 | + 24.9%
| 7.6 GB | 4 | 32683 | 33108 |
| 7.6 GB | 8 | 33968 | 34712 |
| 7.6 GB | 16 | 32287 | 32895 |
| 7.6 GB | 20 | 27770 | 31689 | + 14.1%
| 7.6 GB | 30 | 26739 | 29003 |
| 7.6 GB | 40 | 24901 | 26683 |
| 7.6 GB | 50 | 17115 | 25925 | + 51.5%
------------------------------------------------------

(Davidlohr also has one additional patch which further improves
throughput, though I will ask him to send it directly to you as I have
suggested some minor changes)."

* emailed patches from Michel Lespinasse :
rwsem: no need for explicit signed longs
x86 rwsem: avoid taking slow path when stealing write lock
rwsem: do not block readers at head of queue if other readers are active
rwsem: implement support for write lock stealing on the fastpath
rwsem: simplify __rwsem_do_wake
rwsem: skip initial trylock in rwsem_down_write_failed
rwsem: avoid taking wait_lock in rwsem_down_write_failed
rwsem: use cmpxchg for trying to steal write lock
rwsem: more agressive lock stealing in rwsem_down_write_failed
rwsem: simplify rwsem_down_write_failed
rwsem: simplify rwsem_down_read_failed
rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed
rwsem: shorter spinlocked section in rwsem_down_failed_common()
rwsem: make the waiter type an enumeration rather than a bitmask

Linus Torvalds
2013-05-08 00:22:03 +0800

07 May, 2013

13 commits

b5f541810 rwsem: no need for explicit signed longs ... Browse Code »

Change explicit "signed long" declarations into plain "long" as suggested
by Peter Hurley.

Signed-off-by: Davidlohr Bueso
Reviewed-by: Michel Lespinasse
Signed-off-by: Michel Lespinasse
Signed-off-by: Linus Torvalds

Davidlohr Bueso
2013-05-07 22:20:17 +0800
25c393259 rwsem: do not block readers at head of queue if other readers are active ... Browse Code »

This change fixes a race condition where a reader might determine it
needs to block, but by the time it acquires the wait_lock the rwsem has
active readers and no queued waiters.

In this situation the reader can run in parallel with the existing
active readers; it does not need to block until the active readers
complete.

Thanks to Peter Hurley for noticing this possible race.

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:17 +0800
fe6e674c6 rwsem: implement support for write lock stealing on the fastpath ... Browse Code »

When we decide to wake up readers, we must first grant them as many read
locks as necessary, and then actually wake up all these readers. But in
order to know how many read shares to grant, we must first count the
readers at the head of the queue. This might take a while if there are
many readers, and we want to be protected against a writer stealing the
lock while we're counting. To that end, we grant the first reader lock
before counting how many more readers are queued.

We also require some adjustments to the wake_type semantics.

RWSEM_WAKE_NO_ACTIVE used to mean that we had found the count to be
RWSEM_WAITING_BIAS, in which case the rwsem was known to be free as
nobody could steal it while we hold the wait_lock. This doesn't make
sense once we implement fastpath write lock stealing, so we now use
RWSEM_WAKE_ANY in that case.

Similarly, when rwsem_down_write_failed found that a read lock was
active, it would use RWSEM_WAKE_READ_OWNED which signalled that new
readers could be woken without checking first that the rwsem was
available. We can't do that anymore since the existing readers might
release their read locks, and a writer could steal the lock before we
wake up additional readers. So, we have to use a new RWSEM_WAKE_READERS
value to indicate we only want to wake readers, but we don't currently
hold any read lock.

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
8cf5322ce rwsem: simplify __rwsem_do_wake ... Browse Code »

This is mostly for cleanup value:

- We don't need several gotos to handle the case where the first
waiter is a writer. Two simple tests will do (and generate very
similar code).

- In the remainder of the function, we know the first waiter is a reader,
so we don't have to double check that. We can use do..while loops
to iterate over the readers to wake (generates slightly better code).

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
9b0fc9c09 rwsem: skip initial trylock in rwsem_down_write_failed ... Browse Code »

We can skip the initial trylock in rwsem_down_write_failed() if there
are known active lockers already, thus saving one likely-to-fail
cmpxchg.

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
a7d2c573a rwsem: avoid taking wait_lock in rwsem_down_write_failed ... Browse Code »

In rwsem_down_write_failed(), if there are active locks after we wake up
(i.e. the lock got stolen from us), skip taking the wait_lock and go
back to sleep immediately.

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
5ede972df rwsem: use cmpxchg for trying to steal write lock ... Browse Code »

Using rwsem_atomic_update to try stealing the write lock forced us to
undo the adjustment in the failure path. We can have simpler and faster
code by using cmpxchg instead.

Signed-off-by: Michel Lespinasse
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Acked-by: Rik van Riel
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
ed00f6434 rwsem: more agressive lock stealing in rwsem_down_write_failed ... Browse Code »

Some small code simplifications can be achieved by doing more agressive
lock stealing:

- When rwsem_down_write_failed() notices that there are no active locks
(and thus no thread to wake us if we decided to sleep), it used to wake
the first queued process. However, stealing the lock is also sufficient
to deal with this case, so we don't need this check anymore.

- In try_get_writer_sem(), we can steal the lock even when the first waiter
is a reader. This is correct because the code path that wakes readers is
protected by the wait_lock. As to the performance effects of this change,
they are expected to be minimal: readers are still granted the lock
(rather than having to acquire it themselves) when they reach the front
of the wait queue, so we have essentially the same behavior as in
rwsem-spinlock.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
023fe4f71 rwsem: simplify rwsem_down_write_failed ... Browse Code »

When waking writers, we never grant them the lock - instead, they have
to acquire it themselves when they run, and remove themselves from the
wait_list when they succeed.

As a result, we can do a few simplifications in rwsem_down_write_failed():

- We don't need to check for !waiter.task since __rwsem_do_wake() doesn't
remove writers from the wait_list

- There is no point releaseing the wait_lock before entering the wait loop,
as we will need to reacquire it immediately. We can change the loop so
that the lock is always held at the start of each loop iteration.

- We don't need to get a reference on the task structure, since the task
is responsible for removing itself from the wait_list. There is no risk,
like in the rwsem_down_read_failed() case, that a task would wake up and
exit (thus destroying its task structure) while __rwsem_do_wake() is
still running - wait_lock protects against that.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
da16922cc rwsem: simplify rwsem_down_read_failed ... Browse Code »

When trying to acquire a read lock, the RWSEM_ACTIVE_READ_BIAS
adjustment doesn't cause other readers to block, so we never have to
worry about waking them back after canceling this adjustment in
rwsem_down_read_failed().

We also never want to steal the lock in rwsem_down_read_failed(), so we
don't have to grab the wait_lock either.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
1e78277cc rwsem: move rwsem_down_failed_common code into rwsem_down_{read,write}_failed ... Browse Code »

Remove the rwsem_down_failed_common function and replace it with two
identical copies of its code in rwsem_down_{read,write}_failed.

This is because we want to make different optimizations in
rwsem_down_{read,write}_failed; we are adding this pure-duplication
step as a separate commit in order to make it easier to check the
following steps.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
f7dd1cee9 rwsem: shorter spinlocked section in rwsem_down_failed_common() ... Browse Code »

This change reduces the size of the spinlocked and TASK_UNINTERRUPTIBLE
sections in rwsem_down_failed_common():

- We only need the sem->wait_lock to insert ourselves on the wait_list;
the waiter node can be prepared outside of the wait_lock.

- The task state only needs to be set to TASK_UNINTERRUPTIBLE immediately
before checking if we actually need to sleep; it doesn't need to protect
the entire function.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:16 +0800
e2d57f782 rwsem: make the waiter type an enumeration rather than a bitmask ... Browse Code »

We are not planning to add some new waiter flags, so we can convert the
waiter type into an enumeration.

Background: David Howells suggested I do this back when I tried adding
a new waiter type for unfair readers. However, I believe the cleanup
applies regardless of that use case.

Signed-off-by: Michel Lespinasse
Reviewed-by: Rik van Riel
Reviewed-by: Peter Hurley
Acked-by: Davidlohr Bueso
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-05-07 22:20:15 +0800

06 May, 2013

1 commit

9e6879460 Give the OID registry file module info to avoid kernel tainting ... Browse Code »

Give the OID registry file module information so that it doesn't taint the
kernel when compiled as a module and loaded.

Reported-by: Dros Adamson
Signed-off-by: David Howells
cc: Trond Myklebust
cc: stable@vger.kernel.org
cc: linux-nfs@vger.kernel.org
Signed-off-by: Linus Torvalds

David Howells
2013-05-06 05:38:00 +0800

03 May, 2013

2 commits

20a2078ce Merge branch 'drm-next' of git://people.freedesktop.org/~airlied/linux ... Browse Code »

Pull drm updates from Dave Airlie:
"This is the main drm pull request for 3.10.

Wierd bits:
- OMAP drm changes required OMAP dss changes, in drivers/video, so I
took them in here.
- one more fbcon fix for font handover
- VT switch avoidance in pm code
- scatterlist helpers for gpu drivers - have acks from akpm

Highlights:
- qxl kms driver - driver for the spice qxl virtual GPU

Nouveau:
- fermi/kepler VRAM compression
- GK110/nvf0 modesetting support.

Tegra:
- host1x core merged with 2D engine support

i915:
- vt switchless resume
- more valleyview support
- vblank fixes
- modesetting pipe config rework

radeon:
- UVD engine support
- SI chip tiling support
- GPU registers initialisation from golden values.

exynos:
- device tree changes
- fimc block support

Otherwise:
- bunches of fixes all over the place."

* 'drm-next' of git://people.freedesktop.org/~airlied/linux: (513 commits)
qxl: update to new idr interfaces.
drm/nouveau: fix build with nv50->nvc0
drm/radeon: fix handling of v6 power tables
drm/radeon: clarify family checks in pm table parsing
drm/radeon: consolidate UVD clock programming
drm/radeon: fix UPLL_REF_DIV_MASK definition
radeon: add bo tracking debugfs
drm/radeon: add new richland pci ids
drm/radeon: add some new SI PCI ids
drm/radeon: fix scratch reg handling for UVD fence
drm/radeon: allocate SA bo in the requested domain
drm/radeon: fix possible segfault when parsing pm tables
drm/radeon: fix endian bugs in atom_allocate_fb_scratch()
OMAPDSS: TFP410: return EPROBE_DEFER if the i2c adapter not found
OMAPDSS: VENC: Add error handling for venc_probe_pdata
OMAPDSS: HDMI: Add error handling for hdmi_probe_pdata
OMAPDSS: RFBI: Add error handling for rfbi_probe_pdata
OMAPDSS: DSI: Add error handling for dsi_probe_pdata
OMAPDSS: SDI: Add error handling for sdi_probe_pdata
OMAPDSS: DPI: Add error handling for dpi_probe_pdata
...

Linus Torvalds
2013-05-03 10:40:34 +0800
0279b3c0a Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip ... Browse Code »

Pull scheduler fixes from Ingo Molnar:
"This fixes the cputime scaling overflow problems for good without
having bad 32-bit overhead, and gets rid of the div64_u64_rem() helper
as well."

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
Revert "math64: New div64_u64_rem helper"
sched: Avoid prev->stime underflow
sched: Do not account bogus utime
sched: Avoid cputime scaling overflow

Linus Torvalds
2013-05-03 05:56:31 +0800

02 May, 2013

1 commit

20b4fb485 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs ... Browse Code »

Pull VFS updates from Al Viro,

Misc cleanups all over the place, mainly wrt /proc interfaces (switch
create_proc_entry to proc_create(), get rid of the deprecated
create_proc_read_entry() in favor of using proc_create_data() and
seq_file etc).

7kloc removed.

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (204 commits)
don't bother with deferred freeing of fdtables
proc: Move non-public stuff from linux/proc_fs.h to fs/proc/internal.h
proc: Make the PROC_I() and PDE() macros internal to procfs
proc: Supply a function to remove a proc entry by PDE
take cgroup_open() and cpuset_open() to fs/proc/base.c
ppc: Clean up scanlog
ppc: Clean up rtas_flash driver somewhat
hostap: proc: Use remove_proc_subtree()
drm: proc: Use remove_proc_subtree()
drm: proc: Use minor->index to label things, not PDE->name
drm: Constify drm_proc_list[]
zoran: Don't print proc_dir_entry data in debug
reiserfs: Don't access the proc_dir_entry in r_open(), r_start() r_show()
proc: Supply an accessor for getting the data from a PDE's parent
airo: Use remove_proc_subtree()
rtl8192u: Don't need to save device proc dir PDE
rtl8187se: Use a dir under /proc/net/r8180/
proc: Add proc_mkdir_data()
proc: Move some bits from linux/proc_fs.h to linux/{of.h,signal.h,tty.h}
proc: Move PDE_NET() to fs/proc/proc_net.c
...

Linus Torvalds
2013-05-02 08:51:54 +0800