Eric Lee / smarc-fsl-linux-kernel

11 Sep, 2015

40 commits

d097c0240 kmod: remove unecessary explicit wide CPU affinity setting ... Browse Code »

Khelper is affine to all CPUs. Now since it creates the
call_usermodehelper_exec_[a]sync() kernel threads, those inherit the wide
affinity.

As such explicitly forcing a wide affinity from those kernel threads
is like a no-op.

Just remove it. It's needless and it breaks CPU isolation users who
rely on workqueue affinity tuning.

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
b6b50a814 kmod: bunch of internal functions renames ... Browse Code »

This patchset does a bunch of cleanups and converts khelper to use system
unbound workqueues. The 3 first patches should be uncontroversial. The
last 2 patches are debatable.

Kmod creates kernel threads that perform userspace jobs and we want those
to have a large affinity in order not to contend busy CPUs. This is
(partly) why we use khelper which has a wide affinity that the kernel
threads it create can inherit from. Now khelper is a dedicated workqueue
that has singlethread properties which we aren't interested in.

Hence those two debatable changes:

_ We would like to use generic workqueues. System unbound workqueues are
a very good candidate but they are not wide affine, only node affine.
Now probably a node is enough to perform many parallel kmod jobs.

_ We would like to remove the wait_for_helper kernel thread (UMH_WAIT_PROC
handler) to use the workqueue. It means that if the workqueue blocks,
and no other worker can take pending kmod request, we can be screwed.
Now if we have 512 threads, this should be enough.

This patch (of 5):

Underscores on function names aren't much verbose to explain the purpose
of a function. And kmod has interesting such flavours.

Lets rename the following functions:

* __call_usermodehelper -> call_usermodehelper_exec_work
* ____call_usermodehelper -> call_usermodehelper_exec_async
* wait_for_helper -> call_usermodehelper_exec_sync

Signed-off-by: Frederic Weisbecker
Cc: Rik van Riel
Reviewed-by: Oleg Nesterov
Cc: Christoph Lameter
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Frederic Weisbecker
2015-09-11 04:29:01 +0800
60b61a6f4 kmod: correct documentation of return status of request_module ... Browse Code »

If request_module() successfully runs modprobe, but modprobe exits with a
non-zero status, then the return value from request_module() will be that
(positive) error status. So the return from request_module can be:

negative errno
zero for success
positive exit code.

Signed-off-by: NeilBrown
Cc: Goldwyn Rodrigues
Cc: Oleg Nesterov
Cc: Tejun Heo
Cc: Rusty Russell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2015-09-11 04:29:01 +0800
b4cc0efea hfs: fix B-tree corruption after insertion at position 0 ... Browse Code »

Fix B-tree corruption when a new record is inserted at position 0 in the
node in hfs_brec_insert().

This is an identical change to the corresponding hfs b-tree code to Sergei
Antonov's "hfsplus: fix B-tree corruption after insertion at position 0",
to keep similar code paths in the hfs and hfsplus drivers in sync, where
appropriate.

Signed-off-by: Hin-Tak Leung
Cc: Sergei Antonov
Cc: Joe Perches
Reviewed-by: Vyacheslav Dubeyko
Cc: Anton Altaparmakov
Cc: Al Viro
Cc: Christoph Hellwig
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hin-Tak Leung
2015-09-11 04:29:01 +0800
7cb74be6f hfs,hfsplus: cache pages correctly between bnode_create and bnode_free ... Browse Code »

Pages looked up by __hfs_bnode_create() (called by hfs_bnode_create() and
hfs_bnode_find() for finding or creating pages corresponding to an inode)
are immediately kmap()'ed and used (both read and write) and kunmap()'ed,
and should not be page_cache_release()'ed until hfs_bnode_free().

This patch fixes a problem I first saw in July 2012: merely running "du"
on a large hfsplus-mounted directory a few times on a reasonably loaded
system would get the hfsplus driver all confused and complaining about
B-tree inconsistencies, and generates a "BUG: Bad page state". Most
recently, I can generate this problem on up-to-date Fedora 22 with shipped
kernel 4.0.5, by running "du /" (="/" + "/home" + "/mnt" + other smaller
mounts) and "du /mnt" simultaneously on two windows, where /mnt is a
lightly-used QEMU VM image of the full Mac OS X 10.9:

$ df -i / /home /mnt
Filesystem Inodes IUsed IFree IUse% Mounted on
/dev/mapper/fedora-root 3276800 551665 2725135 17% /
/dev/mapper/fedora-home 52879360 716221 52163139 2% /home
/dev/nbd0p2 4294967295 1387818 4293579477 1% /mnt

After applying the patch, I was able to run "du /" (60+ times) and "du
/mnt" (150+ times) continuously and simultaneously for 6+ hours.

There are many reports of the hfsplus driver getting confused under load
and generating "BUG: Bad page state" or other similar issues over the
years. [1]

The unpatched code [2] has always been wrong since it entered the kernel
tree. The only reason why it gets away with it is that the
kmap/memcpy/kunmap follow very quickly after the page_cache_release() so
the kernel has not had a chance to reuse the memory for something else,
most of the time.

The current RW driver appears to have followed the design and development
of the earlier read-only hfsplus driver [3], where-by version 0.1 (Dec
2001) had a B-tree node-centric approach to
read_cache_page()/page_cache_release() per bnode_get()/bnode_put(),
migrating towards version 0.2 (June 2002) of caching and releasing pages
per inode extents. When the current RW code first entered the kernel [2]
in 2005, there was an REF_PAGES conditional (and "//" commented out code)
to switch between B-node centric paging to inode-centric paging. There
was a mistake with the direction of one of the REF_PAGES conditionals in
__hfs_bnode_create(). In a subsequent "remove debug code" commit [4], the
read_cache_page()/page_cache_release() per bnode_get()/bnode_put() were
removed, but a page_cache_release() was mistakenly left in (propagating
the "REF_PAGES !REF_PAGE" mistake), and the commented-out
page_cache_release() in bnode_release() (which should be spanned by
!REF_PAGES) was never enabled.

References:
[1]:
Michael Fox, Apr 2013
http://www.spinics.net/lists/linux-fsdevel/msg63807.html
("hfsplus volume suddenly inaccessable after 'hfs: recoff %d too large'")

Sasha Levin, Feb 2015
http://lkml.org/lkml/2015/2/20/85 ("use after free")

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/740814
https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1027887
https://bugzilla.kernel.org/show_bug.cgi?id=42342
https://bugzilla.kernel.org/show_bug.cgi?id=63841
https://bugzilla.kernel.org/show_bug.cgi?id=78761

[2]:
http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfs/bnode.c?id=d1081202f1d0ee35ab0beb490da4b65d4bc763db
commit d1081202f1d0ee35ab0beb490da4b65d4bc763db
Author: Andrew Morton
Date: Wed Feb 25 16:17:36 2004 -0800

[PATCH] HFS rewrite

http://git.kernel.org/cgit/linux/kernel/git/tglx/history.git/commit/\
fs/hfsplus/bnode.c?id=91556682e0bf004d98a529bf829d339abb98bbbd

commit 91556682e0bf004d98a529bf829d339abb98bbbd
Author: Andrew Morton
Date: Wed Feb 25 16:17:48 2004 -0800

[PATCH] HFS+ support

[3]:
http://sourceforge.net/projects/linux-hfsplus/

http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.1/
http://sourceforge.net/projects/linux-hfsplus/files/Linux%202.4.x%20patch/hfsplus%200.2/

http://linux-hfsplus.cvs.sourceforge.net/viewvc/linux-hfsplus/linux/\
fs/hfsplus/bnode.c?r1=1.4&r2=1.5

Date: Thu Jun 6 09:45:14 2002 +0000
Use buffer cache instead of page cache in bnode.c. Cache inode extents.

[4]:
http://git.kernel.org/cgit/linux/kernel/git/\
stable/linux-stable.git/commit/?id=a5e3985fa014029eb6795664c704953720cc7f7d

commit a5e3985fa014029eb6795664c704953720cc7f7d
Author: Roman Zippel
Date: Tue Sep 6 15:18:47 2005 -0700

[PATCH] hfs: remove debug code

Signed-off-by: Hin-Tak Leung
Signed-off-by: Sergei Antonov
Reviewed-by: Anton Altaparmakov
Reported-by: Sasha Levin
Cc: Al Viro
Cc: Christoph Hellwig
Cc: Vyacheslav Dubeyko
Cc: Sougata Santra
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hin-Tak Leung
2015-09-11 04:29:01 +0800
3725e9dd5 fs/coda: fix readlink buffer overflow ... Browse Code »

Dan Carpenter discovered a buffer overflow in the Coda file system
readlink code. A userspace file system daemon can return a 4096 byte
result which then triggers a one byte write past the allocated readlink
result buffer.

This does not trigger with an unmodified Coda implementation because Coda
has a 1024 byte limit for symbolic links, however other userspace file
systems using the Coda kernel module could be affected.

Although this is an obvious overflow, I don't think this has to be handled
as too sensitive from a security perspective because the overflow is on
the Coda userspace daemon side which already needs root to open Coda's
kernel device and to mount the file system before we get to the point that
links can be read.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Jan Harkes
Reported-by: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Harkes
2015-09-11 04:29:01 +0800
c5595fa2f checkpatch: add constant comparison on left side test ... Browse Code »

"CONST variable" checks like:

if (NULL != foo)
and
while (0 < bar(...))

where a constant (or what appears to be a constant like an upper case
identifier) is on the left of a comparison are generally preferred to be
written using the constant on the right side like:

if (foo != NULL)
and
while (bar(...) > 0)

Add a test for this.

Add a --fix option too, but only do it when the code is immediately
surrounded by parentheses to avoid misfixing things like "(0 < bar() +
constant)"

Signed-off-by: Joe Perches
Cc: Nicolas Morey Chaisemartin
Cc: Viresh Kumar
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
54507b518 checkpatch: add __pmem to $Sparse annotations ... Browse Code »

commit 61031952f4c8 ("arch, x86: pmem api for ensuring durability of
persistent memory updates") added a new __pmem annotation for sparse
verification. Add __pmem to the $Sparse variable so checkpatch can
appropriately ignore uses of this attribute too.

Signed-off-by: Joe Perches
Reviewed-by: Ross Zwisler
Acked-by: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
4e5d56bdf checkpatch: fix left brace warning ... Browse Code »

Using checkpatch.pl with Perl 5.22.0 generates the following warning:

Unescaped left brace in regex is deprecated, passed through in regex;

This patch fixes the warnings by escaping occurrences of the left brace
inside the regular expression.

Signed-off-by: Eddie Kovsky
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eddie Kovsky
2015-09-11 04:29:01 +0800
bf4daf12a checkpatch: avoid some commit message long line warnings ... Browse Code »

Fixes: and Link: lines may exceed 75 chars in the commit log.

So too can stack dump and dmesg lines and lines that seem
like filenames.

And Fixes: lines don't need to have a "commit" prefix before the
commit id.

Add exceptions for these types of lines.

Signed-off-by: Joe Perches
Reported-by: Paul Bolle
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
6e3007574 checkpatch: emit an error on formats with 0x%<decimal> ... Browse Code »

Using 0x%d is wrong. Emit a message when it happens.

Miscellanea:

Improve the %Lu warning to match formats like %16Lu.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
7bd7e483c checkpatch: make --strict the default for drivers/staging files and patches ... Browse Code »

Making --strict the default for staging may help some people submit
patches without obvious defects.

Signed-off-by: Joe Perches
Cc: Dan Carpenter
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
86406b1cb checkpatch: always check block comment styles ... Browse Code »

Some of the block comment tests that are used only for networking are
appropriate for all patches.

For example, these styles are not encouraged:

/*
block comment without introductory *
*/
and
/*
* block comment with line terminating */

Remove the networking specific test and add comments.

There are some infrequent false positives where code is lazily
commented out using /* and */ rather than using #if 0/#endif blocks
like:
/* case foo:
case bar: */
case baz:

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
7d3a9f673 checkpatch: report the right line # when using --emacs and --file ... Browse Code »

commit 34d8815f9512 ("checkpatch: add --showfile to allow input via pipe
to show filenames") broke the --emacs with --file option.

Fix it.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
100425dee checkpatch: add some <foo>_destroy functions to NEEDLESS_IF tests ... Browse Code »

Sergey Senozhatsky has modified several destroy functions that can
now be called with NULL values.

- kmem_cache_destroy()
- mempool_destroy()
- dma_pool_destroy()

Update checkpatch to warn when those functions are preceded by an if.

Update checkpatch to --fix all the calls too only when the code style
form is using leading tabs.

from:
if (foo)
(foo);
to:
(foo);

Signed-off-by: Joe Perches
Tested-by: Sergey Senozhatsky
Cc: David Rientjes
Cc: Julia Lawall
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
3e838b6c4 checkpatch: Allow longer declaration macros ... Browse Code »

Some really long declaration macros exist.

For instance;
DEFINE_DMA_BUF_EXPORT_INFO(exp_info);
and
DECLARE_DM_KCOPYD_THROTTLE_WITH_MODULE_PARM(name, description)

Increase the limit from 2 words to 6 after DECLARE/DEFINE uses.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
9f5af480f checkpatch: improve SUSPECT_CODE_INDENT test ... Browse Code »

Many lines exist like

if (foo)
bar;

where the tabbed indentation of the branch is not one more than the "if"
line above it.

checkpatch should emit a warning on those lines.

Miscellenea:

o Remove comments from branch blocks
o Skip blank lines in block

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
9d3e3c705 checkpatch: add warning on BUG/BUG_ON use ... Browse Code »

Using BUG/BUG_ON crashes the kernel and is just unfriendly.

Enable code that emits a warning on BUG/BUG_ON use.

Make the code emit the message at WARNING level when scanning a patch and
at CHECK level when scanning files so that script users don't feel an
obligation to fix code that might be above their pay grade.

Signed-off-by: Joe Perches
Reported-by: Geert Uytterhoeven
Tested-by: Geert Uytterhoeven
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
fe043ea12 checkpatch: warn on bare SHA-1 commit IDs in commit logs ... Browse Code »

Commit IDs should have commit descriptions too. Warn when a 12 to 40 byte
SHA-1 is used in commit logs.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
6b4a35fc1 lib/test_kasan.c: make kmalloc_oob_krealloc_less more correctly ... Browse Code »

In kmalloc_oob_krealloc_less, I think it is better to test
the size2 boundary.

If we do not call krealloc, the access of position size1 will still cause
out-of-bounds and access of position size2 does not. After call krealloc,
the access of position size2 cause out-of-bounds. So using size2 is more
correct.

Signed-off-by: Wang Long
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Long
2015-09-11 04:29:01 +0800
9789d8e0c lib/test_kasan.c: fix a typo ... Browse Code »

Signed-off-by: Wang Long
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wang Long
2015-09-11 04:29:01 +0800
b40bdb7fb lib/string_helpers: rename "esc" arg to "only" ... Browse Code »

To further clarify the purpose of the "esc" argument, rename it to "only"
to reflect that it is a limit, not a list of additional characters to
escape.

Signed-off-by: Kees Cook
Suggested-by: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2015-09-11 04:29:01 +0800
d89a3f733 lib/string_helpers: clarify esc arg in string_escape_mem ... Browse Code »

The esc argument is used to reduce which characters will be escaped. For
example, using " " with ESCAPE_SPACE will not produce any escaped spaces.

Signed-off-by: Kees Cook
Cc: Andy Shevchenko
Cc: Rasmus Villemoes
Cc: Mathias Krause
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kees Cook
2015-09-11 04:29:01 +0800
cdf17449a hexdump: do not print debug dumps for !CONFIG_DEBUG ... Browse Code »

print_hex_dump_debug() is likely supposed to be analogous to pr_debug() or
dev_dbg() & friends. Currently it will adhere to dynamic debug, but will
not stub out prints if CONFIG_DEBUG is not set. Let's make it do the
right thing, because I am tired of having my dmesg buffer full of hex
dumps on production systems.

Signed-off-by: Linus Walleij
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Walleij
2015-09-11 04:29:01 +0800
9bf98f168 lib/bitmap.c: bitmap_parselist can accept string with whitespaces on head or tail ... Browse Code »

In __bitmap_parselist we can accept whitespaces on head or tail during
every parsing procedure. If input has valid ranges, there is no reason to
reject the user.

For example, bitmap_parselist(" 1-3, 5, ", &mask, nmaskbits). After
separating the string, we get " 1-3", " 5", and " ". It's possible and
reasonable to accept such string as long as the parsing result is correct.

Signed-off-by: Pan Xinhui
Cc: Yury Norov
Cc: Chris Metcalf
Cc: Rasmus Villemoes
Cc: Sudeep Holla
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pan Xinhui
2015-09-11 04:29:01 +0800
d9282cb66 lib/bitmap.c: fix a special string handling bug in __bitmap_parselist ... Browse Code »

If string end with '-', for exapmle, bitmap_parselist("1,0-",&mask,
nmaskbits), It is not in a valid pattern, so add a check after loop.
Return -EINVAL on such condition.

Signed-off-by: Pan Xinhui
Cc: Yury Norov
Cc: Chris Metcalf
Cc: Rasmus Villemoes
Cc: Sudeep Holla
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pan Xinhui
2015-09-11 04:29:01 +0800
d21c3d4d1 lib/bitmap.c: correct a code style and do some, optimization ... Browse Code »

We can avoid in-loop incrementation of ndigits. Save current totaldigits
to ndigits before loop, and check ndigits against totaldigits after the
loop.

Signed-off-by: Pan Xinhui
Cc: Yury Norov
Cc: Chris Metcalf
Cc: Rasmus Villemoes
Cc: Sudeep Holla
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Pan Xinhui
2015-09-11 04:29:01 +0800
774636e19 proc: convert to kstrto*()/kstrto*_from_user() ... Browse Code »

Convert from manual allocation/copy_from_user/... to kstrto*() family
which were designed for exactly that.

One case can not be converted to kstrto*_from_user() to make code even
more simpler because of whitespace stripping, oh well...

Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2015-09-11 04:29:01 +0800
2d2e4715a kstrto*: accept "-0" for signed conversion ... Browse Code »

strtol(3) et al accept "-0", so should we.

Signed-off-by: Alexey Dobriyan
Cc: David Howells
Cc: Jan Kara
Cc: Joel Becker
Cc: Mark Fasheh
Cc: Theodore Ts'o
Cc: Rasmus Villemoes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2015-09-11 04:29:01 +0800
3cdea4d71 MAINTAINERS/CREDITS: mark MaxRAID as Orphan, move Anil Ravindranath to CREDITS ... Browse Code »

Anil's email address bounces and he hasn't had a signoff
in over 5 years.

Signed-off-by: Joe Perches
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
515a9adce include/linux/printk.h: include pr_fmt in pr_debug_ratelimited ... Browse Code »

The other two implementations of pr_debug_ratelimited include pr_fmt,
along with every other pr_* function. But pr_debug_ratelimited forgot to
add it with the CONFIG_DYNAMIC_DEBUG implementation.

This patch unifies the behavior.

Signed-off-by: Jason A. Donenfeld
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason A. Donenfeld
2015-09-11 04:29:01 +0800
52aa8536f kernel/cred.c: remove unnecessary kdebug atomic reads ... Browse Code »

Commit e0e817392b9a ("CRED: Add some configurable debugging [try #6]")
added the kdebug mechanism to this file back in 2009.

The kdebug macro calls no_printk which always evaluates arguments.

Most of the kdebug uses have an unnecessary call of
atomic_read(&cred->usage)

Make the kdebug macro do nothing by defining it with
do { if (0) no_printk(...); } while (0)
when not enabled.

$ size kernel/cred.o* (defconfig x86-64)
text data bss dec hex filename
2748 336 8 3092 c14 kernel/cred.o.new
2788 336 8 3132 c3c kernel/cred.o.old

Miscellanea:
o Neaten the #define kdebug macros while there

Signed-off-by: Joe Perches
Cc: David Howells
Cc: James Morris
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2015-09-11 04:29:01 +0800
2307e1a3c kernel/extable.c: remove duplicated include ... Browse Code »

Signed-off-by: Wei Yongjun
Acked-by: Steven Rostedt
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wei Yongjun
2015-09-11 04:29:01 +0800
8b839635e include/linux/poison.h: remove not-used poison pointer macros ... Browse Code »

Signed-off-by: Vasily Kulikov
Cc: Solar Designer
Cc: Thomas Gleixner
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Kulikov
2015-09-11 04:29:01 +0800
8a5e5e02f include/linux/poison.h: fix LIST_POISON{1,2} offset ... Browse Code »

Poison pointer values should be small enough to find a room in
non-mmap'able/hardly-mmap'able space. E.g. on x86 "poison pointer space"
is located starting from 0x0. Given unprivileged users cannot mmap
anything below mmap_min_addr, it should be safe to use poison pointers
lower than mmap_min_addr.

The current poison pointer values of LIST_POISON{1,2} might be too big for
mmap_min_addr values equal or less than 1 MB (common case, e.g. Ubuntu
uses only 0x10000). There is little point to use such a big value given
the "poison pointer space" below 1 MB is not yet exhausted. Changing it
to a smaller value solves the problem for small mmap_min_addr setups.

The values are suggested by Solar Designer:
http://www.openwall.com/lists/oss-security/2015/05/02/6

Signed-off-by: Vasily Kulikov
Cc: Solar Designer
Cc: Thomas Gleixner
Cc: "Kirill A. Shutemov"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vasily Kulikov
2015-09-11 04:29:01 +0800
ecf1a3dff proc: change proc_subdir_lock to a rwlock ... Browse Code »

The proc_subdir_lock spinlock is used to allow only one task to make
change to the proc directory structure as well as looking up information
in it. However, the information lookup part can actually be entered by
more than one task as the pde_get() and pde_put() reference count update
calls in the critical sections are atomic increment and decrement
respectively and so are safe with concurrent updates.

The x86 architecture has already used qrwlock which is fair and other
architectures like ARM are in the process of switching to qrwlock. So
unfairness shouldn't be a concern in that conversion.

This patch changed the proc_subdir_lock to a rwlock in order to enable
concurrent lookup. The following functions were modified to take a
write lock:
- proc_register()
- remove_proc_entry()
- remove_proc_subtree()

The following functions were modified to take a read lock:
- xlate_proc_name()
- proc_lookup_de()
- proc_readdir_de()

A parallel /proc filesystem search with the "find" command (1000 threads)
was run on a 4-socket Haswell-EX box (144 threads). Before the patch, the
parallel search took about 39s. After the patch, the parallel find took
only 25s, a saving of about 14s.

The micro-benchmark that I used was artificial, but it was used to
reproduce an exit hanging problem that I saw in real application. In
fact, only allow one task to do a lookup seems too limiting to me.

Signed-off-by: Waiman Long
Acked-by: "Eric W. Biederman"
Cc: Alexey Dobriyan
Cc: Nicolas Dichtel
Cc: Al Viro
Cc: Scott J Norton
Cc: Douglas Hatch
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Waiman Long
2015-09-11 04:29:01 +0800
bdb4d100a procfs: always expose /proc/<pid>/map_files/ and make it readable ... Browse Code »

Currently, /proc//map_files/ is restricted to CAP_SYS_ADMIN, and is
only exposed if CONFIG_CHECKPOINT_RESTORE is set.

Each mapped file region gets a symlink in /proc//map_files/
corresponding to the virtual address range at which it is mapped. The
symlinks work like the symlinks in /proc//fd/, so you can follow them
to the backing file even if that backing file has been unlinked.

Currently, files which are mapped, unlinked, and closed are impossible to
stat() from userspace. Exposing /proc//map_files/ closes this
functionality "hole".

Not being able to stat() such files makes noticing and explicitly
accounting for the space they use on the filesystem impossible. You can
work around this by summing up the space used by every file in the
filesystem and subtracting that total from what statfs() tells you, but
that obviously isn't great, and it becomes unworkable once your filesystem
becomes large enough.

This patch moves map_files/ out from behind CONFIG_CHECKPOINT_RESTORE, and
adjusts the permissions enforced on it as follows:

* proc_map_files_lookup()
* proc_map_files_readdir()
* map_files_d_revalidate()

Remove the CAP_SYS_ADMIN restriction, leaving only the current
restriction requiring PTRACE_MODE_READ. The information made
available to userspace by these three functions is already
available in /proc/PID/maps with MODE_READ, so I don't see any
reason to limit them any further (see below for more detail).

* proc_map_files_follow_link()

This stub has been added, and requires that the user have
CAP_SYS_ADMIN in order to follow the links in map_files/,
since there was concern on LKML both about the potential for
bypassing permissions on ancestor directories in the path to
files pointed to, and about what happens with more exotic
memory mappings created by some drivers (ie dma-buf).

In older versions of this patch, I changed every permission check in
the four functions above to enforce MODE_ATTACH instead of MODE_READ.
This was an oversight on my part, and after revisiting the discussion
it seems that nobody was concerned about anything outside of what is
made possible by ->follow_link(). So in this version, I've left the
checks for PTRACE_MODE_READ as-is.

[akpm@linux-foundation.org: catch up with concurrent proc_pid_follow_link() changes]
Signed-off-by: Calvin Owens
Reviewed-by: Kees Cook
Cc: Andy Lutomirski
Cc: Cyrill Gorcunov
Cc: Joe Perches
Cc: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Calvin Owens
2015-09-11 04:29:01 +0800
d3691d2c6 proc: add cond_resched to /proc/kpage* read/write loop ... Browse Code »

Reading/writing a /proc/kpage* file may take long on machines with a lot
of RAM installed.

Signed-off-by: Vladimir Davydov
Suggested-by: Andres Lagar-Cavilla
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800
f074a8f49 proc: export idle flag via kpageflags ... Browse Code »

As noted by Minchan, a benefit of reading idle flag from /proc/kpageflags
is that one can easily filter dirty and/or unevictable pages while
estimating the size of unused memory.

Note that idle flag read from /proc/kpageflags may be stale in case the
page was accessed via a PTE, because it would be too costly to iterate
over all page mappings on each /proc/kpageflags read to provide an
up-to-date value. To make sure the flag is up-to-date one has to read
/sys/kernel/mm/page_idle/bitmap first.

Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800
33c3fc71c mm: introduce idle page tracking ... Browse Code »

Knowing the portion of memory that is not used by a certain application or
memory cgroup (idle memory) can be useful for partitioning the system
efficiently, e.g. by setting memory cgroup limits appropriately.
Currently, the only means to estimate the amount of idle memory provided
by the kernel is /proc/PID/{clear_refs,smaps}: the user can clear the
access bit for all pages mapped to a particular process by writing 1 to
clear_refs, wait for some time, and then count smaps:Referenced. However,
this method has two serious shortcomings:

- it does not count unmapped file pages
- it affects the reclaimer logic

To overcome these drawbacks, this patch introduces two new page flags,
Idle and Young, and a new sysfs file, /sys/kernel/mm/page_idle/bitmap.
A page's Idle flag can only be set from userspace by setting bit in
/sys/kernel/mm/page_idle/bitmap at the offset corresponding to the page,
and it is cleared whenever the page is accessed either through page tables
(it is cleared in page_referenced() in this case) or using the read(2)
system call (mark_page_accessed()). Thus by setting the Idle flag for
pages of a particular workload, which can be found e.g. by reading
/proc/PID/pagemap, waiting for some time to let the workload access its
working set, and then reading the bitmap file, one can estimate the amount
of pages that are not used by the workload.

The Young page flag is used to avoid interference with the memory
reclaimer. A page's Young flag is set whenever the Access bit of a page
table entry pointing to the page is cleared by writing to the bitmap file.
If page_referenced() is called on a Young page, it will add 1 to its
return value, therefore concealing the fact that the Access bit was
cleared.

Note, since there is no room for extra page flags on 32 bit, this feature
uses extended page flags when compiled on 32 bit.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: kpageidle requires an MMU]
[akpm@linux-foundation.org: decouple from page-flags rework]
Signed-off-by: Vladimir Davydov
Reviewed-by: Andres Lagar-Cavilla
Cc: Minchan Kim
Cc: Raghavendra K T
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Michel Lespinasse
Cc: David Rientjes
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2015-09-11 04:29:01 +0800