Eric Lee / smarc-fsl-linux-kernel

07 Aug, 2014

40 commits

129965a91 lib/test-kstrtox.c: use ARRAY_SIZE instead of sizeof/sizeof[0] ... Browse Code »

Use kernel.h definition.

Signed-off-by: Fabian Frederick
Cc: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2014-08-07 09:01:25 +0800
142cda5db lib/string_helpers.c: constify static arrays ... Browse Code »

Complement commit 68aecfb97978 ("lib/string_helpers.c: make arrays
static") by making the arrays const -- not only pointing to const
strings. This moves them out of the data section to the r/o data
section:

text data bss dec hex filename
1150 176 0 1326 52e lib/string_helpers.old.o
1326 0 0 1326 52e lib/string_helpers.new.o

Signed-off-by: Mathias Krause
Cc: James Bottomley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mathias Krause
2014-08-07 09:01:25 +0800
e004f3c77 lib/cmdline.c: add size unit t/p/e to memparse ... Browse Code »

For modern filesystems such as btrfs, t/p/e size level operations are
common. add size unit t/p/e parsing to memparse

Signed-off-by: Gui Hecheng
Acked-by: David Rientjes
Reviewed-by: Satoru Takeuchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Gui Hecheng
2014-08-07 09:01:25 +0800
428ac5fc0 libata: Use glob_match from lib/glob.c ... Browse Code »

The function may be useful for other drivers, so export it. (Suggested
by Tejun Heo.)

Note that I inverted the return value of glob_match; returning true on
match seemed to make more sense.

Signed-off-by: George Spelvin
Cc: Randy Dunlap
Cc: Tejun Heo
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

George Spelvin
2014-08-07 09:01:25 +0800
5f9be8248 lib/glob.c: add CONFIG_GLOB_SELFTEST ... Browse Code »

This was useful during development, and is retained for future
regression testing.

GCC appears to have no way to place string literals in a particular
section; adding __initconst to a char pointer leaves the string itself
in the default string section, where it will not be thrown away after
module load.

Thus all string constants are kept in explicitly declared and named
arrays. Sorry this makes printk a bit harder to read. At least the
tests are more compact.

Signed-off-by: George Spelvin
Cc: Randy Dunlap
Cc: Tejun Heo
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

George Spelvin
2014-08-07 09:01:25 +0800
b01250856 lib: add lib/glob.c ... Browse Code »

This is a helper function from drivers/ata/libata_core.c, where it is
used to blacklist particular device models. It's being moved to lib/ so
other drivers may use it for the same purpose.

This implementation in non-recursive, so is safe for the kernel stack.

[akpm@linux-foundation.org: fix sparse warning]
Signed-off-by: George Spelvin
Cc: Randy Dunlap
Cc: Tejun Heo
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

George Spelvin
2014-08-07 09:01:24 +0800
62e7ca528 zlib: clean up some dead code ... Browse Code »

Cleanup unused `if 0'-ed functions, which have been dead since 2006
(commits 87c2ce3b9305 ("lib/zlib*: cleanups") by Adrian Bunk and
4f3865fb57a0 ("zlib_inflate: Upgrade library code to a recent version")
by Richard Purdie):

- zlib_deflateSetDictionary
- zlib_deflateParams
- zlib_deflateCopy
- zlib_inflateSync
- zlib_syncsearch
- zlib_inflateSetDictionary
- zlib_inflatePrime

Signed-off-by: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2014-08-07 09:01:24 +0800
0f9859ca9 klist: use same naming scheme as hlist for klist_add_after() ... Browse Code »

The name was modified from hlist_add_after() to hlist_add_behind() when
adjusting the order of arguments to match the one with
klist_add_after(). This is necessary to break old code when it would
use it the wrong way.

Make klist follow this naming scheme for consistency.

Signed-off-by: Ken Helias
Cc: "Paul E. McKenney"
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jeff Kirsher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Helias
2014-08-07 09:01:24 +0800
1d023284c list: fix order of arguments for hlist_add_after(_rcu) ... Browse Code »

All other add functions for lists have the new item as first argument
and the position where it is added as second argument. This was changed
for no good reason in this function and makes using it unnecessary
confusing.

The name was changed to hlist_add_behind() to cause unconverted code to
generate a compile error instead of using the wrong parameter order.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Ken Helias
Cc: "Paul E. McKenney"
Acked-by: Jeff Kirsher [intel driver bits]
Cc: Hugh Dickins
Cc: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Helias
2014-08-07 09:01:24 +0800
bc18dd335 list: make hlist_add_after() argument names match hlist_add_after_rcu() ... Browse Code »

The argument names for hlist_add_after() are poorly chosen because they
look the same as the ones for hlist_add_before() but have to be used
differently.

hlist_add_after_rcu() has made a better choice.

Signed-off-by: Ken Helias
Cc: "Paul E. McKenney"
Cc: Christoph Hellwig
Cc: Hugh Dickins
Cc: Jeff Kirsher
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ken Helias
2014-08-07 09:01:24 +0800
d25d9fece kernel/printk/printk.c: fix bool assignements ... Browse Code »

Fix coccinelle warnings.

Signed-off-by: Neil Zhang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Neil Zhang
2014-08-07 09:01:24 +0800
5874af200 printk: enable interrupts before calling console_trylock_for_printk() ... Browse Code »

We need interrupts disabled when calling console_trylock_for_printk()
only so that cpu id we pass to can_use_console() remains valid (for
other things console_sem provides all the exclusion we need and
deadlocks on console_sem due to interrupts are impossible because we use
down_trylock()). However if we are rescheduled, we are guaranteed to
run on an online cpu so we can easily just get the cpu id in
can_use_console().

We can lose a bit of performance when we enable interrupts in
vprintk_emit() and then disable them again in console_unlock() but OTOH
it can somewhat reduce interrupt latency caused by console_unlock().

We differ from (reverted) commit 939f04bec1a4 in that we avoid calling
console_unlock() from vprintk_emit() with lockdep enabled as that has
unveiled quite some bugs leading to system freezes during boot (e.g.
https://lkml.org/lkml/2014/5/30/242,
https://lkml.org/lkml/2014/6/28/521).

Signed-off-by: Jan Kara
Tested-by: Andreas Bombe
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2014-08-07 09:01:24 +0800
249771b83 printk: miscellaneous cleanups ... Browse Code »

Some small cleanups to kernel/printk/printk.c. None of them should
cause any change in behavior.

- When CONFIG_PRINTK is defined, parenthesize the value of LOG_LINE_MAX.
- When CONFIG_PRINTK is *not* defined, there is an extra LOG_LINE_MAX
definition; delete it.
- Pull an assignment out of a conditional expression in console_setup().
- Use isdigit() in console_setup() rather than open coding it.
- In update_console_cmdline(), drop a NUL-termination assignment;
the strlcpy() call that precedes it guarantees it's not needed.
- Simplify some logic in printk_timed_ratelimit().

Signed-off-by: Alex Elder
Reviewed-by: Petr Mladek
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Jan Kara
Cc: John Stultz
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Elder
2014-08-07 09:01:24 +0800
e99aa4616 printk: use a clever macro ... Browse Code »

Use the IS_ENABLED() macro rather than #ifdef blocks to set certain
global values.

Signed-off-by: Alex Elder
Acked-by: Borislav Petkov
Reviewed-by: Petr Mladek
Cc: Andi Kleen
Cc: Jan Kara
Cc: John Stultz
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Elder
2014-08-07 09:01:24 +0800
0b90fec3b printk: fix some comments ... Browse Code »

Fix a few comments that don't accurately describe their corresponding
code. It also fixes some minor typographical errors.

Signed-off-by: Alex Elder
Reviewed-by: Petr Mladek
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Jan Kara
Cc: John Stultz
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Elder
2014-08-07 09:01:24 +0800
42a9dc0b3 printk: rename DEFAULT_MESSAGE_LOGLEVEL ... Browse Code »

Commit a8fe19ebfbfd ("kernel/printk: use symbolic defines for console
loglevels") makes consistent use of symbolic values for printk() log
levels.

The naming scheme used is different from the one used for
DEFAULT_MESSAGE_LOGLEVEL though. Change that symbol name to be
MESSAGE_LOGLEVEL_DEFAULT for consistency. And because the value of that
symbol comes from a similarly-named config option, rename
CONFIG_DEFAULT_MESSAGE_LOGLEVEL as well.

Signed-off-by: Alex Elder
Cc: Andi Kleen
Cc: Borislav Petkov
Cc: Jan Kara
Cc: John Stultz
Cc: Petr Mladek
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Elder
2014-08-07 09:01:24 +0800
e97e1267e printk: tweak do_syslog() to match comments ... Browse Code »

In do_syslog() there's a path used by kmsg_poll() and kmsg_read() that
only needs to know whether there's any data available to read (and not
its size). These callers only check for non-zero return. As a
shortcut, do_syslog() returns the difference between what has been
logged and what has been "seen."

The comments say that the "count of records" should be returned but it's
not. Instead it returns (log_next_idx - syslog_idx), which is a
difference between buffer offsets--and the result could be negative.

The behavior is the same (it'll be zero or not in the same cases), but
the count of records is more meaningful and it matches what the comments
say. So change the code to return that.

Signed-off-by: Alex Elder
Cc: Petr Mladek
Cc: Jan Kara
Cc: Joe Perches
Cc: John Stultz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alex Elder
2014-08-07 09:01:24 +0800
23b2899f7 printk: allow increasing the ring buffer depending on the number of CPUs ... Browse Code »

The default size of the ring buffer is too small for machines with a
large amount of CPUs under heavy load. What ends up happening when
debugging is the ring buffer overlaps and chews up old messages making
debugging impossible unless the size is passed as a kernel parameter.
An idle system upon boot up will on average spew out only about one or
two extra lines but where this really matters is on heavy load and that
will vary widely depending on the system and environment.

There are mechanisms to help increase the kernel ring buffer for tracing
through debugfs, and those interfaces even allow growing the kernel ring
buffer per CPU. We also have a static value which can be passed upon
boot. Relying on debugfs however is not ideal for production, and
relying on the value passed upon bootup is can only used *after* an
issue has creeped up. Instead of being reactive this adds a proactive
measure which lets you scale the amount of contributions you'd expect to
the kernel ring buffer under load by each CPU in the worst case
scenario.

We use num_possible_cpus() to avoid complexities which could be
introduced by dynamically changing the ring buffer size at run time,
num_possible_cpus() lets us use the upper limit on possible number of
CPUs therefore avoiding having to deal with hotplugging CPUs on and off.
This introduces the kernel configuration option LOG_CPU_MAX_BUF_SHIFT
which is used to specify the maximum amount of contributions to the
kernel ring buffer in the worst case before the kernel ring buffer flips
over, the size is specified as a power of 2. The total amount of
contributions made by each CPU must be greater than half of the default
kernel ring buffer size (1 << LOG_BUF_SHIFT bytes) in order to trigger
an increase upon bootup. The kernel ring buffer is increased to the
next power of two that would fit the required minimum kernel ring buffer
size plus the additional CPU contribution. For example if LOG_BUF_SHIFT
is 18 (256 KB) you'd require at least 128 KB contributions by other CPUs
in order to trigger an increase of the kernel ring buffer. With a
LOG_CPU_BUF_SHIFT of 12 (4 KB) you'd require at least anything over > 64
possible CPUs to trigger an increase. If you had 128 possible CPUs the
amount of minimum required kernel ring buffer bumps to:

((1 << 18) + ((128 - 1) * (1 << 12))) / 1024 = 764 KB

Since we require the ring buffer to be a power of two the new required
size would be 1024 KB.

This CPU contributions are ignored when the "log_buf_len" kernel
parameter is used as it forces the exact size of the ring buffer to an
expected power of two value.

[pmladek@suse.cz: fix build]
Signed-off-by: Luis R. Rodriguez
Signed-off-by: Petr Mladek
Tested-by: Davidlohr Bueso
Tested-by: Petr Mladek
Reviewed-by: Davidlohr Bueso
Cc: Andrew Lunn
Cc: Stephen Warren
Cc: Michal Hocko
Cc: Petr Mladek
Cc: Joe Perches
Cc: Arun KS
Cc: Kees Cook
Cc: Davidlohr Bueso
Cc: Chris Metcalf
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2014-08-07 09:01:23 +0800
f54051722 printk: make dynamic units clear for the kernel ring buffer ... Browse Code »

Signed-off-by: Luis R. Rodriguez
Suggested-by: Davidlohr Bueso
Cc: Andrew Lunn
Cc: Stephen Warren
Cc: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: Petr Mladek
Cc: Joe Perches
Cc: Arun KS
Cc: Kees Cook
Cc: Davidlohr Bueso
Cc: Chris Metcalf
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2014-08-07 09:01:23 +0800
c0a318a36 printk: move power of 2 practice of ring buffer size to a helper ... Browse Code »

In practice the power of 2 practice of the size of the kernel ring
buffer remains purely historical but not a requirement, specially now
that we have LOG_ALIGN and use it for both static and dynamic
allocations. It could have helped with implicit alignment back in the
days given the even the dynamically sized ring buffer was guaranteed to
be aligned so long as CONFIG_LOG_BUF_SHIFT was set to produce a
__LOG_BUF_LEN which is architecture aligned, since log_buf_len=n would
be allowed only if it was > __LOG_BUF_LEN and we always ended up
rounding the log_buf_len=n to the next power of 2 with
roundup_pow_of_two(), any multiple of 2 then should be also architecture
aligned. These assumptions of course relied heavily on
CONFIG_LOG_BUF_SHIFT producing an aligned value but users can always
change this.

We now have precise alignment requirements set for the log buffer size
for both static and dynamic allocations, but lets upkeep the old
practice of using powers of 2 for its size to help with easy expected
scalable values and the allocators for dynamic allocations. We'll reuse
this later so move this into a helper.

Signed-off-by: Luis R. Rodriguez
Cc: Andrew Lunn
Cc: Stephen Warren
Cc: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: Petr Mladek
Cc: Joe Perches
Cc: Arun KS
Cc: Kees Cook
Cc: Davidlohr Bueso
Cc: Chris Metcalf
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2014-08-07 09:01:23 +0800
703001775 printk: make dynamic kernel ring buffer alignment explicit ... Browse Code »

We have to consider alignment for the ring buffer both for the default
static size, and then also for when an dynamic allocation is made when
the log_buf_len=n kernel parameter is passed to set the size
specifically to a size larger than the default size set by the
architecture through CONFIG_LOG_BUF_SHIFT.

The default static kernel ring buffer can be aligned properly if
architectures set CONFIG_LOG_BUF_SHIFT properly, we provide ranges for
the size though so even if CONFIG_LOG_BUF_SHIFT has a sensible aligned
value it can be reduced to a non aligned value. Commit 6ebb017de9
("printk: Fix alignment of buf causing crash on ARM EABI") by Andrew
Lunn ensures the static buffer is always aligned and the decision of
alignment is done by the compiler by using __alignof__(struct log).

When log_buf_len=n is used we allocate the ring buffer dynamically.
Dynamic allocation varies, for the early allocation called before
setup_arch() memblock_virt_alloc() requests a page aligment and for the
default kernel allocation memblock_virt_alloc_nopanic() requests no
special alignment, which in turn ends up aligning the allocation to
SMP_CACHE_BYTES, which is L1 cache aligned.

Since we already have the required alignment for the kernel ring buffer
though we can do better and request explicit alignment for LOG_ALIGN.
This does that to be safe and make dynamic allocation alignment
explicit.

Signed-off-by: Luis R. Rodriguez
Tested-by: Petr Mladek
Acked-by: Petr Mladek
Cc: Andrew Lunn
Cc: Stephen Warren
Cc: Greg Kroah-Hartman
Cc: Michal Hocko
Cc: Petr Mladek
Cc: Joe Perches
Cc: Arun KS
Cc: Kees Cook
Cc: Davidlohr Bueso
Cc: Chris Metcalf
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis R. Rodriguez
2014-08-07 09:01:23 +0800
90a856436 include/linux/byteorder/generic.h: minor comment fix ... Browse Code »

Signed-off-by: Geoff Levand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geoff Levand
2014-08-07 09:01:23 +0800
68be30296 fs.h, drivers/hwmon/asus_atk0110.c: fix DEFINE_SIMPLE_ATTRIBUTE semicolon definition and use ... Browse Code »

The DEFINE_SIMPLE_ATTRIBUTE macro should not end in a ; Fix the one use
in the kernel tree that did not have a semicolon.

Signed-off-by: Joe Perches
Acked-by: Guenter Roeck
Acked-by: Luca Tettamanti
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2014-08-07 09:01:23 +0800
69102311a ./Makefile: tell gcc optimizer to never introduce new data races ... Browse Code »

We have been chasing a memory corruption bug, which turned out to be
caused by very old gcc (4.3.4), which happily turned conditional load
into a non-conditional one, and that broke correctness (the condition
was met only if lock was held) and corrupted memory.

This particular problem with that particular code did not happen when
never gccs were used. I've brought this up with our gcc folks, as I
wanted to make sure that this can't really happen again, and it turns
out it actually can.

Quoting Martin Jambor :
"More current GCCs are more careful when it comes to replacing a
conditional load with a non-conditional one, most notably they check
that a store happens in each iteration of _a_ loop but they assume
loops are executed. They also perform a simple check whether the
store cannot trap which currently passes only for non-const
variables. A simple testcase demonstrating it on an x86_64 is for
example the following:

$ cat cond_store.c

int g_1 = 1;

int g_2[1024] __attribute__((section ("safe_section"), aligned (4096)));

int c = 4;

int __attribute__ ((noinline))
foo (void)
{
int l;
for (l = 0; (l != 4); l++) {
if (g_1)
return l;
for (g_2[0] = 0; (g_2[0] >= 26); ++g_2[0])
;
}
return 2;
}

int main (int argc, char* argv[])
{
if (mprotect (g_2, sizeof(g_2), PROT_READ) == -1)
{
int e = errno;
error (e, e, "mprotect error %i", e);
}
foo ();
__builtin_printf("OK\n");
return 0;
}
/* EOF */
$ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=0
$ ./a.out
OK
$ ~/gcc/trunk/inst/bin/gcc cond_store.c -O2 --param allow-store-data-races=1
$ ./a.out
Segmentation fault

The testcase fails the same at least with 4.9, 4.8 and 4.7. Therefore
I would suggest building kernels with this parameter set to zero. I
also agree with Jikos that the default should be changed for -O2. I
have run most of the SPEC 2k6 CPU benchmarks (gamess and dealII
failed, at -O2, not sure why) compiled with and without this option
and did not see any real difference between respective run-times"

Hopefully the default will be changed in newer gccs, but let's force it
for kernel builds so that we are on a safe side even when older gcc are
used.

The code in question was out-of-tree printk-in-NMI (yeah, surprise
suprise, once again) patch written by Petr Mladek, let me quote his
comment from our internal bugzilla:

"I have spent few days investigating inconsistent state of kernel ring buffer.
It went out that it was caused by speculative store generated by
gcc-4.3.4.

The problem is in assembly generated for make_free_space(). The functions is
called the following way:

+ vprintk_emit();
+ log = MAIN_LOG; // with logbuf_lock
or
log = NMI_LOG; // with nmi_logbuf_lock
cont_add(log, ...);
+ cont_flush(log, ...);
+ log_store(log, ...);
+ log_make_free_space(log, ...);

If called with log = NMI_LOG then only nmi_log_* global variables are safe to
modify but the generated code does store also into (main_)log_* global
variables:

:
55 push %rbp
89 f6 mov %esi,%esi

48 8b 05 03 99 51 01 mov 0x1519903(%rip),%rax # ffffffff82620868
44 8b 1d ec 98 51 01 mov 0x15198ec(%rip),%r11d # ffffffff82620858
8b 35 36 60 14 01 mov 0x1146036(%rip),%esi # ffffffff8224cfa8
44 8b 35 33 60 14 01 mov 0x1146033(%rip),%r14d # ffffffff8224cfac
4c 8b 2d d0 98 51 01 mov 0x15198d0(%rip),%r13 # ffffffff82620850
4c 8b 25 11 61 14 01 mov 0x1146111(%rip),%r12 # ffffffff8224d098
49 89 c2 mov %rax,%r10
48 21 c2 and %rax,%rdx
48 8b 1d 0c 99 55 01 mov 0x155990c(%rip),%rbx # ffffffff826608a0
49 c1 ea 20 shr $0x20,%r10
48 89 55 d0 mov %rdx,-0x30(%rbp)
44 29 de sub %r11d,%esi
45 29 d6 sub %r10d,%r14d
4c 8b 0d 97 98 51 01 mov 0x1519897(%rip),%r9 # ffffffff82620840
eb 7e jmp ffffffff81107029
[...]
85 ff test %edi,%edi # edi = 1 for NMI_LOG
4c 89 e8 mov %r13,%rax
4c 89 ca mov %r9,%rdx
74 0a je ffffffff8110703d
8b 15 27 98 51 01 mov 0x1519827(%rip),%edx # ffffffff82620860
48 8b 45 d0 mov -0x30(%rbp),%rax
48 39 c2 cmp %rax,%rdx # end of loop
0f 84 da 00 00 00 je ffffffff81107120
[...]
85 ff test %edi,%edi # edi = 1 for NMI_LOG
4c 89 0d 17 97 51 01 mov %r9,0x1519717(%rip) # ffffffff82620840
^^^^^^^^^^^^^^^^^^^^^^^^^^
KABOOOM
74 35 je ffffffff81107160

It stores log_first_seq when edi == NMI_LOG. This instructions are used also
when edi == MAIN_LOG but the store is done speculatively before the condition
is decided. It is unsafe because we do not have "logbuf_lock" in NMI context
and some other process migh modify "log_first_seq" in parallel"

I believe that the best course of action is both

- building kernel (and anything multi-threaded, I guess) with that
optimization turned off
- persuade gcc folks to change the default for future releases

Signed-off-by: Jiri Kosina
Cc: Martin Jambor
Cc: Petr Mladek
Cc: Linus Torvalds
Cc: Paul E. McKenney
Cc: Peter Zijlstra
Cc: Marek Polacek
Cc: Jakub Jelinek
Cc: Steven Noonan
Cc: Richard Biener
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiri Kosina
2014-08-07 09:01:23 +0800
12d79d64b mm/zpool: update zswap to use zpool ... Browse Code »

Change zswap to use the zpool api instead of directly using zbud. Add a
boot-time param to allow selecting which zpool implementation to use,
with zbud as the default.

Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Weijie Yang
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
c795779df mm/zpool: zbud/zsmalloc implement zpool ... Browse Code »

Update zbud and zsmalloc to implement the zpool api.

[fengguang.wu@intel.com: make functions static]
Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
af8d417a0 mm/zpool: implement common zpool api to zbud/zsmalloc ... Browse Code »

Add zpool api.

zpool provides an interface for memory storage, typically of compressed
memory. Users can select what backend to use; currently the only
implementations are zbud, a low density implementation with up to two
compressed pages per storage page, and zsmalloc, a higher density
implementation with multiple compressed pages per storage page.

Signed-off-by: Dan Streetman
Tested-by: Seth Jennings
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
99eef8e93 mm/zbud: change zbud_alloc size type to size_t ... Browse Code »

Change the type of the zbud_alloc() size param from unsigned int to
size_t.

Technically, this should not make any difference, as the zbud
implementation already restricts the size to well within either type's
limits; but as zsmalloc (and kmalloc) use size_t, and zpool will use
size_t, this brings the size parameter type in line with zsmalloc/zpool.

Signed-off-by: Dan Streetman
Acked-by: Seth Jennings
Tested-by: Seth Jennings
Cc: Weijie Yang
Cc: Minchan Kim
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2014-08-07 09:01:23 +0800
d2d5e762c zram: replace global tb_lock with fine grain lock ... Browse Code »

Currently, we use a rwlock tb_lock to protect concurrent access to the
whole zram meta table. However, according to the actual access model,
there is only a small chance for upper user to access the same
table[index], so the current lock granularity is too big.

The idea of optimization is to change the lock granularity from whole
meta table to per table entry (table -> table[index]), so that we can
protect concurrent access to the same table[index], meanwhile allow the
maximum concurrency.

With this in mind, several kinds of locks which could be used as a
per-entry lock were tested and compared:

Test environment:
x86-64 Intel Core2 Q8400, system memory 4GB, Ubuntu 12.04,
kernel v3.15.0-rc3 as base, zram with 4 max_comp_streams LZO.

iozone test:
iozone -t 4 -R -r 16K -s 200M -I +Z
(1GB zram with ext4 filesystem, take the average of 10 tests, KB/s)

Test base CAS spinlock rwlock bit_spinlock
-------------------------------------------------------------------
Initial write 1381094 1425435 1422860 1423075 1421521
Rewrite 1529479 1641199 1668762 1672855 1654910
Read 8468009 11324979 11305569 11117273 10997202
Re-read 8467476 11260914 11248059 11145336 10906486
Reverse Read 6821393 8106334 8282174 8279195 8109186
Stride read 7191093 8994306 9153982 8961224 9004434
Random read 7156353 8957932 9167098 8980465 8940476
Mixed workload 4172747 5680814 5927825 5489578 5972253
Random write 1483044 1605588 1594329 1600453 1596010
Pwrite 1276644 1303108 1311612 1314228 1300960
Pread 4324337 4632869 4618386 4457870 4500166

To enhance the possibility of access the same table[index] concurrently,
set zram a small disksize(10MB) and let threads run with large loop
count.

fio test:
fio --bs=32k --randrepeat=1 --randseed=100 --refill_buffers
--scramble_buffers=1 --direct=1 --loops=3000 --numjobs=4
--filename=/dev/zram0 --name=seq-write --rw=write --stonewall
--name=seq-read --rw=read --stonewall --name=seq-readwrite
--rw=rw --stonewall --name=rand-readwrite --rw=randrw --stonewall
(10MB zram raw block device, take the average of 10 tests, KB/s)

Test base CAS spinlock rwlock bit_spinlock
-------------------------------------------------------------
seq-write 933789 999357 1003298 995961 1001958
seq-read 5634130 6577930 6380861 6243912 6230006
seq-rw 1405687 1638117 1640256 1633903 1634459
rand-rw 1386119 1614664 1617211 1609267 1612471

All the optimization methods show a higher performance than the base,
however, it is hard to say which method is the most appropriate.

On the other hand, zram is mostly used on small embedded system, so we
don't want to increase any memory footprint.

This patch pick the bit_spinlock method, pack object size and page_flag
into an unsigned long table.value, so as to not increase any memory
overhead on both 32-bit and 64-bit system.

On the third hand, even though different kinds of locks have different
performances, we can ignore this difference, because: if zram is used as
zram swapfile, the swap subsystem can prevent concurrent access to the
same swapslot; if zram is used as zram-blk for set up filesystem on it,
the upper filesystem and the page cache also prevent concurrent access
of the same block mostly. So we can ignore the different performances
among locks.

Acked-by: Sergey Senozhatsky
Reviewed-by: Davidlohr Bueso
Signed-off-by: Weijie Yang
Signed-off-by: Minchan Kim
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2014-08-07 09:01:23 +0800
023b409f9 zram: use size_t instead of u16 ... Browse Code »

Some architectures (eg, hexagon and PowerPC) could use PAGE_SHIFT of 16
or more. In these cases u16 is not sufficiently large to represent a
compressed page's size so use size_t.

Signed-off-by: Minchan Kim
Reported-by: Weijie Yang
Acked-by: Sergey Senozhatsky
Cc: Jerome Marchand
Cc: Nitin Gupta
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2014-08-07 09:01:23 +0800
a830eff74 zram: remove unused SECTOR_SIZE define ... Browse Code »

Drop SECTOR_SIZE define, because it's not used.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2014-08-07 09:01:22 +0800
cb8f2eec3 zram: rename struct `table' to `zram_table_entry' ... Browse Code »

Andrew Morton has recently noted that `struct table' actually represents
table entry and, thus, should be renamed. Rename to `zram_table_entry'.

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Cc: Nitin Gupta
Cc: Weijie Yang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2014-08-07 09:01:22 +0800
15de36a4c mm/highmem: make kmap cache coloring aware ... Browse Code »

User-visible effect:
Architectures that choose this method of maintaining cache coherency
(MIPS and xtensa currently) are able to use high memory on cores with
aliasing data cache. Without this fix such architectures can not use
high memory (in case of xtensa it means that at most 128 MBytes of
physical memory is available).

The problem:
VIPT cache with way size larger than MMU page size may suffer from
aliasing problem: a single physical address accessed via different
virtual addresses may end up in multiple locations in the cache.
Virtual mappings of a physical address that always get cached in
different cache locations are said to have different colors. L1 caching
hardware usually doesn't handle this situation leaving it up to
software. Software must avoid this situation as it leads to data
corruption.

What can be done:
One way to handle this is to flush and invalidate data cache every time
page mapping changes color. The other way is to always map physical
page at a virtual address with the same color. Low memory pages already
have this property. Giving architecture a way to control color of high
memory page mapping allows reusing of existing low memory cache alias
handling code.

How this is done with this patch:
Provide hooks that allow architectures with aliasing cache to align
mapping address of high pages according to their color. Such
architectures may enforce similar coloring of low- and high-memory page
mappings and reuse existing cache management functions to support
highmem.

This code is based on the implementation of similar feature for MIPS by
Leonid Yegoshin.

Signed-off-by: Max Filippov
Cc: Leonid Yegoshin
Cc: Chris Zankel
Cc: Marc Gauthier
Cc: David Rientjes
Cc: Steven Hill
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Max Filippov
2014-08-07 09:01:22 +0800
b972216e2 mmu_notifier: add call_srcu and sync function for listener to delay call and sync ... Browse Code »

When kernel device drivers or subsystems want to bind their lifespan to
t= he lifespan of the mm_struct, they usually use one of the following
methods:

1. Manually calling a function in the interested kernel module. The
funct= ion call needs to be placed in mmput. This method was rejected
by several ker= nel maintainers.

2. Registering to the mmu notifier release mechanism.

The problem with the latter approach is that the mmu_notifier_release
cal= lback is called from__mmu_notifier_release (called from exit_mmap).
That functi= on iterates over the list of mmu notifiers and don't expect
the release call= back function to remove itself from the list.
Therefore, the callback function= in the kernel module can't release the
mmu_notifier_object, which is actuall= y the kernel module's object
itself. As a result, the destruction of the kernel module's object must
to be done in a delayed fashion.

This patch adds support for this delayed callback, by adding a new
mmu_notifier_call_srcu function that receives a function ptr and calls
th= at function with call_srcu. In that function, the kernel module
releases its object. To use mmu_notifier_call_srcu, the calling module
needs to call b= efore that a new function called
mmu_notifier_unregister_no_release that as its= name implies,
unregisters a notifier without calling its notifier release call= back.

This patch also adds a function that will call barrier_srcu so those
kern= el modules can sync with mmu_notifier.

Signed-off-by: Peter Zijlstra
Signed-off-by: Jérôme Glisse
Signed-off-by: Oded Gabbay
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Zijlstra
2014-08-07 09:01:22 +0800
1d352bfd4 mm: BUG when __kmap_atomic_idx equals KM_TYPE_NR ... Browse Code »

__kmap_atomic_idx is per_cpu variable. Each CPU can use KM_TYPE_NR
entries from FIXMAP i.e. from 0 to KM_TYPE_NR - 1. Allowing
__kmap_atomic_idx to over- shoot to KM_TYPE_NR can mess up with next
CPU's 0th entry which is a bug. Hence BUG_ON if __kmap_atomic_idx >=
KM_TYPE_NR.

Fix the off-by-on in this test.

Signed-off-by: Chintan Pandya
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chintan Pandya
2014-08-07 09:01:22 +0800
61e02c745 mm: memcontrol: clean up reclaim size variable use in try_charge() ... Browse Code »

Charge reclaim and OOM currently use the charge batch variable, but
batching is already disabled at that point. To simplify the charge
logic, the batch variable is reset to the original request size when
reclaim is entered, so it's functionally equal, but it's misleading.

Switch reclaim/OOM to nr_pages, which is the original request size.

Signed-off-by: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2014-08-07 09:01:22 +0800
618fde872 kernel/smp.c:on_each_cpu_cond(): fix warning in fallback path ... Browse Code »

The rarely-executed memry-allocation-failed callback path generates a
WARN_ON_ONCE() when smp_call_function_single() succeeds. Presumably
it's supposed to warn on failures.

Signed-off-by: Sasha Levin
Cc: Christoph Lameter
Cc: Gilad Ben-Yossef
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Tejun Heo
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2014-08-07 09:01:22 +0800
dbffcd03d mm: change confusing #ifdef use in __access_remote_vm ... Browse Code »

This patch changes confusing #ifdef use in __access_remote_vm into
merely ugly #ifdef use.

Addresses bug https://bugzilla.kernel.org/show_bug.cgi?id=81651

Signed-off-by: Rik van Riel
Reported-by: David Binderman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rik van Riel
2014-08-07 09:01:22 +0800
68b5a6524 mm: softdirty: respect VM_SOFTDIRTY in PTE holes ... Browse Code »

After a VMA is created with the VM_SOFTDIRTY flag set, /proc/pid/pagemap
should report that the VMA's virtual pages are soft-dirty until
VM_SOFTDIRTY is cleared (i.e., by the next write of "4" to
/proc/pid/clear_refs). However, pagemap ignores the VM_SOFTDIRTY flag
for virtual addresses that fall in PTE holes (i.e., virtual addresses
that don't have a PMD, PUD, or PGD allocated yet).

To observe this bug, use mmap to create a VMA large enough such that
there's a good chance that the VMA will occupy an unused PMD, then test
the soft-dirty bit on its pages. In practice, I found that a VMA that
covered a PMD's worth of address space was big enough.

This patch adds the necessary VMA lookup to the PTE hole callback in
/proc/pid/pagemap's page walk and sets soft-dirty according to the VMAs'
VM_SOFTDIRTY flag.

Signed-off-by: Peter Feiner
Acked-by: Cyrill Gorcunov
Cc: Pavel Emelyanov
Cc: Hugh Dickins
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Peter Feiner
2014-08-07 09:01:22 +0800
3a91053ae mm: mark fault_around_bytes __read_mostly ... Browse Code »

fault_around_bytes can only be changed via debugfs. Let's mark it
read-mostly.

Signed-off-by: Kirill A. Shutemov
Suggested-by: David Rientjes
Acked-by: David Rientjes
Cc: Dave Hansen
Cc: Andrey Ryabinin
Cc: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2014-08-07 09:01:22 +0800