Eric Lee / smarc-fsl-linux-kernel

03 Aug, 2016

40 commits

a9bfd3321 crc32: use ktime_get_ns() for measurement ... Browse Code »

The crc32 test function measures the elapsed time in nanoseconds, but
uses 'struct timespec' for that. We want to remove timespec from the
kernel for y2038 compatibility, and ktime_get_ns() also helps make the
code simpler here.

It is also slightly better to use monontonic time, as we are only
interested in the time difference.

Link: http://lkml.kernel.org/r/20160617143932.3289626-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Cc: "David S . Miller"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-08-03 07:35:08 +0800
f003a1f18 lib/iommu-helper: skip to next segment ... Browse Code »

When a large enough area in the iommu bitmap is found but would span a
boundary we continue the search starting from the next bit position.
For large allocations this can lead to several useless invocations of
bitmap_find_next_zero_area() and iommu_is_span_boundary().

Continue the search from the start of the next segment (which is the
next bit position such that we'll not cross the same segment boundary
again).

Link: http://lkml.kernel.org/r/alpine.LFD.2.20.1606081910070.3211@schleppi
Signed-off-by: Sebastian Ott
Reviewed-by: Gerald Schaefer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Ott
2016-08-03 07:35:07 +0800
4cad35a7c get_maintainer.pl: reduce need for command-line option -f ... Browse Code »

If a vcs is used, look to see if the vcs tracks the file specified and
so the -f option becomes optional.

Link: http://lkml.kernel.org/r/7c86a8df0d48770c45778a43b6b3e4627b2a90ee.1469746395.git.joe@perches.com
Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-08-03 07:35:07 +0800
750afe7ba printk: add kernel parameter to control writes to /dev/kmsg ... Browse Code »

Add a "printk.devkmsg" kernel command line parameter which controls how
userspace writes into /dev/kmsg. It has three options:

* ratelimit - ratelimit logging from userspace.
* on - unlimited logging from userspace
* off - logging from userspace gets ignored

The default setting is to ratelimit the messages written to it.

This changes the kernel default setting of "on" to "ratelimit" and we do
that because we want to keep userspace spamming /dev/kmsg to sane
levels. This is especially moot when a small kernel log buffer wraps
around and messages get lost. So the ratelimiting setting should be a
sane setting where kernel messages should have a bit higher chance of
survival from all the spamming.

It additionally does not limit logging to /dev/kmsg while the system is
booting if we haven't disabled it on the command line.

Furthermore, we can control the logging from a lower priority sysctl
interface - kernel.printk_devkmsg.

That interface will succeed only if printk.devkmsg *hasn't* been
supplied on the command line. If it has, then printk.devkmsg is a
one-time setting which remains for the duration of the system lifetime.
This "locking" of the setting is to prevent userspace from changing the
logging on us through sysctl(2).

This patch is based on previous patches from Linus and Steven.

[bp@suse.de: fixes]
Link: http://lkml.kernel.org/r/20160719072344.GC25563@nazgul.tnic
Link: http://lkml.kernel.org/r/20160716061745.15795-3-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Dave Young
Cc: Franck Bui
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:06 +0800
6b1d174b0 ratelimit: extend to print suppressed messages on release ... Browse Code »

Extend the ratelimiting facility to print the amount of suppressed lines
when it is being released.

This use case is aimed at short-termed, burst-like users for which we
want to output the suppressed lines stats only once, after it has been
disposed of. For an example, see /dev/kmsg usage in a follow-on patch.

Also, change the printk() line we issue on release to not use
"callbacks" as it is misleading: we're not suppressing callbacks but
printk() calls.

This has been separated from a previous patch by Linus.

Link: http://lkml.kernel.org/r/20160716061745.15795-2-bp@alien8.de
Signed-off-by: Borislav Petkov
Cc: Dave Young
Cc: Franck Bui
Cc: Greg Kroah-Hartman
Cc: Ingo Molnar
Cc: Linus Torvalds
Cc: Peter Zijlstra
Cc: Steven Rostedt
Cc: Uwe Kleine-König
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:06 +0800
b5644a153 fbdev/bfin_adv7393fb: move DRIVER_NAME before its first use ... Browse Code »

Move the DRIVER_NAME macro definition before the first usage site and
fix build error.

Link: http://lkml.kernel.org/r/20160801163937.GA28119@nazgul.tnic
Signed-off-by: Borislav Petkov
Reported-by: kbuild test robot
Cc: Jean-Christophe Plagniol-Villard
Cc: Tomi Valkeinen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Borislav Petkov
2016-08-03 07:35:05 +0800
40a7d9f5f printk: include <asm/sections.h> instead of <asm-generic/sections.h> ... Browse Code »

asm-generic headers are generic implementations for architecture
specific code and should not be included by common code. Thus use the
asm/ version of sections.h to get at the linker sections.

Link: http://lkml.kernel.org/r/1468285008-7331-1-git-send-email-hch@lst.de
Signed-off-by: Christoph Hellwig
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Hellwig
2016-08-03 07:35:05 +0800
cf7754441 printk: introduce suppress_message_printing() ... Browse Code »

Messages' levels and console log level are inspected when the actual
printing occurs, which may provoke console_unlock() and
console_cont_flush() to waste CPU cycles on every message that has
loglevel above the current console_loglevel.

Schematically, console_unlock() does the following:

console_unlock()
{
...
for (;;) {
...
raw_spin_lock_irqsave(&logbuf_lock, flags);
skip:
msg = log_from_idx(console_idx);

if (msg->flags & LOG_NOCONS) {
...
goto skip;
}

level = msg->level;
len += msg_print_text(); >> sprintfs
memcpy,
etc.

if (nr_ext_console_drivers) {
ext_len = msg_print_ext_header(); >> scnprintf
ext_len += msg_print_ext_body(); >> scnprintfs
etc.
}
...
raw_spin_unlock(&logbuf_lock);

call_console_drivers(level, ext_text, ext_len, text, len)
{
if (level >= console_loglevel && >> drop the message
!ignore_loglevel)
return;

console->write(...);
}

local_irq_restore(flags);
}
...
}

The thing here is this deferred `level >= console_loglevel' check. We
are wasting CPU cycles on sprintfs/memcpy/etc. preparing the messages
that we will eventually drop.

This can be huge when we register a new CON_PRINTBUFFER console, for
instance. For every such a console register_console() resets the

console_seq, console_idx, console_prev

and sets a `exclusive console' pointer to replay the log buffer to that
just-registered console. And there can be a lot of messages to replay,
in the worst case most of which can be dropped after console_loglevel
test.

We know messages' levels long before we call msg_print_text() and
friends, so we can just move console_loglevel check out of
call_console_drivers() and format a new message only if we are sure that
it won't be dropped.

The patch factors out loglevel check into suppress_message_printing()
function and tests message->level and console_loglevel before formatting
functions in console_unlock() and console_cont_flush() are getting
executed. This improves things not only for exclusive CON_PRINTBUFFER
consoles, but for every console_unlock() that attempts to print a
message of level above the console_loglevel.

Link: http://lkml.kernel.org/r/20160627135012.8229-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Reviewed-by: Petr Mladek
Cc: Tejun Heo
Cc: Jan Kara
Cc: Calvin Owens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-08-03 07:35:04 +0800
874f9c7da printk: create pr_<level> functions ... Browse Code »

Using functions instead of macros can reduce overall code size by
eliminating unnecessary "KERN_SOH" prefixes from format strings.

defconfig x86-64:

$ size vmlinux*
text data bss dec hex filename
10193570 4331464 1105920 15630954 ee826a vmlinux.new
10192623 4335560 1105920 15634103 ee8eb7 vmlinux.old

As the return value are unimportant and unused in the kernel tree, these
new functions return void.

Miscellanea:

- change pr_ macros to call new __pr_ functions
- change vprintk_nmi and vprintk_default to add LOGLEVEL_ argument

[akpm@linux-foundation.org: fix LOGLEVEL_INFO, per Joe]
Link: http://lkml.kernel.org/r/e16cc34479dfefcae37c98b481e6646f0f69efc3.1466718827.git.joe@perches.com
Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-08-03 07:35:04 +0800
bebca0528 printk: do not include interrupt.h ... Browse Code »

A trivial cosmetic change: interrupt.h header is redundant since commit
6b898c07cb1d ("console: use might_sleep in console_lock").

Link: http://lkml.kernel.org/r/20160620132847.21930-1-sergey.senozhatsky@gmail.com
Signed-off-by: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2016-08-03 07:35:03 +0800
9d5059c95 dynamic_debug: only add header when used ... Browse Code »

kernel.h header doesn't directly use dynamic debug, instead we can
include it in module.c (which used it via kernel.h). printk.h only uses
it if CONFIG_DYNAMIC_DEBUG is on, changing the inclusion to only happen
in that case.

Link: http://lkml.kernel.org/r/1468429793-16917-1-git-send-email-luisbg@osg.samsung.com
[luisbg@osg.samsung.com: include dynamic_debug.h in drb_int.h]
Link: http://lkml.kernel.org/r/1468447828-18558-2-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt
Cc: Rusty Russell
Cc: Hidehiro Kawai
Cc: Borislav Petkov
Cc: Michal Nazarewicz
Cc: Rasmus Villemoes
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis de Bethencourt
2016-08-03 07:35:03 +0800
61e96496d task_work: use READ_ONCE/lockless_dereference, avoid pi_lock if !task_works ... Browse Code »

Change task_work_cancel() to use lockless_dereference(), this is what
the code really wants but we didn't have this helper when it was
written.

Also add the fast-path task->task_works == NULL check, in the likely
case this task has no pending works and we can avoid
spin_lock(task->pi_lock).

While at it, change other users of ACCESS_ONCE() to use READ_ONCE().

Link: http://lkml.kernel.org/r/20160610150042.GA13868@redhat.com
Signed-off-by: Oleg Nesterov
Cc: Andrea Parri
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-08-03 07:35:02 +0800
949bed2f5 include: mman: use bool instead of int for the return value of arch_validate_prot ... Browse Code »

For pure bool function's return value, bool is a little better more or
less than int.

Link: http://lkml.kernel.org/r/1469331815-2026-1-git-send-email-chengang@emindsoft.com.cn
Signed-off-by: Chen Gang
Acked-by: Michael Ellerman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2016-08-03 07:35:02 +0800
bd804ba12 mailmap: add Linus Lüssing ... Browse Code »

For one thing, summarizes all non-umlaut versions into the umlaut one
(Linus Luessing -> Linus Lüssing).

For another, maps obsolete email addresses to the current @c0d3.blue
one.

Link: http://lkml.kernel.org/r/1467805371-2773-1-git-send-email-linus.luessing@c0d3.blue
Signed-off-by: Linus Lüssing
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Linus Lüssing
2016-08-03 07:34:14 +0800
db3f60012 uapi: move forward declarations of internal structures ... Browse Code »

Don't user forward declarations of internal kernel structures in headers
exported to userspace.

Move "struct completion;".
Move "struct task_struct;".

Link: http://lkml.kernel.org/r/20160713215808.GA22486@p183.telecom.by
Signed-off-by: Alexey Dobriyan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Dobriyan
2016-08-03 05:31:41 +0800
bd721ea73 treewide: replace obsolete _refok by __ref ... Browse Code »

There was only one use of __initdata_refok and __exit_refok

__init_refok was used 46 times against 82 for __ref.

Those definitions are obsolete since commit 312b1485fb50 ("Introduce new
section reference annotations tags: __ref, __refdata, __refconst")

This patch removes the following compatibility definitions and replaces
them treewide.

/* compatibility defines */
#define __init_refok __ref
#define __initdata_refok __refdata
#define __exit_refok __ref

I can also provide separate patches if necessary.
(One patch per tree and check in 1 month or 2 to remove old definitions)

[akpm@linux-foundation.org: coding-style fixes]
Link: http://lkml.kernel.org/r/1466796271-3043-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick
Cc: Ingo Molnar
Cc: Sam Ravnborg
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2016-08-03 05:31:41 +0800
ca945e715 memstick: don't allocate unused major for ms_block ... Browse Code »

When alloc_disk(0) is used the ->major number is completely ignored.
All devices are allocated with a major of BLOCK_EXT_MAJOR.

So remove registration and deregistration of 'major'.

Link: http://lkml.kernel.org/r/20160602064318.4403.49955.stgit@noble
Signed-off-by: NeilBrown
Cc: Keith Busch
Cc: Jens Axboe
Cc: Maxim Levitsky
Cc: Greg Kroah-Hartman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

NeilBrown
2016-08-03 05:31:41 +0800
bc083a64b init/Kconfig: make COMPILE_TEST depend on !UML ... Browse Code »

UML is a bit special since it does not have iomem nor dma. That means a
lot of drivers will not build if they miss a dependency on HAS_IOMEM.
s390 used to have the same issues but since it gained PCI support UML is
the only stranger.

We are tired of patching dozens of new drivers after every merge window
just to un-break allmod/yesconfig UML builds. One could argue that a
decent driver has to know on what it depends and therefore a missing
HAS_IOMEM dependency is a clear driver bug. But the dependency not
obvious and not everyone does UML builds with COMPILE_TEST enabled when
developing a device driver.

A possible solution to make these builds succeed on UML would be
providing stub functions for ioremap() and friends which fail upon
runtime. Another one is simply disabling COMPILE_TEST for UML. Since
it is the least hassle and does not force use to fake iomem support
let's do the latter.

Link: http://lkml.kernel.org/r/1466152995-28367-1-git-send-email-richard@nod.at
Signed-off-by: Richard Weinberger
Acked-by: Arnd Bergmann
Cc: Al Viro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Richard Weinberger
2016-08-03 05:31:41 +0800
ca52953f5 fs/proc/task_mmu.c: suppress compilation warnings with W=1 ... Browse Code »

Suppress a bunch of warnings of the form:

fs/proc/task_mmu.c: In function 'show_smap_vma_flags':
fs/proc/task_mmu.c:635:22: warning: initialized field overwritten [-Wt override-init]
[ilog2(VM_READ)] = "rd",
^~~~
fs/proc/task_mmu.c:635:22: note: (near initialization for 'mnemonics[0]')

They happen because of the way we intentionally build the table, so
silence the warning when building with 'make W=1'.

Link: http://lkml.kernel.org/r/8727.1470022083@turing-police.cc.vt.edu
Signed-off-by: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valdis Kletnieks
2016-08-03 05:31:41 +0800
519ded5a8 procfs: avoid 32-bit time_t in /proc/*/stat ... Browse Code »

/proc/stat shows (among lots of other things) the current boottime (i.e.
number of seconds since boot). While a 32-bit number is sufficient for
this particular case, we want to get rid of the 'struct timespec'
suffers from a 32-bit overflow in 2038.

This changes the code to use a struct timespec64, which is known to be
safe in all cases.

Link: http://lkml.kernel.org/r/20160617201247.2292101-1-arnd@arndb.de
Signed-off-by: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arnd Bergmann
2016-08-03 05:31:41 +0800
ef419398b proc_oom_score: remove tasklist_lock and pid_alive() ... Browse Code »

This was needed before to ensure that ->signal != 0 and do_each_thread()
is safe, see commit b95c35e76b29b ("oom: fix the unsafe usage of
badness() in proc_oom_score()") for details.

Today tsk->signal can't go away and for_each_thread(tsk) is always safe.

Link: http://lkml.kernel.org/r/20160608211921.GA15508@redhat.com
Signed-off-by: Oleg Nesterov
Acked-by: David Rientjes
Acked-by: Michal Hocko
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2016-08-03 05:31:41 +0800
db4ad0360 MAINTAINERS: befs: add new maintainers ... Browse Code »

Salah Triki and Luis de Bethencourt are taking over maintainership of
befs.

Link: http://lkml.kernel.org/r/1469651079-32455-1-git-send-email-luisbg@osg.samsung.com
Signed-off-by: Luis de Bethencourt
Acked-by: Greg Kroah-Hartman
Acked-by: Theodore Ts'o
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Luis de Bethencourt
2016-08-03 05:31:41 +0800
9991a9c8d cgroup: update cgroup's document path ... Browse Code »

cgroup's document path is changed to "cgroup-v1". update it.

Link: http://lkml.kernel.org/r/1470148443-6509-1-git-send-email-iamyooon@gmail.com
Signed-off-by: seokhoon.yoon
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

seokhoon.yoon
2016-08-03 05:31:41 +0800
901d805c3 UBSAN: fix typo in format string ... Browse Code »

handle_object_size_mismatch() used %pk to format a kernel pointer with
pr_err(). This seemed to be a misspelling for %pK, but using this to
format a kernel pointer does not make much sence here.

Therefore use %p instead, like in handle_missaligned_access().

Link: http://lkml.kernel.org/r/20160730083010.11569-1-nicolas.iooss_linux@m4x.org
Signed-off-by: Nicolas Iooss
Acked-by: Andrey Ryabinin
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Nicolas Iooss
2016-08-03 05:31:41 +0800
9b24fef9f sysv, ipc: fix security-layer leaking ... Browse Code »

Commit 53dad6d3a8e5 ("ipc: fix race with LSMs") updated ipc_rcu_putref()
to receive rcu freeing function but used generic ipc_rcu_free() instead
of msg_rcu_free() which does security cleaning.

Running LTP msgsnd06 with kmemleak gives the following:

cat /sys/kernel/debug/kmemleak

unreferenced object 0xffff88003c0a11f8 (size 8):
comm "msgsnd06", pid 1645, jiffies 4294672526 (age 6.549s)
hex dump (first 8 bytes):
1b 00 00 00 01 00 00 00 ........
backtrace:
kmemleak_alloc+0x23/0x40
kmem_cache_alloc_trace+0xe1/0x180
selinux_msg_queue_alloc_security+0x3f/0xd0
security_msg_queue_alloc+0x2e/0x40
newque+0x4e/0x150
ipcget+0x159/0x1b0
SyS_msgget+0x39/0x40
entry_SYSCALL_64_fastpath+0x13/0x8f

Manfred Spraul suggested to fix sem.c as well and Davidlohr Bueso to
only use ipc_rcu_free in case of security allocation failure in newary()

Fixes: 53dad6d3a8e ("ipc: fix race with LSMs")
Link: http://lkml.kernel.org/r/1470083552-22966-1-git-send-email-fabf@skynet.be
Signed-off-by: Fabian Frederick
Cc: Davidlohr Bueso
Cc: Manfred Spraul
Cc: [3.12+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Fabian Frederick
2016-08-03 05:31:41 +0800
b5afba297 mm: vmscan: fix memcg-aware shrinkers not called on global reclaim ... Browse Code »

We must call shrink_slab() for each memory cgroup on both global and
memcg reclaim in shrink_node_memcg(). Commit d71df22b55099 accidentally
changed that so that now shrink_slab() is only called with memcg != NULL
on memcg reclaim. As a result, memcg-aware shrinkers (including
dentry/inode) are never invoked on global reclaim. Fix that.

Fixes: b2e18757f2c9 ("mm, vmscan: begin reclaiming pages on a per-node basis")
Link: http://lkml.kernel.org/r/1470056590-7177-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: Mel Gorman
Cc: Hillf Danton
Cc: Vlastimil Babka
Cc: Joonsoo Kim
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-08-03 05:31:41 +0800
05eb6e726 radix-tree: account nodes to memcg only if explicitly requested ... Browse Code »

Radix trees may be used not only for storing page cache pages, so
unconditionally accounting radix tree nodes to the current memory cgroup
is bad: if a radix tree node is used for storing data shared among
different cgroups we risk pinning dead memory cgroups forever.

So let's only account radix tree nodes if it was explicitly requested by
passing __GFP_ACCOUNT to INIT_RADIX_TREE. Currently, we only want to
account page cache entries, so mark mapping->page_tree so.

Fixes: 58e698af4c63 ("radix-tree: account radix_tree_node to memory cgroup")
Link: http://lkml.kernel.org/r/1470057188-7864-1-git-send-email-vdavydov@virtuozzo.com
Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Acked-by: Michal Hocko
Cc: [4.6+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-08-03 05:31:41 +0800
c3cee3722 kasan: avoid overflowing quarantine size on low memory systems ... Browse Code »

If the total amount of memory assigned to quarantine is less than the
amount of memory assigned to per-cpu quarantines, |new_quarantine_size|
may overflow. Instead, set it to zero.

[akpm@linux-foundation.org: cleanup: use WARN_ONCE return value]
Link: http://lkml.kernel.org/r/1470063563-96266-1-git-send-email-glider@google.com
Fixes: 55834c59098d ("mm: kasan: initial memory quarantine implementation")
Signed-off-by: Alexander Potapenko
Reported-by: Dmitry Vyukov
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Potapenko
2016-08-03 05:31:41 +0800
7e0889789 kasan: improve double-free reports ... Browse Code »

Currently we just dump stack in case of double free bug.
Let's dump all info about the object that we have.

[aryabinin@virtuozzo.com: change double free message per Alexander]
Link: http://lkml.kernel.org/r/1470153654-30160-1-git-send-email-aryabinin@virtuozzo.com
Link: http://lkml.kernel.org/r/1470062715-14077-6-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
b3cbd9bf7 mm/kasan: get rid of ->state in struct kasan_alloc_meta ... Browse Code »

The state of object currently tracked in two places - shadow memory, and
the ->state field in struct kasan_alloc_meta. We can get rid of the
latter. The will save us a little bit of memory. Also, this allow us
to move free stack into struct kasan_alloc_meta, without increasing
memory consumption. So now we should always know when the last time the
object was freed. This may be useful for long delayed use-after-free
bugs.

As a side effect this fixes following UBSAN warning:
UBSAN: Undefined behaviour in mm/kasan/quarantine.c:102:13
member access within misaligned address ffff88000d1efebc for type 'struct qlist_node'
which requires 8 byte alignment

Link: http://lkml.kernel.org/r/1470062715-14077-5-git-send-email-aryabinin@virtuozzo.com
Reported-by: kernel test robot
Signed-off-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
47b5c2a0f mm/kasan: get rid of ->alloc_size in struct kasan_alloc_meta ... Browse Code »

Size of slab object already stored in cache->object_size.

Note, that kmalloc() internally rounds up size of allocation, so
object_size may be not equal to alloc_size, but, usually we don't need
to know the exact size of allocated object. In case if we need that
information, we still can figure it out from the report. The dump of
shadow memory allows to identify the end of allocated memory, and
thereby the exact allocation size.

Link: http://lkml.kernel.org/r/1470062715-14077-4-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
f7376aed6 mm/kasan, slub: don't disable interrupts when object leaves quarantine ... Browse Code »

SLUB doesn't require disabled interrupts to call ___cache_free().

Link: http://lkml.kernel.org/r/1470062715-14077-3-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Acked-by: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
4b3ec5a3f mm/kasan: don't reduce quarantine in atomic contexts ... Browse Code »

Currently we call quarantine_reduce() for ___GFP_KSWAPD_RECLAIM (implied
by __GFP_RECLAIM) allocation. So, basically we call it on almost every
allocation. quarantine_reduce() sometimes is heavy operation, and
calling it with disabled interrupts may trigger hard LOCKUP:

NMI watchdog: Watchdog detected hard LOCKUP on cpu 2irq event stamp: 1411258
Call Trace:
dump_stack+0x68/0x96
watchdog_overflow_callback+0x15b/0x190
__perf_event_overflow+0x1b1/0x540
perf_event_overflow+0x14/0x20
intel_pmu_handle_irq+0x36a/0xad0
perf_event_nmi_handler+0x2c/0x50
nmi_handle+0x128/0x480
default_do_nmi+0xb2/0x210
do_nmi+0x1aa/0x220
end_repeat_nmi+0x1a/0x1e
<> __kernel_text_address+0x86/0xb0
print_context_stack+0x7b/0x100
dump_trace+0x12b/0x350
save_stack_trace+0x2b/0x50
set_track+0x83/0x140
free_debug_processing+0x1aa/0x420
__slab_free+0x1d6/0x2e0
___cache_free+0xb6/0xd0
qlist_free_all+0x83/0x100
quarantine_reduce+0x177/0x1b0
kasan_kmalloc+0xf3/0x100

Reduce the quarantine_reduce iff direct reclaim is allowed.

Fixes: 55834c59098d("mm: kasan: initial memory quarantine implementation")
Link: http://lkml.kernel.org/r/1470062715-14077-2-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reported-by: Dave Jones
Acked-by: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
4a3d308d6 mm/kasan: fix corruptions and false positive reports ... Browse Code »

Once an object is put into quarantine, we no longer own it, i.e. object
could leave the quarantine and be reallocated. So having set_track()
call after the quarantine_put() may corrupt slab objects.

BUG kmalloc-4096 (Not tainted): Poison overwritten
-----------------------------------------------------------------------------
Disabling lock debugging due to kernel taint
INFO: 0xffff8804540de850-0xffff8804540de857. First byte 0xb5 instead of 0x6b
...
INFO: Freed in qlist_free_all+0x42/0x100 age=75 cpu=3 pid=24492
__slab_free+0x1d6/0x2e0
___cache_free+0xb6/0xd0
qlist_free_all+0x83/0x100
quarantine_reduce+0x177/0x1b0
kasan_kmalloc+0xf3/0x100
kasan_slab_alloc+0x12/0x20
kmem_cache_alloc+0x109/0x3e0
mmap_region+0x53e/0xe40
do_mmap+0x70f/0xa50
vm_mmap_pgoff+0x147/0x1b0
SyS_mmap_pgoff+0x2c7/0x5b0
SyS_mmap+0x1b/0x30
do_syscall_64+0x1a0/0x4e0
return_from_SYSCALL_64+0x0/0x7a
INFO: Slab 0xffffea0011503600 objects=7 used=7 fp=0x (null) flags=0x8000000000004080
INFO: Object 0xffff8804540de848 @offset=26696 fp=0xffff8804540dc588
Redzone ffff8804540de840: bb bb bb bb bb bb bb bb ........
Object ffff8804540de848: 6b 6b 6b 6b 6b 6b 6b 6b b5 52 00 00 f2 01 60 cc kkkkkkkk.R....`.

Similarly, poisoning after the quarantine_put() leads to false positive
use-after-free reports:

BUG: KASAN: use-after-free in anon_vma_interval_tree_insert+0x304/0x430 at addr ffff880405c540a0
Read of size 8 by task trinity-c0/3036
CPU: 0 PID: 3036 Comm: trinity-c0 Not tainted 4.7.0-think+ #9
Call Trace:
dump_stack+0x68/0x96
kasan_report_error+0x222/0x600
__asan_report_load8_noabort+0x61/0x70
anon_vma_interval_tree_insert+0x304/0x430
anon_vma_chain_link+0x91/0xd0
anon_vma_clone+0x136/0x3f0
anon_vma_fork+0x81/0x4c0
copy_process.part.47+0x2c43/0x5b20
_do_fork+0x16d/0xbd0
SyS_clone+0x19/0x20
do_syscall_64+0x1a0/0x4e0
entry_SYSCALL64_slow_path+0x25/0x25

Fix this by putting an object in the quarantine after all other
operations.

Fixes: 80a9201a5965 ("mm, kasan: switch SLUB to stackdepot, enable memory quarantine for SLUB")
Link: http://lkml.kernel.org/r/1470062715-14077-1-git-send-email-aryabinin@virtuozzo.com
Signed-off-by: Andrey Ryabinin
Reported-by: Dave Jones
Reported-by: Vegard Nossum
Reported-by: Sasha Levin
Acked-by: Alexander Potapenko
Cc: Dmitry Vyukov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2016-08-03 05:31:41 +0800
d6507ff53 memcg: put soft limit reclaim out of way if the excess tree is empty ... Browse Code »

We've had a report about soft lockups caused by lock bouncing in the
soft reclaim path:

BUG: soft lockup - CPU#0 stuck for 22s! [kav4proxy-kavic:3128]
RIP: 0010:[] [] _raw_spin_lock+0x18/0x20
Call Trace:
mem_cgroup_soft_limit_reclaim+0x25a/0x280
shrink_zones+0xed/0x200
do_try_to_free_pages+0x74/0x320
try_to_free_pages+0x112/0x180
__alloc_pages_slowpath+0x3ff/0x820
__alloc_pages_nodemask+0x1e9/0x200
alloc_pages_vma+0xe1/0x290
do_wp_page+0x19f/0x840
handle_pte_fault+0x1cd/0x230
do_page_fault+0x1fd/0x4c0
page_fault+0x25/0x30

There are no memcgs created so there cannot be any in the soft limit
excess obviously:

[...]
memory 0 1 1

so all this just seems to be mem_cgroup_largest_soft_limit_node trying
to get spin_lock_irq(&mctz->lock) just to find out that the soft limit
excess tree is empty. This is just pointless wasting of cycles and
cache line bouncing during heavy parallel reclaim on large machines.
The particular machine wasn't very healthy and most probably suffering
from a memory leak which just caused the memory reclaim to trash
heavily. But bouncing on the lock certainly didn't help...

Fix this by optimistic lockless check and bail out early if the tree is
empty. This is theoretically racy but that shouldn't matter all that
much. First of all soft limit is a best effort feature and it is slowly
getting deprecated and its usage should be really scarce. Bouncing on a
lock without a good reason is surely much bigger problem, especially on
large CPU machines.

Link: http://lkml.kernel.org/r/1470073277-1056-1-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko
Acked-by: Vladimir Davydov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-08-03 05:31:41 +0800
4e666314d mm, hugetlb: fix huge_pte_alloc BUG_ON ... Browse Code »

Zhong Jiang has reported a BUG_ON from huge_pte_alloc hitting when he
runs his database load with memory online and offline running in
parallel. The reason is that huge_pmd_share might detect a shared pmd
which is currently migrated and so it has migration pte which is
!pte_huge.

There doesn't seem to be any easy way to prevent from the race and in
fact seeing the migration swap entry is not harmful. Both callers of
huge_pte_alloc are prepared to handle them. copy_hugetlb_page_range
will copy the swap entry and make it COW if needed. hugetlb_fault will
back off and so the page fault is retries if the page is still under
migration and waits for its completion in hugetlb_fault.

That means that the BUG_ON is wrong and we should update it. Let's
simply check that all present ptes are pte_huge instead.

Link: http://lkml.kernel.org/r/20160721074340.GA26398@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: zhongjiang
Acked-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-08-03 05:31:41 +0800
649920c6a mm/hugetlb: avoid soft lockup in set_max_huge_pages() ... Browse Code »

In powerpc servers with large memory(32TB), we watched several soft
lockups for hugepage under stress tests.

The call traces are as follows:
1.
get_page_from_freelist+0x2d8/0xd50
__alloc_pages_nodemask+0x180/0xc20
alloc_fresh_huge_page+0xb0/0x190
set_max_huge_pages+0x164/0x3b0

2.
prep_new_huge_page+0x5c/0x100
alloc_fresh_huge_page+0xc8/0x190
set_max_huge_pages+0x164/0x3b0

This patch fixes such soft lockups. It is safe to call cond_resched()
there because it is out of spin_lock/unlock section.

Link: http://lkml.kernel.org/r/1469674442-14848-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He
Reviewed-by: Naoya Horiguchi
Acked-by: Michal Hocko
Acked-by: Dave Hansen
Cc: Mike Kravetz
Cc: "Kirill A. Shutemov"
Cc: Paul Gortmaker
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jia He
2016-08-03 05:31:41 +0800
117dec978 tools/testing/radix-tree/linux/gfp.h: fix bitrotted value ... Browse Code »

Apparently, the tools/testing version dates to a few flags ago, and
we've sprouted 4 new ones since. Keep in sync with the value in the
main tree...

Link: http://lkml.kernel.org/r/23400.1469702675@turing-police.cc.vt.edu
Signed-off-by: Valdis Kletnieks
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Valdis Kletnieks
2016-08-03 05:31:41 +0800
1a8018fb4 mm: move swap-in anonymous page into active list ... Browse Code »

Every swap-in anonymous page starts from inactive lru list's head. It
should be activated unconditionally when VM decide to reclaim because
page table entry for the page always usually has marked accessed bit.
Thus, their window size for getting a new referece is 2 * NR_inactive +
NR_active while others is NR_inactive + NR_active.

It's not fair that it has more chance to be referenced compared to other
newly allocated page which starts from active lru list's head.

Johannes:

: The page can still have a valid copy on the swap device, so prefering to
: reclaim that page over a fresh one could make sense. But as you point
: out, having it start inactive instead of active actually ends up giving it
: *more* LRU time, and that seems to be without justification.

Rik:

: The reason newly read in swap cache pages start on the inactive list is
: that we do some amount of read-around, and do not know which pages will
: get used.
:
: However, immediately activating the ones that DO get used, like your patch
: does, is the right thing to do.

Link: http://lkml.kernel.org/r/1469762740-17860-1-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Acked-by: Johannes Weiner
Acked-by: Rik van Riel
Cc: Nadav Amit
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2016-08-03 05:31:41 +0800
c5f88bd29 mm: fail prefaulting if page table allocation fails ... Browse Code »

I ran into this:

BUG: sleeping function called from invalid context at mm/page_alloc.c:3784
in_atomic(): 0, irqs_disabled(): 0, pid: 1434, name: trinity-c1
2 locks held by trinity-c1/1434:
#0: (&mm->mmap_sem){......}, at: [] __do_page_fault+0x1ce/0x8f0
#1: (rcu_read_lock){......}, at: [] filemap_map_pages+0xd6/0xdd0

CPU: 0 PID: 1434 Comm: trinity-c1 Not tainted 4.7.0+ #58
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
Call Trace:
dump_stack+0x65/0x84
panic+0x185/0x2dd
___might_sleep+0x51c/0x600
__might_sleep+0x90/0x1a0
__alloc_pages_nodemask+0x5b1/0x2160
alloc_pages_current+0xcc/0x370
pte_alloc_one+0x12/0x90
__pte_alloc+0x1d/0x200
alloc_set_pte+0xe3e/0x14a0
filemap_map_pages+0x42b/0xdd0
handle_mm_fault+0x17d5/0x28b0
__do_page_fault+0x310/0x8f0
trace_do_page_fault+0x18d/0x310
do_async_page_fault+0x27/0xa0
async_page_fault+0x28/0x30

The important bits from the above is that filemap_map_pages() is calling
into the page allocator while holding rcu_read_lock (sleeping is not
allowed inside RCU read-side critical sections).

According to Kirill Shutemov, the prefaulting code in do_fault_around()
is supposed to take care of this, but missing error handling means that
the allocation failure can go unnoticed.

We don't need to return VM_FAULT_OOM (or any other error) here, since we
can just let the normal fault path try again.

Fixes: 7267ec008b5c ("mm: postpone page table allocation until we have page to map")
Link: http://lkml.kernel.org/r/1469708107-11868-1-git-send-email-vegard.nossum@oracle.com
Signed-off-by: Vegard Nossum
Acked-by: Kirill A. Shutemov
Cc: "Hillf Danton"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vegard Nossum
2016-08-03 05:31:41 +0800