Eric Lee / smarc-fsl-linux-kernel

23 Nov, 2015

6 commits

104e2a6f8 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge slub bulk allocator updates from Andrew Morton:
"This missed the merge window because I was waiting for some repairs to
come in. Nothing actually uses the bulk allocator yet and the changes
to other code paths are pretty small. And the net guys are waiting
for this so they can start merging the client code"

More comments from Jesper Dangaard Brouer:
"The kmem_cache_alloc_bulk() call, in mm/slub.c, were included in
previous kernel. The present version contains a bug. Vladimir
Davydov noticed it contained a bug, when kernel is compiled with
CONFIG_MEMCG_KMEM (see commit 03ec0ed57ffc: "slub: fix kmem cgroup
bug in kmem_cache_alloc_bulk"). Plus the mem cgroup counterpart in
kmem_cache_free_bulk() were missing (see commit 033745189b1b "slub:
add missing kmem cgroup support to kmem_cache_free_bulk").

I don't consider the fix stable-material because there are no in-tree
users of the API.

But with known bugs (for memcg) I cannot start using the API in the
net-tree"

* emailed patches from Andrew Morton :
slab/slub: adjust kmem_cache_alloc_bulk API
slub: add missing kmem cgroup support to kmem_cache_free_bulk
slub: fix kmem cgroup bug in kmem_cache_alloc_bulk
slub: optimize bulk slowpath free by detached freelist
slub: support for bulk free with SLUB freelists

Linus Torvalds
2015-11-23 07:21:40 +0800
865762a81 slab/slub: adjust kmem_cache_alloc_bulk API ... Browse Code »

Adjust kmem_cache_alloc_bulk API before we have any real users.

Adjust API to return type 'int' instead of previously type 'bool'. This
is done to allow future extension of the bulk alloc API.

A future extension could be to allow SLUB to stop at a page boundary, when
specified by a flag, and then return the number of objects.

The advantage of this approach, would make it easier to make bulk alloc
run without local IRQs disabled. With an approach of cmpxchg "stealing"
the entire c->freelist or page->freelist. To avoid overshooting we would
stop processing at a slab-page boundary. Else we always end up returning
some objects at the cost of another cmpxchg.

To keep compatible with future users of this API linking against an older
kernel when using the new flag, we need to return the number of allocated
objects with this API change.

Signed-off-by: Jesper Dangaard Brouer
Cc: Vladimir Davydov
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:44 +0800
033745189 slub: add missing kmem cgroup support to kmem_cache_free_bulk ... Browse Code »

Initial implementation missed support for kmem cgroup support in
kmem_cache_free_bulk() call, add this.

If CONFIG_MEMCG_KMEM is not enabled, the compiler should be smart enough
to not add any asm code.

Incoming bulk free objects can belong to different kmem cgroups, and
object free call can happen at a later point outside memcg context. Thus,
we need to keep the orig kmem_cache, to correctly verify if a memcg object
match against its "root_cache" (s->memcg_params.root_cache).

Signed-off-by: Jesper Dangaard Brouer
Reviewed-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:44 +0800
03ec0ed57 slub: fix kmem cgroup bug in kmem_cache_alloc_bulk ... Browse Code »

The call slab_pre_alloc_hook() interacts with kmemgc and is not allowed to
be called several times inside the bulk alloc for loop, due to the call to
memcg_kmem_get_cache().

This would result in hitting the VM_BUG_ON in __memcg_kmem_get_cache.

As suggested by Vladimir Davydov, change slab_post_alloc_hook() to be able
to handle an array of objects.

A subtle detail is, loop iterator "i" in slab_post_alloc_hook() must have
same type (size_t) as size argument. This helps the compiler to easier
realize that it can remove the loop, when all debug statements inside loop
evaluates to nothing. Note, this is only an issue because the kernel is
compiled with GCC option: -fno-strict-overflow

In slab_alloc_node() the compiler inlines and optimizes the invocation of
slab_post_alloc_hook(s, flags, 1, &object) by removing the loop and access
object directly.

Signed-off-by: Jesper Dangaard Brouer
Reported-by: Vladimir Davydov
Suggested-by: Vladimir Davydov
Reviewed-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:44 +0800
d0ecd894e slub: optimize bulk slowpath free by detached freelist ... Browse Code »

This change focus on improving the speed of object freeing in the
"slowpath" of kmem_cache_free_bulk.

The calls slab_free (fastpath) and __slab_free (slowpath) have been
extended with support for bulk free, which amortize the overhead of
the (locked) cmpxchg_double.

To use the new bulking feature, we build what I call a detached
freelist. The detached freelist takes advantage of three properties:

1) the free function call owns the object that is about to be freed,
thus writing into this memory is synchronization-free.

2) many freelist's can co-exist side-by-side in the same slab-page
each with a separate head pointer.

3) it is the visibility of the head pointer that needs synchronization.

Given these properties, the brilliant part is that the detached
freelist can be constructed without any need for synchronization. The
freelist is constructed directly in the page objects, without any
synchronization needed. The detached freelist is allocated on the
stack of the function call kmem_cache_free_bulk. Thus, the freelist
head pointer is not visible to other CPUs.

All objects in a SLUB freelist must belong to the same slab-page.
Thus, constructing the detached freelist is about matching objects
that belong to the same slab-page. The bulk free array is scanned is
a progressive manor with a limited look-ahead facility.

Kmem debug support is handled in call of slab_free().

Notice kmem_cache_free_bulk no longer need to disable IRQs. This
only slowed down single free bulk with approx 3 cycles.

Performance data:
Benchmarked[1] obj size 256 bytes on CPU i7-4790K @ 4.00GHz

SLUB fastpath single object quick reuse: 47 cycles(tsc) 11.931 ns

To get stable and comparable numbers, the kernel have been booted with
"slab_merge" (this also improve performance for larger bulk sizes).

Performance data, compared against fallback bulking:

bulk - fallback bulk - improvement with this patch
1 - 62 cycles(tsc) 15.662 ns - 49 cycles(tsc) 12.407 ns- improved 21.0%
2 - 55 cycles(tsc) 13.935 ns - 30 cycles(tsc) 7.506 ns - improved 45.5%
3 - 53 cycles(tsc) 13.341 ns - 23 cycles(tsc) 5.865 ns - improved 56.6%
4 - 52 cycles(tsc) 13.081 ns - 20 cycles(tsc) 5.048 ns - improved 61.5%
8 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.659 ns - improved 64.0%
16 - 49 cycles(tsc) 12.412 ns - 17 cycles(tsc) 4.495 ns - improved 65.3%
30 - 49 cycles(tsc) 12.484 ns - 18 cycles(tsc) 4.533 ns - improved 63.3%
32 - 50 cycles(tsc) 12.627 ns - 18 cycles(tsc) 4.707 ns - improved 64.0%
34 - 96 cycles(tsc) 24.243 ns - 23 cycles(tsc) 5.976 ns - improved 76.0%
48 - 83 cycles(tsc) 20.818 ns - 21 cycles(tsc) 5.329 ns - improved 74.7%
64 - 74 cycles(tsc) 18.700 ns - 20 cycles(tsc) 5.127 ns - improved 73.0%
128 - 90 cycles(tsc) 22.734 ns - 27 cycles(tsc) 6.833 ns - improved 70.0%
158 - 99 cycles(tsc) 24.776 ns - 30 cycles(tsc) 7.583 ns - improved 69.7%
250 - 104 cycles(tsc) 26.089 ns - 37 cycles(tsc) 9.280 ns - improved 64.4%

Performance data, compared current in-kernel bulking:

bulk - curr in-kernel - improvement with this patch
1 - 46 cycles(tsc) - 49 cycles(tsc) - improved (cycles:-3) -6.5%
2 - 27 cycles(tsc) - 30 cycles(tsc) - improved (cycles:-3) -11.1%
3 - 21 cycles(tsc) - 23 cycles(tsc) - improved (cycles:-2) -9.5%
4 - 18 cycles(tsc) - 20 cycles(tsc) - improved (cycles:-2) -11.1%
8 - 17 cycles(tsc) - 18 cycles(tsc) - improved (cycles:-1) -5.9%
16 - 18 cycles(tsc) - 17 cycles(tsc) - improved (cycles: 1) 5.6%
30 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
32 - 18 cycles(tsc) - 18 cycles(tsc) - improved (cycles: 0) 0.0%
34 - 78 cycles(tsc) - 23 cycles(tsc) - improved (cycles:55) 70.5%
48 - 60 cycles(tsc) - 21 cycles(tsc) - improved (cycles:39) 65.0%
64 - 49 cycles(tsc) - 20 cycles(tsc) - improved (cycles:29) 59.2%
128 - 69 cycles(tsc) - 27 cycles(tsc) - improved (cycles:42) 60.9%
158 - 79 cycles(tsc) - 30 cycles(tsc) - improved (cycles:49) 62.0%
250 - 86 cycles(tsc) - 37 cycles(tsc) - improved (cycles:49) 57.0%

Performance with normal SLUB merging is significantly slower for
larger bulking. This is believed to (primarily) be an effect of not
having to share the per-CPU data-structures, as tuning per-CPU size
can achieve similar performance.

bulk - slab_nomerge - normal SLUB merge
1 - 49 cycles(tsc) - 49 cycles(tsc) - merge slower with cycles:0
2 - 30 cycles(tsc) - 30 cycles(tsc) - merge slower with cycles:0
3 - 23 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:0
4 - 20 cycles(tsc) - 20 cycles(tsc) - merge slower with cycles:0
8 - 18 cycles(tsc) - 18 cycles(tsc) - merge slower with cycles:0
16 - 17 cycles(tsc) - 17 cycles(tsc) - merge slower with cycles:0
30 - 18 cycles(tsc) - 23 cycles(tsc) - merge slower with cycles:5
32 - 18 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:4
34 - 23 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:-1
48 - 21 cycles(tsc) - 22 cycles(tsc) - merge slower with cycles:1
64 - 20 cycles(tsc) - 48 cycles(tsc) - merge slower with cycles:28
128 - 27 cycles(tsc) - 57 cycles(tsc) - merge slower with cycles:30
158 - 30 cycles(tsc) - 59 cycles(tsc) - merge slower with cycles:29
250 - 37 cycles(tsc) - 56 cycles(tsc) - merge slower with cycles:19

Joint work with Alexander Duyck.

[1] https://github.com/netoptimizer/prototype-kernel/blob/master/kernel/mm/slab_bulk_test01.c

[akpm@linux-foundation.org: BUG_ON -> WARN_ON;return]
Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Alexander Duyck
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:43 +0800
81084651d slub: support for bulk free with SLUB freelists ... Browse Code »

Make it possible to free a freelist with several objects by adjusting API
of slab_free() and __slab_free() to have head, tail and an objects counter
(cnt).

Tail being NULL indicate single object free of head object. This allow
compiler inline constant propagation in slab_free() and
slab_free_freelist_hook() to avoid adding any overhead in case of single
object free.

This allows a freelist with several objects (all within the same
slab-page) to be free'ed using a single locked cmpxchg_double in
__slab_free() and with an unlocked cmpxchg_double in slab_free().

Object debugging on the free path is also extended to handle these
freelists. When CONFIG_SLUB_DEBUG is enabled it will also detect if
objects don't belong to the same slab-page.

These changes are needed for the next patch to bulk free the detached
freelists it introduces and constructs.

Micro benchmarking showed no performance reduction due to this change,
when debugging is turned off (compiled with CONFIG_SLUB_DEBUG).

Signed-off-by: Jesper Dangaard Brouer
Signed-off-by: Alexander Duyck
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-23 03:58:41 +0800

22 Nov, 2015

1 commit

3ad5d7e06 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge misc fixes from Andrew Morton:
"A bunch of fixes"

* emailed patches from Andrew Morton :
slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG
slub: avoid irqoff/on in bulk allocation
slub: create new ___slab_alloc function that can be called with irqs disabled
mm: fix up sparse warning in gfpflags_allow_blocking
ocfs2: fix umask ignored issue
PM/OPP: add entry in MAINTAINERS
kernel/panic.c: turn off locks debug before releasing console lock
kernel/signal.c: unexport sigsuspend()
kasan: fix kmemleak false-positive in kasan_module_alloc()
fat: fix fake_offset handling on error path
mm/hugetlbfs: fix bugs in fallocate hole punch of areas with holes
mm/page-writeback.c: initialize m_dirty to avoid compile warning
various: fix pci_set_dma_mask return value checking
mm: loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390
mm: vmalloc: don't remove inexistent guard hole in remove_vm_area()
tools/vm/page-types.c: support KPF_IDLE
ncpfs: don't allow negative timeouts
configfs: allow dynamic group creation
MAINTAINERS: add Moritz as reviewer for FPGA Manager Framework
slab.h: sprinkle __assume_aligned attributes

Linus Torvalds
2015-11-22 02:49:13 +0800

21 Nov, 2015

7 commits

b4a647187 slub: mark the dangling ifdef #else of CONFIG_SLUB_DEBUG ... Browse Code »

The #ifdef of CONFIG_SLUB_DEBUG is located very far from the associated
#else. For readability mark it with a comment.

Signed-off-by: Jesper Dangaard Brouer
Acked-by: Christoph Lameter
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jesper Dangaard Brouer
2015-11-21 08:17:32 +0800
87098373e slub: avoid irqoff/on in bulk allocation ... Browse Code »

Use the new function that can do allocation while interrupts are disabled.
Avoids irq on/off sequences.

Signed-off-by: Christoph Lameter
Cc: Jesper Dangaard Brouer
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2015-11-21 08:17:32 +0800
a380a3c75 slub: create new ___slab_alloc function that can be called with irqs disabled ... Browse Code »

Bulk alloc needs a function like that because it enables interrupts before
calling __slab_alloc which promptly disables them again using the expensive
local_irq_save().

Signed-off-by: Christoph Lameter
Cc: Jesper Dangaard Brouer
Cc: Pekka Enberg
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Alexander Duyck
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christoph Lameter
2015-11-21 08:17:32 +0800
459372545 kasan: fix kmemleak false-positive in kasan_module_alloc() ... Browse Code »

Kmemleak reports the following leak:

unreferenced object 0xfffffbfff41ea000 (size 20480):
comm "modprobe", pid 65199, jiffies 4298875551 (age 542.568s)
hex dump (first 32 bytes):
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
backtrace:
[] kmemleak_alloc+0x4e/0xc0
[] __vmalloc_node_range+0x4b8/0x740
[] kasan_module_alloc+0x72/0xc0
[] module_alloc+0x78/0xb0
[] module_alloc_update_bounds+0x14/0x70
[] layout_and_allocate+0x16f4/0x3c90
[] load_module+0x2ff/0x6690
[] SyS_finit_module+0x136/0x170
[] system_call_fastpath+0x16/0x1b
[] 0xffffffffffffffff

kasan_module_alloc() allocates shadow memory for module and frees it on
module unloading. It doesn't store the pointer to allocated shadow memory
because it could be calculated from the shadowed address, i.e.
kasan_mem_to_shadow(addr).

Since kmemleak cannot find pointer to allocated shadow, it thinks that
memory leaked.

Use kmemleak_ignore() to tell kmemleak that this is not a leak and shadow
memory doesn't contain any pointers.

Signed-off-by: Andrey Ryabinin
Acked-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2015-11-21 08:17:32 +0800
50e55bf62 mm/page-writeback.c: initialize m_dirty to avoid compile warning ... Browse Code »

When building kernel with gcc 5.2, the below warning is raised:

mm/page-writeback.c: In function 'balance_dirty_pages.isra.10':
mm/page-writeback.c:1545:17: warning: 'm_dirty' may be used uninitialized in this function [-Wmaybe-uninitialized]
unsigned long m_dirty, m_thresh, m_bg_thresh;

The m_dirty{thresh, bg_thresh} are initialized in the block of "if
(mdtc)", so if mdts is null, they won't be initialized before being used.
Initialize m_dirty to zero, also initialize m_thresh and m_bg_thresh to
keep consistency.

They are used later by if condition: !mdtc || m_dirty
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2015-11-21 08:17:32 +0800
1a7636156 mm: loosen MADV_NOHUGEPAGE to enable Qemu postcopy on s390 ... Browse Code »

MADV_NOHUGEPAGE processing is too restrictive. kvm already disables
hugepage but hugepage_madvise() takes the error path when we ask to turn
on the MADV_NOHUGEPAGE bit and the bit is already on. This causes Qemu's
new postcopy migration feature to fail on s390 because its first action is
to madvise the guest address space as NOHUGEPAGE. This patch modifies the
code so that the operation succeeds without error now.

For consistency reasons do the same for MADV_HUGEPAGE.

Signed-off-by: Jason J. Herne
Reviewed-by: Andrea Arcangeli
Acked-by: Christian Borntraeger
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jason J. Herne
2015-11-21 08:17:32 +0800
7511c3ede mm: vmalloc: don't remove inexistent guard hole in remove_vm_area() ... Browse Code »

Commit 71394fe50146 ("mm: vmalloc: add flag preventing guard hole
allocation") missed a spot. Currently remove_vm_area() decreases vm->size
to "remove" the guard hole page, even when it isn't present. All but one
users just free the vm_struct rigth away and never access vm->size anyway.

Don't touch the size in remove_vm_area() and have __vunmap() use the
proper get_vm_area_size() helper.

Signed-off-by: Jerome Marchand
Acked-by: Andrey Ryabinin
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2015-11-21 08:17:32 +0800

19 Nov, 2015

1 commit

0df9d41ab mm, dax: fix DAX deadlocks (COW fault) ... Browse Code »

DAX handling of COW faults has wrong locking sequence:
dax_fault does i_mmap_lock_read
do_cow_fault does i_mmap_unlock_write

Ross's commit[1] missed a fix[2] that Kirill added to Matthew's
commit[3].

Original COW locking logic was introduced by Matthew here[4].

This should be applied to v4.3 as well.

[1] 0f90cc6609c7 mm, dax: fix DAX deadlocks
[2] 52a2b53ffde6 mm, dax: use i_mmap_unlock_write() in do_cow_fault()
[3] 843172978bb9 dax: fix race between simultaneous faults
[4] 2e4cdab0584f mm: allow page fault handlers to perform the COW

Cc:
Cc: Boaz Harrosh
Cc: Alexander Viro
Cc: Dave Chinner
Cc: Jan Kara
Cc: "Kirill A. Shutemov"
Cc: Matthew Wilcox
Acked-by: Ross Zwisler
Signed-off-by: Yigal Korman
Signed-off-by: Dan Williams

Yigal Korman
2015-11-19 08:54:36 +0800

11 Nov, 2015

3 commits

c5a37883f Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge final patch-bomb from Andrew Morton:
"Various leftovers, mainly Christoph's pci_dma_supported() removals"

* emailed patches from Andrew Morton :
pci: remove pci_dma_supported
usbnet: remove ifdefed out call to dma_supported
kaweth: remove ifdefed out call to dma_supported
sfc: don't call dma_supported
nouveau: don't call pci_dma_supported
netup_unidvb: use pci_set_dma_mask insted of pci_dma_supported
cx23885: use pci_set_dma_mask insted of pci_dma_supported
cx25821: use pci_set_dma_mask insted of pci_dma_supported
cx88: use pci_set_dma_mask insted of pci_dma_supported
saa7134: use pci_set_dma_mask insted of pci_dma_supported
saa7164: use pci_set_dma_mask insted of pci_dma_supported
tw68-core: use pci_set_dma_mask insted of pci_dma_supported
pcnet32: use pci_set_dma_mask insted of pci_dma_supported
lib/string.c: add ULL suffix to the constant definition
hugetlb: trivial comment fix
selftests/mlock2: add ULL suffix to 64-bit constants
selftests/mlock2: add missing #define _GNU_SOURCE

Linus Torvalds
2015-11-11 13:14:23 +0800
d15c7c093 hugetlb: trivial comment fix ... Browse Code »

Recently alloc_buddy_huge_page() was renamed to __alloc_buddy_huge_page(),
so let's sync comments.

Signed-off-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2015-11-11 08:32:11 +0800
b0aeba741 Fix alloc_node_mem_map() to work on ia64 again ... Browse Code »

In commit a1c34a3bf00a ("mm: Don't offset memmap for flatmem") Laura
fixed a problem for Srinivas relating to the bottom 2MB of RAM on an ARM
IFC6410 board.

One small wrinkle on ia64 is that it allocates the node_mem_map earlier
in arch code, so it skips the block of code where "offset" is
initialized.

Move initialization of start and offset before the check for the
node_mem_map so that they will always be available in the latter part of
the function.

Tested-by: Laura Abbott
Fixes: a1c34a3bf00a (mm: Don't offset memmap for flatmem)
Signed-off-by: Tony Luck
Signed-off-by: Linus Torvalds

Tony Luck
2015-11-11 06:44:26 +0800

08 Nov, 2015

2 commits

ad804a0b2 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge second patch-bomb from Andrew Morton:

- most of the rest of MM

- procfs

- lib/ updates

- printk updates

- bitops infrastructure tweaks

- checkpatch updates

- nilfs2 update

- signals

- various other misc bits: coredump, seqfile, kexec, pidns, zlib, ipc,
dma-debug, dma-mapping, ...

* emailed patches from Andrew Morton : (102 commits)
ipc,msg: drop dst nil validation in copy_msg
include/linux/zutil.h: fix usage example of zlib_adler32()
panic: release stale console lock to always get the logbuf printed out
dma-debug: check nents in dma_sync_sg*
dma-mapping: tidy up dma_parms default handling
pidns: fix set/getpriority and ioprio_set/get in PRIO_USER mode
kexec: use file name as the output message prefix
fs, seqfile: always allow oom killer
seq_file: reuse string_escape_str()
fs/seq_file: use seq_* helpers in seq_hex_dump()
coredump: change zap_threads() and zap_process() to use for_each_thread()
coredump: ensure all coredumping tasks have SIGNAL_GROUP_COREDUMP
signal: remove jffs2_garbage_collect_thread()->allow_signal(SIGCONT)
signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread()
signal: turn dequeue_signal_lock() into kernel_dequeue_signal()
signals: kill block_all_signals() and unblock_all_signals()
nilfs2: fix gcc uninitialized-variable warnings in powerpc build
nilfs2: fix gcc unused-but-set-variable warnings
MAINTAINERS: nilfs2: add header file for tracing
nilfs2: add tracepoints for analyzing reading and writing metadata files
...

Linus Torvalds
2015-11-08 06:32:45 +0800
75021d285 Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial ... Browse Code »

Pull trivial updates from Jiri Kosina:
"Trivial stuff from trivial tree that can be trivially summed up as:

- treewide drop of spurious unlikely() before IS_ERR() from Viresh
Kumar

- cosmetic fixes (that don't really affect basic functionality of the
driver) for pktcdvd and bcache, from Julia Lawall and Petr Mladek

- various comment / printk fixes and updates all over the place"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
bcache: Really show state of work pending bit
hwmon: applesmc: fix comment typos
Kconfig: remove comment about scsi_wait_scan module
class_find_device: fix reference to argument "match"
debugfs: document that debugfs_remove*() accepts NULL and error values
net: Drop unlikely before IS_ERR(_OR_NULL)
mm: Drop unlikely before IS_ERR(_OR_NULL)
fs: Drop unlikely before IS_ERR(_OR_NULL)
drivers: net: Drop unlikely before IS_ERR(_OR_NULL)
drivers: misc: Drop unlikely before IS_ERR(_OR_NULL)
UBI: Update comments to reflect UBI_METAONLY flag
pktcdvd: drop null test before destroy functions

Linus Torvalds
2015-11-08 05:05:44 +0800

07 Nov, 2015

20 commits

d00181b96 mm: use 'unsigned int' for page order ... Browse Code »

Let's try to be consistent about data type of page order.

[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Stephen Rothwell
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
1d798ca3f mm: make compound_head() robust ... Browse Code »

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
f1e61557f mm: pack compound_dtor and compound_order into one word in struct page ... Browse Code »

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
32e7ba1ea zsmalloc: use page->private instead of page->first_page ... Browse Code »

We are going to rework how compound_head() work. It will not use
page->first_page as we have it now.

The only other user of page->first_page beyond compound pages is
zsmalloc.

Let's use page->private instead of page->first_page here. It occupies
the same storage space.

Signed-off-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Reviewed-by: Sergey Senozhatsky
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
bc4f610d5 slab, slub: use page->rcu_head instead of page->lru plus cast ... Browse Code »

We have properly typed page->rcu_head, no need to cast page->lru.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Andrea Arcangeli
Acked-by: Christoph Lameter
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Sergey Senozhatsky
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
6fe5186f0 zsmalloc: reduce size_class memory usage ... Browse Code »

Each `struct size_class' contains `struct zs_size_stat': an array of
NR_ZS_STAT_TYPE `unsigned long'. For zsmalloc built with no
CONFIG_ZSMALLOC_STAT this results in a waste of `2 * sizeof(unsigned
long)' per-class.

The patch removes unneeded `struct zs_size_stat' members by redefining
NR_ZS_STAT_TYPE (max stat idx in array).

Since both NR_ZS_STAT_TYPE and zs_stat_type are compile time constants,
GCC can eliminate zs_stat_inc()/zs_stat_dec() calls that use zs_stat_type
larger than NR_ZS_STAT_TYPE: CLASS_ALMOST_EMPTY and CLASS_ALMOST_FULL at
the moment.

./scripts/bloat-o-meter mm/zsmalloc.o.old mm/zsmalloc.o.new
add/remove: 0/0 grow/shrink: 0/3 up/down: 0/-39 (-39)
function old new delta
fix_fullness_group 97 94 -3
insert_zspage 100 86 -14
remove_zspage 141 119 -22

To summarize:
a) each class now uses less memory
b) we avoid a number of dec/inc stats (a minor optimization,
but still).

The gain will increase once we introduce additional stats.

A simple IO test.

iozone -t 4 -R -r 32K -s 60M -I +Z
patched base
" Initial write " 4145599.06 4127509.75
" Rewrite " 4146225.94 4223618.50
" Read " 17157606.00 17211329.50
" Re-read " 17380428.00 17267650.50
" Reverse Read " 16742768.00 16162732.75
" Stride read " 16586245.75 16073934.25
" Random read " 16349587.50 15799401.75
" Mixed workload " 10344230.62 9775551.50
" Random write " 4277700.62 4260019.69
" Pwrite " 4302049.12 4313703.88
" Pread " 6164463.16 6126536.72
" Fwrite " 7131195.00 6952586.00
" Fread " 12682602.25 12619207.50

Signed-off-by: Sergey Senozhatsky
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-11-07 09:50:42 +0800
6f0b22760 mm/zsmalloc.c: remove useless line in obj_free() ... Browse Code »

Signed-off-by: Hui Zhu
Reviewed-by: Sergey Senozhatsky
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2015-11-07 09:50:42 +0800
2c3516957 zsmalloc: don't test shrinker_enabled in zs_shrinker_count() ... Browse Code »

We don't let user to disable shrinker in zsmalloc (once it's been
enabled), so no need to check ->shrinker_enabled in zs_shrinker_count(),
at the moment at least.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-11-07 09:50:42 +0800
759b26b29 zsmalloc: use preempt.h for in_interrupt() ... Browse Code »

A cosmetic change.

Commit c60369f01125 ("staging: zsmalloc: prevent mappping in interrupt
context") added in_interrupt() check to zs_map_object() and 'hardirq.h'
include; but in_interrupt() macro is defined in 'preempt.h' not in
'hardirq.h', so include it instead.

Signed-off-by: Sergey Senozhatsky
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey Senozhatsky
2015-11-07 09:50:42 +0800
12a7bfad5 zsmalloc: fix obj_to_head use page_private(page) as value but not pointer ... Browse Code »

In obj_malloc():

if (!class->huge)
/* record handle in the header of allocated chunk */
link->handle = handle;
else
/* record handle in first_page->private */
set_page_private(first_page, handle);

In the hugepage we save handle to private directly.

But in obj_to_head():

if (class->huge) {
VM_BUG_ON(!is_first_page(page));
return *(unsigned long *)page_private(page);
} else
return *(unsigned long *)obj;

It is used as a pointer.

The reason why there is no problem until now is huge-class page is born
with ZS_FULL so it can't be migrated. However, we need this patch for
future work: "VM-aware zsmalloced page migration" to reduce external
fragmentation.

Signed-off-by: Hui Zhu
Acked-by: Minchan Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2015-11-07 09:50:42 +0800
8f958c98f zsmalloc: add comments for ->inuse to zspage ... Browse Code »

[akpm@linux-foundation.org: fix grammar]
Signed-off-by: Hui Zhu
Reviewed-by: Sergey Senozhatsky
Cc: Dan Streetman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hui Zhu
2015-11-07 09:50:42 +0800
6f3526d6d mm: zsmalloc: constify struct zs_pool name ... Browse Code »

Constify `struct zs_pool' ->name.

[akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
Signed-off-by: Sergey Senozhatsky
Acked-by: Dan Streetman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey SENOZHATSKY
2015-11-07 09:50:42 +0800
69e18f4db zpool: remove redundant zpool->type string, const-ify zpool_get_type ... Browse Code »

Make the return type of zpool_get_type const; the string belongs to the
zpool driver and should not be modified. Remove the redundant type field
in the struct zpool; it is private to zpool.c and isn't needed since
->driver->type can be used directly. Add comments indicating strings must
be null-terminated.

Signed-off-by: Dan Streetman
Cc: Sergey Senozhatsky
Cc: Seth Jennings
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-11-07 09:50:42 +0800
c99b42c35 zswap: use charp for zswap param strings ... Browse Code »

Instead of using a fixed-length string for the zswap params, use charp.
This simplifies the code and uses less memory, as most zswap param strings
will be less than the current maximum length.

Signed-off-by: Dan Streetman
Cc: Rusty Russell
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-11-07 09:50:42 +0800
b0c9865fd mm/zswap.c: remove unneeded initialization to NULL in zswap_entry_find_get() ... Browse Code »

On the next line entry variable will be re-initialized so no need to init
it with NULL.

Signed-off-by: Alexey Klimov
Cc: Seth Jennings
Cc: Dan Streetman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexey Klimov
2015-11-07 09:50:42 +0800
6f6461562 mm/memcontrol.c: uninline mem_cgroup_usage ... Browse Code »

gcc version 5.2.1 20151010 (Debian 5.2.1-22)
$ size mm/memcontrol.o mm/memcontrol.o.before
text data bss dec hex filename
35535 7908 64 43507 a9f3 mm/memcontrol.o
35762 7908 64 43734 aad6 mm/memcontrol.o.before

Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-11-07 09:50:42 +0800
d6669d689 thp: remove unused vma parameter from khugepaged_alloc_page ... Browse Code »

The "vma" parameter to khugepaged_alloc_page() is unused. It has to
remain unused or the drop read lock 'map_sem' optimisation introduce by
commit 8b1645685acf ("mm, THP: don't hold mmap_sem in khugepaged when
allocating THP") wouldn't be safe. So let's remove it.

Signed-off-by: Aaron Tomlin
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Aaron Tomlin
2015-11-07 09:50:42 +0800
c62d25556 mm, fs: introduce mapping_gfp_constraint() ... Browse Code »

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-11-07 09:50:42 +0800
dd56b0464 mm: page_alloc: hide some GFP internals and document the bits and flag combinations ... Browse Code »

Andrew stated the following

We have quite a history of remote parts of the kernel using
weird/wrong/inexplicable combinations of __GFP_ flags. I tend
to think that this is because we didn't adequately explain the
interface.

And I don't think that gfp.h really improved much in this area as
a result of this patchset. Could you go through it some time and
decide if we've adequately documented all this stuff?

This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.

Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
97a16fc82 mm, page_alloc: only enforce watermarks for order-0 allocations ... Browse Code »

The primary purpose of watermarks is to ensure that reclaim can always
make forward progress in PF_MEMALLOC context (kswapd and direct reclaim).
These assume that order-0 allocations are all that is necessary for
forward progress.

High-order watermarks serve a different purpose. Kswapd had no high-order
awareness before they were introduced
(https://lkml.kernel.org/r/413AA7B2.4000907@yahoo.com.au). This was
particularly important when there were high-order atomic requests. The
watermarks both gave kswapd awareness and made a reserve for those atomic
requests.

There are two important side-effects of this. The most important is that
a non-atomic high-order request can fail even though free pages are
available and the order-0 watermarks are ok. The second is that
high-order watermark checks are expensive as the free list counts up to
the requested order must be examined.

With the introduction of MIGRATE_HIGHATOMIC it is no longer necessary to
have high-order watermarks. Kswapd and compaction still need high-order
awareness which is handled by checking that at least one suitable
high-order page is free.

With the patch applied, there was little difference in the allocation
failure rates as the atomic reserves are small relative to the number of
allocation attempts. The expected impact is that there will never be an
allocation failure report that shows suitable pages on the free lists.

The one potential side-effect of this is that in a vanilla kernel, the
watermark checks may have kept a free page for an atomic allocation. Now,
we are 100% relying on the HighAtomic reserves and an early allocation to
have allocated them. If the first high-order atomic allocation is after
the system is already heavily fragmented then it'll fail.

[akpm@linux-foundation.org: simplify __zone_watermark_ok(), per Vlastimil]
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800