Eric Lee / smarc-fsl-linux-kernel

07 Nov, 2015

35 commits

cb7ae262e include/linux/zutil.h: fix usage example of zlib_adler32() ... Browse Code »

alder32 was renamed to zlib_adler32 since before 2.6.11.

Signed-off-by: Anish Bhatt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anish Bhatt
2015-11-07 09:50:42 +0800
002edb6f6 dma-mapping: tidy up dma_parms default handling ... Browse Code »

Many DMA controllers and other devices set max_segment_size to
indicate their scatter-gather capability, but have no interest in
segment_boundary_mask. However, the existence of a dma_parms structure
precludes the use of any default value, leaving them as zeros (assuming
a properly kzalloc'ed structure). If a well-behaved IOMMU (or SWIOTLB)
then tries to respect this by ensuring a mapped segment does not cross
a zero-byte boundary, hilarity ensues.

Since zero is a nonsensical value for either parameter, treat it as an
indicator for "default", as might be expected. In the process, clean up
a bit by replacing the bare constants with slightly more meaningful
macros and removing the superfluous "else" statements.

[akpm@linux-foundation.org: dma-mapping.h needs sizes.h for SZ_64K]
Signed-off-by: Robin Murphy
Reviewed-by: Sumit Semwal
Acked-by: Marek Szyprowski
Cc: Arnd Bergmann
Cc: Sakari Ailus
Cc: Russell King
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Murphy
2015-11-07 09:50:42 +0800
9a13049e8 signal: introduce kernel_signal_stop() to fix jffs2_garbage_collect_thread() ... Browse Code »

jffs2_garbage_collect_thread() can race with SIGCONT and sleep in
TASK_STOPPED state after it was already sent. Add the new helper,
kernel_signal_stop(), which does this correctly.

Signed-off-by: Oleg Nesterov
Reviewed-by: Tejun Heo
Cc: David Woodhouse
Cc: Felipe Balbi
Cc: Markus Pargmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-07 09:50:42 +0800
be0e6f290 signal: turn dequeue_signal_lock() into kernel_dequeue_signal() ... Browse Code »

1. Rename dequeue_signal_lock() to kernel_dequeue_signal(). This
matches another "for kthreads only" kernel_sigaction() helper.

2. Remove the "tsk" and "mask" arguments, they are always current
and current->blocked. And it is simply wrong if tsk != current.

3. We could also remove the 3rd "siginfo_t *info" arg but it looks
potentially useful. However we can simplify the callers if we
change kernel_dequeue_signal() to accept info => NULL.

4. Remove _irqsave, it is never called from atomic context.

Signed-off-by: Oleg Nesterov
Reviewed-by: Tejun Heo
Cc: David Woodhouse
Cc: Felipe Balbi
Cc: Markus Pargmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-07 09:50:42 +0800
2e01fabe6 signals: kill block_all_signals() and unblock_all_signals() ... Browse Code »

It is hardly possible to enumerate all problems with block_all_signals()
and unblock_all_signals(). Just for example,

1. block_all_signals(SIGSTOP/etc) simply can't help if the caller is
multithreaded. Another thread can dequeue the signal and force the
group stop.

2. Even is the caller is single-threaded, it will "stop" anyway. It
will not sleep, but it will spin in kernel space until SIGCONT or
SIGKILL.

And a lot more. In short, this interface doesn't work at all, at least
the last 10+ years.

Daniel said:

Yeah the only times I played around with the DRM_LOCK stuff was when
old drivers accidentally deadlocked - my impression is that the entire
DRM_LOCK thing was never really tested properly ;-) Hence I'm all for
purging where this leaks out of the drm subsystem.

Signed-off-by: Oleg Nesterov
Acked-by: Daniel Vetter
Acked-by: Dave Airlie
Cc: Richard Weinberger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Oleg Nesterov
2015-11-07 09:50:42 +0800
a9cd207c2 nilfs2: add tracepoints for analyzing reading and writing metadata files ... Browse Code »

This patch adds tracepoints for analyzing requests of reading and writing
metadata files. The tracepoints cover every in-place mdt files (cpfile,
sufile, and datfile).

Example of tracing mdt_insert_new_block():
cp-14635 [000] ...1 30598.199309: nilfs2_mdt_insert_new_block: inode = ffff88022a8d0178 ino = 3 block = 155
cp-14635 [000] ...1 30598.199520: nilfs2_mdt_insert_new_block: inode = ffff88022a8d0178 ino = 3 block = 5
cp-14635 [000] ...1 30598.200828: nilfs2_mdt_insert_new_block: inode = ffff88022a8d0178 ino = 3 block = 253

Signed-off-by: Hitoshi Mitake
Signed-off-by: Ryusuke Konishi
Cc: Steven Rostedt
Cc: TK Kato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hitoshi Mitake
2015-11-07 09:50:42 +0800
83eec5e6d nilfs2: add tracepoints for analyzing sufile manipulation ... Browse Code »

This patch adds tracepoints which would be useful for analyzing segment
usage from a perspective of high level sufile manipulation (check, alloc,
free). sufile is an important in-place updated metadata file, so
analyzing the behavior would be useful for performance turning.

example of usage (a case of allocation):

$ sudo bin/tpoint nilfs2:nilfs2_segment_usage_allocated
Tracing nilfs2:nilfs2_segment_usage_allocated. Ctrl-C to end.
segctord-17800 [002] ...1 10671.867294: nilfs2_segment_usage_allocated: sufile = ffff880054f908a8 segnum = 2
segctord-17800 [002] ...1 10675.073477: nilfs2_segment_usage_allocated: sufile = ffff880054f908a8 segnum = 3

Signed-off-by: Hitoshi Mitake
Signed-off-by: Ryusuke Konishi
Cc: Steven Rostedt
Cc: Benixon Dhas
Cc: TK Kato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hitoshi Mitake
2015-11-07 09:50:42 +0800
44fda1146 nilfs2: add a tracepoint for transaction events ... Browse Code »

This patch adds a tracepoint for transaction events of nilfs. With the
tracepoint, these events can be tracked: begin, abort, commit, trylock,
lock, and unlock. Basically, these events have corresponding functions
e.g. begin event corresponds nilfs_transaction_begin(). The unlock event
is an exception. It corresponds to the iteration in
nilfs_transaction_lock().

Only one tracepoint is introcued: nilfs2_transaction_transition. The
above events are distinguished with newly introduced enum. With this
tracepoint, we can analyse a critical section of segment constructoin.

Sample output by tpoint of perf-tools:
cp-4457 [000] ...1 63.266220: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800bf5ccc58 count = 1 flags = 9 state = BEGIN
cp-4457 [000] ...1 63.266221: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800bf5ccc58 count = 0 flags = 9 state = COMMIT
cp-4457 [000] ...1 63.266221: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800bf5ccc58 count = 0 flags = 9 state = COMMIT
segctord-4371 [001] ...1 68.261196: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 0 flags = 10 state = TRYLOCK
segctord-4371 [001] ...1 68.261280: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 0 flags = 10 state = LOCK
segctord-4371 [001] ...1 68.261877: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 1 flags = 10 state = BEGIN
segctord-4371 [001] ...1 68.262116: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 0 flags = 18 state = COMMIT
segctord-4371 [001] ...1 68.265032: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 0 flags = 18 state = UNLOCK
segctord-4371 [001] ...1 132.376847: nilfs2_transaction_transition: sb = ffff8802112b8800 ti = ffff8800b889bdf8 count = 0 flags = 10 state = TRYLOCK

This patch also does trivial cleaning of comma usage in collection stage
transition event for consistent coding style.

Signed-off-by: Hitoshi Mitake
Signed-off-by: Ryusuke Konishi
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hitoshi Mitake
2015-11-07 09:50:42 +0800
584977038 nilfs2: add a tracepoint for tracking stage transition of segment construction ... Browse Code »

This patch adds a tracepoint for tracking stage transition of block
collection in segment construction. With the tracepoint, we can analysis
the behavior of segment construction in depth. It would be useful for
bottleneck detection and debugging, etc.

The tracepoint is created with the standard trace API of linux (like ext3,
ext4, f2fs and btrfs). So we can analysis with existing tools easily. Of
course, more detailed analysis will be possible if we can create nilfs
specific analysis tools.

Below is an example of event dump with Brendan Gregg's perf-tools
(https://github.com/brendangregg/perf-tools). Time consumption between
each stage can be obtained.

$ sudo bin/tpoint nilfs2:nilfs2_collection_stage_transition
Tracing nilfs2:nilfs2_collection_stage_transition. Ctrl-C to end.
segctord-14875 [003] ...1 28311.067794: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_INIT
segctord-14875 [003] ...1 28311.068139: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_GC
segctord-14875 [003] ...1 28311.068139: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_FILE
segctord-14875 [003] ...1 28311.068486: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_IFILE
segctord-14875 [003] ...1 28311.068540: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_CPFILE
segctord-14875 [003] ...1 28311.068561: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_SUFILE
segctord-14875 [003] ...1 28311.068565: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_DAT
segctord-14875 [003] ...1 28311.068573: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_SR
segctord-14875 [003] ...1 28311.068574: nilfs2_collection_stage_transition: sci = ffff8800ce6de000 stage = ST_DONE

For capturing transition correctly, this patch adds wrappers for the
member scnt of nilfs_cstage. With this change, every transition of the
stage can produce trace event in a correct manner.

Signed-off-by: Hitoshi Mitake
Signed-off-by: Ryusuke Konishi
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Hitoshi Mitake
2015-11-07 09:50:42 +0800
8de1ee7eb rbtree: clarify documentation of rbtree_postorder_for_each_entry_safe() ... Browse Code »

I noticed that commit a20135ffbc44 ("writeback: don't drain
bdi_writeback_congested on bdi destruction") added a usage of
rbtree_postorder_for_each_entry_safe() in mm/backing-dev.c which appears
to try to rb_erase() elements from an rbtree while iterating over it using
rbtree_postorder_for_each_entry_safe().

Doing this will cause random nodes to be missed by the iteration because
rb_erase() may rebalance the tree, changing the ordering that we're trying
to iterate over.

The previous documentation for rbtree_postorder_for_each_entry_safe()
wasn't clear that this wasn't allowed, it was taken from the docs for
list_for_each_entry_safe(), where erasing isn't a problem due to
list_del() not reordering.

Explicitly warn developers about this potential pit-fall.

Note that I haven't fixed the actual issue that (it appears) the commit
referenced above introduced (not familiar enough with that code).

In general (and in this case), the patterns to follow are:
- switch to rb_first() + rb_erase(), don't use
rbtree_postorder_for_each_entry_safe().
- keep the postorder iteration and don't rb_erase() at all. Instead
just clear the fields of rb_node & cgwb_congested_tree as required by
other users of those structures.

[akpm@linux-foundation.org: tweak comments]
Signed-off-by: Cody P Schafer
Cc: John de la Garza
Cc: Michel Lespinasse
Cc: Peter Zijlstra
Cc: Rusty Russell
Cc: Tejun Heo
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Cody P Schafer
2015-11-07 09:50:42 +0800
0a9df786a lib/kasprintf.c: introduce kvasprintf_const ... Browse Code »

This adds kvasprintf_const which tries to use kstrdup_const if possible:
If the format string contains no % characters, or if the format string is
exactly "%s", we delegate to kstrdup_const. Otherwise, we fall back to
kvasprintf.

Just as for kstrdup_const, the main motivation is to save memory by
reusing .rodata when possible.

The return value should be freed by kfree_const, just like for
kstrdup_const.

There is deliberately no kasprintf_const: In the vast majority of cases,
the format string argument is a literal, so one can determine statically
whether one could instead use kstrdup_const directly (which would also
require one to change all corresponding kfree calls to kfree_const).

Signed-off-by: Rasmus Villemoes
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Rasmus Villemoes
2015-11-07 09:50:42 +0800
48e203e21 bitops.h: add sign_extend64() ... Browse Code »

Months back, this was discussed, see https://lkml.org/lkml/2015/1/18/289
The result was the 64-bit version being "likely fine", "valuable" and
"correct". The discussion fell asleep but since there are possible users,
let's add it.

Signed-off-by: Martin Kepplinger
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: George Spelvin
Cc: Rasmus Villemoes
Cc: Maxime Coquelin
Cc: Denys Vlasenko
Cc: Yury Norov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Kepplinger
2015-11-07 09:50:42 +0800
e2eb53aa9 bitops.h: improve sign_extend32()'s documentation ... Browse Code »

It is often overlooked that sign_extend32(), despite its name, is safe to
use for 16 and 8 bit types as well. This should help prevent sign
extension being done manually some other way.

Signed-off-by: Martin Kepplinger
Cc: Peter Zijlstra
Cc: Ingo Molnar
Cc: Arnaldo Carvalho de Melo
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: George Spelvin
Cc: Rasmus Villemoes
Cc: Maxime Coquelin
Cc: Denys Vlasenko
Cc: Yury Norov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Martin Kepplinger
2015-11-07 09:50:42 +0800
9add850c2 include/linux/compiler-gcc.h: improve __visible documentation ... Browse Code »

Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-11-07 09:50:42 +0800
1965c8b7a mm: use 'unsigned int' for compound_dtor/compound_order on 64BIT ... Browse Code »

On 64 bit system we have enough space in struct page to encode
compound_dtor and compound_order with unsigned int.

On x86-64 it leads to slightly smaller code size due usesage of plain
MOV instead of MOVZX (zero-extended move) or similar effect.

allyesconfig:

text data bss dec hex filename
159520446 48146736 72196096 279863278 10ae5fee vmlinux.pre
159520382 48146736 72196096 279863214 10ae5fae vmlinux.post

On other architectures without native support of 16-bit data types the

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
d00181b96 mm: use 'unsigned int' for page order ... Browse Code »

Let's try to be consistent about data type of page order.

[sfr@canb.auug.org.au: fix build (type of pageblock_order)]
[hughd@google.com: some configs end up with MAX_ORDER and pageblock_order having different types]
Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Stephen Rothwell
Signed-off-by: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
1d798ca3f mm: make compound_head() robust ... Browse Code »

Hugh has pointed that compound_head() call can be unsafe in some
context. There's one example:

CPU0 CPU1

isolate_migratepages_block()
page_count()
compound_head()
!!PageTail() == true
put_page()
tail->first_page = NULL
head = tail->first_page
alloc_pages(__GFP_COMP)
prep_compound_page()
tail->first_page = head
__SetPageTail(p);
!!PageTail() == true

The race is pure theoretical. I don't it's possible to trigger it in
practice. But who knows.

We can fix the race by changing how encode PageTail() and compound_head()
within struct page to be able to update them in one shot.

The patch introduces page->compound_head into third double word block in
front of compound_dtor and compound_order. Bit 0 encodes PageTail() and
the rest bits are pointer to head page if bit zero is set.

The patch moves page->pmd_huge_pte out of word, just in case if an
architecture defines pgtable_t into something what can have the bit 0
set.

hugetlb_cgroup uses page->lru.next in the second tail page to store
pointer struct hugetlb_cgroup. The patch switch it to use page->private
in the second tail page instead. The space is free since ->first_page is
removed from the union.

The patch also opens possibility to remove HUGETLB_CGROUP_MIN_ORDER
limitation, since there's now space in first tail page to store struct
hugetlb_cgroup pointer. But that's out of scope of the patch.

That means page->compound_head shares storage space with:

- page->lru.next;
- page->next;
- page->rcu_head.next;

That's too long list to be absolutely sure, but looks like nobody uses
bit 0 of the word.

page->rcu_head.next guaranteed[1] to have bit 0 clean as long as we use
call_rcu(), call_rcu_bh(), call_rcu_sched(), or call_srcu(). But future
call_rcu_lazy() is not allowed as it makes use of the bit and we can
get false positive PageTail().

[1] http://lkml.kernel.org/g/20150827163634.GD4029@linux.vnet.ibm.com

Signed-off-by: Kirill A. Shutemov
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Hugh Dickins
Cc: David Rientjes
Cc: Vlastimil Babka
Acked-by: Paul E. McKenney
Cc: Aneesh Kumar K.V
Cc: Andi Kleen
Cc: Christoph Lameter
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
f1e61557f mm: pack compound_dtor and compound_order into one word in struct page ... Browse Code »

The patch halves space occupied by compound_dtor and compound_order in
struct page.

For compound_order, it's trivial long -> short conversion.

For get_compound_page_dtor(), we now use hardcoded table for destructor
lookup and store its index in the struct page instead of direct pointer
to destructor. It shouldn't be a big trouble to maintain the table: we
have only two destructor and NULL currently.

This patch free up one word in tail pages for reuse. This is preparation
for the next patch.

Signed-off-by: Kirill A. Shutemov
Reviewed-by: Michal Hocko
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: "Paul E. McKenney"
Cc: Andi Kleen
Cc: Aneesh Kumar K.V
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Hugh Dickins
Cc: Joonsoo Kim
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
474e4eeaf mm: drop page->slab_page ... Browse Code »

Since 8456a648cf44 ("slab: use struct page for slab management") nobody
uses slab_page field in struct page.

Let's drop it.

Signed-off-by: Kirill A. Shutemov
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Reviewed-by: Andrea Arcangeli
Cc: Joonsoo Kim
Cc: Andi Kleen
Cc: "Paul E. McKenney"
Cc: Aneesh Kumar K.V
Cc: Hugh Dickins
Cc: Michal Hocko
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2015-11-07 09:50:42 +0800
6f3526d6d mm: zsmalloc: constify struct zs_pool name ... Browse Code »

Constify `struct zs_pool' ->name.

[akpm@inux-foundation.org: constify zpool_create_pool()'s `type' arg also]
Signed-off-by: Sergey Senozhatsky
Acked-by: Dan Streetman
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sergey SENOZHATSKY
2015-11-07 09:50:42 +0800
69e18f4db zpool: remove redundant zpool->type string, const-ify zpool_get_type ... Browse Code »

Make the return type of zpool_get_type const; the string belongs to the
zpool driver and should not be modified. Remove the redundant type field
in the struct zpool; it is private to zpool.c and isn't needed since
->driver->type can be used directly. Add comments indicating strings must
be null-terminated.

Signed-off-by: Dan Streetman
Cc: Sergey Senozhatsky
Cc: Seth Jennings
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-11-07 09:50:42 +0800
3d9c637f4 module: export param_free_charp() ... Browse Code »

Change the param_free_charp() function from static to exported.

It is used by zswap in the next patch ("zswap: use charp for zswap param
strings").

Signed-off-by: Dan Streetman
Acked-by: Rusty Russell
Cc: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dan Streetman
2015-11-07 09:50:42 +0800
c62d25556 mm, fs: introduce mapping_gfp_constraint() ... Browse Code »

There are many places which use mapping_gfp_mask to restrict a more
generic gfp mask which would be used for allocations which are not
directly related to the page cache but they are performed in the same
context.

Let's introduce a helper function which makes the restriction explicit and
easier to track. This patch doesn't introduce any functional changes.

[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Michal Hocko
Suggested-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2015-11-07 09:50:42 +0800
899033276 include/linux/mmzone.h: reflow comment ... Browse Code »

Someone has an 86 column display.

Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrew Morton
2015-11-07 09:50:42 +0800
dd56b0464 mm: page_alloc: hide some GFP internals and document the bits and flag combinations ... Browse Code »

Andrew stated the following

We have quite a history of remote parts of the kernel using
weird/wrong/inexplicable combinations of __GFP_ flags. I tend
to think that this is because we didn't adequately explain the
interface.

And I don't think that gfp.h really improved much in this area as
a result of this patchset. Could you go through it some time and
decide if we've adequately documented all this stuff?

This patches first moves some GFP flag combinations that are part of the MM
internals to mm/internal.h. The rest of the patch documents the __GFP_FOO
bits under various headings and then documents the flag combinations. It
will not help callers that are brain damaged but the clarity might motivate
some fixes and avoid future mistakes.

Signed-off-by: Mel Gorman
Cc: Johannes Weiner
Cc: Rik van Riel
Cc: Vlastimil Babka
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
0aaa29a56 mm, page_alloc: reserve pageblocks for high-order atomic allocations on demand ... Browse Code »

High-order watermark checking exists for two reasons -- kswapd high-order
awareness and protection for high-order atomic requests. Historically the
kernel depended on MIGRATE_RESERVE to preserve min_free_kbytes as
high-order free pages for as long as possible. This patch introduces
MIGRATE_HIGHATOMIC that reserves pageblocks for high-order atomic
allocations on demand and avoids using those blocks for order-0
allocations. This is more flexible and reliable than MIGRATE_RESERVE was.

A MIGRATE_HIGHORDER pageblock is created when an atomic high-order
allocation request steals a pageblock but limits the total number to 1% of
the zone. Callers that speculatively abuse atomic allocations for
long-lived high-order allocations to access the reserve will quickly fail.
Note that SLUB is currently not such an abuser as it reclaims at least
once. It is possible that the pageblock stolen has few suitable
high-order pages and will need to steal again in the near future but there
would need to be strong justification to search all pageblocks for an
ideal candidate.

The pageblocks are unreserved if an allocation fails after a direct
reclaim attempt.

The watermark checks account for the reserved pageblocks when the
allocation request is not a high-order atomic allocation.

The reserved pageblocks can not be used for order-0 allocations. This may
allow temporary wastage until a failed reclaim reassigns the pageblock.
This is deliberate as the intent of the reservation is to satisfy a
limited number of atomic high-order short-lived requests if the system
requires them.

The stutter benchmark was used to evaluate this but while it was running
there was a systemtap script that randomly allocated between 1 high-order
page and 12.5% of memory's worth of order-3 pages using GFP_ATOMIC. This
is much larger than the potential reserve and it does not attempt to be
realistic. It is intended to stress random high-order allocations from an
unknown source, show that there is a reduction in failures without
introducing an anomaly where atomic allocations are more reliable than
regular allocations. The amount of memory reserved varied throughout the
workload as reserves were created and reclaimed under memory pressure.
The allocation failures once the workload warmed up were as follows;

4.2-rc5-vanilla 70%
4.2-rc5-atomic-reserve 56%

The failure rate was also measured while building multiple kernels. The
failure rate was 14% but is 6% with this patch applied.

Overall, this is a small reduction but the reserves are small relative to
the number of allocation requests. In early versions of the patch, the
failure rate reduced by a much larger amount but that required much larger
reserves and perversely made atomic allocations seem more reliable than
regular allocations.

[yalin.wang2010@gmail.com: fix redundant check and a memory leak]
Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: yalin wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
974a786e6 mm, page_alloc: remove MIGRATE_RESERVE ... Browse Code »

MIGRATE_RESERVE preserves an old property of the buddy allocator that
existed prior to fragmentation avoidance -- min_free_kbytes worth of pages
tended to remain contiguous until the only alternative was to fail the
allocation. At the time it was discovered that high-order atomic
allocations relied on this property so MIGRATE_RESERVE was introduced. A
later patch will introduce an alternative MIGRATE_HIGHATOMIC so this patch
deletes MIGRATE_RESERVE and supporting code so it'll be easier to review.
Note that this patch in isolation may look like a false regression if
someone was bisecting high-order atomic allocation failures.

Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
f77cf4e4c mm, page_alloc: delete the zonelist_cache ... Browse Code »

The zonelist cache (zlc) was introduced to skip over zones that were
recently known to be full. This avoided expensive operations such as the
cpuset checks, watermark calculations and zone_reclaim. The situation
today is different and the complexity of zlc is harder to justify.

1) The cpuset checks are no-ops unless a cpuset is active and in general
are a lot cheaper.

2) zone_reclaim is now disabled by default and I suspect that was a large
source of the cost that zlc wanted to avoid. When it is enabled, it's
known to be a major source of stalling when nodes fill up and it's
unwise to hit every other user with the overhead.

3) Watermark checks are expensive to calculate for high-order
allocation requests. Later patches in this series will reduce the cost
of the watermark checking.

4) The most important issue is that in the current implementation it
is possible for a failed THP allocation to mark a zone full for order-0
allocations and cause a fallback to remote nodes.

The last issue could be addressed with additional complexity but as the
benefit of zlc is questionable, it is better to remove it. If stalls due
to zone_reclaim are ever reported then an alternative would be to
introduce deferring logic based on a timeout inside zone_reclaim itself
and leave the page allocator fast paths alone.

The impact on page-allocator microbenchmarks is negligible as they don't
hit the paths where the zlc comes into play. Most page-reclaim related
workloads showed no noticeable difference as a result of the removal.

The impact was noticeable in a workload called "stutter". One part uses a
lot of anonymous memory, a second measures mmap latency and a third copies
a large file. In an ideal world the latency application would not notice
the mmap latency. On a 2-node machine the results of this patch are

stutter
4.3.0-rc1 4.3.0-rc1
baseline nozlc-v4
Min mmap 20.9243 ( 0.00%) 20.7716 ( 0.73%)
1st-qrtle mmap 22.0612 ( 0.00%) 22.0680 ( -0.03%)
2nd-qrtle mmap 22.3291 ( 0.00%) 22.3809 ( -0.23%)
3rd-qrtle mmap 25.2244 ( 0.00%) 25.2396 ( -0.06%)
Max-90% mmap 48.0995 ( 0.00%) 28.3713 ( 41.02%)
Max-93% mmap 52.5557 ( 0.00%) 36.0170 ( 31.47%)
Max-95% mmap 55.8173 ( 0.00%) 47.3163 ( 15.23%)
Max-99% mmap 67.3781 ( 0.00%) 70.1140 ( -4.06%)
Max mmap 24447.6375 ( 0.00%) 12915.1356 ( 47.17%)
Mean mmap 33.7883 ( 0.00%) 27.7944 ( 17.74%)
Best99%Mean mmap 27.7825 ( 0.00%) 25.2767 ( 9.02%)
Best95%Mean mmap 26.3912 ( 0.00%) 23.7994 ( 9.82%)
Best90%Mean mmap 24.9886 ( 0.00%) 23.2251 ( 7.06%)
Best50%Mean mmap 22.0157 ( 0.00%) 22.0261 ( -0.05%)
Best10%Mean mmap 21.6705 ( 0.00%) 21.6083 ( 0.29%)
Best5%Mean mmap 21.5581 ( 0.00%) 21.4611 ( 0.45%)
Best1%Mean mmap 21.3079 ( 0.00%) 21.1631 ( 0.68%)

Note that the maximum stall latency went from 24 seconds to 12 which is
still bad but an improvement. The milage varies considerably 2-node
machine on an earlier test went from 494 seconds to 47 seconds and a
4-node machine that tested an earlier version of this patch went from a
worst case stall time of 6 seconds to 67ms. The nature of the benchmark
is inherently unpredictable as it is hammering the system and the milage
will vary between machines.

There is a secondary impact with potentially more direct reclaim because
zones are now being considered instead of being skipped by zlc. In this
particular test run it did not occur so will not be described. However,
in at least one test the following was observed

1. Direct reclaim rates were higher. This was likely due to direct reclaim
being entered instead of the zlc disabling a zone and busy looping.
Busy looping may have the effect of allowing kswapd to make more
progress and in some cases may be better overall. If this is found then
the correct action is to put direct reclaimers to sleep on a waitqueue
and allow kswapd make forward progress. Busy looping on the zlc is even
worse than when the allocator used to blindly call congestion_wait().

2. There was higher swap activity as direct reclaim was active.

3. Direct reclaim efficiency was lower. This is related to 1 as more
scanning activity also encountered more pages that could not be
immediately reclaimed

In that case, the direct page scan and reclaim rates are noticeable but
it is not considered a problem for a few reasons

1. The test is primarily concerned with latency. The mmap attempts are also
faulted which means there are THP allocation requests. The ZLC could
cause zones to be disabled causing the process to busy loop instead
of reclaiming. This looks like elevated direct reclaim activity but
it's the correct action to take based on what processes requested.

2. The test hammers reclaim and compaction heavily. The number of successful
THP faults is highly variable but affects the reclaim stats. It's not a
realistic or reasonable measure of page reclaim activity.

3. No other page-reclaim intensive workload that was tested showed a problem.

4. If a workload is identified that benefitted from the busy looping then it
should be fixed by having direct reclaimers sleep on a wait queue until
woken by kswapd instead of busy looping. We had this class of problem before
when congestion_waits() with a fixed timeout was a brain damaged decision
but happened to benefit some workloads.

If a workload is identified that relied on the zlc to busy loop then it
should be fixed correctly and have a direct reclaimer sleep on a waitqueue
until woken by kswapd.

Signed-off-by: Mel Gorman
Acked-by: David Rientjes
Acked-by: Christoph Lameter
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
71baba4b9 mm, page_alloc: rename __GFP_WAIT to __GFP_RECLAIM ... Browse Code »

__GFP_WAIT was used to signal that the caller was in atomic context and
could not sleep. Now it is possible to distinguish between true atomic
context and callers that are not willing to sleep. The latter should
clear __GFP_DIRECT_RECLAIM so kswapd will still wake. As clearing
__GFP_WAIT behaves differently, there is a risk that people will clear the
wrong flags. This patch renames __GFP_WAIT to __GFP_RECLAIM to clearly
indicate what it does -- setting it allows all reclaim activity, clearing
them prevents it.

[akpm@linux-foundation.org: fix build]
[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Mel Gorman
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Acked-by: Johannes Weiner
Cc: Christoph Lameter
Acked-by: David Rientjes
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
401133708 mm: page_alloc: remove GFP_IOFS ... Browse Code »

GFP_IOFS was intended to be shorthand for clearing two flags, not a set of
allocation flags. There is only one user of this flag combination now and
there appears to be no reason why Lustre had to be protected from reclaim
stalls. As none of the sites appear to be atomic, this patch simply
deletes GFP_IOFS and converts Lustre to using GFP_KERNEL, GFP_NOFS or
GFP_NOIO as appropriate.

Signed-off-by: Mel Gorman
Cc: Oleg Drokin
Cc: Andreas Dilger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
d0164adc8 mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep an… ... Browse Code »

…d avoiding waking kswapd

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts. They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve". __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available. Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative. High priority users continue to use
__GFP_HIGH. __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim. __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim. __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
__GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
into this category where kswapd will still be woken but atomic reserves
are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
helper gfpflags_allow_blocking() where possible. This is because
checking for __GFP_WAIT as was done historically now can trigger false
positives. Some exceptions like dm-crypt.c exist where the code intent
is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL. They may
now wish to specify __GFP_KSWAPD_RECLAIM. It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Acked-by: Vlastimil Babka <vbabka@suse.cz>
Acked-by: Michal Hocko <mhocko@suse.com>
Acked-by: Johannes Weiner <hannes@cmpxchg.org>
Cc: Christoph Lameter <cl@linux.com>
Cc: David Rientjes <rientjes@google.com>
Cc: Vitaly Wool <vitalywool@gmail.com>
Cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Mel Gorman
2015-11-07 09:50:42 +0800
016c13daa mm, page_alloc: use masks and shifts when converting GFP flags to migrate types ... Browse Code »

This patch redefines which GFP bits are used for specifying mobility and
the order of the migrate types. Once redefined it's possible to convert
GFP flags to a migrate type with a simple mask and shift. The only
downside is that readers of OOM kill messages and allocation failures may
have been used to the existing values but scripts/gfp-translate will help.

Signed-off-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
46e700abc mm, page_alloc: remove unnecessary taking of a seqlock when cpusets are disabled ... Browse Code »

There is a seqcounter that protects against spurious allocation failures
when a task is changing the allowed nodes in a cpuset. There is no need
to check the seqcounter until a cpuset exists.

Signed-off-by: Mel Gorman
Acked-by: Christoph Lameter
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Johannes Weiner
Cc: Vitaly Wool
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
e2b19197f mm, page_alloc: remove unnecessary parameter from zone_watermark_ok_safe ... Browse Code »

Overall, the intent of this series is to remove the zonelist cache which
was introduced to avoid high overhead in the page allocator. Once this is
done, it is necessary to reduce the cost of watermark checks.

The series starts with minor micro-optimisations.

Next it notes that GFP flags that affect watermark checks are abused.
__GFP_WAIT historically identified callers that could not sleep and could
access reserves. This was later abused to identify callers that simply
prefer to avoid sleeping and have other options. A patch distinguishes
between atomic callers, high-priority callers and those that simply wish
to avoid sleep.

The zonelist cache has been around for a long time but it is of dubious
merit with a lot of complexity and some issues that are explained. The
most important issue is that a failed THP allocation can cause a zone to
be treated as "full". This potentially causes unnecessary stalls, reclaim
activity or remote fallbacks. The issues could be fixed but it's not
worth it. The series places a small number of other micro-optimisations
on top before examining GFP flags watermarks.

High-order watermarks enforcement can cause high-order allocations to fail
even though pages are free. The watermark checks both protect high-order
atomic allocations and make kswapd aware of high-order pages but there is
a much better way that can be handled using migrate types. This series
uses page grouping by mobility to reserve pageblocks for high-order
allocations with the size of the reservation depending on demand. kswapd
awareness is maintained by examining the free lists. By patch 12 in this
series, there are no high-order watermark checks while preserving the
properties that motivated the introduction of the watermark checks.

This patch (of 10):

No user of zone_watermark_ok_safe() specifies alloc_flags. This patch
removes the unnecessary parameter.

Signed-off-by: Mel Gorman
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Acked-by: Michal Hocko
Reviewed-by: Christoph Lameter
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2015-11-07 09:50:42 +0800
bc914532a Merge tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd ... Browse Code »

Pull MFD updates from Lee Jones:
"New Device Support:
- Add support for 88pm860; 88pm80x
- Add support for 24c08 EEPROM; at24
- Add support for Broxton Whiskey Cove; intel*
- Add support for RTS522A; rts5227
- Add support for I2C devices; intel_quark_i2c_gpio

New Functionality:
- Add microphone support; arizona
- Add general purpose switch support; arizona
- Add fuel-gauge support; da9150-core
- Add shutdown support; sec-core
- Add charger support; tps65217
- Add flexible serial communication unit support; atmel-flexcom
- Add power button support; axp20x
- Add led-flash support; rt5033

Core Frameworks:
- Supply a generic macro for defining Regmap IRQs
- Rework ACPI child device matching

Fix-ups:
- Use Regmap to access registers; tps6105x
- Use DEFINE_RES_IRQ_NAMED() macro; da9150
- Re-arrange device registration order; intel_quark_i2c_gpio
- Allow OF matching; cros_ec_i2c, atmel-hlcdc, hi6421-pmic, max8997, sm501
- Handle deferred probe; twl6040
- Improve accuracy of headphone detect; arizona
- Unnecessary MODULE_ALIAS() removal; bcm590xx, rt5033
- Remove unused code; htc-i2cpld, arizona, pcf50633-irq, sec-core
- Simplify code; kempld, rts5209, da903x, lm3533, da9052, arizona
- Remove #iffery; arizona
- DT binding adaptions; many

Bug Fixes:
- Fix possible NULL pointer dereference; wm831x, tps6105x
- Fix 64bit bug; intel_soc_pmic_bxtwc
- Fix signedness issue; arizona"

* tag 'mfd-for-linus-4.4' of git://git.kernel.org/pub/scm/linux/kernel/git/lee/mfd: (73 commits)
bindings: mfd: s2mps11: Add documentation for s2mps15 PMIC
mfd: sec-core: Remove unused s2mpu02-rtc and s2mpu02-clk children
extcon: arizona: Add extcon specific device tree binding document
MAINTAINERS: Add binding docs for Cirrus Logic/Wolfson Arizona devices
mfd: arizona: Remove bindings covered in new subsystem specific docs
mfd: rt5033: Add RT5033 Flash led sub device
mfd: lpss: Add Intel Broxton PCI IDs
mfd: lpss: Add Broxton ACPI IDs
mfd: arizona: Signedness bug in arizona_runtime_suspend()
mfd: axp20x: Add a cell for the power button part of the, axp288 PMICs
mfd: dt-bindings: Document pulled down WRSTBI pin on S2MPS1X
mfd: sec-core: Disable buck voltage reset on watchdog falling edge
mfd: sec-core: Dump PMIC revision to find out the HW
mfd: arizona: Use correct type ID for device tree config
mfd: arizona: Remove use of codec build config #ifdefs
mfd: arizona: Simplify adding subdevices
mfd: arizona: Downgrade type mismatch messages to dev_warn
mfd: arizona: Factor out checking of jack detection state
mfd: arizona: Factor out DCVDD isolation control
mfd: Make TPS6105X select REGMAP_I2C
...

Linus Torvalds
2015-11-07 02:23:50 +0800

06 Nov, 2015

5 commits

2f4bf528e Merge tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux ... Browse Code »

Pull powerpc updates from Michael Ellerman:

- Kconfig: remove BE-only platforms from LE kernel build from Boqun
Feng
- Refresh ps3_defconfig from Geoff Levand
- Emit GNU & SysV hashes for the vdso from Michael Ellerman
- Define an enum for the bolted SLB indexes from Anshuman Khandual
- Use a local to avoid multiple calls to get_slb_shadow() from Michael
Ellerman
- Add gettimeofday() benchmark from Michael Neuling
- Avoid link stack corruption in __get_datapage() from Michael Neuling
- Add virt_to_pfn and use this instead of opencoding from Aneesh Kumar
K.V
- Add ppc64le_defconfig from Michael Ellerman
- pseries: extract of_helpers module from Andy Shevchenko
- Correct string length in pseries_of_derive_parent() from Nathan
Fontenot
- Free the MSI bitmap if it was slab allocated from Denis Kirjanov
- Shorten irq_chip name for the SIU from Christophe Leroy
- Wait 1s for secondaries to enter OPAL during kexec from Samuel
Mendoza-Jonas
- Fix _ALIGN_* errors due to type difference, from Aneesh Kumar K.V
- powerpc/pseries/hvcserver: don't memset pi_buff if it is null from
Colin Ian King
- Disable hugepd for 64K page size, from Aneesh Kumar K.V
- Differentiate between hugetlb and THP during page walk from Aneesh
Kumar K.V
- Make PCI non-optional for pseries from Michael Ellerman
- Individual System V IPC system calls from Sam bobroff
- Add selftest of unmuxed IPC calls from Michael Ellerman
- discard .exit.data at runtime from Stephen Rothwell
- Delete old orphaned PrPMC 280/2800 DTS and boot file, from Paul
Gortmaker
- Use of_get_next_parent to simplify code from Christophe Jaillet
- Paginate some xmon output from Sam bobroff
- Add some more elements to the xmon PACA dump from Michael Ellerman
- Allow the tm-syscall selftest to build with old headers from Michael
Ellerman
- Run EBB selftests only on POWER8 from Denis Kirjanov
- Drop CONFIG_TUNE_CELL in favour of CONFIG_CELL_CPU from Michael
Ellerman
- Avoid reference to potentially freed memory in prom.c from Christophe
Jaillet
- Quieten boot wrapper output with run_cmd from Geoff Levand
- EEH fixes and cleanups from Gavin Shan
- Fix recursive fenced PHB on Broadcom shiner adapter from Gavin Shan
- Use of_get_next_parent() in of_get_ibm_chip_id() from Michael
Ellerman
- Fix section mismatch warning in msi_bitmap_alloc() from Denis
Kirjanov
- Fix ps3-lpm white space from Rudhresh Kumar J
- Fix ps3-vuart null dereference from Colin King
- nvram: Add missing kfree in error path from Christophe Jaillet
- nvram: Fix function name in some errors messages, from Christophe
Jaillet
- drivers/macintosh: adb: fix misleading Kconfig help text from Aaro
Koskinen
- agp/uninorth: fix a memleak in create_gatt_table from Denis Kirjanov
- cxl: Free virtual PHB when removing from Andrew Donnellan
- scripts/kconfig/Makefile: Allow KBUILD_DEFCONFIG to be a target from
Michael Ellerman
- scripts/kconfig/Makefile: Fix KBUILD_DEFCONFIG check when building
with O= from Michael Ellerman
- Freescale updates from Scott: Highlights include 64-bit book3e
kexec/kdump support, a rework of the qoriq clock driver, device tree
changes including qoriq fman nodes, support for a new 85xx board, and
some fixes.
- MPC5xxx updates from Anatolij: Highlights include a driver for
MPC512x LocalPlus Bus FIFO with its device tree binding
documentation, mpc512x device tree updates and some minor fixes.

* tag 'powerpc-4.4-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (106 commits)
powerpc/msi: Fix section mismatch warning in msi_bitmap_alloc()
powerpc/prom: Use of_get_next_parent() in of_get_ibm_chip_id()
powerpc/pseries: Correct string length in pseries_of_derive_parent()
powerpc/e6500: hw tablewalk: make sure we invalidate and write to the same tlb entry
powerpc/mpc85xx: Add FSL QorIQ DPAA FMan support to the SoC device tree(s)
powerpc/mpc85xx: Create dts components for the FSL QorIQ DPAA FMan
powerpc/fsl: Add #clock-cells and clockgen label to clockgen nodes
powerpc: handle error case in cpm_muram_alloc()
powerpc: mpic: use IRQCHIP_SKIP_SET_WAKE instead of redundant mpic_irq_set_wake
powerpc/book3e-64: Enable kexec
powerpc/book3e-64/kexec: Set "r4 = 0" when entering spinloop
powerpc/booke: Only use VIRT_PHYS_OFFSET on booke32
powerpc/book3e-64/kexec: Enable SMP release
powerpc/book3e-64/kexec: create an identity TLB mapping
powerpc/book3e-64: Don't limit paca to 256 MiB
powerpc/book3e/kdump: Enable crash_kexec_wait_realmode
powerpc/book3e: support CONFIG_RELOCATABLE
powerpc/booke64: Fix args to copy_and_flush
powerpc/book3e-64: rename interrupt_end_book3e with __end_interrupts
powerpc/e6500: kexec: Handle hardware threads
...

Linus Torvalds
2015-11-06 15:38:43 +0800
2e3078af2 Merge branch 'akpm' (patches from Andrew) ... Browse Code »

Merge patch-bomb from Andrew Morton:

- inotify tweaks

- some ocfs2 updates (many more are awaiting review)

- various misc bits

- kernel/watchdog.c updates

- Some of mm. I have a huge number of MM patches this time and quite a
lot of it is quite difficult and much will be held over to next time.

* emailed patches from Andrew Morton : (162 commits)
selftests: vm: add tests for lock on fault
mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage
mm: introduce VM_LOCKONFAULT
mm: mlock: add new mlock system call
mm: mlock: refactor mlock, munlock, and munlockall code
kasan: always taint kernel on report
mm, slub, kasan: enable user tracking by default with KASAN=y
kasan: use IS_ALIGNED in memory_is_poisoned_8()
kasan: Fix a type conversion error
lib: test_kasan: add some testcases
kasan: update reference to kasan prototype repo
kasan: move KASAN_SANITIZE in arch/x86/boot/Makefile
kasan: various fixes in documentation
kasan: update log messages
kasan: accurately determine the type of the bad access
kasan: update reported bug types for kernel memory accesses
kasan: update reported bug types for not user nor kernel memory accesses
mm/kasan: prevent deadlock in kasan reporting
mm/kasan: don't use kasan shadow pointer in generic functions
mm/kasan: MODULE_VADDR is not available on all archs
...

Linus Torvalds
2015-11-06 15:10:54 +0800
b0f205c2a mm: mlock: add mlock flags to enable VM_LOCKONFAULT usage ... Browse Code »

The previous patch introduced a flag that specified pages in a VMA should
be placed on the unevictable LRU, but they should not be made present when
the area is created. This patch adds the ability to set this state via
the new mlock system calls.

We add MLOCK_ONFAULT for mlock2 and MCL_ONFAULT for mlockall.
MLOCK_ONFAULT will set the VM_LOCKONFAULT modifier for VM_LOCKED.
MCL_ONFAULT should be used as a modifier to the two other mlockall flags.
When used with MCL_CURRENT, all current mappings will be marked with
VM_LOCKED | VM_LOCKONFAULT. When used with MCL_FUTURE, the mm->def_flags
will be marked with VM_LOCKED | VM_LOCKONFAULT. When used with both
MCL_CURRENT and MCL_FUTURE, all current mappings and mm->def_flags will be
marked with VM_LOCKED | VM_LOCKONFAULT.

Prior to this patch, mlockall() will unconditionally clear the
mm->def_flags any time it is called without MCL_FUTURE. This behavior is
maintained after adding MCL_ONFAULT. If a call to mlockall(MCL_FUTURE) is
followed by mlockall(MCL_CURRENT), the mm->def_flags will be cleared and
new VMAs will be unlocked. This remains true with or without MCL_ONFAULT
in either mlockall() invocation.

munlock() will unconditionally clear both vma flags. munlockall()
unconditionally clears for VMA flags on all VMAs and in the mm->def_flags
field.

Signed-off-by: Eric B Munson
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Guenter Roeck
Cc: Heiko Carstens
Cc: Kirill A. Shutemov
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-11-06 11:34:48 +0800
de60f5f10 mm: introduce VM_LOCKONFAULT ... Browse Code »

The cost of faulting in all memory to be locked can be very high when
working with large mappings. If only portions of the mapping will be used
this can incur a high penalty for locking.

For the example of a large file, this is the usage pattern for a large
statical language model (probably applies to other statical or graphical
models as well). For the security example, any application transacting in
data that cannot be swapped out (credit card data, medical records, etc).

This patch introduces the ability to request that pages are not
pre-faulted, but are placed on the unevictable LRU when they are finally
faulted in. The VM_LOCKONFAULT flag will be used together with VM_LOCKED
and has no effect when set without VM_LOCKED. Setting the VM_LOCKONFAULT
flag for a VMA will cause pages faulted into that VMA to be added to the
unevictable LRU when they are faulted or if they are already present, but
will not cause any missing pages to be faulted in.

Exposing this new lock state means that we cannot overload the meaning of
the FOLL_POPULATE flag any longer. Prior to this patch it was used to
mean that the VMA for a fault was locked. This means we need the new
FOLL_MLOCK flag to communicate the locked state of a VMA. FOLL_POPULATE
will now only control if the VMA should be populated and in the case of
VM_LOCKONFAULT, it will not be set.

Signed-off-by: Eric B Munson
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: Jonathan Corbet
Cc: Catalin Marinas
Cc: Geert Uytterhoeven
Cc: Guenter Roeck
Cc: Heiko Carstens
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-11-06 11:34:48 +0800
a8ca5d0ec mm: mlock: add new mlock system call ... Browse Code »

With the refactored mlock code, introduce a new system call for mlock.
The new call will allow the user to specify what lock states are being
added. mlock2 is trivial at the moment, but a follow on patch will add a
new mlock state making it useful.

Signed-off-by: Eric B Munson
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Heiko Carstens
Cc: Geert Uytterhoeven
Cc: Catalin Marinas
Cc: Stephen Rothwell
Cc: Guenter Roeck
Cc: Jonathan Corbet
Cc: Kirill A. Shutemov
Cc: Michael Kerrisk
Cc: Ralf Baechle
Cc: Shuah Khan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric B Munson
2015-11-06 11:34:48 +0800