Eric Lee / smarc-fsl-linux-kernel

16 Mar, 2016

40 commits

cab49f9ed autofs4: make autofs log prints consistent ... Browse Code »

Use the pr_*() print in AUTOFS_*() macros instead of printks and include
the module name in log message macros. Also use the AUTOFS_*() macros
everywhere instead of raw printks.

Signed-off-by: Ian Kent
Cc: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
0266725ad autofs4: fix some white space errors ... Browse Code »

Fix some white space format errors.

Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
e3cd8067c autofs4: fix invalid ioctl return in autofs4_root_ioctl_unlocked() ... Browse Code »

The return from an ioctl if an invalid ioctl is passed in should be
EINVAL not ENOSYS.

Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
b3f67a988 autofs4: fix coding style line length in autofs4_wait() ... Browse Code »

The need for this is questionable but checkpatch.pl complains about the
line length and it's a straightfoward change.

Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
aa330ddc5 autofs4: fix coding style problem in autofs4_get_set_timeout() ... Browse Code »

Refactor autofs4_get_set_timeout() to eliminate coding style error.

Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
e9a7c2f1a autofs4: coding style fixes ... Browse Code »

Try and make the coding style completely consistent throughtout the
autofs module and inline with kernel coding style recommendations.

Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ian Kent
2016-03-16 07:55:16 +0800
c83aa55d0 autofs: show pipe inode in mount options ... Browse Code »

This is required for CRIU (Checkpoint Restart In Userspace) to migrate a
mount point when write end in user space is closed.

Below is a brief description of the problem.

To migrate a non-catatonic autofs mount point, one has to restore the
control pipe between kernel and autofs master process.

One of the autofs masters is systemd, which closes pipe write end after
passing it to the kernel with mount call.

To be able to restore the systemd control pipe one has to know which
read pipe end in systemd corresponds to the write pipe end in the
kernel. The pipe "fd" in mount options is not enough because it was
closed and probably replaced by some other descriptor.

Thus, some other attribute is required to be able to find the read pipe
end. The best attribute to use to find the correct pipe end is inode
number becuase it's unique for the whole system and can't be reused
while the autofs mount exists.

This attribute can also be used to recognize a situation where an autofs
mount has no master (no process with specified "pgrp" or no file
descriptor with "pipe_ino", specified in autofs mount options).

Signed-off-by: Stanislav Kinsburskiy
Signed-off-by: Ian Kent
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Stanislav Kinsburskiy
2016-03-16 07:55:16 +0800
2213e9a66 kallsyms: add support for relative offsets in kallsyms address table ... Browse Code »

Similar to how relative extables are implemented, it is possible to emit
the kallsyms table in such a way that it contains offsets relative to
some anchor point in the kernel image rather than absolute addresses.

On 64-bit architectures, it cuts the size of the kallsyms address table
in half, since offsets between kernel symbols can typically be expressed
in 32 bits. This saves several hundreds of kilobytes of permanent
.rodata on average. In addition, the kallsyms address table is no
longer subject to dynamic relocation when CONFIG_RELOCATABLE is in
effect, so the relocation work done after decompression now doesn't have
to do relocation updates for all these values. This saves up to 24
bytes (i.e., the size of a ELF64 RELA relocation table entry) per value,
which easily adds up to a couple of megabytes of uncompressed __init
data on ppc64 or arm64. Even if these relocation entries typically
compress well, the combined size reduction of 2.8 MB uncompressed for a
ppc64_defconfig build (of which 2.4 MB is __init data) results in a ~500
KB space saving in the compressed image.

Since it is useful for some architectures (like x86) to retain the
ability to emit absolute values as well, this patch also adds support
for capturing both absolute and relative values when
KALLSYMS_ABSOLUTE_PERCPU is in effect, by emitting absolute per-cpu
addresses as positive 32-bit values, and addresses relative to the
lowest encountered relative symbol as negative values, which are
subtracted from the runtime address of this base symbol to produce the
actual address.

Support for the above is enabled by default for all architectures except
IA-64 and Tile-GX, whose symbols are too far apart to capture in this
manner.

Signed-off-by: Ard Biesheuvel
Tested-by: Guenter Roeck
Reviewed-by: Kees Cook
Tested-by: Kees Cook
Cc: Heiko Carstens
Cc: Michael Ellerman
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Cc: Rusty Russell
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2016-03-16 07:55:16 +0800
8c996940b kallsyms: don't overload absolute symbol type for percpu symbols ... Browse Code »

Commit c6bda7c988a5 ("kallsyms: fix percpu vars on x86-64 with
relocation") overloaded the 'A' (absolute) symbol type to signify that a
symbol is not subject to dynamic relocation. However, the original A
type does not imply that at all, and depending on the version of the
toolchain, many A type symbols are emitted that are in fact relative to
the kernel text, i.e., if the kernel is relocated at runtime, these
symbols should be updated as well.

For instance, on sparc32, the following symbols are emitted as absolute
(kindly provided by Guenter Roeck):

f035a420 A _etext
f03d9000 A _sdata
f03de8c4 A jiffies
f03f8860 A _edata
f03fc000 A __init_begin
f041bdc8 A __init_text_end
f0423000 A __bss_start
f0423000 A __init_end
f044457d A __bss_stop
f044457d A _end

On x86_64, similar behavior can be observed:

ffffffff81a00000 A __end_rodata_hpage_align
ffffffff81b19000 A __vvar_page
ffffffff81d3d000 A _end

Even if only a couple of them pass the symbol range check that results
in them to be taken into account for the final kallsyms symbol table, it
is obvious that 'A' does not mean the symbol does not need to be updated
at relocation time, and overloading its meaning to signify that is
perhaps not a good idea.

So instead, add a new percpu_absolute member to struct sym_entry, and
when --absolute-percpu is in effect, use it to record symbols whose
addresses should be emitted as final values rather than values that
still require relocation at runtime. That way, we can drop the check
against the 'A' type.

Signed-off-by: Ard Biesheuvel
Tested-by: Guenter Roeck
Reviewed-by: Kees Cook
Tested-by: Kees Cook
Cc: Heiko Carstens
Cc: Michael Ellerman
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Acked-by: Rusty Russell
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2016-03-16 07:55:16 +0800
4d5d5664c x86: kallsyms: disable absolute percpu symbols on !SMP ... Browse Code »

scripts/kallsyms.c has a special --absolute-percpu command line option
which deals with the zero based per cpu offsets that are used when
building for SMP on x86_64. This means that the option should only be
passed in that case, so add a Kconfig symbol with the correct predicate,
and use that instead.

Signed-off-by: Ard Biesheuvel
Tested-by: Guenter Roeck
Reviewed-by: Kees Cook
Tested-by: Kees Cook
Acked-by: Rusty Russell
Cc: Heiko Carstens
Cc: Michael Ellerman
Cc: Ingo Molnar
Cc: H. Peter Anvin
Cc: Benjamin Herrenschmidt
Cc: Michal Marek
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ard Biesheuvel
2016-03-16 07:55:16 +0800
6b8c69e43 checkpatch: fix another left brace warning ... Browse Code »

This patch escapes a regex that uses left brace.

Using checkpatch.pl with Perl 5.22.0 generates the warning: "Unescaped
left brace in regex is deprecated, passed through in regex;"

Comment from regcomp.c in Perl source: "Currently we don't warn when the
lbrace is at the start of a construct. This catches it in the middle of
a literal string, or when it's the first thing after something like
"\b"."

This works as a complement to 4e5d56bd ("checkpatch: fix left brace
warning").

Signed-off-by: Geyslan G. Bem
Signed-off-by: Joe Perches
Suggested-by: Peter Senna Tschudin
Cc: Eddie Kovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geyslan G. Bem
2016-03-16 07:55:16 +0800
207a8e846 checkpatch: improve UNSPECIFIED_INT test for bare signed/unsigned uses ... Browse Code »

Improve the test to allow casts to (unsigned) or (signed) to be found
and fixed if desired.

Signed-off-by: Joe Perches
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-16 07:55:16 +0800
a1ce18e4f checkpatch: warn on bare unsigned or signed declarations without int ... Browse Code »

Kernel style prefers "unsigned int " over "unsigned " and
"signed int " over "signed ".

Emit a warning for these simple signed/unsigned declarations. Fix
it too if desired.

Signed-off-by: Joe Perches
Acked-by: David S. Miller
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-16 07:55:16 +0800
42e152931 checkpatch: exclude asm volatile from complex macro check ... Browse Code »

asm volatile and all its variants like __asm__ __volatile__ ("")
are reported as errors with "Macros with with complex values should be
enclosed in parentheses".

Make an exception for these asm volatile macro definitions by converting
the "asm volatile" to "asm_volatile" so it appears as a single function
call and the error isn't reported.

Signed-off-by: Joe Perches
Reported-by: Jeff Merkey
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joe Perches
2016-03-16 07:55:16 +0800
9cf7666ac mm: memcontrol: drop unnecessary lru locking from mem_cgroup_migrate() ... Browse Code »

Migration accounting in the memory controller used to have to handle
both oldpage and newpage being on the LRU already; fuse's page cache
replacement used to pass a recycled newpage that had been uncharged but
not freed and removed from the LRU, and the memcg migration code used to
uncharge oldpage to "pass on" the existing charge to newpage.

Nowadays, pages are no longer uncharged when truncated from the page
cache, but rather only at free time, so if a LRU page is recycled in
page cache replacement it'll also still be charged. And we bail out of
the charge transfer altogether in that case. Tell commit_charge() that
we know newpage is not on the LRU, to avoid taking the zone->lru_lock
unnecessarily from the migration path.

But also, oldpage is no longer uncharged inside migration. We only use
oldpage for its page->mem_cgroup and page size, so we don't care about
its LRU state anymore either. Remove any mention from the kernel doc.

Signed-off-by: Johannes Weiner
Suggested-by: Hugh Dickins
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Mateusz Guzik
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
74485cf2b mm: migrate: consolidate mem_cgroup_migrate() calls ... Browse Code »

Rather than scattering mem_cgroup_migrate() calls all over the place,
have a single call from a safe place where every migration operation
eventually ends up in - migrate_page_copy().

Signed-off-by: Johannes Weiner
Suggested-by: Hugh Dickins
Acked-by: Vladimir Davydov
Acked-by: Michal Hocko
Cc: Mateusz Guzik
Cc: Sergey Senozhatsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
7cf91a98e mm/compaction: speed up pageblock_pfn_to_page() when zone is contiguous ... Browse Code »

There is a performance drop report due to hugepage allocation and in
there half of cpu time are spent on pageblock_pfn_to_page() in
compaction [1].

In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this
case is to find valid pageblock while scanning whole zone range. To
check if pageblock is valid to compact, valid pfn within pageblock is
required and we can obtain it by calling pageblock_pfn_to_page(). This
function checks whether pageblock is in a single zone and return valid
pfn if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to be very
expensive in this workload.

Although we have no way to skip this pageblock check in the system where
hole exists at arbitrary position, we can use cached value for zone
continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.

Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/s

Avg is improved by roughly 30% [2].

[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23

[akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
Signed-off-by: Joonsoo Kim
Reported-by: Aaron Lu
Acked-by: Vlastimil Babka
Tested-by: Aaron Lu
Cc: Mel Gorman
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-03-16 07:55:16 +0800
e1409c325 mm/compaction: pass only pageblock aligned range to pageblock_pfn_to_page ... Browse Code »

pageblock_pfn_to_page() is used to check there is valid pfn and all
pages in the pageblock is in a single zone. If there is a hole in the
pageblock, passing arbitrary position to pageblock_pfn_to_page() could
cause to skip whole pageblock scanning, instead of just skipping the
hole page. For deterministic behaviour, it's better to always pass
pageblock aligned range to pageblock_pfn_to_page(). It will also help
further optimization on pageblock_pfn_to_page() in the following patch.

Signed-off-by: Joonsoo Kim
Cc: Aaron Lu
Cc: David Rientjes
Cc: Mel Gorman
Cc: Rik van Riel
Acked-by: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-03-16 07:55:16 +0800
623446e4d mm/compaction: fix invalid free_pfn and compact_cached_free_pfn ... Browse Code »

free_pfn and compact_cached_free_pfn are the pointer that remember
restart position of freepage scanner. When they are reset or invalid,
we set them to zone_end_pfn because freepage scanner works in reverse
direction. But, because zone range is defined as [zone_start_pfn,
zone_end_pfn), zone_end_pfn is invalid to access. Therefore, we should
not store it to free_pfn and compact_cached_free_pfn. Instead, we need
to store zone_end_pfn - 1 to them. There is one more thing we should
consider. Freepage scanner scan reversely by pageblock unit. If
free_pfn and compact_cached_free_pfn are set to middle of pageblock, it
regards that sitiation as that it already scans front part of pageblock
so we lose opportunity to scan there. To fix-up, this patch do
round_down() to guarantee that reset position will be pageblock aligned.

Note that thanks to the current pageblock_pfn_to_page() implementation,
actual access to zone_end_pfn doesn't happen until now. But, following
patch will change pageblock_pfn_to_page() so this patch is needed from
now on.

Signed-off-by: Joonsoo Kim
Acked-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Aaron Lu
Cc: Mel Gorman
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Joonsoo Kim
2016-03-16 07:55:16 +0800
5aa174801 mm/memblock.c: remove unnecessary memblock_type variable ... Browse Code »

We define struct memblock_type *type in the memblock_add_region() and
memblock_reserve_region() functions only for passing it to the
memlock_add_range() and memblock_reserve_range() functions. Let's
remove these variables and will pass a type directly.

Signed-off-by: Alexander Kuleshov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Alexander Kuleshov
2016-03-16 07:55:16 +0800
a75e1f637 x86: also use debug_pagealloc_enabled() for free_init_pages ... Browse Code »

we want to couple all debugging features with debug_pagealloc_enabled()
and not with the config option CONFIG_DEBUG_PAGEALLOC.

Signed-off-by: Christian Borntraeger
Suggested-by: David Rientjes
Acked-by: David Rientjes
Reviewed-by: Thomas Gleixner
Cc: Laura Abbott
Cc: Joonsoo Kim
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Borntraeger
2016-03-16 07:55:16 +0800
10917b839 s390: query dynamic DEBUG_PAGEALLOC setting ... Browse Code »

We can use debug_pagealloc_enabled() to check if we can map the identity
mapping with 1MB/2GB pages as well as to print the current setting in
dump_stack.

Signed-off-by: Christian Borntraeger
Reviewed-by: Heiko Carstens
Cc: Thomas Gleixner
Acked-by: David Rientjes
Cc: Laura Abbott
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Borntraeger
2016-03-16 07:55:16 +0800
288cf3c64 x86: query dynamic DEBUG_PAGEALLOC setting ... Browse Code »

We can use debug_pagealloc_enabled() to check if we can map the identity
mapping with 2MB pages. We can also add the state into the dump_stack
output.

The patch does not touch the code for the 1GB pages, which ignored
CONFIG_DEBUG_PAGEALLOC. Do we need to fence this as well?

Signed-off-by: Christian Borntraeger
Reviewed-by: Thomas Gleixner
Acked-by: David Rientjes
Cc: Laura Abbott
Cc: Joonsoo Kim
Cc: Heiko Carstens
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Borntraeger
2016-03-16 07:55:16 +0800
8df651c70 thp: cleanup split_huge_page() ... Browse Code »

After one of bugfixes to freeze_page(), we don't have freezed pages in
rmap, therefore mapcount of all subpages of freezed THP is zero. And we
have assert for that.

Let's drop code which deal with non-zero mapcount of subpages.

Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill A. Shutemov
2016-03-16 07:55:16 +0800
88193f7ce mm: use linear_page_index() in do_fault() ... Browse Code »

do_fault() assumes that PAGE_SIZE is the same as PAGE_CACHE_SIZE. Use
linear_page_index() to calculate pgoff in the correct units.

Signed-off-by: Matthew Wilcox
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Matthew Wilcox
2016-03-16 07:55:16 +0800
fdf1cdb91 mm: remove unnecessary uses of lock_page_memcg() ... Browse Code »

There are several users that nest lock_page_memcg() inside lock_page()
to prevent page->mem_cgroup from changing. But the page lock prevents
pages from moving between cgroups, so that is unnecessary overhead.

Remove lock_page_memcg() in contexts with locked contexts and fix the
debug code in the page stat functions to be okay with the page lock.

Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
62cccb8c8 mm: simplify lock_page_memcg() ... Browse Code »

Now that migration doesn't clear page->mem_cgroup of live pages anymore,
it's safe to make lock_page_memcg() and the memcg stat functions take
pages, and spare the callers from memcg objects.

[akpm@linux-foundation.org: fix warnings]
Signed-off-by: Johannes Weiner
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
6a93ca8fd mm: migrate: do not touch page->mem_cgroup of live pages ... Browse Code »

Changing a page's memcg association complicates dealing with the page,
so we want to limit this as much as possible. Page migration e.g. does
not have to do that. Just like page cache replacement, it can forcibly
charge a replacement page, and then uncharge the old page when it gets
freed. Temporarily overcharging the cgroup by a single page is not an
issue in practice, and charging is so cheap nowadays that this is much
preferrable to the headache of messing with live pages.

The only place that still changes the page->mem_cgroup binding of live
pages is when pages move along with a task to another cgroup. But that
path isolates the page from the LRU, takes the page lock, and the move
lock (lock_page_memcg()). That means page->mem_cgroup is always stable
in callers that have the page isolated from the LRU or locked. Lighter
unlocked paths, like writeback accounting, can use lock_page_memcg().

[akpm@linux-foundation.org: fix build]
[vdavydov@virtuozzo.com: fix lockdep splat]
Signed-off-by: Johannes Weiner
Acked-by: Vladimir Davydov
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Greg Thelen
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
23047a96d mm: workingset: per-cgroup cache thrash detection ... Browse Code »

Cache thrash detection (see a528910e12ec "mm: thrash detection-based
file cache sizing" for details) currently only works on the system
level, not inside cgroups. Worse, as the refaults are compared to the
global number of active cache, cgroups might wrongfully get all their
refaults activated when their pages are hotter than those of others.

Move the refault machinery from the zone to the lruvec, and then tag
eviction entries with the memcg ID. This makes the thrash detection
work correctly inside cgroups.

[sergey.senozhatsky@gmail.com: do not return from workingset_activation() with locked rcu and page]
Signed-off-by: Johannes Weiner
Signed-off-by: Sergey Senozhatsky
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
612e44939 mm: workingset: eviction buckets for bigmem/lowbit machines ... Browse Code »

For per-cgroup thrash detection, we need to store the memcg ID inside
the radix tree cookie as well. However, on 32 bit that doesn't leave
enough bits for the eviction timestamp to cover the necessary range of
recently evicted pages. The radix tree entry would look like this:

[ RADIX_TREE_EXCEPTIONAL(2) | ZONEID(2) | MEMCGID(16) | EVICTION(12) ]

12 bits means 4096 pages, means 16M worth of recently evicted pages.
But refaults are actionable up to distances covering half of memory. To
not miss refaults, we have to stretch out the range at the cost of how
precisely we can tell when a page was evicted. This way we can shave
off lower bits from the eviction timestamp until the necessary range is
covered. E.g. grouping evictions into 1M buckets (256 pages) will
stretch the longest representable refault distance to 4G.

This patch implements eviction buckets that are automatically sized
according to the available bits and the necessary refault range, in
preparation for per-cgroup thrash detection.

The maximum actionable distance is currently half of memory, but to
support memory hotplug of up to 200% of boot-time memory, we size the
buckets to cover double the distance. Beyond that, thrashing won't be
detectable anymore.

During boot, the kernel will print out the exact parameters, like so:

[ 0.113929] workingset: timestamp_bits=12 max_order=18 bucket_order=6

In this example, there are 12 radix entry bits available for the
eviction timestamp, to cover a maximum distance of 2^18 pages (this is a
1G machine). Consequently, evictions must be grouped into buckets of
2^6 pages, or 256K.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
162453bfb mm: workingset: separate shadow unpacking and refault calculation ... Browse Code »

Per-cgroup thrash detection will need to derive a live memcg from the
eviction cookie, and doing that inside unpack_shadow() will get nasty
with the reference handling spread over two functions.

In preparation, make unpack_shadow() clearly about extracting static
data, and let workingset_refault() do all the higher-level handling.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
689c94f03 mm: workingset: #define radix entry eviction mask ... Browse Code »

This is a compile-time constant, no need to calculate it on refault.

Signed-off-by: Johannes Weiner
Reviewed-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
81f8c3a46 mm: memcontrol: generalize locking for the page->mem_cgroup binding ... Browse Code »

These patches tag the page cache radix tree eviction entries with the
memcg an evicted page belonged to, thus making per-cgroup LRU reclaim
work properly and be as adaptive to new cache workingsets as global
reclaim already is.

This should have been part of the original thrash detection patch
series, but was deferred due to the complexity of those patches.

This patch (of 5):

So far the only sites that needed to exclude charge migration to
stabilize page->mem_cgroup have been per-cgroup page statistics, hence
the name mem_cgroup_begin_page_stat(). But per-cgroup thrash detection
will add another site that needs to ensure page->mem_cgroup lifetime.

Rename these locking functions to the more generic lock_page_memcg() and
unlock_page_memcg(). Since charge migration is a cgroup1 feature only,
we might be able to delete it at some point, and these now easy to
identify locking sites along with it.

Signed-off-by: Johannes Weiner
Suggested-by: Vladimir Davydov
Acked-by: Vladimir Davydov
Cc: Michal Hocko
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Johannes Weiner
2016-03-16 07:55:16 +0800
0db2cb8da mm, vmscan: make zone_reclaimable_pages more precise ... Browse Code »

zone_reclaimable_pages() is used in should_reclaim_retry() which uses it
to calculate the target for the watermark check. This means that
precise numbers are important for the correct decision.
zone_reclaimable_pages uses zone_page_state which can contain stale data
with per-cpu diffs not synced yet (the last vmstat_update might have run
1s in the past).

Use zone_page_state_snapshot() in zone_reclaimable_pages() instead.
None of the current callers is in a hot path where getting the precise
value (which involves per-cpu iteration) would cause an unreasonable
overhead.

Signed-off-by: Michal Hocko
Signed-off-by: Tetsuo Handa
Suggested-by: David Rientjes
Acked-by: David Rientjes
Acked-by: Hillf Danton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Hocko
2016-03-16 07:55:16 +0800
d7206a70a mm/madvise: update comment on sys_madvise() ... Browse Code »

Some new MADV_* advices are not documented in sys_madvise() comment. So
let's update it.

[akpm@linux-foundation.org: modifications suggested by Michal]
Signed-off-by: Naoya Horiguchi
Acked-by: Michal Hocko
Cc: Minchan Kim
Cc: "Kirill A. Shutemov"
Cc: Jason Baron
Cc: Chen Gong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2016-03-16 07:55:16 +0800
cecf257b6 mm: vmscan: do not clear SHRINKER_NUMA_AWARE if nr_node_ids == 1 ... Browse Code »

Currently, on shrinker registration we clear SHRINKER_NUMA_AWARE if
there's the only NUMA node present. The comment states that this will
allow us to save some small loop time later. It used to be true when
this code was added (see commit 1d3d4437eae1b ("vmscan: per-node
deferred work")), but since commit 6b4f7799c6a57 ("mm: vmscan: invoke
slab shrinkers from shrink_zone()") it doesn't make any difference.
Anyway, running on non-NUMA machine shouldn't make a shrinker NUMA
unaware, so zap this hunk.

Signed-off-by: Vladimir Davydov
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vladimir Davydov
2016-03-16 07:55:16 +0800
703fc13a3 xen_balloon: support memory auto onlining policy ... Browse Code »

Add support for the newly added kernel memory auto onlining policy to
Xen ballon driver.

Signed-off-by: Vitaly Kuznetsov
Suggested-by: Daniel Kiper
Reviewed-by: Daniel Kiper
Acked-by: David Vrabel
Cc: Jonathan Corbet
Cc: Greg Kroah-Hartman
Cc: Daniel Kiper
Cc: Dan Williams
Cc: Tang Chen
Cc: David Rientjes
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: Mel Gorman
Cc: "K. Y. Srinivasan"
Cc: Igor Mammedov
Cc: Kay Sievers
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Kuznetsov
2016-03-16 07:55:16 +0800
31bc3858e memory-hotplug: add automatic onlining policy for the newly added memory ... Browse Code »

Currently, all newly added memory blocks remain in 'offline' state
unless someone onlines them, some linux distributions carry special udev
rules like:

SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"

to make this happen automatically. This is not a great solution for
virtual machines where memory hotplug is being used to address high
memory pressure situations as such onlining is slow and a userspace
process doing this (udev) has a chance of being killed by the OOM killer
as it will probably require to allocate some memory.

Introduce default policy for the newly added memory blocks in
/sys/devices/system/memory/auto_online_blocks file with two possible
values: "offline" which preserves the current behavior and "online"
which causes all newly added memory blocks to go online as soon as
they're added. The default is "offline".

Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Daniel Kiper
Cc: Jonathan Corbet
Cc: Greg Kroah-Hartman
Cc: Daniel Kiper
Cc: Dan Williams
Cc: Tang Chen
Cc: David Vrabel
Acked-by: David Rientjes
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: Mel Gorman
Cc: "K. Y. Srinivasan"
Cc: Igor Mammedov
Cc: Kay Sievers
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vitaly Kuznetsov
2016-03-16 07:55:16 +0800
9cb65bc3b mm/memory.c: make apply_to_page_range() more robust ... Browse Code »

Arm and arm64 used to trigger this BUG_ON() - this has now been fixed.

But a WARN_ON() here is sufficient to catch future buggy callers.

Signed-off-by: Mika Penttilä
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mika Penttilä
2016-03-16 07:55:16 +0800
4355c018c mm/mempolicy.c: skip VM_HUGETLB and VM_MIXEDMAP VMA for lazy mbind ... Browse Code »

VM_HUGETLB and VM_MIXEDMAP vma needs to be excluded to avoid compound
pages being marked for migration and unexpected COWs when handling
hugetlb fault.

Thanks to Naoya Horiguchi for reminding me on these checks.

Signed-off-by: Liang Chen
Signed-off-by: Gavin Guo
Suggested-by: Naoya Horiguchi
Acked-by: David Rientjes
Cc: SeongJae Park
Cc: Rik van Riel
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Liang Chen
2016-03-16 07:55:16 +0800