Eric Lee / smarc-ti-linux-kernel | Embedian Git Server

13 Nov, 2013

40 commits

616c05d11 sh: move fpu_counter into ARCH specific thread_struct ... Browse Code »

Only a couple of arches (sh/x86) use fpu_counter in task_struct so it can
be moved out into ARCH specific thread_struct, reducing the size of
task_struct for other arches.

Compile tested sh defconfig + sh4-linux-gcc (4.6.3)

Signed-off-by: Vineet Gupta
Cc: Paul Mundt
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Vineet Gupta
2013-11-13 11:09:13 +0800
261adc9a6 jump_label: unlikely(x) > 0 ... Browse Code »

if (unlikely(x) > 0) doesn't seem to help branch prediction

Signed-off-by: Roel Kluin
Cc: Raghavendra K T
Cc: Konrad Rzeszutek Wilk
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Roel Kluin
2013-11-13 11:09:13 +0800
81e41ea25 kernel/sys.c: remove obsolete #include <linux/kexec.h> ... Browse Code »

Commit 15d94b82565e ("reboot: move shutdown/reboot related functions to
kernel/reboot.c") moved all kexec-related functionality to
kernel/reboot.c, so kernel/sys.c no longer needs to include
.

Signed-off-by: Geert Uytterhoeven
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Geert Uytterhoeven
2013-11-13 11:09:13 +0800
324d666a5 kernel/delayacct.c: remove redundant checking in __delayacct_add_tsk() ... Browse Code »

The wrapper function delayacct_add_tsk() already checked 'tsk->delays',
and __delayacct_add_tsk() has no another direct callers, so can remove the
redundancy checking code.

And the label 'done' is also useless, so remove it, too.

Signed-off-by: Chen Gang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Chen Gang
2013-11-13 11:09:12 +0800
c725ee54c gen_init_cpio: avoid NULL pointer dereference and rework env expanding ... Browse Code »

getenv() may return NULL if given environment variable does not exist
which leads to NULL dereference when calling strncat.

Besides that, the environment variable name was copied to a temporary
env_var buffer, but this copying can be avoided by simply using the input
string.

Lastly, the whole loop can be greatly simplified by using the snprintf
function instead of the playing with strncat.

By the way, the current implementation allows a recursive variable
expansion, as in:

$ echo 'out ${A} out ' | A='a ${B} a' B=b /tmp/a
out a b a out

I'm assuming this is just a side effect and not a conscious decision
(especially as this may lead to infinite loop), but I didn't want to
change this behaviour without consulting.

If the current behaviour is deamed incorrect, I'll be happy to send
a patch without recursive processing.

Signed-off-by: Michal Nazarewicz
Cc: Kees Cook
Cc: Jiri Kosina
Cc: Jesper Juhl
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michal Nazarewicz
2013-11-13 11:09:12 +0800
0ca434351 errno.h: remove "NFS" from descriptions in comments ... Browse Code »

glibc recently changed the error string for ESTALE to remove "NFS" -

https://sourceware.org/git/?p=glibc.git;a=commitdiff;h=96945714ec61951cc748da2b4b8a80cf02127ee9

from: [ERR_REMAP (ESTALE)] = N_("Stale NFS file handle"),
to: [ERR_REMAP (ESTALE)] = N_("Stale file handle"),

And some have expressed concern that the kernel's errno.h
comments still refer to NFS.

So make that change... note that this is a comment-only change,
and has no functional difference.

Signed-off-by: Eric Sandeen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Eric Sandeen
2013-11-13 11:09:12 +0800
6c251611c init/do_mounts.c: add maj:min syntax comment ... Browse Code »

The name_to_dev_t function has a comment block which lists the supported
syntaxes for the device name. Add a bullet for the :
syntax, which is already supported in the code

Signed-off-by: Sebastian Capella
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sebastian Capella
2013-11-13 11:09:12 +0800
b5064654c scripts/mod/modpost.c: handle non ABS crc symbols ... Browse Code »

For some reason I managed to trick gcc into create CRC symbols that are
not absolute anymore, but weak.

Make modpost handle this case.

Signed-off-by: Andi Kleen
Cc: Al Viro
Cc: Geert Uytterhoeven
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2013-11-13 11:09:12 +0800
83460ec8d syscalls.h: use gcc alias instead of assembler aliases for syscalls ... Browse Code »

Use standard gcc __attribute__((alias(foo))) to define the syscall aliases
instead of custom assembler macros.

This is far cleaner, and also fixes my LTO kernel build.

Signed-off-by: Andi Kleen
Cc: Al Viro
Cc: Geert Uytterhoeven
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andi Kleen
2013-11-13 11:09:12 +0800
54886a715 cramfs: mark as obsolete ... Browse Code »

Who needs cramfs when you have squashfs? At least, we should warn people
that cramfs is obsolete.

Signed-off-by: Michael Opdenacker
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michael Opdenacker
2013-11-13 11:09:12 +0800
623fd8072 percpu: add test module for various percpu operations ... Browse Code »

Tests various percpu operations.

Enable with CONFIG_PERCPU_TEST=m.

Signed-off-by: Greg Thelen
Acked-by: Tejun Heo
Acked-by: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2013-11-13 11:09:11 +0800
3d035f580 drivers/char/hpet.c: allow user controlled mmap for user processes ... Browse Code »

The CONFIG_HPET_MMAP Kconfig option exposes the memory map of the HPET
registers to userspace. The Kconfig help points out that in some cases
this can be a security risk as some systems may erroneously configure the
map such that additional data is exposed to userspace.

This is a problem for distributions -- some users want the MMAP
functionality but it comes with a significant security risk. In an effort
to mitigate this risk, and due to the low number of users of the MMAP
functionality, I've introduced a kernel parameter, hpet_mmap_enable, that
is required in order to actually have the HPET MMAP exposed.

Signed-off-by: Prarit Bhargava
Acked-by: Matt Wilson
Signed-off-by: Clemens Ladisch
Cc: Randy Dunlap
Cc: Tomas Winkler
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Prarit Bhargava
2013-11-13 11:09:11 +0800
72403b4a0 mm: numa: return the number of base pages altered by protection changes ... Browse Code »
2

Commit 0255d4918480 ("mm: Account for a THP NUMA hinting update as one
PTE update") was added to account for the number of PTE updates when
marking pages prot_numa. task_numa_work was using the old return value
to track how much address space had been updated. Altering the return
value causes the scanner to do more work than it is configured or
documented to in a single unit of work.

This patch reverts that commit and accounts for the number of THP
updates separately in vmstat. It is up to the administrator to
interpret the pair of values correctly. This is a straight-forward
operation and likely to only be of interest when actively debugging NUMA
balancing problems.

The impact of this patch is that the NUMA PTE scanner will scan slower
when THP is enabled and workloads may converge slower as a result. On
the flip size system CPU usage should be lower than recent tests
reported. This is an illustrative example of a short single JVM specjbb
test

specjbb
3.12.0 3.12.0
vanilla acctupdates
TPut 1 26143.00 ( 0.00%) 25747.00 ( -1.51%)
TPut 7 185257.00 ( 0.00%) 183202.00 ( -1.11%)
TPut 13 329760.00 ( 0.00%) 346577.00 ( 5.10%)
TPut 19 442502.00 ( 0.00%) 460146.00 ( 3.99%)
TPut 25 540634.00 ( 0.00%) 549053.00 ( 1.56%)
TPut 31 512098.00 ( 0.00%) 519611.00 ( 1.47%)
TPut 37 461276.00 ( 0.00%) 474973.00 ( 2.97%)
TPut 43 403089.00 ( 0.00%) 414172.00 ( 2.75%)

3.12.0 3.12.0
vanillaacctupdates
User 5169.64 5184.14
System 100.45 80.02
Elapsed 252.75 251.85

Performance is similar but note the reduction in system CPU time. While
this showed a performance gain, it will not be universal but at least
it'll be behaving as documented. The vmstats are obviously different but
here is an obvious interpretation of them from mmtests.

3.12.0 3.12.0
vanillaacctupdates
NUMA page range updates 1408326 11043064
NUMA huge PMD updates 0 21040
NUMA PTE updates 1408326 291624

"NUMA page range updates" == nr_pte_updates and is the value returned to
the NUMA pte scanner. NUMA huge PMD updates were the number of THP
updates which in combination can be used to calculate how many ptes were
updated from userspace.

Signed-off-by: Mel Gorman
Reported-by: Alex Thorlton
Reviewed-by: Rik van Riel
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-11-13 11:09:11 +0800
00619bcc4 mm: factor commit limit calculation ... Browse Code »

The same calculation is currently done in three differents places.
Factor that code so future changes has to be made at only one place.

[akpm@linux-foundation.org: uninline vm_commit_limit()]
Signed-off-by: Jerome Marchand
Cc: Dave Hansen
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jerome Marchand
2013-11-13 11:09:11 +0800
715ea41e6 mm: improve the description for dirty_background_ratio/dirty_ratio sysctl ... Browse Code »

Now dirty_background_ratio/dirty_ratio contains a percentage of total
avaiable memory, which contains free pages and reclaimable pages. The
number of these pages is not equal to the number of total system memory.
But they are described as a percentage of total system memory in
Documentation/sysctl/vm.txt. So we need to fix them to avoid
misunderstanding.

Signed-off-by: Zheng Liu
Cc: Rob Landley
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zheng Liu
2013-11-13 11:09:11 +0800
a1aeb65a4 mm/page_alloc.c: fix comment in zlc_setup() ... Browse Code »

Signed-off-by: Zhi Yong Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhi Yong Wu
2013-11-13 11:09:11 +0800
d4dd100f2 arch/x86/mm/init.c: fix incorrect function name in alloc_low_pages() ... Browse Code »

Signed-off-by: Zhi Yong Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhi Yong Wu
2013-11-13 11:09:11 +0800
0ab0abcf5 mm/zswap: refactor the get/put routines ... Browse Code »

The refcount routine was not fit the kernel get/put semantic exactly,
There were too many judgement statements on refcount and it could be
minus.

This patch does the following:

- move refcount judgement to zswap_entry_put() to hide resource free function.

- add a new function zswap_entry_find_get(), so that callers can use
easily in the following pattern:

zswap_entry_find_get
.../* do something */
zswap_entry_put

- to eliminate compile error, move some functions declaration

This patch is based on Minchan Kim 's idea and suggestion.

Signed-off-by: Weijie Yang
Cc: Seth Jennings
Acked-by: Minchan Kim
Cc: Bob Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:11 +0800
67d13fe84 mm/zswap: bugfix: memory leak when invalidate and reclaim occur concurrently ... Browse Code »
2

Consider the following scenario:

thread 0: reclaim entry x (get refcount, but not call zswap_get_swap_cache_page)
thread 1: call zswap_frontswap_invalidate_page to invalidate entry x.
finished, entry x and its zbud is not freed as its refcount != 0
now, the swap_map[x] = 0
thread 0: now call zswap_get_swap_cache_page
swapcache_prepare return -ENOENT because entry x is not used any more
zswap_get_swap_cache_page return ZSWAP_SWAPCACHE_NOMEM
zswap_writeback_entry do nothing except put refcount

Now, the memory of zswap_entry x and its zpage leak.

Modify:
- check the refcount in fail path, free memory if it is not referenced.

- use ZSWAP_SWAPCACHE_FAIL instead of ZSWAP_SWAPCACHE_NOMEM as the fail path
can be not only caused by nomem but also by invalidate.

Signed-off-by: Weijie Yang
Reviewed-by: Bob Liu
Reviewed-by: Minchan Kim
Acked-by: Seth Jennings
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:10 +0800
7a67d7abc memcg, kmem: use cache_from_memcg_idx instead of hard code ... Browse Code »

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
2ade4de87 memcg, kmem: rename cache_from_memcg to cache_from_memcg_idx ... Browse Code »

We can't see the relationship with memcg from the parameters,
so the name with memcg_idx would be more reasonable.

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
f35c3a8ee memcg, kmem: use is_root_cache instead of hard code ... Browse Code »

Signed-off-by: Qiang Huang
Reviewed-by: Pekka Enberg
Acked-by: David Rientjes
Cc: Johannes Weiner
Cc: Michal Hocko
Cc: Glauber Costa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:10 +0800
2afc745f3 mm: ensure get_unmapped_area() returns higher address than mmap_min_addr ... Browse Code »
2

This patch fixes the problem that get_unmapped_area() can return illegal
address and result in failing mmap(2) etc.

In case that the address higher than PAGE_SIZE is set to
/proc/sys/vm/mmap_min_addr, the address lower than mmap_min_addr can be
returned by get_unmapped_area(), even if you do not pass any virtual
address hint (i.e. the second argument).

This is because the current get_unmapped_area() code does not take into
account mmap_min_addr.

This leads to two actual problems as follows:

1. mmap(2) can fail with EPERM on the process without CAP_SYS_RAWIO,
although any illegal parameter is not passed.

2. The bottom-up search path after the top-down search might not work in
arch_get_unmapped_area_topdown().

Note: The first and third chunk of my patch, which changes "len" check,
are for more precise check using mmap_min_addr, and not for solving the
above problem.

[How to reproduce]

--- test.c -------------------------------------------------
#include
#include
#include
#include

int main(int argc, char *argv[])
{
void *ret = NULL, *last_map;
size_t pagesize = sysconf(_SC_PAGESIZE);

do {
last_map = ret;
ret = mmap(0, pagesize, PROT_NONE,
MAP_PRIVATE|MAP_ANONYMOUS, -1, 0);
// printf("ret=%p\n", ret);
} while (ret != MAP_FAILED);

if (errno != ENOMEM) {
printf("ERR: unexpected errno: %d (last map=%p)\n",
errno, last_map);
}

return 0;
}
---------------------------------------------------------------

$ gcc -m32 -o test test.c
$ sudo sysctl -w vm.mmap_min_addr=65536
vm.mmap_min_addr = 65536
$ ./test (run as non-priviledge user)
ERR: unexpected errno: 1 (last map=0x10000)

Signed-off-by: Akira Takeuchi
Signed-off-by: Kiyoshi Owada
Reviewed-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Akira Takeuchi
2013-11-13 11:09:10 +0800
0cbef29a7 mm: __rmqueue_fallback() should respect pageblock type ... Browse Code »
58

When __rmqueue_fallback() doesn't find a free block with the required size
it splits a larger page and puts the rest of the page onto the free list.

But it has one serious mistake. When putting back, __rmqueue_fallback()
always use start_migratetype if type is not CMA. However,
__rmqueue_fallback() is only called when all of the start_migratetype
queue is empty. That said, __rmqueue_fallback always puts back memory to
the wrong queue except try_to_steal_freepages() changed pageblock type
(i.e. requested size is smaller than half of page block). The end result
is that the antifragmentation framework increases fragmenation instead of
decreasing it.

Mel's original anti fragmentation does the right thing. But commit
47118af076f6 ("mm: mmzone: MIGRATE_CMA migration type added") broke it.

This patch restores sane and old behavior. It also removes an incorrect
comment which was introduced by commit fef903efcf0c ("mm/page_alloc.c:
restructure free-page stealing code and fix a bug").

Signed-off-by: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:10 +0800
52c8f6a5a mm: get rid of unnecessary overhead of trace_mm_page_alloc_extfrag() ... Browse Code »
2

In general, every tracepoint should be zero overhead if it is disabled.
However, trace_mm_page_alloc_extfrag() is one of exception. It evaluate
"new_type == start_migratetype" even if tracepoint is disabled.

However, the code can be moved into tracepoint's TP_fast_assign() and
TP_fast_assign exist exactly such purpose. This patch does it.

Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:10 +0800
5d0f3f72e mm: fix page_group_by_mobility_disabled breakage ... Browse Code »

Currently, set_pageblock_migratetype() screws up MIGRATE_CMA and
MIGRATE_ISOLATE if page_group_by_mobility_disabled is true. It rewrites
the argument to MIGRATE_UNMOVABLE and we lost these attribute.

The problem was introduced by commit 49255c619fbd ("page allocator: move
check for disabled anti-fragmentation out of fastpath"). So a 4 year
old issue may mean that nobody uses page_group_by_mobility_disabled.

But anyway, this patch fixes the problem.

Signed-off-by: KOSAKI Motohiro
Acked-by: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

KOSAKI Motohiro
2013-11-13 11:09:09 +0800
af248a0c6 readahead: fix sequential read cache miss detection ... Browse Code »
2

The kernel's readahead algorithm sometimes interprets random read
accesses as sequential and triggers unnecessary data prefecthing from
storage device (impacting random read average latency).

In order to identify sequential cache read misses, the readahead
algorithm intends to check whether offset - previous offset == 1
(trivial sequential reads) or offset - previous offset == 0 (sequential
reads not aligned on page boundary):

if (offset - (ra->prev_pos >> PAGE_CACHE_SHIFT) current offset (which happens on random pattern), the if
condition is true and access is wrongly interpeted as sequential. An
unnecessary data prefetching is triggered, impacting the average random
read latency.

Storing the previous offset value in a "pgoff_t" variable (unsigned
long) fixes the sequential read detection logic.

Signed-off-by: Damien Ramonda
Reviewed-by: Fengguang Wu
Acked-by: Pierre Tardy
Acked-by: David Cohen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Damien Ramonda
2013-11-13 11:09:09 +0800
c78e93630 mm: do not walk all of system memory during show_mem ... Browse Code »

It has been reported on very large machines that show_mem is taking almost
5 minutes to display information. This is a serious problem if there is
an OOM storm. The bulk of the cost is in show_mem doing a very expensive
PFN walk to give us the following information

Total RAM: Also available as totalram_pages
Highmem pages: Also available as totalhigh_pages
Reserved pages: Can be inferred from the zone structure
Shared pages: PFN walk required
Unshared pages: PFN walk required
Quick pages: Per-cpu walk required

Only the shared/unshared pages requires a full PFN walk but that
information is useless. It is also inaccurate as page pins of unshared
pages would be accounted for as shared. Even if the information was
accurate, I'm struggling to think how the shared/unshared information
could be useful for debugging OOM conditions. Maybe it was useful before
rmap existed when reclaiming shared pages was costly but it is less
relevant today.

The PFN walk could be optimised a bit but why bother as the information is
useless. This patch deletes the PFN walker and infers the total RAM,
highmem and reserved pages count from struct zone. It omits the
shared/unshared page usage on the grounds that it is useless. It also
corrects the reporting of HighMem as HighMem/MovableOnly as ZONE_MOVABLE
has similar problems to HighMem with respect to lowmem/highmem exhaustion.

Signed-off-by: Mel Gorman
Cc: David Rientjes
Acked-by: KOSAKI Motohiro
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mel Gorman
2013-11-13 11:09:09 +0800
4a099fb4b mm/bootmem.c: remove unused local `map' ... Browse Code »

Signed-off-by: Daeseok Youn
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Daeseok Youn
2013-11-13 11:09:09 +0800
807a1bd2b mm: clear N_CPU from node_states at CPU offline ... Browse Code »

vmstat_cpuup_callback() is a CPU notifier callback, which marks N_CPU to a
node at CPU online event. However, it does not update this N_CPU info at
CPU offline event.

Changed vmstat_cpuup_callback() to clear N_CPU when the last CPU in the
node is put into offline, i.e. the node no longer has any online CPU.

Signed-off-by: Toshi Kani
Acked-by: Christoph Lameter
Reviewed-by: Yasuaki Ishimatsu
Tested-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-11-13 11:09:09 +0800
d7e0b37a8 mm: set N_CPU to node_states during boot ... Browse Code »

After a system booted, N_CPU is not set to any node as has_cpu shows an
empty line.

# cat /sys/devices/system/node/has_cpu
(show-empty-line)

setup_vmstat() registers its CPU notifier callback,
vmstat_cpuup_callback(), which marks N_CPU to a node when a CPU is put
into online. However, setup_vmstat() is called after all CPUs are
launched in the boot sequence.

Changed setup_vmstat() to mark N_CPU to the nodes with online CPUs at
boot, which is consistent with other operations in
vmstat_cpuup_callback(), i.e. start_cpu_timer() and
refresh_zone_stat_thresholds().

Also added get_online_cpus() to protect the for_each_online_cpu() loop.

Signed-off-by: Toshi Kani
Acked-by: Christoph Lameter
Reviewed-by: Yasuaki Ishimatsu
Tested-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-11-13 11:09:09 +0800
c5320926e mem-hotplug: introduce movable_node boot option ... Browse Code »

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Suggested-by: Kamezawa Hiroyuki
Suggested-by: Ingo Molnar
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:09 +0800
fa591c4ae x86, acpi, crash, kdump: do reserve_crashkernel() after SRAT is parsed. ... Browse Code »

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know which memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
b959ed6c7 x86/mem-hotplug: support initialize page tables in bottom-up ... Browse Code »

The Linux kernel cannot migrate pages used by the kernel. As a result,
kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely
unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info. But before SRAT is parsed, memblock has already started to allocate
memory for the kernel. So we need to prevent memblock from doing this.

So direct memory mapping page tables setup is the case.
init_mem_mapping() is called before SRAT is parsed. To prevent page
tables being allocated within hotpluggable memory, we will use bottom-up
direction to allocate page tables from the end of kernel image to the
higher memory.

Note:
As for allocating page tables in lower memory, TJ said:

: This is an optional behavior which is triggered by a very specific kernel
: boot param, which I suspect is gonna need to stick around to support
: memory hotplug in the current setup unless we add another layer of address
: translation to support memory hotplug.

As for page tables may occupy too much lower memory if using 4K mapping
(CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages), TJ said:

: But as I said in the same paragraph, parsing SRAT earlier doesn't solve
: the problem in itself either. Ignoring the option if 4k mapping is
: required and memory consumption would be prohibitive should work, no?
: Something like that would be necessary if we're gonna worry about cases
: like this no matter how we implement it, but, frankly, I'm not sure this
: is something worth worrying about.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
0167d7d8b x86/mm: factor out of top-down direct mapping setup ... Browse Code »

Create a new function memory_map_top_down to factor out of the top-down
direct memory mapping pagetable setup. This is also a preparation for the
following patch, which will introduce the bottom-up memory mapping. That
said, we will put the two ways of pagetable setup into separate functions,
and choose to use which way in init_mem_mapping, which makes the code more
clear.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
79442ed18 mm/memblock.c: introduce bottom-up allocation mode ... Browse Code »

The Linux kernel cannot migrate pages used by the kernel. As a result,
kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
memory for the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info. But before SRAT is parsed, memblock has already started to allocate
memory for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely
unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch
introduces a new bottom-up allocation mode to allocate memory bottom-up.
And later when we use this allocation direction to allocate memory, we
will limit the start address above the kernel.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Tejun Heo
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
1402899e4 mm/memblock.c: factor out of top-down allocation ... Browse Code »

[Problem]

The current Linux cannot migrate pages used by the kernel because of the
kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and keep the
va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not
migratable. For example, the physical address may be cached somewhere and
will be used. It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages
are used by the kernel, they are not migratable. As a result, memory used
by the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the
following way to do memory hotplug.

[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of
the zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
memory. To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
memory affinities in SRAT record every memory range in the system, and
also, flags specifying if the memory range is hotpluggable. (Please refer
to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.

[Preparation]

Bootloader has to load the kernel image into memory. And this memory must
be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
system, we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
But memblock has already started to work. In the current kernel,
memblock allocates the following memory before SRAT is parsed:

setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best
to allocate memory near the kernel image. Since the whole node the kernel
resides in won't be hotpluggable, and for a modern server, a node may have
at least 16GB memory, allocating several mega bytes memory around the
kernel image won't cross to hotpluggable memory.

[About this patchset]

So this patchset is the preparation for the problem 2 that we want to
solve. It does the following:

1. Make memblock be able to allocate memory bottom up.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in
bottom up direction.

3. Introduce "movable_node" boot option to enable and disable this
functionality.

This patch (of 6):

Create a new function __memblock_find_range_top_down to factor out of
top-down allocation from memblock_find_in_range_node. This is a
preparation because we will introduce a new bottom-up allocation mode in
the following patch.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
7aba842f0 s390/mmap: randomize mmap base for bottom up direction ... Browse Code »

Implement mmap base randomization for the bottom up direction, so ASLR
works for both mmap layouts on s390. See also commit df54d6fa5427 ("x86
get_unmapped_area(): use proper mmap base for bottom-up direction").

Signed-off-by: Heiko Carstens
Cc: Radu Caragea
Cc: Michel Lespinasse
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Chris Metcalf
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2013-11-13 11:09:08 +0800
4e99b0213 mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction ... Browse Code »

This is more or less the generic variant of commit 41aacc1eea64 ("x86
get_unmapped_area: Access mmap_legacy_base through mm_struct member").

So effectively architectures which use an own arch_pick_mmap_layout()
implementation but call the generic arch_get_unmapped_area() now can
also randomize their mmap_base.

All architectures which have an own arch_pick_mmap_layout() and call the
generic arch_get_unmapped_area() (arm64, s390, tile) currently set
mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
arch_pick_mmap_layout() function. So this change is a no-op currently.

Signed-off-by: Heiko Carstens
Cc: Radu Caragea
Cc: Michel Lespinasse
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Chris Metcalf
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2013-11-13 11:09:08 +0800
b349acc76 mm/zswap: avoid unnecessary page scanning ... Browse Code »

Add SetPageReclaim() before __swap_writepage() so that page can be moved
to the tail of the inactive list, which can avoid unnecessary page
scanning as this page was reclaimed by swap subsystem before.

Signed-off-by: Weijie Yang
Reviewed-by: Bob Liu
Reviewed-by: Minchan Kim
Acked-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:08 +0800