Doug / smarc-fsl-linux-kernel | Embedian Git Server

13 Nov, 2013

40 commits

c5320926e mem-hotplug: introduce movable_node boot option ... Browse Code »

The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.

Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.

But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.

So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.

To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:

1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.

Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Suggested-by: Kamezawa Hiroyuki
Suggested-by: Ingo Molnar
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:09 +0800
fa591c4ae x86, acpi, crash, kdump: do reserve_crashkernel() after SRAT is parsed. ... Browse Code »

Memory reserved for crashkernel could be large. So we should not allocate
this memory bottom up from the end of kernel image.

When SRAT is parsed, we will be able to know which memory is hotpluggable,
and we can avoid allocating this memory for the kernel. So reorder
reserve_crashkernel() after SRAT is parsed.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
b959ed6c7 x86/mem-hotplug: support initialize page tables in bottom-up ... Browse Code »

The Linux kernel cannot migrate pages used by the kernel. As a result,
kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
memory for the kernel.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely
unhotpluggable.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info. But before SRAT is parsed, memblock has already started to allocate
memory for the kernel. So we need to prevent memblock from doing this.

So direct memory mapping page tables setup is the case.
init_mem_mapping() is called before SRAT is parsed. To prevent page
tables being allocated within hotpluggable memory, we will use bottom-up
direction to allocate page tables from the end of kernel image to the
higher memory.

Note:
As for allocating page tables in lower memory, TJ said:

: This is an optional behavior which is triggered by a very specific kernel
: boot param, which I suspect is gonna need to stick around to support
: memory hotplug in the current setup unless we add another layer of address
: translation to support memory hotplug.

As for page tables may occupy too much lower memory if using 4K mapping
(CONFIG_DEBUG_PAGEALLOC and CONFIG_KMEMCHECK both disable using >4k
pages), TJ said:

: But as I said in the same paragraph, parsing SRAT earlier doesn't solve
: the problem in itself either. Ignoring the option if 4k mapping is
: required and memory consumption would be prohibitive should work, no?
: Something like that would be necessary if we're gonna worry about cases
: like this no matter how we implement it, but, frankly, I'm not sure this
: is something worth worrying about.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
0167d7d8b x86/mm: factor out of top-down direct mapping setup ... Browse Code »

Create a new function memory_map_top_down to factor out of the top-down
direct memory mapping pagetable setup. This is also a preparation for the
following patch, which will introduce the bottom-up memory mapping. That
said, we will put the two ways of pagetable setup into separate functions,
and choose to use which way in init_mem_mapping, which makes the code more
clear.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
79442ed18 mm/memblock.c: introduce bottom-up allocation mode ... Browse Code »

The Linux kernel cannot migrate pages used by the kernel. As a result,
kernel pages cannot be hot-removed. So we cannot allocate hotpluggable
memory for the kernel.

ACPI SRAT (System Resource Affinity Table) contains the memory hotplug
info. But before SRAT is parsed, memblock has already started to allocate
memory for the kernel. So we need to prevent memblock from doing this.

In a memory hotplug system, any numa node the kernel resides in should be
unhotpluggable. And for a modern server, each node could have at least
16GB memory. So memory around the kernel image is highly likely
unhotpluggable.

So the basic idea is: Allocate memory from the end of the kernel image and
to the higher memory. Since memory allocation before SRAT is parsed won't
be too much, it could highly likely be in the same node with kernel image.

The current memblock can only allocate memory top-down. So this patch
introduces a new bottom-up allocation mode to allocate memory bottom-up.
And later when we use this allocation direction to allocate memory, we
will limit the start address above the kernel.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Tejun Heo
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
1402899e4 mm/memblock.c: factor out of top-down allocation ... Browse Code »

[Problem]

The current Linux cannot migrate pages used by the kernel because of the
kernel direct mapping. In Linux kernel space, va = pa + PAGE_OFFSET.
When the pa is changed, we cannot simply update the pagetable and keep the
va unmodified. So the kernel pages are not migratable.

There are also some other issues will cause the kernel pages not
migratable. For example, the physical address may be cached somewhere and
will be used. It is not to update all the caches.

When doing memory hotplug in Linux, we first migrate all the pages in one
memory device somewhere else, and then remove the device. But if pages
are used by the kernel, they are not migratable. As a result, memory used
by the kernel cannot be hot-removed.

Modifying the kernel direct mapping mechanism is too difficult to do. And
it may cause the kernel performance down and unstable. So we use the
following way to do memory hotplug.

[What we are doing]

In Linux, memory in one numa node is divided into several zones. One of
the zones is ZONE_MOVABLE, which the kernel won't use.

In order to implement memory hotplug in Linux, we are going to arrange all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use these
memory. To do this, we need ACPI's help.

In ACPI, SRAT(System Resource Affinity Table) contains NUMA info. The
memory affinities in SRAT record every memory range in the system, and
also, flags specifying if the memory range is hotpluggable. (Please refer
to ACPI spec 5.0 5.2.16)

With the help of SRAT, we have to do the following two things to achieve our
goal:

1. When doing memory hot-add, allow the users arranging hotpluggable as
ZONE_MOVABLE.
(This has been done by the MOVABLE_NODE functionality in Linux.)

2. when the system is booting, prevent bootmem allocator from allocating
hotpluggable memory for the kernel before the memory initialization
finishes.

The problem 2 is the key problem we are going to solve. But before solving it,
we need some preparation. Please see below.

[Preparation]

Bootloader has to load the kernel image into memory. And this memory must
be unhotpluggable. We cannot prevent this anyway. So in a memory hotplug
system, we can assume any node the kernel resides in is not hotpluggable.

Before SRAT is parsed, we don't know which memory ranges are hotpluggable.
But memblock has already started to work. In the current kernel,
memblock allocates the following memory before SRAT is parsed:

setup_arch()
|->memblock_x86_fill() /* memblock is ready */
|......
|->early_reserve_e820_mpc_new() /* allocate memory under 1MB */
|->reserve_real_mode() /* allocate memory under 1MB */
|->init_mem_mapping() /* allocate page tables, about 2MB to map 1GB memory */
|->dma_contiguous_reserve() /* specified by user, should be low */
|->setup_log_buf() /* specified by user, several mega bytes */
|->relocate_initrd() /* could be large, but will be freed after boot, should reorder */
|->acpi_initrd_override() /* several mega bytes */
|->reserve_crashkernel() /* could be large, should reorder */
|......
|->initmem_init() /* Parse SRAT */

According to Tejun's advice, before SRAT is parsed, we should try our best
to allocate memory near the kernel image. Since the whole node the kernel
resides in won't be hotpluggable, and for a modern server, a node may have
at least 16GB memory, allocating several mega bytes memory around the
kernel image won't cross to hotpluggable memory.

[About this patchset]

So this patchset is the preparation for the problem 2 that we want to
solve. It does the following:

1. Make memblock be able to allocate memory bottom up.
1) Keep all the memblock APIs' prototype unmodified.
2) When the direction is bottom up, keep the start address greater than the
end of kernel image.

2. Improve init_mem_mapping() to support allocate page tables in
bottom up direction.

3. Introduce "movable_node" boot option to enable and disable this
functionality.

This patch (of 6):

Create a new function __memblock_find_range_top_down to factor out of
top-down allocation from memblock_find_in_range_node. This is a
preparation because we will introduce a new bottom-up allocation mode in
the following patch.

Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Cc: Kamezawa Hiroyuki
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-11-13 11:09:08 +0800
7aba842f0 s390/mmap: randomize mmap base for bottom up direction ... Browse Code »

Implement mmap base randomization for the bottom up direction, so ASLR
works for both mmap layouts on s390. See also commit df54d6fa5427 ("x86
get_unmapped_area(): use proper mmap base for bottom-up direction").

Signed-off-by: Heiko Carstens
Cc: Radu Caragea
Cc: Michel Lespinasse
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Chris Metcalf
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2013-11-13 11:09:08 +0800
4e99b0213 mmap: arch_get_unmapped_area(): use proper mmap base for bottom up direction ... Browse Code »

This is more or less the generic variant of commit 41aacc1eea64 ("x86
get_unmapped_area: Access mmap_legacy_base through mm_struct member").

So effectively architectures which use an own arch_pick_mmap_layout()
implementation but call the generic arch_get_unmapped_area() now can
also randomize their mmap_base.

All architectures which have an own arch_pick_mmap_layout() and call the
generic arch_get_unmapped_area() (arm64, s390, tile) currently set
mmap_base to TASK_UNMAPPED_BASE. This is also true for the generic
arch_pick_mmap_layout() function. So this change is a no-op currently.

Signed-off-by: Heiko Carstens
Cc: Radu Caragea
Cc: Michel Lespinasse
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Chris Metcalf
Cc: Martin Schwidefsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Heiko Carstens
2013-11-13 11:09:08 +0800
b349acc76 mm/zswap: avoid unnecessary page scanning ... Browse Code »

Add SetPageReclaim() before __swap_writepage() so that page can be moved
to the tail of the inactive list, which can avoid unnecessary page
scanning as this page was reclaimed by swap subsystem before.

Signed-off-by: Weijie Yang
Reviewed-by: Bob Liu
Reviewed-by: Minchan Kim
Acked-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Weijie Yang
2013-11-13 11:09:08 +0800
c4a391b53 writeback: do not sync data dirtied after sync start ... Browse Code »

When there are processes heavily creating small files while sync(2) is
running, it can easily happen that quite some new files are created
between WB_SYNC_NONE and WB_SYNC_ALL pass of sync(2). That can happen
especially if there are several busy filesystems (remember that sync
traverses filesystems sequentially and waits in WB_SYNC_ALL phase on one
fs before starting it on another fs). Because WB_SYNC_ALL pass is slow
(e.g. causes a transaction commit and cache flush for each inode in
ext3), resulting sync(2) times are rather large.

The following script reproduces the problem:

function run_writers
{
for (( i = 0; i < 10; i++ )); do
mkdir $1/dir$i
for (( j = 0; j < 40000; j++ )); do
dd if=/dev/zero of=$1/dir$i/$j bs=4k count=4 &>/dev/null
done &
done
}

for dir in "$@"; do
run_writers $dir
done

sleep 40
time sync

Fix the problem by disregarding inodes dirtied after sync(2) was called
in the WB_SYNC_ALL pass. To allow for this, sync_inodes_sb() now takes
a time stamp when sync has started which is used for setting up work for
flusher threads.

To give some numbers, when above script is run on two ext4 filesystems
on simple SATA drive, the average sync time from 10 runs is 267.549
seconds with standard deviation 104.799426. With the patched kernel,
the average sync time from 10 runs is 2.995 seconds with standard
deviation 0.096.

Signed-off-by: Jan Kara
Reviewed-by: Fengguang Wu
Reviewed-by: Dave Chinner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jan Kara
2013-11-13 11:09:07 +0800
46c77e2bb tools/vm/page-types.c: support KPF_SOFTDIRTY bit ... Browse Code »

Soft dirty bit allows us to track which pages are written since the last
clear_ref (by "echo 4 > /proc/pid/clear_refs".) This is useful for
userspace applications to know their memory footprints.

Note that the kernel exposes this flag via bit[55] of /proc/pid/pagemap,
and the semantics is not a default one (scheduled to be the default in the
near future.) However, it shifts to the new semantics at the first
clear_ref, and the users of soft dirty bit always do it before utilizing
the bit, so that's not a big deal. Users must avoid relying on the bit in
page-types before the first clear_ref.

Signed-off-by: Naoya Horiguchi
Cc: Wu Fengguang
Cc: Pavel Emelyanov
Cc: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-11-13 11:09:07 +0800
ec8e41aec /proc/pid/smaps: show VM_SOFTDIRTY flag in VmFlags line ... Browse Code »

This flag shows that the VMA is "newly created" and thus represents
"dirty" in the task's VM.

You can clear it by "echo 4 > /proc/pid/clear_refs."

Signed-off-by: Naoya Horiguchi
Cc: Wu Fengguang
Cc: Pavel Emelyanov
Acked-by: Cyrill Gorcunov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-11-13 11:09:07 +0800
bfc4f9d52 mm/page_alloc.c: remove unused marco LONG_ALIGN ... Browse Code »

Signed-off-by: Zhang Yanfei
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-11-13 11:09:07 +0800
58e97ba6b frontswap: enable call to invalidate area on swapoff ... Browse Code »

During swapoff the frontswap_map was NULL-ified before calling
frontswap_invalidate_area(). However the frontswap_invalidate_area()
exits early if frontswap_map is NULL. Invalidate was never called
during swapoff.

This patch moves frontswap_map_set() in swapoff just after calling
frontswap_invalidate_area() so outside of locks (swap_lock and
swap_info_struct->lock). This shouldn't be a problem as during swapon
the frontswap_map_set() is called also outside of any locks.

Signed-off-by: Krzysztof Kozlowski
Reviewed-by: Seth Jennings
Cc: Konrad Rzeszutek Wilk
Cc: Shaohua Li
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Krzysztof Kozlowski
2013-11-13 11:09:07 +0800
2de1a7e40 mm/swapfile.c: fix comment typos ... Browse Code »

Signed-off-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Seth Jennings
2013-11-13 11:09:07 +0800
7f88f88f8 mm: kmemleak: avoid false negatives on vmalloc'ed objects ... Browse Code »

Commit 248ac0e1943a ("mm/vmalloc: remove guard page from between vmap
blocks") had the side effect of making vmap_area.va_end member point to
the next vmap_area.va_start. This was creating an artificial reference
to vmalloc'ed objects and kmemleak was rarely reporting vmalloc() leaks.

This patch marks the vmap_area containing pointers explicitly and
reduces the min ref_count to 2 as vm_struct still contains a reference
to the vmalloc'ed object. The kmemleak add_scan_area() function has
been improved to allow a SIZE_MAX argument covering the rest of the
object (for simpler calling sites).

Signed-off-by: Catalin Marinas
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Catalin Marinas
2013-11-13 11:09:07 +0800
81556b025 mm/sparsemem: fix a bug in free_map_bootmem when CONFIG_SPARSEMEM_VMEMMAP ... Browse Code »

We pass the number of pages which hold page structs of a memory section
to free_map_bootmem(). This is right when !CONFIG_SPARSEMEM_VMEMMAP but
wrong when CONFIG_SPARSEMEM_VMEMMAP. When CONFIG_SPARSEMEM_VMEMMAP, we
should pass the number of pages of a memory section to free_map_bootmem.

So the fix is removing the nr_pages parameter. When
CONFIG_SPARSEMEM_VMEMMAP, we directly use the prefined marco
PAGES_PER_SECTION in free_map_bootmem. When !CONFIG_SPARSEMEM_VMEMMAP,
we calculate page numbers needed to hold the page structs for a memory
section and use the value in free_map_bootmem().

This was found by reading the code. And I have no machine that support
memory hot-remove to test the bug now.

Signed-off-by: Zhang Yanfei
Reviewed-by: Wanpeng Li
Cc: Wen Congyang
Cc: Tang Chen
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Yasunori Goto
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-11-13 11:09:06 +0800
85b35feae mm/sparsemem: use PAGES_PER_SECTION to remove redundant nr_pages parameter ... Browse Code »

For below functions,

- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()

they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section. So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.

Signed-off-by: Zhang Yanfei
Cc: Wen Congyang
Cc: Tang Chen
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Yasunori Goto
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Zhang Yanfei
2013-11-13 11:09:06 +0800
071aee138 memcg: support hierarchical memory.numa_stats ... Browse Code »

The memory.numa_stat file was not hierarchical. Memory charged to the
children was not shown in parent's numa_stat.

This change adds the "hierarchical_" stats to the existing stats. The
new hierarchical stats include the sum of all children's values in
addition to the value of the memcg.

Tested: Create cgroup a, a/b and run workload under b. The values of
b are included in the "hierarchical_*" under a.

$ cd /sys/fs/cgroup
$ echo 1 > memory.use_hierarchy
$ mkdir a a/b

Run workload in a/b:
$ (echo $BASHPID >> a/b/cgroup.procs && cat /some/file && bash) &

The hierarchical_ fields in parent (a) show use of workload in a/b:
$ cat a/memory.numa_stat
total=0 N0=0 N1=0 N2=0 N3=0
file=0 N0=0 N1=0 N2=0 N3=0
anon=0 N0=0 N1=0 N2=0 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

$ cat a/b/memory.numa_stat
total=908 N0=552 N1=317 N2=39 N3=0
file=850 N0=549 N1=301 N2=0 N3=0
anon=58 N0=3 N1=16 N2=39 N3=0
unevictable=0 N0=0 N1=0 N2=0 N3=0
hierarchical_total=908 N0=552 N1=317 N2=39 N3=0
hierarchical_file=850 N0=549 N1=301 N2=0 N3=0
hierarchical_anon=58 N0=3 N1=16 N2=39 N3=0
hierarchical_unevictable=0 N0=0 N1=0 N2=0 N3=0

Signed-off-by: Ying Han
Signed-off-by: Greg Thelen
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Ying Han
2013-11-13 11:09:06 +0800
25485de6e memcg: refactor mem_control_numa_stat_show() ... Browse Code »

Refactor mem_control_numa_stat_show() to use a new stats structure for
smaller and simpler code. This consolidates nearly identical code.

text data bss dec hex filename
8,137,679 1,703,496 1,896,448 11,737,623 b31a17 vmlinux.before
8,136,911 1,703,496 1,896,448 11,736,855 b31717 vmlinux.after

Signed-off-by: Greg Thelen
Signed-off-by: Ying Han
Cc: Johannes Weiner
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Greg Thelen
2013-11-13 11:09:06 +0800
b76ac7e73 mm/mempolicy: use NUMA_NO_NODE ... Browse Code »

Use more appropriate NUMA_NO_NODE instead of -1

Signed-off-by: Jianguo Wu
Acked-by: KOSAKI Motohiro
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2013-11-13 11:09:06 +0800
9f1b868a1 mm: thp: khugepaged: add policy for finding target node ... Browse Code »

Khugepaged will scan/free HPAGE_PMD_NR normal pages and replace with a
hugepage which is allocated from the node of the first scanned normal
page, but this policy is too rough and may end with unexpected result to
upper users.

The problem is the original page-balancing among all nodes will be
broken after hugepaged started. Thinking about the case if the first
scanned normal page is allocated from node A, most of other scanned
normal pages are allocated from node B or C.. But hugepaged will always
allocate hugepage from node A which will cause extra memory pressure on
node A which is not the situation before khugepaged started.

This patch try to fix this problem by making khugepaged allocate
hugepage from the node which have max record of scaned normal pages hit,
so that the effect to original page-balancing can be minimized.

The other problem is if normal scanned pages are equally allocated from
Node A,B and C, after khugepaged started Node A will still suffer extra
memory pressure.

Andrew Davidoff reported a related issue several days ago. He wanted
his application interleaving among all nodes and "numactl
--interleave=all ./test" was used to run the testcase, but the result
wasn't not as expected.

cat /proc/2814/numa_maps:
7f50bd440000 interleave:0-3 anon=51403 dirty=51403 N0=435 N1=435 N2=435 N3=50098

The end result showed that most pages are from Node3 instead of
interleave among node0-3 which was unreasonable.

This patch also fix this issue by allocating hugepage round robin from
all nodes have the same record, after this patch the result was as
expected:

7f78399c0000 interleave:0-3 anon=51403 dirty=51403 N0=12723 N1=12723 N2=13235 N3=12722

The simple testcase is like this:

int main() {
char *p;
int i;
int j;

for (i=0; i < 200; i++) {
p = (char *)malloc(1048576);
printf("malloc done\n");

if (p == 0) {
printf("Out of memory\n");
return 1;
}
for (j=0; j < 1048576; j++) {
p[j] = 'A';
}
printf("touched memory\n");

sleep(1);
}
printf("enter sleep\n");
while(1) {
sleep(100);
}
}

[akpm@linux-foundation.org: make last_khugepaged_target_node local to khugepaged_find_target_node()]
Reported-by: Andrew Davidoff
Tested-by: Andrew Davidoff
Signed-off-by: Bob Liu
Cc: Andrea Arcangeli
Cc: Kirill A. Shutemov
Cc: Mel Gorman
Cc: Yasuaki Ishimatsu
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2013-11-13 11:09:06 +0800
10dc4155c mm: thp: cleanup: mv alloc_hugepage to better place ... Browse Code »

Move alloc_hugepage() to a better place, no need for a seperate #ifndef
CONFIG_NUMA

Signed-off-by: Bob Liu
Reviewed-by: Yasuaki Ishimatsu
Acked-by: Kirill A. Shutemov
Cc: Andrea Arcangeli
Cc: Mel Gorman
Cc: Andrew Davidoff
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Bob Liu
2013-11-13 11:09:06 +0800
0151e3d6d Documentation/vm/zswap.txt: fix typos ... Browse Code »

Signed-off-by: Christian Hesse
Acked-by: Seth Jennings
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Christian Hesse
2013-11-13 11:09:05 +0800
b82225f3f revert mm/vmalloc.c: emit the failure message before return ... Browse Code »

Don't warn twice in __vmalloc_area_node and __vmalloc_node_range if
__vmalloc_area_node allocation failure. This patch reverts commit
46c001a2753f ("mm/vmalloc.c: emit the failure message before return").

Signed-off-by: Wanpeng Li
Reviewed-by: Zhang Yanfei
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: Mitsuo Hayasaka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
af12346cd mm/vmalloc: revert "mm/vmalloc.c: check VM_UNINITIALIZED flag in s_show instead of show_numa_info" ... Browse Code »

The VM_UNINITIALIZED/VM_UNLIST flag introduced by f5252e009d5b ("mm:
avoid null pointer access in vm_struct via /proc/vmallocinfo") is used
to avoid accessing the pages field with unallocated page when
show_numa_info() is called.

This patch moves the check just before show_numa_info in order that some
messages still can be dumped via /proc/vmallocinfo. This patch reverts
commit d157a55815ff ("mm/vmalloc.c: check VM_UNINITIALIZED flag in
s_show instead of show_numa_info");

Reviewed-by: Zhang Yanfei
Signed-off-by: Wanpeng Li
Cc: Mitsuo Hayasaka
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
c2ce8c142 mm/vmalloc: fix show vmap_area information race with vmap_area tear down ... Browse Code »

There is a race window between vmap_area tear down and show vmap_area
information.

A B

remove_vm_area
spin_lock(&vmap_area_lock);
va->vm = NULL;
va->flags &= ~VM_VM_AREA;
spin_unlock(&vmap_area_lock);
spin_lock(&vmap_area_lock);
if (va->flags & (VM_LAZY_FREE | VM_LAZY_FREEZING))
return 0;
if (!(va->flags & VM_VM_AREA)) {
seq_printf(m, "0x%pK-0x%pK %7ld vm_map_ram\n",
(void *)va->va_start, (void *)va->va_end,
va->va_end - va->va_start);
return 0;
}
free_unmap_vmap_area(va);
flush_cache_vunmap
free_unmap_vmap_area_noflush
unmap_vmap_area
free_vmap_area_noflush
va->flags |= VM_LAZY_FREE

The assumption !VM_VM_AREA represents vm_map_ram allocation is
introduced by d4033afdf828 ("mm, vmalloc: iterate vmap_area_list,
instead of vmlist, in vmallocinfo()").

However, !VM_VM_AREA also represents vmap_area is being tear down in
race window mentioned above. This patch fix it by don't dump any
information for !VM_VM_AREA case and also remove (VM_LAZY_FREE |
VM_LAZY_FREEING) check since they are not possible for !VM_VM_AREA case.

Suggested-by: Joonsoo Kim
Acked-by: KOSAKI Motohiro
Signed-off-by: Wanpeng Li
Cc: Mitsuo Hayasaka
Cc: Zhang Yanfei
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
3722e13cf mm/vmalloc: don't set area->caller twice ... Browse Code »

The caller address has already been set in set_vmalloc_vm(), there's no
need to set it again in __vmalloc_area_node.

Reviewed-by: Zhang Yanfei
Signed-off-by: Wanpeng Li
Cc: Joonsoo Kim
Cc: KOSAKI Motohiro
Cc: Mitsuo Hayasaka
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wanpeng Li
2013-11-13 11:09:05 +0800
948927ee9 mm, mempolicy: make mpol_to_str robust and always succeed ... Browse Code »

mpol_to_str() should not fail. Currently, it either fails because the
string buffer is too small or because a string hasn't been defined for a
mempolicy mode.

If a new mempolicy mode is introduced and no string is defined for it,
just warn and return "unknown".

If the buffer is too small, just truncate the string and return, the
same behavior as snprintf().

This also fixes a bug where there was no NULL-byte termination when doing
*p++ = '=' and *p++ ':' and maxlen has been reached.

Signed-off-by: David Rientjes
Cc: KOSAKI Motohiro
Cc: Chen Gang
Cc: Rik van Riel
Cc: Dave Jones
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

David Rientjes
2013-11-13 11:09:05 +0800
40c3baa7c mm/arch: use NUMA_NO_NODE ... Browse Code »

Use more appropriate NUMA_NO_NODE instead of -1 in all archs' module_alloc()

Signed-off-by: Jianguo Wu
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2013-11-13 11:09:05 +0800
03b61ff3c mm/memory-failure.c: move set_migratetype_isolate() outside get_any_page() ... Browse Code »

Chen Gong pointed out that set/unset_migratetype_isolate() was done in
different functions in mm/memory-failure.c, which makes the code less
readable/maintainable. So this patch does it in soft_offline_page().

With this patch, we get to hold lock_memory_hotplug() longer but it's
not a problem because races between memory hotplug and soft offline are
very rare.

Signed-off-by: Naoya Horiguchi
Reviewed-by: Chen, Gong
Acked-by: Andi Kleen
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Naoya Horiguchi
2013-11-13 11:09:04 +0800
01b0f1970 cpu/mem hotplug: add try_online_node() for cpu_up() ... Browse Code »

cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
mem_online_node() to put its node online if offlined and then call
build_all_zonelists() to initialize the zone list.

These steps are specific to memory hotplug, and should be managed in
mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
whole steps.

For this reason, this patch replaces mem_online_node() with
try_online_node(), which performs the whole steps with
lock_memory_hotplug() held. try_online_node() is named after
try_offline_node() as they have similar purpose.

There is no functional change in this patch.

Signed-off-by: Toshi Kani
Reviewed-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Toshi Kani
2013-11-13 11:09:04 +0800
309d0b391 mm/nobootmem.c: have __free_pages_memory() free in larger chunks. ... Browse Code »

On large memory machines it can take a few minutes to get through
free_all_bootmem().

Currently, when free_all_bootmem() calls __free_pages_memory(), the number
of contiguous pages that __free_pages_memory() passes to the buddy
allocator is limited to BITS_PER_LONG. BITS_PER_LONG was originally
chosen to keep things similar to mm/nobootmem.c. But it is more efficient
to limit it to MAX_ORDER.

base new change
8TB 202s 172s 30s
16TB 401s 351s 50s

That is around 1%-3% improvement on total boot time.

This patch was spun off from the boot time rfc Robin and I had been
working on.

Signed-off-by: Robin Holt
Signed-off-by: Nathan Zimmer
Cc: Robin Holt
Cc: "H. Peter Anvin"
Cc: Ingo Molnar
Cc: Mike Travis
Cc: Yinghai Lu
Cc: Mel Gorman
Acked-by: Johannes Weiner
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Robin Holt
2013-11-13 11:09:04 +0800
b9921ecde mm: add a helper function to check may oom condition ... Browse Code »

Use helper function to check if we need to deal with oom condition.

Signed-off-by: Qiang Huang
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Qiang Huang
2013-11-13 11:09:04 +0800
9c2606b77 mm/memory_hotplug.c: use pfn_to_nid() instead of page_to_nid(pfn_to_page()) ... Browse Code »

Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))".

Signed-off-by: Xishi Qiu
Acked-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-11-13 11:09:04 +0800
d6de9d534 mm/memory_hotplug.c: rename the function is_memblock_offlined_cb() ... Browse Code »

A is_memblock_offlined() return or 1 means memory block is offlined, but
is_memblock_offlined_cb() returning 1 means memory block is not offlined,
this will confuse somebody, so rename the function.

Signed-off-by: Xishi Qiu
Acked-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-11-13 11:09:04 +0800
b38a87259 mm: use populated_zone() instead of if(zone->present_pages) ... Browse Code »

Use "if (zone->present_pages)" instead of "if (zone->present_pages)".
Simplify the code, no functional change.

Signed-off-by: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-11-13 11:09:04 +0800
83285c72e mm: use pgdat_end_pfn() to simplify the code in others ... Browse Code »

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.

Signed-off-by: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-11-13 11:09:03 +0800
6408068ee mm: use pgdat_end_pfn() to simplify the code in arch ... Browse Code »

Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.

Signed-off-by: Xishi Qiu
Cc: James Hogan
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Paul Mundt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-11-13 11:09:03 +0800
8bfa3f9a0 mm/huge_memory.c: fix stale comments of transparent_hugepage_flags ... Browse Code »

Since commit 13ece886d99c ("thp: transparent hugepage config choice"),
transparent hugepage support is disabled by default, and
TRANSPARENT_HUGEPAGE_ALWAYS is configured when TRANSPARENT_HUGEPAGE=y.

And since commit d39d33c332c6 ("thp: enable direct defrag"), defrag is
enable for all transparent hugepage page faults by default, not only in
MADV_HUGEPAGE regions.

Signed-off-by: Jianguo Wu
Reviewed-by: Wanpeng Li
Acked-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jianguo Wu
2013-11-13 11:09:03 +0800