Eric Lee / smarc-fsl-linux-kernel

24 Feb, 2013

40 commits

293c07e31 memory-failure: use num_poisoned_pages instead of mce_bad_pages ... Browse Code »

Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.

[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Borislav Petkov
Reviewed-by: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
fa8dd8a92 memory-failure: do code refactor of soft_offline_page() ... Browse Code »

There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc
instead of atomic_long_add.

Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Andrew Morton
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
0ebff32c3 memory-failure: fix an error of mce_bad_pages statistics ... Browse Code »

When doing

$ echo paddr > /sys/devices/system/memory/soft_offline_page

to offline a *free* page, the value of mce_bad_pages will be added, and
the page is set HWPoison flag, but it is still managed by page buddy
alocator.

$ cat /proc/meminfo | grep HardwareCorrupted

shows the value.

If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.

soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"

This patch:

Move poisoned page check at the beginning of the function in order to
fix the error.

Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Tested-by: Naoya Horiguchi
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Xishi Qiu
2013-02-24 09:50:15 +0800
194159fbc mm: remove MIGRATE_ISOLATE check in hotpath ... Browse Code »

Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.

Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2013-02-24 09:50:15 +0800
c60514b63 mm: increase totalram_pages when free pages allocated by bootmem allocator ... Browse Code »

Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:15 +0800
306f2e9ee mm: set zone->present_pages to number of existing pages in the zone ... Browse Code »

Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:

present_pages = spanned_pages - absent_pages;

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:15 +0800
b40da0494 mm: use zone->present_pages instead of zone->managed_pages where appropriate ... Browse Code »

Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.

Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Liu
2013-02-24 09:50:14 +0800
f7210e6c4 mm/memblock.c: use CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect movablecore_map in m… ... Browse Code »

…emblock_overlaps_region().

The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().

Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>

Tang Chen
2013-02-24 09:50:14 +0800
01a178a94 acpi, memory-hotplug: support getting hotplug info from SRAT ... Browse Code »

We now provide an option for users who don't want to specify physical
memory address in kernel commandline.

/*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/

So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.

If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.

Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.

NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.

[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
27168d38f acpi, memory-hotplug: extend movablemem_map ranges to the end of node ... Browse Code »

When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.

Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.

We now implement movablemem_map boot option like this:

/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/

[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
e8d195525 acpi, memory-hotplug: parse SRAT before memblock is ready ... Browse Code »

On linux, the pages used by kernel could not be migrated. As a result,
if a memory range is used by kernel, it cannot be hot-removed. So if we
want to hot-remove memory, we should prevent kernel from using it.

The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.

But when the system is booting, memblock will allocate memory, and
reserve the memory for kernel. And before we parse SRAT, and know the
node memory ranges, memblock is working. And it may allocate memory in
ranges to be set as ZONE_MOVABLE. This memory can be used by kernel,
and never be freed.

So, let's parse SRAT before memblock is called first. And it is early
enough.

The first call of memblock_find_in_range_node() is in:

setup_arch()
|-->setup_real_mode()

so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.

NOTE:

1) early_parse_srat() is called before numa_init(), and has initialized
numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO
NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
numa info.

2) I don't know why using count of memory affinities parsed from SRAT
as a return value in original acpi_numa_init(). So I add a static
variable srat_mem_cnt to remember this count and use it as the return
value of the new acpi_numa_init()

[mhocko@suse.cz: parse SRAT before memblock is ready fix]
Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
fb06bc8e5 page_alloc: bootmem limit with movablecore_map ... Browse Code »

Ensure the bootmem will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablecore_map boot option.

Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Wu Jianguo
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
42f47e27e page_alloc: make movablemem_map have higher priority ... Browse Code »

If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].

Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Cc: Wu Jianguo
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
6981ec311 page_alloc: introduce zone_movable_limit[] to keep movable limit for nodes ... Browse Code »

Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
limit from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of
ZONE_MOVABLE for each node.

Signed-off-by: Tang Chen
Signed-off-by: Liu Jiang
Reviewed-by: Wen Congyang
Cc: Wu Jianguo
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
34b71f1e0 page_alloc: add movable_memmap kernel parameter ... Browse Code »

Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablemem_map.map array.

And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.

[akpm@linux-foundation.org: improve comment]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: remove unneeded parens]
Signed-off-by: Tang Chen
Signed-off-by: Lai Jiangshan
Reviewed-by: Wen Congyang
Tested-by: Lin Feng
Cc: Wu Jianguo
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:14 +0800
4d59a7512 x86: get pg_data_t's memory from other node ... Browse Code »

During the implementation of SRAT support, we met a problem. In
setup_arch(), we have the following call series:

1) memblock is ready;
2) some functions use memblock to allocate memory;
3) parse ACPI tables, such as SRAT.

Before 3), we don't know which memory is hotpluggable, and as a result,
we cannot prevent memblock from allocating hotpluggable memory. So, in
2), there could be some hotpluggable memory allocated by memblock.

Now, we are trying to parse SRAT earlier, before memblock is ready. But
I think we need more investigation on this topic. So in this v5, I
dropped all the SRAT support, and v5 is just the same as v3, and it is
based on 3.8-rc3.

As we planned, we will support getting info from SRAT without users'
participation at last. And we will post another patch-set to do so.

And also, I think for now, we can add this boot option as the first step
of supporting movable node. Since Linux cannot migrate the direct
mapped pages, the only way for now is to limit the whole node containing
only movable memory.

Using SRAT is one way. But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to
loose their NUMA performance. So I think, a user interface is always
needed.

For now, users can disable this functionality by not specifying the boot
option. Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.

This patch:

If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t. So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.

Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Lai Jiangshan
Signed-off-by: Tang Chen
Signed-off-by: Jiang Liu
Cc: Wu Jianguo
Cc: Wen Congyang
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasuaki Ishimatsu
2013-02-24 09:50:14 +0800
aa00d89c2 sched: do not use cpu_to_node() to find an offlined cpu's node. ... Browse Code »

If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
will return -1. As a result, cpumask_of_node(nid) will return NULL. In
this case, find_next_bit() in for_each_cpu will get a NULL pointer and
cause panic.

Here is a call trace:
Call Trace:

select_fallback_rq+0x71/0x190
try_to_wake_up+0x2cb/0x2f0
wake_up_process+0x15/0x20
hrtimer_wakeup+0x22/0x30
__run_hrtimer+0x83/0x320
hrtimer_interrupt+0x106/0x280
smp_apic_timer_interrupt+0x69/0x99
apic_timer_interrupt+0x6f/0x80

There is a hrtimer process sleeping, whose cpu has already been
offlined. When it is waken up, it tries to find another cpu to run, and
get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
ernel panic.

This patch fixes this problem by judging if the nid is -1. If nid is
not -1, a cpu on the same node will be picked. Else, a online cpu on
another node will be picked.

Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:13 +0800
e13fe8695 cpu-hotplug,memory-hotplug: clear cpu_to_node() when offlining the node ... Browse Code »

When the node is offlined, there is no memory/cpu on the node. If a
sleep task runs on a cpu of this node, it will be migrated to the cpu on
the other node. So we can clear cpu-to-node mapping.

[akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
76bba1423 cpu-hotplug, memory-hotplug: try offlining the node when hotremoving a cpu ... Browse Code »

The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Len Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
90b30cdc1 memory-hotplug: export the function try_offline_node() ... Browse Code »

try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.

The node will be offlined when all memory/cpu on the node have been
hotremoved. So we need the function try_offline_node() in cpu-hotplug
path.

If the memory-hotplug is disabled, and cpu-hotplug is enabled

1. no memory no the node
we don't online the node, and cpu's node is the nearest node.

2. the node contains some memory
the node has been onlined, and cpu's node is still needed
to migrate the sleep task on the cpu to the same node.

So we do nothing in try_offline_node() in this case.

[rientjes@google.com: export the function try_offline_node() fix]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Len Brown
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
c4c605246 cpu_hotplug: clear apicid to node when the cpu is hotremoved ... Browse Code »

When a cpu is hotpluged, we call acpi_map_cpu2node() in
_acpi_map_lsapic() to store the cpu's node and apicid's node. But we
don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
hotremoved. If the node is also hotremoved, we will get the following
messages:

kernel BUG at include/linux/gfp.h:329!
invalid opcode: 0000 [#1] SMP
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
RIP: 0010:[] [] allocate_slab+0x28d/0x300
RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
Call Trace:
new_slab+0x30/0x1b0
__slab_alloc+0x358/0x4c0
kmem_cache_alloc_node_trace+0xb4/0x1e0
alloc_fair_sched_group+0xd0/0x1b0
sched_create_group+0x3e/0x110
sched_autogroup_create_attach+0x4d/0x180
sys_setsid+0xd4/0xf0
system_call_fastpath+0x16/0x1b
Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
RIP [] allocate_slab+0x28d/0x300
RSP
---[ end trace adf84c90f3fea3e5 ]---

The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.

If the node is onlined, we still need cpu's node. For example: a task
on the cpu is sleeped when the cpu is hotremoved. We will choose
another cpu to run this task when it is waked up. If we know the cpu's
node, we will choose the cpu on the same node first. So we should clear
cpu-to-node mapping when the node is offlined.

This patch only clears apicid-to-node mapping when the cpu is
hotremoved.

[akpm@linux-foundation.org: fix section error]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
d3eb1570a mempolicy: fix is_valid_nodemask() ... Browse Code »

is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
based on a nodemask as well as a gfp_mask"). but it does not match its
comments, because it does not check the zone which > policy_zone.

Also in commit b377fd3982ad ("Apply memory policies to top two highest
zones when highest zone is ZONE_MOVABLE"), this commits told us, if
highest zone is ZONE_MOVABLE, we should also apply memory policies to
it. so ZONE_MOVABLE should be valid zone for policies.
is_valid_nodemask() need to be changed to match it.

Fix: check all zones, even its zoneid > policy_zone. Use
nodes_intersects() instead open code to check it.

Reported-by: Wen Congyang
Signed-off-by: Lai Jiangshan
Signed-off-by: Tang Chen
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Lai Jiangshan
2013-02-24 09:50:13 +0800
8a356ce38 memory-hotplug: consider compound pages when free memmap ... Browse Code »

usemap could also be allocated as compound pages. Should also consider
compound pages when freeing memmap.

If we don't fix it, there could be problems when we free vmemmap
pagetables which are stored in compound pages. The old pagetables will
not be freed properly, and when we add the memory again, no new
pagetable will be created. And the old pagetable entry is used, than
the kernel will panic.

The call trace is like the following:

BUG: unable to handle kernel paging request at ffffea0040000000
IP: [] sparse_add_one_section+0xef/0x166
PGD 7ff7d4067 PUD 78e035067 PMD 78e11d067 PTE 0
Oops: 0002 [#1] SMP
Modules linked in: ip6table_filter ip6_tables ebtable_nat ebtables nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack ipt_REJECT xt_CHECKSUM iptable_mangle iptable_filter ip_tables bridge stp llc sunrpc binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr sg lpc_ich mfd_core i2c_i801 i2c_core i7core_edac edac_core ioatdma e1000e igb dca ptp pps_core sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
CPU 0
Pid: 4, comm: kworker/0:0 Tainted: G W 3.8.0-rc3-phy-hot-remove+ #3 FUJITSU-SV PRIMEQUEST 1800E/SB
RIP: 0010:[] [] sparse_add_one_section+0xef/0x166
RSP: 0018:ffff8807bdcb35d8 EFLAGS: 00010006
RAX: 0000000000000000 RBX: 0000000000000200 RCX: 0000000000200000
RDX: ffff88078df01148 RSI: 0000000000000282 RDI: ffffea0040000000
RBP: ffff8807bdcb3618 R08: 4cf05005b019467a R09: 0cd98fa09631467a
R10: 0000000000000000 R11: 0000000000030e20 R12: 0000000000008000
R13: ffffea0040000000 R14: ffff88078df66248 R15: ffff88078ea13b10
FS: 0000000000000000(0000) GS:ffff8807c1a00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: ffffea0040000000 CR3: 0000000001c0c000 CR4: 00000000000007f0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process kworker/0:0 (pid: 4, threadinfo ffff8807bdcb2000, task ffff8807bde18000)
Call Trace:
__add_pages+0x85/0x120
arch_add_memory+0x71/0xf0
add_memory+0xd6/0x1f0
acpi_memory_device_add+0x170/0x20c
acpi_device_probe+0x50/0x18a
really_probe+0x6c/0x320
driver_probe_device+0x47/0xa0
__device_attach+0x53/0x60
bus_for_each_drv+0x6c/0xa0
device_attach+0xa8/0xc0
bus_probe_device+0xb0/0xe0
device_add+0x301/0x570
device_register+0x1e/0x30
acpi_device_register+0x1d8/0x27c
acpi_add_single_object+0x1df/0x2b9
acpi_bus_check_add+0x112/0x18f
acpi_ns_walk_namespace+0x105/0x255
acpi_walk_namespace+0xcf/0x118
acpi_bus_scan+0x5b/0x7c
acpi_bus_add+0x2a/0x2c
container_notify_cb+0x112/0x1a9
acpi_ev_notify_dispatch+0x46/0x61
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
Code: 00 00 48 89 df 48 89 45 c8 e8 3e 71 b1 ff 48 89 c2 48 8b 75 c8 b8 ef ff ff ff f6 02 01 75 4b 49 63 cc 31 c0 4c 89 ef 48 c1 e1 06 aa 48 8b 02 48 83 c8 01 48 85 d2 48 89 02 74 29 a8 01 74 25
RIP [] sparse_add_one_section+0xef/0x166
RSP
CR2: ffffea0040000000
---[ end trace e7f94e3a34c442d4 ]---
Kernel panic - not syncing: Fatal exception

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
a1e565aa3 memory-hotplug: do not allocate pgdat if it was not freed when offline. ... Browse Code »

Since there is no way to guarentee the address of pgdat/zone is not on
stack of any kernel threads or used by other kernel objects without
reference counting or other symchronizing method, we cannot reset
node_data and free pgdat when offlining a node. Just reset pgdat to 0
and reuse the memory when the node is online again.

The problem is suggested by Kamezawa Hiroyuki. The idea is from Wen
Congyang.

NOTE: If we don't reset pgdat to 0, the WARN_ON in free_area_init_node()
will be triggered.

[akpm@linux-foundation.org: fix warning when CONFIG_NEED_MULTIPLE_NODES=n]
[akpm@linux-foundation.org: fix the warning again again]
Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:13 +0800
d822b86a9 memory-hotplug: free node_data when a node is offlined ... Browse Code »

We call hotadd_new_pgdat() to allocate memory to store node_data. So we
should free it when removing a node.

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Reviewed-by: Kamezawa Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:13 +0800
60a5a19e7 memory-hotplug: remove sysfs file of node ... Browse Code »

Introduce a new function try_offline_node() to remove sysfs file of node
when all memory sections of this node are removed. If some memory
sections of this node are not removed, this function does nothing.

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:13 +0800
815121d2b memory_hotplug: clear zone when removing the memory ... Browse Code »

When memory is added, we update zone's and pgdat's start_pfn and
spanned_pages in __add_zone(). So we should revert them when the memory
is removed.

The patch adds a new function __remove_zone() to do this.

Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasuaki Ishimatsu
2013-02-24 09:50:12 +0800
5fc1d66a2 memory-hotplug: integrated __remove_section() of CONFIG_SPARSEMEM_VMEMMAP. ... Browse Code »

Currently __remove_section for SPARSEMEM_VMEMMAP does nothing. But even
if we use SPARSEMEM_VMEMMAP, we can unregister the memory_section.

Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:12 +0800
0197518cd memory-hotplug: remove memmap of sparse-vmemmap ... Browse Code »

Introduce a new API vmemmap_free() to free and remove vmemmap
pagetables. Since pagetable implements are different, each architecture
has to provide its own version of vmemmap_free(), just like
vmemmap_populate().

Note: vmemmap_free() is not implemented for ia64, ppc, s390, and sparc.

[mhocko@suse.cz: fix implicit declaration of remove_pagetable]
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Jianguo Wu
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:12 +0800
bbcab8789 memory-hotplug: remove page table of x86_64 architecture ... Browse Code »

Search a page table about the removed memory, and clear page table for
x86_64 architecture.

[akpm@linux-foundation.org: make kernel_physical_mapping_remove() static]
Signed-off-by: Wen Congyang
Signed-off-by: Jianguo Wu
Signed-off-by: Jiang Liu
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:12 +0800
ae9aae9ed memory-hotplug: common APIs to support page tables hot-remove ... Browse Code »

When memory is removed, the corresponding pagetables should alse be
removed. This patch introduces some common APIs to support vmemmap
pagetable and x86_64 architecture direct mapping pagetable removing.

All pages of virtual mapping in removed memory cannot be freed if some
pages used as PGD/PUD include not only removed memory but also other
memory. So this patch uses the following way to check whether a page
can be freed or not.

1) When removing memory, the page structs of the removed memory are
filled with 0FD.

2) All page structs are filled with 0xFD on PT/PMD, PT/PMD can be
cleared. In this case, the page used as PT/PMD can be freed.

For direct mapping pages, update direct_pages_count[level] when we freed
their pagetables. And do not free the pages again because they were
freed when offlining.

For vmemmap pages, free the pages and their pagetables.

For larger pages, do not split them into smaller ones because there is
no way to know if the larger page has been split. As a result, there is
no way to decide when to split. We deal the larger pages in the
following way:

1) For direct mapped pages, all the pages were freed when they were
offlined. And since menmory offline is done section by section, all
the memory ranges being removed are aligned to PAGE_SIZE. So only need
to deal with unaligned pages when freeing vmemmap pages.

2) For vmemmap pages being used to store page_struct, if part of the
larger page is still in use, just fill the unused part with 0xFD. And
when the whole page is fulfilled with 0xFD, then free the larger page.

[akpm@linux-foundation.org: fix typo in comment]
[tangchen@cn.fujitsu.com: do not calculate direct mapping pages when freeing vmemmap pagetables]
[tangchen@cn.fujitsu.com: do not free direct mapping pages twice]
[tangchen@cn.fujitsu.com: do not free page split from hugepage one by one]
[tangchen@cn.fujitsu.com: do not split pages when freeing pagetable pages]
[akpm@linux-foundation.org: use pmd_page_vaddr()]
[akpm@linux-foundation.org: fix used-uninitialised bug]
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Jianguo Wu
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:12 +0800
cd099682e memory-hotplug: move pgdat_resize_lock into sparse_remove_one_section() ... Browse Code »

In __remove_section(), we locked pgdat_resize_lock when calling
sparse_remove_one_section(). This lock will disable irq. But we don't
need to lock the whole function. If we do some work to free pagetables
in free_section_usemap(), we need to call flush_tlb_all(), which need
irq enabled. Otherwise the WARN_ON_ONCE() in smp_call_function_many()
will be triggered.

If we lock the whole sparse_remove_one_section(), then we come to this call trace:

------------[ cut here ]------------
WARNING: at kernel/smp.c:461 smp_call_function_many+0xbd/0x260()
Hardware name: PRIMEQUEST 1800E
......
Call Trace:
smp_call_function_many+0xbd/0x260
smp_call_function+0x3b/0x50
on_each_cpu+0x3b/0xc0
flush_tlb_all+0x1c/0x20
remove_pagetable+0x14e/0x1d0
vmemmap_free+0x18/0x20
sparse_remove_one_section+0xf7/0x100
__remove_section+0xa2/0xb0
__remove_pages+0xa0/0xd0
arch_remove_memory+0x6b/0xc0
remove_memory+0xb8/0xf0
acpi_memory_device_remove+0x53/0x96
acpi_device_remove+0x90/0xb2
__device_release_driver+0x7c/0xf0
device_release_driver+0x2f/0x50
acpi_bus_remove+0x32/0x6d
acpi_bus_trim+0x91/0x102
acpi_bus_hot_remove_device+0x88/0x16b
acpi_os_execute_deferred+0x27/0x34
process_one_work+0x20e/0x5c0
worker_thread+0x12e/0x370
kthread+0xee/0x100
ret_from_fork+0x7c/0xb0
---[ end trace 25e85300f542aa01 ]---

Signed-off-by: Tang Chen
Signed-off-by: Lai Jiangshan
Signed-off-by: Wen Congyang
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Tang Chen
2013-02-24 09:50:12 +0800
46723bfa5 memory-hotplug: implement register_page_bootmem_info_section of sparse-vmemmap ... Browse Code »

For removing memmap region of sparse-vmemmap which is allocated bootmem,
memmap region of sparse-vmemmap needs to be registered by
get_page_bootmem(). So the patch searches pages of virtual mapping and
registers the pages by get_page_bootmem().

NOTE: register_page_bootmem_memmap() is not implemented for ia64,
ppc, s390, and sparc. So introduce CONFIG_HAVE_BOOTMEM_INFO_NODE
and revert register_page_bootmem_info_node() when platform doesn't
support it.

It's implemented by adding a new Kconfig option named
CONFIG_HAVE_BOOTMEM_INFO_NODE, which will be automatically selected
by memory-hotplug feature fully supported archs(currently only on
x86_64).

Since we have 2 config options called MEMORY_HOTPLUG and
MEMORY_HOTREMOVE used for memory hot-add and hot-remove separately,
and codes in function register_page_bootmem_info_node() are only
used for collecting infomation for hot-remove, so reside it under
MEMORY_HOTREMOVE.

Besides page_isolation.c selected by MEMORY_ISOLATION under
MEMORY_HOTPLUG is also such case, move it too.

[mhocko@suse.cz: put register_page_bootmem_memmap inside CONFIG_MEMORY_HOTPLUG_SPARSE]
[linfeng@cn.fujitsu.com: introduce CONFIG_HAVE_BOOTMEM_INFO_NODE and revert register_page_bootmem_info_node()]
[mhocko@suse.cz: remove the arch specific functions without any implementation]
[linfeng@cn.fujitsu.com: mm/Kconfig: move auto selects from MEMORY_HOTPLUG to MEMORY_HOTREMOVE as needed]
[rientjes@google.com: fix defined but not used warning]
Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Tang Chen
Reviewed-by: Wu Jianguo
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Michal Hocko
Signed-off-by: Lin Feng
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasuaki Ishimatsu
2013-02-24 09:50:12 +0800
24d335ca3 memory-hotplug: introduce new arch_remove_memory() for removing page table ... Browse Code »

For removing memory, we need to remove page tables. But it depends on
architecture. So the patch introduce arch_remove_memory() for removing
page table. Now it only calls __remove_pages().

Note: __remove_pages() for some archtecuture is not implemented
(I don't know how to implement it for s390).

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:12 +0800
46c66c4b7 memory-hotplug: remove /sys/firmware/memmap/X sysfs ... Browse Code »

When (hot)adding memory into system, /sys/firmware/memmap/X/{end, start,
type} sysfs files are created. But there is no code to remove these
files. This patch implements the function to remove them.

We cannot free firmware_map_entry which is allocated by bootmem because
there is no way to do so when the system is up. But we can at least
remember the address of that memory and reuse the storage when the
memory is added next time.

This patch also introduces a new list map_entries_bootmem to link the
map entries allocated by bootmem when they are removed, and a lock to
protect it. And these entries will be reused when the memory is
hot-added again.

The idea is suggestted by Andrew Morton.

NOTE: It is unsafe to return an entry pointer and release the
map_entries_lock. So we should not hold the map_entries_lock
separately in firmware_map_find_entry() and
firmware_map_remove_entry(). Hold the map_entries_lock across find
and remove /sys/firmware/memmap/X operation.

And also, users of these two functions need to be careful to
hold the lock when using these two functions.

[tangchen@cn.fujitsu.com: Hold spinlock across find|remove /sys operation]
[tangchen@cn.fujitsu.com: fix the wrong comments of map_entries]
[tangchen@cn.fujitsu.com: reuse the storage of /sys/firmware/memmap/X/ allocated by bootmem]
[tangchen@cn.fujitsu.com: fix section mismatch problem]
[tangchen@cn.fujitsu.com: fix the doc format in drivers/firmware/memmap.c]
Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Tang Chen
Reviewed-by: Kamezawa Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Lai Jiangshan
Cc: Tang Chen
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Julian Calaby
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasuaki Ishimatsu
2013-02-24 09:50:12 +0800
bbc76be67 memory-hotplug: remove redundant codes ... Browse Code »

offlining memory blocks and checking whether memory blocks are offlined
are very similar. This patch introduces a new function to remove
redundant codes.

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Reviewed-by: Kamezawa Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:12 +0800
6677e3eaf memory-hotplug: check whether all memory blocks are offlined or not when removing memory ... Browse Code »

We remove the memory like this:

1. lock memory hotplug
2. offline a memory block
3. unlock memory hotplug
4. repeat 1-3 to offline all memory blocks
5. lock memory hotplug
6. remove memory(TODO)
7. unlock memory hotplug

All memory blocks must be offlined before removing memory. But we don't
hold the lock in the whole operation. So we should check whether all
memory blocks are offlined before step6. Otherwise, kernel maybe
panicked.

Offlining a memory block and removing a memory device can be two
different operations. Users can just offline some memory blocks without
removing the memory device. For this purpose, the kernel has held
lock_memory_hotplug() in __offline_pages(). To reuse the code for
memory hot-remove, we repeat step 1-3 to offline all the memory blocks,
repeatedly lock and unlock memory hotplug, but not hold the memory
hotplug lock in the whole operation.

Signed-off-by: Wen Congyang
Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Tang Chen
Acked-by: KAMEZAWA Hiroyuki
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yasuaki Ishimatsu
2013-02-24 09:50:11 +0800
993c1aad8 memory-hotplug: try to offline the memory twice to avoid dependence ... Browse Code »

memory can't be offlined when CONFIG_MEMCG is selected. For example:
there is a memory device on node 1. The address range is [1G, 1.5G).
You will find 4 new directories memory8, memory9, memory10, and memory11
under the directory /sys/devices/system/memory/.

If CONFIG_MEMCG is selected, we will allocate memory to store page
cgroup when we online pages. When we online memory8, the memory stored
page cgroup is not provided by this memory device. But when we online
memory9, the memory stored page cgroup may be provided by memory8. So
we can't offline memory8 now. We should offline the memory in the
reversed order.

When the memory device is hotremoved, we will auto offline memory
provided by this memory device. But we don't know which memory is
onlined first, so offlining memory may fail. In such case, iterate
twice to offline the memory. 1st iterate: offline every non primary
memory block. 2nd iterate: offline primary (i.e. first added) memory
block.

This idea is suggested by KOSAKI Motohiro.

Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Wen Congyang
2013-02-24 09:50:11 +0800
a864b9d06 mm: memory_hotplug: no need to check res twice in add_memory ... Browse Code »

Remove one redundant check of res.

Signed-off-by: Sasha Levin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Sasha Levin
2013-02-24 09:50:11 +0800
41badc15c mm: make do_mmap_pgoff return populate as a size in bytes, not as a bool ... Browse Code »

do_mmap_pgoff() rounds up the desired size to the next PAGE_SIZE
multiple, however there was no equivalent code in mm_populate(), which
caused issues.

This could be fixed by introduced the same rounding in mm_populate(),
however I think it's preferable to make do_mmap_pgoff() return populate
as a size rather than as a boolean, so we don't have to duplicate the
size rounding logic in mm_populate().

Signed-off-by: Michel Lespinasse
Acked-by: Rik van Riel
Tested-by: Andy Lutomirski
Cc: Greg Ungerer
Cc: David Howells
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Michel Lespinasse
2013-02-24 09:50:11 +0800