11 Jun, 2015
1 commit
-
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[] [] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [] __wake_up_bit+0x20/0x70
RSP
CR2: ffffc90008963690Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.Signed-off-by: Gu Zheng
Reported-by: Taku Izumi
Reviewed by: Yasuaki Ishimatsu
Cc: KAMEZAWA Hiroyuki
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Apr, 2015
1 commit
-
Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi
Cc: Hugh Dickins
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Apr, 2015
2 commits
-
There's a deadlock when concurrently hot-adding memory through the probe
interface and switching a memory block from offline to online.When hot-adding memory via the probe interface, add_memory() first takes
mem_hotplug_begin() and then device_lock() is later taken when registering
the newly initialized memory block. This creates a lock dependency of (1)
mem_hotplug.lock (2) dev->mutex.When switching a memory block from offline to online, dev->mutex is first
grabbed in device_online() when the write(2) transitions an existing
memory block from offline to online, and then online_pages() will take
mem_hotplug_begin().This creates a lock inversion between mem_hotplug.lock and dev->mutex.
Vitaly reports that this deadlock can happen when kworker handling a probe
event races with systemd-udevd switching a memory block's state.This patch requires the state transition to take mem_hotplug_begin()
before dev->mutex. Hot-adding memory via the probe interface creates a
memory block while holding mem_hotplug_begin(), there is no way to take
dev->mutex first in this case.online_pages() and offline_pages() are only called when transitioning
memory block state. We now require that mem_hotplug_begin() is taken
before calling them -- this requires exporting the mem_hotplug_begin() and
mem_hotplug_done() to generic code. In all hot-add and hot-remove cases,
mem_hotplug_begin() is done prior to device_online(). This is all that is
needed to avoid the deadlock.Signed-off-by: David Rientjes
Reported-by: Vitaly Kuznetsov
Tested-by: Vitaly Kuznetsov
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: "K. Y. Srinivasan"
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Cc: Vlastimil Babka
Cc: Zhang Zhen
Cc: Vladimir Davydov
Cc: Wang Nan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use macro section_nr_to_pfn() to switch between section and pfn, instead
of open-coding it. No semantic changes.Signed-off-by: Sheng Yong
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 Mar, 2015
1 commit
-
Qiu Xishi reported the following BUG when testing hot-add/hot-remove node under
stress condition:BUG: unable to handle kernel paging request at 0000000000025f60
IP: next_online_pgdat+0x1/0x50
PGD 0
Oops: 0000 [#1] SMP
ACPI: Device does not support D3cold
Modules linked in: fuse nls_iso8859_1 nls_cp437 vfat fat loop dm_mod coretemp mperf crc32c_intel ghash_clmulni_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr microcode igb dca i2c_algo_bit ipv6 megaraid_sas iTCO_wdt i2c_i801 i2c_core iTCO_vendor_support tg3 sg hwmon ptp lpc_ich pps_core mfd_core acpi_pad rtc_cmos button ext3 jbd mbcache sd_mod crc_t10dif scsi_dh_alua scsi_dh_rdac scsi_dh_hp_sw scsi_dh_emc scsi_dh ahci libahci libata scsi_mod [last unloaded: rasf]
CPU: 23 PID: 238 Comm: kworker/23:1 Tainted: G O 3.10.15-5885-euler0302 #1
Hardware name: HUAWEI TECHNOLOGIES CO.,LTD. Huawei N1/Huawei N1, BIOS V100R001 03/02/2015
Workqueue: events vmstat_update
task: ffffa800d32c0000 ti: ffffa800d32ae000 task.ti: ffffa800d32ae000
RIP: 0010: next_online_pgdat+0x1/0x50
RSP: 0018:ffffa800d32afce8 EFLAGS: 00010286
RAX: 0000000000001440 RBX: ffffffff81da53b8 RCX: 0000000000000082
RDX: 0000000000000000 RSI: 0000000000000082 RDI: 0000000000000000
RBP: ffffa800d32afd28 R08: ffffffff81c93bfc R09: ffffffff81cbdc96
R10: 00000000000040ec R11: 00000000000000a0 R12: ffffa800fffb3440
R13: ffffa800d32afd38 R14: 0000000000000017 R15: ffffa800e6616800
FS: 0000000000000000(0000) GS:ffffa800e6600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000025f60 CR3: 0000000001a0b000 CR4: 00000000001407e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
refresh_cpu_vm_stats+0xd0/0x140
vmstat_update+0x11/0x50
process_one_work+0x194/0x3d0
worker_thread+0x12b/0x410
kthread+0xc6/0xd0
ret_from_fork+0x7c/0xb0The cause is the "memset(pgdat, 0, sizeof(*pgdat))" at the end of
try_offline_node, which will reset all the content of pgdat to 0, as the
pgdat is accessed lock-free, so that the users still using the pgdat
will panic, such as the vmstat_update routine.process A: offline node XX:
vmstat_updat()
refresh_cpu_vm_stats()
for_each_populated_zone()
find online node XX
cond_resched()
offline cpu and memory, then try_offline_node()
node_set_offline(nid), and memset(pgdat, 0, sizeof(*pgdat))
zone = next_zone(zone)
pg_data_t *pgdat = zone->zone_pgdat; // here pgdat is NULL now
next_online_pgdat(pgdat)
next_online_node(pgdat->node_id); // NULL pointer accessSo the solution here is postponing the reset of obsolete pgdat from
try_offline_node() to hotadd_new_pgdat(), and just resetting
pgdat->nr_zones and pgdat->classzone_idx to be 0 rather than the memset
0 to avoid breaking pointer information in pgdat.Signed-off-by: Gu Zheng
Reported-by: Xishi Qiu
Suggested-by: KAMEZAWA Hiroyuki
Cc: David Rientjes
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Tang Chen
Cc: Xie XiuQi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Dec, 2014
2 commits
-
Memory hotplug and failure mechanisms have several places where pcplists
are drained so that pages are returned to the buddy allocator and can be
e.g. prepared for offlining. This is always done in the context of a
single zone, we can reduce the pcplists drain to the single zone, which
is now possible.The change should make memory offlining due to hotremove or failure
faster and not disturbing unrelated pcplists anymore.Signed-off-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Yasuaki Ishimatsu
Cc: Zhang Yanfei
Cc: Xishi Qiu
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The functions for draining per-cpu pages back to buddy allocators
currently always operate on all zones. There are however several cases
where the drain is only needed in the context of a single zone, and
spilling other pcplists is a waste of time both due to the extra
spilling and later refilling.This patch introduces new zone pointer parameter to drain_all_pages()
and changes the dummy parameter of drain_local_pages() to be also a zone
pointer. When NULL is passed, the functions operate on all zones as
usual. Passing a specific zone pointer reduces the work to the single
zone.All callers are updated to pass the NULL pointer in this patch.
Conversion to single zone (where appropriate) is done in further
patches.Signed-off-by: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: Mel Gorman
Cc: Rik van Riel
Cc: Yasuaki Ishimatsu
Cc: Zhang Yanfei
Cc: Xishi Qiu
Cc: Vladimir Davydov
Cc: Joonsoo Kim
Cc: Michal Nazarewicz
Cc: Marek Szyprowski
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
14 Nov, 2014
2 commits
-
When memory is hot-added, all the memory is in offline state. So clear
all zones' present_pages because they will be updated in online_pages()
and offline_pages(). Otherwise, /proc/zoneinfo will corrupt:When the memory of node2 is offline:
# cat /proc/zoneinfo
......
Node 2, zone Movable
......
spanned 8388608
present 8388608
managed 0When we online memory on node2:
# cat /proc/zoneinfo
......
Node 2, zone Movable
......
spanned 8388608
present 16777216
managed 8388608Signed-off-by: Tang Chen
Reviewed-by: Yasuaki Ishimatsu
Cc: [3.16+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In free_area_init_core(), zone->managed_pages is set to an approximate
value for lowmem, and will be adjusted when the bootmem allocator frees
pages into the buddy system.But free_area_init_core() is also called by hotadd_new_pgdat() when
hot-adding memory. As a result, zone->managed_pages of the newly added
node's pgdat is set to an approximate value in the very beginning.Even if the memory on that node has node been onlined,
/sys/device/system/node/nodeXXX/meminfo has wrong value:hot-add node2 (memory not onlined)
cat /sys/device/system/node/node2/meminfo
Node 2 MemTotal: 33554432 kB
Node 2 MemFree: 0 kB
Node 2 MemUsed: 33554432 kB
Node 2 Active: 0 kBThis patch fixes this problem by reset node managed pages to 0 after
hot-adding a new node.1. Move reset_managed_pages_done from reset_node_managed_pages() to
reset_all_zones_managed_pages()
2. Make reset_node_managed_pages() non-static
3. Call reset_node_managed_pages() in hotadd_new_pgdat() after pgdat
is initializedSigned-off-by: Tang Chen
Signed-off-by: Yasuaki Ishimatsu
Cc: [3.16+]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Oct, 2014
1 commit
-
When hot adding the same memory after hot removal, the following
messages are shown:WARNING: CPU: 20 PID: 6 at mm/page_alloc.c:4968 free_area_init_node+0x3fe/0x426()
...
Call Trace:
dump_stack+0x46/0x58
warn_slowpath_common+0x81/0xa0
warn_slowpath_null+0x1a/0x20
free_area_init_node+0x3fe/0x426
hotadd_new_pgdat+0x90/0x110
add_memory+0xd4/0x200
acpi_memory_device_add+0x1aa/0x289
acpi_bus_attach+0xfd/0x204
acpi_bus_attach+0x178/0x204
acpi_bus_scan+0x6a/0x90
acpi_device_hotplug+0xe8/0x418
acpi_hotplug_work_fn+0x1f/0x2b
process_one_work+0x14e/0x3f0
worker_thread+0x11b/0x510
kthread+0xe1/0x100
ret_from_fork+0x7c/0xb0The detaled explanation is as follows:
When hot removing memory, pgdat is set to 0 in try_offline_node(). But
if the pgdat is allocated by bootmem allocator, the clearing step is
skipped.And when hot adding the same memory, the uninitialized pgdat is reused.
But free_area_init_node() checks wether pgdat is set to zero. As a
result, free_area_init_node() hits WARN_ON().This patch clears pgdat which is allocated by bootmem allocator in
try_offline_node().Signed-off-by: Yasuaki Ishimatsu
Cc: Zhang Zhen
Cc: Wang Nan
Cc: Tang Chen
Reviewed-by: Toshi Kani
Cc: Dave Hansen
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
10 Oct, 2014
1 commit
-
Currently memory-hotplug has two limits:
1. If the memory block is in ZONE_NORMAL, you can change it to
ZONE_MOVABLE, but this memory block must be adjacent to ZONE_MOVABLE.2. If the memory block is in ZONE_MOVABLE, you can change it to
ZONE_NORMAL, but this memory block must be adjacent to ZONE_NORMAL.With this patch, we can easy to know a memory block can be onlined to
which zone, and don't need to know the above two limits.Updated the related Documentation.
[akpm@linux-foundation.org: use conventional comment layout]
[akpm@linux-foundation.org: fix build with CONFIG_MEMORY_HOTREMOVE=n]
[akpm@linux-foundation.org: remove unused local zone_prev]
Signed-off-by: Zhang Zhen
Cc: Dave Hansen
Cc: David Rientjes
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Naoya Horiguchi
Cc: Wang Nan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Aug, 2014
3 commits
-
This series of patches fixes a problem when adding memory in bad manner.
For example: for a x86_64 machine booted with "mem=400M" and with 2GiB
memory installed, following commands cause problem:# echo 0x40000000 > /sys/devices/system/memory/probe
[ 28.613895] init_memory_mapping: [mem 0x40000000-0x47ffffff]
# echo 0x48000000 > /sys/devices/system/memory/probe
[ 28.693675] init_memory_mapping: [mem 0x48000000-0x4fffffff]
# echo online_movable > /sys/devices/system/memory/memory9/state
# echo 0x50000000 > /sys/devices/system/memory/probe
[ 29.084090] init_memory_mapping: [mem 0x50000000-0x57ffffff]
# echo 0x58000000 > /sys/devices/system/memory/probe
[ 29.151880] init_memory_mapping: [mem 0x58000000-0x5fffffff]
# echo online_movable > /sys/devices/system/memory/memory11/state
# echo online> /sys/devices/system/memory/memory8/state
# echo online> /sys/devices/system/memory/memory10/state
# echo offline> /sys/devices/system/memory/memory9/state
[ 30.558819] Offlined Pages 32768
# free
total used free shared buffers cached
Mem: 780588 18014398509432020 830552 0 0 51180
-/+ buffers/cache: 18014398509380840 881732
Swap: 0 0 0This is because the above commands probe higher memory after online a
section with online_movable, which causes ZONE_HIGHMEM (or ZONE_NORMAL
for systems without ZONE_HIGHMEM) overlaps ZONE_MOVABLE.After the second online_movable, the problem can be observed from
zoneinfo:# cat /proc/zoneinfo
...
Node 0, zone Movable
pages free 65491
min 250
low 312
high 375
scanned 0
spanned 18446744073709518848
present 65536
managed 65536
...This series of patches solve the problem by checking ZONE_MOVABLE when
choosing zone for new memory. If new memory is inside or higher than
ZONE_MOVABLE, makes it go there instead.After applying this series of patches, following are free and zoneinfo
result (after offlining memory9):bash-4.2# free
total used free shared buffers cached
Mem: 780956 80112 700844 0 0 51180
-/+ buffers/cache: 28932 752024
Swap: 0 0 0bash-4.2# cat /proc/zoneinfo
Node 0, zone DMA
pages free 3389
min 14
low 17
high 21
scanned 0
spanned 4095
present 3998
managed 3977
nr_free_pages 3389
...
start_pfn: 1
inactive_ratio: 1
Node 0, zone DMA32
pages free 73724
min 341
low 426
high 511
scanned 0
spanned 98304
present 98304
managed 92958
nr_free_pages 73724
...
start_pfn: 4096
inactive_ratio: 1
Node 0, zone Normal
pages free 32630
min 120
low 150
high 180
scanned 0
spanned 32768
present 32768
managed 32768
nr_free_pages 32630
...
start_pfn: 262144
inactive_ratio: 1
Node 0, zone Movable
pages free 65476
min 241
low 301
high 361
scanned 0
spanned 98304
present 65536
managed 65536
nr_free_pages 65476
...
start_pfn: 294912
inactive_ratio: 1This patch (of 7):
Introduce zone_for_memory() in arch independent code for
arch_add_memory() use.Many arch_add_memory() function simply selects ZONE_HIGHMEM or
ZONE_NORMAL and add new memory into it. However, with the existance of
ZONE_MOVABLE, the selection method should be carefully considered: if
new, higher memory is added after ZONE_MOVABLE is setup, the default
zone and ZONE_MOVABLE may overlap each other.should_add_memory_movable() checks the status of ZONE_MOVABLE. If it
has already contain memory, compare the address of new memory and
movable memory. If new memory is higher than movable, it should be
added into ZONE_MOVABLE instead of default zone.Signed-off-by: Wang Nan
Cc: Zhang Yanfei
Cc: Dave Hansen
Cc: Ingo Molnar
Cc: Yinghai Lu
Cc: "Mel Gorman"
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Luck, Tony"
Cc: Benjamin Herrenschmidt
Cc: Paul Mackerras
Cc: Chris Metcalf
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
In store_mem_state(), we have:
...
334 else if (!strncmp(buf, "offline", min_t(int, count, 7)))
335 online_type = -1;
...
355 case -1:
356 ret = device_offline(&mem->dev);
357 break;
...Here, "offline" is hard coded as -1.
This patch does the following renaming:
ONLINE_KEEP -> MMOP_ONLINE_KEEP
ONLINE_KERNEL -> MMOP_ONLINE_KERNEL
ONLINE_MOVABLE -> MMOP_ONLINE_MOVABLEand introduces MMOP_OFFLINE = -1 to avoid hard coding.
Signed-off-by: Tang Chen
Cc: Hu Tao
Cc: Greg Kroah-Hartman
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
grow_zone_span and grow_pgdat_span are only called by
__meminit __add_zoneSigned-off-by: Fabian Frederick
Cc: Toshi Kani
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
05 Jun, 2014
3 commits
-
Memory migration uses a callback defined by the caller to determine how to
allocate destination pages. When migration fails for a source page,
however, it frees the destination page back to the system.This patch adds a memory migration callback defined by the caller to
determine how to free destination pages. If a caller, such as memory
compaction, builds its own freelist for migration targets, this can reuse
already freed memory instead of scanning additional memory.If the caller provides a function to handle freeing of destination pages,
it is called when page migration fails. If the caller passes NULL then
freeing back to the system will be handled as usual. This patch
introduces no functional change.Signed-off-by: David Rientjes
Reviewed-by: Naoya Horiguchi
Acked-by: Mel Gorman
Acked-by: Vlastimil Babka
Cc: Greg Thelen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Replace ((x) >> PAGE_SHIFT) with the pfn macro.
Signed-off-by: Fabian Frederick
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
kmem_cache_{create,destroy,shrink} need to get a stable value of
cpu/node online mask, because they init/destroy/access per-cpu/node
kmem_cache parts, which can be allocated or destroyed on cpu/mem
hotplug. To protect against cpu hotplug, these functions use
{get,put}_online_cpus. However, they do nothing to synchronize with
memory hotplug - taking the slab_mutex does not eliminate the
possibility of race as described in patch 2.What we need there is something like get_online_cpus, but for memory.
We already have lock_memory_hotplug, which serves for the purpose, but
it's a bit of a hammer right now, because it's backed by a mutex. As a
result, it imposes some limitations to locking order, which are not
desirable, and can't be used just like get_online_cpus. That's why in
patch 1 I substitute it with get/put_online_mems, which work exactly
like get/put_online_cpus except they block not cpu, but memory hotplug.[ v1 can be found at https://lkml.org/lkml/2014/4/6/68. I NAK'ed it by
myself, because it used an rw semaphore for get/put_online_mems,
making them dead lock prune. ]This patch (of 2):
{un}lock_memory_hotplug, which is used to synchronize against memory
hotplug, is currently backed by a mutex, which makes it a bit of a
hammer - threads that only want to get a stable value of online nodes
mask won't be able to proceed concurrently. Also, it imposes some
strong locking ordering rules on it, which narrows down the set of its
usage scenarios.This patch introduces get/put_online_mems, which are the same as
get/put_online_cpus, but for memory hotplug, i.e. executing a code
inside a get/put_online_mems section will guarantee a stable value of
online nodes, present pages, etc.lock_memory_hotplug()/unlock_memory_hotplug() are removed altogether.
Signed-off-by: Vladimir Davydov
Cc: Christoph Lameter
Cc: Pekka Enberg
Cc: Tang Chen
Cc: Zhang Yanfei
Cc: Toshi Kani
Cc: Xishi Qiu
Cc: Jiang Liu
Cc: Rafael J. Wysocki
Cc: David Rientjes
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Lai Jiangshan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
24 Jan, 2014
2 commits
-
We don't need to do register_memory_resource() under
lock_memory_hotplug() since it has its own lock and doesn't make any
callbacks.Also register_memory_resource return NULL on failure so we don't have
anything to cleanup at this point.The reason for this rfc is I was doing some experiments with hotplugging
of memory on some of our larger systems. While it seems to work, it can
be quite slow. With some preliminary digging I found that
lock_memory_hotplug is clearly ripe for breakup.It could be broken up per nid or something but it also covers the
online_page_callback. The online_page_callback shouldn't be very hard
to break out.Also there is the issue of various structures(wmarks come to mind) that
are only updated under the lock_memory_hotplug that would need to be
dealt with.Cc: Tang Chen
Cc: Wen Congyang
Cc: Kamezawa Hiroyuki
Reviewed-by: Yasuaki Ishimatsu
Cc: "Rafael J. Wysocki"
Cc: Hedi
Cc: Mike Travis
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
bad_page() is cool in that it prints out a bunch of data about the page.
But, I can never remember which page flags are good and which are bad,
or whether ->index or ->mapping is required to be NULL.This patch allows bad/dump_page() callers to specify a string about why
they are dumping the page and adds explanation strings to a number of
places. It also adds a 'bad_flags' argument to bad_page(), which it
then dumps out separately from the flags which are actually set.This way, the messages will show specifically why the page was bad,
*specifically* which flags it is complaining about, if it was a page
flag combination which was the problem.[akpm@linux-foundation.org: switch to pr_alert]
Signed-off-by: Dave Hansen
Reviewed-by: Christoph Lameter
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
22 Jan, 2014
3 commits
-
Correct ensure_zone_is_initialized() function description according to
the introduced memblock APIs for early memory allocations.Signed-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Cc: "Rafael J. Wysocki"
Cc: Arnd Bergmann
Cc: Christoph Lameter
Cc: Greg Kroah-Hartman
Cc: H. Peter Anvin
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michal Hocko
Cc: Paul Walmsley
Cc: Pavel Machek
Cc: Russell King
Cc: Tejun Heo
Cc: Tony Lindgren
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Clean-up to remove depedency with bootmem headers.
Signed-off-by: Grygorii Strashko
Signed-off-by: Santosh Shilimkar
Reviewed-by: Tejun Heo
Cc: Yinghai Lu
Cc: Arnd Bergmann
Cc: Greg Kroah-Hartman
Cc: "Rafael J. Wysocki"
Cc: Christoph Lameter
Cc: H. Peter Anvin
Cc: Johannes Weiner
Cc: KAMEZAWA Hiroyuki
Cc: Konrad Rzeszutek Wilk
Cc: Michal Hocko
Cc: Paul Walmsley
Cc: Pavel Machek
Cc: Russell King
Cc: Tony Lindgren
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Linux kernel cannot migrate pages used by the kernel. As a result,
hotpluggable memory used by the kernel won't be able to be hot-removed.
To solve this problem, the basic idea is to prevent memblock from
allocating hotpluggable memory for the kernel at early time, and arrange
all hotpluggable memory in ACPI SRAT(System Resource Affinity Table) as
ZONE_MOVABLE when initializing zones.In the previous patches, we have marked hotpluggable memory regions with
MEMBLOCK_HOTPLUG flag in memblock.memory.In this patch, we make memblock skip these hotpluggable memory regions
in the default top-down allocation function if movable_node boot option
is specified.[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Cc: "H. Peter Anvin"
Cc: "Rafael J . Wysocki"
Cc: Chen Tang
Cc: Gong Chen
Cc: Ingo Molnar
Cc: Jiang Liu
Cc: Johannes Weiner
Cc: Lai Jiangshan
Cc: Larry Woodman
Cc: Len Brown
Cc: Liu Jiang
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Prarit Bhargava
Cc: Rik van Riel
Cc: Taku Izumi
Cc: Tejun Heo
Cc: Thomas Gleixner
Cc: Thomas Renninger
Cc: Toshi Kani
Cc: Vasilis Liaskovitis
Cc: Wanpeng Li
Cc: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Nov, 2013
6 commits
-
The hot-Pluggable field in SRAT specifies which memory is hotpluggable.
As we mentioned before, if hotpluggable memory is used by the kernel, it
cannot be hot-removed. So memory hotplug users may want to set all
hotpluggable memory in ZONE_MOVABLE so that the kernel won't use it.Memory hotplug users may also set a node as movable node, which has
ZONE_MOVABLE only, so that the whole node can be hot-removed.But the kernel cannot use memory in ZONE_MOVABLE. By doing this, the
kernel cannot use memory in movable nodes. This will cause NUMA
performance down. And other users may be unhappy.So we need a way to allow users to enable and disable this functionality.
In this patch, we introduce movable_node boot option to allow users to
choose to not to consume hotpluggable memory at early boot time and later
we can set it as ZONE_MOVABLE.To achieve this, the movable_node boot option will control the memblock
allocation direction. That said, after memblock is ready, before SRAT is
parsed, we should allocate memory near the kernel image as we explained in
the previous patches. So if movable_node boot option is set, the kernel
does the following:1. After memblock is ready, make memblock allocate memory bottom up.
2. After SRAT is parsed, make memblock behave as default, allocate memory
top down.Users can specify "movable_node" in kernel commandline to enable this
functionality. For those who don't use memory hotplug or who don't want
to lose their NUMA performance, just don't specify anything. The kernel
will work as before.Signed-off-by: Tang Chen
Signed-off-by: Zhang Yanfei
Suggested-by: Kamezawa Hiroyuki
Suggested-by: Ingo Molnar
Acked-by: Tejun Heo
Acked-by: Toshi Kani
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Wanpeng Li
Cc: Thomas Renninger
Cc: Yinghai Lu
Cc: Jiang Liu
Cc: Wen Congyang
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Cc: Taku Izumi
Cc: Mel Gorman
Cc: Michal Nazarewicz
Cc: Minchan Kim
Cc: Rik van Riel
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
For below functions,
- sparse_add_one_section()
- kmalloc_section_memmap()
- __kmalloc_section_memmap()
- __kfree_section_memmap()they are always invoked to operate on one memory section, so it is
redundant to always pass a nr_pages parameter, which is the page numbers
in one section. So we can directly use predefined macro PAGES_PER_SECTION
instead of passing the parameter.Signed-off-by: Zhang Yanfei
Cc: Wen Congyang
Cc: Tang Chen
Cc: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Yinghai Lu
Cc: Yasunori Goto
Cc: Andy Whitcroft
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
cpu_up() has #ifdef CONFIG_MEMORY_HOTPLUG code blocks, which call
mem_online_node() to put its node online if offlined and then call
build_all_zonelists() to initialize the zone list.These steps are specific to memory hotplug, and should be managed in
mm/memory_hotplug.c. lock_memory_hotplug() should also be held for the
whole steps.For this reason, this patch replaces mem_online_node() with
try_online_node(), which performs the whole steps with
lock_memory_hotplug() held. try_online_node() is named after
try_offline_node() as they have similar purpose.There is no functional change in this patch.
Signed-off-by: Toshi Kani
Reviewed-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "pfn_to_nid(pfn)" instead of "page_to_nid(pfn_to_page(pfn))".
Signed-off-by: Xishi Qiu
Acked-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
A is_memblock_offlined() return or 1 means memory block is offlined, but
is_memblock_offlined_cb() returning 1 means memory block is not offlined,
this will confuse somebody, so rename the function.Signed-off-by: Xishi Qiu
Acked-by: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "pgdat_end_pfn()" instead of "pgdat->node_start_pfn +
pgdat->node_spanned_pages". Simplify the code, no functional change.Signed-off-by: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
13 Sep, 2013
1 commit
-
Pull ACPI and power management fixes from Rafael Wysocki:
"All of these commits are fixes that have emerged recently and some of
them fix bugs introduced during this merge window.Specifics:
1) ACPI-based PCI hotplug (ACPIPHP) fixes related to spurious events
After the recent ACPIPHP changes we've seen some interesting
breakage on a system that triggers device check notifications
during boot for non-existing devices. Although those
notifications are really spurious, we should be able to deal with
them nevertheless and that shouldn't introduce too much overhead.
Four commits to make that work properly.2) Memory hotplug and hibernation mutual exclusion rework
This was maent to be a cleanup, but it happens to fix a classical
ABBA deadlock between system suspend/hibernation and ACPI memory
hotplug which is possible if they are started roughly at the same
time. Three commits rework memory hotplug so that it doesn't
acquire pm_mutex and make hibernation use device_hotplug_lock
which prevents it from racing with memory hotplug.3) ACPI Intel LPSS (Low-Power Subsystem) driver crash fix
The ACPI LPSS driver crashes during boot on Apple Macbook Air with
Haswell that has slightly unusual BIOS configuration in which one
of the LPSS device's _CRS method doesn't return all of the
information expected by the driver. Fix from Mika Westerberg, for
stable.4) ACPICA fix related to Store->ArgX operation
AML interpreter fix for obscure breakage that causes AML to be
executed incorrectly on some machines (observed in practice).
From Bob Moore.5) ACPI core fix for PCI ACPI device objects lookup
There still are cases in which there is more than one ACPI device
object matching a given PCI device and we don't choose the one
that the BIOS expects us to choose, so this makes the lookup take
more criteria into account in those cases.6) Fix to prevent cpuidle from crashing in some rare cases
If the result of cpuidle_get_driver() is NULL, which can happen on
some systems, cpuidle_driver_ref() will crash trying to use that
pointer and the Daniel Fu's fix prevents that from happening.7) cpufreq fixes related to CPU hotplug
Stephen Boyd reported a number of concurrency problems with
cpufreq related to CPU hotplug which are addressed by a series of
fixes from Srivatsa S Bhat and Viresh Kumar.8) cpufreq fix for time conversion in time_in_state attribute
Time conversion carried out by cpufreq when user space attempts to
read /sys/devices/system/cpu/cpu*/cpufreq/stats/time_in_state
won't work correcty if cputime_t doesn't map directly to jiffies.
Fix from Andreas Schwab.9) Revert of a troublesome cpufreq commit
Commit 7c30ed5 (cpufreq: make sure frequency transitions are
serialized) was intended to address some known concurrency
problems in cpufreq related to the ordering of transitions, but
unfortunately it introduced several problems of its own, so I
decided to revert it now and address the original problems later
in a more robust way.10) Intel Haswell CPU models for intel_pstate from Nell Hardcastle.
11) cpufreq fixes related to system suspend/resume
The recent cpufreq changes that made it preserve CPU sysfs
attributes over suspend/resume cycles introduced a possible NULL
pointer dereference that caused it to crash during the second
attempt to suspend. Three commits from Srivatsa S Bhat fix that
problem and a couple of related issues.12) cpufreq locking fix
cpufreq_policy_restore() should acquire the lock for reading, but
it acquires it for writing. Fix from Lan Tianyu"* tag 'pm+acpi-fixes-3.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (25 commits)
cpufreq: Acquire the lock in cpufreq_policy_restore() for reading
cpufreq: Prevent problems in update_policy_cpu() if last_cpu == new_cpu
cpufreq: Restructure if/else block to avoid unintended behavior
cpufreq: Fix crash in cpufreq-stats during suspend/resume
intel_pstate: Add Haswell CPU models
Revert "cpufreq: make sure frequency transitions are serialized"
cpufreq: Use signed type for 'ret' variable, to store negative error values
cpufreq: Remove temporary fix for race between CPU hotplug and sysfs-writes
cpufreq: Synchronize the cpufreq store_*() routines with CPU hotplug
cpufreq: Invoke __cpufreq_remove_dev_finish() after releasing cpu_hotplug.lock
cpufreq: Split __cpufreq_remove_dev() into two parts
cpufreq: Fix wrong time unit conversion
cpufreq: serialize calls to __cpufreq_governor()
cpufreq: don't allow governor limits to be changed when it is disabled
ACPI / bind: Prefer device objects with _STA to those without it
ACPI / hotplug / PCI: Avoid parent bus rescans on spurious device checks
ACPI / hotplug / PCI: Use _OST to notify firmware about notify status
ACPI / hotplug / PCI: Avoid doing too much for spurious notifies
ACPICA: Fix for a Store->ArgX when ArgX contains a reference to a field.
ACPI / hotplug / PCI: Don't trim devices before scanning the namespace
...
12 Sep, 2013
7 commits
-
Until now we can't offline memory blocks which contain hugepages because a
hugepage is considered as an unmovable page. But now with this patch
series, a hugepage has become movable, so by using hugepage migration we
can offline such memory blocks.What's different from other users of hugepage migration is that we need to
decompose all the hugepages inside the target memory block into free buddy
pages after hugepage migration, because otherwise free hugepages remaining
in the memory block intervene the memory offlining. For this reason we
introduce new functions dissolve_free_huge_page() and
dissolve_free_huge_pages().Other than that, what this patch does is straightforwardly to add hugepage
migration code, that is, adding hugepage code to the functions which scan
over pfn and collect hugepages to be migrated, and adding a hugepage
allocation function to alloc_migrate_target().As for larger hugepages (1GB for x86_64), it's not easy to do hotremove
over them because it's larger than memory block. So we now simply leave
it to fail as it is.[yongjun_wei@trendmicro.com.cn: remove duplicated include]
Signed-off-by: Naoya Horiguchi
Acked-by: Andi Kleen
Cc: Hillf Danton
Cc: Wanpeng Li
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: KOSAKI Motohiro
Cc: Michal Hocko
Cc: Rik van Riel
Cc: "Aneesh Kumar K.V"
Signed-off-by: Wei Yongjun
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
lock_device_hotplug() serializes hotplug & online/offline operations. The
lock is held in common sysfs online/offline interfaces and ACPI hotplug
code paths.And here are the code paths:
- CPU & Mem online/offline via sysfs online
store_online()->lock_device_hotplug()- Mem online via sysfs state:
store_mem_state()->lock_device_hotplug()- ACPI CPU & Mem hot-add:
acpi_scan_bus_device_check()->lock_device_hotplug()- ACPI CPU & Mem hot-delete:
acpi_scan_hot_remove()->lock_device_hotplug()try_offline_node() off-lines a node if all memory sections and cpus are
removed on the node. It is called from acpi_processor_remove() and
acpi_memory_remove_memory()->remove_memory() paths, both of which are in
the ACPI hotplug code.try_offline_node() calls stop_machine() to stop all cpus while checking
all cpu status with the assumption that the caller is not protected from
CPU hotplug or CPU online/offline operations. However, the caller is
always serialized with lock_device_hotplug(). Also, the code needs to be
properly serialized with a lock, not by stopping all cpus at a random
place with stop_machine().This patch removes the use of stop_machine() in try_offline_node() and
adds comments to try_offline_node() and remove_memory() that
lock_device_hotplug() is required.Signed-off-by: Toshi Kani
Acked-by: Rafael J. Wysocki
Cc: KOSAKI Motohiro
Cc: KAMEZAWA Hiroyuki
Cc: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
add_memory() and remove_memory() can only handle a memory range aligned
with section. There are problems when an unaligned range is added and
then deleted as follows:- add_memory() with an unaligned range succeeds, but __add_pages()
called from add_memory() adds a whole section of pages even though
a given memory range is less than the section size.
- remove_memory() to the added unaligned range hits BUG_ON() in
__remove_pages().This patch changes add_memory() and remove_memory() to check if a given
memory range is aligned with section at the beginning. As the result,
add_memory() fails with -EINVAL when a given range is unaligned, and does
not add such memory range. This prevents remove_memory() to be called
with an unaligned range as well. Note that remove_memory() has to use
BUG_ON() since this function cannot fail.[akpm@linux-foundation.org: avoid printk warnings]
Signed-off-by: Toshi Kani
Acked-by: KOSAKI Motohiro
Reviewed-by: Tang Chen
Reviewed-by: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "zone_is_initialized()" instead of "if (zone->wait_table)".
Simplify the code, no functional change.Signed-off-by: Xishi Qiu
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "zone_is_empty()" instead of "if (zone->spanned_pages)".
Simplify the code, no functional change.Signed-off-by: Xishi Qiu
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Use "zone_end_pfn()" instead of "zone->zone_start_pfn + zone->spanned_pages".
Simplify the code, no functional change.[akpm@linux-foundation.org: fix build]
Signed-off-by: Xishi Qiu
Cc: Cody P Schafer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
I think we can remove "BUG_ON(start_pfn >= end_pfn)" in __offline_pages(),
because in memory_block_action() "nr_pages = PAGES_PER_SECTION * sections_per_block"
is always greater than 0.memory_block_action()
offline_pages()
__offline_pages()
BUG_ON(start_pfn >= end_pfn)In v2.6.32, If info->length==0, this way may hit this BUG_ON().
acpi_memory_disable_device()
remove_memory(info->start_addr, info->length)
offline_pages()A later Fujitsu patch renamed this function and the BUG_ON() is
unnecessary.Signed-off-by: Xishi Qiu
Reviewed-by: Dave Hansen
Cc: Toshi Kani
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
31 Aug, 2013
1 commit
-
Since all of the memory hotplug operations have to be carried out
under device_hotplug_lock, they won't need to acquire pm_mutex if
device_hotplug_lock is held around hibernation.For this reason, make the hibernation code acquire
device_hotplug_lock after freezing user space processes and
release it before thawing them. At the same tim drop the
lock_system_sleep() and unlock_system_sleep() calls from
lock_memory_hotplug() and unlock_memory_hotplug(), respectively.Signed-off-by: Rafael J. Wysocki
Acked-by: Toshi Kani
10 Jul, 2013
2 commits
-
online_pages() is called from memory_block_action() when a user requests
to online a memory block via sysfs. This function needs to return a
proper error value in case of error.Signed-off-by: Toshi Kani
Cc: Yasuaki Ishimatsu
Cc: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Tang Chen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
04 Jul, 2013
1 commit
-
Merge first patch-bomb from Andrew Morton:
- various misc bits
- I'm been patchmonkeying ocfs2 for a while, as Joel and Mark have been
distracted. There has been quite a bit of activity.
- About half the MM queue
- Some backlight bits
- Various lib/ updates
- checkpatch updates
- zillions more little rtc patches
- ptrace
- signals
- exec
- procfs
- rapidio
- nbd
- aoe
- pps
- memstick
- tools/testing/selftests updates* emailed patches from Andrew Morton : (445 commits)
tools/testing/selftests: don't assume the x bit is set on scripts
selftests: add .gitignore for kcmp
selftests: fix clean target in kcmp Makefile
selftests: add .gitignore for vm
selftests: add hugetlbfstest
self-test: fix make clean
selftests: exit 1 on failure
kernel/resource.c: remove the unneeded assignment in function __find_resource
aio: fix wrong comment in aio_complete()
drivers/w1/slaves/w1_ds2408.c: add magic sequence to disable P0 test mode
drivers/memstick/host/r592.c: convert to module_pci_driver
drivers/memstick/host/jmb38x_ms: convert to module_pci_driver
pps-gpio: add device-tree binding and support
drivers/pps/clients/pps-gpio.c: convert to module_platform_driver
drivers/pps/clients/pps-gpio.c: convert to devm_* helpers
drivers/parport/share.c: use kzalloc
Documentation/accounting/getdelays.c: avoid strncpy in accounting tool
aoe: update internal version number to v83
aoe: update copyright date
aoe: perform I/O completions in parallel
...