24 Feb, 2013
40 commits
-
When I use several fast SSD to do swap, swapper_space.tree_lock is
heavily contended. This makes each swap partition have one
address_space to reduce the lock contention. There is an array of
address_space for swap. The swap entry type is the index to the array.In my test with 3 SSD, this increases the swapout throughput 20%.
[akpm@linux-foundation.org: revert unneeded change to __add_to_swap_cache]
Signed-off-by: Shaohua Li
Cc: Hugh Dickins
Acked-by: Rik van Riel
Acked-by: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
According to akpm, this saves 1/2k text and makes things simple for the
next patch.Numbers from Minchan:
add/remove: 1/0 grow/shrink: 6/22 up/down: 92/-516 (-424)
function old new delta
page_mapping - 48 +48
do_task_stat 2292 2308 +16
page_remove_rmap 240 248 +8
load_elf_binary 4500 4508 +8
update_queue 532 536 +4
scsi_probe_and_add_lun 2892 2896 +4
lookup_fast 644 648 +4
vcs_read 1040 1036 -4
__ip_route_output_key 1904 1900 -4
ip_route_input_noref 2508 2500 -8
shmem_file_aio_read 784 772 -12
__isolate_lru_page 272 256 -16
shmem_replace_page 708 688 -20
mark_buffer_dirty 228 208 -20
__set_page_dirty_buffers 240 220 -20
__remove_mapping 276 256 -20
update_mmu_cache 500 476 -24
set_page_dirty_balance 92 68 -24
set_page_dirty 172 148 -24
page_evictable 88 64 -24
page_cache_pipe_buf_steal 248 224 -24
clear_page_dirty_for_io 340 316 -24
test_set_page_writeback 400 372 -28
test_clear_page_writeback 516 488 -28
invalidate_inode_page 156 128 -28
page_mkclean 432 400 -32
flush_dcache_page 360 328 -32
__set_page_dirty_nobuffers 324 280 -44
shrink_page_list 2412 2356 -56Signed-off-by: Shaohua Li
Suggested-by: Andrew Morton
Cc: Hugh Dickins
Acked-by: Rik van Riel
Cc: Minchan Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When correcting commit 04fa5d6a6547 ("mm: migrate: check page_count of
THP before migrating") Hugh Dickins noted that the control flow for
transhuge migration was difficult to follow. Unconditionally calling
put_page() in numamigrate_isolate_page() made the failure paths of both
migrate_misplaced_transhuge_page() and migrate_misplaced_page() more
complex that they should be. Further, he was extremely wary that an
unlock_page() should ever happen after a put_page() even if the
put_page() should never be the final put_page.Hugh implemented the following cleanup to simplify the path by calling
putback_lru_page() inside numamigrate_isolate_page() if it failed to
isolate and always calling unlock_page() within
migrate_misplaced_transhuge_page().There is no functional change after this patch is applied but the code
is easier to follow and unlock_page() always happens before put_page().[mgorman@suse.de: changelog only]
Signed-off-by: Mel Gorman
Signed-off-by: Hugh Dickins
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
page->_last_nid fits into page->flags on 64-bit. The unlikely 32-bit
NUMA configuration with NUMA Balancing will still need an extra page
field. As Peter notes "Completely dropping 32bit support for
CONFIG_NUMA_BALANCING would simplify things, but it would also remove
the warning if we grow enough 64bit only page-flags to push the last-cpu
out."[mgorman@suse.de: minor modifications]
Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This is a preparation patch for moving page->_last_nid into page->flags
that moves page flag layout information to a separate header. This
patch is necessary because otherwise there would be a circular
dependency between mm_types.h and mm.h.Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The current definitions for count_vm_numa_events() is wrong for
!CONFIG_NUMA_BALANCING as the following would miss the side-effect.count_vm_numa_events(NUMA_FOO, bar++);
There are no such users of count_vm_numa_events() but this patch fixes
it as it is a potential pitfall. Ideally both would be converted to
static inline but NUMA_PTE_UPDATES is not defined if
!CONFIG_NUMA_BALANCING and creating dummy constants just to have a
static inline would be similarly clumsy.Signed-off-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Wanpeng Li pointed out that numamigrate_isolate_page() assumes that only
one base page is being migrated when in fact it can also be checking
THP.The consequences are that a migration will be attempted when a target
node is nearly full and fail later. It's unlikely to be user-visible
but it should be fixed. While we are there, migrate_balanced_pgdat()
should treat nr_migrate_pages as an unsigned long as it is treated as a
watermark.Signed-off-by: Mel Gorman
Suggested-by: Wanpeng Li
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Simon Jeons
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
s/me/be/ and clarify the comment a bit when we're changing it anyway.
Signed-off-by: Mel Gorman
Suggested-by: Simon Jeons
Cc: Peter Zijlstra
Cc: Andrea Arcangeli
Cc: Ingo Molnar
Cc: Wanpeng Li
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Signed-off-by: Kirill A. Shutemov
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If one storage interface or usb network interface(iSCSI case) exists in
current configuration, memory allocation with GFP_KERNEL during
usb_device_reset() might trigger I/O transfer on the storage interface
itself and cause deadlock because the 'us->dev_mutex' is held in
.pre_reset() and the storage interface can't do I/O transfer when the
reset is triggered by other interface, or the error handling can't be
completed if the reset is triggered by the storage itself (error
handling path).Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Reviewed-by: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Apply the introduced memalloc_noio_save() and memalloc_noio_restore() to
force memory allocation with no I/O during runtime_resume/runtime_suspend
callback on device with the flag of 'memalloc_noio' set.Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Deadlock might be caused by allocating memory with GFP_KERNEL in
runtime_resume and runtime_suspend callback of network devices in iSCSI
situation, so mark network devices and its ancestor as 'memalloc_noio'
with the introduced pm_runtime_set_memalloc_noio().Signed-off-by: Ming Lei
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Apply the introduced pm_runtime_set_memalloc_noio on block device so
that PM core will teach mm to not allocate memory with GFP_IOFS when
calling the runtime_resume and runtime_suspend callback for block
devices and its ancestors.Signed-off-by: Ming Lei
Cc: Jens Axboe
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce the flag memalloc_noio in 'struct dev_pm_info' to help PM core
to teach mm not allocating memory with GFP_KERNEL flag for avoiding
probable deadlock.As explained in the comment, any GFP_KERNEL allocation inside
runtime_resume() or runtime_suspend() on any one of device in the path
from one block or network device to the root device in the device tree
may cause deadlock, the introduced pm_runtime_set_memalloc_noio() sets
or clears the flag on device in the path recursively.Signed-off-by: Ming Lei
Cc: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: Jens Axboe
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patch introduces PF_MEMALLOC_NOIO on process flag('flags' field of
'struct task_struct'), so that the flag can be set by one task to avoid
doing I/O inside memory allocation in the task's context.The patch trys to solve one deadlock problem caused by block device, and
the problem may happen at least in the below situations:- during block device runtime resume, if memory allocation with
GFP_KERNEL is called inside runtime resume callback of any one of its
ancestors(or the block device itself), the deadlock may be triggered
inside the memory allocation since it might not complete until the block
device becomes active and the involed page I/O finishes. The situation
is pointed out first by Alan Stern. It is not a good approach to
convert all GFP_KERNEL[1] in the path into GFP_NOIO because several
subsystems may be involved(for example, PCI, USB and SCSI may be
involved for usb mass stoarage device, network devices involved too in
the iSCSI case)- during block device runtime suspend, because runtime resume need to
wait for completion of concurrent runtime suspend.- during error handling of usb mass storage deivce, USB bus reset will
be put on the device, so there shouldn't have any memory allocation with
GFP_KERNEL during USB bus reset, otherwise the deadlock similar with
above may be triggered. Unfortunately, any usb device may include one
mass storage interface in theory, so it requires all usb interface
drivers to handle the situation. In fact, most usb drivers don't know
how to handle bus reset on the device and don't provide .pre_set() and
.post_reset() callback at all, so USB core has to unbind and bind driver
for these devices. So it is still not practical to resort to GFP_NOIO
for solving the problem.Also the introduced solution can be used by block subsystem or block
drivers too, for example, set the PF_MEMALLOC_NOIO flag before doing
actual I/O transfer.It is not a good idea to convert all these GFP_KERNEL in the affected
path into GFP_NOIO because these functions doing that may be implemented
as library and will be called in many other contexts.In fact, memalloc_noio_flags() can convert some of current static
GFP_NOIO allocation into GFP_KERNEL back in other non-affected contexts,
at least almost all GFP_NOIO in USB subsystem can be converted into
GFP_KERNEL after applying the approach and make allocation with GFP_NOIO
only happen in runtime resume/bus reset/block I/O transfer contexts
generally.[1], several GFP_KERNEL allocation examples in runtime resume path
- pci subsystem
acpi_os_allocate
Signed-off-by: Minchan Kim
Cc: Alan Stern
Cc: Oliver Neukum
Cc: Jiri Kosina
Cc: Mel Gorman
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Ingo Molnar
Cc: Peter Zijlstra
Cc: "Rafael J. Wysocki"
Cc: Greg KH
Cc: Jens Axboe
Cc: "David S. Miller"
Cc: Eric Dumazet
Cc: David Decotigny
Cc: Tom Herbert
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
From: Zlatko Calusic
Commit 92df3a723f84 ("mm: vmscan: throttle reclaim if encountering too
many dirty pages under writeback") introduced waiting on congested zones
based on a sane algorithm in shrink_inactive_list().What this means is that there's no more need for throttling and
additional heuristics in balance_pgdat(). So, let's remove it and tidy
up the code.Signed-off-by: Zlatko Calusic
Cc: Mel Gorman
Cc: Minchan Kim
Cc: Johannes Weiner
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
num_poisoned_pages counts up the number of pages isolated by memory
errors. But for thp, only one subpage is isolated because memory error
handler splits it, so it's wrong to add (1 << compound_trans_order).[akpm@linux-foundation.org: tweak comment]
Signed-off-by: Naoya Horiguchi
Cc: Andi Kleen
Cc: Tony Luck
Cc: Wu Fengguang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently soft_offline_page() is hard to maintain because it has many
return points and goto statements. All of this mess come from
get_any_page().This function should only get page refcount as the name implies, but it
does some page isolating actions like SetPageHWPoison() and dequeuing
hugepage. This patch corrects it and introduces some internal
subroutines to make soft offlining code more readable and maintainable.Signed-off-by: Naoya Horiguchi
Reviewed-by: Andi Kleen
Cc: Tony Luck
Cc: Wu Fengguang
Cc: Xishi Qiu
Cc: Jiang Liu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Since MCE is an x86 concept, and this code is in mm/, it would be better
to use the name num_poisoned_pages instead of mce_bad_pages.[akpm@linux-foundation.org: fix mm/sparse.c]
Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Borislav Petkov
Reviewed-by: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There are too many return points randomly intermingled with some "goto
done" return points. So adjust the function structure, one for the
success path, the other for the failure path. Use atomic_long_inc
instead of atomic_long_add.Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Suggested-by: Andrew Morton
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When doing
$ echo paddr > /sys/devices/system/memory/soft_offline_page
to offline a *free* page, the value of mce_bad_pages will be added, and
the page is set HWPoison flag, but it is still managed by page buddy
alocator.$ cat /proc/meminfo | grep HardwareCorrupted
shows the value.
If we offline the same page, the value of mce_bad_pages will be added
*again*, this means the value is incorrect now. Assume the page is
still free during this short time.soft_offline_page()
get_any_page()
"else if (is_free_buddy_page(p))" branch return 0
"goto done";
"atomic_long_add(1, &mce_bad_pages);"This patch:
Move poisoned page check at the beginning of the function in order to
fix the error.Signed-off-by: Xishi Qiu
Signed-off-by: Jiang Liu
Tested-by: Naoya Horiguchi
Cc: Borislav Petkov
Cc: Wanpeng Li
Cc: Andi Kleen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Several functions test MIGRATE_ISOLATE and some of those are hotpath but
MIGRATE_ISOLATE is used only if we enable CONFIG_MEMORY_ISOLATION(ie,
CMA, memory-hotplug and memory-failure) which are not common config
option. So let's not add unnecessary overhead and code when we don't
enable CONFIG_MEMORY_ISOLATION.Signed-off-by: Minchan Kim
Cc: KOSAKI Motohiro
Acked-by: Michal Nazarewicz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Function put_page_bootmem() is used to free pages allocated by bootmem
allocator, so it should increase totalram_pages when freeing pages into
the buddy system.Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now all users of "number of pages managed by the buddy system" have been
converted to use zone->managed_pages, so set zone->present_pages to what
it should be:present_pages = spanned_pages - absent_pages;
Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now we have zone->managed_pages for "pages managed by the buddy system
in the zone", so replace zone->present_pages with zone->managed_pages if
what the user really wants is number of allocatable pages.Signed-off-by: Jiang Liu
Cc: Wen Congyang
Cc: David Rientjes
Cc: Jiang Liu
Cc: Maciej Rutecki
Cc: Chris Clayton
Cc: "Rafael J . Wysocki"
Cc: Mel Gorman
Cc: Minchan Kim
Cc: KAMEZAWA Hiroyuki
Cc: Michal Hocko
Cc: Jianguo Wu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
…emblock_overlaps_region().
The definition of struct movablecore_map is protected by
CONFIG_HAVE_MEMBLOCK_NODE_MAP but its use in memblock_overlaps_region()
is not. So add CONFIG_HAVE_MEMBLOCK_NODE_MAP to protect the use of
movablecore_map in memblock_overlaps_region().Signed-off-by: Tang Chen <tangchen@cn.fujitsu.com>
Reported-by: Stephen Rothwell <sfr@canb.auug.org.au>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> -
We now provide an option for users who don't want to specify physical
memory address in kernel commandline./*
* For movablemem_map=acpi:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* hotpluggable: n y y n
* movablemem_map: |_____| |_________|
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*/So user just specify movablemem_map=acpi, and the kernel will use
hotpluggable info in SRAT to determine which memory ranges should be set
as ZONE_MOVABLE.If all the memory ranges in SRAT is hotpluggable, then no memory can be
used by kernel. But before parsing SRAT, memblock has already reserve
some memory ranges for other purposes, such as for kernel image, and so
on. We cannot prevent kernel from using these memory. So we need to
exclude these ranges even if these memory is hotpluggable.Furthermore, there could be several memory ranges in the single node
which the kernel resides in. We may skip one range that have memory
reserved by memblock, but if the rest of memory is too small, then the
kernel will fail to boot. So, make the whole node which the kernel
resides in un-hotpluggable. Then the kernel has enough memory to use.NOTE: Using this way will cause NUMA performance down because the
whole node will be set as ZONE_MOVABLE, and kernel cannot use memory
on it. If users don't want to lose NUMA performance, just don't use
it.[akpm@linux-foundation.org: fix warning]
[akpm@linux-foundation.org: use strcmp()]
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When implementing movablemem_map boot option, we introduced an array
movablemem_map.map[] to store the memory ranges to be set as
ZONE_MOVABLE.Since ZONE_MOVABLE is the latst zone of a node, if user didn't specify
the whole node memory range, we need to extend it to the node end so
that we can use it to prevent memblock from allocating memory in the
ranges user didn't specify.We now implement movablemem_map boot option like this:
/*
* For movablemem_map=nn[KMG]@ss[KMG]:
*
* SRAT: |_____| |_____| |_________| |_________| ......
* node id: 0 1 1 2
* user specified: |__| |___|
* movablemem_map: |___| |_________| |______| ......
*
* Using movablemem_map, we can prevent memblock from allocating memory
* on ZONE_MOVABLE at boot time.
*
* NOTE: In this case, SRAT info will be ingored.
*/[akpm@linux-foundation.org: clean up code, fix build warning]
Signed-off-by: Tang Chen
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
On linux, the pages used by kernel could not be migrated. As a result,
if a memory range is used by kernel, it cannot be hot-removed. So if we
want to hot-remove memory, we should prevent kernel from using it.The way now used to prevent this is specify a memory range by
movablemem_map boot option and set it as ZONE_MOVABLE.But when the system is booting, memblock will allocate memory, and
reserve the memory for kernel. And before we parse SRAT, and know the
node memory ranges, memblock is working. And it may allocate memory in
ranges to be set as ZONE_MOVABLE. This memory can be used by kernel,
and never be freed.So, let's parse SRAT before memblock is called first. And it is early
enough.The first call of memblock_find_in_range_node() is in:
setup_arch()
|-->setup_real_mode()so, this patch add a function early_parse_srat() to parse SRAT, and call
it before setup_real_mode() is called.NOTE:
1) early_parse_srat() is called before numa_init(), and has initialized
numa_meminfo. So DO NOT clear numa_nodes_parsed in numa_init() and DO
NOT zero numa_meminfo in numa_init(), otherwise we will lose memory
numa info.2) I don't know why using count of memory affinities parsed from SRAT
as a return value in original acpi_numa_init(). So I add a static
variable srat_mem_cnt to remember this count and use it as the return
value of the new acpi_numa_init()[mhocko@suse.cz: parse SRAT before memblock is ready fix]
Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Cc: KOSAKI Motohiro
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Wu Jianguo
Cc: Yasuaki Ishimatsu
Cc: Ingo Molnar
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: Len Brown
Cc: "Brown, Len"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Ensure the bootmem will not allocate memory from areas that may be
ZONE_MOVABLE. The map info is from movablecore_map boot option.Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Wu Jianguo
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If kernelcore or movablecore is specified at the same time with
movablemem_map, movablemem_map will have higher priority to be
satisfied. This patch will make find_zone_movable_pfns_for_nodes()
calculate zone_movable_pfn[] with the limit from zone_movable_limit[].Signed-off-by: Tang Chen
Reviewed-by: Wen Congyang
Cc: Wu Jianguo
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Introduce a new array zone_movable_limit[] to store the ZONE_MOVABLE
limit from movablemem_map boot option for all nodes. The function
sanitize_zone_movable_limit() will find out to which node the ranges in
movable_map.map[] belongs, and calculates the low boundary of
ZONE_MOVABLE for each node.Signed-off-by: Tang Chen
Signed-off-by: Liu Jiang
Reviewed-by: Wen Congyang
Cc: Wu Jianguo
Reviewed-by: Lai Jiangshan
Tested-by: Lin Feng
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add functions to parse movablemem_map boot option. Since the option
could be specified more then once, all the maps will be stored in the
global variable movablemem_map.map array.And also, we keep the array in monotonic increasing order by start_pfn.
And merge all overlapped ranges.[akpm@linux-foundation.org: improve comment]
[akpm@linux-foundation.org: checkpatch fixes]
[akpm@linux-foundation.org: remove unneeded parens]
Signed-off-by: Tang Chen
Signed-off-by: Lai Jiangshan
Reviewed-by: Wen Congyang
Tested-by: Lin Feng
Cc: Wu Jianguo
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
During the implementation of SRAT support, we met a problem. In
setup_arch(), we have the following call series:1) memblock is ready;
2) some functions use memblock to allocate memory;
3) parse ACPI tables, such as SRAT.Before 3), we don't know which memory is hotpluggable, and as a result,
we cannot prevent memblock from allocating hotpluggable memory. So, in
2), there could be some hotpluggable memory allocated by memblock.Now, we are trying to parse SRAT earlier, before memblock is ready. But
I think we need more investigation on this topic. So in this v5, I
dropped all the SRAT support, and v5 is just the same as v3, and it is
based on 3.8-rc3.As we planned, we will support getting info from SRAT without users'
participation at last. And we will post another patch-set to do so.And also, I think for now, we can add this boot option as the first step
of supporting movable node. Since Linux cannot migrate the direct
mapped pages, the only way for now is to limit the whole node containing
only movable memory.Using SRAT is one way. But even if we can use SRAT, users still need an
interface to enable/disable this functionality if they don't want to
loose their NUMA performance. So I think, a user interface is always
needed.For now, users can disable this functionality by not specifying the boot
option. Later, we will post SRAT support, and add another option value
"movablecore_map=acpi" to using SRAT.This patch:
If system can create movable node which all memory of the node is
allocated as ZONE_MOVABLE, setup_node_data() cannot allocate memory for
the node's pg_data_t. So, use memblock_alloc_try_nid() instead of
memblock_alloc_nid() to retry when the first allocation fails.Signed-off-by: Yasuaki Ishimatsu
Signed-off-by: Lai Jiangshan
Signed-off-by: Tang Chen
Signed-off-by: Jiang Liu
Cc: Wu Jianguo
Cc: Wen Congyang
Cc: Mel Gorman
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Thomas Gleixner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
If a cpu is offline, its nid will be set to -1, and cpu_to_node(cpu)
will return -1. As a result, cpumask_of_node(nid) will return NULL. In
this case, find_next_bit() in for_each_cpu will get a NULL pointer and
cause panic.Here is a call trace:
Call Trace:
select_fallback_rq+0x71/0x190
try_to_wake_up+0x2cb/0x2f0
wake_up_process+0x15/0x20
hrtimer_wakeup+0x22/0x30
__run_hrtimer+0x83/0x320
hrtimer_interrupt+0x106/0x280
smp_apic_timer_interrupt+0x69/0x99
apic_timer_interrupt+0x6f/0x80There is a hrtimer process sleeping, whose cpu has already been
offlined. When it is waken up, it tries to find another cpu to run, and
get a -1 nid. As a result, cpumask_of_node(-1) returns NULL, and causes
ernel panic.This patch fixes this problem by judging if the nid is -1. If nid is
not -1, a cpu on the same node will be picked. Else, a online cpu on
another node will be picked.Signed-off-by: Tang Chen
Signed-off-by: Wen Congyang
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When the node is offlined, there is no memory/cpu on the node. If a
sleep task runs on a cpu of this node, it will be migrated to the cpu on
the other node. So we can clear cpu-to-node mapping.[akpm@linux-foundation.org: numa_clear_node() and numa_set_node() can no longer be __cpuinit]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The node will be offlined when all memory/cpu on the node is hotremoved.
So we should try offline the node when hotremoving a cpu on the node.Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Len Brown
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
try_offline_node() will be needed in the tristate
drivers/acpi/processor_driver.c.The node will be offlined when all memory/cpu on the node have been
hotremoved. So we need the function try_offline_node() in cpu-hotplug
path.If the memory-hotplug is disabled, and cpu-hotplug is enabled
1. no memory no the node
we don't online the node, and cpu's node is the nearest node.2. the node contains some memory
the node has been onlined, and cpu's node is still needed
to migrate the sleep task on the cpu to the same node.So we do nothing in try_offline_node() in this case.
[rientjes@google.com: export the function try_offline_node() fix]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Cc: Len Brown
Signed-off-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
When a cpu is hotpluged, we call acpi_map_cpu2node() in
_acpi_map_lsapic() to store the cpu's node and apicid's node. But we
don't clear the cpu's node in acpi_unmap_lsapic() when this cpu is
hotremoved. If the node is also hotremoved, we will get the following
messages:kernel BUG at include/linux/gfp.h:329!
invalid opcode: 0000 [#1] SMP
Modules linked in: ebtable_nat ebtables ipt_MASQUERADE iptable_nat nf_nat xt_CHECKSUM iptable_mangle bridge stp llc sunrpc ipt_REJECT nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc dm_mirror dm_region_hash dm_log dm_mod vhost_net macvtap macvlan tun uinput iTCO_wdt iTCO_vendor_support coretemp kvm_intel kvm crc32c_intel microcode pcspkr i2c_i801 i2c_core lpc_ich mfd_core ioatdma e1000e i7core_edac edac_core sg acpi_memhotplug igb dca sd_mod crc_t10dif megaraid_sas mptsas mptscsih mptbase scsi_transport_sas scsi_mod
Pid: 3126, comm: init Not tainted 3.6.0-rc3-tangchen-hostbridge+ #13 FUJITSU-SV PRIMEQUEST 1800E/SB
RIP: 0010:[] [] allocate_slab+0x28d/0x300
RSP: 0018:ffff88078a049cf8 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000000
RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000246
RBP: ffff88078a049d38 R08: 00000000000040d0 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000b5f R12: 00000000000052d0
R13: ffff8807c1417300 R14: 0000000000030038 R15: 0000000000000003
FS: 00007fa9b1b44700(0000) GS:ffff8807c3800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fa9b09acca0 CR3: 000000078b855000 CR4: 00000000000007e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process init (pid: 3126, threadinfo ffff88078a048000, task ffff8807bb6f2650)
Call Trace:
new_slab+0x30/0x1b0
__slab_alloc+0x358/0x4c0
kmem_cache_alloc_node_trace+0xb4/0x1e0
alloc_fair_sched_group+0xd0/0x1b0
sched_create_group+0x3e/0x110
sched_autogroup_create_attach+0x4d/0x180
sys_setsid+0xd4/0xf0
system_call_fastpath+0x16/0x1b
Code: 89 c4 e9 73 fe ff ff 31 c0 89 de 48 c7 c7 45 de 9e 81 44 89 45 c8 e8 22 05 4b 00 85 db 44 8b 45 c8 0f 89 4f ff ff ff 0f 0b eb fe 0b 90 eb fd 0f 0b eb fe 89 de 48 c7 c7 45 de 9e 81 31 c0 44
RIP [] allocate_slab+0x28d/0x300
RSP
---[ end trace adf84c90f3fea3e5 ]---The reason is that the cpu's node is not NUMA_NO_NODE, we will call
alloc_pages_exact_node() to alloc memory on the node, but the node is
offlined.If the node is onlined, we still need cpu's node. For example: a task
on the cpu is sleeped when the cpu is hotremoved. We will choose
another cpu to run this task when it is waked up. If we know the cpu's
node, we will choose the cpu on the same node first. So we should clear
cpu-to-node mapping when the node is offlined.This patch only clears apicid-to-node mapping when the cpu is
hotremoved.[akpm@linux-foundation.org: fix section error]
Signed-off-by: Wen Congyang
Signed-off-by: Tang Chen
Cc: Yasuaki Ishimatsu
Cc: David Rientjes
Cc: Jiang Liu
Cc: Minchan Kim
Cc: KOSAKI Motohiro
Cc: Mel Gorman
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Peter Zijlstra
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
is_valid_nodemask() was introduced by commit 19770b32609b ("mm: filter
based on a nodemask as well as a gfp_mask"). but it does not match its
comments, because it does not check the zone which > policy_zone.Also in commit b377fd3982ad ("Apply memory policies to top two highest
zones when highest zone is ZONE_MOVABLE"), this commits told us, if
highest zone is ZONE_MOVABLE, we should also apply memory policies to
it. so ZONE_MOVABLE should be valid zone for policies.
is_valid_nodemask() need to be changed to match it.Fix: check all zones, even its zoneid > policy_zone. Use
nodes_intersects() instead open code to check it.Reported-by: Wen Congyang
Signed-off-by: Lai Jiangshan
Signed-off-by: Tang Chen
Cc: Mel Gorman
Cc: Lee Schermerhorn
Cc: Jiang Liu
Cc: Jianguo Wu
Cc: Kamezawa Hiroyuki
Cc: Lai Jiangshan
Cc: Yasuaki Ishimatsu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds