09 Feb, 2017
2 commits
-
commit a96dfddbcc04336bbed50dc2b24823e45e09e80c upstream.
Reading a sysfs "memoryN/valid_zones" file leads to the following oops
when the first page of a range is not backed by struct page.
show_valid_zones() assumes that 'start_pfn' is always valid for
page_zone().BUG: unable to handle kernel paging request at ffffea017a000000
IP: show_valid_zones+0x6f/0x160This issue may happen on x86-64 systems with 64GiB or more memory since
their memory block size is bumped up to 2GiB. [1] An example of such
systems is desribed below. 0x3240000000 is only aligned by 1GiB and
this memory block starts from 0x3200000000, which is not backed by
struct page.BIOS-e820: [mem 0x0000003240000000-0x000000603fffffff] usable
Since test_pages_in_a_zone() already checks holes, fix this issue by
extending this function to return 'valid_start' and 'valid_end' for a
given range. show_valid_zones() then proceeds with the valid range.[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'Link: http://lkml.kernel.org/r/20170127222149.30893-3-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Cc: Greg Kroah-Hartman
Cc: Zhang Zhen
Cc: Reza Arbab
Cc: David Rientjes
Cc: Dan Williams
Signed-off-by: Greg Kroah-HartmanSigned-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
commit deb88a2a19e85842d79ba96b05031739ec327ff4 upstream.
Patch series "fix a kernel oops when reading sysfs valid_zones", v2.
A sysfs memory file is created for each 2GiB memory block on x86-64 when
the system has 64GiB or more memory. [1] When the start address of a
memory block is not backed by struct page, i.e. a memory range is not
aligned by 2GiB, reading its 'valid_zones' attribute file leads to a
kernel oops. This issue was observed on multiple x86-64 systems with
more than 64GiB of memory. This patch-set fixes this issue.Patch 1 first fixes an issue in test_pages_in_a_zone(), which does not
test the start section.Patch 2 then fixes the kernel oops by extending test_pages_in_a_zone()
to return valid [start, end).Note for stable kernels: The memory block size change was made by commit
bdee237c0343 ("x86: mm: Use 2GB memory block size on large-memory x86-64
systems"), which was accepted to 3.9. However, this patch-set depends
on (and fixes) the change to test_pages_in_a_zone() made by commit
5f0f2887f4de ("mm/memory_hotplug.c: check for missing sections in
test_pages_in_a_zone()"), which was accepted to 4.4.So, I recommend that we backport it up to 4.4.
[1] 'Commit bdee237c0343 ("x86: mm: Use 2GB memory block size on
large-memory x86-64 systems")'This patch (of 2):
test_pages_in_a_zone() does not check 'start_pfn' when it is aligned by
section since 'sec_end_pfn' is set equal to 'pfn'. Since this function
is called for testing the range of a sysfs memory file, 'start_pfn' is
always aligned by section.Fix it by properly setting 'sec_end_pfn' to the next section pfn.
Also make sure that this function returns 1 only when the range belongs
to a zone.Link: http://lkml.kernel.org/r/20170127222149.30893-2-toshi.kani@hpe.com
Signed-off-by: Toshi Kani
Cc: Andrew Banman
Cc: Reza Arbab
Cc: Greg KH
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
01 Feb, 2017
1 commit
-
commit 8a1f780e7f28c7c1d640118242cf68d528c456cd upstream.
online_{kernel|movable} is used to change the memory zone to
ZONE_{NORMAL|MOVABLE} and online the memory.To check that memory zone can be changed, zone_can_shift() is used.
Currently the function returns minus integer value, plus integer
value and 0. When the function returns minus or plus integer value,
it means that the memory zone can be changed to ZONE_{NORNAL|MOVABLE}.But when the function returns 0, there are two meanings.
One of the meanings is that the memory zone does not need to be changed.
For example, when memory is in ZONE_NORMAL and onlined by online_kernel
the memory zone does not need to be changed.Another meaning is that the memory zone cannot be changed. When memory
is in ZONE_NORMAL and onlined by online_movable, the memory zone may
not be changed to ZONE_MOVALBE due to memory online limitation(see
Documentation/memory-hotplug.txt). In this case, memory must not be
onlined.The patch changes the return type of zone_can_shift() so that memory
online operation fails when memory zone cannot be changed as follows:Before applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
node_scanned 0
spanned 8388608
present 8388608
managed 8388608online_movable operation succeeded. But memory is onlined as
ZONE_NORMAL, not ZONE_MOVABLE.After applying patch:
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
node_scanned 0
spanned 8388608
present 7864320
managed 7864320
# echo online_movable > memory4097/state
bash: echo: write error: Invalid argument
# grep -A 35 "Node 2" /proc/zoneinfo
Node 2, zone Normal
node_scanned 0
spanned 8388608
present 7864320
managed 7864320online_movable operation failed because of failure of changing
the memory zone from ZONE_NORMAL to ZONE_MOVABLEFixes: df429ac03936 ("memory-hotplug: more general validation of zone during online")
Link: http://lkml.kernel.org/r/2f9c3837-33d7-b6e5-59c0-6ca4372b2d84@gmail.com
Signed-off-by: Yasuaki Ishimatsu
Reviewed-by: Reza Arbab
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
28 Oct, 2016
2 commits
-
When I removed the per-zone bitlock hashed waitqueues in commit
9dcb8b685fc3 ("mm: remove per-zone hashtable of bitlock waitqueues"), I
removed all the magic hotplug memory initialization of said waitqueues
too.But when I actually _tested_ the resulting build, I stupidly assumed
that "allmodconfig" would enable memory hotplug. And it doesn't,
because it enables KASAN instead, which then disables hotplug memory
support.As a result, my build test of the per-zone waitqueues was totally
broken, and I didn't notice that the compiler warns about the now unused
iterator variable 'i'.I guess I should be happy that that seems to be the worst breakage from
my clearly horribly failed test coverage.Reported-by: Stephen Rothwell
Signed-off-by: Linus Torvalds -
The per-zone waitqueues exist because of a scalability issue with the
page waitqueues on some NUMA machines, but it turns out that they hurt
normal loads, and now with the vmalloced stacks they also end up
breaking gfs2 that uses a bit_wait on a stack object:wait_on_bit(&gh->gh_iflags, HIF_WAIT, TASK_UNINTERRUPTIBLE)
where 'gh' can be a reference to the local variable 'mount_gh' on the
stack of fill_super().The reason the per-zone hash table breaks for this case is that there is
no "zone" for virtual allocations, and trying to look up the physical
page to get at it will fail (with a BUG_ON()).It turns out that I actually complained to the mm people about the
per-zone hash table for another reason just a month ago: the zone lookup
also hurts the regular use of "unlock_page()" a lot, because the zone
lookup ends up forcing several unnecessary cache misses and generates
horrible code.As part of that earlier discussion, we had a much better solution for
the NUMA scalability issue - by just making the page lock have a
separate contention bit, the waitqueue doesn't even have to be looked at
for the normal case.Peter Zijlstra already has a patch for that, but let's see if anybody
even notices. In the meantime, let's fix the actual gfs2 breakage by
simplifying the bitlock waitqueues and removing the per-zone issue.Reported-by: Andreas Gruenbacher
Tested-by: Bob Peterson
Acked-by: Mel Gorman
Cc: Peter Zijlstra
Cc: Andy Lutomirski
Cc: Steven Whitehouse
Signed-off-by: Linus Torvalds
08 Oct, 2016
1 commit
-
In dissolve_free_huge_pages(), free hugepages will be dissolved without
making sure that there are enough of them left to satisfy hugepage
reservations.Fix this by adding a return value to dissolve_free_huge_pages() and
checking h->free_huge_pages vs. h->resv_huge_pages. Note that this may
lead to the situation where dissolve_free_huge_page() returns an error
and all free hugepages that were dissolved before that error are lost,
while the memory block still cannot be set offline.Fixes: c8721bbb ("mm: memory-hotplug: enable memory hotplug to handle hugepage")
Link: http://lkml.kernel.org/r/20160926172811.94033-3-gerald.schaefer@de.ibm.com
Signed-off-by: Gerald Schaefer
Acked-by: Michal Hocko
Acked-by: Naoya Horiguchi
Cc: "Kirill A . Shutemov"
Cc: Vlastimil Babka
Cc: Mike Kravetz
Cc: "Aneesh Kumar K . V"
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc: Rui Teng
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Sep, 2016
1 commit
-
9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
prevents allocating from an empty nodemask, but as David points out, it is
still wrong. As node_online_map may include memoryless nodes, only
allocating from these nodes is meaningless.This patch uses node_states[N_MEMORY] mask to prevent the above case.
Fixes: 9bb627be47a5 ("mem-hotplug: don't clear the only node in new_node_page()")
Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1474447117.28370.6.camel@TP420
Signed-off-by: Li Zhong
Suggested-by: David Rientjes
Acked-by: Vlastimil Babka
Cc: Michal Hocko
Cc: John Allen
Cc: Xishi Qiu
Cc: Joonsoo Kim
Cc: Naoya Horiguchi
Cc: Tetsuo Handa
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 Sep, 2016
1 commit
-
Commit 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest
neighbor node when mem-offline") introduced new_node_page() for memory
hotplug.In new_node_page(), the nid is cleared before calling
__alloc_pages_nodemask(). But if it is the only node of the system, and
the first round allocation fails, it will not be able to get memory from
an empty nodemask, and will trigger oom.The patch checks whether it is the last node on the system, and if it
is, then don't clear the nid in the nodemask.Fixes: 394e31d2ceb4 ("mem-hotplug: alloc new page from a nearest neighbor node when mem-offline")
Link: http://lkml.kernel.org/r/1473044391.4250.19.camel@TP420
Signed-off-by: Li Zhong
Reported-by: John Allen
Acked-by: Vlastimil Babka
Cc: Xishi Qiu
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
12 Aug, 2016
1 commit
-
The following oops occurs after a pgdat is hotadded:
Unable to handle kernel paging request for data at address 0x00c30001
Faulting instruction address: 0xc00000000022f8f4
Oops: Kernel access of bad area, sig: 11 [#1]
SMP NR_CPUS=2048 NUMA pSeries
Modules linked in: ip6t_rpfilter ip6t_REJECT nf_reject_ipv6 ipt_REJECT nf_reject_ipv4 xt_conntrack ebtable_nat ebtable_broute bridge stp llc ebtable_filter ebtables ip6table_nat nf_conntrack_ipv6 nf_defrag_ipv6 nf_nat_ipv6 ip6table_mangle ip6table_security ip6table_raw ip6table_filter ip6_tables iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 nf_nat nf_conntrack iptable_mangle iptable_security iptable_raw iptable_filter nls_utf8 isofs sg virtio_balloon uio_pdrv_genirq uio ip_tables xfs libcrc32c sr_mod cdrom sd_mod virtio_net ibmvscsi scsi_transport_srp virtio_pci virtio_ring virtio dm_mirror dm_region_hash dm_log dm_mod
CPU: 0 PID: 0 Comm: swapper/0 Tainted: G W 4.8.0-rc1-device #110
task: c000000000ef3080 task.stack: c000000000f6c000
NIP: c00000000022f8f4 LR: c00000000022f948 CTR: 0000000000000000
REGS: c000000000f6fa50 TRAP: 0300 Tainted: G W (4.8.0-rc1-device)
MSR: 800000010280b033 CR: 84002028 XER: 20000000
CFAR: d000000001d2013c DAR: 0000000000c30001 DSISR: 40000000 SOFTE: 0
NIP refresh_cpu_vm_stats+0x1a4/0x2f0
LR refresh_cpu_vm_stats+0x1f8/0x2f0
Call Trace:
refresh_cpu_vm_stats+0x1f8/0x2f0 (unreliable)Add per_cpu_nodestats initialization to the hotplug codepath.
Link: http://lkml.kernel.org/r/1470931473-7090-1-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab
Cc: Mel Gorman
Cc: Paul Mackerras
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
29 Jul, 2016
3 commits
-
If we offline a node, alloc the new page from a nearest neighbor node
instead of the current node or other remote nodes, because re-migrate is
a waste of time and the distance of the remote nodes is often very
large.Also use GFP_HIGHUSER_MOVABLE to alloc new page if the zone is movable
zone or highmem zone.Link: http://lkml.kernel.org/r/5795E18B.5060302@huawei.com
Signed-off-by: Xishi Qiu
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc: Naoya Horiguchi
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
kswapd goes through some complex steps trying to figure out if it should
stay awake based on the classzone_idx and the requested order. It is
unnecessarily complex and passes in an invalid classzone_idx to
balance_pgdat(). What matters most of all is whether a larger order has
been requsted and whether kswapd successfully reclaimed at the previous
order. This patch irons out the logic to check just that and the end
result is less headache inducing.Link: http://lkml.kernel.org/r/1467970510-21195-10-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This moves the LRU lists from the zone to the node and related data such
as counters, tracing, congestion tracking and writeback tracking.Unfortunately, due to reclaim and compaction retry logic, it is
necessary to account for the number of LRU pages on both zone and node
logic. Most reclaim logic is based on the node counters but the retry
logic uses the zone counters which do not distinguish inactive and
active sizes. It would be possible to leave the LRU counters on a
per-zone basis but it's a heavier calculation across multiple cache
lines that is much more frequent than the retry checks.Other than the LRU counters, this is mostly a mechanical patch but note
that it introduces a number of anomalies. For example, the scans are
per-zone but using per-node counters. We also mark a node as congested
when a zone is congested. This causes weird problems that are fixed
later but is easier to review.In the event that there is excessive overhead on 32-bit systems due to
the nodes being on LRU then there are two potential solutions1. Long-term isolation of highmem pages when reclaim is lowmem
When pages are skipped, they are immediately added back onto the LRU
list. If lowmem reclaim persisted for long periods of time, the same
highmem pages get continually scanned. The idea would be that lowmem
keeps those pages on a separate list until a reclaim for highmem pages
arrives that splices the highmem pages back onto the LRU. It potentially
could be implemented similar to the UNEVICTABLE list.That would reduce the skip rate with the potential corner case is that
highmem pages have to be scanned and reclaimed to free lowmem slab pages.2. Linear scan lowmem pages if the initial LRU shrink fails
This will break LRU ordering but may be preferable and faster during
memory pressure than skipping LRU pages.Link: http://lkml.kernel.org/r/1467970510-21195-4-git-send-email-mgorman@techsingularity.net
Signed-off-by: Mel Gorman
Acked-by: Johannes Weiner
Acked-by: Vlastimil Babka
Cc: Hillf Danton
Cc: Joonsoo Kim
Cc: Michal Hocko
Cc: Minchan Kim
Cc: Rik van Riel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
27 Jul, 2016
2 commits
-
When memory is onlined, we are only able to rezone from ZONE_MOVABLE to
ZONE_KERNEL, or from (ZONE_MOVABLE - 1) to ZONE_MOVABLE.To be more flexible, use the following criteria instead; to online
memory from zone X into zone Y,* Any zones between X and Y must be unused.
* If X is lower than Y, the onlined memory must lie at the end of X.
* If X is higher than Y, the onlined memory must lie at the start of X.Add zone_can_shift() to make this determination.
Link: http://lkml.kernel.org/r/1462816419-4479-3-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab
Reviewd-by: Yasuaki Ishimatsu
Cc: Greg Kroah-Hartman
Cc: Daniel Kiper
Cc: Dan Williams
Cc: Vlastimil Babka
Cc: Tang Chen
Cc: Joonsoo Kim
Cc: David Vrabel
Cc: Vitaly Kuznetsov
Cc: David Rientjes
Cc: Andrew Banman
Cc: Chen Yucong
Cc: Yasunori Goto
Cc: Zhang Zhen
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Add move_pfn_range(), a wrapper to call move_pfn_range_left() or
move_pfn_range_right().No functional change. This will be utilized by a later patch.
Link: http://lkml.kernel.org/r/1462816419-4479-2-git-send-email-arbab@linux.vnet.ibm.com
Signed-off-by: Reza Arbab
Reviewed-by: Yasuaki Ishimatsu
Cc: Greg Kroah-Hartman
Cc: Daniel Kiper
Cc: Dan Williams
Cc: Vlastimil Babka
Cc: Tang Chen
Cc: Joonsoo Kim
Cc: David Vrabel
Cc: Vitaly Kuznetsov
Cc: David Rientjes
Cc: Andrew Banman
Cc: Chen Yucong
Cc: Yasunori Goto
Cc: Zhang Zhen
Cc: Shaohua Li
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 May, 2016
2 commits
-
The register_page_bootmem_info_node() function needs to be marked __init
in order to avoid a new warning introduced by commit f65e91df25aa ("mm:
use early_pfn_to_nid in register_page_bootmem_info_node").Otherwise you'll get a warning about how a non-init function calls
early_pfn_to_nid (which is __meminit)Cc: Yang Shi
Cc: Andrew Morton
Signed-off-by: Linus Torvalds -
register_page_bootmem_info_node() is invoked in mem_init(), so it will
be called before page_alloc_init_late() if DEFERRED_STRUCT_PAGE_INIT is
enabled. But, pfn_to_nid() depends on memmap which won't be fully setup
until page_alloc_init_late() is done, so replace pfn_to_nid() by
early_pfn_to_nid().Link: http://lkml.kernel.org/r/1464210007-30930-1-git-send-email-yang.shi@linaro.org
Signed-off-by: Yang Shi
Cc: Mel Gorman
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
20 May, 2016
3 commits
-
CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE specifies the default value for the
memory hotplug onlining policy. Add a command line parameter to make it
possible to override the default. It may come handy for debug and
testing purposes.Signed-off-by: Vitaly Kuznetsov
Cc: Jonathan Corbet
Cc: Dan Williams
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: David Vrabel
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Lennart Poettering
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This patchset continues the work I started with commit 31bc3858ea3e
("memory-hotplug: add automatic onlining policy for the newly added
memory").Initially I was going to stop there and bring the policy setting logic
to userspace. I met two issues on this way:1) It is possible to have memory hotplugged at boot (e.g. with QEMU).
These blocks stay offlined if we turn the onlining policy on by
userspace.2) My attempt to bring this policy setting to systemd failed, systemd
maintainers suggest to change the default in kernel or ... to use
tmpfiles.d to alter the policy (which looks like a hack to me):
https://github.com/systemd/systemd/pull/2938Here I suggest to add a config option to set the default value for the
policy and a kernel command line parameter to make the override.This patch (of 2):
Introduce config option to set the default value for memory hotplug
onlining policy (/sys/devices/system/memory/auto_online_blocks). The
reason one would want to turn this option on are to have early onlining
for hotpluggable memory available at boot and to not require any
userspace actions to make memory hotplug work.[akpm@linux-foundation.org: tweak Kconfig text]
Signed-off-by: Vitaly Kuznetsov
Cc: Jonathan Corbet
Cc: Dan Williams
Cc: "Kirill A. Shutemov"
Cc: Mel Gorman
Cc: David Vrabel
Cc: David Rientjes
Cc: Igor Mammedov
Cc: Lennart Poettering
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Make is_mem_section_removable() return bool to improve readability due
to this particular function only using either one or zero as its return
value.Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Mar, 2016
5 commits
-
Kernel style prefers a single string over split strings when the string is
'user-visible'.Miscellanea:
- Add a missing newline
- Realign argumentsSigned-off-by: Joe Perches
Acked-by: Tejun Heo [percpu]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
online_pages() simply returns an error value if
memory_notify(MEM_GOING_ONLINE, &arg) return a value that is not what we
want for successfully onlining target pages. This patch arms to print
more failure information like offline_pages() in online_pages.This patch also converts printk(KERN_) to pr_(), and moves
__offline_pages() to not print failure information with KERN_INFO
according to David Rientjes's suggestion[1].[1] https://lkml.org/lkml/2016/2/24/1094
Signed-off-by: Chen Yucong
Acked-by: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The success of CMA allocation largely depends on the success of
migration and key factor of it is page reference count. Until now, page
reference is manipulated by direct calling atomic functions so we cannot
follow up who and where manipulate it. Then, it is hard to find actual
reason of CMA allocation failure. CMA allocation should be guaranteed
to succeed so finding offending place is really important.In this patch, call sites where page reference is manipulated are
converted to introduced wrapper function. This is preparation step to
add tracepoint to each page reference manipulation function. With this
facility, we can easily find reason of CMA allocation failure. There is
no functional change in this patch.In addition, this patch also converts reference read sites. It will
help a second step that renames page._count to something else and
prevents later attempt to direct access to it (Suggested by Andrew).Signed-off-by: Joonsoo Kim
Acked-by: Michal Nazarewicz
Acked-by: Vlastimil Babka
Cc: Minchan Kim
Cc: Mel Gorman
Cc: "Kirill A. Shutemov"
Cc: Sergey Senozhatsky
Cc: Steven Rostedt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We can reuse the nid we've determined instead of repeated pfn_to_nid()
usages. Also zone_to_nid() should be a bit cheaper in general than
pfn_to_nid().Signed-off-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: David Rientjes
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Memory compaction can be currently performed in several contexts:
- kswapd balancing a zone after a high-order allocation failure
- direct compaction to satisfy a high-order allocation, including THP
page fault attemps
- khugepaged trying to collapse a hugepage
- manually from /procThe purpose of compaction is two-fold. The obvious purpose is to
satisfy a (pending or future) high-order allocation, and is easy to
evaluate. The other purpose is to keep overal memory fragmentation low
and help the anti-fragmentation mechanism. The success wrt the latter
purpose is moreThe current situation wrt the purposes has a few drawbacks:
- compaction is invoked only when a high-order page or hugepage is not
available (or manually). This might be too late for the purposes of
keeping memory fragmentation low.
- direct compaction increases latency of allocations. Again, it would
be better if compaction was performed asynchronously to keep
fragmentation low, before the allocation itself comes.
- (a special case of the previous) the cost of compaction during THP
page faults can easily offset the benefits of THP.
- kswapd compaction appears to be complex, fragile and not working in
some scenarios. It could also end up compacting for a high-order
allocation request when it should be reclaiming memory for a later
order-0 request.To improve the situation, we should be able to benefit from an
equivalent of kswapd, but for compaction - i.e. a background thread
which responds to fragmentation and the need for high-order allocations
(including hugepages) somewhat proactively.One possibility is to extend the responsibilities of kswapd, which could
however complicate its design too much. It should be better to let
kswapd handle reclaim, as order-0 allocations are often more critical
than high-order ones.Another possibility is to extend khugepaged, but this kthread is a
single instance and tied to THP configs.This patch goes with the option of a new set of per-node kthreads called
kcompactd, and lays the foundations, without introducing any new
tunables. The lifecycle mimics kswapd kthreads, including the memory
hotplug hooks.For compaction, kcompactd uses the standard compaction_suitable() and
ompact_finished() criteria and the deferred compaction functionality.
Unlike direct compaction, it uses only sync compaction, as there's no
allocation latency to minimize.This patch doesn't yet add a call to wakeup_kcompactd. The kswapd
compact/reclaim loop for high-order pages will be replaced by waking up
kcompactd in the next patch with the description of what's wrong with
the old approach.Waking up of the kcompactd threads is also tied to kswapd activity and
follows these rules:
- we don't want to affect any fastpaths, so wake up kcompactd only from
the slowpath, as it's done for kswapd
- if kswapd is doing reclaim, it's more important than compaction, so
don't invoke kcompactd until kswapd goes to sleep
- the target order used for kswapd is passed to kcompactdFuture possible future uses for kcompactd include the ability to wake up
kcompactd on demand in special situations, such as when hugepages are
not available (currently not done due to __GFP_NO_KSWAPD) or when a
fragmentation event (i.e. __rmqueue_fallback()) occurs. It's also
possible to perform periodic compaction with kcompactd.[arnd@arndb.de: fix build errors with kcompactd]
[paul.gortmaker@windriver.com: don't use modular references for non modular code]
Signed-off-by: Vlastimil Babka
Cc: Andrea Arcangeli
Cc: "Kirill A. Shutemov"
Cc: Rik van Riel
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: David Rientjes
Cc: Michal Hocko
Cc: Johannes Weiner
Signed-off-by: Arnd Bergmann
Signed-off-by: Paul Gortmaker
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Mar, 2016
2 commits
-
There is a performance drop report due to hugepage allocation and in
there half of cpu time are spent on pageblock_pfn_to_page() in
compaction [1].In that workload, compaction is triggered to make hugepage but most of
pageblocks are un-available for compaction due to pageblock type and
skip bit so compaction usually fails. Most costly operations in this
case is to find valid pageblock while scanning whole zone range. To
check if pageblock is valid to compact, valid pfn within pageblock is
required and we can obtain it by calling pageblock_pfn_to_page(). This
function checks whether pageblock is in a single zone and return valid
pfn if possible. Problem is that we need to check it every time before
scanning pageblock even if we re-visit it and this turns out to be very
expensive in this workload.Although we have no way to skip this pageblock check in the system where
hole exists at arbitrary position, we can use cached value for zone
continuity and just do pfn_to_page() in the system where hole doesn't
exist. This optimization considerably speeds up in above workload.Before vs After
Max: 1096 MB/s vs 1325 MB/s
Min: 635 MB/s 1015 MB/s
Avg: 899 MB/s 1194 MB/sAvg is improved by roughly 30% [2].
[1]: http://www.spinics.net/lists/linux-mm/msg97378.html
[2]: https://lkml.org/lkml/2015/12/9/23[akpm@linux-foundation.org: don't forget to restore zone->contiguous on error path, per Vlastimil]
Signed-off-by: Joonsoo Kim
Reported-by: Aaron Lu
Acked-by: Vlastimil Babka
Tested-by: Aaron Lu
Cc: Mel Gorman
Cc: Rik van Riel
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Currently, all newly added memory blocks remain in 'offline' state
unless someone onlines them, some linux distributions carry special udev
rules like:SUBSYSTEM=="memory", ACTION=="add", ATTR{state}=="offline", ATTR{state}="online"
to make this happen automatically. This is not a great solution for
virtual machines where memory hotplug is being used to address high
memory pressure situations as such onlining is slow and a userspace
process doing this (udev) has a chance of being killed by the OOM killer
as it will probably require to allocate some memory.Introduce default policy for the newly added memory blocks in
/sys/devices/system/memory/auto_online_blocks file with two possible
values: "offline" which preserves the current behavior and "online"
which causes all newly added memory blocks to go online as soon as
they're added. The default is "offline".Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Daniel Kiper
Cc: Jonathan Corbet
Cc: Greg Kroah-Hartman
Cc: Daniel Kiper
Cc: Dan Williams
Cc: Tang Chen
Cc: David Vrabel
Acked-by: David Rientjes
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: Mel Gorman
Cc: "K. Y. Srinivasan"
Cc: Igor Mammedov
Cc: Kay Sievers
Cc: Konrad Rzeszutek Wilk
Cc: Boris Ostrovsky
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Jan, 2016
1 commit
-
Set IORESOURCE_SYSTEM_RAM in struct resource.flags of "System
RAM" entries.Signed-off-by: Toshi Kani
Signed-off-by: Borislav Petkov
Acked-by: David Vrabel # xen
Cc: Andrew Banman
Cc: Andrew Morton
Cc: Andy Lutomirski
Cc: Boris Ostrovsky
Cc: Borislav Petkov
Cc: Brian Gerst
Cc: Dan Williams
Cc: David Rientjes
Cc: Denys Vlasenko
Cc: Gu Zheng
Cc: H. Peter Anvin
Cc: Konrad Rzeszutek Wilk
Cc: Linus Torvalds
Cc: Luis R. Rodriguez
Cc: Mel Gorman
Cc: Naoya Horiguchi
Cc: Peter Zijlstra
Cc: Tang Chen
Cc: Thomas Gleixner
Cc: Toshi Kani
Cc: linux-arch@vger.kernel.org
Cc: linux-mm
Cc: xen-devel@lists.xenproject.org
Link: http://lkml.kernel.org/r/1453841853-11383-9-git-send-email-bp@alien8.de
Signed-off-by: Ingo Molnar
16 Jan, 2016
1 commit
-
In support of providing struct page for large persistent memory
capacities, use struct vmem_altmap to change the default policy for
allocating memory for the memmap array. The default vmemmap_populate()
allocates page table storage area from the page allocator. Given
persistent memory capacities relative to DRAM it may not be feasible to
store the memmap in 'System Memory'. Instead vmem_altmap represents
pre-allocated "device pages" to satisfy vmemmap_alloc_block_buf()
requests.Signed-off-by: Dan Williams
Reported-by: kbuild test robot
Cc: Thomas Gleixner
Cc: Ingo Molnar
Cc: "H. Peter Anvin"
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Jan, 2016
1 commit
-
Out of memory condition is not a bug and while we can't add new memory
in such case crashing the system seems wrong. Propagating the return
value from register_memory_resource() requires interface change.Signed-off-by: Vitaly Kuznetsov
Reviewed-by: Igor Mammedov
Acked-by: David Rientjes
Cc: Tang Chen
Cc: Naoya Horiguchi
Cc: Xishi Qiu
Cc: Sheng Yong
Cc: Zhu Guihua
Cc: Dan Williams
Cc: David Vrabel
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Dec, 2015
1 commit
-
test_pages_in_a_zone() does not account for the possibility of missing
sections in the given pfn range. pfn_valid_within always returns 1 when
CONFIG_HOLES_IN_ZONE is not set, allowing invalid pfns from missing
sections to pass the test, leading to a kernel oops.Wrap an additional pfn loop with PAGES_PER_SECTION granularity to check
for missing sections before proceeding into the zone-check code.This also prevents a crash from offlining memory devices with missing
sections. Despite this, it may be a good idea to keep the related patch
'[PATCH 3/3] drivers: memory: prohibit offlining of memory blocks with
missing sections' because missing sections in a memory block may lead to
other problems not covered by the scope of this fix.Signed-off-by: Andrew Banman
Acked-by: Alex Thorlton
Cc: Russ Anderson
Cc: Alex Thorlton
Cc: Yinghai Lu
Cc: Greg KH
Cc: Seth Jennings
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
06 Nov, 2015
1 commit
-
Commit a2f3aa025766 ("[PATCH] Fix sparsemem on Cell") fixed an oops
experienced on the Cell architecture when init-time functions,
early_*(), are called at runtime by introducing an 'enum memmap_context'
parameter to memmap_init_zone() and init_currently_empty_zone(). This
parameter is intended to be used to tell whether the call of these two
functions is being made on behalf of a hotplug event, or happening at
boot-time. However, init_currently_empty_zone() does not use this
parameter at all, so remove it.Signed-off-by: Yaowei Bai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Oct, 2015
1 commit
-
Add add_memory_resource() to add memory using an existing "System RAM"
resource. This is useful if the memory region is being located by
finding a free resource slot with allocate_resource().Xen guests will make use of this in their balloon driver to hotplug
arbitrary amounts of memory in response to toolstack requests.Signed-off-by: David Vrabel
Reviewed-by: Daniel Kiper
Reviewed-by: Tang Chen
09 Sep, 2015
1 commit
-
Pull libnvdimm updates from Dan Williams:
"This update has successfully completed a 0day-kbuild run and has
appeared in a linux-next release. The changes outside of the typical
drivers/nvdimm/ and drivers/acpi/nfit.[ch] paths are related to the
removal of IORESOURCE_CACHEABLE, the introduction of memremap(), and
the introduction of ZONE_DEVICE + devm_memremap_pages().Summary:
- Introduce ZONE_DEVICE and devm_memremap_pages() as a generic
mechanism for adding device-driver-discovered memory regions to the
kernel's direct map.This facility is used by the pmem driver to enable pfn_to_page()
operations on the page frames returned by DAX ('direct_access' in
'struct block_device_operations').For now, the 'memmap' allocation for these "device" pages comes
from "System RAM". Support for allocating the memmap from device
memory will arrive in a later kernel.- Introduce memremap() to replace usages of ioremap_cache() and
ioremap_wt(). memremap() drops the __iomem annotation for these
mappings to memory that do not have i/o side effects. The
replacement of ioremap_cache() with memremap() is limited to the
pmem driver to ease merging the api change in v4.3.Completion of the conversion is targeted for v4.4.
- Similar to the usage of memcpy_to_pmem() + wmb_pmem() in the pmem
driver, update the VFS DAX implementation and PMEM api to provide
persistence guarantees for kernel operations on a DAX mapping.- Convert the ACPI NFIT 'BLK' driver to map the block apertures as
cacheable to improve performance.- Miscellaneous updates and fixes to libnvdimm including support for
issuing "address range scrub" commands, clarifying the optimal
'sector size' of pmem devices, a clarification of the usage of the
ACPI '_STA' (status) property for DIMM devices, and other minor
fixes"* tag 'libnvdimm-for-4.3' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: (34 commits)
libnvdimm, pmem: direct map legacy pmem by default
libnvdimm, pmem: 'struct page' for pmem
libnvdimm, pfn: 'struct page' provider infrastructure
x86, pmem: clarify that ARCH_HAS_PMEM_API implies PMEM mapped WB
add devm_memremap_pages
mm: ZONE_DEVICE for "device memory"
mm: move __phys_to_pfn and __pfn_to_phys to asm/generic/memory_model.h
dax: drop size parameter to ->direct_access()
nd_blk: change aperture mapping from WC to WB
nvdimm: change to use generic kvfree()
pmem, dax: have direct_access use __pmem annotation
dax: update I/O path to do proper PMEM flushing
pmem: add copy_from_iter_pmem() and clear_pmem()
pmem, x86: clean up conditional pmem includes
pmem: remove layer when calling arch_has_wmb_pmem()
pmem, x86: move x86 PMEM API to new pmem.h header
libnvdimm, e820: make CONFIG_X86_PMEM_LEGACY a tristate option
pmem: switch to devm_ allocations
devres: add devm_memremap
libnvdimm, btt: write and validate parent_uuid
...
05 Sep, 2015
1 commit
-
Commit f9126ab9241f ("memory-hotplug: fix wrong edge when hot add a new
node") hot-added memory range to memblock, after creating pgdat for new
node.But there is a problem:
add_memory()
|--> hotadd_new_pgdat()
|--> free_area_init_node()
|--> get_pfn_range_for_nid()
|--> find start_pfn and end_pfn in memblock
|--> ......
|--> memblock_add_node(start, size, nid) -------- Here, just too late.get_pfn_range_for_nid() will find that start_pfn and end_pfn are both 0.
As a result, when adding memory, dmesg will give the following wrong
message.Initmem setup node 5 [mem 0x0000000000000000-0xffffffffffffffff]
On node 5 totalpages: 0
Built 5 zonelists in Node order, mobility grouping on. Total pages: 32588823
Policy zone: Normal
init_memory_mapping: [mem 0x60000000000-0x607ffffffff]The solution is simple, just add the memory range to memblock a little
earlier, before hotadd_new_pgdat().[akpm@linux-foundation.org: coding-style fixes]
Signed-off-by: Tang Chen
Cc: Xishi Qiu
Cc: Yasuaki Ishimatsu
Cc: Kamezawa Hiroyuki
Cc: Taku Izumi
Cc: Gu Zheng
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: David Rientjes
Cc: [4.2.x]
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
28 Aug, 2015
1 commit
-
While pmem is usable as a block device or via DAX mappings to userspace
there are several usage scenarios that can not target pmem due to its
lack of struct page coverage. In preparation for "hot plugging" pmem
into the vmemmap add ZONE_DEVICE as a new zone to tag these pages
separately from the ones that are subject to standard page allocations.
Importantly "device memory" can be removed at will by userspace
unbinding the driver of the device.Having a separate zone prevents allocation and otherwise marks these
pages that are distinct from typical uniform memory. Device memory has
different lifetime and performance characteristics than RAM. However,
since we have run out of ZONES_SHIFT bits this functionality currently
depends on sacrificing ZONE_DMA.Cc: H. Peter Anvin
Cc: Ingo Molnar
Cc: Dave Hansen
Cc: Rik van Riel
Cc: Mel Gorman
Cc: Jerome Glisse
[hch: various simplifications in the arch interface]
Signed-off-by: Christoph Hellwig
Signed-off-by: Dan Williams
15 Aug, 2015
1 commit
-
When we add a new node, the edge of memory may be wrong.
e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G],
1. hotremove the node3,
2. then hotadd node3 with a part of memory, mem:[26G-30G],
3. call hotadd_new_pgdat()
free_area_init_node()
get_pfn_range_for_nid()
4. it will return wrong start_pfn and end_pfn, because we have not
update the memblock.This patch also fixes a BUG_ON during hot-addition, please see
http://marc.info/?l=linux-kernel&m=142961156129456&w=2Signed-off-by: Xishi Qiu
Cc: Yasuaki Ishimatsu
Cc: Kamezawa Hiroyuki
Cc: Taku Izumi
Cc: Tang Chen
Cc: Gu Zheng
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
07 Aug, 2015
1 commit
-
Commit 92923ca3aace ("mm: meminit: only set page reserved in the
memblock region") broke memory hotplug which expects the memmap for
newly added sections to be reserved until onlined by
online_pages_range(). This patch marks hotplugged pages as reserved
when adding new zones.Signed-off-by: Mel Gorman
Reported-by: David Vrabel
Tested-by: David Vrabel
Cc: Nathan Zimmer
Cc: Robin Holt
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 Jun, 2015
1 commit
-
When hot add two nodes continuously, we found the vmemmap region info is
a bit messed. The last region of node 2 is printed when node 3 hot
added, like the following:Initmem setup node 2 [mem 0x0000000000000000-0xffffffffffffffff]
On node 2 totalpages: 0
Built 2 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x40000000000-0x407ffffffff]
[mem 0x40000000000-0x407ffffffff] page 1G
[ffffea1000000000-ffffea10001fffff] PMD -> [ffff8a077d800000-ffff8a077d9fffff] on node 2
[ffffea1000200000-ffffea10003fffff] PMD -> [ffff8a077de00000-ffff8a077dffffff] on node 2
...
[ffffea101f600000-ffffea101f9fffff] PMD -> [ffff8a074ac00000-ffff8a074affffff] on node 2
[ffffea101fa00000-ffffea101fdfffff] PMD -> [ffff8a074a800000-ffff8a074abfffff] on node 2
Initmem setup node 3 [mem 0x0000000000000000-0xffffffffffffffff]
On node 3 totalpages: 0
Built 3 zonelists in Node order, mobility grouping on. Total pages: 16090539
Policy zone: Normal
init_memory_mapping: [mem 0x60000000000-0x607ffffffff]
[mem 0x60000000000-0x607ffffffff] page 1G
[ffffea101fe00000-ffffea101fffffff] PMD -> [ffff8a074a400000-ffff8a074a5fffff] on node 2 [ffff8a074a600000-ffff8a074a7fffff] on node 3
[ffffea1800200000-ffffea18005fffff] PMD -> [ffff8a074a000000-ffff8a074a3fffff] on node 3
[ffffea1800600000-ffffea18009fffff] PMD -> [ffff8a0749c00000-ffff8a0749ffffff] on node 3
...The cause is the last region was missed at the and of hot add memory,
and p_start, p_end, node_start were not reset, so when hot add memory to
a new node, it will consider they are not contiguous blocks and print
the previous one. So we print the last vmemmap region at the end of hot
add memory to avoid the confusion.Signed-off-by: Zhu Guihua
Reviewed-by: Naoya Horiguchi
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
11 Jun, 2015
1 commit
-
Izumi found the following oops when hot re-adding a node:
BUG: unable to handle kernel paging request at ffffc90008963690
IP: __wake_up_bit+0x20/0x70
Oops: 0000 [#1] SMP
CPU: 68 PID: 1237 Comm: rs:main Q:Reg Not tainted 4.1.0-rc5 #80
Hardware name: FUJITSU PRIMEQUEST2800E/SB, BIOS PRIMEQUEST 2000 Series BIOS Version 1.87 04/28/2015
task: ffff880838df8000 ti: ffff880017b94000 task.ti: ffff880017b94000
RIP: 0010:[] [] __wake_up_bit+0x20/0x70
RSP: 0018:ffff880017b97be8 EFLAGS: 00010246
RAX: ffffc90008963690 RBX: 00000000003c0000 RCX: 000000000000a4c9
RDX: 0000000000000000 RSI: ffffea101bffd500 RDI: ffffc90008963648
RBP: ffff880017b97c08 R08: 0000000002000020 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffff8a0797c73800
R13: ffffea101bffd500 R14: 0000000000000001 R15: 00000000003c0000
FS: 00007fcc7ffff700(0000) GS:ffff880874800000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffc90008963690 CR3: 0000000836761000 CR4: 00000000001407e0
Call Trace:
unlock_page+0x6d/0x70
generic_write_end+0x53/0xb0
xfs_vm_write_end+0x29/0x80 [xfs]
generic_perform_write+0x10a/0x1e0
xfs_file_buffered_aio_write+0x14d/0x3e0 [xfs]
xfs_file_write_iter+0x79/0x120 [xfs]
__vfs_write+0xd4/0x110
vfs_write+0xac/0x1c0
SyS_write+0x58/0xd0
system_call_fastpath+0x12/0x76
Code: 5d c3 66 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 48 89 e5 48 83 ec 20 65 48 8b 04 25 28 00 00 00 48 89 45 f8 31 c0 48 8d 47 48 39 47 48 48 c7 45 e8 00 00 00 00 48 c7 45 f0 00 00 00 00 48
RIP [] __wake_up_bit+0x20/0x70
RSP
CR2: ffffc90008963690Reproduce method (re-add a node)::
Hot-add nodeA --> remove nodeA --> hot-add nodeA (panic)This seems an use-after-free problem, and the root cause is
zone->wait_table was not set to *NULL* after free it in
try_offline_node.When hot re-add a node, we will reuse the pgdat of it, so does the zone
struct, and when add pages to the target zone, it will init the zone
first (including the wait_table) if the zone is not initialized. The
judgement of zone initialized is based on zone->wait_table:static inline bool zone_is_initialized(struct zone *zone)
{
return !!zone->wait_table;
}so if we do not set the zone->wait_table to *NULL* after free it, the
memory hotplug routine will skip the init of new zone when hot re-add
the node, and the wait_table still points to the freed memory, then we
will access the invalid address when trying to wake up the waiting
people after the i/o operation with the page is done, such as mentioned
above.Signed-off-by: Gu Zheng
Reported-by: Taku Izumi
Reviewed by: Yasuaki Ishimatsu
Cc: KAMEZAWA Hiroyuki
Cc: Tang Chen
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
16 Apr, 2015
1 commit
-
Now we have an easy access to hugepages' activeness, so existing helpers to
get the information can be cleaned up.[akpm@linux-foundation.org: s/PageHugeActive/page_huge_active/]
Signed-off-by: Naoya Horiguchi
Cc: Hugh Dickins
Reviewed-by: Michal Hocko
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: David Rientjes
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds