06 Apr, 2019
1 commit
-
[ Upstream commit 4117992df66a26fa33908b4969e04801534baab1 ]
KASAN does not play well with the page poisoning (CONFIG_PAGE_POISONING).
It triggers false positives in the allocation path:BUG: KASAN: use-after-free in memchr_inv+0x2ea/0x330
Read of size 8 at addr ffff88881f800000 by task swapper/0
CPU: 0 PID: 0 Comm: swapper Not tainted 5.0.0-rc1+ #54
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
__asan_report_load8_noabort+0x19/0x20
memchr_inv+0x2ea/0x330
kernel_poison_pages+0x103/0x3d5
get_page_from_freelist+0x15e7/0x4d90because KASAN has not yet unpoisoned the shadow page for allocation
before it checks memchr_inv() but only found a stale poison pattern.Also, false positives in free path,
BUG: KASAN: slab-out-of-bounds in kernel_poison_pages+0x29e/0x3d5
Write of size 4096 at addr ffff8888112cc000 by task swapper/0/1
CPU: 5 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc1+ #55
Call Trace:
dump_stack+0xe0/0x19a
print_address_description.cold.2+0x9/0x28b
kasan_report.cold.3+0x7a/0xb5
check_memory_region+0x22d/0x250
memset+0x28/0x40
kernel_poison_pages+0x29e/0x3d5
__free_pages_ok+0x75f/0x13e0due to KASAN adds poisoned redzones around slab objects, but the page
poisoning needs to poison the whole page.Link: http://lkml.kernel.org/r/20190114233405.67843-1-cai@lca.pw
Signed-off-by: Qian Cai
Acked-by: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
24 Mar, 2019
1 commit
-
[ Upstream commit 2c2ade81741c66082f8211f0b96cf509cc4c0218 ]
The basic idea behind ->pagecnt_bias is: If we pre-allocate the maximum
number of references that we might need to create in the fastpath later,
the bump-allocation fastpath only has to modify the non-atomic bias value
that tracks the number of extra references we hold instead of the atomic
refcount. The maximum number of allocations we can serve (under the
assumption that no allocation is made with size 0) is nc->size, so that's
the bias used.However, even when all memory in the allocation has been given away, a
reference to the page is still held; and in the `offset < 0` slowpath, the
page may be reused if everyone else has dropped their references.
This means that the necessary number of references is actually
`nc->size+1`.Luckily, from a quick grep, it looks like the only path that can call
page_frag_alloc(fragsz=1) is TAP with the IFF_NAPI_FRAGS flag, which
requires CAP_NET_ADMIN in the init namespace and is only intended to be
used for kernel testing and fuzzing.To test for this issue, put a `WARN_ON(page_ref_count(page) == 0)` in the
`offset < 0` path, below the virt_to_page() call, and then repeatedly call
writev() on a TAP device with IFF_TAP|IFF_NO_PI|IFF_NAPI_FRAGS|IFF_NAPI,
with a vector consisting of 15 elements containing 1 byte each.Signed-off-by: Jann Horn
Signed-off-by: David S. Miller
Signed-off-by: Sasha Levin
13 Feb, 2019
1 commit
-
[ Upstream commit 3c0c12cc8f00ca5f81acb010023b8eb13e9a7004 ]
When CONFIG_KASAN is enabled on large memory SMP systems, the deferrred
pages initialization can take a long time. Below were the reported init
times on a 8-socket 96-core 4TB IvyBridge system.1) Non-debug kernel without CONFIG_KASAN
[ 8.764222] node 1 initialised, 132086516 pages in 7027ms2) Debug kernel with CONFIG_KASAN
[ 146.288115] node 1 initialised, 132075466 pages in 143052msSo the page init time in a debug kernel was 20X of the non-debug kernel.
The long init time can be problematic as the page initialization is done
with interrupt disabled. In this particular case, it caused the
appearance of following warning messages as well as NMI backtraces of all
the cores that were doing the initialization.[ 68.240049] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 68.241000] rcu: 25-...0: (100 ticks this GP) idle=b72/1/0x4000000000000000 softirq=915/915 fqs=16252
[ 68.241000] rcu: 44-...0: (95 ticks this GP) idle=49a/1/0x4000000000000000 softirq=788/788 fqs=16253
[ 68.241000] rcu: 54-...0: (104 ticks this GP) idle=03a/1/0x4000000000000000 softirq=721/825 fqs=16253
[ 68.241000] rcu: 60-...0: (103 ticks this GP) idle=cbe/1/0x4000000000000000 softirq=637/740 fqs=16253
[ 68.241000] rcu: 72-...0: (105 ticks this GP) idle=786/1/0x4000000000000000 softirq=536/641 fqs=16253
[ 68.241000] rcu: 84-...0: (99 ticks this GP) idle=292/1/0x4000000000000000 softirq=537/537 fqs=16253
[ 68.241000] rcu: 111-...0: (104 ticks this GP) idle=bde/1/0x4000000000000000 softirq=474/476 fqs=16253
[ 68.241000] rcu: (detected by 13, t=65018 jiffies, g=249, q=2)The long init time was mainly caused by the call to kasan_free_pages() to
poison the newly initialized pages. On a 4TB system, we are talking about
almost 500GB of memory probably on the same node.In reality, we may not need to poison the newly initialized pages before
they are ever allocated. So KASAN poisoning of freed pages before the
completion of deferred memory initialization is now disabled. Those pages
will be properly poisoned when they are allocated or freed after deferred
pages initialization is done.With this change, the new page initialization time became:
[ 21.948010] node 1 initialised, 132075466 pages in 18702ms
This was still about double the non-debug kernel time, but was much
better than before.Link: http://lkml.kernel.org/r/1544459388-8736-1-git-send-email-longman@redhat.com
Signed-off-by: Waiman Long
Reviewed-by: Andrew Morton
Cc: Andrey Ryabinin
Cc: Alexander Potapenko
Cc: Dmitry Vyukov
Cc: Michal Hocko
Cc: Pasha Tatashin
Cc: Oscar Salvador
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
31 Jan, 2019
1 commit
-
commit 4aa9fc2a435abe95a1e8d7f8c7b3d6356514b37a upstream.
This reverts commit 2830bf6f05fb3e05bc4743274b806c821807a684.
The underlying assumption that one sparse section belongs into a single
numa node doesn't hold really. Robert Shteynfeld has reported a boot
failure. The boot log was not captured but his memory layout is as
follows:Early memory node ranges
node 1: [mem 0x0000000000001000-0x0000000000090fff]
node 1: [mem 0x0000000000100000-0x00000000dbdf8fff]
node 1: [mem 0x0000000100000000-0x0000001423ffffff]
node 0: [mem 0x0000001424000000-0x0000002023ffffff]This means that node0 starts in the middle of a memory section which is
also in node1. memmap_init_zone tries to initialize padding of a
section even when it is outside of the given pfn range because there are
code paths (e.g. memory hotplug) which assume that the full worth of
memory section is always initialized.In this particular case, though, such a range is already intialized and
most likely already managed by the page allocator. Scribbling over
those pages corrupts the internal state and likely blows up when any of
those pages gets used.Reported-by: Robert Shteynfeld
Fixes: 2830bf6f05fb ("mm, memory_hotplug: initialize struct pages for the full memory section")
Cc: stable@kernel.org
Signed-off-by: Michal Hocko
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
29 Dec, 2018
2 commits
-
commit 17e2e7d7e1b83fa324b3f099bfe426659aa3c2a4 upstream.
While playing with gigantic hugepages and memory_hotplug, I triggered
the following #PF when "cat memoryX/removable":BUG: unable to handle kernel NULL pointer dereference at 0000000000000008
#PF error: [normal kernel read fault]
PGD 0 P4D 0
Oops: 0000 [#1] SMP PTI
CPU: 1 PID: 1481 Comm: cat Tainted: G E 4.20.0-rc6-mm1-1-default+ #18
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org 04/01/2014
RIP: 0010:has_unmovable_pages+0x154/0x210
Call Trace:
is_mem_section_removable+0x7d/0x100
removable_show+0x90/0xb0
dev_attr_show+0x1c/0x50
sysfs_kf_seq_show+0xca/0x1b0
seq_read+0x133/0x380
__vfs_read+0x26/0x180
vfs_read+0x89/0x140
ksys_read+0x42/0x90
do_syscall_64+0x5b/0x180
entry_SYSCALL_64_after_hwframe+0x44/0xa9The reason is we do not pass the Head to page_hstate(), and so, the call
to compound_order() in page_hstate() returns 0, so we end up checking
all hstates's size to match PAGE_SIZE.Obviously, we do not find any hstate matching that size, and we return
NULL. Then, we dereference that NULL pointer in
hugepage_migration_supported() and we got the #PF from above.Fix that by getting the head page before calling page_hstate().
Also, since gigantic pages span several pageblocks, re-adjust the logic
for skipping pages. While are it, we can also get rid of the
round_up().[osalvador@suse.de: remove round_up(), adjust skip pages logic per Michal]
Link: http://lkml.kernel.org/r/20181221062809.31771-1-osalvador@suse.de
Link: http://lkml.kernel.org/r/20181217225113.17864-1-osalvador@suse.de
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: David Hildenbrand
Cc: Vlastimil Babka
Cc: Pavel Tatashin
Cc: Mike Rapoport
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman -
commit 2830bf6f05fb3e05bc4743274b806c821807a684 upstream.
If memory end is not aligned with the sparse memory section boundary,
the mapping of such a section is only partly initialized. This may lead
to VM_BUG_ON due to uninitialized struct page access from
is_mem_section_removable() or test_pages_in_a_zone() function triggered
by memory_hotplug sysfs handlers:Here are the the panic examples:
CONFIG_DEBUG_VM=y
CONFIG_DEBUG_VM_PGFLAGS=ykernel parameter mem=2050M
--------------------------
page:000003d082008000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( test_pages_in_a_zone+0xde/0x160)
show_valid_zones+0x5c/0x190
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
test_pages_in_a_zone+0xde/0x160
Kernel panic - not syncing: Fatal exception: panic_on_oopskernel parameter mem=3075M
--------------------------
page:000003d08300c000 is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
Call Trace:
( is_mem_section_removable+0xb4/0x190)
show_mem_removable+0x9a/0xd8
dev_attr_show+0x34/0x70
sysfs_kf_seq_show+0xc8/0x148
seq_read+0x204/0x480
__vfs_read+0x32/0x178
vfs_read+0x82/0x138
ksys_read+0x5a/0xb0
system_call+0xdc/0x2d8
Last Breaking-Event-Address:
is_mem_section_removable+0xb4/0x190
Kernel panic - not syncing: Fatal exception: panic_on_oopsFix the problem by initializing the last memory section of each zone in
memmap_init_zone() till the very end, even if it goes beyond the zone end.Michal said:
: This has alwways been problem AFAIU. It just went unnoticed because we
: have zeroed memmaps during allocation before f7f99100d8d9 ("mm: stop
: zeroing memory during allocation in vmemmap") and so the above test
: would simply skip these ranges as belonging to zone 0 or provided a
: garbage.
:
: So I guess we do care for post f7f99100d8d9 kernels mostly and
: therefore Fixes: f7f99100d8d9 ("mm: stop zeroing memory during
: allocation in vmemmap")Link: http://lkml.kernel.org/r/20181212172712.34019-2-zaslonko@linux.ibm.com
Fixes: f7f99100d8d9 ("mm: stop zeroing memory during allocation in vmemmap")
Signed-off-by: Mikhail Zaslonko
Reviewed-by: Gerald Schaefer
Suggested-by: Michal Hocko
Acked-by: Michal Hocko
Reported-by: Mikhail Gavrilov
Tested-by: Mikhail Gavrilov
Cc: Dave Hansen
Cc: Alexander Duyck
Cc: Pasha Tatashin
Cc: Martin Schwidefsky
Cc: Heiko Carstens
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman
17 Dec, 2018
1 commit
-
[ Upstream commit 8f416836c0d50b198cad1225132e5abebf8980dc ]
init_currently_empty_zone() will adjust pgdat->nr_zones and set it to
'zone_idx(zone) + 1' unconditionally. This is correct in the normal
case, while not exact in hot-plug situation.This function is used in two places:
* free_area_init_core()
* move_pfn_range_to_zone()In the first case, we are sure zone index increase monotonically. While
in the second one, this is under users control.One way to reproduce this is:
----------------------------1. create a virtual machine with empty node1
-m 4G,slots=32,maxmem=32G \
-smp 4,maxcpus=8 \
-numa node,nodeid=0,mem=4G,cpus=0-3 \
-numa node,nodeid=1,mem=0G,cpus=4-72. hot-add cpu 3-7
cpu-add [3-7]
2. hot-add memory to nod1
object_add memory-backend-ram,id=ram0,size=1G
device_add pc-dimm,id=dimm0,memdev=ram0,node=13. online memory with following order
echo online_movable > memory47/state
echo online > memory40/stateAfter this, node1 will have its nr_zones equals to (ZONE_NORMAL + 1)
instead of (ZONE_MOVABLE + 1).Michal said:
"Having an incorrect nr_zones might result in all sorts of problems
which would be quite hard to debug (e.g. reclaim not considering the
movable zone). I do not expect many users would suffer from this it
but still this is trivial and obviously right thing to do so
backporting to the stable tree shouldn't be harmful (last famous
words)"Link: http://lkml.kernel.org/r/20181117022022.9956-1-richard.weiyang@gmail.com
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")
Signed-off-by: Wei Yang
Acked-by: Michal Hocko
Reviewed-by: Oscar Salvador
Cc: Anshuman Khandual
Cc: Dave Hansen
Signed-off-by: Andrew Morton
Signed-off-by: Linus TorvaldsSigned-off-by: Sasha Levin
01 Dec, 2018
2 commits
-
[ Upstream commit c63ae43ba53bc432b414fd73dd5f4b01fcb1ab43 ]
Konstantin has noticed that kvmalloc might trigger the following
warning:WARNING: CPU: 0 PID: 6676 at mm/vmstat.c:986 __fragmentation_index+0x54/0x60
[...]
Call Trace:
fragmentation_index+0x76/0x90
compaction_suitable+0x4f/0xf0
shrink_node+0x295/0x310
node_reclaim+0x205/0x250
get_page_from_freelist+0x649/0xad0
__alloc_pages_nodemask+0x12a/0x2a0
kmalloc_large_node+0x47/0x90
__kmalloc_node+0x22b/0x2e0
kvmalloc_node+0x3e/0x70
xt_alloc_table_info+0x3a/0x80 [x_tables]
do_ip6t_set_ctl+0xcd/0x1c0 [ip6_tables]
nf_setsockopt+0x44/0x60
SyS_setsockopt+0x6f/0xc0
do_syscall_64+0x67/0x120
entry_SYSCALL_64_after_hwframe+0x3d/0xa2the problem is that we only check for an out of bound order in the slow
path and the node reclaim might happen from the fast path already. This
is fixable by making sure that kvmalloc doesn't ever use kmalloc for
requests that are larger than KMALLOC_MAX_SIZE but this also shows that
the code is rather fragile. A recent UBSAN report just underlines that
by the following reportUBSAN: Undefined behaviour in mm/page_alloc.c:3117:19
shift exponent 51 is too large for 32-bit type 'int'
CPU: 0 PID: 6520 Comm: syz-executor1 Not tainted 4.19.0-rc2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0xd2/0x148 lib/dump_stack.c:113
ubsan_epilogue+0x12/0x94 lib/ubsan.c:159
__ubsan_handle_shift_out_of_bounds+0x2b6/0x30b lib/ubsan.c:425
__zone_watermark_ok+0x2c7/0x400 mm/page_alloc.c:3117
zone_watermark_fast mm/page_alloc.c:3216 [inline]
get_page_from_freelist+0xc49/0x44c0 mm/page_alloc.c:3300
__alloc_pages_nodemask+0x21e/0x640 mm/page_alloc.c:4370
alloc_pages_current+0xcc/0x210 mm/mempolicy.c:2093
alloc_pages include/linux/gfp.h:509 [inline]
__get_free_pages+0x12/0x60 mm/page_alloc.c:4414
dma_mem_alloc+0x36/0x50 arch/x86/include/asm/floppy.h:156
raw_cmd_copyin drivers/block/floppy.c:3159 [inline]
raw_cmd_ioctl drivers/block/floppy.c:3206 [inline]
fd_locked_ioctl+0xa00/0x2c10 drivers/block/floppy.c:3544
fd_ioctl+0x40/0x60 drivers/block/floppy.c:3571
__blkdev_driver_ioctl block/ioctl.c:303 [inline]
blkdev_ioctl+0xb3c/0x1a30 block/ioctl.c:601
block_ioctl+0x105/0x150 fs/block_dev.c:1883
vfs_ioctl fs/ioctl.c:46 [inline]
do_vfs_ioctl+0x1c0/0x1150 fs/ioctl.c:687
ksys_ioctl+0x9e/0xb0 fs/ioctl.c:702
__do_sys_ioctl fs/ioctl.c:709 [inline]
__se_sys_ioctl fs/ioctl.c:707 [inline]
__x64_sys_ioctl+0x7e/0xc0 fs/ioctl.c:707
do_syscall_64+0xc4/0x510 arch/x86/entry/common.c:290
entry_SYSCALL_64_after_hwframe+0x49/0xbeNote that this is not a kvmalloc path. It is just that the fast path
really depends on having sanitzed order as well. Therefore move the
order check to the fast path.Link: http://lkml.kernel.org/r/20181113094305.GM15120@dhcp22.suse.cz
Signed-off-by: Michal Hocko
Reported-by: Konstantin Khlebnikov
Reported-by: Kyungtae Kim
Acked-by: Vlastimil Babka
Cc: Balbir Singh
Cc: Mel Gorman
Cc: Pavel Tatashin
Cc: Oscar Salvador
Cc: Mike Rapoport
Cc: Aaron Lu
Cc: Joonsoo Kim
Cc: Byoungyoung Lee
Cc: "Dae R. Jeong"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin -
[ Upstream commit 9d7899999c62c1a81129b76d2a6ecbc4655e1597 ]
Page state checks are racy. Under a heavy memory workload (e.g. stress
-m 200 -t 2h) it is quite easy to hit a race window when the page is
allocated but its state is not fully populated yet. A debugging patch to
dump the struct page state showshas_unmovable_pages: pfn:0x10dfec00, found:0x1, count:0x0
page:ffffea0437fb0000 count:1 mapcount:1 mapping:ffff880e05239841 index:0x7f26e5000 compound_mapcount: 1
flags: 0x5fffffc0090034(uptodate|lru|active|head|swapbacked)Note that the state has been checked for both PageLRU and PageSwapBacked
already. Closing this race completely would require some sort of retry
logic. This can be tricky and error prone (think of potential endless
or long taking loops).Workaround this problem for movable zones at least. Such a zone should
only contain movable pages. Commit 15c30bc09085 ("mm, memory_hotplug:
make has_unmovable_pages more robust") has told us that this is not
strictly true though. Bootmem pages should be marked reserved though so
we can move the original check after the PageReserved check. Pages from
other zones are still prone to races but we even do not pretend that
memory hotremove works for those so pre-mature failure doesn't hurt that
much.Link: http://lkml.kernel.org/r/20181106095524.14629-1-mhocko@kernel.org
Fixes: 15c30bc09085 ("mm, memory_hotplug: make has_unmovable_pages more robust")
Signed-off-by: Michal Hocko
Reported-by: Baoquan He
Tested-by: Baoquan He
Acked-by: Baoquan He
Reviewed-by: Oscar Salvador
Acked-by: Balbir Singh
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
Signed-off-by: Sasha Levin
09 Oct, 2018
1 commit
-
Remove the leftover pglist_data::numabalancing_migrate_lock and its
initialization, we stopped using this lock with:efaffc5e40ae ("mm, sched/numa: Remove rate-limiting of automatic NUMA balancing migration")
[ mingo: Rewrote the changelog. ]
Signed-off-by: Srikar Dronamraju
Acked-by: Mel Gorman
Cc: Linus Torvalds
Cc: Linux-MM
Cc: Peter Zijlstra
Cc: Rik van Riel
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/1538824999-31230-1-git-send-email-srikar@linux.vnet.ibm.com
Signed-off-by: Ingo Molnar
02 Oct, 2018
1 commit
-
Rate limiting of page migrations due to automatic NUMA balancing was
introduced to mitigate the worst-case scenario of migrating at high
frequency due to false sharing or slowly ping-ponging between nodes.
Since then, a lot of effort was spent on correctly identifying these
pages and avoiding unnecessary migrations and the safety net may no longer
be required.Jirka Hladky reported a regression in 4.17 due to a scheduler patch that
avoids spreading STREAM tasks wide prematurely. However, once the task
was properly placed, it delayed migrating the memory due to rate limiting.
Increasing the limit fixed the problem for him.Currently, the limit is hard-coded and does not account for the real
capabilities of the hardware. Even if an estimate was attempted, it would
not properly account for the number of memory controllers and it could
not account for the amount of bandwidth used for normal accesses. Rather
than fudging, this patch simply eliminates the rate limiting.However, Jirka reports that a STREAM configuration using multiple
processes achieved similar performance to 4.16. In local tests, this patch
improved performance of STREAM relative to the baseline but it is somewhat
machine-dependent. Most workloads show little or not performance difference
implying that there is not a heavily reliance on the throttling mechanism
and it is safe to remove.STREAM on 2-socket machine
4.19.0-rc5 4.19.0-rc5
numab-v1r1 noratelimit-v1r1
MB/sec copy 43298.52 ( 0.00%) 44673.38 ( 3.18%)
MB/sec scale 30115.06 ( 0.00%) 31293.06 ( 3.91%)
MB/sec add 32825.12 ( 0.00%) 34883.62 ( 6.27%)
MB/sec triad 32549.52 ( 0.00%) 34906.60 ( 7.24%Signed-off-by: Mel Gorman
Reviewed-by: Rik van Riel
Acked-by: Peter Zijlstra
Cc: Jirka Hladky
Cc: Linus Torvalds
Cc: Linux-MM
Cc: Srikar Dronamraju
Cc: Thomas Gleixner
Link: http://lkml.kernel.org/r/20181001100525.29789-2-mgorman@techsingularity.net
Signed-off-by: Ingo Molnar
05 Sep, 2018
1 commit
-
When scanning for movable pages, filter out Hugetlb pages if hugepage
migration is not supported. Without this we hit infinte loop in
__offline_pages() where we dopfn = scan_movable_pages(start_pfn, end_pfn);
if (pfn) { /* We have movable pages */
ret = do_migrate_range(pfn, end_pfn);
goto repeat;
}Fix this by checking hugepage_migration_supported both in
has_unmovable_pages which is the primary backoff mechanism for page
offlining and for consistency reasons also into scan_movable_pages
because it doesn't make any sense to return a pfn to non-migrateable
huge page.This issue was revealed by, but not caused by 72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early").Link: http://lkml.kernel.org/r/20180824063314.21981-1-aneesh.kumar@linux.ibm.com
Fixes: 72b39cfc4d75 ("mm, memory_hotplug: do not fail offlining too early")
Signed-off-by: Aneesh Kumar K.V
Reported-by: Haren Myneni
Acked-by: Michal Hocko
Reviewed-by: Naoya Horiguchi
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
30 Aug, 2018
1 commit
-
The conversion of the hotplug notifiers to a state machine left the
notifier.h includes around in some places. Remove them.Signed-off-by: Mukesh Ojha
Signed-off-by: Thomas Gleixner
Link: https://lkml.kernel.org/r/1535114033-4605-1-git-send-email-mojha@codeaurora.org
24 Aug, 2018
1 commit
-
A process can be killed with SIGBUS(BUS_MCEERR_AR) when it tries to
allocate a page that was just freed on the way of soft-offline. This is
undesirable because soft-offline (which is about corrected error) is
less aggressive than hard-offline (which is about uncorrected error),
and we can make soft-offline fail and keep using the page for good
reason like "system is busy."Two main changes of this patch are:
- setting migrate type of the target page to MIGRATE_ISOLATE. As done
in free_unref_page_commit(), this makes kernel bypass pcplist when
freeing the page. So we can assume that the page is in freelist just
after put_page() returns,- setting PG_hwpoison on free page under zone->lock which protects
freelists, so this allows us to avoid setting PG_hwpoison on a page
that is decided to be allocated soon.[akpm@linux-foundation.org: tweak set_hwpoison_free_buddy_page() comment]
Link: http://lkml.kernel.org/r/1531452366-11661-3-git-send-email-n-horiguchi@ah.jp.nec.com
Signed-off-by: Naoya Horiguchi
Reported-by: Xishi Qiu
Tested-by: Mike Kravetz
Cc: Michal Hocko
Cc:
Cc: Mike Kravetz
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
23 Aug, 2018
5 commits
-
Currently, whenever a new node is created/re-used from the memhotplug
path, we call free_area_init_node()->free_area_init_core(). But there is
some code that we do not really need to run when we are coming from such
path.free_area_init_core() performs the following actions:
1) Initializes pgdat internals, such as spinlock, waitqueues and more.
2) Account # nr_all_pages and # nr_kernel_pages. These values are used later on
when creating hash tables.
3) Account number of managed_pages per zone, substracting dma_reserved and
memmap pages.
4) Initializes some fields of the zone structure data
5) Calls init_currently_empty_zone to initialize all the freelists
6) Calls memmap_init to initialize all pages belonging to certain zoneWhen called from memhotplug path, free_area_init_core() only performs
actions #1 and #4.Action #2 is pointless as the zones do not have any pages since either the
node was freed, or we are re-using it, eitherway all zones belonging to
this node should have 0 pages. For the same reason, action #3 results
always in manages_pages being 0.Action #5 and #6 are performed later on when onlining the pages:
online_pages()->move_pfn_range_to_zone()->init_currently_empty_zone()
online_pages()->move_pfn_range_to_zone()->memmap_init_zone()This patch does two things:
First, moves the node/zone initializtion to their own function, so it
allows us to create a small version of free_area_init_core, where we only
perform:1) Initialization of pgdat internals, such as spinlock, waitqueues and more
4) Initialization of some fields of the zone structure dataThese two functions are: pgdat_init_internals() and zone_init_internals().
The second thing this patch does, is to introduce
free_area_init_core_hotplug(), the memhotplug version of
free_area_init_core():Currently, we call free_area_init_node() from the memhotplug path. In
there, we set some pgdat's fields, and call calculate_node_totalpages().
calculate_node_totalpages() calculates the # of pages the node has.Since the node is either new, or we are re-using it, the zones belonging
to this node should not have any pages, so there is no point to calculate
this now.Actually, we re-set these values to 0 later on with the calls to:
reset_node_managed_pages()
reset_node_present_pages()The # of pages per node and the # of pages per zone will be calculated when
onlining the pages:online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_zone_range()
online_pages()->move_pfn_range()->move_pfn_range_to_zone()->resize_pgdat_range()Also, since free_area_init_core/free_area_init_node will now only get called during early init, let us replace
__paginginit with __init, so their code gets freed up.[osalvador@techadventures.net: fix section usage]
Link: http://lkml.kernel.org/r/20180731101752.GA473@techadventures.net
[osalvador@suse.de: v6]
Link: http://lkml.kernel.org/r/20180801122348.21588-6-osalvador@techadventures.net
Link: http://lkml.kernel.org/r/20180730101757.28058-5-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Pasha Tatashin
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Let us move the code between CONFIG_DEFERRED_STRUCT_PAGE_INIT to an inline
function. Not having an ifdef in the function makes the code more
readable.Link: http://lkml.kernel.org/r/20180730101757.28058-4-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: Pavel Tatashin
Acked-by: Vlastimil Babka
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__paginginit is the same thing as __meminit except for platforms without
sparsemem, there it is defined as __init.Remove __paginginit and use __meminit. Use __ref in one single function
that merges __meminit and __init sections: setup_usemap().Link: http://lkml.kernel.org/r/20180801122348.21588-4-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin
Signed-off-by: Oscar Salvador
Reviewed-by: Oscar Salvador
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
zone->node is configured only when CONFIG_NUMA=y, so it is a good idea to
have inline functions to access this field in order to avoid ifdef's in c
files.Link: http://lkml.kernel.org/r/20180730101757.28058-3-osalvador@techadventures.net
Signed-off-by: Pavel Tatashin
Signed-off-by: Oscar Salvador
Reviewed-by: Oscar Salvador
Acked-by: Michal Hocko
Acked-by: Vlastimil Babka
Cc: Aaron Lu
Cc: Dan Williams
Cc: David Hildenbrand
Cc: Joonsoo Kim
Cc: Mel Gorman
Cc: Pasha Tatashin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Patch series "Refactor free_area_init_core and add
free_area_init_core_hotplug", v6.This patchset does three things:
1) Clean up/refactor free_area_init_core/free_area_init_node
by moving the ifdefery out of the functions.
2) Move the pgdat/zone initialization in free_area_init_core to its
own function.
3) Introduce free_area_init_core_hotplug, a small subset of
free_area_init_core, which is only called from memhotlug code path. In this
way, we have:free_area_init_core: called during early initialization
free_area_init_core_hotplug: called whenever a new node is allocated/re-used (memhotplug path)This patch (of 5):
Moving the #ifdefs out of the function makes it easier to follow.
Link: http://lkml.kernel.org/r/20180730101757.28058-2-osalvador@techadventures.net
Signed-off-by: Oscar Salvador
Acked-by: Michal Hocko
Reviewed-by: Pavel Tatashin
Acked-by: Vlastimil Babka
Cc: Pasha Tatashin
Cc: Mel Gorman
Cc: Aaron Lu
Cc: Joonsoo Kim
Cc: Dan Williams
Cc: David Hildenbrand
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
18 Aug, 2018
4 commits
-
To improve page allocator's performance for order-0 pages, each CPU has
a Per-CPU-Pageset(PCP) per zone. Whenever an order-0 page is needed,
PCP will be checked first before asking pages from Buddy. When PCP is
used up, a batch of pages will be fetched from Buddy to improve
performance and the size of batch can affect performance.zone's batch size gets doubled last time by commit ba56e91c9401("mm:
page_alloc: increase size of per-cpu-pages") over ten years ago. Since
then, CPU has envolved a lot and CPU's cache sizes also increased.Dave Hansen is concerned the current batch size doesn't fit well with
modern hardware and suggested me to do two things: first, use a page
allocator intensive benchmark, e.g. will-it-scale/page_fault1 to find
out how performance changes with different batch sizes on various
machines and then choose a new default batch size; second, see how this
new batch size work with other workloads.In the first test, we saw performance gains on high-core-count systems
and little to no effect on older systems with more modest core counts.
In this phase's test data, two candidates: 63 and 127 are chosen.In the second step, ebizzy, oltp, kbuild, pigz, netperf, vm-scalability
and more will-it-scale sub-tests are tested to see how these two
candidates work with these workloads and decides a new default according
to their results.Most test results are flat. will-it-scale/page_fault2 process mode has
10%-18% performance increase on 4-sockets Skylake and Broadwell.
vm-scalability/lru-file-mmap-read has 17%-47% performance increase for
4-sockets servers while for 2-sockets servers, it caused 3%-8% performance
drop. Further analysis showed that, with a larger pcp->batch and thus
larger pcp->high(the relationship of pcp->high=6 * pcp->batch is
maintained in this patch), zone lock contention shifted to LRU add side
lock contention and that caused performance drop. This performance drop
might be mitigated by others' work on optimizing LRU lock.Another downside of increasing pcp->batch is, when PCP is used up and need
to fetch a batch of pages from Buddy, since batch is increased, that time
can be longer than before. My understanding is, this doesn't affect
slowpath where direct reclaim and compaction dominates. For fastpath,
throughput is a win(according to will-it-scale/page_fault1) but worst
latency can be larger now.Overall, I think double the batch size from 31 to 63 is relatively safe
and provide good performance boost for high-core-count systems.The two phase's test results are listed below(all tests are done with THP
disabled).Phase one(will-it-scale/page_fault1) test results:
Skylake-EX: increased batch size has a good effect on zone->lock
contention, though LRU contention will rise at the same time and
limited the final performance increase.batch score change zone_contention lru_contention total_contention
31 15345900 +0.00% 64% 8% 72%
53 17903847 +16.67% 32% 38% 70%
63 17992886 +17.25% 24% 45% 69%
73 18022825 +17.44% 10% 61% 71%
119 18023401 +17.45% 4% 66% 70%
127 18029012 +17.48% 3% 66% 69%
137 18036075 +17.53% 4% 66% 70%
165 18035964 +17.53% 2% 67% 69%
188 18101105 +17.95% 2% 67% 69%
223 18130951 +18.15% 2% 67% 69%
255 18118898 +18.07% 2% 67% 69%
267 18101559 +17.96% 2% 67% 69%
299 18160468 +18.34% 2% 68% 70%
320 18139845 +18.21% 2% 67% 69%
393 18160869 +18.34% 2% 68% 70%
424 18170999 +18.41% 2% 68% 70%
458 18144868 +18.24% 2% 68% 70%
467 18142366 +18.22% 2% 68% 70%
498 18154549 +18.30% 1% 68% 69%
511 18134525 +18.17% 1% 69% 70%Broadwell-EX: similar pattern as Skylake-EX.
batch score change zone_contention lru_contention total_contention
31 16703983 +0.00% 67% 7% 74%
53 18195393 +8.93% 43% 28% 71%
63 18288885 +9.49% 38% 33% 71%
73 18344329 +9.82% 35% 37% 72%
119 18535529 +10.96% 24% 46% 70%
127 18513596 +10.83% 23% 48% 71%
137 18514327 +10.84% 23% 48% 71%
165 18511840 +10.82% 22% 49% 71%
188 18593478 +11.31% 17% 53% 70%
223 18601667 +11.36% 17% 52% 69%
255 18774825 +12.40% 12% 58% 70%
267 18754781 +12.28% 9% 60% 69%
299 18892265 +13.10% 7% 63% 70%
320 18873812 +12.99% 8% 62% 70%
393 18891174 +13.09% 6% 64% 70%
424 18975108 +13.60% 6% 64% 70%
458 18932364 +13.34% 8% 62% 70%
467 18960891 +13.51% 5% 65% 70%
498 18944526 +13.41% 5% 64% 69%
511 18960839 +13.51% 5% 64% 69%Skylake-EP: although increased batch reduced zone->lock contention, but
the effect is not as good as EX: zone->lock contention is still as high as
20% with a very high batch value instead of 1% on Skylake-EX or 5% on
Broadwell-EX. Also, total_contention actually decreased with a higher
batch but that doesn't translate to performance increase.batch score change zone_contention lru_contention total_contention
31 9554867 +0.00% 66% 3% 69%
53 9855486 +3.15% 63% 3% 66%
63 9980145 +4.45% 62% 4% 66%
73 10092774 +5.63% 62% 5% 67%
119 10310061 +7.90% 45% 19% 64%
127 10342019 +8.24% 42% 19% 61%
137 10358182 +8.41% 42% 21% 63%
165 10397060 +8.81% 37% 24% 61%
188 10341808 +8.24% 34% 26% 60%
223 10349135 +8.31% 31% 27% 58%
255 10327189 +8.08% 28% 29% 57%
267 10344204 +8.26% 27% 29% 56%
299 10325043 +8.06% 25% 30% 55%
320 10310325 +7.91% 25% 31% 56%
393 10293274 +7.73% 21% 31% 52%
424 10311099 +7.91% 21% 32% 53%
458 10321375 +8.02% 21% 32% 53%
467 10303881 +7.84% 21% 32% 53%
498 10332462 +8.14% 20% 33% 53%
511 10325016 +8.06% 20% 32% 52%Broadwell-EP: zone->lock and lru lock had an agreement to make sure
performance doesn't increase and they successfully managed to keep total
contention at 70%.batch score change zone_contention lru_contention total_contention
31 10121178 +0.00% 19% 50% 69%
53 10142366 +0.21% 6% 63% 69%
63 10117984 -0.03% 11% 58% 69%
73 10123330 +0.02% 7% 63% 70%
119 10108791 -0.12% 2% 67% 69%
127 10166074 +0.44% 3% 66% 69%
137 10141574 +0.20% 3% 66% 69%
165 10154499 +0.33% 2% 68% 70%
188 10124921 +0.04% 2% 67% 69%
223 10137399 +0.16% 2% 67% 69%
255 10143289 +0.22% 0% 68% 68%
267 10123535 +0.02% 1% 68% 69%
299 10140952 +0.20% 0% 68% 68%
320 10163170 +0.41% 0% 68% 68%
393 10000633 -1.19% 0% 69% 69%
424 10087998 -0.33% 0% 69% 69%
458 10187116 +0.65% 0% 69% 69%
467 10146790 +0.25% 0% 69% 69%
498 10197958 +0.76% 0% 69% 69%
511 10152326 +0.31% 0% 69% 69%Haswell-EP: similar to Broadwell-EP.
batch score change zone_contention lru_contention total_contention
31 10442205 +0.00% 14% 48% 62%
53 10442255 +0.00% 5% 57% 62%
63 10452059 +0.09% 6% 57% 63%
73 10482349 +0.38% 5% 59% 64%
119 10454644 +0.12% 3% 60% 63%
127 10431514 -0.10% 3% 59% 62%
137 10423785 -0.18% 3% 60% 63%
165 10481216 +0.37% 2% 61% 63%
188 10448755 +0.06% 2% 61% 63%
223 10467144 +0.24% 2% 61% 63%
255 10480215 +0.36% 2% 61% 63%
267 10484279 +0.40% 2% 61% 63%
299 10466450 +0.23% 2% 61% 63%
320 10452578 +0.10% 2% 61% 63%
393 10499678 +0.55% 1% 62% 63%
424 10481454 +0.38% 1% 62% 63%
458 10473562 +0.30% 1% 62% 63%
467 10484269 +0.40% 0% 62% 62%
498 10505599 +0.61% 0% 62% 62%
511 10483395 +0.39% 0% 62% 62%Westmere-EP: contention is pretty small so not interesting. Note too high
a batch value could hurt performance.batch score change zone_contention lru_contention total_contention
31 4831523 +0.00% 2% 3% 5%
53 4834086 +0.05% 2% 4% 6%
63 4834262 +0.06% 2% 3% 5%
73 4832851 +0.03% 2% 4% 6%
119 4830534 -0.02% 1% 3% 4%
127 4827461 -0.08% 1% 4% 5%
137 4827459 -0.08% 1% 3% 4%
165 4820534 -0.23% 0% 4% 4%
188 4817947 -0.28% 0% 3% 3%
223 4809671 -0.45% 0% 3% 3%
255 4802463 -0.60% 0% 4% 4%
267 4801634 -0.62% 0% 3% 3%
299 4798047 -0.69% 0% 3% 3%
320 4793084 -0.80% 0% 3% 3%
393 4785877 -0.94% 0% 3% 3%
424 4782911 -1.01% 0% 3% 3%
458 4779346 -1.08% 0% 3% 3%
467 4780306 -1.06% 0% 3% 3%
498 4780589 -1.05% 0% 3% 3%
511 4773724 -1.20% 0% 3% 3%Skylake-Desktop: similar to Westmere-EP, nothing interesting.
batch score change zone_contention lru_contention total_contention
31 3906608 +0.00% 2% 3% 5%
53 3940164 +0.86% 2% 3% 5%
63 3937289 +0.79% 2% 3% 5%
73 3940201 +0.86% 2% 3% 5%
119 3933240 +0.68% 2% 3% 5%
127 3930514 +0.61% 2% 4% 6%
137 3938639 +0.82% 0% 3% 3%
165 3908755 +0.05% 0% 3% 3%
188 3905621 -0.03% 0% 3% 3%
223 3903015 -0.09% 0% 4% 4%
255 3889480 -0.44% 0% 3% 3%
267 3891669 -0.38% 0% 4% 4%
299 3898728 -0.20% 0% 4% 4%
320 3894547 -0.31% 0% 4% 4%
393 3875137 -0.81% 0% 4% 4%
424 3874521 -0.82% 0% 3% 3%
458 3880432 -0.67% 0% 4% 4%
467 3888715 -0.46% 0% 3% 3%
498 3888633 -0.46% 0% 4% 4%
511 3875305 -0.80% 0% 5% 5%Haswell-Desktop: zone->lock is pretty low as other desktops, though lru
contention is higher than other desktops.batch score change zone_contention lru_contention total_contention
31 3511158 +0.00% 2% 5% 7%
53 3555445 +1.26% 2% 6% 8%
63 3561082 +1.42% 2% 6% 8%
73 3547218 +1.03% 2% 6% 8%
119 3571319 +1.71% 1% 7% 8%
127 3549375 +1.09% 0% 6% 6%
137 3560233 +1.40% 0% 6% 6%
165 3555176 +1.25% 2% 6% 8%
188 3551501 +1.15% 0% 8% 8%
223 3531462 +0.58% 0% 7% 7%
255 3570400 +1.69% 0% 7% 7%
267 3532235 +0.60% 1% 8% 9%
299 3562326 +1.46% 0% 6% 6%
320 3553569 +1.21% 0% 8% 8%
393 3539519 +0.81% 0% 7% 7%
424 3549271 +1.09% 0% 8% 8%
458 3528885 +0.50% 0% 8% 8%
467 3526554 +0.44% 0% 7% 7%
498 3525302 +0.40% 0% 9% 9%
511 3527556 +0.47% 0% 8% 8%Sandybridge-Desktop: the 0% contention isn't accurate but caused by
dropped fractional part. Since multiple contention path's contentions
are all under 1% here, with some arithmetic operations like add, the
final deviation could be as large as 3%.batch score change zone_contention lru_contention total_contention
31 1744495 +0.00% 0% 0% 0%
53 1755341 +0.62% 0% 0% 0%
63 1758469 +0.80% 0% 0% 0%
73 1759626 +0.87% 0% 0% 0%
119 1770417 +1.49% 0% 0% 0%
127 1768252 +1.36% 0% 0% 0%
137 1767848 +1.34% 0% 0% 0%
165 1765088 +1.18% 0% 0% 0%
188 1766918 +1.29% 0% 0% 0%
223 1767866 +1.34% 0% 0% 0%
255 1768074 +1.35% 0% 0% 0%
267 1763187 +1.07% 0% 0% 0%
299 1765620 +1.21% 0% 0% 0%
320 1767603 +1.32% 0% 0% 0%
393 1764612 +1.15% 0% 0% 0%
424 1758476 +0.80% 0% 0% 0%
458 1758593 +0.81% 0% 0% 0%
467 1757915 +0.77% 0% 0% 0%
498 1753363 +0.51% 0% 0% 0%
511 1755548 +0.63% 0% 0% 0%Phase two test results:
Note: all percent change is against base(batch=31).ebizzy.throughput (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 2410037±7% 2600451±2% +7.9% 2602878 +8.0%
lkp-bdw-ex1 1493328 1489243 -0.3% 1492145 -0.1%
lkp-skl-2sp2 1329674 1345891 +1.2% 1351056 +1.6%
lkp-bdw-ep2 711511 711511 0.0% 710708 -0.1%
lkp-wsm-ep2 75750 75528 -0.3% 75441 -0.4%
lkp-skl-d01 264126 262791 -0.5% 264113 +0.0%
lkp-hsw-d01 176601 176328 -0.2% 176368 -0.1%
lkp-sb02 98937 98937 +0.0% 99030 +0.1%kbuild.buildtime (less is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 107.00 107.67 +0.6% 107.11 +0.1%
lkp-bdw-ex1 97.33 97.33 +0.0% 97.42 +0.1%
lkp-skl-2sp2 180.00 179.83 -0.1% 179.83 -0.1%
lkp-bdw-ep2 178.17 179.17 +0.6% 177.50 -0.4%
lkp-wsm-ep2 737.00 738.00 +0.1% 738.00 +0.1%
lkp-skl-d01 642.00 653.00 +1.7% 653.00 +1.7%
lkp-hsw-d01 1310.00 1316.00 +0.5% 1311.00 +0.1%netperf/TCP_STREAM.Throughput_total_Mbps (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 948790 947144 -0.2% 948333 -0.0%
lkp-bdw-ex1 904224 904366 +0.0% 904926 +0.1%
lkp-skl-2sp2 239731 239607 -0.1% 239565 -0.1%
lk-bdw-ep2 365764 365933 +0.0% 365951 +0.1%
lkp-wsm-ep2 93736 93803 +0.1% 93808 +0.1%
lkp-skl-d01 77314 77303 -0.0% 77375 +0.1%
lkp-hsw-d01 58617 60387 +3.0% 60208 +2.7%
lkp-sb02 29990 30137 +0.5% 30103 +0.4%oltp.transactions (higer is better)
machine batch=31 batch=63 batch=127
lkp-bdw-ex1 9073276 9100377 +0.3% 9036344 -0.4%
lkp-skl-2sp2 8898717 8852054 -0.5% 8894459 -0.0%
lkp-bdw-ep2 13426155 13384654 -0.3% 13333637 -0.7%
lkp-hsw-ep2 13146314 13232784 +0.7% 13193163 +0.4%
lkp-wsm-ep2 5035355 5019348 -0.3% 5033418 -0.0%
lkp-skl-d01 418485 4413339 -0.1% 4419039 +0.0%
lkp-hsw-d01 3517817±5% 3396120±3% -3.5% 3455138±3% -1.8%pigz.throughput (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.513e+08 1.507e+08 -0.4% 1.511e+08 -0.2%
lkp-bdw-ex1 2.060e+08 2.052e+08 -0.4% 2.044e+08 -0.8%
lkp-skl-2sp2 8.836e+08 8.845e+08 +0.1% 8.836e+08 -0.0%
lkp-bdw-ep2 8.275e+08 8.464e+08 +2.3% 8.330e+08 +0.7%
lkp-wsm-ep2 2.224e+08 2.221e+08 -0.2% 2.218e+08 -0.3%
lkp-skl-d01 1.177e+08 1.177e+08 -0.0% 1.176e+08 -0.1%
lkp-hsw-d01 1.154e+08 1.154e+08 +0.1% 1.154e+08 -0.0%
lkp-sb02 0.633e+08 0.633e+08 +0.1% 0.633e+08 +0.0%will-it-scale.malloc1.processes (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 620181 620484 +0.0% 620240 +0.0%
lkp-bdw-ex1 1403610 1401201 -0.2% 1417900 +1.0%
lkp-skl-2sp2 1288097 1284145 -0.3% 1283907 -0.3%
lkp-bdw-ep2 1427879 1427675 -0.0% 1428266 +0.0%
lkp-hsw-ep2 1362546 1353965 -0.6% 1354759 -0.6%
lkp-wsm-ep2 2099657 2107576 +0.4% 2100226 +0.0%
lkp-skl-d01 1476835 1476358 -0.0% 1474487 -0.2%
lkp-hsw-d01 1308810 1303429 -0.4% 1301299 -0.6%
lkp-sb02 589286 589284 -0.0% 588101 -0.2%will-it-scale.malloc1.threads (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 21289 21125 -0.8% 21241 -0.2%
lkp-bdw-ex1 28114 28089 -0.1% 28007 -0.4%
lkp-skl-2sp2 91866 91946 +0.1% 92723 +0.9%
lkp-bdw-ep2 37637 37501 -0.4% 37317 -0.9%
lkp-hsw-ep2 43673 43590 -0.2% 43754 +0.2%
lkp-wsm-ep2 28577 28298 -1.0% 28545 -0.1%
lkp-skl-d01 175277 173343 -1.1% 173082 -1.3%
lkp-hsw-d01 130303 129566 -0.6% 129250 -0.8%
lkp-sb02 113742±3% 116911 +2.8% 116417±3% +2.4%will-it-scale.malloc2.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.206e+09 1.206e+09 -0.0% 1.206e+09 +0.0%
lkp-bdw-ex1 1.319e+09 1.319e+09 -0.0% 1.319e+09 +0.0%
lkp-skl-2sp2 8.000e+08 8.021e+08 +0.3% 7.995e+08 -0.1%
lkp-bdw-ep2 6.582e+08 6.634e+08 +0.8% 6.513e+08 -1.1%
lkp-hsw-ep2 6.671e+08 6.669e+08 -0.0% 6.665e+08 -0.1%
lkp-wsm-ep2 1.805e+08 1.806e+08 +0.0% 1.804e+08 -0.1%
lkp-skl-d01 1.611e+08 1.611e+08 -0.0% 1.610e+08 -0.0%
lkp-hsw-d01 1.333e+08 1.332e+08 -0.0% 1.332e+08 -0.0%
lkp-sb02 82485104 82478206 -0.0% 82473546 -0.0%will-it-scale.malloc2.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.574e+09 1.574e+09 -0.0% 1.574e+09 -0.0%
lkp-bdw-ex1 1.737e+09 1.737e+09 +0.0% 1.737e+09 -0.0%
lkp-skl-2sp2 9.161e+08 9.162e+08 +0.0% 9.181e+08 +0.2%
lkp-bdw-ep2 7.856e+08 8.015e+08 +2.0% 8.113e+08 +3.3%
lkp-hsw-ep2 6.908e+08 6.904e+08 -0.1% 6.907e+08 -0.0%
lkp-wsm-ep2 2.409e+08 2.409e+08 +0.0% 2.409e+08 -0.0%
lkp-skl-d01 1.199e+08 1.199e+08 -0.0% 1.199e+08 -0.0%
lkp-hsw-d01 1.029e+08 1.029e+08 -0.0% 1.029e+08 +0.0%
lkp-sb02 68081213 68061423 -0.0% 68076037 -0.0%will-it-scale.page_fault2.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 14509125±4% 16472364 +13.5% 17123117 +18.0%
lkp-bdw-ex1 14736381 16196588 +9.9% 16364011 +11.0%
lkp-skl-2sp2 6354925 6435444 +1.3% 6436644 +1.3%
lkp-bdw-ep2 8749584 8834422 +1.0% 8827179 +0.9%
lkp-hsw-ep2 8762591 8845920 +1.0% 8825697 +0.7%
lkp-wsm-ep2 3036083 3030428 -0.2% 3021741 -0.5%
lkp-skl-d01 2307834 2304731 -0.1% 2286142 -0.9%
lkp-hsw-d01 1806237 1800786 -0.3% 1795943 -0.6%
lkp-sb02 842616 837844 -0.6% 833921 -1.0%will-it-scale.page_fault2.threads
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1623294 1615132±2% -0.5% 1656777 +2.1%
lkp-bdw-ex1 1995714 2025948 +1.5% 2113753±3% +5.9%
lkp-skl-2sp2 2346708 2415591 +2.9% 2416919 +3.0%
lkp-bdw-ep2 2342564 2344882 +0.1% 2300206 -1.8%
lkp-hsw-ep2 1820658 1831681 +0.6% 1844057 +1.3%
lkp-wsm-ep2 1725482 1733774 +0.5% 1740517 +0.9%
lkp-skl-d01 1832833 1823628 -0.5% 1806489 -1.4%
lkp-hsw-d01 1427913 1427287 -0.0% 1420226 -0.5%
lkp-sb02 750626 748615 -0.3% 746621 -0.5%will-it-scale.page_fault3.processes (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 24382726 24400317 +0.1% 24668774 +1.2%
lkp-bdw-ex1 35399750 35683124 +0.8% 35829492 +1.2%
lkp-skl-2sp2 28136820 28068248 -0.2% 28147989 +0.0%
lkp-bdw-ep2 37269077 37459490 +0.5% 37373073 +0.3%
lkp-hsw-ep2 36224967 36114085 -0.3% 36104908 -0.3%
lkp-wsm-ep2 16820457 16911005 +0.5% 16968596 +0.9%
lkp-skl-d01 7721138 7725904 +0.1% 7756740 +0.5%
lkp-hsw-d01 7611979 7650928 +0.5% 7651323 +0.5%
lkp-sb02 3781546 3796502 +0.4% 3796827 +0.4%will-it-scale.page_fault3.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1865820±3% 1900917±2% +1.9% 1826245±4% -2.1%
lkp-bdw-ex1 3094060 3148326 +1.8% 3150036 +1.8%
lkp-skl-2sp2 3952940 3953898 +0.0% 3989360 +0.9%
lkp-bdw-ep2 3420373±3% 3643964 +6.5% 3644910±5% +6.6%
lkp-hsw-ep2 2609635±2% 2582310±3% -1.0% 2780459 +6.5%
lkp-wsm-ep2 4395001 4417196 +0.5% 4432499 +0.9%
lkp-skl-d01 5363977 5400003 +0.7% 5411370 +0.9%
lkp-hsw-d01 5274131 5311294 +0.7% 5319359 +0.9%
lkp-sb02 2917314 2913004 -0.1% 2935286 +0.6%will-it-scale.read1.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 73762279±14% 69322519±10% -6.0% 69349855±13% -6.0% (result unstable)
lkp-bdw-ex1 1.701e+08 1.704e+08 +0.1% 1.705e+08 +0.2%
lkp-skl-2sp2 63111570 63113953 +0.0% 63836573 +1.1%
lkp-bdw-ep2 79247409 79424610 +0.2% 78012656 -1.6%
lkp-hsw-ep2 67677026 68308800 +0.9% 67539106 -0.2%
lkp-wsm-ep2 13339630 13939817 +4.5% 13766865 +3.2%
lkp-skl-d01 10969487 10972650 +0.0% no data
lkp-hsw-d01 9857342±2% 10080592±2% +2.3% 10131560 +2.8%
lkp-sb02 5189076 5197473 +0.2% 5163253 -0.5%will-it-scale.read1.threads (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 62468045±12% 73666726±7% +17.9% 79553123±12% +27.4% (result unstable)
lkp-bdw-ex1 1.62e+08 1.624e+08 +0.3% 1.614e+08 -0.3%
lkp-skl-2sp2 58319780 59181032 +1.5% 59821353 +2.6%
lkp-bdw-ep2 74057992 75698171 +2.2% 74990869 +1.3%
lkp-hsw-ep2 63672959 63639652 -0.1% 64387051 +1.1%
lkp-wsm-ep2 13489943 13526058 +0.3% 13259032 -1.7%
lkp-skl-d01 10297906 10338796 +0.4% 10407328 +1.1%
lkp-hsw-d01 9636721 9667376 +0.3% 9341147 -3.1%
lkp-sb02 4801938 4804496 +0.1% 4802290 +0.0%will-it-scale.write1.processes (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.111e+08 1.104e+08±2% -0.7% 1.122e+08±2% +1.0%
lkp-bdw-ex1 1.392e+08 1.399e+08 +0.5% 1.397e+08 +0.4%
lkp-skl-2sp2 59369233 58994841 -0.6% 58715168 -1.1%
lkp-bdw-ep2 61820979 CPU throttle 63593123 +2.9%
lkp-hsw-ep2 57897587 57435605 -0.8% 56347450 -2.7%
lkp-wsm-ep2 7814203 7918017±2% +1.3% 7669068 -1.9%
lkp-skl-d01 8886557 8971422 +1.0% 8818366 -0.8%
lkp-hsw-d01 9171001±5% 9189915 +0.2% 9483909 +3.4%
lkp-sb02 4475406 4475294 -0.0% 4501756 +0.6%will-it-scale.write1.threads (higer is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1.058e+08 1.055e+08±2% -0.2% 1.065e+08 +0.7%
lkp-bdw-ex1 1.316e+08 1.300e+08 -1.2% 1.308e+08 -0.6%
lkp-skl-2sp2 54492421 56086678 +2.9% 55975657 +2.7%
lkp-bdw-ep2 59360449 59003957 -0.6% 58101262 -2.1%
lkp-hsw-ep2 53346346±2% 52530876 -1.5% 52902487 -0.8%
lkp-wsm-ep2 7774006 7800092±2% +0.3% 7558833 -2.8%
lkp-skl-d01 8346174 8235695 -1.3% no data
lkp-hsw-d01 8636244 8655731 +0.2% 8658868 +0.3%
lkp-sb02 4181820 4204107 +0.5% 4182992 +0.0%vm-scalability.anon-r-rand.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 11933873±3% 12356544±2% +3.5% 12188624 +2.1%
lkp-bdw-ex1 7114424±2% 7330949±2% +3.0% 7392419 +3.9%
lkp-skl-2sp2 6773277±5% 6492332±8% -4.1% 6543962 -3.4%
lkp-bdw-ep2 7133846±4% 7233508 +1.4% 7013518±3% -1.7%
lkp-hsw-ep2 4576626 4527098 -1.1% 4551679 -0.5%
lkp-wsm-ep2 2583599 2592492 +0.3% 2588039 +0.2%
lkp-hsw-d01 998199±2% 1028311 +3.0% 1006460±2% +0.8%
lkp-sb02 570572 567854 -0.5% 568449 -0.4%vm-scalability.anon-r-rand-mt.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1789419 1787830 -0.1% 1788208 -0.1%
lkp-bdw-ex1 3492595±2% 3554966±2% +1.8% 3558835±3% +1.9%
lkp-skl-2sp2 3856238±2% 3975403±4% +3.1% 3994600 +3.6%
lkp-bdw-ep2 3726963±11% 3809292±6% +2.2% 3871924±4% +3.9%
lkp-hsw-ep2 2131760±3% 2033578±4% -4.6% 2130727±6% -0.0%
lkp-wsm-ep2 2369731 2368384 -0.1% 2370252 +0.0%
lkp-skl-d01 1207128 1206220 -0.1% 1205801 -0.1%
lkp-hsw-d01 964317 992329±2% +2.9% 992099±2% +2.9%
lkp-sb02 567137 567346 +0.0% 566144 -0.2%vm-scalability.lru-file-mmap-read.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 19560469±6% 23018999 +17.7% 23418800 +19.7%
lkp-bdw-ex1 17769135±14% 26141676±3% +47.1% 26284723±5% +47.9%
lkp-skl-2sp2 14056512 13578884 -3.4% 13146214 -6.5%
lkp-bdw-ep2 15336542 14737654 -3.9% 14088159 -8.1%
lkp-hsw-ep2 16275498 15756296 -3.2% 15018090 -7.7%
lkp-wsm-ep2 11272160 11237231 -0.3% 11310047 +0.3%
lkp-skl-d01 7322119 7324569 +0.0% 7184148 -1.9%
lkp-hsw-d01 6449234 6404542 -0.7% 6356141 -1.4%
lkp-sb02 3517943 3520668 +0.1% 3527309 +0.3%vm-scalability.lru-file-mmap-read-rand.throughput (higher is better)
machine batch=31 batch=63 batch=127
lkp-skl-4sp1 1689052 1697553 +0.5% 1698726 +0.6%
lkp-bdw-ex1 1675246 1699764 +1.5% 1712226 +2.2%
lkp-skl-2sp2 1800533 1799749 -0.0% 1800581 +0.0%
lkp-bdw-ep2 1807422 1807758 +0.0% 1804932 -0.1%
lkp-hsw-ep2 1809807 1808781 -0.1% 1807811 -0.1%
lkp-wsm-ep2 1800198 1802434 +0.1% 1801236 +0.1%
lkp-skl-d01 696689 695537 -0.2% 694106 -0.4%
lkp-hsw-d01 698364 698666 +0.0% 696686 -0.2%
lkp-sb02 258939 258787 -0.1% 258199 -0.3%Link: http://lkml.kernel.org/r/20180711055855.29072-1-aaron.lu@intel.com
Signed-off-by: Aaron Lu
Suggested-by: Dave Hansen
Acked-by: Michal Hocko
Acked-by: Jesper Dangaard Brouer
Cc: Huang Ying
Cc: Kemi Wang
Cc: Tim Chen
Cc: Andi Kleen
Cc: Vlastimil Babka
Cc: Mel Gorman
Cc: Jesper Dangaard Brouer
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
There is no real reason to blow up just because the caller doesn't know
that __get_free_pages cannot return highmem pages. Simply fix that up
silently. Even if we have some confused users such a fixup will not be
harmful.[akpm@linux-foundation.org: mask off __GFP_HIGHMEM]
Link: http://lkml.kernel.org/r/20180622162841.25114-1-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reviewed-by: Andrew Morton
Cc: Jiankang Chen
Cc: Mel Gorman
Cc: Johannes Weiner
Cc: Yisheng Xie
Cc: Hanjun Guo
Cc: Kefeng Wang
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
__alloc_pages_slowpath() has for a long time contained code to ignore
node restrictions from memory policies for high priority allocations.
The current code that resets the zonelist iterator however does
effectively nothing after commit 7810e6781e0f ("mm, page_alloc: do not
break __GFP_THISNODE by zonelist reset") removed a buggy zonelist reset.
Even before that commit, mempolicy restrictions were still not ignored,
as they are passed in ac->nodemask which is untouched by the code.We can either remove the code, or make it work as intended. Since
ac->nodemask can be set from task's mempolicy via alloc_pages_current()
and thus also alloc_pages(), it may indeed affect kernel allocations,
and it makes sense to ignore it to allow progress for high priority
allocations.Thus, this patch resets ac->nodemask to NULL in such cases. This
assumes all callers can handle it (i.e. there are no guarantees as in
the case of __GFP_THISNODE) which seems to be the case. The same
assumption is already present in check_retry_cpuset() for some time.The expected effect is that high priority kernel allocations in the
context of userspace tasks (e.g. OOM victims) restricted by mempolicies
will have higher chance to succeed if they are restricted to nodes with
depleted memory, while there are other nodes with free memory left.It's not a new intention, but for the first time the code will match the
intention, AFAICS. It was intended by commit 183f6371aac2 ("mm: ignore
mempolicies when using ALLOC_NO_WATERMARK") in v3.6 but I think it never
really worked, as mempolicy restriction was already encoded in nodemask,
not zonelist, at that time.So originally that was for ALLOC_NO_WATERMARK only. Then it was
adjusted by e46e7b77c909 ("mm, page_alloc: recalculate the preferred
zoneref if the context can ignore memory policies") and cd04ae1e2dc8
("mm, oom: do not rely on TIF_MEMDIE for memory reserves access") to the
current state. So even GFP_ATOMIC would now ignore mempolicies after
the initial attempts fail - if the code worked as people thought it
does.Link: http://lkml.kernel.org/r/20180612122624.8045-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Acked-by: Michal Hocko
Acked-by: Mel Gorman
Cc: David Rientjes
Cc: Joonsoo Kim
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
The role of zero_resv_unavail() is to make sure that every struct page
that is allocated but is not backed by memory that is accessible by
kernel is zeroed and not in some uninitialized state.Since struct pages are allocated in blocks (2M pages in x86 case), we
can skip pageblock_nr_pages at a time, when the first one is found to be
invalid.This optimization may help since now on x86 every hole in e820 maps is
marked as reserved in memblock, and thus will go through this function.This function is called before sched_clock() is initialized, so I used
my x86 early boot clock patches to measure the performance improvement.With 1T hole on i7-8700 currently we would take 0.606918s of boot time,
but with this optimization 0.001103s.Link: http://lkml.kernel.org/r/20180615155733.1175-1-pasha.tatashin@oracle.com
Signed-off-by: Pavel Tatashin
Reviewed-by: Oscar Salvador
Reviewed-by: Naoya Horiguchi
Cc: Pasha Tatashin
Cc: Steven Sistare
Cc: Daniel Jordan
Cc: Michal Hocko
Cc: Matthew Wilcox
Cc: Ingo Molnar
Cc: Dan Williams
Cc: "Huang, Ying"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
15 Aug, 2018
1 commit
-
Pull power management updates from Rafael Wysocki:
"These add a new framework for CPU idle time injection, to be used by
all of the idle injection code in the kernel in the future, fix some
issues and add a number of relatively small extensions in multiple
places.Specifics:
- Add a new framework for CPU idle time injection (Daniel Lezcano).
- Add AVS support to the armada-37xx cpufreq driver (Gregory
CLEMENT).- Add support for current CPU frequency reporting to the ACPI CPPC
cpufreq driver (George Cherian).- Rework the cooling device registration in the imx6q/thermal driver
(Bastian Stender).- Make the pcc-cpufreq driver refuse to work with dynamic scaling
governors on systems with many CPUs to avoid scalability issues
with it (Rafael Wysocki).- Fix the intel_pstate driver to report different maximum CPU
frequencies on systems where they really are different and to
ignore the turbo active ratio if hardware-managend P-states (HWP)
are in use; make it use the match_string() helper (Xie Yisheng,
Srinivas Pandruvada).- Fix a minor deferred probe issue in the qcom-kryo cpufreq driver
(Niklas Cassel).- Add a tracepoint for the tracking of frequency limits changes (from
Andriod) to the cpufreq core (Ruchi Kandoi).- Fix a circular lock dependency between CPU hotplug and sysfs
locking in the cpufreq core reported by lockdep (Waiman Long).- Avoid excessive error reports on driver registration failures in
the ARM cpuidle driver (Sudeep Holla).- Add a new device links flag to the driver core to make links go
away automatically on supplier driver removal (Vivek Gautam).- Eliminate potential race condition between system-wide power
management transitions and system shutdown (Pingfan Liu).- Add a quirk to save NVS memory on system suspend for the ASUS 1025C
laptop (Willy Tarreau).- Make more systems use suspend-to-idle (instead of ACPI S3) by
default (Tristian Celestin).- Get rid of stack VLA usage in the low-level hibernation code on
64-bit x86 (Kees Cook).- Fix error handling in the hibernation core and mark an expected
fall-through switch in it (Chengguang Xu, Gustavo Silva).- Extend the generic power domains (genpd) framework to support
attaching a device to a power domain by name (Ulf Hansson).- Fix device reference counting and user limits initialization in the
devfreq core (Arvind Yadav, Matthias Kaehlcke).- Fix a few issues in the rk3399_dmc devfreq driver and improve its
documentation (Enric Balletbo i Serra, Lin Huang, Nick Milner).- Drop a redundant error message from the exynos-ppmu devfreq driver
(Markus Elfring)"* tag 'pm-4.19-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rafael/linux-pm: (35 commits)
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
cpufreq: intel_pstate: Ignore turbo active ratio in HWP
cpufreq: Fix a circular lock dependency problem
cpu/hotplug: Add a cpus_read_trylock() function
x86/power/hibernate_64: Remove VLA usage
cpufreq: trace frequency limits change
cpufreq: intel_pstate: Show different max frequency with turbo 3 and HWP
cpufreq: pcc-cpufreq: Disable dynamic scaling on many-CPU systems
cpufreq: qcom-kryo: Silently error out on EPROBE_DEFER
cpufreq / CPPC: Add cpuinfo_cur_freq support for CPPC
cpufreq: armada-37xx: Add AVS support
dt-bindings: marvell: Add documentation for the Armada 3700 AVS binding
PM / devfreq: rk3399_dmc: Fix duplicated opp table on reload.
PM / devfreq: Init user limits from OPP limits, not viceversa
PM / devfreq: rk3399_dmc: fix spelling mistakes.
PM / devfreq: rk3399_dmc: do not print error when get supply and clk defer.
dt-bindings: devfreq: rk3399_dmc: move interrupts to be optional.
PM / devfreq: rk3399_dmc: remove wait for dcf irq event.
dt-bindings: clock: add rk3399 DDR3 standard speed bins.
...
14 Aug, 2018
1 commit
-
Merge changes in the PM core, system-wide PM infrastructure, generic
power domains (genpd) framework, ACPI PM infrastructure and cpuidle
for 4.19.* pm-core:
driver core: Add flag to autoremove device link on supplier unbind
driver core: Rename flag AUTOREMOVE to AUTOREMOVE_CONSUMER* pm-domains:
PM / Domains: Introduce dev_pm_domain_attach_by_name()
PM / Domains: Introduce option to attach a device by name to genpd
PM / Domains: dt: Add a power-domain-names property* pm-sleep:
PM / reboot: Eliminate race between reboot and suspend
PM / hibernate: Mark expected switch fall-through
x86/power/hibernate_64: Remove VLA usage
PM / hibernate: cast PAGE_SIZE to int when comparing with error code* acpi-pm:
ACPI / PM: save NVS memory for ASUS 1025C laptop
ACPI / PM: Default to s2idle in all machines supporting LP S0* pm-cpuidle:
ARM: cpuidle: silence error on driver registration failure
06 Aug, 2018
2 commits
-
At present, "systemctl suspend" and "shutdown" can run in parrallel. A
system can suspend after devices_shutdown(), and resume. Then the shutdown
task goes on to power off. This causes many devices are not really shut
off. Hence replacing reboot_mutex with system_transition_mutex (renamed
from pm_mutex) to achieve the exclusion. The renaming of pm_mutex as
system_transition_mutex can be better to reflect the purpose of the mutex.Signed-off-by: Pingfan Liu
Acked-by: Pavel Machek
Signed-off-by: Rafael J. Wysocki -
free_reserved_area() takes pointers as arguments to show which addresses
should be freed. However, it does this in a somewhat ambiguous way. If it
gets a kernel direct map address, it always works. However, if it gets an
address that is part of the kernel image alias mapping, it can fail.It fails if all of the following happen:
* The specified address is part of the kernel image alias
* Poisoning is requested (forcing a memset())
* The address is in a read-only portion of the kernel imageThe memset() fails on the read-only mapping, of course.
free_reserved_area() *is* called both on the direct map and on kernel image
alias addresses. We've just lucked out thus far that the kernel image
alias areas it gets used on are read-write. I'm fairly sure this has been
just a happy accident.It is quite easy to make free_reserved_area() work for all cases: just
convert the address to a direct map address before doing the memset(), and
do this unconditionally. There is little chance of a regression here
because we previously did a virt_to_page() on the address for the memset,
so we know these are not highmem pages for which virt_to_page() would fail.Signed-off-by: Dave Hansen
Signed-off-by: Thomas Gleixner
Cc: keescook@google.com
Cc: aarcange@redhat.com
Cc: jgross@suse.com
Cc: jpoimboe@redhat.com
Cc: gregkh@linuxfoundation.org
Cc: peterz@infradead.org
Cc: hughd@google.com
Cc: torvalds@linux-foundation.org
Cc: bp@alien8.de
Cc: luto@kernel.org
Cc: ak@linux.intel.com
Cc: Kees Cook
Cc: Andrea Arcangeli
Cc: Juergen Gross
Cc: Josh Poimboeuf
Cc: Greg Kroah-Hartman
Cc: Peter Zijlstra
Cc: Hugh Dickins
Cc: Linus Torvalds
Cc: Borislav Petkov
Cc: Andy Lutomirski
Cc: Andi Kleen
Link: https://lkml.kernel.org/r/20180802225826.1287AE3E@viggo.jf.intel.com
17 Jul, 2018
1 commit
-
Moving zero_resv_unavail before memmap_init_zone(), caused a regression on
x86-32.The cause is that we access struct pages before they are allocated when
CONFIG_FLAT_NODE_MEM_MAP is used.free_area_init_nodes()
zero_resv_unavail()
mm_zero_struct_page(pfn_to_page(pfn));
Tested-by: Matt Hart
Acked-by: Michal Hocko
Signed-off-by: Linus Torvalds
15 Jul, 2018
1 commit
-
We must zero struct pages for memory that is not backed by physical
memory, or kernel does not have access to.Recently, there was a change which zeroed all memmap for all holes in
e820. Unfortunately, it introduced a bug that is discussed here:https://www.spinics.net/lists/linux-mm/msg156764.html
Linus, also saw this bug on his machine, and confirmed that reverting
commit 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into
memblock.reserved") fixes the issue.The problem is that we incorrectly zero some struct pages after they
were setup.The fix is to zero unavailable struct pages prior to initializing of
struct pages.A more detailed fix should come later that would avoid double zeroing
cases: one in __init_single_page(), the other one in
zero_resv_unavail().Fixes: 124049decbb1 ("x86/e820: put !E820_TYPE_RAM regions into memblock.reserved")
Signed-off-by: Pavel Tatashin
Signed-off-by: Linus Torvalds
15 Jun, 2018
1 commit
-
mm/*.c files use symbolic and octal styles for permissions.
Using octal and not symbolic permissions is preferred by many as more
readable.https://lkml.org/lkml/2016/8/2/1945
Prefer the direct use of octal for permissions.
Done using
$ scripts/checkpatch.pl -f --types=SYMBOLIC_PERMS --fix-inplace mm/*.c
and some typing.Before: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
44
After: $ git grep -P -w "0[0-7]{3,3}" mm | wc -l
86Miscellanea:
o Whitespace neatening around these conversions.
Link: http://lkml.kernel.org/r/2e032ef111eebcd4c5952bae86763b541d373469.1522102887.git.joe@perches.com
Signed-off-by: Joe Perches
Acked-by: David Rientjes
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
08 Jun, 2018
8 commits
-
In __alloc_pages_slowpath() we reset zonelist and preferred_zoneref for
allocations that can ignore memory policies. The zonelist is obtained
from current CPU's node. This is a problem for __GFP_THISNODE
allocations that want to allocate on a different node, e.g. because the
allocating thread has been migrated to a different CPU.This has been observed to break SLAB in our 4.4-based kernel, because
there it relies on __GFP_THISNODE working as intended. If a slab page
is put on wrong node's list, then further list manipulations may corrupt
the list because page_to_nid() is used to determine which node's
list_lock should be locked and thus we may take a wrong lock and race.Current SLAB implementation seems to be immune by luck thanks to commit
511e3a058812 ("mm/slab: make cache_grow() handle the page allocated on
arbitrary node") but there may be others assuming that __GFP_THISNODE
works as promised.We can fix it by simply removing the zonelist reset completely. There
is actually no reason to reset it, because memory policies and cpusets
don't affect the zonelist choice in the first place. This was different
when commit 183f6371aac2 ("mm: ignore mempolicies when using
ALLOC_NO_WATERMARK") introduced the code, as mempolicies provided their
own restricted zonelists.We might consider this for 4.17 although I don't know if there's
anything currently broken.SLAB is currently not affected, but in kernels older than 4.7 that don't
yet have 511e3a058812 ("mm/slab: make cache_grow() handle the page
allocated on arbitrary node") it is. That's at least 4.4 LTS. Older
ones I'll have to check.So stable backports should be more important, but will have to be
reviewed carefully, as the code went through many changes. BTW I think
that also the ac->preferred_zoneref reset is currently useless if we
don't also reset ac->nodemask from a mempolicy to NULL first (which we
probably should for the OOM victims etc?), but I would leave that for a
separate patch.Link: http://lkml.kernel.org/r/20180525130853.13915-1-vbabka@suse.cz
Signed-off-by: Vlastimil Babka
Fixes: 183f6371aac2 ("mm: ignore mempolicies when using ALLOC_NO_WATERMARK")
Acked-by: Mel Gorman
Cc: Michal Hocko
Cc: David Rientjes
Cc: Joonsoo Kim
Cc: Vlastimil Babka
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
This gives us five words of space in a single union in struct page. The
compound_mapcount moves position (from offset 24 to offset 20) on 64-bit
systems, but that does not seem likely to cause any trouble.Link: http://lkml.kernel.org/r/20180518194519.3820-11-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Now that we can represent the location of 'deferred_list' in C instead of
comments, make use of that ability.Link: http://lkml.kernel.org/r/20180518194519.3820-9-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Vlastimil Babka
Acked-by: Kirill A. Shutemov
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
We're already using a union of many fields here, so stop abusing the
_mapcount and make page_type its own field. That implies renaming some of
the machinery that creates PageBuddy, PageBalloon and PageKmemcg; bring
back the PG_buddy, PG_balloon and PG_kmemcg names.As suggested by Kirill, make page_type a bitmask. Because it starts out
life as -1 (thanks to sharing the storage with _mapcount), setting a page
flag means clearing the appropriate bit. This gives us space for probably
twenty or so extra bits (depending how paranoid we want to be about
_mapcount underflow).Link: http://lkml.kernel.org/r/20180518194519.3820-3-willy@infradead.org
Signed-off-by: Matthew Wilcox
Acked-by: Kirill A. Shutemov
Acked-by: Vlastimil Babka
Cc: Christoph Lameter
Cc: Dave Hansen
Cc: Jérôme Glisse
Cc: Lai Jiangshan
Cc: Martin Schwidefsky
Cc: Pekka Enberg
Cc: Randy Dunlap
Cc: Andrey Ryabinin
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
finalise_ac() has parameter order which is not used at all. Remove it.
Signed-off-by: Huaisheng Ye
Acked-by: Michal Hocko
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
is_pageblock_removable_nolock() is not used outside of
mm/memory_hotplug.c. Move it next to unique caller
is_mem_section_removable() and make it static.Remove prototype in to silence gcc warning (W=1):
mm/page_alloc.c:7704:6: warning: no previous prototype for `is_pageblock_removable_nolock' [-Wmissing-prototypes]
Link: http://lkml.kernel.org/r/20180509190001.24789-1-malat@debian.org
Signed-off-by: Mathieu Malaterre
Suggested-by: Michal Hocko
Reviewed-by: Andrew Morton
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
While revisiting my Btrfs swapfile series [1], I introduced a situation
in which reclaim would lock i_rwsem, and even though the swapon() path
clearly made GFP_KERNEL allocations while holding i_rwsem, I got no
complaints from lockdep. It turns out that the rework of the fs_reclaim
annotation was broken: if the current task has PF_MEMALLOC set, we don't
acquire the dummy fs_reclaim lock, but when reclaiming we always check
this _after_ we've just set the PF_MEMALLOC flag. In most cases, we can
fix this by moving the fs_reclaim_{acquire,release}() outside of the
memalloc_noreclaim_{save,restore}(), althought kswapd is slightly
different. After applying this, I got the expected lockdep splats.1: https://lwn.net/Articles/625412/
Link: http://lkml.kernel.org/r/9f8aa70652a98e98d7c4de0fc96a4addcee13efe.1523778026.git.osandov@fb.com
Fixes: d92a8cfcb37e ("locking/lockdep: Rework FS_RECLAIM annotation")
Signed-off-by: Omar Sandoval
Reviewed-by: Andrew Morton
Cc: Peter Zijlstra
Cc: Tetsuo Handa
Cc: Ingo Molnar
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds -
Highmem's realsize always equals to freesize, so it is not necessary to
spare a variable to record this.Link: http://lkml.kernel.org/r/20180413083859.65888-1-richard.weiyang@gmail.com
Signed-off-by: Wei Yang
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
26 May, 2018
1 commit
-
Oscar has reported:
: Due to an unfortunate setting with movablecore, memblocks containing bootmem
: memory (pages marked by get_page_bootmem()) ended up marked in zone_movable.
: So while trying to remove that memory, the system failed in do_migrate_range
: and __offline_pages never returned.
:
: This can be reproduced by running
: qemu-system-x86_64 -m 6G,slots=8,maxmem=8G -numa node,mem=4096M -numa node,mem=2048M
: and movablecore=4G kernel command line
:
: linux kernel: BIOS-provided physical RAM map:
: linux kernel: BIOS-e820: [mem 0x0000000000000000-0x000000000009fbff] usable
: linux kernel: BIOS-e820: [mem 0x000000000009fc00-0x000000000009ffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000000f0000-0x00000000000fffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000000100000-0x00000000bffdffff] usable
: linux kernel: BIOS-e820: [mem 0x00000000bffe0000-0x00000000bfffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000feffc000-0x00000000feffffff] reserved
: linux kernel: BIOS-e820: [mem 0x00000000fffc0000-0x00000000ffffffff] reserved
: linux kernel: BIOS-e820: [mem 0x0000000100000000-0x00000001bfffffff] usable
: linux kernel: NX (Execute Disable) protection: active
: linux kernel: SMBIOS 2.8 present.
: linux kernel: DMI: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.0.0-prebuilt.qemu-project.org
: linux kernel: Hypervisor detected: KVM
: linux kernel: e820: update [mem 0x00000000-0x00000fff] usable ==> reserved
: linux kernel: e820: remove [mem 0x000a0000-0x000fffff] usable
: linux kernel: last_pfn = 0x1c0000 max_arch_pfn = 0x400000000
:
: linux kernel: SRAT: PXM 0 -> APIC 0x00 -> Node 0
: linux kernel: SRAT: PXM 1 -> APIC 0x01 -> Node 1
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00000000-0x0009ffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x00100000-0xbfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0x13fffffff]
: linux kernel: ACPI: SRAT: Node 1 PXM 1 [mem 0x140000000-0x1bfffffff]
: linux kernel: ACPI: SRAT: Node 0 PXM 0 [mem 0x1c0000000-0x43fffffff] hotplug
: linux kernel: NUMA: Node 0 [mem 0x00000000-0x0009ffff] + [mem 0x00100000-0xbfffffff] -> [mem 0x0
: linux kernel: NUMA: Node 0 [mem 0x00000000-0xbfffffff] + [mem 0x100000000-0x13fffffff] -> [mem 0
: linux kernel: NODE_DATA(0) allocated [mem 0x13ffd6000-0x13fffffff]
: linux kernel: NODE_DATA(1) allocated [mem 0x1bffd3000-0x1bfffcfff]
:
: zoneinfo shows that the zone movable is placed into both numa nodes:
: Node 0, zone Movable
: pages free 160140
: min 1823
: low 2278
: high 2733
: spanned 262144
: present 262144
: managed 245670
: Node 1, zone Movable
: pages free 448427
: min 3827
: low 4783
: high 5739
: spanned 524288
: present 524288
: managed 515766Note how only Node 0 has a hutplugable memory region which would rule it
out from the early memblock allocations (most likely memmap). Node1
will surely contain memmaps on the same node and those would prevent
offlining to succeed. So this is arguably a configuration issue.
Although one could argue that we should be more clever and rule early
allocations from the zone movable. This would be correct but probably
not worth the effort considering what a hack movablecore is.Anyway, We could do better for those cases though. We rely on
start_isolate_page_range resp. has_unmovable_pages to do their job.
The first one isolates the whole range to be offlined so that we do not
allocate from it anymore and the later makes sure we are not stumbling
over non-migrateable pages.has_unmovable_pages is overly optimistic, however. It doesn't check all
the pages if we are withing zone_movable because we rely that those
pages will be always migrateable. As it turns out we are still not
perfect there. While bootmem pages in zonemovable sound like a clear
bug which should be fixed let's remove the optimization for now and warn
if we encounter unmovable pages in zone_movable in the meantime. That
should help for now at least.Btw. this wasn't a real problem until commit 72b39cfc4d75 ("mm,
memory_hotplug: do not fail offlining too early") because we used to
have a small number of retries and then failed. This turned out to be
too fragile though.Link: http://lkml.kernel.org/r/20180523125555.30039-2-mhocko@kernel.org
Signed-off-by: Michal Hocko
Reported-by: Oscar Salvador
Tested-by: Oscar Salvador
Reviewed-by: Pavel Tatashin
Cc: Vlastimil Babka
Cc: Reza Arbab
Cc: Igor Mammedov
Cc: Vitaly Kuznetsov
Cc: Anshuman Khandual
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds
25 May, 2018
1 commit
-
This reverts the following commits that change CMA design in MM.
3d2054ad8c2d ("ARM: CMA: avoid double mapping to the CMA area if CONFIG_HIGHMEM=y")
1d47a3ec09b5 ("mm/cma: remove ALLOC_CMA")
bad8c6c0b114 ("mm/cma: manage the memory of the CMA area by using the ZONE_MOVABLE")
Ville reported a following error on i386.
Inode-cache hash table entries: 65536 (order: 6, 262144 bytes)
microcode: microcode updated early to revision 0x4, date = 2013-06-28
Initializing CPU#0
Initializing HighMem for node 0 (000377fe:00118000)
Initializing Movable for node 0 (00000001:00118000)
BUG: Bad page state in process swapper pfn:377fe
page:f53effc0 count:0 mapcount:-127 mapping:00000000 index:0x0
flags: 0x80000000()
raw: 80000000 00000000 00000000 ffffff80 00000000 00000100 00000200 00000001
page dumped because: nonzero mapcount
Modules linked in:
CPU: 0 PID: 0 Comm: swapper Not tainted 4.17.0-rc5-elk+ #145
Hardware name: Dell Inc. Latitude E5410/03VXMC, BIOS A15 07/11/2013
Call Trace:
dump_stack+0x60/0x96
bad_page+0x9a/0x100
free_pages_check_bad+0x3f/0x60
free_pcppages_bulk+0x29d/0x5b0
free_unref_page_commit+0x84/0xb0
free_unref_page+0x3e/0x70
__free_pages+0x1d/0x20
free_highmem_page+0x19/0x40
add_highpages_with_active_regions+0xab/0xeb
set_highmem_pages_init+0x66/0x73
mem_init+0x1b/0x1d7
start_kernel+0x17a/0x363
i386_start_kernel+0x95/0x99
startup_32_smp+0x164/0x168The reason for this error is that the span of MOVABLE_ZONE is extended
to whole node span for future CMA initialization, and, normal memory is
wrongly freed here. I submitted the fix and it seems to work, but,
another problem happened.It's so late time to fix the later problem so I decide to reverting the
series.Reported-by: Ville Syrjälä
Acked-by: Laura Abbott
Acked-by: Michal Hocko
Cc: Andrew Morton
Signed-off-by: Joonsoo Kim
Signed-off-by: Linus Torvalds