Eric Lee / smarc-fsl-linux-kernel

29 Apr, 2020

1 commit

5126bdeaf mm/ksm: fix NULL pointer dereference when KSM zero page is enabled ... Browse Code »

commit 56df70a63ed5d989c1d36deee94cae14342be6e9 upstream.

find_mergeable_vma() can return NULL. In this case, it leads to a crash
when we access vm_mm(its offset is 0x40) later in write_protect_page.
And this case did happen on our server. The following call trace is
captured in kernel 4.19 with the following patch applied and KSM zero
page enabled on our server.

commit e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")

So add a vma check to fix it.

BUG: unable to handle kernel NULL pointer dereference at 0000000000000040
Oops: 0000 [#1] SMP NOPTI
CPU: 9 PID: 510 Comm: ksmd Kdump: loaded Tainted: G OE 4.19.36.bsk.9-amd64 #4.19.36.bsk.9
RIP: try_to_merge_one_page+0xc7/0x760
Code: 24 58 65 48 33 34 25 28 00 00 00 89 e8 0f 85 a3 06 00 00 48 83 c4
60 5b 5d 41 5c 41 5d 41 5e 41 5f c3 48 8b 46 08 a8 01 75 b8
8b 44 24 40 4c 8d 7c 24 20 b9 07 00 00 00 4c 89 e6 4c 89 ff 48
RSP: 0018:ffffadbdd9fffdb0 EFLAGS: 00010246
RAX: ffffda83ffd4be08 RBX: ffffda83ffd4be40 RCX: 0000002c6e800000
RDX: 0000000000000000 RSI: ffffda83ffd4be40 RDI: 0000000000000000
RBP: ffffa11939f02ec0 R08: 0000000094e1a447 R09: 00000000abe76577
R10: 0000000000000962 R11: 0000000000004e6a R12: 0000000000000000
R13: ffffda83b1e06380 R14: ffffa18f31f072c0 R15: ffffda83ffd4be40
FS: 0000000000000000(0000) GS:ffffa0da43b80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000040 CR3: 0000002c77c0a003 CR4: 00000000007626e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
ksm_scan_thread+0x115e/0x1960
kthread+0xf5/0x130
ret_from_fork+0x1f/0x30

[songmuchun@bytedance.com: if the vma is out of date, just exit]
Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
[akpm@linux-foundation.org: add the conventional braces, replace /** with /*]
Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")
Co-developed-by: Xiongchun Duan
Signed-off-by: Muchun Song
Signed-off-by: Andrew Morton
Reviewed-by: David Hildenbrand
Reviewed-by: Kirill Tkhai
Cc: Hugh Dickins
Cc: Yang Shi
Cc: Claudio Imbrenda
Cc: Markus Elfring
Cc:
Link: http://lkml.kernel.org/r/20200416025034.29780-1-songmuchun@bytedance.com
Link: http://lkml.kernel.org/r/20200414132905.83819-1-songmuchun@bytedance.com
Signed-off-by: Linus Torvalds
Signed-off-by: Greg Kroah-Hartman

Muchun Song
2020-04-29 22:33:15 +0800

23 Nov, 2019

1 commit

9a63236f1 mm/ksm.c: don't WARN if page is still mapped in remove_stable_node() ... Browse Code »

It's possible to hit the WARN_ON_ONCE(page_mapped(page)) in
remove_stable_node() when it races with __mmput() and squeezes in
between ksm_exit() and exit_mmap().

WARNING: CPU: 0 PID: 3295 at mm/ksm.c:888 remove_stable_node+0x10c/0x150

Call Trace:
remove_all_stable_nodes+0x12b/0x330
run_store+0x4ef/0x7b0
kernfs_fop_write+0x200/0x420
vfs_write+0x154/0x450
ksys_write+0xf9/0x1d0
do_syscall_64+0x99/0x510
entry_SYSCALL_64_after_hwframe+0x49/0xbe

Remove the warning as there is nothing scary going on.

Link: http://lkml.kernel.org/r/20191119131850.5675-1-aryabinin@virtuozzo.com
Fixes: cbf86cfe04a6 ("ksm: remove old stable nodes more thoroughly")
Signed-off-by: Andrey Ryabinin
Acked-by: Hugh Dickins
Cc: Andrea Arcangeli
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrey Ryabinin
2019-11-23 01:11:18 +0800

25 Sep, 2019

1 commit

010c164a5 mm: move memcmp_pages() and pages_identical() ... Browse Code »

Patch series "THP aware uprobe", v13.

This patchset makes uprobe aware of THPs.

Currently, when uprobe is attached to text on THP, the page is split by
FOLL_SPLIT. As a result, uprobe eliminates the performance benefit of
THP.

This set makes uprobe THP-aware. Instead of FOLL_SPLIT, we introduces
FOLL_SPLIT_PMD, which only split PMD for uprobe.

After all uprobes within the THP are removed, the PTE-mapped pages are
regrouped as huge PMD.

This set (plus a few THP patches) is also available at

https://github.com/liu-song-6/linux/tree/uprobe-thp

This patch (of 6):

Move memcmp_pages() to mm/util.c and pages_identical() to mm.h, so that we
can use them in other files.

Link: http://lkml.kernel.org/r/20190815164525.1848545-2-songliubraving@fb.com
Signed-off-by: Song Liu
Acked-by: Kirill A. Shutemov
Reviewed-by: Oleg Nesterov
Cc: Johannes Weiner
Cc: Matthew Wilcox
Cc: William Kucharski
Cc: Srikar Dronamraju
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Song Liu
2019-09-25 06:54:11 +0800

19 Jun, 2019

1 commit

7a338472f treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 482 ... Browse Code »

Based on 1 normalized pattern(s):

this work is licensed under the terms of the gnu gpl version 2

extracted by the scancode license scanner the SPDX license identifier

GPL-2.0-only

has been chosen to replace the boilerplate/reference in 48 file(s).

Signed-off-by: Thomas Gleixner
Reviewed-by: Allison Randal
Reviewed-by: Enrico Weigelt
Cc: linux-spdx@vger.kernel.org
Link: https://lkml.kernel.org/r/20190604081204.624030236@linutronix.de
Signed-off-by: Greg Kroah-Hartman

Thomas Gleixner
2019-06-19 23:09:52 +0800

15 May, 2019

2 commits

7269f9999 mm/mmu_notifier: use correct mmu_notifier events for each invalidation ... Browse Code »

This updates each existing invalidation to use the correct mmu notifier
event that represent what is happening to the CPU page table. See the
patch which introduced the events to see the rational behind this.

Link: http://lkml.kernel.org/r/20190326164747.24405-7-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2019-05-15 00:47:49 +0800
6f4f13e8d mm/mmu_notifier: contextual information for event triggering invalidation ... Browse Code »

CPU page table update can happens for many reasons, not only as a result
of a syscall (munmap(), mprotect(), mremap(), madvise(), ...) but also as
a result of kernel activities (memory compression, reclaim, migration,
...).

Users of mmu notifier API track changes to the CPU page table and take
specific action for them. While current API only provide range of virtual
address affected by the change, not why the changes is happening.

This patchset do the initial mechanical convertion of all the places that
calls mmu_notifier_range_init to also provide the default MMU_NOTIFY_UNMAP
event as well as the vma if it is know (most invalidation happens against
a given vma). Passing down the vma allows the users of mmu notifier to
inspect the new vma page protection.

The MMU_NOTIFY_UNMAP is always the safe default as users of mmu notifier
should assume that every for the range is going away when that event
happens. A latter patch do convert mm call path to use a more appropriate
events for each call.

This is done as 2 patches so that no call site is forgotten especialy
as it uses this following coccinelle patch:

%vm_mm, E3, E4)
...>

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(..., struct vm_area_struct *VMA, ...) {
}

@@
expression E1, E2, E3, E4;
identifier FN, VMA;
@@
FN(...) {
struct vm_area_struct *VMA;
}

@@
expression E1, E2, E3, E4;
identifier FN;
@@
FN(...) {
}
---------------------------------------------------------------------->%

Applied with:
spatch --all-includes --sp-file mmu-notifier.spatch fs/proc/task_mmu.c --in-place
spatch --sp-file mmu-notifier.spatch --dir kernel/events/ --in-place
spatch --sp-file mmu-notifier.spatch --dir mm --in-place

Link: http://lkml.kernel.org/r/20190326164747.24405-6-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Reviewed-by: Ralph Campbell
Reviewed-by: Ira Weiny
Cc: Christian König
Cc: Joonas Lahtinen
Cc: Jani Nikula
Cc: Rodrigo Vivi
Cc: Jan Kara
Cc: Andrea Arcangeli
Cc: Peter Xu
Cc: Felix Kuehling
Cc: Jason Gunthorpe
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Christian Koenig
Cc: John Hubbard
Cc: Arnd Bergmann
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2019-05-15 00:47:49 +0800

06 Mar, 2019

3 commits

2cee57d1b mm: ksm: do not block on page lock when searching stable tree ... Browse Code »

ksmd needs to search the stable tree to look for the suitable KSM page,
but the KSM page might be locked for a while due to i.e. KSM page rmap
walk. Basically it is not a big deal since commit 2c653d0ee2ae ("ksm:
introduce ksm_max_page_sharing per page deduplication limit"), since
max_page_sharing limits the number of shared KSM pages.

But it still sounds not worth waiting for the lock, the page can be
skip, then try to merge it in the next scan to avoid potential stall if
its content is still intact.

Introduce trylock mode to get_ksm_page() to not block on page lock, like
what try_to_merge_one_page() does. And, define three possible
operations (nolock, lock and trylock) as enum type to avoid stacking up
bools and make the code more readable.

Return -EBUSY if trylock fails, since NULL means not find suitable KSM
page, which is a valid case.

With the default max_page_sharing setting (256), there is almost no
observed change comparing lock vs trylock.

However, with ksm02 of LTP, the reduced ksmd full scan time can be
observed, which has set max_page_sharing to 786432. With lock version,
ksmd may tak 10s - 11s to run two full scans, with trylock version ksmd
may take 8s - 11s to run two full scans. And, the number of
pages_sharing and pages_to_scan keep same. Basically, this change has
no harm.

[hughd@google.com: fix BUG_ON()]
Link: http://lkml.kernel.org/r/alpine.LSU.2.11.1902182122280.6914@eggly.anvils
Link: http://lkml.kernel.org/r/1548793753-62377-1-git-send-email-yang.shi@linux.alibaba.com
Signed-off-by: Yang Shi
Signed-off-by: Hugh Dickins
Suggested-by: John Hubbard
Reviewed-by: Kirill Tkhai
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Yang Shi
2019-03-06 13:07:19 +0800
52d1e606e mm: reuse only-pte-mapped KSM page in do_wp_page() ... Browse Code »

Add an optimization for KSM pages almost in the same way that we have
for ordinary anonymous pages. If there is a write fault in a page,
which is mapped to an only pte, and it is not related to swap cache; the
page may be reused without copying its content.

[ Note that we do not consider PageSwapCache() pages at least for now,
since we don't want to complicate __get_ksm_page(), which has nice
optimization based on this (for the migration case). Currenly it is
spinning on PageSwapCache() pages, waiting for when they have
unfreezed counters (i.e., for the migration finish). But we don't want
to make it also spinning on swap cache pages, which we try to reuse,
since there is not a very high probability to reuse them. So, for now
we do not consider PageSwapCache() pages at all. ]

So in reuse_ksm_page() we check for 1) PageSwapCache() and 2)
page_stable_node(), to skip a page, which KSM is currently trying to
link to stable tree. Then we do page_ref_freeze() to prohibit KSM to
merge one more page into the page, we are reusing. After that, nobody
can refer to the reusing page: KSM skips !PageSwapCache() pages with
zero refcount; and the protection against of all other participants is
the same as for reused ordinary anon pages pte lock, page lock and
mmap_sem.

[akpm@linux-foundation.org: replace BUG_ON()s with WARN_ON()s]
Link: http://lkml.kernel.org/r/154471491016.31352.1168978849911555609.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Reviewed-by: Yang Shi
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc: Andrea Arcangeli
Cc: Christian Koenig
Cc: Claudio Imbrenda
Cc: Rik van Riel
Cc: Huang Ying
Cc: Minchan Kim
Cc: Kirill Tkhai
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2019-03-06 13:07:15 +0800
98fa15f34 mm: replace all open encodings for NUMA_NO_NODE ... Browse Code »

Patch series "Replace all open encodings for NUMA_NO_NODE", v3.

All these places for replacement were found by running the following
grep patterns on the entire kernel code. Please let me know if this
might have missed some instances. This might also have replaced some
false positives. I will appreciate suggestions, inputs and review.

1. git grep "nid == -1"
2. git grep "node == -1"
3. git grep "nid = -1"
4. git grep "node = -1"

This patch (of 2):

At present there are multiple places where invalid node number is
encoded as -1. Even though implicitly understood it is always better to
have macros in there. Replace these open encodings for an invalid node
number with the global macro NUMA_NO_NODE. This helps remove NUMA
related assumptions like 'invalid node' from various places redirecting
them to a common definition.

Link: http://lkml.kernel.org/r/1545127933-10711-2-git-send-email-anshuman.khandual@arm.com
Signed-off-by: Anshuman Khandual
Reviewed-by: David Hildenbrand
Acked-by: Jeff Kirsher [ixgbe]
Acked-by: Jens Axboe [mtip32xx]
Acked-by: Vinod Koul [dmaengine.c]
Acked-by: Michael Ellerman [powerpc]
Acked-by: Doug Ledford [drivers/infiniband]
Cc: Joseph Qi
Cc: Hans Verkuil
Cc: Stephen Rothwell
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Anshuman Khandual
2019-03-06 13:07:14 +0800

29 Dec, 2018

3 commits

fcf9a0ef8 ksm: react on changing "sleep_millisecs" parameter faster ... Browse Code »

ksm thread unconditionally sleeps in ksm_scan_thread() after each
iteration:

schedule_timeout_interruptible(
msecs_to_jiffies(ksm_thread_sleep_millisecs))

The timeout is configured in /sys/kernel/mm/ksm/sleep_millisecs.

In case of user writes a big value by a mistake, and the thread enters
into schedule_timeout_interruptible(), it's not possible to cancel the
sleep by writing a new smaler value; the thread is just sleeping till
timeout expires.

The patch fixes the problem by waking the thread each time after the value
is updated.

This also may be useful for debug purposes; and also for userspace
daemons, which change sleep_millisecs value in dependence of system load.

Link: http://lkml.kernel.org/r/154454107680.3258.3558002210423531566.stgit@localhost.localdomain
Signed-off-by: Kirill Tkhai
Acked-by: Cyrill Gorcunov
Reviewed-by: Andrew Morton
Cc: Michal Hocko
Cc: Hugh Dickins
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2018-12-29 04:11:51 +0800
ac46d4f3c mm/mmu_notifier: use structure for invalidate_range_start/end calls v2 ... Browse Code »

To avoid having to change many call sites everytime we want to add a
parameter use a structure to group all parameters for the mmu_notifier
invalidate_range_start/end cakks. No functional changes with this patch.

[akpm@linux-foundation.org: coding style fixes]
Link: http://lkml.kernel.org/r/20181205053628.3210-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Acked-by: Christian König
Acked-by: Jan Kara
Cc: Matthew Wilcox
Cc: Ross Zwisler
Cc: Dan Williams
Cc: Paolo Bonzini
Cc: Radim Krcmar
Cc: Michal Hocko
Cc: Felix Kuehling
Cc: Ralph Campbell
Cc: John Hubbard
From: Jérôme Glisse
Subject: mm/mmu_notifier: use structure for invalidate_range_start/end calls v3

fix build warning in migrate.c when CONFIG_MMU_NOTIFIER=n

Link: http://lkml.kernel.org/r/20181213171330.8489-3-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2018-12-29 04:11:50 +0800
59e1a2f4b ksm: replace jhash2 with xxhash ... Browse Code »

Replace jhash2 with xxhash.

Perf numbers:
Intel(R) Xeon(R) CPU E5-2420 v2 @ 2.20GHz
ksm: crc32c hash() 12081 MB/s
ksm: xxh64 hash() 8770 MB/s
ksm: xxh32 hash() 4529 MB/s
ksm: jhash2 hash() 1569 MB/s

Sioh Lee did some testing:

crc32c_intel: 1084.10ns
crc32c (no hardware acceleration): 7012.51ns
xxhash32: 2227.75ns
xxhash64: 1413.16ns
jhash2: 5128.30ns

As jhash2 always will be slower (for data size like PAGE_SIZE). Don't use
it in ksm at all.

Use only xxhash for now, because for using crc32c, cryptoapi must be
initialized first - that requires some tricky solution to work well in all
situations.

Link: http://lkml.kernel.org/r/20181023182554.23464-3-nefelim4ag@gmail.com
Signed-off-by: Timofey Titovets
Signed-off-by: leesioh
Reviewed-by: Pavel Tatashin
Reviewed-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Timofey Titovets
2018-12-29 04:11:46 +0800

23 Aug, 2018

2 commits

815f0ddb3 include/linux/compiler*.h: make compiler-*.h mutually exclusive ... Browse Code »

Commit cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
recently exposed a brittle part of the build for supporting non-gcc
compilers.

Both Clang and ICC define __GNUC__, __GNUC_MINOR__, and
__GNUC_PATCHLEVEL__ for quick compatibility with code bases that haven't
added compiler specific checks for __clang__ or __INTEL_COMPILER.

This is brittle, as they happened to get compatibility by posing as a
certain version of GCC. This broke when upgrading the minimal version
of GCC required to build the kernel, to a version above what ICC and
Clang claim to be.

Rather than always including compiler-gcc.h then undefining or
redefining macros in compiler-intel.h or compiler-clang.h, let's
separate out the compiler specific macro definitions into mutually
exclusive headers, do more proper compiler detection, and keep shared
definitions in compiler_types.h.

Fixes: cafa0010cd51 ("Raise the minimum required gcc version to 4.6")
Reported-by: Masahiro Yamada
Suggested-by: Eli Friedman
Suggested-by: Joe Perches
Signed-off-by: Nick Desaulniers
Signed-off-by: Linus Torvalds

Nick Desaulniers
2018-08-23 08:31:34 +0800
1c4c3b99c mm: fix page_freeze_refs and page_unfreeze_refs in comments ... Browse Code »

page_freeze_refs/page_unfreeze_refs have already been relplaced by
page_ref_freeze/page_ref_unfreeze , but they are not modified in the
comments.

Link: http://lkml.kernel.org/r/1532590226-106038-1-git-send-email-jiang.biao2@zte.com.cn
Signed-off-by: Jiang Biao
Acked-by: Michal Hocko
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jiang Biao
2018-08-23 01:52:44 +0800

18 Aug, 2018

2 commits

50a7ca3c6 mm: convert return type of handle_mm_fault() caller to vm_fault_t ... Browse Code »

Use new return type vm_fault_t for fault handler. For now, this is just
documenting that the function returns a VM_FAULT value rather than an
errno. Once all instances are converted, vm_fault_t will become a
distinct type.

Ref-> commit 1c8f422059ae ("mm: change return type to vm_fault_t")

In this patch all the caller of handle_mm_fault() are changed to return
vm_fault_t type.

Link: http://lkml.kernel.org/r/20180617084810.GA6730@jordon-HP-15-Notebook-PC
Signed-off-by: Souptick Joarder
Cc: Matthew Wilcox
Cc: Richard Henderson
Cc: Tony Luck
Cc: Matt Turner
Cc: Vineet Gupta
Cc: Russell King
Cc: Catalin Marinas
Cc: Will Deacon
Cc: Richard Kuo
Cc: Geert Uytterhoeven
Cc: Michal Simek
Cc: James Hogan
Cc: Ley Foon Tan
Cc: Jonas Bonn
Cc: James E.J. Bottomley
Cc: Benjamin Herrenschmidt
Cc: Palmer Dabbelt
Cc: Yoshinori Sato
Cc: David S. Miller
Cc: Richard Weinberger
Cc: Guan Xuetao
Cc: Thomas Gleixner
Cc: "H. Peter Anvin"
Cc: "Levin, Alexander (Sasha Levin)"
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Souptick Joarder
2018-08-18 07:20:28 +0800
e1fb4a086 dax: remove VM_MIXEDMAP for fsdax and device dax ... Browse Code »

This patch is reworked from an earlier patch that Dan has posted:
https://patchwork.kernel.org/patch/10131727/

VM_MIXEDMAP is used by dax to direct mm paths like vm_normal_page() that
the memory page it is dealing with is not typical memory from the linear
map. The get_user_pages_fast() path, since it does not resolve the vma,
is already using {pte,pmd}_devmap() as a stand-in for VM_MIXEDMAP, so we
use that as a VM_MIXEDMAP replacement in some locations. In the cases
where there is no pte to consult we fallback to using vma_is_dax() to
detect the VM_MIXEDMAP special case.

Now that we have explicit driver pfn_t-flag opt-in/opt-out for
get_user_pages() support for DAX we can stop setting VM_MIXEDMAP. This
also means we no longer need to worry about safely manipulating vm_flags
in a future where we support dynamically changing the dax mode of a
file.

DAX should also now be supported with madvise_behavior(), vma_merge(),
and copy_page_range().

This patch has been tested against ndctl unit test. It has also been
tested against xfstests commit: 625515d using fake pmem created by
memmap and no additional issues have been observed.

Link: http://lkml.kernel.org/r/152847720311.55924.16999195879201817653.stgit@djiang5-desk3.ch.intel.com
Signed-off-by: Dave Jiang
Acked-by: Dan Williams
Cc: Jan Kara
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Dave Jiang
2018-08-18 07:20:27 +0800

15 Jun, 2018

1 commit

1105a2fc0 mm/ksm.c: ignore STABLE_FLAG of rmap_item->address in rmap_walk_ksm() ... Browse Code »

In our armv8a server(QDF2400), I noticed lots of WARN_ON caused by
PAGE_SIZE unaligned for rmap_item->address under memory pressure
tests(start 20 guests and run memhog in the host).

WARNING: CPU: 4 PID: 4641 at virt/kvm/arm/mmu.c:1826 kvm_age_hva_handler+0xc0/0xc8
CPU: 4 PID: 4641 Comm: memhog Tainted: G W 4.17.0-rc3+ #8
Call trace:
kvm_age_hva_handler+0xc0/0xc8
handle_hva_to_gpa+0xa8/0xe0
kvm_age_hva+0x4c/0xe8
kvm_mmu_notifier_clear_flush_young+0x54/0x98
__mmu_notifier_clear_flush_young+0x6c/0xa0
page_referenced_one+0x154/0x1d8
rmap_walk_ksm+0x12c/0x1d0
rmap_walk+0x94/0xa0
page_referenced+0x194/0x1b0
shrink_page_list+0x674/0xc28
shrink_inactive_list+0x26c/0x5b8
shrink_node_memcg+0x35c/0x620
shrink_node+0x100/0x430
do_try_to_free_pages+0xe0/0x3a8
try_to_free_pages+0xe4/0x230
__alloc_pages_nodemask+0x564/0xdc0
alloc_pages_vma+0x90/0x228
do_anonymous_page+0xc8/0x4d0
__handle_mm_fault+0x4a0/0x508
handle_mm_fault+0xf8/0x1b0
do_page_fault+0x218/0x4b8
do_translation_fault+0x90/0xa0
do_mem_abort+0x68/0xf0
el0_da+0x24/0x28

In rmap_walk_ksm, the rmap_item->address might still have the
STABLE_FLAG, then the start and end in handle_hva_to_gpa might not be
PAGE_SIZE aligned. Thus it will cause exceptions in handle_hva_to_gpa
on arm64.

This patch fixes it by ignoring (not removing) the low bits of address
when doing rmap_walk_ksm.

IMO, it should be backported to stable tree. the storm of WARN_ONs is
very easy for me to reproduce. More than that, I watched a panic (not
reproducible) as follows:

page:ffff7fe003742d80 count:-4871 mapcount:-2126053375 mapping: (null) index:0x0
flags: 0x1fffc00000000000()
raw: 1fffc00000000000 0000000000000000 0000000000000000 ffffecf981470000
raw: dead000000000100 dead000000000200 ffff8017c001c000 0000000000000000
page dumped because: nonzero _refcount
CPU: 29 PID: 18323 Comm: qemu-kvm Tainted: G W 4.14.15-5.hxt.aarch64 #1
Hardware name:
Call trace:
dump_backtrace+0x0/0x22c
show_stack+0x24/0x2c
dump_stack+0x8c/0xb0
bad_page+0xf4/0x154
free_pages_check_bad+0x90/0x9c
free_pcppages_bulk+0x464/0x518
free_hot_cold_page+0x22c/0x300
__put_page+0x54/0x60
unmap_stage2_range+0x170/0x2b4
kvm_unmap_hva_handler+0x30/0x40
handle_hva_to_gpa+0xb0/0xec
kvm_unmap_hva_range+0x5c/0xd0

I even injected a fault on purpose in kvm_unmap_hva_range by seting
size=size-0x200, the call trace is similar as above. So I thought the
panic is similarly caused by the root cause of WARN_ON.

Andrea said:

: It looks a straightforward safe fix, on x86 hva_to_gfn_memslot would
: zap those bits and hide the misalignment caused by the low metadata
: bits being erroneously left set in the address, but the arm code
: notices when that's the last page in the memslot and the hva_end is
: getting aligned and the size is below one page.
:
: I think the problem triggers in the addr += PAGE_SIZE of
: unmap_stage2_ptes that never matches end because end is aligned but
: addr is not.
:
: } while (pte++, addr += PAGE_SIZE, addr != end);
:
: x86 again only works on hva_start/hva_end after converting it to
: gfn_start/end and that being in pfn units the bits are zapped before
: they risk to cause trouble.

Jia He said:

: I've tested by myself in arm64 server (QDF2400,46 cpus,96G mem) Without
: this patch, the WARN_ON is very easy for reproducing. After this patch, I
: have run the same benchmarch for a whole day without any WARN_ONs

Link: http://lkml.kernel.org/r/1525403506-6750-1-git-send-email-hejianet@gmail.com
Signed-off-by: Jia He
Reviewed-by: Andrea Arcangeli
Tested-by: Jia He
Cc: Suzuki K Poulose
Cc: Minchan Kim
Cc: Claudio Imbrenda
Cc: Arvind Yadav
Cc: Mike Rapoport
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jia He
2018-06-15 06:55:23 +0800

08 Jun, 2018

1 commit

88484826b mm/ksm: move [set_]page_stable_node from ksm.h to ksm.c ... Browse Code »

page_stable_node() and set_page_stable_node() are only used in mm/ksm.c
and there is no point to keep them in the include/linux/ksm.h

[akpm@linux-foundation.org: fix SYSFS=n build]
Link: http://lkml.kernel.org/r/1524552106-7356-3-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-06-08 08:34:36 +0800

28 Apr, 2018

1 commit

5a2ca3efe mm/ksm: docs: extend overview comment and make it "DOC:" ... Browse Code »

The existing comment provides a good overview of KSM implementation. Let's
update it to reflect recent additions of "chain" and "dup" variants of the
stable tree nodes and mark it as "DOC:" for inclusion into the KSM
documentation.

Signed-off-by: Mike Rapoport
Signed-off-by: Jonathan Corbet

Mike Rapoport
2018-04-28 07:19:24 +0800

17 Apr, 2018

2 commits

24844fd33 Merge branch 'mm-rst' into docs-next ... Browse Code »

Mike Rapoport says:

These patches convert files in Documentation/vm to ReST format, add an
initial index and link it to the top level documentation.

There are no contents changes in the documentation, except few spelling
fixes. The relatively large diffstat stems from the indentation and
paragraph wrapping changes.

I've tried to keep the formatting as consistent as possible, but I could
miss some places that needed markup and add some markup where it was not
necessary.

[jc: significant conflicts in vm/hmm.rst]

Jonathan Corbet
2018-04-17 04:25:08 +0800
ad56b738c docs/vm: rename documentation files to .rst ... Browse Code »

Signed-off-by: Mike Rapoport
Signed-off-by: Jonathan Corbet

Mike Rapoport
2018-04-17 04:18:15 +0800

12 Apr, 2018

1 commit

a38c015f3 mm/ksm.c: fix inconsistent accounting of zero pages ... Browse Code »

When using KSM with use_zero_pages, we replace anonymous pages
containing only zeroes with actual zero pages, which are not anonymous.
We need to do proper accounting of the mm counters, otherwise we will
get wrong values in /proc and a BUG message in dmesg when tearing down
the mm.

Link: http://lkml.kernel.org/r/1522931274-15552-1-git-send-email-imbrenda@linux.vnet.ibm.com
Fixes: e86c59b1b1 ("mm/ksm: improve deduplication of zero pages with colouring")
Signed-off-by: Claudio Imbrenda
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Christian Borntraeger
Cc: Gerald Schaefer
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Claudio Imbrenda
2018-04-12 01:28:31 +0800

06 Apr, 2018

2 commits

77da2ba06 mm/ksm: fix interaction with THP ... Browse Code »

This patch fixes a corner case for KSM. When two pages belong or
belonged to the same transparent hugepage, and they should be merged,
KSM fails to split the page, and therefore no merging happens.

This bug can be reproduced by:
* making sure ksm is running (in case disabling ksmtuned)
* enabling transparent hugepages
* allocating a THP-aligned 1-THP-sized buffer
e.g. on amd64: posix_memalign(&p, 1<<<<
Co-authored-by: Gerald Schaefer
Reviewed-by: Andrew Morton
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Kirill A. Shutemov
Cc: Hugh Dickins
Cc: Christian Borntraeger
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Claudio Imbrenda
2018-04-06 12:36:27 +0800
c01f0b54e mm/ksm.c: make stable_node_dup() static ... Browse Code »

stable_node_dup() is local to the source and does not need to be in
global scope, so make it static.

Cleans up sparse warning:

mm/ksm.c:1321:13: warning: symbol 'stable_node_dup' was not declared. Should it be static?

Link: http://lkml.kernel.org/r/20180206221005.12642-1-colin.king@canonical.com
Signed-off-by: Colin Ian King
Reviewed-by: Andrew Morton
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Colin Ian King
2018-04-06 12:36:24 +0800

18 Mar, 2018

1 commit

74a049674 sparc64: Add support for ADI (Application Data Integrity) ... Browse Code »

ADI is a new feature supported on SPARC M7 and newer processors to allow
hardware to catch rogue accesses to memory. ADI is supported for data
fetches only and not instruction fetches. An app can enable ADI on its
data pages, set version tags on them and use versioned addresses to
access the data pages. Upper bits of the address contain the version
tag. On M7 processors, upper four bits (bits 63-60) contain the version
tag. If a rogue app attempts to access ADI enabled data pages, its
access is blocked and processor generates an exception. Please see
Documentation/sparc/adi.txt for further details.

This patch extends mprotect to enable ADI (TSTATE.mcde), enable/disable
MCD (Memory Corruption Detection) on selected memory ranges, enable
TTE.mcd in PTEs, return ADI parameters to userspace and save/restore ADI
version tags on page swap out/in or migration. ADI is not enabled by
default for any task. A task must explicitly enable ADI on a memory
range and set version tag for ADI to be effective for the task.

Signed-off-by: Khalid Aziz
Cc: Khalid Aziz
Reviewed-by: Anthony Yznaga
Signed-off-by: David S. Miller

Khalid Aziz
2018-03-18 22:38:48 +0800

07 Feb, 2018

1 commit

b7701a5f2 mm: docs: fixup punctuation ... Browse Code »

so that kernel-doc will properly recognize the parameter and function
descriptions.

Link: http://lkml.kernel.org/r/1516700871-22279-2-git-send-email-rppt@linux.vnet.ibm.com
Signed-off-by: Mike Rapoport
Cc: Jonathan Corbet
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Mike Rapoport
2018-02-07 10:32:48 +0800

05 Dec, 2017

1 commit

08df47743 mm/ksm: Remove now-redundant smp_read_barrier_depends() ... Browse Code »

Because READ_ONCE() now implies smp_read_barrier_depends(), the
smp_read_barrier_depends() in get_ksm_page() is now redundant.
This commit removes it and updates the comments.

Signed-off-by: Paul E. McKenney
Cc: Andrew Morton
Cc: Andrea Arcangeli
Cc: Minchan Kim
Cc: Michal Hocko
Cc: "Kirill A. Shutemov"
Cc: Ingo Molnar
Cc: "Aneesh Kumar K.V"
Cc: Claudio Imbrenda
Cc:

Paul E. McKenney
2017-12-05 02:52:56 +0800

16 Nov, 2017

1 commit

0f10851ea mm/mmu_notifier: avoid double notification when it is useless ... Browse Code »

This patch only affects users of mmu_notifier->invalidate_range callback
which are device drivers related to ATS/PASID, CAPI, IOMMUv2, SVM ...
and it is an optimization for those users. Everyone else is unaffected
by it.

When clearing a pte/pmd we are given a choice to notify the event under
the page table lock (notify version of *_clear_flush helpers do call the
mmu_notifier_invalidate_range). But that notification is not necessary
in all cases.

This patch removes almost all cases where it is useless to have a call
to mmu_notifier_invalidate_range before
mmu_notifier_invalidate_range_end. It also adds documentation in all
those cases explaining why.

Below is a more in depth analysis of why this is fine to do this:

For secondary TLB (non CPU TLB) like IOMMU TLB or device TLB (when
device use thing like ATS/PASID to get the IOMMU to walk the CPU page
table to access a process virtual address space). There is only 2 cases
when you need to notify those secondary TLB while holding page table
lock when clearing a pte/pmd:

A) page backing address is free before mmu_notifier_invalidate_range_end
B) a page table entry is updated to point to a new page (COW, write fault
on zero page, __replace_page(), ...)

Case A is obvious you do not want to take the risk for the device to write
to a page that might now be used by something completely different.

Case B is more subtle. For correctness it requires the following sequence
to happen:
- take page table lock
- clear page table entry and notify (pmd/pte_huge_clear_flush_notify())
- set page table entry to point to new page

If clearing the page table entry is not followed by a notify before setting
the new pte/pmd value then you can break memory model like C11 or C++11 for
the device.

Consider the following scenario (device use a feature similar to ATS/
PASID):

Two address addrA and addrB such that |addrA - addrB| >= PAGE_SIZE we
assume they are write protected for COW (other case of B apply too).

[Time N] -----------------------------------------------------------------
CPU-thread-0 {try to write to addrA}
CPU-thread-1 {try to write to addrB}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA and populate device TLB}
DEV-thread-2 {read addrB and populate device TLB}
[Time N+1] ---------------------------------------------------------------
CPU-thread-0 {COW_step0: {mmu_notifier_invalidate_range_start(addrA)}}
CPU-thread-1 {COW_step0: {mmu_notifier_invalidate_range_start(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+2] ---------------------------------------------------------------
CPU-thread-0 {COW_step1: {update page table point to new page for addrA}}
CPU-thread-1 {COW_step1: {update page table point to new page for addrB}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {write to addrA which is a write to new page}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+3] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {preempted}
CPU-thread-2 {}
CPU-thread-3 {write to addrB which is a write to new page}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+4] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {COW_step3: {mmu_notifier_invalidate_range_end(addrB)}}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {}
DEV-thread-2 {}
[Time N+5] ---------------------------------------------------------------
CPU-thread-0 {preempted}
CPU-thread-1 {}
CPU-thread-2 {}
CPU-thread-3 {}
DEV-thread-0 {read addrA from old page}
DEV-thread-2 {read addrB from new page}

So here because at time N+2 the clear page table entry was not pair with a
notification to invalidate the secondary TLB, the device see the new value
for addrB before seing the new value for addrA. This break total memory
ordering for the device.

When changing a pte to write protect or to point to a new write protected
page with same content (KSM) it is ok to delay invalidate_range callback
to mmu_notifier_invalidate_range_end() outside the page table lock. This
is true even if the thread doing page table update is preempted right
after releasing page table lock before calling
mmu_notifier_invalidate_range_end

Thanks to Andrea for thinking of a problematic scenario for COW.

[jglisse@redhat.com: v2]
Link: http://lkml.kernel.org/r/20171017031003.7481-2-jglisse@redhat.com
Link: http://lkml.kernel.org/r/20170901173011.10745-1-jglisse@redhat.com
Signed-off-by: Jérôme Glisse
Cc: Andrea Arcangeli
Cc: Nadav Amit
Cc: Joerg Roedel
Cc: Suravee Suthikulpanit
Cc: David Woodhouse
Cc: Alistair Popple
Cc: Michael Ellerman
Cc: Benjamin Herrenschmidt
Cc: Stephen Rothwell
Cc: Andrew Donnellan
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Jérôme Glisse
2017-11-16 10:21:03 +0800

04 Oct, 2017

1 commit

4b22927f0 ksm: fix unlocked iteration over vmas in cmp_and_merge_page() ... Browse Code »

In this place mm is unlocked, so vmas or list may change. Down read
mmap_sem to protect them from modifications.

Link: http://lkml.kernel.org/r/150512788393.10691.8868381099691121308.stgit@localhost.localdomain
Fixes: e86c59b1b12d ("mm/ksm: improve deduplication of zero pages with colouring")
Signed-off-by: Kirill Tkhai
Acked-by: Michal Hocko
Reviewed-by: Andrea Arcangeli
Cc: Minchan Kim
Cc: zhong jiang
Cc: Ingo Molnar
Cc: Claudio Imbrenda
Cc: "Kirill A. Shutemov"
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Kirill Tkhai
2017-10-04 08:54:23 +0800

07 Sep, 2017

1 commit

f907c26a9 mm/ksm.c: constify attribute_group structures ... Browse Code »

attribute_group are not supposed to change at runtime. All functions
working with attribute_group provided by work with const
attribute_group. So mark the non-const structs as const.

Link: http://lkml.kernel.org/r/1501157167-3706-2-git-send-email-arvind.yadav.cs@gmail.com
Signed-off-by: Arvind Yadav
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Arvind Yadav
2017-09-07 08:27:27 +0800

11 Aug, 2017

1 commit

b3a81d084 mm: fix KSM data corruption ... Browse Code »

Nadav reported KSM can corrupt the user data by the TLB batching
race[1]. That means data user written can be lost.

Quote from Nadav Amit:
"For this race we need 4 CPUs:

CPU0: Caches a writable and dirty PTE entry, and uses the stale value
for write later.

CPU1: Runs madvise_free on the range that includes the PTE. It would
clear the dirty-bit. It batches TLB flushes.

CPU2: Writes 4 to /proc/PID/clear_refs , clearing the PTEs soft-dirty.
We care about the fact that it clears the PTE write-bit, and of
course, batches TLB flushes.

CPU3: Runs KSM. Our purpose is to pass the following test in
write_protect_page():

if (pte_write(*pvmw.pte) || pte_dirty(*pvmw.pte) ||
(pte_protnone(*pvmw.pte) && pte_savedwrite(*pvmw.pte)))

Since it will avoid TLB flush. And we want to do it while the PTE is
stale. Later, and before replacing the page, we would be able to
change the page.

Note that all the operations the CPU1-3 perform canhappen in parallel
since they only acquire mmap_sem for read.

We start with two identical pages. Everything below regards the same
page/PTE.

CPU0 CPU1 CPU2 CPU3
---- ---- ---- ----
Write the same
value on page

[cache PTE as
dirty in TLB]

MADV_FREE
pte_mkclean()

4 > clear_refs
pte_wrprotect()

write_protect_page()
[ success, no flush ]

pages_indentical()
[ ok ]

Write to page
different value

[Ok, using stale
PTE]

replace_page()

Later, CPU1, CPU2 and CPU3 would flush the TLB, but that is too late.
CPU0 already wrote on the page, but KSM ignored this write, and it got
lost"

In above scenario, MADV_FREE is fixed by changing TLB batching API
including [set|clear]_tlb_flush_pending. Remained thing is soft-dirty
part.

This patch changes soft-dirty uses TLB batching API instead of
flush_tlb_mm and KSM checks pending TLB flush by using
mm_tlb_flush_pending so that it will flush TLB to avoid data lost if
there are other parallel threads pending TLB flush.

[1] http://lkml.kernel.org/r/BD3A0EBE-ECF4-41D4-87FA-C755EA9AB6BD@gmail.com

Link: http://lkml.kernel.org/r/20170802000818.4760-8-namit@vmware.com
Signed-off-by: Minchan Kim
Signed-off-by: Nadav Amit
Reported-by: Nadav Amit
Tested-by: Nadav Amit
Reviewed-by: Andrea Arcangeli
Cc: Mel Gorman
Cc: Hugh Dickins
Cc: "David S. Miller"
Cc: Andy Lutomirski
Cc: Heiko Carstens
Cc: Ingo Molnar
Cc: Jeff Dike
Cc: Martin Schwidefsky
Cc: Mel Gorman
Cc: Nadav Amit
Cc: Rik van Riel
Cc: Russell King
Cc: Sergey Senozhatsky
Cc: Tony Luck
Cc: Yoshinori Sato
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-08-11 06:54:07 +0800

07 Jul, 2017

5 commits

80b18dfa5 ksm: optimize refile of stable_node_dup at the head of the chain ... Browse Code »

If a candidate stable_node_dup has been found and it can accept further
merges it can be refiled to the head of the list to speedup next
searches without altering which dup is found and how the dups accumulate
in the chain.

We already refiled it back to the head in the prune_stale_stable_nodes
case, but we didn't refile it if not pruning (which is more common).
And we also refiled it when it was already at the head which is
unnecessary (in the prune_stale_stable_nodes case, nr > 1 means there's
more than one dup in the chain, it doesn't mean it's not already at the
head of the chain).

The stable_node_chain list is single threaded and there's no SMP locking
contention so it should be faster to refile it to the head of the list
also if prune_stale_stable_nodes is false.

Profiling shows the refile happens 1.9% of the time when a dup is found
with a max_page_sharing limit setting of 3 (with max_page_sharing of 2
the refile never happens of course as there's never space for one more
merge) which is reasonably low. At higher max_page_sharing values it
should be much less frequent.

This is just an optimization.

Link: http://lkml.kernel.org/r/20170518173721.22316-4-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Evgheni Dereveanchin
Cc: Andrey Ryabinin
Cc: Petr Holasek
Cc: Hugh Dickins
Cc: Arjan van de Ven
Cc: Davidlohr Bueso
Cc: Gavin Guo
Cc: Jay Vosburgh
Cc: Mel Gorman
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-07-07 07:24:31 +0800
8dc5ffcd5 ksm: swap the two output parameters of chain/chain_prune ... Browse Code »

Some static checker complains if chain/chain_prune returns a potentially
stale pointer.

There are two output parameters to chain/chain_prune, one is tree_page
the other is stable_node_dup. Like in get_ksm_page the caller has to
check tree_page is NULL before touching the stable_node. Similarly in
chain/chain_prune the caller has to check tree_page before touching the
stable_node_dup returned or the original stable_node passed as
parameter.

Because the tree_page is never returned as a stale pointer, it may be
more intuitive to return tree_page and to pass stable_node_dup for
reference instead of the reverse.

This patch purely swaps the two output parameters of chain/chain_prune
as a cleanup for the static checker and to mimic the get_ksm_page
behavior more closely. There's no change to the caller at all except
the swap, it's purely a cleanup and it is a noop from the caller point
of view.

Link: http://lkml.kernel.org/r/20170518173721.22316-3-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Dan Carpenter
Tested-by: Dan Carpenter
Cc: Evgheni Dereveanchin
Cc: Andrey Ryabinin
Cc: Petr Holasek
Cc: Hugh Dickins
Cc: Arjan van de Ven
Cc: Davidlohr Bueso
Cc: Gavin Guo
Cc: Jay Vosburgh
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-07-07 07:24:31 +0800
0ba1d0f7c ksm: cleanup stable_node chain collapse case ... Browse Code »

Patch series "KSMscale cleanup/optimizations".

There are no fixes here it's just minor cleanups and optimizations.

1/3 removes makes the "fix" for the stale stable_node fall in the
standard case without introducing new cases. Setting stable_node to
NULL was marginally safer, but stale pointer is still wiped from the
caller, this looks cleaner.

2/3 should fix the false positive from Dan's static checker.

3/3 is a microoptimization to apply the the refile of future merge
candidate dups at the head of the chain in all cases and to skip it in
one case where we did it and but it was a noop (to avoid checking if
it was already at the head but now we've to check it anyway so it got
optimized away).

This patch (of 3):

When the stable_node chain is collapsed we can as well set the caller
stable_node to match the returned stable_node_dup in chain_prune().

This way the collapse case becomes indistinguishable from the regular
stable_node case and we can remove two branches from the KSM page
migration handling slow paths.

While it was all correct this looks cleaner (and faster) as the caller has
to deal with fewer special cases.

Link: http://lkml.kernel.org/r/20170518173721.22316-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Cc: Evgheni Dereveanchin
Cc: Andrey Ryabinin
Cc: Petr Holasek
Cc: Hugh Dickins
Cc: Arjan van de Ven
Cc: Davidlohr Bueso
Cc: Gavin Guo
Cc: Jay Vosburgh
Cc: Mel Gorman
Cc: Dan Carpenter
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-07-07 07:24:31 +0800
b4fecc67c ksm: fix use after free with merge_across_nodes = 0 ... Browse Code »

If merge_across_nodes was manually set to 0 (not the default value) by
the admin or a tuned profile on NUMA systems triggering cross-NODE page
migrations, a stable_node use after free could materialize.

If the chain is collapsed stable_node would point to the old chain that
was already freed. stable_node_dup would be the stable_node dup now
converted to a regular stable_node and indexed in the rbtree in
replacement of the freed stable_node chain (not anymore a dup).

This special case where the chain is collapsed in the NUMA replacement
path, is now detected by setting stable_node to NULL by the chain_prune
callee if it decides to collapse the chain. This tells the NUMA
replacement code that even if stable_node and stable_node_dup are
different, this is not a chain if stable_node is NULL, as the
stable_node_dup was converted to a regular stable_node and the chain was
collapsed.

It is generally safer for the callee to force the caller stable_node to
NULL the moment it become stale so any other mistake like this would
result in an instant Oops easier to debug than an use after free.

Otherwise the replace logic would act like if stable_node was a valid
chain, when in fact it was freed. Notably
stable_node_chain_add_dup(page_node, stable_node) would run on a stable
stable_node.

Andrey Ryabinin found the source of the use after free in chain_prune().

Link: http://lkml.kernel.org/r/20170512193805.8807-2-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Andrey Ryabinin
Reported-by: Evgheni Dereveanchin
Tested-by: Andrey Ryabinin
Cc: Petr Holasek
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Arjan van de Ven
Cc: Gavin Guo
Cc: Jay Vosburgh
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-07-07 07:24:31 +0800
2c653d0ee ksm: introduce ksm_max_page_sharing per page deduplication limit ... Browse Code »

Without a max deduplication limit for each KSM page, the list of the
rmap_items associated to each stable_node can grow infinitely large.

During the rmap walk each entry can take up to ~10usec to process
because of IPIs for the TLB flushing (both for the primary MMU and the
secondary MMUs with the MMU notifier). With only 16GB of address space
shared in the same KSM page, that would amount to dozens of seconds of
kernel runtime.

A ~256 max deduplication factor will reduce the latencies of the rmap
walks on KSM pages to order of a few msec. Just doing the
cond_resched() during the rmap walks is not enough, the list size must
have a limit too, otherwise the caller could get blocked in (schedule
friendly) kernel computations for seconds, unexpectedly.

There's room for optimization to significantly reduce the IPI delivery
cost during the page_referenced(), but at least for page_migration in
the KSM case (used by hard NUMA bindings, compaction and NUMA balancing)
it may be inevitable to send lots of IPIs if each rmap_item->mm is
active on a different CPU and there are lots of CPUs. Even if we ignore
the IPI delivery cost, we've still to walk the whole KSM rmap list, so
we can't allow millions or billions (ulimited) number of entries in the
KSM stable_node rmap_item lists.

The limit is enforced efficiently by adding a second dimension to the
stable rbtree. So there are three types of stable_nodes: the regular
ones (identical as before, living in the first flat dimension of the
stable rbtree), the "chains" and the "dups".

Every "chain" and all "dups" linked into a "chain" enforce the invariant
that they represent the same write protected memory content, even if
each "dup" will be pointed by a different KSM page copy of that content.
This way the stable rbtree lookup computational complexity is unaffected
if compared to an unlimited max_sharing_limit. It is still enforced
that there cannot be KSM page content duplicates in the stable rbtree
itself.

Adding the second dimension to the stable rbtree only after the
max_page_sharing limit hits, provides for a zero memory footprint
increase on 64bit archs. The memory overhead of the per-KSM page
stable_tree and per virtual mapping rmap_item is unchanged. Only after
the max_page_sharing limit hits, we need to allocate a stable_tree
"chain" and rb_replace() the "regular" stable_node with the newly
allocated stable_node "chain". After that we simply add the "regular"
stable_node to the chain as a stable_node "dup" by linking hlist_dup in
the stable_node_chain->hlist. This way the "regular" (flat) stable_node
is converted to a stable_node "dup" living in the second dimension of
the stable rbtree.

During stable rbtree lookups the stable_node "chain" is identified as
stable_node->rmap_hlist_len == STABLE_NODE_CHAIN (aka
is_stable_node_chain()).

When dropping stable_nodes, the stable_node "dup" is identified as
stable_node->head == STABLE_NODE_DUP_HEAD (aka is_stable_node_dup()).

The STABLE_NODE_DUP_HEAD must be an unique valid pointer never used
elsewhere in any stable_node->head/node to avoid a clashes with the
stable_node->node.rb_parent_color pointer, and different from
&migrate_nodes. So the second field of &migrate_nodes is picked and
verified as always safe with a BUILD_BUG_ON in case the list_head
implementation changes in the future.

The STABLE_NODE_DUP is picked as a random negative value in
stable_node->rmap_hlist_len. rmap_hlist_len cannot become negative when
it's a "regular" stable_node or a stable_node "dup".

The stable_node_chain->nid is irrelevant. The stable_node_chain->kpfn
is aliased in a union with a time field used to rate limit the
stable_node_chain->hlist prunes.

The garbage collection of the stable_node_chain happens lazily during
stable rbtree lookups (as for all other kind of stable_nodes), or while
disabling KSM with "echo 2 >/sys/kernel/mm/ksm/run" while collecting the
entire stable rbtree.

While the "regular" stable_nodes and the stable_node "dups" must wait
for their underlying tree_page to be freed before they can be freed
themselves, the stable_node "chains" can be freed immediately if the
stable_node->hlist turns empty. This is because the "chains" are never
pointed by any page->mapping and they're effectively stable rbtree KSM
self contained metadata.

[akpm@linux-foundation.org: fix non-NUMA build]
Signed-off-by: Andrea Arcangeli
Tested-by: Petr Holasek
Cc: Hugh Dickins
Cc: Davidlohr Bueso
Cc: Arjan van de Ven
Cc: Evgheni Dereveanchin
Cc: Andrey Ryabinin
Cc: Gavin Guo
Cc: Jay Vosburgh
Cc: Mel Gorman
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-07-07 07:24:31 +0800

03 Jun, 2017

1 commit

a7306c343 ksm: prevent crash after write_protect_page fails ... Browse Code »

"err" needs to be left set to -EFAULT if split_huge_page succeeds.
Otherwise if "err" gets clobbered with zero and write_protect_page
fails, try_to_merge_one_page() will succeed instead of returning -EFAULT
and then try_to_merge_with_ksm_page() will continue thinking kpage is a
PageKsm when in fact it's still an anonymous page. Eventually it'll
crash in page_add_anon_rmap.

This has been reproduced on Fedora25 kernel but I can reproduce with
upstream too.

The bug was introduced in commit f765f540598a ("ksm: prepare to new THP
semantics") introduced in v4.5.

page:fffff67546ce1cc0 count:4 mapcount:2 mapping:ffffa094551e36e1 index:0x7f0f46673
flags: 0x2ffffc0004007c(referenced|uptodate|dirty|lru|active|swapbacked)
page dumped because: VM_BUG_ON_PAGE(!PageLocked(page))
page->mem_cgroup:ffffa09674bf0000
------------[ cut here ]------------
kernel BUG at mm/rmap.c:1222!
CPU: 1 PID: 76 Comm: ksmd Not tainted 4.9.3-200.fc25.x86_64 #1
RIP: do_page_add_anon_rmap+0x1c4/0x240
Call Trace:
page_add_anon_rmap+0x18/0x20
try_to_merge_with_ksm_page+0x50b/0x780
ksm_scan_thread+0x1211/0x1410
? prepare_to_wait_event+0x100/0x100
? try_to_merge_with_ksm_page+0x780/0x780
kthread+0xd9/0xf0
? kthread_park+0x60/0x60
ret_from_fork+0x25/0x30

Fixes: f765f54059 ("ksm: prepare to new THP semantics")
Link: http://lkml.kernel.org/r/20170513131040.21732-1-aarcange@redhat.com
Signed-off-by: Andrea Arcangeli
Reported-by: Federico Simoncelli
Acked-by: Kirill A. Shutemov
Cc: Hugh Dickins
Cc:
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Andrea Arcangeli
2017-06-03 06:07:37 +0800

04 May, 2017

2 commits

e4b822227 mm: make rmap_one boolean function ... Browse Code »

rmap_one's return value controls whether rmap_work should contine to
scan other ptes or not so it's target for changing to boolean. Return
true if the scan should be continued. Otherwise, return false to stop
the scanning.

This patch makes rmap_one's return value to boolean.

Link: http://lkml.kernel.org/r/1489555493-14659-10-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-04 06:52:10 +0800
1df631ae1 mm: make rmap_walk() return void ... Browse Code »

There is no user of the return value from rmap_walk() and friends so
this patch makes them void-returning functions.

Link: http://lkml.kernel.org/r/1489555493-14659-9-git-send-email-minchan@kernel.org
Signed-off-by: Minchan Kim
Cc: Anshuman Khandual
Cc: Hillf Danton
Cc: Johannes Weiner
Cc: Kirill A. Shutemov
Cc: Michal Hocko
Cc: Naoya Horiguchi
Cc: Vlastimil Babka
Signed-off-by: Andrew Morton
Signed-off-by: Linus Torvalds

Minchan Kim
2017-05-04 06:52:10 +0800

02 Mar, 2017

1 commit

f7ccbae45 sched/headers: Prepare for new header dependencies before moving code to <linux/sched/coredump.h> ... Browse Code »

We are going to split out of , which
will have to be picked up from other headers and a couple of .c files.

Create a trivial placeholder file that just
maps to to make this patch obviously correct and
bisectable.

Include the new header in the files that are going to need it.

Acked-by: Linus Torvalds
Cc: Mike Galbraith
Cc: Peter Zijlstra
Cc: Thomas Gleixner
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Ingo Molnar
2017-03-02 15:42:28 +0800